**1. Introduction**

With the rapid development of the social economy and the improvement in public life conditions, urban air pollution has become a hot topic and has attracted progressively more attention [1]. The new Ambient Air Quality Standard of China defines six kinds of major pollutants (*PM*2.5, *PM*10, *SO*2, *NO*2, *O*3, and *CO*), which are sufficient to give a more comprehensive evaluation of urban air quality. Under the new standard, air quality data are collected continuously by monitoring stations throughout the whole country. The gathered data are typically multivariate, temporal, and geographically labeled.

An increasing number of works have been devoted to the analysis of air quality data, but most of them have been limited to analyzing the patterns of only one major pollutant in a specific city or monitoring station because of the complexity, diversity and large volumes of data [2,3]. Determining the best approach to handling complicated air quality data to obtain multivariate temporal patterns and the relationships between different regions is a great challenge. The results of the analysis provide support for the pollution abatement of specific pollutants and areas. For example, AirVis [4] is a web-based visual analytic system that supports a collaborative analysis of spatiotemporal and multivariate features. However, it is implemented using only eight air quality monitoring stations in Beijing and is incapable of managing big data. Moreover, it is rare to find a study that focuses on detecting anomalies in air quality; for example, a particular city has a unique appearance compared with its adjacent locations, even if they have similar topographies and climate conditions. Obtaining divergent air quality data for a region despite other similarities to its neighbors can drive the analysis of air pollution causes and development of prevention measures. Furthermore, most of the previous works have obtained conclusions by computing isolated indicators [5,6], and the lack of meaningful contextual information has limited their further applications. Visual analytics is a new technology that makes up for this flaw. It is dedicated to transforming complex data into concise graphics that not only support the exploration of hidden patterns in the data but also assist in the comprehension of the patterns found. Thus, it is imperative to establish a comprehensive visual analysis platform that can analyze the regular patterns of air quality and potential anomalies. Such a technique helps the departments involved in environmental protection formulate effective policies to improve air quality; it even enables non-professional users to understand the patterns of air pollution.

In this paper, we propose AirInsight, an interactive visual analytic system that supports the interactive visual inspection of multivariate spatiotemporal patterns and anomalies from a variety of perspectives. In order to facilitate users' effective perception of data features, we propose a dimension reduction method called Composite Least Square Projection (CLSP), which generates an explicable layout that maintains both the multivariate data distribution and attribute information. CLSP outperforms the traditional projection solutions by enhancing the observation of preliminary patterns as well as the interpretation of patterns through a layout that embeds the multivariate context. For the purpose of exploring multivariate features more deeply, we propose the Noise Hierarchical Clustering (NHC) algorithm to extract inherent patterns and separate outliers. Considering that there are still some noteworthy anomalies hidden in regular patterns that reflect significant changes among similar timestamps or adjacent cities, we further introduce two indicators called time diversity (TD) and geographical surprise (GS) to quantize the data anomaly strength in these two cases. By utilizing them together, we further define all data by four categories of spatiotemporal anomalies and assist users in finding interesting data items intuitively. Multiple linked views are integrated into AirInsight to visualize the above analysis results. At the same time, a variety of contextual information is provided to help users understand the extracted patterns. Moreover, we design a pair of novel glyphs called R-Shield and A-Shield to summarize the normal and abnormal temporal patterns, respectively, of a specific city. By linking the glyphs in the temporal evolution process, several meaningful transform states can be revealed. The contributions of this work are the following:


3. **A visual analytic system integrating summarization glyphs and multiple coordinated views for air quality data.** This tool allows analysts to explore and interpret regular patterns and anomalies from different aspects and levels.

#### **2. Literature Review**

#### *2.1. Visualization of Air Quality Data*

With the extensive use of Internet of Things (IoT), a massive volume of data is being generated and collected [7]. This is a cornerstone of city computing [8], but it renders traditional methods of numerical analysis ineffective. Increasingly more visual analytic methods are being applied to explore and interpret IoT data by combining automated analysis for different fields, such as non-residential building performance analysis [9], public transport optimization using mobile phone data [10], and so on.

As a common type of data collected by sensors, air quality data have attracted the attention of many scholars. Most works have comprised time-varying analysis and regional research. Du et al. [11] proposed an adaptive multiscale trend view that could flexibly reveal the linear and periodical temporal patterns of air quality. Similarly, Li et al. [12] integrated the variations in multiple pollutants. They also studied the various air quality features in time and space and designed Global Distribution View, which jointly visualizes the spatiotemporal and clustering information in a neat form. Through even deeper analysis, Zhou et al. [13] illustrated how spatial clusters changed over different time scales and used a storyline design to depict evolving changes for different locations. Another essential requirement for the visual analysis of air quality consists of correlation detection. The Time-Correlation-Partitioning (TCP) tree [14] presented a novel visual representation that concisely describes both the variable hierarchy and the temporal variation in correlations hidden in air quality data. Qu et al. [15] not only considered the correlation between different kinds of pollutants but also accounted for the influence of weather data on air conditions.

However, few works have paid attention to abnormal cases of air pollution. Li et al. [16] extracted events of air quality data and detected various co-occurrence patterns among them. Although they could find pollution-related urban agglomeration, the lack of extracted temporal variation for the target city limited the determinacy of the discovered events. In this paper, we propose a comprehensive system for air quality data that supports not only regular pattern analysis but also abnormal event detection in time and space.

#### *2.2. Visualization of Multivariate Data*

Analysis of multivariate data is an important and challenging research topic. Displaying an abstract data structure and discovering latent features generally rely on visualization.

Two major types of visualization approaches can be summarized as direct display and visual space projection. The parallel coordinate plot [17] and radar chart [18] are common methods of direct display: the attributes are represented as axes and the data items are drawn as lines across the axes. However, it is difficult for users to intuitively determine the relationships among items because tracking all the axes simultaneously is difficult, especially when the number of items increases along with the inevitable clutter. The other type, visual space projection, aims to map items from a high-dimensional space to the visual space while preserving relationships as much as possible. Thus, a poorly understood data structure can be observed and understood intuitively. Principal component analysis (PCA) [19], multidimensional scaling (MDS) [20], and t-distributed stochastic neighbor embedding (t-SNE) [21] are widely used projection methods. In the projection layout, users can quickly discover clusters through the densities of points. However, the lack of attribute information limits the user's understanding. In recent years, Radviz [22] and star coordinates [23] have been proposed. In these methods, the attributes are used as anchor points or axes aligned on a circle, and data are projected into the circle according to the attribute strengths. Nevertheless, these methods are strongly affected by the ordering of the attributes. Moreover, the relationships

among the data are not considered, so items with different quantities and the same proportion are projected to the same position. To address this drawback, RadViz++ [24] includes histograms over each attribute cell. The histograms show the data distribution and are linked with brushed data, thereby explaining ambiguity.

The data context map [25] was developed to overcome the above shortcomings by mapping attribute points and data points together on the basis of their integrated similarities. However, its availability is restrained to air quality data whose items greatly outnumber the attributes. Building on this method, we propose CLSP, which can reduce errors and enhance the effectiveness of handling such types of data.

#### *2.3. Visualization of Anomaly Detection*

Extensive works have studied anomaly detection by visual analytic approaches in the past several years. Wilkinson presented hdoutliers [26], which was based on a distributional model that could deal with big complex data. It has widespread applications, even for a mixture of categorical and continuous variables. Nevertheless, apart from multivariate features, real-life data sets often have temporal and geographical tags, for which this type of global method is powerless.

In order to assist users in finding temporal anomalies, Muelder et al. [27] portrayed the behaviors of compute nodes over time by applying a force-directed layout that aggregated similar patterns and distinguished abnormal timelines. Similarly, Xu et al. [28] introduced a time-aware outlier-preserving technique to extend Marey's graph and achieved effective anomaly detection in manufacturing processes. Unlike the approaches that focus on a time axis, Shi et al. [29] linked two time-slots in a projection view in a method that supported the analysis of temporal evolution and multivariate features of different items. Cao et al. [30] designed glyphs with a time arc to detect anomalous users in social media data. For a deeper analysis of spatiotemporal anomalies, several visual analytic systems have been developed [31,32]. For example, a visual analytic system named Voila, developed by Cao et al. [33], achieved an interactive anomaly detection performance through a tensor-based unsupervised algorithm that analyzed the current spatiotemporal state by incorporating the historical states.

However, a significant limitation of these existing approaches is that they only consider unidirectional temporal variations while ignoring the periodicity in temporal data. Further, they are restricted to finding items that behave normally individually but abnormally compared with adjacent locations. In this paper, we propose a novel strategy that allows for the detection of abnormal cases from an integral space–time perspective.
