*3.3. Workflow*

To achieve these tasks, we design the analysis pipeline (illustrated in Figure 2), which contains the following modules:


4. **Visual analysis module**. This module consists of three main views: (a) Projection view supports flexible switching between scatter mode and glyph mode to provide an overview of the multivariate data distribution with spatiotemporal information (T2 and T3) or further explore a specific city under the summarization glyphs (T3). (b) Trend view summarizes the distribution of different clusters for each timestamp (T2) and compares the patterns changing with time for different cities (T3). (c) Abnormity classification view exhibits the performance of all data under the anomaly indices (T2 and T3). In addition, we provide rich interaction functions, such as filtering and brushing, to help users explore interesting features with more flexibility.

**Figure 2.** Workflow of AirInsight.

#### **4. Methods**

#### *4.1. Preliminary Exploration of Patterns*

In this section, we explain the CLSP method, which maps multivariate spatiotemporal data and attributes in visual space.

#### 4.1.1. Vectorized Representation

We let *S* denote a sample set, *A* denote an attribute set. One month of data from a city is defined as a sample. Here, *S* = {*s*1,*s*2, ... ,*sn*}, where *n* is the product of the number of cities and the number of months. For the data studied in this paper, *n* is 3168. Each sample *si* is a temporally ordered sequence and combines attribute values, *si* = {*s<sup>k</sup> <sup>i</sup>*,*<sup>j</sup>* | 1 ≤ *j* ≤ *di*, 1 ≤ *k* ≤ *m*}, where *di* is the number of days that belong to *si*, and *m* is the number of attributes. Table 1 shows the sample from Chengdu in March 2014.

The attribute set consists of m vectors, *<sup>A</sup>* = {*a*1, *<sup>a</sup>*2, ... , *<sup>a</sup>m*}. Each attribute vector *<sup>a</sup><sup>k</sup>* has *<sup>n</sup>* dimensions, and *<sup>a</sup><sup>k</sup>* = {*a<sup>k</sup> <sup>i</sup>* | <sup>1</sup> ≤ *<sup>i</sup>* ≤ *<sup>n</sup>*}, in which each dimension *<sup>a</sup><sup>k</sup> <sup>i</sup>* is the mean value of *<sup>s</sup><sup>k</sup> <sup>i</sup>* and can be computed as

$$a\_i^k = \frac{\sum\_{j=1}^{d\_i} s\_{i,j}^k}{d\_i}.\tag{1}$$

#### 4.1.2. Construction of Composite Distance Matrix

Inspired by the data context map [25], we construct a composite distance matrix that stores the relationships among sample vectors and attribute vectors. As demonstrated in the orange block in Figure 3, the matrix consists of four submatrices: *DD* stores the pairwise diversities between sample vectors, *VV* stores the pairwise diversities between attribute vectors, *DV* stores the diversities between sample vectors and attribute vectors, and *VD* is the transpose of matrix *DV*. Since the characteristics of different vectors are distinct, we choose different methods that are suitable to quantify different kinds of diversities.


**Table 1.** The sample from Chengdu in March 2014.

Similar to the data context map, we apply the Pearson correlation coefficient [34] to evaluate the distance between a pair of attribute vectors and construct submatrix *VV*. With regard to the distance between a sample vector *si* and an attribute vector *<sup>a</sup>k*, "*max* − *value*" [25] is used as follows:

$$distance(s\_i, a^k) = \max - a\_i^k,\tag{2}$$

where *max* is the maximum of the IAQI (500), and it can be thought of as the theoretical maximum of *ak <sup>i</sup>* . The *distance*(*si*, *<sup>a</sup>k*) is a significance distance. It is small for *si* when *<sup>a</sup><sup>k</sup> <sup>i</sup>* is large, so when the mean value of a sample's *k*-th attribute is high, the relationship between the sample and the *k*-th attribute is close. Using Equation (2), we can construct the submatrices *DV* and *VD*.

Nevertheless, the sample vector *si* in this paper is a multivariate time-series. When we perform a diversity evaluation of submatrix *DD*, it is vital to take into account the whole temporal trend of the two vectors. In addition, the length of the vectors may not be equal since the number of days in each month is not the same. In order to overcome the above challenges, we apply dynamic time warping (DTW) [35], which can compute the shape similarity of two temporal vectors with unequal lengths. Under certain conditions, DTW extends or shortens two time-series to find the optimal alignment for all timestamps; this sets the accumulated distance of the aligned paths equal to the smallest value. When we compare two timestamps in the process of finding aligned paths, we introduce the structural similarity index (SSIM) [36] for multivariate features.

Since the diversities of the four submatrices are quantified by different means, their value ranges are also diverse. To construct the final composite distance matrix using the same scale, we set the mean values of these submatrices to be the same and fuse them. This matrix can evaluate all three kinds of relationships among samples and attributes and provide a foundation for projection.

**Figure 3.** Pipeline of two-step projection.
