**Step 2: Embed attribute vectors**

To embed attributes into the layout generated in step 1, we regard all samples as new control points. According to submatrices *DV*, *VD*, and *VV*, we build the neighborhood relationship system of attributes. Using the same method as in step 1, we embed the attributes and obtain the final layout containing both samples *s i* (1 ≤ *i* ≤ *n*) and attributes *a <sup>k</sup>*(<sup>1</sup> ≤ *<sup>k</sup>* ≤ *<sup>m</sup>*). As shown in Figure 4b, each orange point is a sample, and the gray symbols represent attributes.

**Figure 4.** Projection results. (**a**) Projection of sample vectors after Step 1. (**b**) Projection of sample vectors and attribute vectors after Step 2.

#### 4.1.4. CLSP Evaluation

The core reason for choosing LSP as the foundation of our method is that it has high computational efficiency and can preserve neighborhood relationships in visual space, as verified by Paulovich et al. [37]. Although our CLSP method has one more step, little extra time is consumed because the attribute set is small. Also, the ability to retain neighborhood relations results in a more compact layout, which achieves the visualization goal of being able to quickly find associated data, and the multivariate patterns can be observed more effectively.

Moreover, the ability to handle multiple kinds of relationships makes CLSP more powerful than LSP. The data mapping in the first step maintains submatrix *DD* to the greatest extent without any other interference. The second step of embedding attributes ensures that *DV* and *VD* are adequately considered. To prove the validity of CLSP, we adopt a commonly used strategy to evaluate the projection quality for each submatrix. For one sample vector or attribute vector, we find the *k* nearest vectors in the original space and the *k* nearest points in the projection layout. Furthermore, we can calculate the proportion of repetition between them. For each submatrix, we assess the mean repetition proportions (MRP) of all included vectors. Table 2 shows the MRP of the air quality data we studied, and *k* is set to 20% of the corresponding vector count.


**Table 2.** Comparing the MRP of LSP and CLSP.

We find that *MRPVD* and *MRPDV* improve significantly, while *MRPVV* remains unchanged. This result means that CLSP can better maintain the relationships between the data and attributes without losing the relationships among attributes. *MRPDD* is also improved, which is also beneficial when the size of the data is large enough.

#### *4.2. Extraction of Inherent Patterns*

#### 4.2.1. Clustering

According to the projection results shown in Figure 4, a distinct cluster and some discrete points can be observed. To further examine the patterns in detail, we perform clustering for the sample points. Since there are some extremely separate points, it is not rational to utilize partition clustering methods, such as K-means [39], which assigns labels to every data item and ignores abnormal patterns. Another widely used method, DBSCAN [40], is a kind of incomplete clustering method and can recognize

noises. However, it is hard for users to obtain a satisfying result for data with uneven densities since DBSCAN significantly depends on the selection of the two parameters.

In this paper, we propose the Noise Hierarchical Clustering (NHC) algorithm to effectively extract regular patterns and abnormal patterns. Similarly, our method starts by setting each point as a cluster and then merges the two most similar clusters. Contrary to the traditional hierarchical clustering method [41], we set a threshold to additionally control the termination of cluster merging. When the point number of a cluster is larger than the threshold, it will not be merged with any other clusters. The algorithm will be terminated if no clusters can be merged. Hence, we can split a sizable consecutive point cloud into some finer clusters, which can be regarded as general patterns, while the cluster that does not meet the threshold will be disintegrated as noises. We apply NHC to the results of projection in Section 4.1; Algorithm 1 presents the NHC process.

**Algorithm 1:** Noise Hierarchical Clustering of sample points **Input:** Projection results *S* = {*s <sup>i</sup>* | 1 ≤ *i* ≤ *n*}, Threshold *CMin* **Output:** Clusters, noises


#### 4.2.2. NHC Evaluation

Figure 5 shows a comparison between our NHC method, DBSCAN, and K-means. The colors of the points encode cluster labels, and the black points represent noises. To highlight the noises, we set a larger radius for them than other clustered points. In the NHC process, we set *CMin* to 200; in the DBSCAN process, we set the minimum number of points required to form a dense region *minPTS* to 4 and the radius to 0.15; and in the K-means process, we set the number of clusters to 5. From Figure 5a,b, it can be observed that our method separates the clusters more elaborately when the points are intensive. As shown in Figure 5a,c, the noises are separated from the regular patterns and highlighted in our method. Thus, NHC performs better when dealing with uneven data distributions.

**Figure 5.** Results of different clustering methods.

According to Algorithm 1, *CMin* is the threshold to control the maximum number of points in a cluster. We can adjust the parameter *CMin* to obtain different clustering results. Figure 6 shows the results of NHC by setting *CMin* to different values. In Figure 6a, most points are grouped into three clusters, which are marked by green, yellow, and blue. Compared with Figure 5a, in which *CMin* is 200, we find that the green cluster in Figure 6a is a combination of three small clusters in Figure 5a. In Figure 6b,c, the points are further divided into more clusters, and more noises are separated. Thus, we can conclude that the smaller the *CMin* value, the more clusters and more noises we get.

**Figure 6.** Results of NHC with different *Pmin* values. (**a**) Cluster result by setting *Pmin* to 300. (**b**) Cluster result by setting *Pmin* to 100. (**c**) Cluster result by setting *Pmin* to 50.

#### *4.3. Detection of Latent Anomalies*

By clustering, we can detect some noises on the basis of multivariate features. In this section, we further consider the spatiotemporal information and introduce two evaluation indices to detect potential anomalies that are mostly hidden in regular multivariate patterns.

#### 4.3.1. Time Diversity

Typically, the air quality of a city for one month is similar to that for its neighboring months. The clustering results of Xizang for May 2015 is taken as an example; the results are divided into the same clusters for April and June 2015 and the corresponding months in 2014 and 2016. However, due to abrupt climate changes or related urban policies, the air quality of a given month may present different characteristics from its time neighborhood. To discover these kinds of anomalous events, we introduce an indicator called time diversity (TD), inspired by the work of Zhang et al. [42].

Considering periodic variations, we define the time neighborhood of sample *si* as *Tsi* = {*t h si* | 1 ≤ *h* ≤ *NTsi* }, where *NTsi* is the number of neighboring timestamps of *si*. In this paper, *NTsi* is set to 8: the previous month, the next month, and these three months in the two adjacent years. Figure 7 shows the time neighborhood of May 2015; the orange grid represents May 2015 and the blue grids represent neighboring timestamps. From the clustering results in Section 4.2, the TD of each month is computed as

$$TD(s\_i) = \sum\_{\mathbb{C}\_{s\_i} \neq \mathbb{C}\_{s\_i}} (\frac{N\_{\mathbb{C}\_{s\_i}}}{N\_{T\_{s\_i}}})^2 - (\frac{N\_{\mathbb{C}\_{s\_i}}}{N\_{T\_{s\_i}}})^2. \tag{3}$$

In Equation (3), *Csi* represents the cluster that sample *si* belongs to, while *Cth si* is the cluster label of *t h si* . For all samples in *Tsi* , *NCsi* is the number of samples belonging to *Csi* , and *NCth si* is the number of samples belonging to *Cth si* (*Cth si* = *Csi* ). *TD*(*si*) is a real number in the range [−1, 1]. The closer the value of *TD*(*si*) is to −1, the more likely that *si* and its time neighborhood belong to the same cluster and vice versa. Hence, the higher the TD of a sample point, the more abnormal the air quality in that month.

**Figure 7.** Neighbor timestamps of May 2015.

#### 4.3.2. Geographic Surprise

Next, we introduce Bayesian surprise [43,44], which is used to detect geographic anomalies, and refer to hereinafter as geographic surprise (GS). Generally, the air quality data of the locations in one area have similar features in light of their similar topographic characteristics and climate conditions. This can be taken as our expectation. When we observe the data distribution of an area at a timestamp, we will not be surprised if it is consistent with that of adjacent cities. In this case, the observed data match our expectation. By contrast, if we find a unique city that possesses different air quality compared with adjacent cities, it is not in accordance with our expectation, and it can be regarded as a surprising event. Thus, the impact on expectation can be used to find abnormal cases, and we use GS to quantify this impact for every sample point.

For sample *si*, we define several expected values and an observed value. Let *X* be the corresponding expected data set of sample set *<sup>S</sup>*, and *<sup>X</sup>* can be written as *<sup>X</sup>* = {*x<sup>u</sup> <sup>i</sup>* | 1 ≤ *i* ≤ *<sup>n</sup>*, 1 ≤ *<sup>u</sup>* ≤ *<sup>q</sup>*}, where *<sup>q</sup>* is the number of expectations. *<sup>P</sup>*(*x<sup>u</sup> <sup>i</sup>* ) can be regarded as the prior probability, which is independent and artificially defined. At the same time, let *Y* be the observed data set and *Y* = {*yi* | 1 ≤ *i* ≤ *n*}. After observing new data *yi*, our expectation will be unmet and, as a result, change. We use *P*(*x<sup>u</sup> <sup>i</sup>* |*yi*) to model the updated likelihood of event *<sup>x</sup><sup>u</sup> <sup>i</sup>* occurring in the face of *yi*. According to Bayes' Rule, *P*(*x<sup>u</sup> <sup>i</sup>* |*yi*) is a posterior probability, and it is proportional to the product of the prior probability and standardized likelihood. It can be calculated as

$$P(\mathbf{x}\_i^{\mu}|y\_i) = \frac{P(y\_i|\mathbf{x}\_i^{\mu})P(\mathbf{x}\_i^{\mu})}{P(y\_i)}.\tag{4}$$

The essence of the impact on expectation is the difference between the prior and posterior probability distributions. Thus, we use relative entropy to calculate GS:

$$GS(s\_i) = KL(P(\mathbf{x}\_i|y\_i) \| P(\mathbf{x}\_i)) = \sum\_{u=1}^{q} P(\mathbf{x}\_i^u|y\_i) \log \frac{P(\mathbf{x}\_i^u|y\_i)}{P(\mathbf{x}\_i^u)}.\tag{5}$$

In this paper, we map all the sample vectors from the original *mD* space to 1*D* space and form the observed data set *Y*. In addition, two expected models, *x*<sup>1</sup> and *x*2, are involved:


We not only specify the regional features of air quality as *x*<sup>1</sup> but also define *x*<sup>2</sup> to prevent inaccuracy caused by the artificial division of geographical areas. To balance the two expected models rationally, we define the prior probability *P*(*xu*)(*u* = 1, 2) by *P*(*x*1) = 0.8 and *P*(*x*2) = 0.2.

According to Equation (5), the value of GS is always positive. The closer the value is to 0, the more likely it is that the sample is consistent with adjacent cities. Thus, the higher the GS value of the corresponding sample, the more abnormal the related city at a particular timestamp.

#### 4.3.3. Abnormity Classification

Using the above indices, we designed an abnormity classification view that provides an overview of abnormal cases and divides them into four categories according to the evaluation indices. As shown in Figure 8, the horizontal axis indicates TD, and the vertical axis indicates GS. Each sample is represented by a circle, whose size is proportional to the sum of the absolute values of TD and GS, so users can easily locate representative abnormal samples. Using the air quality data presented in this paper, we obtain result ranges for TD and GS of [−1, 1] and [0, 0.3], respectively. Furthermore, we set the classifying thresholds of GS and TD to the mean value of the maximum and minimum, respectively. On the basis of these different ranges, we define four kinds of samples:

1. **Insusceptible samples**. These are in the top-left corner (Figure 8a), with a low value of TD and a high value of GS. This combination means that these samples remain stable and different from adjacent cities for long periods. It is important to analyze these samples and understand why they have unique features and are entirely unaffected by their neighboring areas.


From Figure 8, we find that most data are ordinary samples or susceptible samples, and there exist a few insusceptible samples that are worth further exploration. From the abnormity classification view, users can quickly locate unusual samples with specific spatiotemporal characteristics and further trace the detailed contextual information through other views.

**Figure 8.** Abnormity Classification view. (**a**) Insusceptible block. (**b**) Accidental block. (**c**) Ordinary block. (**d**) Susceptible block.
