*6.1. Rationale*

Dimension reduction and clustering are commonly used methods to identify patterns and processes for mining big data, and many works have integrated both of them into a visual analytic system. However, there is still a question that is worth discussing: How do we choose the order of dimension reduction and clustering?. As mentioned in the literature [50], the schemes by which these processes are combined can be divided into several types and generate different results. Each scheme should be chosen on the basis of actual application requirements since they all have advantages and disadvantages. In this study, we chose to apply clustering to a data set that was already projected. The rationale of our scheme, as well as that of other applicable options, are discussed below:

**Independent Algorithms:** Since the dimension reduction and clustering processes are executed independently and do not affect each other, the two algorithms can optimize their results to the maximum extent. However, the clusters in the projection view may be intermingled and difficult to distinguish. It is not consistent with our goals, which included adding glyphs that display clustering information in the projection view. Independent algorithms result in overlapped glyphs, and they require users to apply more effort to trace a time-varying trend.

**Clustering Preprocessing for Dimension Reduction:** One possibility is to execute a clustering algorithm with high-dimensional data and then project the data into visual space using some clustering results, which can be cluster assignments or centroids. In this way, the clustering algorithm is unaffected and will get optimal results. Moreover, the found clusters can be kept together in low-dimensional space. Following this idea, we considered clustering the sample vectors first and then using the centroids in each cluster and the noises as control points for step 1 in CLSP. However, too few control points lead to inaccurate projection results because the expert often hopes to obtain fewer than 10 clusters for easy analysis. In the system proposed in this paper, the user's ability to obtain an overview of the data was an important design requirement, and it is achieved by the projection view. In order to ensure the accuracy of dimension reduction, we vetoed this scheme. Another alternative is to additionally consider the similarities among cluster labels when quantifying the diversities among sample vectors. This scheme fully emphasizes the clustering result, and the corresponding projection view can better display the found clusters. However, in contrast to the traditional dimension reduction methods, CLSP leads to a composite layout that uses nearby sample and attribute points to demonstrate the relatively high value of the sample's attribute. Additional information about cluster labels reduces the interpretability of the projection layout and makes it hard to observe the relationships between samples and attributes.

**Dimension Reduction Preprocessing for Clustering:** We finally chose to perform dimension reduction and then cluster the samples in low-dimensional space. This scheme leads to accurate projection results since the dimension reduction is not affected by the clustering process. However, one of the main disadvantages is that it results in potentially misleading clustering results. Since the information will be inevitably lost during the projection, the clustering results cannot fully reflect the data relationship in high-dimensional space. As mentioned above, in this study, the projection layout is the basis for most analyses, and the number of clusters required is relatively small. After balancing, we finally chose this way to get the results that are most applicable to our goal.

#### *6.2. Scalability*

The scalability of AirInsight is also an issue worthy of discussion. As a web-based system, it is easy for users to access and migrate new data. In this study, AirInsight was applied to air quality data that contains six kinds of pollutants in 88 cities over 3 years; however, it can be easily extended to the analysis of more samples and even more general problems related to multivariate spatiotemporal data.

AirInsight does not limit the geographic scale of data. Users can study large-scale urban agglomeration, as well as analyze the data from several locations or even one location. For air pollution, the system is also suitable for further analysis of the monitoring stations.

In this work, each timestamp is a month, and the combination of DTW and SSIM is applied to measure their distance. Actually, when the granularity of time is smaller or even when *si* is no longer a time series, the method of quantifying diversity can be replaced by methods that only consider the multivariate features, such as the commonly used Euclidean distance. As the number of timestamps increases, the trend view adds a scroll bar to expand the screen and display more time axes. Further, for the spiral heatmap in glyphs, focus+context techniques give users the ability to observe more time

grids. However, when the number of timestamps increases to an unacceptable level and the temporally sequential lines are cluttered, the time-varying process is hard to identify. The animation supported in AirInsight can mitigate this problem. In the future, we will introduce line-simplified visualization techniques, such as edge-bundling technology [51].

For the multivariate features of the data, the higher the number of attributes, the more display space required. In the system, both glyphs and radar charts can be enlarged to show more multivariate information. However, after testing, we found that when the data exceed 30 dimensions, the observation power is significantly reduced. Also, in the projection view, we use different symbols that represent attributes. When the number of symbols exceeds the range that humans can remember and identify, new visual metaphors that are more intuitive and distinguishable are needed.

In addition to our final visualization system, several proposed methods can independently meet more requirements of other application fields. CLSP can be used to create an interpretable dimensional reduction layout, NHC can be used to analyze clusters and noises, and the anomaly detection strategy can be used to detect and classify spatiotemporal anomalies.

As an interdisciplinary application involving environmental science and visual analysis, our work applies data mining algorithms and statistical analysis indicators to extract and present hidden patterns and anomalies in big air quality data. For air quality experts, we provide the possibility to analyze correlations among multiple pollutants and find differences in pollutants among different regions or at different times. At the same time, users can diagnose cities with long-term stability, cities with dramatic changes, and urban groups with similar patterns, our system is also friendly to users without a professional background since visualization technology makes huge amounts of data readable and straightforward. For example, media reporters who want to summarize the air quality of a city over a given time span can find the city of interest from the map view and click it. From the obtained glyph view, they can define the most common air condition according to the R-Shield with the maximum radius and inspect the trend view to compare it with other cities in the same area.

#### *6.3. Limitations*

Although we received positive feedback from users, there are still some limitations that need to be discussed.

One is the size of the data that the system can manage to ensure a good analytical experience. We tested the system on an Intel Core 3.6 GHz computer with 16 GB RAM. On the basis of this implementation, we recorded the running time required for the data studied in this paper. As shown in Table 3, we divided the preprocessing stage into several subprocesses, including constructing the difference matrix (CLSP), projecting (CLSP), NHC, and computing TD and GS. The most time-consuming part is constructing the difference matrix, whose computational complexity is *O*((*n* + *m*)2). In other words, the running time of this process is closely related to the total number of samples and attributes, and it will grow exponentially as the size of the data increases. In the future, we will aim to design a parallel computing algorithm to reduce time costs. Another time-consuming part is projecting; the time-intensiveness of this step is primarily due to the selection of control points by SF-Kmedoids, which can be optimized by improved methods. Aside from the limitations of preprocessing, we further tested interactivity performance with different data sizes. We randomly generated three sizes of projected samples: 5000, 10,000, and 20,000. The experimental results do not reveal any delays, and the linkage between views by brushing or clicking is not affected when using a sample number of 5000 or 10,000. When the data size reaches 20,000, the initial rendering of all views takes about 3 s, and the linkage between views by brushing had some delays. Thus, we regard 20,000 as the data size limit that our system can support. In summary, our system can support the exploration of 20,000 data items with real-time response, although users need to perform preprocessing with an acceptable runtime when they analyze a new data set.


**Table 3.** Running time of preprocessing stage.

Our users also raised some issues worth mentioning after they used AirInsight. Four participants reported that the glyphs showed rich information when they first saw them, and although they were useful and intuitive, it took some effort to fully understand the details when they first accessed them. In addition, one user pointed out that our work lacked an analysis of the sensitivity of anomalies in different temporal and spatial scales. This is a significant issue that can be further studied in the future. After brushing the projection view, the map view and trend view can only display statistics separately, rather than spatiotemporal joint distribution. They fail to solve more complex problems, such as brushing samples and comparing the most common month in each location. Also, bivariate color scales that are green and red at their extremes are not friendly to color-blind users. In the future, more accessible methods of visual mapping should be considered in AirInsight, such as mapping values in grayscale, to meet the needs of different kinds of users.
