**Big Data Computing for Geospatial Applications**

Edited by

Zhenlong Li, Wenwu Tang, Qunying Huang, Eric Shook and Qingfeng Guan

> Printed Edition of the Special Issue Published in *International Journal of Geo-Information*

www.mdpi.com/journal/ijgi

## **Big Data Computing for Geospatial Applications**

## **Big Data Computing for Geospatial Applications**

Editors

**Zhenlong Li Wenwu Tang Qunying Huang Eric Shook Qingfeng Guan**

MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade • Manchester • Tokyo • Cluj • Tianjin

Wenwu Tang Center for Applied Geographic Information Science Department of Geography and Earth Sciences University of North Carolina at Charlotte USA *Editors* ȱȱ ȱȱȱȱ ȱ¢ȱǻ Ǽǰȱ ȱȱ ¢ǰȱ ¢ȱȱȱȱ 


*Editorial Office* MDPI St. Alban-Anlage 66 4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal *ISPRS International Journal of Geo-Information* (ISSN 2220-9964) (available at: https://www.mdpi. com/journal/ijgi/special issues/Big Data Geospatial Applications).

For citation purposes, cite each article independently as indicated on the article page online and as indicated below:

LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. *Journal Name* **Year**, *Article Number*, Page Range.

**ISBN 978-3-03943-244-8 (Hbk) ISBN 978-3-03943-245-5 (PDF)**

c 2020 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license, which allows users to download, copy and build upon published articles, as long as the author and publisher are properly credited, which ensures maximum dissemination and a wider impact of our publications.

The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons license CC BY-NC-ND.

## **Contents**


## **About the Editors**

**Zhenlong Li** is an Associate Professor in the Department of Geography at the University of South Carolina, where he established and leads the Geoinformation and Big Data Research Laboratory. His research focuses on geospatial big data analytics, spatiotemporal analysis/modeling, and cyberGIS/GeoAI. By synthesizing advanced computing technologies, geospatial methods, and spatiotemporal principles, his research aims to advance knowledge discovery and decision making to support domain applications, including disaster management, human mobilities, and public health. He has more than 80 publications, including over 50 peer-reviewed journal articles, and 20 articles in books and proceedings. He served as the Chair of the Cyberinfrastructure Specialty Group of AAG and Co-Chair of the Cloud Computing Group of Federation of Earth Science Information Partners (ESIP). Currently, he sits on the Editorial Board of 4 international journals including *ISPRS International Journal of Geo-Information, Geo-spatial Information Science*, *PLoS ONE*, and *Big Earth Data*.

**Wenwu Tang** is an Associate Professor at the Department of Geography and Earth Sciences and the Executive Director of the Center for Applied GIScience (gis.uncc.edu) at the University of North Carolina at Charlotte. His research interests include cyberinfrastructure and high-performance geocomputation, agent-based modeling, land change modeling, and geospatial big data. Tang has over 80 peer-reviewed publications, including 58 journal articles, 16 book chapters, 12 conference proceedings, and 1 edited book. Wenwu's research has been supported by federal and state funding agencies (about \$2.2 million in total), including USDA Forest Services, US CDC, US Fish and Wildlife, NCDOT, NC Forest Service, and the Electric Power Research Institute. Relevant courses that Wenwu has taught include CyberGIS and Big Data, Spatial Statistics, and Web GIS. Wenwu is an editorial board member of *Landscape and Urban Planning*, academic editor of *PLoS ONE*, and gues<sup>t</sup> editor of three Special Issues (*Sustainability*, *ISPRS International Journal of Geo-Information*).

**Qunying Huang** is an Associate Professor in the Department of Geography at University of Wisconsin-Madison. Her fields of expertise include spatial computing and spatial data mining. Dr. Huang's research bridges the gap between GIScience and Computer and Information Science (CIScience) by generating new computational algorithms and methods to make sense of complex spatial big datasets obtained from both the physical (e.g., remote sensing) and social (e.g., social media) sensing networks. The problem domains of her research are related to natural hazards and human mobility.

**Eric Shook** is an Associate Professor in the Department of Geography, Environment, and Society at the University of Minnesota. His research is focused on geospatial computing, which is situated at the intersection of geographic information science and computational science, with particular emphasis on the areas of cyberGIS and data-intensive spatiotemporal analytics and modeling. Overall, his research aims to advance geospatial computing through foundational research, which is then leveraged in novel applications ranging from research in human origins to analyzing risk perception of disease outbreak and extreme weather events using social media. Dr. Shook teaches in the areas of geographic information science, geospatial computing, and cyberGIS.

**Qingfeng Guan** is a Professor and an Associate Dean of the School of Geography and Information Engineering at China University of Geosciences, Wuhan, China, where he established and leads the High-Performance Spatial Computational Intelligence Lab (HPSCIL). His research interests include big spatiotemporal data analytics, spatiotemporal modeling, spatial computational intelligence, and high-performance spatial computing. Dr. Guan has received research funds from a variety of agencies, including the National Natural Science Foundation of China, Natural Science Foundation of Hubei Province, Ministry of Science and Technology of China, Ministry of Education of China, and China Geological Survey. Dr. Guan has published over 60 journal articles, and 2 book chapters.

## *Editorial* **Introduction to Big Data Computing for Geospatial Applications**

### **Zhenlong Li 1,\*, Wenwu Tang 2, Qunying Huang 3, Eric Shook 4 and Qingfeng Guan 5**


Received: 3 August 2020; Accepted: 10 August 2020; Published: 12 August 2020

**Abstract:** The convergence of big data and geospatial computing has brought challenges and opportunities to GIScience with regards to geospatial data management, processing, analysis, modeling, and visualization. This special issue highlights recent advancements in integrating new computing approaches, spatial methods, and data managemen<sup>t</sup> strategies to tackle geospatial big data challenges and meanwhile demonstrates the opportunities for using big data for geospatial applications. Crucial to the advancements highlighted here is the integration of computational thinking and spatial thinking and the transformation of abstract ideas and models to concrete data structures and algorithms. This editorial first introduces the background and motivation of this special issue followed by an overview of the ten included articles. Conclusion and future research directions are provided in the last section.

**Keywords:** geospatial big data; geospatial computing; cyberGIS; GeoAI; spatial thinking

## **1. Introduction**

Earth observation systems and model simulations are generating massive volumes of disparate, dynamic, and geographically distributed geospatial data with increasingly finer spatiotemporal resolutions [1]. Meanwhile, the ubiquity of smart devices, location-based sensors, and social media platforms provide extensive geo-information about daily life activities. E fficiently analyzing those geospatial big data streams enables us to investigate complex patterns and develop new decision-support systems, thus providing unprecedented values for sciences, engineering, and business. However, handling the five "Vs" (volume, variety, velocity, veracity, and value) of geospatial big data is a challenging task as they often need to be processed, analyzed, and visualized in the context of dynamic space and time [2].

Following a series of successful sessions organized at the American Association of Geographers (AAG) Annual Meeting since 2015, this special issue on "Big Data Computing for Geospatial Applications" by the *ISPRS International Journal of Geo-Information* aims to capture the latest e fforts on utilizing, adapting, and developing new computing approaches, spatial methods, and data managemen<sup>t</sup> strategies to tackle geospatial big data challenges for supporting applications in di fferent domains, such as climate change, disaster management, human dynamics, public health, and environment and engineering.

Specifically, this special issue aims to address the following important topics: (1) geo-cyberinfrastructure integrating spatiotemporal principles and advanced computational technologies (e.g., GPU (graphics processing unit computing), multicore computing, high-performance computing, and cloud computing); (2) innovations in developing computing and programming frameworks and architecture (e.g., MapReduce, Spark) or parallel computing algorithms for geospatial applications; (3) new geospatial data managemen<sup>t</sup> strategies and storage models coupled with high-performance computing for e fficient data query, retrieval, and processing (e.g., new spatiotemporal indexing mechanisms); (4) new computing methods considering spatiotemporal collocation (locations and relationships) of users, data, and computing resources; (5) geospatial big data processing, mining, and visualization methods using high-performance computing and artificial intelligence; (6) integrating scientific workflows in cloud computing and/or a high-performance computing environment; and (7) other research, development, education, and visions related to geospatial big data computing. This editorial provides a summary of the ten articles included in this issue and suggests future research directions in this area based on our collective observations.

## **2. Overview of the Articles**

The articles included in this issue make significant contributions to the use of big data computing for tackling various geospatial problems (from human mobility to disaster managemen<sup>t</sup> to knowledge discovery) by incorporating novel methodologies, data structures, and algorithms with advanced computing frameworks (from geo-visual analytics, deep learning to cloud computing, and MapReduce/Spark). Using ten di fferent big data sources (e.g., social media, remote sensing, and Internet of Things), this issue demonstrates the value and importance of integrating computational approaches and geospatial methods in advancing scientific discovery and domain applications (Table 1).


**Table 1.** Summary of the geospatial big data and computing approaches used in each article for various geospatial applications.

## *2.1. Big Data Computational Methods*

Geospatial data processing and analysis, such as transformation in geometry, converting coordination reference systems, and evaluating spatial relationships, often include a large number of floating-point arithmetic computations. Correspondingly, MapReduce and Spark-based frameworks and systems, such as SpatialHadoop [13] and GeoSpark [14], were developed to speed up these computations. Additionally, cloud-based computing platforms, such as Google Earth Engine (GEE) for big earth observation data, have been increasingly used in geospatial studies and applications. To optimize the performance of a parallel algorithm for geospatial processing, analysis, or modeling when using such general-purpose frameworks, the spatial characteristics of the data and algorithm must be considered for the algorithmic design [15,16]. The four papers by Jo et al. [3], Zhao et al. [4], Kang et al. [5], and Safanelli et al. [6] focus on parallel computing and highlight the adaption of existing computing frameworks for geospatial data preprocessing, parallel algorithm design, simulation modeling, and data analysis.

It often takes a long time to prepare geospatial datasets for these data computing systems, which generally involves extracting, transforming, and loading (i.e., ETL) processes. To deal with big data in the ETL process, Jo and Lee proposed a new method, D\_ELT (delayed extracting–loading–transforming), to reduce the time required for data transformation within the Hadoop platform by utilizing MapReduce-based parallelization [3]. Using big sensor data of various sizes and geospatial analysis of varying complexity levels, several experiments are performed to measure the overall performance of D\_ELT, traditional ETL, and extracting–loading–transforming (ELT) systems. Their results demonstrate that D\_ELT outperforms both ETL and ELT. In addition, the larger the amount of data or the higher the complexity of the analysis, the better the performance of D\_ELT over the traditional ETL and ELT approaches.

Zhao et al. designed a parallel algorithm for overlay analysis, which uses a measurement of polygon shape complexity as the key factor for data partitioning in combination with a distributed spatial index and a minimum boundary rectangular filter [4]. The parallel algorithm was implemented based on Spark, a widely used distributed computing framework for large-scale applications [17]. Experiment results show data partitioning based on shape complexity e ffectively improved the load balancing among multiple computing nodes, hence the computational e fficiency of the parallel algorithm. This work demonstrates that appropriate definitions and measurements of the properties of data and/or algorithms (no matter how simple they are) to reflect the computational intensities are of essential significance for the performance enhancement of parallel algorithms.

The CA–Markov model is one of the most widely used extended cellular automata (CA) models and has been used in the prediction and simulation of land-use changes [5]. As land-use change simulation and prediction involves massive amounts of data and calculations, many parallel CA algorithms have been designed to simulate urban growth based on various computing models, including central processing units (CPUs) and GPUs. While the parallel CA method incorporates spatial relationships amongs<sup>t</sup> cells, it cannot maintain connections between partitions after a study area is divided into several pieces, resulting in di fferent prediction results. Meanwhile, the traditional Markov method can maintain integrity for the entire study area but lacks the ability to incorporate spatial relationships amongs<sup>t</sup> the cells. Alternatively, the MapReduce framework is capable of e fficient parallel processing when coupled to the CA–Markov model; the key problem of segmentation and maintaining spatial connections remain unresolved. As such, Kang et al. introduced a MapReduce-based solution to improve the parallel CA–Markov model for land-use-change prediction [5]. Results sugges<sup>t</sup> that the parallel CA–Markov model not only solves the paradox that the traditional CA–Markov model cannot simultaneously achieve the integrity and segmentation for land-use change simulation and prediction but also achieves both e fficiency and accuracy.

Safanelli et al. took a di fferent approach to handle geospatial big data challenges [6]. They developed a terrain analysis algorithm based on GEE (termed TAGEE) to calculate a variety of terrain attributes, e.g., slope, aspect, and curvatures, for di fferent resolutions and geographical extents. By using spheroidal geometries measured by the great-circle distance, TAGEE does not require the input DEM data to be projected on a flat plane. Experiments show that TAGEE can generate similar results when compared to conventional GIS software packages. By taking advantage of the high-performance computing capacity of GEE, TAGEE is able to efficiently produce a suite of terrain attribute products at any spatial resolution at a global scale. This work represents an emerging paradigm of geospatial computing in the era of big data. As cloud computing platforms such as GEE mature, geospatial computing is no longer limited by locally available computing resources and datasets. Applications of complex geospatial algorithms/models at high spatial resolutions and the global scale have been seen in the last couple of years and will soon become the norm.

## *2.2. Big Data Mining*

Social sensing, in which humans represent a large sensor network, has emerged as a new data collection approach in the big data era [18]. The following three papers by Zhang et al. [7], Yang et al. [8], and Wu et al. [9] demonstrate the power of integrating social sensing data (public transit, social media, and mobile phone) and big data computing techniques for supporting geospatial applications including human mobility, disaster management, and transportation.

Zhang et al. developed a novel approach for mining and visualizing human mobility patterns from multisource big public transit data, aiming to support transportation planning and managemen<sup>t</sup> by providing an enhanced understanding of human movement patterns over space and time [7]. To efficiently extract travel patterns from massive heterogeneous data sources, this work developed a clustering algorithm to extract transit corridors indicating the connections between different regions and a graph-embedding algorithm to reveal hierarchical mobility community structures. Beyond the novel machine-learning algorithms, this work also provides a scalable web-based geo-visual analytical system including visualization techniques to allow users to interactively explore the extracted patterns. The system was evaluated by 23 users with different backgrounds and the results confirm the usability and efficiency of the integrated geo-visual analytical approach for human movement pattern discovery from public transit big data. This work demonstrates the power of integrating geospatial big data, machine-learning algorithms, and geo-visual analytical approaches for supporting transportation applications.

Yang et al. introduced a deep learning method to efficiently conduct sentiment analysis of big social media data for assisting disaster mitigation [8]. This work devises a five-phase framework for automatic extraction of public emotions from geotagged Sina micro-blog data including data collection and processing, emotion classification, and spatiotemporal analysis. To classify emotion (fearful, anxious, sad, angry, neutral, and positive), a convolutional neural network (CNN) model is designed and trained by converting the raw text to a word vector. To demonstrate the efficiency of the approach, an earthquake in Ya'an, China, in 2013 was used as a case study. Based on the trained model, public emotions within the study area are classified at different time periods right after the earthquake. Spatiotemporal analyses were then performed to examine the dynamics of people's sentiments toward the earthquake over space and time. Results sugges<sup>t</sup> that the proposed approach accurately classified emotions from big social media data (>81%), providing valuable public emotional information for disaster mitigation.

Wu et al. proposed a three-step approach to detect missing road segments from mobile phone-based navigation data within urban environments [9]. Their first step is to apply filtering to navigation data to remove those related to pedestrian movement and existing road segments. Then, as a second step, centerlines of missing roads are constructed using a clustering algorithm. Building the topology of missing roads and connecting these detected roads with existing road networks is the third step. Wu et al. [9] applied this approach in a study area (about 6 square kilometers) in Shanghai, China. Based on ~10 million GPS points collected from mobile navigation in 2017, this work evaluated the capability of their three-step approach in the detection of missing roads. Results demonstrate the

performance of this three-step approach based on mobile phone data, recognizing the computational challenge of their approach when dealing with larger datasets.

## *2.3. Knowledge Representation*

Zhuang et al. [10] addressed an understudied problem, namely the representation and sharing of knowledge related to geospatial problem solving. Through a process of abstraction and decomposition, this work deconstructs geospatial problems into tasks that operate at three di fferent granularities. Beyond a high-level description, this work formalizes the geospatial problem-solving process into a knowledge base by creating a suite of ontologies for tasks, processes, and GIS operations. Using a meteorological early-warning analysis as a case study, this work successfully demonstrates how to capture abstract geospatial problem-solving knowledge in a formal and sharable task-oriented knowledge base. Demonstrated by a prototype system, their results o ffer a promising glimpse of how users could begin building geospatial problem-solving models and workflows similar to spatial models and workflows. Such models and workflows could be re-used and adapted for similar problems or used as a building block to tackle more complex geospatial problems in the future, such as the global effects caused by climate change.

Wang et al. [11] built a knowledge graph similar to Zhuang et al. [16] but instead focused on capturing geographic objects and their spatiotemporal contexts. This work creates a geographic knowledge graph (GeoKG) comprised of six elements to answer foundational questions in geography including: Where is it? Why is it there? When and how did it happen? Through a process of model construction and formalization, this work captures geographic objects, their relations, and ongoing dynamics in a GeoKG. To demonstrate the e ffectiveness of the GeoKG, this work detailed the evolution of administrative divisions of Nanjing, China, along the Yangzi River and then compared it to a well-known straightforward and extensible ontology known as YAGO (Yet Another Great Ontology). Results show that GeoKG improved accuracy and completeness through analyses and user evaluation, demonstrating scientific advancement in capturing geographic knowledge in a computational system.

## *2.4. Big Data Search*

Lastly, Gaigalas et al. [12] presented a cyberinfrastructure-enabled cataloging approach that combines web services and crawler technologies to support e fficient search of big climate data. The cataloging approach consists of four main steps, including selection and analysis of a metadata repository, crawling of metadata using crawlers, building spatiotemporal indexing of metadata, and search based on collection search (via catalog services) and granule search (via REST API). This cataloging approach was implemented to support EarthCube CyberConnector. To demonstrate the feasibility and e fficiency of the proposed approach, this cyberinfrastructure was tested with petabyte-level ESOM (Earth System Observation and Modeling) data provided by UCAR THREDDS Data Server (TDS). Results sugges<sup>t</sup> that the proposed cataloging approach not only boosts the crawling speed by 10 times but also dramatically reduces the redundant metadata from 1.85 gigabytes to 2.2 megabytes. Instead of focusing on big data analysis, this work demonstrates the significance and advanced techniques of making big climate data searchable to support interdisciplinary collaboration in climate analysis.

## **3. Conclusion and Future Research Directions**

This special issue highlights a diversity of geospatial models and analyses, geospatial data, geospatial thinking, and computational thinking used to address myriad geospatial problems ranging from human mobility [7] to disaster managemen<sup>t</sup> [8]. The manuscripts span geospatial problem solving and knowledge (e.g., [10,11]), handling massive geospatial data (e.g., [3,12]), and analyzing and visualizing geospatial data (e.g., [7,9]).

Crucial to the advancements highlighted in this special issue is the integration of computational thinking and spatial thinking and the translation of abstract ideas and models to concrete data structures and algorithms. A promising future research direction will be to build on this integration of knowledge and skills across the disciplines of GIScience and computational science, which has been termed cyber literacy for GIScience [19]. In this way, integrated knowledge of real-world geospatial patterns and computational processes can be captured and shared, and big data and geospatial visual analytic frameworks can be integrated to provide more robust computational geospatial platforms to address myriad geospatial problems. A key challenge in this research direction will be the integrative fabric that can seamlessly combine scholarly thinking with computational infrastructures, geospatial data elements with big data capabilities, and geospatial methods infused with parallelism.

Parallelism can be achieved by innovatively utilizing advanced computing frameworks, such as MapReduce and Spark, for applications that include massive data sorting, computing, machine learning, and graph processing [20]. While this special issue highlighted advancements in geospatial big data preprocessing [3], land-use change prediction [5], and overlay analysis [4], more e fforts should be devoted to identifying geospatial applications of grea<sup>t</sup> impact and benefiting from the integration of geospatial methods and parallelization in the big data era. Additionally, many existing geospatial big data applications simply inject spatial data types or functions inside existing big data systems (e.g., Hadoop) without much optimization [3]. Therefore, further research directions should focus on improving and optimizing the performance of big data frameworks from di fferent aspects, such as data ETL, job scheduling, resource allocation, query analytics, memory issues, and I/O bottlenecks, by considering the spatial principles and constraints [21].

**Author Contributions:** Conceptualization, Zhenlong Li; Writing—original draft preparation, Zhenlong Li, Wenwu Tang, Qunying Huang, Eric Shook and Qingfeng Guan; writing—review and editing, Zhenlong Li, Eric Shook, Wenwu Tang, Qunying Huang and Qingfeng Guan. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** We would like to thank all authors for contributing to this special issue and the reviewers for their constructive suggestions and criticisms that significantly improved the papers in this issue. We also want to thank Yuanyuan Yang for her editorial support and help in preparing the special issue.

**Conflicts of Interest:** The authors declare no conflict of interest.

## **References**


©2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **MapReduce-Based D\_ELT Framework to Address the Challenges of Geospatial Big Data**

**Junghee Jo 1,\* and Kang-Woo Lee 2**


Received: 15 August 2019; Accepted: 21 October 2019; Published: 24 October 2019

**Abstract:** The conventional extracting–transforming–loading (ETL) system is typically operated on a single machine not capable of handling huge volumes of geospatial big data. To deal with the considerable amount of big data in the ETL process, we propose D\_ELT (delayed extracting–loading –transforming) by utilizing MapReduce-based parallelization. Among various kinds of big data, we concentrate on geospatial big data generated via sensors using Internet of Things (IoT) technology. In the IoT environment, update latency for sensor big data is typically short and old data are not worth further analysis, so the speed of data preparation is even more significant. We conducted several experiments measuring the overall performance of D\_ELT and compared it with both traditional ETL and extracting–loading– transforming (ELT) systems, using different sizes of data and complexity levels for analysis. The experimental results show that D\_ELT outperforms the other two approaches, ETL and ELT. In addition, the larger the amount of data or the higher the complexity of the analysis, the greater the parallelization effect of transform in D\_ELT, leading to better performance over the traditional ETL and ELT approaches.

**Keywords:** ETL; ELT; big data; sensor data; IoT; geospatial big data; MapReduce

## **1. Introduction**

In recent years, numerous types of sensors have been connected to the Internet of Things (IoT) and have produced huge volumes of data with high velocity. A large percentage of these sensor big data is geospatial data, describing information about physical things in relation to geographic space that can be represented in a coordinate system [1–4]. With the advance of IoT technologies, more diverse data have now become available, thereby greatly increasing the amount of geospatial big data.

Given the general properties of big data, the unique characteristics of geospatial data create an innovative challenge in data preparation [5]. Geospatial data typically include position data. These coordinate data differ from normal string or integer data, requiring the data pre-processing process to include a lot of floating-point arithmetic computations. Examples include transformation in geometry, converting coordination reference systems, and evaluating spatial relationships. Among these, the most well-known aspect of geospatial data is spatial relationship, describing the relationship of some objects in a specific location to other objects in neighboring locations. The calculation of spatial relationship is mostly included in spatial analysis and has been generally regarded as a sophisticated problem [6]. Moreover, processing temporal elements also complicates the handling of geospatial data.

To deal with the challenges in processing and analyzing geospatial big data, several systems have emerged. Systems designed for big data have existed for years (e.g., Hadoop [7] and Spark [8]); however, they are uninformed about spatial properties. This has led to a number of geospatial systems (e.g., SpatialHadoop [9] and GeoSpark [10]) being developed, mostly by injecting spatial data types or functions inside existing big data systems. Hadoop, especially, has proven to be a mature big data platform and so several geospatial big data systems have been constructed by inserting spatial data awareness into Hadoop. However, it is still not easy for big data software developers to create geospatial applications. Typically, to generate a MapReduce job for a required operation in Hadoop, developers need to program a map and reduce functions. Spatial analysis usually requires handling more than one MapReduce step, where the output of the data from a previous MapReduce step becomes the input to the next MapReduce step. As the complexity level of spatial analysis is increased, the number of MapReduce steps is also increased, resulting in augmented di fficulties for the developers to write iterative code to define the increasingly more complicated MapReduce steps.

To resolve this issue, in our previous work [11], we found a way to represent spatial analysis as a sequence of one or more units of spatial or non-spatial operators. This allows developers of geospatial big data applications to create spatial applications by simply combining built-in spatial or non-spatial operators, without having any detailed knowledge of MapReduce. Once the sequence of operators has been incorporated, it is automatically transformed to the map and reduces jobs in our Hadoop-based geospatial big data system. During this conversion process, our system controls the number of MapReduce steps in such a way as to achieve better performance by decreasing the overhead of mapping and reducing. The challenges for geospatial big data, however, lie in confronting not only how to store and analyze the data, but also how to transform the data while achieving good performance.

Currently, a large amount of geospatial data is continuously provided from many spatial sensors. It is important to analyze this geospatial big data as soon as possible to extract useful insights. However, the time required to transform massive amounts of geospatial data into the Hadoop platform has gradually increased. That is, it takes a lot of time to prepare the data required for geospatial analysis, thereby delaying obtaining the results of spatial analysis results. For example, we found that it took about 13 hours and 30 minutes to load 821 GB of digital tachograph (DTG) data using the traditional ETL method. In the ETL process, data are extracted from data sources, then transformed, involving normalization and cleansing, and loaded into the target data base. The conventional ETL system is typically operated on a single machine that cannot e ffectively handle huge volumes of big data [12]. To deal with the considerable quantity of big data in the ETL process, there have been several attempts in recent years to utilize a parallelized data processing concept [13–15].

One study [14] proposed ETLMR using a MapReduce framework to parallelize ETL processes. ETLMR is designed by integrating with Python-based MapReduce. This study conducted an experimental evaluation assessing system scalability based on di fferent scales of jobs and data to compare with other MapReduce-based tools. Another study [15] compared Hadoop-based ETL solutions with commercial ETL solutions in terms of cost and performance. They concluded that Hadoop-based ETL solutions are better in comparison to existing commercial ETL solutions. The study in [16] implemented P-ETL (parallel-ETL), which is developed on Hadoop. Instead of the traditional three steps of extracting, transforming, and loading, P-ETL involves five steps of extracting, partitioning, transforming, reducing, and loading. This study has shown that P-ETL outperforms the classical ETL scheme. Many studies, however, have focused on big data analysis, but there have been insu fficient studies attempting to increase the speed of preparing the data required for big data analysis.

In this paper, we continue our previous study on storing and managing geospatial big data and explain our approach to enhance the performance of ETL processes. Specifically, we propose a method to start geospatial big data analysis in a short time by reducing the time required for data transformation under the Hadoop platform. A transformation is defined as data processing achieved by converting source data into a consistent storage format aiming to query and analyze. Due to the complex nature of transformations, performance of the ETL processes depend mostly on how efficiently the transformations are conducted, which is the rate-limiting step in the ETL process. Our approach allows MapReduce-based parallelization of the transformation in the ETL process. Among the various sources of geospatial big data, we concentrate on sensor big data. With the increasing number of IoT sensing devices, the amount of sensor data is expected to grow significantly over time for a wide range of fields and applications. IoT-based sensor data are, however, essentially loosely structured and typically incomplete, much of it being directly unusable. In addition, in the IoT environment, the update period—the time between the arrival of raw data and when meaningful data are made available—occurs more frequently than for typical batch data. These difficulties require that considerable resources are used for transformation in the ETL process.

This paper extends our research work presented in [11] and suggests a way to increase performance of the transformation functionality in the ETL process by taking advantage of the MapReduce framework. First, in Section 2 we briefly explain our previous work on constructing a geospatial big data processing system by extending the original Hadoop to support spatial properties. We focus particularly on explaining automatically converting a user-specified sequence of operators for spatial analysis to MapReduce steps. Section 3 describes up-to-date ETL research followed by our approach on improving performance of transformation in the ETL processes based on MapReduce. Our conducted experimental settings and results are described in Sections 4 and 5, respectively. Section 6 concludes our work and presents our plans for future research.

## **2. Geospatial Big Data Platform**

In our previous study [11], we developed a high performance geospatial big data processing system based on Hadoop/MapReduce, named Marmot [17]. In Marmot, spatial analysis is defined as a sequence of RecordSetOperators, where a RecordSet is a collection of records and a RecordSetOperator is a processing element using a RecordSet, similar to a relational operator in Relational Database Management System (RDBMS). A sequence of RecordSetOperators is defined as a Plan, as shown in Figure 1.

**Figure 1.** Representation of spatial analysis in Marmot: A sequence of one or more units of spatial or non-spatial operators.

In Marmot, a RecordSetOperator is classified as three possible types: RecordSetLoader, RecordSetFunction, or RecordSetConsumer. RecordSetLoader is a non-spatial operator loading source data and transforming it to a RecordSet; RecordSetFunction is a spatial or non-spatial operator taking a RecordSet as source data and producing a new RecordSet as output data; RecordSetConsumer is a non-spatial operator storing a finally created RecordSet as a result of a given spatial analysis outside of Marmot.

To process a given spatial analysis, a developer creates a corresponding Plan by combining spatial operators and non-spatial operators and injects the Plan into Marmot. Marmot processes each RecordSetOperator one by one and automatically transforms the given Plan to map and reduce jobs, as shown in Figure 2.

**Figure 2.** Automatic transformation of a Plan into MapReduce jobs. (**a**) A Plan having a RecordSetFunction divided into mapping and reducing operators; (**b**) An automatically transformed Plan.

While parsing a given Plan, when Marmot meets a RecordSetFunction that can be separated into mapping and reducing operators (e.g., ReduceByGroupKey), as shown in Figure 2a, Marmot decomposes the RecordSetFunction into the mapping operator and reducing operator, and eventually transforms the Plan into MapReduce jobs consisting of map and reduce phases, as shown in Figure 2b. During this transformation, Marmot controls the number of MapReduce phases in a way to achieve better performance by decreasing the overhead of mapping and reducing. To describe how Marmot handles such processes in detail, an example of spatial analysis to retrieve subway stations in a city is shown in Figures 3 and 4.

$$\begin{array}{l} \text{Plan plan;}\\ \text{plan = marnot.plan} \text{Builder("Subway stations per city")}\\ \quad \text{.load("lops/subway stations")}\\ \quad \text{.update("the\\_geom-ST\\_Centroid(the\\_geom)")}\\ \quad \text{.spatialJoin("the\\_geom", "region/codastral", "the\\_geom")}\\ \quad \text{INTERGECTs, "\text{\\_param.sig\\_cd"}\\ \quad \text{.readcedeByGroupKey("sig.cd")}\\ \quad \text{.aggedregion(COINT)}\\ \quad \text{.txtesCsv("result")}\\ \quad \text{.build()}; \end{array}$$

Figure 3 is a Marmot code for an example of spatial analysis. The analysis is represented as a Plan consisting of five RecordSetOperators: Load, Update, SpatialJoin, ReduceByGroupKey, and StoreAsCsv. As shown in Figure 4, using the Load operator, Marmot reads the boundaries of each subway station and computes their center coordinates. The calculated center points are then utilized as the representative locations of each subway station via the Update operator. For each subway station, using the SpatialJoin operator, Marmot identifies the city that is the center point of the subway station. Finally, the number of subway stations per city is calculated via the ReduceByGroupKey operator and the results are stored in a CSV file named "result" via the StoreAsCsv operator.

**Figure 4.** An example Plan for searching subway stations per city.

During the process of transforming the Plan to a sequence of MapReduce jobs, ReduceByGroupKey is decomposed into GroupBy and Reduce as a mapping operator and a reducing operator, respectively. Accordingly, Load, Update, SpatialJoin, and GroupBy are executed during the Map phase; Reduce and StoreAsCsv, during the Reduce phase.

## **3. Our MapReduce-Based D\_ELT Framework**

As mentioned in the previous section, we constructed the Marmot, high-performance data managemen<sup>t</sup> system that enables developers with no specific knowledge of big data technologies to implement improved performance spatial analysis applications to geospatial big data. The issues concerning geospatial big data, however, lie not only in how to efficiently manage the data for fast analysis, but also in how to efficiently transform the data for fast data preparation.

DTG data, for example, have been used to analyze the status of transportation operations to identify improvement points and to identify disadvantaged areas in terms of public transportation. Transportation authorities, e.g., the Korea Transportation Safety Authority, collect DTG data from commercial vehicles and apply analytics to such big data to extract insights and facilitate decision making. Often, the results of data analysis must be derived periodically within a specific time, e.g., every single day, to be prepared for emergen<sup>t</sup> cases. In this situation, to complete the given analysis in time, not only the data analysis speed, but also the data preparation speed is a critical factor affecting the overall performance. In the IoT environment, update latency for sensor big data, the focus of this paper among various sources of geospatial big data, is typically short and old data are not worth further analysis, making data preparation speed even more important. Moreover, sensor big data is machine-generated; therefore, the source data contains more noise or errors compared to human-generated data, complicating data preparation even more.

Traditional ETL [18–20] can no longer accommodate such situations. The ETL is designed for light-weight computations on small data sets, but is not capable of efficiently handling massive amounts of data. Figure 5a describes the data preparation and analysis in the ETL process. In this approach, data are extracted from various sources and then transformed on an ETL server, which is typically one machine, and loaded into a Hadoop distributed file system (HDFS). The loaded data are finally analyzed in a big data platform for decision-making. In this approach, an *analysis* operation is processed in a parallel/distributed way using MapReduce [21,22], which guarantees reasonable performance, but bottlenecks can occur during a *transform* operation. In fact, *transform* is the most time consuming phase in ETL because this operation includes filtering or aggregation of source data to fit the structure of the target database. Data cleaning should also be completed for any duplicated data, missing data, or different data formats. Moreover, in big data environments, due to heterogeneous sources of big data, the traditional *transform* operation will create even more computational burdens. The overall performance of the ETL processes, therefore, depends mainly on how efficiently the *transform* operation is conducted.

(a) ETL process

(b) ELT process 

(c) D\_ELT process

**Figure 5.** Illustration of geospatial big data preparation and analysis processes comparing three cases: (**a**) ETL; (**b**) ELT; (**c**) D\_ELT. In the figure, "E" stands for extract, "T" stands for transform, "L" stands for load, and "A" stands for analysis.

To overcome the drawbacks of traditional ETL and to speed up the data preparation process, the processes of ELT was devised [23–25]. The nature of traditional ETL is to perform *transform* immediately after the *extract* operation and then start the *load* operation. In contrast, the basic idea of ELT is to conduct the *load* operation immediately after the *extract* operation, and perform the *transform* after storing the data in the HDFS, as shown in Figure 5b. This approach has several advantages over ETL. The *transform* operation can be done at the run time when needed and it is possible to use *transform* even multiple times to handle changing requirements for data. In addition, this approach eliminates a separate transformation engine, the ETL server, between the source and target and makes the overall system less costly. Above all, ELT allows raw source data to be loaded directly into the target and

also leverages the target system to perform the *transform* operation. In that sense, ELT can speed up *transform* using parallelization/distribution supported in the Hadoop-based big data platform.

Despite these advantages, ELT still has limitations in handling big data. The ELT framework can speed up *transform* using MapReduce, but *analysis* is initiated only after the *transform* has been completed. In this approach, it is difficult to optimize *transform* in conjunction with *analysis* because the *transform* is performed in a batch regardless of the context of *analysis*. For example, in the case of geospatial data, one of the high computational overheads in conducting *transform* occurs during type transformation, such as converting the x–axis and y–axis of plain-text into (x,y) coordinates of the point and coordinate system transformation for conducting spatial analysis. If *analysis* does not require such tasks, it is possible to identify them at the *transform* phase and load only the required data. By doing so, the system can eliminate unnecessary transformations and speed up performance.

To achieve better scalability and performance in conducing *transform* on geospatial big data, this paper offers a new approach for data preparation called D\_ETL—in the sense that the decision of how to perform *transform* is delayed until the context of *analysis* is understood. As shown in Figure 5c, in our approach, *transform* is executed in parallel/distributed with *analysis* within our geospatial big data platform, Marmot. In Marmot, the operators for *transform* are considered a type of RecordSetOperator and are also composed of a Plan, along with the existing RecordSetOperator designed for *analysis*. This approach has the advantage that data preparation and analysis processes are described using the same data model. Application developers, therefore, can be free from the inconvenience of having to ge<sup>t</sup> used to implementing both processes.

Regarding the operators required to conduct *transform*, the application developer specifies them in the D\_ELT script. In this way, the developer can implement both data preparation and analysis simultaneously, without having to modify the existing code for conducting *analysis*. The D\_ELT script consists of the names of operators and a list of the key-values of the parameters, as shown in Figure 6. For convenience, if a developer needs a new operator for conducting *transform*, the operator can be separately implemented as a form of plug-in and can be used in Marmot, in the same way as for existing operators.

**Figure 6.** An example of D\_ELT (delayed extracting–loading –transforming) script describing operators required for data transformation.

To perform a spatial analysis, Marmot first loads the D\_ELT script to determine what operators need to be executed for *transform*. Then, Marmot (1) examines the operators needed to be executed for *analysis*, (2) loads only the required data based on the need of *analysis*, and (3) executes both *transform* and *analysis* in a parallel distributed way. At this time, part of the transformed data can be used for *analysis* and not have to wait for all the data to finish being transformed. Figure 7 shows the sequence of operators executed for *transform* and *analysis* and their composition as a form of a plan. In this example Plan, "ParseCSV" is the operator for *transform* and "Filter" is the operator for *analysis*. They are allocated in the Map phase and executed in a parallel distributed way. The outputs from the Map phase are combined during the Reduce phase and the results are written in the output file.

**Figure 7.** Illustration of the Map and Reduce phases during the D\_ELT process.

The reason why we implemented D\_ELT using MapReduce instead of Spark, another well-known engine for big data processing, is that our previously developed geospatial big platform is based on Hadoop and we had the goal of improving the data transformation time in that environment. In addition, the data we are currently handling is a large amount of DTG data, generating 20–30 TB every month. Using Spark, when running spatial analysis based on this large size of data, we anticipated that unexpected problems may occur (e.g., disk swapping), but to our knowledge, concrete solutions have not ye<sup>t</sup> been proposed.

It is also important to note that ELT and D\_ELT are identical in terms of performing data transformation during the MapReduce phase in Hadoop. The difference between ELT and D\_ELT is as follows. In ELT, once raw data are uploaded to Hadoop, the data are transformed using MapReduce. After completely finishing the transformation, analysis is then started using another MapReduce. In D\_ELT, however, data transformation is not conducted, although all of the raw data are uploaded to Hadoop but delayed until the time of conducting the analysis. That is, the transformation task is piggybacked onto the analysis task and both tasks are performed together using the same MapReduce. In this way, part of the transformed data can be used for analysis immediately without having to wait for all the data to be transformed.

## **4. Experimental Evaluation**

This section explains our evaluation of the improvement in performance achieved by our proposed approach, D\_ELT. In addition, the scalability of three different approaches (traditional ETL, ELT, and D\_ELT) were measured and compared by varying data size and levels of analysis complexity.

## *4.1. Experimental Setup*

Our experiments were conducted on the four nodes of a Hadoop cluster. Each node was a desktop computer with a 4.0 GHZ Intel 4 core i7 CPU, a 32 GB main memory, and a 4 TB disk. The operating system was CentOS 6.9 and the Hadoop version was Hortonworks HDP 2.6.1.0 with Ambari 2.5.0.3. PostgreSQL 9.5 was used for the database managemen<sup>t</sup> system along with Oracle JDK 1.8. The 2.7.3 version of MapReduce2 was used.

The test data used in the experiment were DTG data installed in vehicles, which record the driving record in real time. The structure of the data consisted of timestamp, vehicle number, daily mileage, accumulated mileage, speed, acceleration, RPM, brake, x\_position, y\_position, and angle. The data were classified into three different sizes: small, 9.9GB; medium, 19.8 GB; and large, 29.8 GB, as shown in Table 1. For the geospatial big data platform, we used our developed system Marmot, as explained in Section 2.

**Table 1.** Data size: small, medium, and large.


## *4.2. Experiment 1: Measurement of Data Preparation Time*

In this experiment, we compared the data preparation time of ETL and ELT to our proposed D\_ELT, and the scalability of each approach based on the different data size. The overall results from this experiment are presented in Table 2.

As shown in Figure 5 in Section 3, the total time for data preparation in the ETL process includes time for extracting, transforming, and loading. In the case of ELT, the total time spent on data preparation is the summation of times for extracting, loading, and transforming. While in the ETL process, *transform* was conducted on a single machine, which is based on non-MapReduce and *transform* in the ELT was performed in a parallel distributed way based on MapReduce.


**Table 2.** Data preparation time (in seconds): ETL, ELT, and D\_ELT.

1 Data preparation time in ETL: E+T+L, where E for extract, T for transform, L for load; 2 Data preparation time in ELT: E+L+T, where T is executed in a parallel distributed way; 3 Data preparation time in D\_ELT: E+L.

In the case of D\_ELT, the total time spent on data preparation is the summation of times only for extracting and loading, but does not include the time for transforming. *Transform* was simultaneously executed along with *analysis* in the data analysis phase, and so that the data preparation in D\_ELT does not include *transform* but only *extract* and *load*.

## *4.3. Experiment 2: Measurement of Data Analysis Time*

In this experiment, we compared the data analysis time of ETL, ELT, and our proposed D\_ELT, and the scalability of each approach based on the different data size. Additionally, this experiment also included optimized D\_ELT, D\_ELT\_Opt, conducting data analysis by filtering and using only the required data. In order to see the variation in performance according to the different complexity levels of analysis, we utilized three analyses—Count, GroupBy, and SpatialJoin—for low, middle, and high-level complex analysis, respectively. The overall results from this experiment are presented in Table 3.



1 Data analysis time in ETL or ELT: A, where A is executed in parallel; 2 Data analysis time in D\_ELT: T+A, T, where T, A are executed in parallel; 3 Data analysis time in optimized D\_ELT: T+A, where T, A are executed in parallel using only the required data.

As shown in Figure 5 in Section 3, the total time for data analysis in ETL includes only the *analysis*, which is conducted in a parallel distributed way using MapReduce. In the case of ELT, once the data preparation is completed, the *analysis* will be conducted in the same way as for ETL. In the cases of D\_ELT and optimized D\_ELT, the total time spent on data analysis is the time required to execute *transform* and *analysis* in a parallel distributed way using MapReduce.

## **5. Results and Discussion**

## *5.1. Results*

The first experiment for measuring the data preparation time for each approach reveals the following points. As shown in Figure 8, D\_ELT is about 5 times faster than ETL (116 sec vs. 579 sec for small data; 231 sec vs.1158 sec for medium data; 345 sec vs. 1727 sec for large data) and about 3 times faster than ELT (116 sec vs. 413 sec for small data; 231 sec vs. 808 sec for medium data; 345 sec vs. 1175 sec for large data), regardless of the data size. The ELT approach is about 1.4 times faster than ETL. This is because of the parallel distributed processing effect using Marmot's MapReduce when performing *transform* in ELT.

**Figure 8.** Data preparation time (in seconds): ETL, ELT, and D\_ELT.

The data analysis time is measured by the second experiment and reveals the following points. Table 4 compares the performance between D\_ELT and ETL(or ELT) and optimized D\_ELT and ETL(or ELT). In both cases, the ratio of performance to analysis is almost identical regardless of the data size. An interesting point is that the data analysis time in the D\_ELT process contains time for *transform*, while the ETL(or ELT) process does not include this time. Although D\_ELT is slower than ETL(or ELT), there is little difference in performance—D\_ELT is up to 1.4 times slower. In the case of optimized D\_ELT, the process is only up to 1.2 times slower than the ETL(or ELT) approach. In D\_ELT, in the case

of simple analysis, the time involved in data transforming is relatively large compared to the analysis time and consumes a large part of the total execution time. However, in the case of complex spatial analysis, the time involved in data transforming is relatively small compared to data analysis, and so the transformation overhead incurred is relatively small.


**Table 4.** Performance comparison: D\_ELT/ETL(or ELT) and optimized D\_ELT/ETL(or ELT).

It is important to note that these have no e ffect on overall performance degradation, considering that D\_ELT is about 3–5 times faster than ETL(or ELT) during data preparation, as shown in Table 2 in Section 4 and Figure 8. Therefore, the performance of both D\_ELT and optimized D\_ELT is much greater than that of ETL or ELT.

Figure 9 compares the performance between D\_ELT and ETL(or ELT) during data analysis according to the di fferent data size and analysis type. As aforementioned, the analysis time in the D\_ELT includes *transform* as opposed to ETL(or ELT), and so D\_ELT is slower than ETL(or ELT), as shown in Table 3 in Section 4. However, the higher the complexity of the analysis (Count < GroupBy < SpatialJoin), the smaller the di fference between the D\_ELT and the ETL(or ELT) performance. The reason is that the higher the complexity of analysis, the higher the e ffect of parallelizing the *transform* of D\_ELT, thereby enhancing the performance of D\_ELT.

**Figure 9.** Performance comparison of D\_ELT/ETL(or ELT) based on the small, medium, and large data size for each of three analyses: Count, Group-By, and SpatialJoin.

Similarly, Figure 10 compares the performance between optimized D\_ELT and ETL(or ELT) during data analysis, according to di fferent data size and analysis type. Compared to Figure 9, in the case of two simple analysis cases, Count and GroupBy, optimized D\_ELT is faster than D\_ELT. This is because for simple analysis, large amounts of data are often unrelated to the analysis, and so more data can be included in the optimization target, resulting in an incremental increase in performance of optimized D\_ELT.

**Figure 10.** Performance comparison of optimized D\_ELT/ETL(or ELT) based on the small, medium, and large data size for each of three analyses: Count, Group-By, and SpatialJoin.

In both cases, the performance ratio of the analysis is very similar regardless of data size. Thus, we chose only the small data size to compare the performance ratio, as shown in Figure 11. This shows that the higher the complexity of the analysis, the smaller the performance di fference between D\_ELT and optimized D\_ELT. This is because the more complex the analysis, the more data is involved in the analysis, which reduces the scope of optimization. In the case of SpatialJoin, which has the highest complexity among the three analyses, the two values in Figure 11 converge to almost 1.0, showing that there is almost no performance di fference between D\_ELT and optimized D\_ELT.

**Figure 11.** Performance comparison of D\_ELT/ETL(or ELT) vs. optimized D\_ELT/ETL(or ELT) based on the small data size for each of three analyses: Count, Group-By, and SpatialJoin.

The overall performance of ETL, ELT, D\_ELT, and optimized D\_ELT is derived by summing the data preparation and analysis times. Table 5 shows that the overall performance of D\_ELT is much faster than that of the ETL or ELT approaches. D\_ELT is up to 3 times faster than ETL and 2 times faster than ELT. Optimzed D\_ELT is up to 4 times faster than ETL and 3 times faster than ELT. The results are derived from the two simple analysis cases, Count and GroupBy, but not SpatialJoin. In the case of SpatialJoin, both D\_ELT and optimized D\_ELT still perform better than ETL or ELT, but there is almost no di fference between the overall performance of D\_ELT and optimized D\_ELT. Figure 12 shows that as the complexity of the analysis is increased, the gap between D\_ELT and optimized D\_ELT is decreased.


**Table 5.** Overall performance of ETL, ELT, D\_ELT, optimized D\_ELT (in seconds), and performance comparison among ELT vs. ETL, D\_ELT vs. ETL, and optimized D\_ELT vs. ETL.

**Figure 12.** Overall performance comparison of ELT/ETL vs. D\_ELT/ETL vs. optimized D\_ELT/ETL based on a small, medium, and large data size for each of three analyses: Count, Group-By, and SpatialJoin.

## *5.2. Discussion*

There are two conventional methods—ETL and ELT. The traditional ETL method does not use the distributed/parallel method during data pre-processing, causing problems especially when the volume of data to be pre-processed is large. The ELT method improves traditional ETL methods to speed up data pre-processing using the distributed/parallel method. Our proposed D\_ELT method reduces overhead in data pre-processing. In D\_ELT, the transformation task is piggybacked onto the analysis task and both tasks are performed together using the same MapReduce. This way allows one to conduct the analysis immediately without storing transform results and also excludes unnecessary transformations that are not utilized in the analysis.

Compared to existing methods, however, the D\_ELT method significantly reduces the data preparation time, but has the disadvantage in the following cases. First, the case that the same kind of analysis must be conducted repetitively. For example, the D\_ELT method results in a 1382-second reduction (large data, Table 2) in data preparation time compared to that of the conventional ETL method, but 103 seconds is added (large data, SpatialJoin, Table 3) every time an analysis is conducted. Therefore, the greater the number of conducting analyses, the more inefficient D\_ELT is compared to traditional methods. In the example above, the D\_ELT method is more inefficient than the existing method when the same analysis is conducted more than 14 times in succession. Second, in the case that a large amount of input data is invalid, a large amount of data can be removed as a result of the transform. In D\_ELT, the transformation task is piggybacked every time an analysis task is executed, a large amount of invalid data is repeatedly read, resulting in unnecessary I/O and computation burden. Finally, the method proposed in this paper does not consider real-time applications. However, it provides the advantage that required analysis results can be obtained relatively more quickly than other conventional methods.

## **6. Conclusions**

This paper presents our proposed D\_ELT approach to e fficiently transform and analyze data, thereby making it usable for a large amount of sensor big data, especially geospatial big data. Based on the experimental results, we made several observations as follows. First, D\_ELT outperforms ETL and ELT during data preparation. Second, D\_ELT shows performance degradation during data analysis. However, the higher the complexity of the analysis, the smaller the performance degradation, resulting in overall improved performance compared to ETL or ELT. Finally, in the case of simple analysis increasing the scope of optimization, optimized D\_ELT outperforms ELT. In the future, we plan to further increase the overall performance of our developed system including D\_ELT and Marmot by investigating the spatial index, to better support spatial queries in dealing with geospatial big data.

**Author Contributions:** K.-W.L. designed and implemented the D\_ELT; K.-W.L. and J.J. conducted the testing of D\_ELT and analyzed the experiment results; J.J. wrote and K.-W.L. revised the manuscript.

**Funding:** This research was supported by the MOLIT (The Ministry of Land, Infrastructure and Transport), Korea, under the national spatial information research program supervised by the KAIA(Korea Agency for Infrastructure Technology Advancement) (19NSIP-B081011-06).

**Acknowledgments:** This research, 'Geospatial Big data Management, Analysis and Service Platform Technology Development', was supported by the MOLIT (The Ministry of Land, Infrastructure and Transport), Korea, under the national spatial information research program supervised by the KAIA(Korea Agency for Infrastructure Technology Advancement) (19NSIP-B081011-06).

**Conflicts of Interest:** The authors declare no conflicts of interest.

## **References**


©2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **High-Performance Overlay Analysis of Massive Geographic Polygons That Considers Shape Complexity in a Cloud Environment**

### **Kang Zhao 1, Baoxuan Jin 2,\*, Hong Fan 1, Weiwei Song 3, Sunyu Zhou 3 and Yuanyi Jiang 3**


Received: 24 March 2019; Accepted: 24 June 2019; Published: 26 June 2019

**Abstract:** Overlay analysis is a common task in geographic computing that is widely used in geographic information systems, computer graphics, and computer science. With the breakthroughs in Earth observation technologies, particularly the emergence of high-resolution satellite remote-sensing technology, geographic data have demonstrated explosive growth. The overlay analysis of massive and complex geographic data has become a computationally intensive task. Distributed parallel processing in a cloud environment provides an e fficient solution to this problem. The cloud computing paradigm represented by Spark has become the standard for massive data processing in the industry and academia due to its large-scale and low-latency characteristics. The cloud computing paradigm has attracted further attention for the purpose of solving the overlay analysis of massive data. These studies mainly focus on how to implement parallel overlay analysis in a cloud computing paradigm but pay less attention to the impact of spatial data graphics complexity on parallel computing e fficiency, especially the data skew caused by the di fference in the graphic complexity. Geographic polygons often have complex graphical structures, such as many vertices, composite structures including holes and islands. When the Spark paradigm is used to solve the overlay analysis of massive geographic polygons, its calculation e fficiency is closely related to factors such as data organization and algorithm design. Considering the influence of the shape complexity of polygons on the performance of overlay analysis, we design and implement a parallel processing algorithm based on the Spark paradigm in this paper. Based on the analysis of the shape complexity of polygons, the overlay analysis speed is improved via reasonable data partition, distributed spatial index, a minimum boundary rectangular filter and other optimization processes, and the high speed and parallel e fficiency are maintained.

**Keywords:** overlay analysis; shape complexity; massive data; cloud; parallel computing

## **1. Introduction**

Overlay analysis is a common geographic computing operation and an important spatial analysis function of geographic information systems (GIS). It is widely used in applications related to spatial computing [1,2]. This operation involves the spatial overlay analysis of di fferent data layers and their attributes in the target area. It connects multiple spatial objects from multiple data sets, creates a new clip data set, and quantitatively analyzes the spatial range and characteristics of the interactions among di fferent types of spatial objects [3]. The development of geospatial science has entered a new stage with the rapid popularization of the global Internet, sensor technologies, and Earth observation technologies. The transformation of a space information service from digital Earth to intelligent Earth has posed

> 25

challenges, such as being data-intensive, computationally intensive, and time–space-intensive and high concurrent access [4,5]. Overlay analysis deals with massive data, for which traditional data processing algorithms and models are no longer suitable. For example, the number of land use classification patches in Yunnan Province investigated in this study is hundreds of thousands at the county level, millions at the city level, and tens of millions at the provincial level. With the development of the social economy and the progress of data acquisition technologies, the number of land use classification patches will continue to increase. E ffectively calculating land use change using traditional single-computer calculation models is di fficult.

The rise of parallel computing technologies, such as network clustering, grid computing, and distributed processing in recent years has gradually shifted research on high-performance GIS spatial computing from the optimization of algorithms to the parallel transformation and parallel strategy design of GIS spatial computing in a cloud computing environment [6]. Recently, MapReduce and Spark technology have been applied to overlay analysis of massive spatial data, and some results have been achieved. Nevertheless, the massive spatial data is di fferent from the general massive Internet data. The spatial characteristics of spatial data and the complexity of a spatial analysis algorithm determine that simply copying a cloud computing programming paradigm cannot achieve high-performance geographic computing. Therefore, this study chooses the classical Hormann clipping algorithm [7] to analyze and measure the impact of the shape complexity of geographic polygons on parallel overlay analysis, and proposes a Hilbert partition method based on the shape complexity measure to solve the data skew caused by the di fference of the shape complexity of polygons. In addition, through the combination of MBR (Minimum Bounding Rectangle) filtering, R-tree spatial index and other optimizations, an e fficient parallel overlay analysis algorithm is designed. The experimental analysis shows that the proposed method reduces the number of polygon intersection operations, achieves better load balancing of computing tasks, and greatly improves the parallel e fficiency of overlay analysis. When the computational core increases, the algorithm achieves an upward acceleration ratio, and the computational performance presents a nonlinear change.

The rest of this paper is organized as follows. Section 2 reviews the research background and related studies, including those on shape complexity and overlay analysis algorithms. In Section 3, the Hormann algorithm is improved for a parallel polygon clipping process, and the process of a parallel polygon clipping algorithm is optimized according to the shape complexity of polygons. Section 4 describes the experimental process in detail and analyzes the experimental results. Section 5 provides the conclusion drawn from this research, followed by potential future work.

## **2. Relevant Work**

This paper discusses the related research work from two aspects: shape complexity and overlay analysis algorithm.

## *2.1. Shape Complexity*

Many studies use abstract language to describe the shapes and complex details of geometric objects, such as "the structure of polygons with multiple holes, the number of vertices is very large, and polygons with multiple concaves." To evaluate the computational cost, the complexity and computational e fficiency of geometric computing problems should be accurately measured [8]. The concept of complexity is also introduced [9–12]. Many applications related to spatial computing heavily depend on algorithms to solve geometric problems. When dealing with large-scale geographic computing problems, the evaluation of computational cost must consider the quantity of input data, the complexity of graphical objects, and the time complexity of computing models [10]. When the amount of input data and the algorithm are determined, the complexity of di fferent graphical objects frequently leads to considerable di fferences in the computing e fficiency.

Mandelbrot described the complexity of geometric objects from the perspective of a fractal dimension [13]. The most commonly used method is the box-counting technique [14,15]. Brinkho ff quantitatively reported the complexity of polygons from three aspects, namely, the frequency of local vibration, the amplitude of local vibration, and the deviation from the convex hull, to describe the complexity of a global shape [16]. On the basis of Brinkho ff's research, Bryson proposed a conceptual framework to discuss the query processing-oriented shape complexity measures for spatial objects [17]. Rossignac [8] analyzed shape complexity from the aspects of algebraic, topological, morphological, combinatorial, and expression complexities. Rossignac also reduced the shape complexity by using a triangular boundary representation at di fferent scales [8]. Ying optimized graphic data transmission on the basis of shape complexity [18].

From the above discussion, we know that the complexity of graphics has di fferent meanings and measurement methods in di fferent professional fields, such as design complexity, visual complexity and so on. Therefore, we should consider the shape complexity from the perspective of geographic computing. Shape complexity directly a ffects the e fficiency of spatial analysis and spatial query computation, such as the numbers of vertices and local shapes (such as the concavity) of graphics, considerably influencing the e fficiency of spatial geometry calculation. These values are important indicators for evaluating the calculation cost. Fully considering the influence of graphical complexity on specific geographic computations can e ffectively optimize the computing e fficiency of applications.

## *2.2. Overlay Analysis*

The study on vector overlay analysis arithmetic originates from the field of computer graphics. For example, two groups of thousands of overlay polygons are often clipped in 2D and 3D graphics rendering. Subsequently, di fferent overlay analysis algorithms have been produced. Among which, the Sutherland–Hodgman [19], Vatti [20], and Greiner–Hormann [7] algorithms are the most representative when dealing with arbitrary polygon clippings. The Sutherland–Hodgman algorithm is unsuitable for complex polygons. The Weiler–Atherton algorithm requires candidate polygons to be arranged clockwise and with no self-intersecting polygons. The Vatti algorithm does not restrict the types of clipping, and thus, self-intersecting and porous polygons can also be processed. The Hormann algorithm clips polygons by judging the entrance and exit of directional lines. This algorithm also addresses point degradation by moving small distances [21]. In addition, the Hormann algorithm can deal with self-intersecting and nonconvex polygons. The Weiler algorithm uses tree data structures, whereas the Vatti and Greiner–Hormann algorithms adopt a bilinear linked list data structure. Therefore, the Vatti and Greiner–Hormann algorithms are better than the Weiler algorithm in terms of complexity and running speed.

Subsequent researchers have implemented many improvements to the aforementioned traditional vector clipping algorithms [22–24], which simplify the calculation of vector polygon clippings. However, these studies are based on the optimization of a serial algorithm. When overlay analysis is applied to the field of geographic computing, it will deal with more complex polygons (such as polygons with holes and islands) and a larger data volume (the number of land use classification patches in a province is tens or even hundreds of millions, and a polygon may have tens of thousands of vertices). A vector clipping algorithm can be applied e fficiently to computer graphics but cannot be applied e fficiently to geographic computing. Moreover, many traditional geographic element clipping algorithms also exhibit poor suitability and performance degradation. With the development of computer technology and the increase in spatial data volume, traditional vector clipping algorithms frequently encounter efficiency bottlenecks when dealing with large and complex geographic data sets. Therefore, improving the overlay algorithm and using the parallel computing platform for the overlay analysis of massive data is a new research direction.

With the rapid development of MapReduce and Spark cloud computing technologies, the use of large-scale distributed storage and parallel computing technology for massive data processing and analysis has become an e ffective technical approach [6,25]. Recent studies have applied the MapReduce and Spark technology to the overlay analysis of massive spatial data. Wang [26] used MapReduce to improve the e fficiency of overlay analysis by about 10 times by a grid partition and index. Zheng [27]

built a multilevel grid index structure by combining the first-level grid with quartering based on Spark distributed computation platform. Zheng's experiments show that a grid index algorithm achieves good results when polygons are uniformly distributed; otherwise, the e fficiency of the algorithm is low. Xiao [28] proves that parallel task partitioning based on polygons' spatial location achieves better load balancing than random task partitioning. In addition, SpatialHadoop [29–32] and GeoSpark [32–35] extend Hadoop and Spark to support massive spatial data computing better. Among them, Spatial Hadoop designed a set of spatial object storage model, which provides HDFS with grid, R-tree, Hilbert curve, Z curve and other indexes. In addition, it also provides a filtering function for filtering data that need not be processed. GeoSpark also adds a set of spatial object models and extends RDD (Resilient Distributed Dataset) to SRDD (Spatial Resilient Distributed Dataset) that supports spatial object storage. GeoSpark also provides filtering functions to filter data that need not be processed. The design ideas of SpatialHadoop and GeoSpark have grea<sup>t</sup> reference value for the research of this paper.

In summary, using a Spark cloud computing paradigm to develop high-performance geographic computing is a cheap and high-performance method. It is also one of the research hotspots in the field of high-performance geographic computing. Recent studies have implemented overlay analysis in Spark, which greatly improves the e fficiency of overlay analysis. Optimizing strategies for spatial data characteristics, such as reasonable data partitioning and an excellent spatial data index, play an important role in improving the e fficiency of parallel computing, and Hilbert partitioning is more suitable for parallel overlay analysis of non-uniform spatial distribution data than grid partitioning. It is noticed that the current parallel overlay analysis is based on the third-party clipping interface and ignores the impact of the shape complexity of geographic polygons on the clipping algorithm, which will cause a serious data skew.

## **3. Methodology**

In this section, the core idea of the paper will be introduced. First, an excellent basic overlay analysis algorithm is selected for execution on each computing node. Then, according to the complexity and location of graphics, the polygons are divided reasonably, and an index based on spatial location is established to realize fast data access and load balancing of parallel computing nodes.

### *3.1. Basic Overlay Analysis Algorithm Running on Each Computing Node.*

## 3.1.1. Hormann Algorithm and Improvement of Intersection Degeneration Problem.

The basic overlay analysis algorithm, which is the basic processing program for each parallel computing node, performs overlay analysis on two sets of polygons. In designing an overlay analysis algorithm, we use the Hormann algorithm, which can deal with complex structures (e.g., self-intersection and polymorphism with holes), as reference. However, the point coordinate perturbation method used by the Hormann algorithm is not the best solution for the degradation problem, which brings cumulative errors in the area statistics of massive patches. Therefore, we solve the intersection degeneration by judging the azimuth interval between intersecting correlation lines. We also use the improved algorithm as the basis of parallel overlay analysis.

In order to achieve a simplified expression of the overlay analysis of two groups of polygons, we assume that each group of polygons has only one polygon object, because the overlay analysis of two groups of layers with multiple polygons only increases the number of iterations. The processing steps of the improved Hormann algorithm are as follows:


As shown in Figure 1a, the clipped polygon *P*1 and the target polygon *P*2 intersect. Intersection points *K*1 and *K*2 can be obtained through a collinearity equation. By judging the positive and negative values of the product of vector line segments, the intersection points can be judged to enter and exit. As illustrated in the same figure, - *A*1*A*2 × - *B*1*B*2 > 0, and thus, *K*1 is the entry point relative to *P*2. Moreover, - *A*2*A*3 × - *B*1*B*2 < 0, and thus, *K*2 is the exit point. The resulting polygon is composed of a sequence of vertices that consist of *K*1, *A*2, and *K*2. As illustrated in Figure 1b–d, the entry and exit points are unsuitable for describing intersection degradation.

**Figure 1.** Polygon overlay.

Figure 2 shows how to deal with the phenomenon of intersection degeneration. In Figure 2, the dotted arrow *N* points toward the north, which is the starting point of the azimuth calculation. Therefore, each line segmen<sup>t</sup> has its own azimuth. The clipped and target polygons have an intersection point *K*, which is also the location of vertices *Am* and *Bm*. The azimuth intervals of the clipped and target polygons at intersection *K* are *AC*(<sup>α</sup>1, <sup>α</sup>2) and *AT*(<sup>α</sup>1, <sup>α</sup>2). If *AC*(<sup>α</sup>1, <sup>α</sup>2) and *AT*(<sup>α</sup>1, <sup>α</sup>2) have overlapping parts (yellow in the figure), then both polygons overlap near the intersection point, which should be added to the vertex sequence of the resultant polygon.

**Figure 2.** Diagram of azimuth interval calculation.

3.1.2. Effect of Shape Complexity on Parallel Clipping Efficiency

In parallel clipping computing, each computing node is usually assigned the same number of polygons. Generally speaking, it is difficult to ensure that complex polygons are evenly allocated to

each computing node; usually, one computing node is allocated more complex polygons. Although the total number of polygons allocated by each computing node is the same, this computing node needs a long time to complete the allocated computing task, while other computing nodes will be in a waiting state. Therefore, ignoring the complexity di fferences of polygons will result in a situation in which each computing node cannot complete the computing task at the same time, thus the e fficiency of parallel computing is reduced.

Complexity is an intuitive linguistic concept. Generally speaking, di fferent professional fields pay di fferent attention to shape complexity. In the field of geographic computation, shape complexity is related to specific geographic algorithms. The same shape corresponds to di fferent geographic algorithms and may have di fferent shape complexities.

On the other side, specifically to geographic polygons, di fferent polygons have di fferent morphological characteristics, such as convex, concave, self-intersection and a large number of vertices. To measure complexity, information must be compressed into one or more comparable parameter and expression models. Although the starting point and location are completely di fferent, similar shapes may still appear. Therefore, when discussing the shape complexity of a polygon, we can neglect the spatial position and scale, and focus on the influence of polygon features on geographic computing.

Shape complexity can be defined from the perspective of geographic computing:

**Definition 1.** *Shape complexity is a measure of the computational intensity index of shapes participating in the calculation of geographic algorithms. Shape complexity can be measured by the number of repetitions of basic operations in a geographic algorithm caused by a shape.*

*As for the overlay analysis of polygons, the most basic operation of the Hormann algorithm is to find the intersection point of two sides. Therefore, for the Hormann algorithm, the complexity of a polygon is the number of edges it possesses.*

*Based on these analyses, we know that shape complexity is an absolute value, which is di*ffi*cult to program. Therefore, it is necessary to get a relative value by normalization to measure the graphics complexity.*

**Definition 2.** *Given a set of polygons P* = {*P*1, *P*1, ··· *Pn*,}*, the number of vertices of the polygon is Vi, Vmin is the minimum number of vertices of all polygons, and Vmax is the maximum number of vertices of all polygons. Then, the complexity Wi of the polygon Pi can be expressed as*

$$\mathcal{W}\_{i} = \frac{V\_{i} - V\_{\text{min}}}{V\_{\text{max}} - V\_{\text{min}}} \tag{1}$$

*Since a polygon is usually represented by a sequence of vertices in a polygon storage model. The number of edges of a polygon is the same as that of vertices, so in Definition 2, we use vertices of a polygon instead of edges.*

*Therefore, in parallel overlay analysis, we can take shape complexity as an indicator for data partitioning. The ideal state is that the polygon complexity of each data partition is the same, at which time all computing nodes will complete the computing task at the same time.*

### *3.2. Data Balancing and Partitioning Method that Considers Polygon Shape Complexity*

## 3.2.1. Data Partitioning and Loading Strategy

Data partitioning is the key to accelerating a polygon clipping algorithm based on a high-performance computing platform. A complete piece of data is divided into relatively small, independent multiblock data, which provide a basis for distributed or parallel data operation. Spatial data partitioning di ffers from general data partitioning. In addition to balancing the amount of data, the spatial location relationship, such as spatial aggregation and proximity of data, should also be considered. Commonly used spatial data partitioning methods are meshing and filling curve

partitioning [36]. Meshing is simple and considers the spatial proximity of data, but it cannot guarantee a balanced amount of data. The Hilbert curve is a classical spatial filling curve with good spatial clustering characteristics and that considers the spatial relationship and data load. Therefore, the data partitioning strategy in this study adopts the Hilbert filling curve algorithm combined with shape complexity to achieve load balancing.

In Figure 3, Hilbert partitioning divides the spatial region into 2N × 2N grids. During the iteration process, *N* is the order of the Hilbert curve, i.e., the number of iterations. In general, *N* is determined by the number of spatial objects, and the amount of spatial data requires n < 22×*N*.

**Figure 3.** Hilbert partitioning and Hilbert curve generation.

Hilbert partitioning consists of the following four steps:


In actual partitioning, the shapes of polygons are different because the polygons are not in an ideal uniform distribution. If only one polygon central point is strictly required for each grid, then Hilbert's order *N* may be extremely large, the edge length of the grid will be too small, and no polygonal MBR center may exist in many grids. Thus, Hilbert partitioning and the Hilbert curve will consume considerable computing time, and the subsequent overlay calculation will involve many cross-partition problems. Therefore, the existence of multiple polygonal MBR centers in a grid is necessary.

The order *N* of Hilbert grids is related to the length of the mesh edge. Grid length and order *N* are also determined. To obtain a reasonable order *N* of the Hilbert curve, we can calculate the normal distribution of the central point position of a polygon MBR, determine the optimal grid edge length, and eventually achieve balance between the order *N* of the Hilbert curve and the number of polygon MBR central points in each grid. The key to dividing the PHid of the Hilbert coding set of polygons is to ensure the load balance of each partition. Considering that polygon complexity may vary considerably, we cannot simply divide the Hilbert coding set PHid of polygons equally.

The shape complexity of polygon P*i* is defined as *Wi*., *W* as the average complexity of all polygons, the ideal complexity of each partition as *Wideal*, and the actual complexity as *Wactual*, then

$$\mathcal{W}\_{\text{ideal}} = \frac{\sum\_{i=1}^{n} \mathcal{W}\_i}{\mathcal{M}} \tag{2}$$

if the polygons from j to k are placed into the same partition, then,

$$\mathcal{W}\_{actual} = \sum\_{i=j}^{k} \mathcal{W}\_i \tag{3}$$

$$\|\mathcal{W}\_{\text{idend}} - \mathcal{W}\_{\text{actual}}\| < \overline{\mathcal{W}} \tag{4}$$

Generally, the number of polygons in each partition is slightly di fferent after partitioning, but the complexity of polygons in each partition is basically the same. Therefore, this strategy guarantees the load balancing of computing tasks.

## 3.2.2. R-tree Index Construction

R-tree is a widely adopted spatial data index method; it is used in commercial software, such as the Oracle and the SQL Server [37]. To improve the e fficiency of spatial data access, an R-tree must be built. In addition, data are segmented in accordance with Hilbert data partitioning points, and the grid area of the Hilbert curve before each partitioning point is defined as a sub-index area. Moreover, the R-tree index for spatial objects is established in the sub-index area. Similarly, the mapping relationship among grid coding, polygon MBR central point coding, and sub-index area coding is established. Furthermore, the corresponding index codes on computing nodes are cached. In this experiment, we directly use the STR-tree (Sort Tile Recursive) class of the JTS (Java Topology Suite, a java software library) library to construct an R-tree index.

## *3.3. Process Design of Distributed Parallel Overlay Analysis*

To ensure that the process is suitable for decoupling, we divide the distributed parallel overlay analysis process into six steps based on the characteristics of the algorithm: Data preprocessing, preliminary filtering, Hilbert coding, data partitioning and index building, data filtering, and overlay calculation (Figure 4).

**Figure 4.** Parallel overlay computing flow.

## (1) Data preprocessing

In the whole process, many steps need to traverse all polygons and their vertices. To reduce the number of traversals, we can conduct centralized processing in one traversal, such as calculating the MBR of a polygon, its geometric center point, the area of the MBR, and the shape complexity of the polygon, to prepare data for optimizing the processing flow. In the subsequent calculation, such information can be read directly to avoid repeated calculation. Considering the massive amount of data in preprocessing, the Spark paradigm can be used in parallel data processing. In accordance with the calculation of the number of physical nodes *N*, data are divided by default, and the allocated data are traversed at each computing node.

## (2) Preliminary filtering

It can be determined that only the polygons in the area where the MBRs of two polygonal layers intersect need to be clipped. Therefore, filtering polygons that do not require clipping can reduce the computational cost.

### (3) Hilbert coding and computation of data partition points base on polygon clip complexity

All polygons are divided by Hilbert grids in accordance with the spatial distribution position, and each grid and polygon are Hilbert coded. Then, the data partition points are calculated on the basis of shape complexity. This work is not suitable for decoupling and, therefore, cannot be executed in parallel.

## (4) Data partitioning and indexing

Based on the partitioning points, the region of the Hilbert curve is regarded as a sub-index region. The STR-tree class of the JTS library is used to establish an R-tree index for each partition, and the index file is stored in each computing node.

## (5) Target polygon filtering

In overlay analysis, every point of the clipped and target polygons should be traversed. Even if the two polygons are not covered, all points will be traversed, resulting in some invalid calculations. Filtering out the invalid polygons of target polygons can obviously improve the e fficiency. Before overlay calculation, the target polygon without overlay analysis can be e ffectively eliminated by calculating whether an overlay relationship exists between the MBR of the clipped polygon and the MBR of the target polygon. The calculation method directly compares the maximum and minimum coordinates of the clipped and target polygons without using an overlay algorithm.

(6) Overlay analysis

All computing nodes use the Hormann algorithm described in Section 3.1 for parallel overlay computation. The results of each calculation node are reduced to obtain the final overlay analysis results.

## *3.4. Algorithmic Analysis*

The major processes of the parallel overlay analysis conducted in this study include data preprocessing, preliminary filtering, Hilbert partitioning, R-tree index establishment, polygon MBR filtering, and polygon clipping.

In the data preprocessing, only the layer attribute data and vertex coordinate information of the polygons are included in the original data. Polygon MBR and the number of polygon vertices must be used thrice in the calculation process designed in this research. Therefore, we unified the data preprocessing, establish a new data structure and avoided repetitive calculation. Unified data preprocessing saves about half of the workload compared with separate data preprocessing.

In the Preliminary filtering, the time complexity of the MBR filtering algorithm is <sup>O</sup>(1), whereas the complexity of the overlay analysis algorithm is <sup>O</sup>(logN), where *N* is the number of polygon vertices. Therefore, computational complexity will be considerably reduced by filtering polygons without overlay analysis through MBR. In addition, the reduced computational complexity depends on the spatial distribution of polygons, which is an uncontrollable factor.

The time complexity of constructing the Hilbert curve is O *N*<sup>2</sup> , where *N* is the order of the Hilbert curve. The larger *N* is, the longer the time that is spent on data partitioning is. However, if *N* is too small, then multiple polygons will correspond to the same Hilbert coding. If Hilbert partitioning strictly satisfies the condition that each mesh has only one central point of a polygon MBR, then a Hilbert grid supports a maximum of 22×*<sup>N</sup>* polygons. Moreover, polygons in real data are generally not uniformly distributed, and no polygons exist in many Hilbert grids. Therefore, allowing an appropriate number of repetitive Hilbert-coded values is feasible. In addition, compared with grid partition, Hilbert partition can solve the problem of the uneven location of data perfectly.

R-tree is a typical spatial data index method. The time for data traversal is considerably shortened by establishing an R-tree index. The time complexity of R-tree is <sup>O</sup>(logN).

Data preprocessing, MBR filtering, R-tree index construction, and other processes are relatively time-consuming. By using multi-node parallel computing in data partitioning, the time consumed can be reduced to 1/N, where N is the number of parallel processes. In the Spark paradigm, data operations are performed in memory, and I/O operations consume minimal time. Therefore, the proposed overlay algorithm exhibits high e fficiency.

## **4. Experimental Study**

## *4.1. Experimental Design*

To conduct overlay analysis experiments, we used the patches of land use types and the patches with a slope greater than 25 degrees in a county of Yunnan Province in 2018. There are 500,000 patches of land use types and 110,000 slope patches. These data are distributed in the area of 15,000 square kilometers. Based on this data, we constructed di fferent data sets for the experiments. We will use di fferent overlay analysis modes for the execution in the case of di fferent data magnitude data, record the change of execution time, and analyze the characteristics and applicability of di fferent overlay analysis modes.

## 4.1.1. Computing Equipment

Experiments were carried out using one portable computer and six X86 servers. The equipment configuration information is shown in Table 1.



## 4.1.2. Experimental Data

## (1) Clipping layer

The digital elevation model (DEM) data of a 30 m grid in the county were obtained from the Internet, and a slope map was generated by it (Figure 5). The area with a slope greater than 25 degrees was extracted, and 108,025 patches were obtained.

**Figure 5.** Extracting slope as the clipping layer from the digital elevation model (DEM).

## (2) Target layer

A total of 10 groups of experimental data were obtained from 500,000 original land-type patches of the county using sparse and intensive data sets. The number of patches was 50,000, 100,000, 250,000, 500,000, 1 million, 2 million, 4 million, 6 million, 8 million and 10 million, respectively.

Through data checking, 88,000,000 vertices were found in 500,000 original terrain pattern data. Among all the polygons, the simplest polygon has four vertices, whereas the most complex polygon has 99,500 vertices. A total of 890,000 vertices were recorded in 110,000 slope patches. Among all slope patches, the simplest has 8 vertices, whereas the most complex has 5572 vertices. The statistics of the number of polygon vertices in the query and target layers are illustrated in Figures 6 and 7, respectively.

**Figure 6.** Polygons distribution with different number of vertices.

**Figure 7.** Polygons distribution with different number of vertices.

In parallel computing, data are organized into GeoJson format and uploaded to HDFS. HDFS data blocks are three copies, each of which is 64 MB.

## 4.1.3. Experimental Scene

Before describing the experimental scenario design, we first define several different modes for comparison and explain the differences of each mode (Table 2).



Among them, the ArcMap mode is a typical method used in geographic data processing. The purpose of comparing Spark\_original, Spark\_improved and Spark\_NoComlexity modes is to determine how much the performance has been improved. The purpose of comparing Spark\_MBR, Spark\_MBR\_Hilbert and Spark\_MBR\_Hilbert\_R-tree modes is to determine how much the three improved methods can improve the e fficiency of parallel overlay analysis.

(1) Scene 1: Compare the performance di fferences of four modes: ArcMap, Spark\_original, Spark\_improved and Spark\_NoComlexity.

Ten groups of polygons with di fferent numbers were used for overlay analysis in four modes. We will record the completion time of the overlay analysis process and draw time-consumption curves. This experimental scenario can answer the following questions:


Ten groups of polygons with di fferent numbers were used for overlay analysis in three modes. We will record the completion time of the overlay analysis process and draw time-consumption curves.

In addition to considering the influence of the shape complexity di fference of a geographic polygon, three important improvements are used in our algorithm flow: (1) MBR filtering, (2) Hilbert partitioning, (3) R-tree establishment. This experimental scenario can answer: How much do the above three improvements a ffect the e fficiency of parallel computing?

(3) Scene 3: Cluster acceleration performance testing of the proposed algorithm.

The experimental data are fixed to 10 million geographic polygons. One to six servers are used to perform overlay analysis and record the time-consumption changes of the overlay analysis algorithm in this paper. In this experimental scenario, we can see the acceleration ratio and parallel e fficiency of the proposed algorithm in the Spark cluster.

## *4.2. Test Process and Results*

4.2.1. Compare the Performance Di fferences of Four Modes: ArcMap, Spark\_original, Spark\_NoComlexity and Spark\_improved

The parallel computing mode uses six computing nodes to calculate the flow before and after optimization. The experimental data are collected from 50,000, 100,000, 250,000, 500,000, 1 million, 2 million, 4 million, 6 million, 8 million, and 10 million recorded data sets. The time consumption statistics of di fferent computing modes are illustrated in Figure 8.

As shown in Figure 9, blue, red, yellow and grey represent the time consumed by ArcMap, Spark\_original, Spark\_NoComlexity and Spark\_improved modes. With the increase in the data volume, the four time-consumption curves show an upward trend. The changes in these curves answer three questions related to the design of the experimental scenario.

(1) When the number of polygons is less than 10 million, the e fficiency of Spark\_original mode is even lower than that of ArcMap mode. When the number of polygons is more than 50,000, the time-consumption of the Spark\_improved mode is less than that of the ArcMap mode. When the number of polygons exceeds 1 million, ArcMap mode consumes twice as much time as the Spark\_improved mode. As the amount of data increases, the time-consumption of the ArcMap mode increases dramatically, and the time-consumption curve of Spark\_improved mode is still relatively flat.


**Figure 8.** Time consumption statistical graphs of different computing modes.

**Figure 9.** Time consumption comparison of different optimization strategies.

4.2.2. Compare the Performance Di fferences of Four Modes: Spark\_original, Spark\_MBR, Spark\_MBR\_Hilbert and Spark\_MBR\_Hilbert\_R-tree

As shown in Figure 9:


## 4.2.3. Cluster Acceleration Performance Testing of the Proposed Algorithm

The experimental data are unified using 10 million polygons, and then one server is added at a time. As the number of servers increases, the time consumed in parallel computing decreases considerably (Figure 10). However, the trend of running time decreases as the number of nodes increases.

**Figure 10.** Average running time of di fferent numbers of nodes.

As shown in Figure 11, as the number of servers increases, the acceleration ratio decreases slightly but is nearly linear. In addition, Figure 12 illustrates that with the increase in the number of servers, the parallel e fficiency gradually decreases, and finally stabilizes to more than 90%. This is a good result if we consider that the increase in the number of servers will increase the system synchronization and network overhead.

**Figure 11.** Acceleration ratio of different numbers of nodes.

**Figure 12.** Parallel efficiency of different number of nodes.

## *4.3. Analysis of Experimental Results*

Figure 9 shows that a single computer with ArcMap Soft achieves high efficiency in the overlay analysis of small data volume by adopting a reasonable algorithm and excellent multithreading processing technology. In addition, SDD also plays an important role. However, the performance of ArcMap sharply declines when the number of data records reaches millions. Spark distributed parallel computing can effectively solve such problems, but the simple transplantation of the overlay analysis algorithm into the Spark framework is not a reasonable solution. The actual geographic data are often unevenly distributed, and the complexity of polygon graphics varies greatly, which will lead to a serious data skew, and which will seriously affect the performance of parallel computing. When the data volume reaches tens of millions, the performance of our algorithm improves by more than 10 times via Hilbert partitioning based on the polygon graphic complexity and the R-tree index. In addition, the performance advantage becomes more evident as the data volume increases.

When the amount of data is constant, the time-consumption of parallel overlay analysis decreases with the increase in the number of servers. However, the decreasing trend of running time declines as the number of nodes increases. Figure 11 shows that the acceleration ratio is nearly linear. Figure 12 illustrates that parallel efficiency is still over 90% and remains stable when the number of servers increases to six, which means that higher computing efficiency can be maintained when the computing cluster expands. Therefore, it is an effective method to add physical nodes in massive data overlay analysis.

In addition, the proposed overlay analysis algorithm also has some problems to be improved, such as: (1) Big polygons will span multiple data partitions, which will lead to repeated participation of the

polygon in overlay analysis on multiple servers. (2) In the current algorithm process, the R-tree index is created temporarily, which leads to the creation of the index repeatedly for each overlay analysis.

## **5. Conclusions**

In high-performance parallel overlay analysis, the di fferences in shape complexity of a polygon can lead to serious data skew. In this paper, we measure the shape complexity of polygons from the perspective of geographic computing and design a high-performance parallel overlay analysis algorithm considering the shape complexity of polygons. The analysis of the algorithm shows that the algorithm reduces invalid overlay calculation by MBR filtering, achieves load balancing by use of a Hilbert partition based on the polygon shape complexity, and improves data access speed using the R-tree index. Experiments show that this is a high-performance method and can maintain high speed-up and parallel e fficiency in computing cluster expansion.

In future studies, we will study the impact of the spatial distribution of graphics, spatial data storage and indexing methods on the e fficiency of overlay analysis. We will also optimize spatial index storage through distributed memory database technology to further improve the e fficiency of parallel overlay analysis.

**Author Contributions:** Kang Zhao proposed the research ideas and technical lines. Baoxuan Jin and Hong Fan provided research guidance. Weiwei Song shared valuable opinions. Sunyu Zhou and Yuanyi Jiang helped complete the programming. Zhao Kang completed the thesis.

**Funding:** This research was funded by Natural Science Foundation of China, gran<sup>t</sup> number 41661086.

**Acknowledgments:** The authors are grateful to Fan Yang and Liying Li for providing their experimental data and to Lifeng Hou for giving valuable suggestions.

**Conflicts of Interest:** The authors declare no conflict of interest.

## **References**


©2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Parallel Cellular Automata Markov Model for Land Use Change Prediction over MapReduce Framework**

**Junfeng Kang 1,2, Lei Fang 3,\*, Shuang Li 1 and Xiangrong Wang 3**


Received: 12 September 2019; Accepted: 11 October 2019; Published: 13 October 2019

**Abstract:** The Cellular Automata Markov model combines the cellular automata (CA) model's ability to simulate the spatial variation of complex systems and the long-term prediction of the Markov model. In this research, we designed a parallel CA-Markov model based on the MapReduce framework. The model was divided into two main parts: A parallel Markov model based on MapReduce (Cloud-Markov), and comprehensive evaluation method of land-use changes based on cellular automata and MapReduce (Cloud-CELUC). Choosing Hangzhou as the study area and using Landsat remote-sensing images from 2006 and 2013 as the experiment data, we conducted three experiments to evaluate the parallel CA-Markov model on the Hadoop environment. Efficiency evaluations were conducted to compare Cloud-Markov and Cloud-CELUC with different numbers of data. The results showed that the accelerated ratios of Cloud-Markov and Cloud-CELUC were 3.43 and 1.86, respectively, compared with their serial algorithms. The validity test of the prediction algorithm was performed using the parallel CA-Markov model to simulate land-use changes in Hangzhou in 2013 and to analyze the relationship between the simulation results and the interpretation results of the remote-sensing images. The Kappa coefficients of construction land, natural-reserve land, and agricultural land were 0.86, 0.68, and 0.66, respectively, which demonstrates the validity of the parallel model. Hangzhou land-use changes in 2020 were predicted and analyzed. The results show that the central area of construction land is rapidly increasing due to a developed transportation system and is mainly transferred from agricultural land.

**Keywords:** CA Markov; land-use change prediction; Hadoop; MapReduce; cloud computing

## **1. Introduction**

Studying land use/land cover changes in different times and places and predicting land-use structures and spatial layouts can provide scientific support for the utilization of regional land resources, the protection of regional ecological environments, and sustainable social and economic development [1,2].

Many researchers have proposed their own land-use-change simulation and prediction models, such as CLUE [3], CLUE-S [4], cellular automata (CA) [2,5,6], Markov chain [7,8], SLEUTH [9,10], and the spatial logistic model [11]. Since the early 1980s when Wolfram first proposed the CA model [12], many studies have been conducted to use the CA model to simulate urban land-use changes [2,5,6,13] and researchers have integrated other methods or models, such as neural networks [14], support vector machines (SVM) [15], ant-colony optimization [16], and the Markov chain [17], into the CA model to simulate and monitor land-use change.

> 45

The CA-Markov model was one of the most widely used extended CA models and was used in the prediction and simulation of land-use changes in many countries, such as the United States [18], Brazil [19], Portugal [20], Egypt [21], Ethiopia [22], Bangladesh [23], Malaysia [24], and China [25]. It has also been applied to the research of the evolution of urban-settlement patterns [26], the process of spatial dynamic vegetation changes [27], and land transfer across metropolitan areas [28].

As land-use change simulation and prediction involves tremendous numbers of data and calculations, in recent years, some studies have designed parallel CA algorithms on Central Processing Unit (CPU) parallel computing [29], Message Passing Interface (MPI) [30], Graphics Processing Unit (GPU) parallel [31], and GPU/CPU hybrid parallel [32] to simulate urban growth. However, the parallel CA method cannot deal with the connection between partitions after a research area is divided into several pieces, resulting in di fferent final-prediction results, whereas the traditional Markov method is able to maintain the integrity of the entire study area but results in a lack of spatial relationship of the land cell. On the other hand, with the development of Big Data technologies and applications, MapReduce is a promising method to improve traditional-serial-algorithm running e fficiency and has been applied and proven e ffective in many cases. Rathore et al. [33] proposed a real-time remote-sensing image processing and analysis application. Raojun et al. [34] proposed a parallel-link prediction algorithm based on MapReduce. Wiley K. et al. [35] analyzed astronomical graphics based on MapReduce, while Almeer [36] used Hadoop to analyze remote-sensing images, improving batch-reading and -writing e fficiency. For the CA Markov model, the MapReduce framework is not only capable of e fficient parallel processing, but can also be a coupling to the CA-Markov model: The "Map" corresponds to the CA process to realize the parallelism of land-use-unit-change prediction; "Reduce" refers to the Markov process to achieve overall prediction of land-use changes. However, because the key problem of segmentation and connection remains unresolved, there is little research on the parallel CA-Markov model for land-use-change prediction over the MapReduce framework.

Based on in-depth analysis of the parallelism of the CA Markov model, this paper first proposes a parallel solution that uses the MapReduce framework to improve the CA Markov model for land-use-change prediction. The parallel CA-Markov model can not only solve the contradiction that the traditional CA-Markov model cannot simultaneously realize the integrity and segmentation for land-use change simulation and prediction, but can also ensure both e fficiency and accuracy and realize land-use change prediction in a cloud-computing environment.

## **2. Materials and Background Technologies**

## *2.1. Study Area and Data*

Hangzhou is located in the southeast coast of China, which is the political, economic, cultural, and financial center of Zhejiang Province. Hangzhou has a complex topography: The west is a hilly area with the main mountain range, including Tianmu Mountain, and the east is a plain area with low-lying terrain and dense river networks.

The 2006 Landsat TM and the 2013 Landsat8 remote-sensing image with 30 m resolution of the study area were downloaded from http://www.gscloud.cn/. Other experiment datasets included DEM with a 30 m resolution, road-network data, tra ffic-site data, and location-address data.

Nowadays, many researches develop their own auto image-identification method to classify the high-resolution image, especially unmanned aerial vehicle (UAV) images [37,38]. Because Landsat images were the medium-resolution image, we chose to use a semi-manual method to preprocessed and interpreted to obtain land-use data using ENVI 5.3 and ArcGIS 10.2. [39]. As shown in Figure 1, the workflow of Landsat images classification included four main steps, namely definition of the classification inputs, preprocessing, region of interest, and classification. The processing steps mainly included geocorrection, geometric rectification, or image registration, radiometric calibration and atmospheric correction, and topographic correction [40]. A support-vector-machine (SVM) classifier was selected for land-use classification [41].

**Figure 1.** Landsat-image classification flow.

In general, land-use types include cultivated land, forest, grassland, waters, construction land, gardens, transportation land, unused land, and swampland. In our experiment, four land-use types were defined in the training shapefile after reclassifying. The land-use-type reclassification method of construction land (B), agricultural land (A), nature reserve (N), and water area (W) are defined as shown in Table 1.

**Table 1.** Land-use reclassifications.


## *2.2. MapReduce*

The MapReduce [42] program consists of two functions, Map and Reduce. Both of these functions take key/value (key-value pair) as input and output. The Map function receives a user-entered key-value pair (*k1, v1*) and processes it to generate a key-value pair (*k2, v2*) as an intermediate result. Then, the corresponding values of all the same intermediate keys (*k2*) are aggregated to generate a list of values for the *k2* key list (*v2*), which is used as an input to the Reduce function and processed by the Reduce function to obtain the final result list (*k3, v3*). The process can be expressed by the following formula:

$$\text{Map}: (k\_1, v\_1) - \text{list}(k\_2, v\_2) \tag{1}$$

$$\text{Reduce}: \left(k\_2, \text{list}(\upsilon\_2)\right) - \text{list}(k\_3, \upsilon\_3) \tag{2}$$

The MapReduce framework is shown in Figure 2:

**Figure 2.** Overview of MapReduce process framework.

## *2.3. CA Markov Model*

The Markov model is based on the theory of random processes. In this model, given the initial state and state-transition probabilities, simulation results have nothing to do with the historical condition before the current condition, which can be used to describe land-use changes from one period to another. We can also use this as a basis to predict future changes. Change was found by creating a land-use-change transition-probability matrix from periods *t* to *t* + 1, which is the basis to predict future land-use changes [43].

$$S(t+1) = P\_{i\bar{j}} \times S(t) \tag{3}$$

*S*(*t* + 1) denotes the state of the land-use system at times *t* + 1 and *t*, respectively. *P*ij is the state-transition matrix.

CA has four basic components: Cell and cell space, cell state, neighborhood, and transition rules. The CA model can be expressed as follows [44]:

$$S(t+1) = f(S(t), N) \tag{4}$$

In the formula, *S* is a state set of a finite and discrete state, *t* and *t* + 1 are different moments, N is the neighborhood of the cell, and f is the cell-transition rule of the local space.

Usually, in order to make cellular automata better simulate a real environment, space constraint variable β needs to be introduced to express the topographic terrain, as well as the adaptive constraints and restrictive constraints of the spatial-influence factors on the cells. The formula becomes

$$S(t+1) = f(S(t), N, \beta). \tag{5}$$

The separate Markov model lacks spatial knowledge and does not consider the spatial distribution of geographic factors and land-use types, while the CA-Markov model adds spatial features to the Markov model, uses a cellular-automata filter to create weight factors with spatial character, and changes the state of the cells according to the state of adjacent cells and the transition rules [21].

## **3. Methods**

## *3.1. Parallel CA-Markov Model Overview*

## 3.1.1. Parallel CA-Markov Structure

Figure 3 is the structure of parallel CA-Markov over MapReduce framework. The structure contains four layers from bottom to top: The data layer, parallel CA-Markov model layer, land-use transition-direction determination layer, and land-use prediction layer.

**Figure 3.** Structure of parallel cellular-automata Markov over MapReduce.

The experiment data include two-phase remote-sensing images, terrain data, traffic-network data, and location-address data.

The parallel CA-Markov model can mainly be divided into two parts: A parallel Markov algorithm based on MapReduce (Cloud-Markov), and a comprehensive evaluation method of land-use changes based on MapReduce (Cloud-CELUC).

Cloud-Markov was used to calculate the area transition-probability matrix of land-use changes. The algorithm of Cloud-Markov is discussed in Section 3.2.

Cloud-CELUC includes three main parts: Evaluation of the influence of neighborhoods under a cloud-computing environment (C-ENID), multicriteria evaluation under a cloud-computing environment (C-MCE), and comprehensive evaluation of land use. Among them, C-ENID is a parallel CA model over MapReduce, and C-MCE is a MapReduce algorithm to calculate constrained-evaluation and suitability-evaluation values. The cell's neighborhood influences and C-ENID's calculation method is discussed in Section 3.3.1. C-MCE design factors are discussed in Section 3.3.2, and the Cloud-CELUC algorithm is discussed in detail in Section 3.3.3.

After using Cloud-Markov to calculate the transition-probability matrix of land-use changes and using Cloud-CELUC to calculate each cell's comprehensive evaluation value, each cell's land-use transition direction was determined, which is discussed in Section 3.4. Finally, land-use-change predictions were obtained, and the land-use prediction experiment is discussed in Section 4.

## 3.1.2. Parallel CA-Markov Workflow

As shown in Figure 4, the flow of the parallel CA-Markov model has five major steps, as follows:


**Figure 4.** Parallel Cellular Automata-Markov (CA-Markov) flow.

## *3.2. Markov Model Parallel Processing (Cloud-Markov)*

Using an overlay method to analyze two-phase raster images and land-use data with the same spatial position, we obtained the transition direction of each cell and calculated the number of cells in each transition direction, then obtained the area transfer matrix of each land-use type and calculated the area probability matrix of land use. Each year's area transfer matrix was then obtained using the

probability matrix divided by the intervals of these two raster images. Then, MapReduce functions of the Markov model given in the study area are as follows:

$$\mathbf{Map}: \left( N, \left( T\_1, T\_2 \right) \right) \to List \left( \mathbb{C}\_{mk, \prime} i \right) \tag{6}$$

$$\text{Combineder}: M\Big(\mathbb{C}\_{mk}list(i)\Big) \to List\Big(\mathbb{C}\_{mk}\mathbf{s}\Big) \tag{7}$$

$$\text{Reduce}: L\left(\mathbb{C}\_{mk} list(i)\right) \to List\left(\mathbb{C}\_{mk, \text{s}}\right) \tag{8}$$

where *N* is the row offset of the input row, *T*1 represents all cells' land-use type of the earlier raster image, *T*2 represents all cells' land-use type of the later raster image, *Cmk* represents the cell conversion from land-use type *m* to land-use type *k*, *i* indicates the cell transfer number of *Cmk*, *M* is the number of land-use types in a MapReduce node, *s* is the total cell number of *Cmk* in a MapReduce node, *L* is the total number of land-use types, and *q* is the combined value of the total number of cells and transition probability for *Cmk* conversion in the entire study area. For example, "100-3.12%" means that there were 100 cells with *Cmk* conversion, and transition probability was 3.12%. Figure 5 is the flow of Cloud-Markov algorithm. After summing the number of same land-use type transition direction in each node, the area transition matrix was calculated at the Reduce stage.

**Figure 5.** Cloud-Markov algorithm flow.

The steps are as follows:

	- -1 Input <Key,Value>
	- -2 Raster cell's land-use type conversion analysis

In this step, by comparing the cells in the same position between these two raster images, we obtained a list of *Cmk*. If the value of *Cmk* is 'B-A', it means that the land-use-type of the cell with the same position in different raster images is converted from B into A.


where key is conversion direction *Cmk*, and Value is an integer equal to 1.

(2) Combiner stage:


<Key,Value> is a key-value pair (*Cmk,s*), where *Cmk* is the conversion direction and *s* is the total number of cells in a MapReduce node where the *Cmk* land-use-type conversion direction occurs.


Output key-value pairs (*Cmk,s*).

(3) Reduce stage:


The transition-probability-matrix calculation formula is defined as follows:

$$P\_{mk} = \frac{V\_{mk}}{S\_m} \tag{9}$$

where *Vmk* is the value that summed *Cmk* from all MapReduce nodes and *Sm* is the number of the initial raster image's cells whose land-use type is *m*.


<Key,Value> is a key-value pair (*Vmk,Pmk*), where *Vmk* is the land-use-type area-conversion matrix and *Pmk* is the transition-probability matrix.

*3.3. Cloud-CELUC*

3.3.1. Cell Neighborhood Processing

Obtaining a cell's neighborhood-influence value requires reading the state of the neighbor cells. The general neighborhood-influence evaluation methods include Von Neumann and Moore. Figure 6 shows how to read cellular-neighborhood, where we chose the 3 × 3 Moore method to design our algorithm.

**Figure 6.** Cellular-neighborhood reading.

Figure 7 shows the process of cellular-dimension reduction, where a list structure was used to store all cells, so we could read each cell's neighborhood cells through the cell's row index and column index. The two-dimensional raster image was reduced into a one-dimensional array that could reduce the data exchange between each node during MapReduce processes. For example, the neighborhood cells of cell Kx were from line K − 1, K, and K + 1, were recorded as K − 1x−1, K − 1x, K − 1x+1, Kx−1, Kx+1, K + 1x−1, K + 1x, and K + 1x+1, and then stored as an array structure into the HDFS.

**Figure 7.** Cellular-dimension reduction.

## 3.3.2. Multicriteria Evaluation Factors

This research used the multicriteria evaluation (MCE) method to calculate constraint-evaluation and suitability-evaluation values. The purpose of MCE is to select the optimal decision solution in limited (infinite) solutions between which there are conflicts and coexistence [45].

The evaluation criteria of MCE are divided into two parts: Suitable factors and constraint factors. Suitable factors are used to normalize the influencing factor to the continuous measurable values, and constraint factors are utilized to categorize spatial features by their spatial characteristics. The value of the factors is a Boolean type with a value of 0 or 1.

Suitability factors are defined and standardized in Table 2, where eight factors are defined to distinguish the distance from a cell to di fferent typical destination, and according to di fferent land-use type, the weighted value of each factor is defined in Table 3. Then, the analytic hierarchy process (AHP) method and the expert scoring method were used to calculate the weights of the suitability factors [46], which are shown in Table 2. The waters and gradient factors were defined as the constraint factors.



**Table 3.** Weight parameters of suitability factors.


The value of the constraint factors is Boolean, which is determined by land-use type. This research defined the constraint factors of hills with gradients greater than 25 degrees, and waters and ecological reserves equal to 0, because these land-use types would rarely change.

## 3.3.3. Parallel Cloud-CELUC Algorithm

Cloud-CELUC only needs the Map function to calculate factors and obtain comprehensiveevaluation values. The Reduce function of Cloud-CELUC was only used to output the results. The Map function was defined as follows:

$$\text{Map}: \left( N\_{\prime} \left( i, H\_{1}, H\_{2}, H\_{3}, H\_{4\prime}, \dots, H\_{m} \right) \right) \to \left( \left( i, j \right), \left( L\_{1}, L\_{2}, L\_{3\prime}, \dots, L\_{m} \right) \right) \tag{10}$$

where *N* is the line offset of the input line, *i* is the line index of the raster image, *j* is the column index of the raster image, *H*1 is the cell-state value to be calculated, and *H*2 and *H*3 indicate the uplink and downlink cell-state values of *H*1. *H*4, ... *Hm* are various constraint factors and suitability factors corresponding to the cell, and *Lm* is the value combined by the cell's composite-evaluation value of the transition direction and the cell's corresponding transition direction. For example, If *L*1 is 'ba-1.2234', the evaluation value of the cell (*i, j*) from the initial land-use type 'b' to the final land-use type 'a' is '1.2234'.

The flow of Cloud-CELUC algorithm is shown in Figure 8, where the table of comprehensive evaluation value was obtained after each node by calculating the neighborhood impact value, suitability evaluation value, and constrained evaluation value. The steps are as follows:


**Figure 8.** Cloud-comprehensive-evaluation value (CELUC) algorithm.

According to *H*1 and *H*3, the neighborhood-influence degree of each cell in *H*2 was calculated. The calculation formula for the evaluation value of the neighborhood-influence degree of the cell (*i,j*) corresponding to a class at a certain time is as follows:

$$NID\_d = \frac{1}{h-1} \sum \text{Yes} \{ S\_{ij} == a \} \tag{11}$$

where *h* is the number of neighborhood cells, and *YesSij* == *a* is used to judge whether the neighbor cell's land-use type of cell (*i, j*) is *a* or not. If *a* is 1, then *YesSij* == *a* returns 1, otherwise, it returns 0.


> The calculation formula of the suitability-evaluation value is as follows:

$$SEV\_{\mathfrak{a}} = \sum V\_{\mathfrak{a}\mathfrak{\delta}} \times DIS\_{ij\mathfrak{\delta}} \tag{12}$$

where *Va*δ is the weight of factor δ corresponding to land-use type '*a*', *DISij*δ are the suitability factors of cell (*i,j*) that are defined in Table 1.


The constraint-evaluation formula is as follows:

$$
\mathbb{C}EV\_a = \prod \mathbb{Y}es\Big(\mathbb{K}\_{ij}\Big) \tag{13}
$$

where *YesKij*represents the constraint-evaluation value of cell (*i, j*) corresponding to constraint factor '*k*'. If the cell is constrained, *YesKij* returns 0, otherwise, it returns 1.

## -5 Calculate comprehensive-evaluation value (*CELUC*)

Based on the three evaluation values above, *NID, SEV,* and *CEV,* the comprehensive-evaluation formula was defined as follows:

$$\text{CELILIC}\_{\text{a}} = \text{NID}\_{\text{a}} \times \text{SEV}\_{\text{a}} \times \text{CEV}\_{\text{a}} \tag{14}$$

where *a* is a land-use type.


<Key,Value> is a key-value pair ((*i*, *j*), *CELUC*) where *i* is the cell's row index, *j* is the cell's column index, and *CELUC* is the comprehensive-evaluation value of the cell.

## *3.4. Cell Land-Use-Type Conversion*

The multi-objective land-use competition method was used to achieve cells' land-use-type conversion, which was to solve the problem when the cell had confliction in land-use-type conversion [47]. For example, if there were *N* types of land-use types, cell (*i, j*) may have had N kinds of conversion possibilities. According to constraint factors, suitability factors, and neighborhood conditions, each cell's conversion possibility should be given a comprehensive evolution value of land use. If the evaluation value of the cell's transition direction is determined by the biggest conversion possibility, the value may lead to a proliferation of dominant land-use types and cause oversimulation of dominant land-use types and inadequate simulation of weak land-use types. Hence, both land-use comprehensive evaluation values and the area transfer matrix of land-use changes were used to decide thecell'sconversiondirection.Figure 9 showstheflowofacell'sland-use-typeconversionprocess.

The steps are as follows:



The upper area limit of the land-use-type conversion was obtained from the area transition-probability matrix of each land-use type, defined as a key-value pair (*Vmk, Pmk*) where *Vmk* is the land-use-type area conversion matrix and *Pmk* is the land-use transition-probability matrix at the CLOUD-Markov stage.


**Figure 9.** Cell's land-use-type conversion flow.

If the upper area limit is not reached, the land-use type of the cell would be converted into a new land-use type and stored as a key-value pair ((*i, j*), CELUC*i*) into an array, discarding the other CELUCs to make sure the cell is not used in the subsequent steps. Then, it returns to the first step.


## **4. Results and Discussion**

## *4.1. Model-E*ffi*ciency Analysis*

One machine was used as the master node for the work of NameNode and JobTracker, and four other machines were used as the slave nodes for the work of DataNode and TaskTracker. The operating environment of the machines was the CentOS 7.1.1503 system with Java version 1.8.0\_112 and Hadoop version 2.7.3. The experiment environment configuration is shown in Table 4. The hardware environment of the serial algorithm was the same as the hardware of the Hadoop node.


**Table 4.** Hadoop experiment environment.

The running efficiency and acceleration ratio results of the serial-Markov algorithm relative to Cloud-Markov algorithm are shown in Figure 10. The running efficiency and acceleration ratio results of the serial-CELUC algorithm relative to Cloud-CELUC algorithm are shown in Figure 11. As shown in Figures 10a and 11a, the abscissa axis indicates the number of the cells (1 *n* is approximately 9,000,000) and the ordinate axis indicates running time.

**Figure 10.** Running efficiency and acceleration ratio of Cloud-Markov relative to serial-Markov. (**a**) Running efficiency comparison, (**b**) acceleration ratio.

**Figure 11.** Running efficiency and acceleration ratio of serial-CELUC relative to Cloud-CELUC. (**a**) Running efficiency comparison, (**b**) acceleration ratio.

The results showed that the execution time of Cloud-Markov was less than that of the serial Markov algorithm, and with the increase of input data, the acceleration ratio increased and tended to smooth. The acceleration ratio of the Cloud-Markov algorithm to the serial Markov algorithm tended to be steady at 3.27, and the acceleration ratio of Cloud-CELUC to serial CELUC tended to be steady at 1.77. The highest acceleration ratio of Cloud-Markov could reach 3.43, and the highest acceleration ratio of Cloud- CELUC could reach 1.86.

The efficiency of the parallel Markov model based on a cloud environment was remarkable because the MapReduce system effectively distributed the workload of the two phases of cell matching and quantitative statistics. The efficiency of Cloud-CELUC was also improved because the Map phase we defined ran very fast. However, when the output of the Map phase was input into the Reduce phase, it occupied a relatively large part of the long running time, which reduced the efficiency of the operation.

## *4.2. Precision Evaluation and Result Analysis*

## 4.2.1. Precision Evaluation

The 2006 remote-sensing images and other data were defined as the initial data, and the 2013 land-use changes were simulated by suing the parallel CA-Markov model. In our experiment, the water area was fixed with no change. Based on the 2006 and 2013 land-use data, the area transition-probability matrix could be calculated. Table 5 is the 2006–2013 area transition matrix, in which each cell represents the total area of one land-use type transferring to another land-use type from 2006 to 2016. Table 6 is the 2006–2013 transition-probability matrix, in which each cell represents the probability of one land-use type transferring to another land-use type from 2006 to 2016.





As shown in Tables 5 and 6, the biggest land-use-type transition was agricultural land transferring to construction land; its ratio reached 22.91%. In order to evaluate the simulated precision, the 2013 land-use data were classified using real 2013 remote-sensing images at the data-processing stage. The simulated land-use data and the classified 2013 land-use data are shown in Figures 12 and 13, respectively.

After simulation, the precision evaluation experiments were done to correct all kinds of weight parameters defined in C-MCE. After a grea<sup>t</sup> number of repetitive experiments and weight-parameter corrections, the weight parameters of the suitability factors were obtained, which are listed in Table 2. At present, commonly used precision evaluation methods include visual comparison, dimensionality tests, pixel contrasts, and the Kappa coefficient test.

With visual comparison, it was found that the simulation of the nature reserve in the western and southern regions was the most accurate. It can be concluded that the suitability factor and the constrained factor were in line with the change trend of the nature reserve in the study area. The construction land of the central urban area had better simulation accuracy. However, the construction land of east and north, including the towns of Yuanpu and Linpu, was scattered and intertwined with agricultural land. Therefore, simulation error was relatively large.

**Figure 12.** Map of land-use simulation in 2013.

**Figure 13.** Classification map of remote-sensing images in 2013.

The Kappa coefficient test is the most commonly used quantitative test method [48]. When the Kappa coefficient is used to compare the consistency of data, commonly used criteria are as follows: If the two land-use maps are identical, then Kappa = 1; when Kappa > 0.8, consistency is almost perfect; when 0.6 < Kappa ≤ 0.8, consistency is substantial; when 0.4 < Kappa ≤ 0.6, consistency is moderate; when 0.2 < Kappa ≤ 0.4, consistency is slight; when 0 < Kappa ≤ 0.2, consistency is poor [49,50]. Simulated land-use data and actual classified land-use data of 2013 were compared, and the results of the Kappa coefficient are shown in Tables 7–9, respectively.


**Table 7.** Kappa coefficient test table of nature reserve land in 2013 (unit: km2).

**Table 8.** Kappa coefficient of construction land in 2013 (unit: km2).


**Table 9.** Kappa coefficient of agricultural land in 2013 (unit: km2).


Results showed that the Kappa coefficients for nature reserve, construction land, and agricultural land were 0.85, 0.6, and 0.65, respectively. This meant that the 2013 results of the simulation were quite accurate [38,51–53], and that using the parallel CA-Markov model to predict future land use would be highly reliable.

## 4.2.2. Land-Use-Change Prediction

Based on the classified 2013 land-use data and other experiment data, 2020 land-use changes were predicted using the parallel CA-Markov model. The results of the 2013–2020 area transition matrix are shown in Table 10, and the 2020 land-use prediction map is shown in Figure 14.



As can be seen from the 2020 land-use prediction map, construction land in the study area was on the rise as a whole, and this increase mainly came from the conversion of agricultural land. Agricultural land showed a downward trend, and natural-reserve land changed little in proportion to its vast size.

The overall growth of construction land was relatively large. In particular, due to expanding road and public-transport systems, construction land grew faster in the urban center of each county. The growth of construction land in the districts of Xihu, Gongshu, Xiacheng, Binjiang, and Xiaoshan was prominent. Due to the terrain and water-body restrictions, other counties and urban areas maintained stable acreage of construction land.

**Figure 14.** The 2020 land-use prediction map.

Agricultural land and construction land intertwined in large areas in the northwest and north of the study area. However, the evaluation value of construction land in this area was not high because it was far from the city center and the county center and the public transportation system was less developed. The acreage of agricultural land in the above area was basically stable. The agricultural land in the rest of the study area was affected by urban expansion, the road, and public transportation systems, and large acreage was converted to construction land.

Nature reserves were mostly found in the western and southern hilly areas, which were subject to various restraints for development, such as sloping terrain and ecological-protection restrictions, and an inconvenient transportation system, and therefore remained basically stable in acreage.

## **5. Conclusions**

Experiments showed that the results of land-use simulation based on the CA-Markov model under a cloud environment were reliable, reflecting that the method proposed in this study was reliable and applicable. Meanwhile, MapReduce was effective in parallelizing the CA-Markov model to improve the processing speed of land-use-change prediction based on the CA-Markov model. This method parallelized the CA-Markov model in two parts: The parallel Markov model based on a cloud environment (Cloud-Markov), and the comprehensive evaluation method of land-use changes based on MapReduce (Cloud-CELUC). By selecting Hangzhou as the study area and setting up a Hadoop experiment environment, the experiments were designed to verify the reliability, precision, and operating efficiency of the method. Land-use changes in Hangzhou in 2020 were simulated and the results were analyzed. The experimental results showed that the method which simultaneously realized the integrity and segmentation for land use change simulation and prediction is also practical and effective.

This research has successfully applied the MapReduce framework to improve land-use-change prediction efficiency. However, there are still some important issues worth further investigation. First, land-use changes were restrained not only by natural conditions, but also by political, economic, demographic, and other complex factors. Due to limited data sources, this study built its model mainly on traffic, terrain, and location factors. If more data are available, the prediction module of the spatial pattern of land-use, social, economic, policy, demographic, and other factors should be taken into

account in future research. We will also consider combining the auto image-identification method [37] into our current work to reduce manual preprocessing work. When Cloud-CELUC acquired the output result of the Map stage in the Reduce stage, input/output (IO) became a bottleneck in system performance. Therefore, more research e fforts should be dedicated to testing increasing IO e fficiency on the performance of cloud computing based on the CA-Markov model.

**Author Contributions:** Conceptualization, J.K. and L.F.; Methodology, J.K. and S.L.; Software, J.K. and S.L.; Validation, L.F.; Formal analysis, J.K. and S.L.; Investigation, J.K.; Resources, L.F. and X.W.; Data curation, S.L. and J.K.; Writing—original draft preparation, J.K.; Writing—review and editing, L.F., J.K., and S.L.; Visualization, J.K.; Supervision, L.F. and X.W.; Project administration, J.K. and X.W.; Funding acquisition, L.F. and X.W.

**Funding:** This work was supported by the National Key Research and Development Program of China (Grant No. 2016YFC0803105, 2016YFC0502700), the China Postdoctoral Science Foundation (Grant No. 2018M641926), the China Scholarship Council Program (No. 201808360065), and the Jiangxi Provincial Department of Education Science and Technology Research Projects (Grants No. GJJ150661).

**Acknowledgments:** We would like to thank Zhejiang University GIS Lab for providing the data, and Kaibin Zhang for helping setup the experimental environment and evaluating our algorithms, and the anonymous reviewers for their valuable suggestions.

**Conflicts of Interest:** The authors declare no conflict of interest.

## **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## **Terrain Analysis in Google Earth Engine: A Method Adapted for High-Performance Global-Scale Analysis**

**José Lucas Safanelli 1, Raul Roberto Poppiel 1, Luis Fernando Chimelo Ruiz 1, Benito Roberto Bonfatti 2, Fellipe Alcantara de Oliveira Mello 1, Rodnei Rizzo 1 and José A. M. Demattê 1,\***


Received: 1 May 2020; Accepted: 14 June 2020; Published: 17 June 2020

**Abstract:** Terrain analysis is an important tool for modeling environmental systems. Aiming to use the cloud-based computing capabilities of Google Earth Engine (GEE), we customized an algorithm for calculating terrain attributes, such as slope, aspect, and curvatures, for di fferent resolution and geographical extents. The calculation method is based on geometry and elevation values estimated within a 3 × 3 spheroidal window, and it does not rely on projected elevation data. Thus, partial derivatives of terrain are calculated considering the grea<sup>t</sup> circle distances of reference nodes of the topographic surface. The algorithm was developed using the JavaScript programming interface of the online code editor of GEE and can be loaded as a custom package. The algorithm also provides an additional feature for making the visualization of terrain maps with a dynamic legend scale, which is useful for mapping di fferent extents: from local to global. We compared the consistency of the proposed method with an available but limited terrain analysis tool of GEE, which resulted in a correlation of 0.89 and 0.96 for aspect and slope over a near-global scale, respectively. In addition to this, we compared the slope, aspect, horizontal, and vertical curvature of a reference site (Mount Ararat) to their equivalent attributes estimated on the System for Automated Geospatial Analysis (SAGA), which achieved a correlation between 0.96 and 0.98. The visual correspondence of TAGEE and SAGA confirms its potential for terrain analysis. The proposed algorithm can be useful for making terrain analysis scalable and adapted to customized needs, benefiting from the high-performance interface of GEE.

**Keywords:** topographic surface; terrain modeling; global terrain dataset

## **1. Introduction**

Terrain analysis is essential for modeling environmental systems [1–3]. The variability of landforms is frequently used to understand, map or model geomorphological, hydrological, and biological processes [4–7]. Elevation has a strong relationship with terrestrial temperature, vegetation type, and with the potential energy accumulated on a slope. The aspect and derived products, such as Northernness and Easternness attributes, can be linked to the potential solar irradiation on terrain. The Slope gradient, for example, controls the overland and subsurface flow velocity and runo ff rate. Similarly, curvatures are associated with acceleration and dispersion of water and sediment flows, which impacts the erosion and soil water content [8].

The public availability of elevation data with global coverage, such as the digital elevation model (DEM) derived from NASA's Shuttle Radar Topography Mission (SRTM DEM, [9]) and the digital surface model from the Advanced Land Observing Satellite (AW3D30 DSM, [10]), has promoted the exploration of topographic features in different contexts using processing tools available in several geographic information systems (GIS) [4,11,12]. However, despite the popularization of many global elevation datasets, it is important to pay attention to their quality when used for modelling purposes, as the acquisition mean and other production aspects can significantly impact the outputs [13,14]. In addition, analyzing big geospatial datasets can pose some limitations to traditional GIS. This becomes more critical with the availability of new digital datasets, which are providing better temporal and spatial resolutions due to advances in sensor technologies [15].

The Global Multi-resolution Terrain Elevation Data 2010 [16] and the global suit of terrain attributes [2] are examples of datasets that were produced using large computational tasks for mapping the global extent and in different spatial resolutions, which demanded optimized processing architectures. In general, high performance architectures are based on splitting the data in smaller subsets (tiles) to take the advantage of distributed computing operations. Recently, with the advent and popularization of cloud-based interfaces for processing big geospatial data, e.g., Google Earth Engine [17], the Pangeo software packages [18], and Actinia REST service [19], computational tasks applied to terrain analysis could be scaled and customized directly by the user.

Earth Engine (GEE) is a cloud-based platform developed by Google that supports the global-scale analysis of big catalogs of Earth Observation data [17]. It has been used to map global forest change in the 21st century [20], Earth's surface water change [21], global urban areas [11], wildfire progression [22], global bare surface change [23], and others. In this sense, GEE becomes compelling not because the distributed processing tasks are executed on the server-side of Google, but also due to the increasing availability of many global geospatial datasets that could be explored in topographic mapping. There exist several available topographical data within GEE, such as the global SRTM DEM, AW3D30 DSM, Global 30 Arc-Second Elevation data (GTOPO30 DEM, [24]), and others. Thus, GEE characteristics could permit the customization of high-performance terrain analysis with minimal user input and any computational processing on the user side. In fact, GEE provides three algorithms for calculating slope, illumination, and aspect of terrain, but lacks in providing calculation methods of other terrain information, such as the curvatures and landscape characterization.

In addition, a common obstacle of global terrain analysis in common GIS is the need for projecting DEMs onto projected coordinate systems, which ensures the elevation data is equally spaced on a plane square grid [25]. This step is complicated because it is difficult to define a projected system that minimizes terrain distortions over a global extent [26]. Moreover, as many available global DEMs are referenced by geographical coordinate systems and some researchers continue to apply square-grid algorithms to them, the algorithms should consider the geometry and specificity of global spheroidal DEMs [25]. This aspect is important because the application of square-grid methods to spheroidal equal angular DEMs leads to substantial computational errors in models of morphometric variables [25].

In this paper, we aimed at describing and making available a user-friendly processing algorithm for performing terrain analysis in GEE. This algorithm takes advantage of GEE's high-performance architecture for making the computational analysis scalable, adapted to customized needs, and requiring minimal user input. For this, the proposed package takes advantage of a calculation method adapted for spheroidal elevation grids, which favors the global-scale analysis of different DEM resolutions without projecting elevation data.

## **2. Material and Methods**

## *2.1. Algorithm Description*

The Terrain Analysis in GEE (TAGEE) package use calculation methods adapted to spheroidal angular grids, i.e., the DEM can be referenced in a geographical coordinate system, e.g., the World Geodetic System (WGS84). The following paragraphs briefly describe the calculation methods performed by TAGEE package. The readers are referred to [8] for the mathematical concepts of geomorphometry, a historical overview of the progress of digital terrain modelling, and the notion of the topographic surface and its limitations.

## 2.1.1. Topographic Surface

The land topography can be approximated by a topographic surface defined by a continuous, single-valued bivariate function (Equation (1)) [8]:

$$z = f(\mathbf{x}, y) \tag{1}$$

where *z* is elevation (meters), and *x* and *y* are the coordinates in geographical coordinates (degrees).

The local morphometric variables are functions of the partial derivatives of elevation. Using the Evans–Young method, the function *z* = *f*(*<sup>x</sup>*, *y*) is expressed as the second-order bivariate Taylor polynomial (Equation (2)):

$$z = \frac{r\mathbf{x}^2}{2} + \frac{ty^2}{2} + s\mathbf{x}y + p\mathbf{x} + qy + u \tag{2}$$

where *r*, *t*, *s*, *p*, and *q* are the partial derivatives, and *u* is the residual term.

Di fferently from a digital elevation model projected on a plane square grid, where the partial derivatives of terrain are estimated by finite di fferences, the processing and analysis of a spheroidal equal angular DEM must consider the spheroidal geometry. In such case, a grid spacing with approximately equal linear units along meridians and parallels exists only at the equator. To estimate the parameters of a spheroidal grid, a 3 × 3 moving window must retrieve both the geometry elements and the elevation values of the window nodes (Figure 1).

**Figure 1.** A 3 × 3 spheroidal equal angular grid with linear geometries a, b, c, d, and f, and nine elevation nodes—adapted from [8].

## 2.1.2. Terrain Parameters: Neighbor Elevations and Geometries

The elevation values of a 3 × 3 moving window are estimated by convolution kernels. For geometries, the Haversine formula is used to determine the great-circle distances between two neighbor nodes within the spheroidal window, given their latitude and longitude geographical positions (Equations (3)–(5)):

$$j = \sin^2\left(\frac{\Delta\phi}{2}\right) + \cos\phi\_1 \cdot \cos\phi\_2 \cdot \sin^2\left(\frac{\Delta\lambda}{2}\right) \tag{3}$$

$$k = 2 \cdot \operatorname{atan2} \left( \sqrt{j}, \sqrt{(1 - j)} \right) \tag{4}$$

$$l = \mathcal{R} \cdot k \tag{5}$$

where φ1 is latitude for the first given node in radians, φ2 is the latitude for the second given node in radians, λ1 is the longitude for the first given node in radians, λ2 is the longitude for the second given node in radians, Δφ and Δλ are the respective differences of latitude and longitude between the given nodes, and *R* is the mean radius of Earth equals to 6,371,000 meters. The linear distance *l* is given in meters.

Knowing the latitude and longitude of the window nodes (Figure 1), the Haversine formula allows the calculation of linear distances of a, b, c, d, and e, which are used with the neighbor elevation values (from *z*1 to *z*9) to calculate the partial derivatives of terrain.

## 2.1.3. Terrain Derivatives

To estimate the first and second-order partial derivatives *r*, *t*, *s*, *p* and *q*, the polynomial model is fitted by least squares and results in the following estimations (Equations (6)–(10)) [8]:

$$p = \frac{a^2cd(d+e)(z\_3 - z\_1) + b\left(a^2d^2 + c^2e^2\right)(z\_6 - z\_4) + ac^2e(d+e)(z\_9 - z\_7)}{2\left[a^2c^2(d+e)^2 + b^2(a^2d^2 + c^2e^2)\right]}\tag{6}$$

$$\begin{array}{l} q = \frac{1}{34c(4+c)(a^4+b^4+c^4)} \\ \cdot \left[ \left[ a^2(a^4+b^4+b^2c^2) + c^2c^2(a^2-b^2) \right] (z\_1+z\_3) \\ - \left[ a^2(a^4+c^4+b^2c^2) - c^2(a^4+c^4+a^2b^2) \right] (z\_4+z\_6) \\ - \left[ c^2(b^4+c^4+a^2b^2) - a^2d^2(b^2-c^2) \right] (z\_7+z\_8) \\ + d^2 \left[ b^4(z\_2-3z\_5) + c^4(3z\_2-z\_5) + \left( a^4-2b^2c^2 \right) (z\_2-z\_5) \right] \\ + e^2 \left[ a^4(z\_5-3z\_8) + b^4(3z\_5-z\_8) + \left( c^4-2a^2b^2 \right) (z\_5-z\_8) \right] \\ - 2 \left[ a^2d^2(b^2-c^2)z\_8 + c^2c^2(a^2-b^2)z\_2 \right] \right] \\ r = \frac{c^2(z\_1+z\_3-2z\_2) + b^2(z\_4+z\_6-2z\_5) + a^2(z\_7+z\_9-2z\_8)}{a^4+b^4+c^4} \end{array} \tag{7}$$

$$\begin{array}{l} \mathfrak{l} = \left\{ c[a]^2 (d+e) + b^2 e \right\} (z\_3 - z\_1) \\ \quad - b \big( a^2 d - c^2 e \big) (z\_4 - z\_6) \\ \quad + a \big[ c^2 (d+e) + b^2 d \big] (z\_7 - z\_9) \big\} \\ \cdot \frac{1}{2 \big[ a^2 c^2 (d+e)^2 + b^2 (a^2 d^2 + c^2 c^2) \big]} \end{array} \tag{9}$$

$$\begin{array}{l} t = \frac{2}{3\epsilon(d+c)(a^4+b^4+c^4)}\\ \vdots \left[\left[d(a^4+b^4+b^2c^2)-c^2c(a^2-b^2)\right](z\_1+z\_3)\\ \quad -\left[d(a^4+c^4+b^2c^2)+c(a^4+c^4+a^2b^2)\right](z\_4+z\_6)\\ \quad +\left[c(b^4+c^4+a^2b^2)+a^2d(b^2-c^2)\right](z\_7+z\_9)\\ \quad +d\left[b^4(z\_2-3z\_5)+c^4(3z\_2-z5)+\left(a^4-2b^2c^2\right)(z\_2-z\_5)\right]\\ \quad +e\left[a^4(3z\_8-3z\_5)+b^4(z\_8-3z\_5)+c^4-2a^2b^2\right](z\_8-z\_5)\right]\\ \quad -2\left[a^2d(b^2-c^2)z\_8-c^2c(a^2-b^2)z\_2\right]\end{array} \tag{10}$$

where the parameters *a*, *b*, *c*, *d*, and *e* are the linear distances calculated from the Haversine formula (Equations (3)–(5)), and the *z* values are elevation values from the neighbors of a moving window (Figure 1).

*s*

## 2.1.4. Terrain Attributes

Local attributes, such as slope, aspect, and curvatures, are calculated from the partial derivatives of terrain [8]. The slope gradient (*G*, Equation (11)) is a flow attribute that relates to the velocity of gravity-driven flows. For measuring the direction, the slope aspect is used (*A*, Equations (12) and (13)). Additionally, one can calculate the amount that a slope is faced to the North or East, resulting in the Northernness (*AN*, Equation (14)) and Easternness (*AE*, Equation (15)) derived from the aspect. The remaining flux attributes that can be calculated from the first and second-order partial derivatives are the horizontal (*kh*, Equation (16)) and vertical curvatures (*kv*, Equation (17)). While the horizontal curvature relates if a lateral flow converges (*kh* < 0) or diverges (*kh* > 0), the vertical curvature measures the relative acceleration (*kv* > 0) and deceleration (*kv* < 0) of a gravity-driven flow:

$$G = \arctan\sqrt{p^2 + q^2} \tag{11}$$

$$\begin{array}{ll} A = & -90 \left[ 1 - \text{sign}(q) \right] \left( 1 - \left| \text{sign}(p) \right| \right) + 180 \left[ 1 + \text{sign}(p) \right] \\ & - \frac{180}{\pi} \text{sign}(p) \text{arccos} \left( \frac{-q}{\sqrt{p^2 + q^2}} \right) \end{array} \tag{12}$$

$$\text{sign}(\mathbf{x}) = \begin{cases} 1 & \text{for:} \mathbf{x} > 0 \\ 0 & \text{for:} \mathbf{x} = 0 \\ -1 & \text{for:} \mathbf{x} < 0 \end{cases} \tag{13}$$

$$A\_N = \cos A \tag{14}$$

$$A\_E = \sin A \tag{15}$$

$$k\_h = -\frac{q^2r - 2pqs + p^2t}{(p^2 + q^2)\sqrt{1 + p^2 + q^2}}\tag{16}$$

$$k\_{\upsilon} = -\frac{p^2r + 2pq\mathbf{s} + q^2t}{\left(p^2 + q^2\right)\sqrt{\left(1 + p^2 + q^2\right)^3}}\tag{17}$$

Differently from flow attributes, which are gravity field-specific variables, form attributes are related to principal sections of terrain [8]. The mean curvature (*H*, Equation (18)) is a half-sum of any two orthogonal normal sections and represents two accumulation mechanisms of gravity-driven flows with equal weights: convergence and relative deceleration. Among the class of form attributes, the Gaussian curvature (*K*, Equation (19)) is a product of maximal (*kmax*) and minimal (*kmin*) curvatures. The two principal curvatures calculate the highest and lowest curvature for a given point of the topographic surface. The maximal curvature (*kmax*, Equation (20)) is useful for mapping rigdes (*kmax* > 0) and closed depressions (*kmax* < 0). Likewise, the minimal curvature (*kmin*, Equation (21)) is useful for identifying hills (*kmin* > 0) and valleys (*kmin* < 0) across the topographic surface. With the results of mean and Gaussian curvatures, a landform classification can be generated after [27] proposing the continuous form of the Gaussian classification [8,28]. Instead of providing categorical values, the shape index (*SI*, Equation (22)) ranges from –1 to 1 and map convex (*SI* > 0) and concave (*SI* < 0) landforms:

$$H = -\frac{\left(1 + q^2\right)r - 2pqs + \left(1 + p^2\right)t}{2\sqrt{\left(1 + p^2 + q^2\right)^3}}\tag{18}$$

$$K = \frac{rt - s^2}{\left(1 + p^2 + q^2\right)^2} \tag{19}$$

$$k\_{\text{max}} = H + \sqrt{(H^2 - K)}\tag{20}$$

$$k\_{\rm min} = H - \sqrt{(H^2 - K)}\tag{21}$$

$$SI = \frac{2}{\pi} \arctan{\frac{H}{\sqrt{H^2 - K}}}\tag{22}$$

## *2.2. Package Description*

Calculation methods presented in this paper were developed using the JavaScript programming interface available as the online code editor of GEE. TAGEE was developed by di fferent modules of calculation, similarly to what was described in Methods. The first module, *calculateParameters*, uses convolution kernels and the Haversine formula to retrieve elevation values and the spheroidal geometries of a 3 × 3 moving window. In this module, a digital elevation model and a square polygon representing the bounding box (min. Longitude, min. Latitude, max. Longitude, and max. Latitude, in the WGS84 coordinate reference system) are required as input parameters to run. The bounding box is used both in this module and others for generating images with constant values and restrict the calculations to the study area. The first module returns an image with 14 bands, i.e., the neighbor elevation values (from *Z*1 to *Z*9) and the distances (*a*, *b*, *c*, *d*, and *e*) (Figure 1).

Once the basic parameters (elevation and distances) were established, the partial derivatives of terrain are calculated with the *calculateDerivatives* module. This second module requires the returned parameters from *calculateParameters* and also the bounding box of the study region. The second module adds the partial derivatives (*r*, *t*, *s*, *p*, and *q*) as new bands to the previous image. Then, terrain attributes are calculated by the module *calculateAttributes* (Figure 2).

**Figure 2.** TAGEE modules for calculating terrain parameters, derivatives, and attributes.

Terrain attributes can also be calculated by a single function, without calling the intermediate modules. The final output, for both alternatives (Figure 2), is a multi band object containing the same data properties of the digital elevation model (resolution, data type, and coordinate reference system) with 13 bands (Table 1). The final attributes can be used for further modeling inside GEE or thematic mapping.


**Table 1.** Attributes of terrain, with their units and description, calculated by TAGEE package.

The package has an additional feature that makes easier the visualization of terrain attributes. As the range of attribute values and the pixel resolution may vary according to the visualization level (zoom), which impacts the estimated geometries and elevation neighbor values, a module called *makeVisualization* automatically calculates the dynamic legend defined by the 0.05 and 0.95 percentiles within the bounding box. In addition, di fferent color palettes for making the map legend are available in TAGEE: rainbow, inferno, cubehelix, red2green, green2red, elevation, and aspect. The package code and a minimal reproducible example are available in https://github.com/zecojls/tagee (Supplementary Materials).

## *2.3. Statistical Evaluation*

We performed the evaluation of TAGEE attributes by comparing the aspect and slope derived from two available functions of GEE (ee.Terrain.aspect and ee.Terrain.slope) on a near-global scale. For this task, we used the Pearson correlation analysis with the SRTM DEM 30m, which contains elevation in meters limited to an area between about 60◦ north latitude and 56◦ south latitude. It is important to mention that for the currently available terrain functions of GEE, the local gradient is computed using the four-connected neighbors of each pixel, di fferently from the proposed method of TAGEE, which uses a 3 × 3 pixel window and also considers the spheroidal geometries in its calculation. Thus, minimal di fferences between the calculation methods are expected to occur. This analysis was performed in GEE and, in addition to Pearson's correlation, we calculated the relative mean absolute error (MAE) between the outputs. The relative MAE is estimated by calculating the mean absolute di fference between two rasters and standardizing the result to the range (maximum minus minimum values) of the reference raster.

Similarly, we compared the results from TAGEE with terrain attributes calculated by the System for Automated Geoscientific Analyses (SAGA) GIS version 2.3.2 [12]. In this case, we downloaded from GEE the 30 m SRTM DEM together with the resulting 12 attributes calculated by TAGEE, all covering the Mount Ararat (located between 44.2◦ and 44.5◦ E, and 39.6◦ and 39.8◦ N). The Mount Ararat was selected due its high variability of landforms and the availability of published maps from previous works [8,29], allowing the visual comparison of spatial patterns. The Mount Ararat SRTM-DEM was processed in SAGA GIS using the "Slope, Aspect, Curvature" from the Morphometry module of Terrain Analysis. The calculation method was the "Evan (1979)" based on six parameters and 2nd order polynomials, similarly to the TAGEE calculation method. The comparison was performed by calculating the Pearson's correlation coe fficient (*r*) and the relative MAE, where the aspect, slope, horizontal curvature and vertical curvature from TAGEE were compared with aspect, slope, tangential, and profile curvature from SAGA GIS, respectively, following the equivalence described in [8].

## **3. Results and Discussion**

The statistical analysis revealed a significant correlation (*p* < 0.01) of the TAGEE outputs with equivalent terrain attributes calculated from GEE and SAGA GIS (Table 2). The slope estimated over a near-global extent reached a correlation of 0.98 (error of 2%) between TAGEE and functions of GEE, while the aspect resulted in a Pearson's *r* of 0.89 (13% of error). The lower correlation of aspect can be associated to its dimension nature, i.e., a circular variable, as well as to the di fferences of calculation methods between TAGEE and GEE. Despite the small di fferences, TAGEE revealed the same spatial patterns and allowed the estimation of additional attributes at the global scale, such as the Northernness, horizontal and vertical curvatures (Figure 3A–C, respectively). The main mountain ranges of the Earth, such as the Rocky Mountains in North America, Andes in South America, Alps in Europe, Himalayas, and Tibetan plateau in Asia, etc., present the highest curvatures calculated by TAGEE. Conversely, the plains and flat surfaces had the lowest estimates for both curvatures. The degree of orientation to North (Figure 3A) also depict the main landforms of the Earth.


**Table 2.** Comparison of TAGEE attributes with outputs from GEE and SAGA GIS algorithms.

> \* Significant for *p* < 0.01; 1 relative mean absolute error.

TAGEE was developed in GEE to take advantage of the high-performance computing of the platform. As the cloud-based interfaces have created much enthusiasm and engagemen<sup>t</sup> in the remote sensing and geospatial fields, many processing algorithms have been adapted to make substantive progress on global challenges involving the processing of big geospatial data [30]. In this sense, GEE is providing petabytes of publicly available remote sensing imagery and other ready-to-use products. The high-speed parallel processing of GEE servers and the libraries of operators and machine learning algorithms available by Application Programming Interfaces (APIs) in popular coding languages, such as JavaScript and Python, are enabling users to discover, analyze, and visualize geospatial big data without needing access to supercomputers [30]. Within this framework, TAGEE supports the development of customized terrain analysis with di fferent elevation data across large geographical extents.

When TAGEE outputs were compared to those from SAGA GIS (Table 2), the statistical evaluation resulted in a significant and high correlation for the slope, horizontal and vertical curvatures of terrain (Pearson's *r* of 0.98, with an error di fference of 3 and 4%). Aspects from TAGEE and SAGA GIS had an inferior correlation coe fficient, but the result was higher than the aspect from the algorithm

of GEE. The region of Mount Ararat was also used to visually compare the slope, horizontal and vertical curvatures, calculated from both TAGEE and SAGA GIS (Figure 4). The 3D visualizations revealed a high similarity between both maps, but some small differences can be visualized by the color intensity. This is the case of the slope of the Mount Ararat calculated by TAGEE (Figure 4A), which had a higher intensity compared to the slope of SAGA GIS (Figure 4B). A slightly higher intensity for the vertical curvature calculated by SAGA GIS was also evident on an edge of the Mount Ararat (Figure 4F). Despite being small, these visual differences confirm the relative error of both methods (Table 2). In addition, the spatial patterns of aspect, slope, and curvatures from TAGEE presented a high correspondence with the terrain maps of Mount Ararat available in [8,29], reinforcing the confidence of the TAGEE calculation method.

**Figure 3.** Example of terrain attributes calculated from TAGEE package and 1 arc-second SRTM DEM, displayed for the near-global extent at the visualization level 3 (~20 km pixel resolution): horizontal curvature (**A**), vertical curvature (**B**), and Northernness (**C**).

**Figure 4.** 3D visualizations of terrain attributes produced near Mount Ararat: slope, horizontal and vertical curvature from TAGEE (**A**,**C**,**E**, respectively) and SAGA GIS (**B**,**D**,**F**, respectively). 3D maps are displayed with a vertical exaggeration of 2.

In this work, the TAGEE algorithm was developed to consider spheroidal geometries in its calculation method. This approach diverges from the techniques available in traditional GIS, where TAGEE considers the grea<sup>t</sup> circle distances of the DEM defined by Latitude and Longitude positions. Common GIS software, such as SAGA GIS, requires the projection of the DEM to ensure the elevation data have the same pixel size. However, as identified by [25], some researchers continue to apply square-grid algorithms to spheroidal equal angular DEMs, which can lead to substantial computational

errors in models of morphometric variables. The small relative errors between TAGEE and GEE or SAGA GIS could be linked to the differences in their calculation methods.

Finally, some limitations of TAGEE can also be noted. Only local morphometric variables can be calculated by the package, which includes flux and form attributes. Non-local attributes, such as specific catchment area, were not implemented due to the absence of a general analytical theory, which is still little developed [29], and due to the recursion processing that is still challenging within GEE [17]. Furthermore, a novel method became available to handle major problems of terrain analysis, which includes the approximation of DEM, generalization and denoising, and the computation of morphometric variables. The universal spectral analytical method based on high-order orthogonal expansions using the Chebyshev polynomials were developed by [31] to handle the aforementioned issues into an integrated framework, but was not implemented in this work.

## **4. Conclusions**

The proposed package (TAGEE) can calculate terrain attributes using the high-performance platform of GEE with an accuracy equivalent to traditional GIS. The approach of using spheroidal geometries does not require the projection of input elevation data for terrain attributes calculation. The comparison between algorithms demonstrated that TAGEE estimates terrain slope and aspect similarly to the available functions of GEE. The advantage of TAGEE over the currently available functions is that additional outputs can be produced, such as curvatures and shape index, which can be useful for environmental mapping and modelling studies. In addition, a good agreemen<sup>t</sup> was also found when TAGEE was compared to equivalent outputs from SAGA GIS, reaching a Pearson's correlation coefficient between 0.96 and 0.98, and differences between 3–4 %. Thus, TAGEE becomes a feasible tool for making terrain analysis of big geospatial data, which can be customized to any spatial resolution and scaled up to the global extent.

**Supplementary Materials:** The package code and a minimal reproducible example are available online at https://github.com/zecojls/tagee.

**Author Contributions:** Conceptualization, José Lucas Safanelli; Methodology, José Lucas Safanelli, Raul Roberto Poppiel, and Luis Fernando Chimelo Ruiz; Software, José Lucas Safanelli; Validation, José Lucas Safanelli, Luis Fernando Chimelo Ruiz, Fellipe Alcantara de Oliveira Mello, and Rodnei Rizzo; Writing—Original Draft Preparation, José Lucas Safanelli; Writing—Review and Editing, all authors; Supervision, José A. M. Demattê; Funding Acquisition, José A. M. Demattê. Validation, Writing—Review & Editing, Benito Roberto Bonfatti. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by São Paulo Research Foundation, Grant Nos. 2014/22262-0 and 2016/01597-9.

**Acknowledgments:** The authors are grateful to the Geotechnologies in Soil Science (GEOCIS) group.

**Conflicts of Interest:** The authors declare no conflict of interest.

## **References**


©2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Integrating Geovisual Analytics with Machine Learning for Human Mobility Pattern Discovery**

### **Tong Zhang 1, Jianlong Wang 1, Chenrong Cui 1, Yicong Li 1, Wei He 1, Yonghua Lu 2,\* and Qinghua Qiao 3**

1 State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China; zhangt@whu.edu.cn (T.Z.); wangjianlong@whu.edu.cn (J.W.); ccr2017@whu.edu.cn(C.C.);olivelee@whu.edu.cn (Y.L.);he\_wei@whu.edu.cn(W.H.)


Received: 19 August 2019; Accepted: 27 September 2019; Published: 30 September 2019

**Abstract:** Understanding human movement patterns is of fundamental importance in transportation planning and management. We propose to examine complex public transit travel patterns over a large-scale transit network, which is challenging since it involves thousands of transit passengers and massive data from heterogeneous sources. Additionally, efficient representation and visualization of discovered travel patterns is difficult given a large number of transit trips. To address these challenges, this study leverages advanced machine learning methods to identify time-varying mobility patterns based on smart card data and other urban data. The proposed approach delivers a comprehensive solution to pre-process, analyze, and visualize complex public transit travel patterns. This approach first fuses smart card data with other urban data to reconstruct original transit trips. We use two machine learning methods, including a clustering algorithm to extract transit corridors to represent primary mobility connections between different regions and a graph-embedding algorithm to discover hierarchical mobility community structures. We also devise compact and effective multi-scale visualization forms to represent the discovered travel behavior dynamics. An interactive web-based mapping prototype is developed to integrate advanced machine learning methods with specific visualizations to characterize transit travel behavior patterns and to enable visual exploration of transit mobility patterns at different scales and resolutions over space and time. The proposed approach is evaluated using multi-source big transit data (e.g., smart card data, transit network data, and bus trajectory data) collected in Shenzhen City, China. Evaluation of our prototype demonstrates that the proposed visual analytics approach offers a scalable and effective solution for discovering meaningful travel patterns across large metropolitan areas.

**Keywords:** geovisual analytics; machine learning; smart card data; transit corridor; mobility community; trip

## **1. Introduction**

Monitoring human movement is of fundamental importance in transportation planning and management. To facilitate public transit planning and operational management, it is appealing to understand transit movement patterns across space and time [1]. Fortunately, recent advanced geospatial data collection technologies, such as global positioning systems, digital mapping, smart card automated fare paymen<sup>t</sup> systems, and wireless communication techniques, are generating a wealth of spatially and temporally varying transit data that create opportunities to discover meaningful and significant movement patterns over large metropolitan areas [2,3]. Various data mining methods have been developed to uncover transit travel behavior patterns on the basis of these heterogeneous

> 79

geospatial datasets, including clustering for passenger segmentation [4,5], hazard modeling for loyalty analysis [6], trip chaining methods for destination estimation [7], and choice modeling for passenger activity analysis [8].

Over the past few years, many studies have been conducted to explore urban travel patterns using various modeling and analytical approaches based on massive human mobility data, such as optimization-based routing equilibrium models for congestion alleviation [9], clustering-based correlated analyses of mobility similarities and social relationships [10], low-level mobility pattern discovery [11], and multi-scale exploration of social fragmentation [12]. With the availability of massive human mobility data, machine learning techniques have been playing a more and more important role in gaining a deep understanding of human mobility behavior [13,14], ranging from movement pattern mining [15–17], mobility prediction [18–20], and movement mode classification [21], to lifestyle discovering and prediction [22].

Recently, many attempts to visualize massive human mobility data, including cell phone data [23,24], taxi movement data [25], and social media data [26] have been reported. Some systems have been developed to perform visual analytics on smart card data [27,28], aiming to discover salient travel patterns to improve public transit planning and management. These e fforts mostly focus on novel visualization designs by aggregating individual trip information into compact visual forms. With these visualization tools, users can discover and analyze significant travel characteristics e fficiently. Nevertheless, most of these methods focus on the visualization of simple and intuitive spatio-temporal movement patterns, such as place-based flow variations, inter-area flow maps, or accessibility maps. While transit planners and operational managers need to reveal complex movement patterns at di fferent spatio-temporal scales, their intuitive tools are not adequate since they are based on simple statistical methods. This motivates us to investigate the possibility of applying machine learning techniques to identify high-level, complex transit movement patterns that support advanced transit planning and management.

It can be argued that visualization should be enhanced by advanced machine learning methods, given the overwhelming size and complexity of transit data. Over the past few years, researchers have developed visual analytics tools to support interactive exploration of spatio-temporal movement patterns using massive amounts of mobility data. Among these efforts, machine learning methods have been used for pattern discovery and analysis. For example, von Landesberger et al. [29] propose to integrate interactive spatio-temporal clustering and aggregated graph representations to discover abstracted urban movement patterns using social media and mobile phone data. We choose to discover public transit corridors and mobility communities using two state-of-the-art machine learning algorithms because they produce representative high-level, complex transit mobility patterns that are useful for transit and urban planning. Furthermore, we develop specific interactive visualization forms to facilitate the understanding of identified corridors and community structure. We argue that the combination of machine learning and geovisualization is beneficial for gaining a deep understanding of complex transit mobility patterns in large metropolitan areas. We propose to examine high-level, complex public transit travel patterns using visual analytics over large-scale transit networks, which is challenging since thousands of transit passengers and massive amounts of data from heterogeneous sources are involved. Moreover, efficient representation and visualization of discovered travel patterns is also a difficult task given large quantities of transit trips. To address these challenges, this study leverages advanced machine learning methods to identify time-varying mobility patterns based on multi-source transit big data. We also devise compact multi-scale visualization forms to represent the discovered travel behavior dynamics. A web-based prototype is developed to implement the proposed geovisual analytics approach within an integrated graphic interface, enabling in-depth analysis of multi-source massive transit data. We evaluate the prototype with realistic transit data collected in Shenzhen City, China. Our empirical usability study demonstrates that our approach can offer a scalable and effective solution for discovering meaningful travel patterns across large metropolitan areas.

In this study, we aim to identify spatially and temporally varying transit movement patterns based on massive transit data over a large public transit network. We make the following technical contributions:


## **2. Data**

Being deployed on public transit vehicles, smart card automated fare paymen<sup>t</sup> systems provide an efficient way to collect large volumes of travel data at the individual level. The proposed approach utilizes smart card data (SCD) collected in Shenzhen City, China. Shenzhen City has a large bus and subway network consisting of 8 subway lines, 199 subway stations, 808 bus routes, and 6226 bus stops (Figure 1).

**Figure 1.** The study region and public transit network.

We use a week of SCD starting from 3 to 9 April 2017. The SCD used in this study holds the names of boarding and alighting stations for each subway passenger. Bus passengers are not required to tap their smart cards when alighting. Therefore the information on alighting bus stops is not recorded. In addition to the SCD, we have access to bus trajectory data, public transit network, and road network data. These three datasets are registered into the same georeference framework, i.e., World Geodetic System 1984 (WGS 84) coordinate system and Universal Transverse Mercator coordinate system (UTM) Zone 50. The public transit network dataset contains the location, identification, and schedule information of all subway lines and bus routes. Based on GPS devices installed on each bus, we can obtain bus trajectory data such as longitude and latitude, speed, and travel direction at approximate 20–60 second intervals. In addition, bus identification information including license plate numbers, number of transit lines, and company names are all saved. With more than 6 million records collected

for each day, the size of the SCD set for the week amounts to 6.5 GB. Each day, the bus trajectory dataset has approximately 63–73 million GPS records.

## **3. Methodology**

## *3.1. Methodology Overview*

Public transit systems contain multiple components: bus stops, subway stations, bus and subway lines, bus and subway vehicles, and passengers. Most existing literature focuses on the analytics of transit lines/stops [30], schedules [31], or aggregated transit trips [32,33]. Some have explored the relationships between transit trips and points of interest [28] but have not leveraged advanced machine learning methods to analyze complex travel patterns and global mobility structures. Given a large amount of trip data, one may wish to identify significant spatio-temporal travel patterns, reveal global mobility structures, and visualize them on interactive maps. For example, questions can be raised to find interconnected road segments that contain significant transit travel demand patterns at a global scale or to delineate areas with similar transit travel characteristics. What specific spatio-temporal patterns can be discovered from these road segments and areas, and how do these patterns evolve over time? Can we jointly examine transit travel patterns from di fferent aspects of public transit services in an integrated interactive user interface?

To answer these questions, we can define geovisual analytics tasks as follows:


The proposed approach delivers a comprehensive solution to pre-process, analyze, and visualize complex public transit travel patterns (Figure 2). This approach first fuses SCD with other urban data to reconstruct transit trips. It then segments the study region into hierarchical areal units (i.e., public transit mobility community) according to mobility features of trips and static local features using graph embedding. Based on recovered public transit trip data, we develop a clustering-based algorithm for extracting transit corridors to represent mobility interactions between di fferent regions. Based on detected corridors and mobility communities, we develop various visualization forms to represent these transit movement patterns on maps and other views. An interactive web-based mapping prototype is also developed to enable the visual exploration of mobility structures over space and time. Specific visualization forms are designed and implemented in the web-based prototype to facilitate the analysis of massive transit data, including map-based visualization, focus views, and auxiliary views, which will be detailed in Section 3.5. The discovered mobility community structure and transit corridors can be visualized on interactive maps. The focus views consist of four types of visualization: (1) a corridor detail view that shows detailed trip information based on a simplified schematic map for each selected corridor; (2) a community tracking view that presents evolving changes of specific communities across time; (3) a stop glyph plotting the statistical information of all trips that originate, end, or pass a specific bus stop or subway station; and (4) a corridor–community correlation view illustrating the spatio-temporal correlations between transit corridors and mobility communities using parallel coordinate plots.

**Figure 2.** Overview of the proposed geovisual analytic approach. SCD—smart card data.

The trip reconstruction and corridor discovery methods have been described in a separate paper [34]. Below, we briefly introduce the two methods. Implementation details are reported in Section 4.

## *3.2. Data Pre-Processing and Trip Reconstruction*

Following the procedure developed in Zhang et al. [34], we performed data pre-processing for the original datasets and reconstructed transit trips, which were used in subsequent geovisual analytics tasks. For a passenger, a public transit trip consists of multiple consecutively linked trip legs with a specific travel purpose [35]. Our visual analytics approach is based on trips rather than trip legs because trips are better at revealing realistic travel demands and behavior patterns. In this subsection, we briefly describe the data pre-processing and trip reconstruction steps.

After removing erroneous SCD and bus trajectory records, we corrected inconsistent stops names and locations between heterogeneous datasets based on the method developed by [36]. All of the datasets were imported into Microsoft SQL Server databases in which spatial indices were constructed based on transit network data to accelerate data query and trip reconstruction. The original SCD was divided into subway-based and bus-based datasets by each date since subway-based records have full boarding and alighting information and an alighting stop estimation procedure was expected for bus-based SCD.

First, we needed to estimate boarding and alighting stops for bus-based trip legs. For each bus-based SCD record, the boarding stop can be identified by matching the license plate number from the SCD and the bus trajectory dataset to find GPS sampling points that are close to boarding time, which were matched to the transit network to find the most probable boarding stop. Then, we proceeded to estimate alighting stops: (1) alighting stops can be easily derived by searching for the closest stop to the next boarding stop if a passenger makes another boarding during the same day; (2) if the current trip leg is the final one of the day, the first boarding stop of the next day is used to estimate the possible alighting stop of this last trip leg; (3) otherwise, we can search other dates or other similar passengers to make estimations. Upon the availability of both boarding and alighting stops, complete bus trip legs could be recovered. These bus trip legs were then connected with subway legs to reconstruct a complete trip if these trip legs taken by the same individuals were within the 30 min threshold.

## *3.3. Extracting Transit Corridors*

The concept of "transit corridors" has been widely adopted and put into real-world planning and managemen<sup>t</sup> practices [37]. We define a corridor as a directional linear road segmen<sup>t</sup> consisting of multiple transit stops with significant numbers of passengers. Note that corridors may contain multiple branches and may overlap with other corridors. Based on massive transit trip data, time-varying transit corridors can be extracted to represent the most significant travel demand patterns across space and time. The corridor-extracting algorithm is developed on the basis of public transit trips, each of which is characterized by one single travel purpose. Each trip may consist of multiple legs, and each leg corresponds to one smart card transaction. We proposed a share-flow clustering algorithm [34]. The algorithm was based on the concept of "accumulated transit flow", which calculates the number of stops each passenger passes after boarding. If two stops have a large shared "accumulated transit flow", then the downstream stop is "directly transit-flow accessible" from its preceding adjacent stop. Starting from stops with a large amount of "accumulated transit flow", the algorithm iteratively evaluates adjacent stops along the travel direction. If this next stop is directly transit-flow accessible from the current stop, the road segmen<sup>t</sup> between the current and the next stop will be linked to the current corridor. After the initial growth of corridors, a pruning and merging process is performed to remove short and non-significant corridor candidates. By this clustering algorithm, linear corridors can be discovered dynamically for any specific time interval. The algorithm can be described in the following steps:


Usually, the algorithm can extract 5–10 transit corridors for peak hours and non-peak hours during weekdays and weekends based on the trip data we have produced.

## *3.4. Discovering Mobility Communities*

It is desirable to represent high-level urban mobility structures with multi-scale communities when dealing with overwhelming amounts of mobility data [38]. Each mobility community is featured with similar travel characteristics. The representation of a hierarchical community structure can significantly facilitate the understanding of inter-area interconnections in a city. Traditionally, the construction of community structure uses community detection algorithms developed in network science [39]. These community detection algorithms first build a graph to represent connections between nodes and then employ clustering, optimization, or statistical inference methods to divide the entire graph into groups, ensuring nodes within each group are more densely connected than external nodes [40]. However, public transit passengers usually travel long distances away from their origins, and these mobility behaviors must be accounted for when extracting transit mobility structure. This study proposes a di fferent community definition that considers not only local trip statistics but also trip destinations and other dynamic travel characteristics such as travel frequency and transferring patterns. All of this information can be readily computed from trip data.

Instead of denoting each subway station or bus stop by a graph node, we use region partitions with similar possible boarding stations nearby. We first partition the study region into regular grid cells, each of which has a size of 100 m × 100 m. We remove grid cells located in mountainous and water areas (inaccessible by transit services). Then a vector for each cell is built to record possible boarding stations (stops) that are close to them. Finally, a heuristic algorithm is employed to merge neighboring cells with the most similar vectors. After the merging of original grid cells, we obtain 18,109 grid groups, most of which consist of 2–7 original grid cells. These grid groups are then denoted as graph nodes, whose number is much smaller than original transit stops, thereby dramatically reducing computational expenses in community detection.

Traditional community detection algorithms cannot handle our problem and are not scalable to a large public transit network. In order to handle such complex trip behavior features, we use a graph-embedding method to uncover a dynamic community from realistic SCD. Graph embedding aims to produce a compact vector representation for each node and preserves graph structure within a low dimensional space [41].

We define a directed weighted graph *Gt*(*V***,** *E*) for a time interval *t. V* is the set of grid groups, and *E* represents transit connection edges between nodes in *V*. Each edge *e* is weighted by realistic tra ffic flow between its origin node and ending node during *t*. Based on these weights, we can construct a tra ffic flow matrix *F*, where *fi*→*<sup>j</sup>* denotes the number of transit passengers travelling from node *i* to *j*. We also construct an adjacency matrix *A* to describe local connectivity between graph nodes. The matrix *A* can be used to represent first-order transit connectivity proximity. The global network structure can be preserved via a high-order proximity based on tra ffic flow matrix *F*. The high-order proximity is defined as the similarity between tra ffic connectivity structures of a pair of nodes.

Since transit travel behaviors are largely non-linear and non-stationary, we leverage deep learning methods to learn network embedding. A classical auto-encoder framework (structure deep network embedding, SDNE) is adopted to learn latent network representations [42]. The auto-encoder framework consists of an encoder and a decoder. The encoder contains multiple layers, each of which can be defined as

$$z^{(i)} = \sigma(\mathbf{W}^{(i)} z^{(i-1)} + \mathbf{b}^{(i)}),\tag{1}$$

where *z*(*i*) denotes the hidden representation for the *i*th layer and *z*0 is the original input data *X*, which is a *n*-dimensional vector. *W*(*i*) and *b*(*i*) are learnable parameters. <sup>σ</sup>(.) is the non-linear activation function.

If we use *K* layers in the encoder, the input vector *z*0 would be mapped into a hidden representation *<sup>z</sup>*(*K*). Correspondingly, the decoder transforms *z*(*K*) back to a reconstructed vector *Y* after performing *K* layers of nonlinear transformation operations,

$$
\varpi^{(j+1)} = \sigma(\mathsf{W}\mathsf{v}^{(j)}\mathsf{z}\mathsf{v}^{(j)} + \mathsf{b}^{\prime(j)}),\tag{2}
$$

where *z*(*j*) denotes the reconstructed data vector for the *j*th layer and *W*(*j*) and *b*(*j*) are learnable parameters. Note that *z*(0) = *<sup>z</sup>*(*K*), *Y* = *<sup>z</sup>*(*K*).

The model parameters can be learned by minimizing the reconstruction error between the reconstructed vector *Y* and the input vector *X*:

$$L\left(X,\,\,Y\right) = \sum\_{i=1}^{n} \|y\_i - x\_i\|\_2^2. \tag{3}$$

If multiple transit features are used as the input vector to the encoder (including flow, speed, destination, and travel frequency), we obtain *L*1. If the adjacency matrix *A* is used as the input, we can build

$$\text{L2}(A, \Upsilon) \;= \sum\_{i,j>0}^{|V|} f\_{i \to j} \|y\_j^{(K)} - y\_i^{(K)}\|\_{2'}^2 \tag{4}$$

which preserves the high-order proximity of *G*.

The two reconstruction error functions can be linearly combined into a comprehensive joint loss function:

$$L\_{all} = L\_1 + \alpha L\_2. \tag{5}$$

The model is randomly initialized and optimized with a stochastic gradient descent. After model convergence, we can obtain the final embedding representation for all nodes in *G*. Based on learned compact node representations, we can use hierarchical clustering to generate hierarchical mobility communities.

## *3.5. Visual Analytics Design*

In the literature, public transportation visualization studies mostly focus on public transit networks and represent travel statistics based on stops and routes. We propose to examine and evaluate public transit services from different perspectives, namely, hierarchical mobility communities and significant transit corridors, in addition to the public transit network. Several visual designs are proposed to facilitate this comprehensive visual analytics strategy.

Our visualization design consists of three types of views (Figure 2): (1) Map-based visualization that uses interactive maps to depict extracted mobility communities and inter-community transit flow. Detected corridors are also illustrated in the map view. (2) Focus view is designed to present detailed information on user-selected corridors, transit mobility communities, and individual transit stops. Correlations between corridors and communities can also be visualized. (3) Other auxiliary views, including a query view and a statistics view. The query view enables interactive data selection for visual analytics for any time interval. The statistics view uses statistical diagrams to present summary information on corridors.

## 3.5.1. Mobility Communities

Based on realistic SCD, we can extract two-level mobility community structures over Shenzhen City. After performing graph embedding for g grid groups, we can perform hierarchical clustering based on these grid groups to produce two levels of mobility communities (see Section 4 for implementation details). Low-level communities are only based on the distances between embedded vectors. Based on low-level communities, we can further produce high-level communities by accounting for spatial contiguity and cohesiveness using regionalization methods. Figure 3a illustrates a high-level mobility community structure for April 3 (a holiday). Transit flows between these high-level communities are mapped to depict an overview of aggregated transit flows across the study region. High-level community structure is favored for global travel pattern discovery and analysis. When zooming in to the low-level, detailed community structures are visualized with different colors representing different cluster types (Figure 3b). With interactive community maps presented in figures, users can perform global pattern discovery task 1 to identify the hierarchical mobility community structure and visualize inter-community interactions conveniently.

 **Figure 3.** Transit-based mobility community structure on a holiday. (**a**) High-level community flow map; (**b**) low-level community clusters.

In the mobility community tracking view, one can select (from the map) one specific mobility community and track the temporal changes in its shape and flows between itself and other nearby communities (Figure 4). As the detected community structures evolve over time, a community may undergo different changes, including splitting into separate communities or merging with other communities. Each community is represented by a vertical bar, the height of which is proportional to the number of transit trips of the community. Ribbons connecting bars between different times represent transit flows between communities. Wide (narrow) ribbons indicate high (low) volumes of transit trips. The vertical positions of the bottom of bars also indicate the topological relations of communities from adjacent dates. Bars that are far away from each other correspond to communities that are also distant on the map. As we fix the position of the original selected bar, the overlapping relations can also be revealed by their relative vertical positions between two bars from adjacent dates. As community structure undergoes constant changes, users can conduct local pattern exploration task 2 to track the evolving trend of any chosen communities and gain a deep understanding of the mobility structure of the study region.

**Figure 4.** Mobility community tracking view. We can observe that the selected community "C0101" (indicating a detected community on April 1) has a significant number of transit trips connecting it and two communities on April 2 ("C0206" and "C0203"). These two are further connected with five communities (C0301, C0302, C0304, C0305, and C0308) on April 3. We can see that "C0101" is strongly connected with "C0206" and "C0301", as indicated by the width of ribbons between these communities.

## 3.5.2. Corridors

Figure 5 shows five discovered corridors on weekdays on the map. The width of corridors represents the size of transit flows. The flow direction is shown by animated particles [43]. A dedicated summary glyph in the statistics view is also designed to present a compact summary of all corridors in a radial layout, in which each segmen<sup>t</sup> corresponds to a corridor (Figure 6). For each corridor, the number of boardings and alightings within a corridor are divided into four categories: only trip origins fall within the corridor (destinations are outside); only trip destinations are located within the corridors (origins are not); both origins and destinations are within the corridors; and both origins/destinations are outside the corridor. Four bars are used to represent these four types of trips. Bar heights are proportional to the trip counts. The overall performance of the corridors is also represented by a line of dots in the inner circle. The performance is computed as the ratio (percentage) of on-vehicle time versus overall travel time of a transit trip. Dots close to the circle center indicate low performance. Based on this corridor overview map, users can perform global pattern discovery task 2 and examine the distribution of primary transit corridors.

Users can select and observe details of a corridor (Figure 7). The layout of a corridor is simplified in a schematic form to retain only topological connections (similar to metro maps). The above-mentioned four types of trips are visualized for the selected corridor: each ribbon that connects two adjacent stops is divided into four components, and the width represents flow counts. Key stops within a corridor are depicted as rings, with red/green representing boarding/alighting counts. Passengers boarding at a stop are further categorized into two groups: those who will be alighting within the same corridor and those who will be alighting outside the corridor. The two groups are denoted by dark and light red, respectively. Similarly, passengers who alight at a stop are divided into two groups: "boarding from at least 5 stops away" and "boarding close to the current stop". Dark- and light-green colors are used to denote the two groups. When selecting a particular corridor and observing its details in the corridor detail view, users can conduct local pattern exploration task 1 to fetch boarding, alighting, origin, and destination information in a compact visualization form.

**Figure 5.** Public transit corridors for weekday peak hours (8:00AM–10:00AM).

**Figure 6.** Corridor summary glyph.

**Figure 7.** Corridor detail view.

## 3.5.3. Transit Stops

The trip information of individual transit stops is plotted in a glyph when users click on a stop on the map. Figure 8 shows that the stop glyph can visualize in-vehicle, boarding, and alighting passenger numbers. In-vehicle passengers can be further divided into boarding from distant and nearby stops. Boarding passengers consist of initial boarding and transferring passengers. Alighting passengers comprise those who finish their trips and those who transfer at this stop. One can easily tell the role played by this stop for the whole transit network: it could be an important origin stop, destination stop, or a transfer stop.

**Figure 8.** Transit stop glyph.

## 3.5.4. Correlations between Corridors and Communities

In the previous sections, we introduced our approaches to discover primary transit corridors with significant travel demands and mobility communities with similar travel patterns. Furthermore, geovisual analytics can be performed to examine correlations between these two identified time-varying mobility representations. An integrated parallel coordinate plot is designed to describe their correlations. For any pre-specified time interval, we can draw identified corridors as polylines and represent discovered communities as vertical parallel axes. For each corridor (i.e., polyline), the intersection point on an axis (i.e., a community) indicates the number of transit passengers who originate from the community towards the corridor. In this way, spatio-temporal correlations between each corridor and each community can be illustrated in a compact manner. We can easily find which community contributes the biggest portion of transit flow to a particular corridor or identify which corridor is the most correlated one for a specific community. We can also ge<sup>t</sup> to know the composition of any corridor or community. For example, the parallel coordinate plot can reveal whether most trips from a community are correlated to a few corridors or are evenly distributed over a number of corridors across the city. Note that these correlations are not equivalent to intersection relations between corridors and communities, which are explicit on maps. As long as the origins of constituent trips for a corridor can be traced to a community, the corridor and the community establishes a correlation. The number of these correlations represents the intensity of interactions between a corridor–community pair. For each vertical axis, a filtering box can be specified to find corridors that meet the transit flow number search range criterion. Multiple filtering boxes on different axes can be interactively designated to further identify corridors that are correlated to selected communities based on specified transit flow ranges (Figure 9). This geovisual analytics tool enables users to undertake the comprehensive analysis task to mine correlation knowledge between corridors and mobility communities.

**Figure 9.** Parallel coordinate plot to illustrate correlations between transit corridors and mobility communities. Corridors discovered for April 7 are shown in the figure. After setting three filtering boxes, four corridors can be discovered and highlighted in the plot.

## **4. Implementation and Prototype**

The trip reconstruction and corridor discovery algorithms were implemented in C++. The corridor extraction and community detection algorithms were performed on a desktop computer with an Intel™ Xeon E3-1240@3.70 GHz processor and 16GB of memory, running on a Microsoft Window 10 operating system.

We selected stops that have 85–90th percentile of traffic flow counts as corridor seeds. The "shared transit flow" threshold (*st*) was set as the 50th percentile of the transit flow counts. The "shared accumulated transit flow" threshold (*sa*) can be set between −15% and 25%.

Mobility community structures were extracted using a hierarchical clustering tool implemented in the SciPy package based on network embedding results produced by a structural deep network embedding (SDNE) method [27]. The graph-embedding algorithm was implemented using TensorFlow 1.14.0 by Python 3.6. The clustering was performed based on Euclidean distance. The autoencoder network contains three layers: the input layer contains 18,109 neurons, which correspond to 18,109 grid groups in the study region; the hidden layer has 2000 neurons; and the output layer produces a 128 dimensional vector as the final embedding result for each graph node. Deeper layers would lead to performance degradation, as demonstrated by our tests. The model parameter initialization was based on a Gaussian distribution (with mean μ = 1 and standard deviation σ = 0.01). The weight in the joint loss function (Equation (5) was set as 0.2 since this delivered the best performance. The learning rate was set as 0.001. In order to produce cohesive and contiguous high-level communities, we applied a regionalization algorithm, REDCAP [44], based on the communities produced by SciPy.

We compared the performance of our community detection algorithm with a classical community detection algorithm developed by Newman and Leicht [45]. To evaluate the performance, we used the modularity metric proposed by Newman and Girvan [46].

As indicated by Table 1, our graph-embedding algorithm outperformed Newman and Leicht's algorithm by a large margin. Note that our case study is different from the regular community detection problem, in which higher modularity values indicate good community partitions. Since we encourage a community to have dense inter-community transit trips and sparse intra-community trips, lower modularity values are better.

**Table 1.** Performance comparison of community detection using the modularity metric (low-level communities).


The visual designs were implemented in a web-based prototype, which was developed with PyCharm Pro 2018.3.1 on a Windows 10 operating system. Major visualization modules were developed in JavaScript following the standards of HTML5 and CSS3. The user interfaces comprises four major components (Figure 10): (1) query view in the upper-left portion; (2) map view in the upper right; (3) statistics view in the bottom-left corner; and (4) focus views for corridors and communities in the bottom-right region. In the query view, users can specify the time range and select SCD falling within this range for analysis. The map view depicts discovered corridors and mobility community structures. Different corridors are differentiated by distinct colors. The map view embeds a Baidu Map as the background map. Flow maps can be produced to describe primary transit flows between major communities. Focus views illustrate three types of detailed displays: corridor detail view, community tracking view, and corridor–community correlation view. All of these views are dynamically linked. User interactions within any view will apply to other linked views for the same data (community, transit stop, or corridor).

**Figure 10.** Web-based prototype user interface.

## **5. Analysis and Discussion**

## *5.1. Geovisual Analytics Workflow and Examples*

Typically, a user can first specify the time range of SCD for analysis. For example, she can focus on morning peak hours of a weekday and invoke back-end algorithms to extract mobility structure and primary transit corridors. Then, both corridors and community structures can be visualized in multiple linked views to enable further examination. The integration of flow map and corridor map with community map at two scales can help users understand the overall transit mobility structure of the city. At a glance, users can identify the major origin/destination areas and how many passengers travel between these areas. Meanwhile, the statistics view presents summary information on all corridors, which allows users to compare the extracted corridors in terms of their trip types and performance. Furthermore, users can select a corridor and visualize it in the detailed view to gain more information on its constituent trips. Users can also select any transit stop to see the decomposition of its boarding and alighting trips. The prototype also allows users to examine the evolution of any chosen mobility community in the detail view. With all these linked views, users can perform the comprehensive analysis task to discover global and local transit travel patterns across the city over time.

For example, users can discover high-level mobility communities for any date. Figure 3a presents these communities for a holiday. The identified community structure synthesizes transit travel patterns that are much easier to understand than original massive transit footprints. The largest community (No. 1) is located on the east side of the downtown area, which is served by multiple subway lines and dozens of subway stations. This community attracts a large amount of leisure-oriented trips that originate from all over the city. In the west side of downtown, three separate communities (Nos. 2, 3, and 4) can be observed, and each attracts short trips close to it. Other communities are distributed over the suburbs, which are mostly residential areas. Many passengers living in these suburb communities travel to downtown areas for leisure purposes on the holiday.

Figure 11 shows that it can beneficial to examine intrinsic travel patterns by integrating transit corridors with mobility communities. It can be observed that the most salient corridor connects community Nos. 8 and 9 and community No. 1, which have the most job opportunities in the city. Based on the flow direction of the corridor (shown by animated particles), we can find that a large number of commuters travel towards community No. 1 for work in the morning. Another corridor in the Northeast indicates that many passengers who live in remote outskirts make trips towards community No. 6, which features many industrial parks and high-technology companies. With these interactive maps, both corridor and community information can be combined to further investigate the travel origins and destinations for different times and dates, thereby deepening our understanding of evolving movement patterns across the city. These integrated maps can also contribute to the explanation of the interactions between transit corridors and mobility communities.

**Figure 11.** Discovered corridors and mobility communities for 11:00AM–1:00PM on weekdays.

## *5.2. User Evaluation*

Twenty-three users were interviewed to obtain comments and feedback on our geovisual analytics approach based on their experience using the prototype. Sixteen of them are experts in public transportation, and among them, nine have geovisual analytics development knowledge (experienced users). The users can be classified into three groups: (1) experienced users with background knowledge (9 users); (2) non-experienced users with background knowledge (7 users); and (3) non-experienced users without background knowledge (7 users). Before allowing them to use the prototype, we introduced the proposed geovisual analytics approach and the web-based prototype. We asked users to evaluate 6 geovisual analytics tasks: (1) to discover and visualize transit corridors; (2) to extract and visualize mobility community structure; (3) to obtain transit stop summary statistics information; (4) to evaluate the corridor detailed view; (5) to evaluate the mobility community tracking view; (6) to evaluate the corridor–community correlation view. Numeric scores were obtained from questionnaires, with "0" indicating the worst user experience and "5" indicating the best. Figure 12 summarizes the scores of interviewed users.

**Figure 12.** Average evaluation scores for 6 selected geovisual analytics tasks.

According to ratings and comments, different groups of users agreed that our integrated analytics approach and web-based prototype deliver an interesting and applicable solution for human mobility pattern discovery given massive transit data and complex transit networks. As shown in Figure 12, experienced users and users with background knowledge tended to give more positive ratings than non-experienced users or users without any background knowledge for most evaluation tasks. It may take more time for the third group users to understand the interfaces and functions of the system, thereby reducing their time to fully explore all of the views and leading to lower scores. Evaluation tasks (1) and (4) had relatively low ratings, probably due to the unfamiliar concept of transit corridors for some users. Users may have had difficulties selecting and examining particular corridors between different views, which was confirmed by subsequent feedback interviews. The corridor detail view is also not intuitive to use, according to the users' comments, since it requires users to switch their focus between the map view and the detail view frequently.

We also implemented a simplified web-based version for user evaluation. Compared to the original version, the simplified prototype only has an integrated map view to show discovered corridors and community structure. It does not implement linked views and only has limited visualizations (e.g., without animated particles to show the flow direction in corridors, without individual transit stop glyphs and corridor detail views).

With such a simplified system, we conducted user evaluation interviews with the same three groups of users. The same evaluation interview procedure was conducted to solicit their scores and feedback. Note that only four evaluation tasks were evaluated, namely, exploring transit corridors, exploring community structures, examining stop statistics, and exploring corridor details. The average evaluation scores for the simplified system are also shown in Figure 12. As we can see, these evaluation scores are significantly lower than scores obtained based on the original prototype for the evaluated four tasks.

## *5.3. Discussion*

This study adopted the concepts of community structure and transit corridor to construct high-level aggregate mobility knowledge from massive SCD and other urban data. The results of community detection and corridor discovery algorithms were integrated into an interactive visualization interface that consists of multiple linked views to enable efficient visual analysis of spatio-temporal transit mobility patterns at multiple scales and resolutions. The map view offers an overview interface to

help users preserve context information when they focus on a particular corridor, community, or stop visualizations. Specific views such as corridor detail view, mobility community tracking view, and a parallel coordinate correlations plot, along with summary glyphs (including a transit stop glyph and corridor summary glyph) complement the map view to provide intuitive geovisual analytics tools for the discovery of detailed knowledge of any specific component of the public transit system. The advantages of integrating machine learning with interactive visualization can be summarized as follows:


For most city residents, transit travel follows a weekly rhythm: they commute to work on every weekday and enjoy their leisure time on weekends. One-week data could then be sufficient to extract typical transit movement patterns for the study area. In the literature, we can also find other researchers also using one-week public transit data (i.e., smart card data) for their studies. For example, Long and Thill [47] examined job–housing relationships in Beijing with one-week bus-based SCD. Alsger et al. [48] validated origin-destination estimation algorithms based on one-week SCD in Southeast Queensland, Australia. If we can access SCD and GPS trajectory data from other time periods, the same geovisual analytics approach can be readily applied.

## **6. Conclusions and Further Work**

In this study, we applied two machine learning methods, including a clustering algorithm to extract transit corridors and a graph-embedding algorithm to discover mobility community structure. These high-level representations are visualized in a web-based interactive interface to allow users to examine massive SCD in a highly aggregated and efficient manner. Our prototype demonstrates that the proposed visual analytics approach can offer a scalable and effective solution for discovering meaningful travel patterns across a big metropolitan area. We plan to improve the usability of the prototype based on users' comments in the near future. It is favorable to allow users to designate algorithm configurations in the graphic user interface, which contributes to a better understanding of the underpinning machine learning algorithms, and this will be implemented in the near future.

**Author Contributions:** Conceptualization, T.Z.; Methodology, T.Z.; J.W.; Software, C.C.; Y.L. (Yicong Li), W.H.; Formal Analysis, C.C.; Y.L. (Yicong Li); Data Curation, C.C.; Q.Q.; Writing-Original Draft Preparation, T.Z.; Writing-Review & Editing, J.W., Q.Q.; Visualization, J.W.; C.C.; Project Administration, Y.L. (Yonghua Lu); Funding Acquisition, T.Z.; Y.L. (Yonghua Lu).

**Funding:** This research was funded by the Special Fund for the Development of Strategic Emerging Industries in Shenzhen, gran<sup>t</sup> number JSGG20170412170711532, the National Natural Science Foundation of China, gran<sup>t</sup> number 41871308, and the Basic Scientific Research Fund Program of the Chinese Academy of Surveying and Mapping, gran<sup>t</sup> number 7771820.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

## **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Social Media Big Data Mining and Spatio-Temporal Analysis on Public Emotions for Disaster Mitigation**

### **Tengfei Yang 1,2, Jibo Xie 1,\*, Guoqing Li 1, Naixia Mou 3, Zhenyu Li 3, Chuanzhao Tian 1,2 and Jing Zhao 1,2**


Received: 30 October 2018; Accepted: 10 January 2019; Published: 15 January 2019

**Abstract:** Social media contains a lot of geographic information and has been one of the more important data sources for hazard mitigation. Compared with the traditional means of disaster-related geographic information collection methods, social media has the characteristics of real-time information provision and low cost. Due to the development of big data mining technologies, it is now easier to extract useful disaster-related geographic information from social media big data. Additionally, many researchers have used related technology to study social media for disaster mitigation. However, few researchers have considered the extraction of public emotions (especially fine-grained emotions) as an attribute of disaster-related geographic information to aid in disaster mitigation. Combined with the powerful spatio-temporal analysis capabilities of geographical information systems (GISs), the public emotional information contained in social media could help us to understand disasters in more detail than can be obtained from traditional methods. However, the social media data is quite complex and fragmented, both in terms of format and semantics, especially for Chinese social media. Therefore, a more efficient algorithm is needed. In this paper, we consider the earthquake that happened in Ya'an, China in 2013 as a case study and introduce the deep learning method to extract fine-grained public emotional information from Chinese social media big data to assist in disaster analysis. By combining this with other geographic information data (such population density distribution data, POI (point of interest) data, etc.), we can further assist in the assessment of affected populations, explore emotional movement law, and optimize disaster mitigation strategies.

**Keywords:** social media; big data; fine-grained emotion classification; spatio-temporal analysis; hazard mitigation

## **1. Introduction**

With the popularity of mobile devices and the development of the network infrastructure, social media has quickly integrated into people's lives. People can easily share what they see and hear, and even what they feel and think with social media. They are like "mobile sensors" [1] to collect information around them constantly. This provides a new way to acquire disaster-related data. Compared with traditional disaster information collection methods, social media has the characteristics of real-time information provision and low cost. Furthermore, these data contain a lot of geographic information (such as location, time, and other attribute information), which is very important for disaster mitigation. Therefore, many researchers have noticed the importance of social media in disaster mitigation. They have studied disasters from the perspectives of event extraction [2,3], user

trajectory rules [4] and data fusion [5], etc., and achieved good results. However, few researchers have considered the public emotional information contained in social media (especially fine-grained emotions) as an attribute of disaster-related geographic information to aid in disaster mitigation. When disasters occur, public emotions often express the public's attitude towards disaster, needs during disaster, and feedback on disaster relief, etc. These are very helpful to understand the progress of the disaster quickly and effectively improve the efficiency of rescue. However, there is still a lack of an effective framework to quickly collect, process, and use this emotional information. There are three problems involved: (1) How can the fine-grained public emotional categories be divided during the disaster? (2) Social media has a huge user base. We take Sina micro-blog, a Chinese social media, as an example. According to statistics, as of Q3 2018, Chinese social media platform Sina micro-blog had over 431 million active monthly users [6]. When disasters occur, this will generate a lot of disaster-related data. As such, how can the fine-grained emotional information contained in these data be extracted more accurately? (3) When these fine-grained emotions are extracted, how can they be regarded as an attribute of disaster-related geographic information to assist disaster mitigation? In this paper, we used a Sina micro-blog and took an earthquake disaster as an example to describe how the framework we built extracted fine-grained public emotions and used them to serve disaster mitigation.

Unlike most emotion analysis studies (they usually divide emotions into three categories: positive, neutral, and negative), we divide the public emotions during the disaster into more dimensions, because the use of multiple dimensions of emotion in the disaster context can allow more details of the disaster to be described. Additionally, studies have illustrated the importance of multidimensional emotional information in disasters. Ekman, et al [7] showed the differences between anger, disgust, fear, and sadness in terms of antecedent events and likely behavioral responses. Oliver Gruebner et al [8] analyzed how to apply multiple dimensions of negative emotion (including anger, fear, sadness, surprise, confusion, disgust) to survey disaster mental health. Existing psychological studies [9–11] also mention the fine-grained division of emotions in a disaster. Therefore, based on these previous studies and the corpus used in this paper, we subdivide the negative emotions into anger, anxiousness, fear, and sadness.

The commonly used methods for emotion classification include rule-based algorithms and traditional machine learning models [12]. Rule-based algorithms mainly uses given emotional lexicons and corresponding grammatical rules to calculate the emotional intensity of the text [13,14]. This method relies on a large number of manual operations, such as manual development of search rules and a large-scale emotional lexicon [15], which determines the accuracy of the method. Additionally, this method is weak in dealing with stop words and new words. It is also hard to add some slang and Internet buzzwords to the emotional lexicon in time, such as " 喜 大 普 奔" (great satisfaction), " 狂 顶" (very supportive), etc., which often appear in social media. Traditional machine learning models, such as naive Bayes [16], maximum entropy, and support vector machine [17] do not rely on emotional lexicons or search rules. They only need to manually annotate the training set. However, the traditional machine learning method is based on the bag-of-words model, which ignores the semantic relations in text. In other words, it does not consider the order of words in a sentence, which can easily cause misclassification of emotions. For example, the sentences "Although the earthquake is terrible, we are safe and sound" and "Although we are safe and sound, the earthquake is terrible" contain the same words, they express different emotions. Moreover, for traditional machine learning models, the input is the feature words extracted from the text after segmentation. The definition of feature words has a significant impact on the model's efficiency [15]. We selected the deep learning method to extract public emotion from social media. Compared with the rule-based method, deep learning does not depend on any emotional lexicons. Therefore, it is not affected by new and unknown words. Unlike traditional machine learning, deep learning uses word vector models to replace the bag-of-words model, which can make good use of semantic information in sentences. Much research has indicated that the performance of deep learning [18,19] in natural language processing (NLP) tasks is better than of traditional machine learning.

Furthermore, we used extracted fine-grained public emotions and combined them with traditional geographic information data (population density distribution data, point of interest (POI) data, etc.), and the powerful spatial analysis functions of a GIS (geographic information system) to assist disaster relief. Combining public emotional information could produce the following benefits: (1) It could improve the accuracy and efficiency of disaster assessment. For example, with the help of the powerful spatial analysis functions of a GIS, traditional geographic information data (such as population distribution, traffic distribution, etc.), and emotional distribution data can be combined to assess the affected population in real-time. People who express negative emotions are generally considered to be more affected by disasters. (2) It could help to reduce disaster-related losses. For example, disasters, especially sudden disasters (such as earthquakes, volcanic eruptions, etc.), can easily cause disaster-related mental health problems, such as post-traumatic stress disorder (PTSD) and depression [20–22]. Traditional monitoring has difficulty obtaining information about emotions of the public in the disaster area (despite the existence of a questionnaire, its real-time performance is poor). If information about the public's emotions and corresponding spatio-temporal distribution is known, the disaster reduction department can take corresponding psychological rescue measures to reduce the occurrence of disaster-related mental health problems. In addition, extreme disasters have the characteristics of inevitability and unpredictability [23]. People will express different emotions at different stages [24] and have different responses to try to overcome them [9]. For example, anxious people are more sensitive to the negative side of event-related information and can be easily influenced by rumors [25]. Therefore, through understanding the distribution of anxious people, we can release the correct disaster information at appropriate times to prevent rumors being intrusive to anxious people. (3) Learning more about the causes of emotion could help us to optimize emergency decisions. By using different emotion categories, we can explore different emotional causes, such as why angry emotions are predominant in a certain area and more anxious emotions are predominant in another area, and why the emotion categories change in some places over time. By understanding the causes of emotion, the disaster reduction departments can carry out targeted countermeasures. In the spatio-temporal analysis of public emotion information, the framework in this paper includes assessing the affected population in real-time, exploring an emotional movement law, and monitoring the causes of emotional change.

The structure of this paper is as follows: Section 2 describes the data acquisition, parsing, processing, and emotion classification method used in this paper. Section 3 presents the role of public emotion in assisting disaster reduction with a case study. Section 4 shows the evaluation of the experimental indicators. Section 5 concludes the paper.

### **2. Framework to Analyze Public Emotion from Social Media Big Data**

The framework to analyze the role of public emotions in disaster mitigation proposed in this paper includes five major phases: data acquisition and processing, the construction of a word vector list, model training, emotion classification, and spatio-temporal analysis of public emotions (as shown in Figure 1).

**Figure 1.** Framework of the automatic emotion classification and disaster analysis.

## *2.1. Social Media Data Acquisition and Parsing*

We used an earthquake that happened in Ya'an, Sichuan, China, at 08:02 h on April 20, 2013, as the case study. According to the report by the China Seismograph Network (http://news.ceic. ac.cn/CC20130420080246.html), the magnitude of this earthquake was 7.0 and its focal depth was 13 kilometers. The epicenter of this earthquake was located at 30.30◦ N, 103.00◦ E, which caused about 1.52 million people to be affected in an area of 12,500 square kilometers.

In this paper, social media data was acquired from the Sina micro-blog from the region surrounding the epicenter with a radius of 200 km, which was severely damaged by the earthquake. The affected cities included Ya'an, Meishan, Ganzi, Leshan, Ziyang, Deyang, Chengdu, Aba, Zigong, Mianyang, and Neijiang, as shown in Figure 2. The time span of social media data was from April 20 until April 26, 2017. Social media platforms usually provide an interface or API (Application Programming Interface) that allows developers to retrieve social media data. However, the retrieval of data in this way has grea<sup>t</sup> limitations; for example, you cannot set the time-span and topics, etc. Therefore, in this paper, we used Sina micro-blog's advanced search capability to ge<sup>t</sup> data by using time-span, city names, and event-related key words.

**Figure 2.** The study area of the 2013 Ya'an earthquake that was used in this paper.

The data format was initially hypertext markup language (HTML). We parsed the data into a structured data format including fields such as "time," "location," "text," etc. Among them, location was represented by the address and the accuracy of them were different. We take Chengdu as an example. Some addresses were described in more detail, such as "East Gate of Sichuan University," "Sishengci North Street," etc. Some addresses were roughly described, such as "Funan New District." There were also some texts that did not have address information. The reason for this is that people have different usage habits (some people do not want to share their location information). We used the API provided by Baidu (http://lbsyun.baidu.com/index.php?title=webapi/guide/webservice-geocoding) to convert these addresses to latitude and longitude. Among them, for line data, such as "Sishengci North Street," we took its midpoint coordinates to represent it. For surface data, such as "Wangjiang Campus of Sichuan University" and even "Funan New District," we extracted the central point coordinate to represent them respectively. We did not assign coordinates to those texts that did not have address information, including those with rough addresses. They were just labeled "Chengdu."

## *2.2. Social Media Data Processing*

In the subsequence processing steps, we mainly dealt with the text data. The main text processing steps included the conversion of full-width characters to half-width characters and from traditional Chinese to simplified Chinese, as well as recognition of the special characters and symbols. The aim of the first two steps was mainly to improve the computational efficiency of the model. The third step aimed to recognize special characters and symbols, such as "(>\_<)", "ნ", which are deleted and ignored by many common natural language processing (NLP) tools. However, for an emotional analysis, these special characters and symbols have emotional meaning, for example, "(>\_<)" and "(>\_<)>" can express troubled emotions. Therefore, in this paper, we interpreted them into text that could be processed by NLP. Some special characters and symbols could be translated into text by the

micro-blog platform. For example, could be translated into "tear" ( 泪). However, others that could not be decoded by the micro-blog platform, such as (>\_<) and ,, were interpreted according to the web's "list of emoticons," which includes the emotional implications of all kinds of emoticons through a large amount of published literature. For example, (>\_<) can be translated into "troubled" ( 焦 虑) and can be translated into "sad" ( 伤 心).

Finally, after eliminating duplications, there were 39341 data records stored in our database.

## *2.3. Constructing the Word Vector List*

In this paper, we first converted each word from the previously processed texts into a multidimensional vector. This process included two phases: word segmentation and the removal of stop words, and the construction of a word vector list.

### 2.3.1. Word Segmentation and the Removal of Stop Words

p

Unlike the English language, there is no space separation between Chinese words. Therefore, we needed to segmen<sup>t</sup> Chinese text to ge<sup>t</sup> separate words. Additionally, the Chinese micro-blog is more colloquial, which brings grea<sup>t</sup> challenges to word segmentation. We compared many different Chinese word segmentation tools, such as "Stanford NLP", "ANSJ", "NLPIR (Natural Language Processing & Information Retrieval Sharing Platform)", and so on. We found that "NLPIR" had the best performance in terms of the accuracy and speed of word segmentation.

There are many meaningless words in text after word segmentation; these are called stop words, such as " 在 (on)," " 是 (is)," " 一 会 (a moment)," and so on. These words could affect the accuracy of the model and therefore should be removed. In this paper, we used the vocabulary of stop words developed by the Harbin Institute of Technology—Social Computing and Information Retrieval Research Center to remove stop words. As the focus of this paper was on the emotional analysis, we optimized the vocabulary of stop words by removing sentimental words, such as " 愤 然 (indignant)," "幸 亏 (luckily)," " 嘻 (hey)", etc.

## 2.3.2. Construction of the Word Vector List

The input used for the emotion classification model was a word vector matrix. We needed to convert each word in the micro-blog text into a multidimensional vector, and then convert the whole sentence into a word vector matrix. In this paper, we converted all previously processed texts into a word vector list. The training text and new text to be categorized were transformed into a word vector matrix by matching them with the word vector list. The method we used for this was word2vec [26], which projected every word in every sentence to a specified dimensional vector space.

There are two commonly used models in word2vec, which are skip-gram [27,28] and CBOW (Continuous Bag-of-Words) [27,28]. A large number of experiments have been done to compare these two models in terms of performance and accuracy [29] and the results show that the semantic accuracy rate of the skip-gram model is better than that of the CBOW model. Therefore, we used the Skip-gram model in our experiment to construct the text feature vector.

The skip-gram model can determine correlations between words for corpus training. These correlations are represented by the multidimensional feature vectors of each word. Additionally, these multidimensional feature vectors are calculated by taking full consideration of the context of semantic information. From the below formula, given a current word, *wi*, this model tries to find words which have a contextual semantic relationship with the current word. The target of this model is to maximize the objective function, *G*:

$$G = \sum\_{w\_i \in \mathbb{C}} \log P(\text{Conject}(w\_i) | w\_i). \tag{1}$$

In this formula, *wi* represents the current word and *C* represents the context window. *<sup>P</sup>*(*Context*(*wi*)|*wi*) represents the probability of the context information in the current word.

When the training converges, words with similar semantic meanings are closer in the specified dimensional vector space. We exported the text feature vector of each word in the training corpus to generate word embedding. The structure of text feature vectors is shown as Figure 3.

**Figure 3.** Structure of the text feature vector.

## *2.4. Model Training*

The deep leaning model selected in this paper was the convolutional neural network. We read much related literature and found that different deep learning methods can be selected for emotion classification, such as the convolutional neural network (CNN), recurrent neural network (RNN), hierarchy attention network (HAN), etc. These models [30–32] all have their own characteristics and usage scenarios. According to the literature [33], CNN performs emotion classification well, especially in shorter sentences. RNN performs document-level emotion classification well [34]. A previous study [35] presented the performances of the CNN, RNN, and HAN in emotion classification. The results showed that when the training corpus is large enough, HAN has the highest accuracy, but CNN performs the best when the training corpus is not very large. The annotation of a large training corpus requires a lot of manpower and time. Additionally, it takes longer to train the HAN and RNN models than the CNN model. In this paper, the training corpus used was micro-blog texts, which are mainly short texts. Additionally, the amount of manually tagged training corpus data was more suitable for the CNN model. Therefore, we selected CNN as the method to extract the public emotion contained in social media. The training process of the model is shown below.

### 2.4.1. Word Segmentation, Removal of Stop Words, and Construction of the Feature Matrix

First, we segmented the training texts to obtain separate words. Then, we used a vocabulary of stop words to remove the meaningless words contained in those separate words. Finally, the remaining words were converted into word vectors through matching with a previously generated word vector list. Ultimately, every sentence was transformed into a feature matrix.

## 2.4.2. Training Convolutional Neural Network Model

The convolutional neural network (CNN) is a variant of the neural network. It was first successfully used for the recognition of images and videos. Later, some researchers introduced it into the field of natural language processing [36] and found that it had a good effect. The CNN

model used in this paper consisted of an input layer, a convolutional layer, a pooling layer, a fully connected layer, and classification. The structure of the CNN is shown in Figure 4.

**Figure 4.** Structure of the convolutional neural network (CNN) model used in the paper.

In the course of the training process of the CNN model, the neurons in it are usually set to three dimensions: depth, width, and height. The size of each layer is depth × width × height [37]. For example, if there is a sentence with 140 words, and each word is set to be 200 dimensions, the size of the input layer is 1 × 140 × 200.

Next, we introduce the layers of the convolution neural network.

*Input layer*: The input layer of CNN is a matrix that consists of text feature vectors. This matrix is calculated using the skip-gram model, which was described in Section 2.2. The rows and columns (dimension) in this matrix were set before we put the matrix into the neural network model. Taking the Sina micro-blog text as an example, the number of characters in each sentence was less than 140. Therefore, we set the rows in the matrix to 140. If the number of words in a sentence was less than 140 characters, we used the empty character "space" to supplement the missing characters. Therefore, every sentence is expressed as follows:

$$S\_{1:140} = S\_1 \oplus S\_2 \oplus S\_3 \dots \oplus S\_{140} \tag{2}$$

 In this formula, S represents a character or "pad," and ⊕ is the concatenation operator.

*Convolutional layer*: The convolution layer is mainly used to extract features. It abstracts some fragmented elements into features which can be used to distinguish different categories. By convolution, many low-level features can be abstracted to higher level features. For example, the single word "打" or "call" has no emotional meaning. However, the higher-level feature "打call (praise)" can express an emotional attribute. The emotional attributes of these words can be acquired by the model through a large number of training corpus.

Given a matrix *u* that is from the input layer for convolution operation, the formula is as follows:

$$c\_{\rangle} = f\left(\mathfrak{u} \* k\_{\rangle} + b\_{\rangle}\right). \tag{3}$$

 For the matrix **u** ∈ <sup>R</sup>*D*×*L*, *D* represents the embedding dimensionality, and *L* represents the sentence length. The parameter **k** ∈ R*D*×*s* represents the *j*-th convolutional kernel, which is applied to a window of *s* words. The parameter *bj* ∈ R represents a bias term. *f u* ∗ *kj* + *bj* is a non-linear activation function.

*Pooling layer*: After the convolution operation, we can use the output features to directly classify emotions. However, in doing so, we will not only face the challenge of computational complexity, but also the problem of over-fitting, which will affect the classification accuracy. The pooling operation can solve these problems well. In addition, the pooling operation can also serve as a feature selector that can help to identify the most important features to improve the classification performance.

There are two methods that can be selected, namely max pooling and average pooling. We achieved better results with the max pooling method. This method selects global semantic features and attempts to capture the most important feature with the highest value for each feature map [37]. The output from the convolution operation *cj* is used as the input of the pooling operation. The formula is as follows:

$$p\_{\rangle} = pooling(c\_{\rangle}) + b\_{\rangle}.\tag{4}$$

*Fully Connected Layer*: The neurons in this layer have full connections with all neurons in the previous layer. Meanwhile, the value of the full connected layer can be calculated through the neurons in the previous layer. In the calculation process, the dropout regularization method is usually used to avoid over-fitting.

*Classifying*: We can obtain the emotional labels of the original text through the softmax function. In other words, these calculated results represent the probability distribution of the emotional labels.

Based on the training corpus, we can identify the best parameters for the CNN model. Then,this trained model can be used to calculate the emotional categories of new texts.

## *2.5. Emotion Classification*

We used the trained CNN model to analyze new texts. The emotions contained in these texts were divided into six categories: positive, neutral, angry, anxious, fearful, and sad. Among them, the positive emotion mainly included the public's satisfaction with disaster relief, the public's wishes for the disaster area, and the joy of surviving. The neutral emotion mainly included objective descriptions of the disaster. In the process of classification, new texts were first processed using word segmentation and the removal of stop words. Then, the previously trained word vector list was used to translate each word into a word vector. Furthermore, each new text was transformed into a word vector matrix. Finally, the word vector matrix was input into the trained CNN model. Through the calculation of the model, each new text was labeled into the different emotional categories. We classified all 39,344 pieces of texts into the six emotion categories based on this classification process.

## *2.6. Spatio-Temporal Analysis of the Public Emotions*

The framework in this paper aims to assist with disaster mitigation by using the public emotional information contained in social media. In the process, emotional information was regarded as an attribute of geographic information. The powerful spatial analysis capability of a GIS was used to combine the emotional information with other geographic data to dig out more useful knowledge. For example, the population density distribution data can be added to carry out a spatio-temporal assessment of the affected population. The POI data (such as a sanctuary) can be considered to explore the spatio-temporal trajectory law of people in sudden disasters. In addition, emotional information can also help disaster reduction departments to screen out urgen<sup>t</sup> public demands from a vast amount of information. The public demands that contain emotional information are also effective feedback for disaster reduction work. They can help to optimize decision-making to improve rescue efficiency.

## **3. Spatio-Temporal Analysis of Public Emotional Information**

## *3.1. Spatio-Temporal Assessment of the Affected Population*

It is very important to know the distribution of the affected population at the time when an earthquake happens. This can help to ensure an effective assessment of the disaster situation and rational deployment of rescue resources. In this section, we combined the population density distribution data related to the study area with the spatio-temporal information contained in social media to assist analysis. Among them, the population density distribution data was taken from the GHSL (Global Human Settlement Layer) (http://cidportal.jrc.ec.europa.eu/ftp/jrc-opendata/ GHSL/GHS\_POP\_GPW4\_GLOBE\_R2015A/). The introduction of public emotional information can improve the accuracy of the assessment. It is generally believed that negative emotions indicate that the earthquake has a greater impact on the people. Further, according to the rules of time period division from the rescue department, we set up six time periods, which were: 0–4 hours, 4–12 hours, 12–24 hours, 24–48 hours, 48–72 hours, and 0–72 hours after the earthquake. Then, we used the overlay analysis of GIS software to process these data in each time period. The results of the analysis of the related data are shown in Figure 5.

**Figure 5.** Emotional distribution characteristics of the affected population. The figure (**a**), (**b**), (**c**), (**d**) and (**e**) describe the distribution of emotions in different time periods within 72 hours after the disaster. The figure (**f**) shows the distribution of emotions over 72 hours. Among them, each of red circle 1, red circle 2, and red circle 3 in the figures represent the same area. The blue circle 1 in (**b**) shows that compared with (**a**), new negative emotions emerged in same area.

We know that: (1) The microblog data volume was larger in places with a high population density after the earthquake and negative emotions predominated. (2) Within four hours after the earthquake,

as shown in Figure 5a, there were almost no positive emotions in the areas near to the epicenter. Areas far away from the epicenter, such as Leshan (red circle 3) and Chengdu (red circle 2), had fewer positive emotions. (3) From 4 hours to 12 hours after the earthquake, as shown in Figure 5b, compared with the previous distribution of emotional information, some new negative emotions emerged near the epicenter, such as the blue circle 1. This indicates that as time went on, some new disaster damage might have taken place in this region. We checked the corresponding text and found that these new emotion points were mostly anxiety. The reason people expressed anxiety was because the "Fan Min Road" was blocked by boulders, and people were worried that the rescue vehicles that did not know this information which would be delayed due to the incident. The emotions in other areas of this figure had also changed. For example, compared with the red circle 2 and red circle 3 in Figure 5a, the emotions in these areas of Figure 5b increased significantly. This indicates that the public's attention to earthquakes continued to increase in this time period. (4) We selected an area with a high population density near the epicenter to analyze how emotions changed over time in detail. The selected area was located in Yucheng District in Ya'an and it was marked with red circle 1 in Figure 5a–e. Figure 6 shows the changes in data volume in emotion categories in this area for different periods of time. We found that the positive emotions began to appear in the second period and then continued to increase. The reason was that as the rescue operation unfolded, people's gratitude for the rescuers increased. The number of negative emotions increased the most in the third period and then gradually reduced. Although this time period lasted only 12 hours, the number of negative emotions was the highest of all periods. Because this time period was the first night after the earthquake, most people were in urgen<sup>t</sup> need of relief supplies, such as tents, clothes, etc. Therefore, anxiety was dominant. The number of neutral emotions expressed did not change very much. They mainly described the progress of earthquakes. (5) Figure 5f depicts the overall situation 72 hours after the earthquake. We found that the population density in Ya'an was not high, and the distribution of population was not uniform. However, the number of emotions expressed in this city was large and the distribution of them was fairly uniform, especially regarding negative emotions. This shows that the impact of the earthquake on Ya'an was the most serious. In addition, Chengdu is the capital of Sichuan Province and has the largest population density. During this earthquake, many negative emotions were expressed in this city. Although the distribution of these emotions was not uniform, more attention should be placed here to avoid unexpected accidents, such as people being hurt by rumors due to anxious emotions. The same approach can also be applied to other affected cities.

**Figure 6.** Changes in different emotion categories in data volume for different periods of time.

In this section, we conducted an overlay analysis containing public emotional information and population density distribution to assess the affected population. Spatio-temporal distribution characteristics of public emotional information can improve the accuracy of assessment and provides us with more valuable information. Although this emotional information was unevenly distributed, and even some areas with high population density had only a small amount of negative emotions, such as the area in the blue circle 1 in Figure 5b, we should still pay attention to them because emotions expressed by users of social media may also reflect emotions from the neighbors or communities around them, even if these neighbors or communities do not use social media [8].

## *3.2. Emotional Spatio-Temporal Trajectory Mining*

Earthquakes, as a sudden disaster, cause tremendous damage in a short period of time, and is especially hazardous for of human life. Therefore, in most cities, there are many shelters for people to avoid these disasters. In this section, we explore how the spatio-temporal trajectories of human beings change when sudden disasters occur and whether these changes are related to the locations of shelters. Furthermore, in this process of change, we investigate which emotion categories are shown by human beings and how these emotion categories change. We used Chengdu as an example and determined the locations of the shelters in this city from "The Official Website of Chengdu Municipal People's Government (http://cdtf.gov.cn/chengdu/smfw/csyjbn.shtml)." Then, we translated these shelters into coordinates through the API of Baidu (http://api.map.baidu.com/lbsapi/getpoint/index.html) and vectorized the map of this region. Considering the sudden occurrence of earthquakes, we set up seven small time periods, which were 08:02 h to 08:12 h, 08:12 h to 08:22 h, 8:22 h to 08:40 h, 08:40 h to 09:30 h, 09:30 h to 10:30 h, 10:30 h to 11:30 h, and 11:30 h to 13:00 h, respectively (the earthquake occurred at 08:02 h), to analyze the changes of the crowd in these fine-grained time periods. Figure 7 shows the changes in the number of people over time. We can see that the population grew fastest in 08:40 h to 09:30 h. This may reflect that a large number of people had reached nearby shelters during this time period.

**Figure 7.** Changes of the crowd amount in each small time period.

We used the kernel density algorithm [38] to validate the change of crowd aggregation over time and explore the relationships between density centers of crowds and shelter locations, as shown in Figure 8. In Figure 8a, three clusters were formed within the first 10 minutes after the earthquake and we marked them as cluster 1, cluster 2, and cluster 3, respectively. Although the density of cluster 1 is small, we can see its core had been in the location of the shelter. This indicates that within 10 minutes of the earthquake, a small number of people had gathered in nearby shelters. Cluster 2 and cluster 3, especially cluster 3, had higher densities than cluster 1. However, the people in these areas had not ye<sup>t</sup> gathered in the shelter. Ten minutes later, between 08:12 h to 08:22 h, the number of people increased as they further converged to shelters. We found that some shelters had gathered many people, such as cluster 1 and cluster 3, as shown in Figure 8b. As time went on, a large number of discrete and small clusters were fused into a larger clusters, as shown in Figure 8c–e. This indicates that during these periods, a large number of people had reached the shelters. Among them, the number of people between 08:40 h to 09:30 h was the largest, as shown in Figure 8d. Then, the number of people gathered in shelters began to decline. But the crowd hadn't dispersed at this time, as shown in Figure 8e. We can understand these changes through the corresponding value of crowd density. In Figure 8f,g, we can see that the big cluster began to disintegrate and was decomposed of small clusters. Then, these small clusters were gradually moving away from shelters. Perhaps it means that people's emotions were no longer so tense at this time.

**Figure 8.** The change characteristics of public spatio-temporal trajectory. This sequence diagram describes how the crowd moved in different small time periods after the earthquake. Among them, the figure (**a**) shows the trajectories of public change in 10 minutes after the earthquake. Three clusters were formed in this period. The figure (**b**) shows the location relationship between each cluster and shelters in the second ten minutes. The figure (**c**), (**d**) and (**e**) shows that all small clusters formed a large cluster over time and it had the largest population between 08:40 and 09:00 as in figure (**d**). The figure (**f**) and (**g**) shows crowd was gradually dissipating and leaving the shelter. From the whole process of analysis, we determined that: (1) When the earthquake happened, people rushed to the

shelters in a very short time period. However, were these shelters reasonably laid out? We saw that some shelters did not contain many people, or even had no people. Therefore, the analysis results could be used as a reference for the rational layout of shelters. (2) The characteristics of crowd gathering and evacuation could be used as an effective reference to aid disaster reduction departments in dealing with future emergencies.

Further, we wanted to know which categories of emotion the public expressed during these periods and how these emotions changed because massive population movements in a short time may lead to some unnecessary accidents such as possible stampedes due to panic. Therefore, if the public emotions in this process can be monitored, it will help us take quick and effective measures to improve the efficiency of evacuation and prevent accidents. Table 1 presents the emotional characteristics in each period of time in Figure 8. Indicators in this table include clusters formed by kernel density clustering, the emotion categories and major emotion category contained in each cluster, etc.


**Table 1.** Distribution characteristics of emotions in different clusters in different time periods.

From Table 1, we can see that fearful emotions dominated in the first 150 minutes (8:02 h to 10:30 h) after the earthquake, followed by anxiousness. During this time period, people were unprepared for the unexpected earthquake and were afraid of losing their lives in the earthquake. Among them, in the first time period (8:02 h to 8:12 h), people expressed anxiousness in cluster 1, as shown in Figure 8a. Through the corresponding text content, we found that they did not know the details of the earthquake at this time (such as the location of the epicenter, the magnitude scale, etc.), so they were worried about the safety of their relatives, friends, and even others whom they did not know. Angry, neutral, and positive emotions began to appear in the second period of time (8:12 h to 8:22 h). However, the number of the positive and neutral emotions was relatively small, with one piece being positive and two pieces being neutral. Angry emotions in this time period mainly showed people's aversion to earthquakes. Sad emotions began to emerge in the third time period (08:22 h to 08:40 h) People who expressed sad emotions were mainly because this earthquake reminded them of a dreadful catastrophe

that happened in Sichuan in 2008 (the earthquake that occurred in Wenchuan, Sichuan Province, on May 12, 2008, caused grea<sup>t</sup> damage). With more and more detailed information about earthquakes becoming available, people expressed more categories of emotion, and the reasons for these emotions had also changed. For example, in the fourth and fifth time periods, people expressed fearful and anxious emotions because they worried about there would be some aftershocks in the near future. This also shows that for a long time, people still gathered near the shelter, as shown in Figure 8c–e. We can see many people began to leave shelters in Figure 8f. Combining with emotions people expressed in this period, we found fearful emotions were no longer dominant. People expressed sad emotions in cluster 2 in the sixth time period because of grief for the victims in the worst-hit areas. In the seventh time period, the main emotion category in each cluster was different. People might have calmed down at this time. Even the main emotion of cluster 2 in Figure 8g was positive. People expressed their prayers for the disaster areas.

The fine-grained emotion analysis was not only a further explanation for human spatio-temporal trajectory mining, but also provides more details of the disaster for disaster reduction departments. On the one hand, it provides an understanding of the public's emergency awareness and movement law in the study area. On the other hand, based on characteristics of the emotional spatio-temporal trajectory, disaster reduction departments can provide timely managemen<sup>t</sup> and guidance for "key nodes" in this process to avoid unexpected accidents. For example, we can provide effective guidance for areas where negative emotions are intense, i.e., in Figure 8a,b, to avoid possible stampedes due to panic.

## *3.3. Post-Disaster Emotional Change Analysis*

The sudden catastrophe has a long-term impact on the public. By monitoring and analyzing the public fine-grained emotional information, we can mine a lot of important information from massive disaster-related data. This information can help us quickly understand the public's needs and feedback, even for some hard-to-find problems, such as mental health. This is very important for us to improve the efficiency of disaster emergency response and rescue. In this section, we took Ya'an as an example and used days as intervals to monitor public emotions for a longer period of time from the macro perspective. Meanwhile, we analyzed the causes of public emotion change using hot word extraction. This can help us quickly grasp the public's concerns from the mass emotional information. The tool we used to extract hot words was from the web (http://www.picdata.cn/).

Regarding positive emotions, as shown in Figure 9, on the second day (April 21) after the disaster, we brought the hot words " 感 动 (moved)" and 感 激 (grateful)" into the corresponding text. We found that the public expressed their gratitude mainly to the professional rescue workers, such as the army. There were fewer volunteers at this time. However, more and more volunteers spontaneously joined the rescue effort over time, especially on the third and fourth days. Because in this time, " 志 愿 者(volunteer)" appear more frequently. The civilian rescue gradually arrived in the disaster area from about the fourth day (April 23) after the disaster. The hot words " 爱 心 (love)"and " 物 资 (relief supplies)" indicated that people in the disaster-stricken areas were grateful for non-governmental spontaneous relief materials. Based on the changes in public positive emotions, we can understand the general process of rescue work.

Regarding anxiety, as shown in Figure 10, as time went on, the anxiety of the public gradually decreased. On the second day after the earthquake, there was more anxiety. The reason for this was because: (1) The roads were interrupted and some areas were isolated. We used the hot words "中 断 (interrupt)" and " 救 援 (rescue)" in the original micro-blog texts to ge<sup>t</sup> detailed information. We found that " 上 里 镇 (Shangli town)," " 中 里 镇 (Zhongli town)," " 下 里 镇 (Xiali town)," and " 碧峰峡 (Bifengxia town)" were isolated from the outside world and needed urgen<sup>t</sup> rescue after the earthquake. (2) Some areas were in urgen<sup>t</sup> need of relief supplies, and these areas were suffering from bad weather. For example, through hot words, we found that there were some micro-blog texts saying that "Wangjia village in Longmen town was short of water, food, medicine, and tents". (3) Some people expressed

anxiety because they could not contact their relatives and friends after the earthquake. From April 21 to April 26, we found the public's anxiety was mainly due to a lack of supplies and bad weather. Additionally, as time went on, the public's demand for relief supplies mainly concerned tents, especially on April 26th. The combination of micro-blog content and location information could have allowed a more precise rescue plan to be carried out.

**Figure 9.** Sequence diagram of positive emotion (the words in the text box are the hot words related to this emotion in the corresponding time period).

**Figure 10.** Sequence diagram of anxiety.

As shown in Figure 11, from April 21 to April 23, the public mainly expressed anger because of their aversion to the earthquake and because of some internet fraud. By combining the corresponding micro-blog content, we found that some of the Internet fraud was exposed by micro-blog users, such as "It's horrible. Some criminals cheated under the cover of the earthquake. Please pay attention to this telephone number: xxx.". These angry messages were used to help people guard against rumors (especially for anxious people). After April 23rd, public anger was mainly due to aversion against the disaster.

**Figure 11.** Sequence diagram of anger.

As shown in Figure 12, from April 21 to April 25, the sadness was due to the destruction of homes and the deaths of relatives or friends. Examples of corresponding micro-blog texts are: "Where is home? Where is the classroom? Yesterday? I was not an orphan yesterday" and "It's a scene of complete devastation and when can we rebuild our homeland?" On April 26, many mourning activities were carried out by official and non-governmental organizations, which was the reason why people expressed sadness. During the process of the earthquake rescue, disaster relief organizations could send psychological relief to places where sadness was intense, as determined by the locations of the corresponding micro-blog information.

**Figure 12.** Sequence diagram of sadness.

In terms of fear, as shown in Figure 13, from April 22 to April 24, there were several aftershocks in Ya'an, which had a grea<sup>t</sup> impact on the public's life. Many hot words were observed, such as "余震 (aftershock)" and "可怕 (dreadful)". However, between April 24 and April 26, especially on April 26, there was a sudden increase in fear, and the hot words mainly included "心理咨询师 (psychological counselors)," "回忆 (recall)," and "惊吓 (startle)." We used these hot words to ge<sup>t</sup> the original micro-blog and saw some saying that: "A teacher in Zhongli Town reported that a girl was afraid of loud voices and kept eating. She said she would be afraid if she didn't eat. So this teacher hoped that the disaster reduction department could send a psychological counselor to help that girl" and "My brother said that as long as there was the thunder and lightning in Ya'an, he was very scared! Ask for help!" Therefore, the relief department should give some help to these people.

**Figure 13.** Sequence diagram of fear.

We can continue to explore other areas in the same way. This could help us to accurately understand the public's reactions to the progress of disaster mitigation. Further, the analysis results could help the rescue department to optimize rescue strategies and improve the efficiency of rescues.

## **4. Evaluation of the Experimental Indicators**

## *4.1. Accuracy Evaluation of the Emotion Classification*

## 4.1.1. Experimental Corpus

In the emotion classification experiment, we first manually annotated a corpus based on the six emotion categories. In this corpus, each emotional category contained 1000 text samples. Among these, 800 text samples were selected as the training corpus and 200 were selected as the testing corpus from each emotion category.

## 4.1.2. Experimental Environment

In order to improve the accuracy of the emotional classification, we translated special characters and symbols in the text into Chinese words. We integrated the word2vec framework (from Google [39]) and NLPIR-ICTCLAS (http://ictclas.nlpir.org) into our algorithmic framework to assist with the processing of this text. Finally, we built convolutional neural networks based on tensor flow [40], and optimized the model parameters to achieve the best results. In this process, we set the dimensions of the word vector as 200, the number of convolution kernels was 3 and their sizes were 3, 4, and 5. The size of max pooling was 4, the proportion dropout regularization was 0.3, and the stride was 1.

## 4.1.3. Experimental Results and Accuracy Comparison

We verified the accuracy of the algorithm based on the precision (P), recall (R), and comprehensive evaluation indexes (*F-1*). The formulas are shown below:

$$P = \frac{N\_{\text{\\_Correct}}}{N\_{\text{\\_Correct}} + N\_{\text{\\_Palse}}} \tag{5}$$

$$R = \frac{N\_{\text{\\_Correct}}}{N\_{\text{\\_Category}}} \tag{6}$$

$$F - 1 = \frac{2 \times P \times R}{P + R}.\tag{7}$$

*N*\_*Correct* represents the number of texts that were correctly classified into one category, *N*\_*False* represents the number of texts that were misclassified into this category, and *N*\_*Category* represents the number of texts that belonged to this category in the testing corpus.

Table 2 and Figure 14 show the accuracy of the CNN model in the fine-grained emotion classification. The comprehensive evaluation index scores for each category were all above 81%, which met the experimental requirements. In addition, in this paper, we also considered the use of slang, internet buzzwords, and special characters and symbols to enhance the performance of the model.


**Table 2.** Accuracy evaluation of positive emotion classification.

## *4.2. Evaluation of Spatio-Temporal Analysis Experiments*

## 4.2.1. The Description of Data with Address Information

The number of texts in the data set of this paper was 39341. However, not all texts contained address information. We processed this address information based on the method described in Section 2.2. All address information can be divided into two categories in this paper, including rough address information and accurate address information. Among them, rough address information can only represent villages and towns, even districts and counties, such as "Lushan County, Ya'an City" and "Wuhou District, Chengdu City," etc. Accurate address information can represent streets and geographical entities, such as "Sishengci North Street," "Wangjiang Campus of Sichuan University," etc. Table 3 and Figure 15 depicts the proportion and number of data with different accuracy in different cities. Among them, the formula for calculating the proportion is as shown:

$$\text{Proportion} = \frac{\text{The number of specified texts in the city}}{\text{The number of all texts in the city}} \tag{8}$$


**Table 3.** Proportion of data with different accuracy in different cities.

**Figure 15.** Comparison of the number of pieces of address information with different accuracy in each city.

4.2.2. Evaluation of the Experimental Process and Results

In Section 3.1, we used the population density distribution data provided by GHSL (Global Human Settlement Layer) to assist in assessing the affected population. The scale of maps we used was relatively small. Therefore, we considered that all the data with both accurate address information and rough address information can be used. Among them, the data in Aba and Ya'an better reflected the real situation expressed by social media in the region because although the social media data volume in these two cities is small, the data with address information accounted for a larger proportion; they reached 34.8% and 26.09%, respectively. Considering population density and epicenter location (the epicenter of the earthquake occurred in Ya'an), in Section 3.1, we mainly focused on the disaster situation in Ya'an and several cities close to Ya'an.

In Section 3.2, we explored how the spatio-temporal trajectories of human beings changed when sudden disasters occur and used geographic data (shelters) with accurate location information. Therefore, we considered that the social media data with accurate address information can be used for analysis. In addition, earthquakes are sudden disasters and cause tremendous damage in a short period of time. Therefore, in this paper, we set up seven small periods of time to explore the public movement after the disaster and combined this with public emotions to mine more details regarding the disaster. This required more data with accurate address information. We can see that Chengdu meets

the requirements most from Figure 15 and this produced satisfactory analysis results in Section 3.2. Of course, the same method can also be used to analyze the disaster in Ya'an; however, time granularity would be coarse.

In Section 3.3, we mainly monitored the emotional information in each city for a long time. Therefore, we just needed to know which city the social media data belongs to. Although a lot of social media data had no location information, they all had their own labels. This was explained in Section 2.1. Considering that Ya'an was the worst hit by the earthquake, we selected Ya'an as the research object.

## **5. Conclusions**

When a disaster occurs, social media can provide a large amount of important disaster-related geographic information to the disaster reduction departments in near real-time. In this paper, we regarded the fine-grained public emotional information extracted from social media as an attribute of geographic information to assist in disaster mitigation. In the process of extracting emotional information, we fully analyzed the characteristics of Chinese social media and selected a suitable algorithm (convolution neural network model). Meanwhile, a large number of special characters and symbols with emotional characteristics contained in social media were also considered to improve the accuracy of classification. The methods in this paper achieved satisfactory results.

In order to verify the effectiveness of the method in this paper in disaster mitigation, we used the 7.0 earthquake that occurred on April 20, 2013, in Ya'an City, Sichuan Province, China, as a case study. We classified the social media texts related to areas affected by the earthquake into six different emotion categories. Then, with the help of GIS software and other traditional geographic information data (population density distribution data and shelter data), we explored the role of public emotional information that is helpful for disaster reduction. The results showed that fine-grained public emotions can provide more powerful data support for disaster reduction departments to optimize rescue strategies and improve rescue efficiency.

Although social media plays an important role in assisting disaster mitigation, it also has some limitations. (1) Social media data is unevenly distributed. The economically developed and populous areas tend to have more users of the Sina micro-blog. In the research area of this paper, Chengdu had the most Sina micro-blog data and these data were more concentrated in the urban area, but it was not the worst-hit city. Therefore, in future research, more abundant data that include other sources is also needed to supplement social media data, such as image data, vehicle-borne GPS data, etc. (2) Not all social media users are willing to share their location information. In the dataset used in this paper, the proportion of text with location information was very small. This limits the use of some spatio-temporal analysis methods. However, we found that there are many geographically named entities in texts and many of them can respect the user's location. Therefore, an effective method is needed to automatically extract these geographically named entities to supplement the deficiencies of geographic location information in social media.

In addition, the use of social media for disaster mitigation is far from enough. With the development of data mining technology, more disaster-related information contained in social media can be extracted, such as different categories of disaster loss information, etc. With the help of the powerful spatio-temporal analysis ability of GISs, this useful information can play a greater role.

**Author Contributions:** T.Y., J.X. and G.L. conceived and designed the paper; T.Y. and J.X. wrote the paper; T.Y. and N.M. designed and implemented the algorithmic framework; T.Y. and Z.L. realized the visualization; C.T. and J.Z. collected data and processed them.

**Funding:** This research was funded by the National Key R&D Program of China, gran<sup>t</sup> number 2016YFE0122600, the National Natural Science Foundation of China, gran<sup>t</sup> number 41771476 and Strategic Priority Research Program of Chinese Academy of Sciences, gran<sup>t</sup> number XDA19020201.

**Acknowledgments:** We thank Edward T.-H. Chu, Associate Professor at Yunlin University of Science and Technology for his advice and collaborative work on the emotion classification under the framework of cooperation project. We also thank Zhenyu Lin and Qinglan Zhang, Master students at Henan Polytechnic University for their advice on visualization.

**Conflicts of Interest:** Declare conflicts of interest or state "The authors declare no conflict of interest." Authors must identify and declare any personal circumstances or interest that may be perceived as inappropriately influencing the representation or interpretation of reported research results. Any role of the funders in the design of the study; in the collection, analyses or interpretation of data; in the writing of the manuscript, or in the decision to publish the results must be declared in this section. If there is no role, please state "The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results".

## **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
