**Preface to "Recent Advances in Geographic Information System for Earth Sciences"**

Geographic information systems (GISs) are computer-based technologies and methodologies for collecting, managing, analyzing, modeling, and presenting geospatial data for a wide range of applications. In recent decades, GISs have played a vital role in Earth sciences by providing a powerful means of observing the world and various tools for solving complex problems. The scientific community has used GISs to reveal fascinating details about the Earth and other planets.

This book on recent advances in GIS for Earth sciences includes 12 publications from esteemed research groups around the world. The research and review papers in this book belong to the following broad categories: Earth science informatics (geoinformatics), mining, hydrology, natural hazards, and society.

> **Yosoon Choi** *Special Issue Editor*

## *Editorial* **Recent Advances in Geographic Information System for Earth Sciences**

#### **Yosoon Choi**

Department of Energy Resources Engineering, Pukyong National University, Busan 48513, Korea; energy@pknu.ac.kr; Tel.: +82-33-570-6313

Received: 14 May 2020; Accepted: 29 May 2020; Published: 1 June 2020

#### **1. Introduction**

Geographic Information System (GIS) is a computer-based technology and methodology for collecting, managing, analyzing, modeling, and presenting geospatial data for a wide range of applications. GIS plays a vital role in Earth sciences by providing a powerful means of observing the world and various tools for solving complex problems. The scientific community has used GIS to reveal fascinating details about the Earth and other planets.

This special issue on recent advances in GIS for Earth sciences includes 12 publications from esteemed research groups worldwide. The research and review papers in this special issue belong to the following broad categories: Earth science informatics (geoinformatics), mining, hydrology, natural hazards, and society.

#### **2. GIS for Geoinformatics**

GIS is an important tool used to solve complex spatial problems in geoinformatics. Several articles dealing with basic algorithms for spatial data analysis are included in this special issue. Zhou et al. [1] propose an efficient parallel algorithm for polygon overlay analysis. Overlay analysis is a fundamental operator in spatial data analytics and is widely used in Earth science applications. The proposed algorithm includes procedures for active-slave spatial index decomposition for intersection, multi-strategy Hilbert ordering decomposition, and parallel spatial union. The application of their new parallel algorithm to a land-use map of China consisting of multiple polygons with 15,615 elements and 886,547 points shows that the algorithm can perform polygon overlay analysis with high efficiency. Therefore, the study contributes to geoinformatics by allowing the processing of large scale spatial data for spatial data analytics.

Vector maps in GIS have been widely used in various fields, including Earth science. Currently, huge volumes of vector map data can be easily stolen and distributed without permission from the original data providers. Pham et al. [2] propose a random encryption algorithm based on multi-scale simplification and the Gaussian distribution to encrypt vector map data before it is stored and transmitted. Their experiment using vector maps of Scotland at different scales shows that the proposed algorithm provides higher security and computational efficiency of storage and transmission of vector map data than previous methods. Therefore, the algorithm can be applied to improve the security of online and offline Earth science map services.

QGIS [3], an open-source GIS software, has been utilized in the Earth science community. Dobesova [4] assesses the visual notation of QGIS's Processing Modeler, a graphical editor for workflow design, using the Physics of Notations theory in combination with eye-tracking measurements. The results from this study provide several practical recommendations to improve the effective cognition of the QGIS Processing Modeler, including changing the fill color of symbols, increasing the size and variety of inner icons, removing functional icons, using a straight connector line instead of a curved line, and providing a supplemental preview window for the entire model.

Geo-sensor networks produce large amounts of Earth science data that can be processed using GIS for different purposes and for intelligent decision making. Malik and Kim [5] propose a geo-sensor framework that can be used by multiple clients to deploy their own geo-sensor networks, bind their sensor objects to desired locations, generate geo-sensor services for the uploaded networks, and manage the services with a geo-sensor composite toolbox. The framework is implemented based on the RESTful and SOAP web services [6]. Performance analysis shows that the lightweight RESTful web service is the best choice for ease of implementation and access.

#### **3. GIS for Mining**

Systematic and strategic mine planning, operation, and environmental management are necessary to improve mineral productivity, operational efficiency, and stability in the mining environment. To accomplish these objectives, GIS has been effectively used to design and optimize mine development. Choi et al. [7] review GIS-based methods and applications utilized in mine development, especially for mine planning, operation, and environmental management. They observe that GIS-based methods, including database management, spatial analysis, mapping, and visualization, are effectively used for all stages of mine development at global, regional, and local scales. In the mine planning phase, GIS-based methods are adopted for ore reserve estimation, open-pit boundary optimization, mine infrastructure design, and potential conflict analysis. Various mine operation systems based on GIS have been implemented in mining sites for ore haulage operations, wireless communication, ore management, safety monitoring, underground ventilation, and drainage systems. Moreover, various GIS applications have been developed to support decision-making in mine reclamation planning and re-utilization designs.

As an example of a GIS application for mining, Liu et al. [8] present a spatiotemporal model tightly coupled with GIS for simulating methane emissions in underground coal mines. Such a tight coupling approach is achieved by developing a lattice Boltzmann method (LBM)-based turbulent model with an underlying shared FluentEntity model within the GIS. A case study demonstrates that the proposed GIS-based model is capable and effective in providing functionalities for lattice domain decomposition, simulation, visualization, and analyses, as well as improving computational efficiency compared with traditional computational fluid dynamics (CFD) methods. The tight coupling approach for integrating GIS and simulation models is applicable to underground coal mine disasters.

#### **4. GIS for Hydrology**

In hydrological studies, GIS has facilitated the development of a dynamic model for analyzing runoff phenomena as well as a distributed parameter model that considers spatial variability in parameters related to the runoff process. The topography-based hydrological MODEL (TOPMODEL) is a distributed parameter model that uses a digital elevation model (DEM) in GIS. However, TOPMODEL is affected by the resolution of the DEM used. A reliable DEM grid-size resolution that exhibits low sensitivity to changes in input parameters during runoff simulations is investigated by Park et al. [9]. A case study in the Dongkok and Ieemokjung watersheds in South Korea shows that the efficiency of TOPMODEL rarely changes up to a DEM grid-size resolution of approximately 40 m, but changes more noticeably with coarser resolution. The findings of this study are important for understanding and quantifying the modeling behaviors of TOPMODEL under the influence of varying DEM resolution.

Social media data collected through Twitter, Facebook, Flicker, and Weibo can be used to improve understanding of urban hydrology. Wang et al. [10] examine rainstorm-related micro-blogging activities in response to rainstorms in an urban environment at fine spatial and temporal scales. The study collected hourly precipitation data and a total of 3.32 million Weibo blogs geotagged with Beijing, China from June to September 2017. The consistency between rainfall amount and human activities can be explained by the distribution of water ponding sites and major transportation hubs. The results show that human responses to the rainstorm event are consistent, though with certain time lags, in virtual and physical spaces at both grid and city scales.

#### **5. GIS for Natural Hazards**

Advances in GIS have popularized its application to spatial analysis of natural hazards. In particular, GIS has been widely used for landslide susceptibility mapping. Landslide susceptibility maps generated by GIS can be effectively used for future land planning and monitoring. Dikshit et al. [11] review studies of rainfall-induced landslides in the Indian Himalayan region to provide a reference point for the first time for researchers working in this region, and a summary of the improvements most urgently needed to better address landslide hazard research and management. Their study reveals that the inclusion of climate change factors and the acquisition of basic input data of the highest quality for computational models is critical for landslide susceptibility mapping.

Zhao and Chen [12] present an example of GIS-based landslide susceptibility mapping using ensemble techniques of functional tree-based bagging, rotation forest, and dagging (functional trees (FT), bagging-functional trees (BFT), rotation forest-functional trees (RFFT), and dagging-functional trees (DFT)). A landslide inventory map with 263 landslide events is established for Zichang County, China, and 14 landslide conditioning factors selected to analyze the correlation between the conditioning factors and the occurrence of landslides. The results show that the prediction rate of the BFT model is the highest when compared with the accuracy of the four ensemble models.

#### **6. GIS for Society**

GIS plays an important role in society, especially for land-use planning. The land is a complex system providing food, fresh water, and other material resources for humans. It is essential for habitation, transport, leisure, and other activities. For land-use planning, various factors such as topography, soil, hydrology, biology, and climate will be considered simultaneously. Xiang et al. [13] use GIS to assess the spatiotemporal dynamic multi-functionality of land use and to analyze obstacle indicators in Xiangxi, China using two methods (analytic hierarchy and hierarchical weighting). The study finds that spatial heterogeneity of land use in Xiangxi is increasingly clear. The production function of land use in Xiangxi is slowly increasing, with more rapid growth in the southern and central regions than in the northern regions. Three types of obstacles preventing efficient land use in Xiangxi are identified by GIS-based spatial analysis.

Different land uses are connected by transport networks to improve accessibility for human activities. Yu et al. [14] analyze the traffic flow network using GIS to understand the properties of spatial connectivity, spatial aggregation, and spatial dynamics. The study conducted a series of experiments to explore the transport system in Beijing city using taxi trajectory points recorded by the global positioning system (GPS). The results indicate that the interactions of land use show different characteristics over different time periods. Aggregation patterns of functional areas are dynamic over time and are strongly associated with the travel behaviors of residents in the city.

**Funding:** This work was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2018R1D1A1A09083947).

**Acknowledgments:** This special issue would not be possible without the contributions of professional authors and reviewers, and the excellent editorial team of Applied Sciences.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **An E**ffi**cient Parallel Algorithm for Polygons Overlay Analysis**

#### **Yuke Zhou 1, Shaohua Wang 2,\* and Yong Guan <sup>3</sup>**


Received: 29 September 2019; Accepted: 11 November 2019; Published: 13 November 2019

**Abstract:** Map overlay analysis is essential for geospatial analytics. Large scale spatial data pressing poses challenges for geospatial map overlay analytics. In this study, we propose an efficient parallel algorithm for polygons overlay analysis, including active-slave spatial index decomposition for intersection, multi-strategy Hilbert ordering decomposition, and parallel spatial union algorithm. Multi-strategy based spatial data decomposition mechanism is implemented, including parallel spatial data index, the Hilbert space-filling curve sort, and decomposition. The results of the experiments showed that the parallel algorithm for polygons overlay analysis achieves high efficiency.

**Keywords:** parallel algorithm; map overlay analysis; Hilbert ordering decomposition; spatial analysis

#### **1. Introduction**

Map overlay analysis is a fundamental operator in geospatial analytics and widely used in geospatial applications [1,2]. Large scale spatial data pressing poses challenges for geospatial map overlay analytics [3].

The parallel GIS algorithm is an efficient way to conduct map overlay analysis [4–7]. Spatial data decomposition is the basis of parallel computing architecture based on the spatial data partitioning mechanism [8]. Spatial data domain decomposition in parallel GIS refers to the decomposition of object sets in the study area according to a certain granularity and is assigned to different computing units for processing to achieve high concurrency. Spatial data domain decomposition from the perspective of geographic data storage mainly refers to the database domain to allocate spatially adjacent geographical elements to the same physical medium storage according to a certain decomposition principle. The feature elements form different groups in space in the form of clusters, and the spatially separated clusters are divided into different storage areas to realize parallelized spatial data extraction mode. The parallelized map overlay analysis algorithm technology route is based on data division and behavior.

Parallel spatial data decomposition needs to take into consideration the data storage and geo-computation in each child node from the perspective of the spatial distribution of feature objects, while spatial data has multidimensionality [9]. In the process of parallel overlay analysis, the core of parallelization is a fast intersection judgment of geometric objects and the interactive communication between geospatial data [10,11]. Therefore, the critical principle of layer element decomposition is to maintain the spatial proximity of data. The main feature of geographic data is that it has a strong spatial correlation, and its data parallel strategy should be compatible with spatial data types. The spatial feature is the difference between ordinary numerical parallel computing and the key technology of parallel GIS system [12,13]. The purpose of spatial data decomposition is to implement the local process of spatial analysis operations (to reduce the synchronization operation

between computing nodes). The modeling of spatial elements can be accelerated by computational localization. With the improvement of computer hardware performance and the increasing cost of storage, the usual strategy is to exchange storage space for computing time [12,14].

In this paper, we propose an efficient parallel algorithm for polygons overlay analysis. We decompose vector space data based on space filling-curve (Hilbert curve) to better maintain the spatial proximity of data decomposition, which is conducive to the parallelization of spatial proximity and sensitive overlay operations, and with the basis of the original Hilbert ordering decomposition, the spatial data is decomposed using different sorting strategies. This paper is organized as follows. Section 2 introduces related work. Section 3 shows the methods of a parallel algorithm for polygons overlay analysis. The experimental results are given in Section 4. Following this, the last section contains the conclusion and further work.

#### **2. Related Work**

The most time-consuming operation of polygon overlay analysis continues to be the intersection of line segments. Ching studied the load balancing problem for the parallel map overlay calculation based on the line segment set [15] but did not deeply discuss the communication and merge-efficiency of each node after parallel calculation, and cannot guarantee the constant acceleration ratio of the whole process.

Parallel spatial data region decomposition needs to take into consideration the storage and calculation in each child node from the perspective of the spatial distribution of feature objects while spatial data has multidimensionality. In traditional data partitioning methods, such as token rotation, hash table segmentation, and simple region partitioning, the spatial relationship between objects is split during decomposition, which does not reflect the proximity between spatial data.

The purpose of using Hilbert space sorting is to maximize the mapping of high-dimensional data to low-dimensional data [16,17] and to close the geographically adjacent points in computer storage to accelerate data extraction and improve the efficiency of data operations in the first-level storage. The access to spatial data in memory is performed randomly. For spatial data with unbalanced distribution, if the point data comparison is too dense in a certain area, data redundancy in the index sub-node is caused. In order to maintain the uniqueness of spatial data mapping, a more detailed division of the index grid is required. However, if the division is too detailed, it will increase the difficulty and computational complexity of spatial sorting coding and also increase the size of the spatial query.

The spatial data decomposition can be divided by the spatial indexing mechanism. Based on the spatial index, the search space of the candidate dataset in the overlay analysis can be effectively reduced. At the same time, the false intersection can be further filtered in the proximity analysis of the candidate geometry data into a map overlay object with real intersections. The data decomposition in the parallel superposition analysis method is based on the vector topology data model. The key difficulty is how to assign the elements with large spatial proximity to the same node. The vector data capacity is usually between megabytes to gigabytes, and the current computer hard disk is measured by the terabyte level storage capacity. The equalization of storage capacity is not a critical issue. The key problem is also how to effectively equalize the computational tasks of vector data and reduce unnecessary intersection detection and parameter communication between distributed nodes. Because the input spatial data will have different feature density, the division of spatial data into distributed nodes in the GIS parallel algorithm does not have a conventional experience.

Most spatial decomposition methods are based on planar space, where one point on one side of the plane defines one area while the other side determines another area. However, as points on the plane can be arbitrarily divided into one area, using the plane to divide the space recursively will eventually generate a Binary Space Partition Tree (BSP Tree) [18]. Using spatial data decomposition structures to store objects can easily speed up specific geometric query operations, such as conflict detection, to determine if two objects are close together or if a ray intersects the object. The quadtree index belongs to the vertical decomposition mode of the plane [19]. The generation process is to recursively divide

the geospatial by four points until the termination condition is set by itself (for example, each area contains no more than two points, if it exceeds four points). Finally, a hierarchical quadtree is formed. Quadtree decomposition is a typical planar recursive decomposition [20]. This method is suitable for the case where the data isomorphism and distribution are relatively balanced [21].

The decomposition of the spatial index is determined by geometric objects (object priority). Decomposing spatial regions into sub-regions according to spatial objects are also called bucket. Therefore, this method is usually called bucket partitioning. The object-oriented data decomposition method needs to follow certain principles. The most classic strategy is that the B-tree rule uses a separate point or line to decompose the spatial extent recursively [22]. Another classic decomposition principle is to keep the outer bounding rectangle of the object to be the smallest, and R-tree is its important implementation [23]. Use the R-tree spatial data domain to decompose the main packing or bucket mode of the R-tree. R-tree optimization is mainly for the packing and sorting of sub-node index packets. Kamel and Faloutsos proposed the MBR sorting method based on the Hilbert curve [24]. Roussopoulos proposed a sorting method based on Nearest-X in a certain direction. In the classic R-tree implementation, Guttman (1984) proposed two heuristic tree node decomposition methods, including Quadratic-Split and Linear-Split [23]. The performance of the R-tree depends on the quality of the data outsourcing the rectangular clustering algorithm in the node. The Hilbert R-tree uses the space filling curve, especially the Hilbert curve, to organize the data outsourcing the rectangle linearly. The Hilbert R-tree comes in two forms: one is a static database form, and the other is a dynamic database form. In the paper research scheme, the Hilbert fill curve is used to achieve better ordering of high-dimensional data in nodes. This order ensures that similar objects outsourcing rectangles are grouped into groups, keeping the area and perimeter of the resulting outer rectangle as small as possible, so the Hilbert curve is a good sorting method in the sense of this layer. The compact Hilbert R-tree is suitable for static databases [25], and there are few update operations in the static database or no update operations at all. The dynamic Hilbert R-tree is suitable for dynamic databases [26–28], which require real-time insert, delete, and update operations. The dynamic Hilbert R-tree uses an elastic segmentation delay mechanism to increase space utilization. Each node in the tree has a well-defined set of sibling nodes. By establishing the order of the nodes in the R-tree and adjusting the Hilbert R-tree data partitioning strategy, the ideal space utilization degree can be achieved. The Hilbert R-tree is ordered based on the Hilbert value of the center point of the object rectangle, and the point Hilbert value is the length from the start of the Hilbert curve to the point. In contrast, other R-tree variants do not have control over space utilization. Leutenger proposed a new R-tree variant, the Sort-Tile-Recursive tree (STR-tree). The algorithm uses the recursive idea. For the set of spatial rectangles with r in the k-dimensional plane, let the maximum capacity of the leaves of the R-tree be n, and the rectangles are sorted according to the x value of the center point. The concept of the tile is to use <sup>√</sup>(r⁄n) vertical cuttings. The line divides the sorted rectangle so that each strip can be loaded close to <sup>√</sup>(r⁄n) nodes. Each slice continues to be sorted according to the y value of the center point of the rectangle, and a leaf node is pressed every n rectangle; the top-down reclusiveness processes the slice to generate the entire R-tree. One of the measures of efficiency and accuracy of the R-tree index is the area and perimeter of the sub-node MBR in the tree. The smaller the area and perimeter, the higher the spatial aggregation. Therefore, the analysis of the R-tree proposed by Guttman (1984) has some shortcomings: long loading time, insufficient subspace optimization, and long data extraction time for the spatial query.

No matter the equilibrium grid decomposition, quadtree decomposition, and traditional R-tree decomposition, the problem of large spatial distribution and density imbalance cannot be avoided. The regularized partitions of these decomposition methods are divided into different degrees.

In the algorithm of parallel overlay analysis, a lot of frequent data extraction from the cluster environment and the intersection of geometric objects are involved [6]. In the single-disk and single-processor environment, the traditional spatial data extraction method uses the index structure of the spatial database. However, the single-point spatial index storage and access mechanism in multi-disk and multi-processor environments cannot meet the requirements of high-performance computing for data extraction speed [29]. Therefore, it is necessary to implement a fast filtering and extraction mechanism for spatial data matching with a distributed shared-nothing mode in the computing environment of parallel overlay analysis. Spatial index is an important criterion for measuring the quality of the spatial database. In the spatial database, there are usually millions of data tables. If the traditional spatial indexing method of the database is adopted, the data query efficiency will be seriously reduced. At the same time, spatial data has spatial object uncertainty, and the intersection, inclusion, and separation are complex. It is difficult to classify and sort spatial data by processing ordinary data. The spatial index of the spatial database field can be divided into two types: embedded and external [30]. The embedded spatial index structure is incorporated into the database as part of the database itself, while the plug-in database is usually in the form of middleware, which performs a similar proxy and forwarding mechanism at the data request and data layers. For example, the default indexing methods for Oracle spatial and PostGIS are R-tree and B-tree [31,32]; ESRI ArcSDE is an external spatial database management mechanism, which has no specific spatial data carrier but is based on the traditional RDBMS system. The extension, such as ArcSDE, can implement spatial data management and indexing mechanisms based on SqlServer and Oracle. The spatial database established by ArcSDE is called Geodatabase [33]. The default indexing strategy for geodatabase is a spatial grid index for the feature class. Secondly, in the aspect of distributed spatial index data decomposition, Kamel (1993) and other research applied R-tree to the deployment of single-processor and multi-disk hardware structures and implemented an R-tree-based parallel query mechanism. Zhang et al. used a multivariate R-tree (Multiplexed-R-tree) structure to optimize R-tree in combination with proximity index constraints [34]. Experiments show that parallel domain query performance is better when dealing with spatially balanced data. In order to improve the efficiency of massive spatial data management and parallelization processing in the distributed parallel computing environment, Tanin et al. implemented the distributed Quadtree-based spatial query function in the peer–to peer network environment [35].

The parallel spatial index has gradually formed an essential branch of the spatial index family with the development of high-performance parallel GIS applications, which can solve the problem of the simple data decomposition method in this study. The most typical parallel spatial index is the MC-R-tree (Master-client R-tree) method proposed by Schnitzer [36]. The method is characterized in that all non-leaf nodes in the spatial index tree are stored in the main cluster. In the node, each subtree in the index tree is stored in the sub-computing node of the cluster. The disadvantage is that the space utilization of the conventional R-tree is low, and the number of MBR overlaps higher when the subtree index are assigned to the child nodes.

#### **3. Methods**

This section includes active-slave spatial index decomposition for intersection, multi-strategy Hilbert ordering decomposition, and the parallel spatial union algorithm.

#### *3.1. Active-Slave Spatial Index Decomposition for Intersection*

From the dynamic nature of the overlay analysis, the superimposed layers are divided into the active layer (overlay layer) and passive layer (base layer). The point of parallel acceleration is the fast query of the geometric elements in the active layer. The intersection part of the layer and the spatial index is the key technology to achieve acceleration. According to the characteristics of the parallel overlay analysis in this paper, the storage of spatial data adopts a completely redundant mechanism, and each child node maintains a complete set of spatial data tables to be superimposed.

A data decomposition method is proposed for the parallel intersection operation. The process is to spatially decompose the active layer in advance and send the partition location information to the corresponding child node according to the FID (The name primarily used in the spatial data layer), the child node then locally extracts the active layer data in the range, then, the geographic elements in the passive layer all participate in the establishment of the whole spatial index tree; the elements in the active layer query the passive layer index and perform superposition operations on the intersecting candidate sets. Since the process of spatial query and the intersection operation in the overlay analysis are mostly the same operation, the data decomposition mode is adapted to be combined with the parallel intersection operation. In Figure 1, O\_obj represents the entity object in the active overlay layer O\_lyr. After the decomposition strategy, it is distributed to each child node, and the base layer B\_lyr is parallelized.

**Figure 1.** Parallel data decomposition using R-tree.

The actual data decomposition is only the division and treatment of the overlay layer O\_lyr, and the base layer is still a query as a whole. This solution enables acceleration in a reasonable computational complexity in small to medium data scale applications. The data decomposition characteristics of the above map overlay algorithm shows that if each child node resides in a base layer, the spatial index of the entire tree has two defects: first, when processing a layer with a large number of geographic features, each node needs to consume a lot of time and memory resources; second, each query is for the traversal of the entire tree, and there are a large number of invalid queries in the intersecting query of a single geometric object.

To improve the efficiency of massive spatial data management and processing in a parallel environment, an improved master-slave spatial indexing method MC-STR-tree was designed according to the current research results. The rapid management of data management and geometric intersections provides conditions.

The basic index structure of the index adopts STR-tree, and the modification of the R-tree object node is improved, and the conflict of data distribution among the nodes is reduced. It has good spatial clustering characteristics and extremely low spatial index packet overlap rate, so it has excellent parallelization characteristics. STR-tree is a highly balanced tree with simple and maximized space utilization.

The processing steps of the improved master-slave parallel space index MC-STR-tree decomposition (Figure 2) are as follows:

#### (1) Data preprocessing

The master node divides the MBR of the spatial data index node into n parts according to specific rules. The commonly used rules are Hilbert ordering and STR partitioning method, and n is an integer not larger than the cluster computing node. The master node needs to record the node tag to which each subtree is sent and the root node MBR of the subtree. In this paper, the leaf nodes are divided according to the STR rule, and each computing node corresponds to a stripe in the STR. To control the number of subtrees less than the number of computable nodes, node Vol is used to set the maximum number of indices sub-nodes allowed for each computed node in the STR-tree.

#### (2) Sending of the subtree index

The primary node sends the index entries to the child nodes according to a certain replication and allocation policy. The data structure uses an in-memory spatial index. Unlike the traditional hard disk-based spatial index, the in-memory spatial index has the characteristics of dynamic construction and fast query.

#### (3) The master node sets up the Master-R-tree

The master node builds a new spatial index into all non-leaf nodes in the original entire R-tree index tree in computer memory. The index tree records the index item information allocated by each computing node, and its function is equivalent to an overall index controller. The intersecting query of the active overlay layer is first filtered by the overall index tree.

#### (4) Computing node to build a Client-tree

According to the agreed STR-tree node organization rule, the index items sent by the master node are built into the client terminal tree. The entity object stored in the computer node is associated with the index item of the subtree, and the entity space data is extracted according to the index item MBR.

The parallel R-tree index decomposition based on the STR rule is analyzed. The essence is to replace the random partition in the form of elemental fid with the domain decomposition partition considering the neighboring spatial rule. The master node controls the overall spatial index R-tree of a large node, where the id information of the spatial data domain decomposition block in the cluster is recorded. The active layer first performs a query operation with the overall index tree. If one or more intersecting subdomains are queried, the master node notifies the child nodes having the subdomains to extract the active layer locally. The element is further intersected with the subtree, and the process will be analyzed in the parallelization of the intersection.

**Figure 2.** Sort-Tile-Recursive (STR)-tree data decomposition.

#### *3.2. Multi-Strategy Hilbert Ordering Decomposition*

It can be seen from the process of generating Hilbert code values that the determination of the code value is determined by the center point of the space grid, which is called the order method of the midpoint strategy. This section studies the difference between the middle strategy and the median strategy. Hilbert curve sorting is used to optimize the problem so that the sorting method can better take into account the spatial proximity of map features. The spatial sort is the reversible 1:1 correspondence between spatial feature points and consecutive integers in one-dimensional space as key values.

The middle strategy is easier to understand. In the process of dividing the spatial hierarchy, each cell is divided according to a fixed size, that is, at the middle point of the cell. The midpoint strategy is suitable for point set objects with a more balanced distribution in the low dimension in practical applications.

Unlike the middle strategy, the median strategy divides the cells in the space according to the median point in the x or y direction of the point to be sorted (Figure 3). The geometric midpoint is an effective solution to the positioning of resource rules. In the high dimensional space, the median point can better represent the centrality of the clustering point. The median strategy is adapted to high-dimensional point sets or point set objects with uneven distribution.

**Figure 3.** The demonstration of the Hilbert median sort demo.

From the definition of the median and the application context, the characteristics of maintaining clustering in Hilbert ordering can be better understood. The median is a concept in sequential statistics. In a set of n elements, the nth order statistic is the *i-*th small element in the set. For example, in a set of elements, the minimum is the first order statistic (*i* = 1), and the maximum is the nth order statistic (*n* = 1). An informal definition of a median is the "midpoint element" of the collection it is in. When n is an odd number, the set has a unique median at *i* = ((*n* + 1))/2 out; when *n* is even, it has two medians, which appear at *i* = *n*/2 and *i* = *n*/(2 + 1). If not considering the parity of *n*, the median always appears at *i* = ((*n* + 1))/2 (upper median) and *i* = ((*n* + 1))/2 (lower median). In our study, the Hilbert median strategy sorting adopts the lower median by default. The CGAL library is used to assist Hilbert in sorting in the decomposition process of spatial data. Because the spatial data cannot be extracted from the data source directly according to the Hilbert code value, it must be the main processing data in parallel computing. After the node establishes the mapping relationship with the spatial database, the data is decomposed. Figure 4 shows the world map divided by two different Hilbert sorting methods. Each color denotes the different classes for the result of Hilbert sorting methods.

**Figure 4.** Hilbert middle and median sort for world map (4 nodes).

The process based on Hilbert's sorting decomposition is shown in Figure 5. First, the feature points are extracted from the feature elements, and then the appropriate Hilbert space sorting strategy is selected according to the distribution characteristics of the spatial data. After Hilbert sorting, the main computing nodes follow the order of the one-dimensional Hilbert curve according to the multi-channel form, and the data-polling type is generated to each sub-node according to the interval division. Because the storage of vector data in this paper takes the form of a spatial database, what is referred to here is the location tag of the object in the spatial database table.

**Figure 5.** Spatial data decomposition using Hilbert sort.

The most crucial purpose of spatial data domain decomposition in parallel GIS computing systems is that each child node can perform spatial data extraction or query work at the same time, without pulling information from the global control node; however, the traditional data partitioning method is based on the static clustering method. Reasonable spatial data partitioning should consider the spatial location distribution and length uncertainty of geographic data. It can be considered from two aspects whether the decomposition strategy of Hilbert ordering is the best, including whether each piece of sub-data can maintain spatial proximity to the maximum extent and whether the storage capacity of the data blocks in each node is equalized.

#### *3.3. Parallel Spatial Union Algorithm*

The parallel spatial union algorithm process is as follows:

#### **(1) Polygon data feature point extraction**

The Hilbert curve sorted objects are point objects with integer values, and the ordering of polygons was converted to the ordering of their representative feature points. The feature points of a polygon usually have an outer center point of the rectangle or four corner points, a geometric center point, and a mass center point. Since the calculation of the center point and the center of gravity of the polygon is time-consuming, it is not conducive to the parallel acceleration of the map overlay analysis. Therefore, the feature point of the outsourced rectangular is the center point. All the layers S are regarded as a set of all the layers L, and the feature point extraction is represented as, among them, a feature point set. The feature point set is constructed by the mapping function map.

#### **(2) Feature point Hilbert space ordering**

The function corresponding to the Hilbert curve sort is described as follows. Among them are integer coordinate values, and d is the Hilbert code value. For floating-point data in practical applications, it is necessary to scale to the nearest integer value. Constructing a Hilbert curve in a two-dimensional space maps the interval to a unit square interval, and each cell is recursively divided into four parts until each cell contains only one feature point. The computational complexity of constructing the Hilbert sorting curve algorithm is, where n is the number of hierarchical divisions of the Hilbert space. In the calculation process of Hilbert coded values in this paper, the value of n is not forced to be constrained so as to ensure that the point object clustering is denser and still maintains a good mapping unity.

In the parallel overlay analysis, Hilbert curve ordering is performed according to the distribution characteristics of the feature elements using two strategies: middle strategy and median strategy.

#### **(3) Distribution of nearby polygons**

In the parallel overlay analysis system, the Master node is responsible for loading all the map layers involved in the calculation and establishing the Hilbert sort number for its content. The elements in each layer are evenly distributed to each node according to the number of compute nodes. The number of layer elements and number of processors is n. It is necessary to map the code on the Hilbert curve string to the FID of the map element. The generation of the one–to–one correspondence depends on the previous Hilbert ordering process in which there are two mapping relationships, and the two mapping relationships are connected by feature point coordinates. The specific process can be described by the following formula, which is the Hilbert code value, the FID marked element, and the data extraction condition as the overlay query.

The implementation process of the Hilbert joint algorithm is as follows:

(i) A data partitioning strategy. The experiment uses three data distribution strategies in the order of fid, Hilbert\_median, and Hilbert\_middle. First, the master node determines the number of nodes participating in the calculation, and then equally distributes the ordered sets separately.

(ii) Independent calculations, generating intermediate results. Each computing node is responsible for the joint operation of the local polygons and writes the generated intermediate results to a temporary table in the MySQL spatial database.

(iii) Results are collected in conjunction. The master node loads the middle layer into the local memory in the order in which the child nodes complete the task, and then jointly outputs the final result layer to these layers.

The previous part is shown the operation of a single layer, the data decomposition hierarchy object is single, and it is easier to divide and conquer. For the overlaying and merging of two or more layers, it is necessary to consider the spatial data distribution of multiple layer species, such as the case of the line surface overlaying, to prevent the layer overlaying of the spatial distribution imbalance.

From the perspective of spatial data decomposition parallelism, the parallel stacking of multiple layers can use two methods.

(i) According to the above design, the Hilbert sorting decomposition parallel method is based on the spatial ordering of multi-level two-dimensional plan layers through high-dimensional Hilbert curves. For the processing of two-dimensional plane map data, the specific process can be divided into two processes of reducing the dimension and raising the dimension, but the two are not contradictory. Dimensionality reduction is for a polygon layer, extracting a single point from a polygon in the layer, such as the center point of the outer frame, a corner point, or the Centre's CentreID point to uniquely identify the polygon as a mapping to the one-dimensional Hilbert function.

The polygon objects in the two layers were loaded into the same container, and the geometric objects in the container were manipulated according to the Hilbert parallel sorting algorithm to merge the individual layers. The advantage of this method is that it is not necessary to consider the combined correspondence of polygons in the container, and it can be divided according to Hilbert ordering. Polygons that do not want to intersect in the original single layer may be indirectly connected due to the addition of new layer features, as shown in Figure 6. Indirect connectivity increases the complexity of parallel association. If the polygons in the two layers are only a one–to–one relationship, they can be directly merged in batch form. The polygons may be spread in pieces when considering indirect connectivity. In this way, the parallel joint process needs to continue the binary tree merging of the data decomposition results.

**Figure 6.** Connect relationship between two layers.

(ii) A combination of multi-layer filtering mechanisms using parallel master-slave indexes.

The master node establishes a coarse-grained spatial index for the two layers, which is equivalent to establishing a distribution for each layer to establish m and n spatial index subtrees. First, test the intersection of the root nodes of these subtrees and filter the separated subtrees. The information of the intersecting subtrees is sent to the child nodes for the merge operation of the independent subtrees. The merge mode can be understood as spatial analysis of different spatial data of different spatial granularity levels, which can be performed based on spatial data with lower spatial granularity level, and spatial analysis of different spatial data of the same spatial granularity level must be superimposed

to generate Spatial data with a lower spatial granularity level, which is then based on spatial data with a lower spatial granularity level.

According to the implementation principle of the parallel master-slave index union, with data decomposition based on a spatial index, it is difficult to completely eliminate the redundancy of index objects, so different nodes will have the same geometric object, and the subtree merge is a recursive consumption. At the time of operation, the last master still needs to recycle and combine the intermediate results. The sprawling characteristics of joint operations determine that they cannot be parallelized in the form of batch processing. The parallel combination of master-slave index tree mode is difficult to implement in practical applications. The parallel joint research in this paper uses Hilbert as the experimental subject for the two layers as a whole, and the parallel master-slave index joint will be further studied in the next step.

In our study, two joint layers are treated as an overall virtual layer and parallelized according to the strategy of polygon merging in the same layer (Figure 7). The specific process is as follows:

(i) First, create a data structure featureInfo for each feature in each layer. The main member variables are fid, layer\_num, hilbert\_code, and mbr\_pt. The fid is used to record the cursor of the feature in the layer. As a query condition for quickly extracting data in the future, layer\_num records which layer is one of the filtering conditions of the data query. The hilbert\_code defaults to a null value when the FeatureInfo is initialized, which is used to record the encoding of the Hilbert space sorting. Mbr\_pt records the center point of the desired MBR as a feature point for Hilbert ordering.

typedef struct

{

long fid, hilber\_code;

GTPoint\* mbr\_pt;

} featureInfo;

(ii) Put the featureInfo in the two layers into the vector array Vector<featureInfo\*> and use the Hilbert spatial sorting function hilbert\_sort() to sort and encode the elements in the array.

(iii) According to the number k of available computing nodes, the total n elements are sent to one node every n/k according to the Hilbert coding order. The polling method only sends the Hilbert space sorting number interval instead of sending the specific geometry object.

(iv) The child node receives the Hilbert space sorting coding interval, extracts the geometric objects of the feature from the local database according to fid and layer\_num in featureInfo, and starts to merge the intersecting polygons in the interval.

(v) Continue the merge in the form of a single layer binary tree merge until all intersecting polygons are merged.

According to this layer, the overall layer is added to perform Hilbert sorting decomposition. The advantage is that the data is divided into non-redundant polygons of each node. The unique characteristics of Hilbert coded values can ensure the low coupling of joint calculation between processes. The binary tree parallel combination can reduce the pressure of the main node recovery. Therefore, the scheme is feasible, and theoretically, an acceleration effect can be obtained. The following is an experimental verification of the multi-layer Hilbert joint method using actual data.

The computational complexity of this algorithm is O(N+K), which is derived from the count of polygons for the overlayed map layers. Given *N* polygons in each layer, if using the brute-force overlay test, the complexity of this algorithm should be O(N2). However, we performed an intersecting filter process that can exclude the unnecessary computing for polygons that do not overlay to each other in the space. Thus, there would be K polygons in the overlayed layer that participate in overlay computation. Ultimately, the complexity of this algorithm should be O(N+K).

**Figure 7.** Hilbert decomposition for two layers.

#### **4. Experiments**

The parallel algorithm for polygons overlay analysis was implemented, and the performance of the parallel spatial union was verified. Multi-strategy based spatial data decomposition mechanism was implemented, including parallel spatial data index, the Hilbert space-filling curve sort, and spatial data decomposition. The programming language was C++. The operating system of each computation node was 64-bit Ubuntu Linux (16.04), and run on Intel CPU based hardware architecture (Model: Intel Core i7-3960X, 3.8 GHz, 12 cores).

#### *4.1. The Experiment for Parallel Spatial Union*

The experiment used a spatial union in map overlay analysis as an example to conduct parallelization experiments. The experimental data shows the Chinese 1:4 million land use map. The data type was multi-polygon, the number of elements was 15,615, and the number of points was 886,547 (Figure 8). The visualization strategy in Figure 8 is FID based strategy. The experimental dataset is a type of data commonly used in GIS applications, representing the general characteristics of spatial data decomposition. The spatial distribution of these maps data is irregular and unbalanced, and the amount of data is large. It is representative of experiments and has guiding significance for practical applications. Figures 9–11 are the results for the parallel spatial union using fid decomposition, median decomposition, and middle decomposition. The number of the results for the decomposition method is 2, 4, 8, and 16, respectively. Different color denotes different class in these maps.

**Figure 8.** The landuse map of China.

**Figure 9.** Parallel spatial union using the fid decomposition method.

**Figure 10.** Parallel spatial union using the median decomposition method.

**Figure 11.** Parallel spatial union using the middle decomposition method.

Figure 12 shows the resource consumption of a computer when the parallel overlay is used to implement Hilbert ordering using the MPI method. The system starts a total of 10 computing processes. It can be seen from the figure that the CPU usage of each process is basically the same in the parallel computing process, and the memory usage averages about 1.3%.

**Figure 12.** Computing time for parallel spatial union.

According to the experimental results, the parallel merging strategy is analyzed. From the comparison of FID partitioning and Hilbert partitioning, the results of FID partitioning are scattered. The merging results in the same sub-node are not necessarily adjacent. Hilbert divides the merging result better, for example, if the FID is merged, even if there are only two computing nodes, the combined result of the first node also includes part of the combined result of the second node. At the same time, it can be observed that the combined results of FID partitioning are basically horizontal

strips, and the Hilbert partitioning results are basically blocky. It can also be observed from the merger of the Taiwan Province that the fid division strategy will be divided into multiple blocks (the whole map), the islands will be divided into multiple blocks for processing, and the Hilbert spatial sorting combination will ensure a small area and approximate shape. The circular polygonal area maintains functional integrity.

The parallel computing time and acceleration ratio effects of the three merge modes are shown in Figures 13 and 14. The parallel computing time is the sum of the data decomposition time, the independent calculation time, and the last merge time of the master node. The calculation time has decreased with the increase of the number of processes; however, the acceleration effect is obvious when the acceleration effect is less than four processes, which is higher than the linear acceleration ratio, but the acceleration effect is significantly reduced when there are more than four processes because of two aspects: one is that if there are multiple processes in one child node, the data is concurrent with I/O, and the other is that multiple child nodes send writing database requests to the master node. The number of polygons that the master node needs to merge increases, and the database is locked as the writing process takes time. When the number of threads is greater than eight, the median strategy shows a trend of decreasing the acceleration ratio, indicating that the consumption of the collection and consolidation of the primary node is greater than the acceleration of the parallel computing. The solution is to increase the number of physical machines and try to keep each child node running a small number of processes, reduce read-write conflicts, allow as many binary tree merges as possible between child nodes, and reduce the number of merged polygons at the primary node.

**Figure 13.** Speedup ratio for parallel polygon union.

**Figure 14.** Speedup ratio for parallel polygon union.

From the accelerated efficiency analysis of parallelization, the acceleration efficiency is the lowest according to the strategy of the fid division. The acceleration efficiency change from 1 process to 10 processes is from 1 to 0.234, which is a large change. The reason is that the polygon data is allocated according to fid. The number of nodes actually intersecting is small, which causes many polygons to merge again when the master node collects, causing the overload of the master node to affect the overall acceleration efficiency. The acceleration efficiency of the Hilbert\_median partitioning strategy varies between 2~4 processes and 8~10 processes, and the acceleration efficiency is basically unchanged in 4~8 processes. The Hilbert\_middle strategy divides the efficiency with the number of processes. The increase has a linear downward trend. From the perspective of the accelerated efficiency degradation of these parallel joints, although the data decomposition strategy for estimating the spatial proximity characteristics can achieve a specific acceleration effect, there is still room for further optimization in the parallelization algorithm of the data in the reduction phase of the data.

#### *4.2. The Experiment for Parallel Hierarchical Spatial Union*

In the experimental data, the basic layer is the planning map of a certain urban area, and the overlayed layer is the post-translation planning diagram (Figure 15). The result of the spatial union operation is shown in Figure 16.

**Figure 15.** Test data for the Hilbert decomposition of two layers.

**Figure 16.** Two data layers' union result.

#### **5. Conclusions and Future Work**

In this study, we propose an efficient parallel algorithm for polygons overlay analysis, including active-slave spatial index decomposition for intersection, multi-strategy Hilbert ordering decomposition, and a parallel spatial union algorithm. A multi-strategy based spatial data decomposition mechanism is implemented, including parallel spatial data index, the Hilbert space-filling curve sort, and data decomposition. The results of the benchmark experiments showed that the parallel algorithm for polygons overlay analysis achieves high efficiency.

However, there are some limitations to this study. Firstly, in this study, we have not discussed cloud computing and edge computing for map overlay analysis. The new computing framework can improve the efficiency of spatial analysis for large-scale geospatial data [37–39]. Secondly, graphics processing unit (GPU)-based geo-computation is not considered. GPU is a potential factor that could improve the speedup ratio [40]. In the future, we will improve the study by addressing those limitations. Firstly, the map overlay algorithm based on cloud computing will be proposed. Secondly, we would take account of GPU-based parallel algorithms for map overlay in the future.

**Author Contributions:** Conceptualization, S.W.; data curation, Y.Z., S.W., and Y.G.; funding acquisition, Y.Z.; investigation, Y.Z. and S.W.; methodology, Y.Z. and S.W.; project administration, Y.Z. and S.W.; software, Y.Z., S.W. and Y.G.; supervision, Y.Z. and S.W.; validation, Y.G.; visualization, Y.Z. and Y.G.; writing—original draft, Y.Z., S.W., and Y.G.

**Funding:** This research was funded by the National Natural Science Foundation of China (grant number 41601478) and the National Key Research and Development Program (grant numbers 2018YFB0505301, 2016YFC0500103).

**Acknowledgments:** We would like to acknowledge the Institute of Geographic Sciences and Natural Resources Research of the Chinese Academy of Science for providing the research grant to conduct this work.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Vector Map Random Encryption Algorithm Based on Multi-Scale Simplification and Gaussian Distribution**

**Giao N. Pham 1,2,\*, Son T. Ngo 1, Anh N. Bui 1, Dinh V. Tran 3, Suk-Hwan Lee <sup>4</sup> and Ki-Ryong Kwon <sup>5</sup>**


Received: 2 October 2019; Accepted: 13 November 2019; Published: 14 November 2019

**Abstract:** In recent years, GIS (Geographical Information System) vector maps are widely used in everyday life, science, and the military. However, the production process of vector maps is expensive, and a large volume of vector map data is easily stolen and illegally distributed. Therefore, original providers desire an encryption solution to encrypt GIS vector map data before being stored and transmitted in order to prevent pirate attacks and to ensure secure transmission. In this paper, we propose an encryption algorithm for GIS vector map data for preventing illegal copying, and ensuring secured storage and transmission. Polyline/polygon data of GIS vector maps are extracted to compute a backbone object. The backbone object is then selectively simplified by the multi-scale simplification algorithm in order to determine the feature vertices of the backbone object. The feature vertices of the backbone object are encrypted by the advanced encryption standard and the secret key. Finally, all vertices of the backbone object are randomized by the random Gaussian distribution algorithm to obtain the encrypted GIS vector map. Experimental results show that the entire map is altered completely after the encryption process. The proposed method is responsive to the various GIS vector map data formats, and also provides better security than previous methods. The computation time of the proposed method is also significantly shorter than that of previous methods.

**Keywords:** GIS vector map data; GIS vector map security; selective encryption; simplification method and cryptography

#### **1. Introduction**

Currently, GIS vector maps are used in applications in many fields, such as science, navigation, and online or offline digital map services. GIS vector maps are created and developed by the merging of cartography, statistical analysis, and database technology based on vector models [1]. Due to the fact that GIS vector maps may have significant value, GIS vector map data may be stolen or easily purchased and redistributed or resold several times without obtaining any permission from the original data providers. In addition, some applications of vector maps by the military or some personal applications of vector maps must be secured from unauthorized users. Consequently, GIS vector map data protection is necessary to prevent illegal duplication and distribution.

GIS vector map security includes copyright protection [2–4], access control for users, and vector map data encryption in order to prevent attacks or illegal distribution during storage and transmission, and damage in the integration process of geographical information [5–14]. However, access control on the Web is unable to prevent attacks or illegal duplication and distribution, and watermarking techniques are only useful in identifying ownership and for copyright protection, while unauthorized users should not be able to see, attack, or extract GIS vector map content. Therefore, GIS vector map data encryption is necessary and suitable for protection. Moreover, conversion between data formats is vulnerable to attack and security techniques based on database management systems are not responsive to the various data formats of vector maps [10,11]. Thus, the encryption techniques for GIS vector map should be responsive to the various data formats of GIS vector maps and the security requirements.

In order to meet the issues above, we propose an encryption algorithm for GIS vector map data in this paper. Our algorithm solves the current issues of GIS vector map security by encrypting the geometric objects of GIS vector maps. The main intent of the proposed algorithm is to extract geometric objects (polylines and polygons) from GIS vector maps in order to encrypt them. Geometric objects are used to compute the backbone object of polyline/polygon. The backbone object is then selectively simplified by the multi-scale simplification algorithm in order to determine the feature vertices of the backbone object. The feature vertices of the backbone object is encrypted by the AES (Advanced Encryption Standard) algorithm and the secret key. Finally, all vertices of the backbone object are randomized by the random Gaussian distribution in order to generate the encrypted vector map. To clarify the detailed contents of the proposed algorithm, this paper is organized as follows: In Section 2, we discuss vector map data security techniques and explain the relationship between vector map data and the proposed algorithm. In Section 3, we explain the proposed algorithm in detail. We then perform experiments, discuss the experimental results, and evaluate the performance of the proposed algorithm in Section 4. The conclusion is provided in Section 5.

#### **2. Related Works**

#### *2.1. Vector Map Data Security*

The main idea of the watermarking schemes for GIS vector maps is to embed the watermark into the vector map by modifying the coordinates of vertices of geometric objects in the spatial domain [2,3], or embedding the watermark into the spectrum coefficients of a sequence of vertices or topologies in the frequency domain [4]. Thus, vector map watermarking is not suitable for preventing attacks and illegal copying.

The security techniques for GIS vector map data have been proposed in recent years [5–9]. Authors have mostly explained various methods to define an access control system for spatial data on the Web, or mentioned the security requirements for geospatial database management systems and privacy policies. However, access control on the Web and the management of databases do not prevent attacks, illegal duplication, and distribution. Wu et al. [10] proposed a compound chaos-based encryption algorithm for vector data by considering the storage characteristics and the parameters of a chaos-based system; however, this method is not responsive to the various data formats of vector maps. Li et al. [11] encrypted the spatial index of a set of vector data in an external Oracle database management system when it is transmitted to the client; however, this algorithm does not ensure the security of vector maps because the key length is very short. Yasser et al. [12] also described an encryption algorithm that combined AES and RSA (Rivest–Shamir–Adleman) cryptography with a simple watermarking technique in order to protect the copyright protection of vector maps in online and offline services. This method encrypts all parts of a shape-file using an AES block cipher operator of 256 bits. This approach is typical because it encrypts the data stream of a shape-file using the AES cipher, and the computation time is very long. Jang et al. [13] proposed a perceptual encryption method that is combined with the compression process for vector map data. This method only encrypted the direction and position of data units in the compression process. This method has low security. Bang et al. [14] proposed a selective encryption method for vector map data based on a chaotic map in the frequency of discrete wavelet transform (DWT); however, this method only selects some objects and encrypts DC (Direct Current) values in the DWT domain by a common secret key. This method is very simple and weak because it does not encrypt all contents of vector maps and uses a common key. Moreover, this research did not present performance or security evaluations. In summary, the weakness of previous methods is solved by the proposed algorithm, which is presented in Section 3.

#### *2.2. Vector Map Data-Based Encryption*

GIS vector map data is stored in layers. Each layer contains a number of geometric objects, such as points, polylines, and polygons, and annotations, such as symbols. This notation is shown in Figure 1a. Annotation is used to display notes on vector maps, while geometric objects are used to represent geographical objects on vector maps. Points are used to represent simple objects, such as position, while polygons and polylines are used to represent complex objects. A polyline is a set of ordered vertices used to represent objects such as roads, contour lines, rivers, and railways. A polygon is a set of connected polylines used to represent objects such as buildings, areas, lakes, and boundaries. Thus, polylines and polygons are considered the important components of a vector map.

**Figure 1.** (**a**) Vector map data model, and (**b**) data components of a vector map.

In addition to annotation and geometric objects, vector maps also contain the storage information as header and text. Because points, polylines, and polygons determine the content of vector maps, we consider these geometry data, while the annotation, header, and text are considered attribute data. Figure 1b shows the data components of a GIS vector map. The attribute data of a GIS vector map does not contain geographical information or determines the shape of the GIS vector map; thus, it is only required to extract polylines and polygons to perform the random encryption process for GIS vector maps.

#### **3. The Proposed Method**

#### *3.1. Overview*

The proposed algorithm is shown in detail in Figure 2. To begin, each geometric object (polyline/polygon) is extracted from the GIS vector map to compute a backbone object. The backbone object is then selectively simplified by the multi-scale simplification algorithm with differential scale factors in order to obtain new geometric objects. If the backbone object cannot be simplified, it is then encrypted by a key value and randomized by a set of random Gaussian numbers to generate an encrypted object. The key value is generated by a hashing function with a user's key input. New geometric objects, which are simplified from the backbone object, are then used to compute the feature vertices of the backbone object. Here, the backbone object can be divided into two parts: feature vertices and the remaining vertices. The feature vertices of the backbone object will be continuously encrypted by a key value in order to generate the encrypted feature vertices. Finally, the remaining vertices and encrypted feature vertices of the backbone object will be randomly encrypted by a set of random Gaussian numbers in order to obtain the encrypted object. The encrypted GIS vector map is a set of the encrypted geometric objects.

**Figure 2.** The proposed algorithm.

A GIS vector map contains a number of data layers. Each layer **L** contains a number of objects (polyline/polygon) **L** = - **P***i <sup>i</sup>* <sup>∈</sup> [1, <sup>|</sup>**L**|] and each object contains a number of vertices **<sup>P</sup>***<sup>i</sup>* <sup>=</sup> *vi*,*<sup>j</sup> <sup>j</sup>* <sup>∈</sup> [1, <sup>|</sup>**P***i*|] . We briefly define the main notations as the following: **P***<sup>i</sup>* indicates a polyline/polygon object, and |**L**| and |**P***i*| are the cardinalities of a layer **L** and an object **P***<sup>i</sup>* respectively. Thus, *vi*,*<sup>j</sup>* indicates the *j th* vertex in the *i th* object of a layer *L* and is defined by two coordinates *vi*,*<sup>j</sup>* <sup>=</sup> *xi*,*j*, *yi*,*<sup>j</sup>* . Next, **B***<sup>i</sup>* is the backbone object of **P***<sup>i</sup>* and **B***<sup>s</sup> <sup>i</sup>* is the simplified object of **B***<sup>i</sup>* after the multi-scale simplification process with the differential scales. **B***<sup>i</sup>* is the changed backbone object after the feature vertices encryption process or the backbone object encryption process. **F***<sup>i</sup>* represents the feature vertices of the backbone object **B***i*, **Gn***<sup>i</sup>* is a set of random Gaussian numbers, **K** is the secret key, and **E***<sup>i</sup>* is the encrypted object of **P***i*. Finally, *CP*(.), *GP*(.), and *RP*(.) are the cipher function, the random Gaussian number function, and the randomization function, respectively.

#### *3.2. Backbone Object Simplification and Feature Vertices Computation*

The backbone object is a set of vertices, in which each vertex is the average point between two continuous vertices in that object. Thus, the backbone object of **<sup>P</sup>***<sup>i</sup>* is **<sup>B</sup>***<sup>i</sup>* <sup>=</sup> *bi*,*<sup>j</sup> <sup>j</sup>* <sup>∈</sup> [1, <sup>|</sup>**B***i*|] where |**P***i*| = |**B***i*| and the value of *bi*,*<sup>j</sup>* is computed as shown in Equation (1). Figure 3a shows a backbone object of a polyline object. ⎧

**Figure 3.** (**a**) Backbone object of a polyline, and (**b**) feature vertices of a backbone object.

Our purpose is to simplify the backbone object with the scale factor *s* in order to compute the feature vertices of the backbone object. Thus, we must check the condition before simplifying. The condition of a backbone object **B***<sup>i</sup>* can be simplified as:

$$\|\mathbf{B}\_i\| > \frac{1}{s} \tag{2}$$

If the value of <sup>|</sup>**B***i*<sup>|</sup> <sup>≤</sup> <sup>1</sup> *<sup>s</sup>* , that backbone object is encrypted by the cipher function *CP*(.) as shown in Equation (3).

$$\mathbf{B}\prime\_{i} = \mathbb{C}\_{P}(\mathbf{K}, \mathbf{B}\_{i}) = \left\{ b\prime\_{i,j} \middle| j \in [1, |\mathbf{B}\_{i}|] \right. \tag{3}$$

The backbone object **B***<sup>i</sup>* is simplified by the multi-scale simplification algorithm in order to obtain the simplified backbone object **B***<sup>s</sup> <sup>i</sup>* as shown in Equation (4):

$$\mathbf{B}\_{i}^{s} = \left\{ b\_{\stackrel{i}{i}, \frac{\stackrel{j}{i}}{s}} \middle| j \in [1, |\mathbf{B}\_{i}| \times s] \right\} \tag{4}$$

$$\mathbf{F}\_{i} = \begin{array}{c} \mathbf{B}\_{i}^{1/2} \cap & \mathbf{B}\_{i}^{1/3} \cap \mathbf{B}\_{i}^{1/4} \end{array} \tag{5}$$

Here, we obtain three simplified objects **B**1/2 *<sup>i</sup>* , **<sup>B</sup>**1/3 *<sup>i</sup>* , **<sup>B</sup>**1/4 *<sup>i</sup>* with differential scales *s* = 1/2, *s* = 1/3, and *s* = 1/4 (see Figure 3b). We then calculate the feature vertices **F***i*, which is a set of the common vertices of three simplified backbone **B**1/2 *<sup>i</sup>* , **<sup>B</sup>**1/3 *<sup>i</sup>* and **<sup>B</sup>**1/4 *<sup>i</sup>* as shown in Equation (5). From Equations (4) and (5), we can see that **F***<sup>i</sup>* will be determined if |**B***i*| > 4. Figure 3b shows three simplified backbone objects of a polyline object with differential scales 1/2, 1/3, and <sup>1</sup> <sup>4</sup> , and the feature vertices of the backbone object.

#### *3.3. Random Encryption*

The secret key **K** is generated by the SHA-512 hashing algorithm [15] with user's key input. We use the secret key **K** to compute a set of random numbers **Gn***<sup>i</sup>* by the Gaussian distribution [16,17] as shown in Equation (6):

$$\mathbf{G}\mathbf{n}\_{l} = \mathbf{G}\_{P}(\mathbf{K}) = \left\{ \mathbf{g}\_{i,\vec{j}} \middle| j \in [1, |\mathbf{G}\mathbf{n}\_{l}| ]\right.\tag{6}$$

Therein, <sup>|</sup>**Gn***i*<sup>|</sup> <sup>=</sup> <sup>|</sup>**P***i*|, *gi*,*<sup>j</sup>* is calculated by the Gaussian function as shown in Equation (7), and *<sup>x</sup>* is the value of **K**:

$$g\_{i,j} = \frac{i \times j}{|\mathbf{P}\_i|} \times \frac{1}{\sqrt{2\pi}} e^{\frac{-\mathbf{p}^2}{2}} \tag{7}$$

As mentioned above, after the feature vertices computation process, we have a set of feature vertices **F***<sup>i</sup>* and the remaining vertices of the backbone object (**B***<sup>i</sup>* − **F***i*). We then encrypt the feature vertices **F***<sup>i</sup>* by the cipher function *CP*(.) using the secret key **K**, as shown in Equation (8), in order to obtain the encrypted feature vertices **F***i*. The cipher function *CP*(.) can be the AES cipher function, the DES cipher function, or others.

$$\mathbf{F}\prime\_{i} = \mathbb{C}\_{P}(\mathbf{F}\_{i}, \mathbf{K}) \tag{8}$$

The encrypted feature vertices **F***<sup>i</sup>* and the remaining vertices (**B***<sup>i</sup>* − **F***i*) are used for the random encryption process in order to obtain the encrypted object **E***i*. From Section 3.2, the changed backbone object **<sup>B</sup>***<sup>i</sup>* is computed by Equation (3) when <sup>|</sup>**B***i*<sup>|</sup> <sup>≤</sup> <sup>1</sup> *<sup>s</sup>* . Here, **B***<sup>i</sup>* includes the encrypted feature vertices **F***<sup>i</sup>* and the remaining vertices (**B***<sup>i</sup>* − **F***i*) as shown in Equation (9).

$$\mathbf{B}\boldsymbol{\nu}\_{i} = (\mathbf{B}\_{i} - \mathbf{F}\_{i}) \cup \mathbf{F}\_{i}^{'} \tag{9}$$

The random encryption process is performed by a randomization process using the random Gaussian numbers **Gn***i*:

$$\mathbf{E}\_{i} = R\_{P}(\mathbf{B}\_{i\prime}^{\prime}, \mathbf{G} \mathbf{n}\_{i}) = \left\{ \mathbf{c}\_{i,j} \, \middle| \, j \in [1, |\mathbf{E}\_{i}|] \right\} \tag{10}$$

where |**E***i*| = |**P***i*| and *ei*,*<sup>j</sup>* is computed as follows:

$$\rho\_{i,j} = b\nu\_{i,j} \times g\_{i,j} = b\nu\_{i,j} \times \frac{i \times j}{|\mathbf{P}\_i|} \times \frac{1}{\sqrt{2\pi}} e^{\frac{-\mathbf{r}^2}{2}} \quad \forall \ j \in [1, |\mathbf{P}\_i|] \tag{11}$$

Figure 4 shows the random encryption process for the backbone object **B***<sup>i</sup>* when the value of <sup>|</sup>**B***i*<sup>|</sup> <sup>&</sup>gt; <sup>1</sup> *<sup>s</sup>* (Figure 4a with <sup>|</sup>**B***i*<sup>|</sup> <sup>=</sup> 17) and when <sup>|</sup>**B***i*<sup>|</sup> <sup>≤</sup> <sup>1</sup> *<sup>s</sup>* (Figure 4b with |**B***i*| = 4). When the value of |**B***i*| > 4, the feature vertices of **B***<sup>i</sup>* will be encrypted by the cipher function *CP*(.). Then, the randomization process is performed as shown in Figure 4a. In case |**B***i*| = 4, the backbone object will be directly encrypted by the cipher function *CP*(.) and then randomized by a set of Gaussian numbers, as shown in Figure 4b.

(**b**) **Figure 4.** (**a**) Backbone object encryption, and (**b**) feature vertices encryption.

#### *3.4. Decryption Process*

The decryption process is the inverse of the encryption process. Firstly, the key value **K** us generated from the user's key by the SHA-512 hashing algorithm. The key value K is then used to compute a set of random Gaussian numbers as described by Equations (6) and (7) in Section 3.3. The encrypted objects are then extracted from the encrypted vector map and vertices are re-randomized by the random Gaussian numbers before decryption using the key value **K**. After the vertex re-randomization process, if the encrypted object cannot be simplified, it is decrypted by the key value **K** to generate the backbone object. If the encrypted object can be simplified, it is simplified to compute the encrypted feature vertices. These encrypted feature vertices are then decrypted by the key value **K** to generate the backbone object. From the backbone object, we can calculate the decrypted objects based on Equation (1). The decrypted GIS vector map is a set of the decrypted objects.

#### **4. Experimental Results and Analysis**

We used the GIS vector maps of the country of Scotland [18] with differential scales in visualization experiments, and evaluation of security and computation time. The detailed information of the GIS vector maps is shown in Table 1. The data format of the GIS vector map was the shape-file (SHP) format [19], which is a popular geographical vector data format. The proposed algorithm was applied to the polylines and polygons of the vector maps. The backbone encryption process and the feature vertices encryption process were performed by the AES algorithm. We selected the AES algorithm because its security is higher than others. Compared with conventional approaches, the proposed algorithm is more original than previous methods because it encrypts objects based on encrypting the value of feature vertices in polylines/polygons. Consequently, it does not alter or expand the size of the encrypted file, thus preventing data loss. The GIS vector maps are completely altered after the random encryption process (see Figures 5–9).

#### *4.1. Visualization Experiments*

Experimental results are shown in Figures 5–9, which show the original map and a part of the original map beside the encrypted map for comparison. Figure 5a shows the original railway map of Scotland and a part of the original railway map. The content of the railway map is presented by polylines. After encryption, polylines are altered, broken into segments, and positioned in a disorderly manner (Figure 5b). In the experiment with the land-use map of Scotland, the content of the map includes polygons (Figure 6a). After encryption, the shape of all polygons is changed to smaller polygons on the map, and the shape of the entire map is altered completely (Figure 6b). Experiments on the waterway map (Figure 7a), the nature map (Figure 8a), and the road map (Figure 9a) of Scotland also yields similar results. Waterway lines and roads on the maps are broken into shorter polylines and positioned in a disorderly manner (Figures 7b and 9b), and the shape of the original polygons are altered and moved to other positions (Figure 8b). Consequently, the content of the GIS vector maps is altered completely.

**Table 1.** Experimental results of Scotland maps.


(**b**)

**Figure 5.** (**a**) Original railway map, and (**b**) encrypted railway map.

**Figure 6.** (**a**) Original land-use map, and (**b**) encrypted land-use map.

**Figure 7.** (**a**) Original waterway map, and (**b**) encrypted waterway map.

**Figure 8.** (**a**) Original nature map, and (**b**) encrypted nature map.

**Figure 9.** (**a**) Original road map, and (**b**) encrypted road map.

#### *4.2. Security Evaluation*

In order to evaluate the security of the proposed method, in this section we evaluate the randomness of the encrypted map. If the randomness of the encrypted map is high, it will also be difficult to attack. The randomness of the encrypted map is measured by its entropy. Thus, we calculate the entropy of the encrypted map to evaluate the security of the proposed method. From Section 3, we can see that the entropy of the encrypted object is dependent on the secret key **K** and the value of |**P***i*|. Both **K** and |**P***i*| are discrete random variables. Thus, the entropy *H***P***<sup>i</sup>* of the encrypted object **E***<sup>i</sup>* is the sum of the entropies of the random variables:

$$H\_{\mathbf{P}i} = H(\mathbf{K}) + H(|\mathbf{P}\_i|) = |\mathbf{K}|. \log\_2\left(\frac{1}{|\mathbf{K}|}\right) + |\mathbf{P}\_i| \log\_2\left(\frac{1}{|\mathbf{P}\_i|}\right) \tag{12}$$

Furthermore, the entropy *H***<sup>L</sup>** of the encrypted map from the original map layer **L** will be the sum of the entropies of the encrypted objects:

$$H\_{\mathbf{L}} = \sum\_{i=1}^{|\mathbf{L}|} H\_{\mathbf{P}\_i} \tag{13}$$

Clearly, *H***<sup>L</sup>** is dependent on the values of |**L**| and **K**; the key value **K** is a random variable dependent on a user's key input. If the value of **K** is fixed, *H***<sup>L</sup>** is only dependent on the value of |**L**|. As a result, if |**L**| is high, the entropy is high. For example, the railway map of Scotland has |**L**|= 9724 and *<sup>H</sup>***<sup>L</sup>** <sup>=</sup> 128,817 *dB*, but the natural map of Scotland has <sup>|</sup>**L**| = 99,835 and *<sup>H</sup>***<sup>L</sup>** <sup>=</sup> 1.66 <sup>×</sup> 106 *dB*. In the vector maps of Scotland in Table 1, the entropy of the proposed method ranges from 1.3 <sup>×</sup> <sup>10</sup><sup>5</sup> *dB* to 6.88 <sup>×</sup> 106 *dB* with |**L**| ∈ [9724, 372, 138].

The method of Yasser [10] only uses the AES-256 cipher operator to encrypt the data stream of a shape-file. Thus, the entropy of this method is dependent on the length of the secret key and the length of the data stream. In the work of Bang [12], about 70% of polylines/polygons are selected for the encryption process. Bang used a common secret key to encrypt all DC values in a vector map in the DWT domain. The length of the secret key was 512 bits. Thus, the entropy of this method is dependent on the length of the secret key. We applied the methods of both Yasser and Bang for the GIS vector maps of Scotland for comparison. The entropy of our method was higher than that of Yasser (1.2 <sup>×</sup> 105 vs. 6.6 <sup>×</sup> 106 *dB*), while the entropy of Bang's method was fixed at 4608 *dB* for every GIS vector map (Table 2). In conclusion, our method offers more security than previous methods. Figure 10 shows the difference between the entropy of our method and the entropy of the methods of Yasser and Bang according to the number of objects. The entropy of our method is much higher than that of Yasser or Bang.


Road 372,138 6,886,591 299,520 4608

**Table 2.** The entropy of the proposed method compared with previous methods.

**Figure 10.** Entropy according to the number of objects.

#### *4.3. Computation Time*

We implemented the proposed method using the C# language in Visual Studio 2013. We then conducted our experiments on a PC with Intel Core i7 Quad 3.5 GHz, 8 GB of RAM, and Windows 7 64-bit. Section 3 indicates that the computation time of the proposed method is dependent on the number of objects in the vector maps. With the GIS vector maps of Scotland, the computation time of our method ranged from 77 ms to 1896 ms, with the size of maps ranging from 1951 Kbs to 78,136 Kbs. We also implemented the methods of Yasser and Bang using the C# language in Visual Studio 2013, in a similar environment, to measure and compare the computation time between methods. Yasser

performed full encryption for GIS vector maps, which means he encrypted notation, header, text, and all geometric objects. Thus, the computation time of Yasser's method is dependent on the size of the GIS vector maps. Compared with Yasser's method, the computation time of our method was less (40 ms vs. 2870 ms; Table 3). In the method of Bang, the computation time is dependent on the computation time of the selection process, the number of selected objects, and the computation time of the DWT and inverse DWT processes. The computation time of the DWT and inverse DWT processes is dependent on the number of vertices in each object. If an object has many vertices, the computation time of the DWT process is long. Conversely, if the number of selected objects is small, the computation time is short (Table 3). Figure 11 shows the computation times of the proposed method compared to the methods of Yasser and Bang according to the size of the vector maps. The time of the proposed method is significantly shorter than that of previous methods.


**Table 3.** Computation time of the proposed method compared with previous methods.

**Figure 11.** Computation time according to the size of map.

#### **5. Conclusions**

In this paper, we proposed a random encryption algorithm based on multi-scale simplification and the Gaussian distribution for GIS vector maps. Experimental results showed that the proposed method is very effective with GIS vector maps that contain many geometric objects. The presented method provides higher security than previous methods. The computation time of the method is significantly shorter than that of previous methods, and it could be used to replace previous methods for secured storage and transmission. By encrypting only geometric objects, the proposed method can be responsive to the various formats of GIS vector maps. In addition, to apply the method presented in this paper, developers only need to extract geometric objects before performing the encryption process. Furthermore, the proposed method can be applied to the security of online and off-line map services [20,21].

**Author Contributions:** Idea and Validation, S.-H.L.; Supervision and Evaluation, K.-R.K.; Research and Methodology, G.N.P.; Implementation and Software, S.T.N.; Visualization and Verification, A.N.B.; Data and Analysis, D.V.T.

**Funding:** This research was supported by the FPT University; and the FPT Software, Hanoi, Vietnam; and also supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (No. 2016R1D1A3B03931003, No. 2017R1A2B2012456).

**Acknowledgments:** Thanks for the funding support.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Evaluation of E**ff**ective Cognition for the QGIS Processing Modeler**

#### **Zdena Dobesova**

Department of Geoinformatics, Faculty of Science, Palacký University, 17. Listopadu 50, 77146 Olomouc, Czech Republic; zdena.dobesova@upol.cz

Received: 28 December 2019; Accepted: 17 February 2020; Published: 20 February 2020

**Abstract:** This article presents an evaluation of the QGIS Processing Modeler from the point of view of effective cognition. The QGIS Processing Modeler uses visual programming language for workflow design. The functionalities of the visual component and the visual vocabulary (set of symbols and line connectors) are both important. The form of symbols affects how workflow diagrams may be understood. The article discusses the results of assessing the Processing Modeler's visual vocabulary in QGIS according to the Physics of Notations theory. The article evaluates visual vocabularies from the older QGIS 2.x and newer 3.x versions. The paper identifies serious design flaws in the Processing Modeler. Applying the Physics of Notations theory resulted in certain practical recommendations, such as changing the fill colour of symbols, increasing the size and variety of inner icons, removing functional icons, and using a straight connector line instead of a curved line. Another recommendation was to provide a supplemental preview window for the entire model in order to improve user navigation in huge models. Objective eye-tracking measurements validated some results of the evaluation using the Physics of Notations. The respondents read workflows to solve different tasks and their gazes were tracked. Evaluation of the eye-tracking metrics revealed the respondents' reading patterns of the diagram. Evaluation using both Physics of Notation theory and eye-tracking measurements inspired recommendations for improving visual notation. A set of recommendations for users is also given, which can be applied easily in practice using a contemporary visual notation.

**Keywords:** algorithm; cognition; computer languages; eye-tracking measurement; gaze tracking; human-computer interaction; open source software; symbols; visualisation

#### **1. Introduction**

Today, open source GIS software competes with commercial GIS software. The user's choice not only depends on the price but also the degree of functionality in parts of the GIS software. Users need to satisfy their requirements. One of the demands is the automatic processing of spatial data as a sequence of steps. Visual programming languages (VPLs) are used to design steps of processes in the form of workflow diagrams. GIS operations are not used in isolation but as a part of a chain of operations to completely process data. An overview and basic description of several visual programming languages in GIS are given in this article [1]. ModelBuilder for ArcGIS, Macro Modeler for IDRISI, Model Maker and Spatial Model Editor for ERDAS IMAGINE and Workflow Designer for AutoCAD Map 3D are mentioned. All systematic description and evaluation of VPLs in GIS is presented as habilitation [2]. VPLs in GIS are data-centric notations that serve to express a process in detail. Only AutoCAD Map uses hybrid symbols where one symbol both for operation and input/output data altogether is. Other VPLs have a unique set of simple symbols for data and unique symbols for operations and control of flow. GIS workflow does not express a generalised conceptual model of processing, and they are more detailed.

Open source software QGIS is competitive with commercial GIS software in designing workflow diagrams using VPL. The accessibility of a visual programming language increases the usability of QGIS. The possibility of designing workflows could be a reason for selecting open source QGIS.

The Processing Modeler is a graphical editor in QGIS software. This editor allows workflows to be designed in graphical form using a visual programming language. Workflow diagrams in QGIS are termed as a model.

VPLs differ in their visual notation generally, and the symbols in GIS software are various. The visual notation consists of graphical symbols (visual vocabulary), a set of compositional rules (visual grammar) and definitions of the meaning of each symbol (visual semantics). The visual notation is important from the point of user perception and cognition. In his theory, the Physics of Notations, D. Moody stated that it is necessary to use cognitively effective visual notations [3]. Cognitively effective means optimised for processing by the human mind.

This article presents an assessment of the visual notation of the QGIS Processing Modeler using the Physics of Notations theory in combination with eye-tracking measurement. The presented research started with version QGIS 2 in 2014 and has been continued with version 3 up to now. The long-term release (LTR) version QGIS 3.4 Madeira and partly version 3.6 Noosa was used for assessment. Some features of visual notation were empirically tested using the eye-tracking method on version QGIS 2. Finally, some improvements to visual notations are suggested in this article.

The research question was "What is the level of effective cognition in QGIS Processing Modeler." This research aimed to evaluate and improve cognition of visual notation in QGIS. The results bring new and innovative ideas that improve the usability of and satisfaction with QGIS software.

These tasks fall under investigation in Human-Computer Interaction (HCI) research. Standard ISO 9241-210:2019 Ergonomics of human-system interaction—Part 210: Human-centred design for interactive systems [4] provides requirements and recommendations for human-centred design principles and activities of computer-based interactive systems. In the center of HCI research and UX research (User Experience), is the understanding and design of interactive digital systems and their human users [5]. Their common aim is to innovate novel computing user interfaces to satisfy usefulness, ergonomics, and efficiency of using digital systems [6,7]. The improvement is based on theories and both on empirical testing in laboratories [8] e.g., the eye-tracking measurement presented in this article.

#### **2. QGIS Processing Modeler and its Graphical Notation**

#### *2.1. History of QGIS Processing Modeler*

The Processing Modeler was implemented in version QGIS 2.0 Dufour, released in 2013. The next development of the Processing Modeler aimed to increase the functionality of the editor. The author of the graphical editor was Victor Olaya from Spain. In version QGIS 2.6 Brighton, released in 2014, the Processing Modeler was rewritten and provided extra functionality, such as allowing nested models with no depth limit [9].

Furthermore, adding Python script to the model was supported in version 2.x. Python script could be downloaded from an online external collection of scripts created by different users and adopted to a newly created user model. The software architecture and features of the QGIS processing framework are described in this article [10].

In 2018, the new series of version QGIS 3.x began with version 3.0 QGIS Girona. The Processing Modeler underwent extensive changes and included additional and changed input parameters and algorithms. Specifically, the colours of basic symbols were changed, and the interface and degree of functionality were redone. For example, zoom in and zoom out functions [11,12]. The two Input and Algorithm panels can be positioned differently in the interface and now float above the processing window [11,12]. The storage format of the model was also changed [13]. File extension model 3 was used instead of extension model.

#### *2.2. Description of Interface and Graphical Notation*

A graphical editor Processing Modeler is embedded in QGIS and runs in a separate window. The interface is divided into two areas [14]. Two switchable panels are on the left side. The "Inputs" panel is the source for different types of input data. The "Algorithms" panel is the source of operations that can be added to the model (workflow diagram). The large window at right is a canvas for designing the model (Figure 1).

**Figure 1.** The interface of the QGIS Processing Modeler version 3 and an example model in visual programing language.

Selected inputs and algorithms can be added to the model by dragging and dropping it into the modeler canvas. Being movable, the position of the symbols is the user's choice. When input data is added to the model, the type of data and name of data are set. The input data are considered input parameters. Inputs are not assigned to particular existing data at the directory or values of variables. Their names could, therefore, be more descriptive than the data's real name. This would be an advantage because the names of parametrical inputs could be more general. It improves comprehension of the model for other users. The Algorithms panel provides GIS operations (processing algorithms) from several types of open source software apart from QGIS—these are GDAL, GRASS and SAGA. Previously, created QGIS models are also displayed. Python scripts and operations from the ORFEO library were accessible in the older version 2.

Grey connector lines are automatically drawn immediately after adding operations and assigning the existing inputs to the operation in the model. The lines connect symbols of input data with the symbol of operations. The output data symbol is also automatically linked to the model after naming outputs from the operation. The connector lines are then automatically drawn between operation and output data. The user cannot draw the connector lines manually with a mouse or reconnect the symbols. The shape of the connector lines is curved and ends with a black point. When the positions of the symbols change, the shape of lines is automatically redrawn with a different curvature.

Modeler's visual vocabulary contains three rectangular symbols (Figure 2). The size of the symbols is the same and cannot be changed. Originally, the violet rectangle represented input data, the blue rectangle represented output data and the white rectangle represented the operation. The fill colours were changed in version 3. The symbol for input data is now yellow, the symbol for output data is green and the symbol for operations remains white. At first glance, this was perhaps to emphasise that this is a model from the new Processing Modeler version. The user can subsequently very clearly distinguish between existing models from older version 2 and the latest version 3. Comparing the brightness of symbols (compare the greyscale between versions in Figure 3), the input symbol is lighter in tone and the output symbol is darker than in version 3. The difference in brightness is important for colour-blind people or people with perception limitations. For these people, only the difference in brightness is helpful in distinguishing objects. The colour setting of any computer application, design of web pages respects the colour-blind people to apply different brightness of menus, text boxes and other graphical objects of interfaces. From that point, the new colours of symbols in version 3 are better due to different brightness. The difference in brightness maybe not made intentionally by QGIS designers but it is valuable.


**Figure 2.** Graphic symbols (from top, input data, output data and operation); (**a**) symbols from version 2 at left, (**b**) symbols from version 3 at right.

**Figure 3.** Greyscale symbols (**a**) symbols from version 2 at left, (**b**) symbols from version 3 at right.

The rectangular symbols contain some inner icons. The input data symbols are indicated with a *plus sign* icon. The output data symbol is indicated by an *arrow*. Both icons are on the left side. The operation symbols have different icons according to the source library of operation or type of operation. For example, the QGIS 2 icon for Zonal Statistics is shown in Figure 2. The input data symbols and operation symbols have two icons on the right side: *across* and *pencil* in version 2. These icons depict the delete and edit functions. They can be considered *operational icons*. In version 3, the *cross* icon remained, and the icon for editing is *three dots*. The green output symbol was also assigned these two operational icons. It means that the label of the output data symbol is editable. The option to assign a default name and path for the output data is provided in the editing dialogue. When the output symbol is deleted, the output data is automatically assigned as a temporary output in operation. No symbol in the model indicates a temporary output.

Generally, the use of icons is very helpful. According to Szczepanek, icons can be divided into three groups of icons in software interfaces [15]. The first group is *universal icons*, which can be understood without explanation (e.g., a floppy disk for the save operation). The second group is *domain-specific icons* (e.g., for any GIS software), and the third group is *application-specific icons* (e.g., for QGIS software). In the case of the QGIS Processing Modeler, the icons can be sorted as follows: a pencil, three dots and cross icons are universal icons. The plus icons for data are midway between universal and domain-specific icons. The plus icon is frequently used in some GIS interfaces and means adding a layer to the current project. The icons of source libraries (which are in the white operation symbol) belong to the application-specific group of icons. All icons can be understood very well.

#### **3. Theory and Measurement Methods**

#### *3.1. Theory Physics of Notations*

Physics of Notations is an objective theory for evaluating visual notation [3,16]. This theory is widely used in all areas of software engineering, not only GIS software because creating diagrams is frequently required in information technology (IT). The first work was an assessment of the Esri ModelBuilder in the area of VPL in GIS applications [17]. The Physics of Notations theory can be used not only for evaluating existing notation but also improving graphical notation or designing new ones. It means that the visual notation in QGIS can be assessed and improved if any drawbacks are identified. Exploring this theory for the new design is very beneficial in design new graphical vocabulary for any purpose. This paper presents the opportunity to make new suggestions according to the theory for QGIS Processing Modeler.

The Physics of Notations theory states nine principles that recommend fulfilling cognitively effective notation. *Cognitive e*ff*ectiveness* is defined as the speed, ease and accuracy with which a representation can be processed by the human mind [18]. The aim is to read the diagram quickly, without mistakes, and comprehend it accurately.

The nine principles are organised as connected ideas where the first central principle is the Principle of Semiotic Clarity. The modular structure of Physics of Notations is designed to make it easy to add or remove principles, emphasising that they are not fixed or immutable but can be modified or extended by future research [19].

The principles are [3]:

#### • **Principle of Semiotic Clarity**

The principle of Semiotic Clarity expresses a one to one correspondence between the syntactic model and semantic features. According to this principle, symbol redundancy, symbol overload, symbol deficit or symbol excess is not permissible. The principle reflects the ontological analysis.

#### • **Principle of Perceptual Discriminability**

The second principle of Perceptual Discriminability states that different symbols should be clearly distinguishable from each other by visual variables.

#### • **Principle of Visual Expressiveness**

The principle of Visual Expressiveness states that the full range of visual variables and their full capacity should be used to represent the notational elements. Colour is one of the most effective visual variables. The human visual system is very sensitive to differences in colour and can quickly and accurately distinguish them. Differences in colour are found three times faster than shape and are also easy to remember [20]. The level of expressiveness is measured from level 1 (lowest) to 8 (highest).

#### • **Principle of Graphic Economy**

The principle states that the number of symbols in a graphical vocabulary must be manageable by human working memory. The choice of symbol affects the ease of memorizing and recalling visual diagrams. The magic number seven express a suitable number of symbols. The range for an of 7 ± 2 symbols is suitable. More different symbols in basic graphical vocabulary than nine are demanding for comprehension.

#### • **Principle of Dual Coding**

The principle suggests using the text to support the meanings of symbols and clarity. Two methods (graphics and text) provide the user with information and improve comprehensibility. The base is on the duality of mental representation [21].

#### • **Principle of Semantic Transparency**

This principle evaluates how symbols associate the real meaning of an element. Here, associations are sought between the shape or other visual symbol variables and their real properties, and the form implies content.

#### • **Principle of Complexity Management**

This principle recommends producing hierarchical levels of the diagram and dividing it into separate modules and create hierarchical structures. It is suitable for large models when comprehension exceeds human working memory capacity. Modularity means scaling information into separate chunks. Modularisation is the division of large systems into smaller parts or separate subsystems. Practice shows that one subsystem should be only large enough to fit on one sheet of paper or one screen. This subsystem is then represented at a higher level by one symbol. Hierarchical structuring allows systems to be represented at different levels of detail (levelled diagram) with the ability to control complexity at each level. This promotes understanding of the diagram from the highest level to the lowest, which improves the overall understanding of the diagram. Both mechanisms can be combined into the principle of recursive decomposition.

#### • **Principle of Cognitive Interaction**

The principle recommends increasing the options for navigating in the model. The reader must be able to follow the chain of operations easily. The connector lines affect navigation.

#### • **Principle of Cognitive Fit**

The principle proposes to realize different sets of graphical vocabularies for the same type of semantics, where information is represented, for different tasks and different groups of users in different ways. It recommends the use of multiple visual dialects, each of which is suitable for different types of tasks and different user spectrums (according to experience).

#### *3.2. Eye-Tracking Measurement and Experiment*

The eye-tracking equipment was used to evaluate the comprehensibility and discriminability of visual symbols in models. This method was assumed as a combination and extension of Physics of Notations results as an experimental method.

Testing was conducted at an eye-tracking laboratory in the Department of Geoinformatics, Palacký University in Olomouc (Czech Republic). The eye-tracker SMI RED 250 with software SMI Experiment Suite 360◦ was used for the experiment. The test was designed using the SMI Experiment Center program. The results were visualised using SMI BeGaze. An evaluation was also conducted using software Ogama 4.5 and V-Analytics. For statistical evaluation, the STATISTICA software was used. The size of the monitor to record eye movements and display models was 1920 × 1080 pixels. The sampling frequency of the eye-tracker SMI RED was 250 Hz [22].

The complex eye-tracking experiment consisted of 22 workflow diagrams from Processing Modeler version 2. Several models with different sizes, functions, and arrangements of flow orientation (vertical, horizontal, and diagonal directions) were tested. The workflow diagrams were displayed individually on the screen in random order to prevent a learning effect [23]. Shuffling ensured that each respondent saw the models in a different order.

The respondents were first-year students at the end of the semester in a master's programme of Geoinformatics. They had attended lectures where the design of models in Processing Modeler version 2 was practised. They created various examples of models with different functionalities and sizes. The group of respondents was assumed to be advanced users. A total of 22 respondents participated in the eye-tracking testing, aged 22–25.

The term stimulus is applied in the process of eye-tracking testing [24]. The stimuli, in this case, were the models (workflow diagrams). Each model was associated with a comprehension task to record the cognitive process. Response time and the correctness of user answers were measured for each comprehension task as in other research [25–27]. The set of models or maps and comprehension tasks are often used to evaluate the usability of visualisation methods in cartography and GIS [28,29].

Research in the area of workflow diagrams has also been organised at our eye-tracking laboratory for other GIS VPLs. The reading patterns are described for models in ArcGIS ModelBuilder [30]. The significant effect of the orientation of connector lines is mentioned. Also, the influence of bends on connector lines was tested for ModelBuilder [22]. Being able to change colour helps to discriminate graphical symbols in ModelBuilder. This was also demonstrated using eye-tracking experiments at our university laboratory [31].

The eye-tracking experiment consisted of two parts for the QGIS Processing Modeler. The first part only displayed models without any task. This part is called free viewing. The second part contained 22 models that were introduced with comprehension tasks. The respondents solved the tasks by clicking on the stimulus at the right location all answer the question. All tested diagrams are in Appendix A. Clicks were recorded as an answer. All stimuli were interleaved with a fixation cross in the middle of the screen to provide the same starting point for all respondents. The fixation cross was displayed for 600 milliseconds before each stimulus.

The research combined two above mentioned methods in evaluation. They were very different. The first report findings by application theoretical principles. It produced results in a text form with list of insufficiencies, good features, recommendations and ideas. The second is the experimental method where the objective measurement was constructed using user testing. Both methods could be assumed as cross-validation of results, but mainly eye-tracking is an extension of received results. The research tried to combine both attempt to receive more complexity results such as finding reading patterns. In the phase of preparing the eye-tracking experiment, we considered how to test the principles of Physics of Notations. The task was aimed to receive answers that correspond to the principle definitions. The design of the experiment was done so much coherent to the principles. However, only the set of principles is possible to test in a limited way. It is impossible to design the eye-tracking as one task to one principle. There are more influences on the respondent's perception. Moreover, the last principles, Complexity Management and Principle of Cognitive Interaction, are hard to test because of no sufficient solution present in that visual vocabulary. It was also impossible to test case of the Principle of Cognitive Fit when no visual dialects exist in the Processing Modeler.

Two hypotheses were proposed before eye-tracking testing:

**Hypothesis 1 (H1).** *Insu*ffi*ciencies in Semiotic Clarity, Perceptual Discriminability, Visual Expressiveness and Semantic Transparency adversely a*ff*ect the correctness of user answers*.

**Hypothesis 2 (H2).** *Insu*ffi*ciencies in Semiotic Clarity, Perceptual Discriminability, Visual Expressiveness and Semantic Transparency adversely a*ff*ect the e*ff*ectiveness of comprehension*.

To evaluate these two hypotheses, the number of correct answers (for H1), the time required to answer, and eye-tracking metrics were measured. Eye-tracking metrics such as the length of the scanpath, number of fixations and average time of fixations were calculated (for H2). All results are presented in Section 4.

#### **4. Evaluation of Processing Modeler**

#### *4.1. Evaluation of E*ff*ective Cognition by Physics of Notations Method*

Systematic application of Physics of Notations theory on Processing Modeler follows.

#### 4.1.1. Principle of Semiotic Clarity

When this principle is applied to symbols in the Processing Modeler, it is evident that both input data and output data symbols are overloaded. In version 2, one symbol represents nine different data types: vector, raster, string, file, table, table field, number, extent and boolean. The newer version 3 offers 22 different types of input data. The user has to assign the data type immediately when an input data symbol is assigned to the model. The data type (for example point, line, polygon) is assigned immediately when a model is designed, despite there being no evidence about data type in the graphical symbol.

Detected symbol overloads could be solved in the following manner: remove the inner *plus icon* at the left in the symbol and replace it with more specific icons that express the type of data. Suggestions for vector and raster data symbols are given in Figure 4. Both icons are adopted from the QGIS interface. The lower pair uses the compound icons from version 3, where the symbols for vector and raster are supplemented by a small plus icon to express input of data. These compound icons are better than simple icons in version 2. The former and larger plus icon is substituted by a plus icon that forms a part of the vector or raster icon. New icons could be suggested for file, folder, string, number, table and field using any universal icon. Data such as extent, CRS, map layer, etc. need domain-specific icons. The same inner icon sets could be used for the output data symbol where a compound symbol can contain bigger icon of data type and small output arrow that is original in output symbol. The suggestions follow Szczpanek icon theory [15]. These suggestions would increase the number of icons in the graphical vocabulary and solve the overload of the two original symbols for input/output data.

**Figure 4.** Graphic Examples of data symbols with the addition of inner icons according to data types; (**a**) symbols for version 2 at the top, (**b**) symbols for version 3 at the bottom.

#### 4.1.2. Principle of Perceptual Discriminability

The colour, shape, orientation, brightness and other visual variables are what the user uses for discrimination of symbols in practice. Systematic couple comparison shows the distance between every two symbols. The visual distance is measured by several different characteristics (number of visual variables). Pairwise comparison of the version 3 symbols according to this principle is given in Table 1. In the Processing Modeler, the symbols differ only in colour and brightness, and the rectangular shape is the same for all symbols. The visual distance is two in all pairs. The characteristics are poorer in the symbols of the older version 2. The only differences are in colour, and there the visual distance is one (the difference in brightness is only between the data symbol and operation symbol). Perceptual discriminability of the symbol through colour is almost satisfactory by differing in tone. The Processing Modeler does not have the option for the user to define the colour to express other meanings of symbols, for example, to distinguish the final data and intermediate data in a large model.


**Table 1.** Pairwise comparison of graphical symbols and their visual distances.

Considering this principle of perceptual discriminability, the distinctiveness of the white symbol from the canvas is poor. The model's canvas and symbol for operation have the same white colour, from which a new recommendation emerged: change the fill colour for the operation symbol from white to orange-brown (Figure 5). The result of the pairwise comparison of all symbols in the vocabulary remains the same. The discriminability of symbol and canvas is better than with the white symbol.

**Figure 5.** Suggested orange-brown fill for the operation symbol used in the workflow.

#### 4.1.3. Principle of Visual Expressiveness

The recommendation according to this principle is to use maximum visual variables in symbols. Only colour is used as the fill for graphic elements. Other visual variables such as symbol *shape*, *size*, *texture*, *orientation* and *position* are not used in the Processing Modeler. The shape is the same rectangle for all symbols. The size of the symbols does not vary and cannot be changed. Brightness is used in version 3 (Figure 3). The new symbol vocabulary is improved by using a greater variation in brightness between symbols and maybe various shapes of symbols. The visual variable of position is only applied when the output data symbol is automatically placed near the right side of the producing operation (Figure 6). This manner of near-automatic placement of output symbols is the same in version 3 (Figure 13). However, the position of the output data symbol is very often changed by the user, which then moves the operation symbol. The former position of output data remains without following the symbol of the sourcing process. The mutual position linking the symbol is not fixed. The positioning of the output data symbol is only a weak and unstable use of the *position* variable.

**Figure 6.** The default placement of blue output data symbols near the producing white operations in version 2.

The graphical vocabulary is at a low level 1 in a maximum scale of 8 in terms of the principle of Visual Expressiveness.

The QGIS Processing Modeler does not offer the *loop* and *condition* functions. To implement these functions in order to control operations, the draft offered in this paper uses the visual variables of shape and colour (Figure 7). The pink rectangle with oblique sides represents the cycle operation, and the light yellow rhombus represents the condition. These symbol shapes correspond to the classic shapes of flowchart symbols. In the vocabulary in version 3, these shapes differ from basic rectangular shapes in vocabulary and colour. By using these new symbols, the number of variables used increases to two. The total number of symbols would be five in the vocabulary. These symbols fulfil the principles of Discriminability and Visual Expressiveness. The principle of Graphic Economy would also be fulfilled (explanation of the principle of Graphic Economy follows).

**Figure 7.** Draft for loop and condition symbols.

#### 4.1.4. Principle of Graphic Economy

The number of base graphical elements is three, which meets the requirement for cognitive management and the requirement for a range of 7 ± 2 symbols. Even with all the previous suggestions for changes with two symbols for the condition and cycle (under the principle of Visual Expressiveness) and suggestion for a blue symbol for the sub-models (see below in the Complexity Management principle), the total number of symbols is six. Altogether, the requirement of that principle is fulfilled. The vocabulary will be economical.

#### 4.1.5. Principle of Dual Coding

This principle suggests accompanied descriptive text to the symbols. For models in the Processing Modeler, the text completes the data symbol with the data name and the operation symbol with the operation name. The user assigns the data name arbitrarily, which is always an input parameter. The input data symbol is never bound to specific data stored on the storage medium in the model's design mode. The name can be edited as desired. The operation name is added to the symbol automatically according to the selected operation and can also be changed in version 3 (not possible in version 2). The option to edit the operation name is a good improvement in functionality and allows the model to be better understood. Renaming the operation is especially advantageous when the same operation appears multiple times in one model. Therefore, it is possible to describe or specify the meaning of the operation. Figure 8 depicts a diagram where *v.generalize* operations are called three times, but each time with a different generalisation algorithm. The selected algorithm is added manually by the user to the operation name. The operation name's editing option improves the clarity of the model.

**Figure 8.** Default label of white symbols of operation supplemented by the user with the generalisation algorithm names: Douglas, Reuman, Boyle.

For long names that do not fit into the rectangle, the name is automatically truncated and completed with an ellipsis (Figure 9). If a Semantic Transparency modification (see below—deletion of functional icons) were implemented, it would increase the space for longer operation and input data names, which would be beneficial.

**Figure 9.** Automatic truncation of the long operation name ending in an ellipsis.

To follow the principle of Dual Coding, modifying the input data labels is suggested. It would be helpful if labels concerning the data type improved the data symbols by using capitals. Examples are given in Figure 10. If this is added automatically when symbols are added, the user only need arbitrarily select the data name. Additionally, the user's data name (e.g., input lines) emphasises the spatial type. Manually describing a data type is possible in the current stage of notation. There is a space for good use of Dual Coding by users in naming symbols. The Processing Modeler meets the Dual Coding principle; however, comprehensibility could be improved with the proposed modification by specifying the data type with captioning.

**Figure 10.** Suggestions for improving the labels of input data symbols with labels for a data type in capitals.

The text is still used in the models to list the operation parameters when the plus symbol is pressed above the operation (Figure 11). After this, the black dot divides itself into several black dots according to the number of join connector lines. The operation parameter list does not contain the values of these parameters and often dumps overlapping lines leading to the rectangle. This is retained in version 3. It would be useful to add a list of specific parameter values here. The current form of textual information is not useful to users. It is perhaps only useful in terms of expressing which symbol assigns concrete parameters to the operation.

**Figure 11.** List of the operation parameters of the 'Variable distance buffer' with missing values.

#### 4.1.6. Principle of Semantic Transparency

Symbols could be associated with the real meaning of an element according to this principle. The shape and colour of the symbols do not carry any association; they are semantically general in the Processing Modeler. This is the same in other visual programming languages for the GIS application. In those symbols, the inner icon of the *plus sign* symbol on the input data symbol is used at the left. The output data symbol depicts an inward *arrow* icon. Icons can also carry semantic meaning. These icons can be considered almost semantically immediate. The *plus* icon indicates new data for processing. The *arrow* icon indicates the processing result in a certain direction. However, the previous proposal under the principle of Semiotic Clarity is useful and also improves semantic immediacy. It suggests that each data type has an icon, such as in Figure 4 (the plus icon is replaced or is a part of the compound symbol in version 3). Here, it is clear that the change resulting from applying the Semiotic Clarity principle also leads to an improvement in Semantic Transparency.

For operations, icons are mainly used to represent the source library. Rather, these icons are semantically generic because they do not explain anything about the purpose of the operation. However, these icons are a good guideline for determining the source library. It should be considered that many libraries contain operations with the same name (clip, buffer, etc.). In the Processing Modeler, version 3 sometimes uses an icon that represents the type of operation (namely for QGIS operations). Figure 5

shows the operation Dissolve, which has a specific icon that represents this operation. Another specific icon for the operation 'Merge vector layer' is a model in Figure 13. The size and graphics of the icons are not suitable for improving the association of operation and their meanings. The icons are small and use only grey tones. A good example of large colour and detailed icons that describe the purpose of the operation is demonstrated in the Spatial Model Editor (Figure 12) embedded in the ERDAS IMAGINE software [32]. The icons take up more space than the lower text in the symbol. The icons are prominent. The graphical vocabulary of the Spatial Model Editor has a high Semantic Transparency. The graphical vocabulary of the Spatial Model Editor is inspiring for the redesign of Processing Modeler symbols. The final recommendation is to reshape the rectangle to square to adopt bigger icons and then put the text label bellow icon.

**Figure 12.** (**a**) Samples of operation symbols with inner icons that represent the source or library (top to bottom): QGIS 2, QGIS 3, QGIS 3—specific icon for operation Random points in extent, existing model Rivers, GDAL, GRASS, SAGA; (**b**) symbols of operations in the Spatial Model Editor (bottom line).

The graphical vocabulary of the QGIS Processing Modeler has semantic opacity, except for some operations, where a greater positive semantic immediacy can be observed (Figure 12—a third symbol from the top: *Random points in extent*).

#### 4.1.7. Principle of Complexity Management

This principle recommends producing hierarchical levels of the diagram and dividing it into separate modules and hierarchy. In textual/visual programming, this is achieved with sub-programs (sub-routines) or sub-models that can be designed and managed separately. The hierarchical model contains only two levels, no more.

The Processing Modeler allows existing models to be added to other models in the interface—panel algorithms (Figure 13). This has the right degree of modularity according to Complexity Management in both versions 2 and 3. The symbol of the model has a three-gear wheel icon (three connected balls in version 2) at the left of the symbol. Otherwise, a white rectangle is used. Since it would be good to differentiate the symbol of the individual operation from the sub-model with an icon, a colour fill other than white would be appropriate. A suggested depiction—blue fill colour for sub-models—is shown in Figure 13. The visual resolution of other symbols is maintained. The number of symbols increases to seven after a new one is added for the sub-model. The final count of seven symbols fulfils the principle of Graphic Economy.

**Figure 13.** (**a**) Groups of existing models that can be added to other models; (**b**) suggested blue colour fill for the sub-model.

#### 4.1.8. Principle of Cognitive Interaction

The principle recommends increasing the options for navigating in the model. The connector lines affect navigation. In the Processing Modeler, round connector lines join symbols. The lines are rendered automatically. Symbols very often overlap lines when symbols are manually moved (Figure 6). Lines also sometimes cross each other, and they are not parallel. The user must manually attempt to find the best position for symbols in order to prevent overlapping and perplexing criss-crossing of curved lines. Previous research recommended that the number of edge crossings in drawings should be minimised [33]. For these reasons, curved connector lines do not appear to be the proper solution. It is often difficult to trace the connector's direction. A suggested change is to replace curved lines with straight lines (Figure 14). Straight lines ensure good continuity for reading. Good continuity means minimizing the angular deviation from the straight line of two followed edges connecting two nodes [34]. In this new suggestion for the Processing Modeler, straight lines could be optionally angled at an oblique or right angle when it is necessary to avoid a symbol. An acute angle is not suitable because of its smooth line tracking. If curved connectors remain in notation, there is necessary to add the user control over shaping these connectors to prevent crossing and overlapping.

**Figure 14.** Replacing curved connector lines (**a**) with straight connector lines (**b**).

Operation symbols linked with lines and a black dot offset from the edge of the symbol unnecessarily occupy space in the model's area. It would be possible to terminate the lines directly on an edge or at the plus sign of the symbol to save space.

Finally, the ability to display a model's thumbnail in the separate preview window helps to navigate the model. The preview window has not yet been implemented. In terms of cognitive interaction, version 3 was supplemented by a zooming function. The functions zoom in and zoom out in the model was absent in version 2.

Aligning the symbols makes reading the model quicker and easier. No automatic function for aligning the model to the grid is implemented. Symbols usually snap to a grid in other graphical software. No snapping function is given in the Processing Modeler. Post alignment to the vertical or horizontal line of the model could, therefore, be beneficial for design. All arrangement of symbol positions depends on user diligence, which is entirely manual work in the Processing Modeler. Manually aligning is a time-consuming activity.

#### *4.2. Evaluation by Eye-Tracking Measurements*

The eye-tracking experiment was designed in a complex way to confirm or reject the hypothesis H1 and H2. All tested diagrams are in Appendix A. The design of the test contain more tasks to find maximum information. Some models serve for evaluation repetitively for a different purpose, e.g., find symbol, compare orientation, or read the labels. After testing only reliable answers and correct record by eye-tracker were finally evaluated and presented in the article.

The first evaluation concerned the discriminability of symbols. These tasks required finding input data and output data symbols in the models. The task was: "*Click on the symbol where the input data are*" (task A1, A2, A3 in Appendix A). The number of incorrect answers recorded was zero. The next task was "*Click on the symbol where the output data are*" (task A4, A5, A6). The wrong answers were two times 2 (A4, A5), and 4 for task A6. However, model A6 has a big influence of arrangement to answer. It means that the input and output symbols were nearly high in *Perceptual Discriminability*, but the errors report about space for improvements of symbols such as inner icons that are suggested in Section 4.1. for increasing transparency and using all visual variables.

Besides the number of correct/wrong answers, the time of the first click was recorded. The distribution of times had not normal distribution (tested by Shapiro-Wilk test). Non-parametrical test Kruskal-Wallis tested if the medians of the "first click time" of all tasks (A1–A5) is equal. Kruskal-Wallis tested whether time samples originate from the same distribution. The result of the statistical test revealed the there is no significant difference between finding the symbol of input and output. It means that basic symbols are discriminable and none of them is dominant in perception.

The next task aimed to verify the influence of *Dual Coding* (but the influence of the discriminability present). The task was: "*Click on the symbol where the 'Fixed Distance Bu*ff*er' operation is called*" (task A13, A14, A15). Once again, it was necessary to find the white symbol and read the labels in the symbols. A total of 21 correct answers were recorded (one incorrect). The results for 22 respondents were calculated as an attention heat map (Figure 15). The heat map expresses the calculation of places where the peaks of gaze fixations are by all respondents. The figure shows that all white symbols correctly attracted the gaze of respondents. Respondents searched for white operation symbols and then read a particular operation label. The highest attentions were recorded at the two places where the *Fixed Distance Bu*ff*er* operation was (top and bottom). Next, the lower peaks of fixations are at another white symbol with the different operation. It is evident that white colours of operation attract the gaze. Both principles of *Visual Expressiveness* (and also *Perceptual Discriminability),* by using white colour fill, and *Dual Coding* were verified. In fact, this stimulus did not confirm the poor distinguishing of white symbols from a white canvas. The poorest distinguishing result was expected in the theoretical part of this article.

**Figure 15.** Attention heat map for the task to find the 'Fixed Distance Buffer' operation.

The principle of Semantic Transparency was difficult to test. The transparency of the icons was tested with the task: *"Are all operations from the same source library?"* The tested model is shown in Figure 16 (Appendix A16, 17, 18). Three incorrect answers were recorded from 22 respondents.

**Figure 16.** The model with different source libraries of operations in a diagonal arrangement.

Semantic Transparency of data types was only possible to solve in the Processing Modeler with expressive text. This was verified in a model where the task was: "*Does the input data have the same data type as the output data?*" *(A10, 11, 12)*. The data symbols were labelled with the words "table" and "raster" as a part of the data name in the model. It is "user design help" to the respondents to distinguish the data type. The number of incorrect answers was three for two models and two incorrect answers for A11 task. In these models, the response time was longer than the previously presented models and tasks. The average time of fixation was also longer. It verifies the necessity of reading labels by users. The solution of semantic transparency by text label consume more time for comprehension. The longer times confirms the hypothesis H2 about negative influences of insufficiencies to effective comprehension.

Both of the experiments mentioned above (about source library and the comparison of the data type of input and output data) verified that Semantic Transparency was low in the Processing Modeler.

The results about the number of correct and incorrect answers in all tasks presented in this section report that some insufficiencies adversely affect the cognition as it is stated in hypothesis H1.

From eye-tracking testing, we received not only cross-validation of results by Physics of Notations but other new information. The interesting information was finding the reading patterns and influence of flow orientation to the respondent reading directions. To find the reading pattern of users, gazes were aggregated. A comparison of the same diagram from the free viewing section and a section with tasks is given in Figure 17. Aggregations in both cases revealed that the orientation of data flow expressed by connector lines had a significant influence. Reading began in the middle of the stimulus according to the middle fixation cross in the previous stimulus. Gazes were attracted to the upper left corner and continued horizontally to the right. People's habits of reading lines of text were very strong, especially in free viewing (Figure 17a). The lower part of Figure 17b also depicts strongly followed lines. Only a small number of gazes skips between two main horizontal workflow lines. Free viewing was not as systematic as task-oriented gaze aggregations.

**Figure 17.** Aggregate directions of scanpaths of respondents in the model during free-oriented reading (**a**) and task-oriented reading (**b**).

Two models tested the effect of *symbol alignment in the model*. This finding can be linked to the principle of *Cognitive Interaction*. The first model had aligned symbols; the second model had no aligned symbols. The functionality was the same. The question was the same for both models: *"How many functions are in the model*?" (task A8, A9). It was enough to count only the white rectangles in large models. The number of the expected correct answers was eight. The first aligned model recorded two incorrect answers and the second recorded seven. The average task time was much shorter in the first tasks. The non-parametrical Kruskal-Wallis test was used for eye-tracking metrics due to non-parametrical distribution of measured values. It tested if medians of groups are equal means they have the same distribution [35]. The significance level for all Kruskal-Wallis tests was set to p-value 0.05. The test was run three times for several fixations, scanpath lengths and number of fixations per second. The test compared tree measured values for the aligned and non-aligned model (A8 and A9). Kruskal-Wallis test found statistically significant differences for all metrics: the number of fixations, scanpath lengths, and number of fixations per second. The model where the symbols were unaligned showed much worse values for all metrics (task A9). Therefore, aligning the symbols in the model made it easier to read and understand the model. This eye-tracking evaluation supports the recommendation for the new function of the automatic alignment of symbols in this graphical editor.

Three groups of models with the same functionality were prepared to test *the orientation of flow* and find if any orientation is better for users. The models in each group differed only by orientation. Three orientations were tested: vertical, horizontal, and diagonal. Comparing the orientation could be also assumed as a contribution to the principle of *Cognitive Interaction*. An example diagonal orientation of flow in a model is shown in Figure 16, horizontal in Figure 17. Variations of these models with three types of orientations were designed. The triplets are [A10, A11, A12], [A13, A14, A15], [A16, A17, A18] in Appendix A. To prevent the bias, the same task was in each group. The aim was to find the best model orientation. The results from a Kruskal-Wallis test were not statistically significant. In some cases, horizontal orientation had the shortest average time of solution. In some models, diagonal orientation was better in average times of fixation, and horizontal models had the shortest scanpath length. The results were completely ambiguous, and the orientation preferences were not the same in all triplets. There is certainly a great deal of effect depending on the given question and model sizes. Eye-tracking did not reveal the best orientation of flow.

#### **5. Results**

Research into the QGIS Processing Modeler brought useful results and suggestions. The combination of Physics of Notations theory and eye-tracking measurements determined that Perceptual Discriminability, Dual Coding and Graphic Economy were nearly good with space of improvements. The worst situation is in Semantic Transparency. Some of the recommendations can help improve Semiotic Clarity, Visual Expressiveness and Semantic Transparency.

All recommendations can be divided into two groups. The first group is for developers of the Processing Modeler and the second for users in practice. Suggestions for the first group for larger sizes and colours for inner meaning icons increased the Semantic Transparency. This solution also increased the Semiotic Clarity of symbols. Another suggestion for improvement was adding colour fill to the operation symbol of sub-models. Straight connector lines are better than curved lines, optionally users shaped lines are more suitable. New symbols for IF and loop FOR commands were based on new shapes and different colours. The readability of models improved the automatic alignment function of the symbols to the grid.

Users can benefit from some recommendations in practice. Correct labelling of symbols and expressing data types in capitals (VECTOR, RASTER, STRING, NUMBER, etc.) is very useful. Aligning symbols, preventing overlapping, and crossing of lines improved the effective comprehensibility of a model. Design and using of sub-model fulfil the Complexity Management principle. There is a space for user broader use of sub-models. Reading speed increased in one type of orientation (horizontal or diagonal) without any changes to one of the flow direction. These user recommendations were presented to students attending lectures at the Geoinformatics department at the Palacký University in Olomouc every year. The author of the article has had a positive experience in applying the knowledge acquired by the teacher in research and solving practical problems. This positive teacher experience is described in an article about the database design for the university's botanical gardens (BotanGIS project) [36].

The presented evaluation and list of suggestions could assist by inspiring designers of visual programming languages in GIS software. Some recommendations could also be useful for the broader community of users to increase effective cognition of any graphical depiction.

Table 2 reports all findings and recommendation in summarised form, and concrete graphical improvements are in the figures of the article.



**Funding:** This research received no external funding.

**Acknowledgments:** This article was created with the support of the Education for Competitiveness Operational Programme – European Social Fund (project CZ.1.07/2.3.00/20.0170 Ministry of Education, Youth, and Sports of the Czech Republic).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

#### **Eye-tracking experiment: List of tasks and models from QGIS Processing Modeler**

The appendix presents the list of models and assigned tasks that were used in the eye-tracking experiment. The order of models was random in testing.

*Note 1: All models were also used in the first part of testing – free viewing part*

#### **1. Symbols for data and operations (testing of discriminability)**

Task A1. *Click on the symbol where the input data are*.

Task A2. *Click on the symbol where the input data are*.


Task A3. *Click on the symbol where the input data are*.


Task A4. *Click on the symbol of output data*.


Task A5. *Click on the symbol of output data*.


Task A6. *Click on the symbol of output data*.



Task A7. *How many functions are in the model? (Correct answer 8)*

Task A8. *How many functions are in the model? (Correct answer 8)*


Task A9. *How many functions are in the model? (Correct answer 8)*

*Appl. Sci.* **2020**, *10*, 1446

#### **2. Test of dual codding**

Task A10. *Are the input data the same type as output data?*


Task A11. *Are the input data the same type as output data?*


Task A12. *Are the input data the same type as output data?*



Task A13. *Click on the symbol where the 'Fixed Distance Bu*ff*er' operation is called*.

Task A14. *Click on the symbol where the 'Fixed Distance Bu*ff*er' operation is called*.

Task A15. *Click on the symbol where the 'Fixed Distance Bu*ff*er' operation is called*.


#### **3. Test of inner icons (semantic transparency)**

Task A16. *Are all operations from the same source library? (Correct answer No)*

Task A17. *Are all operations from the same source library? (Correct answer No)*


Task A18. *Are all operations from the same source library? (Correct answer No)*


#### **4. Other tested models and tasks**

Task A19. *What operation is used? (Correct answer: Clip)*

Task A20 *What operation is used? (Correct answer: Fixed Distance Bu*ff*er)*


Task A21: *Click on the symbol of operation.*


Task A22: *Click on the symbol of operation*.


#### **References**


© 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

#### *Article*
