*Article*

## **A Novel Method of Missing Road Generation in City Blocks Based on Big Mobile Navigation Trajectory Data**

## **Hangbin Wu 1, Zeran Xu 1,\* and Guangjun Wu 2**


Received: 29 December 2018; Accepted: 11 March 2019; Published: 14 March 2019

**Abstract:** With the rapid development of cities, the geographic information of urban blocks is also changing rapidly. However, traditional methods of updating road data cannot keep up with this development because they require a high level of professional expertise for operation and are very time-consuming. In this paper, we develop a novel method for extracting missing roadways by reconstructing the topology of the roads from big mobile navigation trajectory data. The three main steps include filtering of original navigation trajectory data, extracting the road centerline from navigation points, and establishing the topology of existing roads. First, data from pedestrians and drivers on existing roads were deleted from the raw data. Second, the centerlines of city block roads were extracted using the RSC (ring-stepping clustering) method proposed herein. Finally, the topologies of missing roads and the connections between missing and existing roads were built. A complex urban block with an area of 5.76 square kilometers was selected as the case study area. The validity of the proposed method was verified using a dataset consisting of five days of mobile navigation trajectory data. The experimental results showed that the average absolute error of the length of the generated centerlines was 1.84 m. Comparative analysis with other existing road extraction methods showed that the F-score performance of the proposed method was much better than previous methods.

**Keywords:** missing road; city blocks; topology; big mobile navigation trajectory data

## **1. Introduction**

With rapid road construction development in urban and rural areas, the consequent road changes result in lagging road-data updates that are a poor match to the current situation and have low integrity and accuracy. Traditional technologies used for detecting and updating missing roads, such as professional surveys, map downsizing, remote sensing image interpretation, etc., are more costly, require longer update cycles, are more complicated in data processing, and cannot easily adapt to the needs of rapid urban development [1].

Detecting new and missing roads on existing road networks has become a common concern in the fields of urban management, intelligent transportation, and driverless technology [2]. Imagery, GNSS (global navigation satellite system) trajectories, and multisource data fusion are some of the major data sources used for the renewal and repair of missing GIS (geographical information system) road network data [3]. With technological developments such as wireless communications, Big Data, and cloud computing, navigation trajectories are gradually becoming the main data source for urban road updates. Massive on-vehicle positioning trajectories have the characteristics of large data, multiple data sources, and heterogeneous structures. The use of VGI (volunteered geography information)-based trajectories,

such as those collected by smart phones [4,5], and smart devices installed in vehicles [6] or held by pedestrians [7], to update road data has recently achieved substantial results.

In this paper, a new method of extracting missing roads in city blocks from big mobile navigation trajectory data is proposed, which mainly consists of the following steps: (1) Useless information from pedestrian users on existing urban roads was filtered out of the data; (2) After preprocessing, roadway centerlines were extracted using the proposed RSC (ring-stepping clustering) method; (3) The road topology of each block was then established, and connections between existing urban roads and missing roads in the blocks were built. Compared to traditional methods, the proposed method has the following advantages: First, compared to specialized vehicles equipped with mapping devices [8,9], the whole process can achieve a high level of automation and rarely requires manual operation, and the device surveying investment is also much smaller; second, compared to remote sensing image extraction [10], the influence on road extraction of the ubiquitous trees on city blocks will be appreciably less, and the satellite revisit period will no longer be a problem; finally, compared to existing road extraction methods using GNSS trajectory data [9,11–14], although the computational complexity of the proposed method ( Θ(n2)) is higher than the method in [9] ( Θ(n)), the F − score of the generated road centerline is much higher. A case study with 9,944,710 trajectory GNSS points over an area of 5.76 square kilometers was selected to verify the feasibility of the proposed method. The results showed that the extracted road network was well matched with the real network.

In the next section, we briefly discuss the road network data updating method found in the literature. The rest of the paper is organized as follows: Section 3 describes the research methodology in detail; Section 4 presents a case study with the adopted data and evaluates the quality of the proposed method; and Section 5 derives and discusses the main conclusions and concerns related to the proposed method.

## **2. Literature Review**

The current methods are mainly divided into two major aspects: updating of road geometry data and attribute data. The following subsections introduce the related works corresponding to these two aspects in detail.

## *2.1. Road Geometry Data Updating*

Selecting GNSS trajectory points not located on existing roads is usually a necessary stage in road mapping in order to update road networks and refine the geometry of road segments or intersections [5,15–19]. Then, different algorithms, strategies, and methods for detecting new roads, extensions, and disappearances of existing roads are proposed based on the outlier trajectory points.

Considering that GNSS trajectories—or points—only appear on roads, clustering GNSS points is the most commonly used method for updating the geometric data of a road. For example, *K*-Means [13,20–23] has been used to cluster a large number of GNSS points at a certain position on the road to identify the center of the road. This kind of method proceeds by inputting GNSS points or trajectories and specifying an initial seed point to cluster the center of the road. Then, after iterations over a certain distance interval, the geometric information for the road network is obtained.

The trace merging method [12,24] is usually used for road extraction from massive GNSS trajectories. By using iterations of each GNSS trajectory, raw trajectory edges are added to the current road map according to the map-matching results. Edges of current road maps are given a weight to describe the repeat occurrences and the possibility of a road's presence. Those edges with lower weights are removed, and the remaining edges are regarded as newly found roads.

A large number of unordered GNSS points exhibit geometrical features distributed along the road. Based on this phenomenon, researchers proposed a kernel density estimation (KDE) method and extracted the most densely distributed areas of GNSS points, supplemented by a certain threshold, to achieve the purpose of road boundary extraction and road skeleton extraction [14,25,26]. The

advantage of this method is that as the number of samples increases, the output is more reliable and robust. However, when the number of samples is insufficient, the results often have large deviations.

Recently, the total least squares [27] and turning-point detection methods [28] were proposed to segmen<sup>t</sup> and group GNSS trajectory data. After the segmentation and grouping, the intersection position and road segmen<sup>t</sup> were determined using the phenomena of intersecting and crossing of different groups. [28] A dynamic time-warping (DTW) algorithm was then used to align road segments from the connection matrix between intersections.

Moreover, [29] proposed the hidden Markov model (HMM) map matching method for topological reconstruction and intersection refinement. Reference [5] used a genetic algorithm to establish new road segments.

However, road-network extraction remains faced with difficulties in some places. For example, when a road is covered by a bridge or an underpass is present under a road, GNSS points are particularly chaotic because current mobile phone GNSS positioning is not good at heights; GNSS points can be particularly chaotic, thus making it difficult to obtain this part of the road using the existing method.

## *2.2. Road Attribute Data Updating*

Another use for the movement trajectory is updating the attribute information of the road network. Changes in road attribute information mainly fall into eight categories: directionality, speed limit, number of lanes, access, average speed, congestion, importance, and geometric offset [30]. Winden et al. [30] proposed a decision-tree algorithm for deriving the above eight attributes for an open street map (OSM). The results show that whether a road is a one- or two-way road it is classified with an accuracy of 99% and the accuracy for the road speed limit is 69%.

Reference [31] modeled the speed of the tracks from the centerline of the road as a Gaussian distribution, and then extracted the directionality and turning restriction attributes for road maps such as OSM. Reference [32] used a probabilistic method to derive the number of traffic lanes from GNSS tracks by fitting a Gaussian mixture model (GMM) to the intersections between the GNSS traces and a sampling line perpendicular to the road's centerline. Reference [33] used data-mining techniques to extract the name and class of the road by integrating movement trajectories and geotagged data from social media with a support vector machine (SVM) method. Reference [34] used a fully connected deep neural network (DNN) to automatically extract deep features and classify trajectories based on transportation mode. Further, Reference [35] proposed a framework using feature engineering and noise removal to classify movement trajectories into typical transportation modes such as taxi, car, train, subway, walking, airplane, boat, bike, running, motorcycle, and bus.

The detected transportation modes of each movement trajectory can be used to update the attributes of the road map. Reference [19] adapted a robust map-matching algorithm to assure that each point was assigned to the current road map. Then, the missing intersections, turn restrictions, and road closures were detected and updated. Considering that OSM has become a common way for volunteers to draw maps, Reference [36] obtained newly drawn road data by analyzing OSM data. They then adopted a progressive buffering method to update the latest roads in the OSM data with roads from other data sources, including both geometry and attributes.

## **3. Methodology**

## *3.1. General Description of the Proposed Method*

The proposed method can be divided into three main parts (Figure 1). The first part is data filtering. In this step, two kinds of data, including navigation points located on the existing urban roads and records generated by pedestrians, were filtered out. The remaining navigation data were regarded as navigation points relating to driving on missing roads in city blocks. This part is introduced in Section 3.2.

Then, a clustering-based algorithm named RSC was proposed to extract road centerlines using the reserved navigation points. This is the second part of the proposed method. After that, the centerlines of the missing roads were determined (please see Section 3.3).

Finally, the topologies of missing roads and their relationship with the existing roads were established. The procedures are given in Section 3.4.

**Figure 1.** Flow chart of the proposed method.

## *3.2. Data Filtering*

Mobile navigation data are collected as long as the navigation software is active. In order to reduce the duration of the calculation, data points located on the existing roads and generated by pedestrians should be filtered.

In this subsection, two main steps were adopted to filter the original data. The first step was to filter data from pedestrian users by the speed indicated by the record. Then, filtering was performed to segregate GNSS points based on existing urban roads via overlay analysis.

First, GNSS point speed information is the most appropriate indicator for distinguishing between pedestrian and vehicle users. According to previous studies, the average speeds of walking and driving are approximately 4.2 and 30 km/h, respectively. Therefore, in this research, 5.0 km/h was used as the threshold to separate pedestrian users from other navigation users. In other words, those GNSS points with speeds under 5.0 km/h were regarded as pedestrian users and were then removed.

Navigation data generated by the user's motion are partly located on existing roads and the rest on missing roads. According to our current research, navigation data points on existing roads account for more than 90% of all points in the used dataset. That is, the navigation data points on the roads inside city blocks only account for 10% of the total data volume. Therefore, filtering out the data points from the original data on existing roads will greatly improve the efficiency of the method as they account for the vast majority of the data.

Previous studies have been performed to filter the positioning points on existing urban roads, such as overlay analysis between the navigation-point layer and the existing urban road layer. Vector–raster analysis can also be used for filtering. In this paper, we adopted the method described in [37] to convert both existing roads and the navigation points into a raster format, so that filtering computing could be greatly accelerated.

## *3.3. Road Centerline Extraction*

After filtering the raw data, purer navigation points of missing roads in city blocks were finally reserved. To extract the road network from these points, a clustering-based algorithm called RSC was proposed in this subsection.

## 3.3.1. Centerline Extraction via RSC

In this subsection, RSC was used to extract the centerlines of the missing roads. Figure 2 shows the four main RSC steps. The black dots indicate the navigation points, which were reserved after data filtering, as described in Section 3.2. Considering that the reserved GNSS points were located on the missing road, an initial point (Pini) was randomly selected from the reserved points (see the yellow point in Figure 2a). Then, a radius parameter (R) was used to select the points that were covered by the ring with a radius of 2R (S2R), which included the red circle and the green points of Figure 2b. R was the ring–radius parameter, which was approximately the average width of the roads in this region. After that, the highest density position could be calculated using the green points and could then be regarded as the node of the centerline (see the red point in Figure 2b).

**Figure 2.** The main steps of RSC (ring-stepping clustering) are as follows: (**a**) a random GNSS (global navigation satellite system) point is chosen as the initial point, (**b**) the initial node is selected from the point set S2R of the initial point, and (**<sup>c</sup>**,**d**) the next node is selected from the point set Sring of the current node.

After the initial centerline node was found, a ring with an inner diameter R and an outer radius 3R was defined to find the next road centerline node. The points that fell on the ring were picked up to form the point set Sring (see the blue points shown in Figure 2c).

The density of each point in Sring was also calculated to find the highest density point. However, before deciding on the next node, each point in Sring was calculated to determine whether it satisfied a U-turn. As shown in Figure 3 and Equation (1), value V was calculated from the potential segmen<sup>t</sup> (Vb) and the existing segmen<sup>t</sup> centerline (Va).

$$\mathbf{V} = \mathbf{V\_a} \cdot \mathbf{V\_b} \tag{1}$$

If V < 0, the potential node will not be selected as the next node.

**Figure 3.** Determining a U-turn.

The pseudocode of the proposed RSC algorithm is presented in Algorithm 1. During data processing, the k–d tree [38] was introduced as an index of the point set for acceleration. The computational complexity of the proposed RSC algorithm is Θ(n2).

**Algorithm 1.** RSC (ring-stepping clustering) Algorithm **Inputs:** (1) a set of GNSS points **PointSet** = {P1, P2, P3,...,Pn }; (2) One parameter **R**. **Outputs:** A set of centerlines **CenterLineSet** = {CL1, CL2, CL3, . . . , CL m }, and each centerline is composed of a set of points NodeSet = {N1, N2, N3,...,Nk } 1: **k–dtree** ← Build k–d tree for **PointSet** 2: **For** i ← 1 to n **do** 3: PointSet1 ← GetPointSetInRing(PointSet[i], 0, 2\*R, PointSet) 4: **for** j ← 1 to length(PointSet1) **do** 5: PointSet1[j]'s PointSet2 = GetPointSetInRing(PointSet1[j], 0, R, PointSet) 6: node ← the point whose PointSet2 has the most points in PointSet1 7: numpts ← length of node's PointSet2 8: **If** numpts > 0 **then** 9: **while** numpts > 0 **do** 10: add node to NodeSet 11: PointSet3 ← GetPointSetInRing(Node, R, 3\*R, PointSet) 12: delete those points in PointSet3 that would make the centerline turn around 13: **for** k ← 1 to length(PointSet3) **do** 14: PointSet3[k]'s PointSet4 ← GetPointSetInRing(PointSet3[k], 0, R, PointSet3) 15: **end** 16: node ← the point whose PointSet4 has the most points in PointSet3 17: numpts ← length of node's PointSet4 18: **end** 19: **end** 20: add NodeSet to CenterLineSet 21: **end** 22: **end** 23: return **CenterLineSet** 24: **function** GetPointSetInRing(**Point**, **InsideRadius**, **OutsideRadius**, **PointSet**) 25: **PointSetResult** ← points in **PointSet** whose distance to **Point** is larger than **InsideRadius** and smaller than **OutsideRadius** (using kdtree) 26: return **PointSetResult**

## 3.3.2. Duplication Avoidance and Improvement

Using the above proposed algorithm, centerlines derived from GNSS points were obtained and organized by node sequence. However, after the centerline extraction for the missing road, the GNSS points on that road were still reserved. This led to centerline duplication.

Therefore, a rectangular buffer with width 3R for every single centerline segmen<sup>t</sup> was performed to remove the covered points (see the blue rectangle in Figure 4). All points that fell in the rectangular buffer were removed from the PointSet and were subject to further calculations. Moreover, the centerline extraction algorithm was terminated when PointSet was empty.

**Figure 4.** Centerline duplication avoidance and improvement.

## *3.4. Establishment of Road Topology*

After extracting the centerlines of missing roadways, the road topology, which is important for the road map, remained missing. In this subsection, the road topology was rebuilt for the extracted roads. Two main parts were included: the topology establishment for the extracted missing roads and the relationship establishment for the existing roads.

## 3.4.1. Topology Establishment for Missing Roads

Establishing the topological relationship of the extracted centerlines consisted of three main steps. First, the topology relationships between different centerlines were built according to the end node (EN) of the centerlines. The distance between the end node (EN) and normal node (Ni) was calculated, and the minimum distance (Dmin) was determined. If Dmin < 3R, the connection between EN and Ni was regarded as a centerline, see Figure 5.

**Figure 5.** Relationship built between extracted centerlines. (**a**) Before building and (**b**) after building. EN is end node, Dmin is minimum distance, Ni is normal mode.

Then, a cluster analysis was used to combine the adjacent topological nodes according to the distance between them. In this paper, the mean shift clustering algorithm [39] was adopted. Those topological nodes within a distance R were converted into a single topological node. This greatly improved the topological relationship of the missing roads.

Finally, a node classification procedure was performed to classify the nodes into two groups: the topological and normal nodes. The difference between the normal and topological nodes was the connected centerline segments. Once the connected centerline segments of a node were >2, the node was regarded as a topological node, and the centerlines were divided into two child centerlines (see the red nodes of Figure 6b).

**Figure 6.** Topological rebuild of the extracted centerlines. The centerlines are shown (**a**) before rebuilding and (**b**) after rebuilding.

## 3.4.2. Connections between Missing and Existing Roads

After the topology of the missing roads was established, the connections between missing and existing roads were also established.

First, potential connections were built between the nodes and existing roads (see yellow lines of Figure 7). Then, potential connection verification was performed using the azimuth distribution of GNSS points around the potential connection; once this parameter was uniform with the main angular direction of the connection, the potential connection was regarded as valid. Otherwise, it was treated as an invalid connection.

**Figure 7.** An illustrative example of the connections between missing and existing roads.

## **4. Case Study**

## *4.1. Mobile Navigation Data*

In this study mobile navigation data were used to extract missing roads. These data were generated and collected by mobile phones. These data were generated when the navigation application in the mobile phone was open, regardless of whether the user was using it for navigation. The sampling rate was one second per record, and each record contained seven main fields: day, time, ID, longitude, latitude, speed, and azimuth. A description of each field is given in Table 1.



## *4.2. Case Dataset*

A region located in Shanghai was selected as the case area. It is 3.6 km long and 1.6 km wide, and it covers an area of 5.76 square kilometers. The case area covers two university campuses, and most of the roads in the case area are two-lane bidirectional roads. Around the case area, some municipal roads are available, such as Hujin Expressway, Jianchuan Road, Dongchuan Road, and Lianhua South Road. The location, imagery, and the existing municipal roads (marked with green color) of the case area are shown in Figure 8.

**Figure 8.** The location, imagery, existing municipal roads of the case area, and selected roads and cross points for quality evaluation. The imagery was collected on 13 April 2017 by DigitalGlobe's GeoEye-1. The spatial resolution of the imagery was 1 m. The imagery was matched to real map by the georeferencing toolbar of ArcGIS software. Relative to the real map, the 13 control points of the image had a positional accuracy of 1.97 m RMSE (error magnitudes range from 1.35 to 2.93 m).

To evaluate the quality of the extracted road networks, a real map of the area provided by the Shanghai Municipal Institute of Surveying and Mapping was used. The map was measured by total stations manually, and the precision of the map reached the centimeter level. Among all the real roads, 11 roads, including straight and curved roads, and 22 cross points (endpoints of the 11 selected roads) of the roads were selected to evaluate the performance of the proposed method. In order to compare the quality of satellite-derived method, the 11 selected roads were manually mapped from satellite imagery using the ArcGIS software—the mapping was done by three operators with good training in remote sensing image object extraction, respectively. Selected roads and cross points are shown in Figure 8. The manually mapped roads were compared with the real map of the area in order to evaluate the spatial accuracy, which is shown in Table 2. According to the evaluation results, the positioning precision of the digitalization was about 1.00 m. Compared with the real road data, the true difference was about 3.95 m.

**Table 2.** Result of spatial accuracy of the 22 manually mapped road points. Trueness is the distance between the average position and the corresponding real position (m). Precision is the mean square error of the points by different operators (m).


Data used for analysis were collected in December 2017 (11–15 December) and consisted of 9,944,710 GNSS data points in total, belonging to 198,241 unique vehicle IDs. The data were provided by 1RenData (ShangHai) Technology Co., Ltd (Shanghai, China). Figure 9 shows the raw GNSS data distribution.

**Figure 9.** Raw research data. The black dots represent GNSS (Global Navigation Satellite System) points.

## *4.3. Data Filtering*

Using the data filtering methods described in Section 3.2, 392,128 records belonging to 8676 unique user IDs were finally reserved. The reserved GNSS points took ~4.0% of the raw data. Components of the raw case data are shown in Table 3.


**Table 3.** Components of the case data.

## *4.4. Road Centerlines*

## 4.4.1. Road Centerline Extraction

The proposed RSC method was used to extract road centerlines from the data after filtering. The method prototype was implemented in C-Sharp programming language. The experiment was carried out on a server with Intel Xeon CPU Platinum 8163@ 2.5 GHz and 16GB memory. As the roads' average width in the experimental area was ~7 m, the parameter R was set to 7.0 in this paper. The obtained centerlines are shown in Figure 10. There were 603 centerlines in total, and the average length of these centerlines was 116.09 m. It took 557 s to extract the centerlines of the 392,128 points.

**Figure 10.** Results of the case study.

4.4.2. Geometric Quality Evaluation of Road Centerlines


In this subsection, two indices were selected as quality measures of the extracted roads. The first index was the road's length [40,41]. Assume Lr is the length of the real road and Lg is the length of the road used for quality evaluation. Then, absolute error between the different road lengths was calculated by Equation (2):

$$\text{Eg} = |\text{Lr} - \text{Lg}|\tag{2}$$

where Eg is the absolute error between Lr and Lg.

The second index is the distance between each road centerline and the corresponding real road [42,43]. First, the region area between the real and extracted roads was calculated (see the blue area of Figure 11). Then, the quality values of the extracted centerlines Tg were calculated using Equation (3):

$$\text{Tg} = \frac{\text{Ag}}{\text{L} \text{r}} \tag{3}$$

where Ag is the area of the region between the generated and real roads.

**Figure 11.** Distance calculation between real and extracted roads.

Using parameters in Equations (2) and (3), we listed the evaluation results of quality of the 11 selected roads in Table 4. According to Table 4, the average of Eg was 1.84 m and the average of Tg was 1.62 m, which demonstrates that the proposed method can achieve excellent results.


**Table 4.** Results of the geometric evaluation of generated centerlines (m).


In order to compare the geometric accuracy between the proposed method and digitalization by remote sensing image, the true difference between selected cross points of the real roads and corresponding cross points of the roads, which were named as Dg, was used. In this paper, 22 road cross points were selected to compute the true difference with the real map. Then, the minimum, maximum, and the average Dg of two methods are listed in Table 5.


**Table 5.** Minimum, maximum, and average Dg of digitalization and proposed method (m).

According to Table 5, the average of Dg of the mapped roads was 3.95 m, which was larger than the proposed method. It shows that the accuracy of the proposed method was slightly better than the digitalization method.


In addition to the evaluation method mentioned above for the generated centerline, the F − score proposed in [11] was also adopted to evaluate the proposed method. The F − score was computed as follows:

spurious = spurious marbles (spurious marbles + matched marbles) missing = empty holes (empty holes + matched holes) F − score = 2 × (<sup>1</sup>−spurious)(<sup>1</sup>−missing) (<sup>1</sup>−spurious)+(<sup>1</sup>−missing).

Starting from a random location, the roads were explored by placing point samples on each graph during a traversal outward within a maximum radius. Sample points on the roads that required evaluation were considered as "marbles" and on the real roads as "holes". In this case, spurious marbles represent the number of points on the evaluated roads that do not ge<sup>t</sup> a match, matched marbles represent the number of points on the evaluated roads that ge<sup>t</sup> a match, empty holes represent the number of points on the real roads that do not ge<sup>t</sup> a match and matched holes represent the number of points on the evaluated roads that ge<sup>t</sup> a match.

The F − score of the proposed method was compared with that of the other methods [11–14]. The comparison is shown in Figure 12. Obviously, the proposed method offered a significant improvement over the previous methods.

**Figure 12.** F − score of the proposed and existing methods.

## *4.5. Topology Evaluation*

## 4.5.1. Topology of Missing Roads

The topology was built after centerline extraction. There were 371 topological nodes in total. After manual verification, 296 were real intersections on the roads. The correctness was ~79.8%. The

wrongly extracted topological nodes, usually located at houses, were too densely localized or were located on roads that were too complex. This was probably caused by the low positioning accuracy and multipath effects of GNSS devices and can be improved after enhancing the dataset.

## 4.5.2. Connections between Missing and Existing Roads

Twenty-six potential connections between topological nodes and existing roads were found. After azimuth verification, 23 connections were reserved; when compared to the corresponding remote sensing images, they were correct. These correct connections represent both entrances and exits.

## **5. Discussion and Conclusions**

This paper proposes a new method for generating missing road networks in city blocks using big mobile navigation trajectory data. An algorithm named RSC was designed according to high frequency GNSS data. After extracting centerlines, methods of building road topologies in city blocks and establishing connections between missing and existing roads were proposed. A case area (5.76 km2) was used to verify the feasibility and validity of the proposed method. The results showed that compared to real roads, the average length difference is approximately 1.84 m and the average distance is approximately 1.64 m, indicating that the proposed method can achieve meter-level missing road extraction results. Data from satellite-extracted roads indicate that the proposed method achieves greater results than image-derived method. Meanwhile, using the F − score index, the proposed method can achieve the best results compared to previous studies.

The novelty of the proposed method is the higher geometric quality of the extracted missing roads. The length difference and the distance between extracted roads with real roads are approximately 1.84 m and 1.64 m, respectively. This allows the possibility of generating complicated road networks. Meanwhile, the F − score performance of the proposed method offered a large improvement over previous methods, meaning that the road networks generated by the proposed method are far more meticulous.

However, the complexity of the proposed method is Θ(n2). This metric indicates that when the amount of input of GNSS data increases, resource and time consumption by the algorithm will also increase geometrically. Though the method introduced in the paper worked well in most areas, the result and quality will be affected by a few factors.

First, due to the chaotic trajectories in car parks, there will be a jumble of GNSS points in the corresponding region. Therefore, roads through car parks cannot be extracted using the proposed method. Further, a bridge present above the road, an underground car park, or an underpass under the road will result in failure to generate the road's centerline at ground level.

Second, when the road is between two high buildings or trees, a very thick canopy is formed, and the shadow and multipath effects may cause the coordinates of the GNSS points to be burdened by significant errors; thus, the quality of the centerline extracted via the proposed method will be poor.

Moreover, when two roads are close to each other and in parallel and when they are both not wide enough, their related GNSS points will be difficult to distinguish, so they will probably be extracted as one road.

Finally, during data filtering, some GNSS data with the speeds < 5.0 km/h is removed to filter pedestrian and vehicle users. Such an approach would result in excluding some vehicles traveling at a lower speed; thus, the GNSS data involved in the calculation will also be reduced.

**Author Contributions:** Conceptualization, Hangbin Wu and Guangjun Wu; methodology, Zeran Xu and Hangbin Wu; software, Zeran Xu; validation, Zeran Xu; formal analysis, Hangbin Wu and Zeran Xu; investigation, Zeran Xu; resources, Guangjun Wu; writing—original draft preparation, Hangbin Wu and Zeran Xu; writing—review and editing, Hangbin Wu and Zeran Xu; visualization, Zeran Xu and Hangbin Wu; supervision, Hangbin Wu; project administration, Hangbin Wu; funding acquisition, Hangbin Wu.

**Funding:** This study was supported by the National Science and Technology Major Program (2016YFB0502104), the National Science Foundation of China (No. 41671451), and the Fundamental Research Funds for the Central Universities of China.

**Acknowledgments:** The authors would like to appreciate Quan Yuan and Rui Jia for their manual operations in the digitalization part. Also, authors appreciate the contributions made by anonymous reviewers.

**Conflicts of Interest:** The authors declare no conflict of interest.

## **References**


*ISPRS Int. J. Geo-Inf.* **2019**, *8*, 142

43. Liu, X.; Biagioni, J.; Eriksson, J.; Wang, Y.; Forman, G.; Zhu, Y. Mining large-scale, sparse GPS traces for map inference: Comparison of approaches. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Beijing, China, 12–16 August 2012; pp. 669–677.

©2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **A Task-Oriented Knowledge Base for Geospatial Problem-Solving**

### **Can Zhuang 1, Zhong Xie 1,2,\*, Kai Ma 1, Mingqiang Guo 1 and Liang Wu 1,2**


Received: 5 September 2018; Accepted: 27 October 2018; Published: 31 October 2018

**Abstract:** In recent years, the rapid development of cloud computing and web technologies has led to a significant advancement to chain geospatial information services (GI services) in order to solve complex geospatial problems. However, the construction of a problem-solving workflow requires considerable expertise for end-users. Currently, few studies design a knowledge base to capture and share geospatial problem-solving knowledge. This paper abstracts a geospatial problem as a task that can be further decomposed into multiple subtasks. The task distinguishes three distinct granularities: Geooperator, Atomic Task, and Composite Task. A task model is presented to define the outline of problem solution at a conceptual level that closely reflects the processes for problem-solving. A task-oriented knowledge base that leverages an ontology-based approach is built to capture and share task knowledge. This knowledge base provides the potential for reusing task knowledge when faced with a similar problem. Conclusively, the details of implementation are described through using a meteorological early-warning analysis as an example.

**Keywords:** task; workflow; geospatial problem-solving; knowledge base

## **1. Introduction**

In recent years, with the rapid development of cloud computing and web technologies, an increasing number of geospatial information resources (GIRs), e.g., geospatial data, geospatial analysis functions, models, applications, etc., have been encapsulated into a wide variety of geographic information services (GIServices) [1] which are accessible to general public users over the web [2,3]. For example, a web service toolkit, named GeoPW [4], provides a set of geoprocessing services, which are used to fulfill data processing and spatial analysis tasks over distributed information infrastructures [5]. In the geospatial community, The Open Geospatial Consortium (OGC) established a series of standard interface specifications, such as Web Feature Service (WFS), Web Map Service (WMS), Web Coverage Service (WCS), and Web Processing Service (WPS), which further improve the interoperability and web-based sharing of GIServices [5–7].

In the geospatial application domain, geospatial problems usually relate to heterogeneous data and several computational processes [8]. The capabilities of a single GIService are limited and cannot be effectively conducted because of the complexity of geospatial problems [4]. In the last decade, workflow-based approaches have evolved to a major way to address complex geospatial problems [9]. Currently, with the assistance of standard interface specifications, GIServices published by different organizations can be chained as a geoprocessing workflow that can describe the execution order of problem-solving steps and enhance the power of atomic GIServices to fulfill complex geoprocessing tasks [10–13]. In general, geospatial problems require comparatively deep expertise, which therefore need experts to contribute their problem-solving knowledge by means of a conceptual workflow.

In previous studies, there are already some investigations in the formalization of workflow [7,14, 15] and semantic interoperability for GIServices [16–18]. Additionally, a number of studies have employed the task concept to facilitate the expression of user requirements at a semantic level [14,19,20]. In fact, many geospatial problems can share similar conceptual workflows. Therefore, the conceptual workflows can be formalized into a knowledge base, which can facilitate future users to solve the similar problems.

In this paper, we focus on using ontologies in association with a task-oriented approach to construct a knowledge base to enhance geospatial problem-solving. It is generally believed that ontology is the foundation and a significant part of the semantic web. Ontology provides unified terms to improve the semantic interoperability of domain knowledge [21]. A task is introduced as a reusable component to model the sequence of inference steps involved in the process of solving certain kinds of geospatial problems at a conceptual level. The knowledge base can store conceptual workflows that are considered to be a priori knowledge accumulated from past experience of domain experts [22], which can enable problem-solving knowledge reusable [23]. A geospatial problem is abstracted as a task, and the knowledge for the task is considered as a problem solution. Under many circumstances, tasks need to decompose into simpler tasks, each of which can be solved by one or a set of functions [24]. As the smaller task is simpler than the overall task, the complexity of the task is reduced significantly [22]. Hence, we further divide the task into three distinct granularities: (1) a geooperator, which is basic processing functionality; (2) an atomic task, which is indecomposable; and (3) a composite task, which is decomposed into multiple subtasks.

The main work of this paper includes the following: (1) Concepts: the task concept is introduced as a reusable component for geospatial problem-solving and is used to reflect users' requirements; (2) Model: a task model is proposed to simulate problem-solving processes; (3) Knowledge base: an ontological knowledge base is designed, that comprises several interoperable ontologies to capture and share problem-solving knowledge; and (4) Implementation: taking the meteorological early-warning (MEW) analysis, for example, we describe the details of the implementation conclusively. We focus on geospatial problem solution as a task that is composed of conceptual geoprocessing operations not in connection with any concrete services. The instantiation and execution of a task, the low-level interaction with operations (such as accessing input data), and the validation of the processing chain are not in the scope of this article.

The remainder of this paper is organized as follows. Section 2 offers related work on the task-based approach and geospatial problem-solving. Section 3 proposes the task concept and task model to describe problem-solving processes. Section 4 presents a task-oriented knowledge base, some core ontologies are described in detail. Section 5 introduces the detailed information of implementation. Finally, conclusions and future work are given in Section 6.

## **2. Related Work**

## *2.1. The Task-Based Approach*

The notion of the task was proposed by Albrecht in the field of geographic information system as early as in the 1990s [25] and has been used in many studies. However, there is still no unified definition of a task [19]. In general, the task concept reflects user requirements and describes all actions or operations to solve a specific problem. Some studies have been performed using a task-based approach. We summarize and classify them as follows:

1. The task-based language. A task ontology language based on the OWL (Web Ontology Language), named OWL-T, has been proposed to define task templates to formalize user demands and business processes at a high-level abstraction, which is used for the task of a trip plan [26]. Hu et al. [19] extended the task-oriented approach to the OGC Sensor Web domain. A Task Model Language, called TaskML, is a language for modeling tasks. The significant features of TaskML are Task Trigger, Task Priority, and Task QoS.


## *2.2. Geospatial Problem-Solving*

Currently, geoprocessing service technology is widely employed to solve specific geospatial problems in distributed information infrastructures. Much research has been devoted to utilizing or facilitating geoprocessing services to support problem solving. Mikita [31] published a geoprocessing service for forest owners to optimize clearcut size and shape during the process of forest recovery. Müller [18] proposed a hierarchical framework to identify the semantic and syntactic properties of geoprocessing services with four levels of granularity, which is conducive to service retrieval, service comparison, service invocation.

In most cases, a single geoprocessing service is not enough to solve the complex geospatial problem. Therefore, geoprocessing workflow technology provides a solution. The integration of geoprocessing services has become a popular research topic, and a series of tools and architectures were developed to support geoprocessing service chaining. For example, an open source geoprocessing workflow tool, named GeoJModelBuilder, is able to integrate interoperable geoprocessing services and compose them into a workflow [6,7]. A RichWPS orchestration engine in combination with a DSL (Domain Specific Language) is used to orchestrate WPS processes and publish the composition as a WPS process for further composition [32]. In addition, there are many popular workflow managemen<sup>t</sup> systems to facilitate the integration of geoprocessing services, such as Taverna [33,34], Triana [35], Kepler [36], jABC [9,37]. However, they only simplify the workflow construction process at the syntactical level, and building a workflow composed of services for geospatial problem-solving is still challenging for end-users.

Recently, more studies have focused on semantic and automatic workflow composition for geospatial problem-solving. Farnaghi and Mansourian [12] proposed an automatic composition solution using the AI (Artificial Intelligence) planning algorithm and SAWSDL (Semantic Annotations for Web Service Description Language) to improve the disaster managemen<sup>t</sup> process. Al-Areqi et al. [10] applied a constraints-driven synthesis method to implement the semiautomatic composition of a workflow for analysis of the impacts of sea-level rise. Samadzadegan et al. [38] designed a framework for an automatic workflow for fire detection early warning based on OGC services. Arul and Prakash presented a unified composition algorithm that adds a new phase called Validation and Optimization to automatic web service composition and generated a scalable composition process according to the dynamic change of user requirements [39].

## **3. Task as a Reusable Problem-Solving Component**

## *3.1. An Application Scenario*

In this section, we demonstrate an example that uses a workflow composed of distributed data and various geoprocessing services. This example is used throughout the remainder of the paper to help understand the concept of a geospatial task. Assuming an end-user is a staff of a meteorological disaster monitoring department, he needs to predict the probability of the occurrence of geological disasters in a certain region in the next day. The ideal result is a thematic map of the early-warning region that uses different colors to represent different early-warning levels.

To achieve the early-warning results, the most common approach is to formulate a geographic processing workflow that can generate an early-warning result map. As shown in Figure 1, the elliptical shape represents a data node, and the rectangular shape represents a data processing node. First, it uses geological hazard point data, influence factor data and early-warning unit data as input data to calculate the potential degree index of early-warning units respectively. Similarly, it can obtain effective rainfall data. Then, the potential distribution map and effective rainfall data from the previous step with forecast rainfall data go through early-warning analysis calculations to achieve the early-warning result map.

**Figure 1.** Sequences of the meteorological early-warning (MEW) process.

For the aforementioned application, the entire workflow can be considered a task. GIS domain experts with professional knowledge are able to analyze the technological procedures and abstract them in the form of conceptualization, which are then used to describe the skeleton knowledge of the problem-solving process. The MEW task, which was previously performed manually and had the requirements of GIS skills and knowledge of business processes, can now be executed automatically.

## *3.2. Task and Task Model*

In this paper, the task concept is proposed to reflect user requirements, which can be accomplished by one or more geoprocessing services. A geospatial problem is abstracted as a task that denotes a high-level business goal, and users execute a sequence of processes to achieve the goal. Tasks are different from operations or services, as tasks focus on what users want to solve, while operations or services mainly focus on the implementation of geoprocessing computations.

A complex problem can consist of multiple problem-solving processes with different requirements, which makes it difficult to define the solution as a single task [22]. Hence, a complex task can be decomposed into several smaller tasks, each of which can be solved in a relatively independent way by one or more geoprocessing services and then combined together into a complete solution [24]. The granularity of the task plays an important role during the problem-solving process. As shown in Figure 2, there are three distinct granularities: (1) a geooperator as elementary functionality for an atomic task, (2) an atomic task as a building block for a composite task, and (3) a composite task as a building block for a complex geospatial task. Consequently, the task is a reusable component for construction of the problem-solving workflow.

**Figure 2.** Relationships between Task, AtomicTask, CompositeTask, and Geooperator.

The process property of the geospatial task is expressed by a task process graph (TPG), which is used to capture the execution order of problem-solving steps and closely describe how a task should be achieved. Each TPG contains a set of edges that compose an acyclic directed graph structure. An edge denotes a workflow of two tasks. The directions of edges decide the dependencies between the tasks. The combination of TPG and task composes a task model that provides an approach to allow users to specify complex geospatial problems at an abstract level.

## *3.3. Geooperator*

Geospatial problem-solving knowledge is represented at a conceptualized level that requires categorization and formalization of geoprocessing services. Geooperators are developed mostly for improving the discoverability and exchangeability of geoprocessing functionality and providing an approach to formalize well-defined geoprocessing functionality [40]. In Brauner's work, geooperators are categorized in terms of multiple different perspectives, such as geodata, legacy GIS, pragmatic, formal or technical perspectives [41]. An overview of perspectives and top-level categories identified by Brauner is shown in Figure 3a, and elements described by the geooperator are given in Figure 3b, which can facilitate our work. The former is used to define the subclasses of Geooperator class in the GIS operation ontology without further modification; the latter is partially transformed into data properties and object properties of the Geooperator class.

In this paper, geooperators are introduced to provide a conceptualization for geoprocessing services (such as a geospatial analysis or transformation service) that are encapsulated as standard web services (e.g., WPS) for providing geoprocessing functionalities on the web. From an object-oriented perspective, geooperators act as wrappers for existing geoprocessing services and subsequently serve as building blocks for elementary geoprocessing tasks.

**Figure 3.** (**a**) Different perspectives on Geooperator [41] (**b**) Description elements of Geooperator [17].

## *3.4. Formal Definition*

**Definition 1 (Task).** *A task can be defined as a quadruple:*

$$\mathbf{T} = \text{(PT, OP, PA, C)},\tag{1}$$

*where PT specifies the type of task, OP is spatial inputs and outputs (e.g., spatial datasets), PA is a set of non-spatial parameters of a task and C consists of the precondition and result that generally constrains the thematic and geometric attributes of input or output data for geoprocessing tasks* [42]*.*

**Definition 2 (Task Process Graph).** *A task process graph defines the basic structure of task decomposition* [43]*, which is an acyclic directed graph defined as follows:*

$$\text{TPG} = (\text{V}, \text{E}), \tag{2}$$

*where V is a finite set of n vertices {v1, v2, v3,* ... *, vn}, and each node v* ∈ *V represents a task tv. E is a finite set of directed edges {evi,vj}. Each edge evi,vj* ∈ *E can be characterized by a tuple (pvi,vj, cij). pvi,vj = <vi, vj> is an ordered pair that represents execution precedence between task tvi and task tvj; in other words, tvi is ahead of tvj in the sequence of task decomposition that can alsobe denoted by vi* ≤ *vj. cij represents the control flow connector between two tasks, which includes sequence, branching, loop, and so forth.*

**Definition 3 (Task Model).** *A task model is defined by a 2-tuple as follows:*

$$\text{TMModel} = (\text{t, tpg}), \tag{3}$$

*where t* ∈ *T is a task instance, and tpg denotes a task process graph associated with t that defines the decomposition structure. If tpg only contains a geooperator, we consider this task to be an atomic task; otherwise, it is a composite task.*

**Definition 4 (Task Decomposition).** *Following the definition of the task model, we can further accomplish the task decomposition. Given a task process graph tpg = (V, E), assuming v* ∈ *V, tv* ∈ *T, v associates with tv. If tv has a corresponding model tmodelv = (tv, tpgv), then the decomposition of the task can be defined by*

$$\text{tpg}' = \text{Decomppose}(\text{v, tpg}\_\prime \text{tmodel}\_\text{v}),\tag{4}$$

*where tpg' is a new task process graph obtained replacing node v with tpgv in tmodelv.*

Taking the workflow mentioned in Section 3.1 as an example, Figure 4 depicts the task decomposition procedure. The node "Early-warning analysis" is replaced by a task process graph, which is defined in a task model, where the edges previously connected with this node are revised.

**Figure 4.** An example of task decomposition.

## **4. A Task-Oriented Knowledge Base**

This section presents the knowledge base that adapts the ontology-based approach and provides comprehensive knowledge to support the geoprocessing task. To build the knowledge base, a set of ontologies are needed to capture knowledge related to the problem-solving solution. The use of ontologies makes the semantic meaning of problem-solving procedures explicit and further facilitates users to obtain the problem solution [44]. Formalizing the knowledge base will assist both GIS non-specialist users and specialists in automating problem solving, allowing reuse and sharing of solutions [21]. Accordingly, we deem that the knowledge base is valuable.

## *4.1. Background on Ontologies*

It is widely known that ontology provides a formal language to standardize and share the semantics of various kinds of domain knowledge. The word ontology was first used as a philosophical concept and address the nature of existence, and it was subsequently introduced into the information domain by researchers. Currently, one of the most prevalent definitions of ontology is "Ontology is an explicit specification of a conceptualization", which was proposed by Gruber in 1993 [45]. Based on this definition, ontology is essentially a taxonomy of the objective world and a knowledge representation model. Meanwhile, ontology also supports non-taxonomic relations.

According to Perez [46], knowledge in ontologies is formalized by five kinds of modeling primitives: concepts, relations, functions, axioms, and instances. From a mathematical point of view, ontology can be formally expressed by an Equation as follows:

$$\mathbf{O} = \langle \mathbf{C}, \mathbf{R}, \mathbf{F}, \mathbf{A}, \mathbf{I} \rangle,\tag{5}$$

where C is a set whose elements are called concepts; R is a set of relations between concepts, R ⊆ C × C; F is a special relation in which the former n − 1 elements can uniquely determine the n-th element, and it can be defined as follows: F: C1 × C2 × ... × Cn−1<sup>→</sup>Cn; A represents a geographic axiom, that is, a collection of assertions in a logical form that are always true; and I stands for instances of concepts.

In the process of building ontology, instances represent objects that can be anything in a domain, and concepts are a set of objects that are mapped to classes. The relations between concepts are realized by properties that are classified into two types: an object property and a data property [47]. An object property specifies the relations between two classes, and it connects two individuals from different classes. A data property defines the relations between individuals and data values, which is similar to an inherent attribute of an object.

## *4.2. Ontologies at the Heart of the Knowledge Base*

To realize the capability to represent the knowledge of the problem-solving process, the knowledge base provides a set of ontologies as follows: Task Ontology, Process Ontology, GIS Operation Ontology, Interface Ontology, Data Type Ontology, GIS Data Ontology, and GIService Type Ontology. These ontologies are combined to provide support for all facets of problem-solving, each of which plays a key role in building a rich, dynamic and flexible task-oriented knowledge base. Figure 5 shows the delineations of the definitions of ontologies and how they relate to each other. Several important ontologies are discussed in detail in the following section.

**Figure 5.** The relationships of ontologies in the knowledge base.

## 4.2.1. Task Ontology

Task Ontology is the core for supporting problem solving, that defines the Task class to represent a geospatial problem. Its property relations are composed of object properties and data properties. The data properties mainly describe the metadata information of task instances, such as Description, Publisher, Create Time, and so on. The object properties include: hasSynonym, hasTaskType, hasProcess, hasInput, hasOutput, etc.

The Task class refers to the Task Lexicon class through the hasSynonym property for semantic annotations of tasks to provide the words and phrases describing tasks, on the basis of which end users externalize their own expression of the target problem in natural language. This can broaden the scope of keyword queries and dispose synonyms to support natural language retrieval. The Task Type class describes the categorization of tasks on the basis of functionalities that tasks can implement. The MEW analysis in the example mentioned above is a sort of geospatial task. The Task Type class is linked to the Task class for semantic reference to state the type of task individuals through the predefined hasTaskType property. Each individual of the Task class has at least one conceptual solution which is denoted in the Process ontology. The interfaces of the Task class are defined in the Interface Ontology, which will be described in detail in the following section.

## 4.2.2. Process Ontology

Process Ontology is used to define problem-solving processes at a conceptual level for a certain type of task, that is not associated with any concrete services. The AtomicProcess and CompositeProcess classes are created as subclasses of the Process class to classify the process individuals according to the number of processes involved. The atomic process directly refers to the Geooperator class in the GIS Operation Ontology using the RDF:Type property; however, the composite process is an edge set that contains some edges. Each edge denotes the sequence of two task nodes that are semantically annotated to the Task class using the fromTask and toTask properties. A series of edges form a directed graph that is called a task process graph that describes how the task works. In this paper, we only consider the linear sequence between two tasks; other control flow logics will be included in future work.

## 4.2.3. Data Type Ontology

Data Type Ontology is defined to describe the data types that are divided into two categories: SimpleDataType and GeoDataType, as illustrated in Figure 6. SimpleDataType includes some primitive data types in some programming language or description language such as xml:string and xml:float in XML. GeoDataType is an abstract representation of geospatial data, which has some data properties shared by any type of geospatial data, including attribute, data format, and coordinate reference system (CRS). Based on abstract specifications of the International Standard Organization (ISO) for vector [48] and raster data [49], GeoDataType is differentiated as VectorDataType and RasterDataType, each of which has unique characteristics. In vector data, each geospatial feature must identify a geometric type, such as point, polyline, and polygon following OGC Simple Feature Specification [50]. The resolution and band number must be identified in raster data.

**Figure 6.** Data type specifications.

## 4.2.4. GIS Operation Ontology

In GIS Operation Ontology, the Geooperator class is employed to conceptualize geoprocessing functionalities. The notion of Geooperator has been introduced in the previous section. The geooperators are used as building blocks for the conceptual workflow of geospatial problem-solving. This ontology of the knowledge base is based on work by Hofer [42] who translated the SKOS (Simple Knowledge Organization System) thesaurus provided by Brauner [41] into an OWL ontology and included an additional concept that is known as a functional concept. The SKOS thesaurus contains 40 geooperators. This ontology can be extended by extra categories, if necessary. The categories of the Pragmatic perspective originate from the general task, and are task-oriented categories. Users can further integrate new categories based on practical application. Therefore, in this paper, an additional category named MEW is integrated into the Pragmatic perspective of the geooperator, and subcategories or geooperators can be created for a further description of geoprocessing operations. Based on this classification, geoprocessing services that perform geospatial functionalities are thought of as individuals of the Geooperator class.

## 4.2.5. Interface Ontology

As introduced in the previous section, tasks are used as reusable components to accomplish the composition of problem-solving processes. The composition requires an evaluation of the correspondence of interfaces. The knowledge base needs to include sufficient information of interfaces to satisfy the needs of the composition. An interface requires the description of operands that contain inputs and outputs, constraints that contain a precondition and result, and non-spatial parameters. Consequently, as illustrated in Figure 7, the Interface class consists of the subclasses Input, Output, Parameter, Precondition, and Result. GeoDataType in Data Type Ontology is used to specify operands of interfaces, whereas non-spatial parameters can refer to SimpleDataType which includes conventional data types. The Precondition class focuses on the thematic and geometric properties of the input to ensure the correct function of the operation [42]. The Postcondition class defines the expected result of the output.

**Figure 7.** Interface for annotating Task and Geooperator, and Data Type for specifying Interface.

Similarly, we extend the interface properties of geooperators using the Interface Ontology which presently does not involve the related interface specifications.

## **5. Implementation**

In Section 3, we introduce an application scenario that is a geospatial problem-solving process in the context of MEW. We take this example to demonstrate the benefits of the ontology-based knowledge

base for tasks during the process of geospatial problem-solving. The implementation includes three parts: creation of ontologies, representation of knowledge, and task instances.

## *5.1. Creation of Ontologies*

Based on the proposed architecture of the task-oriented knowledgebase described in Section 4.2, we build different abstract ontologies to represent the hierarchy and relationships of the concepts using Protégé 5.2.0 which is an OWL ontology development platform that allows creation and query of ontologies [21]. In general, an ontology is composed of the following components: concepts and properties of each concept, relationships or constraints between concepts, and instances of concepts [28]. Figure 8a presents all concepts or classes defined in the ontological knowledge base. All object properties that represent the relationships between classes are shown in Figure 8b; they include hasTaskType, hasSynonym, hasProcess, etc. The abstract ontologies can be instantiated for specific tasks. In this paper, the task instances for meteorological early-warning are implemented, which are detailed in the next section.

**Figure 8.** An excerpt of ontologies where (**a**) depicts the classes of ontologies, and (**b**) illustrates the object properties between classes.

## *5.2. Representation of Ontology Knowledge*

Once the components of an ontology are developed, the ontology can be represented by ontology description language, such as Resource Description Framework (RDF) and Web Ontology Language (OWL). RDF is built upon XML, which uses triples of object, property, and value to describe resources. OWL is a W3C-recommended standard semantic markup language being developed by the Semantic Web community, which is an extension of RDF [15,21]. In this paper, we use OWL as a standard and machine-readable language to represent the knowledge of ontologies, which is presented as an OWL file.

Meanwhile, we use property restrictions including hasValue and quantifier restrictions to limit associations between different classes [15]. The hasValue restriction specifies that the individuals of a class have a given value. Nevertheless, the quantifier restriction limits the individuals of a class using an existential restriction (∃, owl:someValuesFrom) or a universal restriction( ∀, owl:allValuesFrom). The former states that values for the restricted property have at least one instance of class, which is defined by existential restriction; however, the latter states that all values for the restricted relationship must be a type of instance. For example, an MEW analysis task only needs effective rainfall data, forecast rainfall data, and potential degree data that can be restricted with the following formal statement: ∀ hasInput (Effective\_Rainfall\_Data ∪ Forecast\_Rainfall\_Data ∪ Potential\_Degree\_Data). This statement defines a universal restriction on the "hasInput" property between the Task class and the Input class (Figure 5). The OWL notation using the "owl:allValuesFrom" restriction is shown in Figure 9.

```
<owl:Class rdf:ID =ĀMeterological Early-warning Taskā>
       <rdfs:subClassof>
              <owl:Restriction>
                     <owl:onProperty rdf:resource =ĀhasInputā/>
                     <owl:allValuesFrom rdf:resource =Ā#Effective_Rainfall_Dataā/>
                     <owl:allValuesFrom rdf:resource =Ā#Forecast_Rainfall_Dataā/>
                     <owl:allValuesFrom rdf:resource =Ā#Potential_Degree_Dataā/>
              </owl:Restriction>
       </rdfs:subClassof>
</owl:Class>
```
**Figure 9.** Snippets of owl notation using a universal restriction.

## *5.3. Task Instances*

The specific task instances can be represented using classes and properties defined in the ontologies. Using the meteorological early-warning mentioned in Section 3.1 as an example, the tasks involved in the MEW example are listed in Table 1, in which there are two composite tasks (e.g., EWATask) and six atomic tasks (e.g., ERCTask, and FQTask). We use ERCTask as an example of an atomic task instance, which is used to calculate the effective rainfall. Figure 10 shows the individuals and properties involved in the ERCTask instance. The process of an atomic task is an individual of AtomicProcess, while those of composite tasks are not. We list the namespace declaration of ontologies and the syntax of class, subclass, and property definitions using OWL, as shown below Figure 10.


**Table 1.** Tasks involved in the MEW example

Differing from the atomic task, the process of the composite task is composed of multiple edge individuals, each of which describes the data flow between two task instances. A set of edges compose a process graph that denotes how the task works. For example, Figure 11 shows the task instance of a composite task called EWATask. The process individual "process:EWAProcess" contains two edge individuals: "process:EWAEdge1" and "process:EWAEdge2". The former connects the

two task instances: "task:QATask" and "task:HRITask", and the latter connects "task:HRITask" and "task:EWLTask". These edge individuals are linked to process individuals with the itemEdge property.

**Figure 10.** The task instance of an atomic task (EffectiveRainfallCalTask).

**Figure 11.** The task instance of a composite task (EarlyWarningAnalysisTask).

## *5.4. Prototype*

A prototype system based on the realized ontology representation and formalized task instances was developed to facilitate users to solve complex geospatial problems. The implemented prototype, leverages a number of web techniques, such as Ajax, XML, JSON, EasyUI, GoJS, OpenLayers, Apache Axis2, and so on. Ajax is used for asynchronous data exchange between the client and server sides. XML and JSON are data exchange formats. EasyUI and GoJS are client UI frameworks, and GoJS is employed to draw a flowchart. OpenLayers is a JavaScript class library package for WebGIS client development, which is used to achieve map data access. Apache Axis2 is used to provide the Web Service interface. The Java API package Apache Jena [51], a Semantic Web framework for Java, is used to parse the ontology file, access ontology definitions, and infer knowledge [52]. The Apache Tomcat server was employed as a web container. The prototype system can be accessed using Microsoft IE or a Google browser in a Windows operating system

The MEW analysis in Henan, China is used as an example to utilize the knowledge base to support geospatial problem solving. First, we define formal semantics in the ontology-based knowledge base by creating task instances using an ontology editor. The task instance is named EWATask (Figure 11) which can decompose into three subtasks (OATask, HRITask, and EWLTask), The ontology files are generated using OWL format language which is mentioned in the previous section.

Second, the web services, including three data access services (Potential\_Degree\_Data, Effective\_Rainfall\_Data, and Forecast\_Rainfall\_Data) and three geoprocessing services (wps\_overlay, wps \_riskIndex, and wps\_ewLevel), are published with the support of MapGIS IGServer [53]. The details of data access services are shown in Table 2, the geoprocessing services follow the WPS specification, and the workflow model for EWATask is shown in Figure 12.



**Figure 12.** The workflow model of EWATask. Data access services are represented in elliptical shapes, and geoprocessing services are represented in rectangular shapes.

Finally, the prototype system provides an intuitive and easy-to-use graphical user interface (GUI). The end-users can access the GUI of the prototype system using a web browser. As shown in Figure 13, in the left panel, there is a tree structure showing the task lists that are parsed from the knowledge base. The user selects and clicks on a tasknode; then, the process model of the task will be displayed in the form of a flowchart in the right panel (step 1). Next, right-click on a process node and select the menu "service binding" (step 2). A service binding window pops up and allows end-users to bind the appropriate service and input the related parameters manually (step 3). Repeat this step for each process node. Finally, execute the task and ge<sup>t</sup> the result map (step 4). For instance, Mr. Wang took Henan province in China as a forecast area for the risk analysis and forecasted the possibility of the occurrence of geological hazards in the next 24 h. He clicked the EarlyWarnAnalysisTask node in the prototype system, the process graph of which was shown in the right panel (Figure 13). Following this workflow, he bound the appropriate geoprocessing services (wps\_overlay, wps\_riskIndex, and wps\_ewLevel) that were invoked with a linear sequence (wps\_overlay → wps\_riskIndex → wps\_ewLevel). According to the forecast results, an early-warning result map as shown in the lower right of Figure 13, was obtained, which uses different colors to represent different early-warning levels.

**Figure 13.** The graphical user interface of the prototype system.

## **6. Conclusions and Future Work**

This paper proposes a task model and abstracts a geospatial problem as a task that can be used as a reusable component for problem-solving. A task-oriented knowledge base is built to capture sharable and reusable geospatial problem-solving knowledge. In the knowledge base, we combine multiple ontologies (e.g., Task Ontology, Process Ontology, and GIS Operation Ontology) to provide assistance for all facets of problem-solving. This knowledge base is not tightly-coupled with any specific workflow language. The required knowledge about problem-solving is stored in the knowledge base which employs ontology and task-oriented approach to achieve the formalization and reusability of tasks.

This knowledge base is tailored for domain experts to create and share their professional geospatial problem-solving knowledge. For the end-users, a user-friendly interface is needed to submit a geospatial problem and query the problem solution. An approach that has the capabilities of parsing natural language input will be developed in future work. This approach would allow users to input free-text to submit problem requirements.

In this paper, we only concentrate on using ontologies to describe a conceptual workflow that is composed of a linear sequence of GIS functionalities. We do not present an algorithm to instantiate into a concrete service chain and execute this workflow. The approach of knowledge transformation, instantiation and execution of a workflow will be implemented in future work.

**Author Contributions:** C.Z. designed the knowledge base, implemented the prototype system, and wrote the paper. K.M. and M.G. deployed and performed the prototype system. L.W. contributed the materials and tools. Z.X. conceived the early ideas of this work, reviewed the paper, and provided some suggestions and feedback.

**Funding:** This work was funded by the National Key Research and Development Program of China (Grant Nos. 2017YFB0503600, 2018YFB0505500, 2017YFC0602204), National Natural Science Foundation of China (Grant Nos. 41671400, 41701446), and Hubei Province Natural Science Foundation of China (Grant No. 2017CFB277).

**Acknowledgments:** We acknowledge the anonymous reviewers for their valuable comments and suggestions to improve this paper.

**Conflicts of Interest:** The authors declare no conflict of interest.

## **References**


©2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Geographic Knowledge Graph (GeoKG): A Formalized Geographic Knowledge Representation**

**Shu Wang 1,2,3, Xueying Zhang 1,2,3,\*, Peng Ye 1,2,3, Mi Du 1,2,3, Yanxu Lu 1,2,3 and Haonan Xue 1,2,3**


Received: 25 February 2019; Accepted: 4 April 2019; Published: 8 April 2019

**Abstract:** Formalized knowledge representation is the foundation of Big Data computing, mining and visualization. Current knowledge representations regard information as items linked to relevant objects or concepts by tree or graph structures. However, geographic knowledge differs from general knowledge, which is more focused on temporal, spatial, and changing knowledge. Thus, discrete knowledge items are difficult to represent geographic states, evolutions, and mechanisms, e.g., the processes of a storm "{9:30-60 mm-precipitation}-{12:00-80 mm-precipitation}- ... ". The underlying problem is the constructors of the logic foundation (ALC description language) of current geographic knowledge representations, which cannot provide these descriptions. To address this issue, this study designed a formalized geographic knowledge representation called GeoKG and supplemented the constructors of the ALC description language. Then, an evolution case of administrative divisions of Nanjing was represented with the GeoKG. In order to evaluate the capabilities of our formalized model, two knowledge graphs were constructed by using the GeoKG and the YAGO by using the administrative division case. Then, a set of geographic questions were defined and translated into queries. The query results have shown that GeoKG results are more accurate and complete than the YAGO's with the enhancing state information. Additionally, the user evaluation verified these improvements, which indicates it is a promising powerful model for geographic knowledge representation.

**Keywords:** geographic knowledge representation; geographic knowledge graph; formalization; GeoKG

## **1. Introduction**

Geographic knowledge consists of the product of geographic thinking and reasoning about the world's natural and human phenomena, which plays an important role in geographic studies and applications [1]. Nearly every geographer is trying to answer the question of "how to perceive, understand and organize geographic knowledge scientifically." [2] Generally, geographic knowledge representation is a type of human expression of the real world that is of grea<sup>t</sup> importance to storage and computation [3]. Especially in the era of Big Data, well-structured geographic knowledge is a benefit to all kinds of geospatial applications, because formalization is the foundation of geospatial big data computing, mining, and visualization.

At present, the most popular knowledge representation is the knowledge graph. It organizes knowledge with a set of concepts, relations, and facts, which are associated by two types {entity, relation, entity} and {entity, attribute, attribute value} [4]. There are only three basic elements in knowledge graphs: the entity, relation, and attribute. These three elements can explicitly represent general information, such as "when did the Beijing storm occur on 21 July—9:30, 21 July". However, geographic knowledge is more complicated than general knowledge. More processes and evolutions need to be answered, e.g., "what caused the 7·21 Beijing storm", "how did it develop", and "what were the effects of the 7·21 Beijing storm". Entities, relations, and attributes cannot easily and directly answer these mechanics questions. For example, the geographic knowledge graph representation of the 7·21 Beijing storm is shown in Figure 1.

**Figure 1.** Different geographic knowledge representations of the 7·21 Beijing Storm. (**a**) Knowledge graph data structure and (**b**) procedural knowledge data structure.

Figure 1a organizes geographic knowledge of the 7·21 Beijing storm using the data structure of the current knowledge graph. This knowledge representation model can explicitly represent each fact and its relations. However, it is not able to represent evolutions or mechanisms, which are key topics in geography. Moreover, this type of knowledge representation differs greatly from procedural knowledge data structure shown in Figure 1b. In general, humans perceive objects, events, and activities through the processing of declarative knowledge, procedural knowledge, and structural knowledge [5]. And procedural knowledge gives the framework to the declarative knowledge state by state, which is benefit for underlying mechanism understanding [6,7]. The 7·21 Beijing storm includes three main stages: 9:30, 14:00, and 18:30. Each stage has a list of attributes. This procedural knowledge data structure helps people acknowledge the evolution or mechanism more explicitly. For example, people cannot directly understand that all the attributes (warning level-blue, warning level-yellow, etc.) link to "7·21 Beijing storm", whereas people could know that the storm has different warning level on different stages.

The purpose of this paper is to improve the declarative discrete facts of knowledge graph to procedural aggregated knowledge. To address this issue, this paper presents a formalized model for geographic knowledge representation from a geography perspective, called GeoKG, and supplements the constructors of the ALC descriptive language.

The remainder of this paper is organized as follows: Section 2 reviews the related works on geographic ontology and geographic knowledge graph. Section 3 describes the methodology by stating the basic ideas from the six core geographical questions and proposes a formalized model of geographic knowledge representation called GeoKG. Section 4 gives an evolution case study of administrative

divisions of Nanjing with the formalized GeoKG model. Section 5 constructs the administrative division case by using the GeoKG model and the YAGO model, sets a series of questions, and analyzes the results. Finally, Section 6 presents the conclusions.

## **2. Related Works**

There are two main representations of geographic knowledge for geospatial big data computing and reasoning: geographic ontology, and geographic knowledge graphs.

## *2.1. Geographic Ontology*

Geographic ontology originates from ontology, which represents the most basic philosophical theories that represent the nature and characteristics of the real world [8]. In the 1960s, "ontology" was introduced in information science for categorization, representation, knowledge sharing, and reuse [9]. Geographic ontology is a domain concept, which explicitly and formally defines the geographic concepts and their relations within geography by hierarchical relations [10–12]. These hierarchical relations between concepts are significant to geographic knowledge representation, information integration, knowledge interoperation, and information retrieval. Thus, geographic ontology is an important geographic knowledge representation method that is widely implemented in various geographical information applications [13]. However, computer simulations not only require the standard hierarchical concept logic but also massive amounts of instance information in geographic knowledge representation. There are two types of typical geographic ontologies that are limited by the representations of geographic knowledge.

First, geographic ontology focuses on the structure of a conceptual system, which is built by strict hyponymy information [9]. These relationships are well suited for categorization, disambiguation, identification and inference but not for describing the states and phenomena of changing geographic objects [6,14]. The descriptions of these changing states and phenomena require copious information, which is lacking in geographic ontology [15]. Although hyponymy is strictly defined by hierarchical tree structures in geographic ontology, this structure cannot directly represent the relationships between multiple concepts that are important to represent evolutions and mechanisms in geography [11]. In addition, the relationships between vertices in a tree are not bi-directional, which limits the representation of the interactions between geographical objects. The cause of these problems is related to the tree structure limiting the representation of geographic knowledge [16].

Second, the logic foundation of geographic ontology is description logic (DL) of attributive concept language with complements (ALC) [17]. DL is an object-based formal knowledge representation language. It contains four components: a construction set represents concepts and roles (e.g., a river is a concept; disjunction is a role), assertion about a concept of terminology (terminology box, Tbox, e.g., each river has its own length), assertion about an individual item (assertion box, Abox, e.g., the length of the Changjiang River is 6300 km) and the reasoning mechanism of Tbox and Abox. DL can construct complicated concepts and roles with simple concepts and roles by constructors. According to the di fferent constructors, DL can be classified as ALC, ALCN, S, SH, SHIQ, etc. ALC is the basic DL that contains intersections (), unions (), complements (¬), universal restrictions ( ∀), and existential restrictions (∃). ALCN consists of basic ALC operators and number restrictions (Q; ≥ n and ≤ n); ALC+R+, short for S, consists of basic descriptions and enhancing relationship operators (R+; role or concept transitions); the SH language, with concept inclusions and role inclusions ( ); and SHIQ includes inverse roles (I) and role transitions (R+). At present, description logic SHIQ has been certified to represent changes in the field of logical theory [18]. Note that "change" is an absolutely essential element for geographic knowledge representation and means that the ALC constructors cannot represent all logical relations of geographic knowledge, especially in quantity expressions and state changes. For example, number restriction constructors are required to represent the geographic knowledge of "the Yangzi River has at least three branches" (∃ has a branch; the Yangzi River ≥ 3), and transition constructors are required to represent the geographic knowledge of "Beiping was renamed

Beijing on 27 September 1949" (Beiping ≡ Trans(Beijing)). Meanwhile, many studies theoretically demonstrated and proved the decidability, soundness and completeness of the operators of a series of DL (from ALC, S, SI, SHI to SHIQ, etc.) on the Tableau algorithm [19–22], and the complexity of SI (role or concept transitions) is PSPACE complete and the following SHI and SHIQ are EXPTIME complete [22,23].

## *2.2. Geographic Knowledge Graph*

A knowledge graph is a graph-formed knowledge representation model with strict logic, di fferent concepts, various relations, and massive instances [10]. It was first presented by Google in 2012, containing over 5.7 billion entities and 0.18 billion facts [24]. With this wealth of information, the real world can be explicitly described. Graph-based storage has properties of connection, direction, and multi-vertices that are suitable for representing the interactions between concepts. Thus, knowledge graphs are promising models to represent knowledge and have been widely built, e.g., YAGO [25], Freebase [26], Probase [27], and DBpedia [28]. A geographic knowledge graph is a domain knowledge graph that is in the exploratory stage.

At present, most geographic knowledge graphs are organized as universal knowledge graphs, e.g., CSGKB [4], NCGKB [29], and CrowdGeoKG [15]. The common sense geographic knowledge base (CSGKB) uses a data structure that links the concepts of geographic features, geographic locations, spatial relationships and administrators for geographic information retrieval (GIR) instead of traditional gazetteers. Moreover, the naive Chinese geographic knowledge base (NCGKB) constructs a GIR-oriented geographic knowledge base based on Chinese Wikipedia based on given concept relations and their instances. CrowdGeoKG uses a crowdsourced geographic knowledge graph that extracts di fferent types of geo-entities from OpenStreetMap and enriches them with human geography information from Wikidata. All of the concepts of these geographic knowledge graphs are developed based on geographic ontologies that follow the ALC descriptive language, resulting in the same problem as geographic ontology.

More importantly, three current bases organize the geographic knowledge as a set of concepts, relations, and facts, which are associated by two kinds of types {entity, relation, entity} and {entity, attribute, attribute value} [4]. Actually, there are only three basic elements in knowledge graph: entity, relation and attribute. These three elements can explicitly represent general information as "when did 7·21 Beijing storm— 9:30, 21 July". However, geographic knowledge is more complicated than general knowledge. More processes and evolutions need to be answered, e.g., "what causes the 7·21 Beijing storm", "how did it develop", and "what are the e ffects of the 7·21 Beijing storm". Entities, relations, and attributes cannot easily and directly answer these mechanics questions.

Scholars indicated that more elements are required. PLUTO supplemented the element of time with "before" and "after" in the knowledge graph model to describe the change trajectories of geographic objects [30]. Geological knowledge graphs have been applied with the evolution element for stating changes between di fferent geological objects [31]. YAGO also explored anchoring spatial and temporal dimensions to the knowledge base, called YAGO2 [7]. YAGO2 let time points and time intervals with standard format to describe the temporal information and set geographical coordinates associate to entities to complete their spatial information. In fact, these spatial and temporal knowledge stored in YAGO system are just regarded as general attributes by adding the predicates like "wasBornOnDate", "occursSince", "hasGeoCoordinates", etc., whereas declarative discrete information cannot directly answer the proceeding questions, evolutions and mechanisms. Additionally, ten core concepts of geographic information sciences were proposed for transdisciplinary research: location, neighbourhood, field, object, network, event, granularity, accuracy, meaning, and value [23]. These concepts can cover every corner of geoscience, but they were extremely di fficult to relate to one conceptualized model. More recently, six factors (geographic semantics, location, shape, evolutionary process, relationship between elements, and attribute) were proposed to describe information from geographic element, object, or phenomenon [22]. Though these factors were designed

for information representation of the geographic objects, they can also provide guidance for geographic knowledge representation. And all the above studies indicated that geographic knowledge can be represented more e ffectively by supplementing elements, whereas it also brings a foundation question: "how to organize geographic knowledge scientifically and cognitively?" Therefore, a conceptualized model of geographic knowledge graph from the geography perspective warrants further study.

## **3. Methodology**

## *3.1. Basic Idea*

## 3.1.1. Guiding Ideology

To address the aforementioned issues, the core question of the GeoKG model is to define the types of geographic knowledge that need to be stored. Geography (from the Greek γεωγραϕί<sup>α</sup>, geographia, literally "earth description") is a field of science devoted to the study of the lands, features, inhabitants, and phenomena of Earth [32]. As a type of human understanding of the geographic environment, geographic knowledge should answer questions about geography. The questions about geography have been separated into six core questions by the International Geographical Union (IGU), which is a part of the international charter on geographical education [2]. Therefore, GeoKG begins to define the basic elements and the conceptualized model using the six core questions in geography. Each question corresponds to one core issue:


## 3.1.2. Main Elements

These aforementioned questions can be used to describe six core aspects of geographic knowledge that should be represented by GeoKG. Each aspect requires some elements to describe them, and we try to find the basic elements among all aspects:


There are seven types of elements among the description of these six aspects: *object, location, time, attribute, relation, state*, and *change*. Three typical characteristics of these seven elements in describing the six aspects are as follows:


• **Stepped representation**. Note that the six aspects from the core geographical questions are not equal. Space and state focus on the static conditions of objects. Evolution and change pay more attentions to the dynamic conditions of objects. Moreover, interaction and usage rely on relationships and mechanisms between geographic objects. Thus, the basic elements cannot be treated as equals.

According to the three typical characteristics of the basic elements, we discovered that a geographic object is a type of media used to represent geographic knowledge. There are six basic elements used to describe geographic knowledge (see Figure 2). *Location*, *time*, *attribute*, *state*, *change*, and *relation* can co-efficiently represent geographic objects from different aspects. Note that these basic elements are not equivalent. *Location*, *time*, and *attribute* belong to the first level and represent a single static state of a geographic object. *State*, *change*, and *relation* describe the dynamic evolutions and relations to geographic objects.


**Figure 2.** The six basic elements to represent a geographic object.

## 3.1.3. GeoKG Model

A conceptualized model of GeoKG is shown in Figure 3, which is based on the ideas mentioned above. The six core elements represent geographic objects and their information together. In this model, geographic objects consist of a series of states. Any state of a geographic object is represented by attributes under a specific spatial-temporal condition. Any two continuous states or different states between two geographic objects could result in a change element. The change element can be categorized into time changes, location changes or attribute changes. If the essential attribute is changed, the geographic object will become another geographic object. The relation element exists between any time, location, and attribute of different states, regardless of whether they are the same geographic objects or not.

**Figure 3.** A conceptualized model of GeoKG based on the six basic elements.

## *3.2. Model Formalization*

To organize geographic knowledge in consideration of the basic ideas, GeoKG must be based on a thorough and formalized model. This section provides the model semantics of GeoKG by using description logic (DL), which is not limited to only the attribute language complement (ALC) level. Using description logic, a user can create a conceptual description for the representation and computation of geographic knowledge that is clear and formal.

## 3.2.1. DL and Construction Operators

DL is comprised of three basic components: concepts, individuals (instances), and roles. Concepts describe the common features about individual sets, e.g., all land mass that projects well above its surroundings forms the concept of "mountain". Individuals are the instances of concepts, e.g., a geographic entity, such as the "Rocky Mountains". Roles can be explained as the binary relation between individuals as properties, e.g., spatial relation (conjunction, disjunction). A description logic system contains four parts. These parts include a construction set, which represent concepts and roles, an assertion about concept terminology (terminology box, Tbox), an assertion about an individual (assertion box, Abox) and the reasoning mechanism of Tbox and Abox. Tbox are sets containing the definitions of the relationship of concepts and the axiom of relationships, which contain explanations of the concepts and roles. Abox includes axiom sets describing specific situations, which contain the instance information of Tbox. Abox include two forms. One is the concept assertion, which expresses whether an object belongs to a concept. The other one is the relation assertion, which express

whether two objects satisfy a certain relation. Description logic can represent complicated concepts and relations on atomic concepts and atomic relations based on the given construction operator. The basic construction operators are and (), or (), not (¬), existential quantifier (∃), and universal quantifier (∀), which are included in ALC DL. More operators can represent more logic, which form di fferent types of DL.

Let C and D be concepts; a, b, and c, individuals; and R is a role between individuals. S is a simple role, and n is a nonnegative integer. As usual, an interpretation I = <sup>Δ</sup>I, · I consists of a non-empty set <sup>Δ</sup>I, called the domain of I, and a valuation · I, which associates, with each role R, a binary relation *R*<sup>I</sup> ⊆ Δ<sup>I</sup> × ΔI. For comprehensive background reading, please refer to the referenced paper [20]. The primary operators that di ffer from DL are shown in Table 1.

Diagrams are supplied to illustrate the graphic meanings of the operators related to geographic objects and their relationships. A top concept indicates all concepts or objects, e.g., River means all the rivers. A bottom concept indicates no concepts or objects in the set, e.g., ⊥River means there are no rivers in the set. An atomic concept indicates the minimize concept, e.g., Ac could be the river, ocean, city, or country. An atomic role indicates the relationships between two atomic concepts, e.g., R ⊆ river × ocean means that there exists a relationship between the river and ocean. A conjunction indicates two individuals that are joint or connected, e.g., Yangzi River Nanjing indicates a joint part of the Yangzi River and Nanjing. A disjunction indicates the logic disjunction of two individuals, e.g., Yangzi River Nanjing means the combination set of the Yangzi River and Nanjing. A negation indicates the set of all individuals not in the target individual, e.g., ¬Yangzi River means all individuals except the Yangzi River. An exist restriction indicates the existence of an individual or a role, e.g., ∃Yangzi River means there exists a Yangzi River and ∃R ⊆ Yangzi River × Zhong Mountains means there exists a role between the Yangzi River and Zhong Mountains. A value restriction indicates all individuals or roles, e.g., ∀ River means all rivers and ∀ R ⊆ Yangzi River × Zhong Mountains means all roles between the Yangzi River and Zhong Mountains. A concept inclusion indicates a concept belonging to another concept, e.g., rain precipitation means rain is a kind of precipitation. A role inclusion indicates a role belonging to a role set, e.g., *Rlocation*\_*Yangzi River*−*Zhong Mountain* Yangzi River × Zhong Mountains indicates that the location relation between the Yangzi River and Zhong Mountains is one of the roles of the entire role set of the Yangzi River and Zhong Mountains. An inverse role indicates that a role has reversibility. A trans role indicates that a role has transmissibility. A qualifying at least/at most restriction indicates there exists at least or at most, e.g., (∃ ≥ 3rivers) ⊆ Yangzi River means the Yangzi River has at least three branches.
