means the number of. Note the decidability, soundness, and completeness of all these operators have been demonstrated [6,23].

**Table 1.** Syntax and semantics of the main construction operators of the description logic.

## *ISPRS Int. J. Geo-Inf.* **2019**, *8*, 184

## 3.2.2. Formalization Representation

In this section, the semantics of the GeoKG model are defined. First, we prescribe the set of geographic knowledge *GK* sourced from the entire world's natural and human phenomena *W*. GeoGK is a set of *GK* that can be defined as follows:

$$\text{GeoGK} = |\langle GK \rangle| GK \in \mathcal{W}|$$

*GK* is a tuple that consists of geographic object *O* and its basic elements *E*:

$$\mathcal{G}K = \{ \langle O, E \rangle | \exists \mathcal{O} \neq \mathcal{O}, \exists E \neq \mathcal{O} \}$$

The basic element set *E* contains six different elements: *location L, time T, attribute A, state St, change Ch* and *relation Re*. Thus, *E* is a six-tuple:

$$E = \{ \langle L, T, A, \text{St}, \text{Ch}, \text{Re} \rangle | \exists L \parallel T \parallel A \parallel \text{St} \parallel \text{Ch} \parallel \text{Re} \neq \mathcal{Q} \} $$

Each element is identified as follows:

(1) Time

Time describes the temporal information of the state of a geographic object. Let *Sti* indicate a specific state of geographic object *Oi*; the basic element *time T* can be defined as follows:

$$T = \{ \exists T \in \text{St}\_i \vert \forall O\_i \neq \mathcal{Q}\_\prime \text{St}\_i \in O\_i \}$$

Time should be described by both the basic types and reference time information. The basic types are point time, interval time and reference time. Point time *Tpoi* records the moment of the state of a geographic object. Interval time *Tint* indicates the time interval between two point times. Reference time *Tre f* indicates the time of other elements of a geographic object, e.g., "2018 World Cup" is an event with a unique time period that could reference the specific time accurately. Time reference knowledge *tre f* indicates the additional knowledge of time descriptions. Let *tw* indicate the time word. A time word indicates a point time that could contain several time descriptive parts, e.g., 12-July-2018, ten past nine and tomorrow morning. The point time *Tpoi*, the interval time *Tint* and the reference time *Tre f* are defined as follows:

$$T\_{\rm par} = \{ \langle tw, trref \rangle \Big| \forall !tw \in T \}$$

$$T\_{\rm int} = \{ \langle tw, trref \rangle \Big| \forall tw \in T, \#tw \ge 2, \forall \mathbf{R} \subseteq tw \rangle \}$$

$$T\_{\rm ref} = \{ \langle E, trref \rangle \Big| \forall E \not\le \forall T \subseteq St\_i \}$$

where R is the interval relation of two time words. Time reference knowledge *tre f* is a set of reference knowledge consisting of commonality, relativity, fuzziness, continuity, and periodicity, namely, *tre f* = *com*, *rel*, *f uz*, *con*, *per*. There are some examples for each reference time word. For example, "12-July-2018" is a common time, and the Late Jurassic is a domain time description. Relativity indicates whether time is relative, e.g., "two days ago" is a relative time that refers to the absolute time "today", "9 o'clock" is an accurate time, "around 9" is a fuzzy time, "12-July" is an instance time, and "until 12, July" is a continuous time. Periodicity can be easily understood, such as "every weekend", "every month", and "annually".

(2) Location

Location describes the spatial information of the state of a geographic object. Let *Sti* indicate a specific state of geographic object *Oi*; the basic element *location L* can be identified as follows:

$$L = \{ \exists L \in \text{St}\_i \| \mathsf{V}O\_i \neq \mathcal{Q}\_\prime \mathsf{St}\_i \in O\_i \}$$

According to the complexity of location descriptions, a location can be set into basic types and reference location information. The basic types include toponym, address, coordinates, and reference location. Toponym *Ltop* describes a location with a common name. Address *Ladd* indicates a location with orderly numbers and streets named by administrators. Coordinate *Lcoo* records the location with a series of numbers organized mathematically. Reference location *Lre f* indicates the location of other elements in a geographic object. Location reference knowledge *lre f* indicates the additional knowledge of location descriptions. Let *tp*, *ad*, *co* indicate toponym, address, and coordinate, respectively. Toponym *Ltop*, address *Ladd*, coordinates *Lcoo*, and reference location *Lre f* are identified as follows:

$$L\_{\text{top}} = \{ \langle tp, lref \rangle \vert \forall lpt \in L \}$$

$$L\_{\text{ndf}} = \{ \langle ad, lref \rangle \vert \forall ad \in L \}$$

$$L\_{\text{coo}} = \{ \langle co, lref \rangle \vert \forall co \in L \}$$

$$L\_{ref} = \{ \langle E, lref \rangle \vert \forall E \text{ \&\ \forall lL \subseteq St\_{\text{f}} \} \}$$

Location reference knowledge *lre f* is a set of reference knowledge consisting of the space type, spatial reference, commonality, relativity, and fuzziness, namely, *lre f* = *typ*,*re f*, *com*, *rel*, *f uz*. Space type describes what types of space, such as reality, virtual or a specific domain location. For example, Pandora is a toponym of the virtual world of the movie Avatar. Spatial reference illustrates the system of a location description, e.g., WGS84 and Mercator projection. Commonality stores whether a location is a domain location, e.g., Beijing is a common toponym that could be coded as "-.-..--...-.---/-..---.-.-.--.." in a Morse code system. Relativity indicates whether a location is relative, e.g., "20 km south of Beijing" is a relative location description that refers to the absolute location "Beijing". Fuzziness states whether a location description is accurate or not, e.g., "near Times Square" is a fuzzy location description.

(3) Attribute

An attribute describes the feature information of the state of a geographic object. Let *Sti* indicate a specific state of geographic object *Oi*; the basic element *attribute A* can be identified as follows:

$$A = \{ \exists A \in \mathcal{S}t\_i | \forall O\_i \neq \mathcal{Q}\_\prime \mathcal{S}t\_i \in O\_i \}$$

All the feature descriptions of a geographic object belong to an attribute, e.g., shape, color, speed, etc. To organize the attributes of a geographic object, identifying what is an attribute is key. An attribute is a single feature description of one geographic object. For example, "a typhoon is a mature tropical cyclone that develops between 180◦ and 100◦ E in the Northern Hemisphere, with peak months from August to October" describes three attributes: the typical attribute of "mature tropical cyclone", the location attribute of "develops between 180◦ and 100◦ E in the Northern Hemisphere" and the frequency attribute of "peak months from August to October". It is noted that attribute can be divided into two types: essential attribute *Aes* and non-essential attribute *Ane*:

$$A\_{\mathcal{C}} = \{ \exists A\_{\mathcal{C}s} \in St\_i \vert \forall O\_i \neq \mathcal{O}, St\_i \in O\_i, \#A\_{\mathcal{C}} \ge 1 \}$$

$$A\_{\mathcal{M}} = \{ A\_{\mathcal{M}} \in A' \Big| \forall O\_i \neq \mathcal{O}, St\_i \in O\_i, A' = A\_{\mathcal{C}} \Big"\}$$

An essential attribute is a mark attribute that identifies a geographic object from others. When an essential attribute changes, a geographic object could change to another object. For example, when a mature tropical cyclone develops in the Atlantic Ocean, it cannot be a typhoon. A non-essential attribute is another feature description of a geographic object, e.g., the frequency attribute of a typhoon is "peak months from August to October". These attributes cannot determine the nature of a geographic object.

(4) State

The state illustrates the different stages of a geographic object. It can be seen that the above three basic elements work together to express the state. Thus, the element *state St* can be identified as follows:

$$St = \{ \exists St\_i \in O | \exists ! L \sqsubseteq St\_{i\prime} \; \exists ! T \sqsubseteq St\_{i\prime} \; \exists A \sqsubseteq St\_{i\prime} \; \#A \geq 0 \} $$

where ∃! means the unique existence. The formulation means that the state is a part of a geographic object. As the element *state St* is represented by sets of attributes of geographic objects under a particular spatial-temporal dimension, it must depend on the element *location L* and the element *Time*. Note that the element *location L* and the element *Time T* exist uniquely, because of time and space are two dimensions to represent the stage in Euclidean space. For example, the state of a typhoon includes all features for a specific spatial-temporal reference frame, e.g., "Typhoon Maria, 23:00/10July-2018, E123.40◦/N25.60◦, central pressure 945 hpa, max speed 30 km/h". The state cannot be defined without the temporal and spatial information. By contrast, the element *state St* does not depend on the element *Attribute A*. The attributes are the descriptive records that cannot a ffect whether the state exists. For example, "Typhoon Maria, 23:00/10July-2018, E123.40◦/N25.60◦" also defines a state of Typhoon Maria. Thus, the attribute element is defined di fferent from location element and time element.

(5) Change

A change describes the changes in a geographic object from one state to another. Thus, change *Ch* must contain at least one di fference between two states, which can be a location change, time change or attribute change. A change contains four main components:

$$\mathcal{C}h = \{ \langle \mathcal{S}t, act, \mathcal{C}E, type \rangle \in \mathcal{O} \Big| \exists \mathcal{S}t, \#\mathcal{S}t = 2, \mathcal{C}E \in \langle T, L\_\prime A \rangle, type \in (\mathcal{C}h\_d, \mathcal{C}h\_\varepsilon) \}$$

where *St* indicates the state (including two di fferent ones), *act* indicate the action of the change, *CE* indicate change elements and *type* indicates the type of the change. It is noted that there are two types of changes: a developing change and an evolving change. A developing change shows the changes from one geographic object, and an evolving change describes the changes between two di fferent geographic objects. Let *Chd* indicate a developing change and *Che* indicate an evolving change; the formalized definitions are as follows:

$$\text{Ch}\_{d} = \{\exists \text{Ch}\_{d} = St\_{i} \times St\_{i+1} \Big| \exists St\_{i} \& St\_{i+1} \in O\_{m}, St\_{i} \neq St\_{i+1} \}$$

$$\text{Ch}\_{\varepsilon} = \{\exists \text{Ch}\_{\varepsilon} = St\_{\text{end}} \times St\_{i} \| \exists St\_{\text{end}} \in O\_{m}, \exists St\_{i} \in O\_{n}, \exists !St\_{\text{end}} A\_{\text{cs}} \neq St\_{i} A\_{\text{cs}}\}$$

where *O*, *Om*, *and On* are geographic objects, *Sti* and *Sti*+<sup>1</sup> indicate the continuous states of the geographic objects, *Stend* indicates the last state of the geographic objects, and *Aes* indicate the essential attribute of the geographic objects.

(6) Relation

A relation expresses the di fferences between the elements of geographic objects, which includes three typical types: location relation, time relation, and attribute relation. These three types describe the spatial di fference, temporal di fference and feature di fference, respectively. A relation contains three main components: the elements of two states *E*, the semantic of the relation *Sem*, and the type of the relation *type*:

$$\mathcal{Re} = \{ \langle E, Sem, type \rangle \in O \Big| \exists E \& \#E \ge 2, type \in (\mathcal{Re}\_l, \mathcal{Re}\_{l'}, \mathcal{Re}\_{\pi}) \} $$

Let *Rel*, *Ret*, *and Rea* indicate location relation, time relation, and attribute relation, respectively, *Li* and *Lj* indicate the locations of di fferent states, *Ti* and *Tj* indicate the times of di fferent states, and *Ai* and *Aj* indicate the attributes of di fferent states. The di fferent types of relations are identified as follows:

$$\text{Re}\_l = \left\{ \exists \text{Re}\_l = L\_i \times L\_j \Big| \exists St\_i \& St\_j, St\_i \neq St\_{i+1} \right\}$$

$$\text{Re}\_l = \left\{ \exists \text{Re}\_l = T\_i \times T\_j \Big| \exists St\_i \& St\_j, St\_i \neq St\_{i+1} \right\}$$

$$\text{Re}\_d = \left\{ \exists \text{Re}\_d = A\_i \times A\_j \Big| \exists St\_i \& St\_j, St\_i \neq St\_{i+1} \right\}$$

A location relation describes the spatial relationships between different states, e.g., the location relations between the different states of a typhoon or the location relations between two different city centres under development. A time relation illustrates the temporal relationships between different states, i.e., the time span between two states, e.g., the time span of river diversion. An attribute relation describes the feature relationships between different states, i.e., the differences between two states of a typhoon, e.g., the max wind speed, central pressure, etc.

## **4. Case Study**

In this section, a full example is shown to illustrate the geographic knowledge representation using the GeoKG model. To describe the geographic knowledge representation clearly, an evolution case of administrative divisions of Nanjing was selected. The given example includes the basic geographic objects (e.g., Yangzi River, Zhongshan Mountain), the changing area of Nanjing, and several affiliated districts in different eras.

## *4.1. Research Area*

Nanjing, formerly romanized as Nanking and Nankin, is the capital of Jiangsu province of the People's Republic of China and the second largest city in the East China region, with an administrative area over 6000 km2. The inner area of Nanjing enclosed by the city wall is Nanjing Centre District, with an area of 55 km2, while the Nanjing Metropolitan Region includes surrounding cities and areas. Three representative stages were chosen to represent the revolution of Nanjing: 1368, 1949, and 2018. The sketch maps were shown in Figure 4.

**Figure 4.** The sketch maps of administrative divisions evolution of Nanjing in 1368, 1949, and 2018.

The first stage is Ming dynasty, which firstly named this city in the word of "Nanjing". The first emperor of the Ming dynasty, Zhu Yuanzhang, who overthrew the Yuan dynasty, renamed the city of Nanjing, rebuilt it, and made it the dynastic capital in 1368. He constructed a 48 km long city wall around Nanjing. That is the centre district of Nanjing, which is situated in the south of the Yangzi River and to the west of the Zhongshan Mountain.

The second stage is the founding of the People's Republic of China. The governmen<sup>t</sup> set Nanjing as a province unit, which directly controlled by the government. At that stage, Nanjing administrated the centre district and several affiliated districts. The centre district included district 1–10 and affiliated districts involved Jiangning, Jurong, Dangtu, Hexian, Pukou, and Luhe. In 1949, Nanjing had been expended through Yangzi River and Zhongshan Mountain.

The third stage is 2018, which refers to the current administrative boundaries of Nanjing. After a series of administrative division adjustments, Gaochun and Lishui was supplemented into Nanjing and Jurong, Dangtu, and Hexian was removed from the boundaries.

During over 600 years development of Nanjing, numerous elements were changed including the boundaries, affiliated districts, the relations between Nanjing and other geographic objects (e.g., Yangzi River and Zhongshan Mountain). Different relations happened in different stages among these geographic objects. Thus, the GeoKG model was used to represent these changing geographic knowledge. The formalization is introduced in the next section.

## *4.2. Formalization*

In this example, administrative division evolution was organized by using the GeoKG model. A geographic object is the key to represent geographic knowledge. First, this case identifies six relevant geographic objects: Nanjing *Onj*, Yangzi River *Oyr*, Zhongshan Mountain *Ozm*, Centre District *Ocd*, Jiangning *Ojn*, and Gaochun *Ogc*. Jiangning and Gaochun are representative affiliated districts which were selected in this case. Jiangning is always been part of Nanjing in 1949 and 2018 and Gaochun has an administrative division adjustment. Each geographic object consists of a series of states, changes and relations. For example, Nanjing *Onj* contains three states *Snj* = *Snj*1, *Snj*2, *Snj*3, six changes *Cnj* = *Cnj*11,*Cnj*12,*Cnj*13,*Cnj*21,*Cnj*22,*Cnj*23, and 12 relations *Rnj* = *Rnj*11,*Rnj*12,*Rnj*13,*Rnj*21,*Rnj*22,*Rnj*23,*Rnj*24,*Rnj*31,*Rnj*32,*Rnj*33,*Rnj*34,*Rnj*35. Thus, Nanjing *Onj* can be defined as follow and the corresponding diagram is shown in Figure 5.

*Onj* = ⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩ *Snj Onj*,*Cnj Onj*,*Rnj Onj Snj* = *Snj*1, *Snj*2, *Snj*3*Snj*.*number* ≤ 3, *Snj*.*number* ≥ 3, *Cnj* = *Cnj*11,*Cnj*12,*Cnj*13,*Cnj*21,*Cnj*22,*Cnj*23*Cnj*.*number* ≤ 6,*Cnj*.*number* ≥ 6 *Rnj* = *Rnj*11,*Rnj*12,*Rnj*13,*Rnj*21,*Rnj*22,*Rnj*23,*Rnj*24,*Rnj*31,*Rnj*32,*Rnj*33,*Rnj*34,*Rnj*35 *Rnj*.*number* ≤ 12,*Rnj*.*number* ≥ 12 ⎫⎪⎪⎪⎪⎪⎪⎪⎪⎪⎬⎪⎪⎪⎪⎪⎪⎪⎪⎪⎭

**Figure 5.** The diagram of different elements of Nanjing by using the GeoKG model.

Actually, different states of Nanjing *Snj*1, *Snj*2, *Snj*3indicate three different stages of 1368, 1949, and 2018. Each state contains different time, location, and attribute elements. For example, the state *Snj*1 of Nanjing contains time element *Tnj* of "1368", location element *Lnj* of "location descriptions in 1368" and attribute element *Anj* of "administrative region". The state *Snj*1 of Nanjing can be defined as follows:

$$S\_{nj1} = \left\{ \begin{array}{c} T\_{nj} \subseteq S\_{nj1}, L\_{nj} \subseteq S\_{nj1}, A\_{nj} \subseteq S\_{nj1} \\\ T\_{nj} = \left\{ T\_{nj1} \middle| T\_{nj}, number \le 1, T\_{nj}.number \ge 1 \right\}, \\\ L\_{nj} = \left\{ L\_{nj1} \middle| L\_{nj}.number \le 1, L\_{nj}.number \ge 1 \right\}, \\\ A\_{nj} = \left\{ A\_{nj1} \middle| A\_{nj}.number \le 1, A\_{nj}.number \ge 1 \right\} \end{array} \right\}$$

Different states could contain changes indicating different kinds of changes from one state to another one. For example, there are three main changes *Cnj*11,*Cnj*12,*Cnj*13 from the state *Snj*1 of Nanjing in 1368 to the state *Snj*2 of Nanjing in 1949: the change *Cnj*11 between time elements, the change *Cnj*12 between location elements and the change *Cnj*13 between the attribute elements of "administrative region". Note that all these changes belong to developing change type which indicates the change do not create a new geographic object. The changes can be defined as follows:

$$\begin{aligned} \mathsf{C}\_{n/11} &= \left\{ \begin{array}{c} \mathsf{St}, \mathsf{act}, \,\,\mathsf{CE}, \mathsf{type} \in \mathsf{C}\_{n/11} \\ \mathsf{S}t = \{\mathsf{S}\_{n/1}, \mathsf{S}\_{n/2}\}, \mathsf{act} = \{\mathsf{"time\ change"}\}, \mathsf{CE} = \{\mathsf{T}\_{n/1}, \mathsf{T}\_{n/2}\}, \mathsf{type} = \mathsf{C}\mathsf{Id}\_{d} \end{array} \right\} \subseteq \mathsf{O}\_{n/2} \\ \mathsf{C}\_{n/12} &= \left\{ \begin{array}{c} \mathsf{S}t, \mathsf{act}, \,\,\mathsf{CE}, \mathsf{type} \in \mathsf{C}\_{n/12} \\ \mathsf{S}t = \{\mathsf{S}\_{n/1}, \mathsf{S}\_{n/2}\}, \mathsf{act} = \{\mathsf{"location\ change"}\}, \mathsf{CE} = \{\mathsf{L}\_{n/1}, \mathsf{L}\_{n/2}\}, \mathsf{type} = \mathsf{C}\mathsf{Id}\_{d} \end{array} \right\} \subseteq \mathsf{O}\_{n/2} \\ \mathsf{C}\_{n/13} &= \left\{ \begin{array}{c} \mathsf{S}t, \mathsf{act}, \,\,\mathsf{CE}, \mathsf{type} \in \mathsf{C}\_{n/13} \\ \mathsf{S}t = \{\mathsf{S}\_{n/1}, \mathsf{S}\_{n/2}\}, \mathsf{act} = \{\mathsf{"attribute\ change"}\}, \mathsf{CE} = \{\mathsf{A}\_{n/1}, \mathsf{A}\_{n/2}\}, \mathsf{type} = \mathsf{C}\mathsf{Id}\_{d} \end{array} \right\} \subseteq \mathsf{O}\_{n/2} \end{aligned}$$

Relation is an indispensable element which exists in geographic objects referring to the relationships between different elements. In this example, there are three relations *Rnj*11,*Rnj*12,*Rnj*13 relate to Nanjing in 1368: the spatial relation *Rnj*11 between Nanjing *Onj* and Yangzi River *Oyz*, the spatial relation *Rnj*12 between Nanjing *Onj* and Zhongshan Mountain *Ozm*, and the attribute relation *Rnj*13 between Nanjing *Onj* and Centre District *Ocd*, where *Lyz*1 is the location of Yangzi River *Oyz* in 1368, *Lzm*1 is the location of Zhongshan Mountain *Ozm* in 1368 and *Acd*1 is the "administrative region" attribute of Centre District *Ocd* in 1368. The relations can be defined as follows and the diagram of these relations was shown in Figure 6.

$$R\_{n/11} = \left( \begin{array}{c} \text{E, Sem, type } \subseteq R\_{n/11} \\ \text{E} = \{\text{L}\_{n/1}, \text{L}\_{g\equiv 1}\}, \text{Sem} = \{\text{"Nanjing is south of the Yangzi Review"}\}, \text{type} = R\varepsilon\_l \end{array} \right) \sqsubseteq O\_{n/11}$$

$$R\_{\text{njl12}} = \left\{ \begin{array}{c} \text{E, Sem, type } \subseteq R\_{\text{njl2}} \Big| \\ \text{E = \{L\_{\text{njl1}}, L\_{\text{2} \text{nn}}\}, Sem = \{\text{"Man\"ing is east of the Zhong-band Monatan"} \}, type = Re\_{\text{l}} \end{array} \right\} \subseteq O\_{\text{njl1}}$$

$$R\_{\text{njl13}} = \left\{ \begin{array}{c} \text{E, Sem, type } \subseteq R\_{\text{njl2}} \Big| \\ \text{E = \{A\_{\text{njl1}}, A\_{\text{cll1}}\}, Sem = \{\text{"Contre District is part of } \text{Nam}\} \text{ing"} \}, type = Re\_{\text{njl1}} \end{array} \right\} \subseteq O\_{\text{njl2}}$$

Correspondingly, Yangzi River contains the relation *Ryz*1 = *Rnj*11<sup>−</sup>, Zhongshan Mountain contains the relation *Rzm*1 = *Rnj*12<sup>−</sup>, and Centre District contains the relation *Rcd*1 = *Rnj*13<sup>−</sup>:

$$R\_{yz1} = \left\{ \begin{array}{c} \text{E, Sem.type } \subseteq R\_{yz1} \big| \\ \quad \quad E = \left\{ L\_{\eta \left| 1 \right|}, L\_{yz1} \right\}, \text{Sem} = \left\{ \text{"Namjings is south of the YangZ.} \text{"Rier"} \right\}, \text{type} = R\_{\ell} \right\} \subseteq \mathcal{O}\_{yz2} \\\\ R\_{zm1} = \left\{ \begin{array}{c} \text{E, Sem.type } \subseteq R\_{zm1} \big| \\ \quad E = \left\{ L\_{\eta \left| 1 \right\rangle}, L\_{zm1} \right\}, \text{Sem} = \left\{ \text{"Namjings is east of the Zlungshan Monount in} \right\}, \text{type} = R\_{\ell} \right\} \subseteq \mathcal{O}\_{zm1} \\\\ R\_{\text{off}} = \left\{ \begin{array}{c} \text{E, Sem.type } \subseteq R\_{\text{cdl}} \big| \\ \quad E = \left\{ A\_{\eta \left| 1 \right|}, A\_{\text{cdl}} \right\}, \text{Sym} = \left\{ \text{"Centre Distribuir is part of Namjings'} \right\}, \text{type} = R\_{\ell} \end{array} \right\} \subseteq \mathcal{O}\_{\text{cdl}} \end{array} \right\} \subseteq \mathcal{O}\_{zm1}$$

**Figure 6.** The diagram of relation elements of Nanjing in 1368.

The whole evolution case of administrative divisions of Nanjing can be shown in Figure 7. Corresponding to Figure 4, each geographic object contains one to three states. For instance, Yangzi River and Zhongshan Mountain have three stages of 1368, 1949, and 2018 and Jiangning and Gaochun have two stages of 1949 and 2018. As inner changes are not considered, Centre District only represented one stage. Between different stages, different kinds of changes were considered. For example, different stages of Yangzi River and Zhongshan Mountain include time change *Cyz*11,*Cyz*21,*Czm*11,*Czm*21, different stages of Nanjing include time change *Cnj*11,*Cnj*21, location change *Cnj*12,*Cnj*22, and attribute change *Cnj*13,*Cnj*23, and different stages of Jiangning and Gaochun include time change *Cjn*11,*Cgc*11 and attribute change *Cjn*12,*Cgc*12. Additionally, relations link different elements among both different geographic objects and same geographic object. For example, Nanjing in 1368 has relations to Yangzi River *Rnj*11, Zhongshan Mountain *Rnj*12, and Centre District *Rnj*13. Then, Nanjing in 1949 has relations to Yangzi River *Rnj*21, Zhongshan Mountain *Rnj*22, Centre District *Rnj*23, and Jiangning *Rnj*24. In 2018, Nanjing has relations to Yangzi River *Rnj*31, Zhongshan Mountain *Rnj*32, Centre District *Rnj*33, Jiangning *Rnj*34, and Gaochun *Rnj*35.

**Figure 7.** An overview of evolution case of administrative divisions of Nanjing and relevant geographic objects.

Note that there are also inner relations between elements. In this case, the administrative division of Jiangning in 1949 has an attribute relation *Rjn*12 of "inheritance relationship" to the administrative division of Jiangning in 2018. Gaocun has the same attribute relation *Rgc*11. All these relations have the inverse relations in the opposite sides.

## **5. Discussion**

In this section, the case study of administrative division evolution of Nanjing was constructed by using the GeoKG model and the YAGO model. YAGO is a representative open source knowledge graph with different versions. Note that we compared our model with YAGO2, a spatially and temporally enhanced version from https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/ research/yago-naga/yago/. Then, three kinds of core geographic questions were posted and the results

were analyzed to evaluate the knowledge representation ability of these two models. Finally, a user evaluation was given to verify the comparisons objectively.

## *5.1. The GeoKG and the YAGO*

## 5.1.1. Structures

The structures of the GeoKG and the YAGO are different. Although Section 2 briefly introduced the characteristics of the YAGO, the comparison between two different structures needs to be analyzed in order to understand the following comparisons of queries and the results in next section. Figure 8 shows the examples structured by different models.

**Figure 8.** The examples with structures of the YAGO model and the GeoKG model. (**a**) the entities, properties and relationships in YAGO structure; (**b**) the elements in GeoKG structure.

In Figure 8a, there are only three kinds of elements: entity, property, and relationship. Each property links to a related entity by a relationship with a predicate. For example, "Nanjing" and "1638" have a relationship named "startedOnDate". Note that the YAGO structure does not contain the relationships between the properties. Thus, there are no semantic relationships between properties. In other words, the massive descriptive properties of an entity link to the entity independently. For example, two relationships happened on Nanjing and the Yangzi River: "Nanjing is south of the Yangzi River" and "The Yangzi River passes through Nanjing". It is difficult to understand this knowledge with no links between properties, whereas the GeoKG in Figure 8b sets six core elements and links these elements. With more integrated elements, the relationship of "Nanjing is south of the Yangzi River" can illustrate more clearly because this relationship links two locations in two different states of the two geographic objects. The different states providing this relationship happened on 1638 and the linked locations provide this relationship relate to different location descriptions. This knowledge cannot be provided without these links between the properties.

## 5.1.2. Construction

Both the GeoKG and YAGO were constructed manually by using the information about the case study of the administrative division evolution of Nanjing. The case study organized by the YAGO model was the classic SPO triple sets which has an open source ontology template. Additionally, the case study organized by the GeoKG model also stored by SPO triple sets that contain more predicates. The main supplement predicates include "isStateof", "isTimeof", "isLocationof", "isAttributeof", "isChangeof", "isRelationof", "isChangeto", and "isRelateto". All these predicates were applied to complete the semantic structure of the GeoKG model. From this perspective, the underlying storage mechanisms of GeoKG and YAGO are the same.

*5.2. The Comparison of Knowledge Representation Ability between the GeoKG and the YAGO*

## 5.2.1. Questions

Time, space, and attribute are three indispensable aspects on geoscience. These three kinds of questions can be defined as standard questions to evaluate whether the stored geographic knowledge is good. According to the differences between factual knowledge and inferential knowledge, each question was a set of two parts. To this case study, the questions are shown in Table 2.


## 5.2.2. Queries

Questions cannot be directly queried from the GeoKG and YAGO database. Thus, they need to be translated into SPARQL queries, because of either GeoKG or YAGO stored as triples in RDFs. For example, the factual question of time can be translated into SPARQL queries, as shown in Table 3.


## 5.2.3. Comparison and Analysis

The collected items of YAGO and GeoKG on six questions are listed in Table 4. The comparisons will be conducted in terms of accuracy, completeness, and repetition.

## a. Accuracy

In general, the results of the GeoKG are slightly better than the YAGO. Both of the two models can respond with accurate results to #Q1, #Q2, #Q3, #Q4, and #Q6. In #Q5, the result of the YAGO model returned two items and the results of the GeoKG model returned four items. Actually, "Zhenjiang" and "Nanjing" from the YAGO model are the misleading answers to the question of "Which city does Gaochun belong to?" Though the results from the GeoKG model: "Zhenjiang(Gaochun, state of 1949)", "Zhenjiang(Zhenjiang, state of 1949)", "Nanjing(Gaochun, state of 2018)" and "Nanjing(Nanjing, state of 2018)" are similar to the front, these results contain the geographic object and relevant state information which is a benefit for the users to understand the results. From this perspective, these state information from GeoKG provided more accurate information than the YAGO model.


### **Table 4.** The results of YAGO and GeoKG on SPARQL queries.

## b. Completeness

Although both of these two models can return the complete results, the results of the GeoKG contains more semantic integrity. In #Q6, YAGO returned 10 items: Centre District, Jiangning, Jurong, Dangtu, Luhe, Pukou, Hexian, Lishui, Gaochun, and Qixia. Among these divisions, Centre District belonged to Nanjing since 1368. Jiangning, Luhe and Pukou belonged to Nanjing since 1949. Jurong, Dangtu and Hexian belonged to Nanjing in 1949. Lishui, Gaochun, and Qixia belonged to Nanjing in 2018. As the question does not have an explicit time constraint condition, YAGO returned all the items, whereas GeoKG returned 30 items and each item recorded the target object and its relevant geographic object and state. It contains the item of "Centre District (Nanjing, state of 1368)" and the item of "Centre District (Centre District, state of 1368)", because of the relation existed oppositely.

## c. Repetition

The results of the GeoKG has more repeat items than the results from the YAGO. The results from the YAGO have repeat items in #Q3 and #Q4, because of the records are repeat. However, the GeoKG model is di fferent. In #Q2, #Q4, #Q5, and #Q6, the results of the GeoKG have many repeat items; for example, the items of "1949 (Jiangning, state of 1949)" and "1949 (Nanjing, state of 2018)" in #Q2. The query target object "1949" is the same. In spite of these two items sourced from di fferent geographic objects (Jiangning and Nanjing), these two items are still quite similar, which pushed more redundant information to the users.

In summary, the results of the GeoKG model are more accurate and complete than the YAGO model with the enhancing state information. It can decrease the influence from the fuzziness questions and obtain answers with more semantic meaning (e.g., geographic object and its relevant state). Meanwhile, the GeoKG model could generate more pairs results (e.g., "Nanjing is south of the Yangzi River (Nanjing, state of 1368)" vs. "Nanjing is south of the Yangzi River (Yangzi River, state of 1368)"), because the relation is stored oppositely in a di fferent geographic object.

## 5.2.4. User Evaluation

An online questionnaire survey is also given in order to verify the results of comparative analyses. The questionnaire is divided into eight parts. The first part is the basic information survey that asks individuals four aspects of information (gender, familiarity to the research area, background, and education level). The statistics of these basic information are shown in Figure 9. The 2nd–7th parts correspond to the questions #Q1–#Q6 and ask the questions about the best answer, accuracy, completeness, and repetition. The 8th part are summary questions including the overall evaluation, scores on YAGO and scores on GeoKG on di fferent aspects. The scores are set as 1–5 corresponding to very bad, bad, normal, good, and very good, and each score group includes an overall score, accuracy score, completeness score, and repetition score. There are 106 valid feedbacks we finally received.

**Figure 9.** The statistics of the four main types of the basic information about the survey.

Figure 10 shows the best answers on #Q1–#Q6 and the overall scores of the YAGO and the GeoKG. In the best answer histogram, the overall results show 54.72% individuals support the GeoKG, which is 23.59% higher than the YAGO at 31.13%. Specifically, the quantities of #Q1 and #Q2 are quite close but the quantities of #Q3–#Q6 are not. The quantities of the GeoKG are much higher than the YAGO among the last four questions, especially in #Q5. The line charts of overall scores on YAGO and GeoKG also show that the evaluation of GeoKG is better than the YAGO. A 7.8% improvement on the average score from the YAGO (3.15) to the GeoKG (3.49) is obtained.

*ISPRS Int. J. Geo-Inf.* **2019**, *8*, 184

**Figure 10.** The best answer on #Q1–#Q6 and the overall scores of the YAGO and the GeoKG.

From the sub-aspects (accuracy, completeness, and repetition) of the point of view, di fferent quantities can immediately show the scores from the YAGO and the GeoKG. Di fferent quantities show the ability from the model (details in Figure 11). Nearly all three aspects of the YAGO obtained the score 3, whereas the GeoKG was di fferent: a score of 4 on accuracy, a score of 4–5 on completeness, and a score of 3–4 on repetition. Comparing these scores, it can be seen that there is little promotion on the accuracy from the 3.11 average score in YAGO to the 3.78 average score in GeoKG. An overwhelming improvement shows on the completeness of the answers from a 2.99 average score in YAGO to a 3.87 average score in GeoKG. Additionally, the GeoKG also obtains a higher repetition from a 3.01 average score in YAGO to a 3.42 average score in GeoKG.

**Figure 11.** Rose maps of scores of di fferent aspects on the YAGO and the GeoKG.

In summary, the answers from GeoKG makes an improvement to those of YAGO's. The user evaluation objectively verified the analyses in Section 5.2.3 and specifically showed clear answers. It can be seen that the main improvements of the GeoKG are on the #Q3–#Q6, which are spatial and attribute questions. These answers to these questions require more related state information and temporal information, which need the links between the elements (Figure 8). This is the reason why the GeoKG is better than the YAGO. In addition, the GeoKG contains more redundancy information than the YAGO because of the bi-directionality of the relation element. This could be a focus of continuous further research on the index and applications in the future.

## **6. Conclusions**

Given that much attention has been paid to the representation of geographic knowledge, this paper is focused on the development of current geographic knowledge representations. We analyzed the problems of current geographic knowledge representation and found that two issues must be improved: the elements of geographic knowledge representation and the supplement of the construction operators of DL.

Following the basic idea of the six core geographical questions, we designed a conceptualized model called GeoKG based on the six elements around the geographical questions, then supplemented the construction operators of DL and finally provided the formalizations of the model with these operators. Additionally, an evolution case of administrative divisions of Nanjing was formalized and illustrated. Then, the knowledge graphs were constructed by both the GeoKG model and the YAGO model by using the case study. After setting a group of standard geographic questions, the query results were finally compared. The results showed that the results of GeoKG are more accurate and complete than the YAGO results, which are verified by the following user evaluation. This comparison indicates the GeoKG model displays its ability to organize geographic knowledge in computers and is a promising and powerful model for geographic knowledge representation.

**Author Contributions:** Conceptualization: S.W., X.Z., and P.Y.; data curation: S.W.; formal analysis: S.W.; funding acquisition: X.Z.; investigation: S.W., M.D., Y.L., and H.X.; methodology: S.W., X.Z., and M.D.; supervision: X.Z.; validation: P.Y., M.D., and Y.L.; visualization: M.D.; writing—original draft: S.W.; writing—review and editing: S.W. and P.Y.

**Acknowledgments:** The authors thank Mingguang Wu, Junzhi Liu and Jie Zhu for their critical reviews and constructive comments. This research is supported by the National Natural Science Foundation of China grants no. 41631177 and no. 41671393 and the National Key Research and Development Program of China, no. 2017YFB0503602.

**Conflicts of Interest:** The authors declare no conflict of interest.

## **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Advanced Cyberinfrastructure to Enable Search of Big Climate Datasets in THREDDS**

## **Juozas Gaigalas, Liping Di \* and Ziheng Sun**

Center for Spatial Information Science and Systems, George Mason University, Fairfax, VA 22030, USA; juozasgaigalas@gmail.com (J.G.); zsun@gmu.edu (Z.S.)

**\*** Correspondence: ldi@gmu.edu

Received: 30 September 2019; Accepted: 31 October 2019; Published: 2 November 2019

**Abstract:** Understanding the past, present, and changing behavior of the climate requires close collaboration of a large number of researchers from many scientific domains. At present, the necessary interdisciplinary collaboration is greatly limited by the difficulties in discovering, sharing, and integrating climatic data due to the tremendously increasing data size. This paper discusses the methods and techniques for solving the inter-related problems encountered when transmitting, processing, and serving metadata for heterogeneous Earth System Observation and Modeling (ESOM) data. A cyberinfrastructure-based solution is proposed to enable effective cataloging and two-step search on big climatic datasets by leveraging state-of-the-art web service technologies and crawling the existing data centers. To validate its feasibility, the big dataset served by UCAR THREDDS Data Server (TDS), which provides Petabyte-level ESOM data and updates hundreds of terabytes of data every day, is used as the case study dataset. A complete workflow is designed to analyze the metadata structure in TDS and create an index for data parameters. A simplified registration model which defines constant information, delimits secondary information, and exploits spatial and temporal coherence in metadata is constructed. The model derives a sampling strategy for a high-performance concurrent web crawler bot which is used to mirror the essential metadata of the big data archive without overwhelming network and computing resources. The metadata model, crawler, and standard-compliant catalog service form an incremental search cyberinfrastructure, allowing scientists to search the big climatic datasets in near real-time. The proposed approach has been tested on UCAR TDS and the results prove that it achieves its design goal by at least boosting the crawling speed by 10 times and reducing the redundant metadata from 1.85 gigabytes to 2.2 megabytes, which is a significant breakthrough for making the current most non-searchable climate data servers searchable.

**Keywords:** climate science; metadata; web cataloging service; big geospatial data; geospatial cyberinfrastructure

## **1. Introduction**

Cyberinfrastructure plays an important role in today's climate research activities [1–6]. Climate scientists search, browse, visualize, and retrieve spatial data using web systems on a daily basis, especially as data volumes from observation and model simulation grow to large amounts that personal devices cannot hold entirely [7,8]. The big data challenges of volume, velocity, variety, veracity, and value (5Vs), have pushed geoscientific research into a more collaborative endeavor that involves many observational data providers, cyberinfrastructure developers, modelers, and information stakeholders [9]. Climate science has developed for decades and produced tens of petabytes of data products, including stationary observations, hindcast, and reanalysis, which are stored in distributed data centers in different countries around the globe [10]. Individuals or small

groups of scientists face big challenges when they attempt to e fficiently discover the data they require. Currently, most scientists acquire their knowledge about datasets via conferences, colleague recommendations, textbooks, and search engines. They become very familiar with the datasets they use, and every time they want to retrieve the data, they go directly to the dataset website to download the data falling within the requested time and spatial windows. However, these routines are less sustainable as the sensors/datasets become more varied, models evolve more frequently, and new data pertaining to their research is available somewhere else [9].

In most scenarios, metadata is the first information that researchers see, before they access and use the actual Earth observation and modeling data the metadata describes [11,12]. Based on the metadata, they decide whether or not the actual data will be useful in their research. For big spatial data, metadata is the key component backing up all kinds of users' daily operations, such as searching, filtering, browsing, downloading, displaying, etc. Currently, two of the fundamental problems in accessing and using big spatial data are the volume of metadata and the velocity of processing metadata [13,14]. Through manual investigation of Unidata THREDDS data repository (a metadata source we take as an example of typical geodata storage patterns) [15,16], it reveals that most of the metadata are highly redundant. The vast majority of metadata records contain identical information and only key fields representing spatial and temporal characteristics are regularly updated. However, there exists a regular pattern to how the redundant information is structured and how new information is added to the repository—but, the pattern varies according to data organization hierarchy and changes with the type of data being delivered (for example: Radar station vs. satellite observation vs. regular forecast model output).

To overcome these big data search challenges, we must confront practical problems in the information model, information quality, and technical implementation of information systems. Our study follows the connection between fundamental scientific challenges and existing implementations of geoscience information systems. This study aims to build a cataloging model capable of fully describing real-time heterogeneous metadata whilst simultaneously reducing data volume and enabling search within big Earth data repositories. This model can be used to e fficiently represent redundant data in the original metadata repository and to perform lossless compression of information for lightweight e fficient storing and searching. The model can shrink the huge amount of metadata (without sacrificing information complexity or variety available in the original repositories) and reduce the computational burden on searching among them. The model defines two types of objects: Collections and granules. It also defines their lifecycle and relationship to the upstream THREDDS repository data. Collection contains content metadata (title, description, authorship, variable/band information, etc.). Each collection contains one or more granules. Each granule contains only the spatiotemporal extent metadata. We prototyped the model as an online catalog system within EarthCube CyberConnector [17–20]. We have made the final system available online at: http://cube.csiss.gmu.edu/CyberConnector/web/covali. The system provides a near real-time replica of the source catalog (e.g., THREDDS), optimizes the metadata storage, and enables searching capability which was not available before. The system is like a clearinghouse with its own metadata database. Currently, the system is mainly used for searching operational time-series observations/simulations collected/derived from field sensors. Other datasets, like remote sensing datasets and airborne datasets, can be foreseen to be supported in the near future. The novelty of this research is that it turns the legacy data center repositories into lightweight flexible catalog services, which are more manageable by providing searching capabilities for petabytes of datasets. The work provides important references to people operating the operation of big climate data centers and advises on further improvements in those operational climate data centers to better serve the climate science community. This paper is organized as follows. Section 2 describes the background knowledge and history. Section 3 introduces related work. Section 4 introduces the proposed model. Section 5 shows the implementation of the model and the required cyberinfrastructure. Section 6 demonstrates the experiment results. Section 7 discusses the results of our approach. Section 8 concludes the paper.

The study described in this paper is an attempt to contribute to the global scientific endeavor on understanding and predicting the impacts of climate change. Understanding climate change and its impacts requires understanding Earth as a complex system of systems with behaviors that emerge from the interaction and feedback loops that occur on a range of temporal and spatial scales. However, new advances in these studies are obstructed by the challenges of interdisciplinary collaboration and the difficulty of data and information collaboration [21–27]. The difficulties of information collaboration can be understood in terms of long-standing big data problems of variety (complexity) and volume/velocity.

## **2. Background**

Metadata is a powerful tool for dealing with big data challenges. We discuss the background work on metadata and interoperability of metadata catalogs as critical components of advanced cyberinfrastructure that we envision.

## *2.1. Metadata*

The topic of metadata has been approached by two distinct scholarly traditions. Understanding them helps us clarify our approach to metadata in cyberinfrastructure. Library information scientists have described the metadata bibliographic control approach. Bibliographic principles allow information users to describe, locate, and retrieve information-bearing entities. The basic metadata unit is the "information surrogate" that derives its usefulness from being locatable (by author, title, and subject), accurately describing the information object (the data of the metadata) and identifying how to locate the object. The second (complementary) view of metadata originates in the computer science discipline and is called the data managemen<sup>t</sup> approach. Complex and heterogeneous data (textual, graphical, relational, etc.) is not separated into information units, but is instead described by data models and architectures that represent "additional information that is necessary for data to be useful" [27]. The key difference is the bibliographic approach works with distinct information entities of limited types, while the data managemen<sup>t</sup> approach works with models of data/information structures and their relationships.

This distinction between bibliographic and data managemen<sup>t</sup> approaches is important in the context of ongoing efforts of metadata standardization [28–31]. The second approach is not conducive for standardization because the data managemen<sup>t</sup> models are as complex and heterogeneous as the structures of the data being modeled. Consequently, in accordance with existing standards, the currently available metadata for large climate datasets follows the first approach, which provides bibliographic information and does not describe the data structures in a way that may permit new capacities of advanced cyberinfrastructure. Our paper describes the work to supplement and transform the existing bibliographical metadata with a custom metadata managemen<sup>t</sup> model resulting in new applications for the existing data. Metadata standardization is a prerequisite for interoperability, which is a prerequisite for building distributed information systems capable of handling complex Earth system data [32].

## *2.2. Interoperability, Data Catalogs, Geoinformation Systems, and THREDDS*

Data and information collaboration across disciplines is critical for advanced Earth science. Unfortunately, there is no strongly unified practice for data recording, storage, transmission, and processing that the entire scientific community follows [33–37]. Disparate fields and traditions have their own preferred data formats, software tools, and procedures for data management. However, Earth system studies generally work with data that follow a geospatial–temporal format [38–42]. All of the data can be meaningfully stored on a 4D (3 spatial and one temporal) dimension grid. This basic commonality has inspired standardization efforts with the goal of enabling wider interoperability and collaboration.

Following organic outgrowth from the community, the standardization e fforts are now headed by Open Geospatial Consortium (OGC) and the International Organization for Standardization Technical Committee 211 (ISO TC 211) and have yielded successful standards in two areas relevant to us [43–47]. First is the definition of NetCDF as one of the standard data formats for storing geospatial data. The second is the metadata standardization. Those e fforts are extremely relevant to our research and are further discussed in the Related Works section. For background, it is important to mention that the standard geospatial metadata models developed by OGC are still evolving capabilities for describing the heterogeneous, high-volume, or high-velocity big data we are studying. The commonly used OGC/ISO 19\* series metadata standards have relatively limited relational features (aggregation only) and, in the repository we studied, each XML encoded metadata record contains mostly redundant information (for example, two metadata objects that represent two images from a single sensor mostly contain duplicated information that describes sensor characteristics). However, there are multiple lines of work that ISO TC 211 is pursuing that addresses these issues and suggests a trend for expanding the applicability of standardized metadata models and the integration of a greater variety of information.

The standard geographical metadata model was developed in conjunction with a standard distributed catalog registration model titled Catalog Services for the Web (CSW) [48,49]. The CSW standard is widely known and many Earth system data providers o ffer some information about their data holdings via the CSW interface. However, the CSW standard is also poorly suited to support big data collaborative studies for the Earth system. CSW follows the basic OGC metadata model in a way that makes it challenging to capture valuable structure and semantics of existing data holdings without storing extremely redundant information—which exhausts computing resources without taking advantage of the true value of large and complex Earth big data. However, OGC metadata stored in CSW is the existing standard that governs not only data distribution practices, but also how researchers think about data collaboration.

The next item this study works with is the UCAR Unidata THREDDS Data Server (TDS). The University Corporation for Atmospheric Research (UCAR) Unidata is a geoscience data collaboration community of diverse research and educational institutions. It provides the real-time heterogeneous Earth system data that this study targets. THREDDS is Unidata's Thematic Real-Time Environmental Distributed Data Services. TDS is a web server that provides metadata and data access for scientific datasets to climate researchers. TDS provides its own rudimentary hierarchical catalog service that is not searchable and does not support the CSW standard. However, it does support the OGC geospatial metadata standard—although not consistently or comprehensively. In order to make data hosted by TDS searchable, the TDS metadata must be copied to another server and a searchable catalog must be created for the metadata. This task is performed by a customized web crawler developed by this study.

This study attempts to build upon the existing infrastructure with its available resources and limitations to provide new capabilities. The limitations of the existing systems are two-fold. First is the limits of the CSW metadata registration model (it does not naturally support registering information about metadata lifecycle or su fficiently detailed aggregation information), and second is the incompleteness of information within metadata provided by THREDDS. This study attempts to erase the limits by first interpolating information to improve the quality of the existing metadata model and then by extending the model to provide advanced capabilities. It demonstrates how to integrate TDS metadata with CSW software and proposes several practical solutions that work around the limitations of the CSW metadata registration model. We do this to show that improvements in metadata and catalog capabilities can also reduce the challenges of big data in variability, volume, and velocity.

## **3. Related Work**

This paper brings together several existing lines of work to confront the problems of integrating and searching vast and diverse climate science datasets. Existing research in areas of metadata modeling, geospatial information interoperability, geospatial cataloging, web information crawling, and search indexing provides the building blocks for our work to demonstrate and evaluate advanced climate data cyberinfrastructure capabilities.

## *3.1. Metadata Models*

There are many studies exploring the fundamental relationship between metadata models and information capabilities. There exists diverse work in other areas that deal with the same basic issues and demonstrates that the creation of novel metadata models can be used as a method for solving information challenges. For example, Spéry et al. [50] have developed a metadata model for describing the lineage of changes of geographical objects over time. They used a direct acyclic graph and a set of elementary operations to construct their model. The model supports new application of querying historical cadastral data and minimizes the size of geographical metadata information. Spatiotemporal metadata modeling can be generalized as a description of objects in space and time, and relationships between objects conceived as flows of information, energy, and material to model interdependent evolution of objects in a system [51]. Provenance ("derivation history of data product starting from its original sources" [52]) modeling is an important part of metadata study. Existing metadata models and information systems have been experimentally extended with provenance modeling capabilities to enable visualization of data history and analysis of workflows that derive data products used by scientists [53,54]. An experiment to re-conceptualize metadata as a practice "knowledge management" yielded a metadata model that can support the needs of spatial decision-making by identifying issues of entity relationships, integrity, and presentation [55]. The proposed metadata model allows communicating more complex information about spatial data. This metadata model makes it possible to build an original geographic information application, named Florida Marine Resource Identification System, that extends the use of the existing environment and civil data to empower users with higher-level knowledge for analysis and planning. Looking outside the geospatial domains, we still observe that the introduction of specialized metadata approaches and models permits the development of new capabilities.

## *3.2. Geospatial Metadata Standardization, Interoperability, and Cataloging*

The diversity of metadata models and formats developed by research has enabled new powerful geoinformation systems, but has also introduced a new set of problems of data reuse and interoperability. Public and private research, administrative, and business organizations have accumulated growing stores of geoinformation and data, but this data has not become easier to discover and access for users outside limited organization jurisdictions. This has led to significant resource wastage and duplication of e ffort for data producers and consumers. Cataloging has grown increasingly challenging because of this heterogeneity. In response, new spatial data infrastructures have been developed. They have attempted to integrate and standardize multiple metadata models and develop shared semantic vocabulary models to enable discovery by employing the "digital library" models of metadata. In this process, syntactic and semantic interoperability challenges have been identified. Syntactic operability refers to information portability—the ability of systems to exchange information. Semantic interoperability refers to domain knowledge that permits information services to understand how to meaningfully use the data from other systems [56].

Various techniques for achieving metadata interoperability have been explored [57]. Two related families of techniques can be identified. One approach attempts to create standard and universal models, the other creates mappings between several metadata representations of the same data. Transformation between several metadata models requires that syntactic, structural, and semantic heterogeneities can be reconciled. The reconciliation is accomplished with techniques called metadata crosswalks. A crosswalk is "a mapping of the elements, semantics, and syntax from one metadata schema to another". Once mappings are developed, they can be used to apply multiple metadata schemas to existing data [58].

The possibilities for interoperability have been advanced by the efforts led by the International Organization for Standardization Technical Committee 211 (ISO TC 211) to standardize metadata representation. It introduced the ISO 19\* series of geospatial metadata standards for describing geographic information by the means of metadata [54,59–61]. The standards define mandatory and optional metadata elements and associations among elements. For example, spatiotemporal extent, authorship, and general description of datasets are required and recommended by the standard. Other kinds of information like sequencing of datasets in a collection, aggregation, and other relational data are optional in the standard. The ISO 19\* series of standards also provides an XML schema for the representation of the metadata in XML [62].

Looking at existing metadata interoperability work, we see a recurrence of similar problems such as diversity of metadata representations and complexity of mapping between them. Several authors discuss the practical challenges of developing software and systems for translation [27,59,63]. There exists a proliferation of study efforts and results that advance the goals of interoperability by identifying key understanding of the challenges of interoperability and demonstrating systems, services, and models that address common challenges. Our work attempts to preserve existing interoperability advances while exploring the possibilities of expanding existing metadata models to support new possibilities use of existing data.

Standardized metadata is often stored and made available using catalog services. Catalogs allow users to find metadata using queries that describe the desired spatial, temporal, textual, and other information characteristics of the searched data [64]. The OGC Catalog Service for the Web (CSW) is one of the widely used catalog models in the geoscience domain to describe geographic information holdings [6,65].

## *3.3. Web Harvesting and Crawling*

One critical capacity of metadata cyberinfrastructure is the ability to integrate metadata from remote web repositories. The process of finding and importing web linked data in a metadata repository is called "crawling" and is accomplished using a software system called "metadata web crawler". A web crawler is a computer program that browses the web in a "methodical, automatic manner or in an orderly fashion" [66]. A crawler is an internet bot, it is a program that autonomously and systematically retrieves data from the world wide web. It automatically discovers and collects different resources in an orderly fashion from the internet according to a set of built-in rules. Patil and Patil [66] summarize this general architecture of web crawlers and also provide a definition of several types of web crawlers. A focused crawler is a type that is designed to eliminate unnecessary downloading of web data by incorporating an algorithm for selecting which links to follow. An incremental crawler first checks for changes and updates to pages before downloading their full data. It necessarily involves an index table of page update dates and times. We follow these two strategies in the design of our crawler. The authors also outline common strategies for developing distributed and parallelized crawlers. Our crawler runs on a single machine, but we use a multithreaded process model with a shared queue mechanism—a common parallelization strategy identified by the authors [67].

A fairly recent review collected by Desai et al. [68] shows that web crawler research is an active area of work—however, most of this work is focused on the needs of general web search engine index construction. There exists an area of research called "vertical crawling" which contends with the problems of crawling non-traditional web data: News items, online shopping lists, images, audio, video. There does not appear any publications regarding efficient crawling of heterogeneous Earth system metadata.

There exists substantial previous work to show the feasibility of crawling this metadata. One recent paper summarizes the state of the art. Li et al. [69] present a heterogenous Earth system metadata crawling and search system named PolarHub—a web crawling tool capable of conducting large-scale search and crawling of distributed geospatial data. It uses existing textual web search engines (Google) to discover OGC standards-compliant geospatial data services. It presents an interactive interface

that allows users to find a large variety and diversity of catalogs and related data services. It has a sophistical distributed multi-threaded software system architecture. PolarHub shows that it is possible to present data from many sources in a single place. However, it does not present datasets, only endpoints that users must further explore on their own. It does not download, summarize, or harmonize the metadata stored on the remote catalogs. It shows the feasibility of cyberinfrastructure that integrates a variety of data based on interoperable standards but does not discuss data volume and velocity challenges that arise when deeper and fuller crawling is done. PolarHub users can find a large number of catalogs and services that contain, for instance, "surface water temperature" data but they cannot use metadata crawler following this catalog hub strategy to discover datasets that hold "surface water temperature inside X spatial and temporal extent with Y spatial and temporal resolution".

A complementary strategy is discussed by Pallickara et al. [70], who present a metadata crawling system named GLEAN, which provides a new web catalog for atmospheric data based on the extraction of fine-grained metadata from existing large-scale atmospheric data collections. It solves the data volume problem by introducing a new metadata scheme based on custom synthetic datasets that represent collections (or subsets or intersections) of multiple existing datasets. This reduces metadata overhead greatly and permits high performance and precise discovery and access of specific datasets inside vast atmospheric data holdings. Unlike PolarHub, GLEAN avoids the data variety challenge by limiting its processing to one type of data format used in atmospheric science. They also do not contend with the interrelated velocity and near real-time access problems—in GLEAN crawling, the discovery of updated datasets is initiated by manual user request. They do not use the OGC catalog or metadata standards to support interoperability.

BCube project (part of EarthCube initiative) attacks similar problems with another approach [71]. EarthCube is a National Science Foundation initiative to create open community-based cyberinfrastructure for all researchers and educators across the geosciences. EarthCube cyberinfrastructure must integrate heterogeneous data resources to allow forecasting the behavior of the complex Earth system. EarthCube is composed of many building blocks. Our work is part of the EarthCube Cyberway building block. BCube (The Brokering Building Block) o ffers a di fferent approach for heterogeneous geodata interoperability. BCube adopts a brokering framework to enhance cross-disciplinary data discovery and access. A broker is a third party online data service that contains a suite of components, called accessors. Each accessor is designed to interface with a di fferent type of geodata repository. A broker allows users to access multiple repositories with a single interface without requiring data providers to implement interoperability measures. BCube supports metadata brokering. It can search, access, and translate heterogeneous metadata from multiple sources. It demonstrates deeper interoperability than other approaches discussed here, but does not attempt to solve data volume or velocity problems [72]. The BCube approach is very relevant to us; however, BCube has very few documents available and the system is inaccessible. We were unable to compare some of the details of our di fferent approaches.

Song and Di [73] studied the same problem with the same example repository: Unidata TDS. The authors determined the volume and velocity characteristics of the target repository metadata. Like our study, they propose modeling it with concepts of collection and granule. They implemented a crawler that is able to crawl some of the TDS archive. Their work is the previous progress in the same project as ours and is highly relevant to this study. However, their approach did not perform well using real-world TDS data, which led us to take it in a di fferent direction. We rebuilt their work to demonstrate real-time search and the possibility of processing all of TDS by using a more sophisticated metadata model, and a more advanced integrated search client and indexing service that permits true real-time search.

Reviewing existing work reveals tremendous advances toward solving the challenges of creating interoperable Earth system cyberinfrastructures that can practically process a large volume and variety of observation and model data that are generated in high-velocity data production processes. Lines of work in metadata modeling, standardization, interoperability, repository crawling, and processing provide the basis for the materials for our study. Our contribution is to synthesize these approaches to explore how interoperability and performance could be achieved simultaneously.

## **4. Materials and Methods**

To enable searching of big climate data, we propose a new big data cataloging solution, which includes the following steps. (1) Analyze the target geodata repository that provides a good example of data challenges for cross-disciplinary Earth system scientific collaboration. (2) Analyze the qualities and characteristics of the data in the selected repository. (3) Construct a model of the repository. (4) Use the repository model to construct an efficient metadata resource model. (5) Develop a crawler system that uses repository and metadata resource models to optimize its crawling algorithm and metadata representation. (6) Demonstrate advanced interoperable big geodata search and access capabilities that our approach permits. The completed cyberinfrastructure model and system architecture (derived from our metadata model) is shown in Figure 1.

**Figure 1.** The proposed big climate data cataloging solution. Abbreviations: CSW, Catalog Services for the Web; REST API, Representational State Transfer Application Programming Interface.

## *4.1. Metadata Repository Selection*

We took Unidata THREDDS Data Server (TDS) as our example target geodata repository platform. TDS was chosen because it is widely used by atmospheric and other related Earth science fields. It supports a good variety of open metadata and data standards and there exist many data centers that use TDS. It supports basic catalog features but lacks advanced search capabilities. It gives users and administrators large latitude of how the data is organized and updated inside the TDS catalog. The geodata stored across many TDS installations meets our broad criteria for real-world data variety, volume, and velocity.

A single TDS instance was selected as a target for our experiment. UCAR Unidata TDS (thredds.ucar.edu) repository was determined as a suitable target system and a good example of diverse uses of TDS. Unidata TDS contains a requisite variety of data. It has near real-time data that demonstrate the data velocity challenge. It contains a variety of data granularity and a good range in the size and complexity of datasets available. The volume of data and the volume of metadata is sufficiently challenging. The catalog structure is heterogeneous—different types of data are organized on different principles. On initial inspection, Unidata TDS was determined to be a grea<sup>t</sup> example of the challenges we wanted to explore.

Using manual inspection and basic statistical analysis via custom Python scripts, we started mapping out the characteristics of the Unidata TDS information system. We tried to answer the following questions: (a) What is the hierarchical structure of data organization in this repository?; (b) how frequently are new records are added and removed?; (c) which parts of the catalog exhibit regular patterns in the information structure that can be generalized and which parts contain unique information?; (d) what are the size and content of the metadata resources stored in the catalog?; (e) how is information in metadata resources related to metadata resources location within the hierarchy of catalog structure?; and (f) what are the data transmission qualities of the Unidata TDS network system—what portion of the TDS information can be transferred and copied to our system?

## *4.2. Repository Analysis*

The following figures show some of the surface structure of the Unidata TDS catalog retrieved using a web browser from http://thredds.ucar.edu/thredds/catalog.html. Figure 2 shows the top level of the catalog hierarchy. Each listed item is a folder (a catalog). Most catalogs contain several levels of nested catalogs (Figure 3) in a tree-like hierarchy similar to a file system. At the bottom (leaf) tree level (Figure 4), the catalogs contain a list of data resources. Catalogs are presented in two formats. First is the HTML format, suitable for manual web browsing. Second is the XML format that contains additional metadata about the catalogs and the data resources. The XML representation follows THREDDS Client Catalog Specification. The specification extends the basic filesystem-like structure with temporal, spatial, and data variable description metadata annotations [74].

**Figure 2.** Top level Unidata THREDDS Data Server (TDS) catalog listing.

The TDS catalog provides a powerful general catalog hierarchy model. However, the practical use of this model by scientists who produce geodata is what determines the possibility of data collaboration and harmonization—as well as the specific shapes and possible solutions for big data problems. Email correspondence with Unidata explained that the data placed in different sub-catalogs is produced and organized by different teams of scientists. Although Unidata TDS acts as a unified repository for diverse Earth data, there are no mandatory overarching organizing principles to enable data harmonization [75].

**Figure 3.** Nested catalogs. Only first 11 entries shown here. More entries omitted.


**Figure 4.** Data resource (dataset) listing at the bottom of the catalog hierarchy. Only first seven entries shown here. More entries omitted.

That being the case, the next step was to understand and describe different sub-structures organically adopted by different teams. After manual inspection and basic statistical analysis performed with custom Python scripts, the following information was compiled to broadly describe the different patterns of sub-catalog utilization (Table 1).


**Table 1.** Unidata TDS subcatalog big data characteristics.

Four general types of data are simultaneously held in the Unidata TDS repository: (1) Forecast model output, (2) observations (time series from in-situ instruments), (3) satellite imagery, and (4) radar imagery from stationary radar network (NEXRAD, Next Generation Weather Radar) [76]. Each type contains much additional variety in its own hierarchy of sub-catalogs, but at this level, there are some clear and useful broad di fferences in data qualities that can guide our experiment.

In Table 1, the estimated catalog size is the total size of the metadata held in the catalog. Most of this metadata is completely redundant, but without knowing the deeper structure of this data, we would have to mirror all of this data in order to enable search and discovery capabilities that THREDDS does not support. We calculated maximum data transfer throughput of 4 MB/s or 5 min to load 1 GB of catalog data. It appears to be possible to mirror the entire Unidata TDS metadata catalog in several hours, but data throughputs we observed were not consistent, often slowing down by one order of magnitude. Furthermore, the speed of data processing (indexing and registering with a standard-compliant OGC CSW catalog) is also very time, and compute and storage resource, consuming. We do not have the capabilities to register and search millions of records mostly containing redundant information. Furthermore, the Unidata TDS data were added in near real-time according to specific patterns and structure in the sub-catalogs. If we attempted to copy and register all of that metadata, then we would not have been able to provide near real-time capabilities.

The last two columns in Table 1 show two critical qualities that determine what approach we needed to take to integrate that metadata into our systems.

If final datasets have "coarse" granularity, that means each dataset is a very file and the size of metadata is small in relation to the data size—for "coarse" datasets, we can copy, harvest, and index the metadata into our search system. "Fine" datasets stretch technical capabilities to transfer and process metadata. "Very fine" records are too numerous (the data files too small) for us to be able to e ffectively synchronize or process their metadata.

If datasets are produced in a regular way (predictable spatiotemporal attributes), then we can harvest minimal information and model the entire catalog. However, for NEXRAD radar metadata, there is no regular pattern to metadata production. A new record could be added every 5 min or every 15 min—and their regularity/irregularity also varies in time and depending on di fferent radar sites (di fferent sub-catalogs). This fine-grained irregular data is the most challenging, because it can neither be harvested wholesale nor modeled in an accurate way. It requires a targeted combination approach. Additional considerations arise when tracking what datasets have expired and been removed—ideally, this should be accomplished without performing an expensive full scan of the TDS repository.

Further examination of the sub-catalogs structure for irregular (and regular) highly granular data revealed additional useful structural information. Some catalogs are "dynamic" (or "live" or "streaming")—they are updated with new data resources with regular (or irregular) frequency. Other catalogs are archival—they can be assumed to never change (until they expire and are deleted entirely). Three distinct types of sub-catalogs can be identified:


## *4.3. Crawling*

Big data catalogs normally need to complete a lot of crawling tasks to grab metadata files, and repeat scanning to capture metadata of the newly observed datasets on a regular basis. Crawling is the fundamental information source of metadata, and how to intelligently crawl is one of the largest challenges in big data searching due to the repeated computational burden and the complexity of the content. When designing our crawling strategy, we considered the observation update frequency, time window, observatory network organization, and made the crawler only touch the folders of those updated sensors at collection (sensor) level. Although a sensor has millions of metadata records, we only crawl the metadata at the sensor level. In other words, only one metadata is crawled for each sensor (or instrument). Using this strategy, we can save numerous hours in crawling and metadata transferring over network, especially when the network is unstable. After applying a parallel worker mechanism, we can have dozens of crawlers working on scanning and capturing new/updated metadata of petabytes of climate datasets.

Our crawler is di fferent from most existing crawlers in the literature, because it is not a general-purpose search engine crawler. Typical crawlers download the entire web page, find links to follow, and add those links to the work queue. We cannot do a similar thing, because the web content we are crawling (TDS catalog) contains vastly redundant information that is not possible to download and process in its entirety without overloading available computing and network resources. There are various sensors in the climate monitoring networks and the sensors are dynamically changing, with new sensors added or old sensors removed. We had to crawl the THREDDS Data Server to make sure all the observations were fully synchronized in our catalog. Our crawler design must incorporate knowledge of metadata and metadata structure its processing and queueing algorithms in order to download only essential information.

## *4.4. Indexing*

The third step is indexing, which extracts the spatiotemporal information from the crawled metadata and creates indexes for data granules of times series by each instrument. CSW provides the basic metadata registration and query model. However, the large granularity of metadata objects (and lack of aggregation/relational capabilities) makes CSW ine fficient for storing and querying large numbers of datasets and that have only small variations in their metadata. A more e fficient model is needed. This is a long explored and essentially solved problem in computer science and informatics. Theodoridis et al. [77] summarize the basic approach. For a time-evolving spatiotemporal object, a snapshot of its evolution can be represented by a triplet {*<sup>o</sup>*\_*id*, *si*, *ti*}—object id, space-stamp, and time-stamp. This information allowed us to create a "repository production model" (Figure 5). We identified patterns in the catalog hierarchical structure that allowed us to identify which paths in the catalog folder hierarchy are "live" and which ones are "archival". In our crawler implementation (discussed in the next section), we used the structure path patterns to drive the crawler algorithm in two stages—"full sync" stage, which copies the archival data, and "update" stage, which monitors and refreshes the listing from "live" catalog paths.

**Figure 5.** Repository metadata production model.

The repository production model allows targeted crawling—however, the number of metadata resources remains too large to harvest, process, and index in its entirety, even when done in two stages to avoid redundant harvesting. We needed a second model that encompasses the metadata information structure (Figure 6). There are two issues we needed to solve: First is that most of the metadata in the catalog is completely redundant; second is that metadata information scope is not consistent in the catalog. The two issues have the same source: Catalogs, and sub-catalogs and data granules all can have metadata attached.

**Figure 6.** Repository information model.

In these examples from Unidata TDS, we see that metadata is attached to the hierarchical catalog structure in various ways. In the first example, a catalog contains some content metadata (for example: Authorship), the sub-catalog contains additional content metadata (ex: Variable names) and spatial metadata, while each granule contains temporal metadata. In the next two examples, the distribution of metadata between catalogs and granules is different. The last example is a case where each catalog only contains a single data record (granule). In some cases, the metadata is simply duplicated between several catalog levels, while in others, one specific layer contains all metadata. Another important detail is that the catalog hierarchy, the names of parent catalogs is also metadata for the data resources.

When combined, these two perspectives (information change model and information structure model) produce a model of the Unidata TDS repository that can be used to develop efficient (non-redundant) harvesting and representation of all contained metadata. By applying the production model to our crawler design, we were able to harvest only the information we know had changed. Knowing the structure of data changes also allowed us to perform targeted incremental harvesting for near real-time discovery capability. We defined two types of objects: Collections and granules. Collection contains content metadata (title, description, authorship, variable/band information, etc.). Each collection contains one or more granules. Each granule contains only the spatiotemporal extent metadata. The OGC CSW catalog standard does not support the composition of collections and granules, so we used CSW to represent collections only, while granules had to be stored externally. We used popular PyCSW software to hold collection metadata. We extended PyCSW with a PostgreSQL relational database to store relations between collections and granules and granule metadata (Figure 7).

## *4.5. Two-Step Search Process*

When the metadata is harvested into PyCSW and temporal granule index is saved in PostgreSQL, the search clients can use these two data sources to retrieve final results for access. The search process takes place in two steps. Initially, the client searches the PyCSW store using standard search methods and queries. This returns a list of collection level results. To ge<sup>t</sup> a list of granules, the search client sends a second query to the crawler service. The crawler service queries the granule index, refreshes the index with the latest granules if needed, and returns a list of granules for the requested collection. The search client can then use the collection level CSW record and combine it with selected granule information to produce granule level CSW information. Figures 1 and 8–10 show these interactions from systems architecture and event sequence perspectives.

**Figure 7.** Metadata collection and granule resources stored in referentially linked PyCSW and PostgreSQL databases. PyCSW gmd: fileIdentifier corresponds as a key to collections table name field in the SQL database.

**Figure 8.** Implementation architecture for searching big data served via Unidata THREDDS Data Server (TDS). Although in our study only UCAR TDS is used, the system is designed to support any TDS repository as a data source.

**Figure 9.** Simple granule index retrieval during search.

**Figure 10.** Granule index retrieval when granule temporal range is outside the range stored in the index. Search triggers an additional step that immediately crawls the TDS catalog and updates the real-time granule metadata. Abbreviations: DB, Database.

So far, we have analyzed the Unidata TDS repository structure, built a model of the repository that can inform an effective crawling strategy, and defined the model for the product output for the crawler. We have also described how a search client should function. To complete our experiment, we built a crawler that follows our metadata model and demonstrates a web search capability for the entire contents of Unidata TDS.

## **5. Implementation**

We implemented a module system within EarthCube CyberConnector [17,78,79] to realize the proposed mode (Figure 8). The implementation included the searching server system and the client system. We will introduce the searching capabilities enabled by these systems.

## *5.1. Crawler Service Implementation*

We built a web crawler that traverses Unidata TDS and extracts and stores essential metadata without using unnecessary resources. It is named 'thredds-crawler' and the source code is available via a public GitHub repository: https://github.com/CSISS/thredds-crawler.

The crawler is written in Python. It was built using common open source libraries for HTTP API interaction (Flask [https://www.fullstackpython.com/flask.html], Gunicorn [https://gunicorn.org/]), general XML processing (libxml [https://lxml.de/}]), and database abstraction (SQLAlchemy [https: //www.sqlalchemy.org/]). It uses native Python threading libraries to support concurrency. For traversing the Unidata THREDDS catalog and retrieving metadata, it uses the Unidata provided Python Siphon library [https://github.com/Unidata/siphon].

To support our big-data experiment requirements, the crawler is tightly integrated with a catalog software PyCSW [https://pycsw.org/] and a PostgreSQL [https://www.postgresql.org/] database. The crawler, PyCSW, and the database each run in a separate Docker [https://www.docker.com/] container. For the sake of this demonstration, all three services run on the same machine and communicate over the local network. The Docker-compose tool is used to connect and orchestrate the three containers. This architecture allows simple scaling out to multiple machines using containers, which allows for potential substantial improvement in system performance.

The crawler docker container runs as a web service hosted by Gunicorn—a python HTTP server widely used for hosting web applications. It serves three HTTP API endpoints that perform the following functions: Harvest, create index, and read index.

The harvest function loads the Unidata Catalog XML from specified catalog\_url using Siphon library. The catalog contains a list of datasets. TDS has a feature to translate its dataset metadata into the ISO/OGC-compatible XML format. For each dataset being harvested, the harvester constructs a query to TDS to retrieve ISO/OGC metadata for the dataset. The ISO/OGC metadata returned by TDS is, however, often incomplete, inaccurate, or inconsistent in some way. The crawler harvesting process then applies a chain of XML filters to the ISO/OGC metadata to rectify it with information from native TDS dataset metadata. Once the metadata is downloaded and processed, it is saved in the PyCSW database directly by using a PyCSW compatibility library.

Indexing is similar to harvesting, but involves a strategy for targeting datasets to be harvested and additional processing steps. During index creation, TDS catalogs and datasets are turned into collections and granules in our model. For each TDS dataset encountered, we determine the collection name. Crucially, the collection name is not the name of the catalog containing the dataset. We found that catalog names are inconsistent, but that TDS dataset ids contain consistent identification information. In the TDS, dataset ids are kept unique by including timestamps in the dataset id. For example, in a dataset with id "NWS/NEXRAD3/PTA/YUX/20190830/Level3\_YUX\_PTA\_*20190830\_1713*.nids", the portion "*20190830\_1713*" is a timestamp. To turn TDS catalogs with datasets into collections with granules, we remove the temporal information to construct the collection id. Then, we download the dataset in the ISO/OGC XML format and transform its XML content with a filter function called "collection builder". This function updates the dataset metadata to turn it into a more general form that describes the collection. It changes identifiers stored in the metadata. It also adds standard-compliant additional fields that identify the metadata for describing "series" ("series" in ISO/OGC model, "collection" in our model). This process needs to be done only for the first dataset encountered for each collection. When processing additional datasets, the existing collection is reused. In TDS, the dataset spatiotemporal extent information is part of the catalog metadata, which means that we only need to download a single dataset metadata to build the collection metadata and we can index the remainder of granules

from catalog metadata. This solves the redundancy issue that previously prevented TDS from being searchable. We also correct TDS identifiers to ensure that the namespace authority portion of the identifier is correctly set.

The following tables help illustrate the process of extract collections from a granule identifier for multiple types of data. Table 2 shows the catalog paths of TDS datasets. Table 3 shows how the collection identifier is generated, and Table 4 shows the final result collection name.

**Table 2.** Catalog paths of TDS datasets for three types of data. The catalog path hierarchy is marked in green. The dataset filename is marked in red.


**Table 3.** TDS dataset identifiers for three types of data. The portion of the identifiers that contain temporal information is highlighted.


**Table 4.** Collection identifiers for the example dataset IDs. They are calculated by removing temporal information and prefixing an authority namespace field.


When the index harvesting is complete, the collection information (OGC/ISO 19139 XML metadata format) is stored in PyCSW. The granule information is stored in a compact SQL index store (Figure 7). Once the index is created, it can be retrieved from the crawler web service using HTTP API (GET/index). These requests take a collection name and temporal extent as parameters. Although our data model includes granule spatial and temporal extent, at the time of publication, only temporal index queries were implemented. It checks the index data store to see if the latest available granules are newer than the requested time extent. If more recent granules are not required, the crawler returns a list of granules in compact JSON format (Figure 9). However, if the index does not contain recent enough granules, then the index service performs a partial "refresh" indexing of the TDS repository. It uses the TDS catalog link stored in our PyCSW collection and re-runs the index process described here. (Figure 10). However, as we discussed in the Experiment section, the TDS catalog is organized with some sub-catalogs storing archival information, while others contain near real-time "live data". The crawler index refresh process takes advantage of that structure. It ignores the old sub-catalogs and only indexes those that contain more recent and unknown data. This makes near real-time index retrieval fast and e fficient.

Both harvesting and index creation use the same multi-threaded queue strategy to achieve higher performance. Normally, most of the time is spent waiting for data to be transmitted over the network. By using many threads, we can increase the saturation of both the network and local computer and memory resources, which allows the metadata to become available much faster.

## *5.2. Search System Implementation*

The search system is implemented based on the previously developed EarthCuber CyberConnector infrastructure building block [17]. CyberConnector is a Java-based web application that supports discovery and visualization of data from CSW catalogs [17]. We extended CyberConnector to support accessing metadata harvested and indexed by the thredds-crawler described in the previous section. We modified the CyberConnector Search Client to perform a two-stage search. The web application user selects "Search" function (Figure 11). They select a time range, which is used by the thredds-crawler index service to determine if granule refresh is needed. The web browser sends an AJAX request to the CyberConnector web application with search parameters. CyberConnector queries thredds-crawler PyCSW service for collections that match query parameters. It returns a list of collections. To see the granules available in a collection, the user clicks the "List Granules" button (Figure 12). This issues another request to CyberConnector for a granules list in the specified temporal extent. CyberConnector web application proxies the granules list request to thredds-crawler indexing service, which returns a list of granules (Figure 9); or thredds-crawler harvests TDS to update the index and then returns a list of granules (Figure 10). The client receives a list of granules, which can then be downloaded or visualized (Figure 13).

**Figure 11.** Search client web interface.

**Figure 12.** Search results with "List Granules" button.

**Figure 13.** Granules list for a collection. The buttons allow users to view metadata, download the dataset, or visualize it.

## **6. Experiment and Results**

Based on the implemented catalog system, we conducted several experiments to validate the feasibility of the proposed approach. The datasets for climate science are generally very large because of their long-term running and high temporal resolution. We took the UCAR NEXRAD dataset [80] and the RDA ASR dataset (53.09 terabytes) [81] as our demonstration examples. The searching capabilities on the two datasets were established in the EarthCube CyberConnector. We made a complete set of tests on the searcher and the results are introduced below.

## *6.1. Searching the NEXRAD Dataset*

NEXRAD is a very important dataset for climate science research. It currently comprises 160 sites throughout the United States and selected overseas locations (as shown in Figure 14). The basic original datasets, including three meteorological base data quantities: Reflectivity, mean radial velocity, and spectrum width, are called Level II. The derived products are called Level III, which include numerous meteorological analysis products. All NEXRAD Level-II data are available via NCEI, as well as NOAA big data plan cloud providers, Amazon web service (http: //thredds-aws.unidata.ucar.edu/thredds/catalog.html) and Google Cloud (https://cloud.google.com/ storage/docs/public-datasets/nexrad). UCAR provides the near real-time observed data via their THREDDS data server (http://thredds.ucar.edu). Unfortunately, all these data repositories are still non-searchable at present, because it is a huge challenge for any catalog to index and search such big amount of metadata files for the frequently updated radar data records (every 6 min). We used this dataset to prove that the proposed cataloging approach can work well on frequently updated big datasets.

**Figure 14.** NOAA NCDC Radar Data Map (NEXRAD Level II and III).

The completed system consists of the harvester/indexer service and the search client that is available to the user as a web application. As a result, users are able to search diverse heterogeneous Earth system observation and modeling datasets simultaneously. Once the metadata is found, users can use the CyberConnector visualization system to simultaneously visualize near real-time NEXRAD radar, satellite observation, and forecast simulation model product data. The system performance characteristics of this approach are significantly improved over the existing naive method of harvesting all of the datasets' metadata.

## *6.2. Searching UCAR RDA (Research Data Archive) TDS Repository*

NSF-funded NCAR CISL (Computational & Information System Lab) maintains Research Data Archive (RDA), which stores over 11,000 terabytes of climate datasets in its high-performance data storage system.

RDA hosts many climate datasets at present, and the Arctic System Reanalysis (ASR) is one of them. ASR is a demonstration regional reanalysis for the greater Arctic developed by Ohio State University. The ASR version 2 dataset (the latest version) is served via RDA with a total volume of 53.04 terabytes. The horizontal resolution is 15 km and the temporal coverage is from 2000 to 2016. It has 34 pressure levels (71 model levels), 31 surface (including 3 soil variables), and 11 upper air analysis variables, 71 surface (including 3 soil variables), and 17 upper air forecast variables.

RDA provides TDS for most of its archived datasets. We harvested the metadata of ASR from its TDS and made them publicly available in CyberConnector. As shown in Figure 15, scientists can search the ASR dataset by providing keywords, spatial extent, or temporal range. The ASR data is in NetCDF format, which is displayable in COVALI. We demonstrated searching ASR dataset in COVALI and visualized the temperature at 2 m above the surface within 12 h. COVALI and RDA were deployed in two remotely distributed facilities. The interactions between COVALI and RDA big data storage were conducted via the standard service interface and over the network. The experiment proves that the proposed solution works well for enabling search on remote big data.

**Figure 15.** ASR (Arctic System Reanalysis) search results and visualization of the temperature at 2 m above the surface in the CyberConnector COVALI visualization system.

## *6.3. Performance Evaluation*

The traditional approach for cataloging climate datasets is fully harvesting all the metadata files of every single data record. We implemented the searcher using the traditional method before, but the performance was very slow and sustained operation not possible for the practical scenario for big data cataloging. After we applied the new cataloging strategy, we tested it by crawling several hundreds and thousands of records from UCAR THREDDS Data Server. We tested using di fferent sets of parallel workers: 40 workers, 20 workers, 10 workers, 5 workers, and a single worker, respectively, to measure the improvements of parallel crawling. Figure 16 displays the time cost of the test to compare the performance of the traditional approach and the proposed approach. The results demonstrate that the proposed approach outperforms the traditional approach at least ten times on the overall time cost (from ~10 to ~1 s) and has significant improvements on harvesting speed, storage use, and search speed based on the number of datasets being processed.

**Figure 16.** Performance comparison (time in seconds) of the traditional harvesting approach (**a**) and our approach (**b**), sampled 5 times for crawling 125 records.

Search time cost has two components. The time to search for collections in the catalog and the time to retrieve the granules list from the granule index. Figure 17 shows that search result retrieval is extremely fast in our system.

**Figure 17.** Search performance (time in seconds) with two different spatial extent parameters. (**a**) Time to query the catalog (collection searching, step 1); (**b**) time to query the granule index (granule searching, step 2).

The search currently supports filters, including keywords, data format, and spatiotemporal extents. All of them are fixed filters with less uncertainties. Therefore, the returned results stay the same as long as the metadata base does not add new records or delete existing records. The result completeness is 100% accurate because correct records match the filter conditions. Users can narrow down spatiotemporal extent based on their interest and provide one or more keywords which could match the data field names. The first page relevance of the search results depend on the relationships between the inputted keywords and the metadata field values. Based on our experiences with climate scientists, we find that they normally do not input any keywords and only use spatiotemporal filters to check out what can be searched in a catalog. Once they have a region of interest or a time window, they then have an impression about what possible data is available. They come to the catalog just to find the access URL to download or visualize the data files. Normally, results from our search client are numerous because of the loose filters the scientists give. The results on the first page are usually very well related to the scientists' needs. A more intelligent search, such as a semantics-based search, which could find more accurate first-page results with higher relevance, will be studied in the next stage of work.

## **7. Discussion**

Our solution to the big data volume, variety, and velocity challenges discussed in the paper consists of a novel metadata model, and cyberinfrastructure architecture and implementation that is derived from the model. The metadata model combines the description of metadata content (the "information model") with the description of metadata repository structure and behavior. The cyberinfrastructure consists of a crawler service that takes advantage of the metadata model to optimize THREDDS crawling strategy to eliminate the transfer and processing of redundant metadata information. Additionally, the metadata repository model permits the crawler service to perform incremental metadata transfer, which enables real-time search capability. The demonstrated cyberinfrastructure also includes an interoperable catalog service that uses the metadata model to minimize the storage of redundant information. Finally, a search client that uses the catalog and the crawler services is implemented.

## *7.1. Can the Proposed Solution Address the Volume Challenge?*

Metadata volume is ~25 GB for the UCAR RADAR dataset. The traditional method for harvesting metadata (as discussed in Section 6.3) is able to process approximately one record (with an approximate size of 100 KB) per second. To completely ingest all of THREDDS RADAR metadata at the observed harvesting rate, it would take 250,000 s or ~70 h. By using the proposed metadata model and cataloging system, we observe harvesting rates that are at least 10 times faster. This permits daily synchronization of all Unidata TDS metadata.

## *7.2. Can the Proposed Solution Address the Velocity Challenge?*

We determined that new (live) RADAR metadata is being generated at 330 records per minute. Our maximum harvest capacity (constrained by Unidata THREDDS network capacity) is 60 records per minute. Using the traditional method, we cannot keep up with the data velocity. Using the indexing harvester approach, we can process up to 1400 records per minute. This exceeds the velocity of THREDDS data production. Additionally, by using incremental index update during the client search request exchange, we can target the indexing harvest process to the exact sub-catalog containing the updated information and thus provide real-time search capability for this high-velocity data.

## *7.3. Can the Proposed Solution Reduce Metadata Crawling Redundancy?*

The solution demonstrated here is able to reduce redundancy in crawling and storage resource consumption. For example, using the traditional method with Forecast Models catalog, ~7000 records are downloaded. The total storage used is 1.85 GB. The same metadata can be processed using our approach by downloading only 45 sample metadata records (2.2 MB) that represent collection level information. This represents a 99% reduction in data transmission and storage costs.

## *7.4. What Are the Benefits and Drawbacks of the Proposed Solution Compared to Other Big Data Searching Strategies?*

The solution demonstrates the expected benefits described at the beginning of this study. The main drawback of this solution is the model and software system complexity. Custom software has to be developed to intelligently process catalogs as they are being harvested. To ge<sup>t</sup> complete and accurate results, the ingested metadata must be cleaned and transformed to fill in missing pieces of information and to make it conform to our model. Although our approach is general enough to work with multiple TDS repositories, in practice, inconsistencies and additional varieties from each repository must be reconciled using custom code. Our work demonstrates that it is possible to build a unified and highly efficient searchable catalog system for large and heterogeneous Earth system data repositories that supports real-time queries; however, every solution has its limitations and costs. In this case, the costs are complexity in software and systems architecture, which means increased software development and maintenance costs.

## **8. Conclusions**

This paper proposed and demonstrated a novel cyberinfrastructure-based cataloging solution to enable an e fficient two-step search on big climatic datasets by leveraging the existing data centers and state-of-art web service technologies. We used the huge datasets served by UCAR THREDDS Data Server (TDS), which serves Petabyte-level ESOM data and updates hundreds of terabytes of data every day, as our study dataset to validate its feasibility. We analyzed the metadata structure in TDS and created an index for data parameters. A developed metadata registration model, which defines constant information, delimits variable information, and exploits spatial and temporal coherence in metadata, was constructed. The model derives a sampling strategy for a high-performance concurrent web crawler bot which is used to mirror the essential metadata of the big data archive without overwhelming network and computing resources. The metadata model, crawler, and standard-compliant catalog service form an incremental search cyberinfrastructure, allowing scientists to near real-time search in big climatic datasets. We experimented with the approach on both UCAR TDS and NCAR RDA TDS, and the results prove that the proposed approach achieves its design goal, which is a significant breakthrough for the current most non-searchable climate data servers. The solution identified redundant information and determined the sampling frequencies to keep unpredictable parts of the source catalog synchronized with our downstream mirror catalog. An automated hierarchical crawler–indexer and a complimentary search system using the pre-existing EarthCube CyberConnector were implemented. Metadata crawling and access performance validates our integrated approach as an e ffective method for dealing with big data challenges posed by heterogeneous, real-time Earth System Observation and Model data. However, although the proposed approach outperforms the traditional searching solution for big data, it is still time-consuming in both crawling and searching processes, and may be out of pace dealing with real-time streaming data. In the future, we will study to further reduce the time spent in crawling redundant metadata and to find a high-performance method for rapid and intelligent search.

**Author Contributions:** Conceptualization, Liping Di and Ziheng Sun; methodology, Ziheng Sun and Juozas Gaigalas; software, Juozas Gaigalas and Ziheng Sun; validation, Juozas Gaigalas, Ziheng Sun; formal analysis, Liping Di, Ziheng Sun, and Juozas Gaigalas; investigation, Juozas Gaigalas; resources, Liping Di; data curation, Juozas Gaigalas, Ziheng Sun, and Liping Di; writing—original draft preparation, Juozas Gaigalas; writing–review an Editing, Liping Di and Ziheng Sun; visualization, Juozas Gaigalas and Ziheng Sun; supervision, Liping Di and Ziheng Sun; project administration, Liping Di and Ziheng Sun; funding acquisition, Liping Di.

**Funding:** This research was funded by the National Science Foundation, gran<sup>t</sup> number AGS-1740693 & CNS-1739705; PI: Liping Di.

**Acknowledgments:** We sincerely thank UCAR, UCAR Unidata Support Team, and the authors of the software, libraries, tools, and datasets that we have used in this work.

**Conflicts of Interest:** The authors declare no conflicts of interest.

## **References**


©2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*ISPRS International Journal of Geo-Information* Editorial Office E-mail: ijgi@mdpi.com www.mdpi.com/journal/ijgi

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18