*Article* **OTNEL: A Distributed Online Deep Learning Semantic Annotation Methodology**

#### **Christos Makris \* and Michael Angelos Simos \***

Department of Computer Engineering and Informatics, University of Patras, 26504 Patras, Greece

**\*** Correspondence: makri@ceid.upatras.gr (C.M.); asimos@ceid.upatras.gr (M.A.S.); Tel.: +30-2610-996-968 (C.M.)

Received: 15 September 2020; Accepted: 27 October 2020; Published: 29 October 2020

**Abstract:** Semantic representation of unstructured text is crucial in modern artificial intelligence and information retrieval applications. The semantic information extraction process from an unstructured text fragment to a corresponding representation from a concept ontology is known as named entity disambiguation. In this work, we introduce a distributed, supervised deep learning methodology employing a long short-term memory-based deep learning architecture model for entity linking with Wikipedia. In the context of a frequently changing online world, we introduce and study the domain of online training named entity disambiguation, featuring on-the-fly adaptation to underlying knowledge changes. Our novel methodology evaluates polysemous anchor mentions with sense compatibility based on thematic segmentation of the Wikipedia knowledge graph representation. We aim at both robust performance and high entity-linking accuracy results. The introduced modeling process efficiently addresses conceptualization, formalization, and computational challenges for the online training entity-linking task. The novel online training concept can be exploited for wider adoption, as it is considerably beneficial for targeted topic, online global context consensus for entity disambiguation.

**Keywords:** named entity disambiguation; text annotation; word sense disambiguation; ontologies; Wikification; neural networks; machine learning

#### **1. Introduction and Motivation**

Named entity disambiguation (NED) is a process involving textual mention resolution and assignment to predefined concepts from a knowledge base or concept ontology. The deterministic identification and linking of semantically dominant entity mentions, based on contextual information available, is not trivial in most cases; ambiguity is common on unstructured corpora, as homonymy and polysemy phenomena are inherent to natural languages.

Advances in the domains of artificial intelligence, information retrieval, and natural language processing, outlining the requisition of semantic knowledge input, such as common paradigms like the bag of words representation, are proven inefficient for deeper knowledge extraction and, hence, higher accuracy. As a result, NED is a common step for many relevant tasks including information retrieval [1], data mining [2,3], and web and semantic search [4–7], consequently being a vital component of the artificial intelligence (AI), internet and information industries.

One of the basal challenges in NED involves the maintenance of knowledge resources, especially as new domains and concepts arise or change dynamically over time. In recent years, Wikipedia has been leveraged as a knowledge base and concept universe due to its online nature. The introduction of Wikipedia in the domain derived noteworthy leaps in classic challenges such as knowledge acquisition and adversarial knowledge resolution, as its articles tend to summarize widespread and commonly accepted concepts.

Deep learning architectures have recently been established in several scientific fields including machine translation, computer vision, medical image analysis, speech recognition, audio recognition, social networks, bioinformatics, and material inspection. As such, methodologies successfully modeled high-level abstraction patterns, leveraging deep multilevel transformations, and several approaches successfully addressed the NED problem. However, a series of factors constitute a challenging background for the task.

The engagement of deep learning architectures preconditions the dimensionality projection to lower-dimension spaces for the training input as computational challenges with large-scale training datasets arise. Consequently, the training input entropy is abstracted during a dimensionality reduction process aiming at fitting sparse input spanning plenteous domains to predefined sized dimension spaces, mainly for computational feasibility purposes. As a result of this computational complexity and accuracy trade-off, the inference of semantics deviant from the dominant patterns is burdensome. In addition, the extensive training process required in the scope of a general-purpose NED application is demanding from a computational complexity perspective as outlined in [8]. Our introduced methodology employs an alternative modeling and dimensionality reduction approach method, detailed in Section 3.

Another fundamental adversity for the task resides in knowledge acquisition, including adversarial knowledge resolution and the impact of noise in the training input. Successful deep learning applications require vast training sets. As the task is based on facts for semantic acquisition of pertinent sense representation associations in the available context, the intricacy of semantic attainment, defined as *knowledge acquisition bottleneck* in [9], is dominant. Consequently, the attainment of high-quality data at a scale for wide semantic coverage is not trivial. Similar works as detailed in [8] often rely on diffusive sources ranging from structured ontologies to unstructured corpora, for example, by inducting context with unsupervised techniques for the inference of co-reference information. The impact of noise in the training input is critical for attaining high accuracy at scale. On the contrary, uncontrolled data cleansing approaches aiming at eliminating anomalies on the input training sets could result in substantial information loss for the resolution of more intricate and less frequent senses of a polysemous anchor.

In this work, we propose a novel approach for efficient NED. In particular, by employing divergent thinking on the main task impediments described above, we propose a model for dimensionality reduction according to topical confinement in the context of online training. We focus on minimizing the impact of input data loss and simplifying the task by leveraging topical inference using a semantic ontology information network representation of Wikipedia.

The remainder of this manuscript is organized as follows: the necessary background and related work are presented in Section 2. Our methodology and implementation details are presented in Section 3. The experimental process is described and assessed in Section 4. Our final conclusions are presented in Section 5, along with potential improvements and future work.

#### **2. Background and Related Work**

The NED task requisites a knowledge base or concept ontology as its foundation for the identification of named entities, to resolve text segments to a predefined concept or sense universe. Human domain experts also need such a knowledge ontology for identifying the appropriate sense of a polysemous mention within a context. As the creation of knowledge resources by human annotators is an expensive and time-consuming task, facing implications as new concepts or domains emerge or change eventually, the knowledge acquisition issue has been pervasive in the field. The maintainability, coverage, and knowledge acquisition challenges have been outlined on several manually created ontologies applied to the NED task. As a result, attempts for unifying such ontologies emerged; however, they encountered accuracy issues throughout the unification process.

As Wikipedia is an online crowdsourcing encyclopedia with millions of articles, it constitutes one of the largest online open repositories of general knowledge. Wikipedia articles are created and maintained by a multitudinous and highly active community of editors. As a result, widespread and commonly accepted textual descriptions are created as derivatives of a diverse knowledge convergence process in real time. Each article can be interpreted as a knowledge entity. As Wikipedia's online nature inherits the main principles of the web in a wide and highly active user base, named entity linking with Wikipedia is among the most popular approaches in several similar works. The rich contextual and structured link information available in Wikipedia along with its online nature and wide conceptual coverage can be leveraged toward successful high-performance named entity linking applications.

#### *2.1. Named Entity Disambiguation Approximations*

Among natural language processing domain tasks, NED and word sense disambiguation (WSD) are acknowledged as challenging for a diversity of aspects. WSD was defined as AI-complete in [8]. AI-completeness is defined by analogy to the nondeterministic polynomial completeness (NP-completeness) concept in complexity theory.

Several formalization approaches have been applied at entity linking coarseness scopes ranging from specific sense ontological entities to generic domains or topics. The disambiguation coverage spans from the disambiguation of one to several senses per sentence. Domain confinement assumptions may also be present on the entity universe.

According to [8], WSD and, hence, NED approaches may be broadly classified into two major categories:

Supervised machine-learning methodologies are used for inferring candidate mentions on the basis of knowledge inference from labeled training sets, usually via classification techniques.

Unsupervised methodologies are based on unstructured corpora for the inference of semantic context via unsupervised machine-learning techniques.

A second level further distinction according to knowledge sources involved can be made as follows:


Supervised knowledge-based NED methodologies are in the limelight of current research focus. Wikipedia is commonly employed for underlying knowledge base representation as an entity linking ontology.

#### *2.2. Early Approaches*

The pioneering works on the NED problem using Wikipedia for the entity linking approach were [9–11]. The works proposed methods for semantic entity linking to Wikipedia. Those early methods clearly captured the technical impediments of the task, while proposing some effective early solutions. Foundations for future work were placed by the establishment of the commonness feature value for the task.

In [12,13], a more sophisticated approach to the task led to the introduction of relatedness among Wikipedia articles as an invaluable measure of semantic compatibility. Relatedness was defined as the inbound link overlap between Wikipedia articles. The coherence of input text anchors disambiguated with unambiguous mentions of the input was used as the core of the introduced models. Specifically, ambiguous mentions were ranked on the basis of a global score formula involving statistics, relatedness, and coherence for the final selection.

The segmentation of the ranking scores to local and global resulted in further improvements in [14]. Local scores were leveraged for the contribution representation of contextual content surrounding an anchor being processed. The consensus among every input anchor disambiguation within the full frame of the input was modeled as a global score. The problem was formalized as a ranking selection and a quadratic assignment problem, aiming at the approximation of an entity mention for each anchor on the basis of a linear summation of local and global scores.

Another suite with a focus on accuracy and computational efficiency of short input was introduced in [15]. The work is particularly popular and established as a baseline to date. Relatedness, commonness, and other Wikipedia statistics were combined in a voting schema for the selection of the top scoring candidate annotation for a second-step evaluation and selection process.

An alternate modeling approximation was used by [16,17]. An undirected weighed graph was used for the knowledge base representation. The graph nodes were used to model entity annotations or candidate entities. The weighted edges among entities of the graph were used for representing relatedness. Weighted edges among mentions and entities of the graph were used to model contextual similarities. These representations were referred to as the mention-entity graph in [16], and a dense subgraph search approximation was used for the selection of a subgraph of anchor nodes, each containing a unique mention-entity edge. In [17], the representation was referred to as a referent graph, and the methodology employed was based on the PageRank algorithm.

In [18], some innovative approaches for text annotation and entity linking were contributed. Voting schema approximations were introduced, along with a novel method inspired by the human interpretation process on polysemous contexts. An iterative method approach was employed for the modeling process. The iteration converged to proposed annotations while balancing high accuracy with performance, leveraging established metrics derived from the Wikipedia graph representation.

A graph knowledge base representation was employed by [19], and a Hyperlink-Induced Topic Search (HITS) algorithm variant using a breadth first search traversal was evaluated with the candidate entities of the input text anchors as initial seed nodes. Coreference resolution heuristics, extension of surface forms, and normalization contributions to the system constituted the core of this work.

The architecture of [15] was refined and redesigned in WAT [20], as several methodology variants were introduced for experimental assessment. The three-component architecture was revisited by some PageRank and HITS algorithm-based approaches. The main components were thoroughly assessed, and results for a series of methodologies were contributed to the community.

#### *2.3. Recent Deep Learning Approaches*

A leading deep learning approximation for the problem was presented in [21]. A vector space representation was used for modeling entities, context, and mentions. The core methodology architecture consisted of a convolutional neural network, in various context windows, for the projection of anchors on the continuous vector space. Finally, a computationally demanding methodology employing a tensor network introduced context and mention interaction information. A similar vector space representation approach of mentions and entities was also employed in [22]. The core disambiguation methodology extended the skip gram model using a knowledge base graph. At a second level, the vector space proximity optimization of vectors representing coherent anchors and entities was used for concluding the process

The authors of [23] introduced a suite combining established approaches, such as graph representation and knowledge base statistics, with deep learning benefits, involving an ensemble consensus disambiguation step. Specifically, variable sized context windows were used by a "neural attention mechanism" with an entity embedding representation.

As most systems rely on heuristics or knowledge-based approaches for conducting semantic relation evaluations, such as coreference, relatedness, or commonness for the conceptual compatibility assessment, the authors of [24] followed a neural entity linking approach, modeling relations as latent variables. Specifically, they extracted semantic relations in an unsupervised manner using an end-to-end optimization methodology for selecting the optimal mentions. The proposed multi-relational model exhibited high performance throughout an experimental evaluation process.

The problem was also addressed in [25] by leveraging a knowledge graph representation. This work was based on the observation that the link density on the representation graph plays a key role as the vertex degree had a major impact to the selection of strongly coherent nodes. To that end, their methodology induced a density enhancement step on the graph representation on

the basis of cooccurrence statistics from an unstructured text for relational inference. A training step of entity embeddings was employed for extracting similarity results for the linking step. As anticipated, the system presented exceptional results for the less densely interconnected concepts on the input, resulting in high performance throughout the experimental assessment through a simple, yet effective method.

The authors of [26] attempted to address weaknesses in previous global models. Specifically, by filtering inconsistent candidate entity annotations, they successfully simplified their proposed model while reducing noise on data input. The task was treated as a sequence decision problem, as a sequential approach of exploiting disambiguated mentions during the disambiguation of subsequent anchors was applied. Finally, a reinforcement learning model was used, factoring in a global context. The experimental results outlined accuracy and high generalization performance.

#### *2.4. Conclusions and Current Limitations*

Following the success of deep learning methodologies on AI tasks, several similar research endeavors approached the NED, using deep neural network architectures, furthering the outstanding research works outlined above. However, the input dimensionality challenges placed considerable impediments of production-ready, computationally efficient methodologies as outlined in [27]. The complexity of recent approximations employing deep learning architectures led to several recent works, including [28] which queried whether deep neural network NED methodologies are currently applicable for industry-level big data AI applications compared to simpler and more scalable approaches. Current methods focus more and more on accuracy instead of run-time performance, neglecting the options for complexity reduction in many cases, by focusing on input dimensionality for complexity reduction. To that end, systems, like RedW employ a performance-oriented approach relying on graph and statistical analysis features, questioning deep neural network approaches at scale. As deep learning methodologies have been established in terms of knowledge inference and enhanced modeling capabilities, a computationally efficient approach bridging complexity and performance would be propitious for wide, industry adoption.

#### **3. Materials and Methods**

#### *3.1. Notations and Terminology*

For readability enhancement purposes, this section presents a terminology and notation summary. The terminology in use is aligned with widely adopted previous works in the domain.


A formal definition of the task on the basis of the above notation can be summarized as the selection of the best fit mention to a *pa* from *Pg(a)* for each anchor *ai* from the set of identified anchors of a text input.

#### *3.2. Knowledge Extraction*

Knowledge is fundamental in an NED task. The current work relies on semantic information from hyperlink anchors to Wikipedia entities. Our methodology supports knowledge acquisition by incorporating any Wikipedia annotated corpus as a training set. In the scope of this work, we leverage the corpus of Wikipedia and the annotated inter-wiki hyperlinks for composing the mention universe ensemble. This ensemble of potential anchor senses grows in parallel with Wikipedia's corpus link structure and its semantic coverage, and it can be considered a sound foundation for our knowledge acquisition process, due to the collaborative online nature of the encyclopedia. Wikipedia entities are widely adopted as an underlying representation ontology for the task due to their commonly accepted textual descriptions.

The population of the mention universe requires the ensemble of Wikipedia pages in MediaWiki article namespace, i.e., pages in namespace with identifier 0. Redirect page unification is carried out for the inclusion of the redirect link context. This involves following the redirect chains and accordingly updating Wikipedia hyperlinks to the corresponding direct link entity IDs. As in [10,11,14,15,20], a preprocessing of a Wikipedia snapshot can be used for the initial extraction of the mention universe, which can remain up to date in syndication with the Wikimedia Update Service [29]. The process involves harvesting the following:


The above structures constitute the core of the knowledge acquisition for the extraction, transformation, and loading of our training dataset universe, effectively composing a graph representation of Wikipedia. In addition, an *anchor occurrence count* dictionary was extracted and maintained for link probability calculations via a second parse of the corpus for implementation simplicity. An appropriate indexing mechanism can be implemented for avoiding this second parse.

The mitigation of noise impact during the knowledge acquisition phase is crucial to the success of our NED methodology and any deep learning model. In the first stage, following an approach inspired by [25], we performed a mention dictionary density enhancement, by incorporating redirect title information. Specifically, page and redirect titles were treated as hyperlinks in a special source article ID. In the next step, unlike many recent deep learning approaches employing a coarser approximation, we applied filtering rules for ignoring low-frequency mention anomalies, with a relative threshold up to 0.5%, at a minimum of one occurrence. Common preprocessing rules for discarding stop-words, punctuation, single characters, and special symbols were also applied to the extracted mention vocabulary, as established in similar works [12–20]. The knowledge extraction phase was straightforward for both the initial loading and the online syndication of the mention universe, as real-time updates were performed in the structures outlined above.

As an outcome from the knowledge extraction process, Wikipedia was encoded in a mention database, enabling the next steps.

#### *3.3. Methodology*

The focus of this work was oriented toward the named entity disambiguation task. The task prerequired anchor extraction. The entity universe to be linked by our system was derived as detailed in the previous section for creating a mention database. In the first step, an extraction, transformation, and loading process was carried out on the unstructured text input for disambiguation (Section 3.3.1). As a next step, we applied a topical coherence-based pruning technique for the confinement of the entity linking scope to coherent topics in a given context (Section 3.3.2). Then, we employed a novel deep learning model for the selection of candidate mentions of an anchor on a local context window, modeling the problem as a classification problem (Section 3.3.3). Finally, a quantification of uncertainty scoring step followed for the confidence evaluation of outcome predictions (Section 3.3.4). Figure 1 outlines our methodology. In the remainder of this manuscript, our methodology is referred to as OTNEL (Online Training Named Entity Disambiguation).

**Figure 1.** OTNEL (Online Training Named Entity Disambiguation) methodology flowchart.

#### 3.3.1. Extraction, Transformation, and Loading

In the first stage, the unstructured text input was parsed for extracting candidate anchors along with their candidate mentions for further evaluation in the following steps. The input underwent a tokenization process for composing candidate *n*-grams, with *n* sized from 1–6. The candidate *n*-grams were initially evaluated for existence in our mention database as in similar works [10,13,15,20]. A *n*-gram present in the database could be identified as an anchor for annotation. However, the case of overlapping *n*-grams needed further examination. The *link probability* of a mention as outlined below by Equation (1) was basial in this examination process.

$$lp(a) = \frac{link(a)}{freq(a)}.\tag{1}$$

Link probability expresses the probability of a word or text fragment occurring as a hyperlink within a corpus. As expressed above, *link(a)* denotes the number of occurrences of anchor *a* as a link. The notation *freq(a)* depicts the occurrence frequency of *a* in a corpus. To preserve meaningful mentions and filter semantically meaningless mentions, *n*-grams with link probability less than 0.001 were pruned similarly with the corresponding knowledge extraction process. As link probability indicates link worthiness, in cases of overlap, the *n*-gram with the highest link probability was selected. Stop-words, punctuation, special symbols, and characters were ignored as they did not return matches in the mention database not being present in the mention universe due to the relevant filtering during the knowledge extraction phase. The specific *n*-gram length was selected in accordance with the maximum size of links on our dataset. Larger *n*-gram lengths would have no effect. Smaller lengths would confine the maximum token length of detected anchors. After this step, unstructured text segments were converted to sequences of semantically significant anchors.

#### 3.3.2. Coherence-Based Dimensionality Reduction

Generic deep learning approximations to the problem face feasibility intricacies at scale for the NED task. On the other hand, in the similar task of named entity recognition, the problem space is limited for the NED task. Specifically, the problem space dimension spans to the mention universe registered on the underlying knowledge base. A perusal of the Wikipedia knowledge graph representation delineates relevant topic coherence with a high degree of reciprocity that can be exploited for discarding incoherent entity mentions from further evaluation in a given context.

As the current work constitutes online semantic annotation, we applied topical confinement for our predictions in terms of the online training process. Specifically, in the first stage, the knowledge graph was pruned to the candidate mentions set of identified input anchors. In the next step, we recalled a training set consisting of mentions within that specific subgraph. This process could be iterated until a wider subgraph was covered, forming a clique from the knowledge graph. As our aim was a vast reduction in the dimensionality space involved, enabling on-the-fly training, a single iteration was performed. The trained model for the specific topical confinement could be persisted for future predictions or on-the-fly training expansion as training input becomes available. As a result, our methodology's dynamic adaptation to rapid underlying knowledge changes as twofold. Firstly, as described in Section 3.2, our mention universe and knowledge graph remained up to date in near real time in syndication with the Wikimedia Update Service [29]. Secondly, the online training process enabled eventual adaptation on potential knowledge changes or updates within the scope of the topical confinement.

It is worth noting that existing approaches to the problem are entrenched by focusing on the generality in a shared vector space with a unified model across topics. However, our approach efficiently confines the training scope required to explicitly generate coherent topical context. The application of similar works following unified approaches would be prohibitive in the online scope of the NED task.

#### 3.3.3. Named Entity Disambiguation

For unambiguous anchors identified by the extraction step, a single entity annotation is available and can be used for linking, i.e., |*Pg(a)*| = 1. For cases where |*Pg(a)*| has more than one entry, a compatible mention evaluation is needed. Polysemous anchors may have several candidate entities for linking derived from the relevant mention ensemble of the knowledge base. However, using the knowledge base entity dimension as our output dimension would place exorbitant barriers on performance. For named entity disambiguation, we modeled the process as a feature-based classification problem, leveraging an architecture of long short-term memory (LSTM) cells in an artificial recurrent neural network (RNN). Specifically, the problem of selecting a coherent mention for a polysemous anchor in a context was modeled as a binary classification problem as described below.

For every candidate mention of an anchor, we evaluated the classification to the following complementary classes:


In the next phase, we could utilize the penultimate deep learning model layer scoring for depicting class predictions probabilities. We selected the highest scoring class 1, i.e., compatible mention, as the disambiguation result.

For maintaining a low input dimension in our model, in the performance context of the online scope of the problem, we provided a set of three features at the input layer, summarizing the gist of topical semantic information. Those features were as follows:

An *inter-wiki Jaccard index average*, as shown in Equation (2). This formula expresses the reciprocity of inbound mentions. The feature of Jaccard similarity was established in [20] as a strong Wikipedia entity coherence measure.

$$\text{avg}\ \text{interwiki}\ \text{local}\ \text{index}(a\_i) = \sum\_{k=0}^{k=i-1} \frac{\left| \text{in}(p\_{a\_i}) \cap \text{in}(p\_{a\_k}) \right|}{\left| \text{in}(p\_{a\_i}) \cup \text{in}(p\_{a\_k}) \right|} / \text{m} + \sum\_{k=i+1}^{k=m} \frac{\left| \text{in}(p\_{a\_i}) \cap \text{in}(p\_{a\_k}) \right|}{\left| \text{in}(p\_{a\_i}) \cup \text{in}(p\_{a\_k}) \right|} / \text{m}.\tag{2}$$

*Relatedness* is an established measure of semantic entity relatedness. The feature has been used as a core disambiguation primitive in several works [13–15]. In this case, we applied an *average relatedness* feature as depicted by Equation (3).

$$avg\text{ relatedness}(a\_i) = \sum\_{k \in [p\_{a\_0} - p\_{a\_{in}}] - [p\_{b\_i}]} \frac{\log \left( \max \left( \left| in(p\_{a\_i}) \right|, \left| in(p\_{b\_i}) \right| \right) - \log \left( \left| in(p\_{a\_i}) \cap in(p\_{a\_k}) \right| \right) \right)}{\log(|\mathcal{W}|) - \log \left( \min \left( \left| in(p\_{a\_i}) \right|, \left| in(p\_{a\_k}) \right| \right) \right)} / \text{m} . \tag{3}$$

*Commonness* as defined by Equation (4) is the prior probability of an anchor pointing to a specific Wikipedia entity. Commonness was broadly used in similar works, contributing significant statistical information to the model.

$$\text{Commonness}(p\_{k'}a\_i) = P(p\_k|a\_i). \tag{4}$$

Figure 2 presents the deep learning layers of our classifier's distributed architecture. The classifier received a three-dimensional vector of feature scores as input, summarizing the contextual compatibility of an evaluated candidate mention for an anchor. This evaluation was derived as a classification score for the binary output compatible/incompatible classes. As more than one or (in rare cases) even none of the candidate mentions were classified as compatible in a context, we exploited the penultimate layer score for deriving a relative prediction to select the most coherent mention.

**Figure 2.** The proposed methodology classifier layer architecture.

The model's input dimensionality was intentionally maintained low through the employment of established context features. Leveraging Keras' [30] and Tensorflow's [31] distributed execution, the input was equally split and fed to the distributed deep learning pipelines. Each individual output was combined to form our classifier output. The complexity was maintained as simple as possible for computational performance reasons.

LSTM was established for addressing RNN's limitations [32]. General RNNs exhibit the "vanishing gradient problem", resulting in a declining effect of the contribution of previous training inputs. The LSTM layer building blocks comprise cells featuring typical input and output gates, along with a forget gate enabling a connected cell state and output recurring feedback from previous states. The *sigmoid* function is commonly adopted as an activation function in LSTM architectures [33] for the input, forget, and output gates, efficiently exhibiting the characteristics and mathematical assets of an effective activation function. The *Tanh* function, a hyperbolic tangent function, is commonly applied as the LSTM candidate and output hidden state.

Our model consisted of two stacked LSTM cell layers after the input followed by a dense layer, producing the output summary. In our model implementation, the LSTM cell size used for the first and second layer was 3, thereby maintaining a low training complexity, yet a high degree of nonlinear feature relation adaptivity. The *MSE* loss function and *Adam* optimizer were used during the model training phase. The *Tanh* activation function and *sigmoid* recurrent activation were employed for the LSTM layers parameters.

The intuition behind the specific multilevel LSTM layer architecture was to involve enhanced semantic relation persistence from the topically confined training sequence. In addition to a clear architecture, simplicity, and modeling and computational efficacy, the methodology enables enhanced prediction strength via exploiting a rich set of both positive and negative linking training examples.

As depicted in the comparative results on a different domain by [33], several activation function options may be explored and compared, contributing intriguing results even in the case of simple classification problems and LSTM architectures. However, in the scope of the current work, we focused on the general approach of a model for the problem, applying established activation functions that experimentally exhibit efficient results for a range of relevant domains. Further exploration of tunning options for our deep learning architectural approach in the specific domain is among our plans for future work.

#### 3.3.4. Quantification of Uncertainty

The NED task is quite demanding, with several cases of variant semantics, insufficient underlying information, or highly ambiguous context. Absolute accuracy may be considered unattainable even for humans on the task. As a result, the confidence evaluation of a named entity prediction is momentous for the development of successful applications. At the pre-output layer of our deep learning model architecture, we could fruitfully exploit the output score as a quality indication for the predicted positive linking compatibility class outcome.

For an anchor *a*, the candidate mention set size is |*Pg(a)*| = *k*. Let *compatibility score(m)* denote the compatibility score of mention *m*. Let the candidate mentions set for anchor *a* in *Pg(a)* be denoted as {*m*1, *m*<sup>2</sup> ... *mk*}.

Hence, the uncertainty quantification formula can be defined as follows:

$$\begin{aligned} \mathcal{U}(a, m\_a) &= \frac{\text{compatibility } \text{score}(m\_i) - \text{compatibility } \text{score}(m\_j)}{\text{compatibility } \text{score}(m\_i)}, \\\\ m\_i &\colon \max(\text{compatibility } \text{score}(m\_i \in \{m\_1, m\_2, \dots m\_k\})), \\\\ m\_j &\colon \max(\text{compatibility } \text{score}(m\_j \in \{m\_1, m\_2, \dots m\_k\} - \{m\_i\}), \end{aligned} \tag{5}$$

Equation (5) is an expression of the semantic distance between the selected mention for an anchor annotation and the next most coherent available mention for that annotation in a specific context. This metric was proven as a good uncertainty indication throughout our experimental evaluation.

#### *3.4. Evaluation Process*

The evaluation analysis focused on the entity linking disambiguation process, delineating the benefits of our novel methodology. The uncertainty score introduced for our methodology was thoroughly validated as a confidence indicator for the outcome prediction. For performance comparison, a classic precision, recall, and F1 measure assessment was carried out. Specifically, we evaluated precision by depicting the ratio of valid anchor mention annotations over the size of the identified mention ensemble.

$$\text{Precision} = \frac{TP}{TP + FP}.\tag{6}$$

We evaluated recall on the basis of the number of correctly annotated predictions divided by the total predictions made.

$$\text{Recall} = \frac{TP}{|\text{mentions}|}. \tag{7}$$

Lastly, the F1 score outlined a harmonic mean between recall and precision.

$$\text{F1} = 2 \ast \frac{Precision \times Recall}{Precision + Recall} . \tag{8}$$

The wiki-disamb30 dataset, introduced by [15], was utilized by several works in the domain, including [15,19,20,30], and it is generally accepted as a baseline for the task. Our methodology evaluation process was based on segments of the wiki-disamb30 dataset for a thorough performance analysis. This dataset contains approximately 1.4 million short input texts up to 30 words, incorporating at least one ambiguous anchor hyperlink along with its corresponding Wikipedia entity. As the dataset target entity links correspond to an old instance of Wikipedia, some processing is required for updating the references to the current changes. The dataset in use features extensive context variability, as the text segments cover a wide range of topics, making it ample for a thorough assessment.

As this work introduces and studies a specific NED problem, namely, online training NED, our main focus was the evaluation of our methodology, using precision, recall, and F1 measures. However, the established baseline methodologies from [15] along with the systems proposed in [34] were included for an incommensurate yet indicative performance comparison, outlining the performance of our methodology under a common evaluation dataset.

The first baseline, TAGME [15], is a particularly popular and relatively simple to implement methodology featuring computational efficiency. Relatedness, commonness, and other Wikipedia statistics were combined in a voting schema for the selection of the top scoring candidate annotation for a second-step evaluation and selection process. We aimed to extract insights from a comparison with classic baseline high-performance approaches. The second baseline employed was the Clauset–Newman–Moore (CNM) methodology from [34]. This approach introduced community detection algorithms for semantic segmentation of the Wikipedia knowledge graph into densely connected subgraphs, achieving high accuracy. A classification model approach was employed for the task, using established features along with community coherence information derived by the Clauset–Newman–Moore algorithm.

#### **4. Results**

Our methodology assessment was performed using Wikipedia snapshots for the knowledge extraction process. Specifically, the enwiki-20191220-pages-articles dump of pages from 20 December 2019 [35] was employed for an extraction, transformation, and loading process in the WikiText [36] format using the Apache Spark framework [37]. Big data techniques were eventful for the process, deriving more than 37,500,000 individual anchors and over 19,300,000 entities, composing a graph with over 1,110,000,000 edges. The distributed architecture and processing model of our implementation can handle a far larger scale, and its scalability capabilities allow following the growth rates of Wikipedia datasets. For the NED disambiguation process implementation, we used Keras [30] and TensorFlow [31]. Our experiments employed a distributed 192vCPU and 720 GB memory Google Cloud Platform setup.

#### *4.1. Experimental Analysis Discussion*

Our OTNEL implementation was experimentally evaluated as dominant for high precision performance, as outlined in the comparative results of Figure 3 and Table 1. Specifically, the methodology indicatively outperformed the baseline methodologies at full recall by 7%. The inclination of precision–recall in similar works in the generic NED scope was impetuous at recall levels above 0.6. This fact was interpreted as the inadequacy of those methodologies to fit a generic model for low-frequency or poorly linked entities in the knowledge graph. Conversely, in the case of our methodology, we observed a more gradual decline in precision to the point of 0.9 recall levels. This not only justified the overall precision of our methodology but also the high-performance certainty evaluation metric employed for recall adjustment, along with the improved modeling capabilities of OTNEL, due to its topical online training. The recall area (0.9, 1] of our method evaluation framed an elevated negative inclination as anticipated toward absolute recall. This was mainly interpreted as knowledge deficiency and a latent modeling approximation of deviating cases.

**Figure 3.** Precision–recall of OTNEL method, compared with Clauset–Newman–Moore (CNM) and TAGME baselines.

**Table 1.** OTNEL, TAGME, and CNM precision and F1 scores at varying recall levels.


<sup>1</sup> WSD methodology based on Clauset-Newman-Moore Community detection; <sup>2</sup> Online Training Named Entity Linking.

Overall, our deep learning architecture consisted of a multilevel LSTM network. The recurrent learning selection was driven through a delayed reward with a global context within the topic confinement. Furthermore, the utilization of our penultimate layer score apparently yielded considerable insights

into the success of certainty scoring, contributing to a progressive precision recall inclination. Again, we could observe high precision even at high recall over 0.9. For a dataset featuring such context variability, as training was conducted using Wikipedia, the extraordinary performance and potential of our approximation is profound.

The F1 score had a local maximum of approximately 91% of recall for the OTNEL model, as shown in Figure 4. The influence of topical segmentation introduced by our online training methodology, in conjunction with the high-performance indicator of linking certainty in the big data scale of the evaluation, emphasizes a consistently high performance, as clearly illustrated for the area over 0.8 of the recall axis. The value of our modeling approach is emphasized by the impressive accuracy even at high recall levels.

**Figure 4.** F1 score of OTNEL method, compared with CNM and TAGME baselines.

#### *4.2. Quantification of Certainty Evaluation*

Certainty was modeled as a measure of confidence for a mention selection. The correlation of certainty score and prediction score are outlined in the two-dimensional (2D) histograms in Figures 5 and 6. In Figure 5, we can observe a dense distribution of high certainty scores and a strong correlation of high prediction scores with high certainty. On the contrary, Figure 6 presents a less dense distribution of low certainty, in the areas of certainty below 0.6 and prediction score below 0.5. This intriguing observation can be interpreted as a knowledge deficit in the knowledge acquisition process, probably due to the coverage of our training set. Another reading of Figure 6 could delineate the presence of outliers; however, the apparent correlation of low certainty with low prediction score clearly indicates our model's advanced capabilities. At this point, it is worth noting that the analogy of valid (true positive) and invalid (false positive) entity link predictions was highly inclined toward true positives, as outlined on Figures 3 and 4, and the visualization of certainty and prediction scores validates our intuitions. Overall, the certainty metric performance as a measure of confidence for the validity of a linked entity was outstanding.

**Figure 5.** Prediction score–certainty score two-dimensional (2D) histogram: true positive distribution.

**Figure 6.** Prediction score–certainty score 2D histogram: false positive distribution.

#### **5. Conclusions and Future Work**

The current work proposed an innovative methodology featuring a deep learning classification model for the NED task, introducing the novel concept of online topical training for attaining high performance whilst maintaining rich semantic input specificity. This work introduced and studied the domain of online training NED. Moreover, to the best of our knowledge, this is the first approximation of Wikification and NED leveraging online topical training, introducing a stacked LSTM deep learning architecture model for the task. Our thorough experimental assessment revealed astounding performance in terms of precision, at moderate computational requirements, due to our simplicity-oriented dimensionality and modeling approach.

Our overall deep learning architecture permeates nonlinear input relation modeling, as the LSTM layers involved enable the exploitation of a dynamically changing contextual window over the input sequence history during the online topical training process. As a result, the use of a limited set of established features from works in the domain was adequate for attaining superior deep semantic inference capabilities with a topical focus, successfully addressing high-dimensional-space performance difficulties on a challenging task.

Among our plans for future enhancement of the current work's promising results, further analysis and experimentation in the quest for a more accurate architecture will be considered. A noteworthy advantage of the proposed neural network architecture is its understandability and neural network opacity via a simple model for delineating the benefits of the topical confinement concept in the online training NED task. As entity linking and NED tasks are based on knowledge, their underlying adversity is discerned in the absence of semantically linked corpora, namely, the knowledge acquisition bottleneck. An unsupervised machine learning knowledge expansion approximation could lead to more accurate results and, thus, knowledge acquisition closure from both structured and unstructured knowledge sources and corpora. The incorporation of an unstructured knowledge source via an ensemble learning approach for mitigating the impact of superinducing noise in the knowledge acquisition phase is among our plans.

In this article, our primary focus was the evaluation of new concepts for lowering the computational feasibility barrier to employing deep learning architectures in the NED task, while maintaining input semantic entropy by avoiding vast input transformations and granularity loss. Our extensive experimentation revealed propitious results, placing the introduced methodology in the limelight for further study and broad adoption.

**Author Contributions:** Conceptualization, C.M. and M.A.S.; methodology, M.A.S.; software, M.A.S.; validation, M.A.S.; formal analysis, M.A.S.; investigation, M.A.S.; resources, C.M.; data curation, M.A.S.; writing—original draft preparation, M.A.S.; writing—review and editing, C.M. and M.A.S.; visualization, M.A.S.; supervision, M.A.S.; project administration, C.M.; funding acquisition, C.M. All authors have read and agreed to the published version of the manuscript.

**Funding:** Christos Makris was co-financed by the European Union and the Greek national funds through the Regional Operational Program "Western Greece 2014–2020", under the Call "Regional Research and Innovation Strategies for Smart Specialization—RIS3 in Information and Communication Technologies" (project: 5038701 entitled "Reviews Manager: Hotel Reviews Intelligent Impact Assessment Platform").

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **ParlTech: Transformation Framework for the Digital Parliament**

**Dimitris Koryzis 1, Apostolos Dalas 2, Dimitris Spiliotopoulos <sup>3</sup> and Fotios Fitsilis 4,\***


**Abstract:** Societies are entering the age of technological disruption, which also impacts governance institutions such as parliamentary organizations. Thus, parliaments need to adjust swiftly by incorporating innovative methods into their organizational culture and novel technologies into their working procedures. Inter-Parliamentary Union World e-Parliament Reports capture digital transformation trends towards open data production, standardized and knowledge-driven business processes, and the implementation of inclusive and participatory schemes. Nevertheless, there is still a limited consensus on how these trends will materialize into specific tools, products, and services, with added value for parliamentary and societal stakeholders. This article outlines the rapid evolution of the digital parliament from the user perspective. In doing so, it describes a transformational framework based on the evaluation of empirical data by an expert survey of parliamentarians and parliamentary administrators. Basic sets of tools and technologies that are perceived as vital for future parliamentary use by intra-parliamentary stakeholders, such as systems and processes for information and knowledge sharing, are analyzed. Moreover, boundary conditions for development and implementation of parliamentary technologies are set and highlighted. Concluding recommendations regarding the expected investments, interdisciplinary research, and cross-sector collaboration within the defined framework are presented.

**Keywords:** digital parliament; digital transformation; legal tech; disruptive technologies; technology framework; parliamentary administrators; ParlTech; knowledge-driven processes; parliamentary hype cycle; semantic web

#### **1. Introduction**

Organizations such as parliaments are complex systems that can be considered an ensemble of five different elements: Process, people, culture, structure, and information systems [1]. These entail the need for an organizational transformation framework that exploits the potential of information communication technology (ICT) [2]. Over the past two decades, the evolutionary use of workplace technologies in organizations has hybridized their use with human activities [3], forming a more complex environment [4] and an emergent human-AI hybrid digital assistant [5] or meta-human configurations as new forms of socio-technical systems [6]. ICT has the potential to impact all of these elements and involves the emergence of several digital/human configurations [3], reflecting the assembly of digital features with human intent and their performance within a complex organization, as in the case of parliaments.

However, even if the demand on ICT to design and implement changes within the parliamentary institution has been documented in previous decades [7,8], it is still unclear how and under which conditions this digital transformation takes place [9]. Within governance, in particular, ICT was found to skew the balance towards efficiency rather than

**Citation:** Koryzis, D.; Dalas, A.; Spiliotopoulos, D.; Fitsilis, F. ParlTech: Transformation Framework for the Digital Parliament. *Big Data Cogn. Comput.* **2021**, *5*, 15. https://doi.org/ 10.3390/bdcc5010015

Received: 13 February 2021 Accepted: 11 March 2021 Published: 15 March 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

39

innovation, despite organizations expressing a need towards the latter [10]. Therefore, a two-step ICT-embedded organizational transformation, i.e., technical and social, can be hypothesized. Mergel et al. [11] (p. 12) go a step further and use the term digital transformation "to emphasise the cultural, organizational, and relational changes ... to differentiate better between different forms of outcomes".

The introduction of ICT is often combined with the transformation of an organization as a whole. Naturally, the technical elements of an organization, e.g., data and information systems, are impacted most. For this reason, in relation to digitalization, researchers have called for "digital ambidexterity", which is the capability to dynamically balance the digital initiatives in terms of efficiency and innovation [12]. In contrast, the social system (culture and structure) appears to be less affected by digital transformation [13]. Generally, digital innovation for value creation in organizations such as parliaments unbundles and recombines linkages among existing resources or simply generates new ones. In situations like these, when changes are radical, digital disruption may emerge, with considerable effects for different actors [14].

Regarding parliaments, in recent years, a small number of studies have investigated their role as organizations that are managing new technologies [8,15]. Nonetheless, the e-Parliament concept is not new [16]. Since the early 2000s, several attempts, projects, and concepts have indicated that citizens can, and in fact should, be included and engaged in decision-making processes through tools, products, platforms, and integrated IT services that enable them to actively participate in interaction with policy makers [17]. Indeed, it appears that the use of digital technologies by traditional organizations such as parliaments is highly diverse, albeit existing studies mostly refer to a variety of tools that allow for the engagement of citizens [18]. It must be noted though that such concepts are still far from the manifestation of parliament as a digital democracy hub for online engagement, communication, cooperation, and interaction among citizens and legislators. Such a digital collaborative platform that is operated in a transparent manner could be a useful tool, particularly within the legislative process.

To date, little attention has been placed on the development of a theoretical framework for the transformation of a traditional public organization, such as parliament, into a modern digitally ambidextrous organization, because there are several pathologies [10]. A basic approach has been made with the technology acceptance model framework by Davis [19], which is currently in version 3 (TAM3). This is premised on the theory that the model helps explain a specific behavior, which in this case is usage, ease of use, or perceived usefulness, towards a specific target, using technology, and within a specific context (e.g., public administration, parliaments). Scarce literature is complemented by a small number of published digital strategies in parliaments that attempt to incorporate organizational transformation elements with digital technologies in a layered structure.

Ongoing development, e.g., within the ISA<sup>2</sup> European Interoperability Framework, is providing guidance towards interoperable digital public services [20]. Effectively, system architecture is leading to an ecosystem of tools and services with accurately defined functions and interfaces. Within this multi-stakeholder environment, a user-centric approach is favored, using agile and lean ICT methodologies for interoperable and secure systems that constitute legal data hubs, which are accessible and inclusive for all stakeholders. Nonetheless, for the parliamentary workspace, more than a simple platform with integrated tools and software applications is needed, especially in the policy formulation stage, where a large number of users, e.g., parliamentary actors and/or stakeholders, are typically involved. However, state-of-the-art intuitive integrated tools of the likes of e-participation services, social media campaigns, visualizations, and linguistic analysis have the potential to advance digital transformation of the policy cycle [21].

The emergence of disruptive technologies might complicate a linear evolutionary approach for the digital parliament. At the same time, they have the potential to strengthen parliamentary institutions and bridge the informational and processing gap towards the executive. Taking into consideration digital tools and solutions for the digital evolution

of parliaments [15,17], comparative reports for aspects of future parliaments [22], the World e-Parliament Report 2018 [23], and the evolution of digital technologies from the 2020 Gartner Hype Cycle for Emerging Technologies [24], this study describes a novel approach to the digital transformation of parliamentary organizations, both from a holistic perspective and the user's point of view. It does so by refining existing digital parliament concepts and discussing organizational vis-à-vis digital transformation. Moreover, an innovative digital framework is developed bottom-up using the findings from a survey of parliamentary experts, who constitute users of parliaments' digital systems. Finally, based on acknowledged technology trends, the definition of a "parliamentary hype cycle" identifies promising technologies of parliamentary relevance that could shape future e-Parliament systems.

The next part (Section 2) defines the methodology of research and the approach taken to create the survey on which the study is based, as well as the selection of a representative sample of intra-parliamentary actors. It is followed by Section 3, where the main findings are shown and discussed. These are used to define a framework for the digital parliament (Section 4), based on which a concise discussion of the most promising technologies is made based on the survey key findings (Section 5). The article concludes with the most interesting aspects in a parliamentary context and a brief outlook for further research (Section 6).

#### **2. Approach and Methodology**

The authors have opted to use a user-initiated approach to define the framework of a digital parliament. To obtain data related to the nature and attributes of the framework, a structured expert survey has been developed and sent to a carefully selected set of parliamentary actors/users/stakeholders [25]. An expert survey is preferable, since the object of scholarly inquiry is novel and complex, yet it directly affects the users as actors and actuators. Therefore, it is "more likely to find reliable information in experts' judgements rather than in documentary sources" [26] (p. 274). In expert surveys, i.e., special and limited populations, the sample size is small by design, and no representative sampling framework is required. Instead, for this study, purposeful sampling was utilized for data collection and their predominantly qualitative interpretation [27]. The main criteria, according to which the subjects have been selected as parliamentary experts, were: Expertise in parliamentary development, scholarly engagement, and international cooperation. Further selection criteria were applied to ensure the geographic and gender diversity of the sample.

The survey builds upon IPU's definition of the digital parliament, the drivers and barriers for its digital and organizational transformation [13], and Gartner's hype cycle [24]. They were used to create a set of questions designed to capture the user's perception of the digital parliament. The resulting survey contains 15 questions, which can be divided in five basic blocks:


Next to general demographics questions, block 2 attempts to redefine the digital parliament. The perceived level of digitalization from the users' point of view is measured and linked to the organizational transformation. Priorities and themes of relevance are captured. The barriers and drivers of organizational transformation and their link with digital transformation is assessed within block 3. Having the 2018 IPU World e-Parliament report [23] as point of reference, block 4 then re-defines trends and key aspects of the digital parliament and introduces tools and services in the parliamentary context. The final block of questions (block 5) estimates the applicability, maturity, usefulness, and sustainability of digital emerging technologies. On the detailed definition of these user experience (UX) terms, see [28,29].

The questions were carefully designed to facilitate the understanding of the parameters and the building blocks of an evolutionary framework for the digital parliament. Both language and terminology were adapted to the parliamentary context. Technology foresight, especially for niche parliamentary technology, or ParlTech, is a particularly difficult task. ParlTech goes beyond what is considered state-of- the-art and is based on emerging technologies that fully or partially automate or even advance processes of parliamentary nature. As such, it is to be differentiated from standard technology aiming to provide solutions to administrative/organizational issues.

The technologies that have been selected to be included in the survey, which eventually led to the creation of a parliamentary hype cycle, have been extracted from 2020 Gartner hype cycles (emerging technologies, legal and compliance technologies, and internet of things) and constitute direct projections of emerging technologies in the parliamentary workspace. In the course of an internal workshop, 13 specific technologies have been identified as promising ParlTech and included in the survey. Table 1 matches the technologies from Gartner hype cycles with the ones that are relevant for parliaments.


**Table 1.** Technologies for the parliamentary workspace.

<sup>1</sup> Short names appear in italics.

In the course of the article, the authors attempt to approach three particular research goals around the digital parliament:


The above survey design methodology constitutes a valid instrument to evaluate responses on the prerequisites and conditions of the digital parliament. While quantitative data have been collected (i.e., a Likert scale is used for quantitative evaluation), focus is placed on the qualitative evaluation of findings, in order to come up with a tangible approach for a digital parliament framework.

The survey has been sent to 53 MPs and parliamentary professionals, collectively referred to as parliamentary experts, covering 36 countries. A total of 32 persons from 25 countries responded, a response rate of 60.4%, which the authors consider particularly

high, given the complexity of the survey. The high response rate may be also an indication that these usually busy, high-level parliamentary experts considered the survey favorably. The responders originate from 25 different parliaments, which means that some parliaments are represented by more than one expert. For methodological reasons, even in countries with bicameral systems, experts were selected from a single chamber. Hence, the number of parliaments coincides with the number of their countries of origin. As a result, in the context of the study, the terms country and parliament can be used interchangeably. Based on the original survey design, a wide geographic distribution across five continents can be observed. The findings are comparable across countries due to the common criteria used for expert selection, i.e., parliamentary experts or MPs who meet the above conditions are more likely to provide comparable information on the digital parliament and its development than random parliamentary professionals. A significant part of survey respondents (around one third, i.e., 31.2%) are female. Basic sample demographics are presented in Table A1 (Appendix A).

Upon request, the participants received online support and technical guidance to complete the survey. Most of the queries referred to the last survey block related to emerging digital technologies. This was anticipated as, unsurprisingly for parliamentary experts, a dominant majority of the respondents have a social science background, i.e., more than one third owns a degree in law, with only 15.6% having a degree in informatics or engineering. The experts work in different sectors of parliament, e.g., in parliamentary committees, library and/or research service, and international relations. The broad distribution in parliamentary sectors is important because it provides for a holistic approach to the research topic.

Processing and presentation of the findings ensured anonymity and confidentiality of the individual contributions. The survey, as well as the raw data set, has been placed on an open platform (Figshare) for cross-analysis and further elaboration [30]. The survey results have been assessed for reliability using the Cronbach coefficient (α) for each of the blocks of questions, i.e., α = 0.88 for block 2, α = 0.76 for block 3, α = 0.89 for block 4, and α = 0.95 for block 5.

#### **3. Findings and Evaluation**

Different blocks in the survey cater for gradual approximation of a digital parliament framework, starting with the perception of digitalization. There is a significant number of participants (46.9%) stating that the level of digitalization of their parliament is high or very high, while 37.5% rate it as moderate. The resulting average score in the seven-point Likert scale is 4.37 with a standard deviation of 1.31, i.e., 4.37 (σ = 1.31), and gives an overall positive view of the level of digitalization in parliament. The extraordinary high values, i.e., approximately 85%, show that the users report that the level of digitalization is at least moderate. This could be a temporal effect, and can be partially explained through the overall positive effects of the pandemic to the digitalization of parliaments [31]. Nevertheless, it can be considered as a strong foundation for further digitalization efforts. Moreover, this overall positive perception allows the authors to assume that the subjects also have the necessary technological affinity to assess the fitness of a broad list of technologies in the parliamentary workspace.

From the organizational perspective, findings show that digitalization has mostly transformed processes (78.1%), data (75%), people (65.6%), and systems (62.5%), with similar Likert scores (1–5 scale): Data is 3.94 (σ = 0.88), processes is 3.69 (σ = 0.86), people is 3.66 (0.83), and systems is 3.88 (σ = 0.98). The widespread perception that ongoing digitization is transforming data, information systems, and processes is not unexpected. It has already been the outcome of existing investigations (Tangi et al., 2020). However, one needs to consider whether the measured lower values in the digitalization effect on structure and culture, 3.31 (σ = 0.93) and 3.25 (σ = 0.80), respectively, are attributed to a certain cause. Regarding the current progress of the parliaments from digitalization, the aspects that the progress applies to (processes, people, culture, structure, systems, and

data) were all found to be non-independent, when examined in pairs (chi-square, *p* < 0.001, for all pairs). The authors believe that these parameters are still decoupled from the effects of digitalization, hence the observed difference. As a matter of fact, overall high acceptance values and inter-dependence seem to confirm that digitalization tends to holistically affect parliamentary organizations.

At the same time, findings show that the organizational transformation process that goes along with digitalization is significantly hindered by a number of factors, the most recognized being bureaucratic culture (65.6%) and resistance to change (62.5%). Likert (1–5) scores for bureaucratic culture and resistance to change are 3.63 (σ = 1.03) and 3.53 (σ = 0.95), respectively, which is similar to earlier findings [13]. This is an interesting result that is linked to the wider perception of parliaments as "traditional" organizations. The fact that digital transformation efforts have been acknowledged, be it as a response to the COVID-19 pandemic or not, shows that even high intrinsic barriers can be overcome, given the proper incentive or when reaching out to a greater objective. It is worth mentioning that experts are differentiated when it comes to roadmaps and planning, i.e., "only" 50% agree with the statement, with 3.50 (σ = 1.02), a result that can be associated with findings from the 2018 IPU report [23]. This partially interprets the observed lack of digital strategies in parliaments. Moreover, the survey participants reported that the fear of innovation was the most serious condition that affected the level of digitalization of their parliament (Spearman's Rank Correlation, *p* = 0.021, ρ(30) = 0.406). In the context of the institutional future, the greater objective is no less than to correspond to a digital societal shift while maintaining the institutional equilibrium.

Two thirds expressed that the organizational transformation process that goes along with digitalization is pushed by expected benefits for the main stakeholders, i.e., 3.72 (σ = 0.81) in Likert scale (1–5), and strong top-down leadership, i.e., 3.47 (σ = 0.88). Both are expected drivers, as the effects of benefits and incentives in public service are well documented (for a systematic review of the relevant literature, see [32]), as well as the positive impact of leadership [33].

Furthermore, parliamentary experts were asked to assess a series of digital trends and aspects from the 2018 IPU World e-Parliament Report [23]. Two years after the IPU report, the user evaluation can reveal an understanding of the transformation of former trends in today's tangible systems and processes within parliaments. Additionally, it can serve as a qualitative indicator for the validity of evaluation of emerging ParlTech.

A huge majority of the experts (87.5%) perceive open and transparent legal data, as well as openness, accountability, and accessibility, as significant components for digital parliament. However, the exact degree of correlation between the production of legal data, for instance Big Open Legal Data (BOLD), and an increase in institutional accountability is unclear, and almost certainly depends on the individual organization. Yet, this finding is in-line with recent developments in legal informatics and the development of legal documents standards that are utilized by dedicated legislative drafting tools [34,35].

When recording priorities for the digital parliament, for most of the occasions (>90%), processes, data, and people are the experts' preferences. System architecture is a high priority for roughly two thirds (65%) of the experts. For all of them, high Likert (1–5) scores (>4.3) are recorded, where data display extremely high Likert values 4.47 (σ = 0.57). Furthermore, the identification of people as a priority is relevant to society representation, openness, inclusiveness, accessibility, accountability, communication, and engagement with citizens. At the same time, process is relevant with accountability, and system architecture is relevant with business process collaboration. Notably, these priorities coincide with preferable components of digitalization from an organizational perspective.

When describing the use of digital tools, services, and products in a parliament, the experts highly favored accessibility and openness (87.5%) as well as communication with citizens (84.3%) as relevant attributes. These preferences are highly ranked in (L)ikert (1–5), i.e., L > 4.2, with 0.68 < σ < 0.90. While these results are in-line with the aforementioned findings regarding digital parliament as a whole, it is worth highlighting the interaction

between citizens and systems, through careful and efficient design and implementation of digital components, tools, products, services [21]. These results confirm once more the IPU suggestions for an open, accessible digital parliament that communicates interactively with citizens.

The 2018 IPU report indicated that digital broadcasting and video streaming will gradually overtake traditional broadcasting, a finding that is supported by the majority of the experts from this study (78.1%). Other important IPU trends, such as inter-parliamentary support and political commitment to use digital technologies, are also confirmed by the present survey. Additional enabling factors, such as training and skills, earned similar high scores. Acquiring new digital skills is deemed necessary for public administrators to be able to participate in the design and operation of ParlTech. For this, novel training approaches are necessary that may involve national schools of government [36] and/or more unified schemes, such as the Interoperability Academy in the framework of the European interoperability framework.

When using digital tools, knowledge of how parliaments work seems to be a high barrier for greater citizen engagement for 68.7% of the respondents. Citizen engagement can be facilitated by parliaments through the use of social media (L = 3.50, σ = 0.92). One could derive that parliaments use social media mainly to report on parliamentary business, interacting with citizens only marginally [37]. Even further, there are several attempts to use innovative ICT tools for social media analysis, without limited impact [38,39]. Table A2 (Appendix B) presents the aggregated results of the above study parameters in the form of average scores on the five-point Likert scale (L), along with the respective standard deviation (σ).

The use of disruptive technologies derived from Gartner hype cycle for Emerging Technologies constitutes a pragmatic approach to define a first set of applicable technologies for parliamentary use. This has been demonstrated already for the broader e-governance sector [40]. The majority of experts identified linked open data and advanced legal services as the most promising technologies (59.3%), immediately followed by social media analytics and the virtual parliament (53.1%). Linked open data, when efficiently produced and distributed, is certainly a direct manifestation of the broader call for institutional openness and can lead, as seen above, to increased accountability of parliamentary actors. Virtual parliament is not a single, but rather a combination of technologies around virtual, augmented, mixed, and extended reality [41] and can be associated with widespread hype around these technologies. Nonetheless, one should not underestimate the marketing-effect in relation to the introduction of such technologies. An adequate marketing wrap could be an efficient passport into the parliamentary sphere for the discussed technologies.

On the other hand, it stands out that a significant number of the questioned users do not identify machine learning solutions and artificial intelligence (AI)-assisted legal drafting and policy making as particularly relevant for parliaments. In recent years, significant applications of AI technologies have found their way into governance. In particular, machine learning, as an expression of AI, is considered a mature technology with broad applications in GovTech (Government Technology) [42], albeit future utilization needs to be based upon responsiveness, efficiency, and fairness [43]. Thus, negative opinions may be related with technological maturity or the lack of relevant pilot/demonstrator applications. Indeed, survey users rated AI as well as blockchain-assisted technologies as less mature than others.

Regarding usefulness of technologies, virtual parliament, linked open data, and advanced legal services stand out for 68.7% of the experts. Additionally, social media analytics (59.3%) and rapid digital and operational transformation (53.1%) seem to be rather useful. Digital Twins represent "digital replications of living as well as non-living entities that enable data to be seamlessly transmitted between the physical and virtual worlds" [44] (p. 87). In parliaments, the concept, boosted by machine intelligence and cloud computing, could be used to monitor and optimize institutional functions and operations. However, less than a third (31.2%) of the experts do not perceive the usefulness of digital twin infrastructure.

Sustainability of these technologies is a central issue. After all, lack thereof would be a major indicator to question investment in technologies below a certain threshold. Regarding sustainability of technologies that the users indicated as useful, almost all experts (96.9%) stated that these will provide added value to professional parliamentary work, while roughly eight out of ten (78.1%) believe that these will help provide usable and more interesting services to strengthen the democratic appreciation of citizens. Special mention is deserved for the option for empowerment of civic stakeholders favored by 71.9% of the experts, which practically confirms the finding that digital communication, e.g., through social media, can potentially re-link citizens and parliaments.

#### **4. Parliamentary Hype Cycle**

The findings from the evaluation of the maturity, usefulness, and applicability parameters of emerging technologies were used to develop a parliamentary hype cycle that is based on the Gartner hype cycle concept [45,46]. Conceptually, the Gartner hype cycle depicts the expectations hype for new and emerging technologies versus time until they are adopted and have passed on to production. Based on the original Gartner plot, the following assumptions were made to assess the necessary parameters and create a respective chart for ParlTech:


For the technologies as per Table 1 (short names are used; classified from lower to higher maturity), Table 2 shows the mean Likert scores (1–5) for these three parameters. The methodology was to create an XY chart of Maturity (Time) versus Usefulness (Expectations), with references to distinct stages of the hype cycle as defined by Gartner. Similar to the original plot, a third dimension (time to productivity plateau) was added for each technology data point via color code. Analysis of survey results led to the definition of two basic time frames for the parliamentary hype cycle:



**Table 2.** Maturity, usefulness, and applicability of ParlTech.

<sup>1</sup> Standard deviation, M: 0.75 <sup>≤</sup> <sup>σ</sup> <sup>≤</sup> 1.02; U: 0.85 <sup>≤</sup> <sup>σ</sup> <sup>≤</sup> 1.06; A: 0.85 <sup>≤</sup> <sup>σ</sup> <sup>≤</sup> 1.13.

The chart depicts Maturity (X-axis) versus Usefulness (Y-axis), and it was based on their mean Likert (1–5) values (see Figure 1). The 'noisy' early part, attributed to the overall low grading of the maturity parameter, has been normalized, an offset has been added, and the slope of the curve has been exaggerated to match the characteristic Gartner hype cycle form. Consequently, it results in a qualitative plot which depicts technology hype as perceived by the experts. Three characteristic stages of the Gartner plot, already in this form, are visible. The sharp rising part of the curve matches the "innovation trigger" followed by the "peak of inflated expectations", i.e., the highest point in the curve. The curve then enters the decreasing slope of the "trough of disillusionment".

**Figure 1.** ParlTech hype cycle for year 2020.

Visibly, most ParlTech finds itself on the "innovation trigger" (potential breakthrough that might kick things off). It is noted that technology early in the hype cycle is perceived to be less applicable compared to technologies higher in the cycle. Digital twins, recommender systems, and ontological representation belong to this category. At the "peak of inflated expectations", one finds linked open data and advanced legal services. This is the technology that enjoys the biggest hype, yet it is perceived to not be mature enough for entering production status. While moving further right on the maturity axis, but maybe still within the limits of the "peak of inflated expectations", the virtual parliament along with social media analytics can be found. It can be observed that a similarity between Gartner and the parliamentary hype cycle lies in the fact that most technologies are located in the first two stages of the curve, where excitement and expectations are high. On the other hand, in contrast to Gartner, the here referenced ParlTech appears not to have reached the "trough of disillusionment" stage. In general, ParlTech is found to be delayed in terms of maturity and expectations compared to Gartner emerging technologies.

It is, however, worth noting differences when considering responses from digitally advanced parliaments (based on responses for the level of digitization with mean L(1–7) ≥ 5.50), namely Austria, Brazil, France, Israel, Libya, and Spain. Overall, higher maturity and usefulness scores are reached for respective technologies. All applicability scores are in the higher tier. Finally, specific ParlTech like digital twins seem to receive considerably higher scores. The assessment of survey results, combined with existing knowledge from previous studies, offer significant insights for the development of a digital parliament framework.

#### **5. The Digital Parliament Transformation Framework**

Considering all the above, a broad framework for digital parliament can be set-up. This framework will give the opportunity for parliamentary organizations to level the path towards advanced digital transformation stages. There have been earlier attempts to define such frameworks from other perspectives, e.g., in the form of organizational restructuring concepts or digital national plans. These frequently and solely rely on elaborated parliamentary strategies, as in the cases of the UK, Australia, Greece, and France. These attempts again have led to specific operational plans and actions within the digital environment [47,48].

Matt et al. [49] presented five general principles, i.e., strategy, operations, functions, technologies, and transformation, upon which such a framework can be developed. Gimpel et al. [50] provide six fields of actions for digitalization, i.e., user, data, value, organization, operations, and transformation. Additionally, Nwaiwu [51] compared 10 conceptual and theoretical frameworks for digital transformation, which primarily deal with organizational issues rather than user behavioral aspects and technological adoption. Nwaiwu [51] concludes that the parameters to be considered when choosing a model for digital transformation are corporate strategy, vision, and mission.

In the light of the above, in a balanced act between strategy and technology, a robust yet adjustable structure for a parliamentary digital framework is defined. The framework consists of four distinct components that roughly correlate to the principles by Matt et al. [49] when unifying functions with operations. This becomes possible because in legislatures, parliamentary functions closely match primary working processes. Hence, the following components may constitute the basis of an advanced digital framework for parliament:


An integrated parliamentary digital strategy is the main pillar of this framework that contains the organization vision, values, scope, and goals, with a clear definition of digitalization in the parliamentary context (e.g., openness, transparency, accountability, and societal representation). Significant attributes of the latter are provided in the evaluation part of this article with high correlation with people (users) as priorities of digitalization. The next step is an operational stage that is related to identification and planning of digitalization actions. Here, as highlighted by parliamentary experts, actions that are related to inclusive governance could be prioritized. These could include, for instance, parliamentary functions that strengthen citizen's engagement. Emerging technologies, as an expression of digital evolution, constitute a natural compound of any digital framework. Survey findings led to the creation of a parliamentary hype cycle, based on state-of-the-art and disruptive technologies adapted to parliamentary context.

Parliaments, depending on their level of digitalization and willingness for innovation, could screen the hype cycle to determine technologies appropriate for further utilization. An overview of necessary digital (and organizational) transformation enablers is suggested above and includes, among others, strong leadership, digital skills, and potential benefits for users. Figure 2 presents the proposed Digital Parliament Transformation Framework, based on the reported priorities (people, culture, structure, data, processes, systems) and the identified attributes that the ParlTech adoption is expected to enhance. However, there are a series of boundary conditions under which this framework can be useful for parliament. For instance, there might be an overlap with existing digital strategies or commitments, as is the case of Open Government Partnership. In such cases, parliament may opt to reassess its relevant digital plans under the light of the proposed framework.

**Figure 2.** Digital parliament transformation framework.

The above framework is more than a mere thought experiment. It relies on established knowledge and trustworthy data from a structured expert survey. Therefore, it can serve as a point of reference or an inspiration for parliament actors when planning digital strategies and action plans. However, there are several technology parameters that are yet to be defined with precision. At the same time, the authors are aware that the proposed framework may appear inexplicit, e.g., when defining the underlying principles or in the justification of a basic set of technologies. It needs to be noted that this was the intended purpose, since an overall too-stiff concept in the era of disruptive technology would risk being overturned all too soon.

#### **6. Conclusions and Outlook**

Parliaments are complex representative institutions that can benefit from on-going digitalization, particularly through the use of emerging technologies. The authors evaluated the results from a structured expert survey directed to internal parliamentary actors, parliamentary professionals and MPs, who constitute users of the tools and services of the Digital Parliament. Data, people and, unsurprisingly, information systems are still top priorities for parliament digitalization, thus confirming IPU's 2018 report [23]. On the other hand, societal barriers, such as culture and change, and lack of tangible strategies and plans, may hinder digitalization, even if there is no lack of resources. This is why stakeholders in parliaments play a significant role in organizational transformation, something which is also positively correlated with parliament digitalization (ANOVA *p* = 0.006, F(3, 28) = 5.064). Open, transparent legal data, which are inherently linked to increased accountability and accessibility, are also of significance. This, again, leads to the determination of parameters such as openness, accessibility, and communication with citizens as particularly relevant contexts for the digital parliament.

In terms of applicability, maturity, and usefulness, the evaluation of expert preferences pointed toward a number of technologies particularly interesting for parliamentary use, such as legal informatics, integrated tools and services, virtual parliament, social media analytics, and rapid digital and operational transformation. However, significant development efforts are necessary for them to be adapted, modified, and customized for use within parliaments. Parliamentary experts stated that these technologies will bestow added value to parliamentary professional work (internal environment). In addition, there are indications that such tools and services will strengthen the democratic appreciation

of citizens (external environment) by empowering and improving relationships between parliament and its civic stakeholders.

By combining quantitative findings with the qualitative approach of Gartner's depiction, a parliamentary hype cycle has been created. Indeed, Gartner proved to provide solid guidance to assess emerging ParlTech. According to the developed parliamentary hype cycle, technologies can be screened for suitability in the institutional workspace. Overall, an analogy to the original hype cycle can be observed, yet responses are concentrated in the prism of parliament use.

Nonetheless, the introduction of emerging technologies should be performed within a wider digital framework. The findings from this study enable the construction of a rigid framework for the digital parliament out of four components, i.e., strategy, operations, technology, and transformation, with specific boundary conditions for the utilization of novel parliamentary technologies. Within this framework, the user plays a central role in its design and implementation, having digitalization as an ultimate scope. For any given parliament, democratic tradition is deeply embedded in its organizational culture. Though indirectly accounted for when discussing ParlTech attributes (e.g., people and culture), the study of related deeper political, societal, and organizational perceptions, interrelations, and ethical structures is well outside the scope of this article, and further research is needed to cover this field.

The evaluation results from the survey produced comprehensive insight, whose detailed presentation goes well beyond the scope of a single publication. The authors will continue the study of the data to come up with novel insights that further elucidate the framework and individual components of the digital parliament. They also point at the online dataset made available to the research community and call for further interdisciplinary studies on the ParlTech field. As new digital technologies emerge at a high rate, increasing investments and cross-sectors collaboration within the defined technological and organizational framework are necessary for them to be efficiently deployed in the parliamentary environment. In addition, a more detailed view of individual technologies appears to be advantageous, possibly prioritizing the ones that are built on artificial intelligence background; for instance, recommender systems (for their use in parliaments see [52]).

Ultimately, under the light of the digital (r)evolution, one has to verify once again the very notion of the digital parliament. This study suggests that the parliament of the future will be more a mere aggregation of tools and technologies. This new parliament will still have strong social and procedural components (see also [11]). It is in the people's interest that the intra-parliamentary actors do not develop negative-biased perceptions for emerging technologies that have the potential to shape the future or legislatures. Tangible digital strategies and targeted re-education of personnel and parliamentarians to develop essential digital skills, a notion that is labelled as 'training' in the 2018 IPU e-Parliament report [23], seem to be inevitable steps towards the digital future of representative institutions.

**Author Contributions:** All authors conceived, designed and performed the experiments; analyzed the data and wrote the paper. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Data Availability Statement:** The data presented in this study are openly available in Figshare at doi: 10.6084/m9.figshare.13604030.v3.

**Acknowledgments:** The authors would like to thank the MPs and parliamentary professionals who participated to the expert survey.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**


#### **Table A1.** Basic sample demographics.

<sup>1</sup> More than one selection was possible; hence, the total percentage exceeds 100%. <sup>2</sup> Other: Economics, History, Higher Education Policy, and Urban Geography, each represented once.




#### **References**


## *Article* **GeoLOD: A Spatial Linked Data Catalog and Recommender**

**Vasilis Kopsachilis \*,† and Michail Vaitis †**

Department of Geography, University of the Aegean, GR-811 00 Mytilene, Greece; vaitis@aegean.gr

**\*** Correspondence: vkopsachilis@geo.aegean.gr

† Current address: Department of Geography, University Hill, GR-811 00 Mytilene, Greece.

**Abstract:** The increasing availability of linked data poses new challenges for the identification and retrieval of the most appropriate data sources that meet user needs. Recent dataset catalogs and recommenders provide advanced methods that facilitate linked data search, but none exploits the spatial characteristics of datasets. In this paper, we present GeoLOD, a web catalog of spatial datasets and classes and a recommender for spatial datasets and classes possibly relevant for link discovery processes. GeoLOD Catalog parses, maintains and generates metadata about datasets and classes provided by SPARQL endpoints that contain georeferenced point instances. It offers text and mapbased search functionality and dataset descriptions in GeoVoID, a spatial dataset metadata template that extends VoID. GeoLOD Recommender pre-computes and maintains, for all identified spatial classes in the Web of Data (WoD), ranked lists of classes relevant for link discovery. In addition, the on-the-fly Recommender allows users to define an uncatalogued SPARQL endpoint, a GeoJSON or a Shapefile and get class recommendations in real time. Furthermore, generated recommendations can be automatically exported in SILK and LIMES configuration files in order to be used for a link discovery task. In the results, we provide statistics about the status and potential connectivity of spatial datasets in the WoD, we assess the applicability of the recommender, and we present the outcome of a system usability study. GeoLOD is the first catalog that targets both linked data experts and geographic information systems professionals, exploits geographical characteristics of datasets and provides an exhaustive list of WoD spatial datasets and classes along with class recommendations for link discovery.

**Keywords:** linked data; spatial datasets; data catalog; dataset recommender

#### **1. Introduction**

Linked data principles [1] lay the technological background for data publishing on the web so that they can be transparently and uniformly accessed by humans and software. Link establishment among related data increases data sharing, interoperability, and reuse; aids dataset enrichment; and unleashes powerful retrieval capabilities already exploited by question answering [2–5] and query federation [6–11] systems. The idea of a web of open and interlinked data has been embraced by scientists and organizations, and steps have been taken towards this direction during the last decade or so. At the early stages of linked data development, providers such as DBpedia [12], MusicBrainz [13] and GeoNames [14] converted their data to RDF and made them accessible through dumps, SPARQL endpoints or embedded them in HTML documents using RDFa [15]. Since then, many tools have been developed, such as search engines [16,17], data catalogs [18,19], link discovery frameworks [20,21], and dataset recommenders [7,22–24], forming the linked data tools ecosystem and facilitating users to consume linked data and lowering the barriers for its adoption by non-expert users. Today, the Linked Open Data (LOD) cloud diagram includes more than 1200 datasets, and DataHub maintains metadata for more than 700 datasets. References [25,26] note that linked data size is expanding and the number of the LOD cloud diagram datasets increased from 203 to 1269 during the period 2010–2020. LODLaundromat [27] reports 38 billion indexed triples in 2018.

**Citation:** Kopsachilis, V.; Vaitis, M. GeoLOD: Spatial Linked Data Catalog and Recommender. *Big Data Cogn. Comput.* **2021**, *5*, 17. https://doi.org/10.3390/bdcc5020017

Academic Editor: Min Chen

Received: 31 January 2021 Accepted: 15 April 2021 Published: 19 April 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

The increasing availability of linked data provides more options to users, but at the same time, increases the difficulty in identifying the appropriate data sources that meet their needs. Concerning linked data search, user needs vary and some common scenarios include searching for (a) topic-specific datasets (e.g., about conferences, music, or geography) [28]; (b) datasets that contain a given entity [29,30]; and (c) similar datasets to a given dataset [23,31]. These scenarios are being covered by available tools and applications; however, to the best of our knowledge, there is not a tool that addresses user needs related to geographical aspects of datasets during linked data search and exploration. In this work, we identify and address four possible scenarios:


These scenarios are covered in GeoLOD, a web catalog of spatial Web of Data (WoD) datasets and classes and a recommender for spatial datasets and classes that may contain related instances. The terms spatial datasets and spatial classes denote datasets and classes, respectively, that contain georeferenced instances, that is, instances whose locations are expressed with longitude and latitude coordinates. GeoLOD parses LOD cloud and DataHub catalogs, identifies spatial datasets and their spatial classes and extracts their metadata. It generates additional metadata that capture spatial aspects of datasets and classes, such as their bounding box and number of spatial entities and associated spatial vocabularies, and exposes them in GeoVoID, a vocabulary that extends the Vocabulary of Interlinked Datasets (VoID) [32], to describe spatial aspects of datasets. GeoLOD Catalog allows access to the lists of linked data spatial datasets and classes (along with their metadata) through a user interface and provides text and map-based search functionality, thus addressing scenarios 1 and 2.

GeoLOD Recommender generates ranked lists of spatial datasets and classes that may contain related instances with a given dataset or class, so as to be further examined in link discovery processes for the establishment of owl:sameAs links or other links that denote close semantic relation among their instances. The recommendation method is based on the work presented in [33] that builds a recommendation algorithm on the hypothesis that "pairs of classes whose instances present similar spatial distribution are more related than pairs of classes whose instances present dissimilar spatial distribution, in the sense that the former are more likely to contain semantically related instances" (p. 152), and thus are better candidates to be used as input in a link discovery process. GeoLOD applies the recommendation algorithm to generate recommendations for each class in the Catalog in the background. It allows the exploration for related classes and datasets through the user interface and the export of automatically generated SILK and LIMES configuration files for a selected pair of recommended classes that can be directly used for link discovery processes. Additionally, it allows on-the-fly recommendations for classes provided through a user-defined SPARQL endpoint, not listed in the Catalog, and for GeoJSON and Shapefile datasets, which are typical geographic information systems (GIS) file formats, thus addressing scenarios 3 and 4.

In addition to the user interface, GeoLOD provides a REST API to serve its content in well-known templates and formats, enabling software-based consumption. Specifically, it provides services that expose GeoLOD metadata and content description in the Data Catalog Vocabulary (DCAT) [34] format, an RDF vocabulary designed to facilitate interoperability among data catalogs, and dataset descriptions in GeoVoID that can aid source selection in query federation systems. It also provides services that export (a) SILK and LIMES configuration files for a selected pair of classes and (b) class recommendation lists

for datasets and classes in order to be consumed as input in batch link discovery processes. The main novelties of GeoLOD are:


The rest of the paper is organized as follows. In Section 2, we present applications related to linked data search and dataset recommendation. In Section 3, we present the design and methods of the GeoLOD application, and in Section 4, we present its implementation and the usage of the user interface and the REST API. In Section 5, we present statistics that summarize the linked data status regarding the geospatial domain, we assess the applicability of GeoLOD recommender in relation with the LIMES framework, and we evaluate GeoLOD usability by different user categories, namely linked data and GIS experts. We conclude the paper in Section 6 by discussing the results and by providing pointers for the improvement of the application.

#### **2. Related Work**

In this section, we present the work related to GeoLOD, classified into three categories: (a) vocabularies and tools for dataset description, (b) dataset catalogs, and (c) dataset recommenders for link discovery. We focus on prototypes and available systems for the linked data domain.

#### *2.1. Dataset Description*

VoID [32] (Vocabulary of Interlinked Datasets) is a well-known vocabulary for describing dataset content by expressing general information (such as, title, keywords, distribution URL, and provenance metadata), statistics (such as number of triples, classes, and properties), and connectivity details to other datasets. It aims to facilitate users and software agents in their dataset exploration [35], and many tools have been developed to generate automated VoID-based or similar dataset descriptions and statistics [36,37]. For example, RDFStats [38] provides an API that generates statistical items for SPARQL endpoints and RDF documents, including instance counts (per class) and histograms (per class, property and value type), originally developed to aid query federation systems. ExpLOD [39] summarizes RDF datasets usage and interlinking by computing representative dataset graphs and statistics, such as number of class instances and predicates used to describe an instance. LODStats [40] defines 32 statistical criteria, extending those defined in VoID, in a scalable and high-performance framework. Aether [41] is a statistics generator and visualization web application that focuses on comparing datasets between versions and on error detection. Loupe [42] and ABStat [43] produce ontology-driven dataset summaries that highlight their structure. ProLOD++ [44] augments dataset analytics with data mining functionality for identifying dependencies between dataset entities such as synonymously used predicates. In addition to dataset statistics, several tools, including LODex [45], LOD-Vader [46], LODAtlas [47], and LODSynthesis [31], provide high-level dataset summaries and visualizations. Concerning the description of geographical elements of the datasets, VoID supports the expression of their geographical coverage (e.g., bounding box) using the Dublin Core [48] spatial coverage predicate, and LODStats allows the (indirect) computation of geographical coverage by combining the minimum and maximum statistical criteria of longitude and latitude property values. Nevertheless, none of the above-described

vocabularies and tools capture the geographical aspects of datasets covered in this work, such as the number of georeferenced instances in datasets and classes.

#### *2.2. Dataset Catalogs*

Dataset catalogs provide single entry points for available linked-data datasets, and the most prominent examples are arguably the Linked Open Data (LOD) cloud and the DataHub. The LOD cloud visualizes datasets by topic, portrays their connectivity, and exports the list of its datasets in JSON format along with their basic provenance and descriptive dataset-level metadata, such as title, description, domain, point of contact, and distribution info (e.g., access URL and SPARQL enpoint URL). DataHub provides a user interface and a CKAN API (an API for querying data catalogs) for searching and filtering (not exclusively RDF) datasets and viewing their metadata. Both catalogs are populated through user-submitted datasets and metadata. LODAtlas [47] is a data catalog that provides keyword search and faceted navigation for RDF datasets parsed from several other catalogs including DataHub, Europeana, and Data.gov. It maintains dataset metadata, statistics about the number of their triples and their in- and out-going links. Moreover, it allows concurrent and in-depth exploration and comparison of multiple datasets' characteristics and provides an overview of their connectivity based on visual summaries. LODLaundromat [49] aims to improve linked data quality by republishing data in a "cleaner" state after correcting syntax errors, filtering duplicates, replacing blank nodes, etc. As part of the cleaning process, it offers description and search services for 650,000 cleaned RDF datasets (mostly data dumps). SPARQLES [50] monitors more than 500 SPARQL endpoints, collected from DataHub, regarding their availability, performance, interoperability, and discoverability, and provides a user interface for humans and an API for software agents for consuming its content. SPORTAL [51] is a catalog of SPARQL endpoints that allows SPARQL and keyword-based search. Endpoints are profiled by extended VoID descriptions, computed by directly querying their content. IDOL [52] provides metadata and analytics about an exhaustive list of RDF datasets in various formats (e.g., zip files and SPARQL endpoints), located by parsing eight data catalogs (including LOD cloud, LODLaundromaut, and the Registry of Research Data Repositories [53]). However, the list of datasets and their analytics are available only through a dump file. Contrary to the above generic data catalogs, LSLOD [54] and YummyData [55] are domain-specific data catalogs. The LSLOD Catalogue contains 52 life-sciences-related SPARQL endpoints for serving ontology alignment purposes between different datasets in the life science domain. Even though some catalogs allow (indirectly) the search for spatial datasets (e.g., in LODAtlas, users can retrieve spatial datasets by selecting a spatial vocabulary in the faceted search component), GeoLOD, to the best of our knowledge, is the first geographical domain data catalog that provides summaries for spatial aspects of datasets. Moreover, GeoLOD Catalog implements some novel features like the map-based dataset and class search and the on-the-fly projection of class spatial instances on an interactive map.

#### *2.3. Dataset Recommenders for Link Discovery*

Link Discovery refers to the problem of identifying and interlinking pairs of instances between two given triplesets for which a relation holds [56]. Two well-known link-discovery frameworks are SILK [20] and LIMES [21], which execute a link discovery process by allowing the set up of a workflow in configuration files or in user interfaces. The general workflow of a link discovery process consists of (a) providing as input two triplesets (e.g., two datasets or two classes), usually referred to as source and target, respectively; (b) defining the type of relation between their instances that will be discovered and established (e.g., owl:sameAs, which means the two instances refer to same real-world object); (c) defining the matching rule, that consists of one or more similarity metrics and the instance properties that will be evaluated (e.g., string equality of instance labels); and finally (d) executing the workflow to generate the recommended links between the instances of the two triplesets. A common obstacle for initiating a link discovery process is

that sometimes there is no prior knowledge of which two triplesets can be used as input for the link discovery process, or a linked data publisher may not be aware of target triplesets that are likely to contain related instances with their (source) tripleset. This is the focus of the Dataset Recommendation for Link Discovery domain, which refers to the automated process of recommending triplesets (e.g., datasets or classes) that may contain related instances to a given tripleset in order to be used as input in a link-discovery process.

Although several methodologies have been proposed to address the problem of Dataset Recommendation for Link Discovery [22,28,57–62], only few are implemented in tools and web applications [23,31,63,64]. One of them, the FluidOps portal [64], offers a data source exploration service, involving users in the source selection process, where a user begins to explore by providing some input (e.g., a keyword) and then refining the results through faceted search. It employs a data source contextualization method for discovering sources containing "somehow" related entities, and thus can serve link-discovery and distributed query processing tasks. TRT [63] recommends relevant triplesets for link discovery by applying link prediction metrics on a graph that maintains dataset connectivity information extracted from DataHub metadata. TRTML [23] augments the recommendation process with supervised learning algorithms. The input to the TRT/TRTML tool is the VoID description of the tripleset that the user wants to get recommendations, and the output is a ranked list of relevant triplesets for link discovery. LODSynthesis [31] is a suite of services for linked data search that includes object co-reference, fact checking, dataset discovery based on connectivity analysis, and connectivity analytics and visualizations. It indexes the entire content of hundreds of datasets in the LOD cloud and recommends relevant datasets by taking into consideration the closure of equivalence relationships based on existing instance (owl:sameAs) and class (owl:EquivalentClass) equivalence links. As an example, users can request for the K datasets that are most connected with the Hellenic Fire Brigade dataset. A related but slightly different tool is Linklion [30], a semantic web link repository, that is, a catalog of identified links between data sources populated from user-employed link discovery processes, which contains 12.6 million links of 10 different relation types (e.g., owl:sameAs, dbo:spokenIn) for 449 datasets. The main difference between our work and all the above is that their recommendation processes are based on information about existing links between datasets, while ours is based on the similarity of the spatial distribution of datasets and classes instances. GeoLOD novelties also include on-the-fly class recommendation for spatial datasets in GIS formats and the export of the recommended pair of classes to SILK and LIMES configuration files for direct use in a link-discovery process.

#### **3. Design and Methods**

GeoLOD consists of two distinct but complementary modules: (a) the Catalog of spatial datasets and classes, and (b) the Recommender of candidate datasets and classes for link discovery. In the following sections, we present the design of and the methods used in each module.

#### *3.1. The Catalog*

The goal of the Catalog is to provide lists of linked data spatial datasets and classes and methods for their textual and spatially-based retrieval. Each catalog item (a spatial dataset or a class) should be described by its metadata, with an emphasis on describing their spatial characteristics. Users and agents should be able to browse and search the catalog and select an item to view its full description. The main design decisions include (a) the definitions of the terms spatial dataset and spatial class, (b) the identification of the methods for collecting information about available spatial datasets and classes, and (c) the metadata set for describing catalog items.

#### 3.1.1. Definitions

An RDF triple is a statement about two resources that follows the subject predicate object structure, where subject and object represent two resources and predicate their relation. A set of triples (S) is denoted as *S* = *I* × *R* × (*I* ∪ *L*), where I, L, and R represent instances, literals, and relations, respectively, so that subject corresponds to an instance, predicate to a relation, and object to an instance or a literal. With the term spatial dataset, we refer to "a set of RDF triples published, maintained or aggregated provided by a single provider" [32] containing spatial instances, that is, a subject explicitly georeferenced with predicates defined in a spatial vocabulary. A spatial vocabulary defines predicates that allow the representation of an instance location in the form of longitude/latitude coordinates in a well-known Coordinate Reference System (CRS), such as WGS84 (e.g., Athens hasLongitude "23.58"). A spatial class is a subset of a spatial dataset containing spatial instances declared to be instances of a dataset class using the rdf:type predicate (e.g., Athens rdf:type City). In this work, we search and catalog spatial datasets and their spatial classes, whose instances' locations are expressed as single points, that is, by a longitude and a latitude value, using the W3C Basic Geo [65], GeoVocab [66], GeoSPARQL [67], GeoNames [68], or GeoRSS [69], which are common spatial vocabularies listed in Linked Open Vocabulary (LOV) [70] and LOV4IoT [71]. Furthermore, we search and catalog only those datasets provided by SPARQL endpoints and not by other means, such as RDF dump files. A SPARQL endpoint is an interface that is accessible through a URL and allows access to the triples of a dataset using SPARQL, which is the standard language for querying linked data. Therefore, the terms datasets and SPARQL endpoints are used in the remainder of the paper interchangeably.

#### 3.1.2. Data Collection

The initial pool of information about available linked data datasets is formed by parsing the content of other well-known dataset catalogs, namely LOD cloud and DataHub, which provide means for automated consumption of their contents. Specifically, LOD cloud exposes a list of datasets and their metadata at https://lod-cloud.net/lod-data.json (accessed on 16 April 2021) in JSON (an open standard and lightweight data-interchange format), and DataHub allows access to its dataset list using the CKAN API [72] (an API for querying data catalogs). GeoLOD Catalog parses the LOD cloud and DataHub to locate datasets provided through SPARQL endpoints and extract basic metadata, such as their title and endpoint URL. Then, it sends ASK queries to the located SPARQL endpoints to identify which of them uses any of the spatial vocabularies defined in Section 3.1.1. An ASK query is a SPARQL variation that is used to return a true or false answer to the issued query. For example, the ASK query below returns true if the endpoint contains triples that use the http://www.w3.org/2003/01/geo/wgs84\_pos#long and http://www.w3.org/2003/01/geo/wgs84\_pos#lat predicates (hereafter, for brevity geo:long and geo:lat, respectively) of the W3C Basic Geo vocabulary to express the coordinates of an instance (represented by the variable ?subject).

ASK { ?subject <http://www.w3.org/2003/01/geo/wgs84\_pos#long> ?x ?subject <http://www.w3.org/2003/01/geo/wgs84\_pos#lat> ?y }

After the available spatial datasets have been identified, we retrieve for each dataset its spatial classes by sending SELECT queries to its SPARQL endpoint. A SELECT query is another variation of SPARQL that is used to extract the raw values that answer to the given query. Specifically, we send five SELECT queries (Table 1), one for each vocabulary, to retrieve dataset classes by vocabulary. For example, the W3C Basic Geo SELECT query returns a list of the classes (variable ?class) that contain instances (variable ?s) using the geo:long and geo:lat predicates for expressing their location. We note that if a class uses more than one spatial vocabulary (for example, an instance is georeferenced using W3C Basic Geo and GeoRSS vocabularies), we retrieve the class once in order to avoid

duplicates. Similar SELECT SPARQL queries are sent to calculate the bounding box, the number of spatial instances and other metadata of the spatial classes and datasets, which are presented in the following section.

**Table 1.** SELECT SPARQL queries for retrieving dataset spatial classes.


#### 3.1.3. Item Metadata and GeoVoID

GeoLOD Catalog contains two main categories of items: spatial datasets and spatial classes. Spatial datasets are described by some generic metadata, namely their title, description, SPARQL endpoint URL, and VoID URL (if available), extracted from LOD cloud and DataHub metadata. Moreover, for each dataset, we compute spatial metadata, namely its bounding box, (that is, the minimum bounding rectangle (MBR) that contains all its instance locations), the number of its spatial classes and spatial instances, and the spatial vocabularies found, extracted by sending the appropriate SELECT queries (as described in Section 3.1.2). Spatial classes are described by some generic metadata, namely their URI (Uniform Resource Identifier), label, description, and the dataset that they belong to. For each class, we compute spatial metadata, namely its MBR, the number of its spatial instances, and the spatial vocabulary that it uses. Figure 1 summarizes the metadata set for GeoLOD datasets and classes and their association.

**Figure 1.** GeoLOD Catalog item metadata. A class may belong to 1 dataset and a dataset can contain many (\*) classes.

To describe spatial datasets in machine-readable format we designed and introduce GeoVoID, an RDF dataset description vocabulary that extends VoID [32] to express spatial metadata at dataset level. In VoID, a void:Dataset class represents the instance of a dataset, which is described by properties, such as void:entities (denoting the total number of its entities), void:classes (denoting the total number of its classes) and void:triples (denoting the total number of its triples). void:classPartition is a subset of a void:Dataset that contains the description of a certain rdfs:Class, which is declared with the property void:class. In GeoVoID, each void:Dataset class is used to describe a spatial dataset and contains a mandatory dctetms:spatial predicate, which denotes the dataset MBR in Well Known Text (WKT) format, which is a markup language for representing vector geometry objects. The newly defined predicates geovoid:vocabulary, geovoid:classes, and geovoid:entities denote the dataset spatial vocabularies, number of spatial classes, and number of spatial instances, respectively, (we remind the reader that VoID corresponding predicates are not restricted to spatial vocabularies, classes, and instances). The void:classPartition predicate contains the list of spatial classes of the dataset, where each spatial class is represented by the void:class class. Each void:class can also contain the dctetms:spatial, geovoid:vocabulary and geovoid:entities predicates to denote the corresponding spatial metadata for a class. The GeoVoID schema is available at http://snf-661343.vm.okeanos.grnet.gr/schemas/geovoid, (accessed on 16 April 2021) and its term definitions are in accordance with the definitions used in this paper; that is, a spatial entity is a georeferenced instance, a spatial class is a class containing one or more spatial instances, and a spatial vocabulary is a vocabulary that can be used for instance georeferencing.

#### *3.2. The Recommender*

The goal of the GeoLOD Recommender is to provide to each spatial class in the GeoLOD Catalog a ranked list of other spatial classes that may contain related instances, that is, instances that refer to the same real world object or to semantically close objects (e.g., a university and its campus). The recommended pairs of classes can be used as input in a link-discovery process, using tools such as SILK and LIMES, for the establishment of owl:sameAs links or other links (e.g., rdf:seeAlso) that denote a close semantic relation between instances. Recommender generates recommendation lists for all spatial classes in the background and provides them through the web interface at both class and datasetlevel. In addition, it allows the on-the-fly recommendation for datasets that are not listed in the catalog and for non-RDF spatial datasets in well-known spatial data representation formats, such as Shapefile and GeoJSON.

The recommendation process implements the methodology presented in [33], which generates a ranked list of relevant classes for a link discovery process to a given class, based on the similarity of the spatial distribution of their instances. Below, we briefly present the recommendation process, which is analyzed in detail in [33]. Initially, the algorithm builds spatial summaries for each class that capture (a) its spatial extent, by calculating its ConvexHull (the minimum polygon that encloses all instance locations of the class), and (b) the spatial distribution of its instances, by overlaying them on a global pre-computed QuadTree and generating a set of QuadTree cells IDs, that consists of the QuadTree cells IDs that overlap with the instances of the class. QuadTree is a spatial index that segments the world into not-equally-sized cells (each having an ID), where small cells cover areas that present high concentration of linked data instances (such as cities) and large cells cover areas that present low concentration of linked data instances (such as oceans). The algorithm exploits above-described class summaries and computes the similarity of an input (source) class (the class for which someone wants to get recommendations) with each of the other summarized (target) classes. In order to reduce the number of similarity computations, the algorithm filters out target classes that do not spatially overlap with the source class (i.e., their ConvexHulls are disjointed), and their spatial distribution summaries do not have a minimum number of common QuadTree cell IDs (which means that the two classes share few instances in close proximity). Finally, the algorithm computes a similarity score for the source class and each of the remaining (not filtered out) target classes using one of the similarity metrics proposed in [33]: Number of Common Cells (CC), Jaccard Similarity (JS), Overlap Coefficient (OC), Poisson Distribution Probability (PD), Pointwise Mutual Information (PMI), and Phi Coefficient (PHI). The output of the algorithm is a ranked list of recommended classes to the source class for a link-discovery process. The ranking is determined by the selected metric score so that the higher the similarity between the source and a target class summary sets, the more likely for this pair of classes to contain related instances.

GeoLOD creates summaries and recommendations for all classes in the Catalog by executing the recommendation algorithm described above with the following modifications. Instead of determining the ranking based on one metric, it combines the three most effective metrics, which, according to the evaluation performed in [33], are the Poisson Distribution Probability (PD), the Pointwise Mutual Information (PMI), and the Phi Coefficient (PHI), as follows: the pairs of classes (the source and each of the target classes) are ranked three times based on the similarity score for each metric. Then, the three ranking positions for each pair are summed to compute its combined ranking. For example, if a pair of classes is ranked 1st for the PD, 6th for the PMI, and 3rd for the PHI metric, its combined ranking is 10. Finally, the combined ranking of all pairs is sorted in ascending order to generate the final ranked list of recommended classes.

To further reduce the size of the final lists of recommended classes to a source class, GeoLOD applies an additional filtering condition to exclude pairs of classes that achieve a low similarity score at least for one of the three metrics. The thresholds defined in the following condition were set empirically and are assessed in Section 5.3:

PD > 0.90 and PMI > 1 and PHI > 0.02

#### **4. The GeoLOD Application**

#### *4.1. Implementation*

GeoLOD web interface is available at http://geolod.net/ (accessed on 16 April 2021). The frontend application was developed in *React* [73] and the backend API in *Node.js* [74]. The queries to the SPARQL endpoints were sent with the *Fetch SPARQL endpoint* node.js module [75]. The thumbnails depicting the bounding box of datasets and classes were generated with the *Static Image Mapbox API* [76], and the interactive maps were built on *Leaflet* [77] and *OpenLayers* [78]. The database behind the application is the PostgreSQL with the *PostGIS* [79] extension for spatial data management. GeoLOD is hosted in a *Ubuntu 18 LTS 4GB* Virtual Machine, provided by *okeanos*, a GRNET cloud Infrastructure as a Service (IaaS) for Greek academic institutes.

GeoLOD content, that is, the list of spatial datasets and classes with their metadata and the recommendation lists for all classes, is updated automatically every two months, as a background process. For each update, newly identified spatial datasets and classes are imported into the Catalog (according to the methods described in Section 3), and existing datasets and classes are checked for content changes and updated accordingly; for instance, if the number of a class spatial instances has changed, we update its metadata and recalculate its minimum bounding rectangle (MBR).

#### *4.2. Use Cases*

In the GeoLOD interface (Figure 2), users can browse the complete list of identified spatial datasets and classes or filter them using text and map-based criteria. Upon entering a keyword in the *Filters* dialog box, GeoLOD searches in datasets and classes titles and descriptions, and upon selecting an area in the interactive map, GeoLOD returns datasets and classes whose minimum bounding rectangles intersect or are contained in the selected area, thus allowing users to browse datasets and classes that contain instances in specific geographical areas, such as continents, countries or other user-defined areas. Additionally, users can sort the datasets and classes lists in multiple ways, including sorting by title, number of instances, and number of recommendations. Upon selecting an item (a dataset or a class), users can view its full description and perform some actions.

On a dataset description page (Figure 3), users can view its title, description, SPARQL endpoint URL, its bounding box on a thumbnail, the spatial vocabularies it uses, the number of spatial entities and classes it contains, the number of recommendations (computed as the sum of recommendations for all dataset classes) and navigate in the list of dataset spatial classes. An icon indicates whether the SPARQL endpoint is currently available (green) or unavailable (red). In addition, users can download its VoID file (if available) and export its GeoVoID description (see Section 3.1.3) and the dataset recommendations list in JSON. The latter can be used for batch link discovery processes and consists of all recommendations for dataset classes. A sample of the JSON file is depicted below: Recommendations is the root element, which contains an array of recommendations. Each array object (described inside { and } characters) refers to a recommendation, that is, a pair of classes, and contains the source and the target class SPARQL endpoint (properties sourceEndpoint and targetEndpoint) and URI (properties sourceClass and targetClass), respectively.

On a class description page, users can view its label, description, URI, the dataset it belongs to, its bounding box on a thumbnail, the spatial vocabulary it uses, the number of its spatial entities and the list of recommended classes and export the list of recommended classes in JSON. Furthermore, they can download live copies of class instances (extracted on the fly from the SPARQL endpoint) in RDF, JSON, and GeoJSON or browse class spatial instances on an interactive map (Figure 4). We note that the GeoJSON downloads are transformed in order to be readily consumable by a geographic information system (GIS) software, such as QGIS.

**Figure 2.** GeoLOD home page with the list of linked data spatial datasets.


**Figure 3.** The *AEMET* dataset description page.

```
{''Recommendations'':[{
''sourceEndpoint'':''http://aemet.linkeddata.es/sparql'',
''sourceClass'':''http://www.w3.org/2003/01/geo/wgs84_pos#Point'',
''targetEndpoint'':''http://www.linklion.org:8890/sparql'',
''targetClass'':''http://linkedgeodata.org/ontology/AerowayThing''
},{
''sourceEndpoint'':''http://aemet.linkeddata.es/sparql'',
''sourceClass'':''http://www.w3.org/2003/01/geo/wgs84_pos#Point'',
''targetEndpoint'':''http://www.linklion.org:8890/sparql'',
''targetClass'':''http://linkedgeodata.org/ontology/Viewpoint''
},{
...
}]}
```
A snapshot of the recommendation list for a given class (specifically, for the *Point* class of the *AEMET* dataset that contains information about meteorological stations) is depicted in Figure 5. Users can navigate through the list, view details for a recommended class, such as the number of estimated related instances and the ranking order, and export SILK and LIMES configuration files for the pair of classes for direct use in a link discovery process. The configuration files are automatically generated using as input the source (in this example *Point*) and the selected target class SPARQL endpoint URLs and URIs and configured to perform a basic instance matching that (a) "cleans" instance labels, by converting them in lower case and removing special characters, and checks for their *Levenshtein Distance*, which is a typical string similarity metric, and (b) checks the distance of instance locations using the *Euclidean Distance* metric.

**Figure 4.** *Point* class instances of the *AEMET* dataset on map. The user can click on an instance to get more info in a pop up.


**Figure 5.** The ranked class recommendations list for the *Point* class of the *AEMET* dataset.

Figure 6 shows the on-the-fly recommender user interface for generating recommendations for datasets that are not listed in the GeoLOD Catalog. Initially, users select the type of the input dataset that can be a SPARQL endpoint, a GeoJSON, or a Shapefile (step 1). For the first case, they enter the URL of the endpoint and select a class from the automatically populated list; for the other cases, they upload the corresponding files. GeoLOD parses the input dataset, builds in real time the required summaries and metadata and generates a preview (step 2). Finally, users click the *Get Recommendations* button and GeoLOD searches in the Catalog to return the list of recommended classes for link discovery.

**Figure 6.** The on-the-fly recommender interface.

*4.3. REST API*

GeoLOD provides a REST API that can be used by software agents. Table 2 lists the names, the request URI (the left part of the URI is http://snf-661343.vm.okeanos.grnet.gr accessed on 16 April 2021), and the descriptions of the main services.


**Table 2.** GeoLOD REST services.

#### **5. Results**

In this Section, we present statistics that provide insights into the characteristics of spatial datasets in the Web of Data (Section 5.1) and the potential interlinkings between spatial datasets and classes based on GeoLOD recommendations (Section 5.2). In Section 5.3, we

assess the applicability of the Recommender by examining the relation between GeoLOD class recommendations and LIMES instance recommendation for different algorithm variations. Finally, we present the findings of the system usability study that we performed to evaluate GeoLOD application (Section 5.4).

#### *5.1. Catalog Statistics*

In November 2020, LOD cloud and DataHub contained 478 and 723 datasets provided through SPARQL endpoints, respectively. Many datasets are listed in both catalogs, and some are provided through the same endpoint. GeoLOD identified 629 unique SPARQL endpoints from both catalogs. After sending simple SPARQL ASK queries to each (see Section 3.1.2), 477 returned an error response, such as URL unavailable or timeout, indicating that approximately only 24% of the total SPARQL endpoints found in LOD cloud and DataHub are active. Of the remaining 152 active endpoints, 60 responded true; that is, they contain a spatial vocabulary, which means that approximately 39% of the active endpoints contain georeferenced information.

In the following pages, we analyze the content of the identified spatial datasets, and we present statistics that reveal the availability and distribution of the spatial information in the Web of Data. Initially, we sent SPARQL SELECT queries to the 60 SPARQL endpoints in order to retrieve their spatial classes and collect statistics, namely, the number of its total classes, spatial classes, total instances, and spatial instances. During the investigation, we found endpoints that could not respond to the issued SELECT queries and endpoints that are duplicates or mirror other endpoints, and we excluded them from subsequent analysis. We also removed classes that contain very few instances (less than 5), because these classes cannot be used for generating recommendations, or too many instances (more than 100,000) in order to avoid high computational costs. Finally, we excluded the DBpedia dataset from our analysis, which contains 22,742 spatial classes (approximately seven times more than the sum of spatial classes of the other datasets) and more than 1 million spatial instances.

Due to the above restrictions, we finally analyzed 40 SPARQL endpoints, presented in Table 3. The total number of identified spatial classes is 3418, that is, approximately 5% of the total classes (66,571) provided by the 40 identified spatial datasets. Accordingly, we identified approximately 77 million georeferenced instances, that is, approximately 18% of the total instances (424 million) provided by the same datasets. Table 3 reveals that the biggest providers of spatial information are the *LinkedGeoData* and *Linklion* datasets, containing 952 and 902 spatial classes and more than 48 and 20 million spatial instances, respectively.

Next, we present information about the spatial characteristics of linked data datasets and classes. Table 4 presents the statistical distribution of datasets and classes by the size of their spatial extents (i.e., their mininum bounding rectangles), classified into five categories, each representing an area roughly equal to a common geographical notion, ranging from small areas, covering medium sized cities, to large areas, covering the whole world. Most datasets and classes are "global" or cover areas approximately equal to continents (about 78% of datasets and 87% of classes), which shows that most linked data providers publish large area datasets and that few providers published local datasets. Furthermore, by inspecting classes content on the GeoLOD interactive map, we noticed that in many cases, the population completeness, that is, the percentage of all real-world objects of a particular type that are represented in a class [80], regarding spatial instances at local level is small. The implication of these findings is that local mapping organizations have not yet adopted linked data technologies. Figure 7 shows the spatial extents of all spatial datasets and their density all over the world and indicates that most non-global-scale datasets are located in and around Europe. A closer examination of Figure 7 reveals potential georeferencing errors for some datasets. For example, there is a dataset that extends in a small area around zero longitude and latitude in the Gulf of Guinea at the Atlantic Ocean and another whose MBR is a thin line that starts in the Pacific Ocean, east of South America, and ends in Australia.

**Table 3.** Number of total and spatial classes, total and spatial instances for 40 SPARQL endpoints. N/A denotes that the number could not be retrieved because of errors returned from the endpoint.


**Table 4.** Datasets and classes classified by the size of their spatial extent.


**Figure 7.** Spatial datasets minimum bounding rectangles and density.

We close this section by presenting two more findings. Regarding the use of spatial vocabularies, the most used spatial vocabulary is the W3C Basic Geo, which is used in all datasets (40) that were examined and in 3345 classes. Ten datasets also use the Geonames and one dataset the GeoVocab vocabularies in 36 and 37 spatial classes, respectively. GeoRSS is used with W3C Basic Geo in 15 datasets, and no dataset was found that uses the GeoSPARQL vocabulary. Concerning the availability of VoID files, of the 629 identified datasets in LOD cloud and DataHub provided through SPARQL endpoints, only 236 were found to publish a VoID description, and, of the 40 datasets listed in Table 3, the respective number is 11, which shows that providers usually do not provide VoID description of their datasets. Furthermore, in none of the provided VoID descriptions did we find information for describing the spatial aspects that we present in this paper, such as dataset bounding boxes.

#### *5.2. Recommender Statistics*

In this section, we analyze the outcome of the GeoLOD Recommender that provides insights into the potential interlinking of linked data spatial datasets and classes. In particular, we executed the recommendation algorithm for 3418 spatial classes provided by the 40 spatial datasets (Table 3) using the ranking mechanism and filtering condition presented in Section 3.2.

Table 5 presents the results of the recommendation algorithm summarized by dataset. For each dataset, it shows (a) the number of its spatial classes as listed in Table 3 (column DC), (b) the number of dataset classes for which there are recommendations (column DCR), (c) the number of recommendations to other dataset classes (column OCR), and (d) the number of recommendations to other datasets (column ODR). It is worth noting that the numbers in Table 5 refer to GeoLOD recommendations (with the specific algorithm parameters) and not to the correctly recommended classes and datasets. Furthermore, the presented statistics include only recommendations for other dataset classes and not for classes provided by the same dataset as the source class. Finally, we note that columns DCR, OCR, and ODR can be read in two ways; the number of dataset classes for which there are recommendations (column DCR) denotes the number of dataset classes for which there are recommendations to classes of other datasets (outbound recommendations) but

also the number of dataset classes for which there are recommendations from classes of other datasets (inbound recommendations).

**Table 5.** GeoLOD Recommendations statistics (DC = Number of dataset classes, DCR = Number of dataset classes for which there are recommendations, OCR = Number of recommendations to other dataset classes, ODR = Number of recommendations to other datasets).


GeoLOD recommends one or more relevant classes for link discovery for 3029 classes, that is, for approximately 89% of all classes. This means that GeoLOD does not find recommendations for only 389 (out of 3418) classes. The 3029 classes belong to 39 different datasets, which means that for all datasets (except *Suface Forestière Mondiale 1990–2016*) GeoLOD produces recommendations. The total number of class recommendations is 86,998

(we note that class recommendations including same dataset classes is 164,782), and thus, the average classes recommendations per class is 25.45, which means that each class gets recommendations for (or from) approximately 0.75% of the total linked data classes (25.45 of 3418). At dataset level, each dataset has on average 2175 recommendations to classes of other datasets and 13.25 recommendations to other datasets, which means that each dataset gets recommendations to (or from) 13.25 other datasets, that is, approximately 33% of the total identified spatial datasets. Table 5 shows that *LinkedGeoData* and *Linklion* are hub datasets, regarding the number of recommendations they have to (or from) other datasets, having recommendations to 34 and 37 other datasets, respectively.

Regarding the execution time of the recommendation algorithm, it requires approximately one day to build summaries and 44 days to generate the recommendation lists for the 3418 classes. Thus, it requires on average 18 min to generate the recommendation list for each class, although the execution time depends on the source class size and spatial extent, ranging from a few seconds to several minutes. We note that this is also the average execution time of the GeoLOD on-the-fly recommender, which builds summaries and generates reccomendations in real time.

#### *5.3. Recommender Applicability Assessment*

In [33], we evaluated the effectiveness of the recommendation methodology that is implemented in GeoLOD, and we showed that the three most effective metrics are PD (Poisson Distribution Probability), PMI (Pointwise Mutual Information), and PHI (Phi Coefficient) and that the most effective, PD, generates ranked lists of recommended classes with 62% mean average precision, approximately 35% higher than simple baselines. In this work, we assess the benefits of employing GeoLOD Recommender as a preparatory step in link-discovery processes regarding its applicability and gains in time and we examine the effect of the ranking mechanism and the filtering condition that we presented in Section 3.2. For this reason, we execute three recommendation algorithm variations and estimate the percentage of GeoLOD recommended pairs of classes for which the LIMES link discovery framework finds possible instance links. We recall that LIMES recommends possible links between instances of two instance sets (in this case, classes), whereas GeoLOD recommends possible pairs of classes for which instance links can be recommended. Therefore, the higher the number of GeoLOD recommended pairs of classes for which LIMES recommends instance links, the higher the quality and usefulness of GeoLOD recommendations.

We execute the first (default) recommender algorithm variation as follows. We initially selected, from the list of recommendations presented in Section 5.2, a random sample of 5000 (out of the total 86,998) recommendations, that is, pairs of classes. To simplify the configuration of LIMES, we restricted on classes using the W3C Basic Geo spatial vocabulary. We then imported the sample set of recommendations as a batch process to LIMES, each configured with the corresponding source and target endpoint URL and class URI and with the following matching rule:

AND(levenshtein(a.rdfs:label,b.rdfs:label)|0.8, euclidean(a.slat|slong,b.tlat|tlong)|0.8)

that recommends a link between two instances when the *(Normalized) Levenshtein Distance* of the instances labels is greater than 0.8 and the LIMES euclidean metric of the instances location is greater than 0.8, which corresponds to a euclidean distance of 0.25 degrees, equal to 25 km at the equator in the WGS84 Coordinate Reference System. We should note that the labels' distance is measured after "cleaning" them, that is, converting them into lower case and removing special characters using the LIMES *regularAlphabet* function.

We examined two more aspects of the GeoLOD recommendations, namely, (a) the quality of Top-1 GeoLOD recommendations by importing in LIMES only the top ranked recommendations for each class and (b) the effect of the final filtering condition of the recommendation algorithm by importing in LIMES only those recommendations that satisfy the following (more strict compared to the default) condition:

PD>0.95 and PMI>3 and PHI>0.2

As baseline, we input in LIMES 5000 pairs of classes randomly selected from the GeoLOD Catalog. Since these pairs are not necessarily GeoLOD recommendations, we compare the applicability of the GeoLOD recommendations against random pairs of classes. Table 6 summarizes the experimental results for the three GeoLOD recommendation algorithm configurations and the baseline. For each, it shows the number of pairs of classes that were used as input in LIMES (column 2, LIMES executions), the number of pairs of classes for which LIMES found one or more possible instance links (that we call them hits) and its percentage to the number of LIMES executions (columns 3 and 4), and the average number of LIMES instance links recommendations per hit (column 5).

#### **Table 6.** GeoLOD recommender evaluation using LIMES.


The percentage of pairs of classes for which LIMES recommends instance links for the GeoLOD class recommendations (column 4), regardless of configuration, outperforms the respective percentage of the randomly generated pairs of classes (baseline). Particularly, 55.98% of the default, 69.56% of the Top-1, and 68.68% of the strict GeoLOD recommendations contain link recommendations according to LIMES basic link specification. Strict GeoLOD recommendations present a higher percentage of hits compared to the default GeoLOD recommendations, but the recommendation list is significantly reduced (3858 recommendations compared to 86,998), which means that default GeoLOD recommendations include more false positives but, at the same time, more true positives compared to the strict GeoLOD recommendations. In the GeoLOD frontend, we use the default recommendation algorithm condition (PD > 0.90 and PMI > 1 and PHI > 0.02) because the recommendations are ranked and users can decide how far they want to go in the recommendation lists to find all the recommended pairs of classes for which LIMES recommends instance links. However, with minor modifications to the GeoLOD fronted, users could select between a strict or loose filtering condition.

We should note that if, for a pair of classes, LIMES recommends one or more instances' links, this does not necessarily mean that this pair of classes indeed contain related instances. Conducting rigorous experiments to evaluate the quality of LIMES recommendations, that is, whether instance link recommendations truly correspond to related instances, is out of the scope of this paper. Nevertheless, we can assume that if a pair of classes contains many LIMES instance link recommendations, it is more possible to truly contain related instances than a pair of classes with few LIMES instance link recommendations. Based on the above assumption, we compare the GeoLOD recommendation algorithm variations by examining the average number of instances links recommendations per relevant pair of classes. Table 6 shows that for random pairs of spatial classes the average number of LIMES instances links recommendations per pair (column 5) is 303, while for GeoLOD recommended pairs, the respective number is much higher for all GeoLOD recommendations configurations. Specifically, the highest average is achieved by the strict variation, presenting 13,119 instance links recommendations per pair of recommended classes. Therefore, we can conclude that GeoLOD (especially, the strict variation) is more likely to recommend pairs of classes that truly contain related instances than the random baseline.

Finally, we discuss the search space reduction of the GeoLOD Recommender and the time saved when it is used as a preparatory step of a link-discovery process. The number of pairwise class comparisons needed for finding all possible instance links for all

identified spatial classes is 3418 × 3418 = 11,682,724. GeoLOD generates approximately 165,360 recommendations (including classes from same datasets), and thus reduces the search space approximately 70 times. In our experiments, LIMES required approximately one hour to compare 1000 pairs of classes for instance link recommendations, and thus, to compare all possible pairs of classes in GeoLOD Catalog, LIMES requires 486 days (with 6.88% probability of finding a pair of classes with links), while comparing the GeoLOD recommended pairs requires 7 days (with 55.98% probability of finding a pair of classes with links). For a single class, the execution time for instance link discovery is approximately 3.5 h (for examining 3418 pairs of classes), while, using the on-the-fly GeoLOD Recommender, it is 18 min (the average time GeoLOD requires to generate recommendations for a single class) plus, on average, 3 min (for the 50 recommended pairs of classes, the average GeoLOD recommendations per class including same dataset recommendations, comparisons in LIMES), that is, approximately 21 min.

#### *5.4. Usability Study*

As already stated, GeoLOD user interface mainly targets linked data experts and GIS professionals in order to facilitate them during their linked data exploration and linkdiscovery processes. For this reason, we conducted a system usability study to assess how each category of users perceives GeoLOD and to identify strong and weak features in order to improve the application. The study is based on the System Usability Scale (SUS) [81], which consists of 10 questions to be rated on a five-point scale ranging from strongly disagree to strongly agree, among which five are positive statements and the remaining are negative. An adjective rating was added as an eleventh question to collect user ratings of the perceived usability according to a seven-point scale with different wordings [82]. The participants were selected to be experts in either linked data, GIS, or both domains. Initially, invitations were sent to academia and business people with known experience in these domains, and those who responded positively participated on a voluntary basis. The study was completed in two web sessions, held on different days, allowing the participants to choose based on their availability. At the beginning of each session, we explained the purpose of the study and briefly introduced the GeoLOD application. Then, participants had some time to get familiar with the application and to execute some indicative tasks, such as:


Finally, participants completed the online SUS with an adjective rating questionnaire. Each session concluded with a short discussion, where participants expressed their general comments and proposals for the improvement of the application.

In the study, in total, 41 users participated; 11 users perceived themselves as linked data experts and 30 as GIS experts. Of the 41 users, only four declared that they are experts on both domains. Table 7 summarizes the results of the usability study. The first two rows show the results for each category of users and the last row contains the total results. The mean SUS score indicates the overall level of usability, where the minimum possible score is 0 and the maximum possible is 100. The mean SUS score for all participants is 68.48, and the respective score for linked data experts is higher (81.36) than for GIS experts (63.75). The adjective rating corresponds to the results of the 11th seven-scale question "Overall, I would rate the user-friendliness of this product as:", where 1 means Worst Imaginable and 7 Best Imaginable. The mean adjective rating for all participants is 5.17, and the respective rating for linked data experts is also higher (5.64) than of GIS experts (5.00). Table 8 and Figure 8 present with more analysis the results of the SUS and the adjective rating questionnaire per question and user category.



**Table 8.** Standard Usability Scale (SUS) questionnaire results per question in the scale 1 (Strongly disagree) to 5 (Strongly agree).


**Figure 8.** Adjective ratings per user category: Linked data experts (**left**), GIS experts (**center**), all (**right**).

The results of the study indicate that the opinion of the users regarding GeoLOD usability and friendliness is good and almost excellent among linked data experts. Furthermore, the responses to the first question of the SUS questionnaire shows that they believe that the application is useful. During the discussion, it emerged that users, especially those who were not linked data experts, would like more guidance (e.g., by including tooltips or explanatory text in the user interface), since they are not familiar with some terms, such as VoID and SPARQL endpoint. Some other proposals included the improvement of the

on-the-fly recommender response times, responsiveness for mobile devices, and inclusion of datasets that contain polygon geometries.

#### **6. Discussion and Conclusions**

In this paper, we presented GeoLOD, a web catalog of spatial linked data datasets and classes and a recommender for datasets and classes that may contain related spatial instances. GeoLOD addresses user needs for linked data search, taking into account the spatial characteristics of datasets, and is the first exhaustive catalog and recommender exclusively for spatial datasets and classes. It provides a user-friendly interface and an API for automated content consumption. It currently contains metadata for 79 spatial datasets and 5130 spatial classes, identified by parsing the LOD cloud and DataHub catalogs. It also provides more than 166,000 recommendations for pairs of classes that may contain the same or closely related instances and an on-the-fly recommender for user-submitted SPARQL endpoints and spatial datasets in GeoJSON and Shapefile formats. The catalog and the recommendations lists are updated in the background every two months.

GeoLOD is compliant with the linked data standards for describing catalogs and datasets, providing its content in DCAT and datasets descriptions in GeoVoID. GeoVoID was introduced in this paper and extends VoID to describe spatial characteristics of datasets. In the results section, we have presented statistics about the availability of SPARQL endpoints and VoID descriptions that confirm other recent studies [25,26,51,83]; few datasets are accompanied by their VoID descriptions, and furthermore, there is no description of their spatial characteristics, such as their bounding box or the number of their georeferenced instances. GeoLOD fills this gap by automatically generating GeoVoID descriptions for each dataset in the Catalog. Our analysis reveals that most spatial datasets and classes are published by global data providers (such as DBpedia, LinkedGeoData, and Linklion) and cover the whole or large areas of the world. The study of linked data spatial characteristics reveals georeferencing errors or generalizations, including misplaced instances, the "null island" effect (instances located at zero longitude and latitude), the representation of large-area objects (e.g., countries) with points and low population completeness [80] regarding georeferenced instances (e.g., a class about airports contains a random subset of the existing airports). A study of systematic errors and their causes in geographic linked data [84] reveals that about 10% of all spatial data on the linked data cloud are erroneous to some degree. These errors could be minimized if local mapping organizations or agencies participated more actively in the linked data domain since they usually possess complete and high-quality spatial datasets. Some reasons that may prevent their engagement with linked data could be the absence or immaturity of linked data publishing tools and the subsequent high barriers for publishing spatial linked data. One of GeoLOD's goals is to provide an easy-to-use tool that could help users, who are not linked data experts, to get familiar with the linked data landscape and thus to lower the barrier for data publishing. As the usability study indicates, users from the geospatial domain are positive about adopting GeoLOD; however they would like a more user-friendly interface regarding the explanation of terms unknown to them.

GeoLOD includes three innovative features regarding dataset interlinking: (a) a complete list of recommendations for pairs of classes that may include related instances, (b) an on-the-fly recommender for uncatalogued SPARQL endpoints and non-RDF spatial datasets, and (c) automatic generation of SILK and LIMES configuration files. These features help users to discover links between related instances, thus fulfilling the fourth linked data principle, which suggests the establishment of links between related instances so that users can discover related things. In the results, we showed the benefits of employing GeoLOD Recommender as a preparatory step for link-discovery processes. It recommends pairs of classes with 55.98% probability to contain link recommendations between class instances, using a basic link specification in LIMES, while the corresponding probability for random pairs of linked data classes is 6.88%. Furthermore, it reduces the search space for

looking in the Web of Data for candidate classes that can be used as input in link discovery processes 70 times.

We conclude the paper by pointing the future work on GeoLOD. Firstly, the user interface can be improved in terms of providing more help to the users. The catalog can be populated with more content, including spatial datasets that are provided through RDF dumps, listed in other data catalogs (such as LOD Laundromat), using other well-known spatial vocabularies and expressing instance location with line or polygon geometries in various coordinate reference systems. The on-the-fly recommender can be extended to support SPARQL endpoints that use additional spatial vocabularies (other than W3C Basic Geo) and additional spatial data formats, such as the Web Feature Service (WFS) [85]. We plan to take action and conduct experiments to fine-tune the recommendation algorithm's filtering and thresholds criteria and further reduce its overall execution time. Other ideas include the involvement of GeoLOD users so as to provide feedback about "good" or "bad" recommendations and the exploitation of SILK/LIMES web services for instant instance links recommendations.

**Author Contributions:** Conceptualization, V.K.; methodology, V.K.; software, V.K.; validation, V.K. and M.V.; formal analysis, V.K. and M.V.; investigation, V.K.; resources, V.K.; data curation, V.K.; writing—original draft preparation, V.K.; writing—review and editing, V.K. and M.V.; visualization, V.K.; supervision, M.V.; project administration, M.V.; funding acquisition, V.K. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research is being supported by the funding program "YPATIA" of the University of the Aegean.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Data supporting reported results can be found at GeoLOD website at http://geolod.net/ accessed on 16 April 2021.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

