**2. Related Work**

The question of search and discovery in the internet is not new. The developed techniques range from the well-known and widely deployed Domain Name System (DNS), the Lightweight Directory Access Protocol (LDAP), to decentralised Distributed Hash Table (DHT). However, none of them address all the challenges in the IoT domain introduced in Section 1. This section highlights the current technical state of topics relevant for a search engine in the IoT.

### *2.1. Search over Discovered Metadata*

A number of approaches for managing IoT metadata and performing a search over it can be found in literature. In the Dyser search engine [1], a query-based search mechanism is used for tracking the states of physical entities in real-time. Using a typical link-traversing approach, it performs crawling and maintains the actual state of dynamically changing metadata. Another service for semantic search and sensor discovery among the Web of Things (WoT) is DiscoWoT [2]. Using a RESTful approach, it enables the integration of WoT entities. The service is based on extensible discovery strategies. Along with that, it allows publishers to semantically annotate WoT sources. The Thingful engine (https://www.thingful.net/ accessed on 24 February 2021) uses a ranking algorithm over

geographical indexed resources. A map-based Web UI is provided for verified sensors with locations. A contextual search allows to query sensors based on their type and location and nearby surroundings. A wrapping approach for integrating real-time data sources is applied in a platform called Linked Stream Middleware (LSM) [3], which uses Semantic Web technology for integrating real-time physical sensory data. For annotating and visualising data, the platform exposes a web UI with a SPARQL endpoint for querying. A predefined taxonomy includes location, physical context, accuracy and other metadata used for displaying types of sensing devices. SPARQL 1.1 with federation extension is used for federating queries from distributed endpoints [4]. WOTS2E [5]—a search engine for a Semantic WoT proposes a novel method for discovering WoT devices and services and semantically annotated data related to IoT/WoT. The engine relies on results of traditional search engines (e.g., Google), where it crawls Linked Data endpoints (SPARQL), which are semantically analysed. For the relevant endpoints, metadata will be extracted and stored in the service description repository, used later by IoT applications such as WoT index. In [6], authors analyse the state-of-art literature about IoT search engines and conclude that the most influencing (citing) contributions were done around 2010. This explains the fact that major references might look obsolete in 2021. Along with that, authors outlined two major functionalities performed by IoT search engines (content discovery and search over it), proposed a so-called meta-path methodology, identified 8 types of meta-paths and classified search mechanisms of existing IoT search engines. According to their classification (combinations of R, D, S, F), search mechanisms of IoTCrawler are able to consider the following assets: aspects of streams representatives (R) and stream observations (Dynamic Content, D), semantics of sensors and sensing devices (Representatives of IoT things possessing streams, again R). Due to the use of ontologies (IoTStream, Sosa) and extensible GraphQL-based querying mechanism [7], a submission of new information models is not a problem for the IoTCrawler metadata storage. For example, one of the crawling mechanisms [8] uses the DogOnt ontology [9] and enriches the Metadata Repository (MDR) and searches over it by the following assets: (a) types of sensors and sensing devices; (b) types of electrical appliances connected to energy-metering smart home sensors. Submission of new ontologies and extension of search mechanism with their facets share the same principles and would easily let IoTCrawler for cover functionality aspects (F) of IoT things. Considering that, we can conclude that the search mechanism of IoTCrawler covers the most of the proposed meta-path categories (except of microsensors level, S) and competes the search capabilities of engines belonging to them. Together with other capabilities (such as security and publish-subscribe, virtual sensors) IoTCrawler framework outperforms capabilities of pure search IoT search engines.

### *2.2. Semantics, Ontologies and Information Models for Interoperability*

Over the past decade, a number of efforts have been made to define information models for IoT using ontologies and semantic annotations, although since these ontologies are developed by different entities, they are bound to be a diverge in semantics, since the IoT domain is quite broad in general. An important focus of IoTCrawler is the description of sensors and IoT data streams. Regarding sensors, one of the main initiatives made in this field is the W3C SSN ontology [10]. It defines an ontology for describing Sensors and Observations, but also expands to Systems, Deployments and Processes. SOSA [11] was created as an extension to SSN to simplify the ontology and to separate Sensors and Observations from other concepts that are deemed relevant for Sensor and Observations management. IoT-lite [12] was an effort to bind the core concepts of SSN with IoT concepts that were not covered by it, such as the concept of Service, but to support the scalability of annotations to IoT resources in a minimalist manner. The Stream Annotation Ontology (SAO) [13] is another effort which extends SSN to address sensor data streams. It employs a class taxonomy for stream analysis techniques, which is useful for high granularity. For this reason, the IoT-Stream [14] ontology was developed to serve the framework by carrying the principles that were adopted for IoT-lite to data streams, in the sense that stream

annotations should be annotated as minimally as possible to support scale in the context of IoT data, but also to be flexible to increase the granularity of annotation as needed by the system.

### *2.3. Security and Privacy in IoT*

Security and privacy cover different areas such as authentication, authorisation, integrity, as well as confidentiality to name a few. In the scope of IoT, Abomhara and Køien [15] identified three different core aspects: privacy for humans, confidentiality of business processes and third-party dependability. They also classified different attacks related to eavesdropping communications, which together with traffic analysis techniques allow attackers to identify information with special roles and activities in IoT devices and data. Nevertheless, they state that there are still open issues related to privacy in data collection, sharing and management, as described by Riahi et al. [16]. Another security aspect which has gained a lot of attention in both academia and industry, is the combination of authentication and identity management. This is widely acknowledged in the literature, such as the works of Mahalle et al. [17] or Bernal et al. [18], the latter associates the term privacy-preserving to identity managemen<sup>t</sup> with the objective of representing not only users, but also devices or services. These aspects, together with the access control, have been also dealt in different EU research projects, such as Smartie, SocIoTal or CPaaS.io, where the integration of these technologies are also proved as an appropriate solution for different domains such as smart buildings or smart cities. These projects also propose the use of access control mechanisms based on eXtendible Access Control Markup Language (XACML) [19], even in a decentralised manner by using Attriute-Based Access Control (ABAC) [20], and to deal with privacy over the data by using encryption techniques based on attributes, such as Perez et al. [21] which composes an identity.

Hwang [22] also raises the well-known concern regarding the security threats related to IoT, for example the possibility to overwhelm a system by means of a few IoT attackers using Denial-of-Service (DoS)-based attacks [23]. The most remarkable point from this paper's perspective is that, as Hwang states, a demand exists for security solutions capable of supporting multi-profile platforms with different security levels. On the other hand, Hernandez-Ramos et al. [24] address the issue of security and privacy from the point of view of the smart city. In this work, the necessity of having a mechanism for empowering citizens to manage their security and privacy by tools such as access control management, as well as decentralised data sharing, are addressed. This idea is endorsed also in another research work [25] where they describe a future data-driven society requiring a harmonised vision of cybersecurity.

### *2.4. Reliability in IoT*

In the past, reliability in IoT has been handled by diverse techniques and solutions, from quality analysis to algorithms for fault detection and recovery, or replacement of faulty data sources. The term Quality of Information (QoI) determines the "fitness for use" of an information that is being processed [26]. It has been originally described as a quality indicator in the context of database systems [27], but has also been used in several frameworks for information processing. The authors of [28] proposed a framework for data translation and identity resolution for heterogeneous data sources including QoI. In comparison to other frameworks, their framework relies on linked data sets instead of real-time data. Other frameworks using QoI are shown in [29] for dealing with security in the context of healthcare including QoI or [30], which deals with streaming data that are stored into a database. For later analysis, they also store calculated QoI bundled to the data. A subscription system for data streams, which are selected on their data quality, is proposed in [31]. Puiu et al. [32] focused on real-time information processing with integrated semantic annotation [33] and QoI calculation for fault-recovery and event processing. Whereas all of these solutions integrate QoI and some of them provide real-time

capabilities and semantics, they are bound to specific domains and none of these solutions are flexible enough to work as a decoupled solution supporting different IoT sensors.

As a result of the recent popularity of IoT, different platforms are trying to integrate large numbers of IoT devices in their systems. For this reason, there is already some research done for fault detection in IoT systems. IoTRepair [34] is a fault diagnosis system for IoT systems. Its diagnosis is facilitated by developer configuration files along with user preferences and works by monitoring the states of each sensor and how they correlate with the states of their neighbours. Power and Kotonya [35] provide an architecture with micro-services for fault diagnosis, through event handling and online machine learning, as a two-step approach. To provide a reasonable sensor value in case of faults, different imputation techniques are defined in the literature. Izonin et al. [36] developed a missing data recovery method by using Adaboost regression on transformed sensor data through Itô decomposition and compared the results with other algorithms like Support Vector Regression (SVR), Stochastic Gradient Descent (SGD) regressor, etc. Liu et al. [37] defined a procedure to deal with large patches of faulty data in uni-variate time-series data. Al-Milli and Almobaideen [38] proposed a recurrent Jordan neural network with weight optimisation through genetic algorithms. Most of the techniques that are used for the detection and recovery of faults are computationally expensive techniques that would evidently become a burden on the processing units with the increase of devices in IoT systems. In contrast to the aforementioned approaches for a search engine for the IoT that can be used in cross-domain scenarios, an objective approach to calculate the quality of received information is presented in this work.

### *2.5. Indexing of Discovered Resources*

The large volumes of heterogeneous and dynamic IoT data sources that are available nowadays should be indexed in a distributed and scalable way in order to provide fast retrieval to user queries [39]. Depending on the attributes to be indexed, different techniques are required. For location attributes, the work in [40] proposed a framework that supports spatial indexing of the geographic values of data collected from sensing devices based on geohash (Z-order curve). Barnaghi et al. [41] combines the use of geohashing and the semantic annotation of sensor data for creating a spatio-temporal indexing. Before applying the k-means clustering algorithm to distribute data in the repository and allow data query, dimensionality reduction is performed to the geohash vectors by means of Singular Value Decomposition (SVD). An index structure is proposed in [42]. The process starts by clustering the resources based on their spatial characteristics and creating a tree structure in each cluster, where each branch represents a type of resource (e.g., humidity or CO2 sensors). The most notable works that are used for indexing time series are Symbolic Aggregate Approximation (SAX) and its variants (e.g., iSAX 2.0 [43] and adaptive iSAX [44]). A grea<sup>t</sup> deal of IoT data can be considered as a time-series, since by nature each observation will have a timestamp associated to it. These methods consider that the data follow a Gaussian distribution and use z normalisation processing, by which the magnitude of data vanishes. Since IoT data do not necessarily follow the Gaussian distribution and/or due to concept drift, the data distribution may change over time, SensorSAX [45] adapts the window size of the data according to its standard deviation in a online manner. Another work that is relevant in this sense is Blocks of Eigenvalues Algorithm for Time Series Segmentation (BEATS) [46], since it uses a non-normalized algorithm for constructing the segmen<sup>t</sup> representation of the time-series raw data. The mentioned methods, derived from SAX, are used to convert raw sensor data into symbolic representations and to infer higher level abstractions (for example, dark rooms or warm environments).

### *2.6. Ranking of Search Results*

While the index cares for fast retrieval of search results, users and applications might still face the problem of sorting through a potentially large number of search results. Ranking mechanisms can help to sort and prioritise resources and services by selecting

the most suitable one. In the Web domain, Google's PageRank [47] is probably one of the most notable ranking algorithms. PageRank explores the links among Web pages to assign scores to documents, which are used in combination with text similarity metrics in the context of Web document search. In the IoT domain, on the other hand, the definition of similarity varies and resources can relate to each other based on a number of different features such as their type or their location. Not only the number of features for IoT resources can vary, but also the notion of similarity itself. Therefore, IoT ranking requires a multi-objective decision-making process in which the criteria to be considered are heavily dependent on the application and the domain. There exists work that already explores the multi-criteria nature of IoT domains for assigning ranking scores [39]. Guinard et al. [48] propose a ranking method for IoT resources which takes into account the resources' type (e.g., temperature), their multi-dimensional attributes (e.g., location) and/or the Quality of Service (QoS) (e.g., latency), and applies different ranking strategies for multi-criteria evaluation with different criteria weights which are determined by the query (e.g., 40% for location, 40% for resource type and 20% for network latency). The work in [49] ranks sensor services based on two different QoS categories in Wireless Sensor Network (WSN), namely network-based (bandwidth, delay, latency, reliability and throughput) and sensor-based (accuracy, cost and trust). Other works incorporate user feedback/rating into their ranking mechanisms [50,51]. In IoTCrawler, we have devised a ranking method which can be tailored to the different applications.

### **3. Search Framework for IoT**

In contrast to web search engines, a search engine for the IoT is used mainly by other machines or applications that need information to work properly. While a human user has the ability to assess the usability of a search result to his needs, a machine is not able to do so. It is expected that all search results returned satisfy the search query, as there is no objective way to decide between them. Therefore, an IoT search engine should rank the results beforehand, even without specifically stated requirements within the search query. For this, it should use all available information about the IoT device, such as long-term availability and reliability. Search results for a human can be presented in different ways. Not only text-based results, but also images, tables and videos are popular ways to transfer knowledge. A machine, in contrast, requires not only a fixed endpoint, but also predefined data formats. It needs to know beforehand how to interpret a received search result as well as the IoT data stream.

Like with any conventional search engine, looking for available resources at the time a search request was issued is not feasible. To provide search results in a timely manner, a data repository or database about the data sources needs to be built in advance. To further decrease the search time, the data within the database needs to be setup with appropriate indices. For example, as the location of a device is an important factor when searching the IoT device, providing indices related to the location can significantly improve the search. Before all of that, the search engine needs to be aware of IoT devices. This is probably the most challenging task since there exists a variety of different IoT devices and configuration possibilities. In addition, the IoT domain is more dynamic than the World Wide Web. While web servers usually remain online and stationary over a long period of time, the IoT devices may appear and disappear frequently. Thus, once an IoT device has been identified and integrated into the search engine's database, it needs to be monitored for availability and stream quality. At the same time, the environmental context of the IoT device can change, which needs to be captured to provide additional search criteria.

For the IoTCrawler framework we adopted the search concept for IoT into the following two steps: (a) by presenting the Crawling and Processing Layer and (b) by presenting an incoming search request into the Search and Orchestration Layer. The parts labelled with a number (1–5) belong to the former layer and the ones labelled with an alphabetical character (A–D) belong to the later layer.

The **Crawling and Processing Layer** is the "online" part of the framework. It is constantly running and responsible for integrating new data sources into the framework. In this first step (1), data sources of different kinds are found and integrated in the MDR level. The federated MDR is the anchor point of the IoTCrawler framework (cf. Section 4.2) , and holds metadata information for all data streams available in the framework. In step (2), the IoTCrawler information model is applied. IoTCrawler features an extensive Information Model based on the Next Generation Service Interface for Linked Data (NGSI-LD) standard and centred around the concept of IoTStreams [14]. The model provides the basis for the information stored in the MDRs and the integration of heterogeneous data sources. Both steps enable other parts of the framework to handle heterogeneous data sources. After the integration of new data sources, the SE comes into play (3) to further add new information to the data sources. The SE component enriches known data sources with new information extracted from the received data. The SE includes a quality analysis component that adds QoI (cf. Section 4.4.1) as well as a Pattern Extractor (PE) (cf. Section 4.4.2), which analyses data and provides higher level information.

In parallel, the enriched data (streams) are monitored (4) to enable the Fault Detection (FD) and Fault Recovery (FR) solutions of the framework. The Monitoring component ensures a constant user experience by detecting faulty streams and providing data recovery mechanisms (cf. Section 4.3). In addition, it features a virtual sensor creator to replace faulty data streams by an ML-based virtual copy. In the last step (5), within this layer, the search indices are created, allowing data sources to be found in the search process in a fast manner. The Indexing component is directly supporting the search of data streams by building indexes for the stream types and their attributes, such as locations (cf. Section 4.5).

The **Search and Orchestration Layer** contains components for handling search and subscription requests coming from IoT applications or individual users (A). The Orchestrator (B) is the main entry point for any user or application that wants to search for IoT devices (cf. Section 5.4). It organises the search process and orchestrates the needed data streams. The Orchestrator utilises the Search Enabler component (C), to resolve contextaware GraphQL requests to NGSI-LD requests and thus providing an easy-to-use interface hiding the complex NGSI-LD query mechanisms. For subscription requests coming from IoT applications, the Orchestrator can process the information gathered from the Search Enabler and is able to provide an endpoint to receive notifications about the stream properties, e.g., detected faults. NGSI-LD requests are redirected to the Ranking (D) component, which uses the built indices, given (user) constraints, and enriched information to rank the found data sources before they are sent back to the user or application.

All steps, in both upper and lower layers, are constantly supported by IoTCrawler's Privacy and Security components (cf. Section 5.1) to continuously ensure restricted access to IoT data sources for legitimate users (indicated with a \* in Figure 1).

IoTCrawler enables users and applications to search for data sources, while addressing the challenges mentioned before. Due to the loose coupling of components via publish and subscribe APIs and the design of the single components, the framework is designed to reach high scalability (**R-1**).

**Figure 1.** IoTCrawler addressing search in Internet of Things (IoT).

### **4. Enablers for Discovery and Processing Layer**

This section addresses enablers for the discovery in the IoTCrawler framework, introduced in Section 3. A detailed description for each enabler is provided and complemented by an evaluation on the enabler's performance.
