*4.1. Information Model*

The IoTCrawler information model is built upon standards. It follows the NGSI-LD standard and combines it with well-known ontologies, to reflect IoT use cases in the context of IoTCrawler and to address the requirement of semantics (**R-2**) to provide machine readable results. The choice of NGSI-LD is justified by several factors: being based on an standard makes it easier to inter-operate with, to integrate with other technologies and to maintain and evolve. Added to that, NGSI-LD supports semantics from the ground-up, which is one of the core strengths of IoTCrawler and an enabler for some of the functionalities that it provides. NGSI-LD provides not only the incorporation of semantic information to the data, but also a core information model and a common API to interact with that information (commonly called context). It was chosen as the main anchor point for the interactions between the components in IoTCrawler, greatly reducing and simplifying the number of different APIs to implement and keep track of, as well as data formats and models. This "common language" not only serves an internal purpose to simplify and optimise, but also makes IoTCrawler components easier to be integrated outside of IoTCrawler itself, and has already allowed to integrate existing components (like the MDR) seamlessly into IoTCrawler. The model has been designed to capture a domain that focuses primarily on sensors and stream observations. To achieve this and following best practises [52], concepts were reused from the SOSA ontology [11]. To enable search based on phenomena, the ObservableProperty is also reused. The Platform class is used to capture where the Sensor is hosted on. In addition to SOSA, the SSN ontology is used to capture what Systems sensors belong to and where they are deployed. Although SOSA captures concepts for sensors and observations, the concept of streams is missing, which is a fundamental aspect for IoTCrawler as it involves stream processing. For this, the IoT-Stream ontology provides the concept by defining an IotStream [14]. The IotStream class represents the data stream that is generatedBy the sensor as an entity. It also extends the SOSA ontology by defining a subclass of the Observation class, StreamObservation. This

has been done to extend the temporal properties of an Observation to include windows as well as time points. For accessing the Service exposing the IotStream, the Service class from the IoT-lite ontology is used, and this enables direct invocation of the data source. The IoT-Stream ontology also provides concepts for Analytics and Events, which represent aspects of the semantic enrichment process. Moreover, with regard to the semantic enrichment process, IoT-Streams link to external concepts that capture QoI information about the streams. The QoI ontology provides this, which captures aspects of quality such as Age, Artificiality, Completeness, Concordance, Frequency and Plausibility [53]. An important aspect to any entity is location. Here, the NGSI-LD meta-model which defines a GeoProperty is used. The main classes and relationships of the IoTCrawler model are illustrated in Figure 2.

**Figure 2.** IoTCrawler information model.

### *4.2. Federation of Metadata Repositories*

A key enabler in the IoTCrawler framework is the federation of multiple MDRs. The MDRs stores all available metadata information gathered by the discovery process. Considering the requirements from Section 1, the MDR has not only to support the IoT search as a whole, but also to address the requirements for scalability (**R-1**) and semantics to allow machines and applications to use available IoT data sources (**R-2**).

In addition to other technologies, e.g., triple stores or relational databases, IoTCrawler has chosen to use the NGSI-LD standard, which not only defines a data model for context information forming the basis for IoTCrawler's data model (see Section 4.1), but also defines an API, which will be used by consumers and providers alike, to access information. Among the API functionalities offered by the MDR are: the direct query and publish/subscribe mechanisms, which allow context consumers to receive notifications whenever new information is made available in the system. This publish/subscribe mechanism is extensively used in IoTCrawler for communication and synchronisation between different components, which will subscribe to context information relevant for their purpose, and will publish the processed information to make it available to other components.

NGSI-LD brokers can be interconnected in different ways to achieve scalability. The most best performing deployment configuration of NGSI-LD brokers, which is used in the IoTCrawler framework, is the federated one as shown in Figure 3. It consists of a federation of brokers, in which all information of the different federated brokers is accessible automatically through the federation broker. This last broker acts as the central point of IoTCrawler's architecture and is the key in making IoTCrawler horizontally scalable and well performing. This allows all other components in IoTCrawler to use the MDR in a scalable and standardised way and, being based on the NGSI-LD standard, not

only makes the MDR inter-operable and compliant to standards, but also allows for the use of different already existing implementations.

For our current deployment, we have used Scorpio (https://github.com/ScorpioBroker/ ScorpioBroker accessed on 24 February 2021) because it is the only implementation which considers a federated scenario. Nevertheless, in the frame of this paper we have focused our metrics on a single instance of this broker, obtaining both latency and scalability metrics. To do so, we have deployed a virtual machine with the following features: 8 CPU cores and 28 GB of RAM inside a Google cloud. Latency has been evaluated over the different operations provided by the MDR, specifically: entity management, publication/subscription and context provisioning.

**Figure 3.** Federated broker architecture [54].

Measurements show that the most time consuming operation is the process of getting entities specified by their ID, which takes around 800 ms. This operation should not be so cumbersome and we think that the low performance associated with this task could be due to the maturity of this software. Apart from this operation, the rest of the operations take from 17 to 37 ms to perform, which is a more affordable processing time. Regarding subscription management, the creation of subscriptions is a heavier task taking up to 270 ms, whereas the other operations take only about 17 ms. Finally, context provisioning tasks, which comprise the registration of the information coming from context providers pointing at the end-point services provided by them, take more time compared to the previous tasks. Nevertheless, the registration and deletion of context providers are operations which are usually executed once per context provider. By contrast, the operations to obtain context providers take about 100 ms.

Finally, regarding the scalability metric, we have focused on the CPU and memory resources consumed by the instance of the NGSI-LD broker according to a specific range of simultaneous connections (2, 4, 8, 16, 32, 64, 128, 256, 512 and finally 1024). In addition, we have repeated this process four times. The results of these tests are presented in Figure 4, depicting that the CPU resources' consumption follows a logarithmic curve where the steepness of the slope is lowered from 8 simultaneous communications on. On the other hand, we can see that the increase in simultaneous communication does not impair the memory resources notably.

**Figure 4.** Metadata Repository (MDR) scalability assessment.
