4.4.1. QoI Analyser

The QoI Analyser is responsible to annotate data streams within the MDR with additional QoI metadata. By combining metadata and predefined QoI metrics, it is possible to rate incoming data from data streams and therefore to annotate these streams with QoI. This additional information about quality enables other components of the framework to provide (better) results, especially the Monitoring (cf. Section 4.3) and the Ranking (cf. Section 5.2) components.

An important step is the definition of QoI metrics that are available within the IoTCrawler framework. Currently, the QoI Analyser supports five QoI metrics that have been defined: Completeness, Age, Frequency, Plausibility, Concordance and Artificiality. For details and calculation of the QoI metrics, we refer to [59–61]. To integrate the results of the QoI calculation an ontology has been created and integrated into the information model as shown in Section 4.1.

A main anchor point for the integration is the publish/subscribe interface provided by the MDR. Figure 7 provides an overview of the interactions of the QoI Analyser and the IoTCrawler framework. When a data source is registered or updated at the MDR, the registration contains additional metadata, e.g., a detailed description of the data sources properties and its characteristics. This allows to adopt the QoI calculation to changes in the metadata or to connect to a new data endpoint description to access data. Finally, the QoI Analyser calculates the QoI for each known data source and adds the results to the metadata. This allows other IoTCrawler components as well as third-party users to access the additional information.

**Figure 7.** Semantic Enrichment (SE)–MDR communication.

For the following experiment, data from the city of Aarhus, Denmark are used. The data set named "CityProbe" is a real-time data source, which consists of 24 sensors that are mounted on light poles. The devices are solar powered and provide different sensor values, e.g., humidity, temperature, rain or CO. These data are analysed and it is shown how the QoI Analyser detects increasing or decreasing quality of the incoming data. For the experiment, the metadata annotation has been set to the following values: The range for the measured temperature has been set to −30 ◦C to 40 ◦C, which depicts a common temperature range for a northern country, whereas the humidity ranges from 0% to 100%. These ranges are used for the calculation of the Plausibility metric by checking if the observations remain in the defined ranges. Figure 8 shows an analysis of two sensor devices for temperature and humidity data for a time span of one week. The first graph depicts the measured values, whereas the second one shows the calculated Plausibility values. Figure 8a shows some suspicious temperature and/or humidity peaks in the minus area. From a human point of view, they can assumed to be wrong. In addition, the Plausibility metric decreases as it can be seen in Figure 8b. This example shows a use case of a decreasing QoI metric. A possible subscriber of the QoI, e.g., the Monitoring, can now react to the dropping information quality. In case of the Monitoring, it is now possible to initialise a more complex FD or FR algorithm or to create a new virtual sensor instance.

With the QoI Analyser, it is possible to identify data streams with decreasing quality. As an example, the Frequency metric is able to detect if a data stream does not provide data in the annotated time interval. Of course, it is not possible to directly detect the reason for a decreasing Frequency as IoTCrawler has no access to the sensor devices, but it provides the results of the QoI calculation to other components, which can then react to a changing QoI, e.g., by selecting an alternative data source. With that, the QoI Analyser enhances the Reliability in IoT environments. The QoI annotations also give objective criteria to choose between data streams, especially when the search is performed by a non-human system (Requirement **R-2**).

(**a**) Humidity and Temperature

(**b**) Plausibility for Humidity and Temperature

**Figure 8.** Aarhus CityProbe sensor's data and Plausibility.

### 4.4.2. Pattern Extractor

To allow context-based search (Requirement **R-2**), the Pattern Extractor (PE) module enables the generation of higher-level context The context itself would be defined by the domain(s) of interest of the deployment, e.g., traffic congestion levels or personal health activity monitoring. The PE relies on a pre-training process in which it creates a set of clusters, each corresponding to a state or event. The PE analyses annotated IoT data streams that are pushed to the Metadata Repository to detect Events, by employing a data analysis technique. A subscription to the MDR is made for StreamObservations that have a certain property, and can also include spatial and temporal filters. iot-stream:StreamObservations of iot-stream:IotStreams that meet the requirements are then pushed as notifications the PE component. The PE temporarily stores a certain number of observations that correspond to the time window pre-defined by the deployer. The output of the analysis is a textual label that interprets the pattern of data. The label is then encapsulated in an iot-stream:Event instance, along with the start and end times of the window in question, and published to the MDR.

The algorithm for pattern extraction is based on aggregating observations from a time window for pattern representation. Observations are grouped in time windows of predetermined size. On each window, Lagrangian Pattern Representation (LPR) [62,63] is applied to determine the patterns. Patterns are then clustered and grouped using Gaussian Mixture Models (GMM). The number of clusters depends on the number of expected events for a specific scenario. A label representing the pattern is given to each cluster. Label nomenclature is defined by the topical domain ontology for the specific use case.

In the PE component, there are two models that represent patterns [63]. K-means clustering was used for the first approach of representing patterns and our model applied to some data sets from UCR Time-series Classification Archive [64], which is known as a benchmark data set for clustering and classification methods. The data sets Arrowhead, Lightning7, Coffee, Ford A and Proximal Phalanx Outline Age Group from the time-series archive were used. The Arrowhead data set contains shapes of projectile points in time series. Lightning7 has data of time-domain electromagnetic from lightnings. The Coffee data set contains data from measurements of infrared radiation interaction with coffee beans, which is used to verify the coffee species. Ford A has measurements of car engine noise and Proximal Phalanx Outline Age Group has observations from radiography images from hands and bones. Silhouette coefficient was used to evaluate the model. Silhouette is a measure of how separated the constructed clusters are from each other. To evaluate the clustering technique in the real-world scenario, we need to use a measurement to evaluate the separation of the clusters as we do not have the true classes. The results were compared by using K-means on raw data without Lagrangian representation. Table 1 proves that our method improves the clustering results of these data sets.

**Table 1.** Silhouette evaluation of Lagrangian representation using k-means.


The measurements for the above data sets were conducted using a machine with a 4.00 GHz 4-core CPU and 32 GB of RAM. In the case of the time series in the Ford A data set, the averaging processing time for applying LPR on it was between 400–500 milliseconds. Figure 9 shows the relative comparison of the clustering algorithm processing time applied to each data set.

**Figure 9.** Processing time of the clustering algorithm for different data sets.

For the evaluation of Principal Component Analysis (PCA)-Lagrangian representation, the method was applied to both synthetic and real-world data. GMM was then used for clustering. We generated a synthetic data set using a multivariate Gaussian distribution and generated a time series including 2400 samples with four dimensions with three different Gaussian distributions which have the same covariance matrix and different mean vectors. Each distribution had 800 samples. In addition, another data set was generated by adding white noise with Signal-to-Noise Ratio (SNR) of 0.01. The results of the Silhouette coefficient are 0.87 for data w/o noise and 0.47 for data with noise.

For a real-world scenario, we used air quality data from Aarhus' open data. We used air quality data from a period of two months with a sampling frequency of every five minutes. The data have two dimensions; Nitrogen-dioxide (NO2) and Particulate Matter (PM). There are three different clusters: low risk, medium risk and high risk. We evaluated the results using Silhouette coefficient and compared the results. The results are shown in Table 2.

**Table 2.** Results of Silhouette coefficient for the Aarhus data set.


The proposed algorithms for pattern extraction allow to extract high level events directly from the IoTCrawler framework (**R-2**). They also reduce the need for external applications to subscribe to raw data and decrease the amount of transferred data, improving scalability (**R-1**).
