*6.2. Data Integration*

The whole information gathered from actions of Social Sensing, sensors, Providers, and user contributions must pass through a process of adaptation and elaboration so to be accessible and significant for tourists and partners of the pilgrim path. Through the processing phase it is possible to elaborate data coming from actions of social sensing and the different partners to create a customized experience and adapt to variating conditions such as weather, activation of new promotions, or unexpected events.

In this first test of this test-bed section, we aim at showing the effectiveness of the platform components chosen to elaborate and integrate data and services coming from producers. We created a simple function receiving data from either an HTTP endpoint or from a Zenoh Resource and applying a filtering operation and a conversion in JSON format. Transformed data are forwarded through a Kafka topic to Spark where a Spark Streaming will execute some computationally intensive operations on the data received such as square root of the number of occurrences of a character in the data received. The data so processed are then stored into Elasticsearch by the Spark-Elasticsearch connector. To measure the impact of each part on the latency before the data can be stored in the Elasticsearch, each service adds a timestamp in the payload at the receival of any message. We then sent a constant rate of 1000 requests per second and logged the timestamps so generated in Elasticsearch to be visualized through Kibana.

As mentioned before the fast elaboration of gathered data in order to obtain a small end-to-end latency presenting updates in a near real-time fashion to customers is one of the main requirements that emerged from the "Francigena way" use case.

Results shown in Figure 7 show that the platform proposed by authors is able to process and memorize a huge quantity of data and re-exposing them and to provide an extremely limited to provide a very limited total latency of less than 1 s to the customer. The decomposition of the end-to-end latency shows that the sawtooth behavior is due to Spark when processing data with its batch streaming approach.

**Figure 7.** Average latency introduced by each component compared with the end-to-end latency when under a constant load of 1000 message/s (Log Scale).

We can also observe that while the processing delay carried on by the FaaS platform presents less jitter than the processing task by Spark, the many Spark optimizations carried,such as the pre-allocation of computational resources, lead to a processing time one order of magnitude faster. The two major components responsible for the resulting end-to-end latency are the message wait time and FaaS processing time. Messages stored in Kafka queues wait until the Spark adaptor is not ready to receive them in order to be computed and this latency introduced is mainly due to the batch streaming behavior of Spark. The latency introduced by the processing in the FaaS platforms is a known problem of these platforms caused by many factors derived from the dynamic creation of function at each request and is a highly active topic of research [36].

Those results also sugges<sup>t</sup> that while the aim of authors of achieving a less than 1 s delivery time is achieved, when in need of faster end-to-end processing, the tuning of Spark Streaming is the applicable point of intervention. There are, in fact, several options to tune the performance of a Streaming process in Spark, such as increasing the level of parallelism in data receiving and serialization and by setting the right batch interval. Finding the right batch interval (BI), for a Spark Streaming application running on a cluster is an essential condition for its stability and requires that the system is able to process data as soon as it is being received [37]. We have sent a constantly increasing number of requests, starting from 0 to 10,000, during a time-lapse of 5 min and increased the batch interval to test how the infrastructure behaves, at the load increasing.

The variation of BI shows that pipelines configured with a higher batch interval present a higher jitter and latency, while they are more unaffected by variations of loads (Figure 8). In fact, the pipeline configured with a 2 s BI does not show a degraded behavior until the 150 second approximately (with 5000 message/s), while the one configured with a BI of 250 milliseconds shows already a degradation after 75 s (with 2500 message/s) from the test start. On the contrary, if the requirement is a low end-to-end latency, the best configuration is with a BI of 250 milliseconds that can process messages with several orders of magnitude less than other BI configurations.

**Figure 8.** Average end-to-end latency in logarithmic scale when variating the batch interval (BI) in Spark Streaming and submitting an increasing number of messages from 0 to 10,000 in a time-lapse of 5 min.

Zooming in on the test with the BI of 250 milliseconds, we can see that the latency introduced by each part compared with the end-to-end latency is influenced by the processing of the Spark Streaming (Figure 9). These results confirm the necessity of introducing both the FaaS platform and Kafka that guarantee the best flexibility of the infrastructure. In fact, the FaaS platform with its fine-grained scalability has not only exhibited the best adaptation to different connection protocols and message formats, but also the flexibility of performing a preprocessing, before entering the "hard" processing of Spark. Apache Kafka, on its side, is capable of storing information until it is effectively requested by Spark, so as to enable the recovery of transient situations, where the load is excessive to be computed by resources assigned to Spark without losing data.

**Figure 9.** Average latency introduced by each part compared with the tail latency when under an increasing load of messages starting from 0 to 10,000 during a time-lapse of 5 min.

So, we can claim that a platform, like Apache Spark, can provide a fast, complete, and efficient solution to parallel processing massive amounts of data coming from social sensing, social networks, sensors, and tourism partners.

Altogether, even a tailored tuning of the Spark platform cannot satisfy all different requirements of ST while achieving an optimal usage of computational resources available [38,39]. We can state that Spark is not a solution to manage the heterogeneous fluctuating information of ST scenarios, so the only way to address ST challenging tasks

requires to *ad-hoc setup* the Spark infrastructure dimensioning to the worst-case scenario with an obvious waste of resources [40].

On the contrary, the FaaS platform, even with introducing a greater latency in computation, has shown much greater flexibility in response to load variance (Figure 9). Future developments of these FaaS platforms introducing missing computation constructs (such as Map-Reduce) and enabling better performances can shed light to the offloading of the Spark computed part. In this way, it will be possible to leverage on the finer granularity and zero-scaling capabilities of FaaS platforms to gran<sup>t</sup> the best dynamic adaptation to the continuous variations typical of the ST scenarios [36].

From the comparison with partners in the territory, the authors can claim that the proposed first prototype of APERTO5.0 is able to address the challenges in terms of data collecting and fast processing, presented by the creation of a unified platform to support ST services on the "Francigena way".
