**2. Background**

This section provides a brief overview of the Digital Twin concept, presenting its main building blocks and relating them to IPPODAMO. Next, we provide a concise and contextual survey of big data computing frameworks, part of the technological ecosystem adopted in this work.

### *2.1. Digital Twin Concept*

Grieves is recognized to be the first that coined the term Digital Twin, using it to describe the digital product lifecycle managemen<sup>t</sup> process [11]. Since then, thanks to the availability of modern computing platforms, engineering and managemen<sup>t</sup> paradigms and practices, the term has gained in popularity, promoting DT technology as a key enabler of advanced digitization.

To date, many such applications have been successfully implemented in different domains, ranging from manufacturing [12] and logistics [13] to smart cities [14], etc. However, currently, there is no common understanding of the term, and in this respect, a taxonomy would help to demarcate the conceptual framework.

To this end, the authors in [4] provide a classification of the individual application areas of Digital Twins. As part of a comprehensive literature review, they develop a taxonomy for classifying the domains of application, stressing the importance and prior lack of consideration of this technology in information systems research. Building on this, the authors of [15] undertake a structured literature review and propose an empirical, multi-dimensional taxonomy of Digital Twin solutions, focusing on the technical and functional characteristics of the technological solution. As an add-on, the proposal allows for the classification of existing Industry 4.0 standards enabling a particular characteristic.

What unites the various DT approaches is the presence of some key technological building blocks that need to be considered and are crucial to its success.

1. Data link: the twin needs data collected from its real-world counterpart over its full lifecycle. The type and granularity of the data depend on the scope and context of deployment of the technology.

IPPODAMO relies on a multitude of heterogeneous data sources, available at different granularities. Each data source has a dedicated (automatic) ingestion process used to transform the data, enriching the IPPODAMO data layer. For more information on the data sources and some computational aspects, we refer to the reader to Sections 3 and 4, respectively.

2. Deployment: the twin can be deployed and span the entire cloud-to-thing continuum, starting from the thing (IoT devices), the edge and/or the cloud. The specific deployment criteria depend on the scenario requirements, typically based on latency/bandwidth and/or security/privacy constraints.

IPPODAMO is not directly involved in the raw data collection process. The system relies on third-party data providers, which collect, extract and fetch the data in dedicated ingress, cloud-backed machines. The raw data are anonymized and temporally retained in the system.

3. Modeling: the twin may contain different and heterogeneous computational and representational models pertaining to its real-world counterpart. This may range from first-principle models adhering to physical laws; data-driven; geometrical and material, such as Computer-Aided Design/Engineering (CAD, CAE); or visualizationoriented ones, such as mixed-reality.

Our Digital Twin solution provides a consolidated view of heterogeneous urban data (descriptive Digital Twin), while at the same time, relying on historical and (near-)real-time data to perform near-to-mid-term predictions on the activity indexes (predictive Digital Twin). This allows operators to perform simulation scenarios aimed at minimizing predetermined indexes during routine and scheduled interventions. More information on the use cases and, in particular, the UFM scheduling is provided in Section 5.

4. APIs: the twin needs to interact with other components, e.g., twins in a composite system or an operator consulting the system. To facilitate these interactions, various APIs must be available to allow for information collection and control between the twin and its real-world counterpart.

IPPODAMO presents the UFM operator an intuitive, high level user interface abstracting low-level system interfaces. At the same time, IPPODAMO has a programmatic interface, exposing well-defined ReST APIs through which other systems and operators could integrate and interact. The rationale behind this choice is to enable future potential integrations of IPPODAMO into a larger, federated ecosystem of smart city platforms. This aspect of the study is beyond the scope of this article.

5. Security/privacy: considering the role and scope of the DT, physical-to-digital interactions require security/privacy mechanisms aimed at securing the contents of the twin and the interaction flows between the twin and its physical counterpart. The solution is deployed on a state-of-the-art virtualized environment equipped with all the necessary security features. To this end, different administrative levels are provisioned, for accessing both the virtualized system and the system functionalities.

In the following, we provide a concise survey on some big data computing frameworks, motivating the choice of the identified technological ecosystem.

### *2.2. Big Data Computing*

Smart cities generate and require, more than often, the collection of massive amounts of geo-referenced data from heterogeneous sources. Depending on the process and purpose of the system(s), these data must be quickly processed and analyzed to benefit from the information. To this end, big data computing frameworks have gained tremendous importance, enabling complex and online processing of information, allowing us to gain insights about phenomena in (near-)real time.

Cluster computing frameworks such as Apache Hadoop, based on the MapReduce compute model, are optimized for offline data analysis and are not ideal candidates for fast and online data processing due to the overhead of storing/fetching data at intermediate computation steps [16]. Current efforts in the relevant state-of-the-art have shifted toward promoting a new computing paradigm referred to as Stream Processing [17]. This paradigm adheres to the dataflow programming model, where computation is split and modeled as a directed acyclic graph and data flow through the graph, subjected to various operations [18].Micro-batching represents a middle ground in this continuum: the idea is to discretize the continuous flow of data as a set of continuous sequences of small chunks of data, delivered to the system for processing. In this context, frameworks such as Apache Spark adopt this philosophy [19]. While Apache Spark is not a purely streaming approach, it introduces a powerful abstraction—the resilient distributed dataset (RDD [20])—allowing for distributed and in-memory computation. This programming abstraction supports efficient batch, iterative and online micro-batching processing of data, and it is a perfect match for the dynamics of our UFM scenario. Moreover, the cluster computing platform offers a wide range of libraries and computation models, as well as the geographical extensions needed to handle spatial data [21].

### **3. The IPPODAMO Platform**

In this section, we start by presenting the data sources used by the Digital Twin, discussing their spatial and temporal characteristics. Next, we provide a high-level overview of the functional components comprising the technical solution.

### *3.1. Data Sources*

IPPODAMO relies on a multitude of heterogeneous data sources provided, in part, by the project partners, including a telco operator and a company operating in the UFM sector.

Referring to Figure 1, the twin relies on (anonymized) vehicular and human presence data, combined to extract a measure of the activity inside an area of interest. The data sources are geo-referenced, embodying different levels of granularity, subject to different data processing steps and aggregation procedures aimed at building a composable activity level index (refer to Section 5). At the time of writing, the system has processed and stores two years of historical data and actively processes (near) real-time updates from each data source. These data have important value and enable IPPODAMO to gain (near-)real-time insights into the activities, but also to simulate near-to-mid-term evolutions of the activity level by exploiting the historical data.


**Figure 1.** IPPODAMO data sources.

Other important data relate to the UFM process itself, which comprises data generated from: (i) the urban monitoring activity, e.g., the status of an urban asset assessed periodically through field inspections, (ii) annual planning and scheduled operations, e.g., repair interventions, (iii) geographical data concerning public utilities such as hospitals, schools, cycling lanes, etc. The data are provided and updated by the company operating in the UFM sector, having a vested interest in accurately depicting and monitoring the status of the urban assets.

Additional data sources are available and extracted from the open data portal, curated and maintained by the municipality of Bologna, Italy, and these include: (i) city events and (ii) other public utility maintenance operations.

All the above-mentioned data sources undergo dedicated processing steps and are stored in a logically centralized system, providing a consolidated and multi-source data layer. A rich set of visualizations can be built, guiding the UFM operator in their work. At the same time, more advanced functionalities, relying on the historical data to predict future evolutions of the phenomena inside an area of interest, are possible, and this topic is discussed in Section 5.

### *3.2. Technological Ecosystem*

The platform (Figure 2) is structured in four main conceptual layers: (i) the ingestion layer, which interacts with the data providers, continuously acquiring new data, performing syntactic transformations, pushing them upwards for further refinements; (ii) the big data processing layer, which performs semantic transformation and enrichment of raw data fed from the ingestion points; (iii) the storage layer, providing advanced memorization and query capability over (near-)real-time and historical data, and (iv) the analytics layer, presenting to the customer an advanced (near-)real-time layer with query capabilities, aggregate metrics and advanced representations of data.

**Figure 2.** Technological components of the IPPODAMO platform.

For each data source used by the Digital Twin and, in particular, for the vehicular and presence data, there is a custom ingestion process tasked with reading the raw data, performing some syntactic transformations and pushing them towards the big data cluster. The various data sources are retrieved and pushed in parallel to specific Kafka topics, which identify also the semantic processing pipeline. Indeed, to enable the reliable and fast delivery of data from the ingestion points to our analytics and storage platform, we rely on Apache Kafka, an open-source, distributed, message-oriented middleware [22]. This choice is driven by the capabilities of this platform, including, but not limited to, its capability to gracefully scale the processing in the presence of high-throughput and low-latency ingress data. This matches the requirements of our domain, where data are constantly and periodically collected from many heterogeneous sources, e.g., vehicle black boxes, cellular, public transport, etc.

To provide advanced and fast processing capabilities for spatial data, the technological stack integrates and relies on Apache Sedona [23]. Apache Sedona is a distributed processing library, which builds on the in-memory computation abstraction of the Apache Spark framework and is able to provide advanced and fast queries over spatial data. Thanks to this spatial processing framework, we are able to blend and elaborate different data sources, creating rich representations of various phenomena in an area of interest.

Once the data have been processed, they are stored in Elasticsearch, a fast and scalable no-SQL database with advanced capabilities of indexing and querying key-value data [24]. Thanks to the advanced integration between Spark and Elasticsearch, we can use the solution as both a data provider and as a distributed storage system. The data are subject to further refinements and algorithmic processes aimed at creating different layers of aggregations, calculating synthetic indexes, etc., which are then used by the planner functionality to identify suitable time intervals during which to schedule urban operations. The raw data and information extracted from the data are presented to the end-users in different forms through advanced visualization dashboards available through Kibana, part

of the Elasticsearch ecosystem. Last is the JobServer component, an optimized RESTful interface for submitting and managing Apache Spark tasks [25]. A job request, e.g., for a suggestion on a maintenance operation schedule, is a JSON-formatted request, triggering the execution of a specific algorithm resulting in a JSON-formatted output visualized through a web-based user interface. This component composes part of the IPPODAMO programmatic API, which could be used to integrate the solution into a larger, federated ecosystem of platforms in a smart city context.

### **4. Big Data Processing**

In this section, we provide some technical details on the big data processing pipelines dedicated to the transformation and enrichment of incoming raw data. Next, we present and analyze the performance trend of two distinct data processing pipelines, providing in-depth insights into some system mechanisms.

### *4.1. Processing Pipeline(s)*

The data sources introduced in Section 3 are subject to different processing pipelines due to the inherent syntactic and semantic differences that they embody. Concerning the vehicular and presence data, of relevance to the UFM process is the measure of the activity level inside a particular area in time. To this end, both pipelines are finalized to implement a counting technique measuring the activity volume. The geographical granularity of the presence data is accounted on a tile basis, a square-shaped geographical covering an area of 150 m × 150 m, while the vehicular data are point data—latitude and longitude—and can be accounted for at any meaningful granularity. These data resolutions are imposed by the data provider. It is important to note that in scenarios where both vehicular and presence data need to be accounted for, the granularity of this aggregation operation can be a tile or a multiple of tile entities.

Referring to Figure 3, both vehicular and presence data sources are initially subjected to some syntactic transformations before being forwarded from the Kafka producer, simplifying the ingestion process on the Spark cluster. From here on, the data are subject to semantic transformation processes, enriching the source data with relevant geographical information. In particular, the last operation in the pipeline accounts for the activity index information, retaining it in a distributed memory support. This operation should be done in a fast and efficient way; otherwise, it risks becoming a bottleneck for the overall system, delaying the ingestion performance.

To this end, we rely on a spatial partitioning mechanism, dividing the interest area among the cluster nodes. The area of interest to the project is initially saved in the underlying distributed file system and programmatically loaded and partitioned using the GeoHash spatial partitioning scheme [26]. This allows us to distribute topological data and index thereof among the cluster nodes. Indeed, in scenarios where ingress data are uniformly distributed inside the area of interest, this allows us, on average, to equally distribute and scale the computation among available cluster nodes. In particular, the data are accounted for by performing distributed point-in-polygon (PiP) and k-Nearest-Neighbor (kNN) operations, and these operations are enclosed by the *GeoRDDPresence* and *GeoRDDVehicular* functional components, respectively, for the presence and vehicular data. The data, once retained, are then subjected to additional processes, aggregating and slicing the data in the time and space domains (refer to Section 5). Other sources of information are those containing topological information concerning urban assets such as cycling and bus lanes, hospitals, schools, etc., and open data offered by the municipality, e.g., city events. Topological data have a dedicated batch processing pipeline; at its crux is a geo-join operation, aimed at enriching the IPPODAMO baseline topological map with additional information on urban assets. A dedicated update procedure is available whereby old information is discarded and only the fresh information is considered.

Last are the data sources containing operational information on the facility management process. UFM data have spatial and time information and are currently handled by a similar processing pipeline to the topological data.

**Figure 3.** IPPODAMO (big) data processing pipelines and information flow.

### *4.2. Performance Analysis*

The periodicity of the individual data sources imposes some operational constraints on the data processing pipelines, and this consideration applies to the vehicular and presence data. As an example, when a new batch of vehicular data enters the system, the corresponding end-to-end processing pipeline comprising the various syntactic and semantic transformations should take no more than 60 min (refer to Figure 1). Failure to do so would create a backlog of data that increases over time, slowing down the ingestion performance.

In particular, among the operations shown in Figure 3, the one embodying the highest computational burden is the *GeoRDDVehicular* operation, which requires the execution of a kNN operation for each data point present in the hourly dataset. We report that the end-to-end sequential processing of an hourly vehicular dataset containing, on average, 15,000 records (trips) often results in a violation of its operational constraint.

To address the issue, we leverage some specific constructs of the Apache Sedona framework, aimed at distributing the computational effort among the various cluster nodes. As anticipated, the proposed solution makes use of the (i) GeoHash spatial partitioning scheme, allowing the partitioning of a geographical area of interest among cluster nodes, and the (ii) broadcast primitive implementing a distributed, read-only shared memory support. In particular, the ingress vehicular dataset is enclosed as a broadcast variable, shared among worker nodes, where each node is responsible for the computation of a kNN operation over a subset of the points contained in the original dataset. This approach allows the parallel computation of individual kNN operations, whose outcome is later merged and retained in ElasticSearch. Figure 4 shows the performance trend of the optimized vehicular data processing pipeline. In particular, Figure 4a plots the processing time with a varying number of input records. The resulting trend is monotonic, allowing for the timely processing of the input dataset, adhering to the operational constraints imposed by the data source periodicity. Figure 4b puts the processing time into a greater context, accounting for additional operations occurring before and after the *GeoRDDVehicular* processing step, comprising data (de)serialization and output communication to the driver node, with the kNN processing step accounting for nearly 91% of the processing time.

**Figure 4.** Vehicular processing pipeline performance. The experiments were carried out in a testbed comprising 4 VMs—1 driver and 3 workers—each equipped with 8 vCores, 32 GB vRAM and 150 GB data SSD support. (**a**) Overall processing time under varying number of ingress records. (**b**) Processing time decomposition for the 15,000 record configuration.

### **5. A Decision Support System**

At its core, the IPPODAMO platform serves as a decision support system, aiding UFM operators in their daily activity. In this section, we start by discussing a set of identified use cases best targeting the needs of the UFM operator. Next, we discuss the concept behind the activity level index, a synthetic index used by some underlying system functionalities. Then, we provide a brief description of the algorithmic details of the scheduler functionality, along with a validation study showcasing its capabilities.

### *5.1. Use Cases*

While preserving the general aspect of our study and without loss of generality, we identified two broad use cases, which showcase the capabilities of the solution in the following directions:


Concerning the first use case, the system provides a rich set of configurable visualizations, allowing the UFM operator to consult and confront the historical and current trend of data inside an area of interest. The second use case aims to provide the UFM planner with a proactive decision tool, guinding the scheduling decisions. The last use case allows the platform administrator to perform an *a posteriori* evaluation of the annual planning schedule, dictated by the data that IPPODAMO has ingested. Through this functionality, we would like also to be able to perform a qualitative evaluation of the algorithmic decisions made by IPPODAMO, confronting them with the knowledge of the UFM specialist. At the core of these use cases is the activity level index, which is discussed in the following section.

### *5.2. Activity Level Index*

Once the data are processed, they are stored in ElasticSearch and are subjected to further periodic and event-based algorithmic processes, aggregating and slicing the data in the time and space domains. Intuitively, the activity index is a measure of the activity level inside an area of interest.

At first, vehicular data are stored at their finest granularity, contributing to the traffic volume in a specific point of the underlying road topology. These point-wise data are aggregated and accounted on a road-arch basis, a constituent of the road topology. The data are also sliced in the time domain, accounting for the daily traffic volume and the traffic volume on some configurable rush hours. Once aggregated, the data are normalized and stored at different scales, e.g., for better visualization. This computed quantity constitutes the activity level index derived from the vehicular data.

The human presence data are subjected to a similar workflow. The difference is in the granularity (tile entity) that the data are accounted for. Recall that this granularity is imposed on us by the data provider. Once the individual indexes are computed, they can be composed and weighted according to some criteria. The current implementation allows an operator to simulate different scenarios by specifying the weights accordingly.

These indexes are periodically updated and maintained, and they constitute the input for the, e.g., UFM scheduler. Currently, three implementations for the activity level index are available, and the planners' behavior is parametric on the index type:


Concerning the predictive index, we rely on state-of-the art algorithms, capable of inferring seasonality and local phenomena from the data. Currently, we are evaluating some practical design considerations that can occur in a dynamic and time-varying scenario such as ours.

### *5.3. UFM Scheduling Algorithm*

This functionality is used in the first use case, and aids the UFM operator in searching for a suitable timeframe during which schedule a maintenance intervention. A maintenance operation may consist of a minor/major repair operation of an urban asset; it has fixed coordinates in space, a predicted duration, and an optional timeframe in which it needs to be scheduled. The scheduling criteria vary, and, depending on the objective, one would like to avoid or minimize disturbance to nearby activities. To this aim, the system has the capability to express and consider all these and other constraints when performing the search for a suitable timeframe.

Once a request is issued, the system receives all the constraints expressed by the operator, including the list of attributes, e.g., coordinate, expected duration, etc., for each intervention. The algorithm then exploits the geographical coordinates to compute the activity level index inside an area of interest and gathers all the potential interferences with nearby ongoing activities. To this aim, the functional component relies on the Spark SQL library and geographical primitives available in Apache Sedona. Recall that the algorithms' behavior is parametric on the index type, and different types of indexes are available. The final outcome of this computation is the construction of the index history and its hypothetical evolution in time.

Once the activity level index is computed, and all interferences have been retrieved, a final index is processed for the time horizon under consideration served as input to the scheduling algorithm. The exact algorithmic details and final index composition are beyond the scope of this article.
