**1. Introduction**

#### *1.1. Scope and Significance*

The future of connected and automated vehicles (CAVs) and the development of intelligent transportation systems (ITSs) are actively researched topics which open up a multitude of possibilities. With the progress in computation and communication technology in the last decade, some formerly unrealistic constructs are becoming more practically viable and demand proof-of-concept implementations. We envision a future where traffic participants and observers like CAVs and ITSs share their information resources in realtime for a safer and more efficient transportation and traveling experience. Herein, we outline a so-called *Central System* architecture that enables such information sharing and integration. We use the term *central* in a strictly logical sense to denote the emergence of a single, fully integrated and logically consistent environment and decision model in the cloud (the *digital twin*), while the physical implementation itself remains highly *distributed*, i.e., computation and communication loads are delegated to a spatially localized edge processing nodes hierarchy, as well as a network of third-party partners such as trusted data and algorithm providers. Understanding the wide ranging general applicability of building a well-integrated smart road and vehicle IoT, we made a dedicated effort to design the *Central System* as an easily extensible integration framework using industry standard interfaces. In the coming months we expect to be able to connect a part of the fast-developing Hungarian ITS facilities into the *Central System* and to provide non-stop

**Citation:** Tihanyi, V.; Rövid, A.; Remeli, V.; Vincze, Z.; Csonthó, M.; Peth˝o, Z.; Szalai, M.; Varga, B.; Khalil, A.; Szalay, Z. Towards Cooperative Perception Services for ITS: Digital Twin in the Automotive Edge Cloud. *Energies* **2021**, *14*, 5930. https://doi.org/10.3390/en14185930

Academic Editor: Marco Pau

Received: 13 July 2021 Accepted: 4 September 2021 Published: 18 September 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

real-time integration of real traffic data and thus demonstrate the first large-scale industrial application. In the current paper, we report the positive results of small-scale experiments conducted in late 2020.

Arguably, the most useful kind of information for traffic participants might very well be the establishment of a so-called *digital twin*: a dynamic and real-time model of the environment, which is the focus of our current work and provides the major information source to build other services upon. The functional prototype we built and present in this paper therefore realizes only the fundamental real-time environment perception function of the *Central System*, which for clarity we will call *Central Perception*—distinguishing it from an envisioned suite of other dependent functionalities like traffic analytics and planning, road infrastructure monitoring and management, cloud-based traffic control and vehicle control, specific applications on CAV testing grounds and logistic grounds, etc. Of course, such derivative services must be introduced in at least some detail to highlight the expected practical significance of our initial efforts. A schematic overview of the planned *Central System*, its participants and functionalities is represented in Figure 1.

**Figure 1.** *Central System* overview.

The premise of *Central Perception* is the following: assuming smart roads and vehicles in the near future, the same physical traffic environment will typically be perceived through many sensors and platforms of highly differing setups and capabilities at the same time, and these platforms may or may not be affiliated. Depending on application requirements like expected response time and reliability, the integration of environment perception information from various sources will pose certain challenges including communication, compatibility, synchronization, calibration, fusion, tracking, and end-to-end latency. Such an ad-hoc collective and collaborative perception system also requires certain sophistication in its architecture which must allow for scaling in a dynamically changing environment, especially considering CAVs that constantly change their location and whose sensor data must therefore be integrated with different stationary platforms at different points in time.

#### *1.2. Prior Work*

In the last two years, several connected automotive information system prototypes were realized, all of them demonstrating some degree of cooperative perception. Krämmer et al. from TU Munich and Fortiss built "Providentia" [1], an intelligent infrastructure system consisting of several gantries equipped with cameras and radars, serving as a support system for intelligent vehicles and aiding them in perceiving blind spots and objects behind cover. They process and fuse the sensory data both locally at the individual measurement stations and also centrally at designated edge computing nodes before communicating it to the consumer vehicles via 5G. Gabb et al. from Bosch and Karlsruhe showed theoretical guidelines (supported by experiments) for developing a similar system around the same time, at the Intelligent Vehicles conference in Paris [2]. The University of Tokyo released open source software for cooperative perception [3] while the University of California together with Toyota demonstrated a use-case of actually backpropagating the already integrated data from the cloud for driving assistance purposes [4]. Toyota also pursues similar topics with other Universities and independently, focusing on various settings like V2V communication [5] or camera and digital twin integration for visual guidance systems [6]. In Australia, researchers implemented the ETSI CPM messaging standard in an I2V setting, allowing sensor-less vehicles to perceive and autonomously react to pedestrians [7]. Recently, Chinese authorities announced to launch "world's first high-level cloud-controlled autonomous driving demonstration zone" (http://m. news.cctv.com/2020/09/11/ARTIeJEug9svYwuLazxQFzO3200911.shtml (accessed on 14 September 2021)) to be constructed in Beijing with similar long term targets as our *Central System* project.

As prior work we have carried out a measurement campaign (together with international industrial and academic partners) on a real-world motorway section in Hungary, which resulted sensory data useful for future automotive R&D activities due to the available ground truth for static as well as for dynamic content [8].

A possible first industrial application of *Central System* technology is likely to occur where experimental CAVs and ITSs are introduced earliest: on automotive testing grounds. In particular, development of *Central System* based Scenario-in-the-Loop (SciL) [9] control is already underway on the ZalaZONE [10] CAV proving ground in Hungary.

#### *1.3. Primary Contribution*

According to the best of our knowledge, we are the first to demonstrate a real-time cooperative perception platform that has both stationary and moving multi-sensor data sources and that combines several levels of data integration such as inter-sensor raw fusion, on-platform tracking and inter-platform local area fusion to finally create and visualize a simultaneous, centrally consistent model (*digital twin*) of all objects of interest in the area covered by the sensors' field of views.

We emphasize that our primary contribution lies in the demonstration of system-level possibilities as they were not demonstrated beforehand, i.e., the maintenance of a *digital twin* in a complex and heterogeneous environment. In this paper we do not claim any scientific novelty wrt. our individual subsystems, nor are they uniquely necessary for the development of our technological demonstration (we could have chosen alternative technological approaches to demonstrate the same concept).

To clarify, we define the following:

	- **–** *Cooperative perception services:* more reliable centralized perception via sensor fusion in the cloud (*Central Perception*).

#### **2. Problem Definition**

Arguably, the most crucial element in providing centrally maintained *digital twin* services is achieving a centrally integrated, real-time perception of the traffic environment. This paper presents a solution to the *Central Perception* problem that we specify as follows.

The overall system must acquire an integral and dynamic object-level view of the real-time traffic situation with at most 100 ms latency. The logical core of the system should be responsible for data integration while the peripheral measurement platforms will act as data sources. The overall architecture and the established interfaces must enable the simultaneous participation of connected intelligent equipment including static road-side infrastructure, mobile (vehicle-borne) and third-party measurement systems. Various sensor types and vendors must be supported, as well the precise and reliable detection of traffic participants including pedestrians, vehicles, obstacles, etc. As additional data sources, the system may use static HD maps and various traffic data (e.g., road meteorology) providers.

We specified and built a proof of concept system covering a substantial subset of above requirements, demonstrating the viability of the approach. Our functional sample provides following capabilities:


#### **3. Overview of the Central Perception System Architecture**

In order to support the requirements mentioned in Section 2, numerous design decisions had to be made regarding the prototype development of the measurement systems, the central server, and the communication between them. The following sections will discuss the overall design in some detail, the current section giving only a brief overview.

The realized system consists of one central server in the cloud and three wirelessly connected measurement systems, two of which are stationary and one that is mounted on a vehicle. At present we focus on fusion-based perception since we want to utilize the strengths of different sensor types on a single platform. Each measurement system has a sensory setup consisting of a camera-LiDAR pair, with the exception of the vehicle which has one camera and two LiDARs. For precise distance measurements and large-scale 3D reconstruction in automotive applications, stereo vision is becoming a less and less viable choice simply due to the precision loss at distances that are relevant to driving (compared to the high precision and falling prices of LiDAR technology) [11].

The measurement systems use the RTMaps software framework for data acquisition, synchronization, and data-flow processing. Detection and tracking is performed locally on the GPU-s and CPU-s of the measurement systems. The 3D pedestrian detection is done using low-level (raw) data fusion on the local system, i.e., the camera image and the corresponding LiDAR pointcloud are both necessary and are considered together in

calculating the 3D position of the detected pedestrian (in contrast to object-level fusion where each sensor arrives at a detection estimate separately, and fusion occurs only afterwards). The fused detections of each sensor cluster are then tracked locally to smooth out any remaining uncertainties or missing datapoints. The tracks are then communicated using standard SENSORIS message formats over 5G or DSRC to the central server, where the inter-systems track fusion occurs. Finally the resulting locally and globally fused tracks are displayed real-time in a digital twin simulation developed in Unity. The overview of the system components and connections is shown in Figure 2.

**Figure 2.** *Central Perception* prototype main components and protocols.

#### **4. Perception Module**

#### *4.1. System Calibration*

The proposed *Central System* integrates numerous different type of sensors (each having an assigned local coordinate system) such as LiDARs, cameras deployed in the infrastructure or attached to a vehicle. In addition to these sensors—in case of vehicles—the IMU/dGPS stand for an additional key element. For the system to work properly the calibration parameters have to be estimated first, i.e., all the intrinsics and extrinsics. Here, the intrinsics cover the internal parameters of cameras (such as focal length, principal point coordinates, skew, radial and tangential lens distortion) and the extrinsics stand for the transformations between the local coordinate systems of attached sensors as well as transformations from and to the Universal Transverse Mercator (UTM) frame which was selected to represent the global reference frame. For camera calibration the method published in [12] has been applied. The simplified calibration setup is illustrated by Figure 3. Let us briefly introduce the calibration approaches used to calibrate the proposed *Central System*.

**Figure 3.** Illustration of the simplified setup of calibration containing two infrastructure stations and a vehicle each equipped with a camera and a LiDAR. In the vehicle there is an IMU/dGPS system, as well.

#### 4.1.1. Chessboard Based Camera-LiDAR Calibration

The estimation of the rotation and translation between the camera-LiDAR pairs is more challenging than that of the camera-camera pairs, since we have to identify 3D points in the LiDAR point cloud and their corresponding image points in camera images. The estimation of LiDAR-camera extrinsics was based on the method proposed by authors in [13], which is a fully automatic extrinsic calibration approach aimed for LiDAR-camera extrinsics calibration by using a printed chessboard attached to a rigid planar surface. The key element of the method is to determine the 3D locations of chessboard corners in LiDAR's coordinate system. A full-scale model of the chessboard (A0 sized) is fitted to the segmented 3D points corresponding to the chessboard in the LiDAR point cloud. The intensities of light rays reflected form black and white patches of the chessboard are different and well distinguishable, thus the model is fitted to a 4D point cloud where the last dimension corresponds to the intensity of the given LiDAR point). The corners of the fitted model are considered to be the 3D corners of the chessboard.

The extrinsic calibration of the camera and LiDAR is performed by minimizing the re-projection error (given the estimated corners **M***i* in the LiDAR frame, their measured projections in the camera image **m***i* as well as the intrinsics of the camera (camera matrix **K**, radial and tangential distortion coefficients *p*1, *p*2, *p*3, *q*1, *q*2). *N* stands for the number of corner points considered.

$$\min\_{\mathbf{R}, \mathbf{t}} \sum\_{i=1}^{N} \left|| \mathbf{m}\_i - \hat{\mathbf{m}}\_i (\mathbf{M}\_i, \mathbf{K}, \mathbf{R}, \mathbf{t}, p\_1, p\_2, p\_3, q\_1, q\_2) \right||^2 \tag{1}$$

Figure 4 shows the process of data acquisition. The blue and yellow colors correspond to different LiDAR point intensities. Figure 5 shows the LiDAR points projected onto the camera image. In the chessboard image we can see both the detected corners and the 3D corners identified in the LiDAR point cloud and projected onto the camera image. The 3D viewer shows the detected corners together with the point cloud of the chessboard colored based upon the black and white patches of the fitted chessboard model. The achieved RMSE in case of five different poses of the chessboard can be followed in Table 1.

**Figure 4.** Data acquisition for chessboard-based camera-LiDAR calibration.

**Figure 5.** chessboard-based camera-LiDAR calibration results.

**Table 1.** The achieved RMSE in case of five poses.


#### 4.1.2. Box Based Calibration

In this second approach, the Camera-LiDAR extrinsics calibration relies on box corners, instead of a chessboard. The calibration box is placed at different locations (with known UTM coordinates measured in advance by a portable dGPS modul) of the working

area. For each pose of the box the point cloud and the corresponding camera image is acquired. In the LiDAR pointcloud, the box corners were determined by segmenting the points which correspond to the box in the point cloud followed by fitting the box model. Since the box corners are obtained with respect to the LiDAR coordinate frame and the corresponding UTM coordinates are known, the LiDAR pose wrt. the UTM frame can be estimated. Similarly, the camera pose can be determined by minimizing the re-projection error (see Equation (1)) associated with the selected corners with known UTM coordinates. Since the UTM coordinates of box corners as well as the IMU pose wrt. UTM are known, the transformation between the LiDAR and IMU coordinate systems can also be determined. Another well known approach to estimate LiDAR-UTM extrinsics is the hand-eye calibration which requires at least two motions (with non-parallel rotation axes) of the sensors (LiDAR and IMU) [14].

#### *4.2. Data Synchronization*

In order to keep the data streams synchronized among infrastructural and vehicular sensors, a common time source as well as a time protocol is needed. As the most commonly used time protocols, the Network Time Protocol (NTP) and the Precision Time Protocol (PTP) might be emphasized. During our experiments the NTP was utilized and the GPS time was used as time source. Each station (including two infrastructural stations and one measurement vehicle) was equipped with an on-board unit having an integrated GPS time source and running the NTP service (see Section 6.1). Each computing node's (PCs, DrivePX2 Tegra A and Tegra B) system clock has been synchronized with the GPS time by relying on the NTP protocol. Prior to testing the *Central Perception* system, the synchronicity of data streams from different sensors have been verified by experiments. To each data frame (independently on what type of sensor it originates from), a timestamp is assigned as it enters the computing framework. In the computing framework the data frames (from different streams) being closest in time are associated and processed afterwards as depicted by Figure 6.

**Figure 6.** Illustration of the data flow and timestamp based assignment of data frames.

As another alternative for time synchronization the Precision Time Protocol might be used, which instead of millisecond-level synchronization, aims to achieve nanosecond- or even picosecond-level synchronization. In case of the PTP, switches with PTP support are required for each station. For most commercial and industrial applications, NTP is more than accurate enough [15].

#### *4.3. 3D Object Detection*

3D object detection plays crucial role in environment perception and understanding. During the development of the *Central System* we payed much attention on the development of robust 3D object detection methods which are considered to be essential from the overall system performance point of view. We have considered two type of approaches:


#### 4.3.1. Yolo and Point Cloud Based Approach

Camera-based systems perform outstandingly well in case of recognition tasks, but when it comes to position estimation they are less accurate than LiDAR-based systems. Depending on the resolution and the number of used cameras, the baseline length, the accuracy of calibration as well as the accuracy of the pixel coordinates of points of interests, the position estimation might be improved; however, by including one or more LiDARs the location estimation of object's might significantly be improved.

The method presented below combines the advantages of the two sensors (i.e., the high resolution of cameras and the localization capabilities of LiDARs). In order to fuse camera images with LiDAR point clouds the sensors have to be calibrated (see Figure 3). Another crucial point here is to guarantee real-time processing which puts additional constraint (depending on the used hardware) on the complexity of applied algorithms. Nevertheless, the data streams of different sensors must be kept synchronized to ensure that data frames closest to each other in time are associated and processed accordingly (see Section 4.2).

As first step the detector receives images on its input and 2D object detection is performed by the YOLOv4 object detector [16]. The speed and accuracy of the algorithm are in line with the requirements defined, which means that the frame rate of the overall system was set to be at least 20 FPS (which currently stands for the upper limit for LiDARs). During the experiment pedestrians and cars have been considered as primary objects of interest, however the algorithm can easily be extended to detect additional classes such as motorcycles, buses, etc. The 2D detection may take several milliseconds even on the most powerful hardware (∼30 ms).

In the next stage, the point cloud is projected onto the camera image and each estimated 2D bounding box gets associated with the LiDAR points which projections are bounded by the given box. As result a set of frustums is obtained (one for each 2D bounding box) containing the 3D points of the objects of interest. Let us denote the set of these frustums by F*i*. Given F*i* the 3D bounding box corresponding to the given object might either be estimated on neural basis by a convolutional neural network trained to perform detection in frustums or the position of the object might be determined based on a simple reasoning.

By the reasoning based approach first the false points (foreground, background points) from F*i* are eliminated and a small 2D window inside each bounding box is defined. The size and position of the window is proportional to the size of the original box. The scaling factor and position were set empirically based on the type of object. Since these windows generate significantly narrower frustums, the points falling inside it are more likely to belong to the object of interest. Let us denote the set of these points as F*i* . The location of the object is determined as the mean of the points falling inside the volume bounded by F*i* and satisfying the constraint

$$d\_{\min} < ||\mathbf{p}\_j - \mathbf{c}|| < d\_{\min} + \delta, \quad \mathbf{p}\_j \in \mathcal{F}\_{\text{i}}, \quad j = 1..\text{N}\_{\text{i}}.\tag{2}$$

where **c** stands for the camera center and *Ni* represents the number of points in F*i*, *δ* = max{*objectwidth*, *objectheight*}.

$$d\_{\min} = \underset{\mathbf{q} \in \mathcal{F}\_i'}{\operatorname{argmin}} \|\mathbf{q} - \mathbf{c}\|. \tag{3}$$

This is an extremely simple and therefore very fast way to filter out unnecessary points and localize the object of interest Figure 7. The latency of the detection can be followed in Figure 8.

**Figure 7.** Detected objects.

4.3.2. Yolo and Homography Based Approach

The detector described above uses the lidar point cloud to estimate the 3D location of the target, the method introduced in this section focuses on a single camera based 3D localization of targets. Homography and its estimation is well known topics in the literature, but let us briefly summarize it: Let us denote a world point by **M** and its image coordinates by **m**. Let us consider the scenario when the world points of interest are lying on the XY plane, thus their Z coordinate is zero. These points are projected onto the image plane of the camera as follows:

$$\mathbf{m} = \mathbf{PM} = \mathbf{K}[\mathbf{R} \mid \mathbf{t}] \begin{bmatrix} X \\ Y \\ 0 \\ 1 \end{bmatrix} = \mathbf{K} \begin{bmatrix} \mathbf{r}\_1 & \mathbf{r}\_2 & \mathbf{r}\_3 & \mathbf{t} \end{bmatrix} \begin{bmatrix} X \\ Y \\ 0 \\ 1 \end{bmatrix} = \underbrace{\mathbf{K} \begin{bmatrix} \mathbf{r}\_1 & \mathbf{r}\_2 & \mathbf{t} \end{bmatrix}}\_{\mathbf{H}} \begin{bmatrix} X \\ Y \\ 1 \end{bmatrix} \tag{4}$$

where *ri* denote columns of the rotation matrix **R**, **t** stands for the translation and **K** denotes the camera matrix containing the camera intrinsics. In order to estimate the homography

**H** the following cost function is minimized (measurement error is considered in both the image and world plane):

$$\min\_{\mathbf{H}, \hat{\mathbf{m}}\_i', \hat{\mathbf{m}}\_i} \sum\_{i=1}^N \left\lVert \left| \mathbf{m}\_i - \hat{\mathbf{m}}\_i \right| \right\rVert^2 + \left\lVert \left| \mathbf{m}\_i' - \hat{\mathbf{m}}\_i' \right| \right\rVert^2,\\ \text{s.t. } \hat{\mathbf{m}}\_i' = \mathbf{H} \hat{\mathbf{m}}\_{l\prime} \; \forall \; i, \tag{5}$$

where **m***i* and **m** *i* stand for the measured point pairs while **m**<sup>ˆ</sup> *i* and **m**<sup>ˆ</sup> *i* stand for the estimated perfectly matched correspondences, i.e., **m**<sup>ˆ</sup> *i* = **H m**<sup>ˆ</sup> *i* [17].

We have used 18 markers **m** *i* with known UTM coordinates (measured by a mobile GNSS system in advance with an accuracy of ∼20 mm) and their image projections **m***i* to estimate the homography. **m***i* stand for the undistorted normalized image points. The detection part of the approach uses the YOLO4 [16] neural network to detect targets of various types in images (during our experiment pedestrians were the main objects of interest, however other object types are also supported by the proposed perception system). The point of interest for each detected pedestrian was set to be the center point of the bottom edge of its 2D bounding box. Let us denote these points by **m***i*. By applying the estimated homography **H**, the image points *mi* can be transformed to the *XY* plane of the UTM coordinate system as **m** *i* = **Hm***i*, ∀*i*. Here we omit the true altitude, thus it was set to zero for each point. Although this kind of 3D detection is very useful for static cameras (for example cameras installed in the infrastructure), in case of cameras attached to a vehicle, change in pitch or roll of the vehicle (caused for example when accelerating or making a hard turn, etc.), invalidates the estimated homography. In addition, the uncertainty of the measured image points **m***i* must also be considered. Given both the uncertainty of **H** and **m***i*, the covariance of the estimated points **m***i* is given by:

$$
\boldsymbol{\Sigma}\_{\mathbf{m}\_i'} = \mathbf{J}\_{\mathbf{h}} \,\,\boldsymbol{\Sigma}\_{\mathbf{h}} \,\,\mathbf{J}\_{\mathbf{h}}^T + \mathbf{J}\_{\mathbf{m}\_i} \,\,\boldsymbol{\Sigma}\_{\mathbf{m}\_i} \,\,\mathbf{J}\_{\mathbf{m}\_i'}^T \tag{6}
$$

where **Σ <sup>m</sup>***i* , **Σ <sup>m</sup>***i* and **Σh** stand for the covariance matrix of the estimated road point **mi** , the measured image point **m***i* and the estimated homography **h**, respectively (vector **h** is composed from the concatenated rows of **H**). Furthermore, **Jm***i* and **Jh** stand for the Jacobians of **m** *i* = **Hm***i* wrt. **m***i* and **h**, respectively [17].

Since the vehicle is moving, the detected objects have to be transformed according to the actual pose of the vehicle to a global coordinate system (e.g., UTM). Firstly, the detections should be estimated wrt. a selected reference coordinate system and then based on the actual pose of the vehicle transformed to the UTM frame. The reference coordinate system for the vehicle was set to be the coordinate system of the IMU shifted along the vertical axes to the ground level (road level). Let us refer to this coordinate system as *IMUgl*. In order to estimate the homography which transforms the image points directly to *IMUgl*, one needs to estimate the marker coordinates in *IMUgl*. Since the IMU modul (used during our experiments) includes a differential GPS with a dual antenna system, the UTM coordinates and the heading of the vehicle can be measured with an accuracy of ∼20 mm which might be considered to be sufficient for autonomous driving related applications. Based on the measured pose of the vehicle, the UTM to *IMUgl* rigid transformation can be determined, thus the markers in *IMUgl* can be calculated. By applying the homography (estimated based upon markers in the *IMUgl* and the corresponding image plane points) to points *mi*, the 3D position of each detected target is obtained directly in *IMUgl*. Since the pose of the vehicle is continuously measured with a sampling rate of 100 Hz, the detections can directly be transformed to the UTM frame in real-time.

#### *4.4. Object Tracking*
