5.2.3. Buffering

The issue of input message buffering is not entirely trivial, since the order of message arrival in the topic partition does not necessarily follow any kind of (e.g., timestamp-based) ordering. We have to somehow make sure that the tracker is always fed appropriately chosen inputs and that no inputs are wasted or discarded prematurely. The regular intervals the fusion is queried at (in our case 120 ms) require a flexible buffering method that supports fusion with missing or no data, old and future data, etc.

Our buffering method:


#### 5.2.4. Real-Time vs. Playback Mode

The behavior of the fusion module should be different when we play back data from the past then when we stream the present data. In both cases fusion is performed at fixed real time (not timestep time) intervals, and fusion time can proceed only forward (strictly greater than previous) (see Table 2).

**Table 2.** Comparison of real-time and playback fusion modes.


#### 5.2.5. TrackerGNN Parametrization

The method parameters for the trackerGNN fusion component were set to:

```
tracker = trackerGNN( ...
     'TrackerIndex', 0, ...
     'FilterInitializationFcn', @initcvkf, ...
     'Assignment', 'MatchPairs', ...
     'AssignmentThreshold', 15*[1 Inf], ...
     'TrackLogic', 'History', ... \% History|Score
     'ConfirmationThreshold', [2 3], ...
     'DeletionThreshold', [5 5], ...
     'DetectionProbability', 0.9, ...
     'FalseAlarmRate', 1e-6, ...
     'Beta', 1, ...
     'Volume', 1, ...
     'MaxNumTracks', 100, ...
     'MaxNumSensors', 20, ...
     'StateParameters', struct(), ...
     'HasDetectableTrackIDsInput', false, ...
     'HasCostMatrixInput', false ...
);
```
The detection-to-track assignment upper threshold was set to half of the default since pedestrians are smaller and slower than vehicles and are expected to have smaller uncertainty. Track confirmation threshold was set to 2 out of 3 detections, although our detections often come with a nonzero object type meaning instant confirmation. The track deletion threshold was set to 5 out of 5 misses. Besides, a custom rule was also introduced removing all input detections with any position covariance value larger then an experimentally chosen threshold = 5.0 m2.

## 5.2.6. Results

The processing time of the local area fusion component was measured and found sufficiently performant for our requirements. Latencies are on the order of 2–3 ms, counting not only fusion itself, but including also stream (de)serialization and message parsing. There were some acceptably rare outliers: 0.14% of cases required more than 5 ms and none more than 35 ms (see Figure 11).

**Figure 11.** Distribution of processing times of the local area fusion component.

We have developed two distinct tools for visualizing fusion outputs. One is a 3D rendering demonstration tool described in Section 7. The other is a 2D monitoring dashboard for internal use that lets us step through each fusion cycle individually. A screenshot of the fusion results is presented in Figure 12 below.

**Figure 12.** *Central System* Dashboard: our monitoring and analysis tool.

#### **6. Client Module**

#### *6.1. DSRC Communication*

Real-time communication, such as 5G cellular network or WiFi-based 802.11p (DSRC— Dedicated Short Range Communication), plays a major role in the system architecture. In the *Central Perception* functional sample of the *Central System*, the dedicated DSRC 5.9 GHz radio communication has made it possible for distant system components—such as the infrastructure stations, the vehicle, and the central server—to communicate with each other in real-time via radio frequency (RF).

For DSRC in our functional sample we used Cohda Wireless MK5 OBUs, which use the Software-Defined Radio (SDR) baseband processor SAF5100 and the dual-radio (antenna A, B) multi band RF transceiver TEF5100. These chips offer adjustable parameters for radio wave modulation schemes. The unit includes a dedicated HSM (Hardware Secure Module) for data encryption, compression, decryption and also the keys used for encryption using this chip [25]. The data rate corresponding to the modulation schemes of the device (BPSK, QPSK, QAM, etc.) can be changed from 3 Mbps to a maximum of 27 Mbps. At faster data rates, one of the most critical metrics, the PDR (Packet Delivery Ratio) is less than 100% so a trade-off had to be made and a medium rate, more reliable modulation option than BPSK (Binary Phase Shift Keying) was chosen.

The MK5 OBU complies with the following standards and protocols: IEEE 802.11 (part of the IEEE 802 set, the most widely used wireless networking standard), IEEE 1609 WAVE (Standard for Wireless Access in Vehicular Environments), ETSI ES 202 663 (European profile standard for the physical and medium access control layer for Intelligent Transport System operating in the 5 Ghz frequency band), SAE J2735 (Dedicated Short Range Communications (DSRC) Message Set Dictionary).

The 802.11p protocol compliance grants the following advantages:


The OBUs have another key role in the *Central System* architecture because they are also used for time synchronization, using the Chrony module and the GNSS antenna. The MK5 runs a gpsd server to allow applications to access GPS data. Chrony [26] is a versatile implementation of the NTP (Network Time Protocol), and it can synchronize the system clock with the NTP servers and reference clocks. With the help of the Chrony module all of the OBUs can be configured to have a reference time with microsecond accuracy.

Regarding the network topology in the DSRC setup, we define four subnetworks: two infrastructure stations, one vehicle, and one central server. Every subnetwork contains one PC for data acquisition, processing and visualization, one wireless router and one Cohda Wireless MK5 OBU.

The MK5 module has an Ethernet connection interface, which supports Ethernet 100 Base-T with 100 Mbps data rate. For the *Central Perception* functional sample the Cohda OBUs have been configured as IPv4 (Internet Protocol v4) gateways to provide a fully transparent communication between subnetworks. This means that all subnets are seen by each other, so real-time data exchange between nodes can be easily achieved. Figure 13 represents the subnetwork layout of the communication architecture of the *Central Perception* prototype.

#### *6.2. Kafka Streaming Platform*

For communication middleware we have chosen to use Kafka, the popular opensource "dumb broker" streaming platform maintained by the Apache Foundation. Judging by its main functionality Kafka can also be considered a distributed commit log, although it is primarily used for messaging. The aim of the project is to provide a real-time, highthroughput, low-latency streaming platform. Kafka provides horizontal scalability via distribution of message topic partitions across respective partition leader brokers while also

providing fault tolerance by replicating each partition across non-leader brokers in a way reminiscent of RAID redundancy and fallback mechanisms. The distributed brokers and topic partitions architecture perfectly fits our long-term hierarchical edge computing vision if we assume that message topics should be divided into partitions according to the source area of measurements. We already tested our system in a 3-broker, 3-way-partitioned and triply-replicated (one original and two replicas) setup and experienced no perceptible lag or slowdown. When a broker was deliberately terminated, one of the remaining brokers automatically took up partition leadership; and when the temporarily disabled broker came back to life, the load balancing mechanism automatically reassigned it to partition leadership once again.

**Figure 13.** Subnetwork layout of the communication architecture.

In order to connect to the Kafka middleware, we developed a universal and platformindependent client module that runs in the Java Virtual Machine runtime environment in order to create a convenient socket-based API for uploading processed sensor data (detections, tracks, source system positions, etc.) to the distributed Kafka cloud in all the supported standard business domain level formats and protocols. This currently extends to the SENSORIS and ETSI CPM protocols, of which SENSORIS was used in the prototype demonstration. The client module's API encapsulates Kafka specifics and accepts standard SENSORIS messages. For further convenience we also provided a python wrapper API that we can easily call from the RT Maps client-side real-time dataflow-processing framework.

#### *6.3. SENSORIS Message Standard*

Exactly one message is sent for each source's each output (after each measurementdetection-tracking cycle). The source is not necessarily a single sensor, it might be, e.g., a raw sensor fusion based untracked detection, or the output of a tracker. The data that we collect via SENSORIS v1.0.0 (https://sensoris.org/ (accessed on 14 September 2021)) messages therefore contains the following elements:

	- **–** identification (platform UUID, sensor UUID, sensor SUID) (UUID stands for universally unique identifier; SUID stands for system-wide unique identifier)
	- **–** GPS PPS synchronized timestamp of originating measurement (event time)
	- **–** localization (position, orientation) and its uncertainty
	- **–** Object SUID
	- Object type and type uncertainty

#### **7. 3D Renderer**

**–**

In order to visually represent the information provided by the individual sensor systems, as well as the central fusion system, it is necessary to use a digital twin rendering module. A properly constructed 3D visualization demonstrates the cooperative perception of scenario participants in scenarios where a single on-site sensor would not ensure proper operation. The visualization system must communicate with the *Central System*, including reading SENSORIS messages, and being able to decode and display this information in real time. It is also assumed that a digitized 3D model of the real environment of the on-site demonstration is available so that the visualized information can be compared with the real-world scenario. In our case, we used Unity 3D software to implement the visualization, which communicates with the *Central System* over a TCP connection. The localization of the measurement stations and their respective object detections (fused or raw) are available on a Kafka topic as encoded Sensoris messages. In order to visualize the measurement vehicle and the surrounding pedestrians, these data are accessed and forwarded to the visualization module in a proper structure.

#### *7.1. Virtual Environment*

Virtual imaging of the real environment is most accurate when based on laser measurements. Therefore, testing on the university campus was preceded by a laser measurement that provides a digitized description of the area as a LiDAR point cloud. This point cloud had to be brought from las format to some readable, xyz format to display within Unity software. In addition to the transformation of the format, it is important to place the lateral and longitudinal coordinate pairs relative to some center point in the x-y coordinate system defined by us so that the distances can also be interpreted in the Unity software. This transformation requires the use of the ellipsoid WGS84 as well as the determination of a clearly definable (0, 0) coordinate. This coordinate will later become the center of Unity's virtual world, as well as the basis for the transformation of all information that comes in during testing. The xyz data created in this way can already be read in a csv or txt file, and spheres representing the points can be created for the coordinate points it contains. In this way, it becomes interpretable in the virtual space of Unity, and based on this the various landmarks are clearly outlined (Figures 14 and 15). During the demonstration, the most important thing is that the roads are positioned correctly in the digital world, so we performed additional GPS measurements at their corner points. The origin of the virtual world was also determined during these measurements.

The shape and texture of the buildings surrounding the campus have been modelled according to reality. The shape and location of the vegetation and other components in the parking lot could be modeled based on the point cloud. The vegetation has been designed to vary the colour and density of the foliage according to the seasons. Unity software also provides the ability to model current lighting conditions using various skyboxes. However, for proper running performance, the generation of lights is not done in real-time. Still, a so-called baked lightmap is created, which predetermines illumination with the given settings.

The digital replica of the environment is best presented through cameras that can be matched to each real sensor. For a scenario to be well demonstrated, it is necessary to be able to present the given environment from several perspectives. We placed virtual cameras in the positions corresponding to the two infrastructure cameras as well as the cameras placed on the test vehicle, applying the basic properties of the real sensors.

**Figure 14.** University campus based on LiDAR point cloud (map source: http://maps.google.com (accessed on 14 September 2021)).

**Figure 15.** The Unity model of the University campus with the point cloud.

#### *7.2. Sensor Detection Visualization*

The test vehicle is displayed according to the method detailed in [27]. High-frequency real-time GPS data is available from the test vehicle that is accurate enough to place the vehicle in the virtual world. In this case, the lateral and longitudinal position of the vehicle and its heading are used. The movement of the vehicle's wheels was not modelled. When handling sensor detections—either from static stations or from the moving vehicle—it is necessary to separate information from different sources, as well as to handle different objects. Although only pedestrians were detected during our current measurements, the

system is also prepared to handle all static and dynamic objects defined by SENSORIS. The detections from the sensors always reflect the state closest to real-time, i.e., only the most recent objects are always displayed. This also means that the objects existing in the previous update step must be moved or deleted. We also had to consider that the frequency of messages from different stations and sensor types is not the same in all cases, and may change dynamically. Residual detections—object tracks that ge<sup>t</sup> no confirmation within a short time period—are only rendered for the time specified by a parameter, after which they are automatically deleted. In our simulations, this time was set to 0.25 s. With these solutions, the movement of the detected pedestrians is continuous, there is no vibration in the display process, and the objects do not multiply during the movement, they do not draw a strip.

A visual distinction was made between detections of different origin, which helps us to understand the scenario. In addition, different sources also assign different tags to objects, which allows one to treat the objects belonging to that tag as a group, whether it is to turn off the display of objects or even delete objects. Pedestrian objects are generated based on a predefined cylindrical shape, the properties of which, such as size, colour, or permeability, are set based on the data associated with the detection.

The system provides a sufficiently high frequency to ensure that the motion is clearly continuous. There are two ways to test the visualization system, displaying real-time uploaded detections online and playing back data already present on the server offline. During the tests, we had two main expectations for the viewer, the first of which was to display the detection sent to it in real-time, and the second was to position both the environment and the detections accurately in the virtual world. With these conditions fulfilled, we observe a complete and synchronous copy of reality within the simulation. The system also allows one to turn off the display of detections of any given sensor for separate analysis. Figure 16 simultaneously shows the simulated camera FoV areas of all three sending infrastructures and a larger camera FoV overlooking the entire simulation environment. This figure also shows that the vehicle sensor sees a garage door (lower right corner), meaning that the snapshot came from a replay when the test vehicle did not participate in the measurement.

**Figure 16.** Detections in simulation.

We can best evaluate our digital twin during real-time tests, where we can see the scenario in reality and in its digital version at the same time. The streaming of raw camera images also provides additional checking possibilities. In the offline state, when a recorded stream is played back, the video played can also reveal whether there is a substantial difference in the digitized world compared to reality, as can be observed in Figure 17.

**Figure 17. Top-left** image: objects sensed by the infrastructure-1; **Bottom-left** image: objects sensed by the infrastructure-2; **Top-right** image: objects sensed by the vehicle; **Bottom-right** image: Realtime digital twin generated by the central system.

#### **8. Conclusions and Future Work**

We have proposed a cooperative perception system capable of generating and maintaining the digital twin of the traffic environment in real-time by fusing higher level data of multiple sensors (deployed either in the infrastructure or in intelligent vehicles), thus providing object detections of higher reliability and at the same time extending the sensing range. Besides giving a general idea on cooperative perception we have also introduced the key building blocks of this system including the calibration, 3D detection, tracking, fusion, data synchronization, communication and visualization. In case of time-critical components we have also presented the underlying algorithms and pointed out the relevant implementation details, as well. The functional prototype of the proposed system has also been created and tested under real circumstances on-line. We have demonstrated a single service of the proposed perception system, namely the real-time visualization of the generated digital twin of the environment including pedestrians as dynamic objects of interest communicated using standard SENSORIS message formats over 5G or DSRC to the central server. The system can further be extended to support other type of objects, as well such as cars, bicyclist, etc. Besides the digital twin generation a broad range of new derivative services can be facilitated, as well, including cloud-assisted and cloud-controlled ADAS functions, various analytics for traffic control, etc., which are subjects of further research. We have also shown that the proposed perception system is able to operate in real-time, meaning that an overall latency of less than 100ms has been achieved. As already stated, we envision this prototype system as part of a larger network of local information processing and integration nodes, where the logically centralized digital twin is maintained in a physically distributed edge cloud in real-time.

We have encountered several noteworthy practical questions and lessons during the implementation which led to the establishment of certain best practices that can not be treated adequately within the bounds of this paper, but can be at least mentioned. Some of them include considering sensor latencies, triggering simultaneous snapshots and associating data from sensors with different frequencies. There are effects related to vehicle movement during a full LiDAR rotation. There are problems with creating a perfectly flat and orthogonal calibration points layout in the field. As already mentioned in some detail, GPS time based inter-platform synchronization was a cardinal issue. Detection can suffer from all the problems inherent in deep learning systems: unfamiliar lighting, background, or anything that takes the input image beyond the domain and distribution the neural network was trained for can influence the algorithm adversely. Of course, deep learning models are also susceptible to deliberate adversarial attacks like "invisibility T-shirts" [28] on pedestrians, etc. Foreground clutter in chest height (even a stretched out hand) can destabilize our LiDAR-reliant raw fusion method. DSRC communication tends to break down in the presence of obscuring objects: the installation height and placement of on-board/road-side communication units is crucial, communication hand-off between moving vehicle platforms and stationary road side units has to be solved. On the server side, managing the spatially distributed digital twin across several edge computing nodes and their overlapping areas of responsibility is a theoretical problem we are currently investigating. Practical considerations like system security, authentication, authorization and information integrity are undeniably safety critical issues that must be tackled before industrial application. So is the adherence to automotive standards like ASIL D and the use of provably real-time hardware and software systems that come with industrial-grade guarantees. Despite numerous challenges, technological enablers like cheap LiDAR-s, powerful deep learning and ubiquitous 5G are making the road towards cooperative perception services more attainable by the day.

**Author Contributions:** Conceptualization, Z.S. and V.T.; methodology, V.T., A.R. and V.R.; sensor calibration, A.R., M.C. and A.K.; platform setup: A.R., V.R., Z.V. and M.C.; platform software: data acquisition, object detection, A.R. and M.C.; platform software: tracking, Z.V.; communication: Z.P. and V.R.; central software: stream processing and local area fusion, V.R.; simulation and visualization, M.S. and B.V.; writing—original draft preparation, A.R. and V.R.; writing—eview and editing, A.R. and V.R.; supervision, project administration, V.T.; funding acquisition V.T. and Z.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** The research reported in this paper and carried out at the Budapest University Technology and Economics has been supported by the National Research Development and Innovation Fund (TKP2020 National Challenges Subprogram, Grant No. BME-NC) based on the charter of bolster issued by the National Research Development and Innovation Office under the auspices of the Ministry for Innovation and Technology. In addition the research was supported by the National Research, Development and Innovation Office through the project "National Laboratory for Autonomous Systems" under Grant NKFIH-869/2020. Funder NKFIH Grant Nos.: TKP2020 BME-NC and NKFIH-869/2020.

**Acknowledgments:** The research reported in this paper and carried out at the Budapest University of Technology and Economics has been supported by the National Research Development and Innovation Fund (TKP2020 National Challenges Subprogram, Grant No. BME-NC) based on the charter of bolster issued by the National Research Development and Innovation Office under the auspices of the Ministry for Innovation and Technology. In addition the research was supported by the Ministry of Innovation and Technology NRDI Office within the framework of the Autonomous Systems National Laboratory Program.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.
