Three-Dimensional Vehicle Detection and Pose Estimation in Monocular Images for Smart Infrastructures

Borau Bernad, Javier; Ramajo-Ballester, Álvaro; Armingol Moreno, José María

doi:10.3390/math12132027

Open AccessArticle

Three-Dimensional Vehicle Detection and Pose Estimation in Monocular Images for Smart Infrastructures

by

Javier Borau Bernad

^*

,

Álvaro Ramajo-Ballester

and

José María Armingol Moreno

Intelligent Systems Lab, Universidad Carlos III de Madrid, 28911 Leganés, Spain

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(13), 2027; https://doi.org/10.3390/math12132027

Submission received: 23 May 2024 / Revised: 13 June 2024 / Accepted: 26 June 2024 / Published: 29 June 2024

(This article belongs to the Special Issue Advanced Machine Vision with Mathematics)

Download

Browse Figures

Versions Notes

Abstract

:

Over the last decades, the idea of smart cities has evolved from a visionary concept of the future into a concrete reality. However, the vision of smart cities has not been fully realized within our society, partly due to the challenges encountered in contemporary data collection systems. Despite these obstacles, advancements in deep learning and computer vision have propelled the development of highly accurate detection algorithms capable of obtaining 3D data from image sources. Nevertheless, this approach has predominantly centered on data extraction from a vehicle’s perspective, bypassing the advantages of using infrastructure-mounted cameras for performing 3D pose estimation of vehicles in urban environments. This paper focuses on leveraging 3D pose estimation from this alternative perspective, benefiting from the enhanced field of view that infrastructure-based cameras provide, avoiding occlusions, and obtaining more information from the objects’ sizes, leading to better results and more accurate predictions compared to models trained on a vehicle’s viewpoint. Therefore, this research proposes a new path for exploration, supporting the integration of monocular infrastructure-based data collection systems into smart city development.

Keywords:

monocular 3D object detection; smart cities; intelligent infrastructures; deep learning; computer vision

MSC:

68T45

1. Introduction

Over the first years of the 21st century, the concept of smart cities has transformed from a futuristic vision into a tangible reality, propelled by rapid advancements in technology and a growing emphasis on urban sustainability, efficiency, and safety. The core of this change is the integration of digital technologies in the urban environment, improving the quality of life for the residents and the efficiency of the resources used. Among these technologies, machine vision stands out as a key element as it fundamentally changes the data collection process and provides the tools to revolutionize today’s transportation systems.

Monocular 3D detection has emerged as a technology of significant interest, renowned for its ability to infer three-dimensional pose estimation from images captured by a single camera. This capability is crucial for a wide variety of applications, from autonomous driving and traffic management to security and pedestrian safety. The primary advantage of monocular detection systems lies in their simplicity and cost-effectiveness compared to more complex sensor setups like stereo vision systems or LiDAR. Utilizing computational models and deep learning techniques, monocular camera systems can accurately identify and determine the position and orientation of vehicles, obstacles, and other elements within urban environments, all from the visual information provided by a single lens.

While vehicle detection has mainly focused on the vehicle’s side view, mimicking the driver’s perspective during navigation, its potential when viewed from alternative angles remains largely unexplored. This paper introduces a paradigm shift by leveraging infrastructure-mounted cameras, which offer an elevated perspective, enabling unobstructed views that overcome significant limitations of the vehicle’s point of view, such as restricted fields of view and occlusions caused by other vehicles and roadside objects. This approach not only benefits from the strategic positioning of sensors and their superior view but also unveils a new branch for future research related to computer vision and smart cities. This ego-infrastructure approach provides a more comprehensive, reliable, and scalable solution for 3D vehicle detection for smart city development.

Our research delves into adapting existing monodetection algorithms, that are already effective from a vehicle’s point of view, to perform 3D pose estimation from the infrastructure side, analyzing their performance and possible advantages in accuracy compared to vehicles. Recent studies highlight the growing number of cooperative detection systems [1,2], demonstrating enhanced detection capabilities when combining vehicle and infrastructure data. On the other hand, the proliferation of datasets specifically designed for infrastructure-based detection scenarios [3,4,5] has smoothed the path for more robust model training and validation, leading to advances in algorithms specifically optimized for these setups, such as MonoUNI [6], which displays effective vehicle detection from elevated perspectives. Benefiting from the higher perspective of the cameras and the capabilities to analyze several vehicles from only one sensor, this approach creates a solution to data acquisition in urban environments that offers a broader view than is possible with traditional vehicle-based methods.

This innovative approach of detecting vehicles from infrastructure-mounted cameras not only results in a more accurate solution but also unveils a set of applications for next-generation smart city setups. Apart from using the acquired information for real-time traffic management, it additionally proposes other applications like traffic flow analysis, real-time traffic light control, smart parking site detection, and intelligent toll detection based on vehicle dimensions, which are unlocked thanks to the wider and elevated viewpoint of the sensors.

As our cities continue to evolve, weaving cutting-edge technologies like monocular 3D detection into the fabric of urban infrastructure stands out as a crucial step toward bringing next-generation smart cities to the present. This paper aims to contribute to this field by proposing an approach to data capture from an innovative point of view, paving the way for further exploration in this exciting domain.

2. Vehicle-Side Monocular Detection

2.1. Datasets

One of the most critical aspects to consider in deep learning and computer vision is the significance of the selected dataset. Indeed, the precision and quality of the data provided by the dataset will directly impact the performance of our algorithms. This is primarily because, during the training process, neural networks attempt to learn and generalize from the examples provided to them. The diversity and characteristics of these examples are crucial for the model to perform accurately in real-world situations. Therefore, datasets used in vehicle detection must contain a wide range of conditions, including different lighting scenarios, weather conditions, and urban settings, to ensure models can generalize effectively after the training procedure. Furthermore, the annotations related to these datasets, such as the 3D bounding boxes and categories, need to be as precise as possible to avoid transferring inaccuracies to the algorithms.

Fortunately, during the last decade, the growing interest in vehicle 3D pose estimation has led to an increase in the availability of public datasets for researchers. This surge in resources has significantly expanded the possibilities for experimentation, enabling the development of more complex and accurate detection models. In Figure 1, a timeline illustrating the increase in public datasets available during the last years is displayed, showcasing the clear trajectory of increasing complexity and diversity over the years. This timeline shows the evolution from early datasets, focused on limited scenarios and fundamental vehicle perspectives, to more complex situations that encompass variable weather conditions and different points of view. Additionally, in Table 1, some of the most important characteristics of the datasets, such as the quantity of data available and the types of conditions covered, are displayed.

Two of the most used datasets in the realm of vehicle detection are KITTI and nuScenes, both offering a vehicle point of view, highlighting the research community’s focus on this perspective. Examples from these two datasets are displayed in Figure 2. Remarkably, the KITTI dataset, although the oldest among those listed, is the most frequently employed by researchers. The popularity of the KITTI dataset came from its well-regarded benchmark, which features test metrics based on AP11 and AP40 references [18]. This benchmark distinguishes between three levels of difficulty: easy, moderate, and hard, taking into account factors such as occlusions and the height of the bounding boxes and will be explained in details in a later section, as the evaluation procedure set for the experiments made in this study.

2.2. Detection Algorithms

The advancement of deep learning techniques, such as the application of transformers in object detection tasks [19] or new depth estimation techniques, has allowed computers to perform highly complex tasks that were unimaginable some decades ago. These sophisticated algorithms play a crucial role in 3D pose estimation, a key component in the development of autonomous driving technologies.

There are various methods to acquire 3D data from the environment, not limited only to monocular images. Alternatives include stereo cameras, LiDAR, and radar systems. However, the primary advantage of utilizing simple image data for vehicle inference lies in the system’s simplicity and low cost. Conventional cameras are already installed in numerous devices, including cars and infrastructure cameras, and new models are quite affordable. In contrast, LiDAR systems are costly, and stereo cameras need a complex calibration process. These advantages make monocular cameras an ideal candidate for consideration in research on 3D pose estimation.

However, images inherently have a significant drawback; they capture a 2D plane, necessitating the extraction of 3D information. To derive 3D information from these flat representations, the employment of deep learning algorithms is crucial. Consequently, over the last few years, and accompanied by the increasing availability of data that enables the creation of more complex models, a wide range of deep learning algorithms has emerged.

As previously mentioned, the KITTI dataset is the most widely used for 3D object detection, largely due to its benchmarking capabilities. Consequently, most algorithms are specifically developed, trained, and optimized with this dataset in mind, making them particularly suited for 3D object detection from a vehicle’s perspective. In Table 2, some of the most significant models developed for the KITTI dataset and their test results are displayed. Additionally, Figure 3 presents inferences using the PGD and SMOKE models on the KITTI dataset to demonstrate their detection capabilities.

3. Our Approach: Infrastructure-Based Monocular Detection

While the majority of research in vehicle detection and 3D pose estimation has traditionally focused on data acquired from vehicle-mounted cameras, our study explores the paradigm shift toward utilizing infrastructure as a strategic vantage point. This alternative perspective not only introduces novel data capture opportunities but also aims to overcome some limitations inherent in vehicle-centric methods. By leveraging elevated infrastructure points, our approach accesses a broader and more detailed view of vehicle movements, potentially enhancing the precision of detection algorithms. Relevant works in the field, such as cooperative detection systems [1] or infrastructure-based systems [2], have laid the groundwork by demonstrating the benefits of integrating data from multiple sources. Furthermore, models like MonoUNI [6] have shown promising results in infrastructure detection, providing a solid benchmark for our research. Our work further extends these findings by systematically comparing the performance of various monocular detection models when applied in an infrastructure-based context against traditional vehicle-mounted setups, illuminating the significant improvements in accuracy and reliability gained through our method.

Infrastructure, in the context of our study, refers to fixed installations such as traffic cameras, bridge-mounted surveillance systems, and highway monitoring sensors. These elements provide elevated viewpoints that cover wider areas than what is possible with vehicle-based cameras. Additionally, the data collected by these roadside sensors enrich the information available about vehicles on the road, such as more precise dimensions and locations, allowing our algorithms to achieve greater accuracy. This comprehensive view is crucial for developing more reliable and effective detection systems.

To better illustrate the advantages of infrastructure-based data acquisition, Figure 4 provides a side-by-side comparison of images captured from an infrastructure-mounted camera and a vehicle-mounted camera. This visual comparison highlights the broader field of view and the reduced occlusions achieved with infrastructure-based viewpoints. Notably, the infrastructure camera captures a comprehensive scene of the road and surrounding area, offering clearer insights into traffic patterns and vehicle interactions, which are often occluded in vehicle-based images. These characteristics are crucial for algorithms that require detailed and unobstructed views to accurately estimate vehicle dimensions and positions.

In this section, we will explore how to implement this approach, beginning with the selection of a dataset that captures the necessary views from infrastructure points. After selecting a dataset with the desired characteristics, we can proceed to train existing models initially developed for the KITTI benchmark with the data obtained from the infrastructure. Through this process, we aim to develop a fully trained set of models capable of accurately inferring 3D poses from data captured via infrastructure-based sources.

3.1. DAIR-V2X Dataset

A fundamental step in training a model involves supplying it with data that accurately reflect the real-world scenario it aims to address. This factor is especially crucial in 3D object detection, where selecting the right dataset is key to preventing performance degradation. Figure 5 showcases the performance differences between a model trained on data sourced from infrastructure viewpoints and one trained with data from a car’s point of view. This comparison underscores the critical importance of choosing a dataset that aligns well with the intended application of the model.

For our research, we have selected the DAIR-V2I dataset [3], the infrastructure-focused part of the DAIR-V2X dataset. This subset includes 10,000 images, including almost 500,000 labeled data, all captured from sensors positioned on infrastructure elements. This distinct perspective makes the DAIR-V2I dataset an ideal resource for training our models.

Specifically, the DAIR dataset consists of images captured with an RGB 1920 × 1080 camera, offering images of higher resolution which are key to our task. Enhanced image quality boosts our model’s ability to detect vehicles accurately; however, this also increases the computational resources required. Additionally, the dataset includes 10,000 LiDAR instances, captured simultaneously with the images, making it potentially useful for future applications involving models that require additional data for improved performance.

Annotation of this dataset was carried out by expert annotators who labeled 10 object classes in each image and point cloud frame, noting object’s category, occlusion and truncation states, and 7-dimensional cuboids (

x, y, z

, width, length, height, yaw angle). Furthermore, for the camera images, 2D bounding boxes were annotated defining

x, y

, width, and length.

The dataset has been converted into the KITTI format, enabling us to use all the tools available in MMDetection3D [27] and the KITTI benchmark to measure the performance of the model. Finally, some instances of images and their corresponding 3D labeled boxes are displayed in Figure 6.

3.2. Detection Models

Throughout our research, we have focused on three different models, each leveraged for their unique capabilities in handling the complex demands of 3D object detection from infrastructure-based viewpoints. Each model was implemented using MMDetection3D. These models were selected based on their performance in the KITTI benchmark, which presents similar challenges to our setup in terms of detecting and estimating the position and dimensions of vehicles. Additionally, their adaptability to the new perspective offered by infrastructure-mounted cameras, such as wider fields of view and varied object scales, makes them particularly valuable. These models provide a unique opportunity to compare the effectiveness of infrastructure-based detection against traditional vehicle-based approaches, thereby deepening our understanding of how spatial perspectives affect detection accuracy and model performance. During this section, we briefly introduce each model and outline its key features.

3.2.1. SMOKE: Single-Stage Monocular 3D Object Detection via Keypoint Estimation

The SMOKE [21] algorithm is a computer vision model that uses monocular images to infer the 3D pose of objects using Keypoint Estimation. Chosen for its rapid performance and high precision, SMOKE is well-suited as an algorithm for infrastructure research, where these attributes are critical.

The algorithm is composed of a hierarchical layer fusion network DLA-34 [28] as the backbone to extract features from the image. From the feature map generated by the backbone, the Keypoint Branch extracts the projected 3D center of the object, allowing the algorithm to recover the 3D location with camera parameters, as illustrated in Equation (1). Let

{[x, y, z]}^{T}

be the 3D center of each object in camera frame coordinates and obtaining the projected 3D points

{[x_{c}, y_{c}]}^{T}

on the image with the intrinsic matrix K, as follows:

[\begin{matrix} z \cdot x_{c} \\ z \cdot y_{c} \\ z \end{matrix}] = K_{3 \times 3} [\begin{matrix} x \\ y \\ z \end{matrix}] .

(1)

In parallel, the regression branch predicts the essential variables to create the 3D bounding box. All of them are encoded to be learned in residual representation, easing the training task. These variables are represented as an 8-tuple, consisting of depth offset (

δ_{z}

), discretization offset (

δ_{x_{c}}

,

δ_{y_{c}}

), residual dimensions (

δ_{h}

,

δ_{w}

,

δ_{l}

), and the observation angle,

α

, represented by (

sin α, cos α

).

The procedure of recovering the final 3D parameters is distinct between each variable. Firstly, the depth z is recovered using the pre-defined scale and shift parameters

σ_{z}

and

μ_{z}

, as follows:

z = μ_{z} + δ_{z} σ_{z} .

(2)

Secondly, by inverting Equation (1) and using the recovered depth z along with the 2D image projections, the algorithm estimates the 3D location as follows:

[\begin{matrix} x \\ y \\ z \end{matrix}] = K_{3 \times 3}^{- 1} [\begin{matrix} z \cdot (x_{c} + δ_{x_{c}}) \\ z \cdot (y_{c} + δ_{y_{c}}) \\ z \end{matrix}] .

(3)

Thirdly, the model uses the pre-calculated average dimension array

{(\bar{h}, \bar{w}, \bar{l})}^{T}

of each class from the dataset and calculates the object dimensions

{(h, w, l)}^{T}

by applying the equation

[\begin{matrix} h \\ w \\ l \end{matrix}] = [\begin{matrix} \bar{h} \cdot e^{δ_{h}} \\ \bar{w} \cdot e^{δ_{w}} \\ \bar{l} \cdot e^{δ_{l}} \end{matrix}] .

(4)

Finally, the observation angle,

α

, is calculated using the

sin (α)

and

cos (α)

obtained by the regression branch, which represents the object’s orientation. However, it is necessary to transform this angle to the yaw angle,

θ

, to satisfy the coordinates system selected, and is obtained using the following equation:

θ = α_{z} + arctan (\frac{x}{z}) .

(5)

Nevertheless, when applying deep learning to computer vision, it is also necessary to define the loss functions that use the algorithm during training. The SMOKE model uses different functions in each branch of the model, Equation (6) being applied to the keypoint classification head and Equation (7) to the 3D regression branch, the total being a loss sum of both expressed in Equation (8).

L_{c l s} = - \frac{1}{N} {(1 - y_{i, j})}^{β} {(1 - s_{i, j})}^{α} l o g (s_{i, j}),

(6)

s being the prediction of the model, y gound-truth values,

α

and

β

adjustable hyper-parameters, and N the number of keypoints in the image.

L_{r e g} = \frac{λ}{N} | B^{*} - B |,

(7)

where

B^{*}

denotes the predicted values, B the ground-truth, and

λ

a scaling factor to balance the classification and regression tasks.

L = L_{c l s} + \sum_{i = 1}^{3} L_{r e g} (B_{i}) .

(8)

This multi-loss approach of the SMOKE algorithm ensures that it not only performs effectively in object detection but also accurately estimates the 3D pose of vehicles, which is crucial for applications within the Intelligent Transportation Systems field.

3.2.2. MONOCON: Monocular Context

The MONOCON algorithm [25] distinguishes itself from other 3D object detection models by its focus on providing additional contextual information during training. This extra context, such as the corners of the bounding box and the 2D dimensions of the object, allows for better generalization. MONOCON have been chosen to evaluate the performance of this more complex algorithm with the DAIR-V2I dataset in pursuit of superior results, inspired by its demonstrated success on the KITTI dataset.

Similar to the SMOKE algorithm, MONOCON processes the image using a DLA-34 backbone to create a feature map from the image data. From the feature map generated by the backbone, MONOCON proposes a series of detector heads divided into two groups, the 3D Bounding Box Regression Heads and the Auxiliary Context Regression Heads. During training, both groups of heads are used to provide information to the loss functions; however, only the 3D Bounding Box Regression Heads are used during inference or testing. Firstly, let us define the module used in all heads,

g (F; Φ)

, being

Φ

the learnable parameters of each head and F the feature map provided by the backbone, as follows:

F \to_{\begin{matrix} d \times 3 \times 3 \times D \end{matrix}}^{\begin{matrix} Conv + AN + ReLU \end{matrix}} F_{d \times h \times w}^{'} \to_{\begin{matrix} c \times 1 \times 1 \times d \end{matrix}}^{\begin{matrix} Conv \end{matrix}} H_{d \times h \times w} .

(9)

Once defined in this module, the 3D Regression Heads, which compute the position and dimension of the vehicles, are composed of five different detection heads:

The 2D Bbox Center Heatmap head: uses the regression head of Equation (9) to create a heatmap $H^{b}$ for the bidimensional center of each of the classes.

$H^{b} = g (F; Φ_{b}) .$

(10)

This heatmap aids in accurately locating the 2D center of each object, which will serve as the reference coordinates to perform the 3D regression procedures of the other heads.
Offset vector head: computes the offset vector $(Δ x_{b}^{c}, Δ y_{b}^{c})$ from the 2D object’s center $(x_{b}, y_{b})$ to the projected 3D bounding box center $(x_{c}, y_{c})$ , defined by

$O_{2 \times h \times w}^{c} = g (F; Θ^{b_{c}}) .$

(11)
Depth and uncertainty head: regress the depth Z and its uncertainty $σ^{Z}$ as follows:

$Z_{1 \times h \times w} = \frac{1}{S i g m o i d (g (F; Θ^{Z}) [0]) + ϵ} - 1,$

(12)

$σ_{1 \times h \times w}^{Z} = g (F; Θ^{Z}) [1] .$

(13)

Equation (12) calculates the depth Z using the Sigmoid function applied to the Equation (9) to normalize the output value in a range between 0 and 1, which is easier for the neural network to learn, and $ϵ$ is included to prevent the division by zero. Additionally, Equation (13) computes the uncertainty $σ^{Z}$ , which is used to indicate the confidence level of the measurement.
Shape dimensions head: estimates object dimensions by

$S_{2 \times h \times w}^{3 D} = g (F; Θ^{S^{3 D}}) .$

(14)
Observation angle: using the multibin method [29] where the angle is divided into a predefined number of b class and the residual angle, both regressed by

$A_{2 b \times h \times w} = g (F; Θ^{A}) .$

(15)

The final orientation angle is recovered by identifying the most likely bin class and adding its associated residual value, obtaining a more precise result compared to other angle estimation methods.

However, unlike other algorithms, MONOCON’s behavior requires adaptation to meet the specific needs of an infrastructure perspective. Since some detection branches of the algorithm rely on the corners of the 3D bounding box, it is necessary to adjust for the camera’s pitch. Vehicle-mounted cameras generally capture images from a flat, ground-parallel perspective, aligned with the vehicle’s orientation. In contrast, infrastructure cameras are often tilted towards the road to capture a broader view. Consequently, when calculating the projected coordinates of a vehicle’s bounding box corners in the image, it is required to consider the camera’s pitch. In our study, this adjustment is executed by using the extrinsic parameter matrix from the DAIR-V2X dataset’s calibration files, enabling accurate alignment of the bounding boxes with the actual camera angle.

As previously discussed, these detection heads are used during both the training and inference phases; however, it is needed to decode the object’s pose and dimensions in inference phase. This begins with locating the 2D center of the object on the heatmap

H_{b}

. Once the center is identified, the 3D projected center is determined using the associated offset vector,

(Δ x_{b}^{c}, Δ y_{b}^{c})

. Subsequently, the object’s depth, the Z-coordinate, is directly derived from Equation (12), while its dimensions are obtained from Equation (14). Additionally, the object’s orientation angles are decoded using the multibin class method. Finishing the decoding with the computation of the 3D projected position, calculated using the 3D projected center, the depth and the intrinsic matrix of the camera.

Secondly, the Auxiliary Context Regression Heads are designed to complement the 3D Regression Heads during training, allowing the model better performance. These heads are also based on Equation (9) and are explained briefly as follows:

Heatmaps of the projected keypoints: as calculated in the 2D bbox center heatmap before, these heads regress the heatmap of the nine projected keypoints, being the eight corners of the 3D Bbox and its centers, following the equation

$H_{9 \times h \times w}^{k} = g (F; Θ^{k}) .$

(16)
Offset vectors for the eight projected corner points: obtains the offset vectors, as done previously with the center point, to project the points from image 2D to 3D projected points

$O_{16 \times h \times w}^{k} = g (F; Θ^{b_{k}}) .$

(17)
2D Bbox size: regress the height and width of the 2D bounding box as follows:

$H_{2 \times h \times w}^{2 D} = g (F; Θ^{S_{2 D}}) .$

(18)
Quantization residual of a keypoint location: The overall stride (normally s > 1) in the backbone produces a difference between the pixel position in the image and the feature map from the backbone. The residual of the center keypoint is regressed by Equation (19) and the residual of the rest of the keypoints follow Equation (20), as follows:

$R_{2 \times h \times w}^{b} = g (F; Θ^{R^{b}}),$

(19)

$R_{2 \times h \times w}^{k} = g (F; Θ^{R^{k}}) .$

(20)

As with many algorithms in 3D object detection, MONOCON uses several loss functions applied differently to each detection head. Specifically, it uses the Gaussian kernel weighted focal loss function [30] for heatmaps (Equations (10) and (16)), the Laplacian aleatoric uncertainty [20] loss for depth and uncertainty (Equations (12) and (13)), the dimension-aware L1 function [24] for shape dimensions (Equation (14)), the standard cross-entropy loss function for the bin index in observation angles (Equation (15)), and finally the standard L1 loss function for the rest of the heads. Each loss is designed to accurately capture the complexities of 3D object detection from monocular images, improving the performance of the model.

3.2.3. PGD: Probabilistic and Geometric Depth

The third algorithm implemented in this infrastructure research is the PGD model [23], an enhanced version of FCOS3D [31]. It introduces a novel approach to depth estimation aimed at improving the accuracy of depth estimation in FCOS3D. Selected for its innovative dual-depth estimation approach, PGD combines probabilistic methods with geometric perspectives, enhancing depth accuracy critical for complex urban infrastructure environments. This capability significantly elevates its performance compared to traditional depth estimation methods, making it particularly suitable for the demanding conditions of infrastructure-based detection tasks. In 3D object detection, inaccuracies in depth estimation significantly impact the overall accuracy of the model. The PGD model proposes calculating depth estimation from both probabilistic and geometric perspectives, finally merging them to achieve better performance.

The PGD model inherits its architecture from FCOS3D, which is a one-stage algorithm, similar to the previously explained SMOKE. It utilizes a pretrained ResNet101 [32] as a backbone for feature extraction, followed by a Feature Pyramid Network (FPN) [33] for preliminary object detection tasks. Additionally, the model includes a series of shared detection heads responsible for classification, center location, and bounding box regression.

However, PGD diverges from FCOS3D in its approach to depth estimation. Instead of adopting FCOS3D’s depth estimation head, PGD incorporates its own depth estimation module. This new depth head computes depth using probabilistic methods and geometric perspective relationships, producing two separate depth estimates. These estimates are then fused to determine the final depth, aiming for a more accurate and reliable depth estimation process.

These improvements in depth regression allow PGD to achieve better performance than its ancestor, FCOS3D, situating PGD as a truly confident 3D object detection algorithm to be trained in our infrastructure experiments.

4. Experiments

4.1. Data Preparation and Preprocessing

The data processed by the models in the experiments conducted come from the DAIR-V2X dataset, as commented before, which includes images along with corresponding labels and calibration data essential for converting real-world coordinates into 3D projections suitable for image analysis. Specifically, the dataset includes 7058 images, divided into 80% for training, 10% for validation, and 10% for testing the models after the training phase.

To enhance the generalization capabilities of our algorithms, we employ a series of preprocessing steps. These steps are designed to introduce variability, ensuring that the models are exposed to a diverse range of data during training. This preprocessing routine is uniformly applied across all three models.

Using the functionalities provided by the MMDetection3D library, our preprocessing begins with loading the dataset images at a resolution of 1920 × 1080 pixels. Once loaded, to simulate more vehicle orientations and increase the data’s diversity, we apply random flipping along the horizontal axis with a 50% probability. Additional data augmentation includes scaling the images with a 30% probability of applying a scale factor between 0.2 and 0.4.

4.2. Implementation Details

While the implementation specifics vary between the models trained, the same computational environment was utilized for all of them. Each model underwent training on a single GeForce GTX 3090 Ti, equipped with 24 GB of VRAM. The system was configured with CUDA 12.2, providing the necessary support for GPU acceleration. The development and training of the models were facilitated by the MMDetection3D 1.2 library, which integrates PyTorch 1.13, offering a robust framework for deep learning applications in computer vision.

Firstly, the training of the SMOKE algorithm lasted 100 epochs, with validation intervals set at every 5 epochs and a batch size of 4. Images are initially loaded in their original resolution of 1920 × 1080 pixels, and an affine resize operation reduces their dimensions by a factor of 1/4. The Adam optimizer was chosen for this training, starting with an initial learning rate of

2.5 \times 10^{- 4}

. This learning rate is maintained for the first 50 epochs; then, from epoch 51 to the end of the training, it is reduced to

2.5 \times 10^{- 5}

to ensure finer adjustments in the network’s weights for improved model performance.

Secondly, the training of MONOCON spans 100 epochs, with validation checkpoints set every 25 epochs and a batch size of 4. It also loads the images at their original resolution and downsampled by a factor of 4. The Adam optimizer was selected for optimization, starting with an initial learning rate of

2.5 \times 10^{- 5}

. The parameter scheduler linearly increases the learning rate to a value of

2.5 \times 10^{- 4}

by epoch 5 and decreases it in the same manner to

2.5 \times 10^{- 9}

by the end of the training.

Finally, the training of the PGD model is conducted over 48 epochs, with the validation interval set at every 12 epochs and a batch size of 1 due to memory restrictions. However, unlike other algorithms, the model does not downsample the images, so the final image shape is 1920 × 1080 pixels. The optimizer selected is SGD, with an initial learning rate of

1 \times 10^{- 3}

and a momentum of 0.9. Lastly, the learning rates are decreased by a factor of 10 at epoch 32 and again at 44, finishing the training with a learning rate of

1 \times 10^{- 5}

.

4.3. Evaluation Metrics

One of the primary benefits of adapting the DAIR dataset to the KITTI format and using MMDetection3D lies in the availability of established tools and evaluation metrics from the KITTI Benchmark. For the evaluation of the trained models, we have used AP40 metrics. These metrics classify all ground truth objects into three main types: easy, moderate, and hard, regarding the occlusion and pixel size of the bounding boxes. Lately, the Intersection over Union (IoU) metric evaluates the overlap between the predicted bounding box and the ground truth, filtering the objects that calculate AP40 by a minimum IoU score; for example, 0.70 or 0.50 is commonly used to filer detection for AP evaluating in vehicle detection.

Subsequently, the Average Precision (AP) scores are calculated according to the following equation, evaluating the precision at 40 points of recall:

A P_{R_{N}} = \frac{1}{N} \sum_{r \in R} P (r),

(21)

where

R = [r_{0}, r_{0} + \frac{r_{1} - r_{0}}{N - 1}, r_{0} + \frac{2 (r_{1} - r_{0})}{N - 1}, \dots, r_{1}]

and

P (r) = max_{r^{'} : r^{'} \geq r} P (r^{'})

. AP40 provides a more detailed insight than AP11 into the model’s behavior due to the greater number of evaluation points, offering a closer look at its performance. However, AP11 remains a widely recognized and utilized metric, acknowledged for its established benchmark status in 3D object detection.

4.4. Results

Upon completion of all experiments, the results obtained from testing with the DAIR dataset are detailed in Table 3. These results serve to compare the performance of the SMOKE, MONOCON, and PGD models in terms of AP40 3D metrics across two different IoU thresholds, 0.7 and 0.5, providing insights into each model’s ability to accurately detect objects.

Additionally, a comparative analysis of the models’ performances on the DAIR and KITTI datasets at an IoU threshold of 0.7 is provided in Table 4. This comparison highlights how the infrastructure perspective aids detection models and offers a valuable viewpoint, making them more accurate than those from a vehicle’s perspective trained on KITTI.

To better observe the precision achieved by each algorithm, examples of model inferences are illustrated in Figure 7, showcasing the practical applications and effectiveness of the models in real-world scenarios. Furthermore, the results are displayed from a bird’s-eye view (BEV) with its current labels to showcase how accurate the models could be in estimating a vehicle’s position. Moreover, Figure 8 depicts two additional inferences on DAIR using each model.

After reviewing the results obtained from the experiments conducted during this research, it is insightful to consider additional factors apart from the algorithm precision to correctly evaluate the effectiveness of the models. In 3D object detection applied to Intelligent Transportation Systems, it is crucial to look over the inference time and model complexity of the algorithms. On the one hand, ITS often demand almost immediate processing time to deliver real-time traffic updates, making the inference time of the model a critical factor. On the other hand, the analysis of the traffic is sometimes carried out locally, depending on the environment set up, and therefore, the computational complexity of the algorithms is fundamental too. Table 5 showcases the precision achieved by each model, along with its average inference time, estimated frames per second, and overall computational demands, providing a comprehensive overview of these crucial aspects.

The results underscore the different performance and computational demands of the models, and while SMOKE and MONOCON show significantly smaller computational demands compared to PGD, they also perform robustly across all difficulty levels, highlighting that a more complex algorithm, as indicated by higher GFLOPS, does not necessarily translate to better performance. Furthermore, the higher model complexity of PGD directly impacts its inference time, resulting in a decreased Frames Per Second (FPS) achievable by the model. In contrast, SMOKE and MONOCON achieve inference times suitable for real-time traffic detection, allowing the infrastructure to utilize the data extracted in urban environments. These observations are crucial, especially in applications that require rapid response times, where increased model complexity without corresponding improvements in performance is a significant drawback.

Nevertheless, the increased precision achieved by the detection system depends on the correct calibration of the cameras used. In real-world scenarios, it is essential to correctly calibrate the infrastructure cameras to use the detection algorithms in a precise manner. This process needs to be accomplished with each sensor but it remains a simpler calibration process than those required for LiDAR or stereo-camera systems, maintaining the advantage of simplicity that characterizes monocular 3D detection.

To conclude the experiments section, these calibrated cameras are ready for deployment and capable of performing 3D object detection either locally on devices like Nvidia Jetson or through cloud computing solutions. The simplicity of the models, particularly SMOKE and MONOCON, supports their integration into compact and portable devices, enabling on-site detection and immediate data transmission to a central control center. Alternatively, the infrastructure-based detection system can transmit images to a traffic management center for centralized analysis. Furthermore, finally, with the recent advancements in cloud computing, handling detections externally offers another layer of flexibility. This range of deployment options significantly broadens the practical applicability of 3D object detection systems within various Intelligent Transportation System configurations.

4.5. Challenges and Limitations

Despite the satisfactory results obtained from the experiments, computer vision applied from Smart Infrastructures faces some critical challenges and limitations. One significant bottleneck is the delay in the transmission of decisions to vehicles, which can severely impact the ability of Intelligent Transportation Systems (ITS) to operate in real-time. Such delays are often compounded by network latency and the inference times discussed earlier. Additionally, the decision-making system of the ITS setup also requires crucial fragments of seconds, which are vital if the application is related to traffic planning.

Moreover, the accuracy and reliability of monocular 3D object detection systems can be substantially affected by environmental factors, such as lighting changes and extreme weather conditions, which can severely impair the detection capabilities of the algorithms [34]. Additionally, specific traffic characteristics, such as large trucks occluding vehicles behind them or high-density traffic patterns, can reduce the system’s accuracy and reliability.

The challenges identified highlight the evidence that further research and development are needed to improve computer vision within Smart Infrastructures. Although infrastructure-based detection has proven to be more accurate than ego-vehicle approaches, considerable limitations remain unsolved before these technologies can be fully implemented within Intelligent Transportation Systems and smart cities, particularly in real-time applications. Future research should focus on enhancing detection capabilities, particularly in adverse lighting and weather conditions, and optimizing the detection system and information handling to minimize detection and decision-making delays, thereby providing ITS with a reliable tool for autonomous vehicle applications.

5. Conclusions and Future Works

The results obtained from this research clearly demonstrate the advantages of utilizing infrastructure-based monocular cameras for 3D vehicle detection over traditional vehicle-based methods. Our experiments underscore the increased precision of algorithms when compared to models trained using the KITTI dataset. This improvement in precision is primarily attributed to the elevated viewpoint provided by infrastructure-mounted cameras, which allows for less occluded and broader views.

The efficiency of the analyzed models, particularly in terms of inference time and model complexity, has important implications for the development of smart cities. The reduced inference time, especially in the MONOCON and SMOKE models, enables real-time processing capabilities crucial in urban environments where live data ensure users’ safety and comfort. Additionally, the decreased model complexity allows for integration into small and relatively simple computational devices, crucial for scaling up smart city technologies and making monocular 3D object detection systems more accessible and cost-effective. Moreover, the use of monocular cameras on smart infrastructures presents an alternative to more expensive systems, such as LiDAR and stereo camera set ups, offering a solution for cities looking to implement 3D detection technologies without significant investments.

Moving forward, the field of monocular 3D detection from an infrastructure point of view is poised for significant development, with several paths for future research. Increasing the precision and diversity of future datasets will be a key step to take, including a wide range of environmental conditions to improve algorithmic generalization and reliability. Additionally, the development of more sophisticated algorithms that can overcome current state-of-the-art models is also essential for the evolution of 3D vehicle detection systems from the infrastructure side. Finally, exploring the full array of applications for intelligent infrastructure will be an alternative avenue of investigation, including next-generation smart city traffic management, autonomous vehicle control, and even smart parking planning systems.

In conclusion, this paper underscores the potential of developing infrastructure-based monocular 3D detection systems for next-generation smart cities, paving the way for safer, more efficient, and sustainable urban environments.

Author Contributions

Conceptualization, Á.R.-B. and J.M.A.M.; methodology, J.B.B.; software, J.B.B.; validation, J.B.B., Á.R.-B. and J.M.A.M.; formal analysis, J.B.B.; investigation, J.B.B.; resources, Á.R.-B.; data curation, Á.R.-B.; writing—original draft preparation, J.B.B.; writing—review and editing, Á.R.-B. and J.M.A.M.; visualization, J.B.B.; supervision, Á.R.-B. and J.M.A.M.; project administration, J.M.A.M.; funding acquisition, J.M.A.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by MCIN/AEI/10.13039/501100011033 grant numbers PID2021-124335OB-C21, PID2022-140554OB-C32, and PDC2022-133684-C31.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Guo, E.; Chen, Z.; Rahardja, S.; Yang, J. 3D Detection and Pose Estimation of Vehicle in Cooperative Vehicle Infrastructure System. IEEE Sens. J. 2021, 21, 21759–21771. [Google Scholar] [CrossRef]
Zimmer, W.; Birkner, J.; Brucker, M.; Tung Nguyen, H.; Petrovski, S.; Wang, B.; Knoll, A.C. InfraDet3D: Multi-Modal 3D Object Detection based on Roadside Infrastructure Camera and LiDAR Sensors. In Proceedings of the 2023 IEEE Intelligent Vehicles Symposium (IV), Anchorage, AK, USA, 4–7 June 2023; pp. 1–8. [Google Scholar] [CrossRef]
Yu, H.; Luo, Y.; Shu, M.; Huo, Y.; Yang, Z.; Shi, Y.; Guo, Z.; Li, H.; Hu, X.; Yuan, J.; et al. DAIR-V2X: A Large-Scale Dataset for Vehicle-Infrastructure Cooperative 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 21361–21370. [Google Scholar]
Ye, X.; Shu, M.; Li, H.; Shi, Y.; Li, Y.; Wang, G.; Tan, X.; Ding, E. Rope3D: The Roadside Perception Dataset for Autonomous Driving and Monocular 3D Object Detection Task. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 21341–21350. [Google Scholar]
Creß, C.; Zimmer, W.; Strand, L.; Fortkord, M.; Dai, S.; Lakshminarasimhan, V.; Knoll, A. A9-Dataset: Multi-Sensor Infrastructure-Based Dataset for Mobility Research. In Proceedings of the 2022 IEEE Intelligent Vehicles Symposium (IV), Aachen, Germany, 4–9 June 2022; pp. 965–970. [Google Scholar] [CrossRef]
Jia, J.; Li, Z.; Shi, Y. MonoUNI: A Unified Vehicle and Infrastructure-side Monocular 3D Object Detection Network with Sufficient Depth Clues. In Proceedings of the Thirty-Seventh Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Ramajo-Ballester, Á.; de la Escalera Hueso, A.; Armingol Moreno, J.M. 3D Object Detection for Autonomous Driving: A Practical Survey. In Proceedings of the 9th International Conference on Vehicle Technology and Intelligent Transport Systems, Prague, Czech Republic, 26–28 April 2023; pp. 64–73. [Google Scholar] [CrossRef]
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
Huang, X.; Cheng, X.; Geng, Q.; Cao, B.; Zhou, D.; Wang, P.; Lin, Y.; Yang, R. The ApolloScape Dataset for Autonomous Driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Patil, A.; Malla, S.; Gang, H.; Chen, Y.T. The H3D Dataset for Full-Surround 3D Multi-Object Detection and Tracking in Crowded Urban Scenes. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 9552–9557. [Google Scholar] [CrossRef]
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A Multimodal Dataset for Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Sun, P.; Kretzschmar, H.; Dotiwalla, X.; Chouard, A.; Patnaik, V.; Tsui, P.; Guo, J.; Zhou, Y.; Chai, Y.; Caine, B.; et al. Scalability in Perception for Autonomous Driving: Waymo Open Dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Weng, X.; Man, Y.; Park, J.; Yuan, Y.; O’Toole, M.; Kitani, K.M. All-In-One Drive: A Comprehensive Perception Dataset with High-Density Long-Range Point Clouds. 2021. Available online: https://openreview.net/forum?id=yl9aThYT9W (accessed on 1 June 2024).
Xiao, P.; Shao, Z.; Hao, S.; Zhang, Z.; Chai, X.; Jiao, J.; Li, Z.; Wu, J.; Sun, K.; Jiang, K.; et al. PandaSet: Advanced Sensor Suite Dataset for Autonomous Driving. In Proceedings of the 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, USA, 19–22 September 2021; pp. 3095–3101. [Google Scholar] [CrossRef]
Mao, J.; Niu, M.; Jiang, C.; Liang, H.; Chen, J.; Liang, X.; Li, Y.; Ye, C.; Zhang, W.; Li, Z.; et al. One Million Scenes for Autonomous Driving: ONCE Dataset. arXiv 2021, arXiv:2106.11037. [Google Scholar] [CrossRef]
Liao, Y.; Xie, J.; Geiger, A. KITTI-360: A Novel Dataset and Benchmarks for Urban Scene Understanding in 2D and 3D. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 3292–3310. [Google Scholar] [CrossRef] [PubMed]
Wilson, B.; Qi, W.; Agarwal, T.; Lambert, J.; Singh, J.; Khandelwal, S.; Pan, B.; Kumar, R.; Hartnett, A.; Pontes, J.K.; et al. Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting. arXiv 2023, arXiv:2301.00493. [Google Scholar] [CrossRef]
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar] [CrossRef]
Chen, Y.; Tai, L.; Sun, K.; Li, M. MonoPair: Monocular 3D Object Detection Using Pairwise Spatial Relationships. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Liu, Z.; Wu, Z.; T’oth, R. SMOKE: Single-Stage Monocular 3D Object Detection via Keypoint Estimation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 4289–4298. [Google Scholar]
Shi, X.; Ye, Q.; Chen, X.; Chen, C.; Chen, Z.; Kim, T.K. Geometry-Based Distance Decomposition for Monocular 3D Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 15172–15181. [Google Scholar]
Wang, T.; Zhu, X.; Pang, J.; Lin, D. Probabilistic and Geometric Depth: Detecting Objects in Perspective. arXiv 2021, arXiv:2107.14160. [Google Scholar] [CrossRef]
Ma, X.; Zhang, Y.; Xu, D.; Zhou, D.; Yi, S.; Li, H.; Ouyang, W. Delving Into Localization Errors for Monocular 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 4721–4730. [Google Scholar]
Liu, X.; Xue, N.; Wu, T. Learning Auxiliary Monocular Contexts Helps Monocular 3D Object Detection. arXiv 2021, arXiv:2112.04628. [Google Scholar] [CrossRef]
Li, Z.; Jia, J.; Shi, Y. MonoLSS: Learnable Sample Selection For Monocular 3D Detection. arXiv 2023, arXiv:2312.14474. [Google Scholar] [CrossRef]
Contributors, M. MMDetection3D: OpenMMLab Next-Generation Platform for General 3D Object Detection. 2020. Available online: https://github.com/open-mmlab/mmdetection3d (accessed on 1 June 2024).
Yu, F.; Wang, D.; Darrell, T. Deep Layer Aggregation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 28–23 June 2018; pp. 2403–2412. [Google Scholar]
Mousavian, A.; Anguelov, D.; Flynn, J.; Kosecka, J. 3D Bounding Box Estimation Using Deep Learning and Geometry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Wang, T.; Zhu, X.; Pang, J.; Lin, D. FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection. arXiv 2021, arXiv:2104.10956. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar] [CrossRef]
Li, P.; Zhao, H.; Liu, P.; Cao, F. RTM3D: Real-time Monocular 3D Detection from Object Keypoints for Autonomous Driving. arXiv 2020, arXiv:2001.03343. [Google Scholar] [CrossRef]
Vargas, J.; Alsweiss, S.; Toker, O.; Razdan, R.; Santos, J. An overview of autonomous vehicles sensors and their vulnerability to weather conditions. Sensors 2021, 21, 5397. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Timeline of publicly available datasets and respective point of view.

Figure 2. Examples of images from (a) KITTI dataset and (b) nuScenes.

Figure 3. Inference results using (a) PGD and (b) SMOKE models on an image from the KITTI dataset.

Figure 4. Comparison of data capture perspectives: (a) Image from a vehicle-mounted camera with potential occlusions and limited scope. (b) Image from an infrastructure-mounted camera showing a wide, unobstructed view. This comparison underscores the benefits of infrastructure perspectives in terms of visibility and data quality.

Figure 5. (a) Inference using the SMOKE model trained on the KITTI dataset and applied to the DAIR-V2X dataset. (b) Inference using the SMOKE model trained directly with the DAIR-V2X dataset. Note the differences in detection capabilities and accuracy.

Figure 6. Visualization of 3D bounding boxes obtained from the labels file provided by the DAIR-V2I dataset.

Figure 7. Inferences using the SMOKE (left), MONOCON (center), and PGD (right) models on the DAIR dataset, showcasing the projected bounding boxes on the image and the bird’s-eye view (BEV). This illustrates the models’ accuracy in object detection, identifying objects at multiple distances and from different perspectives.

Figure 8. Inferences using the SMOKE (left), MONOCON (center), and PGD (right) models on the DAIR dataset in different image scenarios.

Table 1. Comparison between publicly available datasets for 3D object detection, sorted by year [7].

Dataset	Year	Images	LiDAR	3D ann.	Classes	Night/Rain	View
KITTI [8]	2013	15 k	15 k	200 k	8	No/No	Onboard
ApolloScape [9]	2019	144 k	20 k	475 k	6	-/-	Onboard
H3D [10]	2019	83 k	27 k	1.1 M	8	No/No	Onboard
nuScenes [11]	2020	1.4 M	400 k	1.4 M	23	Yes/Yes	Onboard
WaymoOpen [12]	2020	1 M	230 k	12 M	4	Yes/Yes	Onboard
AIODrive [13]	2021	100 k	100 k	10 M	-	Yes/Yes	Virtual
PandaSet [14]	2021	48 k	16 k	1.3 M	28	Yes/Yes	Onboard
ONCE [15]	2021	7 M	1 M	417 k	5	Yes/Yes	Onboard
KITTI-360 [16]	2022	300 k	78 k	68 k	37	-/-	Onboard
Rope3D [4]	2022	50 k	-	1.5 M	12	Yes/Yes	Infrast.
A9Dataset [5]	2022	5.4 k	1.7 k	215 k	8	Yes/Yes	Infrast.
DAIR-V2X [3]	2022	71 k	71 k	1.2 M	10	-/-	I/O ¹
Argoverse2 [17]	2023	∼1 M	150 k	-	30	-/-	Onboard

¹ Infrastructure and onboard data.

Table 2. Comparison of performance metrics for various 3D Monocular Object Detection Models as of 2023, evaluated on the KITTI benchmark. The table presents the Average Precision (

A P_{3 D}

) at an Intersection over Union (IoU) threshold of 0.7 using the

R_{40}

recall levels. This compilation highlights advancements in model accuracy over recent years, showcasing the progressive enhancement of monocular detection capabilities within the KITTI benchmark. All results shown are reported by the authors of each respective study.

Table 2. Comparison of performance metrics for various 3D Monocular Object Detection Models as of 2023, evaluated on the KITTI benchmark. The table presents the Average Precision (

A P_{3 D}

) at an Intersection over Union (IoU) threshold of 0.7 using the

R_{40}

recall levels. This compilation highlights advancements in model accuracy over recent years, showcasing the progressive enhancement of monocular detection capabilities within the KITTI benchmark. All results shown are reported by the authors of each respective study.

		Results ${AP}_{3 D} (IoU = 0.7) \| R_{40}$
Model	Year	Easy	Moderate	Hard
MONOPAIR [20]	2020	16.28	12.30	10.42
SMOKE [21]	2020	14.03	9.76	7.84
MONORCNN [22]	2021	18.36	12.65	10.03
PGD [23]	2021	19.05	11.73	9.39
MONODLE [24]	2021	17.23	12.26	10.29
MONOCON [25]	2021	22.50	16.46	13.95
MONOLSS [26]	2023	26.11	19.15	16.94

Table 3. Results obtained from the experiments explained. The table presents the AP40 3D metrics at two different levels of Intersection over Union (IoU), 0.7 and 0.5.

	${AP}_{3 D}$ (IoU = 0.7)\|R₄₀			${AP}_{3 D} (IoU = 0.5) \| R_{40}$
Model	Easy	Moderate	Hard	Easy	Moderate	Hard
SMOKE	59.82	51.14	50.93	84.10	78.34	78.19
MONOCON	51.35	43.41	43.19	85.58	79.50	79.30
PGD	59.99	50.87	49.54	84.06	73.92	73.94

Table 4. Comparison of AP40 3D metrics between DAIR and KITTI datasets at IoU thresholds of 0.7.

	DAIR ${AP}_{40}$ IoU ≥ 0.7			KITTI ${AP}_{40}$ IoU ≥ 0.7
Model	Easy	Moderate	Hard	Easy	Moderate	Hard
SMOKE	59.82	51.14	50.93	14.03	9.76	7.84
MONOCON	51.35	43.41	43.19	22.50	16.46	13.95
PGD	59.99	50.87	49.54	19.05	11.73	9.39

Table 5. Performance and computational requirements of detection models at an IoU threshold of 0.7, showing Average Precision at a moderate difficulty level (AP40), inference time, frames per second (FPS), and computational complexity (GFLOPS) when inferring 1920 × 1080 p images.

	AP40	Inference		Model
Model	(Moderate)	Time	FPS	Complexity
SMOKE	51.14	134 ms	7.46 FPS	185 GFLOPS
MONOCON	43.41	133 ms	7.52 FPS	190 GFLOPS
PGD	50.87	286 ms	3.50 FPS	1780 GFLOPS

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Borau Bernad, J.; Ramajo-Ballester, Á.; Armingol Moreno, J.M. Three-Dimensional Vehicle Detection and Pose Estimation in Monocular Images for Smart Infrastructures. Mathematics 2024, 12, 2027. https://doi.org/10.3390/math12132027

AMA Style

Borau Bernad J, Ramajo-Ballester Á, Armingol Moreno JM. Three-Dimensional Vehicle Detection and Pose Estimation in Monocular Images for Smart Infrastructures. Mathematics. 2024; 12(13):2027. https://doi.org/10.3390/math12132027

Chicago/Turabian Style

Borau Bernad, Javier, Álvaro Ramajo-Ballester, and José María Armingol Moreno. 2024. "Three-Dimensional Vehicle Detection and Pose Estimation in Monocular Images for Smart Infrastructures" Mathematics 12, no. 13: 2027. https://doi.org/10.3390/math12132027

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Three-Dimensional Vehicle Detection and Pose Estimation in Monocular Images for Smart Infrastructures

Abstract

1. Introduction

2. Vehicle-Side Monocular Detection

2.1. Datasets

2.2. Detection Algorithms

3. Our Approach: Infrastructure-Based Monocular Detection

3.1. DAIR-V2X Dataset

3.2. Detection Models

3.2.1. SMOKE: Single-Stage Monocular 3D Object Detection via Keypoint Estimation

3.2.2. MONOCON: Monocular Context

3.2.3. PGD: Probabilistic and Geometric Depth

4. Experiments

4.1. Data Preparation and Preprocessing

4.2. Implementation Details

4.3. Evaluation Metrics

4.4. Results

4.5. Challenges and Limitations

5. Conclusions and Future Works

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI