Next Article in Journal
Integrating Edge-Intelligence in AUV for Real-Time Fish Hotspot Identification and Fish Species Classification
Next Article in Special Issue
O2SAT: Object-Oriented-Segmentation-Guided Spatial-Attention Network for 3D Object Detection in Autonomous Vehicles
Previous Article in Journal
Prediction of Disk Failure Based on Classification Intensity Resampling
Previous Article in Special Issue
Deep Learning-Based Multiple Droplet Contamination Detector for Vision Systems Using a You Only Look Once Algorithm
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Architectural Framework to Enhance Image-Based Vehicle Positioning for Advanced Functionalities

by
Iosif-Alin Beti
1,†,
Paul-Corneliu Herghelegiu
2,† and
Constantin-Florin Caruntu
1,*,†
1
Department of Automatic Control and Applied Informatics, Gheorghe Asachi Technical University of Iasi, 700050 Iasi, Romania
2
Department of Computer Science and Engineering, Gheorghe Asachi Technical University of Iasi, 700050 Iasi, Romania
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Information 2024, 15(6), 323; https://doi.org/10.3390/info15060323
Submission received: 24 April 2024 / Revised: 22 May 2024 / Accepted: 28 May 2024 / Published: 31 May 2024

Abstract

:
The growing number of vehicles on the roads has resulted in several challenges, including increased accident rates, fuel consumption, pollution, travel time, and driving stress. However, recent advancements in intelligent vehicle technologies, such as sensors and communication networks, have the potential to revolutionize road traffic and address these challenges. In particular, the concept of platooning for autonomous vehicles, where they travel in groups at high speeds with minimal distances between them, has been proposed to enhance the efficiency of road traffic. To achieve this, it is essential to determine the precise position of vehicles relative to each other. Global positioning system (GPS) devices have an intended positioning error that might increase due to various conditions, e.g., the number of available satellites, nearby buildings, trees, driving into tunnels, etc., making it difficult to compute the exact relative position between two vehicles. To address this challenge, this paper proposes a new architectural framework to improve positioning accuracy using images captured by onboard cameras. It presents a novel algorithm and performance results for vehicle positioning based on GPS and video data. This approach is decentralized, meaning that each vehicle has its own camera and computing unit and communicates with nearby vehicles.

1. Introduction

We live in a constantly developing world, where cities are becoming more and more crowded, and the number of cars is constantly increasing. Traffic becomes more and more congested because the road infrastructure can no longer cope with the increasing number of vehicles. This means more fuel consumption, more pollution, longer journeys, stressed drivers, and, most importantly, an increase in the number of accidents. Pedestrians, cyclists, and motorcyclists are the most exposed to road accidents. According to a report from 2017 by the World Health Organization, every year, 1.25 million people die in road accidents, and millions more are injured [1]. The latest status report from 2023 [2] indicates a slight decrease in the number of road traffic deaths to 1.19 million per year, highlighting the positive impact of efforts to enhance road safety. However, it underscores that the cost of mobility remains unacceptably high. The study described in [3] tracked the progress of reducing the number of car accidents since 2010 in several cities. It concluded that very few of the studied cities are improving road safety at a pace that will reduce road deaths by 50% by 2030, in line with the United Nations’ road safety targets.
Autonomous vehicles can avoid some errors made by drivers, and they can improve the flow of traffic by controlling their pace so that traffic stops oscillating. They are equipped with advanced technologies such as global positioning systems (GPS), video cameras, radars, light detection and ranging (LiDARs), and many other types of sensors. They can travel together, exchanging information about travel intentions, detected hazards and obstacles, etc., through vehicle-to-vehicle (V2V) or vehicle to everything (V2X) communication networks.
To increase the efficiency of road traffic, the idea of grouping autonomous vehicles into platoons through information exchange was proposed in [4]. Vehicles should consider all available lanes on a given road sector when forming a group and travel at high speeds with minimal safety distances between them. However, this is possible only if a vehicle can determine its precise position with respect to other traffic participants.
In recent years, image processing and computer vision techniques have been widely applied to solve various real-world problems related to traffic management, surveillance, and autonomous driving. In particular, the detection of traffic participants such as vehicles, pedestrians, and bicycles [5] plays a crucial role in many advanced driver assistance systems (ADAS) and smart transportation applications.
Image processing is important in optimizing traffic by being used to develop functionalities that reduce the number of accidents, increase traffic comfort, and group vehicles into platoons. Several approaches have been proposed over time to detect traffic participants with video cameras using convolutional neural networks [6,7], but most of them require a significant amount of computing power and cannot be used in real-time due to increased latency.
In this paper, a proof of concept algorithm to solve a part of the image-based vehicle platooning problem is proposed. It uses a decentralized approach, with each vehicle performing its own computing steps and determining its position with respect to the nearby vehicles. This approach relies on images acquired by the vehicle’s cameras and the communication between vehicles. To test our approach, we used cheap commercial dashboard cameras equipped with a GPS sensor. No other sensors were used, mainly because they would have greatly increased the hardware cost. Each vehicle computes an image descriptor for every frame in the video stream, which it sends as a message along with other GPS information to other vehicles. Vehicles within communication range receive this message and attempt to find the frame in their own stream that most closely resembles the received one. The novelty of this approach lies in calculating the distance between the two vehicles by matching image descriptors computed for frames from both vehicles, determining the time difference at which the two frames were captured, and considering the traveling speeds of vehicles.
The rest of the paper is organized as follows. Section 2 presents some vehicle grouping methods for traffic optimization, then reviews applications of image processing related to street scenes, and, lastly, presents several image descriptors. The method proposed in this paper is detailed in Section 3, while in Section 4, the implementation of the algorithm is presented. In Section 5, preliminary results are presented, demonstrating the feasibility of the proposed algorithm. Finally, Section 6 presents the main conclusions of this study and directions for future research.

2. Related Work

The aim of this study is to describe a system architecture for positioning nearby vehicles using image processing techniques. The related work section is divided into three subsections: traffic optimization, street scene image processing, and image descriptors, as detailed in the following subsections.

2.1. Traffic Optimization

The majority of driver errors can be avoided by autonomous vehicles. They can reduce traffic oscillations by maintaining a safe distance, while exchanging information in real-time. Single-lane platooning solutions, proposed in [8,9], prove that vehicle platooning improves traffic safety and increases the capacity of existing roads.
To further increase traffic flow, ref. [4] extends the idea of single-lane platoons to multi-lane platoons [10,11,12]. Vehicles should consider all available lanes on a given road sector when creating a group of vehicles and travel with small distances between them at high speeds. The platoon should be dynamic to allow new vehicles to join or leave the group, be able to overcome certain obstacles encountered on the road, and also allow faster-moving vehicles to overtake. Through the vehicle-to-vehicle communication network, they exchange information about travel intentions, dangers, and detected obstacles.
Vehicle movement control is divided into two parts: lateral control and longitudinal control. Lateral control is in charge of changing lanes, whereas longitudinal control is in charge of actions in the current lane. These two, when combined, must ensure that vehicle collisions are avoided. Maintaining formations and joining new members is made easier with lateral control via a lane change solution. At the same time, longitudinal control is used to keep a safe distance between vehicles. As such, the triangle formation strategy inspired by [13] is usually chosen because it offers several advantages, such as the stability of each member within the group and quick regrouping of the members in case of a dissolving scenario.
The way platoons are formed is based on the fields of swarm robotics and flocking, which are inspired by nature, more precisely, by the way fish, birds [14], insects, and mammals interact in a group [15,16]. An individual possesses poor abilities, but within a group, they contribute to the formation of complex group behaviors and, thus, provide flexibility and robustness, such as route planning and task allocation.

2.2. Street Scene Image Processing

Vehicle detection plays a very important role in modern society, significantly impacting transportation efficiency, safety, and urban planning. It optimizes traffic flow by refining signal timings and reducing congestion, as evidenced in [17]. Moreover, advancements in vehicle detection technology have facilitated features like automatic collision avoidance and pedestrian detection, contributing to a decrease in accidents [5].
Law enforcement also benefits from vehicle detection systems, aiding in tasks like license plate identification and stolen vehicle tracking [18]. Additionally, these systems support efficient parking management by monitoring parking spaces and guiding drivers to available spots [19].
Regarding the topic of advanced driver-assistance systems (ADAS), recent studies provide comprehensive reviews of vision-based on-road vehicle detection systems [20,21]. These systems, mounted on vehicles, face challenges in handling the vast amounts of data from traffic surveillance cameras and necessitate real-time analysis for effective traffic management.
Addressing challenges in traffic monitoring requires precise object detection and classification, accurate speed measurement, and interpretation of traffic patterns. Techniques proposed in studies offer efficient approaches for detecting cars in video frames, utilizing image processing methods [22].
While vision-based solutions have made significant advances in the automotive industry, they remain vulnerable to adverse weather conditions [23]. Weather elements like heavy rain, fog, or low lighting can potentially impact the accuracy and reliability of these systems, thus necessitating further research and development efforts.
Moreover, communication between vehicles is an emerging area of research, as highlighted in [24]. Understanding and optimizing vehicle-to-vehicle communication techniques are essential for enhancing road safety and traffic efficiency. Implementing robust communication protocols can facilitate cooperative driving strategies, leading to smoother traffic flow and reduced congestion levels.

2.3. Image Descriptors

Image descriptors are essential in computer vision and image processing. They extract robust and distinctive features from images for tasks like matching, recognition, and retrieval. Rather than processing the entire image, these techniques focus on specific key points. Each key point is associated with a descriptor that describes its properties. Examples of such descriptors are provided below.

2.3.1. Scale-Invariant Feature Transform (SIFT) Descriptor

The scale-invariant feature transform (SIFT) [25] is a widely used feature descriptor in computer vision and image processing. It extracts distinctive features using the scale-space extrema detection and the difference in the Gaussian (DoG) method. SIFT provides invariance to image scaling, rotation, and robustness against changes in viewpoint, illumination, and occlusion.
The algorithm consists of several steps [26]:
  • Scale-space extrema detection: this step identifies potential points of interest that are invariant to orientation using the difference of Gaussians (DoG) function.
  • Key point localization: the algorithm establishes the location and scale of key points to measure their stability.
    • Contrast threshold: following the selection of key points, the algorithm sets a contrast threshold to ensure stability. By considering the DoG function as a contrast function, key points with a DoG value less than 0.03 (after normalizing the intensity [0 1]) are excluded from the list.
    • Eliminating edge response: the next approach for localizing key points involves the elimination of edge responses. This is achieved by employing the Hessian matrix derived from the DoG function,
      H = D x x D x y D y x D y y .
      Due to variations in the DoG function, there are significant changes in curvature at edges perpendicular to the direction of interest. To ensure robustness, key points lacking a stable maximum are excluded. This is achieved by predicting the Hessian matrix and computing the sum of its eigenvalues.
      T r ( H ) = D x x + D y y = α + β ,
      D e t ( H ) = D x x D y y D y x D x y = α β .
      In rare cases, the curvatures may have different signs when the determinant is negative. Let r denote the ratio of the largest to the smallest magnitude eigenvalue, then the expression ( r + 1 ) 2 / r reaches its minimum value when the two eigenvalues are equal, and it rises as r increases. To ensure the principal curvatures ratio stays below a threshold r, only verifying r is necessary.
      T r ( H ) D e t ( H ) = ( α + β ) 2 α β = ( r β + β ) 2 r β 2 = ( r + 1 ) 2 r ,
      T r ( H ) 2 D e t ( H ) < ( r + 1 ) 2 r .
      Therefore, key points with a ratio between the primary curvatures greater than 10 are disregarded when r = 10 .
  • Orientation assignment: local image gradient directions are assigned to each key point position.
  • Key point descriptor: descriptors are obtained from the region surrounding each key point, incorporating local image gradients and scale information to represent significant shifts in light and local shape distortions.

2.3.2. Sped-Up Robust Feature (SURF) Descriptor

The sped-up robust feature (SURF) descriptor [27] is a faster and more efficient alternative to SIFT for feature extraction in computer vision and image processing. It is based on Haar wavelet responses and the determinant of the Hessian matrix. SURF achieves comparable performance to SIFT in matching and recognition tasks while significantly improving processing efficiency. Unlike SIFT, which utilizes the difference of Gaussian (DoG) technique to approximate the Laplacian of Gaussian (LoG), SURF employs box filters. This approach offers computational advantages, as box filters can be efficiently computed, and calculations for different scales can be performed simultaneously.
To handle orientation, SURF calculates Haar-wavelet responses in both the x and y directions within a 6s neighborhood around each key point, with s being proportional to scale. The orientation is determined by summing the responses in a sliding scanning area.
For feature extraction, a 20 s × 20 s neighborhood is extracted around each key point and divided into 4 × 4 cells. Haar wavelet responses are computed for each cell, and the responses from all cells are concatenated to form a 64-dimensional feature descriptor.
The SURF algorithm’s implementation involves the following key steps [28]:
  • Identifying salient features like blobs, edges, intersections, and corners in specific regions of the integral image. SURF utilizes the fast Hessian detector for feature point detection.
  • Utilizing descriptors to characterize the surrounding neighborhood of each feature point. These feature vectors must possess uniqueness while remaining robust to errors, geometric deformations, and noise.
  • Assigning orientation to key point descriptors by calculating Haar wavelet responses across image coordinates.
  • Ultimately, SURF matching is conducted using the nearest-neighbor approach.

2.3.3. Oriented FAST and Rotated BRIEF (ORB) Descriptor

The oriented FAST and rotated BRIEF (ORB) descriptor [29] is an efficient algorithm for feature extraction in computer vision and image processing. It combines the FAST key point detector [30] with the binary robust independent elementary features (BRIEF) descriptor [31] and introduces rotation invariance. This makes ORB robust to image rotations and enhances its performance in matching and recognition tasks [32]. It achieves comparable performance to other popular descriptors like SIFT and SURF while being significantly faster in computation time.
The ORB method utilizes a simple measure for corner orientation, namely the intensity centroid [33]. First, the moments of a patch are defined as follows:
m p q = x , y x p y q I ( x , y ) .
With these moments, the centroid, also known as the ‘center of mass’ of the patch, can be determined as follows:
C = m 10 m 00 , m 01 m 00 .
One can construct a vector from the corner’s center O to the centroid O C . The orientation of the patch is then provided as follows:
θ = atan 2 ( m 01 , m 01 ) .
After calculating the orientation of the patch, it can be rotated to a canonical position, enabling the computation of the descriptor and ensuring rotation invariance. BRIEF (9) lacks rotation invariance; hence, ORB employs rotation-aware BRIEF (rBRIEF) (11). ORB integrates this feature while maintaining the speed advantage of BRIEF:
f ( n ) = 1 i n 2 i 1 τ ( p ; x i , y i ) ,
where τ ( p ; x , y ) is defined as follows:
τ ( p ; x , y ) ) = 1 : p ( x ) < p ( y ) , 0 : p ( x ) p ( y ) ,
and p ( x ) is the intensity p at a point x.
The steered BRIEF operator is obtained as follows:
g n ( p , θ ) = f n ( p ) ( x i , y i ) S θ .

2.3.4. Boosted Efficient Binary Local Image Descriptor (BEBLID)

The boosted efficient binary local image descriptor (BEBLID) [34] is a newer binary descriptor that encodes intensity differences between neighboring pixels. It provides an efficient representation for feature matching and recognition. BEBLID enhances the performance of the real-valued descriptor, boosted efficient local image descriptor (BELID) [35], improving both matching efficiency and accuracy.
The proposed algorithm assumes that there is a training dataset consisting of pairs of images, denoted as x i , y i , l i i = 1 N , where x i , y i X are labeled with l i 1 , 1 . Here, l i = 1 indicates that two blocks belong to the same image structure, while l i = 1 indicates that they are different. The objective is to minimize the loss using AdaBoost:
L BELID = i = 1 N exp γ i k = 1 K α k h k ( x i ) h k ( y i ) ,
where g s x i , y i = k = 1 K α k h k ( x i ) h k ( y i ) , and γ represents the learning rate parameter. The function h k z h k z ; f ; T corresponds to the k-th weak learner (WL) in g s combined with α k . The WL depends on the feature extraction function f ( x ) and threshold T, defined as follows:
h k z ; f ; T = 1 : f ( x ) T , 1 : f ( x ) > T ,
The BEBLID feature extraction function f ( x ) is defined as follows:
f x ; p 1 , p 2 , s = 1 s 2 q R p 1 , s I q q R p 2 , s I r ,
where I ( q ) represents the gray value of pixel q, and R ( p , s ) is the square image frame centered at p, with an area of s. Thus, f ( x ) computes the average gray values of pixels in R p 1 , s and R p 2 , s and thresholds it. To output binary values 0 , 1 , 1 are represented as 0 and + 1 as 1, resulting in the BEBLID binary descriptor.
L BEBLID = i = 1 N exp γ i k = 1 K h k ( x ) h k ( y ) .
BEBLID has demonstrated superior performance compared to other state-of-the-art descriptors such as SIFT [25], SURF [27], ORB [29], and convolutional neural networks (CNNs) [36] in terms of speed, accuracy, and robustness.
Compared to CNNs, a powerful deep learning approach for feature extraction and recognition, BEBLID is a lightweight and efficient alternative, ideal for low-resource applications. It also offers easier interpretation and debugging due to its binary string representation. Hence, this paper utilizes the advantages and performance of BEBLID as the chosen algorithm.

3. Precise Localization Algorithm

In this paper, an algorithm is proposed to help vehicles position themselves with respect to other nearby vehicles. The approximate distance between two nearby vehicles is computed using GPS data. The exact position between two vehicles cannot be computed using only GPS data because all commercial GPS devices have an intended positioning error [37]. This error might increase further according to various specific conditions, like the number of available satellites, nearby buildings, trees, driving into tunnels, etc. For example, when using a smartphone, the GPS error can be, on average, as much as 4.9 m [38]. Such errors can lead to potentially dangerous situations if any relative vehicle positioning system relies solely on GPS data. For this reason, the aim of this paper is to increase the positioning accuracy using images captured by cameras mounted on each vehicle. Thus, the proposed solution aims to find two similar frames from different vehicles within a certain distance range. Each vehicle sends information about multiple consecutive frames while also receiving similar information from other vehicles for local processing. By using an algorithm to match image descriptors calculated based on these frames, a high number of matches indicates that the vehicles are in relatively the same position. Using the timestamps associated with the two frames, we can determine the moment each vehicle was in that position, allowing us to calculate the distance between them by considering their traveling speed and the time difference between the two.
The proposed approach is decentralized, meaning that each vehicle acts as an independent entity. It has to cover the information exchange with the other vehicles as well as processing the self-acquired and received data. In our model, vehicles employ a V2X communication system with the broadcast information, but they will also use a V2V communication model if the distance to a responding nearby vehicle is below a pre-defined threshold (Figure 1). Each vehicle will broadcast processed information, not being aware if any other vehicle will receive it.
The proposed algorithm assumes that each vehicle is equipped with an onboard camera with GPS and a computing unit. The GPS indicates the current position of the vehicle in terms of latitude and longitude, as well as the timestamp at that time and the vehicle’s speed. The vehicle computing unit will handle all computations and communications, so it will process the data and send it to the other vehicles. Also, the processing unit will receive data from other vehicles that are nearby. The processing unit must determine, based on the received information, if a V2V communication can start between the two vehicles. If it can, it will begin an information exchange with the other vehicle and will process the subsequent received data. This means that each vehicle has two roles: the first one involves data processing and communication, while the second involves receiving messages from other nearby vehicles and analyzing them. As the paper does not focus on the communication model itself but rather on the image processing part, we employed a very simple and straightforward communication model. This model cannot be used in real-world applications, where security, compression, and other factors must be taken into consideration. Our main focus when developing the communication model was the main processing steps needed from the image processing point of view. The send and receive roles are described in the following subsection.

3.1. Message Transmission Procedure

To avoid sending large amounts of irrelevant data between vehicles, a handshake system must be defined first. This will prevent congestion in any communication technique. The handshake system allows all vehicles to broadcast their GPS position. As a low-ranged communication system is assumed, only nearby vehicles will receive this message. Any nearby vehicle that receives this broadcast message will compute the distance between the sending vehicle and itself, and if the distance is lower than a threshold, it will send back a start communication message. The distance threshold should be around 15 to 20 m. This will take into consideration both the GPS errors and the minimum safety distance between two vehicles. Also, note that, as most GPS systems record data once per second and the video records at a much greater rate, synchronization between the GPS coordinates and each frame must be performed. In other words, if the camera records 30 frames per second, it means that 30 frames will have attached the same GPS coordinate. For example, a car driving at 60 km/h will travel 16.6 m/s during this time interval. Thus, during these approximately 17 m, the messages sent by the vehicle will have the same GPS position, leading to potentially dangerous situations if data from images is not taken into consideration.
After establishing that the two vehicles are close enough and that image data has to be sent between them, the next question to be answered is exactly which data should be exchanged. Sending the entire video stream is unfeasible due to high bandwidth requirements, so the most straightforward approach is to compute key points for each frame. Then, for each detected key point, a descriptor is computed. All these steps are presented in the flowchart illustrated in Figure 2. Once the descriptors have been computed, every piece of information related to the current frame is serialized and sent to other paired vehicles. This includes the timestamp, latitude, longitude, speed, key points, and descriptors.

3.2. Message Reception Procedure

After passing the handshake system described before, each vehicle will only receive messages from nearby vehicles. The message will contain data about GPS positioning and processed image data. The steps these data will take are presented in Figure 3 and described in detail in the following subsection.
Each vehicle will also have its own video stream that is processed locally. This means that, for each frame, the key points and their descriptors are computed. These will have to be matched against the key points and descriptors received from other vehicles. There are various algorithms developed for feature matching, but the most used ones are brute-force matcher (BFMatcher) [39] and fast library for approximate nearest neighbors (FLANN) [40]. Brute-force matcher matches one feature descriptor from the first set with all features in the second set using distance computation to find the closest match. FLANN is an optimized library for fast nearest neighbor search in large datasets and high dimensional features, which works faster than BFMatcher for large datasets and requires two dictionaries specifying the algorithm and its related parameters.
The matching algorithm will output the matched descriptors, which are the descriptors that correspond to two matched pixels in the input images. Usually, filtering is carried out to remove the outliers, i.e., points that have the same descriptors and do not correspond to matching pixels in the input images. Having the two sets of matched descriptors, the next step is to determine the relative position between them. In other words, at this point, a set of matched points is available, but the location of the points from the received image (their descriptors) in the current vehicle’s frame is unknown. This will determine where the two vehicles are positioned with respect to each other: if the points are located in the center of the image, it means that the two vehicles are in the same lane, one in front of the other. If the points are located to the side of the image, it means that the two vehicles are on separate lanes, close to each other. In other words, if corresponding matched points are located on the right side of the first image and on the left side of the second image, then the first vehicle is on the right side of the second vehicle.
One way to determine the points’ relative position is to compute a homography matrix that transforms a point in the first image into a point in the second image. Once the homography matrix is computed, it can be applied to the points from the first image to see where those points are in the second image. In this way, the two vehicles can be relatively positioned in relation to each other.
Of course, it might happen that the current frame from the receiving vehicle corresponds to an older or newer frame from the receiving car. In these cases, multiple comparisons between the current frame and the received frames must be performed. By combining all these pieces of information, the vehicle can detect the location of each nearby vehicle accurately and efficiently, and the exact steps for these are described in detail in the following section regarding algorithm implementation.

4. Framework for Algorithm Implementation

The following section details the implementation of the algorithm, which involves several steps: resizing the image dimensions, extracting the camera-displayed time, simulating the communication process, detecting the corresponding frame, defining the distance calculation formula, and outlining the hardware equipment utilized.

4.1. Adjust Image Dimensions

Considering that the camera captures a significant part of the car dashboard (see, for example, Figure 4), this aspect can negatively influence the image-matching algorithm. It is important to note that this particular area is not relevant for the intended purposes. Furthermore, in that area, information from the camera is displayed, such as the camera name, current date, and speed. These elements can also affect the performance of the used image descriptors. For this reason, the decision was made to crop out the respective area from the original image, thus eliminating irrelevant information and retaining only the essential data for the intended purposes. The cropped area corresponds to the region beneath the red line in Figure 4, and this approach enables greater precision in image matching and enhances the algorithm’s performance regarding the specific intended objectives.

4.2. Extract Camera Displayed Time

It is necessary to consider that most GPS systems record data once per second, while video recording is carried out at a much higher rate. Therefore, meticulous synchronization between GPS coordinates and each video frame is required. To put it simply, if the camera records 30 frames per second, it means that 30 frames will have the same GPS coordinate attached. To address this issue and ensure the accuracy of our data, optical character recognition (OCR) technology was utilized to extract time information from the images provided by the camera. We specified the exact area where the time is displayed in the image and assigned the corresponding GPS coordinates to that moment. This synchronization process has allowed us to ensure that GPS data are accurately correlated with the corresponding video frames, which is essential for the subsequent analysis and understanding of our information.
To extract text from the image, we used Tesseract version 5.3.0, an open-source optical character recognition engine [41]. Tesseract is renowned for its robustness and accuracy in converting images containing text into editable content.

4.3. Simulation of Vehicle-to-Vehicle Communication

In our system, a vehicle extracts key points from each frame to obtain significant information about the surrounding environment. Additionally, information is retrieved from the GPS system, providing data about latitude, longitude, and crucial details about the number of lanes and vehicle speed, using the specialized function.
To ensure the exchange of information with other involved vehicles, a function was developed to simulate message transmission. As stated in Section 3, our paper focuses mainly on image processing and not on the communication itself. This is why we use a simulation model, in order to prove the feasibility of the proposed method. The function that simulates the communication is responsible for transmitting the processed data to other vehicles. This vehicle-to-vehicle communication is simulated through a file where the information is stored and later read.
To receive and process messages from other vehicles, a function for message reception is utilized. This simple function reads information from the specific file and extracts relevant data to make decisions and react appropriately within our vehicle communication system.
Overall, this architecture enables us to successfully collect, transmit, and interpret data to facilitate efficient communication and collaboration among vehicles in our project.

4.4. Detection of the Corresponding Frame

To detect the frame received from another vehicle within the current vehicle’s frame sequence, we used an approach relying on the number of matches between the two frames. Thus, the frame with the highest number of matches in the video recording was selected.
When received information about a frame is searched in a video, the closer it gets to the corresponding frame, the higher the number of matches increases (as observed in Figure 5). Essentially, the closer the current vehicle gets to the position where that frame was captured, the more similar the images become. With some exceptions that will be discussed at the end of the paper, this approach proved valid during all our tests.
Implementing this algorithm involved monitoring the number of matches for each frame and retaining the frame with the most matches. An essential aspect was identifying two consecutive decreases in the number of matches, at which point the frame with the highest number of matches up to that point was considered the corresponding frame.

4.5. Compute Distance

After detecting the corresponding frame in the video stream, the next step involves computing the distance between vehicles and determining their positions. This can be achieved using information from the two vehicles associated with these frames, such as the timestamp and speed.
Thus, if the timestamp from the current vehicle is greater than that of the other vehicle, it indicates that the latter is in front of the current one. The approach to comparing the two timestamps is presented in Figure 6. By knowing the speed of the front vehicle and the time difference between the two vehicles, the distance between them is computed.
Conversely, if the timestamp from the current vehicle is smaller than that of the other vehicle, it suggests that the latter is behind the current one. With the speed of the current vehicle and the time difference between the two vehicles, the distance between them can still be computed.
Given that the video operates at a frequency of 30 frames per second and GPS data is reported every second, each of the 30 frames contains the same set of information. However, this uniformity prevents the exact determination of distance because both frame 1 and frame 30 will have the same timestamp despite an almost 1-s difference between the two frames.
To enhance the accuracy of distance computation between the two vehicles, adjustments are made to the timestamp for the frames from both vehicles. In addition to other frame details, the frame number reported with the same timestamp (ranging from 1 to 30) is transmitted. In the distance computation function, the timestamp is adjusted by adding the current frame number divided by the total number of frames (30). For instance, if the frame number is 15, 0.5 s are added to the timestamp.
In Figure 7, the method of computing distance assuming that Vehicle 1 is in the front and Vehicle 2 is behind is detailed. Frame V1 from Vehicle 1, which is the x-th frame at timestamp T1, is detected by Vehicle 2 as matching with frame V2, which is the y-th frame at timestamp T 2 . To determine the position relative to Vehicle 1, the other vehicle needs to compute the distance traveled by the first vehicle in the time interval from timestamp T 1 to the current timestamp T 2 , taking into account its speed.
To compute the distance as accurately as possible, the speed reported at each timestamp is considered, and the calculation formula is presented in Equation (16). Since Frame V1 is the x-th frame at timestamp T 1 , and considering that there are 30 frames per second, the time remaining until timestamp T 1 + 1 second can be determined. Then, this time interval is multiplied by speed S 1 at timestamp T 1 to determine the distance traveled in this interval. The distance traveled from timestamps T 1 + 1 to T 2 1 is determined by multiplying the speeds S 1 at these timestamps by 1 s each. To determine the distance traveled from T 2 to the frame y-th, the speed S 1 at T 2 is multiplied by y / 30 . By summing all these distances, the total distance is obtained.
distance = j = x 1 29 1 30 ( S 1 ( T 1 ) + j S 1 ( T 1 + 1 ) S 1 ( T 1 ) 30 ) , + T i = T 1 + 1 T 2 1 j = 0 29 1 30 ( S 1 ( T i ) + j S 1 ( T i + 1 ) S 1 ( T i ) 30 ) , + j = 0 y 1 1 30 ( S 1 ( T 2 ) + j S 1 ( T 2 + 1 ) S 1 ( T 2 ) 30 ) .

4.6. Hardware Used

For the developed solution, 2 DDPAI MOLA N3 cameras were utilized, each featuring a 2k resolution and operating at a frame rate of 30 frames per second. These cameras feature built-in GPS functionality that accurately records the vehicle’s location and speed. The advantage of these cameras lies in their GPS data storage format, which facilitates the seamless retrieval of this information. Cameras with identical resolutions were selected to ensure consistency, as not all image descriptors maintain scale invariance, which could otherwise affect algorithm performance.

5. Performed Experiments and Test Results

Based on the implementation presented in the previous section, a series of tests were conducted to demonstrate both the feasibility of the algorithm and its performance. The performances of the BEBLID, ORB, and SIFT descriptors were tested, as well as how the number of features influences frame detection. Finally, a comparison between the distance calculated by the proposed algorithm, the one calculated based on GPS data, and the measured distance is presented to illustrate the preciseness of the proposed algorithm in real-world applications. This comparison shows that the algorithm reflects a high degree of accuracy when validated against physically measured distances, which demonstrates its potential effectiveness for applications requiring precise distance calculations, e.g., vehicle platooning applications.

5.1. Test Architecture

To prove the feasibility and robustness of the proposed algorithm, we conducted various tests in real-world scenarios. For the first test, a vehicle equipped with a dashboard camera was used to make two passes on the same streets, resulting in two video recordings. The main purpose of this test was to determine what descriptors work best and if the proposed system performs well when eliminating the errors caused using multiple cameras. For this purpose, for a frame extracted from the first video (left picture from Figure 8), we had to find the corresponding frame in the second video (right picture from Figure 8). Additionally, the performances of three different descriptors were compared, namely SIFT, ORB, and BEBLID, in terms of matching accuracy and speed. This comparison allows us to evaluate the strengths and limitations of each descriptor in the context of our experiment.
For the selected frame and each frame in the second video, the following steps were performed:
  • In total, 10,000 key points were detected using the ORB detector for each frame.
  • Based on these key points, the descriptors were computed, and the performances of SIFT, ORB, and BEBLID descriptors were compared.
  • For feature matching, the brute-force descriptor matcher was used for ORB and BEBLID, which are binary descriptors. This technique compares binary descriptors efficiently by calculating the Hamming distance. As for SIFT, a floating-point descriptor, the FLANN descriptor matcher, was employed. FLANN utilizes approximate nearest neighbor search techniques to efficiently match floating-point descriptors.
  • The frame with the highest number of common features with the selected frame from the first video is considered as its corresponding frame in the second video. This matching process is based on the similarity of visual features between frames, allowing us to find the frame in the second video that best corresponds to the reference frame from the first video. In Figure 8, an example of the identified corresponding frame is presented.
The SIFT descriptor is considered a gold-standard reference but requires a significant amount of computational power. It has a feature vector of 128 values. This descriptor managed to match the reference frame with frame 136 from the second video with a total of 3456 matches (Figure 9a). The ORB descriptor is considered one of the fastest algorithms and has a feature vector of 32 values. It also successfully detected frame 136 from the second video with 2178 matches, as shown in Figure 9b. According to the authors, the BEBLID descriptor achieves results similar to SIFT and surpasses ORB in terms of accuracy and speed, with a feature vector of 64 values. However, in our specific test case, the BEBLID descriptor managed to detect frame 136 but with a lower number of matches, specifically 2064 (as shown in Figure 9c) compared to ORB. This discrepancy could be attributed to the specific conditions of our experiment, such as variations in lighting, perspective, or the content of the frames.
As a result of this first experiment, all three descriptors considered successfully match two frames from different videos, but with the same camera, even if their content is slightly different. These differences, such as variations in traffic conditions, will also occur when video sequences form different vehicles are used.

5.2. Descriptor Performance Test

The underlying concept of this test involved the deployment of two vehicles equipped with dashboard cameras driving on the same street. As they progressed, the cameras recorded footage, resulting in two distinct videos.
Using these video recordings, the objective of the test was to identify 50 consecutive frames from the leading vehicle within the footage captured by the trailing vehicle. Each frame from the first video was compared with 50 frames from the second video, and the frame with the highest number of matches was taken into consideration. This aimed to ascertain the algorithm’s capability to consistently detect successive frames, thus showcasing its robustness. Furthermore, a secondary aim was to evaluate the performance of the three descriptors used in the process. In Figure 10a, one of the frames from the car in front (left) and the matched frame from the rear car are presented (right). In the frame on the right, the front car is also visible.
In Table 1, the results of the three descriptors for a total of 20,000 features are presented. BEBLID successfully detected 39 frames correctly, with instances of incorrect detections shown in blue in the table. These incorrect detections exhibit a minor deviation by detecting a frame either preceding or following the actual frame, which poses no significant concern.
Note that, the frames shown in blue in Table 1 might be caused by the fact that a perfect synchronization between frames of the used videos cannot be accomplished. For example, the first car traveled at 21.9 km/h or 6.083 m/s records a frame every 0.2 m (considering 30 frames per second). This sampling rate might, in our opinion, cause some of the slightly incorrect detections presented in blue in Table 1. This is the reason to use the blue color, because they might result from the sampling rate of the used cameras and not by an actual error in the matching algorithm.
Similarly, ORB shows good performance by correctly detecting 38 frames. However, the performance of SIFT falls short of expectations, with only 23 out of 50 frames being detected accurately. Additionally, for SIFT, there are cases when it detected a frame from behind after the detection of a subsequent frame, indicated in red in the table. Moreover, the case when the difference between the correct frame and the predicted one is greater than 1 frame is highlighted in orange in the table. Another downside of using SIFT is that it has more cases with three consecutive detections of the same frame than the other two descriptors (BEBLID-0, ORB-1 (frame 45), SIFT-4 (frames 55, 63, 71, 73)). Also, when using SIFT, there are cases when two consecutive detected frames differ by three frames (frames 29, 58, 66 and 71), which is a case that was not encountered when using BEBLID of ORB.
Furthermore, it is noteworthy to highlight that a higher number of matches, as observed in the case of SIFT, does not necessarily translate to better performance. Despite BEBLID having a lower number of matches compared to the other two descriptors, it achieved the highest performance in this test.
These findings underscore the importance of not only relying on the quantity of matches but also considering the accuracy and robustness of the detection algorithm. In this context, BEBLID stands out as a promising descriptor for its ability to deliver reliable performance even with a comparatively lower number of matches.
It is worth mentioning that, the frames written in orange and red are most likely errors in the detection algorithm and can lead to potentially dangerous situations if their number increases.

5.3. Influence of the Number of Features

In this test, the objective was to analyze the influence of the number of features associated with each descriptor on its performance. As the computational time increases with a higher number of features, we examined the performance of the three descriptors across a range of feature numbers, from 20,000 down to 5000.
The test methodology involved detecting 20 frames from the first video against 20 frames from the second video. This approach facilitated an assessment of how varying feature counts affected the accuracy and efficiency of frame detection for each descriptor.
As observed in Table 2, the number of matches per frame decreased as the number of features decreased. For BEBLID, if the number of features decreased from 20,000 to 10,000, the performance did not decline considerably. In fact, for a feature count of 16,000, we achieved the highest number of correctly detected frames, with 18 out of 20. However, if the number of features dropped below 10,000, performance deteriorated significantly.
The results for ORB can be observed in Table 3. For a feature count of 12,000 and 10,000, we achieved 16 out of 20 correctly detected frames. However, if the feature count dropped below 10,000, the performance deteriorated.
From Table 4, it is clear that the number of correctly detected frames varies depending on the number of features for SIFT. However, overall, this descriptor exhibits poor performance in all cases.
Based on the outcomes of the last two tests, we can conclude that BEBLID generally achieves better results, with the exception being when the number of features is 12,000, where ORB detects 16 frames correctly compared to BEBLID’s 15, see Figure 11. ORB also shows satisfactory results, whereas the performance of SIFT is not as commendable. For the presented reasons, we will use only the BEBLID descriptor in further tests.

5.4. Distance Computation Test

The objective of this test was to evaluate the distance calculation algorithm and compare the distance calculated based on the proposed algorithm with the distance calculated based on GPS coordinates.
Thirty frames were used as a reference from the car in front, and attempts were made to detect them in the video stream from the car behind using the BEBLID descriptor. As can be observed in Table 5, the detected frames are mostly consecutive. Additionally, it is evident that the distance calculated based on GPS coordinates is significantly larger than the one calculated by the algorithm.
A second test was conducted by reversing the order of the two cars. What is observed in this test is that, in this case, the distance calculated based on GPS coordinates is smaller than the distance calculated by the proposed algorithm. These results are presented in Table 6.

5.5. Accuracy of the Computed Distance in a Real-World Scenario

The last test that was conducted aims to ascertain the accuracy of the distance between two vehicles computed by the proposed algorithm. For this, we use a simple but very effective real-world testing scenario. This scenario allows us to measure the exact distance between two vehicles and compare it with the computed distance using the presented approach.
For this test, we used two vehicles, each equipped with a video camera. The street where we recorded the videos was a one lane street. First, we recorded the videos used by our positioning algorithm. Next, for the same frames we have computed the distance, we positioned the cars in the exact same location and measured the exact distance using a measuring tape. During this test, the cars were traveling with speeds between 17.8 and 20.3 km/h.
We did this two times, the only difference being that we switched the car order. The frames detected were in the same area for both cases. The results are presented in Table 7 and in Table 8. In these tables, we included the frame from the first video (from the car in front), the detected frame in the video from the car behind, the computed distance relying solely on the GPS coordinates, the distance computed by the proposed algorithm, and the real measured distance.
The data presented in these two tables confirm the hypothesis that the distance computed only using the GPS coordinates presents a significant error compared to the real distance and should not be used in car platooning applications. Also, the data indicate that the distance computed using the proposed algorithm outperforms the GPS distance by a great margin, with small differences compared to the real distance. All the differences between the distances computed using the proposed algorithm and the real distances were under 1 m, compared to around 10 m using the GPS distance.

5.6. Limitations

In the conducted tests, we observed that for certain areas captured in the images, such as in Figure 12, the proposed algorithm detected very few matches between frames from the two cameras. This compromised the optimal functioning of the corresponding frame detection algorithm. As depicted in Figure 13, for frame 11, which should ideally have the highest number of matches, they amounted to only around 150. Due to the low number of matches, the algorithm fails to accurately identify the correct frame.
One of the reasons for this issue could be lower brightness in these areas, where descriptors may struggle to extract and match significant features between images, resulting in a reduced number of matches. Nevertheless, such cases can be labeled as failed detections and not to be used in further vehicle platooning applications.

6. Conclusions

Increasing urbanization and vehicle density have led to escalating traffic congestion and a rise in road accidents. With millions of lives lost or injured annually, urgent measures are required to enhance road safety. This underscores the necessity for effective vehicle positioning algorithms to mitigate these challenges.
A robust vehicle positioning algorithm is crucial for effective traffic management and enhanced road safety. With the rising number of vehicles, there is an increased need to optimize traffic flow, minimize delays, and enable intelligent control systems. By accurately determining the position of vehicles, advanced functionalities can be developed to reduce the risk of accidents, improve commuting experiences, and facilitate efficient resource allocation. Implementing such an algorithm is vital for creating a safer and more efficient transportation system.
In this paper, an algorithm that accurately and robustly position vehicles on a road with respect to the position of other nearby vehicles was described. The algorithm presents a decentralized approach where each vehicle acts like an independent computational node and tries to position itself depending on data received from the other nearby vehicles.
The decentralized approach proposed in this paper can use a low-range communication system with a very high bandwidth, but each vehicle requires a high computational power to perform all the processing tasks in real time. A centralized approach (using cloud services, for example) can perform all the processing tasks in real-time, but it highly depends on communication between vehicles and the server, mainly because each vehicle will send the entire video stream to the server.
Based on the results obtained for the various performed tests, it was proven that the novel approach proposed in this paper is efficient and can be used to increase the accuracy of the computed distance between vehicles.
For the first future research direction, the goal is to detect whether vehicles are in the same lane or in different lanes based on the relative position of the two matched descriptors. Another research direction involves implementing a centralized approach, where each vehicle sends data to a server that utilizes cloud computing to process all the data in real-time. This way, each vehicle will have a clearer understanding of vehicles that are not within the considered distance threshold. Furthermore, we plan to expand the experiments and conduct them at higher speeds once we find a suitable road that allows for this, aiming to ensure minimal interference and achieve more accurate results.

Author Contributions

Conceptualization, I.-A.B. and P.-C.H.; methodology, I.-A.B. and P.-C.H.; software, I.-A.B.; validation, I.-A.B., P.-C.H. and C.-F.C.; formal analysis, I.-A.B. and P.-C.H.; investigation, I.-A.B. and P.-C.H.; resources, I.-A.B. and P.-C.H.; data curation, I.-A.B.; writing—original draft preparation, I.-A.B.; writing—review and editing, I.-A.B., P.-C.H. and C.-F.C.; visualization, I.-A.B.; supervision, P.-C.H. and C.-F.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
GPSGlobal Positioning System
LiDARLight Detection and Ranging
V2VVehicle-to-vehicle
V2XVehicle-to-everything
ADASAdvanced Driver Assistance Systems
SIFTScale-Invariant Feature Transform
DoGDifference of Gaussian
LoGLaplacian of Gaussian
SURFSpeeded-Up Robust Feature
ORBOriented FAST and Rotated BRIE
BRIEFBinary Robust Independent Elementary Features
rBRIEFRotation-aware BRIEF
BELIDBoosted Efficient Local Image Descriptor
BEBLIDBoosted efficient binary local image descriptor
CNNConvolutional Neural Networks
BFMatcherBrute-force Matcher
FLANNFast Library for Approximate Nearest Neighbors
OCROptical Character Recognition

References

  1. World Health Organization. Save Lives: A Road Safety Technical Package; World Health Organization: Geneva, Switzerland, 2017; p. 60.
  2. World Health Organization. Global Status Report on Road Safety 2023; World Health Organization: Geneva, Switzerland, 2023.
  3. Forum, I.T. Monitoring Progress in Urban Road Safety; International Traffic Forum: Paris, France, 2018. [Google Scholar]
  4. Caruntu, C.F.; Ferariu, L.; Pascal, C.; Cleju, N.; Comsa, C.R. Connected cooperative control for multiple-lane automated vehicle flocking on highway scenarios. In Proceedings of the 23rd International Conference on System Theory, Control and Computing, Sinaia, Romania, 9–11 October 2019; pp. 791–796. [Google Scholar] [CrossRef]
  5. Sun, Y.; Song, J.; Li, Y.; Li, Y.; Li, S.; Duan, Z. IVP-YOLOv5: An intelligent vehicle-pedestrian detection method based on YOLOv5s. Connect. Sci. 2023, 35, 2168254. [Google Scholar] [CrossRef]
  6. Ćorović, A.; Ilić, V.; Ðurić, S.; Marijan, M.; Pavković, B. The Real-Time Detection of Traffic Participants Using YOLO Algorithm. In Proceedings of the 2018 26th Telecommunications Forum (TELFOR), Belgrade, Serbia, 20–21 November 2018; pp. 1–4. [Google Scholar] [CrossRef]
  7. Joshi, R.; Rao, D. AlexDarkNet: Hybrid CNN architecture for real-time Traffic monitoring with unprecedented reliability. Neural Comput. Appl. 2024, 36, 1–9. [Google Scholar] [CrossRef]
  8. Jia, D.; Lu, K.; Wang, J.; Zhang, X.; Shen, X. A Survey on Platoon-Based Vehicular Cyber-Physical Systems. IEEE Commun. Surv. Tutor. 2016, 18, 263–284. [Google Scholar] [CrossRef]
  9. Axelsson, J. Safety in Vehicle Platooning: A Systematic Literature Review. IEEE Trans. Intell. Transp. Syst. 2017, 18, 1033–1045. [Google Scholar] [CrossRef]
  10. Yang, H.; Hong, J.; Wei, L.; Gong, X.; Xu, X. Collaborative Accurate Vehicle Positioning Based on Global Navigation Satellite System and Vehicle Network Communication. Electronics 2022, 11, 3247. [Google Scholar] [CrossRef]
  11. Kolat, M.; Bécsi, T. Multi-Agent Reinforcement Learning for Highway Platooning. Electronics 2023, 12, 4963. [Google Scholar] [CrossRef]
  12. Gao, C.; Wang, J.; Lu, X.; Chen, X. Urban Traffic Congestion State Recognition Supporting Algorithm Research on Vehicle Wireless Positioning in Vehicle–Road Cooperative Environment. Appl. Sci. 2022, 12, 770. [Google Scholar] [CrossRef]
  13. Lee, G.; Chong, N. Recent Advances in Multi Robot Systems; Chapter Flocking Controls for Swarms of Mobile Robots Inspired by Fish Schools; InTechOpen: London, UK, 2008; pp. 53–68. [Google Scholar] [CrossRef]
  14. Reynolds, C.W. Flocks, Herds and Schools: A Distributed Behavioral Model. SIGGRAPH Comput. Graph. 1987, 21, 25–34. [Google Scholar] [CrossRef]
  15. Tan, Y.; Yang, Z. Research Advance in Swarm Robotics. Def. Technol. 2013, 9, 18–39. [Google Scholar] [CrossRef]
  16. Kennedy, J.; Eberhart, R.C.; Shi, Y. Swarm Intelligence. In The Morgan Kaufmann Series in Artificial Intelligence; Morgan Kaufmann: San Francisco, CA, USA, 2001. [Google Scholar] [CrossRef]
  17. Mandal, V.; Mussah, A.R.; Jin, P.; Adu-Gyamfi, Y. Artificial Intelligence-Enabled Traffic Monitoring System. Sustainability 2020, 12, 9177. [Google Scholar] [CrossRef]
  18. Sultan, F.; Khan, K.; Shah, Y.A.; Shahzad, M.; Khan, U.; Mahmood, Z. Towards Automatic License Plate Recognition in Challenging Conditions. Appl. Sci. 2023, 13, 3956. [Google Scholar] [CrossRef]
  19. Rafique, S.; Gul, S.; Jan, K.; Khan, G.M. Optimized real-time parking management framework using deep learning. Expert Syst. Appl. 2023, 220, 119686. [Google Scholar] [CrossRef]
  20. Tang, X.; Zhang, Z.; Qin, Y. On-Road Object Detection and Tracking Based on Radar and Vision Fusion: A Review. IEEE Intell. Transp. Syst. Mag. 2022, 14, 103–128. [Google Scholar] [CrossRef]
  21. Umair Arif, M.; Farooq, M.U.; Raza, R.H.; Lodhi, Z.U.A.; Hashmi, M.A.R. A Comprehensive Review of Vehicle Detection Techniques Under Varying Moving Cast Shadow Conditions Using Computer Vision and Deep Learning. IEEE Access 2022, 10, 104863–104886. [Google Scholar] [CrossRef]
  22. Kalyan, S.S.; Pratyusha, V.; Nishitha, N.; Ramesh, T.K. Vehicle Detection Using Image Processing. In Proceedings of the IEEE International Conference for Innovation in Technology, Bangluru, India, 6–8 November 2020; pp. 1–5. [Google Scholar]
  23. Zhang, Y.; Carballo, A.; Yang, H.; Takeda, K. Perception and sensing for autonomous vehicles under adverse weather conditions: A survey. J. Photogramm. Remote Sens. 2023, 196, 146–177. [Google Scholar] [CrossRef]
  24. Lu, S.; Shi, W. Vehicle Computing: Vision and challenges. J. Inf. Intell. 2023, 1, 23–35. [Google Scholar] [CrossRef]
  25. Lowe, D. Object recognition from local scale-invariant features. In Proceedings of the 7th IEEE International Conference on Computer Vision, Kerkyra, Greece, 20–27 September 1999; Volume 2, pp. 1150–1157. [Google Scholar] [CrossRef]
  26. Vaithiyanathan, D.; Manigandan, M. Real-time-based Object Recognition using SIFT algorithm. In Proceedings of the 2023 Second International Conference on Electrical, Electronics, Information and Communication Technologies (ICEEICT), Trichirappalli, India, 5–7 April 2023; pp. 1–5. [Google Scholar] [CrossRef]
  27. Bay, H.; Ess, A.; Tuytelaars, T.; Van Gool, L. Speeded-Up Robust Features SURF. Comput. Vis. Image Underst. 2008, 110, 346–359. [Google Scholar] [CrossRef]
  28. Sreeja, G.; Saraniya, O. Chapter 3—Image Fusion Through Deep Convolutional Neural Network. In Deep Learning and Parallel Computing Environment for Bioengineering Systems; Sangaiah, A.K., Ed.; Academic Press: Cambridge, MA, USA, 2019; pp. 37–52. [Google Scholar] [CrossRef]
  29. Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G.R. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar]
  30. Rosten, E.; Drummond, T. Machine Learning for High-Speed Corner Detection. In Computer Vision—ECCV; Springer: Berlin/Heidelberg, Germany, 2006; pp. 430–443. [Google Scholar]
  31. Calonder, M.; Lepetit, V.; Strecha, C.; Fua, P. BRIEF: Binary Robust Independent Elementary Features. In Computer Vision—ECCV; Springer: Berlin/Heidelberg, Germany, 2010; pp. 778–792. [Google Scholar]
  32. Wu, S.; Fan, Y.; Zheng, S.; Yang, H. Object tracking based on ORB and temporal-spacial constraint. In Proceedings of the IEEE 5th International Conference on Advanced Computational Intelligence, Nanjing, China, 18–20 October 2012; pp. 597–600. [Google Scholar] [CrossRef]
  33. Rosin, P.L. Measuring Corner Properties. Comput. Vis. Image Underst. 1999, 73, 291–307. [Google Scholar] [CrossRef]
  34. Suárez, I.; Sfeir, G.; Buenaposada, J.M.; Baumela, L. BEBLID: Boosted efficient binary local image descriptor. Pattern Recognit. Lett. 2020, 133, 366–372. [Google Scholar] [CrossRef]
  35. Suarez, I.; Sfeir, G.; Buenaposada, J.; Baumela, L. BELID: Boosted Efficient Local Image Descriptor. In Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2019; pp. 449–460. [Google Scholar] [CrossRef]
  36. Tian, Y.; Fan, B.; Wu, F. L2-Net: Deep Learning of Discriminative Patch Descriptor in Euclidean Space. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6128–6136. [Google Scholar] [CrossRef]
  37. Zhang, H.C.; Zhou, H. GPS positioning error analysis and outlier elimination method in forestry. Trans. Chin. Soc. Agric. Mach. 2010, 41, 143–147. [Google Scholar] [CrossRef]
  38. van Diggelen, F.; Enge, P.K. The World’s first GPS MOOC and Worldwide Laboratory using Smartphones. In Proceedings of the 28th International Technical Meeting of the Satellite Division of the Institute of Navigation (ION GNSS+ 2015), Tampa, FL, USA, 14–18 September 2015. [Google Scholar]
  39. OpenCV Modules. Available online: https://docs.opencv.org/4.9.0/ (accessed on 1 May 2024).
  40. Muja, M.; Lowe, D. Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration. VISAPP 2009, 1, 331–340. [Google Scholar]
  41. Tesseract OCR. Available online: https://github.com/tesseract-ocr (accessed on 1 May 2024).
Figure 1. Vehicle communication. First, each vehicle broadcasts data in a V2X communication model (blue circles). Depending on the computed distance between two vehicles, they can start V2V communication (orange arrows).
Figure 1. Vehicle communication. First, each vehicle broadcasts data in a V2X communication model (blue circles). Depending on the computed distance between two vehicles, they can start V2V communication (orange arrows).
Information 15 00323 g001
Figure 2. Send message architecture overview.
Figure 2. Send message architecture overview.
Information 15 00323 g002
Figure 3. Receive message architecture overview.
Figure 3. Receive message architecture overview.
Information 15 00323 g003
Figure 4. Cropping out non-relevant image areas: enhancing data relevance and algorithm efficiency.
Figure 4. Cropping out non-relevant image areas: enhancing data relevance and algorithm efficiency.
Information 15 00323 g004
Figure 5. Observing frame proximity in video analysis: closer vehicle positioning correlates with increased image similarity and match frequency.
Figure 5. Observing frame proximity in video analysis: closer vehicle positioning correlates with increased image similarity and match frequency.
Information 15 00323 g005
Figure 6. Determining neighboring vehicle relative position using matched frame timestamps.
Figure 6. Determining neighboring vehicle relative position using matched frame timestamps.
Information 15 00323 g006
Figure 7. Distance estimation between two vehicles through frame matching and timestamp comparison.
Figure 7. Distance estimation between two vehicles through frame matching and timestamp comparison.
Information 15 00323 g007
Figure 8. Two corresponding frames from two video sequences.
Figure 8. Two corresponding frames from two video sequences.
Information 15 00323 g008
Figure 9. Comparison of Matching Results for SIFT, ORB, and BEBLID.
Figure 9. Comparison of Matching Results for SIFT, ORB, and BEBLID.
Information 15 00323 g009aInformation 15 00323 g009b
Figure 10. Two matched frames from vehicles.
Figure 10. Two matched frames from vehicles.
Information 15 00323 g010
Figure 11. Comparison of correct detection rates for BEBLID, ORB, and SIFT descriptors across different feature counts.
Figure 11. Comparison of correct detection rates for BEBLID, ORB, and SIFT descriptors across different feature counts.
Information 15 00323 g011
Figure 12. Area of sparse matches detected by the algorithm.
Figure 12. Area of sparse matches detected by the algorithm.
Information 15 00323 g012
Figure 13. Correct frame detection.
Figure 13. Correct frame detection.
Information 15 00323 g013
Table 1. Results of descriptor analysis with color indications: blue for incorrect detections, red for frames detected from behind, and orange for differences exceeding 1 frame.
Table 1. Results of descriptor analysis with color indications: blue for incorrect detections, red for frames detected from behind, and orange for differences exceeding 1 frame.
Frame Video 1Frame Video 2BEBLIDORBSIFT
12626–3992 matches26–4964 matches26–5935 matches
22726–3848 matches26–4885 matches26–5970 matches
32828–3929 matches28–4970 matches29–5984 matches
42929–3921 matches29–5053 matches29–6052 matches
53030–3868 matches29–4918 matches30–5958 matches
63131–3890 matches31–4953 matches32–5878 matches
73233–3923 matches32–5016 matches32–5957 matches
83333–3991 matches33–4996 matches34–5927 matches
93434–3900 matches33–4864 matches35–5884 matches
103535–3974 matches35–4949 matches36–5957 matches
113636–3884 matches36–4927 matches35–5893 matches
123737–3894 matches37–5061 matches37–5925 matches
133838–3895 matches38–5044 matches38–5875 matches
143939–3837 matches38–4964 matches40–5838 matches
154040–3906 matches40–4846 matches40–5817 matches
164141–3848 matches41–4937 matches41–5886 matches
174242–3966 matches42–4995 matches42–5880 matches
184343–3764 matches43–4835 matches44–5980 matches
194445–3744 matches45–4772 matches44–5787 matches
204545–3780 matches45–4903 matches45–5862 matches
214646–3809 matches45–4776 matches46–5796 matches
224747–3891 matches47–5009 matches48–5998 matches
234848–3939 matches48–4956 matches49–6043 matches
244950–3645 matches48–4802 matches49–6017 matches
255050–3678 matches50–4592 matches50–6011 matches
265151–3695 matches51–4737 matches50–6019 matches
275252–3672 matches52–4741 matches52–5986 matches
285353–3544 matches52–4599 matches53–5900 matches
295454–3755 matches54–4709 matches55–6031 matches
305555–3793 matches55–4796 matches55–5976 matches
315655–3582 matches55–4560 matches55–5971 matches
325756–3486 matches56–4591 matches58–5926 matches
335858–3600 matches57–4582 matches58–5985 matches
345959–3639 matches59–4584 matches60–6079 matches
356060–3628 matches60–4704 matches59–6052 matches
366161–3584 matches61–4609 matches62–5965 matches
376262–3650 matches62–4575 matches63–5851 matches
386363–3653 matches63–4523 matches63–5946 matches
396463–3584 matches63–4512 matches63–5909 matches
406565–3435 matches65–4402 matches66–5875 matches
416665–3449 matches66–4349 matches67–5745 matches
426767–3565 matches67–4410 matches68–5967 matches
436868–3393 matches68–4320 matches68–6003 matches
446970–3343 matches69–4356 matches71–5825 matches
457070–3477 matches70–4399 matches71–5940 matches
467171–3434 matches71–4389 matches71–5929 matches
477273–3288 matches72–4282 matches73–5971 matches
487373–3135 matches73–4119 matches73–5960 matches
497475–3193 matches74–4048 matches73–5884 matches
507575–3199 matches75–4079 matches75–5955 matches
Number of Correct Detections 393823
Table 2. BEBLID—The influence of the number of features on the frame detection: blue for incorrect detections, red for frames detected from behind.
Table 2. BEBLID—The influence of the number of features on the frame detection: blue for incorrect detections, red for frames detected from behind.
Frame Video 1Frame Video 220,00018,00016,00014,00012,00010,000800060005000
12626– 399226–363726–326326–284026–244226–201326–167226–128426–1065
22726–384826–348326–310626–275526–237427–197426–160526–125226–1047
32828–392928–357828–321828–278528–239228–200728–166728–129828–1095
42929–392129–357529–320629–281529–244929–205230–165329–127430–1065
53030–386830–354630–321030–279830–244230–204230–167129–126230–1074
63131–389031–353931–315131–275332–236831–201630–163330–125230–1039
73233–392332–356032–318332–283732–243433–201733–164333–122133–1037
83333–399133–361633–323533–285733–248133–211133–171133–128933–1083
93434–390034–356034–318333–280333–244235–204535–164334–124935–1056
103535–397435–360935–325835–288735–244735–204935–165235–124435–1043
113636–388435–353436–315936–278936–241536–201135–162936–123336–1048
123737–389437–350837–318137–279337–240738–202538–161938–124836–1069
133838–389538–351438–315338–277438–240738–201638–164038–123038–1023
143939–383739–351239–316839–281339–240639–198639–160839–122739–1024
154040–390640–356240–316740–282240–244140–201440–165040–124840–1047
164141–384841–348341–308441–271741–236441–192242–158542–120142–1008
174242–396642–356442–320342–280042–241042–203742–161942–128542–1088
184343–376443–343143–307143–267944–229943–193743–156443–120842–993
194445–374443–336443–302245–269243–232645–196745–160945–121843–1016
204545–378045–342645–306945–271545–233645–195845–156945–123845–1017
Correct Detection 171718171516111311
Table 3. ORB—The influence of the number of features on the frame detection: blue for incorrect detections, and orange for differences exceeding 1 frame.
Table 3. ORB—The influence of the number of features on the frame detection: blue for incorrect detections, and orange for differences exceeding 1 frame.
Frame Video 1Frame Video 220,00018,00016,00014,00012.00010,000800060005000
12626–496426–448126–401626–354226–305226–255926–217026–165226–1379
22726–488526–442528–395126–348728–302627–253826–210426–161828–1352
32828–497028–453328–409528–362228–311828–261428–216128–167028–1412
42929–505329–461429–413529–366429–316929–266829–216229–166029–1394
53029–491830–448530–406930–357030–310430–260930–212529–161230–1330
63131–495330–448030–406630–356630–307030–259230–212129–160030–1338
73232–501632–457932–407932–361432–315932–258332–211932–159933–1348
83333–499633–455533–410333–363233–317533–268833–219733–166033–1388
93433–486433–444533–400833–353834–310534–256834–210733–157233–1320
103535–494935–453435–408935–362835–310235–262135–209235–159335–1337
113636–492735–449535–403336–354136–306936–256935–210036–159235–1346
123737–506137–456637–408937–361537–313737–261536–209236–160236–1373
133838–504438–456038–405537–357438–314238–264238–214338–163938–1366
143938–496438–448239–404638–354538–310838–261938–210838–160638–1362
154040–484640–441039–394239–352240–301039–251739–209539–159739–1312
164141–493741–447741–395441–351041–303641–247940–205140–157040–1326
174242–499542–450742–405742–360842–305542–255942–211342–164742–1384
184343–483543–439443–397142–351043–304443–258442–206142–163542–1349
194445–477243–431043–387945–343343–298743–253143–206243–157543–1349
204545–490345–443945–400545–352145–301845–253645–203045–156645–1293
Correct Detection 15141412161611109
Table 4. SIFT—The influence of the number of features on the frame detection: blue for incorrect detections, red for frames detected from behind, and orange for differences exceeding 1 frame.
Table 4. SIFT—The influence of the number of features on the frame detection: blue for incorrect detections, red for frames detected from behind, and orange for differences exceeding 1 frame.
Frame Video 1Frame Video 220,00018,00016,00014,00012,00010,000800060005000
12625–599425–547425–500226–450426–396725–338526–284426–216925–1869
22727–604228–555627–505427–457427–405927–343027–294227–220628–1871
32828–603630–549928–499628–451129–394430–341328–287128–219630–1859
42929–606631–544129–500731–454429–405029–348531–290431–220429–1839
53030–595431–549629–499031–447530–402330–345931–285231–225031–1899
63132–590432–544131–497832–450631–398031–339630–284731–222931–1912
73232–592632–543733–494432–450333–397733–343033–282233–218832–1874
83334–587434–541934–497232–443732–390034–339932–281733–220034–1903
93435–587335–545035–489135–443935–394435–341034–278234–215535–1846
103536–595335–546936–496035–455835–402735–340436–280635–216336–1845
113636–593835–547435–497935–444135–392635–341136–276036–217436–1839
123737–589237–540737–496437–446337–393038–336937–277536–220539–1836
133838–593038–546838–495338–440038–390140–336938–278439–215939–1881
143940–583640–535240–490240–434340–389341–337340–278539–220339–1853
154040–583140–542540–497840–438639–390039–334640–279339–224739–1917
164142–592842–548243–497041–448441–390242–335642–278641–220342–1877
174242–589442–550142–492842–441042–389042–342842–278842–219542–1867
184344–598144–550342–494744–444944–397842–338542–280343–217343–1844
194444–575745–531045–485845–442144–384645–336045–271545–216243–1814
204546–585546–540845–497845–445145–390445–338745–278945–214843–1847
Correct Detection 11691112710137
Table 5. First test—The computed distance between the vehicles (Car A in Front, Car B Behind) for 30 consecutive frames.
Table 5. First test—The computed distance between the vehicles (Car A in Front, Car B Behind) for 30 consecutive frames.
Frame Video 1Frame Video 2GPS DistanceComputed Distance
133–2267 matches34.153 m20.694 m
233–2215 matches34.153 m20.516 m
335–2175 matches34.153 m20.681 m
437–2162 matches34.153 m20.846 m
538–2134 matches34.153 m20.839 m
639–2108 matches34.153 m20.832 m
740–2207 matches34.153 m20.825 m
841–2120 matches33.463 m20.653 m
943–2151 matches33.463 m21.005 m
1045–2139 matches33.463 m21.181 m
1144–2167 matches33.463 m20.827 m
1245–2128 matches33.463 m20.826 m
1347–2186 matches33.463 m21.000 m
1449–2128 matches33.463 m21.173 m
1549–2123 matches33.463 m20.996 m
1650–2089 matches33.463 m20.993 m
1752–2100 matches33.463 m21.164 m
1852–2122 matches33.463 m20.987 m
1953–2128 matches33.463 m20.984 m
2054–2026 matches33.463 m20.980 m
2156–1985 matches33.463 m21.148 m
2256–2008 matches33.463 m20.972 m
2357–1980 matches33.463 m20.968 m
2458–2079 matches33.463 m20.963 m
2560–2065 matches33.463 m21.128 m
2662–2020 matches33.463 m21.292 m
2762–2079 matches33.463 m21.117 m
2864–2105 matches33.463 m21.279 m
2964–2013 matches33.463 m21.104 m
3066–2133 matches33.463 m21.265 m
Table 6. Second Test—The computed distance between the vehicles(Reversed Order: Car B in Front, Car A Behind) for 30 consecutive frames.
Table 6. Second Test—The computed distance between the vehicles(Reversed Order: Car B in Front, Car A Behind) for 30 consecutive frames.
Frame Video 1Frame Video 2GPS DistanceComputed Distance
125–2255 matches9.085 m24.025 m
226–2222 matches9.085 m24.006 m
327–2232 matches9.085 m23.987 m
429–2184 matches9.184 m23.513 m
530–2180 matches9.184 m23.333 m
631–2177 matches9.184 m23.336 m
732–2191 matches9.184 m23.338 m
832–2148 matches9.184 m23.161 m
933–2270 matches9.184 m23.164 m
1034–2202 matches9.184 m23.166 m
1135–2176 matches9.184 m23.167 m
1236–2186 matches9.184 m23.168 m
1338–2202 matches9.184 m23.341 m
1438–2205 matches9.184 m23.168 m
1539–2193 matches9.184 m23.168 m
1641–2182 matches9.184 m23.336 m
1741–2166 matches9.184 m23.165 m
1843–2210 matches9.184 m23.329 m
1944–2127 matches9.184 m23.325 m
2044–2144 matches9.184 m23.156 m
2145–2030 matches9.184 m23.152 m
2247–2081 matches9.184 m23.310 m
2348–1912 matches9.184 m23.304 m
2449–1964 matches9.184 m23.297 m
2549–2036 matches9.184 m23.132 m
2650–1946 matches9.184 m23.126 m
2751–2014 matches9.184 m23.119 m
2851–2036 matches9.184 m22.955 m
2954–2025 matches9.184 m23.256 m
3054–2161 matches9.184 m23.094 m
Table 7. First Test—Comparison between the measured distance and the computed distance for 3 frames (Car A in Front, Car B Behind).
Table 7. First Test—Comparison between the measured distance and the computed distance for 3 frames (Car A in Front, Car B Behind).
Frame Video 1Frame Video 2GPS DistanceComputed DistanceMeasured DistanceDifferencePercent
133–2267 matches34.153 m20.694 m21.5 m−0.806 m3.748%
1650–2089 matches33.463 m20.993 m21.45 m−0.457 m2.130%
2560–2065 matches33.463 m21.128 m20.38 m0.748 m3.670%
Table 8. Second Test—Comparison between the measured distance and the computed distance for 3 frames (Reversed Order: Car B in Front, Car A Behind).
Table 8. Second Test—Comparison between the measured distance and the computed distance for 3 frames (Reversed Order: Car B in Front, Car A Behind).
Frame Video 1Frame Video 2GPS DistanceComputed DistanceMeasured DistanceDifferencePercent
125–2255 matches9.085 m24.025 m23.95 m0.075 m0.313%
1236–2186 matches9.184 m23.168 m23.98 m−0.812 m3.386%
1641–2182 matches9.184 m23.336 m23.9 m−0.564 m2.359%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Beti, I.-A.; Herghelegiu, P.-C.; Caruntu, C.-F. Architectural Framework to Enhance Image-Based Vehicle Positioning for Advanced Functionalities. Information 2024, 15, 323. https://doi.org/10.3390/info15060323

AMA Style

Beti I-A, Herghelegiu P-C, Caruntu C-F. Architectural Framework to Enhance Image-Based Vehicle Positioning for Advanced Functionalities. Information. 2024; 15(6):323. https://doi.org/10.3390/info15060323

Chicago/Turabian Style

Beti, Iosif-Alin, Paul-Corneliu Herghelegiu, and Constantin-Florin Caruntu. 2024. "Architectural Framework to Enhance Image-Based Vehicle Positioning for Advanced Functionalities" Information 15, no. 6: 323. https://doi.org/10.3390/info15060323

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop