1. Introduction
In recent years, unmanned aerial vehicles (UAVs), commonly known as drones, have undergone significant advancements due to technological breakthroughs that have enhanced their design and functionality. This evolution has expanded their practical applications across various sectors, significantly improving their operational efficiency and capabilities. In agriculture [
1], UAVs are indispensable for precise crop mapping, real-time harvest analyses, and early pest detection, leading to better crop management strategies. In search and rescue operations [
2], drones quickly cover large areas, locate distressed individuals, and provide crucial information in disaster-stricken areas. The military sector benefits from UAVs through advanced reconnaissance and immediate intelligence [
3], revolutionizing modern warfare and strategy. At the same time, in environmental conservation, they respond swiftly to forest fires, conduct air quality studies, and track wildlife [
4].
Commercially, UAVs have transformed business methodologies by offering a safer, more cost-effective alternative to manned aerial data collection [
5] compatible with diverse payloads like high-resolution cameras and sensor arrays. This adaptability hints at future developments such as enhanced autonomous navigation and sophisticated delivery systems. Additionally, UAVs are becoming crucial to urban development, promising to improve traffic management, urban planning, and public safety in Smart Cities [
6,
7,
8]. They are also set to revolutionize the transportation industry by introducing drone taxis and automated delivery services [
9], indicating a significant shift in urban mobility and logistics. Regulatory bodies are changing policies to balance the benefits of UAVs against privacy and security concerns, ensuring their smooth and beneficial integration into society.
However, as with any emerging technology, UAVs face challenges in maintaining precision and dependability while performing autonomous tasks in real-world settings, particularly in regions with weak or nonexistent GPS signals. One methodology involves the application of SLAM, which leverages optical data to discern features, thereby enabling the real-time determination of a UAV’s trajectory and spatial orientation while simultaneously facilitating the generation of a navigational map [
10]. A supplementary strategy integrates visual sensing with laser and/or inertial measurement units to enhance locational accuracy [
11]. Additionally, deploying GNSS-augmented LiDAR systems offers a robust alternative via the augmentation of positional data with high-fidelity topographical information [
12]. Another approach merges the functionalities of optical sensors with LiDAR technology, capitalizing on the complementary strengths of both systems to refine navigational precision [
13]. However, these techniques, which use multiple sensors, can become costly and cumbersome when used in applications like low-cost drones or in scenarios such as city surveillance in which weight restrictions are imposed. They pose challenges as they are sometimes memory-intensive and time-consuming, requiring more powerful processing to run in real time.
In response to these challenges, efforts have been made to explore alternative navigation techniques, such as visual SLAM (VSLAM), that are used only on sensor monocular cameras. Most visual SLAM systems are designed to operate in static environments, leading to error accumulation when environmental changes occur. This reduces the accuracy and reliability of these systems. Recent advancements in machine learning [
14] have spurred the development of new visual SLAM approaches that incorporate deep learning to address these dynamic issues. Notably, SLAM methods now increasingly incorporate object detection algorithms such as YOLO to identify dynamic objects within a scene.
Using YOLO for object detection helps create precise and real-time tracking systems by accurately identifying and categorizing dynamic elements in an environment. This is particularly effective in environments in which the contours of dynamic objects are usually clear and distinct from static backgrounds. This clarity aids in refining the detection outlines of dynamic objects, enhancing the overall performance of the SLAM system. Moreover, its robustness in varying lighting and weather conditions enhances its utility in outdoor applications. However, challenges persist due to the limited diversity of recognizable objects and the size of the training dataset. These limitations can result in incomplete coverage of dynamic objects, thereby reducing the number of dynamic objects detected in scenes. Consequently, these objects may not be accurately filtered out of the SLAM process, thereby affecting the correct estimation of the camera’s position and the mapping of its surroundings.
To address the aforementioned issues and enhance the accuracy and robustness of pose estimation in visual SLAM systems operating within dynamic indoor environments, the primary contributions of the proposed method are outlined as follows:
- (1)
The capabilities of YOLOv4 are leveraged to accurately identify and classify various objects in images or videos. Additionally, a specific Kalman filter that utilizes the centroids of objects for enhanced tracking accuracy is integrated.
- (2)
An algorithm has been developed to selectively filter features associated with dynamic objects.
- (3)
These object detection and tracking models are integrated into the ORB-SLAM process. This integration involves deleting feature information from dynamic objects to prevent them from adversely affecting the SLAM performance. This approach ensures that the system can more effectively navigate and map environments in which object movement occurs using only a monocular camera.
In the following sections, existing studies pertinent to this challenge are examined, this study’s approach is outlined, and the findings of this study are delved into. The technique is applied to a widely recognized dataset to evaluate the extent of improvements made by the proposed methods.
3. Dynamic Object Tracking and Elimination
This research introduces a robust VSLAM algorithm tailored for UAV localization in dynamic settings which is mindful of the existence of moving objects [
33]. This methodology is implemented using the ORB-SLAM algorithm, a monocular visual SLAM technique. To ensure its efficacy in dynamic settings, ORB-SLAM [
34] is augmented with a YOLO-Kalman framework which merges YOLOv4, an object detection algorithm [
35], with the Kalman filter, thereby enhancing localization and mapping.
The structure of the proposed approach is depicted in
Figure 1. The blue component represents ORB-SLAM, while the orange component corresponds to the proposed object detection and tracking module. This module is integrated with ORB-SLAM to address dynamic objects. This schematic representation aims to help visualize the workflow and understand key steps in the ORB-SLAM process and the overall proposed methodology.
ORB-SLAM is a prevalent algorithm for conducting SLAM in 3D settings through the use of photographic sensors. It incorporates three main components: TRACKING, LOCAL MAPPING, and LOOP CLOSING. These elements collaboratively facilitate the real-time determination of a camera’s position, the creation of a local map, and the identification of loops to maintain the global map’s consistency.
3.1. Camera Tracking
The camera tracking component is crucial in determining the camera’s position in real time as it navigates through an environment. This involves initializing the camera’s position, selecting key images, extracting and matching features between the key images and the current image, estimating the camera’s position using the matches, and relocating if tracking is lost. The aim is to continuously track the camera’s position and orientation as accurately as possible.
3.2. Local Mapping
The local mapping component focuses on constructing a local map of the environment. It performs a variety of tasks, such as triangulating 3D points using correspondences between keyframes and the current image, eliminating redundant keyframes to optimize computational efficiency, adjusting the beam locally to fine-tune camera poses and map points, and updating the keyframe database to maintain a diverse set of keyframes for robust tracking and loop closure. The local map represents a portion of the environment around the camera’s trajectory.
3.3. Loop Closing
Loop closing is tasked with identifying and rectifying loop closures, situations in which the camera revisits a location it has encountered before. Its main aim is to ensure overall map consistency and reduce drift. Loop closure involves recognizing loop closure candidates by comparing the current image with key images in the database, checking loops for geometric consistency and appearance, performing a global bundle adjustment to optimize the overall map, and updating the covisibility graph representing relationships between key images to reflect newly detected loop closures. By closing loops, ORB-SLAM can correct accumulated errors and obtain a more accurate, globally consistent map.
3.4. Detecting and Tracking Objects
Detecting and tracking are based on two components: YOLOv4 and the Kalman filter.
YOLOv4is trained using the COCO (Common Objects in Context) dataset. This extensive dataset is utilized for tasks such as object detection, segmentation, and captioning. It encompasses more than 330,000 images with labels spread across more than 80 categories of objects, establishing it as a frequently employed standard in the study of object detection. The YOLOv4 algorithm identifies objects within images and videos, delivering high precision and rapid processing. It provides the coordinates for each object’s bounding box, specifies its category, and evaluates the detection confidence level. The classification of objects is crucial for recognizing and sorting moving objects that have been detected.
The Kalman filter is an algorithm that iteratively predicts a system’s state using noisy measurements. In this context, it continuously tracks an object’s position, even when measurements from the YOLO detector are unavailable. By integrating YOLOv4 with the Kalman filter, a robust system capable of effectively detecting and tracking objects in dynamic environments is created, complementing the ORB-SLAM algorithm.
Integrating deep learning technology and the Kalman tracking module within the ORB-SLAM framework enables real-time object detection and the elimination of dynamic objects. This configuration allows ORB-SLAM to monitor the camera’s motion and create a map of its surroundings, even within changing environments.
To facilitate an understanding of VSLAM techniques, including ORB-SLAM and the proposed VSLAM, refer to
Figure 2,
Figure 3 and
Figure 4. These figures present detailed flowcharts illustrating simplified versions of the processes.
VSLAM is a technique that processes input from a singular camera to deliver outputs as a 3D map and an estimate of the camera’s position, as depicted in
Figure 2. The estimated position of the camera is essential for navigation purposes, and the 3D map plays a key role in comprehending the camera’s surroundings and identifying obstacles.
ORB-SLAM, as a VSLAM technique, employs the ORB (Oriented FAST and Rotated BRIEF) feature detector to identify keypoints within a camera’s imagery, as illustrated in
Figure 3. ORB keypoints are pinpointed at unique image locations, including corners, edges, and blobs. The result of ORB-SLAM is an environmental map alongside the estimated location and orientation of the camera. This map is constructed from tracked keypoints and their corresponding descriptors, which are refined through bundle adjustment. An estimation of the camera’s position and orientation is derived from this map in conjunction with the camera’s movements.
In this research, ORB-SLAM is integrated with the method of detecting and tracking (as illustrated in
Figure 4) to address the challenge of localization drift induced by dynamic objects. The YOLOv4 algorithm is applied alongside Kalman tracking modules for each keyframe identified by ORB-SLAM. Whenever dynamic objects are detected, a matrix operation is executed to remove the keypoints situated within the bounding boxes of these objects. This approach significantly improves localization accuracy by excluding the drift caused by keypoints associated with dynamic objects while concurrently endeavoring to preserve the fidelity of the environmental map.
Thanks to the capabilities of YOLOv4, as illustrated in
Figure 5, a dynamic entity such as a person is accurately identified. Following the detection of dynamic objects and their bounding boxes, a movement model of the boxes is integrated with the Kalman filter to track all identified objects. This strategy is known as the multi-object tracking method, and further details can be found in [
36]. Subsequently, this detection and tracking process is combined with ORB features to pinpoint features within the predicted boxes, as depicted in
Figure 6. In the final step, algorithms are implemented to remove the keypoints associated with unwanted dynamic objects, a process showcased in
Figure 7.
Integrating visual SLAM with the YOLO-Kalman module offers an effective strategy for rectifying inaccuracies introduced by dynamic objects within a real-world setting. In this methodology, the integration of ORB-SLAM and the YOLO-Kalman framework delivers precise and dependable assessments of the camera’s location and alignment alongside environmental mapping, even in the presence of dynamic entities. By merging the outcomes of YOLOv4 object detection with the visual SLAM framework, this approach can avoid any discrepancies resulting from moving objects.
In the study, the TUM public datasets were utilized for analysis [
37].
Figure 5,
Figure 6 and
Figure 7 were derived from this dataset. Modifications to the original data included object detection using YOLOv4, outcomes of the ORB feature, and the results of dynamic features. These alterations were made to highlight image features relevant to the research objectives.
3.5. Typical Kalman Filter
A widely adopted method for estimating parameters involves deploying an observer that relies on a state space model. Such an estimator is capable of inferring unobservable states within a system, as detailed in the referenced paper on state estimation [
38]. By leveraging the known input and output signals of a system, it is possible to estimate its internal states. The main goal is to use an estimator to either monitor states that cannot be directly measured or to minimize uncertainties associated with real-world sensor data. Nonetheless, the precision of these estimations hinges critically on the accuracy of the underlying model.
Initially, consider a tracking system in which the state vector represents the dynamic characteristics of the object, with k signifying the temporal aspect of the discretized object box model. In this context, the aim is to deduce based on observed measurements .
Consider the following equation that represents the model of the internal state:
In this context, F represents the transition matrix, while denotes the state transitioning from time to k.
is a Gaussian distribution of a random variable
characterized by an average and a covariance. With a normal probability distribution,
is as follows:
The state of measurement
from time
to
k is defined as follows:
Here,
H denotes the measurement matrix, and
is the Gaussian distribution of a random variable
is characterized by an average and a covariance. With a normal probability distribution,
is as follows:
The estimation process is divided into two phases: the time-update equations and the measurement-update equations. The notation
signifies the state
at time
k based on the data available up to time
. The time-update equations are responsible for predicting the estimated states (
) and the estimated error covariance (
) for the upcoming time step. The overall algorithm is described as follows:
The measurement-update equations serve to adjust the predicted estimated states and error covariance from the time-update phase by comparing the estimated states against actual measurements. These equations are outlined as follows:
Here, and are positive definite matrices representing the covariances of process noise and measurement noise, respectively. It is important to note that the process noise and measurement noise in Kalman filters are assumed to be white Gaussian noise and are independent from each other. This independence is a crucial prerequisite for the estimator’s convergence.
3.6. Object Tracking Using Kalman Filter
For the purpose of identifying and monitoring moving objects recorded by a camera, it is essential to examine their features, such as their positions, geometries, and centroids [
39]. The camera employed in this research captures images at a rate of 30 frames per second (30 fps), resulting in minimal changes between two consecutive frames for moving objects. This allows us to consider the movement of the target object to be scontinuous over adjacent frames.
To effectively describe a moving object, focus is placed on its centroid position and the tracking window size. By using these features, a representation that accurately describes the object’s motion can be created. Once moving objects have been identified through learning methods, certain preparatory steps are required for tracking these objects.
A key step involves allocating a tracking window to every moving object within a scene. To minimize the impact of excessive noise, the tracking window size is maintained at a modest scale set slightly larger than the object’s image. This approach aids in diminishing noise disturbances, improving image processing efficiency, and increasing the speed of operation.
The Kalman filter applied in tracking is characterized by its states, the motion model, and the equation matrix of measurements. The system state vector
is eight-dimensional and can be represented as follows:
Here, and denote the horizontal and vertical coordinates of the centroid, while and indicate the half-width and half-height of the tracking window, respectively. , , , and represent the respective velocities of these parameters.
The measurement vector of the system adopts the following:
In what follows,
F represents the transition matrix and
H denotes the measurement matrix of our tracking system, accompanied by the Gaussian process
and the measurement noise
. The magnitudes of these noise values rely entirely on the characteristics of the system under observation and are determined through empirical adjustments.
The observation matrix
H can be defined as follows:
Once the state and measurement equations of the motion model have been established, the Kalman filter can be applied in the subsequent frame to predict the position and dimensions of the object within a limited area, thereby obtaining the trajectories of moving objects.
4. Results and Discussion
In this section, the outcomes from both the standard ORB-SLAM and the enhanced ORB-SLAM integrated with YOLO methods are presented, and these results are compared against those obtained from the proposed approach that combines ORB-SLAM and YOLO-Kalman. For more detailed information regarding ORB-SLAM combined with YOLO, please refer to previous work [
40]. First, the results of the ORB-SLAM method in comparison to the ORB-SLAM with YOLO-Kalman method are showcased, focusing on the 3D and xyz axes. Second, a comparison of 2D trajectory results for both algorithms, ORB-SLAM and ORB-SLAM with YOLO-Kalman, is provided. Finally, a table summarizing the results for the ORB alone, ORB-SLAM with YOLO, and the proposed ORB-SLAM with YOLO-Kalman methods is included.
TUM (Technical University of Munich) datasets designed for RGB-D SLAM systems were utilized for algorithm assessment, as referenced in [
37]. This database is extensively employed in SLAM research and evaluation due to its provision of high-quality data accompanied by ground-truth poses crucial for appraising VSLAM algorithms. Essentially, the TUM datasets encompass a variety of environments and scenarios, such as dynamic environments, object SLAM, and suboptimal lighting conditions. With its collection of indoor and outdoor scenes, both dynamic and static objects, and diverse lighting conditions, the TUM database serves as a valuable tool for examining the performance of SLAM algorithms across a range of situations. Finally, the TUM datasets are primarily utilized within academic and research circles, facilitating the comparison of various SLAM algorithms’ performances. They are also employed for assessing ORB-SLAM in Matlab, which aligns with our objective of evaluating our work. While numerous databases like KITTI [
41] and EuRoC [
42] are available for SLAM algorithm evaluation in research, the TUM datasets are especially favored for their capability to assess SLAM performance in dynamic environments. This preference is due to the datasets’ inclusion of sequences with substantial dynamic object interactions in addition to their accuracy and broad adoption in the research community.
Moreover, the choice to use the Matlab environment is driven by its high-level programming language and interactive framework, which facilitate the rapid prototyping, comparison, and visualization of complex algorithms and data. Matlab offers an interactive suite of tools and functionalities tailored for the robotics community, including several specialized toolboxes like the Robotics System Toolbox and Mapping Toolbox. These toolboxes are equipped with functions designed for managing robot sensors, kinematics, and mapping tasks. This comprehensive toolset renders Matlab an ideal platform for devising innovative approaches to addressing the complexities of VSLAM [
43].
The evaluation primarily focused on the key feature of the ORB-SLAM and YOLO-Kalman system, namely the removal of dynamic objects to correct trajectory drift, utilizing real-time experiments on the public TUM dataset. It should be noted that map accuracy was not assessed in this study. For the datasets, the TUM freiburg2-desk-with-person sequence was selected, depicting a typical office setting with an individual sitting and moving throughout the recording.
This particular sequence is well-suited for assessing the effectiveness of our ORB-SLAM with YOLO-Kalman system in managing dynamic object removal and model correction. The video sequence lasts for 142.08 s, during which the camera covers a distance of 17.044 m, moving at an average velocity of 0.121 m per second.
The methodology for computing the improvement criterion
in terms of the RMSE is depicted in Equation (
14):
The method for calculating the RMSE is provided in Equation (
15):
where
represents the set of points predicted by the ORB-SLAM algorithm, and
denotes the set of points predicted by the ORB-SLAM with YOLO-Kalman enhancement.
The calculation method for the RMSE in the 3D position is given in Equation (
16):
In conclusion, this paper defines the deviation error as the maximum amplitude of the absolute error.
Initially, the proposed method was evaluated against the original ORB-SLAM method, demonstrating enhanced performance in accurately estimating the camera trajectory, even in highly dynamic settings. In the subsequent results, the term “without YOLO-Kalman” refers to the outcomes derived solely from the original ORB-SLAM.
Figure 8 depicts three distinct trajectories: the initial one represents the ground-truth trajectory, the second is generated by ORB-SLAM, and the third shows the outcome of implementing ORB-SLAM with the YOLO-Kalman algorithm. The recording appears to have occurred within a 4-m range on the x-axis, a 1-m range on the y-axis, and a 2-m range on the z-axis, all while rotating on a desktop table. Further analyses of the impact of dynamic object removal on ORB-SLAM’s performance will be presented for each axis in the subsequent results.
The data presented in
Figure 9 illustrate the estimated positions and their corresponding errors along the x-axis. From an analysis of the errors, a significant enhancement is observed. The deviation error for the ORB-SLAM algorithm tops out at just over 30.18 cm, while the ORB-SLAM algorithm integrated with YOLO-Kalman demonstrates a lower maximum deviation error of 19.03 cm, showcasing an improvement compared to the original ORB-SLAM.
Figure 10 presents the outcomes of the experiment focused on estimating positions and errors along the y-axis. The results highlight a limitation in the ORB-SLAM with YOLO-Kalman method’s precision in estimating the y-axis position. Nonetheless, the overall conclusion demonstrates that despite this shortcoming, the integration of ORB-SLAM with YOLO-Kalman results in a method that surpasses the performance of the original ORB-SLAM in predicting 3D trajectories.
Figure 11 displays the z-axis trajectory outcomes from our proposed VSLAM algorithm in comparison with those from ORB-SLAM. The trajectory generated by our method appears to closely match the actual camera trajectory. Regarding deviation error, the ORB-SLAM algorithm’s maximum reaches just over 41.57 cm. In contrast, the ORB-SLAM integrated with YOLO-Kalman shows a significantly lower maximum of 20.20 cm, indicating an improvement over the original ORB-SLAM. This demonstrates that the error in trajectory prediction by the proposed approach is considerably smaller compared to ORB-SLAM, underscoring the superior quality of the trajectory prediction results achieved by our proposed method.
Additionally, the keyframe count gives 273 with the YOLO-Kalman method; a keyframe count of 254 shows the augmented number of keyframes found when using YOLO-Kalman algorithms.
Figure 12 displays 2D plots from the ORB-SLAM method and the proposed methodology with YOLO-Kalman for the dataset “freiburg2-desk-with-person”. It can be observed that the estimated trajectory obtained with the proposed method closely aligns with the real trajectory compared to ORB-SLAM when moving along the y-axis in the right-hand section, where x is greater than 1.5 m. However, a limitation becomes evident in the left-hand section when x is less than 0.5, especially when traversing along both the x- and y-axes.
After presenting the estimation results figures for the ORB-SLAM and ORB-SLAM with YOLO-Kalman methods, the outcomes of ORB-SLAM integrated with YOLO are included in the table below for a comprehensive comparison with the method employing YOLO-Kalman.
The integration of ORB-SLAM with YOLO, as explained in previous work, is a technique used to enhance VSLAM by eliminating features on dynamic objects, using only YOLOv4 for object detection. However, this technique suffers from discontinuities in object detection.
Table 1 presents the trajectory results of the proposed method integrating ORB-SLAM the YOLO-Kalman algorithm compared against the results from the original ORB-SLAM and ORB-SLAM enhanced with YOLO methods. In this table, the improvement criterion is presented in relation to ORB-SLAM. ORB-SLAM with only object detection shows an improvement of 23.85% compared to 34.99% when using tracking methods. This highlights the importance of addressing the weakness of ORB-SLAM with only YOLO, which comprises in detecting objects that arise from relying solely on detection. The improvement criterion in ORB-SLAM with YOLO-Kalman validates the choice of the YOLO-Kalman corrector.
The proposed algorithm outperforms others primarily because it operates within the SLAM framework, where the camera-tracking step relies on features extracted by the ORB algorithm. This algorithm does not differentiate between dynamic and static objects, assuming the entire scene is static. This leads to the execution of the “local map tracking” algorithm based on potentially inaccurate feature measurements from a mistakenly assumed static scene despite the presence of dynamic objects, which makes the environment dynamic. An incorrect tracking estimation influenced by dynamic objects can result. To address this, the algorithm is designed to detect and track objects in real time, effectively eliminating features associated with dynamic objects and reducing the uncertainty these objects introduce in the camera-tracking phase.
Furthermore, considering all the results discussed in this article, which demonstrate the enhancement of the ORB-SLAM algorithm for drone localization in a dynamic environment, it is noteworthy that there is a loss of precision along the y-axis. This underscores the neeed for further investigation to refine the results and explore the feasibility of integrating the algorithm for dynamic object elimination with other SLAM algorithms. Future work could also benefit from considering advanced detection techniques, such as object elimination and inpainting methods, to further enhance accuracy.