Next Article in Journal
APEIOU Integration for Enhanced YOLOV7: Achieving Efficient Plant Disease Detection
Next Article in Special Issue
Detection and Instance Segmentation of Grape Clusters in Orchard Environments Using an Improved Mask R-CNN Model
Previous Article in Journal
Soil Organic Carbon Dynamics in the Long-Term Field Experiments with Contrasting Crop Rotations
Previous Article in Special Issue
High-Precision Detection for Sandalwood Trees via Improved YOLOv5s and StyleGAN
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

MOLO-SLAM: A Semantic SLAM for Accurate Removal of Dynamic Objects in Agricultural Environments

1
College of Engineering, South China Agricultural University, Guangzhou 510642, China
2
Guangdong Topsee Technology Co., Ltd., Guangzhou 510663, China
3
National Key Laboratory of Agricultural Equipment Technology, Guangzhou 510642, China
*
Author to whom correspondence should be addressed.
Agriculture 2024, 14(6), 819; https://doi.org/10.3390/agriculture14060819
Submission received: 16 April 2024 / Revised: 21 May 2024 / Accepted: 22 May 2024 / Published: 24 May 2024
(This article belongs to the Special Issue Advanced Image Processing in Agricultural Applications)

Abstract

:
Visual simultaneous localization and mapping (VSLAM) is a foundational technology that enables robots to achieve fully autonomous locomotion, exploration, inspection, and more within complex environments. Its applicability also extends significantly to agricultural settings. While numerous impressive VSLAM systems have emerged, a majority of them rely on static world assumptions. This reliance constrains their use in real dynamic scenarios and leads to increased instability when applied to agricultural contexts. To address the problem of detecting and eliminating slow dynamic objects in outdoor forest and tea garden agricultural scenarios, this paper presents a dynamic VSLAM innovation called MOLO-SLAM (mask ORB label optimization SLAM). MOLO-SLAM merges the ORBSLAM2 framework with the Mask-RCNN instance segmentation network, utilizing masks and bounding boxes to enhance the accuracy and cleanliness of 3D point clouds. Additionally, we used the BundleFusion reconstruction algorithm for 3D mesh model reconstruction. By comparing our algorithm with various dynamic VSLAM algorithms on the TUM and KITTI datasets, the results demonstrate significant improvements, with enhancements of up to 97.72%, 98.51%, and 28.07% relative to the original ORBSLAM2 on the three datasets. This showcases the outstanding advantages of our algorithm.

1. Introduction

Simultaneous localization and mapping (SLAM) [1] is playing an increasingly vital role in today’s intelligent era, and its rapid development in fields like autonomous driving, robotics, and virtual reality [2,3,4] has turned it into a major research focus worldwide. Notably, SLAM’s relevance in agriculture has also grown, making it a popular subject of study [5]. SLAM can be divided into two categories based on the sensors used: laser SLAM and vision SLAM [6]. Laser SLAM, having been extensively researched earlier, has traditionally been regarded as the primary solution for mobile robots, owing to its stability and accuracy. However, with the progress of industrialization, the high cost associated with laser SLAM has shifted focus towards vision SLAM. Vision SLAM presents several advantages, such as cost-effectiveness, ease of installation, and the abundance of environmental information it provides. Furthermore, VSLAM (visual SLAM) exhibits superior environmental perception capabilities and boasts applicability across a broader range of scenarios. Capitalizing on these benefits, a multitude of VSLAM systems with remarkable performance have emerged in recent years. Notable examples include PATM [7], SVO [8], LSD-SLAM [9], and ORB-SLAM2 [10], among others.

1.1. Visual SLAM

Visual SLAM primarily relies on visual methods to construct 3D maps and estimate the localization of an agent within an environment. The direct method involves motion estimation by minimizing the photometric error of image pixels. For instance, Newcombe et al. [11] proposed that the DTAM system enables dense pixel-wise matching across each video frame, but it lacks a complete visual SLAM system. MonoSLAM [12] was one of the pioneering real-time monocular VSLAM systems, utilizing an extended Kalman (EKF) [13] for efficient camera pose optimization. Jakob Engel et al. introduced the semi-direct method in their pioneering work on the LSD-SLAM [9], a semi-dense VSLAM system that selects salient feature points with large pixel gradients to track camera motion. Another approach is RGBD-SLAM [14], which creates real-time dense 3D point cloud mapping based on a comprehensive feature point-based framework. However, the feature point proposal step in this semi-direct method can be computationally expensive and inefficient. The ORB-SLAM series, proposed by Mur Artal et al., represents one of the most prominent and widely adopted VSLAMs. ORB-SLAM2 [10] utilizes a quadtree-based spatial selection strategy for ORB feature points and keyframes, resulting in high accuracy and real-time performance. Building upon ORB-SLAM2, ORB-SLAM3 [15] incorporates a module for fusing inertial information and introduces a multi-map system (ATLAS [16]), significantly improving robustness and accuracy. However, the traditional VSLAM systems mentioned above primarily operate under the assumption of a static environment. Consequently, they can face substantial localization and mapping errors in highly dynamic environments, where the presence of moving objects violates the static world hypothesis. Therefore, it has become essential to investigate new VSLAM algorithms that are specifically tailored for handling such dynamic environments.
Most of the current visual SLAM systems are designed based on the assumption of a static environment, where the camera frames are presumed to capture static scenes. However, these systems encounter challenges when dealing with dynamic objects, as the matching of feature points on such moving entities can leads to data correlation errors, reducing localization accuracy. To mitigate the impact of dynamic or erroneous points, existing vSLAM algorithms often incorporate various measures. In agricultural environments, correlation errors can significantly reduce the accuracy of localization and map building. For instance, random sample sampling (RANSAC) [17] is used to filter out anomalous points, and local or global bundle adjustment (BA) [18] is employed to correct position estimations and minimize the influence of matching errors. There are other systems that mitigate the impact by fusing additional sensors. For example, Zhao et al. [19] proposed a robust depth-assisted RGBD inertial odometry for indoor positioning that integrates IMU. However, when the dynamic object area dominates the current frame, significant errors can occur, leading to VSLAM system failure. Therefore, enhancing the robustness of VSLAM algorithms in highly dynamic scenarios remains a significant challenge.

1.2. Dynamic SLAM

As deep learning continues to advance, the integration of SLAM with deep learning networks has become more profound, enabling us to address a wider range of problems. This synergy allows for more effective solutions and improved performance in various scenarios. One approach that exemplifies this integration is semantic prior-based dynamic SLAM, where a deep learning network is employed to segment the environmental image data obtained from the camera sensor. This provides an a priori mask and label information for different objects. By utilizing this a priori information, the SLAM system can identify and remove dynamic label feature points, thereby enhancing system stability. Li et al. [20] proposed a real-time visual SLAM system based on deep learning, utilizing a multi-task feature extraction network and self-supervised feature points. They employed deep learning methods for extracting feature points to enhance accuracy. However, in a dynamic environment without the assistance of prior information, accuracy may decrease, affecting the system’s performance. Dyna-SLAM, introduced by Bescos et al. [21], combines Mask-RCNN [22] segmentation, for instance, to obtain a priori dynamic information. It then identifies non-prior dynamic objects’ masks through multi-view geometry and region-growing algorithms while providing a background restoration method. Building upon this research, Dyna-SLAM II [23] computes the motion trajectory of multiple objects in a 3D map based on the camera’s position estimation. Similarly, DS-SLAM [24], utilizing SegNet [25], verifies object motion consistency through optical flow and polar constraints. It removes detected feature points within a predefined dynamic mask, labeling them only as “person”. VDO-SLAM [26] computes and optimizes camera poses and dynamic points while determining whether a moving object is dynamic or not by estimating the poses of dynamic and static points. However, this algorithm requires significant computation. MaskFusion [27] employs geometric segmentation to generate more accurate object boundaries, thereby overcoming the imperfection of mask boundaries. SG-SLAM [28] is a dynamic RGB-D SLAM system based on Yolact and geometric constraints. It significantly enhances positioning accuracy; however, it relies on depth maps, and accuracy issues with Yolact make it unsuitable for outdoor environments.
To address the time-consuming aspect of deep learning networks, researchers have explored efficient segmentation strategies for feature frame segmentation. For example, Detect-SLAM [29] utilizes a movement strategy that propagates four types of keypoints to overcome latency in semantic information. Dynamic-SLAM [30] improves the recall of the existing SSD [31] network by employing a leakage-detection compensation algorithm. Additionally, RDS-SLAM [32] introduces a non-blocking model that enables real-time tracking through probabilistic updating of moving objects and semantic propagation. On the other hand, DGS-SLAM [33] introduces a lightweight SLAM system based on depth-graph clustering and adaptive a priori semantic segmentation, ensuring high robustness and speed. While these methods enhance real-time performance, their failure to segment each frame may result in decreased robustness.

1.3. Visual SLAM in Agriculture

Due to the advancement in SLAM technology, its application in agriculture has become increasingly widespread. However, the complexity of the agricultural environment requires targeted algorithmic improvements. AGRI-SLAM [34] utilizes an image enhancement technique for recovering ORB point and LSD line feature. This allows the algorithm to function in low-light environments. Kaiyu Song et al. proposed an optimized VINS-mono algorithm [35] that includes a key frame selection algorithm based on vertical motion smoothness validation. This algorithm enhances localization accuracy in agricultural environments. The Graph-SLAM algorithm [36] improves closed-loop detection (LCD) through semantic segmentation based on grapevines. It is suitable for large-scale mapping of vineyards. However, all the aforementioned vision SLAM algorithms do not take into consideration the dynamic environment, yet the dynamic agricultural environment significantly affects localization and mapping accuracy. Therefore, it is necessary to eliminate the dynamic objects while the agricultural robot is operating.
The application of visual SLAM technology in agricultural robot navigation and mapping is becoming increasingly widespread. However, agricultural environments present unique challenges, including unstable lighting conditions and the presence of abundant repetitive texture regions, which make it more difficult for visual SLAM algorithms to identify feature points. Compared to urban and industrial settings, visual SLAM systems deployed in agricultural environments often exhibit poor stability of feature points. This is due to the highly repetitive nature of the environment, which renders the feature detection and matching processes more susceptible to the influence of dynamic objects. As a result, the overall system performance can be adversely affected. Additionally, during the inspection process, agricultural robots aim to acquire three-dimensional data of crops or the environment. However, the presence of dynamic objects in such scenarios can undermine the accuracy of localization and mapping, preventing the accurate reconstruction of the three-dimensional information of crops. These challenges impose rigorous requirements on visual SLAM systems operating in agricultural environments.

1.4. Differences from Other Works

The application of Mask RCNN in dynamic vision SLAM is relatively common, but the problem of fixed a priori dynamic label judgment for static and dynamic objects still exists. Its label judgment cannot be adapted to the changing state of the image pair. Additionally, the mask may be incomplete or undetected, and the point cloud may be disorganized. However, our algorithm can dynamically judge object labels in real time and reject anomalous mask frames. This enables us to achieve higher localization accuracy and obtain a cleaner 3D point cloud. Our objective is to enable the agricultural inspection robot to achieve localization and dense reconstruction during inspections, aiming to extract 3D point cloud data and semantic information about the agricultural environment and crops while excluding the influence of dynamic objects in the scene. Since our system operates outdoors, the SLAM system needs to be compatible with both RGB-D and stereo cameras. Dyna-SLAM employs a priori masks and multi-view geometry to identify dynamic masks indoors. However, in outdoor stereo mode, it eliminates the label mask for “car”, which adversely affects accuracy. DS-SLAM is not suitable for outdoor use as it only finds the dynamic mask for “person”. DGS-SLAM utilizes the depth map clustering method, which requires reconstructing the depth map in outdoor stereo mode. Unfortunately, the accuracy cannot be guaranteed, rendering it unsuitable for outdoor use as well.
To address the challenge of detecting and eliminating dynamic objects in outdoor forest and tea garden agricultural scenarios, which affects localization and image reconstruction accuracy, we introduce a novel solution called MOLO-SLAM. This approach involves a label-optimized dynamic SLAM technique that incorporates multiple geometric constraints and dynamic confidence. Our methodology is built upon the foundation of the ORB-SLAM2 algorithm and obtaining highly accurate a priori mask information via Mask-RCNN. Additionally, we curated our own dataset, which was named the Forest-Tea-COCO Dataset, and trained the corresponding Forest-Tea-COCO Model by Mask-RCNN. This model was then used to conduct reconstruction experiments and generate semantic maps in forest–tea garden scenarios, with the aim of effectively addressing dynamic object interference in forest and tea garden environments. The primary contributions of this paper can be summarized as follows:
  • This paper proposes a novel calculation method for dynamic label confidence, which utilizes multiple geometric constraints. This approach improves SLAM accuracy by excluding masks of highly dynamic objects based on their dynamic confidence. To validate its effectiveness, experiments were conducted using our forest–tea garden datasets, and the results demonstrate that MOLO-SLAM significantly enhanced accuracy in dynamic scenes.
  • The paper also presents the design of the label consistency module and the incomplete mask contaminated frame rejection module. These modules effectively reject contaminated point cloud data and facilitate the reconstruction of high-precision 3D point cloud maps.
  • Additionally, the paper proposes a combination with the BundleFusion 3D reconstruction algorithm, resulting in the generation of a more refined 3D mesh model.
The rest of the paper is organized as follows: Section 2 describes the MOLO-SLAM algorithm system. Section 3 performs a comparison of TUM and KITTI trials, as well as test on our own agricultural dataset. Finally, the summary and future directions of the work are proposed in Section 4.

2. Materials and Methods

2.1. System Overview

First, let us provide a brief overview of the entire system, as illustrated in Figure 1. Our system was built upon ORB-SLAM2, which serves as the foundation for tracking and mapping. The process begins with RGB images and depth images obtained from sensors in either RGB-D or stereo modes. These images are then fed into the system, with RGB images being processed by two threads. The first thread, feature extraction, is responsible for extracting feature points from the current frame. The Multiple Geometric Constraints module calculates the dynamic confidence of the labels. Simultaneously, the Mask-RCNN thread provides RGB image masks, which are combined with the dynamic confidence values to identify dynamic tags with high confidence. Subsequently, the feature points within the dynamic tag masks are eliminated, while static feature points are used for tracking and map building to obtain precise position estimation. Meanwhile, the labels obtained from Mask-RCNN undergo label consistency and polluted frame removal through feature point tracking. The removed contaminated frames are excluded from the point cloud map thread. Additionally, we refer to the dense mapping system proposed by Yang et al. [37], and the positional estimation is then combined with the key frames to generate the point cloud map. Finally, the point cloud map undergoes bundle fusion to generate a mesh 3D model. This model is combined with the Mask-RCNN segmentation results to obtain the semantic point cloud map.
Compared to ORB-SLAM2 and other dynamic VSLAMs, we introduced several modules to enhance the overall system performance:
(1)
Dynamic detection module: We implemented the multiple geometric constraints dynamic detection by employing movement consistency check, reprojection error, and pre-previous frame movement consistency check. Additionally, we strengthened certain labels by appropriately increasing the number of large gradient pixel points. This enables us to compute dynamic confidence scores for each label based on weighted criteria.
(2)
In order to classify objects in the agricultural environment more efficiently, we employed augmented learning to input the agricultural dataset and the COCO [38] instance dataset into the Mask-RCNN network. This allows us to obtain a semantic labeling segmentation model for instances that is applicable to both forest and tea gardens, thereby extending the semantic labeling capabilities in these scenarios.
(3)
3D reconstruction module: To ensure accurate 3D reconstruction, we implemented label consistency and contaminated frame removal. By eliminating frames with incomplete segmentation, we prevent them from entering the point cloud threads. Consequently, the dynamically culled RGB maps and depth maps are then input into the BundleFusion system for the final 3D reconstruction.

2.2. Dynamic Feature Point Detection

The framework we employ, ORBSLAM2, relies on the identification of ORB feature points in the upward and downward frames of images. These feature points are then utilized for data matching, enabling the calculation of relative positions. This process facilitates localization. However, a notable challenge arises when attempting to localize using feature points on dynamic objects. Such scenarios can introduce significant errors into the localization process. Consequently, it becomes imperative to address this issue by detecting and distinguishing between static and dynamic feature points.
Currently, many VSLAM algorithms employ time-consuming region growth algorithms or cluster segmentation techniques to detect static points and then perform calculations on the depth map. However, this approach heavily relies on the depth image. To better adapt outdoor scenes, we require a straightforward and efficient dynamic detection method that solely utilizes RGB images. Therefore, we have chosen the triple weighting method of movement consistency, re-projection error, and pre-previous frame consistency to construct the multiple geometric constraints module, as illustrated in Figure 2. First is the epipolar geometry as the foundation of our research; by examining the relative changes in R (rotation) and t (translation) within the camera, we analyze the behavior of spatial points like P1, P2, and P3 in relation to the polar line of the poles. In an ideal scenario, these points should fall along the pole line. However, due to real-world error, the feature points in the subsequent frame deviate from the pole line by a certain distance, as illustrated in Figure 2. This distance is denoted as “dp”—the distance between the matching points (P1, P2, P3) and the pole line (dp1, dp2, dp3) in previous frame and current frame. Importantly, when accounting for the dynamics of P3, we observe that due to movement, its distance “dp3” increases, resulting in a larger gap between the matching point and the pole line. By exploiting that, we can establish a mechanism for verifying the consistency of movement. In the following section, we provide a detailed explanation of the calculation method for dynamic point detection.

2.2.1. Epipolar Constraint Check

Since the proposal of DS-SLAM, the mobile consistency module has served as the foundation for numerous dynamic SLAM research studies. Here, we also adopt the mobile consistency module as our starting point. The process involves comparing the current frame FC with the previous frame FP. In the first step, we compute the optical flow pyramid for the current frame, FC, to obtain its feature points. Then, we calculate the error of the 3 × 3 image block of the optical flow matching point and remove it if it is close to the boundary or if the error is too large. Moving on to the third step, we employ RANSAC iteration to locate the best optical flow matching feature point, allowing us to calculate and find the F (fundamental matrix). In the fourth step, we calculate the epipolar line by considering the coordinates of PC and PP, as shown in Formula (1):
P C = u c , v c , 1 , P P = u p , v p , 1
where u and v are the pixel coordinates of the optical flow feature points. Then, the epipolar line is calculated from this, which is calculated by Formula (2):
I C = F P C = F u c v c 1 = X Y Z
Finally, we calculate the epipolar distance based on the feature points and the epipolar line, and the formula for this calculation is shown in Formula (3):
D P C = P P F P C X 2 + Y 2
where DPC is the epipolar distance of point PC. When DPC is bigger than the threshold ε, we consider this point PC as a dynamic feature point. The algorithm flow of mobile consistency checking is shown in Algorithm 1.
Algorithm 1 mobile consistency checking
Input: Current frame FC; Previous frame FP; Current feature points Pc.
1 Optical flow calculation previous feature points Pp = CalcOpticalFlowPyrLK(Fc,Fp,Pc);
2 if Pp close to the edge or SSD too small then
 Remove match point in Ppend if;
3 F = FindFundamentalMatrix(Pc,Pp,RANSAC);
4 for each matched point pairs in (Pc,Pp) do compute Ic via (2);
5 Compute the epipolar distance DPC via (3);
6 if DPC > ε then
 Put in the dynamic feature point set (Pci,Ppi)→S end if.
Output: Dynamic feature point set S.
Mobile consistency detection is a dynamic outlier detection technique based on optical flow and epipolar constraints. Using the optical flow method and F matrix, we can effectively describe the correlation between corresponding points in images and express their motion relationship. Nevertheless, this method has its limitations. For instance, when there is dynamic motion along the epipolar line, it may not pass the distance threshold check. Hence, we consider introducing re-projection errors for optimization.

2.2.2. Re-Projection Error

Re-projection error refers to the Euclidean distance dpe between transformed feature points. For re-projection points, dpe is generally relatively large, enabling the detection of abnormal points. The Homography matrix is often used to detect abnormal points in plane feature scenes. In some tea gardens that are relatively flat, it can be seen as a scene with rich plane features, as illustrated in Figure 2. So, we use H (homography matrix) that better represents the projection of point between the two images. To compute the re-projection error, we first calculate the H of the optical flow feature points under the assumption of a static background. Then, we proceed with the specific calculation process following the polar constraints as outlined below:
First, we calculate the optical flow pyramid of the current frame FC, to obtain its feature point, PC. Next, we locate the matching feature point, PP, of the previous frame, FP, using the optical flow method. In the third step, we initially screen the feature point pairs, PCS and PPS, in the static background and calculate the homography matrix, H, based on the matching feature points of PCS and PPS. Finally, we utilize the H matrix to calculate the re-projection error Dre for each feature point. The calculation formula is shown in Formula (4):
P P - r e - v = H * u p v p 1 = u p - r e - v v p - r e - v w p - r e - v
where PP-re-v is the re-projection point of the previous frame feature point PP in the current frame, and uP-re-v, vP-re-v, and wP-re-v are the position vectors of the predicted points, which are normalized to obtain the pixel normalized coordinates, as shown in Formula (5):
P P - r e = u p - r e - v / w p - r e - v v p - r e - v / w p - r e - v 1 = u p - r e v p - r e 1
The Euclidean distance is used to calculate the re-projection error Dre:
D r e = ( u c - u p - r e ) 2 + ( v c - v p - r e ) 2
When Dre is greater than the threshold e, we consider it an outlier. The algorithm flow of re-projection error outlier point detection is shown in Algorithm 2.
Algorithm 2 Re-projection error Outlier Point Detection
Input: Current frame FC; Previous frame FP; Current feature points Pc.
1 Optical flow calculation previous feature points Pp = CalcOpticalFlowPyrLK(Fc,Fp,Pc);
2 if Pp close to the edge or SSD toos mall then
 Remove match point in Pcend if;
3 H = FindHomographyMatrix(Pc-s,Pp-s,RANSAC)//(Pc-s,Pp-s) is the static background
matching point;
4 for each matched point pairs in (Pc,Pp) do compute Pp-re via (5) and Dre via (6);
5 Compute the epipolar distance DPC via (3);
6 if Dre > e then
 Put in the outlier current feature points set (Pci,Ppi)→R end if.
Output: Outlier current feature points set R.

2.2.3. Pre-Previous Frame Epipolar Constraint Check

Motion consistency detection is capable of detecting moving objects. However, it may fail to identify dynamic points that move along the epipolar line within the current viewing angle, leading to an omission of dynamic abnormal points. Additionally, because of the narrow roads in a forest–tea garden environment, the dynamic objects’ movement speed (v) is relatively low, heightening the challenge of dynamic detection. To avoid the problem of the omission of dynamic abnormal points, we contemplate introducing an additional viewing angle for motion consistency checking. However, the movement of objects seen from an excessively large camera angle of view can result in significant errors, which affects the continuity of moving objects. To strike a balance, we opt for per-frame motion consistency detection between the current frame, FC, and the previous frame, FP-pre, to identify early abnormal points. According to the specific algorithm flow for the consistency check, the inputs are denoted as Current RGB frame FC; Current RGB frame’s feature points PC; Previous RGB frame FP-pre. The schematic diagram of the pre-previous frame check is depicted in Figure 2.

2.3. Dynamic Confidence Calculation Method

Once the dynamic feature points are successfully identified, due to the existence of errors, the system may fail to detect all the erroneous feature points associated with the dynamic object. Therefore, for an accurate localization of the moving object, it becomes necessary to incorporate prior semantic information and instance masks to eliminate all the feature points on dynamic object. Consequently, we opted to employ the Mask-RCNN instance segmentation network, which is known for its superior accuracy in providing such prior information. This network is capable of generating instance masks and bounding boxes for various categories, offering a more streamlined approach to image detection.
Mask-RCNN exhibits superior segmentation accuracy in comparison to other instance segmentation approaches. However, a notable drawback of Mask-RCNN is its inability to effectively differentiate between static and dynamic object labels, particularly when there are transitions in the motion state of the labeled entities (e.g., from moving to stationary). Consequently, after leveraging the prior semantic information provided by Mask R-CNN, it becomes necessary to accurately detect and distinguish the dynamic objects through alternative means. Multiple geometric constraints are proposed here. We integrate three dynamic detection methods and utilize the ratio of detected outlier points to the total number of feature points from these methods. Each method is assigned a specific weight, and this weighted ratio is used to calculate the dynamic confidence of the label. We then combine the deep learning Mask-RCNN method to locate the dynamic mask and reject the feature points falling within the mask area. In this context, the proposed multiple geometric constraints method serves as a crucial step for accurately calculating the dynamic confidence of the object labels, thereby enabling the robust detection of dynamic masks. Here, we introduce the specific details of this calculation approach.
Firstly, we adopted the dynamic label checking strategy of DS-SLAM and established a threshold for the epipolar distance Em. We classified the points larger than this distance as outer points Pout, while the rest were designated as inner points Pin. The second step involves traversing each outer point Pout to determine its corresponding mask and record the number of outer points Li on the label of each mask. A label is considered dynamic when the number of outer points Li exceeds the threshold value N.
However, this calculation method has some drawbacks. If we preset too many target labels, it may result in small areas or features with insufficient prominence on the mask, leading to only a few feature points. Consequently, even though the labeling should be dynamic, the number of outer points Li might not surpass the threshold value N. This error in calculation can adversely affect the 3D reconstruction quality. To address this issue, we introduce a pseudo-feature point increase strategy. When the number of feature points on a mask Li is less than a designated threshold value M = 2N, we randomly add (M-Li) large gradient pixel points to that mask as pseudo-feature points. Figure 3 describes the specific process of dynamic confidence, and pseudo feature points are added to labels with insufficient feature points.
For calculating the initial dynamic probability Cm-Ini of mobile consistency, our gradient point increase strategy ensures that each mask has a minimum of M feature points, where we define M = 2N. With this definition, when there are N outliers, the critical value for the dynamic confidence becomes N/M = 0.5. Thus, the initial dynamic confidence of label “i” that detects LSi outliers is as follows in Formula (7):
C m - I n i = L S i / M
When LSi > M, Cm-Ini is 0.95, and 0.05 is floating redundancy.
From the above analysis, we obtain the calculation method of dynamic confidence. First, we perform mobile consistency detection and the re-projection error to calculate initial confidence:
L a b e l _ c o n f = ( α C m + β C r e )
The dynamic confidence coefficient Cm of mobile consistency detection is Formula (9):
C m = α * C m - I n i
α is the weight of the detection method, and Cm-Ini is the initial dynamic confidence. Similarly, the re-projection confidence is Formula (10):
C r e = β * C r e - I n i
β is the weight of the detection method, and Cre-Ini re-projects the initial dynamic confidence.
Although the threshold has been defined, the confidence at this point is solely influenced by M, thereby introducing the static–dynamic comparison coefficient B.
The contrast coefficient B is determined by the outlier point Pout and the number of feature points Li on the label mask. Pout represents the outlier points prior to the inclusion of high-gradient pixel points and Li denotes the quantity of feature points on the label mask. The calculation formula for coefficient K is as follows:
K = ( P o u t / L S i ) / ( L i / M )
The coefficient K represents the relationship between dynamic confidence and the proportion of original dynamic points. Specifically, when the initial dynamic confidence Cm-Ini is 0.5, but the proportion of original dynamic points exceeds 0.9, we tend to favor it as a dynamic label and appropriately increase the dynamic confidence. Therefore, we establish the following rules: when K is less than 0.2, the dynamic confidence is capped at 0.85; when K exceeds 2, the dynamic confidence is capped at 1.25. This leads to the derivation of the static–dynamic comparison coefficient B:
B ( K ) = 0.8 , ( K 0.2 ) B ( K ) = 0.25 K + 0.75 , ( 0.2 K 2 ) B ( K ) = 1.25 , ( 2 K )
Finally, the dynamic probability Cm for motion consistency is determined as follows:
C m = B C m - I n i
Due to the introduction of additional feature points, uncertainty is increased, thus leading to the introduction of the uncertainty coefficient Uncer. The following factors influence this parameter. The motion level of the label, i.e., the prior motion probability ω, increases the proportion of feature points on the mask ξ.
Since there are 80 categories in the COCO dataset, the likelihood of motion varies across each category. Based on real-world experience, we categorized dynamics into five classes of levels, from strongest to weakest, classifying the labels into five category levels: prior dynamic (e.g., people, cats, dogs), prior static (e.g., beds, tables), intermediate-to-dynamic (e.g., cars, bicycles), intermediate-to-static (e.g., potted plants, TVs), and intermediate states (others). The motion levels of the five categories are presented in Table 1 below.
The dynamic weight of the intermediate state is set to “1”, indicating that the category prior of the label does not influence the calculation of dynamic confidence. For values where “a > b > 1”, it implies an increase in the dynamic weight of the label. Similarly, for values where “1 > c > d”, it indicates an increase in the static weight of the label. Here, “ω” represents the dynamic probability weight. The motion level can be adjusted according to different datasets in different scenarios. For example, in the TUM indoor dynamic dataset, where the object motion range is small, the dynamic object motion coefficient is increased accordingly. On the other hand, in the KITTI outdoor data, where the dynamic range is large, the dynamic object motion coefficient is decreased accordingly.
To successfully detect dynamic labels, we introduced pseudo feature points to ensure a minimum of M optical flow points on the mask. However, while adding these feature points, it is important to note that the added points are not necessarily detected as corner points but rather pixels with significant gradients. Introducing too many feature points can lead to a certain degree of unreliability. Consequently, when the number of mask optical flow feature points on the label is denoted as “m”, ξ(m) cannot influence the dynamic confidence and is set to ξ(m) = 1. When the number of mask optical flow feature points on the label is 0, and the increased m is considered unreliable, we stipulate that in this scenario, the label’s status is determined only when 2/3m feature points ascertain the status. Hence, ξ(0) is set to 2/3. We employ the linear model and incorporate it into Formula (11):
ξ ( x ) = 1 ( 3 m ) x + 2 3 , x 0 , m
x is the number of optical flow feature points on the mask. So, we define the uncertainty coefficient Uncer of the label as
U n c e r = ω * ξ
If the calculated Label_conf has a large Uncer, Label_conf ∈ [0.35, 0.65], the second motion consistency check will be started. The second motion consistency check is the optical flow motion consistency dynamic detection of the current frame and the previous frame. After the introduction, the dynamic confidence is Formula (13):
L a b e l _ c o n f = U n c e r * ( α C m + β C r e + σ C m 2 )
We set the initial weight as α = 0.7, β = 0.3; after introducing the quadratic weight, we set it as α = 0.6, β = 0.2, σ = 0.2. It means that we still believe in mobile consistency check compared to the introduced method, and the introduction of other methods is to stabilize the coefficient of dynamic confidence.

2.4. Label Consistency Algorithm and Pollution Frame Removal

Although Mask-RCNN and multiple geometric constraints are employed for dynamic object detection, to achieve more accurate 3D point cloud information, it is still necessary to process certain contaminated frames. Hence, we introduce the label consistency algorithm and pollution frame removal as part of our approach. The overall process is shown in Figure 4.

2.4.1. Label Consistency Algorithm

During optical flow calculations, we want to establish connections between consecutive frames. However, in the instance segmentation process, only the image information of the current frame is considered for image processing, without incorporating the labels from the preceding and subsequent frames. As a result, inconsistencies may arise between the labels of consecutive frames, potentially leading to confusion and ambiguity.
To achieve label alignment, we adopted the method of optical flow feature point matching, wherein the current frame ORB feature points are utilized to track the corresponding matching points in the upper frame using the optical flow method in order to perform data correlation between the upper and lower frame levels, following this specific algorithm:
(1)
While traversing each optical flow feature point’s mask, we simultaneously associate them with the corresponding label attributes. For example, if the label of the current frame (FC) is classified as a dynamic label, we track the optical flow feature points matched by FP using these attributes, e.g., person1.
(2)
We then traverse the optical flow feature points tracked by FP and determine the attributes of the feature points’ label on the mask of FP. When most of the points’ label attributes align with a certain label, e.g., person2, we synchronize the label attributes from the previous frame to the next frame, i.e., person1→person2.
To prevent the continuous propagation of synchronization errors among specific frames, this algorithm restricts its operation to key frames. Specifically, for each key frame, we initiate the sticky note consistency operation for the labels. Upon detecting the subsequent key frame, it serves as another starting point for the labeling process.

2.4.2. Pollution Frame Removal

Mask-RCNN demonstrates exceptional edge coverage detection capability and typically achieves high MIOU (mean intersection over union) for image detection tasks. However, when confronted with camera-acquired images, such as those obtained in SLAM (simultaneous localization and mapping), the reduced resolution of the camera and the motion blur during image acquisition can have a detrimental effect on Mask-RCNN’s masking performance. Consequently, in certain scenarios, the detection may be incomplete, leading to missed detections.
To prevent incomplete and missing mask detection from polluting the point cloud and undermining the reconstruction of the 3D terrain, a strategy is needed to eliminate these polluted frames. Our approach involves a relatively simple pollution frame elimination strategy. By ensuring label consistency between the preceding and subsequent frames, we can unify them. Since false detections and missed detections occur only in a small number of cases, we utilize the change in area of the detection boxes between the preceding and subsequent frames to identify contaminated frames. In continuous image sequences, a significant change in the area of the detection box for a target instance indicates a mask failure or disappearance, resulting in a decrease in area. Therefore, frames with a high probability of being polluted are identified based on this criterion. Specifically, we set a minimum value, A, for the detection box area. If the area of the detection box in the next frame changes by more than half compared to the previous frame, we classify it as a polluted frame. Such polluted frames are excluded from the point cloud thread in the subsequent stages. However, to ensure tracking integrity, the polluted frame is not removed from the tracking thread.

2.5. Point Cloud Map and Surface Reconstruction

To ensure comprehensive coverage of dynamic objects and their feature points, we employ an expansion process. After expanding the mask, we remove the corresponding masked area from both the RGB image and depth image, enabling real-time input to the point cloud thread for the static map reconstruction of the 3D point cloud. However, despite these efforts, the point cloud may still contain a few polluted points, which can be addressed through 3D mesh reconstruction. To achieve this, we save the RGB image and depth image of the dynamic mask area locally and utilize the BundleFusion algorithm for offline 3D reconstruction. The optimized SLAM system provides the localization data, which is combined with the RGB image and depth map after dynamic mask rejection, and has been fed into the BundleFusion algorithm in RGB-D mode. This approach results in a 3D mesh model with high precision, purity, and a small memory footprint, as demonstrated in Figure 5. We have named this dynamic SLAM system MOLO-SLAM (mask ORB label optimization SLAM).

3. Results

3.1. Experiments Equipment and Process

To verify the effectiveness and performance of the MOLO-SLAM and 3D reconstruction methods mentioned earlier, a series of experimental tests was conducted. The algorithm is specifically designed to address the challenges of stable and efficient VSLAM in agricultural, forest, and tea garden scenes, which are often complex and dynamic environments.
We aimed to achieve accurate three-dimensional reconstruction of these scenes to support the navigation of agricultural inspection robots and gather essential three-dimensional information about crops. To achieve this, we needed to conduct outdoor experiments using specialized equipment, as shown in Figure 6. The experimental equipment consisted of an independently designed tracked multi-terrain adaptive chassis, equipped with an IMU, a 3D LIDAR, and the Kinect V2 depth camera. Additionally, a high-performance PC was used as a data collection and computing platform, capable of gathering RGB images, depth images, Lidar data, and IMU data.
We used this set of equipment to create agricultural dynamic datasets at South China Agricultural University’s Forest and Tea Garden. We tested our algorithms on dynamic systems in agricultural environments by deliberately introducing dynamic objects in a forest and tea garden environment. The datasets evaluate the system’s ability to handle dynamic mask and reconstruct static scenes. In Figure 7, we have three sequences labeled “person”, and others to represent dynamic and static labels; in order to judge the transition of the tag’s motion state, a conveniently parked “bicycle” was used as an intermediate state. Sequences (a) and (b) are intentionally designed dynamic datasets in the forest and tea plantation scenarios, respectively, which were firstly captured as a person walking into the camera view from the side holding a bicycle after traveling for a certain distance in the static environment, and after moving for a certain distance, putting the bicycle down to simulate the dynamic labeling from the static transformed scenario, and then the person leaves the camera view. In Sequence (c), a more complex dynamic transformation scene is simulated, where two people enter the camera lens in turn, one of them walks for a period of time, and then they take the bicycle away from the movement, simulating the scene of static labeling by dynamic transformation; the other person, on the contrary, first holds the bicycle to enter for a period of time, and then puts it down and leaves it, simulating the scene of dynamic labeling by static transformation, and the two kinds of transformations appear in the same datasets. The relevant starting threshold parameters for the operation of the system in this paper are shown in Table 2.
Once data collection was completed, we ran the MOLO-SLAM algorithm offline on the Legion 7i to conduct a series of experiments on the localization and 3D reconstruction of forest and tea garden scenes. These experiments allowed us to assess the algorithm’s performance and its applicability to real-world scenarios. All the experiments were carried out on a system with an Intel CPU i7-12700, NVIDIA 3060 graphics card, TensorFlow-GPU 2.2.4, and the Ubuntu 20.04 operating system. The Mask-RCNN instance segmentation model used was trained on the COCO datasets, specifically the Mask-RCNN-COCO model. Since this paper utilizes a parametric and large Mask-RCNN segmentation model along with a relatively complex geometric dynamic detection method, its objective was to investigate the accuracy of the method. Therefore, it was executed offline to evaluate the impact on each dataset. Additionally, for data quantification, we assessed it using the TUM and KITTI datasets, which provide ground truth position and attitude data. We conducted comparisons with ORB-SLAM2, as well as state-of-the-art (SOTA) dynamic SLAM systems, namely, DS-SLAM and Dyna-SLAM.

3.2. Dynamic Confidence and Dynamic Detection Experiments

We input the collected data into our system, following the procedure described earlier. Our system utilizes multiple motion consistency checks and outer point detection to calculate the confidence in dynamic elements. It then identifies and removes the dynamic mask, resulting in the estimation of the positions of dynamic-free bits. As shown in Figure 8, whether in a forest garden or a tea garden scene with abundant planar features, the system successfully detected the dynamic labels “person” and “bicycle”. When the bicycle came to a stop, “bicycle” was accurately identified as a static label, demonstrating the stability of the system. It is worth noting that the dynamic labels “bicycle” and “person” used as examples do not possess specificity among all the dataset labels. Therefore, in practical scenarios, the algorithm is capable of detecting and excluding other dynamic objects during computation.
We added semantic static labels (road, Tea_garden, tree, Other_obstacles) to the collected agricultural datasets. Through augmented learning, we inputted the agricultural datasets and COCO instance datasets into the Mask-RCNN network for hybrid training. We obtained the Forest–Tea-COCO-Model and integrated it into MOLO-SLAM. The detailed workflow is depicted in Figure 9. Thanks to adaptive improvements in agricultural scenarios, our algorithm successfully detected the dynamic mask and effectively removed it, demonstrating its capability to handle dynamic objects in real-world scenarios.
Then, the trained model was validated by inputting images of forest and tea gardens for Forest–Tea-COCO-Model validation, and the results are shown in Figure 10, wherein we were able to find that in the agricultural scenario of forest and tea gardens, our model was able to segment the static semantic labels and dynamic instance labels effectively. Whether it was a static scene or a moving scene, the model was able to accurately identify and segment static and dynamic objects of various categories.

3.3. Pollution Frame Removal Tests

We performed label consistency alignment on these two datasets, ensuring that the frame sequence labels between keyframes were consistent. Next, we leveraged the unified label information to calculate changes in detection frames and identify contaminated frames. These contaminated frames may contain misidentified objects, leading to inconsistencies in the reconstruction of depth maps and the resulting three-dimensional point cloud. For the validation of contaminated frame rejection and point cloud reconstruction, we primarily focused on the RGB-D mode using our datasets. In this experiment, we inputted the aligned labels into the contaminated frame rejection algorithm to identify and detect the contaminated frames. Subsequently, we selected five key frames above and below these contaminated frames for the point cloud reconstruction experiment. A comparison was made between the effects of rejecting the contaminated frames and leaving them unremoved in the 3D point cloud. The outcomes of this experiment are shown in Figure 11 and Figure 12.
Based on the results, we can observe that Sequence1, Sequence2, and Sequence3 successfully detected the contaminated frames. Furthermore, the contaminated point cloud, labeled in the red box in Figure 12 corresponding to the contaminated frames, was effectively removed, resulting in a relatively pure 3D point cloud. This demonstrates the beneficial role of our algorithm in maintaining data accuracy and enhancing the quality of the reconstructed 3D point cloud.

3.4. Pose Estimation Experiment

To evaluate the performance of our SLAM system, we conducted tests on our datasets and TUM datasets. Lidar SLAM, with its wide viewing angle, high accuracy, and extensive depth range, excels in capturing static point clouds in large scenes, making it less prone to interference from dynamic objects compared to general laser SLAM algorithms. However, the trade-off is a sparse map and higher cost. To evaluate our algorithm, we used the Fast-LIO2 laser SLAM algorithm, built with 3D lidar, as a reference. Additionally, we evaluated our algorithm on the TUM datasets, and we selected seven sequences for testing. These sequences include four dynamic sequences and three static sequences. We compared DS-SLAM, Dyna-SLAM, and our MOLO-SLAM system in the TUM datasets. All of these systems are dynamic VSLAM systems based on ORBSLAM2.
Notice that we specifically focused on the reconstruction of the static environment since we needed to subsequently generate 3D point cloud maps. As a result, we set the motion level of the “person” label to infinite in order to ensure its elimination. For other cases, we could also adjust the motion level to utilize feature points on the static “person”; however, in our scenario, we chose to eliminate them. Simultaneously, we unconditionally remove the mask if the average depth exceeds 10m or if the bounding box is too small.
To assess the performance, we utilized several evaluation metrics: absolute trajectory error (ATE) for evaluating global errors, relative pose error (RPE) for relative local error evaluation, and root mean square error (RMSE) for calculating sequence values. Its calculation formula is shown in Formula (17):
F i = Q i 1 S P i
where Fi is the error value of the i-th frame pose; PiSE(3) is the estimated pose of the i-th frame of the algorithm; QiSE(3) is the real pose of the i-th frame; and SSE(3) is the transformation matrix from the estimated pose to the real pose by the least squares method.
The calculation formula of RPE is shown in Formula (18):
E i = ( Q i 1 Q i + Δ ) 1 ( P i 1 P i + Δ )
where Ei is the error value of the i-th frame pose; PiSE(3) is the estimated pose of the i-th frame of the algorithm; QiSE(3) is the real pose of the i-th frame; and Δ is the fixed time difference between the accuracy of the pose difference between two frames (compared to the real pose).
We introduced the Improvements value to evaluate the accuracy improvement of our algorithm relative to ORBSLAM2. Formula (19) for the calculation of the Improvements value is as follows:
K = ( 1 τ / ψ ) × 100 %
where K is the value of Improvements; τ is the evaluation value of MOLO-SLAM; and ψ is the evaluation value of ORBSLAM2. So, we record the evaluation values of ORBSLAM2. To assess the ATE and RPE values for ORB-SLAM2 and MOLO-SLAM in the forest-tea garden setting, we used FAST-LIO2 as a reference. The evaluation results are presented in Table 3 and Table 4. Figure 13 displays the FAST-LIO2 Mapping and the ATE positioning error map evaluated by the EVO tool in agricultural dynamic sequences. According to the quantitative data, our algorithm achieved a performance surpassing 80% in complex agricultural dynamic environments, demonstrating accuracy comparable to that of the laser SLAM algorithm in this specific setting.
In the evaluation of TUM data, our algorithm was quantitatively compared with ORBSLAM2, DS-SLAM, and Dyna-SLAM. In the comparison of ATE evaluation values in Table 5, our MOLO-SLAM system outperformed in the highly dynamic fr3/w_static and fr3/w_xyz sequences. For the other two highly dynamic sequences, fr3/w_half and fr3/w_rpy, Dyna-SLAM achieved the best results. Notably, Dyna-SLAM also performed remarkably well in the static sequences, fr3/s_half and fr2/xyz. These results suggest that our system did not significantly negatively impact the static sequences. As shown in Table 6, the performance of translational RPE in both dynamic and static sequences also demonstrated positive outcomes. Analyzing the data, we found that Dyna-SLAM excelled in high dynamic sequences, but our algorithm closely approached the performance of Dyna-SLAM. Moreover, when comparing DS-SLAM and ORBSLAM2, our algorithm showcased excellent optimization in dynamic sequences while maintaining a minimal negative impact on static sequences. Overall, these results demonstrate the effectiveness of our algorithm. Figure 14 presents the ATE trajectory effect graphs for the four highly dynamic sequences.
Similarly, we also provide the evaluation results of the KITTI dataset, which are shown in Table 7 and Table 8. Based on the data, it can be concluded that the performance improvement amounted to 28.07%.

3.5. Point Cloud Reconstruction and Mesh Reconstruction Experiments

We utilized the keyframe mechanism of ORBSLAM2 to generate 3D point cloud reconstruction. The specific process is as follows: First, the ORBSLAM2 system runs to determine the keyframes and record the bitmap of each frame. Then, the depth map and RGB map of each frame form a local point cloud. Each local point cloud is multiplied by the bitmap of the corresponding keyframe, and all local point clouds are fused together to form a global point cloud reconstruction, as shown in Formula (20):
P E = i = 1 n ( R i P i + t i )
where Pi denotes the local point cloud, PE denotes the global point cloud point cloud map, Ri denotes the rotation matrix, and ti denotes the translation matrix determined by the SLAM system camera with respect to a reference coordinate system. The reference coordinate system is typically established based on the bit position of the system in the first frame. In our MOLO-SLAM system, this process was facilitated by the label consistency module and the contaminated frame rejection module; similar to the ORBSLAM2 system, key frames are utilized to generate the point cloud. However, if a key frame is identified as contaminated, we select the nearest non-contaminated frame and substitute it into the point cloud system.
After conducting the experiment on outdoor dynamic culling, we proceeded with the experiments on 3D point cloud reconstruction in the forest tea garden scene, and the experimental results are presented in Figure 15. In the case of the original ORBSLAM2 localization, as depicted in Figure 15a,f, the position estimation was adversely affected by the presence of dynamic objects; consequently, the resulting 3D point cloud exhibited chaotic and inaccurate characteristics, failing to accurately reconstruct the terrain topography of the scene. However, after introducing our dynamic VSLAM system, MOLO-SLAM, as depicted in Figure 15b,g, both the localization accuracy and the accuracy of dense building reconstruction exhibited significant improvements. Our system successfully mitigated the influence of dynamic objects and accurately reconstructed the 3D terrain topography in the forest and tea plantation scenario. Additionally, during the transition of the bicycle from dynamic to static, as shown in Figure 15c,h, our algorithm effectively identified and preserved the static part of the bicycle while eliminating the dynamic part, thus further substantiating the effectiveness of our algorithm. Moreover, we extracted the depth maps processed by MOLO-SLAM along with RGB maps, which were subsequently utilized as input for the BundleFusion 3D reconstruction algorithm. The result is an enhanced accuracy 3D mesh model with reduced space occupation, as illustrated in Figure 15d,i. Simultaneously, when combined with the Forest–Tea-COCO-Model amplified by augmented learning, a semantic point cloud map can be derived. By utilizing the semantic map, a 3D point cloud with semantic labels can be used to implement inspection requirements for various objects, as shown in Figure 15e,j.

3.6. Time Efficiency Analysis

To analyze the temporal efficiency of the dynamic confidence calculation module, we evaluated the time performance of each component on the agricultural dataset in our research platform and recorded the average computation time for each module in Table 9. The “dynamic confidence calculation (total)” represents the aggregate time cost of the three dynamic detection methods. Compared to the DS-SLAM approach using only epipolar constraint check, the overall time cost increased in our proposed framework. The multi-view geometry paper by Dynamam [21] reported a time cost of approximately 200–300 ms, although the experimental setup differed. While our method has lower computational overhead, we incorporated the Mask R-CNN model with a substantial number of parameters, as well as the multiple geometric constraints technique, to enhance the overall segmentation accuracy. This integration inevitably leads to increased time cost. The algorithm presented in this work still has considerable room for real-time performance optimization. The trade-off between accuracy and computational efficiency remains an active research area, and further algorithmic and hardware-level optimizations may be necessary to achieve the desired real-time capabilities for practical deployment.

4. Discussion

In this work, we present MOLO-SLAM (mask ORB label optimization SLAM), a deep-learning-based dynamic object culling VSLAM system that enhances ORBSLAM2 by integrating the Mask-RCNN instance segmentation algorithm. Our system is specifically tailored to operate effectively in dynamic agricultural environments. We introduce several key improvements:
(1)
Enhancing the front-end feature extraction module of ORBSLAM2: We integrated it with the Mask-RCNN instance segmentation network to identify dynamic object masks and selectively remove feature points from those regions, thereby optimizing the visual odometry estimation.
(2)
Introducing a multiple geometric constraints and dynamic confidence module: This module leverages geometric consistency checks, re-projection error analysis, and pre-previous frame geometric consistency checks to identify anomaly points. This enables the calculation of dynamic confidence levels for different thresholds. Consequently, we can reliably identify dynamic labels with a confidence greater than 0.5.
(3)
Proposing label consistency and contaminated frame rejection in the point cloud reconstruction pipeline: We remove frames with errors detected by Mask-RCNN during the point cloud reconstruction, resulting in a more refined and accurate point cloud. We then employ the BundleFusion algorithm for the subsequent 3D mesh model reconstruction.
We established a comprehensive agricultural SLAM dataset, which includes synchronized RGB images, depth images, LiDAR point clouds, and inertial measurement unit (IMU) data. Additionally, we compared the performance of our proposed approach against various SOTA dynamic VSLAM algorithms using the TUM and KITTI benchmark datasets. Our results show significant improvements, with up to 97.72% enhancement in trajectory estimation accuracy relative to the original ORBSLAM2 algorithm on the forest–tea garden agriculture dataset, up to a 98.51% performance improvement on the TUM datasets and up to a 28.07% improvement on the KITTI datasets. Furthermore, our algorithm has minimal negative impact on static scene sequences, and we obtained purer and more accurate 3D point clouds and 3D mesh models during our point cloud experiments. We also conducted comprehensive acquisition experiments in forest and tea plantation environments for the tasks of camera pose estimation and 3D reconstruction, achieving excellent results in terms of the generated 3D point clouds and mesh models in these challenging scenarios. Finally, we discuss the extended applications of our algorithm, showcasing its effectiveness in various other robotic perception and mapping methods and functions.
Our algorithm has shown promising results in dynamic scenarios. However, the use of Mask-RCNN poses a challenge in achieving real-time performance due to its computational demands and offline processing approach. In our future research, we aim to address this issue by replacing the instance segmentation network with a lighter and simpler alternative that can offer real-time processing capabilities. This improvement will enhance the practicality and applicability of our system in dynamic environments. And in the future with AI development, if the model size can be reduced to a certain degree, we can also use GAN technology to reconstruct the environmental point cloud after the predictive production of the hollow part.

Author Contributions

Data curation, J.L. (Jinhong Lv), B.Y., S.S. and Q.L.; funding acquisition, W.W.; project administration, W.W.; formal analysis, W.W., J.L. (Jinhong Lv), B.Y., C.G., H.G. and J.L. (Junlin Li); investigation, B.Y., J.L. (Jinhong Lv), S.S. and Q.L.; methodology, W.W., B.Y., J.L. (Jinhong Lv), H.G., Q.L., S.S. and Q.L.; software, W.W., J.L. (Jinhong Lv), B.Y., C.G. and S.S.; supervision, B.Y., C.G., H.G. and Q.L.; resources, C.G., J.L. (Junlin Li), S.S. and Q.L.; validation, W.W., J.L. (Jinhong Lv), J.L. (Junlin Li) and Q.L.; writing—original draft, B.Y., J.L. (Jinhong Lv), C.G., J.L. (Junlin Li), H.G. and S.S.; writing—review and editing, W.W., J.L. (Jinhong Lv) and B.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the 2024 Rural Revitalization Strategy Special Funds Provincial Project (2023LZ04), and Research and Development of Intelligence Agricultural Machinery and Control Technology (FNXM012022020-1-03).

Data Availability Statement

The data presented in this study are available in article.

Acknowledgments

The authors acknowledge the editors and reviewers for their constructive comments and all the support on this work.

Conflicts of Interest

Author Haijun Guo is employed by the Guangdong Topsee Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Cadena, C.; Carlone, L.; Carrillo, H.; Latif, Y.; Scaramuzza, D.; Neira, J.; Reid, I.; Leonard, J.J. Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age. IEEE Trans. Robot. 2016, 32, 1309–1332. [Google Scholar] [CrossRef]
  2. Reitmayr, G.; Langlotz, T.; Wagner, D.; Mulloni, A.; Schall, G.; Schmalstieg, D.; Pan, Q. Simultaneous localization and mapping for augmented reality. In Proceedings of the 2010 International Symposium on Ubiquitous Virtual Reality, Gwangju, Republic of Korea, 7–10 July 2010; pp. 5–8. [Google Scholar]
  3. Singandhupe, A.; La, H.M. A review of slam techniques and security in autonomous driving. In Proceedings of the 2019 third IEEE International Conference on Robotic Computing (IRC), Naples, Italy, 25–27 February 2019; pp. 602–607. [Google Scholar]
  4. Yousif, K.; Bab-Hadiashar, A.; Hoseinnezhad, R. An overview to visual odometry and visual slam: Applications to mobile robotics. Intell. Ind. Syst. 2015, 1, 289–311. [Google Scholar] [CrossRef]
  5. Ding, H.; Zhang, B.; Zhou, J.; Yan, Y.; Tian, G.; Gu, B. Recent developments and applications of simultaneous localization and mapping in agriculture. J. Field Robot. 2022, 39, 956–983. [Google Scholar] [CrossRef]
  6. Bresson, G.; Alsayed, Z.; Yu, L.; Glaser, S. Simultaneous localization and mapping: A survey of current trends in autonomous driving. IEEE Trans. Intell. Veh. 2017, 2, 194–220. [Google Scholar] [CrossRef]
  7. Klein, G.; Murray, D. Parallel tracking and mapping for small ar workspaces. In Proceedings of the 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, Nara, Japan, 13–16 November 2007; pp. 225–234. [Google Scholar]
  8. Forster, C.; Zhang, Z.; Gassner, M.; Werlberger, M.; Scaramuzza, D. Svo: Semidirect visual odometry for monocular and multicamera systems. IEEE Trans. Robot. 2016, 33, 249–265. [Google Scholar] [CrossRef]
  9. Engel, J.; Schöps, T.; Cremers, D. Lsd-slam: Large-scale direct monocular slam. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 834–849. [Google Scholar]
  10. Mur-Artal, R.; Tardos, J.D. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
  11. Newcombe, R.A.; Lovegrove, S.J.; Davison, A.J. Dtam: Dense tracking and mapping in real-time. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2320–2327. [Google Scholar]
  12. Davison, A.J.; Reid, I.D.; Molton, N.D.; Stasse, O. Monoslam: Real-time single camera slam. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 1052–1067. [Google Scholar] [CrossRef] [PubMed]
  13. Yan, L.; Zhao, L. An approach on advanced unscented kalman filter from mobile robot-slam. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2020, 43, 381–389. [Google Scholar] [CrossRef]
  14. Endres, F.; Hess, J.; Sturm, J.; Cremers, D.; Burgard, W. 3D mapping with an rgb-d camera. IEEE Trans. Robot. 2013, 30, 177–187. [Google Scholar] [CrossRef]
  15. Campos, C.; Elvira, R.; Rodriguez, J.J.G.; Montiel, J.M.; Tardos, J.D. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
  16. Elvira, R.; Tardos, J.D.; Montiel, J.M. Orbslam-atlas: A robust and accurate multi-map system. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 6253–6259. [Google Scholar]
  17. Yoo, J.; Borselen, R.; Mubarak, M.; Tsingas, C. Automated first break picking method using a random sample consensus (ransac). In Proceedings of the 81st EAGE Conference and Exhibition 2019, London, UK, 3–6 June 2019; pp. 1–5. [Google Scholar]
  18. Bustos, A.P.; Chin, T.-J.; Eriksson, A.; Reid, I. Visual slam: Why bundle adjust? In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 2385–2391. [Google Scholar]
  19. Zhao, X.; Li, Q.; Wang, C.; Dou, H.; Liu, B. Robust depth-aided rgbd-inertial odometry for indoor localization. Measurement 2023, 209, 112487. [Google Scholar] [CrossRef]
  20. Li, G.; Yu, L.; Fei, S. A deep-learning real-time visual slam system based on multi-task feature extraction network and self-supervised feature points. Measurement 2021, 168, 108403. [Google Scholar] [CrossRef]
  21. Bescos, B.; Facil, J.M.; Civera, J.; Neira, J. Dynaslam: Tracking, mapping, and inpainting in dynamic scenes. IEEE Robot. Autom. Lett. 2018, 3, 4076–4083. [Google Scholar] [CrossRef]
  22. He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
  23. Bescos, B.; Campos, C.; Tardos, J.D.; Neira, J. Dynaslam ii: Tightlycoupled multi-object tracking and slam. IEEE Robot. Autom. Lett. 2021, 6, 5191–5198. [Google Scholar] [CrossRef]
  24. Yu, C.; Liu, Z.; Liu, X.-J.; Xie, F.; Yang, Y.; Wei, Q.; Fei, Q. Ds-slam: A semantic visual slam towards dynamic environments. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1168–1174. [Google Scholar]
  25. Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
  26. Zhang, J.; Henein, M.; Mahony, R.; Ila, V. Vdo-slam: A visual dynamic object-aware slam system. arXiv 2020, arXiv:2005.11052. [Google Scholar] [CrossRef]
  27. Runz, M.; Buffier, M.; Agapito, L. Maskfusion: Real-time recognition, tracking and reconstruction of multiple moving objects. In Proceedings of the 2018 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Munich, Germany, 16–20 October 2018; pp. 10–20. [Google Scholar]
  28. Wang, X.; Zheng, S.; Lin, X.; Zhu, F. Improving rgb-d slam accuracy in dynamic environments based on semantic and geometric constraints. Measurement 2023, 217, 113084. [Google Scholar] [CrossRef]
  29. Zhong, F.; Wang, S.; Zhang, Z.; Wang, Y. Detect-slam: Making object detection and slam mutually beneficial. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 1001–1010. [Google Scholar]
  30. Xiao, L.; Wang, J.; Qiu, X.; Rong, Z.; Zou, X. Dynamic-slam: Semantic monocular visual localization and mapping based on deep learning in dynamic environment. Robot. Auton. Syst. 2019, 117, 1–16. [Google Scholar] [CrossRef]
  31. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
  32. Liu, Y.; Miura, J. Rds-slam: Real-time dynamic slam using semantic segmentation methods. IEEE Access 2021, 9, 23772–23785. [Google Scholar] [CrossRef]
  33. Yan, L.; Hu, X.; Zhao, L.; Chen, Y.; Wei, P.; Xie, H. Dgs-slam: A fast and robust rgbd slam in dynamic environments combined by geometric and semantic information. Remote Sens. 2022, 14, 795. [Google Scholar] [CrossRef]
  34. Islam, R.; Habibullah, H.; Hossain, T. Agri-slam: A real-time stereo visual slam for agricultural environment. Auton. Robot. 2023, 27, 649–668. [Google Scholar] [CrossRef]
  35. Song, K.; Li, J.; Qiu, R.; Yang, G. Monocular visual-inertial odometry for agricultural environments. IEEE Access 2022, 10, 103975–103986. [Google Scholar] [CrossRef]
  36. Papadimitriou, A.; Kleitsiotis, I.; Kostavelis, I.; Mariolis, I.; Giakoumis, D.; Likothanassis, S.; Tzovaras, D. Loop closure detection and slam in vineyards with deep semantic cues. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 2251–2258. [Google Scholar]
  37. Yang, L.; Wang, L. A semantic slam-based dense mapping approach for large-scale dynamic outdoor environment. Measurement 2022, 204, 112001. [Google Scholar] [CrossRef]
  38. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Doll´ar, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Figure 1. Overview of the MOLO-SLAM system.
Figure 1. Overview of the MOLO-SLAM system.
Agriculture 14 00819 g001
Figure 2. Schematic diagram of multiple geometric constraints.
Figure 2. Schematic diagram of multiple geometric constraints.
Agriculture 14 00819 g002
Figure 3. Process of dynamic confidence calculation method.
Figure 3. Process of dynamic confidence calculation method.
Agriculture 14 00819 g003
Figure 4. Process of pollution frame removal.
Figure 4. Process of pollution frame removal.
Agriculture 14 00819 g004
Figure 5. The 3D reconstruction method.
Figure 5. The 3D reconstruction method.
Agriculture 14 00819 g005
Figure 6. Forest and tea garden crawler inspection robot.
Figure 6. Forest and tea garden crawler inspection robot.
Agriculture 14 00819 g006
Figure 7. Agricultural datasets for the forest–tea garden scenario.
Figure 7. Agricultural datasets for the forest–tea garden scenario.
Agriculture 14 00819 g007aAgriculture 14 00819 g007b
Figure 8. MOLO-SLAM dynamic mask detection.
Figure 8. MOLO-SLAM dynamic mask detection.
Agriculture 14 00819 g008
Figure 9. Segmentation effect in the forest tea garden scene.
Figure 9. Segmentation effect in the forest tea garden scene.
Agriculture 14 00819 g009
Figure 10. Distribution of dynamic confidence scores for dynamic and static objects.
Figure 10. Distribution of dynamic confidence scores for dynamic and static objects.
Agriculture 14 00819 g010
Figure 11. Abnormal frame detection: The red box represents the detected contaminated frames.
Figure 11. Abnormal frame detection: The red box represents the detected contaminated frames.
Agriculture 14 00819 g011
Figure 12. Comparison of point cloud before and after contaminated frame rejection: (a,c,e) is before removing point clouds from contaminated frames in Sequence1, Sequence2, and Sequence3. (b,d,f) is after removing point clouds from contaminated frames in Sequence1, Sequence2, and Sequence3.
Figure 12. Comparison of point cloud before and after contaminated frame rejection: (a,c,e) is before removing point clouds from contaminated frames in Sequence1, Sequence2, and Sequence3. (b,d,f) is after removing point clouds from contaminated frames in Sequence1, Sequence2, and Sequence3.
Agriculture 14 00819 g012
Figure 13. (a) FAST-LIO2 locating and mapping in an agricultural environment. (b) ATE trajectories for three agricultural dynamic sequences.
Figure 13. (a) FAST-LIO2 locating and mapping in an agricultural environment. (b) ATE trajectories for three agricultural dynamic sequences.
Agriculture 14 00819 g013
Figure 14. ATE trajectories for four highly dynamic sequences: the black line is the true trajectory; blue is the algorithm estimation curve; red is the error between the estimation algorithm and the true value (the larger the distance the higher the error).
Figure 14. ATE trajectories for four highly dynamic sequences: the black line is the true trajectory; blue is the algorithm estimation curve; red is the error between the estimation algorithm and the true value (the larger the distance the higher the error).
Agriculture 14 00819 g014
Figure 15. Point cloud reconstruction experiment: (a,f) original orbslam2 point cloud; (b,g) point cloud generated by MOLO-SLAM after dynamic culling; (c,h) point cloud of bicycle; (d,i) the mesh model generated by BundleFusion after dynamic culling; (e,j) semantic map and segmented point cloud.
Figure 15. Point cloud reconstruction experiment: (a,f) original orbslam2 point cloud; (b,g) point cloud generated by MOLO-SLAM after dynamic culling; (c,h) point cloud of bicycle; (d,i) the mesh model generated by BundleFusion after dynamic culling; (e,j) semantic map and segmented point cloud.
Agriculture 14 00819 g015
Table 1. Motion level classification.
Table 1. Motion level classification.
ClassificationMotion LevelDynamic Probability Weights
Prior dynamic5a
Intermediate-to-dynamic4b
Intermediate states31
Intermediate-to-static2c
Prior static1d
Table 2. The relevant starting threshold parameters for the operation of the system in this paper.
Table 2. The relevant starting threshold parameters for the operation of the system in this paper.
Classification Motion LevelDynamic Probability Weights
Epipolar constraint distance threshold, ε5/pixel
Reprojection error threshold, e8/pixel
Epipolar constraint distance threshold of the previous frame, ε27/pixel
Initial weight of dynamic confidence, α, β0.7, 0.3
Outlier threshold of dynamic confidence, N7/time
Detection box area variation ratio for contaminated frames, A1/2
Motion level, a, b, c, d1.2, 1.1, 0.9, 0.8
Number of ORB feature points, NF1000/time
Mask-RCNN training, Batch_size4/time
Number of Mask-RCNN training, Epoch100/time
Table 3. Comparison of absolute trajectory error RMSE for agricultural datasets data—reference Fast-LIO2 (ATE)/m.
Table 3. Comparison of absolute trajectory error RMSE for agricultural datasets data—reference Fast-LIO2 (ATE)/m.
SequencesORBSLAM2MOLO-SLAMImprovements
Sequence10.14890.019087.24%
Sequence20.19540.012093.86%
Sequence30.81120.016997.92%
Table 4. Comparison of relative translation trajectory RMSE for agricultural datasets data—reference Fast-LIO2 (translation RPE)/m.
Table 4. Comparison of relative translation trajectory RMSE for agricultural datasets data—reference Fast-LIO2 (translation RPE)/m.
SequencesORBSLAM2MOLO-SLAMImprovements
Sequence10.36690.087176.26%
Sequence20.47400.079683.21%
Sequence30.51140.109078.69%
Table 5. Comparison of absolute trajectory error RMSE for TUM data (ATE)/m.
Table 5. Comparison of absolute trajectory error RMSE for TUM data (ATE)/m.
SequencesORBSLAM2DS-SLAMDyna-SLAMSG-SLAMMOLO-SLAM *Improvements
fr3/w_half0.46130.03040.02600.02470.031693.15%
fr3/w_rpy0.68240.38770.03030.01960.038294.40%
fr3/w_static0.40250.00840.00680.00600.006098.51%
fr3/w_xyz0.85720.029590.01570.01530.015098.25%
fr3/s_half0.02020.01620.0192-0.015921.29%
fr3/s_xyz0.00820.01140.0105-0.0109−32.93%
fr2/xyz0.00520.00390.0044-0.003336.54%
* Represents our system. - Represents not provided.
Table 6. Comparison of relative translation trajectory error RMSE of TUM data (translation RPE)/m.
Table 6. Comparison of relative translation trajectory error RMSE of TUM data (translation RPE)/m.
SequencesORBSLAM2DS-SLAMDyna-SLAMSG-SLAMMOLO-SLAMImprovements
fr3/w_half0.72480.04300.03700.02490.044593.86%
fr3/w_rpy0.99990.57270.04370.03720.054094.60%
fr3/w_static0.57450.01250.01160.00820.008098.61%
fr3/w_xyz1.21480.04580.02230.01950.019098.44%
fr3/s_half0.02090.01520.0287-0.014331.38%
fr3/s_xyz0.01050.01240.0185-0.0148−40.95%
fr2/xyz0.00330.00290.0035-0.0038−15.15%
Table 7. Comparison of absolute trajectory error RMSE for KITTI data (ATE)/m.
Table 7. Comparison of absolute trajectory error RMSE for KITTI data (ATE)/m.
SequencesORBSLAM2Dyna-SLAMMOLO-SLAMImprovements
003.42513.69283.4617−1.07%
017.528212.10516.174617.98%
025.53695.46685.6880−2.73%
035.03194.73654.95631.50%
041.11861.43021.07993.46%
051.59241.24351.145428.07%
061.93602.28952.2345−15.42%
Table 8. Comparison of relative translation trajectory error RMSE (translation RPE)/m for KITTI data.
Table 8. Comparison of relative translation trajectory error RMSE (translation RPE)/m for KITTI data.
SequencesORBSLAM2Dyna-SLAMMOLO-SLAMImprovements
001.55871.63781.57230.87%
012.51154.52932.6927−7.21%
021.95851.94671.91772.08%
032.27822.15272.20953.02%
041.59471.81471.57231.40%
050.96750.88670.790618.28%
061.65031.90551.8968−14.94%
Table 9. The average calculation time for each module (ACT)/ms.
Table 9. The average calculation time for each module (ACT)/ms.
ModuleTime/ms
Epipolar constraint check36.253
Re-projection error42.254
Pre-previous frame epipolar constraint check37.256
Dynamic confidence calculation (total)118.532
Mask-RCNN352.235
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lv, J.; Yao, B.; Guo, H.; Gao, C.; Wu, W.; Li, J.; Sun, S.; Luo, Q. MOLO-SLAM: A Semantic SLAM for Accurate Removal of Dynamic Objects in Agricultural Environments. Agriculture 2024, 14, 819. https://doi.org/10.3390/agriculture14060819

AMA Style

Lv J, Yao B, Guo H, Gao C, Wu W, Li J, Sun S, Luo Q. MOLO-SLAM: A Semantic SLAM for Accurate Removal of Dynamic Objects in Agricultural Environments. Agriculture. 2024; 14(6):819. https://doi.org/10.3390/agriculture14060819

Chicago/Turabian Style

Lv, Jinhong, Beihuo Yao, Haijun Guo, Changlun Gao, Weibin Wu, Junlin Li, Shunli Sun, and Qing Luo. 2024. "MOLO-SLAM: A Semantic SLAM for Accurate Removal of Dynamic Objects in Agricultural Environments" Agriculture 14, no. 6: 819. https://doi.org/10.3390/agriculture14060819

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop