MOLO-SLAM: A Semantic SLAM for Accurate Removal of Dynamic Objects in Agricultural Environments

Lv, Jinhong; Yao, Beihuo; Guo, Haijun; Gao, Changlun; Wu, Weibin; Li, Junlin; Sun, Shunli; Luo, Qing

doi:10.3390/agriculture14060819

Open AccessArticle

MOLO-SLAM: A Semantic SLAM for Accurate Removal of Dynamic Objects in Agricultural Environments

by

Jinhong Lv

¹,

Beihuo Yao

¹,

Haijun Guo

²,

Changlun Gao

¹,

Weibin Wu

^1,3,*,

Junlin Li

¹

,

Shunli Sun

¹

and

Qing Luo

¹

College of Engineering, South China Agricultural University, Guangzhou 510642, China

²

Guangdong Topsee Technology Co., Ltd., Guangzhou 510663, China

³

National Key Laboratory of Agricultural Equipment Technology, Guangzhou 510642, China

^*

Author to whom correspondence should be addressed.

Agriculture 2024, 14(6), 819; https://doi.org/10.3390/agriculture14060819

Submission received: 16 April 2024 / Revised: 21 May 2024 / Accepted: 22 May 2024 / Published: 24 May 2024

(This article belongs to the Special Issue Advanced Image Processing in Agricultural Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Visual simultaneous localization and mapping (VSLAM) is a foundational technology that enables robots to achieve fully autonomous locomotion, exploration, inspection, and more within complex environments. Its applicability also extends significantly to agricultural settings. While numerous impressive VSLAM systems have emerged, a majority of them rely on static world assumptions. This reliance constrains their use in real dynamic scenarios and leads to increased instability when applied to agricultural contexts. To address the problem of detecting and eliminating slow dynamic objects in outdoor forest and tea garden agricultural scenarios, this paper presents a dynamic VSLAM innovation called MOLO-SLAM (mask ORB label optimization SLAM). MOLO-SLAM merges the ORBSLAM2 framework with the Mask-RCNN instance segmentation network, utilizing masks and bounding boxes to enhance the accuracy and cleanliness of 3D point clouds. Additionally, we used the BundleFusion reconstruction algorithm for 3D mesh model reconstruction. By comparing our algorithm with various dynamic VSLAM algorithms on the TUM and KITTI datasets, the results demonstrate significant improvements, with enhancements of up to 97.72%, 98.51%, and 28.07% relative to the original ORBSLAM2 on the three datasets. This showcases the outstanding advantages of our algorithm.

Keywords:

VSLAM; inspection robot; 3D reconstruction; dynamic confidence; dynamic object removal

1. Introduction

Simultaneous localization and mapping (SLAM) [1] is playing an increasingly vital role in today’s intelligent era, and its rapid development in fields like autonomous driving, robotics, and virtual reality [2,3,4] has turned it into a major research focus worldwide. Notably, SLAM’s relevance in agriculture has also grown, making it a popular subject of study [5]. SLAM can be divided into two categories based on the sensors used: laser SLAM and vision SLAM [6]. Laser SLAM, having been extensively researched earlier, has traditionally been regarded as the primary solution for mobile robots, owing to its stability and accuracy. However, with the progress of industrialization, the high cost associated with laser SLAM has shifted focus towards vision SLAM. Vision SLAM presents several advantages, such as cost-effectiveness, ease of installation, and the abundance of environmental information it provides. Furthermore, VSLAM (visual SLAM) exhibits superior environmental perception capabilities and boasts applicability across a broader range of scenarios. Capitalizing on these benefits, a multitude of VSLAM systems with remarkable performance have emerged in recent years. Notable examples include PATM [7], SVO [8], LSD-SLAM [9], and ORB-SLAM2 [10], among others.

1.1. Visual SLAM

Visual SLAM primarily relies on visual methods to construct 3D maps and estimate the localization of an agent within an environment. The direct method involves motion estimation by minimizing the photometric error of image pixels. For instance, Newcombe et al. [11] proposed that the DTAM system enables dense pixel-wise matching across each video frame, but it lacks a complete visual SLAM system. MonoSLAM [12] was one of the pioneering real-time monocular VSLAM systems, utilizing an extended Kalman (EKF) [13] for efficient camera pose optimization. Jakob Engel et al. introduced the semi-direct method in their pioneering work on the LSD-SLAM [9], a semi-dense VSLAM system that selects salient feature points with large pixel gradients to track camera motion. Another approach is RGBD-SLAM [14], which creates real-time dense 3D point cloud mapping based on a comprehensive feature point-based framework. However, the feature point proposal step in this semi-direct method can be computationally expensive and inefficient. The ORB-SLAM series, proposed by Mur Artal et al., represents one of the most prominent and widely adopted VSLAMs. ORB-SLAM2 [10] utilizes a quadtree-based spatial selection strategy for ORB feature points and keyframes, resulting in high accuracy and real-time performance. Building upon ORB-SLAM2, ORB-SLAM3 [15] incorporates a module for fusing inertial information and introduces a multi-map system (ATLAS [16]), significantly improving robustness and accuracy. However, the traditional VSLAM systems mentioned above primarily operate under the assumption of a static environment. Consequently, they can face substantial localization and mapping errors in highly dynamic environments, where the presence of moving objects violates the static world hypothesis. Therefore, it has become essential to investigate new VSLAM algorithms that are specifically tailored for handling such dynamic environments.

Most of the current visual SLAM systems are designed based on the assumption of a static environment, where the camera frames are presumed to capture static scenes. However, these systems encounter challenges when dealing with dynamic objects, as the matching of feature points on such moving entities can leads to data correlation errors, reducing localization accuracy. To mitigate the impact of dynamic or erroneous points, existing vSLAM algorithms often incorporate various measures. In agricultural environments, correlation errors can significantly reduce the accuracy of localization and map building. For instance, random sample sampling (RANSAC) [17] is used to filter out anomalous points, and local or global bundle adjustment (BA) [18] is employed to correct position estimations and minimize the influence of matching errors. There are other systems that mitigate the impact by fusing additional sensors. For example, Zhao et al. [19] proposed a robust depth-assisted RGBD inertial odometry for indoor positioning that integrates IMU. However, when the dynamic object area dominates the current frame, significant errors can occur, leading to VSLAM system failure. Therefore, enhancing the robustness of VSLAM algorithms in highly dynamic scenarios remains a significant challenge.

1.2. Dynamic SLAM

As deep learning continues to advance, the integration of SLAM with deep learning networks has become more profound, enabling us to address a wider range of problems. This synergy allows for more effective solutions and improved performance in various scenarios. One approach that exemplifies this integration is semantic prior-based dynamic SLAM, where a deep learning network is employed to segment the environmental image data obtained from the camera sensor. This provides an a priori mask and label information for different objects. By utilizing this a priori information, the SLAM system can identify and remove dynamic label feature points, thereby enhancing system stability. Li et al. [20] proposed a real-time visual SLAM system based on deep learning, utilizing a multi-task feature extraction network and self-supervised feature points. They employed deep learning methods for extracting feature points to enhance accuracy. However, in a dynamic environment without the assistance of prior information, accuracy may decrease, affecting the system’s performance. Dyna-SLAM, introduced by Bescos et al. [21], combines Mask-RCNN [22] segmentation, for instance, to obtain a priori dynamic information. It then identifies non-prior dynamic objects’ masks through multi-view geometry and region-growing algorithms while providing a background restoration method. Building upon this research, Dyna-SLAM II [23] computes the motion trajectory of multiple objects in a 3D map based on the camera’s position estimation. Similarly, DS-SLAM [24], utilizing SegNet [25], verifies object motion consistency through optical flow and polar constraints. It removes detected feature points within a predefined dynamic mask, labeling them only as “person”. VDO-SLAM [26] computes and optimizes camera poses and dynamic points while determining whether a moving object is dynamic or not by estimating the poses of dynamic and static points. However, this algorithm requires significant computation. MaskFusion [27] employs geometric segmentation to generate more accurate object boundaries, thereby overcoming the imperfection of mask boundaries. SG-SLAM [28] is a dynamic RGB-D SLAM system based on Yolact and geometric constraints. It significantly enhances positioning accuracy; however, it relies on depth maps, and accuracy issues with Yolact make it unsuitable for outdoor environments.

To address the time-consuming aspect of deep learning networks, researchers have explored efficient segmentation strategies for feature frame segmentation. For example, Detect-SLAM [29] utilizes a movement strategy that propagates four types of keypoints to overcome latency in semantic information. Dynamic-SLAM [30] improves the recall of the existing SSD [31] network by employing a leakage-detection compensation algorithm. Additionally, RDS-SLAM [32] introduces a non-blocking model that enables real-time tracking through probabilistic updating of moving objects and semantic propagation. On the other hand, DGS-SLAM [33] introduces a lightweight SLAM system based on depth-graph clustering and adaptive a priori semantic segmentation, ensuring high robustness and speed. While these methods enhance real-time performance, their failure to segment each frame may result in decreased robustness.

1.3. Visual SLAM in Agriculture

Due to the advancement in SLAM technology, its application in agriculture has become increasingly widespread. However, the complexity of the agricultural environment requires targeted algorithmic improvements. AGRI-SLAM [34] utilizes an image enhancement technique for recovering ORB point and LSD line feature. This allows the algorithm to function in low-light environments. Kaiyu Song et al. proposed an optimized VINS-mono algorithm [35] that includes a key frame selection algorithm based on vertical motion smoothness validation. This algorithm enhances localization accuracy in agricultural environments. The Graph-SLAM algorithm [36] improves closed-loop detection (LCD) through semantic segmentation based on grapevines. It is suitable for large-scale mapping of vineyards. However, all the aforementioned vision SLAM algorithms do not take into consideration the dynamic environment, yet the dynamic agricultural environment significantly affects localization and mapping accuracy. Therefore, it is necessary to eliminate the dynamic objects while the agricultural robot is operating.

The application of visual SLAM technology in agricultural robot navigation and mapping is becoming increasingly widespread. However, agricultural environments present unique challenges, including unstable lighting conditions and the presence of abundant repetitive texture regions, which make it more difficult for visual SLAM algorithms to identify feature points. Compared to urban and industrial settings, visual SLAM systems deployed in agricultural environments often exhibit poor stability of feature points. This is due to the highly repetitive nature of the environment, which renders the feature detection and matching processes more susceptible to the influence of dynamic objects. As a result, the overall system performance can be adversely affected. Additionally, during the inspection process, agricultural robots aim to acquire three-dimensional data of crops or the environment. However, the presence of dynamic objects in such scenarios can undermine the accuracy of localization and mapping, preventing the accurate reconstruction of the three-dimensional information of crops. These challenges impose rigorous requirements on visual SLAM systems operating in agricultural environments.

1.4. Differences from Other Works

The application of Mask RCNN in dynamic vision SLAM is relatively common, but the problem of fixed a priori dynamic label judgment for static and dynamic objects still exists. Its label judgment cannot be adapted to the changing state of the image pair. Additionally, the mask may be incomplete or undetected, and the point cloud may be disorganized. However, our algorithm can dynamically judge object labels in real time and reject anomalous mask frames. This enables us to achieve higher localization accuracy and obtain a cleaner 3D point cloud. Our objective is to enable the agricultural inspection robot to achieve localization and dense reconstruction during inspections, aiming to extract 3D point cloud data and semantic information about the agricultural environment and crops while excluding the influence of dynamic objects in the scene. Since our system operates outdoors, the SLAM system needs to be compatible with both RGB-D and stereo cameras. Dyna-SLAM employs a priori masks and multi-view geometry to identify dynamic masks indoors. However, in outdoor stereo mode, it eliminates the label mask for “car”, which adversely affects accuracy. DS-SLAM is not suitable for outdoor use as it only finds the dynamic mask for “person”. DGS-SLAM utilizes the depth map clustering method, which requires reconstructing the depth map in outdoor stereo mode. Unfortunately, the accuracy cannot be guaranteed, rendering it unsuitable for outdoor use as well.

To address the challenge of detecting and eliminating dynamic objects in outdoor forest and tea garden agricultural scenarios, which affects localization and image reconstruction accuracy, we introduce a novel solution called MOLO-SLAM. This approach involves a label-optimized dynamic SLAM technique that incorporates multiple geometric constraints and dynamic confidence. Our methodology is built upon the foundation of the ORB-SLAM2 algorithm and obtaining highly accurate a priori mask information via Mask-RCNN. Additionally, we curated our own dataset, which was named the Forest-Tea-COCO Dataset, and trained the corresponding Forest-Tea-COCO Model by Mask-RCNN. This model was then used to conduct reconstruction experiments and generate semantic maps in forest–tea garden scenarios, with the aim of effectively addressing dynamic object interference in forest and tea garden environments. The primary contributions of this paper can be summarized as follows:

This paper proposes a novel calculation method for dynamic label confidence, which utilizes multiple geometric constraints. This approach improves SLAM accuracy by excluding masks of highly dynamic objects based on their dynamic confidence. To validate its effectiveness, experiments were conducted using our forest–tea garden datasets, and the results demonstrate that MOLO-SLAM significantly enhanced accuracy in dynamic scenes.
The paper also presents the design of the label consistency module and the incomplete mask contaminated frame rejection module. These modules effectively reject contaminated point cloud data and facilitate the reconstruction of high-precision 3D point cloud maps.
Additionally, the paper proposes a combination with the BundleFusion 3D reconstruction algorithm, resulting in the generation of a more refined 3D mesh model.

The rest of the paper is organized as follows: Section 2 describes the MOLO-SLAM algorithm system. Section 3 performs a comparison of TUM and KITTI trials, as well as test on our own agricultural dataset. Finally, the summary and future directions of the work are proposed in Section 4.

2. Materials and Methods

2.1. System Overview

First, let us provide a brief overview of the entire system, as illustrated in Figure 1. Our system was built upon ORB-SLAM2, which serves as the foundation for tracking and mapping. The process begins with RGB images and depth images obtained from sensors in either RGB-D or stereo modes. These images are then fed into the system, with RGB images being processed by two threads. The first thread, feature extraction, is responsible for extracting feature points from the current frame. The Multiple Geometric Constraints module calculates the dynamic confidence of the labels. Simultaneously, the Mask-RCNN thread provides RGB image masks, which are combined with the dynamic confidence values to identify dynamic tags with high confidence. Subsequently, the feature points within the dynamic tag masks are eliminated, while static feature points are used for tracking and map building to obtain precise position estimation. Meanwhile, the labels obtained from Mask-RCNN undergo label consistency and polluted frame removal through feature point tracking. The removed contaminated frames are excluded from the point cloud map thread. Additionally, we refer to the dense mapping system proposed by Yang et al. [37], and the positional estimation is then combined with the key frames to generate the point cloud map. Finally, the point cloud map undergoes bundle fusion to generate a mesh 3D model. This model is combined with the Mask-RCNN segmentation results to obtain the semantic point cloud map.

Compared to ORB-SLAM2 and other dynamic VSLAMs, we introduced several modules to enhance the overall system performance:

(1): Dynamic detection module: We implemented the multiple geometric constraints dynamic detection by employing movement consistency check, reprojection error, and pre-previous frame movement consistency check. Additionally, we strengthened certain labels by appropriately increasing the number of large gradient pixel points. This enables us to compute dynamic confidence scores for each label based on weighted criteria.
(2): In order to classify objects in the agricultural environment more efficiently, we employed augmented learning to input the agricultural dataset and the COCO [38] instance dataset into the Mask-RCNN network. This allows us to obtain a semantic labeling segmentation model for instances that is applicable to both forest and tea gardens, thereby extending the semantic labeling capabilities in these scenarios.
(3): 3D reconstruction module: To ensure accurate 3D reconstruction, we implemented label consistency and contaminated frame removal. By eliminating frames with incomplete segmentation, we prevent them from entering the point cloud threads. Consequently, the dynamically culled RGB maps and depth maps are then input into the BundleFusion system for the final 3D reconstruction.

2.2. Dynamic Feature Point Detection

The framework we employ, ORBSLAM2, relies on the identification of ORB feature points in the upward and downward frames of images. These feature points are then utilized for data matching, enabling the calculation of relative positions. This process facilitates localization. However, a notable challenge arises when attempting to localize using feature points on dynamic objects. Such scenarios can introduce significant errors into the localization process. Consequently, it becomes imperative to address this issue by detecting and distinguishing between static and dynamic feature points.

Currently, many VSLAM algorithms employ time-consuming region growth algorithms or cluster segmentation techniques to detect static points and then perform calculations on the depth map. However, this approach heavily relies on the depth image. To better adapt outdoor scenes, we require a straightforward and efficient dynamic detection method that solely utilizes RGB images. Therefore, we have chosen the triple weighting method of movement consistency, re-projection error, and pre-previous frame consistency to construct the multiple geometric constraints module, as illustrated in Figure 2. First is the epipolar geometry as the foundation of our research; by examining the relative changes in R (rotation) and t (translation) within the camera, we analyze the behavior of spatial points like P₁, P₂, and P₃ in relation to the polar line of the poles. In an ideal scenario, these points should fall along the pole line. However, due to real-world error, the feature points in the subsequent frame deviate from the pole line by a certain distance, as illustrated in Figure 2. This distance is denoted as “d_p”—the distance between the matching points (P₁, P₂, P₃) and the pole line (d_p₁, d_p₂, d_p₃) in previous frame and current frame. Importantly, when accounting for the dynamics of P₃, we observe that due to movement, its distance “d_p₃” increases, resulting in a larger gap between the matching point and the pole line. By exploiting that, we can establish a mechanism for verifying the consistency of movement. In the following section, we provide a detailed explanation of the calculation method for dynamic point detection.

2.2.1. Epipolar Constraint Check

Since the proposal of DS-SLAM, the mobile consistency module has served as the foundation for numerous dynamic SLAM research studies. Here, we also adopt the mobile consistency module as our starting point. The process involves comparing the current frame F_C with the previous frame F_P. In the first step, we compute the optical flow pyramid for the current frame, F_C, to obtain its feature points. Then, we calculate the error of the 3 × 3 image block of the optical flow matching point and remove it if it is close to the boundary or if the error is too large. Moving on to the third step, we employ RANSAC iteration to locate the best optical flow matching feature point, allowing us to calculate and find the F (fundamental matrix). In the fourth step, we calculate the epipolar line by considering the coordinates of P_C and P_P, as shown in Formula (1):

P_{C} = [u_{c}, v_{c}, 1], P_{P} = [u_{p}, v_{p}, 1]

(1)

where u and v are the pixel coordinates of the optical flow feature points. Then, the epipolar line is calculated from this, which is calculated by Formula (2):

I_{C} = F P_{C} = F [\begin{matrix} u_{c} \\ v_{c} \\ 1 \end{matrix}] = [\begin{matrix} X \\ Y \\ Z \end{matrix}]

(2)

Finally, we calculate the epipolar distance based on the feature points and the epipolar line, and the formula for this calculation is shown in Formula (3):

D_{P C} = \frac{|P_{P} F P_{C}|}{\sqrt{{‖X‖}^{2} + {‖Y‖}^{2}}}

(3)

where D_PC is the epipolar distance of point P_C. When D_PC is bigger than the threshold ε, we consider this point P_C as a dynamic feature point. The algorithm flow of mobile consistency checking is shown in Algorithm 1.

Algorithm 1 mobile consistency checking

Input: Current frame F_C; Previous frame F_P; Current feature points P_c.

1 Optical flow calculation previous feature points P_p = CalcOpticalFlowPyrLK(F_c,F_p,P_c);

2 if P_p close to the edge or SSD too small then

Remove match point in P_p end if;

3 F = FindFundamentalMatrix(P_c,P_p,RANSAC);

4 for each matched point pairs in (P_c,P_p) do compute I_c via (2);

5 Compute the epipolar distance D_PC via (3);

6 if D_PC > ε then

Put in the dynamic feature point set (P_ci,P_pi)→S end if.

Output: Dynamic feature point set S.

Mobile consistency detection is a dynamic outlier detection technique based on optical flow and epipolar constraints. Using the optical flow method and F matrix, we can effectively describe the correlation between corresponding points in images and express their motion relationship. Nevertheless, this method has its limitations. For instance, when there is dynamic motion along the epipolar line, it may not pass the distance threshold check. Hence, we consider introducing re-projection errors for optimization.

2.2.2. Re-Projection Error

Re-projection error refers to the Euclidean distance d_pe between transformed feature points. For re-projection points, d_pe is generally relatively large, enabling the detection of abnormal points. The Homography matrix is often used to detect abnormal points in plane feature scenes. In some tea gardens that are relatively flat, it can be seen as a scene with rich plane features, as illustrated in Figure 2. So, we use H (homography matrix) that better represents the projection of point between the two images. To compute the re-projection error, we first calculate the H of the optical flow feature points under the assumption of a static background. Then, we proceed with the specific calculation process following the polar constraints as outlined below:

First, we calculate the optical flow pyramid of the current frame F_C, to obtain its feature point, P_C. Next, we locate the matching feature point, P_P, of the previous frame, F_P, using the optical flow method. In the third step, we initially screen the feature point pairs, P_CS and P_PS, in the static background and calculate the homography matrix, H, based on the matching feature points of P_CS and P_PS. Finally, we utilize the H matrix to calculate the re-projection error D_re for each feature point. The calculation formula is shown in Formula (4):

P_{P - r e - v} = H * [\begin{matrix} u_{p} \\ v_{p} \\ 1 \end{matrix}] = [\begin{matrix} u_{p - r e - v} \\ v_{p - r e - v} \\ w_{p - r e - v} \end{matrix}]

(4)

where P_P-re-v is the re-projection point of the previous frame feature point P_P in the current frame, and u_P-re-v, v_P-re-v, and w_P-re-v are the position vectors of the predicted points, which are normalized to obtain the pixel normalized coordinates, as shown in Formula (5):

P_{P - r e} = [\begin{matrix} u_{p - r e - v} / w_{p - r e - v} \\ v_{p - r e - v} / w_{p - r e - v} \\ 1 \end{matrix}] = [\begin{matrix} u_{p - r e} \\ v_{p - r e} \\ 1 \end{matrix}]

(5)

The Euclidean distance is used to calculate the re-projection error D_re:

D_{r e} = \sqrt{{(u_{c} - u_{p - r e})}^{2} + {(v_{c} - v_{p - r e})}^{2}}

(6)

When D_re is greater than the threshold e, we consider it an outlier. The algorithm flow of re-projection error outlier point detection is shown in Algorithm 2.

Algorithm 2 Re-projection error Outlier Point Detection

Input: Current frame F_C; Previous frame F_P; Current feature points P_c.

1 Optical flow calculation previous feature points P_p = CalcOpticalFlowPyrLK(F_c,F_p,P_c);

2 if P_p close to the edge or SSD toos mall then

Remove match point in P_c end if;

3 H = FindHomographyMatrix(P_c-s,P_p-s,RANSAC)//(P_c-s,P_p-s) is the static background
matching point;

4 for each matched point pairs in (P_c,P_p) do compute P_p-re via (5) and D_re via (6);

5 Compute the epipolar distance D_PC via (3);

6 if D_re > e then

Put in the outlier current feature points set (P_ci,P_pi)→R end if.

Output: Outlier current feature points set R.

2.2.3. Pre-Previous Frame Epipolar Constraint Check

Motion consistency detection is capable of detecting moving objects. However, it may fail to identify dynamic points that move along the epipolar line within the current viewing angle, leading to an omission of dynamic abnormal points. Additionally, because of the narrow roads in a forest–tea garden environment, the dynamic objects’ movement speed (v) is relatively low, heightening the challenge of dynamic detection. To avoid the problem of the omission of dynamic abnormal points, we contemplate introducing an additional viewing angle for motion consistency checking. However, the movement of objects seen from an excessively large camera angle of view can result in significant errors, which affects the continuity of moving objects. To strike a balance, we opt for per-frame motion consistency detection between the current frame, F_C, and the previous frame, F_P-pre, to identify early abnormal points. According to the specific algorithm flow for the consistency check, the inputs are denoted as Current RGB frame F_C; Current RGB frame’s feature points P_C; Previous RGB frame F_P-pre. The schematic diagram of the pre-previous frame check is depicted in Figure 2.

2.3. Dynamic Confidence Calculation Method

Once the dynamic feature points are successfully identified, due to the existence of errors, the system may fail to detect all the erroneous feature points associated with the dynamic object. Therefore, for an accurate localization of the moving object, it becomes necessary to incorporate prior semantic information and instance masks to eliminate all the feature points on dynamic object. Consequently, we opted to employ the Mask-RCNN instance segmentation network, which is known for its superior accuracy in providing such prior information. This network is capable of generating instance masks and bounding boxes for various categories, offering a more streamlined approach to image detection.

Mask-RCNN exhibits superior segmentation accuracy in comparison to other instance segmentation approaches. However, a notable drawback of Mask-RCNN is its inability to effectively differentiate between static and dynamic object labels, particularly when there are transitions in the motion state of the labeled entities (e.g., from moving to stationary). Consequently, after leveraging the prior semantic information provided by Mask R-CNN, it becomes necessary to accurately detect and distinguish the dynamic objects through alternative means. Multiple geometric constraints are proposed here. We integrate three dynamic detection methods and utilize the ratio of detected outlier points to the total number of feature points from these methods. Each method is assigned a specific weight, and this weighted ratio is used to calculate the dynamic confidence of the label. We then combine the deep learning Mask-RCNN method to locate the dynamic mask and reject the feature points falling within the mask area. In this context, the proposed multiple geometric constraints method serves as a crucial step for accurately calculating the dynamic confidence of the object labels, thereby enabling the robust detection of dynamic masks. Here, we introduce the specific details of this calculation approach.

Firstly, we adopted the dynamic label checking strategy of DS-SLAM and established a threshold for the epipolar distance E_m. We classified the points larger than this distance as outer points P_out, while the rest were designated as inner points P_in. The second step involves traversing each outer point P_out to determine its corresponding mask and record the number of outer points L_i on the label of each mask. A label is considered dynamic when the number of outer points L_i exceeds the threshold value N.

However, this calculation method has some drawbacks. If we preset too many target labels, it may result in small areas or features with insufficient prominence on the mask, leading to only a few feature points. Consequently, even though the labeling should be dynamic, the number of outer points L_i might not surpass the threshold value N. This error in calculation can adversely affect the 3D reconstruction quality. To address this issue, we introduce a pseudo-feature point increase strategy. When the number of feature points on a mask L_i is less than a designated threshold value M = 2N, we randomly add (M-L_i) large gradient pixel points to that mask as pseudo-feature points. Figure 3 describes the specific process of dynamic confidence, and pseudo feature points are added to labels with insufficient feature points.

For calculating the initial dynamic probability C_m-Ini of mobile consistency, our gradient point increase strategy ensures that each mask has a minimum of M feature points, where we define M = 2N. With this definition, when there are N outliers, the critical value for the dynamic confidence becomes N/M = 0.5. Thus, the initial dynamic confidence of label “i” that detects LS_i outliers is as follows in Formula (7):

C_{m - I n i} = L S_{i} / M

(7)

When LS_i > M, C_m-Ini is 0.95, and 0.05 is floating redundancy.

From the above analysis, we obtain the calculation method of dynamic confidence. First, we perform mobile consistency detection and the re-projection error to calculate initial confidence:

L a b e l_c o n f = (α C_{m} + β C_{r e})

(8)

The dynamic confidence coefficient C_m of mobile consistency detection is Formula (9):

C_{m} = α * C_{m - I n i}

(9)

α is the weight of the detection method, and C_m-Ini is the initial dynamic confidence. Similarly, the re-projection confidence is Formula (10):

C_{r e} = β * C_{r e - I n i}

(10)

β is the weight of the detection method, and C_re-Ini re-projects the initial dynamic confidence.

Although the threshold has been defined, the confidence at this point is solely influenced by M, thereby introducing the static–dynamic comparison coefficient B.

The contrast coefficient B is determined by the outlier point P_out and the number of feature points L_i on the label mask. P_out represents the outlier points prior to the inclusion of high-gradient pixel points and L_i denotes the quantity of feature points on the label mask. The calculation formula for coefficient K is as follows:

K = (P_{o u t} / L S_{i}) / (L_{i} / M)

(11)

The coefficient K represents the relationship between dynamic confidence and the proportion of original dynamic points. Specifically, when the initial dynamic confidence C_m-Ini is 0.5, but the proportion of original dynamic points exceeds 0.9, we tend to favor it as a dynamic label and appropriately increase the dynamic confidence. Therefore, we establish the following rules: when K is less than 0.2, the dynamic confidence is capped at 0.85; when K exceeds 2, the dynamic confidence is capped at 1.25. This leads to the derivation of the static–dynamic comparison coefficient B:

\{\begin{cases} B (K) = 0.8, & (K \leq 0.2) \\ B (K) = 0.25 K + 0.75, & (0.2 \leq K \leq 2) \\ B (K) = 1.25, & (2 \leq K) \end{cases}

(12)

Finally, the dynamic probability C_m for motion consistency is determined as follows:

C_{m} = B \cdot C_{m - I n i}

(13)

Due to the introduction of additional feature points, uncertainty is increased, thus leading to the introduction of the uncertainty coefficient Uncer. The following factors influence this parameter. The motion level of the label, i.e., the prior motion probability ω, increases the proportion of feature points on the mask ξ.

Since there are 80 categories in the COCO dataset, the likelihood of motion varies across each category. Based on real-world experience, we categorized dynamics into five classes of levels, from strongest to weakest, classifying the labels into five category levels: prior dynamic (e.g., people, cats, dogs), prior static (e.g., beds, tables), intermediate-to-dynamic (e.g., cars, bicycles), intermediate-to-static (e.g., potted plants, TVs), and intermediate states (others). The motion levels of the five categories are presented in Table 1 below.

The dynamic weight of the intermediate state is set to “1”, indicating that the category prior of the label does not influence the calculation of dynamic confidence. For values where “a > b > 1”, it implies an increase in the dynamic weight of the label. Similarly, for values where “1 > c > d”, it indicates an increase in the static weight of the label. Here, “ω” represents the dynamic probability weight. The motion level can be adjusted according to different datasets in different scenarios. For example, in the TUM indoor dynamic dataset, where the object motion range is small, the dynamic object motion coefficient is increased accordingly. On the other hand, in the KITTI outdoor data, where the dynamic range is large, the dynamic object motion coefficient is decreased accordingly.

To successfully detect dynamic labels, we introduced pseudo feature points to ensure a minimum of M optical flow points on the mask. However, while adding these feature points, it is important to note that the added points are not necessarily detected as corner points but rather pixels with significant gradients. Introducing too many feature points can lead to a certain degree of unreliability. Consequently, when the number of mask optical flow feature points on the label is denoted as “m”, ξ(m) cannot influence the dynamic confidence and is set to ξ(m) = 1. When the number of mask optical flow feature points on the label is 0, and the increased m is considered unreliable, we stipulate that in this scenario, the label’s status is determined only when 2/3m feature points ascertain the status. Hence, ξ(0) is set to 2/3. We employ the linear model and incorporate it into Formula (11):

ξ (x) = \frac{1}{(3 m)} x + \frac{2}{3}, x \in [0, m]

(14)

x is the number of optical flow feature points on the mask. So, we define the uncertainty coefficient Uncer of the label as

U n c e r = ω * ξ

(15)

If the calculated Label_conf has a large Uncer, Label_conf ∈ [0.35, 0.65], the second motion consistency check will be started. The second motion consistency check is the optical flow motion consistency dynamic detection of the current frame and the previous frame. After the introduction, the dynamic confidence is Formula (13):

L a b e l_c o n f = U n c e r * (α C_{m} + β C_{r e} + σ C_{m 2})

(16)

We set the initial weight as α = 0.7, β = 0.3; after introducing the quadratic weight, we set it as α = 0.6, β = 0.2, σ = 0.2. It means that we still believe in mobile consistency check compared to the introduced method, and the introduction of other methods is to stabilize the coefficient of dynamic confidence.

2.4. Label Consistency Algorithm and Pollution Frame Removal

Although Mask-RCNN and multiple geometric constraints are employed for dynamic object detection, to achieve more accurate 3D point cloud information, it is still necessary to process certain contaminated frames. Hence, we introduce the label consistency algorithm and pollution frame removal as part of our approach. The overall process is shown in Figure 4.

2.4.1. Label Consistency Algorithm

During optical flow calculations, we want to establish connections between consecutive frames. However, in the instance segmentation process, only the image information of the current frame is considered for image processing, without incorporating the labels from the preceding and subsequent frames. As a result, inconsistencies may arise between the labels of consecutive frames, potentially leading to confusion and ambiguity.

To achieve label alignment, we adopted the method of optical flow feature point matching, wherein the current frame ORB feature points are utilized to track the corresponding matching points in the upper frame using the optical flow method in order to perform data correlation between the upper and lower frame levels, following this specific algorithm:

(1): While traversing each optical flow feature point’s mask, we simultaneously associate them with the corresponding label attributes. For example, if the label of the current frame (F_C) is classified as a dynamic label, we track the optical flow feature points matched by F_P using these attributes, e.g., person1.
(2): We then traverse the optical flow feature points tracked by F_P and determine the attributes of the feature points’ label on the mask of F_P. When most of the points’ label attributes align with a certain label, e.g., person2, we synchronize the label attributes from the previous frame to the next frame, i.e., person1→person2.

To prevent the continuous propagation of synchronization errors among specific frames, this algorithm restricts its operation to key frames. Specifically, for each key frame, we initiate the sticky note consistency operation for the labels. Upon detecting the subsequent key frame, it serves as another starting point for the labeling process.

2.4.2. Pollution Frame Removal

Mask-RCNN demonstrates exceptional edge coverage detection capability and typically achieves high MIOU (mean intersection over union) for image detection tasks. However, when confronted with camera-acquired images, such as those obtained in SLAM (simultaneous localization and mapping), the reduced resolution of the camera and the motion blur during image acquisition can have a detrimental effect on Mask-RCNN’s masking performance. Consequently, in certain scenarios, the detection may be incomplete, leading to missed detections.

To prevent incomplete and missing mask detection from polluting the point cloud and undermining the reconstruction of the 3D terrain, a strategy is needed to eliminate these polluted frames. Our approach involves a relatively simple pollution frame elimination strategy. By ensuring label consistency between the preceding and subsequent frames, we can unify them. Since false detections and missed detections occur only in a small number of cases, we utilize the change in area of the detection boxes between the preceding and subsequent frames to identify contaminated frames. In continuous image sequences, a significant change in the area of the detection box for a target instance indicates a mask failure or disappearance, resulting in a decrease in area. Therefore, frames with a high probability of being polluted are identified based on this criterion. Specifically, we set a minimum value, A, for the detection box area. If the area of the detection box in the next frame changes by more than half compared to the previous frame, we classify it as a polluted frame. Such polluted frames are excluded from the point cloud thread in the subsequent stages. However, to ensure tracking integrity, the polluted frame is not removed from the tracking thread.

2.5. Point Cloud Map and Surface Reconstruction

To ensure comprehensive coverage of dynamic objects and their feature points, we employ an expansion process. After expanding the mask, we remove the corresponding masked area from both the RGB image and depth image, enabling real-time input to the point cloud thread for the static map reconstruction of the 3D point cloud. However, despite these efforts, the point cloud may still contain a few polluted points, which can be addressed through 3D mesh reconstruction. To achieve this, we save the RGB image and depth image of the dynamic mask area locally and utilize the BundleFusion algorithm for offline 3D reconstruction. The optimized SLAM system provides the localization data, which is combined with the RGB image and depth map after dynamic mask rejection, and has been fed into the BundleFusion algorithm in RGB-D mode. This approach results in a 3D mesh model with high precision, purity, and a small memory footprint, as demonstrated in Figure 5. We have named this dynamic SLAM system MOLO-SLAM (mask ORB label optimization SLAM).

3. Results

3.1. Experiments Equipment and Process

To verify the effectiveness and performance of the MOLO-SLAM and 3D reconstruction methods mentioned earlier, a series of experimental tests was conducted. The algorithm is specifically designed to address the challenges of stable and efficient VSLAM in agricultural, forest, and tea garden scenes, which are often complex and dynamic environments.

We aimed to achieve accurate three-dimensional reconstruction of these scenes to support the navigation of agricultural inspection robots and gather essential three-dimensional information about crops. To achieve this, we needed to conduct outdoor experiments using specialized equipment, as shown in Figure 6. The experimental equipment consisted of an independently designed tracked multi-terrain adaptive chassis, equipped with an IMU, a 3D LIDAR, and the Kinect V2 depth camera. Additionally, a high-performance PC was used as a data collection and computing platform, capable of gathering RGB images, depth images, Lidar data, and IMU data.

We used this set of equipment to create agricultural dynamic datasets at South China Agricultural University’s Forest and Tea Garden. We tested our algorithms on dynamic systems in agricultural environments by deliberately introducing dynamic objects in a forest and tea garden environment. The datasets evaluate the system’s ability to handle dynamic mask and reconstruct static scenes. In Figure 7, we have three sequences labeled “person”, and others to represent dynamic and static labels; in order to judge the transition of the tag’s motion state, a conveniently parked “bicycle” was used as an intermediate state. Sequences (a) and (b) are intentionally designed dynamic datasets in the forest and tea plantation scenarios, respectively, which were firstly captured as a person walking into the camera view from the side holding a bicycle after traveling for a certain distance in the static environment, and after moving for a certain distance, putting the bicycle down to simulate the dynamic labeling from the static transformed scenario, and then the person leaves the camera view. In Sequence (c), a more complex dynamic transformation scene is simulated, where two people enter the camera lens in turn, one of them walks for a period of time, and then they take the bicycle away from the movement, simulating the scene of static labeling by dynamic transformation; the other person, on the contrary, first holds the bicycle to enter for a period of time, and then puts it down and leaves it, simulating the scene of dynamic labeling by static transformation, and the two kinds of transformations appear in the same datasets. The relevant starting threshold parameters for the operation of the system in this paper are shown in Table 2.

Once data collection was completed, we ran the MOLO-SLAM algorithm offline on the Legion 7i to conduct a series of experiments on the localization and 3D reconstruction of forest and tea garden scenes. These experiments allowed us to assess the algorithm’s performance and its applicability to real-world scenarios. All the experiments were carried out on a system with an Intel CPU i7-12700, NVIDIA 3060 graphics card, TensorFlow-GPU 2.2.4, and the Ubuntu 20.04 operating system. The Mask-RCNN instance segmentation model used was trained on the COCO datasets, specifically the Mask-RCNN-COCO model. Since this paper utilizes a parametric and large Mask-RCNN segmentation model along with a relatively complex geometric dynamic detection method, its objective was to investigate the accuracy of the method. Therefore, it was executed offline to evaluate the impact on each dataset. Additionally, for data quantification, we assessed it using the TUM and KITTI datasets, which provide ground truth position and attitude data. We conducted comparisons with ORB-SLAM2, as well as state-of-the-art (SOTA) dynamic SLAM systems, namely, DS-SLAM and Dyna-SLAM.

3.2. Dynamic Confidence and Dynamic Detection Experiments

We input the collected data into our system, following the procedure described earlier. Our system utilizes multiple motion consistency checks and outer point detection to calculate the confidence in dynamic elements. It then identifies and removes the dynamic mask, resulting in the estimation of the positions of dynamic-free bits. As shown in Figure 8, whether in a forest garden or a tea garden scene with abundant planar features, the system successfully detected the dynamic labels “person” and “bicycle”. When the bicycle came to a stop, “bicycle” was accurately identified as a static label, demonstrating the stability of the system. It is worth noting that the dynamic labels “bicycle” and “person” used as examples do not possess specificity among all the dataset labels. Therefore, in practical scenarios, the algorithm is capable of detecting and excluding other dynamic objects during computation.

We added semantic static labels (road, Tea_garden, tree, Other_obstacles) to the collected agricultural datasets. Through augmented learning, we inputted the agricultural datasets and COCO instance datasets into the Mask-RCNN network for hybrid training. We obtained the Forest–Tea-COCO-Model and integrated it into MOLO-SLAM. The detailed workflow is depicted in Figure 9. Thanks to adaptive improvements in agricultural scenarios, our algorithm successfully detected the dynamic mask and effectively removed it, demonstrating its capability to handle dynamic objects in real-world scenarios.

Then, the trained model was validated by inputting images of forest and tea gardens for Forest–Tea-COCO-Model validation, and the results are shown in Figure 10, wherein we were able to find that in the agricultural scenario of forest and tea gardens, our model was able to segment the static semantic labels and dynamic instance labels effectively. Whether it was a static scene or a moving scene, the model was able to accurately identify and segment static and dynamic objects of various categories.

3.3. Pollution Frame Removal Tests

We performed label consistency alignment on these two datasets, ensuring that the frame sequence labels between keyframes were consistent. Next, we leveraged the unified label information to calculate changes in detection frames and identify contaminated frames. These contaminated frames may contain misidentified objects, leading to inconsistencies in the reconstruction of depth maps and the resulting three-dimensional point cloud. For the validation of contaminated frame rejection and point cloud reconstruction, we primarily focused on the RGB-D mode using our datasets. In this experiment, we inputted the aligned labels into the contaminated frame rejection algorithm to identify and detect the contaminated frames. Subsequently, we selected five key frames above and below these contaminated frames for the point cloud reconstruction experiment. A comparison was made between the effects of rejecting the contaminated frames and leaving them unremoved in the 3D point cloud. The outcomes of this experiment are shown in Figure 11 and Figure 12.

Based on the results, we can observe that Sequence1, Sequence2, and Sequence3 successfully detected the contaminated frames. Furthermore, the contaminated point cloud, labeled in the red box in Figure 12 corresponding to the contaminated frames, was effectively removed, resulting in a relatively pure 3D point cloud. This demonstrates the beneficial role of our algorithm in maintaining data accuracy and enhancing the quality of the reconstructed 3D point cloud.

3.4. Pose Estimation Experiment

To evaluate the performance of our SLAM system, we conducted tests on our datasets and TUM datasets. Lidar SLAM, with its wide viewing angle, high accuracy, and extensive depth range, excels in capturing static point clouds in large scenes, making it less prone to interference from dynamic objects compared to general laser SLAM algorithms. However, the trade-off is a sparse map and higher cost. To evaluate our algorithm, we used the Fast-LIO2 laser SLAM algorithm, built with 3D lidar, as a reference. Additionally, we evaluated our algorithm on the TUM datasets, and we selected seven sequences for testing. These sequences include four dynamic sequences and three static sequences. We compared DS-SLAM, Dyna-SLAM, and our MOLO-SLAM system in the TUM datasets. All of these systems are dynamic VSLAM systems based on ORBSLAM2.

Notice that we specifically focused on the reconstruction of the static environment since we needed to subsequently generate 3D point cloud maps. As a result, we set the motion level of the “person” label to infinite in order to ensure its elimination. For other cases, we could also adjust the motion level to utilize feature points on the static “person”; however, in our scenario, we chose to eliminate them. Simultaneously, we unconditionally remove the mask if the average depth exceeds 10m or if the bounding box is too small.

To assess the performance, we utilized several evaluation metrics: absolute trajectory error (ATE) for evaluating global errors, relative pose error (RPE) for relative local error evaluation, and root mean square error (RMSE) for calculating sequence values. Its calculation formula is shown in Formula (17):

F_{i} = Q_{i}^{- 1} S P_{i}

(17)

where F_i is the error value of the i-th frame pose; P_i ∈ SE(3) is the estimated pose of the i-th frame of the algorithm; Q_i ∈ SE(3) is the real pose of the i-th frame; and S ∈ SE(3) is the transformation matrix from the estimated pose to the real pose by the least squares method.

The calculation formula of RPE is shown in Formula (18):

E_{i} = {(Q_{i}^{- 1} Q_{i + Δ})}^{- 1} (P_{i}^{- 1} P_{i + Δ})

(18)

where E_i is the error value of the i-th frame pose; P_i ∈ SE(3) is the estimated pose of the i-th frame of the algorithm; Qi ∈ SE(3) is the real pose of the i-th frame; and Δ is the fixed time difference between the accuracy of the pose difference between two frames (compared to the real pose).

We introduced the Improvements value to evaluate the accuracy improvement of our algorithm relative to ORBSLAM2. Formula (19) for the calculation of the Improvements value is as follows:

K = (1 - τ / ψ) \times 100 %

(19)

where K is the value of Improvements; τ is the evaluation value of MOLO-SLAM; and ψ is the evaluation value of ORBSLAM2. So, we record the evaluation values of ORBSLAM2. To assess the ATE and RPE values for ORB-SLAM2 and MOLO-SLAM in the forest-tea garden setting, we used FAST-LIO2 as a reference. The evaluation results are presented in Table 3 and Table 4. Figure 13 displays the FAST-LIO2 Mapping and the ATE positioning error map evaluated by the EVO tool in agricultural dynamic sequences. According to the quantitative data, our algorithm achieved a performance surpassing 80% in complex agricultural dynamic environments, demonstrating accuracy comparable to that of the laser SLAM algorithm in this specific setting.

In the evaluation of TUM data, our algorithm was quantitatively compared with ORBSLAM2, DS-SLAM, and Dyna-SLAM. In the comparison of ATE evaluation values in Table 5, our MOLO-SLAM system outperformed in the highly dynamic fr3/w_static and fr3/w_xyz sequences. For the other two highly dynamic sequences, fr3/w_half and fr3/w_rpy, Dyna-SLAM achieved the best results. Notably, Dyna-SLAM also performed remarkably well in the static sequences, fr3/s_half and fr2/xyz. These results suggest that our system did not significantly negatively impact the static sequences. As shown in Table 6, the performance of translational RPE in both dynamic and static sequences also demonstrated positive outcomes. Analyzing the data, we found that Dyna-SLAM excelled in high dynamic sequences, but our algorithm closely approached the performance of Dyna-SLAM. Moreover, when comparing DS-SLAM and ORBSLAM2, our algorithm showcased excellent optimization in dynamic sequences while maintaining a minimal negative impact on static sequences. Overall, these results demonstrate the effectiveness of our algorithm. Figure 14 presents the ATE trajectory effect graphs for the four highly dynamic sequences.

Similarly, we also provide the evaluation results of the KITTI dataset, which are shown in Table 7 and Table 8. Based on the data, it can be concluded that the performance improvement amounted to 28.07%.

3.5. Point Cloud Reconstruction and Mesh Reconstruction Experiments

We utilized the keyframe mechanism of ORBSLAM2 to generate 3D point cloud reconstruction. The specific process is as follows: First, the ORBSLAM2 system runs to determine the keyframes and record the bitmap of each frame. Then, the depth map and RGB map of each frame form a local point cloud. Each local point cloud is multiplied by the bitmap of the corresponding keyframe, and all local point clouds are fused together to form a global point cloud reconstruction, as shown in Formula (20):

P_{E} = \sum_{i = 1}^{n} (R_{i} P_{i} + t_{i})

(20)

where P_i denotes the local point cloud, P_E denotes the global point cloud point cloud map, R_i denotes the rotation matrix, and t_i denotes the translation matrix determined by the SLAM system camera with respect to a reference coordinate system. The reference coordinate system is typically established based on the bit position of the system in the first frame. In our MOLO-SLAM system, this process was facilitated by the label consistency module and the contaminated frame rejection module; similar to the ORBSLAM2 system, key frames are utilized to generate the point cloud. However, if a key frame is identified as contaminated, we select the nearest non-contaminated frame and substitute it into the point cloud system.

After conducting the experiment on outdoor dynamic culling, we proceeded with the experiments on 3D point cloud reconstruction in the forest tea garden scene, and the experimental results are presented in Figure 15. In the case of the original ORBSLAM2 localization, as depicted in Figure 15a,f, the position estimation was adversely affected by the presence of dynamic objects; consequently, the resulting 3D point cloud exhibited chaotic and inaccurate characteristics, failing to accurately reconstruct the terrain topography of the scene. However, after introducing our dynamic VSLAM system, MOLO-SLAM, as depicted in Figure 15b,g, both the localization accuracy and the accuracy of dense building reconstruction exhibited significant improvements. Our system successfully mitigated the influence of dynamic objects and accurately reconstructed the 3D terrain topography in the forest and tea plantation scenario. Additionally, during the transition of the bicycle from dynamic to static, as shown in Figure 15c,h, our algorithm effectively identified and preserved the static part of the bicycle while eliminating the dynamic part, thus further substantiating the effectiveness of our algorithm. Moreover, we extracted the depth maps processed by MOLO-SLAM along with RGB maps, which were subsequently utilized as input for the BundleFusion 3D reconstruction algorithm. The result is an enhanced accuracy 3D mesh model with reduced space occupation, as illustrated in Figure 15d,i. Simultaneously, when combined with the Forest–Tea-COCO-Model amplified by augmented learning, a semantic point cloud map can be derived. By utilizing the semantic map, a 3D point cloud with semantic labels can be used to implement inspection requirements for various objects, as shown in Figure 15e,j.

3.6. Time Efficiency Analysis

To analyze the temporal efficiency of the dynamic confidence calculation module, we evaluated the time performance of each component on the agricultural dataset in our research platform and recorded the average computation time for each module in Table 9. The “dynamic confidence calculation (total)” represents the aggregate time cost of the three dynamic detection methods. Compared to the DS-SLAM approach using only epipolar constraint check, the overall time cost increased in our proposed framework. The multi-view geometry paper by Dynamam [21] reported a time cost of approximately 200–300 ms, although the experimental setup differed. While our method has lower computational overhead, we incorporated the Mask R-CNN model with a substantial number of parameters, as well as the multiple geometric constraints technique, to enhance the overall segmentation accuracy. This integration inevitably leads to increased time cost. The algorithm presented in this work still has considerable room for real-time performance optimization. The trade-off between accuracy and computational efficiency remains an active research area, and further algorithmic and hardware-level optimizations may be necessary to achieve the desired real-time capabilities for practical deployment.

4. Discussion

In this work, we present MOLO-SLAM (mask ORB label optimization SLAM), a deep-learning-based dynamic object culling VSLAM system that enhances ORBSLAM2 by integrating the Mask-RCNN instance segmentation algorithm. Our system is specifically tailored to operate effectively in dynamic agricultural environments. We introduce several key improvements:

(1): Enhancing the front-end feature extraction module of ORBSLAM2: We integrated it with the Mask-RCNN instance segmentation network to identify dynamic object masks and selectively remove feature points from those regions, thereby optimizing the visual odometry estimation.
(2): Introducing a multiple geometric constraints and dynamic confidence module: This module leverages geometric consistency checks, re-projection error analysis, and pre-previous frame geometric consistency checks to identify anomaly points. This enables the calculation of dynamic confidence levels for different thresholds. Consequently, we can reliably identify dynamic labels with a confidence greater than 0.5.
(3): Proposing label consistency and contaminated frame rejection in the point cloud reconstruction pipeline: We remove frames with errors detected by Mask-RCNN during the point cloud reconstruction, resulting in a more refined and accurate point cloud. We then employ the BundleFusion algorithm for the subsequent 3D mesh model reconstruction.

We established a comprehensive agricultural SLAM dataset, which includes synchronized RGB images, depth images, LiDAR point clouds, and inertial measurement unit (IMU) data. Additionally, we compared the performance of our proposed approach against various SOTA dynamic VSLAM algorithms using the TUM and KITTI benchmark datasets. Our results show significant improvements, with up to 97.72% enhancement in trajectory estimation accuracy relative to the original ORBSLAM2 algorithm on the forest–tea garden agriculture dataset, up to a 98.51% performance improvement on the TUM datasets and up to a 28.07% improvement on the KITTI datasets. Furthermore, our algorithm has minimal negative impact on static scene sequences, and we obtained purer and more accurate 3D point clouds and 3D mesh models during our point cloud experiments. We also conducted comprehensive acquisition experiments in forest and tea plantation environments for the tasks of camera pose estimation and 3D reconstruction, achieving excellent results in terms of the generated 3D point clouds and mesh models in these challenging scenarios. Finally, we discuss the extended applications of our algorithm, showcasing its effectiveness in various other robotic perception and mapping methods and functions.

Our algorithm has shown promising results in dynamic scenarios. However, the use of Mask-RCNN poses a challenge in achieving real-time performance due to its computational demands and offline processing approach. In our future research, we aim to address this issue by replacing the instance segmentation network with a lighter and simpler alternative that can offer real-time processing capabilities. This improvement will enhance the practicality and applicability of our system in dynamic environments. And in the future with AI development, if the model size can be reduced to a certain degree, we can also use GAN technology to reconstruct the environmental point cloud after the predictive production of the hollow part.

Author Contributions

Data curation, J.L. (Jinhong Lv), B.Y., S.S. and Q.L.; funding acquisition, W.W.; project administration, W.W.; formal analysis, W.W., J.L. (Jinhong Lv), B.Y., C.G., H.G. and J.L. (Junlin Li); investigation, B.Y., J.L. (Jinhong Lv), S.S. and Q.L.; methodology, W.W., B.Y., J.L. (Jinhong Lv), H.G., Q.L., S.S. and Q.L.; software, W.W., J.L. (Jinhong Lv), B.Y., C.G. and S.S.; supervision, B.Y., C.G., H.G. and Q.L.; resources, C.G., J.L. (Junlin Li), S.S. and Q.L.; validation, W.W., J.L. (Jinhong Lv), J.L. (Junlin Li) and Q.L.; writing—original draft, B.Y., J.L. (Jinhong Lv), C.G., J.L. (Junlin Li), H.G. and S.S.; writing—review and editing, W.W., J.L. (Jinhong Lv) and B.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the 2024 Rural Revitalization Strategy Special Funds Provincial Project (2023LZ04), and Research and Development of Intelligence Agricultural Machinery and Control Technology (FNXM012022020-1-03).

Data Availability Statement

The data presented in this study are available in article.

Acknowledgments

The authors acknowledge the editors and reviewers for their constructive comments and all the support on this work.

Conflicts of Interest

Author Haijun Guo is employed by the Guangdong Topsee Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Cadena, C.; Carlone, L.; Carrillo, H.; Latif, Y.; Scaramuzza, D.; Neira, J.; Reid, I.; Leonard, J.J. Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age. IEEE Trans. Robot. 2016, 32, 1309–1332. [Google Scholar] [CrossRef]
Reitmayr, G.; Langlotz, T.; Wagner, D.; Mulloni, A.; Schall, G.; Schmalstieg, D.; Pan, Q. Simultaneous localization and mapping for augmented reality. In Proceedings of the 2010 International Symposium on Ubiquitous Virtual Reality, Gwangju, Republic of Korea, 7–10 July 2010; pp. 5–8. [Google Scholar]
Singandhupe, A.; La, H.M. A review of slam techniques and security in autonomous driving. In Proceedings of the 2019 third IEEE International Conference on Robotic Computing (IRC), Naples, Italy, 25–27 February 2019; pp. 602–607. [Google Scholar]
Yousif, K.; Bab-Hadiashar, A.; Hoseinnezhad, R. An overview to visual odometry and visual slam: Applications to mobile robotics. Intell. Ind. Syst. 2015, 1, 289–311. [Google Scholar] [CrossRef]
Ding, H.; Zhang, B.; Zhou, J.; Yan, Y.; Tian, G.; Gu, B. Recent developments and applications of simultaneous localization and mapping in agriculture. J. Field Robot. 2022, 39, 956–983. [Google Scholar] [CrossRef]
Bresson, G.; Alsayed, Z.; Yu, L.; Glaser, S. Simultaneous localization and mapping: A survey of current trends in autonomous driving. IEEE Trans. Intell. Veh. 2017, 2, 194–220. [Google Scholar] [CrossRef]
Klein, G.; Murray, D. Parallel tracking and mapping for small ar workspaces. In Proceedings of the 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, Nara, Japan, 13–16 November 2007; pp. 225–234. [Google Scholar]
Forster, C.; Zhang, Z.; Gassner, M.; Werlberger, M.; Scaramuzza, D. Svo: Semidirect visual odometry for monocular and multicamera systems. IEEE Trans. Robot. 2016, 33, 249–265. [Google Scholar] [CrossRef]
Engel, J.; Schöps, T.; Cremers, D. Lsd-slam: Large-scale direct monocular slam. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 834–849. [Google Scholar]
Mur-Artal, R.; Tardos, J.D. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
Newcombe, R.A.; Lovegrove, S.J.; Davison, A.J. Dtam: Dense tracking and mapping in real-time. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2320–2327. [Google Scholar]
Davison, A.J.; Reid, I.D.; Molton, N.D.; Stasse, O. Monoslam: Real-time single camera slam. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 1052–1067. [Google Scholar] [CrossRef] [PubMed]
Yan, L.; Zhao, L. An approach on advanced unscented kalman filter from mobile robot-slam. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2020, 43, 381–389. [Google Scholar] [CrossRef]
Endres, F.; Hess, J.; Sturm, J.; Cremers, D.; Burgard, W. 3D mapping with an rgb-d camera. IEEE Trans. Robot. 2013, 30, 177–187. [Google Scholar] [CrossRef]
Campos, C.; Elvira, R.; Rodriguez, J.J.G.; Montiel, J.M.; Tardos, J.D. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
Elvira, R.; Tardos, J.D.; Montiel, J.M. Orbslam-atlas: A robust and accurate multi-map system. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 6253–6259. [Google Scholar]
Yoo, J.; Borselen, R.; Mubarak, M.; Tsingas, C. Automated first break picking method using a random sample consensus (ransac). In Proceedings of the 81st EAGE Conference and Exhibition 2019, London, UK, 3–6 June 2019; pp. 1–5. [Google Scholar]
Bustos, A.P.; Chin, T.-J.; Eriksson, A.; Reid, I. Visual slam: Why bundle adjust? In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 2385–2391. [Google Scholar]
Zhao, X.; Li, Q.; Wang, C.; Dou, H.; Liu, B. Robust depth-aided rgbd-inertial odometry for indoor localization. Measurement 2023, 209, 112487. [Google Scholar] [CrossRef]
Li, G.; Yu, L.; Fei, S. A deep-learning real-time visual slam system based on multi-task feature extraction network and self-supervised feature points. Measurement 2021, 168, 108403. [Google Scholar] [CrossRef]
Bescos, B.; Facil, J.M.; Civera, J.; Neira, J. Dynaslam: Tracking, mapping, and inpainting in dynamic scenes. IEEE Robot. Autom. Lett. 2018, 3, 4076–4083. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Bescos, B.; Campos, C.; Tardos, J.D.; Neira, J. Dynaslam ii: Tightlycoupled multi-object tracking and slam. IEEE Robot. Autom. Lett. 2021, 6, 5191–5198. [Google Scholar] [CrossRef]
Yu, C.; Liu, Z.; Liu, X.-J.; Xie, F.; Yang, Y.; Wei, Q.; Fei, Q. Ds-slam: A semantic visual slam towards dynamic environments. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1168–1174. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Henein, M.; Mahony, R.; Ila, V. Vdo-slam: A visual dynamic object-aware slam system. arXiv 2020, arXiv:2005.11052. [Google Scholar] [CrossRef]
Runz, M.; Buffier, M.; Agapito, L. Maskfusion: Real-time recognition, tracking and reconstruction of multiple moving objects. In Proceedings of the 2018 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Munich, Germany, 16–20 October 2018; pp. 10–20. [Google Scholar]
Wang, X.; Zheng, S.; Lin, X.; Zhu, F. Improving rgb-d slam accuracy in dynamic environments based on semantic and geometric constraints. Measurement 2023, 217, 113084. [Google Scholar] [CrossRef]
Zhong, F.; Wang, S.; Zhang, Z.; Wang, Y. Detect-slam: Making object detection and slam mutually beneficial. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 1001–1010. [Google Scholar]
Xiao, L.; Wang, J.; Qiu, X.; Rong, Z.; Zou, X. Dynamic-slam: Semantic monocular visual localization and mapping based on deep learning in dynamic environment. Robot. Auton. Syst. 2019, 117, 1–16. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Liu, Y.; Miura, J. Rds-slam: Real-time dynamic slam using semantic segmentation methods. IEEE Access 2021, 9, 23772–23785. [Google Scholar] [CrossRef]
Yan, L.; Hu, X.; Zhao, L.; Chen, Y.; Wei, P.; Xie, H. Dgs-slam: A fast and robust rgbd slam in dynamic environments combined by geometric and semantic information. Remote Sens. 2022, 14, 795. [Google Scholar] [CrossRef]
Islam, R.; Habibullah, H.; Hossain, T. Agri-slam: A real-time stereo visual slam for agricultural environment. Auton. Robot. 2023, 27, 649–668. [Google Scholar] [CrossRef]
Song, K.; Li, J.; Qiu, R.; Yang, G. Monocular visual-inertial odometry for agricultural environments. IEEE Access 2022, 10, 103975–103986. [Google Scholar] [CrossRef]
Papadimitriou, A.; Kleitsiotis, I.; Kostavelis, I.; Mariolis, I.; Giakoumis, D.; Likothanassis, S.; Tzovaras, D. Loop closure detection and slam in vineyards with deep semantic cues. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 2251–2258. [Google Scholar]
Yang, L.; Wang, L. A semantic slam-based dense mapping approach for large-scale dynamic outdoor environment. Measurement 2022, 204, 112001. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Doll´ar, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]

Figure 1. Overview of the MOLO-SLAM system.

Figure 2. Schematic diagram of multiple geometric constraints.

Figure 3. Process of dynamic confidence calculation method.

Figure 4. Process of pollution frame removal.

Figure 5. The 3D reconstruction method.

Figure 6. Forest and tea garden crawler inspection robot.

Figure 7. Agricultural datasets for the forest–tea garden scenario.

Figure 8. MOLO-SLAM dynamic mask detection.

Figure 9. Segmentation effect in the forest tea garden scene.

Figure 10. Distribution of dynamic confidence scores for dynamic and static objects.

Figure 11. Abnormal frame detection: The red box represents the detected contaminated frames.

Figure 12. Comparison of point cloud before and after contaminated frame rejection: (a,c,e) is before removing point clouds from contaminated frames in Sequence1, Sequence2, and Sequence3. (b,d,f) is after removing point clouds from contaminated frames in Sequence1, Sequence2, and Sequence3.

Figure 13. (a) FAST-LIO2 locating and mapping in an agricultural environment. (b) ATE trajectories for three agricultural dynamic sequences.

Figure 14. ATE trajectories for four highly dynamic sequences: the black line is the true trajectory; blue is the algorithm estimation curve; red is the error between the estimation algorithm and the true value (the larger the distance the higher the error).

Figure 15. Point cloud reconstruction experiment: (a,f) original orbslam2 point cloud; (b,g) point cloud generated by MOLO-SLAM after dynamic culling; (c,h) point cloud of bicycle; (d,i) the mesh model generated by BundleFusion after dynamic culling; (e,j) semantic map and segmented point cloud.

Table 1. Motion level classification.

Classification	Motion Level	Dynamic Probability Weights
Prior dynamic	5	a
Intermediate-to-dynamic	4	b
Intermediate states	3	1
Intermediate-to-static	2	c
Prior static	1	d

Table 2. The relevant starting threshold parameters for the operation of the system in this paper.

Classification Motion Level	Dynamic Probability Weights
Epipolar constraint distance threshold, ε	5/pixel
Reprojection error threshold, e	8/pixel
Epipolar constraint distance threshold of the previous frame, ε₂	7/pixel
Initial weight of dynamic confidence, α, β	0.7, 0.3
Outlier threshold of dynamic confidence, N	7/time
Detection box area variation ratio for contaminated frames, A	1/2
Motion level, a, b, c, d	1.2, 1.1, 0.9, 0.8
Number of ORB feature points, N_F	1000/time
Mask-RCNN training, Batch_size	4/time
Number of Mask-RCNN training, Epoch	100/time

Table 3. Comparison of absolute trajectory error RMSE for agricultural datasets data—reference Fast-LIO2 (ATE)/m.

Sequences	ORBSLAM2	MOLO-SLAM	Improvements
Sequence1	0.1489	0.0190	87.24%
Sequence2	0.1954	0.0120	93.86%
Sequence3	0.8112	0.0169	97.92%

Table 4. Comparison of relative translation trajectory RMSE for agricultural datasets data—reference Fast-LIO2 (translation RPE)/m.

Sequences	ORBSLAM2	MOLO-SLAM	Improvements
Sequence1	0.3669	0.0871	76.26%
Sequence2	0.4740	0.0796	83.21%
Sequence3	0.5114	0.1090	78.69%

Table 5. Comparison of absolute trajectory error RMSE for TUM data (ATE)/m.

Sequences	ORBSLAM2	DS-SLAM	Dyna-SLAM	SG-SLAM	MOLO-SLAM *	Improvements
fr3/w_half	0.4613	0.0304	0.0260	0.0247	0.0316	93.15%
fr3/w_rpy	0.6824	0.3877	0.0303	0.0196	0.0382	94.40%
fr3/w_static	0.4025	0.0084	0.0068	0.0060	0.0060	98.51%
fr3/w_xyz	0.8572	0.02959	0.0157	0.0153	0.0150	98.25%
fr3/s_half	0.0202	0.0162	0.0192	-	0.0159	21.29%
fr3/s_xyz	0.0082	0.0114	0.0105	-	0.0109	−32.93%
fr2/xyz	0.0052	0.0039	0.0044	-	0.0033	36.54%

* Represents our system. - Represents not provided.

Table 6. Comparison of relative translation trajectory error RMSE of TUM data (translation RPE)/m.

Sequences	ORBSLAM2	DS-SLAM	Dyna-SLAM	SG-SLAM	MOLO-SLAM	Improvements
fr3/w_half	0.7248	0.0430	0.0370	0.0249	0.0445	93.86%
fr3/w_rpy	0.9999	0.5727	0.0437	0.0372	0.0540	94.60%
fr3/w_static	0.5745	0.0125	0.0116	0.0082	0.0080	98.61%
fr3/w_xyz	1.2148	0.0458	0.0223	0.0195	0.0190	98.44%
fr3/s_half	0.0209	0.0152	0.0287	-	0.0143	31.38%
fr3/s_xyz	0.0105	0.0124	0.0185	-	0.0148	−40.95%
fr2/xyz	0.0033	0.0029	0.0035	-	0.0038	−15.15%

Table 7. Comparison of absolute trajectory error RMSE for KITTI data (ATE)/m.

Sequences	ORBSLAM2	Dyna-SLAM	MOLO-SLAM	Improvements
00	3.4251	3.6928	3.4617	−1.07%
01	7.5282	12.1051	6.1746	17.98%
02	5.5369	5.4668	5.6880	−2.73%
03	5.0319	4.7365	4.9563	1.50%
04	1.1186	1.4302	1.0799	3.46%
05	1.5924	1.2435	1.1454	28.07%
06	1.9360	2.2895	2.2345	−15.42%

Table 8. Comparison of relative translation trajectory error RMSE (translation RPE)/m for KITTI data.

Sequences	ORBSLAM2	Dyna-SLAM	MOLO-SLAM	Improvements
00	1.5587	1.6378	1.5723	0.87%
01	2.5115	4.5293	2.6927	−7.21%
02	1.9585	1.9467	1.9177	2.08%
03	2.2782	2.1527	2.2095	3.02%
04	1.5947	1.8147	1.5723	1.40%
05	0.9675	0.8867	0.7906	18.28%
06	1.6503	1.9055	1.8968	−14.94%

Table 9. The average calculation time for each module (ACT)/ms.

Module	Time/ms
Epipolar constraint check	36.253
Re-projection error	42.254
Pre-previous frame epipolar constraint check	37.256
Dynamic confidence calculation (total)	118.532
Mask-RCNN	352.235

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lv, J.; Yao, B.; Guo, H.; Gao, C.; Wu, W.; Li, J.; Sun, S.; Luo, Q. MOLO-SLAM: A Semantic SLAM for Accurate Removal of Dynamic Objects in Agricultural Environments. Agriculture 2024, 14, 819. https://doi.org/10.3390/agriculture14060819

AMA Style

Lv J, Yao B, Guo H, Gao C, Wu W, Li J, Sun S, Luo Q. MOLO-SLAM: A Semantic SLAM for Accurate Removal of Dynamic Objects in Agricultural Environments. Agriculture. 2024; 14(6):819. https://doi.org/10.3390/agriculture14060819

Chicago/Turabian Style

Lv, Jinhong, Beihuo Yao, Haijun Guo, Changlun Gao, Weibin Wu, Junlin Li, Shunli Sun, and Qing Luo. 2024. "MOLO-SLAM: A Semantic SLAM for Accurate Removal of Dynamic Objects in Agricultural Environments" Agriculture 14, no. 6: 819. https://doi.org/10.3390/agriculture14060819

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MOLO-SLAM: A Semantic SLAM for Accurate Removal of Dynamic Objects in Agricultural Environments

Abstract

1. Introduction

1.1. Visual SLAM

1.2. Dynamic SLAM

1.3. Visual SLAM in Agriculture

1.4. Differences from Other Works

2. Materials and Methods

2.1. System Overview

2.2. Dynamic Feature Point Detection

2.2.1. Epipolar Constraint Check

2.2.2. Re-Projection Error

2.2.3. Pre-Previous Frame Epipolar Constraint Check

2.3. Dynamic Confidence Calculation Method

2.4. Label Consistency Algorithm and Pollution Frame Removal

2.4.1. Label Consistency Algorithm

2.4.2. Pollution Frame Removal

2.5. Point Cloud Map and Surface Reconstruction

3. Results

3.1. Experiments Equipment and Process

3.2. Dynamic Confidence and Dynamic Detection Experiments

3.3. Pollution Frame Removal Tests

3.4. Pose Estimation Experiment

3.5. Point Cloud Reconstruction and Mesh Reconstruction Experiments

3.6. Time Efficiency Analysis

4. Discussion

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI