*2.3. Optimization-Based SLAM*

Full SLAM estimates all the vehicle pose and map states using the entire sensor data, and it is mostly optimization based. Similar to filter-based SLAM, the optimization-based SLAM system consists of two main parts: the frontend and the backend. In the frontend step, the SLAM system extracts the constraints of the problem with the sensor data, for example, by performing feature detection and matching, motion estimation, loop closure detection, etc. Nonlinear optimization is then applied to acquire the maximum likelihood estimation at the backend.

Graph SLAM is one of the main classes of full SLAM which uses a graphical structure to represent the Bayesian SLAM. All the platform poses along the whole trajectory and all the detected features are treated as nodes. Spatial constraints between poses are encoded in the edges between the nodes. These constraints result from observations, odometry measurements, and from loop closure constraints. After the graph construction, graph optimization is applied in order to optimize the graph model of the whole trajectory and map. To solve the full optimization and to calculate the Gaussian approximation of the aposteriori, a number of methods can be used, such as Gauss–Newton or Levenberg–Marquardt [31].

For graph-based SLAM, the size of its covariance matrix and update time are constant after generating the graph, therefore graph SLAM has become popular for building largescale maps. Reducing the computational complexity of the optimization step has become one of the main research topics for practical implementations of the high-dimensional SLAM problem. The key to solving the optimization step efficiently is the sparsity of the normal matrix. The fact that each measurement is only associated with a very limited number of variables makes the matrix very sparse. With Cholesky factorization and QR factorization methods, the information matrix and measurement Jacobian matrix can be factorized efficiently, and hence the computational cost can be significantly reduced. Several algorithms have been proposed, such as TORO and g2o. The sub-map method is also a popular strategy for solving large-scale problems [32–36]. The sub-maps can be optimized independently and are related to a local coordinate frame. The sub-map coordinates can be treated as pose nodes, linked with motion constraints or loop closure constraints. Thus, a global pose graph is generated. In this way the computational complexity and update time will be improved.

Smoothing and Mapping (SAM), another optimization-based SLAM algorithm, is a type of nonlinear least squares problem. Such a least squares problem can be solved incrementally by Incremental Smoothing and Mapping (iSAM) [37] and iSAM2 [38]. Online SLAM can be obtained with incremental SAMs as they avoid unnecessary calculations with the entire covariance matrix. iSAM2 is more efficient as it uses a Bayes tree to obtain incremental variable re-ordering and fluid re-linearization.

SLAM++ is another incremental solution for nonlinear least squares optimizationbased SLAM which is very efficient. Moreover, for online SLAM implementations, fast state covariance recovery is very important for data association, obtaining reduced state representations, active decision-making, and next best-view [39,40]. SLAM++ has an advantage as it allows for incremental covariance calculation which is faster than other implementations [40].

Table 1 is a summary of the characteristics of some typical SLAM techniques. Note that Graph SLAM utilizes all available observations and control information and can achieve very accurate and robust estimation results. It is suitable for offline applications and its performance relies on a good initial state guess. Filter-based SLAM is more suitable for small-scale environments when used for online estimation, but for the complex environment, a real-time computation may be difficult with the traditional EKF-SLAM. Other variants or fastSLAM should be considered. The incremental optimization method can do incremental updating, so as to provide an optimal estimation of a large-scale map with very high efficiency and in real-time.

**Table 1.** Characteristics of some typical SLAM techniques.



**Table 1.** *Cont.*

#### *2.4. Sensors and Fusion Method for SLAM*

New SLAM methods have appeared thanks to advances in sensor and computing technology. These methods are also optimization-based or filtered-based at the backend estimation step while the frontend step is highly dependent on the application of different sensor modalities. Two of the major sensors used for SLAM are Lidar and Camera. The Lidar method has become popular due to its simplicity and accuracy compared to other sensors [52]. The core of Lidar-based localization and mapping is scan-matching, which recovers the relative position and orientation of two scans or point clouds. Popular approaches for scan matching include the Iterative Closet Point (ICP) algorithm and its variants [53–55], and the normal distribution transform (NDT) [56]. These methods are highly dependent on good initial guess, and are impacted by local minimums [57,58]. Some other matching methods include probabilistic methods such as correlative scan matching (CSM) [59], feature-based methods [57,60], and others. Many of the scan-matching methods focus on initial free of or robust to, initialization error, but they still face the computation efficiency challenge.

Some range sensors that can be used for SLAM estimation are Radar and Sonar/ultrasonic sensors. Radar works in a similar manner to Lidar, but the system emits radio waves instead of light to measure the distance to objects. Furthermore, since Radar can observe the relative velocity between the sensor and the object using the measured Doppler shift [61], it is suitable for distinguishing between stationary and moving objects, and can be used to discard moving objects during the map-building process [62]. Some research on using Radar for SLAM can be found in [42,62–66]. When compared to Lidar, lower price, lower power consumption, and less sensitivity to atmospheric conditions make it well suited for outdoor applications. However, Radar has lower measurement resolution, and its detections are more sparse than Lidar. Thus, it is harder to match Radar data and deal with the data association problem, which results in its 3D mapping being less accurate.

Sonar/ultrasonic sensors also measure the time-of-flight (TOF) to determine the distance to objects, by sending and receiving sound waves. Sonar-based SLAM was initially used for underwater [67,68], and indoor [69] applications. It has become popular due to its low cost and low power consumption. It is not affected by visibility restrictions and can be used with multiple surface types [70]. However, similar to Radar, it obtains sparse information and suffers from inaccurate feature extraction and long processing time. Thus, it is of limited use for high-speed vehicle applications. Moreover, Sonar/ ultrasonic sensors have limited sensing range and may be affected by environmental noise and other platforms using ultrasound with the same frequency [71].

Camera is another popular sensor for SLAM. Different techniques have been developed, such as monocular [72,73], stereo [74–77], and multi-camera [78–81]. These techniques can be used in a wide range of environments, both indoor and outdoor. The single-camera system is easy to deploy, however, it suffers from scale uncertainty [82]. Stereo-camera systems can overcome the scale factor problem and can retrieve 3D structural information by comparing the same scene from two different perspectives [61]. Multicamera systems have gained increasing interest, particularly as they achieve a large field of view [78] or are even capable of panoramic vision [81]. This system is more robust in complicated environments, while single sensor system may be very vulnerable to environmental interference [81]. However, the integration of Cameras requires additional software and hardware, and requires more calibration and synchronization effort [71,83]. Another special Camera, the RGB-D Camera, has been studied by the SLAM and computer vision communities [84–91] since it can directly obtain depth information. However, this system is mainly applicable in indoor environments because it uses infrared spectrum light and is therefore sensitive to external illumination [70].

The Visual SLAM can also be classified as feature-based or direct SLAM depending on how the measurements are used. The feature-based SLAM repeatedly detects features in images and utilizes descriptive features for tracking and depth estimation [92]. Some fundamental frameworks for this feature-based system include MonoSLAM [72,93], PTAM [94], ORB-SLAM [95], and ORB-SLAM2 [96]. Instead of using any feature detectors and descriptors, the direct SLAM method uses the whole image. Examples of direct SLAM include DTAM [97], LSD-SLAM [73], and SVO [98]. A dense or semi-dense environment model can be acquired by these methods, which makes them more computationally demanding than feature-based methods. Engel et al. [74] extended the LSD-SLAM from a monocular to a stereo model while Caruso et al. [99] extended the LSD-SLAM to an omnidirectional model. A detailed review of Visual SLAM can be found in [5] and [70,92,100,101].

Each of these perceptive sensors has its advantages and limitations. Lidar approaches can provide precise and long-range observations, but with limitations such as being sensitive to atmospheric conditions, being expensive, and currently rather bulky. Radar systems are relatively low cost, but are more suitable for object detection than for 3D map building. Sonar/ultrasonic sensors are not suitable for high-speed platform applications. Cameras are low-cost, even when multiple Cameras are used. Cameras can also provide rich visual information. However, they are sensitive to environment texture and light, and in general, have high computational demands. Therefore, a popular strategy is to combine a variety of sensors, making the SLAM system more robust.

There are several strategies to integrate data from different sensors for SLAM. One is fusing independently processed sensor results to then obtain the final solution. In [102], a mapping method that merged two grid maps, which were generated individually from laser and stereo camera measurements, into a single grid map was proposed. In this method, the measurements of the different sensors need to be mapped to a joint reference system. In [103], a multi-sensor SLAM system that combined the 3-DoF pose estimation from laser readings, the 6-DoF pose estimation from a monocular visual system, and the inertial-based navigation estimation results to generate the final 6-DoF position using an EKF processing scheme was proposed. For this type of strategy, the sensors can provide redundancy and the system will be robust to possible single-sensor failure. A decisionmaking step may be needed to identify whether the data from each sensor is reliable, and to decide whether to adopt the estimation from that sensor modality or to ignore it. Another fusion strategy is using an assistant sensor to improve the performance of other sensor-based SLAM algorithms. The main sensor could be Lidar or Camera, while the assistant sensor could be any other type of sensor. In this strategy, the assistant sensor is used to overcome the limitations of the main sensor. The work in [104] incorporated visual information to provide a good initial guess on the rigid body transformation, and then used this initial transformation to seed the ICP framework. Huang et al. [105] extracted the depth of point-based and line-based landmarks from the Lidar data. The proposed system used this depth information to guide camera tracking and also to support the subsequent point-line bundle adjustment to further improve the estimation accuracy.

The above two strategies can be combined. In the work of [106], the fusing consists of two models, one deals with feature fusion that utilizes line feature information from an image to remove any "pseudo-segments", which result from dynamic objects, in the laser segments. Another is a modified EKF SLAM framework that incorporates the state estimates obtained from the individual monocular and laser SLAM in order to reduce the pose estimation covariance and improve localization accuracy. This modified SLAM framework can run even when one sensor fails since the sensor SLAM processes are parallel to each other.

Some examples of more tight fusion can also be found in the literature. The work of [107] combined both the laser point cloud data and image feature point data as constraints and conducted a graph optimization with both of these constraints using a specific cost function. Furthermore, an image feature-based loop closure was added to this system to remove accumulation errors.

Inertial SLAM incorporates an inertial measurement unit (IMU) as an assistant sensor. The IMU can be fused with the Camera or Lidar to support pose (position, velocity, attitude) estimation. With an IMU, the attitudes, especially the heading, are observable [108]. The integration of IMU measurements can also improve the motion tracking performance during the gaps of observations. For instance, for a Visual SLAM, illumination change, texture-less area, or motion blur will cause losses of visual tracks [108]. For a Lidar system, the raw Lidar scan data may suffer from skewing caused by high-acceleration motion, such as moving fast or shaking suddenly, resulting in sensing error that is difficult to account for [109]. The work of [110] used an IMU sensor to deal with fast velocity changes and to initialize motion estimates for scan-matching Lidar odometry to support their LOAM system. The high-frequency IMU data between two Lidar scans can be used to de-skew Lidar point clouds and improve their accuracy [109].

The fusion of inertial sensors can be as a simple assistant [111,112] or more tightly coupled [108,113–115]. For the simple assistant case, the IMU is mainly used to provide orientation information, such as to support the system initialization. The IMU is used as prior for the whole system, and the IMU measurements are not used for further optimization. For the tightly coupled case, IMU data is fused with Camera/Lidar states to build up measurement models, and then perform state estimation and feedback to the inertial navigation system to improve navigation performance [116]. Therefore the former method is more efficient than the latter, however, it is less accurate [117]. For the tightly coupled case, a Kalman filter could be used to correct the IMU states, even during GNSS outages [118].

#### *2.5. Deep Learning-Based SLAM*

Most of the aforementioned SLAM methods are geometric model-based, which build up models of platform motion and the environment based on geometry. These methods have achieved great success in the past decade. However, they still face many challenging issues. For instance, Visual SLAM (VSLAM) is limited under extreme lighting conditions. For large-scale applications, the model-based methods need to deal with large amounts of information, such as features and dynamic obstacles. Recently, deep learning techniques, such as data-driven approaches developed in the computer vision field, have attracted more attention. Many researchers have attempted to apply deep learning methods to SLAM problems.

Most of the current research activities focus on utilizing learning-based methods for VSLAM problems since deep learning techniques have made breakthroughs in the areas of image classification, recognition, object detection, and image segmentation [119]. For instance, deep learning has been successfully applied to the visual odometry (VO) problem, which is an important element of VSLAM. Optical flow estimation is utilized in some learned VO models as inputs [120–124]. The application of learning approaches can be applied in an end-to-end manner without adopting any module in the conventional VO pipeline [125,126]. Wang et al. [125] introduced an end-to-end VO algorithm with deep Recurrent Convolutional Neural Networks (RCNNs) by combining CNNs with the RNNs. With this algorithm, the pose of the camera is directly estimated from raw RGB images, and neither prior knowledge nor parameters are needed to recover the absolute scale [125]. Li et al. [127] proposed an Unsupervised Deep Learning based VO system (UnDeepVO) which is trained with stereo image pairs and then performs both pose estimation and dense depth map estimation with monocular images. Unlike the one proposed by Wang et al. [125], ground truth is not needed for UnDeepVO since it operates in an unsupervised manner.

The learning-based methods can be combined with the VSLAM system to replace or add on an individual or some modules of traditional SLAM, such as image depth estimation [128–130], pose estimation [131–133], and loop closure [134–137], etc., to improve the traditional method. Li et al. [138] proposed a fully unsupervised deep learning-based VSLAM that contains several components, including Mapping-net, Tracking-net, Loopnet, and a graph optimization unit. This DeepSLAM method can achieve accurate pose estimation and is robust in some challenging scenarios, combining the important geometric models and constraints into the network architecture and the loss function.

Sematic perception of the environment and semantic segmentation are current research topics in the computer vision field. They can provide a high-level understanding of the environment and are extremely important for autonomous applications. The rapid development of deep learning can assist in the introduction of semantic information into VSLAM [139] for semantic segmentation [140–142], localization and mapping [143–147], and dynamic object removal [148–151]. Some detailed reviews of deep learning-based VSLAM can be found in [92,139,152–154].

Fusion with an inertial sensor can also benefit from deep learning techniques, especially the RNN, which has an advantage in integrating temporal information and helping to establish consistency between nearby frames [139]. The integration of visual and inertial data with RNN or Long Short-Term Memory (LSTM), a variant of RNN that allows RNN to learn long-term trends [155], has been proven to be more effective and convenient than traditional fusion [156–158]. According to Clark et al. [157], the data-driven approach eliminates the need for manual synchronization of the camera and IMU, and the need for manual calibration between the camera and IMU. It outperforms the traditional fusion method since it is robust to calibration errors and can mitigate sensor drifts. However, to deal with the drift problem, a further extension of the learning-based visual-inertial odometry system to a larger SLAM-like system with loop-closure detection and global relocalization still needs to be investigated.

Compared to the visual-based SLAM, the applications of deep learning techniques for laser scanners or Lidar-based SLAM are still in the early stages and can be considered a new challenge [159]. Velas et al. [160] used CNN for Lidar odometry estimation by using the IMU sensor to support rotation parameter estimation. The results are competitive with state-of-the-art methods such as LOAM. Li et al. [161] introduced an end-to-end Lidar odometry, LO-Net, which has high efficiency, and high accuracy, and can handle dynamic objects. However, this method is trained with ground truth data, which limits its application to large-scale outdoor scenarios. Li et al. [162] designed a visual-Lidar odometry framework, which is self-supervised, without using any ground truth labels. The results indicate that this VLO method outperforms other current self-supervised visual or Lidar odometry methods, and performs better than fully supervised VOs. Datadriven approaches also make semantic segmentation of Lidar data more accurate and faster, making it suitable for supporting autonomous vehicles [163–165]. Moving objects can be distinguished from static objects by LMNet [166] based on CNNs of 3D Lidar scans. One limitation of some cost-effective 3D Lidar applications for autonomous driving in challenging dynamic environments is its relatively sparse point clouds. In order to overcome this drawback, high-resolution camera images were utilized by Yue et al. [167] to enrich the raw 3D point cloud. ERFNet is employed to segment the image with the aid of sparse Lidar data. Meanwhile, the sparsity invariant CNN (SCNN) is employed to predict the dense point cloud. Then the enriched point clouds can be refined by combining these two outputs using a multi-layer convolutional neural network (MCNN). Finally, Lidar SLAM can be performed with this enriched point cloud. Better target segmentation can be achieved with this Lidar data enrichment neural network method. However, due to the small training dataset, this method did not show improvement in SLAM accuracy with the enriched point cloud when compared to the original sparse point cloud. More training and further investigation of dynamic objects may be needed to satisfy autonomous driving application requirements [167].

The generation of complex deep learning architectures has contributed to achieving more accurate, robust, adaptive, and efficient computer vision solutions, confirming the great potential for their application to SLAM problems. The availability of large-scale datasets is still the key to boosting these applications. Moreover, with no need for ground truth, unsupervised learning is more promising for SLAM applications in autonomous driving. Compared to the traditional SLAM algorithms, data-driven SLAM is still in the development stage, especially for Lidar SLAM. In addition, combining multiple sensing modalities may overcome the shortcomings of individual sensors, for which the learning methods-based integration system still needs further investigation.

#### **3. Application of SLAM in Autonomous Driving**

Depending on the different characteristics of SLAM techniques, there could be different applications for autonomous driving. One classification of the applications is whether they are offline or online. A map satisfying a high-performance requirement is typically generated offline, such as the High Definition (HD) map [10]. For this kind of 3D point cloud map, an offline map generation process ensures the accuracy and reliability of the map. Such maps can be pre-generated to support the real-time operations of autonomous vehicles.

#### *3.1. High Definition Map Generation and Updating*

As stated earlier, SLAM can be used to generate digital maps used for autonomous driving, such as the HD map [10]. Due to the stringent requirements, high quality sensors are used. Lidar is one of the core sensors for automated cars as it can generate high-density 3D point clouds. High-end GNSS and INS technology are also used to provide accurate position information. Cameras can provide information that is similar to the information detected by human eyes. The fusion of sensor data and analysis of road information to generate HD maps needs considerable computational power, which is not feasible in current onboard vehicle systems. Therefore the HD map is built-up offline, using techniques such as optimization-based SLAM. The offline map creation can be performed by driving the road network several times to collect information, and then all the collected perceptive sensor information and position information is processed together to improve the accuracy of the final map. An example of a HD map is shown in Figure 4 [11].

**Figure 4.** An image from a high definition map (https://here.com/) [11].

The road environment and road rules may change, for instance, the speed limit may be reduced due to road work, road infrastructure may be changed due to building development, and so on. Therefore the HD map needs frequent updates. Such updates can utilize the online data collected from any autonomous car. For example, the data is transmitted to central (cloud) computers where the update computations are performed. Other cars can receive such cloud-based updates and make a timely adjustment to driving plans. Jo et al. [168] proposed a SLAM change update (SLAMCU) algorithm, utilizing a Rao–Blackwellized PF approach for online vehicle position and (new) map state estimation. In the work of [169], a new feature layer of HD maps can be generated using Graph SLAM when a vehicle is temporarily stopped or in a parking lot. The new feature layer from one vehicle can then be uploaded to the map cloud and integrated with that from other vehicles into a new feature layer in the map cloud, thus enabling more precise and robust vehicle localization. In the work of Zhang et al. [170], real-time semantic segmentation and Visual SLAM were combined to generate semantic point cloud data of the road environment, which was then matched with a pre-constructed HD map to confirm map elements that have not changed, and generate new elements when appearing, thus facilitating crowdsource updates of HD maps.

#### *3.2. Small Local Map Generation*

SLAM can also be used for small local areas. One example is within parking areas. The driving speed in a parking lot is low, therefore the vision technique will be more robust than in other high-speed driving scenarios. The parking area could be unknown (public parking lot or garage), or known (home zone)–both cases can benefit from SLAM. Since SLAM can be used without GNSS signals, it is suitable for vehicles in indoor or underground parking areas, using just the perceptive sensor and odometry measurements (velocity, turn angle) or IMU measurements. For unknown public parking areas, the position of the car and the obstacles, such as pillars, sidewalls, etc., can be estimated at the same time, guiding the parking system. For home zone parking, the pre-generated map and a frequent parking trajectory can be stored within the automated vehicle system. Each time the car returns home, re-localization using the stored map can be carried out by matching detected features with the map. The frequent trajectory could be used for the planning and controlling steps.

An approach that utilizes multi-level surface (MLS) maps to locate the vehicle, and to calculate and plan the vehicle path within indoor parking areas was proposed in [171]. In this study, graph-based SLAM was used for mapping, and the MLS map is then used to plan a global path from the start to the destination, and to robustly localize the vehicle with laser range measurements. In the work of [172], a grid map and an EKF SLAM algorithm were used with W-band radar for autonomous back-in parking. In this work, an efficient EKF SLAM algorithm was proposed to enable real-time processing. In [173], the authors proposed an around-view monitor (AVM)/ Lidar sensor fusion method to recognize the parking lane and to provide rapid loop closing performance. The above

studies have demonstrated that both filter-based SLAM and optimization-based SLAM can be used to support efficient and accurate vehicle parking assistance (local area mapping and localization), even without GNSS. In the work of Qin et al. [174], pose graph optimization is performed so as to achieve an optimized trajectory and a global map of a parking lot, with semantic features such as guide signs, parking lines, and speed bumps. These kinds of features are more (long-term) stable and robust than traditional geometrical features, especially in underground parking environments. An EKF was then used to complete the localization system for autonomous driving.

#### *3.3. Localization within the Existing Map*

In map-based localization, a matching method is used to match "live" data with map information, using methods such as Iterative Closest Point (ICP), Normal Distribution Transform (NDT), and others [10,175]. These algorithms can be linked to the SLAM problem since SLAM executes loop closing and re-localization using similar methods. For a SLAM problem, the ability to recognize a previously mapped object or feature and to relocate the vehicle within the environment is essential for correcting the maps [13]. Therefore, the reuse of a pre-generated map to localize the vehicle can be considered an extension of a SLAM algorithm. In other words, the pre-generated and stored map can be treated as a type of "sensor" to support localization.

However, matching live data with a large-scale pre-prepared map requires substantial computational resources. Hence, some methods have been proposed to increase computational efficiency. One method is to first narrow down the possible matching area from the map with position estimated from GNSS or GNSS/INS, and then carry out detailed matching of the detected features with the map [176].

Due to the current limited installation of Lidar systems in commercial vehicles (high price of sensor and high power consumption), localization of a vehicle with a low-cost sensor (e.g., vision sensor) in a pre-generated HD map is of considerable practical interest. For instance, the work in [177] located a vehicle within a dense Lidar-generated map using vision data and demonstrated that a similar order of magnitude error rate can be achieved to traditional Lidar localization but with several orders of magnitude cheaper sensor technology. Schreiber et al. [178] proposed to first generate a highly accurate map with road markings and curb information using a high-precision GNSS unit, Velodyne laser scanner, and cameras. Then during the localization process, a stereo camera system was used to detect road information and match it with the pre-generated map to achieve lane-level real-time localization. Jeong et al. [179] utilized road markings obtained from camera images for global localization. A sub-map that contained road information, such as 3D road marking points, was generated and utilized to recognize a revisited place and to support accurate loop detection. The pose graph-based approach was then used to eliminate the drift. Qin et al. [146] proposed a semantic localization system to provide a light-weight localization solution for low-cost cars. In this work, a local semantic map was generated by combining the CNN-based semantic segmentation results and the optimized trajectory after pose graph optimization. A compacted global map was then generated (or updated) in the cloud server for further end-user localization based on the ICP method and within an EKF framework. The average size of the semantic map was 36 kb/km. This proposed camera-based localization framework is reliable and practical for autonomous driving.

In addition to the aforementioned applications, moving objects within the road environment will cause a drift of perception, localization, and mapping for autonomous driving. SLAM can be used to address the problem of DATMO (detection and tracking of moving objects) [15] because one of the assumptions of SLAM is that the detected features are stationary. As the static parts of the environment are localized and mapped by SLAM, the dynamic parts can be concurrently detected and tracked. Some approaches have dealt with dynamic obstacles [180–182].

#### **4. Challenges of Applying SLAM for Autonomous Driving and Suggested Solutions**

*4.1. Ensuring High Accuracy and High Efficiency*

Localization and mapping for automated vehicles need to be accurate and robust to any changes in the environment and executed with high efficiency. With rapidly developing sensor technology, the combination of different sensors can compensate for the limitations of a particular sensor. Examples include GNSS/INS + Lidar/Camera SLAM, Radar SLAM, and some others. There is considerable research and development associated with low-cost and/or miniaturized Lidar sensors. New Lidar sensor concepts promise a significant reduction in the cost of Lidar systems, with the potential for real-time implementation in future autonomous vehicles. For instance, RoboSense has unveiled a new \$200 Lidar sensor combining MEMS sensors and an AI-based deep-learning algorithm to support high-performance autonomous driving applications [183].

Choosing a SLAM approach should take into consideration different application scenarios with different level of requirements. Optimization-based SLAM can provide more accurate and robust estimation, however, it is more suitable for offline estimation. EKF SLAM suffers from the quadratic increase in the number of state variables, which restricts its online application in large-scale environments. Although high-resolution map generation can be offline, real-time, or near-real-time, solutions are essential for map updating and map-based localization applications.

Any change in the road environment should be quickly updated on the map and transmitted to other road users. Emerging 5G wireless technology can make the communication between vehicle-to-vehicle (V2V), vehicle-to-infrastructure (V2I), and vehicle-to-cloud more reliable and with higher throughput [14].

### *4.2. Representing the Environment*

There are different types of maps that can be used to represent the road environment. Three major types of maps in the robotic field for SLAM applications are occupancy grid maps, feature-based maps, and topological maps [184]. They are also applicable to road environments. Each of them has its own advantages and limitations for autonomous driving applications. The grid map divides the environments into many fixed-size cells, and each cell contains its own unique property, such as whether the grid is occupied, free or unknown [185,186]. The obstacle occupancy information can be directly fed to the planning algorithms. This kind of map can be merged easily and has flexibility in incorporating data from numerous types of sensors [184]. Mentasti and Matteucci [185] proposed an occupancy grid creation method that utilized data from all the available sensors on an autonomous vehicle, including Lidar, Camera, Laser, and Radar. The grid map also shows the potential for detecting moving objects [187]. Mutz et al. [188] compare the performance of mapping and localization with different grid maps, including occupancy, reflectivity, color, and semantic grid maps, for self-driving car applications in diverse driving environments, including under challenging conditions. GraphSLAM was used for mapping, while localization was based on particle filter solutions. According to their results, the occupancy showed more accurate localization results, followed by the reflectivity grid map. Semantic grid maps kept the position tracking without losses in most scenarios, however with bigger errors than the first two map approaches. Colorized grid maps were most inconsistent and inaccurate for use in localization, which may be due to the influence of illumination conditions. One shortcoming that limits the occupancy grip map for large-scale autonomous driving is its dense representation, which needs big storage space and high computation power [189]. Thus Li [186] suggested applying this technique for real-time local mapping with a controlled size instead of for global mapping.

The feature-based map is a popular map type for autonomous driving. It represents the map with a group of features extracted from sensor data. For outdoor road environments, the typical features are traffic lanes, kerbs, road markings and signs, buildings, trees, etc. For indoor areas, especially in parking areas, the features are mainly the parking lane, sidewalls, etc. These features can be represented by points, lines, and planes, tagged

with coordinate information. The point feature represents the environment as dense point clouds. The high-density point cloud maps generated using Lidar and/or vision sensors can provide abundant features and 3D structure information of the area surrounding the vehicle. However, the transmission, updating, and processing of this volume of data is burdensome for complex road environments. The sparser line and plane features are suitable for structured environments, such as indoor environments, urban areas, or highways, with clear markings. These features are more sophisticated than the point features, with lower memory requirements [186], and are less susceptible to noise [189]. Im et al. [173] proposed a parking line-based SLAM approach which extracted and analyzed parking line features to achieve rapid loop closure and accurate localization in a parking area. Javanmardi et al. [190] generated a city road map with 2D lines and 3D planes to represent the buildings and the ground along the road. However, for autonomous driving, the application environment is variable. A specific landmark-based algorithm may not be suitable for other driving scenarios. Furthermore, in some rural areas, the road may be unpaved and there is no road lane marking. Thus the related feature-based map approaches may not be feasible due to the lack of road markings and irregular road curve geometry [191].

The topological map represents the environment with a series of nodes and edges. The nodes indicate the important objects, such as corners, intersections, and feature points; while the edges denote the topological relationships between them [192,193]. One typical topological map is OpenStreetMAP (OSM) [194] which contains the coordinates of features as well as road properties such as road direction, lane numbers, etc. This kind of map significantly reduces the storage and computational requirements. However, it loses some useful information about the nature and structure of the actual environment [184]. Thus, some approaches combine topological maps with other types of maps. Bernuy and Ruiz-del-Solar [195] proposed the use of a topological map based on semantic information to provide robust and efficient mapping and localization solutions for large-scale outdoor scenes for autonomous vehicles and ADAS systems. According to Bernuy and Ruiz-del-Solar [195], the graph-based topological semantic mapping method was suitable for large-scale driving tasks on highways, rural roads, and city areas, with less computational expense than metrics maps. Bender et al. [196] introduced a highly detailed map, Lanelets, which combines both geometrical and topological representations, and includes information on traffic regulations and speed limits.

The semantic map is becoming increasingly important in autonomous fields as it contains semantic information that allows the robot or vehicle to better understand the environment, and to complete higher-lever tasks, such as human-robot interaction. For outdoor applications, the labeled objects could be statistic background (e.g., 'building', 'tree', 'traffic sign') or dynamic entities (e.g., 'vehicle', 'pedestrian'). Therefore, this kind of map can facilitate complex tasks for autonomous vehicles, such as planning and navigation [195,197]. Associating semantic concepts with geometric entities has become a popular research topic and semantic SLAM approaches have been investigated that combine geometric and semantic information [139,143,149]. The semantic SLAM approaches can contribute to making localization and mapping more robust [174], to supporting re-localization at revisited areas [143], and very importantly, to tracking moving objects detected in dynamic environments [149,151,198]. One critical problem faced by semantic map generation and utilization is that some modules within them, such as semantic segmentation, are very computationally demanding, which makes them unsuitable for real-time applications [199], especially for large-scale outdoor scenarios. Thus, some investigations seek to solve this problem. Ros et al. [199] proposed an offline-online strategy that generates a dense 3D semantic map offline without sacrificing accuracy. Afterward, real-time self-localization can be performed by matching the current view to the 3D map, and the related geometry and semantics can be retrieved accordingly. Meanwhile, the new dynamic objects can be detected online to support instantaneous motion planning. With the advent of deep learning, the efficiency and reliability of semantic segmentation and semantic SLAM have been vastly improved [147,200–203]. However, as previously mentioned, when applying

deep learning-based semantic SLAM to autonomous driving, there are still some challenges, such as the need for large amounts of training data, or the lack of ground truth that makes unsupervised learning methods necessary.

The different map representations are essential to support a highly automated vehicle operating in a challenging and complex road environment. Therefore, a detailed digital map, such as the HD map, which contains different layers of data, has been increasingly adopted. In addition to the most basic 3D point cloud map layer, the HD map may also contain layers with information on road topology, geometry, occupancy, lane features, road furniture, road regulation, real-time knowledge, and more. The storing, updating, and utilizing of such dense data without losing accuracy is a challenge. Some researchers have proposed the concept of "Road DNA" to represent the road environment and to deal with the Big Data problem [12,204]. Road DNA converts a 3D point cloud road pattern into a compressed, 2D view of the roadway without losing details [12], with the objective to reduce processing requirements.

#### *4.3. Issue of Estimation Drifts*

SLAM estimation drifts may be caused by accumulated linearization error, the presence of dynamic obstacles, noisy sensor data, wrong data association, etc.

In most SLAM algorithms, nonlinear models are used to represent the vehicle motion pattern and the environment. EKF SLAM suffers from a divergence problem due to the accumulation of linearization errors. Biases may occur when linearization is performed using values of state variables that are far from their true values. For optimization-based SLAM, a poor initial guess of variables will lead to poor convergence performance. Rotation may be the cause of nonlinearity and has a strong impact on the divergence of estimation [205,206], thus the accumulated vehicle orientation error will cause the inconsistency of the SLAM problem. One solution to the linearization challenge is the Linear SLAM algorithm proposed in [205], which modifies the relative state vector and carries out "map joining". Sub-map joining, which involves solving a linear least squares problem and performing nonlinear coordinate transformations, does not require an initial guess or iteration. In the work of [207], a robocentric local map sequencing approach was presented which can bound location uncertainty within each local map and improve the linearization accuracy with sensor uncertainty level constraints. Many variants of the classical EKF-SLAM have been proposed to overcome the divergence of the filter. The study of [208] demonstrated that the Unscented SLAM can improve the online consistency for large-scale outdoor applications. Huang et al. [209] proposed two alternatives for EKF-SLAM, Observability Constrained EKF, and First-Estimates Jacobian EKF, both of which significantly outperform the EKF in terms of accuracy and consistency. A linear time varying (LTV) Kalman filtering was introduced in [210] which avoids linearization error by creating virtual measurements. Some nonparametric approaches which are mainly based on the PF, such as fastSLAM [28], Unscented fastSLAM [211–214], show better performance than the EKF-SLAM.

For the nonlinear optimization-based SLAM approach, computing a good initial guess (solving the initialization problem), will lead to faster convergence and reduce the risk of convergence to local minima. Olson et al. [215] presented a fast iterative algorithm for optimizing pose graphs using a variant of Stochastic Gradient Descent (SGD), which is robust against local minima and converges quickly even with a bad initial guess. Then in the work of [50], an extension of Olson's algorithm was proposed which uses a tree-based parameterization for the nodes in the graph. This algorithm was demonstrated to be more efficient than Olson's and robust to the initial configuration. An approximation solution for 2D pose-graphs, called Linear Approximation for a pose Graph Optimization (LAGO), can be used as an exact solution or for bootstrapping nonlinear techniques [216,217]. This method first solves a linear estimation problem to obtain the suboptimal orientation estimate, and then uses it to estimate the relative position measurements in the global reference frame. Finally, the position and orientation solution is obtained by solving another linear estimation problem. This solution can then be treated as an initial guess

for a Gauss-Newton iteration. This method can provide a good initial guess, however, it is limited to 2D pose-graph, and is sensitive to noisy measurements. An algorithm with more complex initialization was proposed in [218] that uses the M-estimator, in particular the Cauchy function, as a bootstrapping technique. Similar to approaches that use the M-estimator to make estimation robust to outliers, the M-estimator proved to also be robust to a bad initial guess. In contrast to LAGO and TORO, this method can be applied to different variants of SLAM (pose-graphs and feature-based) in both 2D and 3D [218]. Carlone et al. [219] surveyed different 3D rotation estimation techniques and demonstrated the importance of good rotation estimate to bootstrap iteration pose graph solvers. More recent research presented a heuristic method called Multi-Ancestor Spatial Approximation Tree (MASAT), which has low complexity and is computationally efficient without needing a preliminary optimization step [220]. This method is still for the pose graph. Other studies seek to obtain a good initial guess by introducing inertial measurements to support initialization [221,222] or conducting parameter calibration [223–225].

Dynamic objects such as pedestrians, bicycles, other vehicles, etc., may cause estimation drifts since the system may wrongly identify them as static road entities. There are some methods to avoid this. Probabilistic Maps that use probabilistic infrared intensity values have been proposed in [226]. In this study, GNSS/INS and a 64-beam Lidar sensor were combined to achieve robust position RMS errors of 9 cm in dynamic environments. However, this system suffers from high costs and a high computational burden. The 3D Object Tracker [227] can be used to track moving objects in Visual SLAM methods. Another algorithm proposed in [228] uses Canny's edge detector to find dominant edges in the vertical direction of a tree trunk and to select these tree trunks as typical salient features. Deep learning methods are increasingly investigated to deal with the dynamic environment as aforementioned [148–151,166,198].

Another source of drifts is the outlier within the sensor observations. Each sensor has its own error sources. For example, in the case of a camera, the fuzzy image due to high speed and poor light conditions may cause incorrect identification of landmarks. Lidar sensors are sensitive to weather conditions (such as rainfall), and large changes in the road environment. GNSS may suffer from signal blockage. FDI (Fault Detection and Isolation system) techniques can be used to detect measurement outliers and reject the influence of these outliers on positioning and localization [229].

The aforementioned SLAM error sources may also result in incorrect data association, which is an important process to associate measurement(s) to a specific landmark. Wrong data association may happen due to not only the noisy sensor data, inconsistency, wrong detection of dynamic objects, etc., but also to some specific road environments. For instance, the highway environment is sometimes visually repetitive and contains many similar features, which makes it difficult to recognize a previously explored area.

Some researchers avoid the challenge of wrong data association directly at the frontend step of SLAM by using RANSAC [230], which is commonly used in Visual SLAM to reject outliers. In [231], the authors proposed a middle layer, referred to as Graph-Tinker (GTK), that can detect and remove false-positive loop closures. Artificial loop closures are then injected into the pose graph when using an Extended Rauch–Tung–Striebel smoother framework.

The data association challenge can also be addressed at the backend step since there is still a chance that outliers are not totally eliminated. The concept of Switchable Constraints (SC) was introduced in [232], such that a switchable variable is introduced into each loop closure constraint. Once a constraint is considered as an outlier, it can be turned off during optimization. In [233], the authors introduced an algorithm known as Realizing, Reversing, and Recovering (RRR), which is a consistency-based loop closure verification method. More recently, Carlone et al. [234] used 1\_relaxation to select "reliable" measurements, and Carlone and Calafiore [235] use convex relaxations to solve the nonconvex problem without the need for an initial guess of unknown poses. The potential causes of SLAM drifts and the corresponding suggested solutions are summarized in Table 2.


#### **Table 2.** Potential causes of SLAM drifts and solutions.
