Review Reports - Lightweight Model Development for Forest Region Unstructured Road Recognition Based on Tightly Coupled Multisource Information

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The manuscript entitled Lightweight Model Development for Forest Region Unstructured Road Recognition Based on Tightly Coupled Multisource Information

Abstract

In this section, the authors should say more about the main methodology within this research and add at least one more sentence. The author should briefly introduce the methodology before presenting the results. The main methodology must also be included. It is also good to add a word in the keyword section that better explains the main methodology used. In my opinion, the author needs to add another word in the keywords section

Introduction

Lines between 42 and 43, can the authors better explain how it is possible to create an algorithm to recognize forest paths.

Can the authors have information about the Tesla autos program to better detect forest trails.

Line 64, can the authors explain more about Lidar 2d point clouds, it is very important for future analysis.

The first part of the introduction needs to be expanded to include the sentences that explain the mapping of forest trails. The final results of this paper is must be president in the mapping of forest roads, and a better understanding of image quality in raster and vector enhancements.

Based on the above points and the fact that the authors can extend this analysis, I would like to recommend the authors to read and cite valuable references. In this research, authors can read and analyze many valuable methods. These methods may be applicable

The recommended reference are

- Valjarević, A., Djekić, T., Stevanović, V., Ivanović, R., & Jandziković, B. (2018). GIS numerical and remote sensing analyses of forest changes in the Toplica region for the period of 1953–2013. Applied geography, 92, 131-139.

-Zhang, W., & Hu, B. (2021). Forest roads extraction through a convolution neural network aided method. International Journal of Remote Sensing, 42(7), 2706-2721. https://doi.org/10.1080/01431161.2020.1862438.

Materials and Methods

Can the authors provide more data on used vehicles?

Can the authors explain the characteristics of the images as samples?

How did the authors determine the matrix values?

Is it sufficient for this experiment to have an i7 processor with only 8 GB of RAM?

Deeplab-Road Model Architecture

How large are the average differences in the images between the input and output images?

Fig. 3 The flow chart must be drawn with the larger symbols.

Parameterized Construction of Quasi-Structured Roads

As the authors line drawn on the quasi-roads, better explain

What were the characteristics of the CCD camera, tell more about the process of pixelization?

How do the authors handle the visualization and mapping of Lidar data?

Conclusion section

In this section, the authors need to add the following important answers

-Why is this research important?

-What are the main findings of this research?

-A few sentences about the key findings could be added to this section.

Recommendation

The work has the scientific potential to be accepted after Moderate revision

Good luck to the authors

Reviewer#2

Comments for author File: Comments.pdf

Author Response

Response to Reviewer 1 Comments

Abstract

Point 1: In this section, the authors should say more about the main methodology within this research and add at least one more sentence. The author should briefly introduce the methodology before presenting the results. The main methodology must also be included. It is also good to add a word in the keyword section that better explains the main methodology used. In my opinion, the author needs to add another word in the keywords section.

Response 1: We gratefully thank for your time and effort in reviewing our manuscript! We greatly appreciate your valuable suggestions. Based on your opinion, we have added further specific explanations and introductions about the main methodology in the Abstract section of the manuscript. Especially, the characteristics of the backbone network and the role of pluggable modules were described. The revisions made to the manuscript have been marked up using the highlighting function for “Track Changes”. Thanks again for your valuable suggestion. In addition, we have added two keywords in the keyword section to reflect our main methodology.

According to your suggestion, we have made the following corrections in in the Abstract section:

Abstract: Promoting the deployment and application of embedded systems in complex forest scenarios is an inevitable developmental trend in advanced intelligent forestry equipments. Unstructured roads, which lack effective artificial traffic signs and reference objects, pose significant challenges for driverless technology in forest scenarios, owing to their high nonlinearity and uncertainty. In this research, an unstructured road parameterization construction method “DeepLab-Road,” based on tight coupling of multi-source information is proposed, which aims to provide a new segmented architecture scheme for the embedded deployment of forestry engineering vehicles’ driving assistance system. DeepLab-Road utilizes MobileNetV2 as the backbone network that improves the completeness of feature extraction through the inverse residual strategy. Than, it integrates pluggable modules including DenseASPP and strip-pooling mechanisms. They can connect the dilated convolutions in a denser manner to improve feature resolution without significantly increasing the model size. The boundary pixel tensor expansion is then completed through a cascade of two-dimensional Lidar point cloud information. Combined with the coordinate transformation, a quasi-structured road parameterization model in the vehicle coordinate system was established. The strategy was trained on a self-built Unstructured Road Scene Dataset and transplanted into our intelligent experimental platform to verify its effectiveness. Experimental results show that the system can meet real-time data processing requirements (≥12 frames/s) under low-speed conditions (≤1.5 m/s). For the trackable road centerline, the average matching error between the image and the Lidar was 0.11m. This study offers valuable technical support for the rejection of satellite signals and autonomous navigation in unstructured environments devoid of high-precision maps, such as forest product transportation, agricultural and forestry management, autonomous inspection and spraying, nursery stock harvesting, skidding and transportation.

Keywords: DeepLab-Road; Expand Convolutional Layer; Inverted Residuals; Lightweight model for embedded systems; Unstructured road recognition in forest region.

Introduction

Point 2: Lines between 42 and 43, can the authors better explain how it is possible to create an algorithm to recognize forest paths.

Response 2: We greatly appreciate your valuable suggestions. The commonly used methods for identifying forest roads are the aerial photography and airborne laser scanning. Aerial photography can be used to locate and measure roads that are clearly distinguishable from the surrounding forest canopy. However, considering that forest roads are commonly occluded by dense canopy cover, the use of aerial photography for forest road identification is limited. The use of airborne laser scanning (ALS) can facilitate road detection in such situations as well as allow road status and width assessments over large forest areas. ALS is increasingly used for operational forestry as it provides reliable and accurate terrain information, even under dense forest cover. Despite the development of several ALS-based road extraction methods, the remoteness and size of forested ecosystems continue to limit the monitoring and updating of forest road networks.

We have added a description of common methods for traditional forest road detection in this section (Lines: 46-52):

The commonly used methods for identifying forest roads are the aerial photography and airborne laser scanning (ALS). Aerial photography can be used to locate and measure roads that are clearly distinguishable from the surrounding forest canopy. The use of ALS can facilitate road detection in such situations as well as allow road status and width assessments over large forest areas. However, considering the resolution of aerial images and the huge computational volume of point cloud data, the use of these methods for forest road identification is limited [2]. While forest road identification methods based on Convolutional Neural Networks (CNNs) have been well developed since 2012, two major challenges remain about road image recognition: (i) Deep network models focus primarily on improving the accuracy rather than the efficiency; (ii) Similar to structured driving environments in the United States or Europe, the most current work is focused on well-structured driving environments, and it cannot be separated from the assistance of high-precision maps [3]. The current solution still has considerable limitations in terms of portability and adaptability in a wider range of forest scenarios and unstructured road conditions.

Point 3: Can the authors have information about the Tesla autos program to better detect forest trails.

Response 3: We appreciate for your valuable comment. During the experiment, we extensively searched for information and literature to obtain the current research progress on unstructured road recognition in field scenarios. Tesla, as a pioneer in the field of autonomous driving, has extensive technical accumulation and application cases in road recognition, unmanned control, and high-level assisted driving. Based on our limited knowledge of Tesla, its main application scenarios are focused on urban structured roads, relying on “Vehicle-Road-Cloud” integration technology to achieve intelligent decision-making and autonomous driving of vehicles. Of course, the commonality between this study and Tesla's application scenarios is that both of us needs to face and solve the problems of data completeness and limited application scenarios caused by the lack of roadside information. Differently, Tesla has relatively complete “Vehicle-Road-Cloud” information in urban scenes and structured roads, and the problems of satellite signal loss and incomplete map information are usually short-term and can be overcome. However, autonomous vehicles face even harsher environmental and signal conditions in forest scenes.

In future research, we will further focus on and refer to Tesla's relevant research in the field of autonomous driving and solutions for road recognition. Thank you very much for your kind advice and reminder.

Point 4: Line 64, can the authors explain more about Lidar 2d point clouds, it is very important for future analysis.

Response 4: We greatly appreciate your valuable suggestions. As pointed out in your comment, 2D LiDAR is one of the very important environmental perception methods for unmanned vehicles. We have supplemented the current research status of 2D LiDAR in the field of autonomous driving in lines 75-90 of this manuscript. At the same time, corresponding adjustments and updates have been made to the cited literature in the reference section.

Based on your suggestion, we have added the following content to the Introduction section:

The utilization of Lidar point cloud data is also prevalent in unmanned vehicle systems owing to its inherent benefits in providing intuitive distance measurements and effective obstacle sensing [8,9]. 2D Lidar is widely used in unmanned vehicle’s environment perception due to its simple structure, small number of point clouds and fast computing speed. Palacín et al. [10] proposes mobile robot self-localization based on an onboard 2D push-broom Lidar using a reference 2D map previously obtained. self-localization with a 2D push-broom Lidar is possible by projecting the scan points in the horizontal plane of the 2D reference map before applying a 2D self-location algorithm. Hassan et al. [11] propose a map correction method with planar environmental constraints, which introduces a complementary filter with moving average filtering for lidar pose estimation to overcome severe vibration of 2D point. Zhang Bin et al. [12] proposed a novel global feature point matching-based loop closure detection algorithm, which created a tightly-coupled front-end to mitigate front-end accumulated errors and constructs globally consistent maps. In general, high reliability and high confidence perception based on the fusion of 2D LiDAR and other sensors remains one of the exploration directions for lightweight models [13].

Point 5: The first part of the introduction needs to be expanded to include the sentences that explain the mapping of forest trails. The final results of this paper is must be president in the mapping of forest roads, and a better understanding of image quality in raster and vector enhancements.

Response 5: We appreciate for your valuable comment. The connection and clear explanation from unmanned driving in the wild to forest trails is of great significance for a clearer explanation of the relationship between forest trails, unstructured roads and this manuscript, which can increase the readability and interpretability of the article. In addition, as you mentioned, it is also very important to provide relevant explanations on image quality in raster and vector enhancement. This is of great significance for indicating the image size during the experimental process and explaining the calculation speed of the image.

Based on your suggestion, we have added the relevant explanation about the forest trail to the Introduction section (Line 39-41):

As a typical unstructured road in forest areas, the effective identification of forest trails can provide new solutions for autonomous planning and decision-making of robots in the wild.

Regarding the image quality of forest trails mentioned in the manuscript, we have provided further supplementary explanations in the appropriate section of the Result (3.2) (Line: 393-397)：

It should be noted that the images about forest trails in Figure 6 and Figure 7 correspond one-to-one, meaning that the image in Figure 6 is the original image captured by the vehicle mounted camera, and Figure 7 shows the corresponding algorithm processing result proposed in this study. The collected images were uniformly 640(pixels) ×480 (pixels).

Point 6: Based on the above points and the fact that the authors can extend this analysis, I would like to recommend the authors to read and cite valuable references. In this research, authors can read and analyze many valuable methods. These methods may be applicable.

The recommended reference are:

Response 6: We greatly appreciate your kind recommendation. Forest roads are important features of the environment that have significant impacts on wildlife habitats. We have carefully studied the references you recommended. The methods provided in these references have considerable value and reference significance, and have also triggered our thinking on the next research direction and content. At the same time, we have also made timely updates to the Reference section.

Based on your suggestion, we have inserted the recommended references into the appropriate position in the Introduction section (Lines: 70-73 and Line 144):

Zhang et al. [7] utilized the multivariate Gaussian and Laplacian of Gaussian (LoG) filters and VGG 16 on high spatial resolution multispectral imagery to extract both primary road and secondary roads in forested areas.

However, their acquisition predominantly focuses on urban structured road scenes with robust infrastructures [33-36], revealing a conspicuous gap in the distribution of sample data for unstructured forest scenes.

Materials and Methods

Point 7: Can the authors provide more data on used vehicles?

Response 7: Of course. Based on your suggestion, we have further supplemented the structure and parameters of the experimental platform, namely the vehicle shown in Figure 1 (Line 170-173):

The experimental platform adopted the design structure of wood transport vehicles, with a tractor using the Ackermann chassis structure and the trailer using an unpowered four-wheel trailer. The dimensions of the tractor and trailer were consistent: the length of a single vehicle body was 970 mm, the width was 680 mm.

Point 8: Can the authors explain the characteristics of the images as samples?

Response 8: We fully understand your concerns. We have also done a lot of thinking and deployment work in the early stage regarding the selection of image samples and our database. To ensure the generalization and robustness of the model, we try to expand the distribution range of the samples as much as possible and increase the differences between samples. We selected samples from two aspects: on the one hand, the experimental results show that, both subjective and objective considerations, indicate significant influence of seasonal variations on the image features within the scene's target regions. Therefore, in samples, the unstructured scene includes three different seasons: spring, summer and autumn. on the other hand, the images include forest road surfaces with different surface structures, the vegetated pavements, the gravel pavements, the pavement covered by litter, the dirt roads, the flagstone roads, road surface overexposed by camera, and the hardened road with wide surfaces and blurred road boundaries. We also want to enhance the generalization ability of the model by extracting different feature images.

For your suggestion, we have modified and explained in the manuscript: (Lines: 343-352):

To ensure algorithmic effectiveness and generalizability, a comprehensive collection of unstructured road data from diverse field scenes has been actively pursued.

Figure 6 (a) - (d) are sourced from the RUGD public dataset and the remaining figures are derived from our self-constructed URSD dataset. The experimental results show that, both subjective and objective considerations, indicate significant influence of seasonal variations on the image features within the scene's target regions. Therefore, in URSD set, the unstructured scene includes three different seasons: spring, summer and autumn. From another perspective, URSD dataset also includes the pavements with different surface structures, the vegetated pavements (Figure 6 (a)), the gravel pavements (Figure 6 (b) - (c)), the pavement covered by litter (Figure 6 (d) - (e)), the dirt roads (Figure 6 (f) - (i)), the flagstone roads (Figure 6 (j) - (m)), road surface overexposed by camera Figure 6 (n), and the hardened road with wide surfaces and blurred road boundaries (Figure 6 (o) - (p)).

Point 9: How did the authors determine the matrix values?

Response 9: We fully understand your concerns. These two matrices are parameter matrices related to the joint calibration of the camera and Lidar in the experiment. In order to ensure the accuracy and efficiency of multi-sensor information fusion, the joint calibration method of visual camera and 2D Lidar was developed and verified in the experiment (including the calibration of spatial dimension and the calibration of time dimension). Regarding the specific content of this section, we have developed and validated the method in our previous research, and published a paper explaining the method. Due to space limitations, the parameters and results of the calibration matrix are directly provided here. The detailed information about the calibration method can be found in our previous research [38,39].

Meanwhile, based on your suggestion, we have also provided relevant explanations in the manuscript (Lines: 174-178) and supplemented the references at the end of the manuscript:

In order to ensure the accuracy and efficiency of multi-sensor information fusion, the joint calibration method of visual camera and 2D Lidar was developed and verified in the experiment (including the calibration of spatial dimension and the calibration of time dimension). The detailed information about the calibration method can be found in our previous research [38,39].

Point 10: Is it sufficient for this experiment to have an i7 processor with only 8 GB of RAM?

Response 10: We greatly appreciate your kind reminder. In fact, our experiment is divided into two parts: laboratory experiments and field experiments. In the early stages of data collection, data partitioning, model training, and network structure adjustment, the configuration of our laboratory computer is much higher than that of the vehicle platform. After the model is trained, it needs to be ported to the vehicle platform for validation (as you can see in the manuscript). The purpose is to conduct extensive experimentation and improve efficiency. We have made many attempts and efforts to obtain a lightweight model. In harsh forest environments, vehicle vibrations caused by unstructured roads can cause unpredictable damage to the onboard computer. Actually, the upper computer would crash or be damaged almost every week due to vibration during the experiment. This has indeed caused us a lot of trouble. Therefore, achieving unstructured road recognition with limited resources is our research goal. Of course, in subsequent research and experiments, we will further provide corresponding solutions for system shock absorption, improve the configuration of the upper computer, and thus further enhance the computing power of the system.

Deeplab-Road Model Architecture

Point 11: How large are the average differences in the images between the input and output images?

Response 11: We greatly appreciate your kind reminder. We would like to address the issue of differences between input and output images from two perspectives. From the image scale, the input and output images are the same size, as described in the manuscript, with an image size of 640(pixels) * 480(pixels). From image semantics, the input image is the original image captured by a visual camera. The output image is obtained through semantic segmentation of the initial image. Firstly, the algorithm is used to determine whether each pixel belongs to the road area. Then, we would place a translucent mask (the same size as the original image) on top of the original image. According to the pixel classification results, road areas are filled with red, while non road areas are filled with black. Finally, we need to fit the irregular non road boundaries and draw pseudo structured road boundary lines. Thus we can obtain the results shown in the following figure. Here we have selected two images for comparison and explanation.

(a) Initial image (b) Result image

Figure 1 Comparison between original image and running results

In summary, there is a necessary connection and causal relationship between the original image and the image processing results. In fact, removing the mask from Figure 1(b) can obtain the original Figure 1(a). The algorithm only performs semantic understanding and segmentation on the original image. I hope my explanation can clear up your doubts.

Point 12: Fig. 3 The flow chart must be drawn with the larger symbols.

Response 12: Thank you for your careful inspection. We have rearranged and formatted Figure 3, hoping we have understood your meaning correctly.

Parameterized Construction of Quasi-Structured Roads

Point 13: As the authors line drawn on the quasi-roads, better explain what were the characteristics of the CCD camera, tell more about the process of pixelization?

Response 13: We greatly appreciate your kind suggestion. The pseudo-structured road is generated to solve the problem of irregular road boundary under image segmentation. Here, the minimum scale of semantic segmentation is pixel. The discretization and irregularity of road boundaries will lead to extreme roughness of road boundaries. The following Figure 2(a) shows the binary image result of a randomly extracted image segmented from the dataset. In the Figure 2(a), the black area represents no-road areas, the blue area represents road areas, and the red and green curves represent the left and right road boundaries, respectively. Obviously, such road boundaries directly guiding the navigation of autonomous driving systems may lead to catastrophic consequences. Therefore, it is necessary to perform smooth fitting of pseudo structured roads (Figure 2(b)).

In the Figure 2(b), the red curve is the fitted road boundary line, and the black curve is the road median line calculated through the road boundary line.

(a) Binary image of semantic segmentation (b) Pseudo-structured path of original image

Figure 2 Pseudo-structured path

Based on your suggestion, we have provided further supplementary explanations on the reasons, methods, effects, and roles of generating pseudo structured roads (Lines:375-387):

The pseudo-structured road is generated to solve the problem of irregular road boundary under image segmentation. Here, the minimum scale of semantic segmentation is pixel. Uneven road boundaries can cause frequent and discontinuous directional adjustments of the unmanned system, which is unsafe for the operation of unmanned navigation in the wild. Such road boundaries directly guiding the navigation of autonomous driving systems may lead to catastrophic consequences. Therefore, it is necessary to perform smooth fitting of pseudo structured roads. The binary image was derived from the mask image, and the boundary delineation of irregular unstructured roads was expeditiously accomplished using the Sobel edge detection algorithm. Subsequently, the furthest point was employed as the pivotal boundary point for the segregation of the left and right borders. A Cubic Bessel Curve fitting was performed based on the image coordinates of the sampled points.The green curve represents the fitted road boundary line, whereas the black curve represents the road centerline calculated based on the road boundary line.

Point 14: How do the authors handle the visualization and mapping of Lidar data?

Response 14: We fully understand your concerns. In this study, multi-source data fusion was achieved through reprojection between point clouds and images. The fusion strategy of image and point cloud data is as follows:

Building upon the joint calibration outcomes of the CCD camera and 2D Lidar, a re-projection of the point cloud onto the image was executed. The distance information of the point cloud is assigned to the corresponding pixel along the fitted boundary line, leading to an extension of the pixel coordinate vectors. Subsequently, the positional information of the road boundary is transformed into the actual local scene coordinate system through a continuous sequence of coordinate transformations involving image, camera, vehicle, and world coordinates. In fact, we have weakened the visualization of point clouds and instead used reprojection techniques to assign distance information to the corresponding pixel points at the road boundary.

Based on your suggestion, we have provided further supplementary explanations in 3.3. Reprojection of Images and 2D Point Clouds (Lines:401-419):

The semantic segmentation results of the image lack distance information, so combining point cloud information is necessary to construct complete road information in the vehicle coordinate system. Therefore, quasi-structured road construction is considered to provide local coordinate information for the tracking control of UGV in forest scenarios.

Conclusion section

Point 15: In this section, the authors need to add the following important answers .

-Why is this research important?

Response 15: We greatly appreciate your valuable suggestions. Based on your suggestion, we have added an explanation about the importance of this research in the conclusion section (Lines:518-528):

Forest roads are important features of the environment that have significant impacts on forestry equipment automation and forest operation unmanned. Aiming at the issue of insufficient generalization ability of intelligent navigation systems in current forest scenarios, this study proposes a lightweight quasi-structured road recognition and reconstruction scheme Deeplab-Road suitable for embedded systems. This algorithm has good real-time performance, strong robustness, and environmental adaptability. It can provide more accurate parameterized road models for unmanned ground vehicles to navigate in unstructured scenes. This technology has important application value in promoting the autonomous navigation of intelligent robots in unstructured scenarios, such as forest scenarios, port and dock cargo handling, urban underground space exploration, battlefield harsh environment investigation, farm independent picking, and forest nursery care.

Point 16: -What are the main findings of this research? A few sentences about the key findings could be added to this section.

Response 16: We greatly appreciate your valuable suggestions. Based on your suggestion, we have reorganized and adjusted the main findings of this research in the conclusion section.

The main findings of this research are as follows (Line:529-559):

This study focuses on autonomous navigation technology for unstructured road scenes in forest scenarios, proposes a lightweight quasi-structured road recognition and reconstruction scheme Deeplab-Road suitable for embedded systems. This model uses MobileNetV2 as the backbone network, integrates DenseASPP and the instant plug module String Pooling, to meet the balance between real-time performance of forest engineering vehicle in outdoor environments and image segmentation accuracy.Combining reprojection technology, the detailed geometric shape and topological information of road boundaries provide accurate and guiding directions for the optimal route for vehicles and humans. At the same time, to some extent, it overcomes the problem of real-time computation difficulties in practical application scenarios caused by the redundancy of 3D point cloud data and the lack of clear and unified point cloud data structure features. The construction of pseudo structured roads provides a parameterized road model for local UGV navigation lacking satellite signals and high-precision map support.

The main contributions of this study also include the self-built dataset URSD. It mainly includes unstructured road data in open scenarios, and the data consistency of this dataset is good. Eliminating interference from factors such as lane lines, traffic lights, people, and vehicles in traditional unmanned driving datasets as much as possible, enabling scholars and models to focus on unstructured road areas. All data samples come from the collected original images, without involving image rotation, image flipping, adding noise, and other extensions. When new samples appear, the training system can be upgraded and updated by adding them to the training set to expand the samples, train and update the model. This means that the system has good plasticity, robustness, and portability. The dataset that matches the images and point clouds will be systematically curated, disseminated, and showcased in forthcoming research efforts.

It is noteworthy that the dataset explicitly excludes winter scenes with snow-covered roads. In practice, we did capture some unstructured road scenes post-snowfall during winter. However, the homogeneity of color and texture between the foreground and background in images depicting snow-covered roads poses a challenge for feature extraction. This complexity misguides the trained model and introduces certain confusion and complications. In subsequent endeavors, we plan to extensively augment the dataset and undertake further research and exploration into this particular issue.

We gratefully thank for your time and effort in reviewing our manuscript! Thank you again!

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

In the paper “Lightweight Model Development for Forest Region Unstructured Road Recognition Based on Tightly Coupled Multisource Information” by Guannan Lei et. al. the authors proposed their own algorithm “DeepLab-Road” based on lightweight network MobileNetV2 for recognizing roads on an input image. The algorithm can also be used for recognition of the unstructured road images collected in the forest scenarios .

Judging by Figures 7 and 9, the algorithm works very well that is crucially important for unmanned platforms navigating in complex outdoor scenes in the wild, to perform real-time road detection and direction estimation.

The authors provide a comparison with other well-known recognition algorithms and judging by Figure 9 and Table 1, the DeepLab-Road showed the best result in time and image recognition.

The methods and materials are described in detail. Сlaimed to be effective algorithm, Strip Pooling is explained quite clearly in Figure 5.

I have some minor comments.

1) Such a well-working algorithm should be reproducible so that anyone who wants to, can use or improve it. Therefore, the authors need to expand its description in section 3.2, in particular, clearly specifying all the criteria for determining the road boundaries that they used.

2) More clarification is needed on the DeepLab-Road Network Framework Structure (Figure 2).

3) What do R and T in formulas (1) and (2) mean?

I would like to separately praise the authors for the very beautiful photographs of the roads.

Author Response

Response to Reviewer 2 Comments

Point 1: Such a well-working algorithm should be reproducible so that anyone who wants to, can use or improve it. Therefore, the authors need to expand its description in section 3.2, in particular, clearly specifying all the criteria for determining the road boundaries that they used.

Response 1: We gratefully thank for your time and effort in reviewing our manuscript! And we totally understand your concern. Based on your suggestion, we have added further specific explanations and introductions about the main methodology in the section 3.2 (Lines:355-367 and lines:375-397):

The pseudo-structured road is generated to solve the problem of irregular road boundary under image segmentation. The boundary of the mask area is irregular and tortuous, which can cause frequent and drastic changes in the direction control commands during the control process, making it impossible to ensure the stability and comfort of the vehicle tracking system. Such road boundaries directly guiding the navigation of autonomous driving systems may lead to catastrophic consequences. Therefore, it is necessary to perform smooth fitting of pseudo structured roads. The binary image was derived from the mask image, and the boundary delineation of irregular unstructured roads was expeditiously accomplished using the Sobel edge detection algorithm. Subsequently, the furthest point was employed as the pivotal boundary point for the segregation of the left and right borders. A Cubic Bessel Curve fitting was performed based on the image coordinates of the sampled points. The green curve represents the fitted road boundary line, whereas the black curve represents the road centerline calculated based on the road boundary line.

The DeepLab-Road model demonstrates noteworthy segmentation efficacy in such scenarios. From a quantitative analytical standpoint, the delineation of boundaries for these roads is more explicit compared with their unstructured counterparts, yielding a superior and more precise segmentation outcome. Thus, preliminary construction can be achieved by moving from image segmentation to quasi-structured roads.

Point 2: More clarification is needed on the DeepLab-Road Network Framework Structure (Figure 2).

Response 2: We greatly appreciate your valuable suggestions. Based on your suggestion, we have reorganized this section and supplemented the relevant content (Lines:192-225):

Unlike structured roads, unstructured road images collected in the forest scenarios exhibit varied textural features, inconspicuous feature clutter, and irregular road shapes and boundaries. It is challenging to extract and identify unstructured road areas because of these nonstructural features. Preliminary experiments revealed that common network models could not be directly applied to unstructured road image segmentation tasks. The existing challenges are primarily reflected in three aspects: (i) traditional networks have complex backbone feature extraction networks and a deeper network structure, leading to potential difficulties in training and deployment; (ii) the long-range contextual information in the image has not been fully utilized, which has been proven to be highly effective in unstructured road segmentation; (iii) the multi-scale features generated by atrous spatial pyramid pooling in traditional networks have limited feature resolution in the scale axis and are not dense enough for the unstructured road segmentation scenario.

In response to these issues, a DeepLab-Road Model was designed in this study. As shown in Figure 2. The model utilizes MobileNetV2 as the backbone network and incorporates stripe pooling and Dense ASPP into the encoder. The first layer within these bottleneck blocks employs a pointwise convolution with a convolutional kernel size of 1×1, thereby enabling dimensionality expansion. The second layer involves a separable deep convolution with a spatial extent of 3×3. Finally, the concluding layer consists of a convolution with a spatial extent of 1×1. And the inverted residual structure can facilitate comprehensive feature extraction and mitigate the risk of gradient loss.

Then, Dense ASPP structure was incorporated into backbone network. It combines void convolution and void space pyramid pooling to create a denser feature pyramid and a larger receptive field. Dense ASPP achieves richer feature proportional sampling and information sharing by cascading and densely connecting cavity convolutional layers with different expansion rates. So, it can improve the recognition accuracy of the network. And it is particularly suitable for processing high-resolution images and tasks that require capturing a large range of contextual information.

On this basis, the strip pooling strategy is further introduced to reconsider the formula of spatial pooling, so as to enable the backbone network to model the remote dependency effectively. It collects rich contextual information by utilizing pool operations with different kernel shapes to explore images with complex scenes. For each spatial position in the pooled feature map, it encodes its global horizontal and vertical information, and then uses these encodings to balance its own weights for feature optimization. It can be used as an effective plug and play module in existing scene analysis networks.

Point 3: What do R and T in formulas (1) and (2) mean?

Response 3: Thank you for your careful review. We fully understand your concerns. These two matrices are parameter matrices related to the joint calibration of the camera and Lidar in the experiment. In order to ensure the accuracy and efficiency of multi-sensor information fusion, the joint calibration method of visual camera and 2D Lidar was developed and verified in the experiment (including the calibration of spatial dimension and the calibration of time dimension). Regarding the specific content of this section, we have developed and validated the method in our previous research, and published a paper explaining the method. Due to space limitations, the parameters and results of the calibration matrix are directly provided here. The detailed information about the calibration method can be found in our previous research [38,39].

Meanwhile, based on your suggestion, we have also provided relevant explanations in the manuscript (Lines: 174-178) and supplemented the references at the end of the manuscript:

We gratefully thank for your time and effort in reviewing our manuscript! Thank you again!

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

Dear authors,

The study focuses on detecting unstructured roads and reconstructing pseudo-structured roads for unmanned ground vehicles (UGVs) in complex outdoor environments. It introduces the DeepLab-Road model, which uses MobileNetV2 as the backbone and incorporates Stripe Pooling and DenseASPP modules for efficient and accurate road recognition.

The paper presents a highly innovative approach by introducing the DeepLab-Road model, which effectively combines MobileNetV2, DenseASPP, and Stripe Pooling to tackle the complex challenge of unstructured road detection. This combination of technologies is particularly noteworthy for its potential to enable real-time applications in off-road and forest environments. A major strength of the study is the creation of the Unstructured Road Surface Dataset (URSD), which is specifically tailored to the research, enhancing the relevance and practical applicability of the findings. Additionally, the comprehensive methodology, integrating advanced techniques to capture complex features and contextual information, demonstrates the robustness of the research. The study’s focus on addressing practical challenges in vehicle mobility, such as real-time processing in complex terrains, further underscores its significance and potential impact in both military and civilian contexts within forestry.

However, I believe there are several areas that need improvement for the paper to be considered relevant. Improvements 2 and 3 are particularly crucial, while some other aspects may be less pressing and could be addressed in future work.

Improvement 1: Evaluation in different environments:

The current study primarily tests the DeepLab-Road model under specific conditions that may not fully represent the diverse range of environments where off-road vehicles operate. There are some real-case conditions that should be evaluated or tested to better understand its robustness and versatility. This could include:

Weather Conditions:

· Rain: Testing the model under rainy conditions to see how well it handles water on the ground, which can obscure unstructured roads and create reflective surfaces that might confuse the model.

· Fog: Evaluating the model’s performance in foggy conditions, where visibility is reduced, would test its ability to detect roads when visual information is partially obscured.

Lighting Scenarios:

· High vs. low light: Assessing the model’s effectiveness during both hard light sunlight vs diffuse light cloudy days would determine its ability to function under varying light conditions.

· Shadowed Areas: Testing in environments with dappled sunlight, such as in dense forests where light filters through the canopy, would evaluate how well the model copes with high-contrast lighting and shadowing effects.

Varying Vegetation Densities:

· Sparse vs. Dense Vegetation: The model should be tested in areas with different levels of vegetation density. Sparse vegetation might expose more of the terrain, making road detection easier, while dense vegetation could obscure roads and create false positives (e.g., interpreting shadows or gaps in the canopy as roads).

· Seasonal Vegetation Changes: Evaluating the model in environments where vegetation changes with the seasons (e.g., leaf-covered vs. leafless in deciduous forests) would help determine if the model can adapt to these variations.

Terrain Variability:

· Flat vs. Rugged Terrain: Including different terrain types, from relatively flat and straightforward paths to rugged, uneven terrains with obstacles like boulders and fallen logs, would test the model’s adaptability to varying terrain roughness and complexity.

· Mud and Slippery Surfaces: Testing on muddy or wet ground would add another layer of complexity, as these surfaces can be difficult for vehicles to traverse and might affect the accuracy of road detection.

Improvement 2: Enhancing real-time processing capabilities:

The DeepLab-Road model, while designed with a lightweight architecture using MobileNetV2, faces challenges in achieving real-time processing, especially when dealing with the complexities of 3D point cloud data which are essential for capturing the detailed structure of the terrain but require significant computational power to process efficiently.

There are some potential solutions, some of them essential when designing this type of platforms, that could be considered now or as future work:

1. Model optimization:

· Pruning and Quantization: Techniques like pruning, where unnecessary parts of the model are removed, and quantization, where the precision of the model’s weights and activations is reduced, can significantly decrease the computational load.

· Knowledge Distillation: This involves training a smaller, faster model (student) to mimic the performance of a larger, more complex model (teacher). The student model can retain most of the accuracy of the teacher model but with reduced complexity, allowing for quicker processing.

2. Alternative architectures:

· Edge Computing and Distributed Processing: Instead of relying solely on the vehicle's onboard computing power, processing tasks could be distributed across multiple edge devices or cloud-based systems.

· Use of Specialized Hardware: Implementing the model on hardware specifically designed for deep learning tasks, such as Tensor Processing Units (TPUs) or Graphics Processing Units (GPUs), can significantly enhance processing speeds.

3. Simplifying input data:

· Dimensionality Reduction: Reducing the complexity of the input data (e.g., by simplifying the 3D point clouds to 2.5D or using fewer data points without losing critical information) can help speed up processing.

· Data Fusion: Combining data from multiple sensors (e.g., LIDAR, RGB cameras, and RADAR) into a unified, simplified dataset before processing can streamline the model’s operations.

4. Incremental processing:

· Real-Time Data Streaming: Instead of processing large batches of data at once, the model could process incoming data incrementally as it is received (streaming).

· Prioritization of Critical Data: The model could be adjusted to prioritize the processing of critical data points that are most likely to impact vehicle navigation (e.g., obstacles directly in the vehicle’s path), allowing for faster response times.

Improvement 3: Benchmarking against existing models:

This is essential in this type of papers. Benchmarking against other state-of-the-art models is essential to objectively evaluate the performance of the DeepLab-Road model. The readers/researchers should understand how their model compares in terms of accuracy, efficiency, robustness, and other critical metrics. Without such comparisons, it is challenging to identify the unique strengths and weaknesses of the proposed model, making it difficult to justify its practical utility or to identify areas for improvement. The steps for an effective benchmarking should be:

1. Select comparable models:

· Unstructured Road Detection Models: These models could include traditional computer vision approaches, deep learning-based methods like U-Net, and models that use different data inputs, such as LIDAR or combined sensor data.

· Broader Autonomous Navigation Models: Include models used for general off-road navigation, such as those applied in autonomous military vehicles, agricultural machinery, or disaster response robots. These might not focus solely on road detection but could offer insights into how well the DeepLab-Road model performs in the broader context of vehicle mobility.

2. Key performance metrics: This part is fundamental

· Accuracy and precision: Compare how accurately each model detects unstructured roads. Metrics such as Intersection over Union (IoU), precision, and recall would provide a clear picture of the model's effectiveness in identifying road boundaries and avoiding false positives.

· Processing speed: Assess how quickly each model processes input data. Models should be compared in terms of frames per second (FPS) or latency in processing 3D point clouds or other sensor data.

· Robustness across different environments: Test each model under varying conditions (e.g., different weather, lighting, and vegetation density) to evaluate their robustness.

· Computational efficiency: Compare the computational resources required by each model, including memory usage, CPU/GPU requirements, and energy consumption. A model that performs well but requires extensive computational resources might be less practical than one that offers a balance between performance and efficiency.

3. Comparative analysis techniques:

· Qualitative Analysis: Perform visual comparisons of output from the models. For example, visually inspect the road detection overlays on sample images or 3D reconstructions to see how accurately and confidently each model detects roads.

· Quantitative Analysis: Use statistical tools to compare the performance metrics across models. Techniques such as paired t-tests or ANOVA can be employed to determine if differences in performance metrics are statistically significant.

· Scenario-Based Analysis: Evaluate how each model performs in specific use cases relevant to the application domain (e.g., navigating through dense forests, avoiding obstacles like fallen trees). This contextual comparison would reveal practical strengths and weaknesses.

4. Insights gained from benchmarking:

· Identifying superior techniques: Benchmarking might reveal certain techniques or components from other models that outperform those used in the DeepLab-Road model. For instance, another model might use a more effective method for handling occlusions or better preprocessing steps for noisy data.

· Highlighting unique strengths: The DeepLab-Road model might outperform others in specific areas, such as handling dense vegetation or recognizing roads in highly unstructured environments. These strengths could be emphasized in future work or applied to other domains.

· Targeting weaknesses for improvement: Areas where the DeepLab-Road model lags behind competitors could be targeted for enhancement. This might involve adopting novel algorithms, refining data preprocessing techniques, or improving model architecture.

Improvement 4: Integration with other navigation systems

This is more a future work task, but you should consider integrating the DeepLab-Road model with other navigation systems such as GNSS, inertial navigation systems (INS), LIDAR, RADAR, and vision-based systems could significantly enhance vehicle mobility in forested environments. By combining the model's unstructured road detection capabilities with the precise location data from GNSS, the reliable positioning of INS, and the detailed obstacle detection from LIDAR and RADAR, the system would offer a comprehensive and robust navigation solution. This integration would improve decision-making, safety, and operational efficiency, particularly in challenging terrains where single-system approaches may fall short. Testing the model within such an integrated framework would also allow for real-time data fusion, cross-validation of sensor outputs, and more reliable vehicle guidance, ultimately leading to better performance in practical deployments.

Author Response

Response to Reviewer 3 Comments

Improvement 1: Evaluation in different environments:

Weather Conditions:

Rain: Testing the model under rainy conditions to see how well it handles water on the ground, which can obscure unstructured roads and create reflective surfaces that might confuse the model.

Fog: Evaluating the model’s performance in foggy conditions, where visibility is reduced, would test its ability to detect roads when visual information is partially obscured.

Lighting Scenarios:

High vs. low light: Assessing the model’s effectiveness during both hard light sunlight vs diffuse light cloudy days would determine its ability to function under varying light conditions.

Shadowed Areas: Testing in environments with dappled sunlight, such as in dense forests where light filters through the canopy, would evaluate how well the model copes with high-contrast lighting and shadowing effects.

Varying Vegetation Densities:

Sparse vs. Dense Vegetation: The model should be tested in areas with different levels of vegetation density. Sparse vegetation might expose more of the terrain, making road detection easier, while dense vegetation could obscure roads and create false positives (e.g., interpreting shadows or gaps in the canopy as roads).

Seasonal Vegetation Changes: Evaluating the model in environments where vegetation changes with the seasons (e.g., leaf-covered vs. leafless in deciduous forests) would help determine if the model can adapt to these variations.

Terrain Variability:

Flat vs. Rugged Terrain: Including different terrain types, from relatively flat and straightforward paths to rugged, uneven terrains with obstacles like boulders and fallen logs, would test the model’s adaptability to varying terrain roughness and complexity.

Mud and Slippery Surfaces: Testing on muddy or wet ground would add another layer of complexity, as these surfaces can be difficult for vehicles to traverse and might affect the accuracy of road detection.

Response: We gratefully thank for your time and effort in reviewing our manuscript! And we totally understand your concern. As an unstructured and complex scene, the working conditions in forest areas are harsh and varied. Therefore, forest scene perception technology is an urgent field for breakthroughs in special equipment in forest areas. As you pointed out, the scenarios covered in this study are a subset of all open scenarios. Weather conditions, Fog, lighting scenarios, high vs. low light, shadowed areas, varying vegetation densities and terrain variability are indeed real scenes faced by forest scenes. This is also one of the important reasons for the slow expansion of unmanned driving and forestry unmanned equipment scenarios.

So, we have also done a lot of thinking and deployment work in the early stage regarding the selection of image samples and our database. To ensure the generalization and robustness of the model, we try to expand the distribution range of the samples as much as possible and increase the differences between samples. We selected samples from two aspects: on the one hand, the experimental results show that, both subjective and objective considerations, indicate significant influence of seasonal variations on the image features within the scene's target regions. Therefore, in samples, the unstructured scene includes three different seasons: spring, summer and autumn. on the other hand, the images include forest road surfaces with different surface structures, the vegetated pavements, the gravel pavements, the pavement covered by litter, the dirt roads, the flagstone roads, road surface overexposed by camera, and the hardened road with wide surfaces and blurred road boundaries. We also want to enhance the generalization ability of the model by extracting different feature images.

For your doubts, we have modified and explained in the manuscript: (Lines: 343-352):

To ensure algorithmic effectiveness and generalizability, a comprehensive collection of unstructured road data from diverse field scenes has been actively pursued.

Of course, as you have mentioned, we will further improve the dataset and model in future work to ensure better generalization and scalability of the algorithm.

Improvement 2: Enhancing real-time processing capabilities:

There are some potential solutions, some of them essential when designing this type of platforms, that could be considered now or as future work:

Model optimization:

Pruning and Quantization: Techniques like pruning, where unnecessary parts of the model are removed, and quantization, where the precision of the model’s weights and activations is reduced, can significantly decrease the computational load.

Knowledge Distillation: This involves training a smaller, faster model (student) to mimic the performance of a larger, more complex model (teacher). The student model can retain most of the accuracy of the teacher model but with reduced complexity, allowing for quicker processing.

Response: We greatly appreciate your valuable suggestions. As you said, Model optimization is an important means to improve model performance. Specifically, for deep learning models, pruning and quantization, and knowledge distillation are direct means of reducing model parameters and improving model accuracy. For pruning and quantization, We are conducting relevant research based on the current DeepLab-Road model. During the research process, we found that pruning and quantification operations require manual implementation, and pruning behavior has a certain degree of randomness and considerable unpredictability. Therefore, we are attempting to construct an adaptive pruning model based on DeepLab-Road to address the performance fluctuations caused by the uncertainty of the pruning process. The relevant research results will be further announced after being summarized and organized.

Alternative architectures:

Edge Computing and Distributed Processing: Instead of relying solely on the vehicle's onboard computing power, processing tasks could be distributed across multiple edge devices or cloud-based systems.

Use of Specialized Hardware: Implementing the model on hardware specifically designed for deep learning tasks, such as Tensor Processing Units (TPUs) or Graphics Processing Units (GPUs), can significantly enhance processing speeds.

Response: We greatly appreciate your valuable suggestions. For edge computing and distributed processing, this type of strategy can indeed improve the computing power of in vehicle systems, including the deployment of image processing units.

In fact, our experiment is divided into two parts: laboratory experiments and field experiments. In the early stages of data collection, data partitioning, model training, and network structure adjustment, the configuration of our laboratory computer is much higher than that of the vehicle platform. On the laboratory processing platform, its configuration is a PC with an Intel Core 13th Gen Intel(R) Core(TM) i5-13700KF@ 3.5 GHz, 64 GB of RAM, 4 Nvidia 4070 GPU with Linux 16.04 and ROS-kinetic. The purpose is to conduct extensive experimentation and improve efficiency. And we have made many attempts and efforts to obtain a lightweight model.

After the model is trained, it needs to be ported to the vehicle platform for validation (as you can see in the manuscript). Many practical problems need to be overcome in this process. The biggest and urgent problem we are facing is the optimization of the vibration absorption scheme for the vehicle system. In harsh forest environments, vehicle vibrations caused by unstructured roads can cause unpredictable damage to the onboard computer. Therefore, before upgrading the configuration of the onboard processor system, the first thing we need to address is the system's shock absorption issue. And we are currently working on resolving this issue. Actually, the upper computer would crash or be damaged almost every week due to vibration caused by unstructured roads during the experiment.

In addition, we have made preliminary attempts in the early stages to improve computing efficiency based on cloud systems. However, due to inadequate communication infrastructure and tree canopy obstruction, ensuring uninterrupted coverage of cloud services in the time dimension is difficult in forest scenarios. Based on your suggestion, we will try to focus on vehicle side computing and supplement it with cloud computing in the future to improve the computing efficiency of vehicles.

Simplifying input data:

Dimensionality Reduction: Reducing the complexity of the input data (e.g., by simplifying the 3D point clouds to 2.5D or using fewer data points without losing critical information) can help speed up processing.

Data Fusion: Combining data from multiple sensors (e.g., LIDAR, RGB cameras, and RADAR) into a unified, simplified dataset before processing can streamline the model’s operations.

Response: We fully understand your concerns. The input data involved in this study mainly includes image data and point cloud data. Therefore, it involves a huge amount of computation. In order to balance the relationship between computational complexity and real-time performance, we consider using image data as the main source and point cloud data as a supplement. On the basis of a thorough understanding and segmentation of image semantics, 2D point cloud data is integrated. By jointly calibrating the Lidar and visual camera, determine the positional relationship and reprojection relationship between the two. The fusion here can be regarded as dimensionality enhancement of image data, that is, assigning point cloud distance signals to corresponding pixels.

We have further supplemented and explained in section 3.3 (Lines: 400-414):

3.3. Reprojection of Images and 2D Point Clouds

Incremental processing:

Real-Time Data Streaming: Instead of processing large batches of data at once, the model could process incoming data incrementally as it is received (streaming).

Prioritization of Critical Data: The model could be adjusted to prioritize the processing of critical data points that are most likely to impact vehicle navigation (e.g., obstacles directly in the vehicle’s path), allowing for faster response times.

Response: We greatly appreciate your valuable suggestions. In fact, this issue involves the selection of keyframes for video stream images. As you said, this question is of great significance for improving the processing efficiency of video stream images. However, scholars are conducting extensive research on algorithms related to keyframe extraction in video stream images. In this study, the extraction approach for keyframes was to extract them at equal intervals. There are two reasons for this selection. On the one hand, it overcomes the subjectivity of keyframe selection to a certain extent; On the other hand, this study needs to consider the correlation between contextual information of video stream images. Increasing the accuracy of forest road recognition in unfamiliar scenes through contextual correlation of images. In our subsequent research, keyframe extraction is also one of the key areas of focus. We will improve the overall recognition performance of the system through better keyframe filtering methods.

Improvement 3: Benchmarking against existing models:

Select comparable models:

Unstructured Road Detection Models: These models could include traditional computer vision approaches, deep learning-based methods like U-Net, and models that use different data inputs, such as LIDAR or combined sensor data.

Broader Autonomous Navigation Models: Include models used for general off-road navigation, such as those applied in autonomous military vehicles, agricultural machinery, or disaster response robots. These might not focus solely on road detection but could offer insights into how well the DeepLab-Road model performs in the broader context of vehicle mobility.

Response: We greatly appreciate your valuable suggestions. Here we select 8 methods, for comparison and supplementary explanation. In our study, we took visual images as the main perception, supplemented and assisted by point cloud data. The input is unified into image information and point cloud information. Based on the image segmentation, the distance information contained in the point cloud information is assigned to the corresponding pixels. Based on your suggestions, we have supplemented the various partitioning options in section 4.1:

4.1. Evaluation and Comparison of Image Segmentation Effect

To comprehensively evaluate the open-scene semantic recognition proposed in this study, six widely used image segmentation algorithms were selected for training and testing on an Unstructured Road Scene Dataset.

As shown in the Figure 9, the first column is the original data sample; The image segmentation methods corresponding to the second to eighth columns are as follows: (a) Origional image, (b) SegNet, (c) FCN, (d) U-Net, (e) PSPNet, (f) UperNet, (g) DeepLab V3+ and (h) DeepLad-Road. The last column are the segmentation result of the proposed algorithm DeepLab-Road. The segmentation outcomes are visually presented in the form of binary images. The red area is the road area recognized by each algorithm, and the black area is the background area.

Key performance metrics: This part is fundamental

Accuracy and precision: Compare how accurately each model detects unstructured roads. Metrics such as Intersection over Union (IoU), precision, and recall would provide a clear picture of the model's effectiveness in identifying road boundaries and avoiding false positives.

Processing speed: Assess how quickly each model processes input data. Models should be compared in terms of frames per second (FPS) or latency in processing 3D point clouds or other sensor data.

Robustness across different environments: Test each model under varying conditions (e.g., different weather, lighting, and vegetation density) to evaluate their robustness.

Computational efficiency: Compare the computational resources required by each model, including memory usage, CPU/GPU requirements, and energy consumption. A model that performs well but requires extensive computational resources might be less practical than one that offers a balance between performance and efficiency.

Response: We greatly appreciate your valuable suggestions. Based on your suggestion, we have selected three evaluation metrics to measure the accuracy and precision of image segmentation results. They are MPA (Mean Pixel Accuracy), MIoU (Mean Interp over Union) and FWIoU (Frequency Weighted Intersection over Union), respectively.

MPA = sum(Pi) / N (1)

Where, Pi is the pixel accuracy for each category; N is the number of categories.

MIoU = (IoU_p + IoU_n) / 2 = [ TP / (TP + FP + FN) + TN / (TN + FN + FP) ] / 2 (2)

Where, TP is true positive; FP is false positive; FN is false negative; TN is true negative.

FWIoU = (3)

For the evaluation of processing speed, we measure it by the average processing time (Ave Time). In addition, Parameters and MS are the computational resources required by each model and the memory usage of CPU, respectively. The specific details and parameters are listed in Table 1 of the manuscript.

Table 1. Quantitative evaluation results of comparative studies on URSD

Model	MPA (%)	MIoU (%)	FWIoU (%)	Parameters (M)	MS (MB)	Ave Time(ms)
SegNet	90.26	85.66	88.25	16.31	124.55	95
FCN	92.13	86.05	88.36	134.27	1095.68	701
U-Net	92.14	87.99	90.12	26.36	201.19	167
PSPNet	92.85	88.74	90.71	51.43	392.76	255
UperNet	93.20	89.08	90.98	126.07	962.62	582
DeepLab V3+	93.49	89.27	91.12	59.34	453.47	247
DeepLab-Road	94.86	89.48	91.18	15.13	58.03	83

Comparative analysis techniques:

Qualitative Analysis: Perform visual comparisons of output from the models. For example, visually inspect the road detection overlays on sample images or 3D reconstructions to see how accurately and confidently each model detects roads.

Quantitative Analysis: Use statistical tools to compare the performance metrics across models. Techniques such as paired t-tests or ANOVA can be employed to determine if differences in performance metrics are statistically significant.

Scenario-Based Analysis: Evaluate how each model performs in specific use cases relevant to the application domain (e.g., navigating through dense forests, avoiding obstacles like fallen trees). This contextual comparison would reveal practical strengths and weaknesses.

Response: We greatly appreciate your valuable suggestions. Based on your suggestion, the qualitative analysis and quantitative analysis are supplemented and modified. he qualitative explanation is mainly embodied in the intuitive feeling of the image segmentation results. The quantitative explanation is embodied in point cloud reprojection and pseudo-structured road fitting.

Based on your suggestions, we have supplemented the various partitioning options in section 3.2 and 4.2:

Table 2. Remapping Error Analysis

Distance	E_up (pixel)	E_vp (pixel)	E_ud (m)	E_vd (m)	E_ξ(m)
1 m–2 m	23.614	3.014	0.075	0.007	0.035
2 m–3 m	27.851	3.537	0.122	0.012	0.050
3 m–4 m	31.926	4.028	0.191	0.022	0.079
4 m–5 m	34.864	4.293	0.242	0.026	0.108

E_ξ refers to the lateral average error of the sampling points along the road centerline. The road centerline is derived through the averaging of sampling points corresponding to the left and right boundaries, akin to achieving mean filtering. This process effectively diminishes the fitting error of the road centerline. The maximum average error of the road centerline within a 5-meter range is recorded at 0.108 (m). The average relative error is constrained to not exceed 6%. Most of the proposed road widths are within the range of 2(m) - 4(m). Consequently, there is a strong alignment between image and radar data, affirming that the road model fulfills the requirements for localization and tracking.

Insights gained from benchmarking:

Identifying superior techniques: Benchmarking might reveal certain techniques or components from other models that outperform those used in the DeepLab-Road model. For instance, another model might use a more effective method for handling occlusions or better preprocessing steps for noisy data.

Highlighting unique strengths: The DeepLab-Road model might outperform others in specific areas, such as handling dense vegetation or recognizing roads in highly unstructured environments. These strengths could be emphasized in future work or applied to other domains.

Targeting weaknesses for improvement: Areas where the DeepLab-Road model lags behind competitors could be targeted for enhancement. This might involve adopting novel algorithms, refining data preprocessing techniques, or improving model architecture.

Response: We greatly appreciate your valuable suggestions. Based on your suggestion, we have reorganized and adjusted the main findings of this research in the conclusion section.

The main findings of this research are as follows (Line:529-559):

Improvement 4: Integration with other navigation systems

Response: We greatly appreciate your valuable suggestions. And we totally understand your concern. As you said, there is still a lot of work that needs to be further refined, promoted, and improved in the future. For example: we will consider integrating the DeepLab-Road model with other navigation systems such as BDS, GPS, 3D LIDAR and so on; We will further consider the distributed deployment of optimized models, including edge solving and cloud high confidence services; and real-time data fusion, cross-validation of sensor outputs, and more reliable vehicle guidance. In the follow-up work, we will continue to strive to do better.

We gratefully thank for your time and effort in reviewing our manuscript! Thank you again!

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The manuscript entitled Lightweight Model Development for Forest Region Unstructured Road Recognition Based on Tightly Coupled Multisource Information is, in my opinion, acceptable as presented.

The authors have responded very seriously to all my comments and corrected all typographical and scientific errors.

The authors have also changed the most important parts of the text and made other improvements that sound much better.

Also the English language is very good in sense of scientifically importance.

The authors have now produced much better and more appropriate maps and figures

I therefore recommend acceptance.

Sincerely,

Reviewer#1

Author Response

Comments：The manuscript entitled Lightweight Model Development for Forest Region Unstructured Road Recognition Based on Tightly Coupled Multisource Information is, in my opinion, acceptable as presented.

The authors have responded very seriously to all my comments and corrected all typographical and scientific errors.

The authors have also changed the most important parts of the text and made other improvements that sound much better.

Also the English language is very good in sense of scientifically importance.

The authors have now produced much better and more appropriate maps and figures

I therefore recommend acceptance.

Sincerely,

Reviewer#1

Response: We gratefully thank for your time and effort in reviewing our manuscript! Thank you again.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

Dear authors,

You have responded well to the recommendations, showing an understanding of the areas needing improvement. However, to elevate your paper to a more relevant and impactful level, I would recommend to take more concrete steps in addressing the suggested improvements. Find below a point by point response.

Improvement 1: Evaluation in Different Environments

You have made a commendable effort to address the recommendation by expanding your dataset to include different seasons and various types of road surfaces. This is a good start, and it's clear you recognize the importance of broader environmental conditions, such as different weather scenarios and varying vegetation densities.

However, the current response is somewhat limited in scope, as you include only three seasons and do not seem to have tested under conditions like rain, fog, or different lighting scenarios, which were specifically suggested.

I would recommend to expand your dataset further, including more extreme conditions like rain, fog, and variable lighting. Additionally, including winter scenarios, which you exclude due to difficulties, would also comprehensively evaluate your model's robustness. As Richard Feynman said, "Science is the belief in the ignorance of experts." It’s not about proving that your method is the best in all scenarios, but understanding where it fails. Identifying these limitations can be valuable, as it not only guides the refinement of your current work but also helps in planning future research directions more effectively.

Improvement 2: Enhancing Real-Time Processing Capabilities

You have acknowledged the importance of model optimization and mentioned ongoing research into pruning and quantization to improve real-time performance. The discussion of challenges related to hardware limitations and system vibration is relevant for real-time deployment in harsh environments. However, while you outline potential future solutions, these have not yet been implemented or tested.

To strengthen your work, it would be useful to provide more concrete results from your ongoing research into model optimization. Additionally, offering preliminary results from testing alternative architectures or simplified data inputs would show progress towards real-time capability, showing these solutions are not just theoretical but are being pursued.

Improvement 3: Benchmarking Against Existing Models

You have made significant progress by benchmarking your model against several other modern models, comparing performance using relevant metrics like MPA, MIoU, and FWIoU. Including both qualitative and quantitative analysis is important for a comprehensive evaluation.

While your benchmarking efforts are commendable, it would be beneficial to include a discussion of any specific areas where your model may lag behind others. A more detailed comparative analysis highlighting these weaknesses and suggesting targeted improvements would add depth to the discussion and provide clearer insights for future research.

Improvement 4: Integration with Other Navigation Systems

The authors acknowledge the suggestion and express an intention to integrate the DeepLab-Road model with other navigation systems in future work. They recognize the potential benefits of this integration but have not yet taken concrete steps towards this.

I would recommend to move beyond a simple acknowledgment, to outline a specific plan or timeline for this integration in your future research. Providing a detailed strategy for testing and implementing integration with GNSS, INS, LIDAR, and other systems would show a commitment to advancing this part of your work. Additionally, even preliminary results or a pilot study would strengthen your response.

Author Response

Response to Reviewer 3 Comments

Improvement 1: Evaluation in Different Environments

Response: We gratefully thank for your time and effort in reviewing our manuscript! According to your suggestions, we replaced and supplemented Figure 6(n)-(p) and Figures 7(n)-(p) in the manuscript. Figure 6(n)-(o) and Figure 7(n)-(o) are forest roads with overexposed. Figure 6(p) and Figure 7(p) are a road with snow covered. Our dataset is still being expanded to include partial images of different lighting conditions (shaded roads, overexposed roads, etc.) as well as snowy roads. However, frankly, the RUGD dataset we built does not yet include forest roads on rainy and foggy days. As you can see, the effectiveness of our model for road recognition in forest areas on snowy days needs to be improved. There are two reasons: 1. From the perspective of image segmentation, the snow covered forest road loses its original color and texture, and the color and texture of the whole ground tend to be the same, which may be fatal to our model or even most image processing models; 2. Usually, we divide images with a specific class of features into a dataset. Therefore, from the perspective of data distribution, the RUGD data set constructed in this study may not belong to the same distribution as the data set of the snow image due to the differences in color, texture and other features, or the distribution centers of the two are far apart. This is also why we did not put images of snow days into the RUGD dataset. In addition, we are also honest to say that our model does have limitations and cannot meet all conditions of forest road identification. We hope to do better through subsequent efforts.

We have provided additional explanations in the manuscript (Lines: 346-353) and (Lines: 371-375):

Therefore, in URSD set, the unstructured scene includes three different seasons: spring, summer, autumn and winter. From another perspective, URSD dataset also includes the pavements with different surface structures, the vegetated pavements (Figure 6 (a)), the gravel pavements (Figure 6 (b) - (c)), the pavement covered by litter (Figure 6 (d) - (e)), the dirt roads (Figure 6 (f) - (i)), the flagstone roads (Figure 6 (j) - (m)), road surface overexposed by camera Figure 6 (n) - (o), there is some partially melted snow in Figure 6(o), and snow covered road surface (Figure 6 (p)).

In Figure 7 (m), the interplay of sunlight casts shadows from the tree canopy onto the road surface; In Figure 7 (n) - (o), obviously, the effectiveness of road boundary planning needs to be improved. Overexposed road surfaces and heavily snow covered forest roads lost some of original color and texture features, resulting in a consistent color and texture of the entire ground, which may be fatal to our model.

Improvement 2: Enhancing Real-Time Processing Capabilities

Response: Thank you for your suggestion. In fact, we are adjusting and optimizing the backbone network of the model, adjusting the backbone network from MobileNetv2 to MobileNetv3. At present, the model is not yet mature and is still adjusting parameters and training. We would like to see progress quickly, but as you know, every small advance in scientific research requires a great deal of effort and time. If there is any research progress, we will publish and publicize it in the form of a paper as soon as possible.

Improvement 3: Benchmarking Against Existing Models

Response: Thank you for your valuable suggestion. For the model proposed in this study, the cost of the overall lightweight structure of the model is a certain dependence on the hardware performance and configuration of the system. In our model, we replace the deep network with multiple small parallel networks to achieve a similar effect to the deep neural network. Multi-line parallelism requires the CPU or GPU to have enough independent processing units to meet the parallel network requirements. The experiment shows that when the number of CPU independent cores or GPU independent operational units is limited, the computing speed of the model is significantly limited. This indicates that our model is highly dependent on computer hardware performance. We have added this part to the section Conclutions of the manuscript. Of course, this is why we have further upgraded our equipment. This is further covered in Improvement 4.

We have provided additional explanations in the manuscript (Lines: 554-561):

Further improvement and optimization are needed in terms of the model architecture. This model replaces deep neural networks with multiple small parallel networks to achieve similar effects as deep neural networks. Multi line parallelism requires a sufficient number of independent processing units on the CPU or GPU to meet the parallel network requirements. Experiments show that when the number of CPU independent cores or GPU independent computing units is limited, it significantly limits the computational speed of the model. This indicates that our model has a high dependence on computer hardware performance.

Improvement 4: Integration with Other Navigation Systems

Response: Thank you for your attention and promotion of our work. As for DeepLab-Road model with other navigation systems, in fact, we are already working on it. As shown in the figure below, we fully upgraded the system on the first experimental platform. This is the second version of the unmanned platform we built, and the disks in the picture are the GPS mushroom antenna. We are currently debugging the GPS module. As can be seen in the figure, the vehicle top space is limited, and the installation and layout of various sensors need to be optimized, so there are currently no cameras or radars installed here. In this respect, the work has made some progress, however, it is not enough to form advanced conclusions. If there is any technical progress and breakthrough, we will further publish after sorting and summarizing, and we are happy to share our research results with you. I hope our work will go smoothly.

Thank you again for your time and effort in reviewing this manuscript! Your suggestions are of great significance in improving the quality of the manuscript and helping us to organize our next steps of work. Thank you very much!（Please refer to ”Response to Reviewer 3 - Round 2. pdf“ for the modified image and the image of our upgraded second version of the intelligent driving platform）

Author Response File: Author Response.pdf