Semantic Segmentation Network for Unstructured Rural Roads Based on Improved SPPM and Fused Multiscale Features

Cao, Xinyu; Tian, Yongqiang; Yao, Zhixin; Zhao, Yunjie; Zhang, Taihong

doi:10.3390/app14198739

Open AccessArticle

Semantic Segmentation Network for Unstructured Rural Roads Based on Improved SPPM and Fused Multiscale Features

by

Xinyu Cao

¹

,

Yongqiang Tian

^1,2,3,4

,

Zhixin Yao

^1,2,3,4,

Yunjie Zhao

^1,2,3,*

and

Taihong Zhang

^1,2,3,*

¹

School of Computer and Information Engineering, Xinjiang Agricultural University, Urumqi 830052, China

²

Research Center for Intelligent Agriculture, Ministry of Education Engineering, Urumqi 830052, China

³

Xinjiang Agricultural Informatization Engineering, Technology Research Center, Urumqi 830052, China

⁴

National Engineering Research Center for Information Technology in Agriculture, Beijing 100125, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2024, 14(19), 8739; https://doi.org/10.3390/app14198739

Submission received: 19 August 2024 / Revised: 17 September 2024 / Accepted: 25 September 2024 / Published: 27 September 2024

(This article belongs to the Special Issue Advances in Computer Vision and Semantic Segmentation, 2nd Edition)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Semantic segmentation of rural roads presents unique challenges due to the unstructured nature of these environments, including irregular road boundaries, mixed surfaces, and diverse obstacles. In this study, we propose an enhanced PP-LiteSeg model specifically designed for rural road segmentation, incorporating a novel Strip Pooling Simple Pyramid Module (SP-SPPM) and a Bottleneck Unified Attention Fusion Module (B-UAFM). These modules improve the model’s ability to capture both global and local features, addressing the complexity of rural roads. To validate the effectiveness of our model, we constructed the Rural Roads Dataset (RRD), which includes a diverse set of rural scenes from different regions and environmental conditions. Experimental results demonstrate that our model significantly outperforms baseline models such as UNet, BiSeNetv1, and BiSeNetv2, achieving higher accuracy in terms of mean intersection over union (MIoU), Kappa coefficient, and Dice coefficient. Our approach enhances segmentation performance in complex rural road environments, providing practical applications for autonomous navigation, infrastructure maintenance, and smart agriculture.

Keywords:

semantic segmentation; rural roads; bottleneck attention; strip pooling; feature fusion

1. Introduction

Rural roads play an important social and economic role globally, especially in agricultural production, transportation, and rural–urban connectivity. However, rural roads often have irregular shapes, fuzzy boundaries, and complex environmental interference, which bring great challenges to the automatic recognition and semantic segmentation of road scenes. Existing semantic segmentation techniques mostly focus on structured urban road scenarios, while related research on unstructured rural roads is not sufficient.

Rural road semantic segmentation has a wide range of applications not only in agriculture [1], such as supporting autonomous driving, navigation, and path planning of intelligent agricultural machinery, but also in several other fields [2,3,4,5]. First, rural roads are an important infrastructure, connecting remote rural areas with cities, and their state directly affects the transportation of agricultural products, distribution of goods, and local economic development [6,7]. Second, rural roads play a crucial role in emergency rescue after natural disasters, and accurate road segmentation can help rescuers quickly determine road conditions and improve rescue efficiency [8]. In addition, semantic segmentation of rural roads can support the application of autonomous driving technologies in non-urban environments and help intelligent transportation systems adapt to more complex and irregular road conditions [9]. Segmentation in such scenarios is significantly more complex and requires effective solutions tailored to the rural road environment.

2. Related Work

In earlier studies, significant progress has been made in semantic segmentation for structured environments like urban roads and well-defined landscapes. Models such as UNet [10], ENet [11], BiSeNetv1 [12], and BiSeNetv2 [13] have become popular for road segmentation tasks, especially in urban settings. However, these models face notable limitations when applied to unstructured rural road environments. UNet is a well-known encoder–decoder architecture primarily used for biomedical image segmentation. It relies on skip connections between the encoder and decoder, which help recover spatial details lost during downsampling. However, UNet’s pooling operations and upsampling methods often struggle with complex, unstructured environments like rural roads, where the road shapes and boundaries are irregular. ENet performs well in real-time applications, but its lightweight design compromises segmentation accuracy and may not be comparable to more complex models when faced with complex country roads and small objects. BiSeNetv1 uses a two-pathway architecture to balance spatial detail extraction with contextual information. Although it performs well in urban road segmentation, it struggles with rural roads’ long and narrow structures, as its spatial pathway cannot fully retain crucial geometric information. BiSeNetv2 improves upon its predecessor by optimizing for lightweight, real-time segmentation. However, this reduction in computational complexity comes at the cost of reduced accuracy, particularly for small objects and complex scenes, making it less effective in unstructured rural environments.

In recent years, deep convolutional neural networks have made significant advances in computer vision, particularly in tasks like image classification, object detection, and semantic segmentation, where they have shown outstanding performance [13,14]. The advent of deep learning has provided an effective approach to addressing the challenges of complex road scene recognition and parsing [15]. Zhu et al. [16] developed an improved DeepLabV3+ visual navigation path recognition method proposed for semantic segmentation, effectively realized in the environment of the Pitaya Orchard. However, the computational complexity of the model is high and faces difficulties in practical applications. Ni et al. [17] proposed a Deep Dual-resolution Road Scene Segmentation Network Decoupled Dynamic Filter and Squeeze–Excitation (DDF&SE-DDRNet) based on decoupled dynamic filtering and squeezed excitation, which reduces the number of network parameters while still having satisfactory segmentation results. Similarly, there are problems of high model complexity and excessive computational effort. Lv, Qingxuan, et al. [18] introduced a parallel complementary network (PCNet), which was proposed to obtain better results on public datasets by utilizing the inverted residual structure while having a large feeling field. Although the parallel network architecture with multiple branches enhances the feature extraction, it also increases the complexity of the model. Despite these advancements, existing studies utilizing convolutional neural networks for semantic segmentation still face challenges such as high parameter count, computational complexity, and suboptimal inference speed. Furthermore, there remains room for improvement in the utilization of contextual and global information in image processing, which has a direct impact on the segmentation accuracy of complex scenes.

There are three major challenges in achieving semantic segmentation of rural roads. First, the complexity of rural road scene images poses a formidable obstacle. The intricate environment, undefined boundaries, uneven road surfaces, occlusions, and numerous obstacles make rural road images much more complex than urban road scenes. Consequently, this complexity significantly heightens the difficulty of semantic segmentation tasks. Second, the variability of rural road scene images presents another challenge. Unlike urban roads, rural road environments undergo frequent changes, resulting in varying road shapes and environmental conditions. As a result, rural road images may exhibit diverse features depending on the prevailing conditions, which complicates the recognition process. Therefore, it is imperative to take into account the myriad scenarios encountered on rural roads. Third, there is a scarcity of rural road datasets. Collecting comprehensive datasets for rural roads is difficult due to the dynamic and complex nature of rural environments. This scarcity severely limits further research in this domain and hinders progress in developing robust semantic segmentation models tailored specifically for rural road scenes.

To tackle these challenges, an innovative approach to fine-grained segmentation of different objects in rural road scenes using efficient and accurate CNNs is introduced in this research. The focus of this research is to propose an improved PP-LiteSeg semantic segmentation model for the rural road recognition task. Figure 1 outlines the architecture of our proposed model. The main contributions of this work are summarized as follows:

1. In this paper, a Rural Roads Dataset (RRD) is constructed and it is verified that the improved algorithm proposed can achieve accurate segmentation in a complex rural environment.

2. An SP-SPPM module is designed to introduce the SP module into a simple pyramid pooling module, and the multi-scale pooled features are then further subjected to bar feature extraction.

3. A B-UAFM module was designed to introduce the BAM module into the unified attention fusion module to capture richer contextual information.

4. An improved PP-LiteSeg model is proposed, which ensures accuracy in segmenting complex rural roads while keeping them lightweight.

The remainder of this article is organized as follows. Section 2 details our proposed deep learning framework that is capable of accurate segmentation for rural road scenes. Section 3 describes the RRD constructed for this study, on which rich experiments were conducted. Section 4 summarizes the main points of this paper and provides constructive suggestions for future research.

3. Improvement of PP-LiteSeg Rural Road Recognition Methods

3.1. Architecture of Semantic Segmentation Model

The modern semantic segmentation models commonly employ an encoder–decoder architecture. The encoder typically involves a series of operations such as convolution, pooling, and activation functions to extract features. At the same time, the decoder uses upsampling or inverse convolution operations to restore low-resolution features from the encoder to high-resolution, ultimately producing final prediction results. The original PP-LiteSeg [19] model illustrates this architecture with an encoder–decoder structure. In the encoding stage, features are extracted using STDC [20], followed by refinement through a simple pyramid pooling module. The decoding stage utilizes the Unified Attention Fusion Module to combine deeper features with those from the encoding stage. The predicted result is then obtained through upsampling. In the improved model presented in this paper, the features extracted from the backbone network are fed into a simple pyramid module with strip pooling at the encoding stage. This yields more effective global context information, which improves the performance of the model. In the decoding stage, the output of the features from the pyramid module with strip pooling is fed into the Bottleneck Unified Attention Fusion Module. This module employs double-branching to increase the utilization of feature information. Additionally, intermediate features from the encoding stage are fused to enrich the feature set. Moreover, in the decoding stage, the output of features from the strip-pooling pyramid module is similarly processed by the bottleneck unified attention fusion module. Next, dual branching is used to increase the utilization of feature information. In addition, intermediate features from the coding stage are fused to increase feature richness. Finally, the prediction results are obtained by upsampling. The specific structure of the improved PP-LiteSeg is shown in Figure 2.

3.2. Simple Pyramid Module for Strip Pooling

Strip pooling [21] is a spatial pooling technique used in computer vision, proposed specifically for scene parsing tasks. Strip pooling enhances global information capture by strip pooling along horizontal and vertical directions. Suppose the input is

X \in R^{H \times W}

. Assuming that the input is

X \in R^{H \times W}

, the expression is given in the following equation:

\begin{matrix} y_{i}^{h} = \frac{1}{W} \sum_{0 \leq i < W} x_{i, j} \\ y_{j}^{v} = \frac{1}{H} \sum_{0 \leq i < H} x_{i, j} \\ y_{O} = y_{i}^{h} + y_{j}^{v} \end{matrix}

(1)

The proposed Strip Pooling-Simple Pyramid Pooling Module, as depicted in Figure 3, begins with conducting three global average pooling operations and one individual strip pooling operation on the features generated by the backbone network. The global average pooling windows are sized

1 \times 1

,

2 \times 2

, and

4 \times 4

, respectively. Subsequently, the features undergo a convolution operation followed by strip pooling and upsampling. The resulting features from these operations are then aggregated and subjected to a

3 \times 3

convolution operation. Finally, the processed features are output.

3.3. Parallel Feature Fusion Module

Bottleneck Attention Module

BAM (Bottleneck Attention Module) [22] combines channel attention and spatial attention to enhance the network’s attention to key features. The specific structure of BAM is shown in Figure 4. Its main idea is to enhance important channels and spatial locations by generating attention maps on feature channels and spatial dimensions, respectively. The specific operation can be divided into the following two parts:

(1) Channel Attention Module

Channel attention adaptively adjusts the importance of each channel. Given the input feature map

X \in R^{H \times W \times C}

, global average pooling and max pooling are first performed to obtain the global features:

\begin{matrix} X_{a v g} = A v g P o o l (X) \\ X_{max} = M a x P o o l (X) \end{matrix}

(2)

The features are then scaled through the fully connected layer:

X_{c h a n n e l} = σ (W_{1} (W_{0} (X_{a v g}) + W_{0} (X_{max})))

(3)

where

W_{0}

and

W_{1}

are trainable weight matrices and

σ

denotes the activation function.

(2) Spatial Attention Module

Spatial attention captures important locations in an image. The feature maps are first subjected to max pooling and average pooling operations in the channel dimension to obtain two two-dimensional spatial attention feature maps:

\begin{matrix} X_{a v g}^{s p a t i a l} = A v g P o o l (X, a x i s = c h a n n e l) \\ X_{max}^{s p a t i a l} = M a x P o o l (X, a x i s = c h a n n e l) \end{matrix}

(4)

They are then concatenated and a spatial attention map is generated by a convolution operation:

\begin{matrix} X_{s p a t i a l} = σ (C o n v ([X_{a v g}^{s p a t i a l}, X_{max}^{s p a t i a l}])) \end{matrix}

(5)

Finally, the results of channel attention and spatial attention are combined onto the input feature map to obtain an enhanced feature representation:

\begin{matrix} X_{o u t} = X \cdot X_{c h a n n e l} \cdot X_{s p a t i a l} \end{matrix}

(6)

The parallel feature fusion module proposed in this paper is shown in Figure 5. First, the output of the simple pooling pyramid module is multiplied element by element. Then, the two output features go into the B-UAFM module, respectively, while the features of the intermediate stage of the backbone network are fused with it. Finally, the two outputs are multiplied element by element again to output the final result. The specific structure of B-UAFM is shown in the Figure below.

4. Dataset Construction

4.1. Image Data Acquisition

Semantic segmentation of rural roads classifies the objects in rural road images with corresponding labels and then gives the information, which in turn realizes the scene understanding. At present, in automatic driving, roads can generally be divided into two categories: structured roads and unstructured roads. Structured roads have clear road markings and clearer road boundaries, and they generally include city roads and highways. Unstructured roads are characterized by fuzzy or no road markings, difficult-to-define boundaries, and more complex backgrounds. They generally refer to non-main roads and rural roads. Rural roads present the following unstructured characteristics: (1) road boundaries are difficult to define, the road environment changes a lot, and the shape of the road is variable, (2) the road surface is not smooth enough, and there will be occlusions as well as a lot of obstacles, and (3) when the environment changes, the road in the image may have different characteristics. These uncertain conditions bring many challenges to semantic segmentation in rural road scenes, and the model then needs to have some generalization ability and be more robust as well.

In this paper, we have collected and constructed a dataset containing a variety of rural road scenarios named ’RRD’. The dataset in this paper is divided according to specific rural road environments. There are buildings, asphalt roads, non-hardened roads, sky, obstacles, cars, towers, line poles, vegetation, fences, cement roads, motorcycles, agricultural machinery, banners, persons, trucks, and traffic signs, and, in addition to these categories mentioned above, the background categories are also set up, so that there are a total of 18 categories in the rural road images. The example image is shown in Figure 6.

Images were collected in Xinjiang’s Shawan, Fukang, Nanshan, and Urumqi counties, and the equipment selected was a monocular sports video camera, GoPro HERO9 (GoPro, San Mateo, CA, USA), with a pixel count of 3840 × 2160 and a frame rate of 30 fps, which has the advantages of supporting 4 K video and 20-megapixel photos, Super Anti-Shake 3.0 video stabilization, and the camera’s built-in horizon correction function and ultra-long endurance, etc. To ensure that continuous and clear images could be captured, the auxiliary capture device was a smartphone with a 4 K/30 fps HD camera. A large number of pictures of rural roads under different weather and road environments were collected to ensure that there is a diversity of individual species to ensure a better response to the characteristics of rural road scenes. In order to obtain a wider road scene, this study fixed the image acquisition device to the rearview mirror of the car interior, driving the car at a speed of 30 km/h at a constant speed, a total of 1 h and 30 min of time-length video was collected, and the images were selected using the frame extraction technique, which totaled 1490 images. The original images were scaled in size to 1280 × 720 to ensure the training of the network and to reduce the pressure on the hardware during feature extraction. Table 1 details the number of annotations of each category in our RRD.

The dataset construction follows the following theoretical basis: (1) Data collection addressed specific questions to ensure that the dataset reflects the unstructured nature of rural roads. (2) The data were sourced from multiple rural areas, and by introducing this geographic diversity, the ability to generalize the dataset across different types of rural road scenarios beyond a specific region or condition was ensured, increasing the adaptability of the model to a wide range of rural environments. (3) There was comprehensive labeling of road and environmental features; the dataset contains a wide variety of commonly labeled object classes in rural road environments, and by covering these diverse object classes, the constructed dataset ensures that the model can handle a wide range of rural road scenarios.

4.2. Data Preprocessing

The type of annotation is primarily determined by the task attributes. The objective of this study is to segment the country road scene in order to obtain accurate segmentation information. Consequently, this paper focuses on pixel-level annotations of common objects on country roads. The RRD contains a substantial number of polygon annotations. The annotation cost is proportional to its complexity, which necessitates a significant amount of manual annotation work. The images are annotated using a custom-built CVAT annotation platform, with the annotations saved in JSON format. These JSON files are then converted into the standard Pascal VOC format using scripts. Pascal VOC is a standardized dataset used for target detection, semantic segmentation, and other related tasks. The annotation files are stored in XML format, with each XML file corresponding to an image file. The XML files contain information about all the objects in the image, including location and category information. Figure 7 illustrates a sample of the annotated image, where each annotation entity is represented by a mask.

5. Experimentation and Analysis

5.1. Experiment Platform and Parameter Settings

The computing equipment used in this study was as follows: a 10th Gen Intel (R) Core (TM) i7-10870H CPU, an NVIDIA GeForce RTX 3070 laptop GPU, and the Windows 11 operating system. We also used Python version 3.9.7, PaddlPaddle [23] version 2.6.1, and PaddleSeg [24] version 2.8.0 software, and the unified computing device architecture selected CUDA 11.6, deep neural network acceleration library version is CUDNN v8.4. The dataset images were randomly divided into a training set (90%) and a validation set (10%) in a ratio of 9:1. The model is trained on the training set and tested on the validation set. In this paper, the model training is done using the pre-training weights of ImageNet, which is an image classification dataset containing 1.35 million images, including 1000 categories. The input image scaling is

512 \times 512

when training the model, the initial learning rate is

5 \times 10^{- 3}

, the batch size is set to 4, the maximum number of iterations is 40,000, the optimizer is stochastic gradient descent (SGD) [25], the momentum (momentum) is 0.9, the weight decay (weight decay) is 0.00005, and the learning rate decay strategy is polynomial decay (polynomial decay). The specific parameters are shown in Table 2. The loss function is OhemCrossEntropyLoss [26] with the following expression:

L o s s = - \frac{1}{N} \sum_{i = 1}^{N} \{\begin{matrix} log (p_{t arg e t}), y_{t arg e t} = 1 \\ log (1 - p_{t arg e t}), y_{t arg e t} = 0 \\ 0, o t h e r w i s e \end{matrix}

(7)

5.2. Evaluation Indices

The expression

A c c

(Accuracy) is the proportion of correctly categorized pixels to the total pixels in the segmentation result. In image segmentation, we calculate the accuracy between the predicted and true labels of each pixel by comparing them. Specifically, segmentation accuracy is obtained by calculating the number of pixels whose segmentation result is the same as the true result and dividing it by the total number of pixels. A higher accuracy rate means that more pixels are correctly categorized in its segmentation result, and the segmentation performance is better. The formula is as follows:

A c c = \frac{\sum_{i = 0}^{k} p_{i i}}{\sum_{i = 0}^{k} \sum_{j = 0}^{k} p_{i j}}

(8)

where i represents the true value, j represents the predicted value,

p_{i j}

represents the number of pixels that are true for i but predicted for j, and k represents the total number of categories.

The MIoU (mean intersection over union) is the ratio between the intersection and concatenation between predicted values and labels. In image segmentation, we compare the predicted and labeled values of each category and then calculate the IoU between them. Finally, we can calculate the average of all the IoU categories to get the MIoU of the whole image. The higher the value of MIoU, the closer the segmentation is to the labeled values and the better the segmentation result. The expression of MIoU is given in the following equation:

M I o U = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{p_{i i}}{\sum_{j = 0}^{k} p_{i j} + \sum_{j = 0}^{k} p_{j i} - p_{i i}}

(9)

where i represents the true value, j represents the predicted value,

p_{i j}

represents the number of i pixels predicted to be j pixels, and k represents the number of all species.

Dice (Dice coefficient) is the metric of ensemble similarity measure, which is usually used to calculate the similarity of two samples. We compare the predicted values with the labeled values and then calculate the similarity between them. For each category, by comparing the degree of overlap between the predicted segmentation result and the real label, specifically, the ratio of the overlap region between the predicted segmentation and the real segmentation to the ratio of the overall size of the two. Finally, we can get the Dice for the whole image. The higher the Dice, the higher the overlap between the predicted segmentation result and the real label, and the better the segmentation will be. The expression is given in the following equation:

D i c e = \frac{2 \times (p r e d \cap t r u e)}{p r e d \cup t r u e}

(10)

where

p r e d

is the set of predicted values and true is the set of true values, the numerator is the intersection between

p r e d

and

t r u e

multiplied by 2 because there is a reason for double counting the common elements between

p r e d

and

t r u e

in the denominator. The denominator is the concatenation of

p r e d

and

t r u e

.

Kappa coefficient (Kappa coefficient) is an evaluation metric based on the confusion matrix for pairwise assessment of the consistency between segmentation results and labels. For each category, the consistency between the predicted segmentation results and the ground truth is determined by comparing the predicted segmentation results and the ground truth. Specifically, first, a confusion matrix needs to be constructed based on the predicted results and ground truth of the model. The observed accuracy is then calculated, which can be obtained by calculating the sum of the diagonal elements of the confusion matrix divided by the total number of samples. Subsequently, the desired accuracy is calculated and estimated by calculating the marginal distributions of the rows and columns of the confusion matrix and the total number of samples. Finally, Kappa is calculated based on the observed accuracy and the desired accuracy. Higher Kappa means higher consistency between the predicted segmentation results and the ground truth and better segmentation results. The expression is given in the following equation:

K a p p a = \frac{p_{o} - p_{e}}{1 - p_{e}}

(11)

where

p_{o}

represents the calculated observed accuracy and

p_{e}

represents the calculated expected accuracy.

5.3. Experiment: Comparison with State-of-the-Art Methods

To verify the effectiveness of the models proposed in this paper for rural road segmentation, Unet, Enet, BiSeNetv1, BiSeNetv2, PP-LiteSeg, and the improved PP-LiteSeg models were selected for comparison. All of these networks are well-known in the field of lightweight semantic segmentation. The experimental configurations of the other models are consistent with our approach, and all experiments are performed entirely on the RDD dataset that we constructed.

As can be seen from Table 3, the MIoU and Dice of the model of this paper are 54.23% and 67.56%, respectively, which are 14.50 and 4.73 percentage points higher than that of Unet, mainly because the multiple downsampling operations of Unet lead to a large amount of loss of detail information, and that pavement boundaries of rural roads are often irregular and fuzzy, whereas Unet is unable to efficiently capture such complex boundaries. The main reason is that the multiple downsampling operation of Unet leads to a large amount of detail information loss: 20.00 and 4.08 percentage points higher than Enet, respectively. This is mainly because the Enet model is lighter, and although it possesses a faster inference speed, its feature extraction ability is relatively weak, resulting in less performance than more complex models in complex scenarios. It is 3.31, 0.76 percentage points and 13.14, 0.53 percentage points higher than BiSeNetv1 and BiSeNetv2, respectively, which is mainly because, although the BiSeNet family of models maintains the capture of spatial information while fast reasoning through the bilateral network structure, its small sensory field leads to deficiencies in capturing contextual information. The values are 2.44 and 1.16 percentage points higher than the original model, mainly due to the introduction of the Simple Pyramid Pooling Module with Strip Pooling (SP-SPPM) and the Bottleneck Attention Module (BAM), which allow for a better capturing of global contextual information and an enhanced ability to process details in complex scenarios. The SP-SPPM module enhances the interaction of information in different spatial dimensions through strip pooling, whereas the BAM module improves the model’s perception of small objects and details by focusing on information in spatial and channel dimensions. These improvements significantly improve the segmentation accuracy of the model in complex rural road scenarios, making our proposed model outperform the conventional model in both MIoU and Dice coefficients. Although the difference in Kappa coefficients is small, it is still statistically significant, and models with higher Kappa coefficients indicate that the model has better consistency in overall categorization and higher reliability especially when dealing with large-scale categories.

As shown in Figure 7, the model proposed in this paper can effectively and accurately segment the semantic segmentation targets in the rural road scene. In contrast, the Unet model performs poorly in small object segmentation due to the loss of many details caused by multiple downsampling and, therefore, also suffers from segmentation errors. For example, the non-hardened pavement in column 1 of Figure 7 is incorrectly segmented, and the people in column 4 are not segmented and recognized. Moreover, the banner in the images in column 2 of Figure 7 also suffers from segmentation confusion. The Enet model suffers from fuzzy segmentation results, poor boundary continuity, and segmentation errors. For example, the asphalt road in the image in row 3 of Figure 7 is not segmented. The Enet model suffers from fuzzy segmentation results, poor boundary continuity, and segmentation errors. For example, in the image in column 3 of Figure 7, the junction of the asphalt road and the non-hardened road is not only discontinuous but also segmented ambiguously. The banner area is also incorrectly recognized because the Enet model does not consider the overall information of the image and has a poor ability to capture image information. While BiSeNetv1 and BiSeNetv2 models are limited in the perceptual field, the segmentation of the upper and lower parts of the image is not easily recognized. The fact that the contextual information of the image is not sufficiently taken into account leads to more difficult and rougher segmentation of small objects. For example, the truck in the distance in column 2 of Figure 7 is incorrectly segmented into the background. The original model suffers from the same problems of segmentation of small objects and boundary details, such as the incorrect segmentation of the banner in column 3 of Figure 7 and the discontinuity of the poles. The segmentation of the motorcycle in the image in column 5 is confusing, and the segmentation of the poles, traffic signs and motorcycle riders in column 5 is difficult.

5.4. Experiment: Results on the Indian Driving Dataset (IDD)

To further validate the effectiveness of the model of this paper in rural road segmentation, we selected Unet, Enet, BiSeNetv1, BiSeNetv2, PP-LiteSeg, and the improved PP-LiteSeg model for comparison. All these networks are well-known in the field of lightweight semantic segmentation. The experimental configurations of the other models are consistent with our approach, and all experiments are performed entirely on the IDD dataset. IDD (Indian Driving Dataset) is a semantic segmentation dataset specifically designed for Indian autonomous driving scenarios, aiming to deal with the complex unstructured road environment in India.

As can be seen in Table 4, UNet performs significantly worse with only 28.85% MIoU and 50.20% Dice, showing its limitations in dealing with unstructured environments like rural roads. Although ENet has the lowest number of parameters and FLOPs, its MIoU of 21.82% and Dice of 32.39% are also the lowest, reflecting its shortage in segmentation performance. The MIoU and Dice of BiSeNetv1 are 63.58% and 47.38%, and the MIoU and Dice of BiSeNetv2 are 39.39% and 49.63%, respectively; BiSeNetv2 has a lower accuracy and does not acquire more contextual information, although it is faster in reasoning. The original model has an MIoU of 49.48% and a Dice of 62.59%, which is a good trade-off between speed and accuracy, but again, it does not capture complete feature information. The improved PP-LiteSeg has an MIoU of 51.08%, a Kappa coefficient of 87.81%, and a Dice coefficient of 64.66%, which provides an optimal balance between accuracy and computational efficiency, even though its number of parameters and FLOPs are slightly higher than other lightweight models.

As can be seen in Figure 8, the UNet model performs moderately well on unhardened road and sky segmentation, and, due to multiple downsampling, UNet is weaker on detailing, with unhardened roads incorrectly segmented in the first column of the second row, pedestrians unsegmented in the fourth column of the second row, and obfuscated pavements in the second column of the second row. The segmentation results of the ENet model are fuzzy, with poor boundary continuity, and prone to segmentation errors; the asphalt road in the third row and third column is not correctly segmented, and the segmentation of the road and the unhardened pavement is unclear and discontinuous. Two models, BiSeNetv1 and BiSeNetv2, have better segmentation results on larger objects and more regular scenes. However, both models still perform poorly on small objects, and the vehicle in the first column of the fourth row is incorrectly segmented as a roadway. Due to the small receptive field, BiSeNetv1 and BiSeNetv2 fail to effectively capture the contextual information of the image, resulting in rougher segmentation of small objects. The original model performs significantly better than the previous model in complex unstructured scenarios. In particular, the segmentation of smaller objects has been improved, but the segmentation is still not detailed enough in some places where the boundaries are more complex, for example, the segmentation of the motorcycle and the pole in the sixth row and fifth column is more ambiguous. The improved PP-LiteSeg model performed the best of all models, segmenting small objects and complex boundaries better. The truck in the second column of the seventh row was segmented correctly, and the pole, traffic sign, and motorcycle in the fifth column of the seventh row were more accurate. The main reason is the enhanced ability to capture global contextual information through the SP-SPPM and BAM modules, making it better able to handle details in complex scenes.

5.5. Ablation Experiment: Category Accuracy of Semantic Segmentation Models with Different Functional Unit Configurations

Ablation experiments were conducted to validate the effectiveness of the proposed modules and analyze the impact of each module on the model performance. Based on the original model, the SP and BAM modules are introduced step by step. The performance of the model was analyzed by evaluating metrics such as single-class pixel accuracy, MIoU, Kappa, and Dice, as well as considering the number of parameters of the model. Table 4 and Table 5 show the results of the model run on the test set.

As can be seen from Table 4, buildings, asphalt roads, towers, vegetation, agricultural machinery, and other such objects with precise shape, color, and outline features have a higher accuracy rate of segmentation, but for cars, fences, cement roads, and so on, such objects are affected by distance and distribution and therefore have a lower accuracy rate compared to the previous categories. Due to the small area of the motorbike and the poles in the image and the low resolution of the processed image, the resolution of the feature map is further reduced by multiple downsampling operations during the training process, which results in the loss of much more detailed information. Sampling on the images becomes more difficult, and, therefore, incomplete or mis-segmentation occurs in the final result, especially for these categories, which tend to be less accurate.

As can be seen from Table 6, after adding only the BAM module, the MIoU, Kappa, and Dice of the model reached, respectively, 53.39%, 89.02%, and 64.42%. This shows that the BAM module can capture more spatial position information to some extent, which can improve the prediction performance of the model. There was no significant increase in FLOPs, and FPS was similar to the original model. Adding the SP module to the Simple Pyramid Pooling module resulted in model MIoU, Kappa, and Dice of 53.10%, 88.60%, and 66.39%, respectively. It shows that extracting the features of the strip region is effective, and the enhancement of the model segmentation results is significant when the information from different areas is brought together. Also, FLOPs and FPS are similar to the original model. When these two modules are added to the model at the same time, the MIoU, Kappa, and Dice of the model reached 53.82%, 89.36%, and 64.04%, respectively. At this point, there is a slight increase in FLOPs and some decrease in FPS, but it is also comparable to the original model, and there is also a significant increase in MIoU. These results illustrate that these two modules can enable the model to obtain richer features and more fine-grained predictions for the final results.

As shown in Figure 9, the improved model has a better segmentation effect. The basic model has a poor segmentation effect on small objects, such as the traffic sign in column 2 of Figure 9 and the distant person in column 4, which can be alleviated to a certain extent by adding the SP module. In addition, the truck in column 1 of Figure 9 and the asphalt pavement in column 4 have incorrect segmentation problems, and the scene in Figure 9 can be segmented correctly by adding the BAM module. Buildings in column 3 of Figure 9 and pedestrians in column 4 can be segmented correctly with more continuous boundaries by adding two different modules. By adding the BAM module, the scene can be segmented correctly. By adding two different modules, the buildings in row 3 and the pedestrians in row 4 of Figure 9 can be segmented correctly, and the overall boundary is more continuous. By adding the BAM module, the scene in the figure can be segmented correctly. By adding two different modules, the buildings in column 3 and the pedestrians in column 4 of Figure 9 can be segmented correctly, and the overall boundary is more continuous, which completely takes into account the overall information of the image.

6. Conclusions

In this study, an improved PP-LiteSeg semantic segmentation model is proposed to enhance feature extraction by a simple pyramid pooling module and parallel feature fusion module and validated on the self-constructed dataset and IDD dataset with excellent segmentation results. Despite good results, limitations remain, especially in terms of geographic coverage, object class diversity, and dynamic scene processing. Future research can address these issues by expanding the dataset, introducing multimodal data, increasing computational efficiency, and improving the segmentation of dynamic scenes and complex boundaries. These improvements will help to build more robust models that will provide technical support for enhanced applications of intelligent transportation systems and smart agricultural equipment.

Author Contributions

Conceptualization, X.C.; methodology, X.C.; data curation, X.C., Y.Z. and Z.Y.; software, Y.Z.; validation, Y.T.; visualization, Y.Z.; writing—original draft, X.C. and Y.T.; writing—review and editing, Y.Z. and T.Z.; funding acquisition T.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Key R&D Program of China (2022ZD0115805), Provincial Key S&T Program of Xinjiang (2022A02011), and Research on key technology of rural road image panoramic segmentation for the complex environment (XJAUGRI2024008).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Talaviya, T.; Shah, D.; Patel, N.; Yagnik, H. Implementation of artificial intelligence in agriculture for optimisation of irrigation and application of pesticides and herbicides. Artif. Intell. Agric. 2020, 4, 58–73. [Google Scholar] [CrossRef]
Hou, X.; Chen, P. Analysis of Road Safety Perception and Influencing Factors in a Complex Urban Environment—Taking Chaoyang District, Beijing, as an Example. ISPRS Int. J. Geo-Inf. 2024, 13, 272. [Google Scholar] [CrossRef]
Wang, J.; Zeng, X.; Wang, Y.; Ren, X.; Wang, D.; Qu, W.; Liao, X.; Pan, P. A Multi-Level Adaptive Lightweight Net for Damaged Road Marking Detection Based on Knowledge Distillation. Remote Sens. 2024, 16, 2593. [Google Scholar] [CrossRef]
Ding, L.; Zhang, H.; Xiao, J.; Li, B.; Lu, S.; Klette, R.; Norouzifard, M.; Xu, F.; Xu, F. A Comprehensive Approach for Road Marking Detection and Recognition. Multimed. Tools Appl. 2020, 79, 17193–17210. [Google Scholar] [CrossRef]
Wang, Z.; Wang, J.; Yang, K.; Wang, L.; Su, F.; Chen, X. Semantic Segmentation of High-Resolution Remote Sensing Images Based on a Class Feature Attention Mechanism Fused with Deeplabv3+. Comput. Geosci. 2022, 18, 1049–1069. [Google Scholar] [CrossRef]
Yang, Y.; He, J.; Wang, P.; Luo, X.; Zhao, R.; Huang, P.; Gao, R.; Liu, Z.; Luo, Y.; Hu, L. TCNet: Transformer Convolution Network for Cutting-Edge Detection of Unharvested Rice Regions. Agriculture 2024, 14, 1122. [Google Scholar] [CrossRef]
Fan, S.; Zhang, X. Infrastructure and regional economic development in rural China. In Regional Inequality in China; Routledge: Oxfordshire, UK, 2009; pp. 177–189. [Google Scholar]
Smith, A.B.; Katz, R.W. US billion-dollar weather and climate disasters: Data sources, trends, accuracy and biases. Nat. Hazards 2013, 67, 387–410. [Google Scholar] [CrossRef]
Paz, D.; Zhang, H.; Li, Q.; Xiang, H.; Christensen, H. Probabilistic semantic mapping for urban autonomous driving applications. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NA, USA, 25–29 October 2020; pp. 2059–2064. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. arXiv 2016, arXiv:1606.02147. [Google Scholar]
Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 325–341. [Google Scholar]
Yu, C.; Gao, C.; Wang, J.; Yu, G.; Shen, C.; Sang, N. Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. Int. J. Comput. Vis. 2021, 129, 3051–3068. [Google Scholar] [CrossRef]
Jiang, H.; Zhang, C.; Qiao, Y.; Zhang, Z.; Zhang, W.; Song, C. CNN feature-based graph convolutional network for weed and crop recognition in smart farming - ScienceDirect. Comput. Electron. Agric. 2020, 174, 105450. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Zhu, L.; Deng, W.; Lai, Y.; Guo, X.; Zhang, S. Research on Improved Road Visual Navigation Recognition Method Based on DeepLabV3+ in Pitaya Orchard. Agronomy 2024, 14, 1119. [Google Scholar] [CrossRef]
Ni, H.; Jiang, S. Deep Dual-Resolution Road Scene Segmentation Networks Based on Decoupled Dynamic Filter and Squeeze–Excitation Module. Sensors 2023, 23, 7140. [Google Scholar] [CrossRef] [PubMed]
Lv, Q.; Sun, X.; Chen, C.; Dong, J.; Zhou, H. Parallel complement network for real-time semantic segmentation of road scenes. IEEE Trans. Intell. Transp. Syst. 2021, 23, 4432–4444. [Google Scholar] [CrossRef]
Peng, J.; Liu, Y.; Tang, S.; Hao, Y.; Chu, L.; Chen, G.; Wu, Z.; Chen, Z.; Yu, Z.; Du, Y.; et al. PP-LiteSeg: A Superior Real-Time Semantic Segmentation Model. arXiv 2022, arXiv:2204.02681. [Google Scholar]
Fan, M.; Lai, S.; Huang, J.; Wei, X.; Chai, Z.; Luo, J.; Wei, X. Rethinking BiSeNet For Real-time Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 9716–9725. [Google Scholar]
Hou, Q.; Zhang, L.; Cheng, M.M.; Feng, J. Strip Pooling: Rethinking Spatial Pooling for Scene Parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 4003–4012. [Google Scholar]
Park, J.; Woo, S.; Lee, J.Y.; Kweon, I.S. Bam: Bottleneck attention module. arXiv 2018, arXiv:1807.06514. [Google Scholar]
Ma, Y.; Yu, D.; Wu, T.; Wang, H. PaddlePaddle: An Open-Source Deep Learning Platform from Industrial Practice. Front. Data Comput. 2019, 1, 105–115. [Google Scholar]
Liu, Y.; Chu, L.; Chen, G.; Wu, Z.; Chen, Z.; Lai, B.; Hao, Y. PaddleSeg: A High-Efficient Development Toolkit for Image Segmentation. arXiv 2021, arXiv:2101.06175. [Google Scholar]
Bottou, L. Large-scale machine learning with stochastic gradient descent. In Proceedings of the COMPSTAT’2010: 19th International Conference on Computational Statistics, Paris, France, 22–27 August 2010; pp. 177–186. [Google Scholar]
Shrivastava, A.; Gupta, A.; Girshick, R. Training region-based object detectors with online hard example mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NA, USA, 27–30 June 2016; pp. 761–769. [Google Scholar]

Figure 1. This is a wide figure.

Figure 2. Improvements to the PP-LiteSeg modeling framework.

Figure 3. Strip pooling simple pyramid module, where 1D Conv represents a one-dimensional convolution operation,

1 \times 1

Conv represents a convolution operation with a convolution kernel of 1, sigmoid represents an activation function, Sp represents the strip pooling operation, and Add represents the addition of elements.

Figure 3. Strip pooling simple pyramid module, where 1D Conv represents a one-dimensional convolution operation,

1 \times 1

Conv represents a convolution operation with a convolution kernel of 1, sigmoid represents an activation function, Sp represents the strip pooling operation, and Add represents the addition of elements.

Figure 4. The overall structure of BAM attention, where

1 \times 1

Conv and

3 \times 3

Conv represent

1 \times 1

convolutional manipulation and

3 \times 3

convolutional operation, respectively, FC represents fully connected operation, and Sigmoid represents the activation function.

Figure 4. The overall structure of BAM attention, where

1 \times 1

Conv and

3 \times 3

Conv represent

1 \times 1

convolutional manipulation and

3 \times 3

convolutional operation, respectively, FC represents fully connected operation, and Sigmoid represents the activation function.

Figure 5. Overall structure of parallel feature fusion module and detailed structure of B-UAFM, where Upsample represents the upsampling operation, and BAM represents the bottleneck attention module.

Figure 6. Overall sample images with annotations: (a) rows of images represent simple scenes; (b) rows of images represent their corresponding labeled images; (c) rows of images represent complex scenes; and (d) rows of images represent their corresponding labeled images.

Figure 7. Exemplary visual results from various models: row (a) illustrates the input images, (b–f) display the segmentation outcomes of UNet, ENet, BiSeNetv1, BiSeNetv2, and PP-LiteSeg, and (g) displays the results of our approach.

Figure 8. Exemplary visual results from various models: row (a) illustrates the input images, (b–f) display the segmentation outcomes of UNet, ENet, BiSeNetv1, BiSeNetv2, PP-LiteSeg, and (g) displays the results of our approach.

Figure 9. Examples of visualization of different functional units: row (a) illustrates the input images; (b–e) show the segmentation results for the original model, with the addition of the SP module, with the addition of the BAM module, and with the addition of both modules.

Table 1. The number of annotated semantics per category.

Category	Number of Semantics
Fence	983
Barrier	1202
Asphalt road	8480
Non-hardened road	4957
Cement road	1106
Building	4047
Person	327
Sky	2830
Vegetation	6730
Banner	592
Pole	3903
Traffic sign	724
Car	1172
Motorcycle	294
Truck	516
Agricultural machinery	106
Tower	283

Table 2. Experimental settings.

Setting	Value
Batch size	4
Crop size	$512 \times 512$
Momentum	0.9
Initial learning rate	0.005
Weight decay	0.000005

Table 3. Results of comparison of MIoU (%), Kappa, Dice, and parameters for different models.

Model	MIoU (%)	Kappa (%)	Dice (%)	Parameters	FLOPs (G)	FPS (f · s⁻¹)
Unet	39.32	86.83	62.31	$1.34 \times 10^{7}$	118.02	6.13
Enet	33.82	85.28	47.86	$3.60 \times 10^{5}$	2.50	28.73
BiSeNetv1	50.51	88.60	65.69	$1.29 \times 10^{7}$	52.76	22.73
BiSeNetv2	40.68	88.83	51.25	$2.33 \times 10^{6}$	7.52	43.50
PP-LiteSeg	51.38	88.20	64.55	$8.04 \times 10^{6}$	5.34	41.67
The proposed method	53.82	89.36	67.04	$11.0 \times 10^{6}$	5.42	40.03

Table 4. Results of comparison of MIoU (%), Kappa, Dice, and parameters for different models.

Model	MIoU (%)	Kappa (%)	Dice (%)	Parameters	FLOPs (G)	FPS (f · s⁻¹)
Unet	28.85	85.66	50.20	$1.34 \times 10^{7}$	118.02	6.56
Enet	21.82	84.37	32.39	$3.60 \times 10^{5}$	2.50	25.17
BiSeNetv1	47.38	87.23	63.58	$1.29 \times 10^{7}$	52.76	17.87
BiSeNetv2	39.39	87.34	49.63	$2.33 \times 10^{6}$	7.52	35.26
PP-LiteSeg	49.48	87.22	62.59	$8.04 \times 10^{6}$	5.34	39.20
The proposed method	51.08	87.81	64.66	$11.0 \times 10^{6}$	5.42	38.16

Table 5. Category accuracy (%) of semantic segmentation models with different functional unit configurations.

Category	Base Model (%)	+SP (%)	+BAM (%)	+SP+BAM (%)
Building	82.39	82.99	81.62	84.59
Asphalt road	89.10	89.88	90.33	91.38
Non-hardened road	75.15	76.88	75.53	77.48
Sky	70.44	71.56	74.63	75.07
Barrier	69.07	81.39	82.06	82.75
Car	60.16	59.86	63.54	63.11
Tower	93.35	93.73	93.20	93.58
Pole	65.99	65.23	65.47	68.65
Vegetation	98.73	98.74	98.71	98.69
Fence	68.56	70.17	70.71	75.62
Cement road	62.89	61.45	64.72	69.44
Motorcycle	56.37	55.78	62.44	64.26
Agricultural machinery	85.20	76.84	85.50	86.64
Banner	70.04	71.04	72.86	74.57
Person	75.78	70.07	75.30	82.45
Truck	65.07	84.08	87.79	88.87
Traffic sign	72.19	73.83	74.69	75.59

Table 6. Performance of semantic segmentation models with different functional unit configurations (%).

Model	MIoU (%)	Kappa (%)	Dice (%)	Parameters	FLOPs (G)	FPS (f · s⁻¹)
Base model	51.38	88.20	64.55	$8.04 \times 10^{6}$	5.34	41.67
+SP	53.10	88.60	66.39	$8.16 \times 10^{6}$	5.38	41.17
+BAM	53.39	89.02	64.42	$8.04 \times 10^{6}$	5.34	41.55
+SP+BAM	53.82	89.36	67.04	$11.0 \times 10^{6}$	5.42	40.03

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cao, X.; Tian, Y.; Yao, Z.; Zhao, Y.; Zhang, T. Semantic Segmentation Network for Unstructured Rural Roads Based on Improved SPPM and Fused Multiscale Features. Appl. Sci. 2024, 14, 8739. https://doi.org/10.3390/app14198739

AMA Style

Cao X, Tian Y, Yao Z, Zhao Y, Zhang T. Semantic Segmentation Network for Unstructured Rural Roads Based on Improved SPPM and Fused Multiscale Features. Applied Sciences. 2024; 14(19):8739. https://doi.org/10.3390/app14198739

Chicago/Turabian Style

Cao, Xinyu, Yongqiang Tian, Zhixin Yao, Yunjie Zhao, and Taihong Zhang. 2024. "Semantic Segmentation Network for Unstructured Rural Roads Based on Improved SPPM and Fused Multiscale Features" Applied Sciences 14, no. 19: 8739. https://doi.org/10.3390/app14198739

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Semantic Segmentation Network for Unstructured Rural Roads Based on Improved SPPM and Fused Multiscale Features

Abstract

1. Introduction

2. Related Work

3. Improvement of PP-LiteSeg Rural Road Recognition Methods

3.1. Architecture of Semantic Segmentation Model

3.2. Simple Pyramid Module for Strip Pooling

3.3. Parallel Feature Fusion Module

Bottleneck Attention Module

4. Dataset Construction

4.1. Image Data Acquisition

4.2. Data Preprocessing

5. Experimentation and Analysis

5.1. Experiment Platform and Parameter Settings

5.2. Evaluation Indices

5.3. Experiment: Comparison with State-of-the-Art Methods

5.4. Experiment: Results on the Indian Driving Dataset (IDD)

5.5. Ablation Experiment: Category Accuracy of Semantic Segmentation Models with Different Functional Unit Configurations

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI