Real-Time Semantic Segmentation Algorithm for Street Scenes Based on Attention Mechanism and Feature Fusion

Wu, Bao; Xiong, Xingzhong; Wang, Yong

doi:10.3390/electronics13183699

Open AccessArticle

Real-Time Semantic Segmentation Algorithm for Street Scenes Based on Attention Mechanism and Feature Fusion

by

Bao Wu

¹,

Xingzhong Xiong

^2,* and

Yong Wang

¹

School of Automation and Information Engineering, Sichuan University of Science and Engineering, Yibin 644000, China

²

Artificial Intelligence Key Laboratory of Sichuan Province, Sichuan University of Science and Engineering, Yibin 644000, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(18), 3699; https://doi.org/10.3390/electronics13183699

Submission received: 17 August 2024 / Revised: 6 September 2024 / Accepted: 12 September 2024 / Published: 18 September 2024

Download

Browse Figures

Versions Notes

Abstract

:

In computer vision, the task of semantic segmentation is crucial for applications such as autonomous driving and intelligent surveillance. However, achieving a balance between real-time performance and segmentation accuracy remains a significant challenge. Although Fast-SCNN is favored for its efficiency and low computational complexity, it still faces difficulties when handling complex street scene images. To address this issue, this paper presents an improved Fast-SCNN, aiming to enhance the accuracy and efficiency of semantic segmentation by incorporating a novel attention mechanism and an enhanced feature extraction module. Firstly, the integrated SimAM (Simple, Parameter-Free Attention Module) increases the network’s sensitivity to critical regions of the image and effectively adjusts the feature space weights across channels. Additionally, the refined pyramid pooling module in the global feature extraction module captures a broader range of contextual information through refined pooling levels. During the feature fusion stage, the introduction of an enhanced DAB (Depthwise Asymmetric Bottleneck) block and SE (Squeeze-and-Excitation) attention optimizes the network’s ability to process multi-scale information. Furthermore, the classifier module is extended by incorporating deeper convolutions and more complex convolutional structures, leading to a further improvement in model performance. These enhancements significantly improve the model’s ability to capture details and overall segmentation performance. Experimental results demonstrate that the proposed method excels in processing complex street scene images, achieving a mean Intersection over Union (mIoU) of 71.7% and 69.4% on the Cityscapes and CamVid datasets, respectively, while maintaining inference speeds of 81.4 fps and 113.6 fps. These results indicate that the proposed model effectively improves segmentation quality in complex street scenes while ensuring real-time processing capabilities.

Keywords:

semantic segmentation; feature fusion; feature extraction; pyramid pooling; complex street scenes

1. Introduction

Semantic segmentation [1,2] is a core task in computer vision, aimed at assigning semantic labels to every pixel in an image to achieve a detailed understanding of its content. This task is of significant importance in fields such as autonomous driving, medical image analysis [3], remote sensing image processing, and intelligent surveillance. For example, in the medical field, semantic segmentation is particularly advantageous in surgical procedures, as observed by Marullo et al. [4]. Traditional image segmentation methods [5], such as thresholding, edge detection, region growing, and watershed algorithms, rely on low-level features (e.g., color, grayscale, edges) and struggle to capture high-level semantic information. Moreover, these methods are sensitive to noise, changes in lighting, and complex backgrounds, leading to unstable results. Their adaptability is limited due to the need for manually designed feature extraction and segmentation criteria, which is particularly challenging when dealing with complex textures and multiple objects. Advances in deep learning, especially the application of convolutional neural networks [2], have greatly enhanced the accuracy and robustness of semantic segmentation. The introduction of fully convolutional networks (FCNs) [2] represents a breakthrough in applying deep learning to semantic segmentation. FCNs replace the fully connected layers in traditional CNNs with convolutional layers, enabling pixel-level prediction for images of arbitrary sizes and supporting end-to-end training and prediction.

Building on the foundation of FCNs, numerous models have emerged with various improvements and extensions aimed at enhancing segmentation accuracy, reducing computational complexity, boosting feature representation [6] capabilities, and adapting to different application scenarios. The U-Net [7] architecture enhances feature representation through its distinctive symmetric shape and skip connections, making it particularly well-suited for handling small objects and fine details in images. RefineNet [8] employs a multi-path refinement strategy, effectively combining high-level semantic information with low-level detail through cross-layer connections, thereby improving segmentation accuracy and edge sharpness. DeepLabv3+ [9] features an encoder-decoder structure, where the encoder captures contextual information of the image and the decoder restores spatial details, particularly enhancing the accuracy of object boundaries. ERFNet [10] achieves a good balance between efficiency and accuracy through the use of residual connections and factorized convolutions. PSPNet [11] integrates context information at different scales using a pyramid pooling module, significantly enhancing the model’s ability to understand complex scenes and providing more precise scene interpretation. In light of these advancements, mIoU measures the accuracy of segmentation models, while FPS indicates the processing speed. By examining the relationship between these metrics through a graph, it is possible to reveal the trade-offs between accuracy and efficiency, aiding in the selection of the most suitable model for specific application scenarios.

MFNet [12] employs a multi-branch structure, including attention branches, semantic branches, and spatial information branches, along with heterogeneous decomposition (AF) blocks to effectively fuse multi-level features, thereby improving both segmentation accuracy and real-time performance. LCNet [13] introduces a PCT (partial-channel transformation) strategy. The PCT block incorporates a TCA (three-branch context aggregation) module, expanding the receptive field of features and capturing multi-scale contextual information. MS-SRSS [14] proposes a multi-resolution learning mechanism, which enhances the feature extraction capabilities of the semantic segmentation network’s encoder. Despite significant advances in accuracy and efficiency achieved by deep learning-based semantic segmentation methods, several challenges remain. For instance, street scene images often encompass multiple object categories, such as pedestrians, various types of vehicles, and road markings, each with diverse sizes and shapes and often subject to mutual occlusion. Moreover, street scene datasets typically cover urban environments under a range of lighting and weather conditions. Variations in lighting and seasonal changes can lead to significant differences in the appearance of street scenes, necessitating models with robust generalization capabilities.

In response to the complex challenges associated with the semantic segmentation of street scene images, this paper presents an improved Fast-SCNN [15] architecture. The main contributions of this study are as follows:

The integration of SimAM enables the model to more effectively highlight critical features. Additionally, the enhanced DAB is incorporated into the high-resolution branch of the feature fusion module to enhance the model’s ability to capture image details. Simultaneously, the SE attention mechanism is integrated into the low-resolution branch of the feature fusion module, further refining the model’s focus on important information.
In the classifier, depthwise separable convolutions and additional convolutional layers are introduced to enhance the model’s capacity to process features. The RPP (refined pyramid pooling) module is extended by adding finer-grained levels, providing richer contextual information for the semantic segmentation task.
The proposed method achieves an average Intersection over Union (mIoU) of 71.7% and 69.4% on the challenging Cityscapes and CamVid test datasets, respectively, with inference speeds reaching 81.4 fps and 113.6 fps.

The structure of this paper is as follows: Section 2 reviews related work; Section 3 presents the methodology; Section 4 presents the experimental results and comparative analysis; and Section 5 provides the conclusions.

2. Related Work

2.1. Semantic Segmentation

Semantic segmentation is a crucial technique in computer vision, aiming to partition an image into multiple regions, each representing a specific category. This technique is widely applied in fields such as autonomous driving, medical imaging, and video surveillance. In the context of deep learning, semantic segmentation is typically implemented using CNNs, which can learn to distinguish the visual features of different objects from large volumes of labeled images. Classic semantic segmentation models include FCN, SegNet, and U-Net. In recent years, many efficient network models incorporating attention mechanisms and deep feature fusion have emerged, such as DANet [16], DSANet [17], MFNet [12], and BiSeNet [18]. Figure 1 summarizes common network architectures used for semantic segmentation.

FCNs, through fully convolutional layers and upsampling techniques, are capable of processing input images of arbitrary sizes and performing precise pixel-level semantic segmentation. DeepLab v3+ [9] combined an encoder-decoder structure with depthwise separable convolutions, enhancing edge handling capabilities and operational efficiency. BiSeNet [18] is a dual-path network designed for real-time semantic segmentation, featuring a fast spatial path to preserve image details and a context path to extract contextual information. The SegBlocks [19] network adaptively adjusts the processing resolution based on the complexity of the image content. This approach not only reduces the computational overhead but also preserves the ability to capture critical visual information. The Depthwise Asymmetric Bottleneck module in DABNet [20] effectively reduces model parameters while maintaining high processing speed and accuracy by combining depthwise separable convolutions with dilated convolutions. In the field of real-time semantic segmentation [21], processing speed [22] and accuracy are critical factors. To achieve efficient real-time performance, many approaches utilize lightweight network architectures and optimization techniques.

2.2. Attention Modules

In deep learning, the attention mechanism [23,24] mimics human attention processes by selectively focusing on key parts of the input data, thereby enhancing model performance and efficiency. It is widely used in various fields such as natural language processing [25] and computer vision [26], and shows exceptional performance in handling long sequences and image recognition tasks. DANet [16] introduces the position attention module and channel attention module to strengthen the discriminative power of feature representations. By applying spatial and channel attention at key points, MSCFNet [27] more effectively integrates information from different stages and scales, maintaining high segmentation performance even with fewer parameters. SFNet-N [28] enhances performance using a multi-scale attention mechanism. The encoder incorporates the CBAM (Convolutional Block Attention Module), which recalibrates features across channel and spatial dimensions. Unlike the traditional SE module, the SA module in SANet [29] not only recalibrates channel weights but also considers the spatial relationships between pixels by incorporating the average pooling to preserve spatial information. By combining dilated spatial attention and channel attention, DSANet [17] fully leverages feature maps across multiple levels, enabling more accurate pixel-level classification and meeting real-time processing requirements. In LASNet [30], the transform module employs three attention mechanism branches to enhance the network’s focus on important features, thereby improving the overall performance. In CTNet [31], the self-attention mechanism proposed in the SCM does not calculate correlations between all pixels but instead models global dependencies indirectly by exploring correlations between pixels and categories.

2.3. Feature Fusion Strategies

FCNs achieve feature fusion through “skip connections”, which facilitate the combination of feature layers from different depths within the network. This strategy helps retain more spatial and contextual information, enhancing the overall performance of the model. In MFNet, the FFB (Feature Fusion Block) integrates features from the attention branch, semantic branch, and spatial information branch. This design allows the network to simultaneously leverage high-level semantic information and low-level spatial information. In PSPNet [11], before the final classification, the multi-scale features generated by the pyramid pooling module are combined with the original features and further processed through convolutional layers, achieving a fusion of global and local information. FSFNet’s [32] feature selective fusion module dynamically adjusts the importance of features across spatial and channel dimensions by generating a weight map associated with each feature map, effectively integrating features from different levels. DFANet utilizes sub-network aggregation to upsample features from the previous network to the next, progressively refining pixel classification. Sub-stage aggregation further enhances feature expressiveness and receptive fields by fusing features across stages. ICNet [33] utilizes the CCF (Cascade Feature Fusion) unit to integrate features from branches with varying resolutions. This design enables the low-resolution branch to generate an initial prediction rapidly, while the mid-resolution and high-resolution branches progressively refine the segmentation results through the fusion process. The AlignFA module in AlignSeg [34] aligns and fuses high-resolution and low-resolution feature maps by learning two-dimensional transformation offsets. This approach ensures precise alignment during multi-resolution feature aggregation and effectively addresses the issue of feature misalignment.

2.4. Dilated Convolution

Compared to traditional convolutions, dilated convolutions [35] expand the receptive field without increasing the number of parameters, thereby reducing the network’s parameter count and computational cost. The DAB module in DABNet combines depthwise asymmetric convolution and dilated convolution, enabling efficient dense feature extraction and contextual information utilization. Dilated convolutions are employed in the non-bottleneck-1D layers of ERFNet to capture more contextual information, thereby improving segmentation accuracy. In EDANet [36], dilated convolutions are primarily used in the second pair of asymmetric convolutional layers within the EDA (efficient sense modules with asymmetric convolution) module, with progressively increasing dilation rates to aggregate multi-scale contextual information. The SS-nbt unit (Split-Shuffle-Non-Bottleneck) in the LEDNet [37] encoder uses dilated convolutions with varying rates to expand the receptive field, enhance segmentation accuracy, and maintain a low computational cost. The parallel factorized convolutional unit in ESNet [38] uses multi-branch dilated convolutions to expand the receptive field, capture objects at various scales, and reduce computational complexity. ESPNet [39] achieves efficient feature extraction and contextual information capture through dilated convolutions within its efficient spatial pyramid, significantly reducing computational complexity while maintaining high segmentation accuracy. The surrounding context extractor in CGNet [40] employs dilated convolutions to more effectively capture surrounding contextual information. The FPE block in FPENet [41] combines pyramid dilated convolutions with depthwise separable inverted bottleneck structures, using dilated convolutions at varying rates to form a spatial pyramid that encodes multi-scale features and reduces computational complexity.

3. Proposed Method

Fast-SCNN [15] is a lightweight and efficient semantic segmentation model designed specifically for real-time performance on high-resolution images. Unlike traditional dual-branch networks, Fast-SCNN shares the initial layers between branches, thereby reducing redundant computations. Building upon the foundation of Fast-SCNN, the proposed network introduces several enhancements: the SimAM [42] module improves feature representation; the global feature extraction module enhances contextual awareness; and feature fusion integrates SE attention with the enhanced DAB module. These improvements collectively contribute to generating the final output through the classifier. Figure 2 illustrates the architecture of the proposed network in detail.

To provide a clearer depiction of the network structure, Table 1 lists the detailed composition of each module in the proposed network.

3.1. Efficient Down-Sample

The primary objective of the downsampling module is to reduce the spatial dimensions of the input data while retaining as much important feature information as possible. This module achieves this through a series of convolutional operations, including standard convolutions and depthwise separable convolutions. Specifically, the input image is first processed by a convolutional layer with a kernel size of 3 and a stride of 2, reducing the spatial dimensions of the output feature map to half the original size. Subsequently, it passes through two depthwise separable convolution layers, each with a stride of 2, further reducing the feature map dimensions sequentially. Ultimately, the spatial resolution of the output feature map is 1/8 of that of the input image. Each convolution operation is followed by batch normalization and ReLU activation functions to ensure the preservation of critical visual information during feature compression and to enhance the non-linearity of feature representation. By progressively downsampling the input image, the computational load of subsequent layers can be significantly reduced, thus improving the computational efficiency of the model. This is particularly important for processing high-resolution street scene images. The downsampling process effectively retains important visual features while filtering out insignificant details, thereby extracting more abstract and high-level features.

3.2. SimAM

Existing attention modules typically refine features only along the channel or spatial dimensions, limiting their flexibility in learning attention weights across both dimensions. Unlike traditional attention mechanisms such as CBAM, SimAM [42] does not introduce additional parameters to the network. Instead, it uses an energy optimization function to assess the importance of each neuron, considering both spatial and channel dimensions. This energy function, grounded in neuroscience theory, reflects the competitive and inhibitory effects between neurons. By solving the closed-form solution of this energy function, SimAM efficiently computes attention weights, which are then used to weight the feature map, highlighting important features and suppressing less significant ones. The mechanism of SimAM is detailed in Figure 3 to facilitate a clearer understanding of its operational principles.

The above process can be expressed using the following equations:

e_{t} (ω_{t}, b_{t}, y, x_{i}) = {(y_{t} - \hat{t})}^{2} + \frac{1}{M - 1} \sum_{i = 1}^{M - 1} {(y_{o} - {\hat{x}}_{i})}^{2}

(1)

Equation (1) defines the basic form of the energy function, which aims to identify the linear separability between the target neuron and other neurons by minimizing this energy function.

t

represents the target neuron in the feature map,

x_{i}

represents other neurons in the same channel,

\hat{t}

represents the linear transformation of the target neuron, and

{\hat{x}}_{i}

represents the linear transformation of the surrounding neurons.

M

refers to the total number of neurons in a channel.

y_{t}

and

y_{o}

represent the label assigned to the target neuron and the label assigned to the surrounding neurons, respectively.

e_{t} (ω_{t}, b_{t}, y, x_{i}) = \frac{1}{M - 1} \sum_{i = 1}^{M - 1} {(- 1 - (ω_{t} x_{i} + b_{t}))}^{2} + {(- 1 - (ω_{t} t + b_{t}))}^{2} + λ ω_{t}^{2}

(2)

Equation (2) is an extension of Equation (1), incorporating a regularization term

λ ω_{t}^{2}

to prevent overfitting and simplify the model’s complexity.

ω_{t} = - \frac{2 (t - μ_{t})}{{(t - μ_{t})}^{2} + 2 σ_{t}^{2} + 2 λ}

(3)

b_{t} = - \frac{1}{2} (t + μ_{t}) ω_{t}

(4)

Equations (3) and (4) are the closed-form solutions to Equation (2), providing specific methods for computing

ω_{t}

and

b_{t}

.

μ_{t}

is the mean of the surrounding neurons in the same channel.

σ_{t}^{2}

is the variance of the surrounding neurons in the same channel.

λ

is a regularization term to avoid overfitting.

ω_{t}

represents the weight. The two equations adjust the weights and biases of each neuron based on the statistical properties (mean and variance) of the input features, thereby normalizing the features.

e_{t}^{*} = \frac{4 ({\hat{σ}}^{2} + λ)}{{(t - \hat{μ})}^{2} + 2 {\hat{σ}}^{2} + 2 λ}

(5)

Equation (5) calculates the minimum energy value

e_{t}^{*}

for each neuron based on the solutions

ω_{t}

and

b_{t}

.

t - \hat{μ}

represents the difference between the target neuron and the mean of the surrounding neurons. This minimum energy value reflects the degree of separation between the neuron and other neurons in the feature space.

\tilde{X} = s i g m o i d (\frac{1}{E}) ⊙ X

(6)

Equation (6) utilizes the minimum energy values

e_{t}^{*}

obtained from Equation (5) to adjust the responses of each neuron in the feature map. This adjustment is achieved through a sigmoid function.

E

is essentially a collection of the minimum energy values

e_{t}^{*}

for all neurons.

SimAM enhances feature representation by computing pixel-wise attention weights, allowing the network to focus on key feature regions and reducing background noise interference.

3.3. Global Feature Extractor

The global feature extraction module is primarily responsible for extracting deep global features from the network. Specifically, this is achieved through the use of multiple levels of linear bottleneck modules and the final refined pyramid pooling (RPP) module. The linear bottleneck module operates as follows: the input features first undergo an expansion convolution, which increases the number of channels to t times the original, thereby enlarging the feature dimensions. This is followed by depthwise convolutions applied in the high-dimensional feature space to capture more detailed features while maintaining computational efficiency. Finally, a compression convolution reduces the number of feature channels back to the specified output channels. Figure 4 illustrates the refined pyramid pooling (RPP).

In semantic segmentation, capturing multi-scale information is crucial for improving accuracy. The refined pyramid pooling module extracts global and local features through pooling operations at five different scales, significantly improving the understanding of multi-scale context. These operations provide multi-level information, while the 1 × 1 pooling operation captures global context, aiding in the recognition of large objects and overall structures. This module up-samples the feature maps from five scales to the original size and concatenates them along the channel dimension to achieve effective multi-scale feature fusion. Despite the multi-scale pooling and convolution operations, the design of upsampling and convolution maintains computational efficiency while enhancing network performance. Multi-scale feature fusion enables the network to capture boundary information of objects in street scene images more accurately, thus improving boundary segmentation precision. This is particularly important in complex urban scenes, aiding in the clear differentiation of various object categories. By integrating features at different scales, the network maintains consistency in handling details and the overall structure, and can identify both large objects and small objects. This multi-scale feature extraction capability ensures efficient segmentation on street scene datasets and provides robustness and adaptability to targets with varying scales.

3.4. Feature Fusion Module

Low-resolution features, typically derived from the deeper layers of the network, provide rich semantic information [43] but lack spatial detail. Conversely, high-resolution features from the shallower layers contain more spatial details but less semantic information. To effectively fuse these two types of features, we first apply 1 × 1 convolution and batch normalization to the high-resolution feature maps and enhance their expressiveness with the refined DAB module. Low-resolution features are upsampled to match the size of the high-resolution features using bilinear interpolation and then refined with depthwise convolution to preserve key information while reducing the number of parameters. Subsequently, 1 × 1 convolution is used to adjust the channel dimensions, combined with batch normalization and SE attention [44] mechanisms to improve feature representation and importance adjustment. Finally, the high-resolution and low-resolution feature maps are integrated effectively through element-wise fusion. This module integrates the deep semantic information with shallow spatial details, which is crucial for handling complex street scene scenarios, particularly for multi-scale objects and detailed boundaries. Figure 5 illustrates a comparison of the implementation details between the ERFNet non-bottleneck-1D module, the DAB module, and the proposed enhanced DAB module in this paper.

Compared to the original DAB module, enhanced DAB captures multi-scale features through various convolution operations, including standard convolutions, depthwise convolutions, and dilated convolutions. These convolution operations extract information from different receptive fields, helping the model understand different contextual information and details in the image. Depthwise convolutions reduce computational complexity while maintaining the spatial resolution of feature maps, whereas dilated convolutions increase the receptive field, aiding in capturing a broader range of contextual information. The additional 3 × 3 and 1 × 1 convolution layers further enhance the feature representation capabilities. These layers, applied in the final stage of the module, help extract higher-level feature information.

The SE (squeeze-and-excitation) attention is illustrated in Figure 6. The attention enhances feature representation by introducing channel attention mechanisms. First, the input feature map undergoes adaptive average pooling to compress the spatial dimensions to 1 × 1, thereby capturing global channel information. Subsequently, this information is processed through two fully connected layers with an intermediate ReLU activation function, and the channel attention weights are output through a sigmoid function. Finally, these weights are multiplied channel-wise with the original input feature map, recalibrating the importance of each channel and producing a recalibrated feature map. Integrating SE attention into the low-resolution branch of feature fusion can enhance important features and improve feature representation. During the feature fusion process, after emphasizing significant features in the low-resolution branch, these features can be better preserved and utilized when fused with the high-resolution branch, thereby enhancing the fusion effectiveness.

3.5. Enhanced Classifer

The classifier module maps high-dimensional feature maps to the class space for pixel-level classification. It optimizes computational efficiency and model performance by combining depthwise separable convolutions with depth convolutions. Specifically, the input feature maps first pass through three depthwise separable convolutional layers to reduce computational load while maintaining effective feature extraction. Subsequently, the feature maps are processed through a depth convolutional layer, and finally refined through a module containing 1 × 1 convolutions, batch normalization, ReLU activation, and dropout layers to complete the class mapping. Depthwise separable convolutions significantly reduce computation and parameter count by decomposing standard convolutions into depthwise and pointwise convolutions, which is particularly effective for handling high-resolution street scene images. Additionally, the combination of batch normalization and ReLU enhances training speed and stability. Batch normalization reduces internal covariate shift and accelerates the training process, while ReLU activation enhances the model’s representational capacity, allowing it to capture complex features more effectively.

4. Experiments and Results

4.1. Data Sets and Evaluation Metrics

The Cityscapes dataset [45] is widely used for autonomous driving and urban scene understanding. It contains 5000 high-resolution images (2048 × 1024 pixels) from 50 cities. The dataset is divided into 2975 training images, 500 validation images, and 1525 test images, with precise pixel-level annotations for 30 classes, 19 of which are used for evaluation. The annotations cover common urban elements such as roads, pedestrians, and vehicles.

The CamVid [45] dataset is a high-quality resource widely used for semantic segmentation and autonomous driving scene understanding. It comprises 701 video frames, partitioned into 367 for training, 101 for validation, and 233 for testing. With a resolution of 960 × 720 pixels, the dataset ensures clear visual detail. Covering a range of weather and lighting conditions, it provides pixel-level annotations for 11 key categories, including road, building, vehicle, pedestrian, and tree.

The mean Intersection over Union (mIoU) is a commonly used evaluation metric in semantic segmentation tasks, assessing model performance across different categories. The calculation formula is as follows:

m I o U = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{p_{i i}}{\sum_{j = 0}^{k} p_{i j} + \sum_{j = 0}^{k} (p_{j i} - p_{i i})}

(7)

Here,

p_{i i}

denotes the number of pixels correctly identified for class i, while

p_{i j}

represents the number of pixels of class i that have been misclassified as class j.

In subsequent experiments, the network’s performance was comprehensively evaluated from multiple perspectives by assessing metrics such as the number of model parameters, frames per second (FPS), and floating-point operations (GFLOPs).

4.2. Implementation Details

This study was implemented using Python 3.8 and the PyTorch 2.3.1 deep learning framework. Experiments were conducted on a system equipped with CUDA 12.1 and CuDNN v8 to ensure efficient computation on an NVIDIA GeForce RTX 4060 GPU. The model was evaluated on two benchmark datasets: Cityscapes and CamVid. For the Cityscapes dataset, a training batch size of 4 was used; for the CamVid dataset, a batch size of 8 was employed. Data augmentation strategies included random cropping, scaling, horizontal flipping, and additional cropping to enhance model robustness. During training, the cross-entropy loss function was used as the loss metric, and the Adam optimizer was applied to optimize the network parameters. The momentum value was set to 0.9, the weight decay rate was set to 0.0005, and the initial learning rate was set to 0.001. To improve training efficiency and stability, the network parameters were initialized using the Kaiming initialization strategy.

The learning rate adjustment strategy followed a Poly schedule, which gradually reduces the learning rate during training to fine-tune model parameters in the later stages. The specific formula is:

l r = l r_{0} \times {(1 - \frac{i t e r}{\max_i t e r})}^{p o w e r}

(8)

where

l r_{0}

is the initial learning rate and power is the decay exponent set to 0.9.

4.3. Ablation Study

To systematically evaluate the contributions of the optimization modules and related improvements to the performance of the semantic segmentation network, we conducted ablation experiments using the Cityscapes dataset. The purpose of these experiments is to quantify the impact of each component and validate its effectiveness. The experiments illustrate the trade-offs between parameter count, processing speed, and mIoU (mean Intersection over Union) across different configurations. The baseline model has a parameter count of 1.18 M, a processing speed of 123.5 FPS, and an mIoU of 68%. The introduction of additional components resulted in a series of performance changes in the model. Table 2 presents a comparative analysis of the impact of various components on model performance.

SimAM module: The introduction of the SimAM attention mechanism enhances the feature representation capability within the semantic segmentation network. By leveraging self-attention and adaptive weighting, it improves the distinctiveness and accuracy of the features. This mechanism dynamically adjusts the weights of the feature maps, allowing the network to better capture and utilize critical feature information. Consequently, the mIoU increases to 68.6%, though the processing speed decreases to 117.4 FPS, and the GFLOPs increase to 18.7 G, while the parameter count remains unchanged, indicating a notable performance improvement.

RPP module: The RPP module enhances the performance of the network in segmenting distant small objects and complex backgrounds by improving the pyramid pooling module in the global feature extraction. Specifically, the refined pyramid pooling module increases the number of pooling scales, which enables more effective extraction and fusion of multi-scale features. As a result, the mIoU improves by 0.4% compared to the baseline model, while the processing speed decreases to 119 FPS, and the parameter count increases to 1.55 M. The GFLOPs increase to 21.8 G.

FFM module: Incorporating DABBlock in the high-resolution branch and SE attention in the low-resolution branch of the feature fusion module (FFM) improved the model’s mIoU from 68% to 68.8% and 68.5%, respectively, demonstrating effective adaptation to different scales of objects. When DABBlock and SE attention are used together, the mIoU further increases to 69.1%, highlighting the significant advantage of the FFM module in enhancing the segmentation of small objects and edge regions. DABBlock combines standard and dilated convolutions to achieve effective multi-scale feature extraction and integration, improving the network’s ability to recognize objects of various sizes while maintaining computational efficiency. SE attention generates weights through global pooling, enhancing features, suppressing irrelevant information, and optimizing the low-resolution branch. The GFLOPs for FFM with DABBlock increase to 23.9 G, and with SE attention, this increases to 25.2 G.

After integrating SimAM, RPP, FFM, and the enhanced classifier, the proposed algorithm significantly improves the overall segmentation performance in complex urban scenes, achieving an mIoU of 71.7% with GFLOPs of 58.5 G.

4.4. Comparison with the Existing Model

In this section, we compare the proposed algorithm with several state-of-the-art segmentation models and present a series of performance metrics derived from the Cityscapes and Camvid datasets. Figure 7 illustrates the relationship between model inference speed and segmentation accuracy for various segmentation networks on the Cityscapes dataset.

4.4.1. Results on CityScapes Dataset

Table 3 presents a comparison of the performance metrics for different algorithms on the CityScapes dataset. OCNet, with 62.6 M parameters and a computational cost of 549 G, achieves an mIoU of 80.1%, indicating that its high accuracy comes at the expense of a substantial computational overhead. In contrast, the ENet and ESPNet models have minimal parameters (0.36 M) and lower computational costs, but their mIoU values are 58.3% and 60.3%, respectively, highlighting that efficient models designed for resource-constrained environments may sacrifice accuracy. BiSeNet1, with 5.8 M parameters and 14.8 G of computational cost, operates at 101.6 FPS with an mIoU of 68.4%. On the other hand, BiSeNet2, which has 49 M parameters and 55.3 G of computational cost, achieves 60.4 FPS and 74.7% mIoU. This demonstrates that a moderate increase in model complexity can enhance accuracy without significantly compromising speed. The network proposed in this study features 2.59 M parameters and a computational cost of 58.5 G, which is significantly lower than that of OCNet and RefineNet, making it more suitable for real-time applications. Additionally, this network maintains a high segmentation accuracy with an mIoU of 71.7% while achieving a processing speed of 81.4 FPS, ensuring both fast response and high segmentation performance.

As illustrated in Figure 8, the proposed algorithm demonstrates relatively better performance in detailed capture and boundary handling. The proposed method retains more details and achieves more accurate segmentation of object boundaries when handling complex scenes. For instance, as shown in the second and third rows of Figure 8, the proposed algorithm provides clearer and more natural contours for dense crowds compared to Fast-SCNN and DABNet. The final row of Figure 8 highlights that the proposed method segments large objects, such as buses, with clearer boundaries and fewer segmentation misalignments. The third row of Figure 8 indicates that the proposed algorithm achieves higher precision in segmenting slender objects, such as traffic signs, demonstrating its excellence in capturing image details. In contrast, the first and fourth rows of Figure 8 reveal that both Fast-SCNN and DABNet exhibit some degree of segmentation inaccuracies when segmenting bicycles. Fast-SCNN shows errors in distinguishing between people riding bicycles and the bicycles themselves, while DABNet’s weaker detail capture results in a loss of fine image details. Additionally, as seen in the third row of Figure 8, Fast-SCNN and DABNet perform less effectively than the proposed algorithm in segmenting slender objects like traffic signs.

Based on Table 4, our method achieves an average Intersection over Union (mIoU) of 71.7% on the Cityscapes dataset, outperforming other algorithms. Table 4 also provides IoU evaluations for different algorithms across various categories, offering a detailed reflection of model performance. Among the 19 categories, our method shows superior performance compared to competing algorithms in most categories, particularly demonstrating significant advantages in “Road” (97.8%), “Building” (91.6%), and “Person” (82.0%).

4.4.2. Results on CamVid Dataset

Table 5 compares the performance metrics of different algorithms on the CamVid dataset. Our proposed method achieves 113.6 FPS with only 2.59 M parameters while maintaining an mIoU of 69.4%, demonstrating its ability to effectively balance model complexity, computational efficiency, and segmentation accuracy in practical applications. In contrast, although PSPNet achieves a comparable mIoU of 69.1%, it requires a significantly larger number of parameters (250.8 M) and offers only 5.4 FPS, making it unsuitable for real-time applications. ENet and ESPNet represent another extreme, with very small parameter sizes (0.36 M each) and FPS rates of 96.4 and 190.3, respectively, but their segmentation accuracy is relatively low, with mIoUs of 51.3% and 58.3%. DABNet, with an extremely small parameter size of 0.76 M, achieves 162 FPS and an mIoU of 65.7%, demonstrating a good balance between efficiency and accuracy. In summary, our proposed method achieves an optimal trade-off between high accuracy, low computational overhead, and real-time performance.

Figure 9 presents the segmentation results of various algorithms on the CamVid test set, highlighting the advantages of the proposed method in street scene analysis. Firstly, the proposed algorithm achieves precise object boundary segmentation, effectively avoiding the edge blurring issues seen in Fast-SCNN and DABNet, and demonstrates superior detail retention. For instance, in the third row of Figure 9, the proposed method shows greater accuracy in segmenting bicycles compared to Fast-SCNN. Secondly, the proposed algorithm excels in segmenting small objects. The CamVid dataset contains many small objects, such as bicycles and traffic signs, which are often missed or misclassified by other algorithms. In contrast, the proposed method utilizes refined feature extraction and attention mechanisms to accurately identify and segment these small targets. As seen in the first two rows of Figure 9, the proposed method demonstrates improved accuracy in segmenting poles. Moreover, despite the low illumination in the first input image of Figure 9, the proposed method maintains consistent performance across different categories, indicating its robustness under complex lighting conditions. The algorithm maintains stable segmentation results across various lighting conditions through deep feature learning and effective normalization. For occluded objects, which typically lead to segmentation errors, the proposed method effectively addresses this issue through contextual information integration and enhanced feature representation. Overall, the proposed algorithm exhibits outstanding performance in segmentation stability and accuracy, demonstrating significant practical value.

Table 6 presents the segmentation accuracy for each class achieved by different networks on the CamVid dataset.

5. Conclusions

In this study, we made several key improvements to the Fast-SCNN model to enhance its semantic segmentation performance on street scene images. First, we introduced the SimAM module to boost the network’s sensitivity to critical spatial features. This module, applied after the downsampling phase, significantly improves the expressiveness of feature maps. Second, we extended the pyramid pooling module (PPM) by incorporating more fine-grained layers and designed the enhanced DAB module, which is integrated into the high-resolution branch of the feature fusion module. This provides richer contextual information and improves the accuracy of small object recognition. Additionally, the low-resolution branch includes SE attention, which significantly enhances the fusion of low-level and high-level features. The classifier module was improved by adding depthwise separable convolutions and depth convolutions, which deepens the network structure and better captures and classifies multi-scale features. These enhancements increase the network’s ability to handle image details and improve segmentation accuracy and robustness. The experimental results show that our network achieves significantly higher segmentation accuracy in complex scenarios, such as urban street scenes, compared to the original model. On the Cityscapes and CamVid datasets, our model achieved 71.7% and 69.4% mIoU, respectively, while maintaining inference speeds of 81.4 fps and 113.6 fps.

Future work will focus on further enhancing the efficiency and performance of the model. This includes exploring more lightweight network architectures to improve the inference speed while maintaining high accuracy, thereby making the model more suitable for deployment on resource-constrained embedded devices. Additionally, efforts will be directed towards investigating methods to enhance the robustness and generalization capabilities of the model across various challenging scenarios, such as extreme weather conditions and nighttime environments.

Author Contributions

Conceptualization, X.X.; methodology, B.W.; software, B.W.; validation, X.X., B.W. and Y.W.; formal analysis, B.W.; investigation, X.X.; resources, B.W.; data curation, B.W.; writing—original draft preparation, B.W.; writing—review and editing, B.W.; visualization, B.W.; supervision, X.X.; project administration, B.W.; funding acquisition, X.X. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by the Key Science and Technology Program of Zigong Municipality (No. 2020YGJC25), the Opening Fund of Power Internet of Things Key Laboratory of Sichuan Province (No. PIT-F-202304), and the Sichuan University of Science & Engineering Graduate Innovation Fund (No. Y2023308).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data will be made available upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

Hu, X.; Feng, J. A Fast Attention-Guided Hierarchical Decoding Network for Real-Time Semantic Segmentation. Sensors 2023, 24, 95. [Google Scholar] [CrossRef] [PubMed]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Tsuneki, M. Deep Learning Models in Medical Image Analysis. J. Oral Biosci. 2022, 64, 312–320. [Google Scholar] [CrossRef] [PubMed]
Marullo, G.; Tanzi, L.; Ulrich, L.; Porpiglia, F.; Vezzetti, E. A Multi-Task Convolutional Neural Network for Semantic Segmentation and Event Detection in Laparoscopic Surgery. J. Pers. Med. 2023, 13, 413. [Google Scholar] [CrossRef] [PubMed]
Luo, Z.; Yang, W.; Yuan, Y.; Gou, R.; Li, X. Semantic Segmentation of Agricultural Images: A Survey. Inf. Process. Agric. 2023, 11, 172–186. [Google Scholar] [CrossRef]
Yuan, Y.; Huang, L.; Guo, J.; Zhang, C.; Chen, X.; Wang, J. OCNet: Object Context for Semantic Segmentation. Int. J. Comput. Vis. 2021, 129, 2375–2398. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2015; Volume 9351, pp. 234–241. ISBN 978-3-319-24573-7. [Google Scholar]
Lin, G.; Milan, A.; Shen, C.; Reid, I. Refinenet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1925–1934. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Romera, E.; Alvarez, J.M.; Bergasa, L.M.; Arroyo, R. Erfnet: Efficient Residual Factorized Convnet for Real-Time Semantic Segmentation. IEEE Trans. Intell. Transp. Syst. 2017, 19, 263–272. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Lu, M.; Chen, Z.; Liu, C.; Ma, S.; Cai, L.; Qin, H. MFNet: Multi-Feature Fusion Network for Real-Time Semantic Segmentation in Road Scenes. IEEE Trans. Intell. Transp. Syst. 2022, 23, 20991–21003. [Google Scholar] [CrossRef]
Shi, M.; Lin, S.; Yi, Q.; Weng, J.; Luo, A.; Zhou, Y. Lightweight Context-Aware Network Using Partial-Channel Transformation for Real-Time Semantic Segmentation. IEEE Trans. Intell. Transp. Syst. 2024, 25, 7401–7416. [Google Scholar] [CrossRef]
Shu, R.; Zhao, S. Multi-Resolution Learning and Semantic Edge Enhancement for Super-Resolution Semantic Segmentation of Urban Scene Images. Sensors 2024, 24, 4522. [Google Scholar] [CrossRef]
Poudel, R.P.K.; Liwicki, S.; Cipolla, R. Fast-SCNN: Fast Semantic Segmentation Network. arXiv 2019, arXiv:1902.04502. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
Elhassan, M.A.; Huang, C.; Yang, C.; Munea, T.L. DSANet: Dilated Spatial Attention for Real-Time Semantic Segmentation in Urban Street Scenes. Expert. Syst. Appl. 2021, 183, 115090. [Google Scholar] [CrossRef]
Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. Bisenet: Bilateral Segmentation Network for Real-Time Semantic Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 325–341. [Google Scholar]
Verelst, T.; Tuytelaars, T. SegBlocks: Block-Based Dynamic Resolution Networks for Real-Time Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 2400–2411. [Google Scholar] [CrossRef] [PubMed]
Li, G.; Yun, I.; Kim, J.; Kim, J. DABNet: Depth-Wise Asymmetric Bottleneck for Real-Time Semantic Segmentation. arXiv 2019, arXiv:1907.11357. [Google Scholar]
Kong, C.; Luo, A.; Wang, S.; Li, H.; Rocha, A.; Kot, A.C. Pixel-Inconsistency Modeling for Image Manipulation Localization. arXiv 2023, arXiv:2310.00234. [Google Scholar]
Fan, M.; Lai, S.; Huang, J.; Wei, X.; Chai, Z.; Luo, J.; Wei, X. Rethinking Bisenet for Real-Time Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9716–9725. [Google Scholar]
Liu, T.; Xu, C.; Qiao, Y.; Jiang, C.; Chen, W. News Recommendation with Attention Mechanism. J. Ind. Eng. Appl. Sci. 2024, 2, 21–26. [Google Scholar] [CrossRef]
Jin, A.; Zeng, X. A Novel Deep Learning Method for Underwater Target Recognition Based on Res-Dense Convolutional Neural Network with Attention Mechanism. J. Mar. Sci. Eng. 2023, 11, 69. [Google Scholar] [CrossRef]
Patil, R.; Boit, S.; Gudivada, V.; Nandigam, J. A Survey of Text Representation and Embedding Techniques in Nlp. IEEE Access 2023, 11, 36120–36146. [Google Scholar] [CrossRef]
Zalewski, J.; Hożyń, S. Computer Vision-Based Position Estimation for an Autonomous Underwater Vehicle. Remote Sens. 2024, 16, 741. [Google Scholar] [CrossRef]
Gao, G.; Xu, G.; Yu, Y.; Xie, J.; Yang, J.; Yue, D. MSCFNet: A Lightweight Network with Multi-Scale Context Fusion for Real-Time Semantic Segmentation. IEEE Trans. Intell. Transp. Syst. 2021, 23, 25489–25499. [Google Scholar] [CrossRef]
Wang, H.; Chen, Y.; Cai, Y.; Chen, L.; Li, Y.; Sotelo, M.A.; Li, Z. SFNet-N: An Improved SFNet Algorithm for Semantic Segmentation of Low-Light Autonomous Driving Road Scenes. IEEE Trans. Intell. Transp. Syst. 2022, 23, 21405–21417. [Google Scholar] [CrossRef]
Zhong, Z.; Lin, Z.Q.; Bidart, R.; Hu, X.; Daya, I.B.; Li, Z.; Zheng, W.-S.; Li, J.; Wong, A. Squeeze-and-Attention Networks for Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13065–13074. [Google Scholar]
Chen, Y.; Zhan, W.; Jiang, Y.; Zhu, D.; Guo, R.; Xu, X. LASNet: A Light-Weight Asymmetric Spatial Feature Network for Real-Time Semantic Segmentation. Electronics 2022, 11, 3238. [Google Scholar] [CrossRef]
Li, Z.; Sun, Y.; Zhang, L.; Tang, J. CTNet: Context-Based Tandem Network for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 9904–9917. [Google Scholar] [CrossRef] [PubMed]
Pei, Y.; Sun, B.; Li, S. Multifeature Selective Fusion Network for Real-Time Driving Scene Parsing. IEEE Trans. Instrum. Meas. 2021, 70, 5008412. [Google Scholar] [CrossRef]
Zhao, H.; Qi, X.; Shen, X.; Shi, J.; Jia, J. Icnet for Real-Time Semantic Segmentation on High-Resolution Images. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 405–420. [Google Scholar]
Huang, Z.; Wei, Y.; Wang, X.; Liu, W.; Huang, T.S.; Shi, H. Alignseg: Feature-Aligned Segmentation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 550–557. [Google Scholar] [CrossRef] [PubMed]
Gao, R. Rethinking Dilated Convolution for Real-Time Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 4675–4684. [Google Scholar]
Lo, S.-Y.; Hang, H.-M.; Chan, S.-W.; Lin, J.-J. Efficient Dense Modules of Asymmetric Convolution for Real-Time Semantic Segmentation. In Proceedings of the ACM Multimedia Asia, Beijing, China, 15–18 December 2019; pp. 1–6. [Google Scholar]
Wang, Y.; Zhou, Q.; Liu, J.; Xiong, J.; Gao, G.; Wu, X.; Latecki, L.J. Lednet: A Lightweight Encoder-Decoder Network for Real-Time Semantic Segmentation. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 1860–1864. [Google Scholar]
Wang, Y.; Zhou, Q.; Xiong, J.; Wu, X.; Jin, X. ESNet: An Efficient Symmetric Network for Real-Time Semantic Segmentation. In Pattern Recognition and Computer Vision; Lin, Z., Wang, L., Yang, J., Shi, G., Tan, T., Zheng, N., Chen, X., Zhang, Y., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2019; Volume 11858, pp. 41–52. ISBN 978-3-030-31722-5. [Google Scholar]
Mehta, S.; Rastegari, M.; Caspi, A.; Shapiro, L.; Hajishirzi, H. Espnet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 552–568. [Google Scholar]
Wu, T.; Tang, S.; Zhang, R.; Cao, J.; Zhang, Y. Cgnet: A Light-Weight Context Guided Network for Semantic Segmentation. IEEE Trans. Image Process. 2020, 30, 1169–1179. [Google Scholar] [CrossRef]
Liu, M.; Yin, H. Feature Pyramid Encoding Network for Real-Time Semantic Segmentation. arXiv 2019, arXiv:1909.08599. [Google Scholar]
Yang, L.; Zhang, R.-Y.; Li, L.; Xie, X. Simam: A Simple, Parameter-Free Attention Module for Convolutional Neural Networks. In Proceedings of the International Conference on Machine Learning, online, 17–18 July 2021; pp. 11863–11874. [Google Scholar]
Zhao, S.; Zhao, X.; Huo, Z.; Zhang, F. BMSeNet: Multiscale Context Pyramid Pooling and Spatial Detail Enhancement Network for Real-Time Semantic Segmentation. Sensors 2024, 24, 5145. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Emek Soylu, B.; Guzel, M.S.; Bostanci, G.E.; Ekinci, F.; Asuroglu, T.; Acici, K. Deep-Learning-Based Approaches for Semantic Segmentation of Natural Scene Images: A Review. Electronics 2023, 12, 2730. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. arXiv 2016, arXiv:1606.02147. [Google Scholar]
Li, H.; Xiong, P.; Fan, H.; Sun, J. Dfanet: Deep Feature Aggregation for Real-Time Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9522–9531. [Google Scholar]

Figure 1. Comparison of common structures in semantic segmentation networks. From left to right: (a) Spatial pyramid pooling; (b) Multi-branch; (c) Encoder-decoder architecture; (d) Feature reutilization at the stage level.

Figure 2. Overall architecture of the proposed method.

Figure 3. Schematic diagram of the operational mechanism of SimAM.

Figure 4. Details of the implementation of the refined pyramid pooling module.

Figure 5. (a) ERFNet non-bottleneck-1D module; (b) DAB module; (c) proposed enhanced DAB module in this paper. Here, d denotes the use of dilated convolutions, and c denotes the number of input channels; all the convolutions within the black dashed box employ depthwise convolution.

Figure 6. The implementation principle of the SE attention mechanism.

Figure 7. Model inference speed and segmentation accuracy on the Cityscapes dataset. The red triangle denotes the proposed algorithm, the green dots represent other segmentation algorithms, and the red dashed line indicates the minimum real-time segmentation requirements.

Figure 8. Visual comparisons on the Cityscapes validation set. From left to right are the original image, ground truth, segmentation results from Fast-SCNN, DABNet, and the proposed algorithm. The red dashed box highlights the key areas for comparison.

Figure 9. Visual comparisons on the CamVid test set. From left to right are the original image, ground truth, segmentation outputs from Fast-SCNN, DABNet, and the proposed algorithm. The red dashed box highlights key areas for comparison.

Table 1. Detailed architecture of proposed method.

Operator	Mode	Channel	Output Size
Conv2D	kernel:3 × 3; stride 2	32	512 × 1024
DSConv	stride 2	48	256 × 512
DSConv	stride 2	64	128 × 256
SimAM	-	64	128 × 256
bottleneck1	n = 3; t = 6	64	64 × 128
bottleneck2	n = 3; t = 6	96	32 × 64
bottleneck3	n = 3; t = 6	128	32 × 64
RPPM	-	128	32 × 64
FFM (branch1)	Conv + DAB block	128	128 × 256
FFM (branch2)	DWConv + Conv + SE	128	128 × 256
FFM	add	128	128 × 256
DSConv	stride 1	128	128 × 256
DSConv	stride 1	128	128 × 256
DSConv	stride 1	128	128 × 256
DWConv	kernel:3 × 3; stride 1	128	128 × 256
final_conv_block	Dropout + Conv + BN + ReLU+ Dropout + Conv	19	128 × 256
Bilinear Interpolation	×8	19	1024 × 2048

Table 2. Comparison of the impact of different components on model performance.

Experiment Configuration	Params (M)	GFLOPs	FPS	mIoU (%)
Baseline	1.18	15.3 G	123.5	68
Baseline + SimAM	1.18	18.7 G	117.4	68.6
Baseline + RPP	1.55	21.8 G	119	68.4
Baseline + FFM (with DABblock)	1.69	23.9 G	107.2	68.8
Baseline + FFM (with SE attention)	1.63	25.2 G	113	68.5
Baseline + FFM	2.14	35.6 G	104.9	69.1
Baseline+ Enhanced Classifier	1.26	29.1 G	112.5	68.8
Baseline + SimAM + RPP	1.55	25.2 G	101.6	68.9
Baseline + SimAM + RPP + FFM	2.51	42.7 G	86.7	69.4
Proposed Method	2.59	58.5 G	81.4	71.7

Table 3. Comparison of the results of the proposed method with other advanced models on the Cityscapes test set.

Method	Input Size	Pretrain	Params (M)	GFLOPs	FPS	mIoU (%)
SegNet [46]	640 × 340	ImageNet	29.5	286 G	16.7	57
OCNet [6]	512 × 1024	ImageNet	62.6	549 G	8.7	80.1
RefineNet [8]	512 × 1024	ImageNet	118.1	526 G	9.1	73.6
ENet [47]	512 × 1024	No	0.36	4.38 G	69.4	58.3
ESPNet [39]	512 × 1024	No	0.36	4.64 G	113	60.3
CGNet [40]	512 × 1024	No	0.5	7.0 G	62	64.8
ERFNet [10]	512 × 1024	No	2.1	27.2 G	46	68.0
DABNet [20]	512 × 1024	No	0.8	10.4 G	102	70.1
ICNet [33]	1024 × 2048	ImageNet	26.5	28.3 G	30	69.5
BiSeNet1 [18]	768 × 1536	ImageNet	5.8	14.8 G	101.6	68.4
DSANet [17]	512 × 1024	No	3.49	37.4 G	33.8	71.4
BiSeNet2 [18]	768 × 1536	ImageNet	49	55.3 G	60.4	74.7
Ours	512 × 1024	ImageNet	2.59	58.5 G	81.4	71.7

The last row of the table highlights the metrics of our method in bold.

Table 4. Class mIoU scores on Cityscapes test set for the per-class category.

Method	Road	Sidewalk	Building	Wall	Fence	Pole	Traffic Light	Traffic Sign	Vegetation	Terrain	Sky	Person	Rider	Car	Truck	Bus	Train	Motorcycle	Bicycle	mIoU
ENet [47]	96.3	74.2	85.1	32.1	33.2	43.4	34.2	44.0	88.6	61.4	90.5	65.5	38.4	90.6	36.9	50.4	48.1	38.8	55.4	58.3
CGNet [40]	95.7	73.9	89.9	43.9	46.1	52.9	55.9	63.8	91.7	68.3	94.2	76.6	54.2	91.3	41.3	55.9	32.8	41.2	60.9	64.8
ERFNet [10]	97.7	81.0	89.8	42.3	48.0	56.3	59.8	65.3	91.4	68.2	94.2	76.8	57.2	92.8	50.8	60.3	51.7	47.3	61.6	68.0
DABNet [20]	97.6	80.7	90.3	47.9	48.2	56.4	61.8	67.0	92.1	69.5	94.2	80.3	59.2	93.5	46.0	57.1	35.2	50.4	66.8	68.1
DSANet [17]	96.8	78.4	91.2	50.5	50.9	59.4	64.2	71.7	92.6	70.0	94.5	81.8	61.8	92.9	56.1	75.6	50.6	50.9	66.7	71.4
Ours	97.8	80.7	91.6	50.3	51.7	59.6	64.0	71.5	92.9	69.5	94.6	82.0	62.0	92.8	56.2	75.3	52.2	50.6	67.1	71.7

Bold values denote that the relevant indicator surpasses other methods.

Table 5. Comparison of the experimental results of the proposed method with other state-of-the-art models on the CamVid test set. ‘-’ denotes that the original paper did not provide the relevant data.

Method	Params (M)	GFLOPs	FPS	mIoU (%)
SegNet [46]	29.5	286 G	46	55.6
PSPNet [11]	250.8	412.2 G	5.4	69.1
DFANet [48]	7.8	3.4 G	120	64.7
ENet [47]	0.36	4.38 G	96.4	51.3
ESPNet [39]	0.36	4.64 G	190.3	58.3
CGNet [40]	0.5	7.0 G	95.6	64.7
ERFNet [10]	2.1	27.2 G	139.1	67.7
BiseNet1 [18]	5.8	14.8 G	-	65.6
DABNet [20]	0.76	10.4 G	162	65.7
ICNet [33]	26.5	28.3 G	27.8	67.1
BiseNet2 [18]	49	55.3 G	-	68.7
Ours	2.59	58.5 G	113.6	69.4

The last row of the table highlights the metrics of our method in bold.

Table 6. Class mIoU scores on CamVid test set for the per-class category.

Method	Building	Tree	Sky	Car	Sign	Road	Pedestrain	Fence	Pole	Sidewalk	Bicyclist	mIoU (%)
BiSeNet1 [18]	82.2	74.5	91.9	80.7	42.8	93.1	53.9	49.7	25.4	77.3	50.1	65.6
BiSeNet2 [18]	82.9	75.7	92.1	83.7	46.5	94.6	58.7	53.6	31.9	81.4	54.2	68.7
CGNet [40]	79.8	73.2	90.6	81.3	41.6	95.4	52.9	32.9	28.2	81.9	53.9	64.7
ERFNet [10]	81.2	75.0	91.9	84.3	45.1	95.0	58.3	36.2	37.8	81.2	58.2	67.7
DABNet [20]	81.0	74.1	91.1	81.7	43.0	93.8	56.2	37.2	29.4	78.7	56.5	65.7
Ours	81.6	78.4	92.5	81.8	51.2	97.3	57.4	45.6	37.8	81.2	58.7	69.4

Bold values denote that the relevant indicator surpasses other methods.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, B.; Xiong, X.; Wang, Y. Real-Time Semantic Segmentation Algorithm for Street Scenes Based on Attention Mechanism and Feature Fusion. Electronics 2024, 13, 3699. https://doi.org/10.3390/electronics13183699

AMA Style

Wu B, Xiong X, Wang Y. Real-Time Semantic Segmentation Algorithm for Street Scenes Based on Attention Mechanism and Feature Fusion. Electronics. 2024; 13(18):3699. https://doi.org/10.3390/electronics13183699

Chicago/Turabian Style

Wu, Bao, Xingzhong Xiong, and Yong Wang. 2024. "Real-Time Semantic Segmentation Algorithm for Street Scenes Based on Attention Mechanism and Feature Fusion" Electronics 13, no. 18: 3699. https://doi.org/10.3390/electronics13183699

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Real-Time Semantic Segmentation Algorithm for Street Scenes Based on Attention Mechanism and Feature Fusion

Abstract

1. Introduction

2. Related Work

2.1. Semantic Segmentation

2.2. Attention Modules

2.3. Feature Fusion Strategies

2.4. Dilated Convolution

3. Proposed Method

3.1. Efficient Down-Sample

3.2. SimAM

3.3. Global Feature Extractor

3.4. Feature Fusion Module

3.5. Enhanced Classifer

4. Experiments and Results

4.1. Data Sets and Evaluation Metrics

4.2. Implementation Details

4.3. Ablation Study

4.4. Comparison with the Existing Model

4.4.1. Results on CityScapes Dataset

4.4.2. Results on CamVid Dataset

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI