LMANet: A Lightweight Asymmetric Semantic Segmentation Network Based on Multi-Scale Feature Extraction

Chen, Hui; Xiao, Zhexuan; Ge, Bin; Li, Xuedi

doi:10.3390/electronics13173361

Open AccessArticle

LMANet: A Lightweight Asymmetric Semantic Segmentation Network Based on Multi-Scale Feature Extraction

by

Hui Chen

^1,*

,

Zhexuan Xiao

¹,

Bin Ge

¹ and

Xuedi Li

²

¹

School of Computer Science and Engineering, Anhui University of Science and Technology, Huainan 232001, China

²

China Telecom Co., Ltd., Anhui Branch, Hefei 230000, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(17), 3361; https://doi.org/10.3390/electronics13173361

Submission received: 31 July 2024 / Revised: 11 August 2024 / Accepted: 21 August 2024 / Published: 23 August 2024

Download

Browse Figures

Versions Notes

Abstract

:

With the swift progress of deep learning and its wide application in semantic segmentation, the effect of semantic segmentation has been significantly improved. However, how to achieve a reasonable compromise between accuracy, model size, and inference speed is crucial. In this paper, we propose a lightweight multi-scale asymmetric encoder–decoder network (LMANet) that is designed on the basis of an encoder–decoder structure. First, an optimized bottleneck module is used to extract features from different levels, and different receptive fields are applied to obtain effective information on different scales. Then, a channel-attention module and a feature-extraction module are introduced to constitute the residual structure, and different feature maps are connected by a feature-fusion module to effectively improve segmentation accuracy. Finally, a lightweight multi-scale decoder is designed to recover the image, and a spatial attention module is added to recover the spatial details effectively. This paper has verified the proposed method on the Cityscapes dataset and CamVid dataset and achieved mean intersection over union (mIoU) of 73.9% and 71.3% with the inference speeds of 111 FPS and 118 FPS, respectively, and the number of parameters is only 0.85 M.

Keywords:

image processing; lightweight; multi-scale features; real-time semantic segmentation

1. Introduction

Semantic segmentation involves assigning a category label to each pixel in the image, which can acquire detailed scene understanding by analyzing the semantics of individual pixels. Semantic segmentation has been widely used in real-world scenarios, such as autonomous driving [1], medical diagnosis [2], remote sensing [3], and other fields. However, with the wide application of models in terminal equipment, the model size and calculation cost must be considered. It is more important to construct a lightweight semantic segmentation network to achieve a balance between accuracy, model size, and inference speed.

Many semantic segmentation algorithms based on deep learning have outperformed traditional methods in both accuracy and inference speed recently, so numerous semantic segmentation methods utilizing a full convolutional network [4] have been proposed. These networks focus on two main metrics, namely (1) accuracy and (2) light weight. However, high-quality segmentation networks are usually accompanied by deeper convolutional layers and larger channel numbers, which leads to more complex network structures and affects their usability on mobile devices, such as PSPNet [5] and Deeplabv3+ [6]. There are also some networks that can solve real-time problems through optimization. For instance, Zhao et al. [7] proposed ICNet to acquire information and integrate features through a cascade strategy. ESPNet [8] adopted a convolution factorization strategy and the efficient spatial pyramid module to achieve a smaller number of parameters. In addition, some networks apply technologies from NLP to image semantic segmentation to improve segmentation accuracy, such as a transformer [9] and semantic enhancement [10].

Moreover, some networks utilize different branches to extract semantic and spatial information, preserving accuracy with fewer model parameters. For example, BiseNet [11] contains two branches that extract spatial information and semantic information, respectively. Based on BiseNet, Yu et al. [12] introduced group convolution and further optimized the fusion method in BiseNetv2, which improved the segmentation effect but also greatly increased the size and computational cost of the model. BFMNet [13] also adopted a two-branch structure and perceived the multi-scale objects’ information by adding a multi-scale context aggregation module. The DDRNet [14] cascaded into two different branches on the basis of the two-branch structure and added a pyramid module to extract context information. DecoupleSegNet [15] used a dual-stream framework and shared high-resolution information between the different branches.

In the semantic segmentation task, a critical factor for enhancing segmentation performance is effectively recovering information lost during the downsampling. U-Net [16] used a symmetric encoder–decoder structure to achieve image segmentation, but it is mainly used for images with relatively simple semantic information and does not meet the requirements of being lightweight. ENet [17] was proposed by Paszke et al., which decreases the model’s memory requirements by using a compact decoder, but its segmentation accuracy is greatly reduced. In this paper, we propose a novel real-time network called a lightweight multi-scale asymmetric network (LMANet). This model is composed of an encoder and a decoder. In the encoder, two kinds of feature-bottleneck modules are designed to extract information from the feature maps of the different levels, respectively. In addition, a residual efficient attention (REA) module is introduced to extract the features. Then, we fuse the results with the output map of the feature-extraction module through the feature-fusion module. In the decoder, we design a new multi-scale feature decoder that merges information from different scales to recover spatial details with only 0.01 M parameters. In summary, the primary contributions of this paper are as follows.

(1): A low-level feature-bottleneck (LFB) module and a high-level feature-bottleneck (HFB) module are proposed to extract features at different levels, which can capture richer semantic information through deep convolution with different dilation rates;
(2): We introduce the residual efficient attention (REA) module, which focuses on feature channels and spatial information with reduced parameters and computational cost. Meanwhile, the REA module enhances the feature information of the feature-fusion module (FFM);
(3): A multi-scale feature decoder (MFD) with the spatial attention module (SAM) is proposed to process different features and recover spatial information more efficiently;
(4): The proposed LMANet can effectively extract and recover image features based on the asymmetric encoder–decoder structure. Compared with existing real-time semantic segmentation algorithms, LMANet strikes a more favorable balance between accuracy, model size, and inference speed.

The organization of this paper is as follows. Section 2 gives some related knowledge of real-time semantic segmentation. Section 3 describes the key components and architecture of LMANet. Section 4 presents experiments to illustrate the performance of our network. Finally, Section 5 offers the conclusions of the paper.

2. Related Work

2.1. Encoder–Decoder Semantic Segmentation

Many networks have adopted an encoder–decoder structure to extract and recover information from images for semantic segmentation. These networks can be divided into two categories, namely (1) symmetric and (2) asymmetric. SegNet [18] constructed the symmetric encoder–decoder structure by nonlinear upsampling directly with pooling indices. ERFNet [19] introduced Non-bottleneck-1D to build a symmetric encoder–decoder structure. Although these networks can better recover spatial information through a symmetric encoder–decoder structure, they cannot balance the computational burden and accuracy well due to multiple identical feature-extraction and upsampling steps in the models.

The asymmetric encoder–decoder structure has been extensively used in semantic segmentation networks recently because of reduced parameters and faster inference speed. DABNet [20] used a deep asymmetric bottleneck layer to extract the feature information and recover the spatial information with fewer parameters through an efficient non-local module. DFANet [21] adopted a backbone network based on lightweight extended space and used the cross-level feature aggregation module as the decoder to enhance the performance. LAANet [22] introduced the EAB module and AFFU module to form an asymmetric encoder–decoder structure for image segmentation. MSCFNet [23] introduced an EAR module in the encoder and a deconvolution to maintain the segmentation effect. WFDCNet [24] applied an LAPF module as a lightweight encoder to realize the combination of multiple information, resulting in 0.5 M parameters. EFRNet [25] designed a feature-fusion module in a single branch to effectively fuse and refine the feature information with 0.46 M parameters. LEANet [26] adopted the CA-PP module as a decoder to collect context information, resulting in 0.74 M parameters.

2.2. Optimized Convolution

In recent real-time semantic segmentation tasks, many algorithms usually optimize the standard convolution to improve the model. Optimized convolution is mainly divided into dilated convolution and factorized convolution. Standard convolution typically uses convolution with a small receptive field, making it challenging to capture rich contextual information. Dilated convolution meets the requirement by expanding the receptive field without more parameters, which is achieved by inserting zeros into the convolutional kernel of the standard convolution. Dilated convolution was first introduced in Deeplabv2 [27] for semantic segmentation, and now it has been extensively utilized in segmentation models, such as CFPNet [28], Deeplabv3+ [6], etc.

Factorized convolution includes many different forms, such as asymmetric depth-wise separable convolution [29,30], asymmetric depth-wise separable dilated convolution [31], and asymmetric convolution [32]. Asymmetric depth-wise separable convolution separates spatial correlation and channel correlation by factoring the standard convolution, which is combined with asymmetric point-wise convolution and depth-wise convolution to extract a feature map with a lower number of parameters and computation cost. Asymmetric depth-wise separable dilated convolution increases the receptive field and obtains richer information under the condition of reducing the number of parameters and computation cost. Therefore, using optimized convolution in semantic segmentation networks can effectively satisfy the requirement of real-time semantic segmentation tasks, such as MSCFNet, EADNet [33], etc.

2.3. Attention Mechanism

An attention mechanism has been widely adopted in computer vision [34], which enables the network to focus on different regions by assigning varying weights to each pixel. The attention mechanisms are broadly categorized into spatial attention and channel attention based on the way of the attention weight. SENet [35] models the relationships between different channels to compute channel weights. Based on SENet, ECANet [36] replaces the fully connected layer with fewer parameters in the module by 1D convolution. CBAM [37] integrates spatial and channel-attention mechanisms sequentially to individually weight pixels within the feature map.

The attention mechanism is commonly utilized to refine feature maps and enhance the integration of different-level features. LAANet [22] proposed an improved module for CBAM and designed the AFFU module on this basis, which was employed to collect multi-scale contextual information from various layers. DANet [38] used attention mechanisms to enhance low-level features and subsequently combine them with high-level features in the feature-fusion module. In general, utilizing an attention mechanism tends to have better results compared to simple direct fusion.

3. Method

To better satisfy the requirements for real-time semantic segmentation, we propose a new model called LMANet, which is mainly divided into two parts, including an encoder composed of a feature-extraction module and a lightweight decoder. In order to enhance the accuracy, different attention mechanisms are introduced into the model to process the image.

3.1. Overall Architecture of LMANet

The architecture of LMANet is shown in Figure 1, and the detailed architectural configuration is listed in Table 1.

At the beginning of LMANet, an initial unit is employed to adjust the resolution size of input images and remove unnecessary information. First, a 3 × 3 convolution with stride 2 is utilized to process the original image. Then, we use two 3 × 3 standard convolutions to receive rich context information. In addition, a downsampling module is used to increase the receptive field, which consists of a 3 × 3 standard convolution with stride 2 and a 2 × 2 maximum pooling layer in series. After that, the feature-bottleneck module is employed to obtain more information, and the features are combined through the feature-fusion module. At the decoder stage, the multi-scale feature decoder combines feature maps of different scales, and the spatial attention module is introduced to enhance the recovery of spatial information, which significantly enhances segmentation accuracy.

3.2. Feature-Bottleneck Module

Many networks have reduced the number of parameters and computational costs by applying optimized convolution to lightweight residual layers. LEDNet [39] introduced a split-shuffle non-bottleneck that employed convolution factorization to extract local information in two branches. However, maintaining the same size of the receptive field in the model will limit the effect of the model for feature maps at different levels. Therefore, effective information at different levels can be obtained by using different scale-receptive fields to extract features at the corresponding stages. Inspired by the proposed residual structure, we design new bottlenecks for the LFB module and the HFB module in LMANet, as shown in Figure 2.

The structure of the LFB module is shown in Figure 2a. It is mainly used to extract the context information of low-level maps. Concretely, first, a 3 × 3 standard convolution is employed to generate a deeper feature and halves the number of input channels at a lower cost. Next, the output is divided into two branches, which can obtain feature information at different scales. The left branch uses a deep-wise separable convolution of 3 × 3 to extract local information while retaining spatial information. The right branch uses the asymmetric depth-wise separable dilated convolution with a kernel size of 3 to extract the context information of feature maps while reducing a certain amount of computation, where the dilated rate is 2. Then, the features of the two branches are fused, and the channels are restored by a 1 × 1 convolution. In addition, a skip connection is used to perform identity mapping, and the initial input is added to achieve residual learning. Finally, a channel shuffle is used to improve the segmentation accuracy by facilitating information interaction. In addition, batch normalization and PReLU are both applied before every convolution operation to help improve the convergence effect of the model. Since the information in deep feature maps is more abundant, and different receptive fields are used to extract information from the feature maps of the corresponding level to improve the segmentation effect, we designed the HFB module to extract features from 1/4 feature maps. The structure of the HFB module is shown in Figure 2b. It is mainly used to extract the information in high-level feature maps. Compared with the LFB module, the difference is that the HFB module uses a three-branch structure, whereas the additional branch uses asymmetric depth-wise separable dilated convolution with a kernel of 3 to extract feature information by expanding the dilation rate.

3.3. Feature-Fusion Module

The skip connection is employed in many networks to fuse features at different levels, which improves the segmentation effect, such as ESPNet and AdaNet [40]. Motivated by the skip connection strategy, we designed a feature-fusion module (FFM). The first FFM directly fuses the downsampled feature map with the output of the initial block, which can avoid an overmuch loss of effective information. By contrast, the other two FFMs not only connect the output features of the feature-bottleneck module and the original image, which is downsampled to the same size as the feature map to compensate for information loss, but also connect the output of the REA module, which forms a skip connection that can better focus on the features in different channels. In addition, we employ a pointwise convolution in FFM to make further integration to extract information with fewer model parameters. The different FFMs are shown in Figure 3.

For the residual efficient attention (REA) module, since the channel contains rich feature information, we introduce the REA [36] module in the skip connection process between the input and output of the HFB module. Meanwhile, this method can reduce the influence of interference noise in the channel so as to optimize the feature extraction, as shown in Figure 3. Due to the low efficiency of capturing the relationship between channels, the REA module uses 1 × 1 convolution to avoid dimensionality reduction and capture cross-channel interaction. REA module obtains global information through the residual structure and global average pooling with a negligible calculation cost. The process can be expressed as (1):

M_{c} (X) = f_{m} [X, σ (C_{k \times k} (f_{T} (f_{AP} (X))))],

(1)

where

M_{c}

is the output feature map, X is the input features,

f_{m} [\cdot, \cdot]

represents the pointwise multiplication,

f_{T}

represents the compression and re-weighting operations,

f_{AP}

denotes the average pooling operation,

C_{k \times k}

denotes the standard convolution operation with the kernel size of k × k, and

σ

is the Sigmoid function.

3.4. Multi-Scale Feature Decoder

In the encoder–decoder semantic segmentation architecture, the decoder is used to upsample the feature maps to the original resolution. Therefore, a good decoder can availably improve the segmentation effect with a small number of parameters. We present a new lightweight decoder named MFD to improve the recovery effect of the decoder on the feature information with fewer parameters. The MFD takes a 1/4 feature map as the main attention map, then restores contextual semantic details by guiding a 1/8 feature map, and finally, restores the spatial information more effectively with a 1/2 feature map. The structure of the MFD is shown in Figure 4.

Specifically, the 1/4 output feature map of FFM is converted into C1 channels by a 1 × 1 convolution. Then, the result is passed through the spatial attention module (SAM) as an input to increase the attention to informative features. The calculation method of SAM can be expressed as (2):

F_{a 3} = σ (C_{k \times k} (f_{c} [f_{A P} (F_{a 2}), f_{M P} (F_{a 2})])),

(2)

where

F_{a 3}

is the output map,

f_{c} [\cdot, \cdot]

represents the concatenation,

f_{A P}

and

f_{M P}

refer to the average pooling and maximum pooling, respectively. By contrast,

F_{b 1}

, the 1/8 output map of FFM, is transformed to

F_{b 2}

with C2 channels by a 1 × 1 convolution, and the upsampling is utilized to double the size of

F_{b 2}

. The operation can be expressed as (3):

F_{b 3} = f_{u p} (C_{1 \times 1} (F_{b 1})),

(3)

where the upsampling operation is executed by bilinear interpolation.

Second, a depth-wise separable 3 × 3 convolution is followed by the concatenation operation as (4):

F_{D w} = C_{D w 3 \times 3} (f_{c} [F_{a 3}, F_{b 3}]),

(4)

where

C_{D w 3 \times 3}

represents a depth-wise separable convolution with the kernel size of 3 × 3.

In addition,

F_{c 1}

, the 1/2 output feature map of FFM, is changed into

F_{c 2}

with C3 channels by a 1 × 1 convolution. Then,

F_{D w}

is upsampled to the same size as

F_{c 2}

and fused with

F_{c 2}

to recover richer spatial information. Finally, another upsampling is employed to its original size. The output can be computed by

F_{o u t} = f_{u p} (f_{a} [f_{u p} (F_{D w}), C_{1 \times 1} (F_{c 1})]),

(5)

where

f_{a} [\cdot, \cdot]

represents the addition operation.

We collect feature information at different scales, and 1/2 and 1/8 feature maps are utilized as guidance to capture more spatial and semantic information. In addition, we also employ an attention mechanism to the 1/4 feature map branch to focus on learning more effective context information. Therefore, the multi-scale feature decoder can recover the details of the feature map with a very small number of parameters.

4. Experiments

In this section, we first provide information about the Cityscapes [41] and Camvid [42] datasets, and then, we perform ablation experiments on the Cityscapes dataset to demonstrate the effectiveness of the network composition. Finally, we compare the proposed algorithm with existing algorithms to demonstrate the advantages of LMANet.

4.1. Datasets

The Cityscapes dataset is one of the primary datasets for real-time semantic segmentation, including 5000 finely annotated images and 20,000 roughly annotated images from different urban streetscapes. More specifically, the finely annotated images consist of 2975 images for training, 500 images for validating, and 1525 images for testing. The original image resolution of the Cityscapes dataset is 2048 × 1024.

The CamVid dataset is a street-scene dataset derived from video sequences. As an auxiliary dataset for this paper, the CamVid dataset consists of 367 images for training, 101 images for validation, and 233 images for testing. The original image resolution of the CamVid dataset is 720 × 960.

4.2. Implementation Details

All the experiments are performed with one RTX 2080Ti GPU, CUDA 10.0, 40 GB memory, and an ubuntu18.04 operating system on the PyTorch 1.1.0 platform. Mini-Batch SGD is utilized with a batch size of eight, a momentum of 0.9, and a weight of 1 × 10⁻⁴. In addition, we employ a polynomial policy as a learning rate decay policy, which is expressed as (6):

l r = l r_{i} \times {(1 - \frac{e p o c h}{m a x_e p o c h})}^{0.9},

(6)

where

l r

is the learning rate in the current epoch,

l r_{i}

represents the initial learning rate,

e p o c h

is the current iteration, and

m a x_e p o c h

means the maximum iteration. Also, we randomly crop the training images for the Cityscape dataset into a resolution of 512 × 1024 during the training phase.

4.3. Ablation Studies

In this section, we design ablation experiments to verify the effectiveness and its association with LMANet. We conduct ablation experiments on the structure of the feature-bottleneck module, the number of feature-bottleneck modules, and the MFD module. Finally, we provide the ablation results of the contribution of each component to the overall performance of LMANet. All experiments are conducted on the Cityscapes dataset.

(1) For the ablation study of the feature-bottleneck module, the HFB module is used to extract the feature information of high-level maps in LMANet. We design two kinds of methods to perform ablation experiments on the HFB module to prove the effectiveness of the HFB module. First, we change the number of branches in the HFB module to two and set the dilated rates to different sequences for comparison. Second, we set different dilated rate sequences on the basis of the three branches for comparison to investigate the influence of the dilated rates on the experimental results. The ablation experimental results can be seen in Table 2. Significantly, since one branch in the HFB module is fixed, we perform ablation experiments on the other two branches.

In Table 2, when the number of branches in the HFB module is two, the mIoU is greatly reduced. In addition, we set the parameters in the dilated rate sequence to the same or different values to verify the influence of different dilated rates in S1 and S2, respectively. It can be seen that, when the dilated rate sequence is set to different values in S2, the mIoU is 1.7% higher than the sequence {2,2,2,2,2,2} in S1, while the number of parameters and inference speed are almost unchanged. We can conclude that setting different dilation rates to extract multi-scale features can substantially boost segmentation accuracy.

To investigate the effects of different dilated rates on the experimental results, we designed different types of sequences with varying dilated rates. From Table 2, we can observe that the size and order of the dilated rate will affect the segmentation effect. Compared with S6 (ours), S3 and S4 set different dilated rate sequences, and their mIoU dropped by 1.0% and 0.8%, respectively. In addition, when we adjusted the order of the expansion rate in test S5, the mIoU decreased by 0.4% compared with S6, and the size and the inference speed of the network have a small change in the process of adjusting the dilated rate sequence. Therefore, the HFB module in this paper achieves a good effect.

In addition, we conduct ablation experiments to evaluate the effectiveness of the feature-bottleneck module. We apply different bottleneck modules to replace LFB and HFB in the experiment, including a depthwise asymmetric bottleneck [20], SS-Nbt [39], Non-bt-1D [19], and efficient asymmetric bottleneck [22], to form different segmentation networks. The comparison results are shown in Table 3. The EAB module has the fewest number of parameters, but in comparison, our network achieves higher accuracy and a better balance between the accuracy, parameters, and inference speed.

(2) For the ablation study of feature-extraction module depth, we performed ablation experiments by adjusting the number of the LFB module and the HFB module to compare the performance. The results of the experiment on the Cityscapes dataset are listed in Table 4, where m and n are used to denote the number of LFB module and the HFB module in LMANet. As shown in Table 4, with the increase in modules, the model achieves a higher precision at the expense of increased inference time and a larger number of parameters, and the influence of the HFB module on the number of parameters and inference speed of the model is more obvious. In addition, simply increasing the number of feature-extraction modules causes a slight improvement in the accuracy. On the contrary, the model accuracy and efficiency even decrease when the number of modules is increased. In order to make a proper trade-off between parameters and accuracy, we chose three and six as the values for m and n, respectively.

(3) For the ablation study of the multi-scale feature decoder, MFD is used to restore the feature map and receive the segmentation result, and we designed two ablation experiments for its overall performance and internal components. First, we compare MFD with the decoders of DABNet [20] and ERFNet [19]. Then, we performed ablation experiments on the attention mechanisms in MFD and the connection methods between different scale feature maps. The results are shown in Table 5 and Table 6. From Table 5, the mIoU value of the model is reduced by 1.3% when MFD is replaced by a decoder in DABNet with almost the same calculation cost, while the accuracy of the model with ERFD is reduced by 0.5% compared to MFD. And the number of parameters is increased by 0.74 M. The ablation visual results of different decoders are shown in Figure 5. The difference is marked with white dotted boxes. As can be seen from Figure 5, the decoder in LMANet can make the model have a better effect on the recovery of the edge details.

In Table 6, we conducted ablation experiments by replacing SAM with a CBAM [37] module and an SE [35] module to verify the effectiveness of SAM. When SAM is used in MFD to refine the one-fourths feature map, the mIoU value of the model achieves the highest value. Meanwhile, the parameter of the model decreased slightly, and the inference speed increased slightly. In addition, the feature-fusion method is an important topic in multi-scale information aggregation, so we provide the comparison of the fusion modes between high-level maps and low-level maps in Table 6. From Table 6, we used three methods of concatenation, multiplication, and addition, respectively, to carry out an arbitrary combination and selected several results with better performance for comparison. The first parameter in parentheses indicates the fusion method between the one-fourths and one-eighths feature map, and the second parameter in parentheses indicates the fusion mode between the one-half feature map and the above output feature map. We can observe that the model achieves the highest mIoU value when using concatenation and addition, respectively, while the change in calculation cost can be negligible, which demonstrates the benefits of the multi-scale feature-fusion method in MFD.

(4) For the ablation study of each component, in order to verify the validity of the association between the modules in the model, we conducted ablation experiments on each component, and the results are shown in Table 7. In the experiment, we took the asymmetric depth-wise bottleneck network as the baseline for comparison. As can be seen from Table 7, when the low-level feature-extraction module was replaced by the LFB module, the parameter number and running speed of the model were almost unchanged compared with that of the basic network, and its mIoU value increased to 71.2%. When the spatial attention mechanism is introduced into the multi-scale feature decoder MFD designed in this chapter, the number of model parameters is only increased by 0.01 M. The reference speed is almost not affected, but mIoU is worth improving.

4.4. Comparison with Other Approaches

In this section, we compare the performance of LMANet with some state-of-the-art models on the Cityscapes and CamVid datasets. The results are shown in Table 8 and Table 9.

For the results on Cityscapes, as shown in Table 8, we performed quantitative analyses on the Cityscapes dataset. Our LMANet achieves 73.9% mIoU at 111 FPS with only 0.85 M parameters. From the table, LMANet can be able to better balance the relationship between accuracy and calculation cost. Specifically, comparing ENet [17] and ESPNet [8] with the smallest number of parameters, the mIoU of LMANet is much higher, while the number of parameters is only increased by 0.49 M. In comparison to the LRNNet [47], the mIoU of LMANet increased by 1.7%, but the parameter number increased by only 0.17 M. Moreover, the inference speed of LMANet is able to reach 111 FPS, which is faster than LRNNet. The speed of LMANet remains the same or decreases when compared with some real-time semantic segmentation models with a faster inference speed, such as Fast-SCNN [46] and DABNet [20], but the accuracy is significantly improved by 5.9% and 3.8%, respectively. ContextNet [44] and LMANet have the same number of parameters, but our mIoU has increased by 7.8%. Meanwhile, the inference speed of LMANet on 2080Ti is 111 FPS, which is faster than ContextNet. In terms of mIoU, several segmentation methods have similar or even higher performance than LMANet, such as FPANet [50] and Hyperseg-M [51]. Indeed, Hyperseg-M achieves a higher mIoU than our LMANet, but it is undeniable that the number of parameters of Hyperseg-M is 10.1 M, which is approximately 12 times than that of LMANet. In addition, the inference speed of Hyperseg-M is only 36 FPS, while the inference speed of LMANet is increased by 74 FPS. Therefore, compared with other advanced methods, our LMANet achieves the best balance. Moreover, we present the visualization results of LMANet and several other networks in Figure 6. The white dotted boxes depict the difference in the segmentation results.

For the results on CamVid, to verify the generality of LMANet, we also tested the performance of LMANet on the CamVid dataset and compared it with other methods quantitatively in Table 9. As can be seen in Table 9, LMANet also achieves outstanding performance on the CamVid dataset. Specifically, LMANet achieves 71.1% mIoU and a segmentation speed of 111 FPS for a 360 × 480 input image. Compared with MSCFNet [23], LMANet achieves higher mIoU values with fewer parameters and faster inference speed. In addition, the inference speed of EDANet [45] is similar to LMANet when the size of the input image is 360 × 480, while the mIoU is reduced by 4.7%. The above experiments fully prove that LMANet has excellent performance in real-time semantic segmentation and strikes an effective balance between accuracy and efficiency.

5. Conclusions

In this paper, we propose a lightweight multi-scale asymmetric semantic segmentation network (LMANet) based on an encoder–decoder structure that comprises three types of components, including a feature-bottleneck module, FFMs, and MFD. The feature-bottleneck module includes an LFB module and an HFB module, which are used for different levels of feature maps to extract context features effectively. The FFMs are responsible for refining the feature maps and fusing different features to produce multi-scale local features. The MFD is employed to restore the spatial information to the original resolution with an attention mechanism. Moreover, we conducted a series of experiments to validate the model and its components. To be specific, LMANet achieves 73.9% mIoU and 71.1% mIoU with 0.85 M parameters on the Cityscapes and CamVid dataset, and the inference speed on the two datasets is 111 FPS and 118 FPS on an RTX 2080Ti. The experimental results reveal that LMANet strikes a good balance between accuracy, number of parameters, and inference speed in real-time semantic segmentation tasks. The segmentation accuracy of LMANet still has room for improvement, and further optimization of the feature-extraction module and attention mechanism can be considered in the subsequent research work.

Author Contributions

Conceptualization, Z.X.; methodology, H.C., Z.X. and B.G.; software, Z.X.; validation, H.C., Z.X. and X.L.; formal analysis, H.C. and B.G.; investigation, B.G.; resources, H.C.; data curation, H.C.; writing—original draft preparation, H.C., Z.X. and X.L.; writing—review and editing, H.C., Z.X., B.G. and X.L.; visualization, H.C.; supervision, H.C.; project administration, H.C.; funding acquisition, H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Key Research and Development Project (grant number 2020YFB1314103) and the Key Teaching Research Project of Anhui province (grant number 2020jyxm0458).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

Author Xuedi Li was employed by the company China Telecom Co., Ltd., Anhui Branch. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Feng, D.; Haase-Schütz, C.; Rosenbaum, L.; Hertlein, H.; Glaeser, C.; Timm, F.; Wiesbeck, W.; Dietmayer, K. Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE Trans. Intell. Transp. Syst. 2020, 22, 1341–1360. [Google Scholar] [CrossRef]
Ding, S.; Wang, H.; Lu, H.; Nappi, M.; Wan, S. Two path gland segmentation algorithm of colon pathological image based on local semantic guidance. IEEE J. Biomed. Health Inform. 2023, 27, 1701–1708. [Google Scholar] [CrossRef]
Dai, X.; Xia, M.; Weng, L.; Hu, K.; Lin, H.; Qian, M. Multiscale location attention network for building and water segmentation of remote sensing image. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–19. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Chen, L.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
Zhao, H.; Qi, X.; Shen, X.; Shi, J.; Jia, J. ICNet for real-time semantic segmentation on high-resolution images. In Proceedings of the European Conference Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
Mehta, S.; Rastegari, M.; Caspi, A.; Shapiro, L.; Hajishirzi, H. ESPNet: Efficient spatial pyramid of dilated convolutions for semantic segmentation. In Proceedings of the European Conference Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
Vo, T. A novel semantic-enhanced text graph representation learning approach through transformer paradigm. Cybern. Syst. 2023, 54, 499–525. [Google Scholar] [CrossRef]
Zhang, Y.; Ji, Z.; Wang, D.; Pang, Y.; Li, X. USER: Unified semantic enhancement with momentum contrast for image-text retrieval. IEEE Trans. Image Process. 2024, 33, 596–609. [Google Scholar] [CrossRef]
Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. BiSeNet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
Yu, C.; Gao, C.; Wang, J.; Yu, G.; Shen, C.; Sang, N. BiseNetV2: Bilateral network with guided aggregation for real-time semantic segmentation. Int. J. Comput. Vis. 2021, 129, 3051–3068. [Google Scholar] [CrossRef]
Liu, J.; Zhang, F.; Zhou, Z.; Wang, J. BFMNet: Bilateral feature fusion network with multi-scale context aggregation for real-time semantic segmentation. Neurocomputing 2023, 521, 27–40. [Google Scholar] [CrossRef]
Pan, H.; Hong, Y.; Sun, W.; Jia, Y. Deep dual-resolution networks for real-time and accurate semantic segmentation of traffic scenes. IEEE Trans. Intell. Transp. Syst. 2023, 24, 3448–3460. [Google Scholar] [CrossRef]
Li, X.; Li, X.; Zhang, L.; Cheng, G.; Shi, J.; Lin, Z.; Tan, S.; Tong, Y. Improving semantic segmentation via decoupled body and edge supervision. In Proceedings of the European Conference Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015. [Google Scholar]
Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. ENet: A deep neural network architecture for real-time semantic segmentation. arXiv 2016, arXiv:1606.02147. Available online: https://arxiv.org/abs/1606.02147 (accessed on 7 June 2016).
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A deep convolutional encoder–decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Romera, E.; Álvarez, J.; Bergasa, L.; Arroyo, R. ERFNet: Efficient residual factorized ConvNet for real-time semantic segmentation. IEEE Trans. Intell. Transp. Syst. 2018, 19, 263–272. [Google Scholar] [CrossRef]
Li, G.; Yun, I.; Kim, J.; Kim, J. DABNet: Depth-wise asymmetric bottleneck for real-time semantic segmentation. arXiv 2019, arXiv:1907.11357. Available online: https://arxiv.org/abs/1907.11357 (accessed on 1 October 2019).
Li, H.; Xiong, P.; Fan, H.; Sun, J. DFANet: Deep feature aggregation for real-time semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Zhang, X.; Du, B.; Wu, Z.; Wan, T. LAANet: Lightweight attentionguided asymmetric network for real-time semantic segmentation. Neur. Comp. Appl. 2022, 34, 3573–3587. [Google Scholar] [CrossRef]
Gao, G.; Xu, G.; Yu, Y.; Xie, J.; Yang, J.; Yue, D. MSCFNet: A lightweight network with multi-scale context fusion for real-time semantic segmentation. IEEE Trans. Intell. Transp. Syst. 2022, 23, 25489–25499. [Google Scholar] [CrossRef]
Hao, X.; Hao, X.; Zhang, Y.; Li, Y.; Wu, C. Real-time semantic segmentation with weighted factorized-depthwise convolution. Image Vis. Comput. 2021, 114, 104269. [Google Scholar] [CrossRef]
Zhang, K.; Liao, Q.; Zhang, J.; Liu, S.; Ma, H.; Xue, J. EFRNet: A lightweight network with efficient feature fusion and refinement for real-time semantic segmentation. In Proceedings of the IEEE International Conference on Multimedia and Expo, Shenzhen, China, 5–9 July 2021. [Google Scholar]
Zhang, X.; Du, B.; Luo, Z.; Ma, K. Lightweight and efficient asymmetric network design for real-time semantic segmentation. Appl. Intell. 2021, 52, 564–579. [Google Scholar] [CrossRef]
Chen, L.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]
Lou, A.; Loew, M. CFPNet: Channel-wise feature pyramid for real-time semantic segmentation. In Proceedings of the IEEE International Conference on Image Processing, Anchorage, AK, USA, 19–22 September 2021. [Google Scholar]
Yu, C.; Wang, J.; Gao, C.; Yu, G.; Shen, C.; Sang, N. Context prior for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Gao, G.; Xu, G.; Li, J.; Yu, Y.; Lu, H.; Yang, J. FBSNet: A fast bilateral symmetrical network for real-time semantic segmentation. IEEE Trans. Multimedia 2022, 25, 3273–3283. [Google Scholar] [CrossRef]
Shi, M.; Shen, J.; Yi, Q.; Weng, J.; Huang, Z.; Luo, A.; Zhou, Y. LMFFNet: A well-balanced lightweight network for fast and accurate semantic segmentation. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 3205–3219. [Google Scholar] [CrossRef]
Wang, Y.; Zhou, Q.; Xiong, J.; Wu, X.; Jin, X. ESNet: An efficient symmetric network for real-time semantic segmentation. In Proceedings of the Pattern Recognition and Computer Vision, Xi’an, China, 8–11 November 2019. [Google Scholar]
Yang, Q.; Chen, T.; Fan, J.; Lu, Y.; Zuo, C.; Chi, Q. EADnet: Efficient asymmetric dilated network for semantic segmentation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, ON, Canada, 6–11 June 2021. [Google Scholar]
Wu, F.; Chen, F.; Jin, X.; Hu, C.; Ge, Q.; Ji, Y. Dynamic attention network for semantic segmentation. Neurocomputing 2020, 384, 182–191. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef] [PubMed]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Woo, S.; Park, J.; Lee, J.; Kweon, I. CBAM: Convolutional block attention module. In Proceedings of the European Conference Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Wang, Y.; Zhou, Q.; Liu, J.; Xiong, J.; Gao, G.; Wu, X.; Latecki, L. LEDNet: A lightweight encoder–decoder network for real-time semantic segmentation. In Proceedings of the IEEE International Conference on Image Processing, Taipei, Taiwan, China, 22–25 September 2019. [Google Scholar]
Cortes, C.; Gonzalvo, X.; Kuznetsov, V.; Mohri, M.; Yang, S. AdaNet: Adaptive structural learning of artificial neural networks. In Proceedings of the International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017. [Google Scholar]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Brostow, G.; Fauqueur, J.; Cipolla, R. Semantic object classes in video: A high-definition ground truth database. Pattern Recognit. Lett. 2009, 30, 88–97. [Google Scholar] [CrossRef]
Wu, T.; Tang, S.; Zhang, R.; Cao, J.; Zhang, Y. CGNet: A light-weight context guided network for semantic segmentation. IEEE Trans. Image Process. 2021, 30, 1169–1179. [Google Scholar] [CrossRef]
Poudel, R.; Bonde, U.; Liwicki, S.; Zach, C. ContextNet: Exploring context and detail for semantic segmentation in real-time. In Proceedings of the British Machine Vision Conference, Newcastle, UK, 3–6 September 2018. [Google Scholar]
Lo, S.; Hang, H.; Chan, S.; Lin, J. Efficient dense modules of asymmetric convolution for real-time semantic segmentation. In Proceedings of the ACM Multimedia Asia, Beijing, China, 16–18 December 2019. [Google Scholar]
Poudel, R.; Liwicki, S.; Cipolla, R. Fast-SCNN: Fast semantic segmentation network. In Proceedings of the British Machine Vision Conference, Cardiff, UK, 9–12 September 2019. [Google Scholar]
Jiang, W.; Xie, Z.; Li, Y.; Liu, C.; Lu, H. LRNNET: A light-weighted network with efficient reduced non-local operation for real-time semantic segmentation. In Proceedings of the IEEE International Conference on Multimedia & Expo Workshops, London, UK, 6–10 July 2020. [Google Scholar]
Wang, D.; Xiang, S.; Zhou, Y.; Mu, J.; Zhou, H.; Irampaye, R. Multiple-Attention Mechanism Network for Semantic Segmentation. Sensors 2022, 12, 4477. [Google Scholar] [CrossRef]
Zhou, Q.; Wang, Y.; Fan, Y.; Wu, X.; Zhang, S.; Kang, B.; Latecki, L. AGLNet: Towards real-time semantic segmentation of self-driving images via attention-guided lightweight network. Appl. Soft Comput. 2020, 96, 106682. [Google Scholar] [CrossRef]
Wu, Y.; Jiang, J.; Huang, Z.; Tian, Y. FPANet: Feature pyramid aggregation network for real-time semantic segmentation. Appl. Intell. 2022, 52, 3319–3336. [Google Scholar] [CrossRef]
Nirkin, Y.; Wolf, L.; Hassner, T. HyperSeg: Patch-wise hypernetwork for real-time semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021. [Google Scholar]
Liu, J.; Xu, X.; Shi, Y.; Deng, C.; Shi, M. RELAXNet: Residual efficient learning and attention expected fusion network for real-time semantic segmentation. Neurocomputing 2022, 474, 115–127. [Google Scholar] [CrossRef]

Figure 1. Overview architecture of our LMANet.

Figure 2. The structure of feature-bottleneck module. (a) LFB module. (b) HFB module. DConv indicates depth-wise convolution. DDConv indicates depth-wise dilated convolution.

Figure 3. The structure of feature-fusion module. (a) First FFM structure. (b) The other FFM structure.

Figure 4. Illustration of our MFD.

Figure 5. Visual results about ablation study on MFD.

Figure 6. Comparison of visualization results on Cityscapes dataset between different algorithms.

Table 1. Architecture details of LMANET.

Stage	Operator	Mode	Output Size	Channel
Encoder	3 × 3 Conv	Stride = 2	256 × 512	32
	3 × 3 Conv	Stride = 1	256 × 512	32
	3 × 3 Conv	Stride = 1	256 × 512	32
	FFM	—	256 × 512	32 + 3
	Downsample	—	128 × 256	64
LFB Block	LFB module × 3	Dilated = 2	128 × 256	64
	REA module	—	128 × 256	64
	FFM	—	128 × 256	128 + 3
	Downsample	—	64 × 128	128
HFB Block	HFB module × 2	Dilated = {2,4}	64 × 128	128
	HFB module × 2	Dilated = {4,8}	64 × 128	128
	HFB module × 2	Dilated = {8,16}	64 × 128	128
	REA module	—	64 × 128	128
	FFM	—	64 × 128	256 + 3
Decoder	MFD	—	512 × 1024	C

Table 2. Ablation study results of HFB module on Cityscapes dataset.

Module	Params/M	Speed/FPS	mIoU/%
DAB module	0.75	116	71.5
SS-nbt module	0.81	113	72.8
Non-bt-1D	1.13	92	73.2
EAB module	0.72	118	73.1
LFB and HFB	0.85	111	73.9

Table 3. Ablation study results of feature bottleneck.

Strategy	Dilation Rates1 (R1)	Dilation Rates2 (R2)	Params/M	Speed/FPS	mIoU/%
S1	{2,2,2,2,2,2}	—	0.84	122	69.8
S2	{2,2,4,4,8,8}	—	0.84	121	71.5
S3	{2,2,5,5,9,9}	{5,5,9,9,17,17}	0.85	108	72.9
S4	{2,2,2,8,8,8}	{4,4,4,16,16,16}	0.85	109	73.1
S5	{2,4,8,2,4,8}	{4,8,16,4,8,16}	0.85	112	73.5
S6	{2,2,4,4,8,8}	{4,4,8,8,16,16}	0.85	111	73.9

Table 4. Results of different values of m and n in feature-extraction module on the cityscapes.

m	n	Params/M	Speed/FPS	mIoU/%
3	1	0.43	178	65.3
1	10	1.15	88	72.5
3	8	1.02	92	74.1
4	6	0.87	103	73.8
2	4	0.66	146	69.6
5	10	1.23	73	74.0
3	6	0.85	111	73.9

Table 5. Experiment results on MFD.

Decoder			Params/M	Speed/FPS	mIoU/%
DABNet	ERFNet (ERFD)	MFD	Params/M	Speed/FPS	mIoU/%
√			0.84	118	72.6
	√		1.59	90	73.4
		√	0.85	111	73.9

Table 6. Experiment results on the internal components of MFD.

Ablation Study	Type	Params/M	Speed/FPS	mIoU/%
Attention module	CBAM	0.86	105	73.6
	SE	0.86	108	73.5
	SAM	0.85	111	73.9
Feature-fusion method	{Multiply, Add}	0.85	112	73.4
	{Multiply, Concat}	0.85	114	73.3
	{Concat, Multiply}	0.85	113	73.6
	{Concat, Add}	0.85	111	73.9

Table 7. Experiment results of different modules.

Type	LFB	HFB	REA Module	MFD	Params/M	Speed/FPS	mIoU/%
S1					0.75	132	70.3
S2	√				0.75	130	71.2
S3	√	√			0.84	121	72.2
S4	√	√	√		0.84	115	73.1
S5 (LMANet)	√	√	√	√	0.85	111	73.9

Table 8. Comparison of state-of-the-art semantic segmentation methods on Cityscapes dataset.

Method	Input Size	GPU	GFLOPs	Params/M	mIoU/%	Speed/FPS
SegNet [18]	360 × 640	3090	168.2	29.5	56.1	53
ENet [17]	512 × 1024	3090	4.3	0.36	58.8	52
ESPNet [8]	512 × 1024	3090	3.6	0.36	60.3	165
CGNet [43]	1024 × 2048	3090	28	0.49	64.8	31
ContextNet [44]	1024 × 2048	3090	7.2	0.85	66.1	80
EDANet [45]	512 × 1024	3090	9	0.68	67.3	112
ERFNet [19]	512 × 1024	3090	27.9	2.1	68.0	63
Fast-SCNN [46]	1024 × 2048	3090	7	1.1	68.0	125
BiseNetv1 [11]	768 × 1536	1080 Ti	14.8	5.8	68.4	105
LEDNet [39]	512 × 1024	3090	11.5	0.95	70.6	59
DABNet [20]	512 × 1024	3090	27.1	0.76	70.1	143
ICNet [7]	1024 × 2048	3090	28.3	26.5	69.9	41
ESNet [32]	512 × 1024	3090	24.4	1.7	70.7	53
LRNNet [47]	512 × 1024	3090	-	0.68	72.2	71
MANet [48]	512 × 1024	1080 Ti	-	65.3	72.8	23
DFANet [21]	1024 × 1024	1080 Ti	3.6	7.8	71.3	100
BiseNetv2 [12]	768 × 1536	2080 Ti	21.2	3.4	72.6	156
FBSNet [30]	512 × 1024	3090	9.7	0.62	70.9	60
AGLNet [49]	512 × 1024	1080 Ti	13.8	1.1	70.1	52
FPANet [50]	512 × 1024	2080 Ti	-	15.5	73.4	63
MSCFNet [23]	512 × 1024	3090	17.1	1.2	71.9	50
WFDCNet [24]	512 × 1024	3090	5.8	0.51	73.6	88
Hyperseg-M [51]	512 × 1024	1080 Ti	7.5	10.1	75.8	36
LMANet (ours)	512 × 1024	2080 Ti	16.3	0.85	73.9	111

Table 9. Performance comparison of our LMANet against state-of-the-art semantic segmentation networks on the CamVid dataset.

Method	Input Size	GPU	GFLOPs	Params/M	mIoU/%	Speed/FPS
ENet [17]	360 × 480	3090	1.4	0.36	52.8	80
ESPNet [8]	360 × 480	3090	1.1	0.36	55.6	230
LEDNet [39]	360 × 480	3090	3.8	0.95	66.6	76
ERFNet [19]	360 × 480	3090	8.6	2.1	60.5	133
BiseNetv1 [11]	720 × 960	1080 Ti	32.4	5.8	68.7	116
CGNet [43]	360 × 480	3090	2.3	0.49	65.6	66
EDANet [45]	360 × 480	3090	2.9	0.68	66.4	112
DABNet [20]	360 × 480	3090	3.2	0.76	66.2	107
DFANet [21]	720 × 960	1080 Ti	2.1	7.8	64.7	120
AGLNet [49]	360 × 480	3090	4.6	1.1	69.4	90
MSCFNet [23]	360 × 480	3090	5.7	1.2	69.3	50
WFDCNet [24]	360 × 480	3090	1.9	0.51	67.5	90
RELAXNet [52]	360 × 480	3090	7.9	1.9	71.2	79
LMANet (ours)	360 × 480	2080 Ti	8.6	0.85	71.1	118

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, H.; Xiao, Z.; Ge, B.; Li, X. LMANet: A Lightweight Asymmetric Semantic Segmentation Network Based on Multi-Scale Feature Extraction. Electronics 2024, 13, 3361. https://doi.org/10.3390/electronics13173361

AMA Style

Chen H, Xiao Z, Ge B, Li X. LMANet: A Lightweight Asymmetric Semantic Segmentation Network Based on Multi-Scale Feature Extraction. Electronics. 2024; 13(17):3361. https://doi.org/10.3390/electronics13173361

Chicago/Turabian Style

Chen, Hui, Zhexuan Xiao, Bin Ge, and Xuedi Li. 2024. "LMANet: A Lightweight Asymmetric Semantic Segmentation Network Based on Multi-Scale Feature Extraction" Electronics 13, no. 17: 3361. https://doi.org/10.3390/electronics13173361

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LMANet: A Lightweight Asymmetric Semantic Segmentation Network Based on Multi-Scale Feature Extraction

Abstract

1. Introduction

2. Related Work

2.1. Encoder–Decoder Semantic Segmentation

2.2. Optimized Convolution

2.3. Attention Mechanism

3. Method

3.1. Overall Architecture of LMANet

3.2. Feature-Bottleneck Module

3.3. Feature-Fusion Module

3.4. Multi-Scale Feature Decoder

4. Experiments

4.1. Datasets

4.2. Implementation Details

4.3. Ablation Studies

4.4. Comparison with Other Approaches

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI