Asymmetric-Convolution-Guided Multipath Fusion for Real-Time Semantic Segmentation Networks

Liu, Jie; Zhao, Bing; Tian, Ming

doi:10.3390/math12172759

Open AccessArticle

Asymmetric-Convolution-Guided Multipath Fusion for Real-Time Semantic Segmentation Networks

by

Jie Liu

^1,*,

Bing Zhao

¹ and

Ming Tian

²

¹

School of Measurement and Control Technology and Communication Engineering, Harbin Institute of Technology, Harbin 150001, China

²

China Telecom Heilongjiang Branch, Harbin 150010, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(17), 2759; https://doi.org/10.3390/math12172759

Submission received: 20 June 2024 / Revised: 29 July 2024 / Accepted: 19 August 2024 / Published: 5 September 2024

(This article belongs to the Special Issue Mathematical Modeling, Machine Learning, and Intelligent Computing for Internet of Things)

Download

Browse Figures

Versions Notes

Abstract

:

Aiming to provide solutions for problems proposed by the inaccurate segmentation of long objects and information loss of small objects in real-time semantic segmentation algorithms, this paper proposes a lightweight multi-branch real-time semantic segmentation network based on BiseNetV2. The new auxiliary branch makes full use of spatial details and context information to cover the long object in the field of view. Meanwhile, in order to ensure the inference speed of the model, the asymmetric convolution is used in each stage of the auxiliary branch to design a structure with low computational complexity. In the multi-branch fusion stage, the alignment-and-fusion module is designed to provide guidance information for deep and shallow feature mapping, so as to make up for the problem of feature misalignment in the fusion of information at different scales, and thus reduce the loss of small target information. In order to further improve the model’s awareness of key information, a global context module is designed to capture the most important features in the input data. The proposed network uses an NVIDIA GeForce RTX 3080 Laptop GPU experiment on the road street view Cityscapes and CamVid datasets, with the average simultaneously occurring ratios reaching 77.1% and 77.4%, respectively, and the running speeds reaching 127 frames/s and 112 frames/s, respectively. The experimental results show that the proposed algorithm can achieve a real-time segmentation and improve the accuracy significantly, showing good semantic segmentation performance.

Keywords:

semantic segmentation; asymmetric convolution; feature misalignment; high-level semantic information

MSC:

68T07

1. Introduction

Image semantic segmentation is an important research subject in the field of computer vision. By classifying and predicting a given image pixel by pixel, this method can segment different areas of semantic identification, which has a wide range of applications in many fields, such as automatic driving, scene analysis, medical detection, and machine perception [1,2].

In recent years, with the rapid development of artificial intelligence, deep learning technology has been widely applied in the field of image semantic segmentation, and has achieved better results than traditional image segmentation algorithms [3]. In the convolution network (fully convolutional networks—FCNs) [4], traditional convolution neural networks connected in the whole layer are replaced by convolution; this is based on the depth of the convolution of the neural network approach [5]. This has become the main solution to semantic segmentation tasks. It has brought a new research direction for other image semantic segmentation researchers, and prompted the proposal of many high-precision image semantic segmentation algorithms. Among them, in order to refine spatial details, PSPNet [6] involves the design of a pyramid pooling module (PPM, in a new paper published in Acta Electronica), and uses global average pooling operations to extract global information. In order to reduce the problem of spatial information loss during downsampling, DeepLab [7] uses void convolution to increase the receptive field to obtain more context information. On this basis, the emerging optimized and improved version of the Deep Lab algorithm [8,9,10] has gradually improved the accuracy of semantic segmentation. In order to extract features with different resolutions, HRNet [11] realizes image semantic segmentation with high accuracy by maintaining high image resolution and conducting parallel subsampling.

The above semantic segmentation networks perform well in segmentation accuracy, but this high accuracy mainly depends on their complex model design, which usually requires a large number of parameters and computational resources; additionally, they make it difficult to meet the needs of real-time processing, thus limiting their practical applications. Therefore, real-time semantic segmentation algorithms with fewer parameters and faster running speeds have become a research focus [12,13,14,15,16]. Early real-time semantic segmentation networks often used efficient encoder–decoder structures [17,18,19]. ICNet [20] uses a cascading network structure to gradually encode features of different resolutions, effectively improving the model running speed. FANet [21] uses a bidirectional feature pyramid network to fuse feature information at different levels in the decoder stage. These methods have made progress in real-time applications, but the reduction in a large number of model parameters will weaken the ability of the network structure to extract feature information. Therefore, some research methods design a special module for extracting features to improve the segmentation performance of the model. ESPNet [22] designed the efficient spatial pyramid (ESP) module, which improves the accuracy of the network while reducing the number of parameters and the calculation cost of the network model. SwiftNet [23] adopts a structure with spatial pyramid pooling (SPP) in the downsampling phase to reduce the number of model calculations. EADNet [24] involves the design of multi-scale shapes, which feels like a more wild convolution (multi-scale multi-shape receptive field convolution—MMRFC) module; this uses a module which is built for lightweight semantic network segmentation. The inference speed of the above network is greatly improved, but the model sacrifices spatial resolution in order to realize the real-time inference speed, resulting in the loss of feature spatial information.

For this reason, the creators of BiSeNet [25] chose a lightweight backbone network and proposed an additional downsampling path to obtain spatial details and integrate them with the backbone network to compensate for the reduced accuracy. On this basis, BiSeNetV2 [26] improved the structure of BiSeNet by reducing the number of channels, adopting a fast downsampling strategy, and introducing an auxiliary training strategy to further improve segmentation performance.

This two-branch network achieves a better balance in terms of inference speed and segmentation accuracy. However, the detailed branch network structure of BiseNetV2 is shallow, which causes the extracted features to lack a sufficient receptive field. At the same time, the symmetric convolution adopted by this branch will capture interference information from irrelevant regions [27], which will not effectively identify target objects that may be long (such as grass) or dispersed (such as traffic signs) structures, and ultimately lead to lower segmentation accuracy. In addition, the way BiseNetV2 fuses multi-branch features has the problem of feature misalignment [28], thus ignoring the diversity between the two branches, which is not conducive to recovering the feature information lost by small targets during network downsampling.

Finally, since the semantic branch of BiseNetV2 in the network aims to capture global context information and extract high-level semantic features, it needs to be further improved to enhance the model’s ability to extract high-level semantic features. To solve the above problems, this paper optimizes BiseNetV2 to improve the performance of the algorithm. The main contributions of this paper are summarized as follows: (1) auxiliary branches are designed to guide detailed branches to capture long-distance relationships between features, further increasing the segmentation performance of large irregular targets; (2) an alignment and integration module (alignment-and-fusion module—AAFM) is proposed to merge multiple output branches of the model in the implementation of the effective interaction between multiple branches at the same time, which eases the characteristics of non-alignment problems, making up for the loss of small target information; (3) a global context module (GCM) is introduced to capture the most important features in the input data, thereby improving the model’s awareness of key information; and (4) the performance of the proposed algorithm on Cityscapes and CamVid datasets is significantly improved compared to the BiseNetV2 algorithm and other existing advanced algorithms, proving the superiority of the proposed algorithm.

2. Textual Algorithm

2.1. Overall Structure

The network structure of the BiseNetV2 algorithm and the proposed algorithm is illustrated in Figure 1. Figure 1a depicts the BiseNetV2 algorithm, while Figure 1b illustrates the enhanced proposed algorithm, which is based on BiseNetV2.

As illustrated in Figure 1a, BiseNetV2 employs detail branches and semantic branches to extract both spatial detail information and semantic abstract information, respectively.

This paper presents an optimised and improved version of the BiseNetV2 algorithm. As illustrated in Figure 1b, the network is structured into three branches. The detail branch and the auxiliary branch are tasked with extracting spatial detail features, while the semantic branch is responsible for capturing semantic context features. Subsequently, an alignment-and-fusion module is devised for the purpose of fusing the features of the three aforementioned branches, thereby obtaining the desired semantic segmentation results. The overall structure of the multi-branch network algorithm proposed in this paper is illustrated in Figure 2. The backbone network comprises three branches: the detail branch (blue), the auxiliary branch (yellow), and the semantic branch (green). The auxiliary branch represents a novel addition to the original network, introduced with the objective of extracting comprehensive spatial detail and contextual information, thereby encompassing the extensive strip of targets within the field of view. The numerals inscribed within each branch box represent the ratio of the feature map size to the resolution of the original input. The image is initially extracted by three parallel branches. The auxiliary branches are integrated with the detail branches at each stage, thus assisting the model in retaining spatial information. Upon extracting the semantic branch feature to 1/32 of the original, a global context module is embedded to enhance the context representation capability. Furthermore, the alignment-and-fusion module is employed when features are merged. The module feeds the features obtained from the detail branch by downsampling (Down) and the features obtained from the semantic branch by the sigmoid activation function (ϕ) into the alignment layer (AL). Concurrently, the features derived from the semantic branch through upsampling and the sigmoid activation function (ϕ), along with the features obtained from the detail branch, are also conveyed to the alignment layer. This enables the comprehensive alignment and interactive fusion of the features from the two branches. The two outputs of the alignment layer are ultimately combined (SUM) with the output of the secondary branch. Furthermore, this algorithm maintains the auxiliary training strategy of the baseline model and enhances the feature extraction capacity of diverse shallow networks by integrating the loss of four auxiliary heads and the loss of the main network segment head. The three modules proposed in this paper are described in detail: the auxiliary branch, the alignment-and-fusion module, and the global context module. The instantiation of the network is illustrated in Table 1.

2.2. Auxiliary Branch

BiSeNetV2 employs a symmetric convolution operation to generate detail branches, which are used to extract low-level features from images. However, the disadvantages of symmetric convolution are evident. Firstly, symmetric convolution considers all directions equally, which may result in the omission of valid texture information. Secondly, symmetric convolution has a fixed kernel size and is therefore unsuitable for processing input data with different shapes or aspect ratios. This leads to a limited processing capacity for diverse input data. Thirdly, due to its inherent symmetry and inherent smoothing effect, symmetric convolution may result in the loss of details and texture information in some cases. To address the aforementioned issues, this algorithm introduces a novel auxiliary branch structure, designed to capture long-range dependencies, edges, textures, and other pertinent information. The proposed auxiliary branch structure is illustrated in Figure 3.

The novel auxiliary branch comprises three stages, as illustrated in Figure 3a, with the red box denoting the convolution process. The initial stage of the branch employs convolution with a step size of 2 to extract high-resolution features from the input features. In the subsequent stage, convolution with a step size of 1 is utilised to expand the receptive field and capture the long-range relationship between isolated regions, thereby facilitating the segmentation of large targets. In the third stage, the convolution operation with a step size of 1 is employed once more to encode and integrate the detailed features extracted previously. In the initial stage, the input X is obtained through a 3 × 1 vertical asymmetric convolution, which is then subjected to a 1 × 3 horizontal asymmetric convolution. The operations in the latter two stages of the auxiliary branch are analogous to those of this stage. The operation process of each stage is shown as follows:

Y^{H} = C o n v_{3 \times 1} (X)

(1)

Y^{V} = C o n v_{1 \times 3} (Y^{H})

(2)

where

X \in R^{H \times W \times C}

is the input of this branch,

{Conv}_{m \times n}

represents an asymmetric convolution operation with a convolution kernel size of m × n, and

Y^{H}

and

Y^{V}

are the features obtained after vertical and horizontal asymmetric convolution, respectively. The process of auxiliary branch is illustrated in Figure 3b. In the initial phase of the detail branch, the image resolution is high and the available detail information is substantial. In order to circumvent model redundancy, this algorithm does not provide assistance at this stage. Conversely, the feature

F_{D 1}

of the initial stage of the detail branch is merged with the feature

F_{A 1}

of the initial stage of the auxiliary branch, resulting in

F_{1}

. Subsequently,

F_{1}

is utilised. To direct the second stage of the detail branch to obtain

F_{D 2}

, which is then merged with the output

F_{A 2}

of the second stage of the auxiliary branch to create

F_{2}

, which is then used to direct the third phase of the detail branch, the auxiliary branch feature, designated as

F_{A i}

, and the detail branch feature, designated as

F_{D i}

, are integrated gradually in order to facilitate the interaction of feature information at disparate scales. This integration serves to establish a dependency relationship between discrete distribution regions and enhances edge and texture information, thereby providing a more comprehensive and nuanced representation of the features in question. The specific operational process is illustrated below:

F_{1} = F_{A 1} \oplus η (F_{D 1})

(3)

F_{D 1} = C (F_{1})

(4)

F_{2} = F_{D 2} \oplus η (F_{A 2})

(5)

F_{D 3} = C (F_{2})

(6)

where

F_{A i}

,

F_{D i}

, and

F_{i}

represent auxiliary branches, detail branches, and their integrated feature maps, respectively;

i \in [1, 2, 3]

.

η (\cdot)

represents the broadcast operation, and

C (\cdot)

represents the convolution operation, where “⊕” represents the element-by-element addition operation.

Furthermore, in the case of real-time semantic segmentation models, it is essential to consider not only the performance of the model, but also its speed. This paragraph will explain the rationale behind the choice of asymmetric convolution to form new auxiliary branches from the perspective of computational efficiency. Consider that the baseline model is a convolution layer with dimension

C \times F

and space dimension

d^{v} \times d^{h}

. For the sake of simplicity, we may assume that both

d^{v} = d^{h} = d

. This can be achieved by decomposing them into two convolution layers of size

d \times 1

and

1 \times d

, respectively. The computational cost of these two schemes is directly proportional to

C F d^{2}

and

L (C + F) d

, respectively. Therefore, a significant improvement can be obtained when

L (C + F) < < C F d

. The analysis of the above formula shows that the new auxiliary branch does not significantly increase the computational burden.

2.3. Alignment-and-Fusion Module

BiseNetV2 proposes the introduction of a bidirectional guided aggregation module, which would serve to fuse and guide the detail and semantic branches. While effective communication is achieved through mutual guidance between the two branches, this approach does not address the potential misalignment of features that may arise when the input from the two branches is merged. This will result in the two branches transmitting erroneous data to one another, leading to feature redundancy following fusion. This is an impediment to the recovery of detailed information, particularly that pertaining to small targets. To address this issue, a novel alignment-and-fusion module has been devised with the objective of integrating the three branches of the network. The structure is illustrated in Figure 4. This module is presented in two sections. The initial section offers direction on the specific branches associated with the semantic branches, whereas the subsequent section provides guidance on the semantic branch in relation to the detail branch.

The detail branches provide guidance for the semantic branches; the guidance stage is illustrated in Figure 4a, displayed in orange. Initially, the detail branch features are obtained through a 3 × 3 convolution process, which enables the extraction of local features. Subsequently, the channel dimensionality is reduced through an average pooling operation. Concurrently, semantic branching features are subjected to 3 × 3 deep separable convolution to augment their receptive fields, which is then followed by a 1 × 1 convolution to integrate the features. Subsequently, the two branches are fused to

F_{s}^{'}

by element multiplication, thus ensuring efficient guidance between them. In the alignment stage, as illustrated in Figure 4a, the grey component is displayed. Firstly, the

F_{s}^{'}

is learned through 3 × 3 convolution prediction to obtain the two-dimensional offset

F_{O 1}

, in which each pixel position contains the horizontal and vertical offset. The two-dimensional offset,

F_{O 1}

, enables the determination of the relative position relationship between pixels with disparate resolution features. Subsequently, the semantic branch feature

F_{S}

is warped using the two-dimensional offset

F_{S}

to obtain

F_{U 1}

. Warp can be defined as a specific type of spatial transformation operation. As illustrated in Figure 4b, a spatial grid is generated through the offset, which is employed to resample the low-resolution image and distort the low-resolution feature maps into high-resolution feature maps. This process generates the aligned feature map and alleviates the issue of feature misalignment that arises during the fusion of feature maps with disparate resolutions.

Semantic branches provide guidance for detailed branches. As with the aforementioned operation, the two branches act as guides for one another, facilitating the calculation of

F_{D}^{'}

, which is subsequently sent to the 3 × 3 convolution operation. This enables the convolution, learning, and prediction of the two-dimensional offset

F_{O 2}

. Given that the Warp operation is designed to distort low-resolution feature maps into high-resolution feature maps, the

F_{O 2}

is still employed here to perform the Warp operation on the low-resolution

F_{S}

. Subsequently, the connections between the

F_{U 1}

and

F_{U 2}

feature maps are established in accordance with the specified channel dimensions.

Ultimately, in the purple section of Figure 4a, to circumvent the superfluous features evident in the fusion diagram, a 3 × 3 convolution operation is employed to derive

F_{U}

. Furthermore, in order to conserve parameters, the algorithm merges the final three branches and combines the corresponding pixels of

F_{A}

and

F_{U}

to obtain the final output F. The aforementioned process is illustrated as follows:

F_{O 1} = f_{3 \times 3} ((A P o o l i n g (f_{3 \times 3} (F_{D})) \otimes (f_{1 \times 1} (f_{3 \times 3} (F_{S}))))

(7)

F_{U 1} = w a r p (F_{S}, F_{O 1})

(8)

F_{O 2} = f_{3 \times 3} ((f_{1 \times 1} (f_{3 \times 3} (F_{D}) \otimes u p s a m p l e (f_{3 \times 3} (F_{S}))))

(9)

F_{U 2} = w a r p (F_{S}, F_{O 2})

(10)

F_{U} = f_{3 \times 3} (c o n c a t (F_{U 1}, F_{U 2}))

(11)

where

F_{I}

represents the input of this module,

I \in [D, S, A]

represents detail branch, semantic branch and auxiliary branch, respectively;

f_{m \times n}

represents the convolution operation m × n; and Apooling indicates average pooling operation. The Warp operation is shown in Figure 4b. ⊗ represents the multiplication-by-element operation.

2.4. Global Context Module

In order to optimise the contribution of high-level semantic information in semantic branches, a global context module has been devised with the objective of enhancing the retention of image information. Conventional methodologies frequently employ stacked convolutional layers to extract high-level semantic features. However, this approach is known to consume a considerable amount of computing resources. In order to achieve real-time performance, the algorithm employs global average pooling and global maximum pooling in the global context module, eschewing the use of stacked convolution layers. Global average pooling is not constrained by the size of the receptive field and operates on the entire feature map, thereby enabling the capture of a more extensive range of contextual information. This facilitates the model’s comprehension of the image’s semantics and structure, as well as its ability to perceive the image globally. The selection of the maximum activation value within each channel enables the extraction of the most significant features, which facilitates the focus of attention on the most important features within the image. This, in turn, enhances the perception and differentiation ability of the model for key information.

The module structure is illustrated in Figure 5. Firstly, the input feature, designated as

F_{i n p u t}

, undergoes global average pooling and global maximum pooling, respectively. Subsequently, the two obtained feature maps are batch normalized, with the objective of stabilising the distribution of input features. Subsequently, in order to extract more effective features, a 1 × 1 convolution is employed to process them separately, with the objective of fusing features from different channels and thereby facilitating interaction between the features derived from different channels. Subsequently, the model’s expression ability was augmented through the incorporation of batch normalisation and a ReLU activation function, thereby yielding the requisite outputs, designated as

F_{L}

and

F_{R}

. Subsequently, the inputs, namely

F_{i n p u t}

,

F_{L}

, and

F_{R}

, are merged through the utilisation of skip connections, thereby facilitating the direct transfer of information pertaining to the input features to

F_{L}

and

F_{R}

. This approach serves to achieve a harmonious equilibrium between the processing of intricate details and the integration of global information. Finally, in order to more accurately represent the semantic information present in the image, a 3 × 3 convolution is performed to further enhance the model’s ability to abstract and obtain the final output, denoted as

F_{o u t p u t}

. The aforementioned operations can be expressed as follows:

F_{L} = σ (β (C o n v_{1 \times 1} (β (G A P (F_{i n p u t})))))

(12)

F_{R} = σ (β (C o n v_{1 \times 1} (β (G M P (F_{i n p u t})))))

(13)

F_{output} = C o n v_{3 \times 3} (F_{L} \oplus F_{R})

(14)

The symbols

F_{i n p u t}

and

F_{o u t p u t}

are used to represent the input and output of the module, respectively. GAP and GMP represent the processes of global average pooling and global maximum pooling, respectively. The notation

C o n v_{m \times n}

represents a convolution operation with a convolution kernel of size m × n. The functions

σ (\cdot)

and

β (\cdot)

represent the ReLU and batch normalization operations, respectively.

3. Experiment and Result Analysis

3.1. Dataset Introduction

3.1.1. Cityscapes Dataset

The Cityscapes dataset [29] comprises 25,000 high-resolution street view images sourced from 50 distinct cities within Germany. The images in the dataset are of high resolution, with a width of 1024 pixels and a height of 2048 pixels. The dataset comprises 5000 images that have been meticulously annotated, and 2000 images that have been roughly annotated. The algorithm was trained and validated using images that had been meticulously labelled, which were then grouped into 19 categories. These comprised a training dataset (2975 images), a validation dataset (500 images), and a test dataset (1525 images). Similar to the advanced semantic segmentation methods [7,26], 19 common semantic categories (such as sidewalk, road, and car) are used in this experiment.

3.1.2. CamVid Dataset

CamVid [30] is a widely used computer vision resource for semantic segmentation tasks, particularly for image segmentation and semantic annotation in urban scenes. The dataset comprises video images captured by cameras while cars are in motion. It encompasses a diverse array of scenes and objects, including urban streets, traffic signs, pedestrians, and vehicles. However, it exhibits lower image and annotation quality in comparison to the Cityscapes dataset. This preliminary roadscape dataset, captured from the perspective of a driving car, comprises 701 high-resolution video frames captured from five video sequences covering 11 semantic categories. The objective of this experiment is to validate the algorithm mentioned in this paper using these 701 images.

3.2. Evaluation Index

In this study, the evaluation index is based on the standard metrics of average crossover ratio (mIoU), frame per second (FPS), and parameters. The term “frames per second” (FPS) refers to the number of image frames processed by the model in a given time period, typically expressed as a number of frames processed per second. This metric is used to assess the speed of the model. The parameters are employed to assess the memory consumption of the model. mIoU is utilised to evaluate the accuracy of the model, where i represents the true value, j represents the predicted value, and

P_{i j}

represents the prediction of j to i. The number of categories of pixels, k, is also a factor. The mIoU can be expressed as follows:

mIoU = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{P_{i j}}{\sum_{j = 0}^{k} P_{i j} + \sum_{j = 0}^{k} P_{j i} - P_{i i}}

(15)

3.3. Training Strategy

3.3.1. Experimental Parameter

The experiment is based on the deep learning framework Pytorch and employs the use of an individual NVIDIA GeForce RTX 3080 laptop GPU (GPU is NVIDIA Corporation, Santa Clara, CA, USA; Device Manufacturer Name: Lenovo, City: Beijing, Country: China). The network was trained using the Adam optimiser with a momentum of 0.9 [31]. In training the Cityscapes dataset, the weight attenuation was set to

5 \times 10^{- 4}

, the batch size was set to 2, the maximum number of iterations was set to 150,000, and the warmup strategy was applied in the first 1000 iterations [32]. In the case of the CamVid dataset, the weight attenuation was set to

5 \times 10^{- 6}

, the batch size was set to 2, the maximum number of iterations was set to 800,000, and the warm-up strategy was applied in the first 1000 iterations.

3.3.2. Data Enhancement

In order to address the issue of data imbalance, the algorithm employs a range of techniques, including random horizontal flipping, random clipping, and other forms of data enhancement. The random scale employed by the algorithm spans the range {0.75, 1, 1.25, 1.5, 1.75, 2.0}. The resolution of the Cityscapes dataset was 512 × 1024, while that of the CamVid dataset was 960 × 720, which were used to train the algorithm.

3.3.3. Training Optimization Strategy

In this paper, we utilise the Poly learning rate strategy, as outlined in BiseNetV2 [26], to mitigate the learning rate. The current learning rate is illustrated in Formula (16).

l = l_{i n i t} \times {(1 - \frac{i t e r}{i t e r_{m a x}})}^{p o w e r}

(16)

where l represents the current learning rate,

l_{i n i t}

represents the initial learning rate, which is set to

5 \times 10^{- 2}

, iter represents the number of current iterations,

i t e r_{m a x}

represents the maximum number of iterations, and power is set to 0.9.

3.4. Ablation Experiment

In order to demonstrate the efficiency of the disparate modules within this algorithm, this section will undertake a functional validation of various combinations on the Cityscapes dataset. The results of the experiment are presented in Table 2. The first category of models is distinguished by their structure. The Baseline model employs solely the BiseNetV2 module, while the +Auxiliary Branch model incorporates both BiseNetV2 and the Auxiliary Branch. The +AAFM denotes the BiseNetV2+ alignment-and-fusion module, while the +CGM indicates the BiseNetV2+ global context module. The algorithm is expressed as follows: Baseline + Auxiliary Branch + AAFM + CGM. The selected evaluation indexes, namely mIoU, FPS, and Params, are presented in columns 2, 3, and 4, respectively.

As illustrated in Table 2, the mean intersection over union (mIoU) reached 74.3% after the incorporation of auxiliary branches, representing a 1.7% improvement over the baseline model. Concurrently, the asymmetric convolution utilized in the auxiliary branch has the effect of minimizing the impact on processing speed, with a reduction of only 17 frames per second. Furthermore, the addition of the alignment-and-fusion module resulted in a mere 19 frames/s reduction in running speed, while mIoU increased to 74.4%, thus providing additional confirmation of the module’s efficiency in mitigating feature misalignment. Furthermore, the addition of the global context module resulted in an mIoU of 73.3%, which was slightly inferior to the other two modules in terms of segmentation accuracy. However, the pooling operation introduced no additional parameters to the model, enabling the module to reach 150 frames/s in terms of reasoning speed. In conclusion, despite the slight impact on processing speed resulting from the incorporation of additional modules, the algorithm in this paper attains the highest level of accuracy, with a mIoU of 77.1%.

It has been demonstrated that the module incorporated into this algorithm can markedly enhance the performance of the model. The aforementioned experimental results demonstrate that the incorporation of the auxiliary branch, alignment-and-fusion module, and global context module into the baseline model leads to enhanced segmentation accuracy while maintaining a relatively unaltered running speed. The results demonstrate that the proposed algorithm not only yields more accurate segmentation outcomes, but also preserves efficient real-time performance.

3.5. Contrast Experiment

Table 3 presents a comparison of the proposed algorithm with 14 other advanced algorithms. The real-time and non-real-time networks are distinguished by black lines. The first column indicates the model name, while columns 2 and 3 indicate whether the model has been pre-trained on ImageNet. Columns 4 and 6 present model evaluation indexes, and column 5 shows image resolution.

As illustrated in Table 3, the mIoU of the proposed algorithm is superior to that of numerous large-scale networks, including SegNet and DeepLabV2, in comparison to non-real-time semantic segmentation networks. The proposed algorithm employs only 7% of the parameters of PSPNet to achieve enhanced segmentation precision. In comparison to a lightweight network (real-time semantic segmentation network), the Faster:SCDNet has the highest segmentation accuracy, which is 0.5% higher than the algorithm presented in this paper. However, the segmentation accuracy is significantly lower, demonstrating only half the effect of the algorithm presented in this paper. With regard to segmentation efficiency, the proposed algorithm achieves a segmentation speed of 127 frames/s, which is slightly inferior to that of STDC-Seg50 and the baseline model BiseNetV2. Nevertheless, despite the proposed algorithm’s segmentation speed being 29.6 frames/s and 29 frames/s slower than that of STDC-SEG50, its segmentation accuracy is 3.7% and 4.5% higher, respectively. In terms of accuracy, the proposed algorithm exhibits certain advantages. With regard to model complexity, the parameters of the proposed algorithm are 4.65 M, which is a relatively low figure among all the compared semantic segmentation networks. This indicates that the proposed algorithm has fewer redundant parameters and a compact, efficient network structure. The experimental results demonstrate that, in comparison with the classical algorithms that have emerged in recent years, the proposed algorithm is capable of more accurate segmentation of the target and exhibits superior comprehensive performance.

To illustrate the analysis results in a more direct manner, the paper also presents a scatterplot of the precision-speed comparison based on Cityscapes validation set, as shown in Figure 6. This includes 10 classical algorithms that differ from those presented in Table 3. It can be observed that the algorithm presented in this paper is situated in the upper right quadrant of the image. It exhibits superior segmentation accuracy in comparison to other lightweight semantic segmentation networks, while also maintaining a high level of processing speed. This demonstrates that the proposed algorithm strikes an optimal balance between accuracy and real-time performance.

To further validate the performance of the algorithm presented in this paper, experimental comparisons were conducted with the current classical algorithms for real-time semantic segmentation on the CamVid dataset. As illustrated in Table 4, in terms of segmentation accuracy, this paper’s algorithm attains the highest value of 77.4% in mIoU, outperforming other models; in terms of segmentation speed, this paper’s algorithm achieves a FPS of 112. This value is lower than that of other algorithms, including STDC2-Seg, Faster:SCDNet, and BiseNetV2. However, it should be noted that the algorithm in question exhibits a higher level of accuracy in terms of segmentation, as indicated by the mIoU metric. In comparison to other algorithms, it demonstrates a 3.5%, 3.2%, and 5.0% improvement in this regard, which represents a notable advantage in terms of accuracy. The experimental results demonstrate that the algorithm presented in this paper exhibits superior overall performance and more accurate target segmentation compared to classical algorithms that have emerged in recent years.

To further validate the effectiveness of the proposed algorithm, a comparison was conducted between the IoU of various categories in the Cityscapes dataset and that of the current classical algorithm of real-time semantic segmentation. This included a total of 19 categories, such as motorcycles, sky, buildings, and walls. The specific comparison results are presented in Table 5. The segmentation accuracy for small target objects, which are mainly optimized by this algorithm, exceeds that of the baseline model by 1.4% and 2.5%, respectively. Similarly, for long-range strip objects, such as buildings, walls, and buses, which are also optimized by this algorithm, the segmentation accuracy exceeds that of the baseline model by more than 10% or even 20%. This illustrates the substantial benefits of the proposed algorithm in addressing such particular objects. With regard to the secondary optimisation objective of this algorithm, such as the sky and bicycles, the segmentation accuracy exhibited a mere 0.1% and 4.3% decline in comparison with the baseline model. This demonstrates that although the incorporation of auxiliary branches has varying degrees of adverse effects on the secondary optimisation objectives, their impact on the segmentation outcomes is not considerable.

In summary, while the algorithm’s performance in the categories of sky and motorcycle is not exceptional, the segmentation accuracy of the categories of support rods, signs, buildings and buses has been significantly enhanced. With regard to the discrepancy in segmentation accuracy between each category and the baseline model, the algorithm presented in this paper continues to demonstrate superior performance, thereby substantiating its efficiency.

3.6. Visual Result

In order to demonstrate the efficiency of the proposed algorithm in terms of segmentation, partial visualisation results and partial error graphs for the proposed algorithm and the BiseNetV1 and BiseNetV2 algorithms are presented in Figure 7 and Figure 8, respectively. Figure 7 illustrates the input image (Figure 7a), the label image (Figure 7b) provides a visual representation of the dataset segmentation results, and the semantic segmentation results of the BiseNetV1 (Figure 7c) and BiseNetV2 (Figure 7d) algorithms. Figure 7e illustrates the semantic segmentation results of the proposed algorithm. The different colours represent different categories and the blue boxes indicate the different segmentation results produced by the three calculations. As illustrated in Figure 7, the boundaries of the segmentation results for objects of a smaller scale, such as poles and traffic signs, appear blurred in both the first and third lines. Additionally, there is a notable absence of main body segmentation for poles. However, in this algorithm, the contours are complete and the boundaries are clearer and smoother. In the second and fifth lines, for categories with high frequency and large scale, such as fences and roads, the algorithm in this paper demonstrates superior performance in segmenting roadside fences and achieving complete ground segmentation. In the initial, fourth, and fifth lines, the model in this paper is able to distinguish between the target person and the vehicle with a high degree of accuracy, and the outline boundary is clearly delineated.

Figure 8 illustrates the input image (Figure 8a), the segmentation error diagram of BiseNetV1 (Figure 8b), the segmentation error diagram of BiseNetV2 (Figure 8c), and the segmentation error diagram of the algorithm proposed in this paper (Figure 8d). The white area denotes an accurate classification, whereas the black area represents an erroneous classification. A comparison of the proportion of black and white areas in the red box and the input image reveals that the target shape is significantly more discernible and the boundary is markedly smoother following the addition of the module proposed in this paper. Furthermore, the segmentation result is notably more refined.

The experimental results demonstrate that the proposed algorithm is capable of effectively compensating for the loss of spatial information pertaining to features, further refining the edges of said features, and enhancing the recognition ability of both long and large objects. Concurrently, the capacity to represent the context of features is enhanced, the loss of information pertaining to small targets is diminished, and more exact segmentation outcomes are attained. This has resulted in favourable semantic segmentation performance.

4. Conclusions

This paper addresses the limitations of existing semantic segmentation algorithms by proposing a lightweight multi-branch network for real-time semantic segmentation. Firstly, the algorithm acquires irregular features and supplementary edge information through auxiliary branches, thereby enhancing the capacity to recognise long and large targets. Furthermore, in order to enhance the precision of the segmentation process and prevent a reduction in the model’s inference speed, the algorithm employs asymmetric convolution to develop auxiliary branches, thereby reducing the time required for inference. Secondly, the alignment-and-fusion module is designed to guide and fuse the feature maps of multiple branches, thereby alleviating the feature misalignment that occurs in the fusion of multi-branch networks and improving the recovery ability of small target details. Finally, in order to consider the importance of global information, a global context module is designed in the final stage of the semantic branch. These structures are integrated and jointly optimised to ensure the algorithm exhibits optimal performance in semantic segmentation. The results of experiments conducted on the Cityscapes and CamVid datasets indicate that the proposed algorithm exhibits a balance between segmentation accuracy and inference speed. The mean intersection over union (IoU) achieved on the Cityscapes dataset was 77.1%, with a processing speed of 127 frames per second (FPS), while on the CamVid dataset, the mean IoU was 77.4%, with a processing speed of 112 FPS. In the subsequent iteration of this research, we will conduct a more detailed examination of the algorithm and implement enhancements to the semantic branches with the objective of enhancing the accuracy of the model.

Author Contributions

Conceptualization, J.L.; Methodology, J.L.; Funding acquisition, J.L.; investigation, J.L.; Project administration, J.L.; Supervision, J.L.; Data curation, B.Z.; Resources, B.Z.; Software, B.Z.; Visualization, B.Z.; Writing—original draft, B.Z.; Formal analysis, M.T.; Validation, M.T.; Writing—review, J.L., B.Z. and M.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was founded by Science and Technology Project of Heilongjiang Provincial Department of Transport OF FUNDER grant number LH2024B002.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Ren, F.; Zhou, H.; Yang, L.; Liu, F.; He, X. ADPNet: Attention based dual path network for lane detection. J. Vis. Commun. Image Represent. 2022, 87, 103574. [Google Scholar] [CrossRef]
Huang, X.; Wang, P.; Cheng, X.; Zhou, D.; Geng, Q.; Yang, R. The apolloscape open dataset for autonomous driving and its application. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2702–2719. [Google Scholar] [CrossRef] [PubMed]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 3349–3364. [Google Scholar] [CrossRef] [PubMed]
Zhang, Z.; Pfister, T. Learning fast sample re-weighting without reward data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 725–734. [Google Scholar]
Tian, S.; Yao, G.; Chen, S. Faster SCDNet: Real-Time Semantic Segmentation Network with Split Connection and Flexible Dilated Convolution. Sensors 2023, 23, 3112. [Google Scholar] [CrossRef] [PubMed]
Chen, Y.; Zhang, S.; Liu, J.; Li, B. Towards a Deep Learning Approach for Detecting Malicious Domains. In Proceedings of the 2018 IEEE International Conference on Smart Cloud (SmartCloud), New York, NY, USA, 21–23 September 2018; pp. 190–195. [Google Scholar]
Hu, X.; Jiang, Y.; Tang, K.; Chen, J.; Miao, C.; Zhang, H. Learning to segment the tail. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 14045–14054. [Google Scholar]
Gao, G.; Xu, G.; Li, J.; Yu, Y.; Lu, H.; Yang, J. FBSNet: A fast bilateral symmetrical network for real-time semantic segmentation. IEEE Trans. Multimed. 2022, 50, 1609–1620. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. arXiv 2016, arXiv:1606.02147. [Google Scholar]
Romera, E.; Alvarez, J.M.; Bergasa, L.M.; Arroyo, R. ERFNet: Efficient Residual Factorized ConvNet for Real-Time Semantic Segmentation. IEEE Trans. Intell. Transp. Syst. 2017, 19, 263–272. [Google Scholar] [CrossRef]
Zhao, H.; Qi, X.; Shen, X.; Shi, J.; Jia, J. ICNet for Real-Time Semantic Segmentation on High-Resolution Images. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Hu, P.; Perazzi, F.; Heilbron, F.C.; Wang, O.; Lin, Z.; Saenko, K.; Sclaroff, S. Real-Time Semantic Segmentation with Fast Attention. In Proceedings of the International Conference on Robotics and Automation, Xi’an, China, 30 May–5 June 2021. [Google Scholar]
Mehta, S.; Rastegari, M.; Caspi, A.; Shapiro, L.; Hajishirzi, H. ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation; Springer: Cham, Switzerland, 2018. [Google Scholar]
Wang, H.; Jiang, X.; Ren, H.; Hu, Y.; Bai, S. SwiftNet: Real-time Video Object Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Yang, Q.; Chen, T.; Fan, J.; Lu, Y.; Chi, Q. EADNet: Efficient Asymmetric Dilated Network for Semantic Segmentation. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021. [Google Scholar]
Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. BiSeNet: Bilateral Segmentation Network for Real-Time Semantic Segmentation; Springer: Cham, Switzerland, 2018. [Google Scholar]
Yu, C.; Gao, C.; Wang, J.; Yu, G.; Shen, C.; Sang, N. BiSeNet V2: Bilateral Network with Guided Aggregation for Real-Time Semantic Segmentation. Int. J. Comput. Vis. 2021, 129, 3051–3068. [Google Scholar] [CrossRef]
He, J.; Deng, Z.; Zhou, L.; Wang, Y.; Qiao, Y. Adaptive pyramid context network for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7519–7528. [Google Scholar]
Li, X.; You, A.; Zhu, Z.; Zhao, H.; Yang, M.; Yang, K.; Tan, S.; Tong, Y. Semantic flow for fast and accurate scene parsing. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part I 16. Springer: Cham, Switzerland, 2020; pp. 775–793. [Google Scholar]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
Brostow, G.J.; Shotton, J.; Fauqueur, J.; Cipolla, R. Segmentation and recognition using structure from motion point clouds. In Proceedings of the Computer Vision–ECCV 2008: 10th European Conference on Computer Vision, Marseille, France, 12–18 October 2008; Proceedings, Part I 10. Springer: Berlin/Heidelberg, Germany, 2008; pp. 44–57. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Poudel, R.P.; Liwicki, S.; Cipolla, R. Fast-scnn: Fast semantic segmentation network. arXiv 2019, arXiv:1902.04502. [Google Scholar]
Li, G.; Yun, I.; Kim, J.; Kim, J. DABNet: Depth-Wise Asymmetric Bottleneck for Real-Time Semantic Segmentation. 2021. Available online: http://arxiv.org/pdf/1907.11357.pdf (accessed on 12 January 2024).
Li, H.; Xiong, P.; Fan, H.; Sun, J. Dfanet: Deep feature aggregation for real-time semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9522–9531. [Google Scholar]
Fan, M.; Lai, S.; Huang, J.; Wei, X.; Chai, Z.; Luo, J.; Wei, X. Rethinking bisenet for real-time semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9716–9725. [Google Scholar]
Hao, S.; Zhou, Y.; Guo, Y.; Hong, R.; Cheng, J.; Wang, M. Real-time semantic segmentation via spatial-detail guided context propagation. IEEE Trans. Neural Netw. Learn. Syst. 2022, 24, 1–12. [Google Scholar] [CrossRef] [PubMed]
Wu, Y.; Jiang, J.; Huang, Z.; Tian, Y. FPANet: Feature pyramid aggregation network for real-time semantic segmentation. Appl. Intell. 2022, 52, 3319–3336. [Google Scholar] [CrossRef]
Yang, Z.; Yu, H.; Fu, Q.; Sun, W.; Jia, W.; Sun, M.; Mao, Z.H. NDNet: Narrow while deep network for real-time semantic segmentation. IEEE Trans. Intell. Transp. Syst. 2020, 22, 5508–5519. [Google Scholar] [CrossRef]
Chen, Y.; Zhan, W.; Jiang, Y.; Zhu, D.; Guo, R.; Xu, X. LASNet: A light-weight asymmetric spatial feature network for real-time semantic segmentation. Electronics 2022, 11, 3238. [Google Scholar] [CrossRef]
Kim, M.; Park, B.; Chi, S. Accelerator-aware fast spatial feature network for real-time semantic segmentation. IEEE Access 2020, 8, 226524–226537. [Google Scholar] [CrossRef]
Wang, Y.; Zhou, Q.; Liu, J.; Xiong, J.; Gao, G.; Wu, X.; Latecki, L.J. Lednet: A lightweight encoder-decoder network for real-time semantic segmentation. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1860–1864. [Google Scholar]

Figure 1. Schematic comparison of network structure.

Figure 2. Overall structure of the algorithm in this paper.

Figure 3. Auxiliary branch structure.

Figure 4. Alignment-and-fusion module structure.

Figure 5. Global context module.

Figure 6. Comparison of precision and speed of lightweight network.

Figure 7. Semantic segmentation junctions of BiseNetV1, BiseNetV2, and our algorithm on Cityscapes dataset. Notation: The blue boxes indicate the different segmentation results produced by the three calculations.

Figure 8. Error graphs of BiseNetV1, BiseNetV2, and the algorithm in this paper on Cityscapes dataset. Notation: The red boxes indicate the different segmentation results produced by the three algorithms.

Table 1. Instantiation of the detail branch, the auxiliary branch, and the semantic branch. Each stage S comprises one or more operations. Each operation is characterised by a kernel size (k), a step (s), and a number of output channels (c), which are repeated a specified number of times (r). The expansion factor (e) is applied to expand the channel number of the operation. Notation: ‘Conv2d’ and ‘asymmetric Conv2d’ refer to the convolutional and asymmetric layers, respectively, which are followed by a batch normalization layer and a ReLU activation function. “stem” is used to denote the stem block. The GE layer is responsible for gathering and expansion. The global context module (GCM) is a component of the system that processes global context information.

Stage	Detail Branch	Auxiliary Branch	Semantic Branch
Stage	Operation	Operation	Operation	Output Size
S1	Conv2d, k = 3, c = 64, s = 2, r = 1 Conv2d, k = 3, c = 64, s = 1, r = 1	asymmetric conv2d, k = 1 × 3, c = 64, s = 2, r = 1 asymmetric conv2d, k = 3 × 1, c = 64, s = 2, r = 1	stem, k = 3, c = 16, e = _, s = 4, r = 1	256 × 512 256 × 512
S2	Conv2d, k = 3, c = 64, s = 2, r = 1 Conv2d, k = 3, c = 64, s = 1, r = 2 Conv2d, k = 3, c = 64, s = 1, r = 2	asymmetric conv2d, k = 1 × 3, c = 64, s = 1, r = 1 asymmetric conv2d, k = 3 × 1, c = 64, s = 1, r = 1		128 × 256 128 × 256
S3	Conv2d, k = 3, c = 128, s = 2, r = 1 Conv2d, k = 3, c = 128, s = 1, r = 2 Conv2d, k = 3, c = 128, s = 1, r = 2	asymmetric conv2d, k = 1 × 3, c = 128, s = 2, r = 1 asymmetric conv2d, k = 3 × 1, c = 128, s = 2, r = 1	GE, k = 3, c = 32, e = 6, s = 2, r = 1 GE, k = 3, c = 32, e = 6, s = 1, r = 1	64 × 128 64 × 128
S4			GE, k = 3, c = 64, e = 6, s = 2, r = 1 GE, k = 3, c = 64, e = 6, s = 1, r = 1	32 × 64 32 × 64
S5			GE, k = 3, c = 128, e = 6, s = 2, r = 1 GE, k = 3, c = 128, e = 6, s = 1, r = 3 GCM, k = 3, c = 128, e = _, s = 1, r = 1	16 × 32 16 × 32 16 × 32

Table 2. Results of different combinations on Cityscapes dataset.

Model	mIoU%	Running Speed (Frames·s⁻¹)	Parameters (M)
Baseline	72.60	156.00	-
+Auxiliary Branch	74.30	139.00	3.74
+AAFM	74.40	137.00	4.25
+CGM	73.30	150.00	3.37
+Auxiliary Branch + AAFM + CGM	77.10	127.00	4.65

Bold indicates the best performing value in this column.

Table 3. Fourteen algorithms compared on the Cityscapes dataset.

Network Type	Network Name	Pre-Training	mIoU%	Running Speed (Frames·s⁻¹)	Resolution	Parameters (M)
Large scale	SegNet [17]	Y	57.0	17	640 × 360	29.5
	PSPNet [6]	Y	81.2	<1	713 × 713	65.7
	DeepLabV2 [8]	Y	70.4	<1	512 × 1024	44
Light weight	Fast-SCNN [33]	N	68	123.5	1024 × 2048	0.4
	ESPNet [22]	Y	60.3	112.9	512 × 1024	0.4
	DABNet [34]	Y	70.1	27.7	1024 × 2048	-
	DFANet A [35]	Y	71.3	100	1024 × 2048	-
	STDC-Seg50 [36]	Y	73.4	156.6	512 × 1024	12.3
	SGCPNet [37]	Y	70.9	103.7	1024 × 2048	0.61
	FPANet [38]	Y	75.9	31	1024 × 2048	2.59
	SwiftNet [23]	Y	75.4	39.9	1024 × 2048	11.8
	Faster:SCDNet [13]	Y	77.6	66.1	1024 × 2048	17.8
	BiseNetV1 [25]	Y	68.4	105.8	786 × 1536	5.8
	BiseNetV2 [26]	N	72.6	156.0	512 × 1024	-
	Ours	Y	77.1	127.0	512 × 1024	4.65

Bold indicates the best performing value in this column.

Table 4. Six algorithms compared on the CamVid dataset.

Network Name	mIoU%	Running Speed (Frames·s⁻¹)	Resolution
DFANet A [35]	64.7	120	960 × 720
STDC-Seg50 [36]	73.9	152.2	720 × 960
FPANet [38]	72.9	88	720 × 960
Faster:SCDNet [13]	74.2	154.2	720 × 960
BiseNetV1 [25]	65.6	175	960 × 720
BiseNetV2 [26]	72.4	124.5	960 × 720
Ours	77.4	112.0	960 × 720

Bold indicates the best performing value in this column.

Table 5. Comparison of IoU% among different categories in the Cityscapes dataset.

Network Name	Lane	Footpath	Unit	Wall	Fence	Support Bar	Traffic Light	Mark	Plant	Topography
SegNet [17]	96.4	73.2	84.0	28.4	29.0	35.7	39.8	45.1	87.0	63.8
ENet [18]	96.3	74.3	75.0	32.2	33.2	43.4	34.1	44.0	88.6	61.4
ESPNet [22]	95.7	73.3	86.6	32.8	36.4	47.0	46.9	55.4	89.8	66.0
NDNet [39]	96.6	75.2	87.2	44.2	46.1	29.6	40.4	53.3	87.4	57.9
LASNet [40]	97.1	80.3	89.1	64.5	58.8	48.6	48.5	62.6	89.9	62.0
FSFNet [41]	97.7	81.1	90.2	41.7	47.0	47.0	61.1	65.3	91.8	69.3
ERFNet [19]	97.9	82.1	90.7	45.2	50.4	59.0	62.6	68.4	91.9	69.4
LEDNet [42]	98.1	79.5	91.6	47.7	49.9	62.8	61.3	72.8	92.6	61.2
BisNetV2 [26]	98.2	82.9	91.7	44.5	51.1	63.5	71.3	75.0	92.9	71.1
Ours	97.9	83.8	92.3	63.7	63.8	64.9	63.1	77.5	92.4	63.0
Network Name	Sky	Pede Strian	Rider	Car	Truck	Bus	Train	Bike	Motor Cycle	mIoU
SegNet [17]	91.8	62.8	42.8	89.3	38.1	43.1	44.1	35.8	51.9	55.6
ENet [18]	90.6	65.5	38.4	90.6	36.9	50.5	48. 1	38.8	55.4	58.3
ESPNet [22]	92.5	68.5	45.9	89.9	40.0	47.7	40.7	36.4	54.9	60.3
NDNet [39]	90.2	62.6	41.6	88.5	57.8	63.7	35.1	31.9	59.4	60.6
LASNet [40]	91.8	70.8	51.3	91.1	77.3	81.7	69.2	48.0	65.8	70.9
FSFNet [41]	94.2	77.8	57.8	92.8	47.3	64.4	59.4	53.1	66.2	65.3
ERFNet [19]	94.2	78.5	59.8	93.4	52.3	60.8	53.7	49.9	64.2	69.7
LEDNet [42]	94.9	76.2	53.7	90.9	64.4	64.0	52.7	44.4	71.6	70.6
BisNetV2 [26]	94.9	83.6	65.4	94.9	60.5	68.7	56.8	61.5	51.9	72.6
Ours	94.8	81.0	58.5	94.3	80.6	83.8	78.0	57.2	76.5	77.1

Bold indicates the best performing value in this column.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, J.; Zhao, B.; Tian, M. Asymmetric-Convolution-Guided Multipath Fusion for Real-Time Semantic Segmentation Networks. Mathematics 2024, 12, 2759. https://doi.org/10.3390/math12172759

AMA Style

Liu J, Zhao B, Tian M. Asymmetric-Convolution-Guided Multipath Fusion for Real-Time Semantic Segmentation Networks. Mathematics. 2024; 12(17):2759. https://doi.org/10.3390/math12172759

Chicago/Turabian Style

Liu, Jie, Bing Zhao, and Ming Tian. 2024. "Asymmetric-Convolution-Guided Multipath Fusion for Real-Time Semantic Segmentation Networks" Mathematics 12, no. 17: 2759. https://doi.org/10.3390/math12172759

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Asymmetric-Convolution-Guided Multipath Fusion for Real-Time Semantic Segmentation Networks

Abstract

1. Introduction

2. Textual Algorithm

2.1. Overall Structure

2.2. Auxiliary Branch

2.3. Alignment-and-Fusion Module

2.4. Global Context Module

3. Experiment and Result Analysis

3.1. Dataset Introduction

3.1.1. Cityscapes Dataset

3.1.2. CamVid Dataset

3.2. Evaluation Index

3.3. Training Strategy

3.3.1. Experimental Parameter

3.3.2. Data Enhancement

3.3.3. Training Optimization Strategy

3.4. Ablation Experiment

3.5. Contrast Experiment

3.6. Visual Result

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI