1. Introduction
Semantic segmentation is a crucial computer vision task that attempts to obtain a comprehensive understanding of image content by accurately assigning each pixel in the image to specific semantic categories. This technology has broad application prospects in various fields, including human–computer interaction [
1], autonomous driving [
2,
3], video surveillance [
4], and medical image processing [
5]. However, such application scenarios often require systems with fast inference speeds and high-precision semantic segmentation to realize real-time user interaction and response. Achieving this balance between fast inference speeds and high segmentation accuracy presents a significant challenge in real-time semantic segmentation.
In recent years, significant advancements have been observed in semantic segmentation driven by fully convolutional network structure (FCN) [
6]. Notably, FCN pioneered the use of a fully convolutional network structure to perform pixel-level classification of images. Following FCN, some exceptional semantic segmentation methods have been introduced. UNet [
7] employs a U-shaped network structure and dense skip connections. This design allows UNet to retain intricate details during image processing and effectively capture global semantic information from images. SegNet [
8] employs a symmetric encoder–decoder structure that recovers the encoder’s low-resolution feature maps while preserving maximal pool indexes and utilizing top–down lateral connectivity. This approach ensures the retention of complete information in feature maps. PSPNet [
9] employs a pyramid pooling module that effectively extracts contextual information across various scales, which enhances the network’s ability to perceive global information. Chen et al. [
10] utilized dilated convolutions to expand the network’s receptive field, allowing for extracting broader contextual information. In addition, Conditional Random Fields (CRF) were employed to improve the segmentation accuracy. DeepLabv3+ [
11] is a design based on the pyramid structure dilated convolution, which utilizes dilated convolutions with different rates to expand the network’s receptive field while aggregating rich multiscale contextual information. HRNet [
12] employs a multi-branch structure, with parallel branches introduced in the network to maintain high image resolution even as the network depth increases. Consequently, spatial detail and multiscale information are effectively extracted. DANet [
13] leverages a spatial and channel attention module to capture global feature dependencies across both spatial and channel dimensions. CCNet [
14] introduces a criss-cross attention module, which allows the extraction of cross feature information for each pixel across various directions and captures dense global contextual information. However, to enhance segmentation accuracy, high-precision semantic segmentation networks rely upon complex backbone networks such as ResNet [
15] and VGG [
16]. Although these complex backbone networks provide powerful feature representations, they also incur significant computational costs, which reduces inference speed. Furthermore, the primary focus of these high-precision networks is often to maximize segmentation accuracy, neglecting the equal importance of inference speed. As a result, they are unsuitable for applications that require real-time performance.
To optimize inference speed, numerous real-time semantic segmentation networks prioritize lightweight networks as their backbone. This strategic choice aims to achieve a balance between efficient real-time execution and maintaining satisfactory segmentation accuracy without increasing computational costs. MobileNet [
17], a lightweight classification network, has introduced a depth-wise separable convolutional structure. This ingenious design significantly reduces the network’s parameters and computational complexity, thereby facilitating an efficient inference speed while preserving satisfactory segmentation accuracy. ShuffleNet [
18] is another notable lightweight classification network that employs shuffle operations to reduce the computational complexity, thereby enhancing the network’s computational efficiency. EfficientNet [
19] employs a novel scaling method. This method employs a compound coefficient to strategically scale the network depth, width, and resolution in a balanced manner. This innovative approach optimizes the network performance while ensuring computational efficiency. However, lightweight classification networks often have limited feature extraction capabilities, potentially hindering the extraction of rich spatial and contextual information from the input image. Other authors [
20,
21], have explored reducing the size of the input images as a strategy to further improve inference speed. This is because adopting smaller input image resolutions significantly reduces computational complexity and inference time. However, lower resolutions can result in a significant loss of boundary information, which in turn reduces segmentation accuracy. The studies listed as references [
8,
22] increased the models’ inference speeds by decreasing channels in the networks because fewer channels translate to reduced computational costs, leading to faster inference speeds. However, reducing channels may weaken a model’s ability to represent spatial information.
In addition, various other notable network architectures have also emerged. ICNet [
23] employs a three-level cascade architecture that effectively integrates detail information and semantic information. ESPNet [
24] introduced an efficient spatial pyramid module that decomposes standard convolutions into a pointwise and spatial pyramid of dilated convolutions. This approach effectively reduces the model’s parameter count and computational complexity. CGNet [
25] employs an effective context-guided block to efficiently model and extract global contextual information. Fast-SCNN [
26] effectively extracts shallow features by employing two branches with shared learning for downsampling modules. BiSeNet [
27] employs a dual-branch network architecture designed to preserve shallow spatial details while simultaneously extracting deep semantic information. The detail branch achieves this by extracting spatial detail information using shallow wide channels, and the semantic branch extracts semantic information via the design of deep narrow channels. BiSeNetV2 [
28] also employs a dual-branch structure, where each path is dedicated to extracting detailed spatial information and categorical semantic information, respectively. STDC-Seg [
29] builds upon BiSeNet [
27] to achieve an improved balance between real-time inference speed and segmentation accuracy. The proposed network leverages a detail-guided approach to extract low-level detail information while minimizing computational costs. DDRNet [
30] separates the backbone into two parallel depth branches. The first branch prioritizes the generation of high-resolution feature maps, and the second branch focuses on extracting semantic information through multiple downsampling layers. Effectively integrating information from both branches using multiple bilateral connections facilitates high performance in semantic segmentation tasks.
Image segmentation tasks are inherently complex because of the existence of multiple segmentation objects with different scales. This variability can result in different objects of the same category appearing in various sizes in the visual scene. During pixel-level classification in semantic segmentation, relying on a single scale makes it difficult for the model to accurately perceive targets of different sizes. This limitation reduces the network’s robustness and generalizability. Furthermore, the continuous downsampling of images during the segmentation process can lead to a significant loss of spatial detail information. To address the deficiencies in semantic feature diversity and spatial detail information loss in a real-time semantic segmentation network, this paper proposes a Multiscale Context Pyramid Pooling and Spatial Detail Enhancement Network (BMSeNet), which employs a dual-branch network structure. To address the issues of relatively singular semantic features, the proposed model incorporates a Multiscale Context Pyramid Pooling Module (MSCPPM). In addition, a Spatial Detail Enhancement Module (SDEM) was established to improve the extraction of spatial detail information. To seamlessly integrate spatial detail and semantic information, a Bilateral Attention Fusion Module (BAFM) was constructed. The primary contributions of this paper are as follows:
The design of the Multiscale Context Pyramid Pooling Module (MSCPPM) addresses the limited semantic feature diversity in segmentation tasks. The proposed MSCPPM realizes this by employing multiple pooling operations at different scales. This approach increases the network’s receptive field and facilitates the extraction of abundant local and global feature information from the input image. Consequently, it enables the capture of abundant multiscale contextual information, enhancing the model’s ability to perceive and process information at different scales.
In this paper, we introduce the Spatial Detail Enhancement Module (SDEM), which accurately acquires image position information through global average pooling operations in various directions. This module compensates for the loss of spatial detail information that occurs during the continuous downsampling process, thereby improving the model’s ability to effectively perceive spatial detail.
The Bilateral Attention Fusion Module (BAFM) effectively fuses spatial detail information and global contextual information. This is achieved by reasonably guiding the network to allocate weights based on the positional correlation between pixels in two branches, which improves the effectiveness of feature fusion.
To validate the efficacy of the proposed BMSeNet, comprehensive experiments were performed on two datasets. The results demonstrate that BMSeNet achieved 76.4% mIoU at 60.2 FPS on the Cityscapes test set, while on the CamVid test set, BMSeNet achieved 72.9% at 78 FPS.
3. Proposed Method
3.1. Overall Architecture
Despite the impressive performance of BiSeNet-based segmentation networks, two pressing issues remain. First, relying on a single scale for semantic features reduces the network’s generalization and robustness. Second, there is a loss of spatial detail information, leading to a decrease in segmentation accuracy. To address these limitations, we propose a network called BMSeNet. The overall architecture of BMSeNet is depicted in
Figure 4. As presented in
Figure 4, BMSeNet follows a dual-branch network structure. The semantic branch comprising the MSCPPM and ResNet18 [
15] backbone extracts rich semantic information. In contrast, the detail branch comprises the SDEM and spatial path from BiSeNet, which prioritizes extracting abundant spatial detail information. Beyond the input and prediction layers, the overall network architecture includes five stages, ranging from stage 1 to stage 5. Within each stage, BMSeNet employs a downsampling operation that reduces the input resolution by half using a stride of 2. A detailed explanation of the entire implementation process of BMSeNet is provided below.
BMSeNet begins by feeding the input image into its backbone network, which is designed to extract rich semantic information. After the image features pass through the backbone network, the MSCPPM is attached. The proposed method uses various pooling operations to enlarge the receptive field and efficiently gather diverse multiscale contextual information. In addition, the Attention Refinement Module (ARM) enhances the output features from the final two stages. After the three Conv + BN + ReLU operations, the Spatial Detail Enhancement Module is introduced to extract abundant spatial detail information, which enhances the model’s perception of spatial details. Then, the features extracted by both branches are combined through the BAFM. The BAFM seamlessly merges the informative details extracted by both branches, thereby improving the overall segmentation performance of the network. Following the BAFM, the resulting feature map is reduced to an eighth of its original size. Finally, the feature map undergoes an eightfold enlargement, bringing it back to its original size and enabling the final prediction.
Table 1 summarizes the detailed structure of BMSeNet. In this table, Dopr represents the convolution and DBEM operations in the detail branch, and Sopr encompasses the convolution, Max pooling, and MSCPPM operations in the semantic branch. Here, ‘C’ denotes the number of channels used at each stage, and ‘S’ represents the stride value employed during the downsampling process.
3.2. Multiscale Context Pyramid Pooling Module (MSCPPM)
In semantic segmentation, a critical component is the context aggregation module, which significantly improves segmentation performance. Moreover, by expanding the model’s receptive field, it becomes feasible to consider a broader range of contextual information and effectively grasp the long-range dependencies between pixels. This enhances the model’s global perceptual capability accordingly. At the same time, when dealing with images containing multiscale structures, the module can effectively perceive and capture the multiscale feature information present in the image. Processing images with closely related semantic associations enables accurate extraction of dense semantic relationships between different regions, which improves segmentation results.
MSCPPM is employed to enhance the accuracy of contextual information extraction from images. This allows for more effective extraction of rich multiscale contextual information, which improves the segmentation performance of the model. To ensure the inference speed of MSCPPM while enhancing accuracy, it is connected to the output feature map with a resolution of 1/32 of the original image of the backbone network. Since the input feature resolution of MSCPPM is only 1/32 of the original image resolution, the amount of computation is significantly reduced, resulting in a lower impact on inference speed. Additionally, 1 × 1 group convolution is used. By grouping the input feature map and performing independent convolution operations on each group, the number of parameters and computational complexity are reduced, thereby effectively improving the inference speed. The structure of MSCPMM is depicted in
Figure 5. Within this module, the terms Pooling, Strip Pooling, UP, and GConv refer, respectively, to operations involving pooling, strip pooling, upsampling, and group convolution with a group size of 2. Initially, the input feature map undergoes a 1 × 1 group convolution to reduce dimensionality and improve computational efficiency. The resulting feature map undergoes a 1 × 1 convolution to generate a new feature map. Simultaneously, a set of pooling operations with kernel sizes {3 × 3, 5 × 5, 9 × 9} is applied. This is followed by 1 × 1 convolution and upsampling operations, which resize the feature maps to 1/32 of their original size from the input feature map. In addition, global contextual information is extracted by passing through a global average pooling layer in parallel, followed by 1 × 1 convolution and upsampling operations. At the same time, the model performs strip pooling operations in parallel, followed by 1 × 1 convolution and upsampling operations. This approach captures long-distance relationships within isolated regions while integrating both global and local contextual information. To maximize the utilization of feature information, the six aforementioned branches are connected in parallel. Each branch contributes sequentially to the next one, starting from the bottom. Feature information from various directions and scales seamlessly merges and interacts with the above operations. In addition, a 3 × 3 convolution is applied to fuse rich multiscale contextual information. Mathematically, if we represent the input features as
, the output features generated by each parallel branch of MSCPPM
can be expressed as follows:
where
represents the 1 × 1 convolution.
represents the 3 × 3 convolution.
indicates the upsampling operation.
expresses the global average pooling.
signifies strip pooling.
denotes a pooling operation with kernel size j.
Finally, the previously generated feature maps are combined via concatenation to form a unified feature map denoted as
. After processing the concatenated feature map with a 1 × 1 GConv, it is connected residually to the feature map
, which proceeds into parallel branches. Then, the output feature
of MSCPPM can be expressed as follows:
where
represents the concatenated output features from the grounding operation of each branch
, while
denotes group convolution.
3.3. Spatial Detail Enhancement Module (SDEM)
Spatial detail information is crucial in semantic segmentation and it encompasses specific local micro-features in the image, such as the position, size, and edges of objects. Simultaneously, preserving object boundaries and achieving accurate spatial positioning requires capturing essential spatial detail information. By accurately capturing spatial detail information, the model can effectively differentiate between various semantic categories and precisely delineate their boundaries, thereby enhancing its performance in complex scenes with numerous objects. Furthermore, the accurate acquisition of spatial details directly improves the overall segmentation performance, leading to higher-precision segmentation.
To improve the accuracy of spatial detail information extracted from images, this study employs SDEM. We can extract spatial details from the image more effectively by accurately calculating the position of each object. This is achieved by multiplying the horizontal and vertical spatial features. In addition to improving accuracy, the inference speed of SDEM is considered. Specifically, 3 × 3 depth-wise separable convolution is employed. Depth-wise separable convolution reduces computational complexity and the number of parameters by decomposing the convolution into depth-wise and pointwise convolution. Compared to traditional convolution, depth-wise separable convolution provides higher computational efficiency, improving the model’s inference speed, Moreover, SDEM utilizes parallel computation by processing three distinct branches in parallel, further enhancing computational efficiency and accelerating inference.
Figure 6 presents an overview of SDEM. This involves processing input features through three distinct branches: Depth-wise separable convolution enhances feature representation while reducing computational complexity and generates an enhanced feature map. Horizontal global pooling captures horizontal spatial features, and vertical global pooling captures vertical spatial features. In contrast to CA [
33], which simply concatenates horizontal and vertical spatial features, SDEM leverages matrix multiplication between the horizontal and vertical spatial features, resulting in a feature map with positional information. The proposed approach enables better capture of pixel relationships across different locations, thereby enhancing its ability to effectively model spatial relationships. After capturing positional information within the feature map, a sigmoid function is applied to generate pixel-wise weights. These weights are then used to multiply the enhanced feature map, resulting in a feature map guided by these weights. Finally, the feature map guided by weights is added to the enhanced feature map element-wise to obtain the final output feature map of SDEM. This can be expressed mathematically as follows:
where
represents the input to the SDEM, and the
is denoted by the Sigmoid function operation. The output after processing through SDEM is indicated as
. The 3 × 3 depth-wise separable convolution operation is referred to as
.
signifies the global average pooling operation performed horizontally, whereas
denotes the global average pooling operation executed vertically.
3.4. Bilateral Attention Fusion Module (BAFM)
Feature fusion involves merging features from various levels or branches. The dual-branch network architecture leverages a feature fusion module to effectively integrate spatial detail information and semantic information. Spatial detail information and semantic information offer distinct types of insights. Local micro-features in the image, such as object position, shape, and edges, contribute to spatial detail information. High-level abstract information about the overall context of an image is included in the semantic information, which covers a global understanding of object categories, scene semantics, and the relationships between objects. Spatial detail and semantic information hold different levels of importance. Simply summing elements or concatenating channels is insufficient to effectively merge features from various levels.
The BAFM is designed to facilitate a more effective integration of spatial feature information and semantic feature information extracted by the network’s dual branches. By considering the positional relationships between pixels, the model gains the ability to selectively focus on features from specific locations, thereby enhancing the recognition of critical areas and significantly enhancing the quality of feature fusion. BAFM also takes into account inference speed. It employs 1 × 1 convolution to reduce the number of channels in the fused feature map, thereby achieving dimensionality reduction. This effectively decreases the complexity of subsequent computations and enhances inference speed. Furthermore, BAFM synchronously processes feature information from both the semantic and detail branches. This enables the simultaneous utilization of features at different levels, reducing computational redundancy and further improving computational efficiency. As shown in
Figure 7, the BAFM begins by concatenating the features
and
from the two branches. Next, to ensure a balance among features, BAFM applies a 1 × 1 convolution, batch normalization, and a ReLU activation function. The processed map is denoted as
. Unlike AFM [
36], which uses global average pooling, BAFM decomposes this step into horizontal and vertical global pooling. The proposed method allows the extraction of accurate spatial location information across extended spatial dependencies. The feature map
undergoes separate horizontal and vertical global pooling operations. This allows it to capture horizontal and vertical spatial features, respectively. The horizontal and vertical spatial features obtained are multiplied element-wise, producing a feature map with pixel-wise positional information. Subsequently, the feature map undergoes processing with a 1 × 1 convolution, batch normalization, and a sigmoid activation function to produce weights. Then, the weights are multiplied by each branch. Finally, the two generated features are concatenated. Further feature extraction is then conducted using a 3 × 3 convolution, batch normalization, and a ReLU activation function to derive the final output feature map
of the BAFM. The detailed formulas for the entire process are as follows:
where
denotes the 1 × 1 convolution followed by batch normalization and the ReLU activation function.
refers to the 3 × 3 convolution followed by batch normalization, and the ReLU activation function.
indicates the use of the Sigmoid activation function.
refers to the 1 × 1 convolution followed by batch normalization.
denotes the global average pooling operation performed horizontally, while
denotes the global average pooling operation executed vertically.