MAS-Net: Multi-Attention Hybrid Network for Superpixel Segmentation

Yan, Guanghui; Wei, Chenzhen; Jia, Xiaohong; Li, Yonghui; Chang, Wenwen

doi:10.3390/sym16081000

Open AccessArticle

MAS-Net: Multi-Attention Hybrid Network for Superpixel Segmentation

by

Guanghui Yan

^1,2,*,

Chenzhen Wei

^1,2,

Xiaohong Jia

^1,3,

Yonghui Li

^1,2 and

Wenwen Chang

^1,2

¹

School of Electronic and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China

²

Key Laboratory of Media Convergence Technology and Communication, Lanzhou 730070, China

³

Key Laboratory of Big Data and Artificial Intelligence in Transportation, Ministry of Education, Beijing Jiaotong University, Beijing 100044, China

^*

Author to whom correspondence should be addressed.

Symmetry 2024, 16(8), 1000; https://doi.org/10.3390/sym16081000

Submission received: 27 June 2024 / Revised: 16 July 2024 / Accepted: 30 July 2024 / Published: 6 August 2024

Download

Browse Figures

Versions Notes

Abstract

:

Superpixels, as essential mid-level image representations, have been widely used in computer vision due to their computational efficiency and redundant compression. Compared with traditional superpixel methods, superpixel algorithms based on deep learning frameworks demonstrate significant advantages in segmentation accuracy. However, existing deep learning-based superpixel algorithms suffer from a loss of details due to convolution and upsampling operations in their encoder–decoder structure, which weakens their semantic detection capabilities. To overcome these limitations, we propose a novel superpixel segmentation network based on a multi-attention hybrid network (MAS-Net). MAS-Net is still based on an efficient symmetric encoder–decoder architecture. First, utilizing residual structure based on a parameter-free attention module at the feature encoding stage enhanced the capture of fine-grained features. Second, adoption of a global semantic fusion self-attention module was used at the feature selection stage to reconstruct the feature map. Finally, fusing the channel with the spatial attention mechanism at the feature-decoding stage was undertaken to obtain superpixel segmentation results with enhanced boundary adherence. Experimental results on real-world image datasets demonstrated that the proposed method achieved competitive results in terms of visual quality and metrics, such as ASA and BR-BP, compared with the state-of-the-art approaches.

Keywords:

superpixel; attention mechanism; feature extraction; deep learning

1. Introduction

A superpixel refers to the grouping and segmentation of adjacent pixels into visually homogeneous blocks. These approaches aggregate pixels based on similarities in texture, luminance, edge energy, and curvature, resulting in a more natural and perceptually meaningful image representation [1]. Superpixels effectively reduce redundant information in images without losing image details, thereby reducing the complexity of subsequent computer vision tasks and providing essential research directions for large-scale and real-time image processing. Due to their significantly reduced computational and storage requirements, superpixels have gained popularity in a variety of computer vision tasks, including semantic segmentation [2,3,4], object detection [5,6], hyperspectral image classification [7], saliency detection [8], and medical image segmentation [9].

Existing superpixel algorithms can be roughly classified into two categories: traditional algorithms and deep learning-based algorithms. Traditional algorithms typically begin by uniformly distributing seed points across the image or dividing it into a regular grid as an initialization step, followed by an iterative process using methods such as graph cuts, clustering, and gradient descent [10,11,12] to obtain the final superpixel segmentation result. As an intermediate image representation, superpixels are commonly employed within the overall network to enhance the precision of image segmentation and detection [5] or to provide a priori information to guide subsequent visual tasks, such as hyperspectral image classification [7]. Therefore, superpixel algorithms must prioritize the accuracy and boundary adherence of segmentation results to fully realize their effectiveness. Additionally, whether serving as a partial component of the overall network or providing a priori information, the superpixel algorithm should exhibit low computational complexity. In contrast, traditional superpixel methods often necessitate laborious iterative clustering processes. The limitations of traditional algorithms and hardware result in their accuracy and computational complexity failing to meet the requirements of existing computer vision tasks.

Deep learning-based methods also initialize an image by dividing it into regular grids [13] and use neural networks to predict classifications between pixels and regular grids, thereby generating superpixels, as shown in Figure 1. Regrettably, the existing deep learning-based superpixel algorithms remain imperfect. Some superpixel algorithms still employ iterative clustering to generate superpixels after extracting high-dimensional features. This approach enhances the accuracy of the segmentation process but concomitantly increases the time complexity by a factor of ten or even twenty relative to traditional algorithms. In contrast, a series of superpixel algorithms based on the fully convolutional network [14,15,16] markedly enhances the efficiency of superpixel segmentation. These algorithms are efficient and have low computational costs, but suffer from a significant loss of fine-grained details during the downsampling process in the encoding stage, as well as the upsampling process in the decoding stage. Notably, the impacts of these losses on the final prediction results is often overlooked, leading to a persistent failure to effectively improve the boundary attachment of superpixels.

To preserve the efficient segmentation paradigm while addressing the limitations of existing algorithms, this paper proposes a novel superpixel segmentation network based on a multi-attention hybrid mechanism. To the best of our knowledge, this represents the first instance of an attention mechanism being integrated with a superpixel segmentation task. The core motivation of this work was that the main function of superpixels as an intermediate image representation was to serve as a preprocessing means for subsequent tasks. Therefore, it was crucial to generate superpixels that were more semantically informative. Moreover, superpixels with more semantic information also imply better adherence to real boundaries, which can effectively reduce errors and redundant information brought about by superpixel segmentation, which benefits subsequent tasks. The multi-attention hybrid mechanism superpixel segmentation strategy proposed in this paper allows for a more semantic-focused association between pixels and superpixels, in addition to the traditional considerations based on color and location information. This means that the superpixels generated by the method described in this paper are better suited as over-segmented labels for a variety of computer vision tasks. Furthermore, as an end-to-end network architecture, it can be embedded into other task networks to enhance the overall performance. Specifically, we utilized residual blocks reconstructed based on a parameter-free attention mechanism at the encoder stage to mitigate the loss of fine-grained details during downsampling. Concurrently, for the deep semantic information obtained from encoding, we performed key semantic reconstruction using a global semantic fusion self-attention mechanism. This approach optimized the computational overhead to accentuate the significance of the key semantic information. Furthermore, during the decoding process, we emphasized the correlation of the feature points within both the channel dimension and the spatial dimension at different stages. This strategy aimed to further reduce the loss of details. Quantitative and qualitative results on various benchmark datasets, including the Berkeley segmentation dataset 500 (BSDS500) [17] and New York University depth dataset version 2 (NYUv2) [18], indicate the superiority of the proposed algorithm over existing superpixel algorithms. In summary, the main contributions of this study are as follows:

We propose a strategy that combines the multi-attention hybrid mechanism with superpixel segmentation. Through three-stage multi-attention fusion, we achieved fine-grained feature extraction, efficient deep semantic map reconstruction, and semantic feature enhancement in upsampling. This strategy addressed the issue of detail loss in the encoding–decoding stage of existing superpixel algorithms.
Our multi-attention hybrid network for superpixel segmentation could focus on both the semantic and spatial information contained in the input image, thus generating superpixels with more semantic awareness.
Experimental results on various visual task datasets show the excellent performance of the proposed method in superpixel segmentation, particularly in generating superpixels with better boundary adherence.

2. Related Work

2.1. Traditional Superpixel Methods

Traditional superpixel algorithms can be classified into graph-based, clustering-based, and gradient-based algorithms. Graph-based algorithms treat each pixel as a graph node and represent the relationships between adjacent pixels using edge weights. The algorithm divides the nodes into different parts to achieve superpixel segmentation. Examples of such algorithms include normalized cut (NC) [1], Felzenswalb and Huttenlocher (FH) [19], and entropy rate superpixel (ERS) [10]. Clustering-based algorithms use methods like K-means to measure the pixel feature similarity in various feature spaces. For instance, simple linear iterative clustering (SLIC) [11] performs pixel clustering based on a five-dimensional feature space comprising color and position information. Variants like linear spectral clustering (LSC) [20] project image features into a 10-dimensional space using kernel functions, while manifold SLIC (MSLIC) [21] projects them into a two-dimensional manifold space. Gradient-based algorithms use gradient descent to optimize clusters. Well-known algorithms include Waterpixels [12], extended topology preserving segmentation (ETPS) [22], and watershed-based superpixel with global and local boundary marching (WSGL) [23]. Most traditional algorithms involve an iterative clustering process and are non-differentiable, making them difficult to integrate into end-to-end models.

2.2. Deep Superpixel Methods

Deep neural networks have been widely used in computer vision due to their effectiveness in extracting image features. Consequently, researchers also attempted to apply deep learning to superpixel segmentation. Early approaches often involved using deep neural networks to extract rich image features and then applying traditional algorithms for clustering and segmentation. Segmentation-aware loss (SEAL) [24] extracts feature with pixel affinity using segmentation-aware loss and subsequently applies the ERS algorithm for superpixel segmentation. Similarly, a superpixel sampling network (SSN) [13] relaxes the nearest-neighbor constraints in the SLIC module and combines it with features extracted by a CNN for clustering. These attempts demonstrated the potential of integrating deep learning with superpixel but did not align with the efficient end-to-end nature of deep learning.

Superpixel segmentation with a fully convolutional network (SCN) [14] integrates feature extraction and superpixel segmentation into an end-to-end process. Utilizing a fully convolutional network establishes the association map between pixels and regular grids, significantly enhancing the efficiency of superpixel segmentation. Association implantation for superpixel (AINet) [15] builds upon this by incorporating a novel association implantation (AI) module to enhance the perception capability between pixels and grids. The efficient superpixel network (ESNet) [16] replaces the decoder with a pyramid gradient superpixel generator (PSG) to effectively extract cluster-friendly features. However, these improvements are limited to optimizing specific parts of the network, resulting in limited enhancement of the feature acquisition and processing capabilities of the overall network architecture.

Recently, the bio-inspired superpixel segmentation network (BINet) [25] reconsidered the process of superpixel generation with a bio-inspired perspective. The proposed approach comprises two key components, namely, the enhanced screening module (ESM) and the boundary-aware labeling (BAL) module. The ESM module enhances the semantic information by simulating the interaction projection mechanism observed in the visual cortex, while the BAL module simulates the spatial frequency properties of visual cortex cells to generate superpixels that adhere to the boundaries. The ESM module can be regarded as an extension of the AI module in AINet. While it enhances the correlation between neighboring grids, the algorithm focuses excessively on superpixel compactness, which results in suboptimal boundary adherence. A content disentangle superpixel (CDS) [26] is a design that uses content dispersion superpixel algorithm based on the diffusion model. By content disentangling for RGB features and HSV features, it reduces the influence of dataset style noise on the model and improves the generalization. Nevertheless, the construction of auxiliary modal data and the introduction of mutual information minimization and other steps are necessary, which results in an overall high complexity of the algorithm.

Table 1 summarizes the publication year, publication venue, programming language, category, and number of Google Scholar citations for existing superpixel algorithms. This information provides an overview of the scholarly impact and technical details of the approaches described.

3. Methodology

There are several challenges in performing superpixel segmentation using end-to-end deep learning frameworks, such as the loss of fine-grained details due to deep network hierarchies and the ambiguity in segmentation caused by insufficient semantic perception capabilities. To address these issues, we constructed the superpixel segmentation network architecture MAS-Net based on a multi-attention hybrid mechanism. The overall architecture of the network is shown in Figure 2.

This structure was still designed with an efficient and symmetric encoder–decoder architecture. For the encoding stage, we propose a novel residual structure based on a parameter-free attention mechanism to enhance the fine-grained image feature extraction. The input image is successively convolved and downsampled by a new parameter-free attention residual block to obtain rich feature map information. While shallow feature maps often contain abundant pixel-level spatial information, deeper low-resolution feature maps usually contain high-dimensional semantic information. The deepest feature map is downsampled 16 times compared with the original input image. For such complex abstract semantic information, a global information fusion attention mechanism was used to reconstruct the feature map. This mechanism operates effectively and focuses on each feature point. The decoder is capable of recovering more valid information due to the enhancement of features under large receptive fields (RFs). During the decoding stage, we employed the channel attention mechanism to integrate features while upsampling the reconstructed deep semantic information. For the shallow features obtained from the upsampling, we recovered more detailed spatial location information by using the spatial attention mechanism. Additionally, we utilized skip connections to merge image features at the same resolution in the encoding and decoding stages.

3.1. Encoder with Parameterless Attention ResBlock

The residual structure [27] has been widely used in various deep learning tasks to address gradient vanishing and explosion issues. Therefore, in this study, we considered replacing the ordinary convolution of the encoder part with the residual structure to improve the model performance while maintaining better robustness. However, the residual structure has a limited impact on enhancing the feature extraction capability of the encoder. To address this issue, we propose two main improvements. First, we incorporated a parameter-free attention module called SimAM [28] into the residual structure to measure the importance of each feature point, enabling the extraction of fine-grained features without introducing additional parameters. Additionally, to further enhance the nonlinearity of the model, we replaced all ReLU functions in the original residual structure with SiLU, thus constructing the parameter-free attention residual (PAR) block. In contrast to the original residual structure, the proposed PAR module does not introduce additional learnable parameters, yet it effectively enhances the overall model performance. Furthermore, we conducted quantitative ablation experiments to analyze the original residual structure and the PAR module, thereby exploring its effectiveness. Figure 3 shows a comparison between the original residual structure and the improved structure.

SimAM originates from visual neuroscience, where neurons with more information often exhibit stronger spatial suppression effects. To distinguish the target neuron t from other neurons

x_{i}

, an energy function is defined as follows:

e_{t}^{*} = \frac{4 (σ^{2} + λ)}{{(t - μ)}^{2} + 2 σ^{2} + 2 λ},

(1)

where

μ = \frac{1}{M} \sum_{i = 1}^{M} x_{i}

and

σ^{2} = \frac{1}{M} \sum_{i = 1}^{M} {(x_{i} - μ)}^{2}

represent the mean and variance of the other neurons in the current channel, respectively.

M = H \times W

represents the number of neurons in the current feature map, and H and W denote the feature map size.

λ

is a hyperparameter, which was set to

1 \times 10^{- 4}

in this study.

Spatial suppression neurons exhibit strong linear separability. Therefore, the more significant the difference between neuron t and the surrounding neurons, the higher the importance of the target neuron. Similar to other attention mechanisms, SimAM also uses the sigmoid function to compute the weight coefficients and multiply them with the original input X to obtain the enhanced feature map

\tilde{X}

:

\tilde{X} = S i g m o i d (\frac{1}{E}) ⊙ X,

(2)

where E denotes

e_{t}^{*}

in all channels and spatial dimensions.

Furthermore, the SiLU activation function combines the advantages of the ReLU and sigmoid activation functions. As a smooth function, it facilitates network optimization. Therefore, we replaced the traditional residual structure and the activation function of the rest of the model with SiLU. Additionally, a BatchNormalization layer was added after the 1 × 1 convolution branch to enhance the robustness. Relative to the ReLU activation function, the employment of the SiLU activation function can reduce the number of training iterations by approximately 60k, thereby saving 27.3% of the overall training time. When the input is expressed as the variable x, the SiLU function is presented as follows:

f (x) = \frac{x}{1 + e^{- x}}

(3)

Finally, the encoder feeds the input image into a stride-2 convolution and the improved parameter-free attention residual structure to extract deep fine-grained features. In contrast to the conventional convolution or interpolation sampling techniques employed in the encoder components of other models, the proposed PAR module is capable of preserving feature information loss during the downsampling process to the greatest extent possible. Furthermore, the integrated SimAM module can selectively screen key feature points and increase their relative importance within the encoding process. Concurrently, the shortcut connection in PAR helps to mitigate network degradation and enhance the gradient stability.

3.2. Feature Map Reconstruction Based on Global Semantic Fusion Self-Attention

After continuous convolution and downsampling in the encoder, the dimension of the deepest feature changes from the input

R^{3 \times H \times W}

to

R^{3 \times h \times w}

, where, in this study,

h = w = 13

. This implies that each feature map within a channel contains only 169 feature points, each of which has a larger RF compared with the shallow features. To reinforce the correlation between the feature points under a large RF, we used an improved computational self-attention mechanism to reconstruct the feature maps.

Generally, the attention score in self-attention mechanisms [29] can be computed with the following:

A t t e n t i o n (Q, K, V) = S o f t m a x (\frac{Q K^{T}}{\sqrt{d}}) V,

(4)

where Q, K, and V denote the query, key, and value vector of each feature point, respectively, and d is a scaling factor. For a feature with dimensions

R^{C \times H \times W}

, the multi-head self-attention mechanism first flattens it into

R^{C \times N}

, where C denotes the channels and N represents the total number of feature points. It is then divided into the

R^{(C / D) \times N}

form according to the number of attention heads D. As shown in Figure 4, the original multi-head self-attention mechanism first calculates the dot product of Q and

K^{T}

as the similarity measure score between the feature points, which is normalized by the softmax function and then multiplied by V. Mathematically, the dot product of Q and

K^{T}

implies matrix multiplication between two matrices of sizes

N \times (C / D)

and

(C / D) \times N

, resulting in a matrix of size

N \times N

for

Q K^{T}

. This matrix is then multiplied by the V matrix of size

N \times (C / D)

. Therefore, the total computational complexity is

O (D N^{2} (C / D)) = O (N^{2} C)

.

In contrast, linear attention [30] modifies the computation order by prioritizing the dot product of

K^{T}

and V and replacing the softmax with a kernel function

ϕ (x)

. By altering the order of matrix multiplication, the computational complexity of linear attention is reduced to

O (D N {(C / D)}^{2}) = O (N C^{2} / D)

, which can be calculated as follows:

L i n e a r A t t e n t i o n (Q, K, V) = ϕ (Q) (ϕ {(K)}^{T} V)

(5)

Inspired by linear attention, we set the number of heads D equal to the number of channels C. The features are distributed in the form of

R^{C \times N \times 1}

, which means that the feature points within each channel can be associated with the global features instead of being fragmented into local correlations. Dissimilar to linear attention, in this study, we normalized Q and

K^{T}

using the L2 norm to obtain

\tilde{Q}

and

\tilde{K^{T}}

, and summed the results of multiplying

\tilde{K^{T}}

with V to further fuse the global information. This kind of feature reconstruction combines global mixed information and is more suitable for abstract semantic details compared with the traditional multi-head self-attention mechanisms. Moreover, in the case of

D = C

, the computational complexity is further reduced to

O (D N {(C / D)}^{2}) = O (N C)

. We refer to this design as the global semantic fusion self-attention mechanism (GSFS), which can be formulated as follows:

G S F S (Q, K, V) = \frac{Q}{{| | Q | |}_{2}} \sum \frac{K^{T}}{| | K^{T} {| |}_{2}} V

(6)

3.3. Decoder with CAM/SAM

In the decoding stage, it is common practice to restore the downsampled features to the original image size using interpolation or transposed convolution to obtain the final superpixel segmentation results. However, a simple fusion of high-dimensional features often leads to blurred object boundaries and a loss of detailed information; therefore, we designed an attention-enhanced decoder (AED) in conjunction with a convolutional block attention module (CBAM) [31] to improve the decoder’s recognition ability and highlight the features between different target classes. As emphasized earlier, deep low-resolution feature maps have a larger RF and more abstract semantic information, while shallow feature maps possess richer spatial position information. The GSFS-reconstructed deep semantic features contain 256 channels, and there are still 128 channels after one upsampling. Therefore, we first used the channel attention module (CAM) to obtain the importance of each feature channel in the two deepest feature maps to strengthen the important features by assigning them weight coefficients. For the upsampled feature maps restored to higher resolutions, the spatial attention module (SAM) was used to enhance the representation of key areas, thus reducing the interference of non-target regions on segmentation.

Specifically, for the first two layers of the deep feature maps, we first performed one upsampling using transposed convolution to fuse with the encoder part of the same resolution feature maps, thereby preserving more details. Then, we assigned channel attention weights to the fused features, and we further refined the features and adjusted the channel numbers by inserting two convolutional layers. After two upsamplings, the feature map contained sufficient spatial features. At this point, the CAM was replaced with the SAM. Let F denote the input feature. The CAM and SAM attention processes can be summarized as follows:

C A M (F) = σ (M L P (A v g P o o l (F)) + M L P (M a x P o o l (F))),

(7)

S A M (F) = σ (f^{7 \times 7} ([A v g P o o l (F); M a x P o o l (F)])),

(8)

where

σ

represents the sigmoid function,

M L P

denotes a multilayer perceptron consisting of fully connected layers, and

f^{7 \times 7}

represents a convolution operation with kernel size

7 \times 7

. Equation (7) indicates that average pooling and max pooling are applied to the input features across channels to obtain global information, followed by an

M L P

to discern the importance of each channel, and the resulting activations serve as weights. Equation (8) demonstrates that average and max poolings are performed on the corresponding feature points across channels to capture spatial information. Then, a large-kernel convolution is utilized to determine the importance of each feature point.

Finally, Figure 5 presents the visualizations of key feature maps in each encoder–decoder model. From top to bottom, the visualizations show the superpixel segmentation results, the first downsampling results of the encoder, and the final upsampling results of the decoder. Through the visualizations, it is evident that the framework proposed in this paper exhibited less loss of color and texture information compared with other encoder–decoder-based network models. For instance, details such as the characters, logos, and landing gears painted on the aircraft fuselage were preserved, which was crucial for enhancing the accuracy of superpixel segmentation and category perception capability. The other network models focused more on the contour information of the image, which led to their inability to generate superpixels that contained logo information, such as the segmentation results of the SCN on the tail logo in the first column and ESNet on the fuselage logo in the third column.

4. Experiments and Results

4.1. Datasets

To demonstrate the effectiveness of the proposed method for superpixel segmentation, we evaluated the method on BSDS500 and NYUv2, which are commonly used evaluation datasets in the field of superpixel segmentation. Additionally, to verify the enhancement of the network structure on superpixel boundary adherence and the model generalization capability, we conducted boundary adherence tests on the semantic boundaries dataset (SBD) [32] and Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago (KITTI) [33] dataset without fine-tuning the model. The SBD comprises a greater number of semantic categories and exhibits greater similarity to the BSDS500. In comparison, the KITTI dataset exhibits a reduction in the number of semantic categories and a greater degree of homogeneity within each category, particularly with regard to color features. We chose these two datasets with large differences to better validate the generalization of the model. Moreover, qualitative analysis was performed by showcasing visual results on datasets such as the WHU building dataset (WHU_BD) [34] and digital retinal images for vessel extraction (DRIVE) dataset [35].

Among these datasets, BSDS500 was used for the model training. It includes 200 training images, 100 validation images, and 200 test images and provides multiple ground truth annotations for each image. NYUv2 is a large-scale indoor scene dataset, while SBD, KITTI, WHU_BD, and DRIVE are widely used datasets in scene understanding, autonomous driving perception, building change detection, and medical image segmentation, respectively. Superpixel segmentation has found widespread application across numerous computer vision tasks. Through comprehensive quantitative and qualitative experimentation, we aimed to analyze the potential of MAS-Net in various applications, as well as explore subsequent optimization directions.

4.2. Evaluation Metrics

For a fair comparison, all the metric evaluations were conducted using the standards and code provided by [36]. Similar to other methods, this study primarily evaluated superpixels using five metrics: achievable segmentation accuracy (ASA), boundary recall (BR), boundary precision (BP), under-segmentation error (UE), and compactness (CO). ASA measures the ability to accurately identify objects in the image, BR and BP assess the boundary adherence of superpixels to the ground truth, UE measures the percentage of the superpixels leaked across the ground truth boundaries, and CO is used to evaluate the compactness of superpixels. Among them, higher scores of the ASA, BR, BP, and CO metrics and lower scores of the UE metric represent better superpixel segmentation results. UE is strongly correlated to ASA but not directly equivalent to 1-ASA.

4.3. Implementation Details

The network architecture proposed in this paper was implemented with PyTorch, and all experiments were conducted on one NVIDIA GeForce RTX 4090 GPU. For the dataset, the BSDS500 was selected as the training data, which usually contains multiple label annotations. Therefore, we expanded the images to match the corresponding number of labels, resulting in a total of 1087 training samples. Before being fed into the network for training, the images were resized to

208 \times 208

and subjected to random flipping and cropping for data augmentation. The same operations were applied to the images of the test set. For the optimizer, the model used Adam with its default hyperparameters, i.e.,

β 1 = 0.9

and

β 2 = 0.999

, to speed up the model convergence, and weight_decay was set to

4 \times 10^{- 4}

to avoid model underfitting. The initial learning rate was set to

5 \times 10^{- 5}

, where a smaller initial learning rate ensures that the model gets better performance. The batch size was set to 16 to avoid crashes due to running out of memory during training, and the model reached convergence after approximately

140 k

iterations. Additionally, in this experiment, the loss function and the initialization grid size followed the settings of SCN and ESNet.

4.4. Comparison with the State-of-the-Art Methods

To comprehensively compare the advantages of MAS-Net over existing algorithms, we conducted experiments involving nine different algorithms, including SLIC, LSC, ETPS, SEAL, SSN, SCN, AINet, ESNet, and CDS. Among them, SLIC, LSC, and ETPS are the most representative traditional methods, and the rest are the current state-of-the-art deep learning methods. The BSDS500 and NYUv2 test sets were chosen as the datasets for the evaluation. Existing superpixel algorithms typically employ post-processing techniques to enhance the spatial connectivity of the segmentation results. This is commonly achieved by merging superpixels that are smaller than a predefined threshold with their surrounding superpixels. However, this approach inherently limits the ability to precisely control the number of generated superpixels. Consequently, in this study, we likewise opted to illustrate the experimental results using a trend graph that depicts the relationship between the metrics and the corresponding number of generated superpixels. The results are shown in Figure 6.

In addition, Table 2 visualizes the performance of MAS-Net against other models on BSDS500, where we set all algorithms to generate 600 superpixels for comparison. From Figure 6 and Table 2, it is evident that the deep learning-based superpixel segmentation algorithms exhibited certain advantages over the traditional algorithms in each metric. Moreover, the MAS-Net model proposed in this paper achieved the best results in the ASA, BR-BP, and UE metrics compared with other deep learning-based methods. Especially for boundary precision and recall, the MAS-Net constructed by the mixture of three-stage attention mechanisms made the segmentation effective in terms of precision and accuracy by refining, fusing, and reconstructing the semantic information, which means that MAS-Net was capable of segmenting superpixel labels with less ambiguity. As the number of superpixels increased, the advantage of MAS-Net in terms of the ASA and UE metrics became more pronounced, which was attributable to the spatial feature segmentation capabilities provided by the AED module.

Furthermore, as outlined in [36], there exists a tradeoff between compactness and boundary adherence for superpixels. Greater compactness results in superpixel segmentation outputs that are more circular in shape, thereby encompassing more semantic information for a given boundary length. However, the category label boundaries in real-world datasets are typically irregular, necessitating a compromise in compactness to concentrate the superpixel segmentation on category boundaries and enhance semantic awareness. In fact, most classification and segmentation tasks rely more heavily on precise a priori information. Nonetheless, the CO metrics maintained by the proposed approach still exhibited an advantage over traditional algorithms and certain deep learning-based methods. Moving forward, we will further investigate the potential for performance improvements in superpixel segmentation when the number of superpixels is small.

For NYUv2, we directly evaluated the test set using the model trained on BSDS500 without fine tuning, and it also achieved highly competitive performance, indicating the good generalization capability of the model. Furthermore, we had to sacrifice a portion of compactness to focus the superpixel segmentation results on category boundaries and enhance the semantic awareness. However, the CO metric still maintained an advantage over the traditional and some deep learning-based algorithms.

We also conducted a qualitative visual analysis of the deep learning-based superpixel segmentation algorithms, and the results are depicted in Figure 7. The examples shown from top to bottom correspond to the BSDS500, NYUv2, WHU_BD, and DRIVE datasets, respectively, with the number of superpixels being BSDS500→400, NYUv2→300, WHU_BD→260, and DRIVE→1300. By comparing the results in Figure 7, it can be observed more intuitively that our proposed method better adhered to the boundaries of categories in the ground truth. Additionally, it avoids generating ambiguous segmentation within the regions, which can be seen in other algorithms where the superpixel segmentation boundaries formed within categories actually represent redundant information, as shown in the third row with the rooftop example. Furthermore, even for datasets with minimal saliency variations, like DRIVE, MAS-Net was still capable of segmenting the retinal vessel contours.

To further validate the practical improvement of the model in terms of the boundary adherence in superpixel segmentation, we conducted tests on the SBD and KITTI dataset. As shown in Figure 8, AINet, ESNet, and CDS exhibited improvements in boundary segmentation accuracy and boundary recall compared with SCN. However, the proposed MAS-Net framework demonstrated a significant advantage in boundary adherence. The reason was that while AINet and ESNet improved some of the model recognition capability based on SCN by the AI module or PSG module, these two methods only optimize a single stage and have limitations in fully utilizing feature extraction and deep semantic information. In addition, the performance of CDS varied greatly when facing different datasets, with a poor boundary adhesion performance for the SBD with more categories. Meanwhile, for the KITTI dataset with fewer categories and the same category containing a large number of pixels with similar statistical attributes, ESNet, CDS, and MAS-Net achieved similar performances. This indicates that MAS-Net had a stronger generalization ability and boundary adhesion compared with the existing models. MAS-Net achieved excellent semantic feature extraction, reconstruction, and enhancement through the integration of multiple attention mechanisms throughout the entire process. This enabled the segmented superpixels to effectively perceive category information from the ground truth and adhered to their boundaries.

4.5. Efficiency Analysis

Table 3 illustrates the operational efficiency of MAS-Net in comparison with existing state-of-the-art algorithms. The running efficiency was typically influenced by the number of superpixels and the image resolution. Accordingly, approximately 600 superpixels were generated in the BSDS500 test set, with an image resolution of

321 \times 481

for evaluation purposes. All computations were conducted using an RTX 4090 GPU and an Intel Core i7 CPU. Despite the fact that our model exhibited the highest time complexity in comparison with other codec-based methods, it nevertheless outperformed all existing algorithms in terms of metrics such as ASA, BR-BP, and UE. This was in relation to our proposed three-stage attention design. Furthermore, SEAL and SSN employ conventional techniques for superpixel segmentation following feature extraction, which results in significantly slower inference. This drawback is circumvented by MAS-Net with its end-to-end architecture. Our MAS-Net strikes a favorable balance between performance and computational complexity, effectively enhancing each performance metric while ensuring that the inference speed is satisfactory.

4.6. Ablation Study

As discussed in this section, an ablation experiment was designed to validate the effectiveness of each module in the model in improving superpixel segmentation performance. The baseline model consisted of an encoder–decoder structure without attention mechanisms for comparison purposes. Various modules were added to this baseline model for evaluation, and the results are shown in Figure 9.

Through the ablation experiment, it was found that the PAR structure in the model significantly improved the performance. However, suppose the extracted deep features were directly fed into the AED without global semantic fusion. In this case, certain feature channels containing useful information could be assigned smaller weights by the channel attention module to inhibit their importance. This situation was further exacerbated by the shallow spatial attention module, which led to a decreased segmentation performance, although it still outperformed the baseline model. Our designed GSFS mechanism effectively compensated for this limitation. The decoder could obtain more globally valid features by reconstructing the deep feature maps.

Furthermore, we also quantitatively analyzed the proposed PAR module with the original residual blocks, and the results are shown in Table 4. The experiment only replaced the PAR module in the model, while all the other parameters remained unaltered. The results demonstrate that the proposed PAR module could markedly enhance the overall performance of MAS-Net. The rich, fine-grained features extracted from it were crucial for the subsequent reconstruction and recovery of the feature information.

The results of the aforementioned experiments demonstrate that MAS-Net exhibited superior performance in comparison with the existing algorithms with respect to the segmentation accuracy, boundary recall, and boundary precision. Furthermore, the quantitative and qualitative experiments on multiple datasets demonstrated the existence of a robust capacity for generalization. In comparison with traditional superpixel algorithms and other deep learning-based superpixel algorithms, it demonstrated greater applicability to a range of computer vision tasks. As an end-to-end model, MAS-Net can be readily integrated into tasks such as object detection [5]. In addition, ref. [7] demonstrated that when applying superpixels to multi-spectral images, it is able to extract rich spectral–spatial information and significantly improve classification accuracy. Compared with the traditional superpixel algorithm ERS used in [7], MAS-Net was better in terms of the time complexity and performance.

5. Conclusions

This paper presents the effective superpixel segmentation network MAS-Net. This work innovatively incorporated attention mechanisms at various positions within the superpixel segmentation model, which enables comprehensive utilization of image features and addressing the performance issues of existing models stemming from feature information loss. The MAS-Net is composed of a parameter-free attention residual (PAR) block, a global semantic fusion self-attention (GSFS) mechanism, and an attention-enhanced decoder (AED). Through the three-stage process of feature refinement, reconstruction, and focusing process, the MAS-Net significantly mitigated the loss of image details during the encoding and decoding processes, resulting in enhanced accuracy in the superpixel segmentation. The experimental results demonstrated that the proposed approach achieved a superior segmentation performance compared with other methods by leveraging multilevel visual attention mechanisms for feature enhancement. Meanwhile, our approach exhibited stronger boundary adherence in the superpixel segmentation results across multi-domain visual task datasets than methods optimized for a single stage. This finding implies that the proposed MAS-Net exhibited superior generalization capabilities, along with the potential for application across a wide range of computer vision tasks. Moreover, the visualization results indicate that the method proposed in this paper can reduce the generation of superpixel boundaries within categories, thereby reducing the possibility of ambiguous segmentation.

It is noteworthy that while MAS-Net exhibited enhanced performance compared with the existing algorithms, its efficiency still necessitates further enhancement. Although we reduced the computation of the self-attention mechanism in the GSFS module, the number of parameters in the overall model is still higher than other algorithms. When dealing with large datasets, MAS-Net encounters certain limitations.

In our future research, the focus will be on methods to reduce the number of trainable model parameters, eliminate redundant design while preserving original performance, and achieve model pruning and lightweightness. Additionally, we will further investigate the application of superpixels across a variety of computer vision tasks, such as object detection and classification in polarimetric synthetic aperture radar images and hyperspectral data.

Author Contributions

Conceptualization, G.Y.; writing—original draft preparation, C.W.; methodology and supervision, X.J. and W.C.; data curation, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by (i) the National Natural Science Foundation of China (grant no. 62366029); (ii) the National Natural Science Foundation of China (grant no. 62366028); (iii) the National Natural Science Foundation of China (grant no. 62062049); (iv) the Gansu Provincial Science and Technology Plan Project, China (grant no. 21ZD8RA008); and (v) The Key Laboratory of Big Data and Artificial Intelligence in Transportation (Beijing Jiaotong University), Ministry of Education (grant no. BATLAB202302).

Data Availability Statement

Data that are cited in this article are available in a publicly accessible repository.

Acknowledgments

The authors thank the editor and anonymous reviewers for their helpful comments and valuable suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ren, X.; Malik, J. Learning a classification model for segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Nice, France, 13–16 October 2003. [Google Scholar]
Kim, S.; Park, D.; Shim, B. Semantic-aware superpixel for weakly supervised semantic segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Washington, DC, USA, 7–14 February 2023. [Google Scholar]
Lei, T.; Jia, X.; Zhang, Y.; Liu, S.; Meng, H.; Nandi, A.K. Superpixel-based fast fuzzy C-means clustering for color image segmentation. IEEE Trans. Fuzzy Syst. 2019, 27, 1753–1766. [Google Scholar] [CrossRef]
Zhang, S.; Ma, Z.; Zhang, G.; Lei, T.; Zhang, R.; Cui, Y. Semantic image segmentation with deep convolutional neural networks and quick shift. Symmetry 2020, 12, 427. [Google Scholar] [CrossRef]
Liu, M.; Chen, S.; Lu, F.; Xing, M.; Wei, J. Realizing target detection in SAR images based on multiscale superpixel fusion. Sensors 2021, 21, 1643. [Google Scholar] [CrossRef] [PubMed]
Huang, C.; Zong, Y.; Ding, Y.; Luo, X.; Clawson, K.; Peng, Y. A new deep learning approach for the retinal hard exudates detection based on superpixel multi-feature extraction and patch-based CNN. Neurocomputing 2021, 452, 521–533. [Google Scholar] [CrossRef]
Mu, C.; Dong, Z.; Liu, Y. A two-branch convolutional neural network based on multi-spectral entropy rate superpixel segmentation for hyperspectral image classification. Remote Sens. 2022, 14, 1569. [Google Scholar] [CrossRef]
Wei, W.; Chen, W.; Xu, M. Co-saliency detection of RGBD image based on superpixel and hypergraph. Symmetry 2022, 14, 2393. [Google Scholar] [CrossRef]
Rout, R.; Parida, P.; Alotaibi, Y.; Alghamdi, S.; Khalaf, O.I. Skin lesion extraction using multiscale morphological local variance reconstruction based watershed transform and fast fuzzy C-means clustering. Symmetry 2021, 13, 2085. [Google Scholar] [CrossRef]
Liu, M.-Y.; Tuzel, O.; Ramalingam, S.; Chellappa, R. Entropy rate superpixel segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Colorado Springs, CO, USA, 20–25 June 2011. [Google Scholar]
Achanta, R.; Shaji, A.; Smith, K.; Lucchi, A.; Fua, P.; Süsstrunk, S. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. 2012, 34, 2274–2282. [Google Scholar] [CrossRef] [PubMed]
Machairas, V.; Faessel, M.; Cardenas-Pena, D.; Chabardes, T.; Walter, T.; Decenciere, E. Waterpixels. IEEE Trans. Image Process. 2015, 24, 3707–3716. [Google Scholar] [CrossRef] [PubMed]
Jampani, V.; Sun, D.; Liu, M.-Y.; Yang, M.-H.; Kautz, J. Superpixel sampling networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Yang, F.; Sun, Q.; Jin, H.; Zhou, Z. Superpixel segmentation with fully convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Wang, Y.; Wei, Y.; Qian, X.; Zhu, L.; Yang, Y. AINet: Association implantation for superpixel segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Xu, S.; Wei, S.; Ruan, T.; Zhao, Y. ESNet: An efficient framework for superpixel segmentation. IEEE Trans. Circ. Syst. Vid. 2023, 34, 5389–5399. [Google Scholar] [CrossRef]
Arbeláez, P.; Maire, M.; Fowlkes, C.; Malik, J. Contour detection and hierarchical image segmentation. IEEE Trans. Pattern Anal. 2011, 33, 898–916. [Google Scholar] [CrossRef] [PubMed]
Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor segmentation and support inference from RGBD images. In Proceedings of the European Conference on Computer Vision (ECCV), Firenze, Italy, 7–13 October 2012. [Google Scholar]
Felzenszwalb, P.F.; Huttenlocher, D.P. Efficient graph-based image segmentation. Int. J. Comput. Vision 2004, 59, 167–181. [Google Scholar] [CrossRef]
Li, Z.; Chen, J. Superpixel segmentation using linear spectral clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Liu, Y.-J.; Yu, C.-C.; Yu, M.-J.; He, Y. Manifold SLIC: A fast method to compute content-sensitive superpixels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Yao, J.; Boben, M.; Fidler, S.; Urtasun, R. Real-time coarse-to-fine topologically preserving segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Yuan, Y.; Zhu, Z.; Yu, H.; Zhang, W. Watershed-based superpixels with global and local boundary marching. IEEE Trans. Image Process. 2020, 29, 7375–7388. [Google Scholar] [CrossRef]
Tu, W.-C.; Liu, M.-Y.; Jampani, V.; Sun, D.; Chien, S.-Y.; Yang, M.-H.; Kautz, J. Learning superpixels with segmentation-aware affinity loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Zhao, T.; Peng, B.; Sun, Y.; Yang, D.; Zhang, Z.; Wu, X. Rethinking superpixel segmentation from biologically inspired mechanisms. Appl. Soft. Comput. 2024, 156, 111467. [Google Scholar] [CrossRef]
Xu, S.; Wei, S.; Ruan, T.; Liao, L. Learning invariant inter-pixel correlations for superpixel generation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Vancouver, BC, Canada, 20–27 February 2024. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Yang, L.; Zhang, R.-Y.; Li, L.; Xie, X. SimAM: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers are RNNs: Fast autoregressive transformers with linear attention. In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 12–18 July 2020. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Gould, S.; Fulton, R.; Koller, D. Decomposing a scene into geometric and semantically consistent regions. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Kyoto, Japan, 29 September–2 October 2009. [Google Scholar]
Abu Alhaija, H.; Mustikovela, S.K.; Mescheder, L.; Geiger, A.; Rother, C. Augmented reality meets computer vision: Efficient data generation for urban driving scenes. Int. J. Comput. Vision 2018, 126, 961–972. [Google Scholar] [CrossRef]
Ji, S.; Wei, S.; Lu, M. Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Trans. Geosci. Remote Sens. 2019, 57, 574–586. [Google Scholar] [CrossRef]
Staal, J.; Abramoff, M.D.; Niemeijer, M.; Viergever, M.A.; Van Ginneken, B. Ridge-based vessel segmentation in color images of the retina. IEEE Trans. Med. Imaging 2004, 23, 501–509. [Google Scholar] [CrossRef] [PubMed]
Stutz, D.; Hermans, A.; Leibe, B. Superpixels: An evaluation of the state-of-the-art. Comput. Vis. Image Und. 2018, 166, 1–27. [Google Scholar] [CrossRef]

Figure 1. From initial grid to learned superpixels.

Figure 2. The architecture of MAS-Net.

Figure 3. Comparison of original residual block and improved residual block structure.

Figure 4. Comparison of the calculation of three self-attention mechanisms.

Figure 5. Visualization of key layer results of superpixel segmentation network based on encoder–decoder structure.

Figure 6. Performance comparison on BSDS500 and NYUv2. From left to right: ASA, BR-BP, UE, and CO.

Figure 7. Qualitative results of our and previous state-of-the-art methods. The boxes represent localized magnifications of the locations in question.

Figure 8. Boundary adherence comparison on SBD and KITTI dataset.

Figure 9. Ablation study on BSDS500.

Table 1. Comparison of superpixel methods. Abbreviations: ICCV, International Conference on Computer Vision; IJCV, International Journal of Computer Vision; CVPR, IEEE/CVF Computer Vision and Pattern Recognition Conference; TPAMI, IEEE Transactions on Pattern Analysis and Machine Intelligence; TIP, IEEE Transactions on Image Processing; ECCV, European Conference on Computer Vision; TCSVT, IEEE Transactions on Circuits and Systems for Video Technology; ASC, Applied Soft Computing; AAAI, AAAI Conference on Artificial Intelligence.

Method	Pub., Year	Implementation	Category	Citations (July 2024)
NC [1]	ICCV, 2003	MatLab/C	Graph	2367
FH [19]	IJCV, 2004	C/C++	Graph	8455
ERS [10]	CVPR, 2011	C/C++	Graph	1193
SLIC [11]	TPAMI, 2012	C/C++	Clustering	10,408
LSC [20]	CVPR, 2015	C/C++	Clustering	568
MSLIC [21]	CVPR, 2016	MatLab/C	Clustering	157
Waterpixels [12]	TIP, 2015	Python	Gradient	124
ETPS [22]	CVPR, 2015	MatLab/C	Gradient	162
WSGL [23]	TIP, 2020	C/C++	Gradient	19
SEAL [24]	CVPR 2018	Python	Deep learning	135
SSN [13]	ECCV, 2018	Python	Deep learning	261
SCN [14]	CVPR, 2020	Python	Deep learning	228
AINet [15]	ICCV, 2021	Python	Deep learning	36
ESNet [16]	TCSVT, 2023	Python	Deep learning	2
BINet [25]	ASC, 2024	Python	Deep learning	1
CDS [26]	AAAI, 2024	Python	Deep learning	0

Table 2. Results on the BSDS500 test set. The number of superpixels was set to 600 and the best results are denoted in bold. ↑ represents a higher indicator, while ↓ represents a lower indicator.

Method	Pub., Year	ASA (%) ↑	BP (%) ↑	BR (%) ↑	UE(%) ↓	CO (%) ↑
SLIC [11]	TPAMI, 2012	95.60	11.17	82.91	4.40	27.65
LSC [20]	CVPR, 2015	96.67	8.99	90.77	3.33	20.99
ETPS [22]	CVPR, 2015	96.58	8.17	95.38	3.42	13.09
SEAL [24]	CVPR, 2018	97.06	8.68	90.08	2.94	23.14
SSN [13]	ECCV, 2018	96.95	12.03	85.97	3.05	38.09
SCN [14]	CVPR, 2020	96.92	12.54	84.83	3.08	39.05
AINet [15]	ICCV, 2021	97.07	12.74	86.90	2.93	34.71
ESNet [16]	TCSVT, 2023	97.21	13.08	87.90	2.79	37.28
CDS [26]	AAAI, 2024	97.25	12.80	88.17	2.74	35.59
MAS-Net (ours)	-	97.29	13.14	88.96	2.71	34.55

Table 3. Efficiency comparison on the BSDS500 test set. †: the calculation of the parameters of SEAL and SSN encompassed solely the feature extraction phase.

Method	Params (M)	Iterations	Time (ms)	ASA (%)	Device
SLIC [11]	-	Yes	105	96.23	CPU
LSC [20]	-	Yes	96	96.52	CPU
ETPS [22]	-	Yes	299	96.50	CPU
SEAL [24]	0.89 ^†	Yes	1691	97.06	CPU and GPU
SSN [13]	0.21 ^†	Yes	2316	97.10	GPU
SCN [14]	2.27	No	19	96.92	GPU
AINet [15]	5.59	No	41	96.95	GPU
ESNet [16]	0.44	No	7	97.21	GPU
CDS [26]	0.40	No	7	97.25	GPU
MAS-Net (ours)	6.58	No	46	97.29	GPU

Table 4. Quantitative analysis of the Ori ResBlock and PAR module on the BSDS500 test set. The number of superpixels was set to 600 and the best results are denoted in bold. ↑ represents a higher indicator, while ↓ represents a lower indicator.

Module	ASA (%) ↑	BP (%) ↑	BR (%) ↑	UE (%) ↓	CO (%) ↑
Ori ResBlock	97.25	11.98	88.18	2.75	36.88
PAR	97.29	13.14	88.96	2.71	34.55

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yan, G.; Wei, C.; Jia, X.; Li, Y.; Chang, W. MAS-Net: Multi-Attention Hybrid Network for Superpixel Segmentation. Symmetry 2024, 16, 1000. https://doi.org/10.3390/sym16081000

AMA Style

Yan G, Wei C, Jia X, Li Y, Chang W. MAS-Net: Multi-Attention Hybrid Network for Superpixel Segmentation. Symmetry. 2024; 16(8):1000. https://doi.org/10.3390/sym16081000

Chicago/Turabian Style

Yan, Guanghui, Chenzhen Wei, Xiaohong Jia, Yonghui Li, and Wenwen Chang. 2024. "MAS-Net: Multi-Attention Hybrid Network for Superpixel Segmentation" Symmetry 16, no. 8: 1000. https://doi.org/10.3390/sym16081000

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MAS-Net: Multi-Attention Hybrid Network for Superpixel Segmentation

Abstract

1. Introduction

2. Related Work

2.1. Traditional Superpixel Methods

2.2. Deep Superpixel Methods

3. Methodology

3.1. Encoder with Parameterless Attention ResBlock

3.2. Feature Map Reconstruction Based on Global Semantic Fusion Self-Attention

3.3. Decoder with CAM/SAM

4. Experiments and Results

4.1. Datasets

4.2. Evaluation Metrics

4.3. Implementation Details

4.4. Comparison with the State-of-the-Art Methods

4.5. Efficiency Analysis

4.6. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI