Weak Edge Target Segmentation Network Based on Dual Attention Mechanism

Wu, Nengkai; Jia, Dongyao; Li, Ziqi; He, Zihao

doi:10.3390/app14198963

Open AccessArticle

Weak Edge Target Segmentation Network Based on Dual Attention Mechanism

School of Automation and Intelligence, Beijing Jiaotong University, Beijing 100044, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(19), 8963; https://doi.org/10.3390/app14198963

Submission received: 19 September 2024 / Revised: 2 October 2024 / Accepted: 3 October 2024 / Published: 5 October 2024

Download

Browse Figures

Versions Notes

Abstract

:

Segmentation of weak edge targets such as glass and plastic poses a challenge in the field of target segmentation. The detection process is susceptible to background interference and various external factors due to the transparent nature of these materials. To address this issue, this paper introduces a segmentation network for weak edge target objects (WETS-Net). To effectively extract edge information of such objects and eliminate redundant information during feature extraction, a dual-attention mechanism is employed, including the Edge Attention Extraction Module (EAEM) and the Multi-Scale Information Fusion Module (MIFM). Specifically, the EAEM combines improved edge feature extraction kernels to selectively enhance the importance of edge features, aiding in more precise target region extraction. The MIFM utilizes spatial attention mechanisms to fuse multi-scale features, reducing background and external interference. These innovations enhance the performance of WETS-Net, offering a new direction for weak edge target segmentation research. Finally, through ablation experiments, the effectiveness of each module is effectively validated. Moreover, the proposed algorithm achieves an average detection accuracy of 95.83% and 96.13% on the dataset and a self-made dataset, respectively, outperforming similar U-Net-improved networks.

Keywords:

target segmentation; attention mechanism; edge features; multi-scale feature fusion

1. Introduction

With the rapid development of machine vision technology, models such as semantic segmentation are being widely applied in various fields of life, such as agriculture, industry, healthcare, and transportation. Semantic segmentation [1] involves classifying pixels in target images and predicting the category to which each pixel belongs.

In the process of automating medical detection, accurately and efficiently segmenting consumables made of different materials is a key task. Especially for weak edge targets made of transparent or semi-transparent materials like glass and plastic, the edge information is often blurry, posing a challenge to traditional segmentation methods. The precise identification and localization of such weak edge targets are of significant importance for ensuring the quality and safe use of medical equipment.

Traditional image segmentation methods, such as threshold-based [2], region-growing [3,4], and edge detection methods [5,6], mainly rely on low-level texture and edge information of the targets. However, for medical instruments made of materials like glass and plastic, their edge features are often not distinct, lacking clear differentiation from the background. Furthermore, these methods typically require manual design and selection of features, lacking the capability to comprehend and model overall semantics, making it difficult to adapt them to complex and variable instrument shapes and imaging conditions.

The vigorous development of deep learning technology in recent years has provided new insights for the segmentation of medical consumables [7,8,9]. CNN, with its powerful feature extraction and representation capabilities, has achieved remarkable success in the field of image recognition. In particular, the introduction of Fully Convolutional Networks (FCNs) has pioneered end-to-end semantic segmentation [10]. Building upon this foundation, U-Net [11] further introduced the encoder–decoder structure and skip connection mechanism, significantly enhancing the accuracy of image semantic segmentation.

In recent years, U-Net has gained widespread adoption in the field of object segmentation due to its exceptional performance, attracting considerable attention from researchers. Sariturk et al. [12] demonstrated the significant effectiveness of U-Net and its variants in building segmentation. Their study compared various CNN and Transformer models, including U-Net, Residual U-Net, and Attention Residual U-Net. These models achieved excellent segmentation performance across different datasets, highlighting the adaptability and potential of the U-Net architecture in handling complex scene segmentation tasks. Ahmed et al. [13] showcased U-Net’s powerful real-time multi-object segmentation capabilities in remote sensing and surveillance applications, particularly excelling in processing drone aerial imagery. Su et al. [14] proposed an improved U-Net model that integrated the advantages of DenseNet, dilated convolutions, and DeconvNet for remote sensing image semantic segmentation, exhibiting outstanding performance on the Potsdam orthophoto dataset. Ahsan et al. [15] applied U-Net to brain tumor detection and segmentation, demonstrating superior performance through joint work with YOLOv5. Guo et al. [16] introduced a dual U-Net framework, innovatively combining object region and boundary information for segmentation, showing advantages in tasks such as lung, heart, and clavicle segmentation in chest X-rays, as well as segmentation of subtle edges like pelvic structures. John et al. [17] incorporated attention mechanisms into U-Net for deforestation detection in the Amazon rainforest and Atlantic forest of South America. The Attention U-Net outperformed traditional U-Net and other baseline models on Sentinel-2 satellite imagery. Cui et al. [18] enhanced the U-Net model by introducing attention mechanisms and residual modules, making it more suitable for the semantic segmentation of moving objects. U-Net and its derivative networks have been widely applied in diverse detection scenarios, fully demonstrating the versatility and adaptability of this architecture. This not only highlights the universality of the U-Net structure but also provides a reliable theoretical foundation and practical support for improving and applying it to segmentation tasks involving objects with weak boundaries, such as glass.

However, directly applying existing CNN models to the segmentation of medical consumables made of materials such as glass and plastic still faces challenges. These materials possess unique optical properties, such as transparency and reflectivity, resulting in less pronounced edge information in consumable images, which poses difficulties for feature extraction modules. Moreover, the significant variations in the shape and size of consumables lack consistent structured representations, placing higher demands on the generalization capability of segmentation networks. Yu et al. [19] proposed a Progressive Glass Segmentation Network (PGSNet) to address the challenges in glass segmentation. This method employs a discriminative enhancement module to bridge the characteristic gap between features at different levels, improving the discriminative power of feature representations. Additionally, it utilizes a focus and exploration fusion module to deeply mine useful information during the fusion process by highlighting commonalities and exploring differences, achieving glass segmentation from coarse to fine. Wan et al. [20] introduced a novel bidirectional cross-modal fusion framework for Glass-Like Object (GLO) segmentation. This framework incorporates a Feature Exchange Module (FEM) and a Shifted-Window Cross-Attention Feature Fusion Module (SW-CAFM) in each Transformer block stage. The FEM uses coordinate and spatial attention mechanisms to filter noise and recalibrate features from two modalities. The SW-CAFM fuses RGB and depth features through cross-attention and employs shifted-window self-attention operations to reduce computational complexity. This method achieved promising results on multiple glass and mirror benchmark datasets. Hu et al. [21] proposed a new glass detection method specifically for single RGB images. It extracts backbone features through self-attention methods and then uses a VIT-based deep semantic segmentation architecture called MFT to associate multi-level receptive field features and retain feature information captured at each layer, effectively improving glass region detection capabilities. Liu et al. [22] presented a hybrid learning architecture called YOLOS (You Only Look Once Station Scene), which integrates a novel squeeze-and-excitation (SE) attention module into the detection branch. This module adaptively learns the weights of feature channels, enabling the network to focus more on critical deep features of objects. Despite significant progress in glass and similar object detection, challenges remain in detection accuracy and network generalization capabilities. This research aims to address the limitations of existing methods by proposing innovative solutions to enhance the performance and efficiency of weak edge object segmentation networks. Specifically, this paper focuses on optimizing detection algorithm accuracy while enhancing model adaptability across different scenarios, contributing to technological advancements in this field.

Addressing the issues of low accuracy and blurred boundaries encountered by existing deep learning methods in segmenting weak edge targets like glass and plastic poses two major challenges in the image processing process. The first challenge revolves around enhancing the network’s feature extraction capabilities to capture the crucial features of the targets, while the second challenge involves maximizing the suppression of redundancy and noise propagation while ensuring feature richness. To tackle these challenges, most improvement efforts focus on strengthening the network’s feature representation and information transmission abilities. However, blindly deepening the network structure may lead to the loss of detailed information, which is essential for accurate segmentation. Furthermore, existing methods mostly utilize skip connections, directly merging the features of the encoder into the decoder. While this approach partially alleviates the issue of information loss, it inevitably introduces noise and redundancy.

Given the aforementioned circumstances, this paper introduces a novel Dual Attention Mechanism Weak Edge Target Segmentation Network (WETS-Net) to address these challenges. The network is founded on the U-Net architecture and integrates dual attention mechanisms to, respectively, enhance the network’s feature extraction and fusion capabilities. Specifically, an enhanced spatial attention mechanism is introduced in the shallow encoder of the U-Net, optimized in conjunction with edge extraction algorithms like the Laplacian operator. Through the aggregation of cross-scale gradient information, this module achieves selective extraction and enhancement of weak edges and texture features, enabling the network to more sensitively capture subtle variations in weak target regions and providing essential priors for subsequent segmentation tasks. Additionally, to address the underutilization of multi-scale information during feature decoding in U-Net, a channel attention mechanism is introduced at the skip connection points in this study. Through adaptive learning, it significantly evaluates features at different scales from the encoder and adjusts the weights of these features during the decoding process. Unlike directly concatenating features from different scales, the introduction of the channel attention mechanism allows for selective enhancement of features in target regions based on task requirements and properties, suppressing redundant background interference, thereby making the fusion of multi-scale information more efficient and targeted. By combining these two significant innovations, the WETS-Net demonstrates a notable performance enhancement in the realm of weak edge object segmentation, paving the way for novel directions in subsequent research endeavors.

The main contributions of this paper are as follows:

Proposed an improved Weak Edge Target Segmentation Network, WETS-Net, which significantly enhances the network’s extraction and fusion capabilities for weak target edges and multi-scale features by introducing a dual attention mechanism in the encoder and decoder of U-Net.
Drawing inspiration from the Laplacian operator’s role in edge detection, a novel Edge Enhancement Convolution (EE-Conv) was designed and applied in constructing the Edge Attention Extraction Module (EAEM). Positioned at the shallow feature output location of the encoder, EAEM effectively extracts and enhances the edge and texture features of weak targets.
Introduced a channel attention mechanism at the skip connections of U-Net and devised a Multi-Scale Information Fusion Module (MIFM). MIFM adaptively adjusts the importance of different scale features from the encoder and decoder, facilitating efficient fusion of multi-scale features and further enhancing the segmentation performance of the network.

2. Theoretical Basis

2.1. U-Net Semantic Segmentation Network

U-Net is a convolutional neural network architecture proposed by Ronneberger et al. in 2015 for medical image segmentation [23]. This network, based on an encoder–decoder structure, introduces skip connections to facilitate multi-scale feature fusion and effective spatial information transmission. With its outstanding segmentation performance and end-to-end training approach, U-Net has garnered widespread success in the field of medical image segmentation and is considered one of the milestone works in this domain.

The core structure of U-Net embodies a symmetrical encoder–decoder architecture [24], showcased in Figure 1. The encoder segment comprises a sequence of convolution and downsampling operations, aimed at deriving multi-scale feature representations from the input image. Each downsampling step reduces the resolution of the feature map by half while doubling the channel count, enabling the network to progressively capture increasingly abstract and high-level semantic information. Conversely, the decoder segment employs a series of upsampling and convolution operations to systematically restore the spatial resolution of the feature map until it aligns with the dimensions of the input image. This symmetrical design empowers U-Net to encode high-level semantic features while preserving ample spatial detail, laying a solid groundwork for precise image segmentation. Delving deeper into the structure, the encoder undertakes the task of feature extraction from the input image through convolutional layers, followed by downsampling techniques like max pooling or strided convolution. This downsampling process diminishes the spatial dimensions of the feature map, enabling the network to progressively concentrate on broader areas of the input image and facilitating the assimilation of more abstract features and the hierarchical representation of the input data. On the contrary, the decoder is entrusted with the responsibility of upsampling the feature map to the original input image’s resolution. This restoration of spatial information holds paramount importance in crafting segmentation masks that conform to the input image’s scale. The decoder accomplishes this through the utilization of transposed convolutions or upsampling layers, aiding in the recovery of details lost during the downsampling phase. The hallmark of the U-Net architecture lies in its symmetrical framework, which adeptly integrates high-level semantic insights from the encoder with spatial intricacies reconstructed by the decoder. This design enables U-Net to effectively capture both global context and local nuances, rendering it exceptionally well-suited for tasks necessitating precise segmentation, such as medical image analysis and semantic segmentation of weak edge objects.

Another design of U-Net is the introduction of a jump connection mechanism between the encoder and the decoder [25]. Specifically, each lower sampling layer in the encoder is directly connected to the corresponding upper sampling layer in the decoder, allowing the high-resolution feature map extracted by the encoder to be connected to the feature map generated by the decoder. This design effectively combines local spatial information with global context information and reduces the spatial information loss usually associated with downsampling. In addition, jump joins act like residual learning, helping gradients propagate more smoothly throughout the network, thus speeding up model convergence and optimization. In addition, these jump connections enable the network to learn more robust and complex representations by allowing information to bypass certain layers, preserving finer details, and helping pinpoint features during segmentation. This mechanism enhances the network’s ability to capture fine-grained detail and overall context, greatly improving its performance in tasks that require complex image segmentation.

In the design of the loss function, U-Net utilizes pixel-wise cross-entropy to measure the difference between the predicted results and the ground truth labels [26]. This loss function effectively penalizes misclassified pixels while encouraging accurate segmentation results. However, in practical medical image segmentation tasks, there is often a significant imbalance in the number of pixels between foreground and background regions. This imbalance may lead the model to bias optimization towards the larger background class, neglecting the relatively smaller foreground regions. To address this issue, U-Net introduces weight coefficients in the loss function, assigning different importance to the contributions of pixels from different classes. This weighted cross-entropy loss function balances the optimization efforts among classes, enhancing the accuracy and balance of segmentation results.

In conclusion, the U-Net semantic segmentation network achieves efficient and accurate medical image segmentation through its structured encoder–decoder design, feature fusion mechanism via skip connections, application of data augmentation strategies, and end-to-end training approach. Its innovations not only drive advancements in medical image analysis but also offer valuable insights and inspiration for image segmentation tasks in other fields. Subsequent research has widely applied and enhanced U-Net, leading to a series of semantic segmentation models based on the encoder–decoder structure, such as V-Net [27], U2-Net [28], and others. These developments further extend the application scope and performance boundaries of U-Net, contributing to the continuous evolution of semantic segmentation models.

2.2. Attention Mechanism

The attention mechanism is a critical technique in deep learning that allows models to adaptively focus on key parts of input data, enhancing performance and generalization capabilities [29,30]. Widely used in computer vision tasks like image classification, object detection, and semantic segmentation, this section will introduce the attention mechanism from different perspectives, including channel attention, spatial attention, hybrid attention, and self-attention mechanisms.

The channel attention mechanism adjusts the weights of different channels by learning their importance, adaptively enhancing crucial features while suppressing minor ones [31]. The Squeeze-and-Excitation Networks (SENet) represent a typical channel attention mechanism [32]. By incorporating SE modules, SENet adjusts the weights of each channel in convolutional networks. The SE module first conducts global average pooling on input feature maps, then utilizes two fully connected layers to learn channel weights, and finally applies these learned weights to the original feature map to adjust channel importance. The channel attention mechanism effectively boosts model expressiveness, leading to significant performance improvements in tasks such as image classification.

The spatial attention mechanism adjusts the weights of different spatial positions by learning their importance, highlighting target areas while suppressing background interference [33]. The spatial attention module generates spatial weight maps of the same size as the input feature map using convolutional layers and the Sigmoid function. These weight maps are then applied to the original feature map to adjust the importance of spatial positions. The spatial attention mechanism effectively increases the model’s focus on target regions, achieving good results in tasks like object detection and semantic segmentation.

The hybrid attention mechanism combines channel attention and spatial attention, adaptively adjusting the weights of both the channel and spatial dimensions of feature maps [34]. CBAM, in addition to incorporating a spatial attention module, also introduces a channel attention module [35]. By concatenating these two modules, CBAM achieves a joint attention mechanism across channel and spatial dimensions. The hybrid attention mechanism comprehensively explores critical information in feature maps, further enhancing model performance.

The self-attention mechanism models long-range dependencies by computing the correlations between different positions in the feature map [36]. Non-local neural networks represent a typical self-attention mechanism. By calculating the similarity between all positions in the feature map, generating attention weights, and aggregating the weighted features, non-local neural networks fuse global contextual information. The self-attention mechanism effectively captures long-range dependencies in feature maps, leading to significant performance improvements in tasks like video classification and action recognition. Moreover, the multi-head self-attention mechanism in Transformers has been widely applied in computer vision tasks, such as vision Transformer models like ViT, driving further advancements of self-attention mechanisms in the field of computer vision.

The introduction of attention mechanisms significantly enhances the expressive power and generalization capabilities of deep learning models, enabling models to adaptively focus on key information in input data and suppress interfering factors. By appropriately designing and applying attention mechanisms, performance in computer vision tasks can be further improved, driving advancements in related fields.

3. Model Design and Optimization

This section will provide a detailed overview of the proposed Weak Edge Target Segmentation Network (WETS-Net) based on a dual attention mechanism, elucidating the roles and design principles of its various components. When designing WETS-Net, we focused on two key issues in weak edge target image segmentation tasks: how to enhance the network’s feature extraction capability and how to effectively fuse and compensate high-level semantic features while suppressing noise and irrelevant information as much as possible. Given the excellent performance of U-shaped networks in medical image segmentation tasks and considering the similarity between medical images and weak edge target images in terms of missing texture features, we ultimately chose to base and construct WETS-Net on the U-Net architecture.

Figure 2 illustrates the overall network architecture of WETS-Net. Compared to the classic U-Net, WETS-Net incorporates targeted improvements and optimizations in both the encoder and decoder sections. Specifically, an Edge-Aware Enhancement Module (EAEM) is embedded at the shallow feature output locations of the encoder to enhance the network’s capability in extracting and representing weak edge and texture features. In the decoder section, an enhanced skip connection with a Multi-Scale Information Fusion Module (MIFM) is introduced to adaptively regulate and fuse multi-scale feature information from the encoder. Through this approach, WETS-Net can selectively enhance feature representations at different scales, thereby improving its performance in weak target segmentation tasks.

In the following subsections, we will delve into the detailed design specifics and implementation methods of the EAEM module and MIFM model, further expounding on the overall network structure and optimization strategies of WETS-Net.

3.1. Edge Attention Extraction Module

In semantic segmentation tasks, low-scale feature maps typically possess higher resolution and richer spatial information, crucial for accurate object localization and segmentation. However, many networks employing an “encoder-decoder” structure in the early stages often overlook the extraction and utilization of low-scale features, limiting segmentation performance. While some methods attempt to enhance the representation of low-scale features by introducing multi-path spatial pyramid pooling or large convolutional kernels, these approaches often come with high computational complexity, rendering them unsuitable for real-time segmentation tasks. On the other hand, some existing methods specifically designed for real-time semantic segmentation and low-scale feature extraction have limitations: using small-sized image inputs may introduce more low-scale information, but their training strategies are complex and impractical, and while certain structures emphasizing spatial information have achieved good segmentation results, they do not fully consider the rich edge information inherent in low-scale features.

Addressing the aforementioned issues, the proposed Edge-Aware Enhancement Module (EAEM) aims to design an efficient and effective low-scale feature extraction mechanism for real-time semantic segmentation tasks. Inspired by the Laplacian operator’s role in edge detection, we have devised an edge-enhancing convolution, EE-Conv, and integrated it into the construction of EAEM. The design of EE-Conv draws inspiration from the operational principles of the eight-neighbor Laplacian operator, as illustrated in Figure 3. The Laplacian operator, a second-order differential operator widely used in image processing for edge detection and enhancement, offers several significant advantages over first-order differential operators like the Sobel operator. Firstly, the Laplacian operator’s response to edges is directionally invariant, exhibiting good isotropy in detecting edges in all directions uniformly, thereby avoiding the directional bias seen with first-order differential operators. Secondly, the Laplacian operator’s definition includes Gaussian smoothing, effectively suppressing image noise and enhancing the robustness of edge detection. Additionally, as a second-order differential operator, the Laplacian operator is more sensitive to details and textures in images, enabling the extraction of finer and more comprehensive edge information.

Compared to traditional convolution, EE-Conv introduces minimal additional computation while extracting edge information, ensuring the module’s lightweight nature and real-time performance. Based on this, the EAEM module designed in this study is illustrated in Figure 4. The module comprises two paths: a main path and a sub-path.

Structurally, the main path incorporates three key convolution operations. The first and third layers employ 1 × 1 convolution, followed by ReLU and Sigmoid activation functions, respectively. The second layer utilizes EE-Conv combined with a ReLU activation function to accurately extract edge features from the feature map. Following the convolution operations of the main path, an edge attention weight matrix of size 1 × H × W is generated, effectively enhancing the pixel proportion of the target edge area. This weight matrix is convolved with the input image based on edge features transmitted by the sub-path to obtain the edge attention feature map, which serves as input for the final convolution layer.

This design methodology effectively integrates spatial and edge information, enhancing the module’s information extraction capability. Notably, the entire structure contains only one short path, utilizing 1 × 1 convolution, which minimizes information processing time. This design incorporates a spatial attention mechanism, optimizing edge feature extraction while improving computational resource utilization efficiency. Consequently, it enhances the module’s performance and practicality while maintaining the stability and reliability of its core functions.

The engineering design of this structure provides crucial technical support for real-time applications and lightweight models. In the WETS-Net model, the first two layers typically output shallow features, primarily encompassing edge and texture features. These features aid in locating object boundaries and contours, providing fundamental cues to the image structure. Therefore, in this model’s design process, EAEM is positioned at the output of the encoder’s first two shallow feature layers. This placement allows for comprehensive extraction and enhancement of edge and texture features of weak targets, facilitating refinement of segmentation boundaries and details. Furthermore, it provides richer and more fine-grained low-scale information for subsequent feature fusion and segmentation tasks.

Regarding the computational overhead of the EAEM module, its carefully designed architecture achieves lightweight computation. The module’s efficiency can be analyzed as follows:

Main Path: The 1 × 1 convolutions applied in the first and third layers primarily serve for linear combination across channels. For RGB input images, the computational complexity is 3 × 3 × H × W, which is relatively low.
EE-Conv Layer: As the core of the module, it employs a 3 × 3 convolution kernel with a theoretical complexity of 3 × 3 × 3 × H × W. However, due to its optimized design, the actual computational cost is lower than traditional 3 × 3 convolutions. This layer precisely extracts edge features while reducing computational complexity through sparse convolution or other efficient strategies.
Edge Attention Weight Matrix: The generation of the 1 × H × W edge attention weight matrix, used to enhance pixel importance in target edge regions, primarily involves convolution operations during its creation process.
Branch Path: The convolution operation between the weight matrix and the input image produces the edge attention feature map.

Overall, the EAEM module optimizes computational costs in the 1 × 1 convolutions and the EE-Conv layer, ensuring efficient edge feature extraction while maintaining a lightweight and real-time capable structure. This design makes it suitable for real-time applications and lightweight model requirements.

The module’s architecture balances the trade-off between computational efficiency and feature extraction effectiveness. By leveraging optimized convolution operations and a well-designed attention mechanism, EAEM achieves enhanced edge detection capabilities without significantly increasing computational overhead. This approach aligns with the growing demand for efficient, real-time capable models in various computer vision applications.

3.2. Multi-Scale Information Fusion Module

The primary role of the EAEM module is to extract and preserve sufficient edge information, while the MIFM aims to efficiently acquire rich global feature information. Some existing methods, such as multi-scale image inputs, large convolutional kernels, and spatial pyramids, although effective in capturing multi-scale features, often significantly increase the network’s complexity and computational load, rendering them unsuitable for real-time segmentation tasks. Additionally, the representation of convolutional features is not only related to the receptive field but also closely linked to the feature channels. Compared to enlarging the convolutional kernel size, adding more channels requires less computation. Therefore, utilizing a channel attention mechanism to extract multi-channel feature information may be a more optimal solution for real-time semantic segmentation tasks.

The channel attention mechanism has been widely applied in semantic segmentation tasks, with SENet and FC attention mechanisms being two representative methods. SENet introduces the SE module to adaptively adjust the weights of each feature channel, enabling the network to automatically learn the importance of different channels and enhance the quality of feature representation. The design of the SE module is concise and efficient, significantly boosting network performance at a low computational cost and achieving outstanding results in tasks like image classification and object detection. However, the channel weights learned by the SE module are globally unique, neglecting differences in spatial dimensions and limiting its performance in dense prediction tasks such as semantic segmentation.

To overcome this limitation, the FC attention mechanism employs fully connected layers to learn attention weights in both spatial and channel dimensions, enabling the network to adaptively adjust feature responses at different positions and channels. The FC attention mechanism outperforms the SE module in semantic segmentation tasks, demonstrating the importance of spatial attention for dense prediction tasks. Nevertheless, the FC attention mechanism introduces a large number of parameters and computational overhead, restricting its application in real-time semantic segmentation tasks. Therefore, finding ways to reduce the complexity of the channel attention mechanism while maintaining accuracy is a research direction worth further exploration.

To address these limitations, this study proposes a novel Multi-Scale Information Fusion Module (MIFM) designed to efficiently and flexibly fuse and extract multi-scale feature information. MIFM utilizes a lightweight channel attention mechanism to adaptively adjust channel weights in different scale feature maps, facilitating feature fusion and optimization across scales, as depicted in Figure 5.

In the structural design of MIFM, the input initially undergoes processing through a 3 × 3 convolution layer, followed by a ReLU activation function layer, and another 3 × 3 convolution layer. The number of channels is appropriately reduced to mitigate the computational burden. Subsequently, the feature map bifurcates into two parallel branches: the main path for generating the guidance vector, and the branch path.

In the main path of the guidance vector, the feature map traverses a global average pooling layer to generate a global descriptor. Unlike SENet, MIFM eschews the use of a fully connected layer with large parameters, instead employing a 1 × 1 pointwise convolution to process the global descriptor, thereby enhancing model efficiency. Pointwise convolution operates on the channel dimension, significantly reducing the number of parameters and improving computational efficiency. As pointwise convolution does not need to consider spatial information associations, it only processes information between channels, proving more efficient without capturing information from all channels.

Following the pointwise convolution, the global descriptor is converted into a compressed guidance vector and then restored to the same number of channels as the input feature map through another 1 × 1 pointwise convolution, yielding a reweighted guidance vector. In the feature fusion stage, the guidance vector undergoes element-wise multiplication with the output of the shortcut branch, adaptively adjusting the weight of each channel to achieve selective feature enhancement.

Finally, the enhanced feature maps of different scales are concatenated along the channel dimension to form the final fused features, which are then passed to the subsequent layer. By introducing a lightweight channel attention mechanism and a cross-scale feature fusion strategy, MIFM can efficiently extract and integrate multi-scale contextual information, enhancing the quality and robustness of feature representation. This structural design effectively integrates the channel attention mechanism, improving the model’s performance and efficiency while maintaining accurate capture and effective utilization of feature information.

In WETS-Net, each skip connection layer incorporates a MIFM module before the decoders at each level. Through this design, the network can fully utilize the rich semantic information extracted by the encoder and the high-resolution spatial details restored by the decoder, thereby better capturing the multi-scale features of weak and small targets. Applying MIFM at each skip connection enables WETS-Net to optimize feature representation layer by layer and achieve cross-scale fusion. The introduction of this structure effectively enhances segmentation accuracy and efficiency while maintaining the stability and interpretability of the overall network architecture.

The improved network can more effectively learn and utilize the importance of different channels, thereby increasing focus on key features and reducing noise interference in the final feature representation. This layer-wise optimization and cross-scale fusion method enables WETS-Net to perform well in processing complex scenes, not only improving the accuracy of segmentation tasks but also enhancing overall computational efficiency. Consequently, the network becomes more suitable for practical application scenarios.

3.3. Total Network Framework

WETS-Net is a real-time segmentation network built upon the foundation of U-Net, enhanced by the incorporation of the EAEM and MIFM modules, as illustrated in Figure 2. The EAEM module extracts edge feature information from the first and second low-scale downsampled features at the encoding stage, while the MIFM module focuses on channel information extraction from all downsampled features. These modules are interconnected during the adjacent upsampling process in the decoder, facilitating further information aggregation for feature fusion.

WETS-Net is an efficient and accurate image-processing network that demonstrates outstanding performance in terms of real-time capability and precision. From a computational complexity perspective, this network, based on U-Net, introduces only a small number of additional 3 × 3 convolutional layers, maximizing GPU resources and reducing computational waste. Moreover, the foundational U-Net model itself is a lightweight multitask network, further decreasing the time overhead during inference. It can be said that WETS-Net maintains performance while balancing computational efficiency and resource utilization.

To enhance the network’s feature extraction and fusion capabilities, WETS-Net incorporates two key modules: EAEM and MIFM. EAEM not only focuses on target edge features but also reorganizes and optimizes spatial features, enabling the network to better capture structural information about the targets. Meanwhile, MIFM strengthens the fusion and transmission of contextual information by modeling the dependencies between features at different scales. The synergistic action of these two modules allows WETS-Net to explore semantic information from multiple perspectives, leading to higher segmentation accuracy.

WETS-Net employs a multi-task joint optimization strategy in its loss function design, incorporating two auxiliary loss terms (EAEM and MIFM outputs) to enhance network training. This approach not only supervises the final output but also emphasizes the learning of intermediate layer features, promoting more effective gradient propagation and consequently improving convergence speed and generalization performance. The comprehensive loss function concept allows for a balanced weighting of the main task and auxiliary tasks, ensuring that the model achieves an optimal equilibrium across various objectives. This sophisticated design enhances the model’s expressiveness while deepening feature learning and providing more extensive supervision throughout the network training process. The loss function is formulated as follows:

L_{t o t a l} = L_{f i n a l} + α \times L_{E A E M} + β \times L_{M I F M}

(1)

In the equation, α and β are used as weight parameters to balance the auxiliary loss terms to adjust the influence of the two attention mechanism modules. After simulation experiment verification, it is determined that the two weight parameters of the auxiliary loss terms are both set to 0.5 to achieve a balanced effect on the attention mechanism module. In the formula, L_final, L_EAEM, and L_MIFM represent the losses of the entire network, EAEM and MIFM, and L_total represents the total loss function. It should be pointed out that these two auxiliary loss terms are only used in the training phase. All of these loss functions are calculated using the cross-entropy loss function, and the calculation method is shown in Equation (2). During the training process, by appropriately adjusting the weight parameters and the structure of the loss function, the model can better learn the data features and improve its performance. This meticulous loss design and weight adjustment is to promote the effective training of the network and the optimization of the final performance.

L = - \sum_{i = 1}^{N} \sum_{j = 1}^{C} y_{i j} \log (p_{i j})

(2)

In the equation, N represents the number of samples, C represents the number of categories, y_ij is an indicator variable (if the true category of sample i is j, then y_ij = 1; otherwise it is 0), and p_ij represents the model’s predicted probability that sample i belongs to category j. Since the value of p_ij ranges from 0 to 1, the value range of the calculated cross-entropy loss function is [0, +∞). When the model predicts completely accurately, the loss function is 0; as the deviation between the prediction and the true label increases, the value of the loss function also increases. In practical applications, a smaller loss function value indicates that the network has a better classification effect. In the training process of network optimization, minimizing the loss function is a crucial step. By continuously adjusting the network parameters to minimize the loss function, the network can better fit the training data and show better generalization ability on unseen data. Therefore, when training deep learning models, continuously monitoring and optimizing the loss function is a key step to ensure model performance and generalization ability.

In general, the loss function design of WETS-Net not only improves the expressiveness of the model but also deepens the level of feature learning, allowing the network to better understand the intrinsic structure of the data. By introducing auxiliary loss terms, the network obtains more supervisory signals during the learning process, which helps the network learn effective feature representations faster and improves the generalization performance of the network. The design of the comprehensive loss function effectively models the relationship between different tasks, provides more information and guidance for network training, and enables the model to more comprehensively learn and optimize the correlation and trade-offs between various tasks.

4. Experiment

4.1. Datasets and Evaluation Indicators

4.1.1. Datasets

To comprehensively evaluate the effectiveness of the proposed WETS-Net model, we conducted ablation studies and comparative experiments with other state-of-the-art methods on two independent datasets: the Trans10k dataset and a custom-made dataset. The Trans10k dataset is a semantic segmentation dataset comprising a large number of transparent objects, designed to advance research and development in the field of transparent object segmentation. This dataset, available on GitHub (https://github.com/xieenze/Segment_Transparent_Objects, accessed on 25 August 2024), consists of a training set of 5000 images, a validation set of 1000 images, and a test set of 4428 images, as illustrated in Figure 6. The custom dataset was created by a personal research team specifically for glass and plastic products. It consists of 3000 RGB images of consumables used in thromboelastography, captured under various backgrounds. The dataset is categorized into seven classes: (1) reagent cups, (2) plastic reagent bottles, (3) glass reagent bottles, (4) large pipette tips, (5) small pipette tips, (6) plastic droppers, and (7) glass droppers. All images are uniformly sized at 1280 × 960 pixels, as illustrated in Figure 7. And it is annotated using multiple formats, including masks and COCO. For the experimental setup, we divided the custom-made dataset into three subsets: 80% for training, 10% for validation, and the remaining 10% for testing. To enhance the robustness of our model and prevent overfitting, we applied more than 10 data augmentation techniques to the training sets of both datasets. These augmentation methods included center cropping, random rotation, transposition, and elastic transformation, among others. This comprehensive experimental design allows for a thorough evaluation of the WETS-Net model’s performance across different datasets and under various conditions, providing a robust assessment of its effectiveness in transparent object segmentation tasks.

During the experiments, for the custom dataset, samples were divided into three subsets: 80% for training, 10% for validation, and the remaining 10% for testing. In the experimental process, after splitting the datasets, data augmentation techniques were applied to the training data using over ten methods, including center cropping, random rotation, flipping, elastic transformation, and more, for both datasets.

4.1.2. Evaluation Indicators

Evaluation indicators include mIOU, Precision, and Recall. The evaluation indicator that needs to be focused on is mIOU, which is a commonly used indicator for evaluating image segmentation quality. It combines pixel-level accuracy and recall to provide a comprehensive evaluation of the overall performance of the segmentation results. The expression of the mIOU indicator is shown in Equation (3).

m I O U = \frac{1}{N} \sum_{i = 1}^{N} I O U_{i}

(3)

In the equation, N represents the total number of classes, and IOU_i represents the IOU for the i-th class, where IOU is an indicator for calculating the accuracy of the segmentation result of a single category, which represents the ratio between the intersection and union of the predicted area and the actual area. Its expression is shown in Equation (4). The mIOU metric should fall within the range of 0 to 1. A higher mIOU value closer to 1 indicates that the overlapping regions between the ground truth and the predicted values are more aligned, signifying a greater overlap between the predicted segmentation by the model and the true labels, thus indicating higher accuracy. When the model accurately predicts the segmentation regions for each class, the mIOU reaches its maximum value of 1.

I O U = \frac{T P}{T P + F P + F N}

(4)

In the equation, TP represents the true positive example of the category, FP represents the false positive example of the category, and FN represents the false negative example of the category.

In addition to the aforementioned metrics, Precision and Recall are widely employed evaluation measures in the field of segmentation. Precision quantifies the model’s ability to accurately identify positive samples among its positive predictions, reflecting the proportion of true positives within the predicted positive set. Conversely, Recall assesses the model’s capacity to correctly identify all actual positive samples, indicating the fraction of true positives captured from the total positive set. These complementary metrics provide valuable insights into the model’s performance from different perspectives. Their respective mathematical formulations are presented in Equations (5) and (6).

P r e c i s i o n = \frac{T P}{T P + F P}

(5)

R e c a l l = \frac{T P}{T P + F N}

(6)

4.2. Experimental Results and Analysis

4.2.1. Ablation Experiment

To comprehensively assess the effectiveness and superiority of the WETS-Net improvement method proposed in this paper, a series of comparative experiments were designed in this section. The experiments were based on the U-Net as the benchmark model, with the encoder part initialized using a pre-trained VGG19 network. Validation was conducted on a dataset collected and annotated by ourselves. Through these experiments, we aimed to validate the effectiveness of the EAEM and MIFM modules proposed in this paper. In particular, we examined whether the EAEM module could effectively extract edge texture features to enhance the precise segmentation of images with weak edges.

Ablation experiments were conducted separately on two datasets, with experimental parameters set as shown in Table 1. The results of cell classification are presented in Figure 8, where the legend numbers correspond to the experimental numbers in Table 1. As seen in Figure 8, evaluation metrics, including mIOU, Recall, and Precision, demonstrate the segmentation effectiveness. Analysis of the results reveals that the classification framework without EAEM, MIFM, and loss functions (Model 1) achieved only a mIOU of 71.23%. In contrast, the WETS-Net model (Model 8) achieved the highest mIOU value of 84.23% within the complete framework.

To ascertain the contribution of EAEM, a comparison between Model 1 and Model 2, as well as between Model 5 and Model 8, shows an increase in accuracy ranging from 5.92% to 6.81%. This improvement is attributed to EAEM effectively extracting edge feature information from shallow regions, significantly enhancing the model’s segmentation accuracy. In comparison to the baseline U-Net, WETS-Net achieved an accuracy of 96.13%, which is 9.82% higher than U-Net (86.31%). These results demonstrate the effectiveness of the method proposed in this paper.

4.2.2. Comparative Experiment

In this section, further comparative experiments were conducted on the Trans10k dataset and the custom dataset to validate and demonstrate the segmentation effectiveness of the proposed method compared to existing algorithms. Similar to many methods utilizing this dataset, the experiments in this section involved randomly splitting the dataset into training, validation, and test sets in proportions of 80%, 10%, and 10%, respectively. Subsequently, by applying five different data augmentation techniques, the Trans10k dataset and the custom dataset were processed, resulting in datasets comprising 52,140 samples and 15,000 samples, respectively.

The experimental results of different methods on these datasets are presented in Table 2 and Table 3.

Table 2 and Table 3 show the test results of different semantic segmentation algorithms on the Trans10k dataset and the self-made dataset, respectively. We use three index parameters, namely mIOU, Recall, and Precision, to evaluate the detection results. The WETS-Net proposed in this paper shows excellent performance in the tests of the two datasets, especially in the semantic segmentation task of processing weak edge targets such as glass. On the Trans10k dataset, WETS-Net achieved the highest values in the two key indicators of mIOU and Precision, reaching 78.42% and 95.67%, respectively, which shows that the model can accurately identify and locate the target area while maintaining a low false alarm rate. Although it is slightly lower than LeViT-UNet in the Recall index, overall, WETS-Net’s overall performance on this dataset is more comprehensive and balanced. It is worth mentioning that on the self-made dataset, WETS-Net is significantly better than other methods in all evaluation indicators, with mIOU, Recall, and Precision reaching 84.23%, 90.23%, and 96.13%, respectively, which fully proves the robustness and generalization ability of the model in practical application scenarios. In contrast, the performance of LeViT-UNet on the self-made dataset has declined, especially in terms of detection accuracy, which is 6.01% lower than that of WETS-Net. This may be due to the presence of more complex backgrounds and interference factors in the dataset, which puts higher requirements on the model’s feature extraction and fusion capabilities.

The segmentation effect of WETS-Net is shown on our self-made dataset (as shown in Figure 9). Obviously, when processing images with obvious edge features of a single target, the segmentation effect of each network is quite good. However, when facing glass images with a dark background, various segmentation algorithms show certain deviations. Especially in this case, the model proposed in this paper and the LeViT-UNet model performs better than other algorithms, further verifying the fact that the introduction of the attention mechanism has a positive effect on the segmentation network. In complex scenes with other foreign objects or diverse objects, similar situations also exist. The effects of various comparison networks do not seem to be ideal, especially when dealing with objects such as transparent glass, which are more susceptible to interference from the background environment. In general, improved networks such as ResUNet++ and DoubleU-Net have made certain improvements on the basis of U-Net, and the processing effect of colored plastic reagent bottles is also quite ideal. The LeViT-UNet model shows quite good results in various scenes, but there are still some deviations in the segmentation of less obvious areas such as the tip area of glassware. In comparison, the algorithm proposed in this paper performs better in dealing with such scenes and successfully completes the task of accurate regional segmentation of materials such as glass and plastic. In summary, from a visual point of view, WETS-Net shows strong segmentation ability and robustness.

When performing target detection on two different datasets, the U-Net network showed considerable detection accuracy, showing a certain effect in this type of object detection task. At the same time, various improved methods based on the U-Net network showed positive effects in the target detection process, highlighting the good plasticity and room for improvement of the model. LeViT-UNet and the model proposed in this paper achieved relatively outstanding results after comparing them with other models, which shows the effectiveness of introducing the attention mechanism in the segmentation network, which can provide strong support for the detection of various special targets. Although the WETS-Net algorithm in this paper has a good segmentation effect in existing images, the segmentation effect in processing highlights or completely transparent areas is not ideal. How to eliminate the influence of such situations and complete the segmentation of the target area will also be an important improvement direction in the future.

5. Conclusions

The paper introduces a novel dual-attention mechanism semantic segmentation network called WETS-Net to address the challenges faced in weak edge target image segmentation tasks, such as transparent consumables. Built upon the U-Net architecture, this network incorporates the EAEM and MIFM modules to enhance texture details in shallow features and fuse contextual information across different scales. The EAEM module employs a spatial attention mechanism to adaptively adjust the weights of different regions in shallow feature maps, emphasizing texture details of weak edge targets while suppressing background interference. The design of EAEM integrates the EE-Conv convolutional kernel, enabling the network to better capture edge features of targets and improve segmentation accuracy.

In the decoding phase, WETS-Net enhances the skip connections of U-Net by introducing the MIFM module at each connection point. This allows the network to adaptively fuse high-level semantic features extracted by the encoder with high-resolution spatial details restored by the decoder, reducing the impact of blurry boundaries and background noise on segmentation results. MIFM utilizes a lightweight channel attention mechanism to adjust channel weights in different scale feature maps adaptively, achieving cross-scale feature fusion and optimization. Compared to existing channel attention mechanisms, MIFM significantly reduces parameter and computational overhead while maintaining performance, making it more suitable for real-time segmentation tasks.

To further improve segmentation performance, a multi-task learning strategy is employed, incorporating a composite loss function consisting of primary and auxiliary losses. The auxiliary loss supervises the learning process of the EAEM and MIFM modules, directing the network’s focus towards the feature extraction and fusion performance of these key modules.

Experimental results on the Trans10K dataset and a custom dataset demonstrate that WETS-Net achieves significant performance improvements in weak edge target image segmentation tasks, surpassing existing semantic segmentation methods. Through ablation studies, the effectiveness of the proposed modules is further validated. However, the paper acknowledges the limitations of CNN-based methods in global semantic understanding despite their strong local feature extraction capabilities and suitability for small-scale datasets. Future research directions could explore enhancing the network’s perception and understanding of key target features, such as by introducing Transformer-based global attention mechanisms or designing more efficient cross-scale feature fusion strategies.

Author Contributions

N.W. conceptualization, methodology, software, writing—original draft preparation, manuscript writing, visualization, investigation, writing—reviewing and editing. D.J.: supervision. Z.L.: data curation, investigation. Z.H.: software, validation. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Natural Science Lateral Project of Beijing Jiaotong University (W20L00160).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare that they have no competing interests.

References

Ren, W.; Tang, Y.; Sun, Q.; Zhao, C.; Han, Q.L. Visual Semantic Segmentation Based on Few/Zero-Shot Learning: An Overview. IEEE-CAA J. Autom. Sin. 2024, 11, 1106–1126. [Google Scholar] [CrossRef]
Varga-Szemes, A.; Muscogiuri, G.; Schoepf, U.J.; Wichmann, J.L.; Suranyi, P.; De Cecco, C.N.; Cannaò, P.M.; Renker, M.; Mangold, S.; Fox, M.A.; et al. Clinical feasibility of a myocardial signal intensity threshold-based semi-automated cardiac magnetic resonance segmentation method. Eur. Radiol. 2015, 26, 1503–1511. [Google Scholar] [CrossRef] [PubMed]
He, Z.; Wang, H.; Zhao, X.; Zhang, S.; Tan, J. Inward-region-growing-based accurate partitioning of closely stacked objects for bin-picking. Meas. Sci. Technol. 2020, 31, 125901. [Google Scholar] [CrossRef]
Wang, W.; Zhang, Y.; Ge, G.; Jiang, Q.; Wang, Y.; Hu, L. Indoor Point Cloud Segmentation Using a Modified Region Growing Algorithm and Accurate Normal Estimation. IEEE Access 2023, 11, 42510–42520. [Google Scholar] [CrossRef]
Wang, X.; Wang, S.; Guo, Y.; Hu, K.; Wang, W. Coal gangue image segmentation method based on edge detection theory of star algorithm. Int. J. Coal Prep. Util. 2022, 43, 119–134. [Google Scholar] [CrossRef]
Jin, F.; Zhan, K.; Chen, S.; Huang, S.; Zhang, Y. Image segmentation method of mine pass soil and ore based on the fusion of the confidence edge detection algorithm and mean shift algorithm. Gospod. Surowcami Miner. Miner. Resour. Manag. 2023, 37, 133–152. [Google Scholar] [CrossRef]
Jiang, H.; Diao, Z.; Yao, Y.-D. Deep learning techniques for tumor segmentation: A review. J. Supercomput. 2021, 78, 1807–1851. [Google Scholar] [CrossRef]
Jiang, B.; An, X.; Xu, S.; Chen, Z. Intelligent Image Semantic Segmentation: A Review Through Deep Learning Techniques for Remote Sensing Image Analysis. J. Indian Soc. Remote Sens. 2022, 51, 1865–1878. [Google Scholar] [CrossRef]
Liu, Z.; Tong, L.; Chen, L.; Jiang, Z.; Zhou, F.; Zhang, Q.; Zhang, X.; Jin, Y.; Zhou, H. Deep learning based brain tumor segmentation: A survey. Complex Intell. Syst. 2022, 9, 1001–1026. [Google Scholar] [CrossRef]
Karimi, D.; Warfield, S.K.; Gholipour, A. Transfer learning in medical image segmentation: New insights from analysis of the dynamics of model parameters and learned representations. Artif. Intell. Med. 2021, 116, 102078. [Google Scholar] [CrossRef] [PubMed]
Siddique, N.; Paheding, S.; Elkin, C.P.; Devabhaktuni, V. U-Net and Its Variants for Medical Image Segmentation: A Review of Theory and Applications. IEEE Access 2021, 9, 82031–82057. [Google Scholar] [CrossRef]
Sariturk, B.; Seker, D.Z. A Residual-Inception U-Net (RIU-Net) Approach and Comparisons with U-Shaped CNN and Trans-former Models for Building Segmentation from High-Resolution Satellite Images. Sensors 2022, 22, 7624. [Google Scholar] [CrossRef] [PubMed]
Ahmed, I.; Ahmad, M.; Jeon, G. A real-time efficient object segmentation system based on U-Net using aerial drone images. J. Real-Time Image Process. 2021, 18, 1745–1758. [Google Scholar] [CrossRef]
Su, Z.; Li, W.; Ma, Z.; Gao, R. An improved U-Net method for the semantic segmentation of remote sensing images. Appl. Intell. 2021, 52, 3276–3288. [Google Scholar] [CrossRef]
Ahsan, R.; Shahzadi, I.; Najeeb, F.; Omer, H. Brain tumor detection and segmentation using deep learning. Magn. Reson. Mater. Phys. Biol. Med. 2024, 1–10. [Google Scholar] [CrossRef]
Guo, W.; Zhou, H.; Gong, Z.; Zhang, G. Double U-Nets for Image Segmentation by Integrating the Region and Boundary Information. IEEE Access 2021, 9, 69382–69390. [Google Scholar] [CrossRef]
John, D.; Zhang, C. An attention-based U-Net for detecting deforestation within satellite sensor imagery. Int. J. Appl. Earth Obs. Geoinformation 2022, 107, 102685. [Google Scholar] [CrossRef]
Cui, Z. An Attention-Based Improved U-Net Neural Network Model for Semantic Segmentation of Moving Objects. IEEE Access 2024, 12, 57071–57081. [Google Scholar] [CrossRef]
Yu, L.; Mei, H.; Dong, W.; Wei, Z.; Zhu, L.; Wang, Y.; Yang, X. Progressive Glass Segmentation. IEEE Trans. Image Process. 2022, 31, 2920–2933. [Google Scholar] [CrossRef]
Wan, Y.; Zhao, Q.; Xu, J.; Wang, H.; Fang, L. DAGNet: Depth-aware Glass-like objects segmentation via cross-modal attention. J. Vis. Commun. Image Represent. 2024, 100, 104121. [Google Scholar] [CrossRef]
Hu, X.; Gao, R.; Yang, S.; Cho, K. TGSNet: Multi-Field Feature Fusion for Glass Region Segmentation Using Transformers. Mathematics 2023, 11, 843. [Google Scholar] [CrossRef]
Liu, B.; Yang, T.; Wu, X.; Wang, B.; Zhang, H.; Wu, Y. UAV imagery-based railroad station building inspection using hybrid learning architecture. Meas. Sci. Technol. 2024, 35, 086206. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the 18th International Conference, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Jumutc, V.; Bļizņuks, D.; Lihachev, A. Multi-Path U-Net Architecture for Cell and Colony-Forming Unit Image Segmentation. Sensors 2022, 22, 990. [Google Scholar] [CrossRef]
Deng, Y.; Hou, Y.; Yan, J.; Zeng, D. ELU-Net: An Efficient and Lightweight U-Net for Medical Image Segmentation. IEEE Access 2022, 10, 35932–35941. [Google Scholar] [CrossRef]
Yang, Z.; Xu, P.; Yang, Y.; Bao, B.K. A Densely Connected Network Based on U-Net for Medical Image Segmentation. Acm. Trans. Multimed. Comput. Commun. Appl. 2021, 17, 1–14. [Google Scholar] [CrossRef]
Ding, Y.; Chen, Z.; Wang, Z.; Wang, X.; Hu, D.; Ma, P.; Ma, C.; Wei, W.; Li, X.; Xue, X.; et al. Three-dimensional deep neural network for automatic delineation of cervical cancer in planning computed tomography images. J. Appl. Clin. Med. Phys. 2022, 23, e13566. [Google Scholar] [CrossRef]
Wu, W.; Liu, G.; Liang, K.; Zhou, H. Inner Cascaded U2-Net: An Improvement to Plain Cascaded U-Net. Comput. Model. Eng. Sci. 2023, 134, 1323–1335. [Google Scholar] [CrossRef]
Wang, D.; Xiang, S.; Zhou, Y.; Mu, J.; Zhou, H.; Irampaye, R. Multiple-Attention Mechanism Network for Semantic Segmentation. Sensors 2022, 22, 4477. [Google Scholar] [CrossRef]
Chen, X.; Kuang, T.; Deng, H.; Fung, S.H.; Gateno, J.; Xia, J.J.; Yap, P.T. Dual Adversarial Attention Mechanism for Unsuper-vised Domain Adaptive Medical Image Segmentation. IEEE Trans. Med. Imaging 2022, 41, 3445–3453. [Google Scholar] [CrossRef]
Duan, S.; Zhao, J.; Huang, X.; Zhao, S. Semantic Segmentation of Remote Sensing Data Based on Channel Attention and Feature Information Entropy. Sensors 2024, 24, 1324. [Google Scholar] [CrossRef]
Cheng, D.; Meng, G.; Cheng, G.; Pan, C. SeNet: Structured Edge Network for Sea–Land Segmentation. IEEE Geosci. Remote Sens. Lett. 2016, 14, 247–251. [Google Scholar] [CrossRef]
Kang, Y.; Ji, J.; Xu, H.; Yang, Y.; Chen, P.; Zhao, H. Swin-CDSA: The Semantic Segmentation of Remote Sensing Images Based on Cascaded Depthwise Convolution and Spatial Attention Mechanism. IEEE Geosci. Remote Sens. Lett. 2024, 21, 3431638. [Google Scholar] [CrossRef]
Du, Z.; Liang, Y. Research on Image Semantic Segmentation Based on Hybrid Cascade Feature Fusion and Detailed Attention Mechanism. IEEE Access 2024, 12, 62365–62377. [Google Scholar] [CrossRef]
Cheng, D.; Deng, J.; Xiao, J.; Yanyan, M.; Kang, J.; Gai, J.; Zhang, B.; Zhao, F. Attention based multi-scale nested network for biomedical image segmentation. Heliyon 2024, 10, e33892. [Google Scholar] [CrossRef]
Hu, X.; Zhang, P.; Zhang, Q.; Yuan, F. GLSANet: Global-Local Self-Attention Network for Remote Sensing Image Semantic Segmentation. IEEE Geosci. Remote Sens. Lett. 2023, 20, 3235117. [Google Scholar] [CrossRef]

Figure 1. Encoder–decoder architecture.

Figure 2. WETS-Net network architecture diagram.

Figure 3. Edge-enhanced convolution.

Figure 4. EAEM structure diagram.

Figure 5. EIFM structure diagram.

Figure 6. Trans10k datasets.

Figure 7. Homemade split datasets.

Figure 8. Results of ablation experiments with different parameter settings.

Figure 9. Comparison of segmentation effect of WETS-Net.

Table 1. Parameter settings for ablation experiments.

Model	EAEM	MIFM	Loss Function
1	×	×	×
2	√	×	×
3	×	√	×
4	×	×	√
5	×	√	√
6	√	√	×
7	√	×	√
8	√	√	√

Note: √ indicates that the module is used; × indicates that the module is not used.

Table 2. Comparative experiments on the Trans10k dataset.

Model	mIOU (%)	Recall (%)	Precision (%)
U-Net	78.23	87.23	88.44
ResUNet++	81.23	88.23	90.23
DoubleU-Net	82.32	82.23	92.23
LeViT-UNet	81.32	90.88	91.12
WETS-Net	83.83	90.43	95.83

Table 3. Comparative experiments on self-made datasets.

Model	mIOU (%)	Recall (%)	Precision (%)
U-Net	71.23	85.22	87.13
ResUNet++	79.23	87.61	89.97
DoubleU-Net	80.12	81.32	89.31
LeViT-UNet	80.76	89.92	90.12
WETS-Net	84.23	90.23	96.13

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, N.; Jia, D.; Li, Z.; He, Z. Weak Edge Target Segmentation Network Based on Dual Attention Mechanism. Appl. Sci. 2024, 14, 8963. https://doi.org/10.3390/app14198963

AMA Style

Wu N, Jia D, Li Z, He Z. Weak Edge Target Segmentation Network Based on Dual Attention Mechanism. Applied Sciences. 2024; 14(19):8963. https://doi.org/10.3390/app14198963

Chicago/Turabian Style

Wu, Nengkai, Dongyao Jia, Ziqi Li, and Zihao He. 2024. "Weak Edge Target Segmentation Network Based on Dual Attention Mechanism" Applied Sciences 14, no. 19: 8963. https://doi.org/10.3390/app14198963

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Weak Edge Target Segmentation Network Based on Dual Attention Mechanism

Abstract

1. Introduction

2. Theoretical Basis

2.1. U-Net Semantic Segmentation Network

2.2. Attention Mechanism

3. Model Design and Optimization

3.1. Edge Attention Extraction Module

3.2. Multi-Scale Information Fusion Module

3.3. Total Network Framework

4. Experiment

4.1. Datasets and Evaluation Indicators

4.1.1. Datasets

4.1.2. Evaluation Indicators

4.2. Experimental Results and Analysis

4.2.1. Ablation Experiment

4.2.2. Comparative Experiment

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Model	EAEM	MIFM	Loss Function
1	×	×	×
2	√	×	×
3	×	√	×
4	×	×	√
5	×	√	√
6	√	√	×
7	√	×	√
8	√	√	√

Model	EAEM	MIFM	Loss Function
1	×	×	×
2	√	×	×
3	×	√	×
4	×	×	√
5	×	√	√
6	√	√	×
7	√	×	√
8	√	√	√

Model	EAEM	MIFM	Loss Function
1	×	×	×
2	√	×	×
3	×	√	×
4	×	×	√
5	×	√	√
6	√	√	×
7	√	×	√
8	√	√	√