1. Introduction
With the rapid development of machine vision technology, models such as semantic segmentation are being widely applied in various fields of life, such as agriculture, industry, healthcare, and transportation. Semantic segmentation [
1] involves classifying pixels in target images and predicting the category to which each pixel belongs.
In the process of automating medical detection, accurately and efficiently segmenting consumables made of different materials is a key task. Especially for weak edge targets made of transparent or semi-transparent materials like glass and plastic, the edge information is often blurry, posing a challenge to traditional segmentation methods. The precise identification and localization of such weak edge targets are of significant importance for ensuring the quality and safe use of medical equipment.
Traditional image segmentation methods, such as threshold-based [
2], region-growing [
3,
4], and edge detection methods [
5,
6], mainly rely on low-level texture and edge information of the targets. However, for medical instruments made of materials like glass and plastic, their edge features are often not distinct, lacking clear differentiation from the background. Furthermore, these methods typically require manual design and selection of features, lacking the capability to comprehend and model overall semantics, making it difficult to adapt them to complex and variable instrument shapes and imaging conditions.
The vigorous development of deep learning technology in recent years has provided new insights for the segmentation of medical consumables [
7,
8,
9]. CNN, with its powerful feature extraction and representation capabilities, has achieved remarkable success in the field of image recognition. In particular, the introduction of Fully Convolutional Networks (FCNs) has pioneered end-to-end semantic segmentation [
10]. Building upon this foundation, U-Net [
11] further introduced the encoder–decoder structure and skip connection mechanism, significantly enhancing the accuracy of image semantic segmentation.
In recent years, U-Net has gained widespread adoption in the field of object segmentation due to its exceptional performance, attracting considerable attention from researchers. Sariturk et al. [
12] demonstrated the significant effectiveness of U-Net and its variants in building segmentation. Their study compared various CNN and Transformer models, including U-Net, Residual U-Net, and Attention Residual U-Net. These models achieved excellent segmentation performance across different datasets, highlighting the adaptability and potential of the U-Net architecture in handling complex scene segmentation tasks. Ahmed et al. [
13] showcased U-Net’s powerful real-time multi-object segmentation capabilities in remote sensing and surveillance applications, particularly excelling in processing drone aerial imagery. Su et al. [
14] proposed an improved U-Net model that integrated the advantages of DenseNet, dilated convolutions, and DeconvNet for remote sensing image semantic segmentation, exhibiting outstanding performance on the Potsdam orthophoto dataset. Ahsan et al. [
15] applied U-Net to brain tumor detection and segmentation, demonstrating superior performance through joint work with YOLOv5. Guo et al. [
16] introduced a dual U-Net framework, innovatively combining object region and boundary information for segmentation, showing advantages in tasks such as lung, heart, and clavicle segmentation in chest X-rays, as well as segmentation of subtle edges like pelvic structures. John et al. [
17] incorporated attention mechanisms into U-Net for deforestation detection in the Amazon rainforest and Atlantic forest of South America. The Attention U-Net outperformed traditional U-Net and other baseline models on Sentinel-2 satellite imagery. Cui et al. [
18] enhanced the U-Net model by introducing attention mechanisms and residual modules, making it more suitable for the semantic segmentation of moving objects. U-Net and its derivative networks have been widely applied in diverse detection scenarios, fully demonstrating the versatility and adaptability of this architecture. This not only highlights the universality of the U-Net structure but also provides a reliable theoretical foundation and practical support for improving and applying it to segmentation tasks involving objects with weak boundaries, such as glass.
However, directly applying existing CNN models to the segmentation of medical consumables made of materials such as glass and plastic still faces challenges. These materials possess unique optical properties, such as transparency and reflectivity, resulting in less pronounced edge information in consumable images, which poses difficulties for feature extraction modules. Moreover, the significant variations in the shape and size of consumables lack consistent structured representations, placing higher demands on the generalization capability of segmentation networks. Yu et al. [
19] proposed a Progressive Glass Segmentation Network (PGSNet) to address the challenges in glass segmentation. This method employs a discriminative enhancement module to bridge the characteristic gap between features at different levels, improving the discriminative power of feature representations. Additionally, it utilizes a focus and exploration fusion module to deeply mine useful information during the fusion process by highlighting commonalities and exploring differences, achieving glass segmentation from coarse to fine. Wan et al. [
20] introduced a novel bidirectional cross-modal fusion framework for Glass-Like Object (GLO) segmentation. This framework incorporates a Feature Exchange Module (FEM) and a Shifted-Window Cross-Attention Feature Fusion Module (SW-CAFM) in each Transformer block stage. The FEM uses coordinate and spatial attention mechanisms to filter noise and recalibrate features from two modalities. The SW-CAFM fuses RGB and depth features through cross-attention and employs shifted-window self-attention operations to reduce computational complexity. This method achieved promising results on multiple glass and mirror benchmark datasets. Hu et al. [
21] proposed a new glass detection method specifically for single RGB images. It extracts backbone features through self-attention methods and then uses a VIT-based deep semantic segmentation architecture called MFT to associate multi-level receptive field features and retain feature information captured at each layer, effectively improving glass region detection capabilities. Liu et al. [
22] presented a hybrid learning architecture called YOLOS (You Only Look Once Station Scene), which integrates a novel squeeze-and-excitation (SE) attention module into the detection branch. This module adaptively learns the weights of feature channels, enabling the network to focus more on critical deep features of objects. Despite significant progress in glass and similar object detection, challenges remain in detection accuracy and network generalization capabilities. This research aims to address the limitations of existing methods by proposing innovative solutions to enhance the performance and efficiency of weak edge object segmentation networks. Specifically, this paper focuses on optimizing detection algorithm accuracy while enhancing model adaptability across different scenarios, contributing to technological advancements in this field.
Addressing the issues of low accuracy and blurred boundaries encountered by existing deep learning methods in segmenting weak edge targets like glass and plastic poses two major challenges in the image processing process. The first challenge revolves around enhancing the network’s feature extraction capabilities to capture the crucial features of the targets, while the second challenge involves maximizing the suppression of redundancy and noise propagation while ensuring feature richness. To tackle these challenges, most improvement efforts focus on strengthening the network’s feature representation and information transmission abilities. However, blindly deepening the network structure may lead to the loss of detailed information, which is essential for accurate segmentation. Furthermore, existing methods mostly utilize skip connections, directly merging the features of the encoder into the decoder. While this approach partially alleviates the issue of information loss, it inevitably introduces noise and redundancy.
Given the aforementioned circumstances, this paper introduces a novel Dual Attention Mechanism Weak Edge Target Segmentation Network (WETS-Net) to address these challenges. The network is founded on the U-Net architecture and integrates dual attention mechanisms to, respectively, enhance the network’s feature extraction and fusion capabilities. Specifically, an enhanced spatial attention mechanism is introduced in the shallow encoder of the U-Net, optimized in conjunction with edge extraction algorithms like the Laplacian operator. Through the aggregation of cross-scale gradient information, this module achieves selective extraction and enhancement of weak edges and texture features, enabling the network to more sensitively capture subtle variations in weak target regions and providing essential priors for subsequent segmentation tasks. Additionally, to address the underutilization of multi-scale information during feature decoding in U-Net, a channel attention mechanism is introduced at the skip connection points in this study. Through adaptive learning, it significantly evaluates features at different scales from the encoder and adjusts the weights of these features during the decoding process. Unlike directly concatenating features from different scales, the introduction of the channel attention mechanism allows for selective enhancement of features in target regions based on task requirements and properties, suppressing redundant background interference, thereby making the fusion of multi-scale information more efficient and targeted. By combining these two significant innovations, the WETS-Net demonstrates a notable performance enhancement in the realm of weak edge object segmentation, paving the way for novel directions in subsequent research endeavors.
The main contributions of this paper are as follows:
Proposed an improved Weak Edge Target Segmentation Network, WETS-Net, which significantly enhances the network’s extraction and fusion capabilities for weak target edges and multi-scale features by introducing a dual attention mechanism in the encoder and decoder of U-Net.
Drawing inspiration from the Laplacian operator’s role in edge detection, a novel Edge Enhancement Convolution (EE-Conv) was designed and applied in constructing the Edge Attention Extraction Module (EAEM). Positioned at the shallow feature output location of the encoder, EAEM effectively extracts and enhances the edge and texture features of weak targets.
Introduced a channel attention mechanism at the skip connections of U-Net and devised a Multi-Scale Information Fusion Module (MIFM). MIFM adaptively adjusts the importance of different scale features from the encoder and decoder, facilitating efficient fusion of multi-scale features and further enhancing the segmentation performance of the network.
3. Model Design and Optimization
This section will provide a detailed overview of the proposed Weak Edge Target Segmentation Network (WETS-Net) based on a dual attention mechanism, elucidating the roles and design principles of its various components. When designing WETS-Net, we focused on two key issues in weak edge target image segmentation tasks: how to enhance the network’s feature extraction capability and how to effectively fuse and compensate high-level semantic features while suppressing noise and irrelevant information as much as possible. Given the excellent performance of U-shaped networks in medical image segmentation tasks and considering the similarity between medical images and weak edge target images in terms of missing texture features, we ultimately chose to base and construct WETS-Net on the U-Net architecture.
Figure 2 illustrates the overall network architecture of WETS-Net. Compared to the classic U-Net, WETS-Net incorporates targeted improvements and optimizations in both the encoder and decoder sections. Specifically, an Edge-Aware Enhancement Module (EAEM) is embedded at the shallow feature output locations of the encoder to enhance the network’s capability in extracting and representing weak edge and texture features. In the decoder section, an enhanced skip connection with a Multi-Scale Information Fusion Module (MIFM) is introduced to adaptively regulate and fuse multi-scale feature information from the encoder. Through this approach, WETS-Net can selectively enhance feature representations at different scales, thereby improving its performance in weak target segmentation tasks.
In the following subsections, we will delve into the detailed design specifics and implementation methods of the EAEM module and MIFM model, further expounding on the overall network structure and optimization strategies of WETS-Net.
3.1. Edge Attention Extraction Module
In semantic segmentation tasks, low-scale feature maps typically possess higher resolution and richer spatial information, crucial for accurate object localization and segmentation. However, many networks employing an “encoder-decoder” structure in the early stages often overlook the extraction and utilization of low-scale features, limiting segmentation performance. While some methods attempt to enhance the representation of low-scale features by introducing multi-path spatial pyramid pooling or large convolutional kernels, these approaches often come with high computational complexity, rendering them unsuitable for real-time segmentation tasks. On the other hand, some existing methods specifically designed for real-time semantic segmentation and low-scale feature extraction have limitations: using small-sized image inputs may introduce more low-scale information, but their training strategies are complex and impractical, and while certain structures emphasizing spatial information have achieved good segmentation results, they do not fully consider the rich edge information inherent in low-scale features.
Addressing the aforementioned issues, the proposed Edge-Aware Enhancement Module (EAEM) aims to design an efficient and effective low-scale feature extraction mechanism for real-time semantic segmentation tasks. Inspired by the Laplacian operator’s role in edge detection, we have devised an edge-enhancing convolution, EE-Conv, and integrated it into the construction of EAEM. The design of EE-Conv draws inspiration from the operational principles of the eight-neighbor Laplacian operator, as illustrated in
Figure 3. The Laplacian operator, a second-order differential operator widely used in image processing for edge detection and enhancement, offers several significant advantages over first-order differential operators like the Sobel operator. Firstly, the Laplacian operator’s response to edges is directionally invariant, exhibiting good isotropy in detecting edges in all directions uniformly, thereby avoiding the directional bias seen with first-order differential operators. Secondly, the Laplacian operator’s definition includes Gaussian smoothing, effectively suppressing image noise and enhancing the robustness of edge detection. Additionally, as a second-order differential operator, the Laplacian operator is more sensitive to details and textures in images, enabling the extraction of finer and more comprehensive edge information.
Compared to traditional convolution, EE-Conv introduces minimal additional computation while extracting edge information, ensuring the module’s lightweight nature and real-time performance. Based on this, the EAEM module designed in this study is illustrated in
Figure 4. The module comprises two paths: a main path and a sub-path.
Structurally, the main path incorporates three key convolution operations. The first and third layers employ 1 × 1 convolution, followed by ReLU and Sigmoid activation functions, respectively. The second layer utilizes EE-Conv combined with a ReLU activation function to accurately extract edge features from the feature map. Following the convolution operations of the main path, an edge attention weight matrix of size 1 × H × W is generated, effectively enhancing the pixel proportion of the target edge area. This weight matrix is convolved with the input image based on edge features transmitted by the sub-path to obtain the edge attention feature map, which serves as input for the final convolution layer.
This design methodology effectively integrates spatial and edge information, enhancing the module’s information extraction capability. Notably, the entire structure contains only one short path, utilizing 1 × 1 convolution, which minimizes information processing time. This design incorporates a spatial attention mechanism, optimizing edge feature extraction while improving computational resource utilization efficiency. Consequently, it enhances the module’s performance and practicality while maintaining the stability and reliability of its core functions.
The engineering design of this structure provides crucial technical support for real-time applications and lightweight models. In the WETS-Net model, the first two layers typically output shallow features, primarily encompassing edge and texture features. These features aid in locating object boundaries and contours, providing fundamental cues to the image structure. Therefore, in this model’s design process, EAEM is positioned at the output of the encoder’s first two shallow feature layers. This placement allows for comprehensive extraction and enhancement of edge and texture features of weak targets, facilitating refinement of segmentation boundaries and details. Furthermore, it provides richer and more fine-grained low-scale information for subsequent feature fusion and segmentation tasks.
Regarding the computational overhead of the EAEM module, its carefully designed architecture achieves lightweight computation. The module’s efficiency can be analyzed as follows:
Main Path: The 1 × 1 convolutions applied in the first and third layers primarily serve for linear combination across channels. For RGB input images, the computational complexity is 3 × 3 × H × W, which is relatively low.
EE-Conv Layer: As the core of the module, it employs a 3 × 3 convolution kernel with a theoretical complexity of 3 × 3 × 3 × H × W. However, due to its optimized design, the actual computational cost is lower than traditional 3 × 3 convolutions. This layer precisely extracts edge features while reducing computational complexity through sparse convolution or other efficient strategies.
Edge Attention Weight Matrix: The generation of the 1 × H × W edge attention weight matrix, used to enhance pixel importance in target edge regions, primarily involves convolution operations during its creation process.
Branch Path: The convolution operation between the weight matrix and the input image produces the edge attention feature map.
Overall, the EAEM module optimizes computational costs in the 1 × 1 convolutions and the EE-Conv layer, ensuring efficient edge feature extraction while maintaining a lightweight and real-time capable structure. This design makes it suitable for real-time applications and lightweight model requirements.
The module’s architecture balances the trade-off between computational efficiency and feature extraction effectiveness. By leveraging optimized convolution operations and a well-designed attention mechanism, EAEM achieves enhanced edge detection capabilities without significantly increasing computational overhead. This approach aligns with the growing demand for efficient, real-time capable models in various computer vision applications.
3.2. Multi-Scale Information Fusion Module
The primary role of the EAEM module is to extract and preserve sufficient edge information, while the MIFM aims to efficiently acquire rich global feature information. Some existing methods, such as multi-scale image inputs, large convolutional kernels, and spatial pyramids, although effective in capturing multi-scale features, often significantly increase the network’s complexity and computational load, rendering them unsuitable for real-time segmentation tasks. Additionally, the representation of convolutional features is not only related to the receptive field but also closely linked to the feature channels. Compared to enlarging the convolutional kernel size, adding more channels requires less computation. Therefore, utilizing a channel attention mechanism to extract multi-channel feature information may be a more optimal solution for real-time semantic segmentation tasks.
The channel attention mechanism has been widely applied in semantic segmentation tasks, with SENet and FC attention mechanisms being two representative methods. SENet introduces the SE module to adaptively adjust the weights of each feature channel, enabling the network to automatically learn the importance of different channels and enhance the quality of feature representation. The design of the SE module is concise and efficient, significantly boosting network performance at a low computational cost and achieving outstanding results in tasks like image classification and object detection. However, the channel weights learned by the SE module are globally unique, neglecting differences in spatial dimensions and limiting its performance in dense prediction tasks such as semantic segmentation.
To overcome this limitation, the FC attention mechanism employs fully connected layers to learn attention weights in both spatial and channel dimensions, enabling the network to adaptively adjust feature responses at different positions and channels. The FC attention mechanism outperforms the SE module in semantic segmentation tasks, demonstrating the importance of spatial attention for dense prediction tasks. Nevertheless, the FC attention mechanism introduces a large number of parameters and computational overhead, restricting its application in real-time semantic segmentation tasks. Therefore, finding ways to reduce the complexity of the channel attention mechanism while maintaining accuracy is a research direction worth further exploration.
To address these limitations, this study proposes a novel Multi-Scale Information Fusion Module (MIFM) designed to efficiently and flexibly fuse and extract multi-scale feature information. MIFM utilizes a lightweight channel attention mechanism to adaptively adjust channel weights in different scale feature maps, facilitating feature fusion and optimization across scales, as depicted in
Figure 5.
In the structural design of MIFM, the input initially undergoes processing through a 3 × 3 convolution layer, followed by a ReLU activation function layer, and another 3 × 3 convolution layer. The number of channels is appropriately reduced to mitigate the computational burden. Subsequently, the feature map bifurcates into two parallel branches: the main path for generating the guidance vector, and the branch path.
In the main path of the guidance vector, the feature map traverses a global average pooling layer to generate a global descriptor. Unlike SENet, MIFM eschews the use of a fully connected layer with large parameters, instead employing a 1 × 1 pointwise convolution to process the global descriptor, thereby enhancing model efficiency. Pointwise convolution operates on the channel dimension, significantly reducing the number of parameters and improving computational efficiency. As pointwise convolution does not need to consider spatial information associations, it only processes information between channels, proving more efficient without capturing information from all channels.
Following the pointwise convolution, the global descriptor is converted into a compressed guidance vector and then restored to the same number of channels as the input feature map through another 1 × 1 pointwise convolution, yielding a reweighted guidance vector. In the feature fusion stage, the guidance vector undergoes element-wise multiplication with the output of the shortcut branch, adaptively adjusting the weight of each channel to achieve selective feature enhancement.
Finally, the enhanced feature maps of different scales are concatenated along the channel dimension to form the final fused features, which are then passed to the subsequent layer. By introducing a lightweight channel attention mechanism and a cross-scale feature fusion strategy, MIFM can efficiently extract and integrate multi-scale contextual information, enhancing the quality and robustness of feature representation. This structural design effectively integrates the channel attention mechanism, improving the model’s performance and efficiency while maintaining accurate capture and effective utilization of feature information.
In WETS-Net, each skip connection layer incorporates a MIFM module before the decoders at each level. Through this design, the network can fully utilize the rich semantic information extracted by the encoder and the high-resolution spatial details restored by the decoder, thereby better capturing the multi-scale features of weak and small targets. Applying MIFM at each skip connection enables WETS-Net to optimize feature representation layer by layer and achieve cross-scale fusion. The introduction of this structure effectively enhances segmentation accuracy and efficiency while maintaining the stability and interpretability of the overall network architecture.
The improved network can more effectively learn and utilize the importance of different channels, thereby increasing focus on key features and reducing noise interference in the final feature representation. This layer-wise optimization and cross-scale fusion method enables WETS-Net to perform well in processing complex scenes, not only improving the accuracy of segmentation tasks but also enhancing overall computational efficiency. Consequently, the network becomes more suitable for practical application scenarios.
3.3. Total Network Framework
WETS-Net is a real-time segmentation network built upon the foundation of U-Net, enhanced by the incorporation of the EAEM and MIFM modules, as illustrated in
Figure 2. The EAEM module extracts edge feature information from the first and second low-scale downsampled features at the encoding stage, while the MIFM module focuses on channel information extraction from all downsampled features. These modules are interconnected during the adjacent upsampling process in the decoder, facilitating further information aggregation for feature fusion.
WETS-Net is an efficient and accurate image-processing network that demonstrates outstanding performance in terms of real-time capability and precision. From a computational complexity perspective, this network, based on U-Net, introduces only a small number of additional 3 × 3 convolutional layers, maximizing GPU resources and reducing computational waste. Moreover, the foundational U-Net model itself is a lightweight multitask network, further decreasing the time overhead during inference. It can be said that WETS-Net maintains performance while balancing computational efficiency and resource utilization.
To enhance the network’s feature extraction and fusion capabilities, WETS-Net incorporates two key modules: EAEM and MIFM. EAEM not only focuses on target edge features but also reorganizes and optimizes spatial features, enabling the network to better capture structural information about the targets. Meanwhile, MIFM strengthens the fusion and transmission of contextual information by modeling the dependencies between features at different scales. The synergistic action of these two modules allows WETS-Net to explore semantic information from multiple perspectives, leading to higher segmentation accuracy.
WETS-Net employs a multi-task joint optimization strategy in its loss function design, incorporating two auxiliary loss terms (EAEM and MIFM outputs) to enhance network training. This approach not only supervises the final output but also emphasizes the learning of intermediate layer features, promoting more effective gradient propagation and consequently improving convergence speed and generalization performance. The comprehensive loss function concept allows for a balanced weighting of the main task and auxiliary tasks, ensuring that the model achieves an optimal equilibrium across various objectives. This sophisticated design enhances the model’s expressiveness while deepening feature learning and providing more extensive supervision throughout the network training process. The loss function is formulated as follows:
In the equation,
α and
β are used as weight parameters to balance the auxiliary loss terms to adjust the influence of the two attention mechanism modules. After simulation experiment verification, it is determined that the two weight parameters of the auxiliary loss terms are both set to 0.5 to achieve a balanced effect on the attention mechanism module. In the formula,
Lfinal,
LEAEM, and
LMIFM represent the losses of the entire network, EAEM and MIFM, and
Ltotal represents the total loss function. It should be pointed out that these two auxiliary loss terms are only used in the training phase. All of these loss functions are calculated using the cross-entropy loss function, and the calculation method is shown in Equation (2). During the training process, by appropriately adjusting the weight parameters and the structure of the loss function, the model can better learn the data features and improve its performance. This meticulous loss design and weight adjustment is to promote the effective training of the network and the optimization of the final performance.
In the equation, N represents the number of samples, C represents the number of categories, yij is an indicator variable (if the true category of sample i is j, then yij = 1; otherwise it is 0), and pij represents the model’s predicted probability that sample i belongs to category j. Since the value of pij ranges from 0 to 1, the value range of the calculated cross-entropy loss function is [0, +∞). When the model predicts completely accurately, the loss function is 0; as the deviation between the prediction and the true label increases, the value of the loss function also increases. In practical applications, a smaller loss function value indicates that the network has a better classification effect. In the training process of network optimization, minimizing the loss function is a crucial step. By continuously adjusting the network parameters to minimize the loss function, the network can better fit the training data and show better generalization ability on unseen data. Therefore, when training deep learning models, continuously monitoring and optimizing the loss function is a key step to ensure model performance and generalization ability.
In general, the loss function design of WETS-Net not only improves the expressiveness of the model but also deepens the level of feature learning, allowing the network to better understand the intrinsic structure of the data. By introducing auxiliary loss terms, the network obtains more supervisory signals during the learning process, which helps the network learn effective feature representations faster and improves the generalization performance of the network. The design of the comprehensive loss function effectively models the relationship between different tasks, provides more information and guidance for network training, and enables the model to more comprehensively learn and optimize the correlation and trade-offs between various tasks.
5. Conclusions
The paper introduces a novel dual-attention mechanism semantic segmentation network called WETS-Net to address the challenges faced in weak edge target image segmentation tasks, such as transparent consumables. Built upon the U-Net architecture, this network incorporates the EAEM and MIFM modules to enhance texture details in shallow features and fuse contextual information across different scales. The EAEM module employs a spatial attention mechanism to adaptively adjust the weights of different regions in shallow feature maps, emphasizing texture details of weak edge targets while suppressing background interference. The design of EAEM integrates the EE-Conv convolutional kernel, enabling the network to better capture edge features of targets and improve segmentation accuracy.
In the decoding phase, WETS-Net enhances the skip connections of U-Net by introducing the MIFM module at each connection point. This allows the network to adaptively fuse high-level semantic features extracted by the encoder with high-resolution spatial details restored by the decoder, reducing the impact of blurry boundaries and background noise on segmentation results. MIFM utilizes a lightweight channel attention mechanism to adjust channel weights in different scale feature maps adaptively, achieving cross-scale feature fusion and optimization. Compared to existing channel attention mechanisms, MIFM significantly reduces parameter and computational overhead while maintaining performance, making it more suitable for real-time segmentation tasks.
To further improve segmentation performance, a multi-task learning strategy is employed, incorporating a composite loss function consisting of primary and auxiliary losses. The auxiliary loss supervises the learning process of the EAEM and MIFM modules, directing the network’s focus towards the feature extraction and fusion performance of these key modules.
Experimental results on the Trans10K dataset and a custom dataset demonstrate that WETS-Net achieves significant performance improvements in weak edge target image segmentation tasks, surpassing existing semantic segmentation methods. Through ablation studies, the effectiveness of the proposed modules is further validated. However, the paper acknowledges the limitations of CNN-based methods in global semantic understanding despite their strong local feature extraction capabilities and suitability for small-scale datasets. Future research directions could explore enhancing the network’s perception and understanding of key target features, such as by introducing Transformer-based global attention mechanisms or designing more efficient cross-scale feature fusion strategies.