Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (263)

Search Parameters:
Keywords = weakly supervised learning

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
25 pages, 5766 KB  
Article
Early-Stage Wildfire Detection: A Weakly Supervised Transformer-Based Approach
by Tina Samavat, Amirhessam Yazdi, Feng Yan and Lei Yang
Fire 2025, 8(11), 413; https://doi.org/10.3390/fire8110413 (registering DOI) - 25 Oct 2025
Abstract
Smoke detection is a practical approach for early identification of wildfires and mitigating hazards that affect ecosystems, infrastructure, property, and the community. The existing deep learning (DL) object detection methods (e.g., Detection Transformer (DETR)) have demonstrated significant potential for early awareness of these [...] Read more.
Smoke detection is a practical approach for early identification of wildfires and mitigating hazards that affect ecosystems, infrastructure, property, and the community. The existing deep learning (DL) object detection methods (e.g., Detection Transformer (DETR)) have demonstrated significant potential for early awareness of these events. However, their precision is influenced by the low visual salience of smoke and the reliability of the annotation, and collecting real-world and reliable datasets with precise annotations is a labor-intensive and time-consuming process. To address this challenge, we propose a weakly supervised Transformer-based approach with a teacher–student architecture designed explicitly for smoke detection while reducing the need for extensive labeling efforts. In the proposed approach, an expert model serves as the teacher, guiding the student model to learn from a variety of data annotations, including bounding boxes, point labels, and unlabeled images. This adaptability reduces the dependency on exhaustive manual annotation. The proposed approach integrates a Deformable-DETR backbone with a modified loss function to enhance the detection pipeline by improving spatial reasoning, supporting multi-scale feature learning, and facilitating a deeper understanding of the global context. The experimental results demonstrate performance comparable to, and in some cases exceeding, that of fully supervised models, including DETR and YOLOv8. Moreover, this study expands the existing datasets to offer a more comprehensive resource for the research community. Full article
Show Figures

Figure 1

22 pages, 3532 KB  
Article
Dual Weakly Supervised Anomaly Detection and Unsupervised Segmentation for Real-Time Railway Perimeter Intrusion Monitoring
by Donghua Wu, Yi Tian, Fangqing Gao, Xiukun Wei and Changfan Wang
Sensors 2025, 25(20), 6344; https://doi.org/10.3390/s25206344 - 14 Oct 2025
Viewed by 306
Abstract
The high operational velocities of high-speed trains present constraints on their onboard track intrusion detection systems for real-time capture and analysis, encompassing limited computational resources and motion image blurring. This emphasizes the critical necessity of track perimeter intrusion monitoring systems. Consequently, an intelligent [...] Read more.
The high operational velocities of high-speed trains present constraints on their onboard track intrusion detection systems for real-time capture and analysis, encompassing limited computational resources and motion image blurring. This emphasizes the critical necessity of track perimeter intrusion monitoring systems. Consequently, an intelligent monitoring system employing trackside cameras is constructed, integrating weakly supervised video anomaly detection and unsupervised foreground segmentation, which offers a solution for monitoring foreign objects on high-speed train tracks. To address the challenges of complex dataset annotation and unidentified target detection, weakly supervised learning detection is proposed to track foreign object intrusions based on video. The pretraining of Xception3D and the integration of multiple attention mechanisms have markedly enhanced the feature extraction capabilities. The Top-K sample selection alongside the amplitude score/feature loss function effectively discriminates abnormal from normal samples, incorporating time-smoothing constraints to ensure detection consistency across consecutive frames. Once abnormal video frames are identified, a multiscale variational autoencoder is proposed for the positioning of foreign objects. A downsampling/upsampling module is optimized to increase feature extraction efficiency. The pixel-level background weight distribution loss function is engineered to jointly balance background authenticity and noise resistance. Ultimately, the experimental results indicate that the video anomaly detection model achieved an AUC of 0.99 on the track anomaly detection dataset and processes 2 s video segments in 0.41 s. The proposed foreground segmentation algorithm achieved an F1 score of 0.9030 in the track anomaly dataset and 0.8375 on CDnet2014, with 91 Frames per Second, confirming its efficacy. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

17 pages, 1307 KB  
Article
Video Content Plagiarism Detection Using Region-Based Feature Learning
by Xun Jin, Su Yan, Rongchun Chen, Xuanyou Li, De Li and Yanwei Wang
Electronics 2025, 14(20), 4011; https://doi.org/10.3390/electronics14204011 - 13 Oct 2025
Viewed by 358
Abstract
Due to the continuous increase in copyright infringement cases of video content, the economic losses of copyright owners continue to rise. To improve the efficiency of plagiarism detection in video content, in this paper, we propose region-based video feature learning. The first innovation [...] Read more.
Due to the continuous increase in copyright infringement cases of video content, the economic losses of copyright owners continue to rise. To improve the efficiency of plagiarism detection in video content, in this paper, we propose region-based video feature learning. The first innovation of this paper lies in the combination of temporal positional encoding and attention mechanisms to extract global features for weakly supervised model training. Self- and cross-attention mechanisms are combined to enhance similar features within and between videos by incorporating position coding to capture timing relationships between video frames. Global classification description is embedded for capturing global spatiotemporal information and combined with a weak supervised loss for model training. The second innovation is the frame sequence similarity calculation, which is composed of Chamfer similarity, coordinate attention mechanism, and residual connection, to aggregate similarity scores between videos. Experimental results show that the proposed method can achieve the mAP of 0.907 on the short video dataset from Douyin. The proposed method outperforms frame-level and video-level features in achieving higher detection accuracy, and further contributes to the improvement of video content plagiarism detection performance. Full article
Show Figures

Figure 1

19 pages, 3418 KB  
Article
WSVAD-CLIP: Temporally Aware and Prompt Learning with CLIP for Weakly Supervised Video Anomaly Detection
by Min Li, Jing Sang, Yuanyao Lu and Lina Du
J. Imaging 2025, 11(10), 354; https://doi.org/10.3390/jimaging11100354 - 10 Oct 2025
Viewed by 605
Abstract
Weakly Supervised Video Anomaly Detection (WSVAD) is a critical task in computer vision. It aims to localize and recognize abnormal behaviors using only video-level labels. Without frame-level annotations, it becomes significantly challenging to model temporal dependencies. Given the diversity of abnormal events, it [...] Read more.
Weakly Supervised Video Anomaly Detection (WSVAD) is a critical task in computer vision. It aims to localize and recognize abnormal behaviors using only video-level labels. Without frame-level annotations, it becomes significantly challenging to model temporal dependencies. Given the diversity of abnormal events, it is also difficult to model semantic representations. Recently, the cross-modal pre-trained model Contrastive Language-Image Pretraining (CLIP) has shown a strong ability to align visual and textual information. This provides new opportunities for video anomaly detection. Inspired by CLIP, WSVAD-CLIP is proposed as a framework that uses its cross-modal knowledge to bridge the semantic gap between text and vision. First, the Axial-Graph (AG) Module is introduced. It combines an Axial Transformer and Lite Graph Attention Networks (LiteGAT) to capture global temporal structures and local abnormal correlations. Second, a Text Prompt mechanism is designed. It fuses a learnable prompt with a knowledge-enhanced prompt to improve the semantic expressiveness of category embeddings. Third, the Abnormal Visual-Guided Text Prompt (AVGTP) mechanism is proposed to aggregate anomalous visual context for adaptively refining textual representations. Extensive experiments on UCF-Crime and XD-Violence datasets show that WSVAD-CLIP notably outperforms existing methods in coarse-grained anomaly detection. It also achieves superior performance in fine-grained anomaly recognition tasks, validating its effectiveness and generalizability. Full article
(This article belongs to the Section Computer Vision and Pattern Recognition)
Show Figures

Figure 1

21 pages, 899 KB  
Article
Gated Fusion Networks for Multi-Modal Violence Detection
by Bilal Ahmad, Mustaqeem Khan and Muhammad Sajjad
AI 2025, 6(10), 259; https://doi.org/10.3390/ai6100259 - 3 Oct 2025
Viewed by 511
Abstract
Public safety and security require an effective monitoring system to detect violence through visual, audio, and motion data. However, current methods often fail to utilize the complementary benefits of visual and auditory modalities, thereby reducing their overall effectiveness. To enhance violence detection, we [...] Read more.
Public safety and security require an effective monitoring system to detect violence through visual, audio, and motion data. However, current methods often fail to utilize the complementary benefits of visual and auditory modalities, thereby reducing their overall effectiveness. To enhance violence detection, we present a novel multimodal method in this paper that detects motion, audio, and visual information from the input to recognize violence. We designed a framework comprising two specialized components: a gated fusion module and a multi-scale transformer, which enables the efficient detection of violence in multimodal data. To ensure a seamless and effective integration of features, a gated fusion module dynamically adjusts the contribution of each modality. At the same time, a multi-modal transformer utilizes multiple instance learning (MIL) to identify violent behaviors more accurately from input data by capturing complex temporal correlations. Our model fully integrates multi-modal information using these techniques, improving the accuracy of violence detection. In this study, we found that our approach outperformed state-of-the-art methods with an accuracy of 86.85% using the XD-Violence dataset, thereby demonstrating the potential of multi-modal fusion in detecting violence. Full article
Show Figures

Figure 1

20 pages, 162180 KB  
Article
Annotation-Efficient and Domain-General Segmentation from Weak Labels: A Bounding Box-Guided Approach
by Ammar M. Okran, Hatem A. Rashwan, Sylvie Chambon and Domenec Puig
Electronics 2025, 14(19), 3917; https://doi.org/10.3390/electronics14193917 - 1 Oct 2025
Viewed by 461
Abstract
Manual pixel-level annotation remains a major bottleneck in deploying deep learning models for dense prediction and semantic segmentation tasks across domains. This challenge is especially pronounced in applications involving fine-scale structures, such as cracks in infrastructure or lesions in medical imaging, where annotations [...] Read more.
Manual pixel-level annotation remains a major bottleneck in deploying deep learning models for dense prediction and semantic segmentation tasks across domains. This challenge is especially pronounced in applications involving fine-scale structures, such as cracks in infrastructure or lesions in medical imaging, where annotations are time-consuming, expensive, and subject to inter-observer variability. To address these challenges, this work proposes a weakly supervised and annotation-efficient segmentation framework that integrates sparse bounding-box annotations with a limited subset of strong (pixel-level) labels to train robust segmentation models. The fundamental element of the framework is a lightweight Bounding Box Encoder that converts weak annotations into multi-scale attention maps. These maps guide a ConvNeXt-Base encoder, and a lightweight U-Net–style convolutional neural network (CNN) decoder—using nearest-neighbor upsampling and skip connections—reconstructs the final segmentation mask. This design enables the model to focus on semantically relevant regions without relying on full supervision, drastically reducing annotation cost while maintaining high accuracy. We validate our framework on two distinct domains, road crack detection and skin cancer segmentation, demonstrating that it achieves performance comparable to fully supervised segmentation models using only 10–20% of strong annotations. Given the ability of the proposed framework to generalize across varied visual contexts, it has strong potential as a general annotation-efficient segmentation tool for domains where strong labeling is costly or infeasible. Full article
Show Figures

Figure 1

17 pages, 3106 KB  
Article
Weakly Supervised Gland Segmentation Based on Hierarchical Attention Fusion and Pixel Affinity Learning
by Yanli Liu, Mengchen Lin, Xiaoqian Sang, Guidong Bao and Yunfeng Wu
Bioengineering 2025, 12(9), 992; https://doi.org/10.3390/bioengineering12090992 - 18 Sep 2025
Viewed by 449
Abstract
Precise segmentation of glands in histopathological images is essential for the diagnosis of colorectal cancer, as the changes in gland morphology are associated with pathological progression. Conventional computer-assisted methods rely on dense pixel-level annotations, which are costly and labor-intensive to obtain. The present [...] Read more.
Precise segmentation of glands in histopathological images is essential for the diagnosis of colorectal cancer, as the changes in gland morphology are associated with pathological progression. Conventional computer-assisted methods rely on dense pixel-level annotations, which are costly and labor-intensive to obtain. The present study proposes a two-stage weakly supervised segmentation framework named Multi-Level Attention and Affinity (MAA). The MAA framework utilizes the image-level labels and combines the Multi-Level Attention Fusion (MAF) and Affinity Refinement (AR) modules. The MAF module extracts the hierarchical features from multiple transformer layers to grasp global semantic context, and generates more comprehensive initial class activation maps. By modeling inter-pixel semantic consistency, the AR module refines pseudo-labels, which can sharpen the boundary delineation and reduce label noise. The experiments on the GlaS dataset showed that the proposed MAA framework achieves the Intersection over Union (IoU) of 81.99% and Dice coefficient of 90.10%, which outperformed the state-of-the-art Online Easy Example Mining (OEEM) method with an improvement of 4.43% in IoU. Such experimental results demonstrated the effectiveness of integrating hierarchical attention mechanisms with affinity-guided refinement for annotation-efficient and robust gland segmentation. Full article
(This article belongs to the Special Issue Recent Progress in Biomedical Image Processing and Analysis)
Show Figures

Figure 1

20 pages, 774 KB  
Article
Enhanced Pseudo-Labels for End-to-End Weakly Supervised Semantic Segmentation with Foundation Models
by Xuesheng Zhou and Zhenzhou Tang
Appl. Sci. 2025, 15(18), 10013; https://doi.org/10.3390/app151810013 - 12 Sep 2025
Viewed by 685
Abstract
Weakly supervised semantic segmentation (WSSS) aims to learn pixel-level semantic concepts from image-level class labels. Due to the simplicity and efficiency of the training pipeline, end-to-end WSSS has received significant attention from the research community. However, the coarse nature of pseudo-label regions remains [...] Read more.
Weakly supervised semantic segmentation (WSSS) aims to learn pixel-level semantic concepts from image-level class labels. Due to the simplicity and efficiency of the training pipeline, end-to-end WSSS has received significant attention from the research community. However, the coarse nature of pseudo-label regions remains one of the primary bottlenecks limiting the performance of such methods. To address this issue, we propose class-guided enhanced pseudo-labeling (CEP), a method designed to generate high-quality pseudo-labels for end-to-end WSSS frameworks. Our approach leverages pretrained foundation models, such as contrastive language-image pre-training (CLIP) and segment anything model (SAM), to enhance pseudo-label quality. Specifically, following the pseudo-label generation pipeline, we introduce two key components: a Skip-CAM module and a pseudo-label refinement module. The Skip-CAM module enriches feature representations by introducing skip connections from multiple blocks of CLIP, thereby improving the quality of localization maps. The refinement module then utilizes SAM to refine and correct the pseudo-labels based on the initial class-specific regions derived from the localization maps. Experimental results demonstrate that our method surpasses the state-of-the-art end-to-end approaches as well as many multi-stage competitors. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

23 pages, 3606 KB  
Article
Dual-Stream Attention-Enhanced Memory Networks for Video Anomaly Detection
by Weishan Gao, Xiaoyin Wang, Ye Wang and Xiaochuan Jing
Sensors 2025, 25(17), 5496; https://doi.org/10.3390/s25175496 - 4 Sep 2025
Viewed by 1086
Abstract
Weakly supervised video anomaly detection (WSVAD) aims to identify unusual events using only video-level labels. However, current methods face several key challenges, including ineffective modelling of complex temporal dependencies, indistinct feature boundaries between visually similar normal and abnormal events, and high false alarm [...] Read more.
Weakly supervised video anomaly detection (WSVAD) aims to identify unusual events using only video-level labels. However, current methods face several key challenges, including ineffective modelling of complex temporal dependencies, indistinct feature boundaries between visually similar normal and abnormal events, and high false alarm rates caused by an inability to distinguish salient events from complex background noise. This paper proposes a novel method that systematically enhances feature representation and discrimination to address these challenges. The proposed method first builds robust temporal representations by employing a hierarchical multi-scale temporal encoder and a position-aware global relation network to capture both local and long-range dependencies. The core of this method is the dual-stream attention-enhanced memory network, which achieves precise discrimination by learning distinct normal and abnormal patterns via dual memory banks, while utilising bidirectional spatial attention to mitigate background noise and focus on salient events before memory querying. The models underwent a comprehensive evaluation utilising solely RGB features on two demanding public datasets, UCF-Crime and XD-Violence. The experimental findings indicate that the proposed method attains state-of-the-art performance, achieving 87.43% AUC on UCF-Crime and 85.51% AP on XD-Violence. This result demonstrates that the proposed “attention-guided prototype matching” paradigm effectively resolves the aforementioned challenges, enabling robust and precise anomaly detection. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

21 pages, 1662 KB  
Article
Controllable Speech-Driven Gesture Generation with Selective Activation of Weakly Supervised Controls
by Karlo Crnek and Matej Rojc
Appl. Sci. 2025, 15(17), 9467; https://doi.org/10.3390/app15179467 - 28 Aug 2025
Viewed by 602
Abstract
Generating realistic and contextually appropriate gestures is crucial for creating engaging embodied conversational agents. Although speech is the primary input for gesture generation, adding controls like gesture velocity, hand height, and emotion is essential for generating more natural, human-like gestures. However, current approaches [...] Read more.
Generating realistic and contextually appropriate gestures is crucial for creating engaging embodied conversational agents. Although speech is the primary input for gesture generation, adding controls like gesture velocity, hand height, and emotion is essential for generating more natural, human-like gestures. However, current approaches to controllable gesture generation often utilize a limited number of control parameters and lack the ability to activate/deactivate them selectively. Therefore, in this work, we propose the Cont-Gest model, a Transformer-based gesture generation model that enables selective control activation through masked training and a control fusion strategy. Furthermore, to better support the development of such models, we propose a novel evaluation-driven development (EDD) workflow, which combines several iterative tasks: automatic control signal extraction, control specification, visual (subjective) feedback, and objective evaluation. This workflow enables continuous monitoring of model performance and facilitates iterative refinement through feedback-driven development cycles. For objective evaluation, we are using the validated Kinetic–Hellinger distance, an objective metric that correlates strongly with the human perception of gesture quality. We evaluated multiple model configurations and control dynamics strategies within the proposed workflow. Experimental results show that Feature-wise Linear Modulation (FiLM) conditioning, combined with single-mask training and voice activity scaling, achieves the best balance between gesture quality and adherence to control inputs. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

10 pages, 2952 KB  
Article
Weakly Supervised Monocular Fisheye Camera Distance Estimation with Segmentation Constraints
by Zhihao Zhang and Xuejun Yang
Electronics 2025, 14(17), 3429; https://doi.org/10.3390/electronics14173429 - 28 Aug 2025
Viewed by 563
Abstract
Monocular fisheye camera distance estimation is a crucial visual perception task for autonomous driving. Due to the practical challenges of acquiring precise depth annotations, existing self-supervised methods usually consist of a monocular distance model and an ego-motion predictor with the goal of minimizing [...] Read more.
Monocular fisheye camera distance estimation is a crucial visual perception task for autonomous driving. Due to the practical challenges of acquiring precise depth annotations, existing self-supervised methods usually consist of a monocular distance model and an ego-motion predictor with the goal of minimizing a reconstruction matching loss. However, they suffer from inaccurate distance estimation in low-texture regions, especially road surfaces. In this paper, we introduce a weakly supervised learning strategy that incorporates semantic segmentation, instance segmentation, and optical flow as additional sources of supervision. In addition to the self-supervised reconstruction loss, we introduce a road surface flatness loss, an instance smoothness loss, and an optical flow loss to enhance the accuracy of distance estimation. We evaluate the proposed method on the WoodScape and SynWoodScape datasets, and it outperforms the self-supervised monocular baseline, FisheyeDistanceNet. Full article
Show Figures

Figure 1

21 pages, 3126 KB  
Article
CViT Weakly Supervised Network Fusing Dual-Branch Local-Global Features for Hyperspectral Image Classification
by Wentao Fu, Xiyan Sun, Xiuhua Zhang, Yuanfa Ji and Jiayuan Zhang
Entropy 2025, 27(8), 869; https://doi.org/10.3390/e27080869 - 15 Aug 2025
Viewed by 733
Abstract
In hyperspectral image (HSI) classification, feature learning and label accuracy play a crucial role. In actual hyperspectral scenes, however, noisy labels are unavoidable and seriously impact the performance of methods. While deep learning has achieved remarkable results in HSI classification tasks, its noise-resistant [...] Read more.
In hyperspectral image (HSI) classification, feature learning and label accuracy play a crucial role. In actual hyperspectral scenes, however, noisy labels are unavoidable and seriously impact the performance of methods. While deep learning has achieved remarkable results in HSI classification tasks, its noise-resistant performance usually comes at the cost of feature representation capabilities. High-dimensional and deep convolution can capture rich deep semantic features, but with high complexity and resource consumption. To deal with these problems, we propose a CViT Weakly Supervised Network (CWSN) for HSI classification. Specifically, a lightweight 1D-2D two-branch network is used for local generalization and enhancement of spatial–spectral features. Then, the fusion and characterization of local and global features are achieved through the CNN-Vision Transformer (CViT) cascade strategy. The experimental results on four benchmark HSI datasets show that CWSN has good anti-noise ability and ensures the robustness and versatility of the network facing both clean and noisy training sets. Compared to other methods, the CWSN has better classification accuracy. Full article
(This article belongs to the Section Signal and Data Analysis)
Show Figures

Graphical abstract

21 pages, 4714 KB  
Article
Automatic Scribble Annotations Based Semantic Segmentation Model for Seedling-Stage Maize Images
by Zhaoyang Li, Xin Liu, Hanbing Deng, Yuncheng Zhou and Teng Miao
Agronomy 2025, 15(8), 1972; https://doi.org/10.3390/agronomy15081972 - 15 Aug 2025
Viewed by 496
Abstract
Canopy coverage is a key indicator for judging maize growth and production prediction during the seedling stage. Researchers usually use deep learning methods to estimate canopy coverage from maize images, but fully supervised models usually need pixel-level annotations, which requires lots of manual [...] Read more.
Canopy coverage is a key indicator for judging maize growth and production prediction during the seedling stage. Researchers usually use deep learning methods to estimate canopy coverage from maize images, but fully supervised models usually need pixel-level annotations, which requires lots of manual labor. To overcome this problem, we propose ASLNet (Automatic Scribble Labeling-based Semantic Segmentation Network), a weakly supervised model for image semantic segmentation. We designed a module which could self-generate scribble labels for maize plants in an image. Accordingly, ASLNet was constructed using a collaborative mechanism composed of scribble label generation, pseudo-label guided training, and double-loss joint optimization. The cross-scale contrastive regularization can realize semantic segmentation without manual labels. We evaluated the model for label quality and segmentation accuracy. The results showed that ASLNet generated high-quality scribble labels with stable segmentation performance across different scribble densities. Compared to Scribble4All, ASLNet improved mIoU by 3.15% and outperformed fully and weakly supervised models by 6.6% and 15.28% in segmentation accuracy, respectively. Our works proved that ASLNet could be trained by pseudo-labels and offered a cost-effective approach for canopy coverage estimation at maize’s seedling stage. This research enables the early acquisition of corn growth conditions and the prediction of corn yield. Full article
(This article belongs to the Section Precision and Digital Agriculture)
Show Figures

Figure 1

18 pages, 3256 KB  
Article
YOLOv8-Seg with Dynamic Multi-Kernel Learning for Infrared Gas Leak Segmentation: A Weakly Supervised Approach
by Haoyang Shen, Lushuai Xu, Mingyue Wang, Shaohua Dong, Qingqing Xu, Feng Li and Haiyang Yu
Sensors 2025, 25(16), 4939; https://doi.org/10.3390/s25164939 - 10 Aug 2025
Cited by 1 | Viewed by 712
Abstract
Gas leak detection in oil and gas processing facilities is a critical component of the safety production monitoring system. Non-contact detection technology based on infrared imaging has emerged as a vital real-time monitoring method due to its rapid response and extensive coverage. However, [...] Read more.
Gas leak detection in oil and gas processing facilities is a critical component of the safety production monitoring system. Non-contact detection technology based on infrared imaging has emerged as a vital real-time monitoring method due to its rapid response and extensive coverage. However, existing pixel-level segmentation networks face challenges such as insufficient segmentation accuracy, rough gas edges, and jagged boundaries. To address these issues, this study proposes a novel pixel-level segmentation network training framework based on anchor box annotation and enhances the segmentation performance of the YOLOv8-seg network for gas detection applications. First, a dynamic threshold is introduced using the Visual Background Extractor (ViBe) method, which, in combination with the YOLOv8-det network, generates binary masks to serve as training masks. Next, a segmentation head architecture is designed, incorporating dynamic kernels and multi-branch collaboration. This architecture utilizes feature concatenation under deformable convolution and attention mechanisms to replace parts of the original segmentation head, thereby enhancing the extraction of detailed gas features and reducing dependency on anchor boxes during segmentation. Finally, a joint Dice-BCE (Binary Cross-Entropy) loss, weighted by ViBe-CRF (Conditional Random Fields) confidence, is employed to replace the original Seg_loss. This effectively reduces roughness and jaggedness at gas edges, significantly improving segmentation accuracy. Experimental results indicate that the improved network achieves a 6.4% increase in F1 score and a 7.6% improvement in the mIoU (mean Intersection over Union) metric. This advancement provides a new, real-time, and efficient detection algorithm for infrared imaging of gas leaks in oil and gas processing facilities. Furthermore, it introduces a low-cost weakly supervised learning approach for training pixel-level segmentation networks. Full article
(This article belongs to the Section Optical Sensors)
Show Figures

Figure 1

17 pages, 3807 KB  
Article
2AM: Weakly Supervised Tumor Segmentation in Pathology via CAM and SAM Synergy
by Chenyu Ren, Liwen Zou and Luying Gui
Electronics 2025, 14(15), 3109; https://doi.org/10.3390/electronics14153109 - 5 Aug 2025
Viewed by 699
Abstract
Tumor microenvironment (TME) analysis plays an extremely important role in computational pathology. Deep learning shows tremendous potential for tumor tissue segmentation on pathological images, which is an essential part of TME analysis. However, fully supervised segmentation methods based on deep learning usually require [...] Read more.
Tumor microenvironment (TME) analysis plays an extremely important role in computational pathology. Deep learning shows tremendous potential for tumor tissue segmentation on pathological images, which is an essential part of TME analysis. However, fully supervised segmentation methods based on deep learning usually require a large number of manual annotations, which is time-consuming and labor-intensive. Recently, weakly supervised semantic segmentation (WSSS) works based on the Class Activation Map (CAM) have shown promising results to learn the concept of segmentation from image-level class labels but usually have imprecise boundaries due to the lack of pixel-wise supervision. On the other hand, the Segment Anything Model (SAM), a foundation model for segmentation, has shown an impressive ability for general semantic segmentation on natural images, while it suffers from the noise caused by the initial prompts. To address these problems, we propose a simple but effective weakly supervised framework, termed as 2AM, combining CAM and SAM for tumor tissue segmentation on pathological images. Our 2AM model is composed of three modules: (1) a CAM module for generating salient regions for tumor tissues on pathological images; (2) an adaptive point selection (APS) module for providing more reliable initial prompts for the subsequent SAM by designing three priors of basic appearance, space distribution, and feature difference; and (3) a SAM module for predicting the final segmentation. Experimental results on two independent datasets show that our proposed method boosts tumor segmentation accuracy by nearly 25% compared with the baseline method, and achieves more than 15% improvement compared with previous state-of-the-art segmentation methods with WSSS settings. Full article
(This article belongs to the Special Issue AI-Driven Medical Image/Video Processing)
Show Figures

Figure 1

Back to TopTop