MDPI - Publisher of Open Access Journals

20 pages, 985 KiB

Open AccessArticle

Multi-Object Tracking Method Based on Domain Adaptation and Camera Motion Compensation

by Yongze Zhang, Feipeng Da and Haocheng Zhou

Electronics 2025, 14(11), 2238; https://doi.org/10.3390/electronics14112238 (registering DOI) - 30 May 2025

Viewed by 37

Although recent multi-object tracking (MOT) methods have shown impressive performance, MOT remains challenging due to two key issues: the poor generalization of ReID in MOT tasks and motion estimation errors caused by camera movement. To address these, we propose DCTrack, a novel tracker [...] Read more.

Although recent multi-object tracking (MOT) methods have shown impressive performance, MOT remains challenging due to two key issues: the poor generalization of ReID in MOT tasks and motion estimation errors caused by camera movement. To address these, we propose DCTrack, a novel tracker with two core modules: Domain Adaptation-based Batch Instance Normalization (DA-BIN) and camera motion compensation mapped to the Ground Plane (CMC-GP). DA-BIN enhances appearance modeling through domain adaptation using a combination of batch and instance normalization layers, allowing DCTrack to simulate unsuccessful generalization scenarios and improving ReID performance in MOT. CMC-GP uses a Kalman Filter-based method to map object motion estimation to the ground plane and applies the same compensation parameters across the video sequence, enhancing robustness to camera motion. Experimental results show that DCTrack effectively improves ReID generalization in MOT and performs well in scenes with camera motion, achieving a HOTA of 67.3% and an IDF1 of 65.8% on DanceTrack. Full article

16 pages, 2556 KiB

Open AccessArticle

Deep Learning Method with Domain-Task Adaptation and Client-Specific Fine-Tuning YOLO11 Model for Counting Greenhouse Tomatoes

by Igor Glukhikh, Dmitry Glukhikh, Anna Gubina and Tatiana Chernysheva

Appl. Syst. Innov. 2025, 8(3), 71; https://doi.org/10.3390/asi8030071 - 27 May 2025

Viewed by 136

Abstract

This article discusses the tasks involved in the operational assessment of the volume of produced goods, such as tomatoes. The large-scale implementation of computer vision systems in greenhouses requires approaches that reduce costs, time and complexity, particularly in creating training data and preparing [...] Read more.

This article discusses the tasks involved in the operational assessment of the volume of produced goods, such as tomatoes. The large-scale implementation of computer vision systems in greenhouses requires approaches that reduce costs, time and complexity, particularly in creating training data and preparing neural network models. Publicly available models like YOLO often lack the accuracy needed for specific tasks. This study proposes a method for the sequential training of detection models, incorporating Domain-Task Adaptation and Client-Specific Fine-Tuning. The model is initially trained on a large, specialized dataset for tasks like tomato detection, followed by fine-tuning with a small custom dataset reflecting real greenhouse conditions. This results in the light YOLO11n model achieving high validation accuracy (mAP50 > 0.83, Precision > 0.75, Recall > 0.73) while reducing computational resource requirements. Additionally, a custom training dataset was developed that captures the unique challenges of greenhouse environments, such as dense vegetation and occlusions. An algorithm for counting tomatoes was also created, which processes video frames to accurately count only the visible tomatoes in the front row of plants. This algorithm can be utilized in mobile video surveillance systems, enhancing monitoring efficiency in greenhouses. Full article

(This article belongs to the Section Artificial Intelligence)

► Show Figures

Figure 1

19 pages, 3395 KiB

Open AccessArticle

End-to-End Online Video Stitching and Stabilization Method Based on Unsupervised Deep Learning

by Pengyuan Wang, Pinle Qin, Rui Chai, Jianchao Zeng, Pengcheng Zhao, Zuojun Chen and Bingjie Han

Appl. Sci. 2025, 15(11), 5987; https://doi.org/10.3390/app15115987 - 26 May 2025

Viewed by 165

Abstract

The limited field of view, cumulative inter-frame jitter, and dynamic parallax interference in handheld video stitching often lead to misalignment and distortion. In this paper, we propose an end-to-end, unsupervised deep-learning framework that jointly performs real-time video stabilization and stitching. First, collaborative optimization [...] Read more.

The limited field of view, cumulative inter-frame jitter, and dynamic parallax interference in handheld video stitching often lead to misalignment and distortion. In this paper, we propose an end-to-end, unsupervised deep-learning framework that jointly performs real-time video stabilization and stitching. First, collaborative optimization architecture allows the stabilization and stitching modules to share parameters and propagate errors through a fully differentiable network, ensuring consistent image alignment. Second, a Markov trajectory smoothing strategy in relative coordinates models inter-frame motion as incremental relationships, effectively reducing cumulative errors. Third, a dynamic attention mask generates spatiotemporal weight maps based on foreground motion prediction, suppressing misalignment caused by dynamic objects. Experimental evaluation on diverse handheld sequences shows that our method achieves higher stitching quality, lower geometric distortion rates, and improved video stability compared to state-of-the-art baselines, while maintaining real-time processing capabilities. Ablation studies validate that relative trajectory modeling substantially mitigates long-term jitter and that the dynamic attention mask enhances stitching accuracy in dynamic scenes. These results demonstrate that the proposed framework provides a robust solution for high-quality, real-time handheld video stitching. Full article

(This article belongs to the Collection Trends and Prospects in Multimedia)

► Show Figures

Figure 1

19 pages, 4931 KiB

Open AccessArticle

A Hybrid Deep Learning Model for Early Forest Fire Detection

by Akhror Mamadmurodov, Sabina Umirzakova, Mekhriddin Rakhimov, Alpamis Kutlimuratov, Zavqiddin Temirov, Rashid Nasimov, Azizjon Meliboev, Akmalbek Abdusalomov and Young Im Cho

Forests 2025, 16(5), 863; https://doi.org/10.3390/f16050863 - 21 May 2025

Viewed by 216

Abstract

Forest fires pose an escalating global threat, severely impacting ecosystems, public health, and economies. Timely detection, especially during early stages, is critical for effective intervention. In this study, we propose a novel deep learning-based framework that augments the YOLOv4 object detection architecture with [...] Read more.

Forest fires pose an escalating global threat, severely impacting ecosystems, public health, and economies. Timely detection, especially during early stages, is critical for effective intervention. In this study, we propose a novel deep learning-based framework that augments the YOLOv4 object detection architecture with a modified EfficientNetV2 backbone and Efficient Channel Attention (ECA) modules. The backbone substitution leverages compound scaling and Fused-MBConv/MBConv blocks to improve representational efficiency, while the lightweight ECA blocks enhance inter-channel dependency modeling without incurring significant computational overhead. Additionally, we introduce a domain-specific preprocessing pipeline employing Canny edge detection, CLAHE + Jet transformation, and pseudo-NDVI mapping to enhance fire-specific visual cues in complex natural environments. Experimental evaluation on a hybrid dataset of forest fire images and video frames demonstrates substantial performance gains over baseline YOLOv4 and contemporary YOLO variants (YOLOv5–YOLOv9), with the proposed model achieving 97.01% precision, 95.14% recall, 93.13% mAP, and 92.78% F1-score. Furthermore, our model outperforms fourteen state-of-the-art approaches across standard metrics, confirming its efficacy, generalizability, and suitability for real-time deployment in UAV-based and edge computing platforms. These findings highlight the synergy between architectural optimization and domain-aware preprocessing for high-accuracy, low-latency wildfire detection systems. Full article

(This article belongs to the Special Issue Forest Fires Prediction and Detection—2nd Edition)

► Show Figures

Figure 1

20 pages, 4395 KiB

Open AccessArticle

The Creation of Artificial Data for Training a Neural Network Using the Example of a Conveyor Production Line for Flooring

by Alexey Zaripov, Roman Kulshin and Anatoly Sidorov

J. Imaging 2025, 11(5), 168; https://doi.org/10.3390/jimaging11050168 - 20 May 2025

Viewed by 274

Abstract

This work is dedicated to the development of a system for generating artificial data for training neural networks used within a conveyor-based technology framework. It presents an overview of the application areas of computer vision (CV) and establishes that traditional methods of data [...] Read more.

This work is dedicated to the development of a system for generating artificial data for training neural networks used within a conveyor-based technology framework. It presents an overview of the application areas of computer vision (CV) and establishes that traditional methods of data collection and annotation—such as video recording and manual image labeling—are associated with high time and financial costs, which limits their efficiency. In this context, synthetic data represents an alternative capable of significantly reducing the time and financial expenses involved in forming training datasets. Modern methods for generating synthetic images using various tools—from game engines to generative neural networks—are reviewed. As a tool-platform solution, the concept of digital twins for simulating technological processes was considered, within which synthetic data is utilized. Based on the review findings, a generalized model for synthetic data generation was proposed and tested on the example of quality control for floor coverings on a conveyor line. The developed system provided the generation of photorealistic and diverse images suitable for training neural network models. A comparative analysis showed that the YOLOv8 model trained on synthetic data significantly outperformed the model trained on real images: the mAP50 metric reached 0.95 versus 0.36, respectively. This result demonstrates the high adequacy of the model built on the synthetic dataset and highlights the potential of using synthetic data to improve the quality of computer vision models when access to real data is limited. Full article

(This article belongs to the Special Issue Industrial Machine Learning with Image Technology Integration)

► Show Figures

Figure 1

17 pages, 1654 KiB

Open AccessArticle

ConvGRU Hybrid Model Based on Neural Ordinary Differential Equations for Continuous Dynamics Video Object Detection

by Linbo Qian, Shanlin Sun and Shike Long

Electronics 2025, 14(10), 2033; https://doi.org/10.3390/electronics14102033 - 16 May 2025

Viewed by 120

Abstract

Video object detection involves identifying and localizing objects within video frames over time. However, challenges such as real-time processing requirements, motion blur, and the need for temporal consistency in video data make this task particularly demanding. This study proposes a novel hybrid model [...] Read more.

Video object detection involves identifying and localizing objects within video frames over time. However, challenges such as real-time processing requirements, motion blur, and the need for temporal consistency in video data make this task particularly demanding. This study proposes a novel hybrid model that integrates Neural Ordinary Differential Equations (Neural ODEs) with Convolutional Gated Recurrent Units (ConvGRU) to achieve continuous dynamics in object detection for video data. First, it leverages the continuous dynamics of Neural ODEs to define the hidden state transitions between observation points, enabling the model to naturally align with real-world time-based processes. Second, we present the FPN-Up module, which combines high-level semantic information with low-level spatial details to enhance the exploitation of multi-layer feature representations. Finally, we integrate a CBAM attention module into the detection head, enabling the model to emphasize the most salient input feature regions, thereby elevating detection precision while preserving the existing network structure. Evaluation on the KITTI object detection dataset reveals that our proposed model outperforms a vanilla video object detector by 2.8% in mAP while maintaining real-time processing capabilities. Full article

(This article belongs to the Special Issue Artificial Intelligence-Based Guidance, Navigation, and Control Technologies for Multiple Mobile Robotic Systems)

► Show Figures

Figure 1

21 pages, 29272 KiB

Open AccessArticle

Multi-Strategy Enhancement of YOLOv8n Monitoring Method for Personnel and Vehicles in Mine Air Door Scenarios

by Lei Zhang, Hongjing Tao, Zhipeng Sun and Weixun Yi

Sensors 2025, 25(10), 3128; https://doi.org/10.3390/s25103128 - 15 May 2025

Viewed by 230

Abstract

The mine air door is the primary facility for regulating airflow and controlling the passage of personnel and vehicles. Intelligent monitoring of personnel and vehicles within the mine air door system is a crucial measure to ensure the safety of mine operations. To [...] Read more.

The mine air door is the primary facility for regulating airflow and controlling the passage of personnel and vehicles. Intelligent monitoring of personnel and vehicles within the mine air door system is a crucial measure to ensure the safety of mine operations. To address the issues of slow speed and low efficiency associated with traditional detection methods in mine air door scenarios, this study proposes a CGSW-YOLO man-vehicle monitoring model based on YOLOv8n. Firstly, the Faster Block module, which incorporates partial convolution (PConv), is integrated with the C2f module of the backbone network. This combination aims to minimize redundant calculations during the convolution process and expedite the model’s aggregation of multi-scale information. Secondly, standard convolution is replaced with GhostConv in the backbone network to further reduce the number of model parameters. Additionally, the Slim-neck module is integrated into the neck feature fusion network to enhance the information fusion capability of various feature maps while maintaining detection accuracy. Finally, WIoUv3 is utilized as the loss function, and a dynamic non-monotonic focusing mechanism is implemented to adjust the quality of the anchor frame dynamically. The experimental results indicate that the CGSW-YOLO model exhibits strong performance in monitoring man-vehicle interactions in mine air door scenarios. The Precision (P), Recall (R), and the map@0.5 are recorded at 88.2%, 93.9%, and 98.0%, respectively, representing improvements of 0.2%, 1.5%, and 1.7% over the original model. The Frames Per Second (FPS) has increased to 135.14 f·s⁻¹, reflecting a rise of 35.14%. Additionally, the parameters, the floating point operations per second (FLOPS), and model size are 2.36 M, 6.2 G, and 5.0 MB, respectively. These values indicate reductions of 21.6%, 23.5%, and 20.6% compared to the original model. Through the verification of on-site surveillance video, the CGSW-YOLO model demonstrates its effectiveness in monitoring both individuals and vehicles in scenarios involving mine air doors. Full article

(This article belongs to the Special Issue Recent Advances in Optical Sensor for Mining)

► Show Figures

Figure 1

19 pages, 29576 KiB

Open AccessArticle

Vehicle Detection in Videos Leveraging Multi-Scale Feature and Memory Information

by Yanni Yang and Shengnan Lu

Electronics 2025, 14(10), 2009; https://doi.org/10.3390/electronics14102009 - 15 May 2025

Viewed by 212

Abstract

Vehicle detection in videos is a critical task in traffic monitoring. Existing vehicle detection tasks commonly use static detectors. Since video frames are processed as discrete static images, static detectors neglect the temporal information of vehicles when detecting vehicles in videos, leading to [...] Read more.

Vehicle detection in videos is a critical task in traffic monitoring. Existing vehicle detection tasks commonly use static detectors. Since video frames are processed as discrete static images, static detectors neglect the temporal information of vehicles when detecting vehicles in videos, leading to a reduction in detection accuracy. To address the above shortcoming, this paper improves the detection performance by introducing a video vehicle detection method that combines multi-scale features with memory information. We design a Multi-scale Feature Generation Network (MFGN) to improve the detector’s self-adaptation ability to vehicle scales. MFGN generates features with two scales and predefines multi-scale anchors for each feature scale. Based on MFGN, we propose a Memory-based Multi-scale Feature Aggregation Network (MMFAN), which aggregates historical features with current features through two parallel memory networks. The multi-scale feature and memory based method enhances the features of each frame in two perspectives, thus enhancing the vehicle detection accuracy. On the commonly adopted vehicle detection dataset UA-DETRAC, the mAP of our method is 7.4% higher compared to its static detector. The proposed approach is further validated on the well-known ImageNet VID benchmark. It demonstrates comparable performance with the memory-driven state-of-the-art frameworks. Full article

► Show Figures

Figure 1

14 pages, 3518 KiB

Open AccessArticle

Object Detection in Laparoscopic Surgery: A Comparative Study of Deep Learning Models on a Custom Endometriosis Dataset

by Andrey Bondarenko, Vilen Jumutc, Antoine Netter, Fanny Duchateau, Henrique Mendonca Abrão, Saman Noorzadeh, Giuseppe Giacomello, Filippo Ferrari, Nicolas Bourdel, Ulrik Bak Kirk and Dmitrijs Bļizņuks

Diagnostics 2025, 15(10), 1254; https://doi.org/10.3390/diagnostics15101254 - 15 May 2025

Viewed by 259

Abstract

Background: Laparoscopic surgery for endometriosis presents unique challenges due to the complexity of and variability in lesion appearances within the abdominal cavity. This study investigates the application of deep learning models for object detection in laparoscopic videos, aiming to assist surgeons in accurately [...] Read more.

Background: Laparoscopic surgery for endometriosis presents unique challenges due to the complexity of and variability in lesion appearances within the abdominal cavity. This study investigates the application of deep learning models for object detection in laparoscopic videos, aiming to assist surgeons in accurately identifying and localizing endometriosis lesions and related anatomical structures. A custom dataset was curated, comprising of 199 video sequences and 205,725 frames. Of these, 17,560 frames were meticulously annotated by medical professionals. The dataset includes object detection annotations for 10 object classes relevant to endometriosis, alongside segmentation masks for some classes. Methods: To address the object detection task, we evaluated the performance of two deep learning models—FasterRCNN and YOLOv9—under both stratified and non-stratified training scenarios. Results: The experimental results demonstrated that stratified training significantly reduced the risk of data leakage and improved model generalization. The best-performing FasterRCNN object detection model achieved a high average test precision of 0.9811 ± 0.0084, recall of 0.7083 ± 0.0807, and mAP50 (mean average precision at 50% overlap) of 0.8185 ± 0.0562 across all presented classes. Despite these successes, the study also highlights the challenges posed by the weak annotations and class imbalances in the dataset, which impacted overall model performances. Conclusions: In conclusion, this study provides valuable insights into the application of deep learning for enhancing laparoscopic surgical precision in endometriosis treatment. The findings underscore the importance of robust dataset curation and advanced training strategies in developing reliable AI-assisted tools for surgical interventions. The latter could potentially improve the guidance of surgical interventions and prevent blind spots occurring in difficult to reach abdominal regions. Future work will focus on refining the dataset and exploring more sophisticated model architectures to further improve detection accuracy. Full article

(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)

► Show Figures

Figure 1

16 pages, 2784 KiB

Open AccessArticle

Retinal Vessel Flicker Light Responsiveness and Its Relation to Analysis Protocols and Static and Metabolic Data in Healthy Subjects

by Dmitri Artemiev, Christophe Valmaggia, Scott Tschuppert, Konstantin Kotliar, Cengiz Türksever and Margarita G. Todorova

Biomedicines 2025, 13(5), 1201; https://doi.org/10.3390/biomedicines13051201 - 15 May 2025

Viewed by 204

Abstract

Background: The aim of this study was to assess the agreement between different analysis protocols for the determination of retinal vessel dilation response to flicker light (FL) and its relation to static and metabolic parameters of retinal vessels in healthy subjects. Methods: [...] Read more.

Background: The aim of this study was to assess the agreement between different analysis protocols for the determination of retinal vessel dilation response to flicker light (FL) and its relation to static and metabolic parameters of retinal vessels in healthy subjects. Methods: In total, 24 right eyes of 24 healthy controls (mean age: 36.04 ± SD 14.4 years) who underwent dynamic and static retinal diameter and oxygen saturation measurements on a Retinal Vessel Analyzer (RVA, Imedos, Jena, Germany) were included. Using repeated video analyses, responses to FL were measured with RVA. These measurements were conducted at three specific retinal locations: within the superotemporal area—within a distance of less than one optic disk (OD) diameter to optic nerve head (ONH) (group 1); greater than one OD diameter to ONH (group 2); and areas near the ONH within the VesselMap region (group 3). For comparability, the static and oxygen saturation parameters were also calculated in the superotemporal peripapillary area using the VesselMap tool of the RVA and were evaluated in relation to the corresponding dynamic area (group 3). Results: In all groups, the vascular FL response of arteries was less pronounced compared to venules (p = 0.0014). Even though FL responses (mean ± SD: FL-A; FL-V) in group 1 were more pronounced (3.36 ± 2.31; 4.42 ± 1.69) compared to those in group 2 (2.97 ± 2.40; 4.08 ± 1.55) and group 3 (2.84 ± 2.29; 4.21 ± 2.03), they did not reach statistically significant values. The mean flicker response of venules (VDil) in all groups showed negative correlations to the corresponding static parameter: central retinal venous equivalent (CRV) (r = −0.0437; p = 0.015). The mean flicker response of arteries (ADil) in all groups showed negative correlations to the corresponding metabolic parameter: arterio-venous oxygen extraction fraction (r = −0.101; p = 0.041). Conclusions: Our study confirms that the flicker light response, despite slight variations in its duration and location, allows for reliable measurements, proving the Retinal Vessel Analyzer to be a valuable diagnostic tool. Furthermore, we were able to highlight the relationship between the dynamic and metabolic components of retinal supply, which enables early diagnosis concerning the development of diseases within this spectrum. Full article

(This article belongs to the Special Issue Retinal Diseases: Pathogenetic, Diagnostic and Therapeutic Perspectives)

► Show Figures

Figure 1

17 pages, 5707 KiB

Open AccessArticle

AI-Enabled Digital Twin Framework for Safe and Sustainable Intelligent Transportation

by Keke Long, Chengyuan Ma, Hangyu Li, Zheng Li, Heye Huang, Haotian Shi, Zilin Huang, Zihao Sheng, Lei Shi, Pei Li, Sikai Chen and Xiaopeng Li

Sustainability 2025, 17(10), 4391; https://doi.org/10.3390/su17104391 - 12 May 2025

Viewed by 503

Abstract

This study proposes an AI-powered digital twin (DT) platform designed to support real-time traffic risk prediction, decision-making, and sustainable mobility in smart cities. The system integrates multi-source data—including static infrastructure maps, historical traffic records, telematics data, and camera feeds—into a unified cyber–physical platform. [...] Read more.

This study proposes an AI-powered digital twin (DT) platform designed to support real-time traffic risk prediction, decision-making, and sustainable mobility in smart cities. The system integrates multi-source data—including static infrastructure maps, historical traffic records, telematics data, and camera feeds—into a unified cyber–physical platform. AI models are employed for data fusion, anomaly detection, and predictive analytics. In particular, the platform incorporates telematics–video fusion for enhanced trajectory accuracy and LiDAR–camera fusion for high-definition work-zone mapping. These capabilities support dynamic safety heatmaps, congestion forecasts, and scenario-based decision support. A pilot deployment on Madison’s Flex Lane corridor demonstrates real-time data processing, traffic incident reconstruction, crash-risk forecasting, and eco-driving control using a validated Vehicle-in-the-Loop setup. The modular API design enables integration with existing Advanced Traffic Management Systems (ATMSs) and supports scalable implementation. By combining predictive analytics with real-world deployment, this research offers a practical approach to improving urban traffic safety, resilience, and sustainability. Full article

(This article belongs to the Special Issue Sustainable Intelligent Transportation: Cooperative Systems and Vehicle Automation)

► Show Figures

Figure 1

27 pages, 8920 KiB

Open AccessArticle

Advancing Rice Disease Detection in Farmland with an Enhanced YOLOv11 Algorithm

by Hongxin Teng, Yudi Wang, Wentao Li, Tao Chen and Qinghua Liu

Sensors 2025, 25(10), 3056; https://doi.org/10.3390/s25103056 - 12 May 2025

Viewed by 351

Abstract

Smart rice disease detection is a key part of intelligent agriculture. To address issues like low efficiency, poor accuracy, and high costs in traditional methods, this paper introduces an enhanced lightweight version of the YOLOv11-RD algorithm, enhancing multi-scale feature extraction through the integration [...] Read more.

Smart rice disease detection is a key part of intelligent agriculture. To address issues like low efficiency, poor accuracy, and high costs in traditional methods, this paper introduces an enhanced lightweight version of the YOLOv11-RD algorithm, enhancing multi-scale feature extraction through the integration of the enhanced LSKAC attention mechanism and the SPPF module. It also lowers computational complexity and enhances local feature capture through the C3k2-CFCGLU block. The C3k2-CSCBAM block in the neck region reduces the training overhead and boosts target learning in complex backgrounds. Additionally, a lightweight 320 × 320 LSDECD detection head improves small-object detection. Experiments on a rice disease dataset extracted from agricultural operation videos demonstrate that, compared to YOLOv11n, the algorithm improves mAP50 and mAP50-95 by 2.7% and 11.5%, respectively, while reducing the model parameters by 4.58 M and the computational load by 1.1 G. The algorithm offers significant advantages in lightweight design and real-time performance, outperforming other classical object detection algorithms and providing an optimal solution for real-time field diagnosis. Full article

(This article belongs to the Topic Digital Agriculture, Smart Farming and Crop Monitoring)

► Show Figures

Figure 1

21 pages, 4777 KiB

Open AccessArticle

Harnessing Semantic and Trajectory Analysis for Real-Time Pedestrian Panic Detection in Crowded Micro-Road Networks

by Rongyong Zhao, Lingchen Han, Yuxin Cai, Bingyu Wei, Arifur Rahman, Cuiling Li and Yunlong Ma

Appl. Sci. 2025, 15(10), 5394; https://doi.org/10.3390/app15105394 - 12 May 2025

Viewed by 190

Abstract

Pedestrian panic behavior is a primary cause of overcrowding and stampede accidents in public micro-road network areas with high pedestrian density. However, reliably detecting such behaviors remains challenging due to their inherent complexity, variability, and stochastic nature. Current detection models often rely on [...] Read more.

Pedestrian panic behavior is a primary cause of overcrowding and stampede accidents in public micro-road network areas with high pedestrian density. However, reliably detecting such behaviors remains challenging due to their inherent complexity, variability, and stochastic nature. Current detection models often rely on single-modality features, which limits their effectiveness in complex and dynamic crowd scenarios. To overcome these limitations, this study proposes a contour-driven multimodal framework that first employs a CNN (CDNet) to estimate density maps and, by analyzing steep contour gradients, automatically delineates a candidate panic zone. Within these potential panic zones, pedestrian trajectories are analyzed through LSTM networks to capture irregular movements, such as counterflow and nonlinear wandering behaviors. Concurrently, semantic recognition based on Transformer models is utilized to identify verbal distress cues extracted through Baidu AI’s real-time speech-to-text conversion. The three embeddings are fused through a lightweight attention-enhanced MLP, enabling end-to-end inference at 40 FPS on a single GPU. To evaluate branch robustness under streaming conditions, the UCF Crowd dataset (150 videos without panic labels) is processed frame-by-frame at 25 FPS solely for density assessment, whereas full panic detection is validated on 30 real Itaewon-Stampede videos and 160 SUMO/Unity simulated emergencies that include explicit panic annotations. The proposed system achieves 91.7% accuracy and 88.2% F1 on the Itaewon set, outperforming all single- or dual-modality baselines and offering a deployable solution for proactive crowd safety monitoring in transport hubs, festivals, and other high-risk venues. Full article

► Show Figures

Figure 1

24 pages, 11944 KiB

Open AccessArticle

YOLO-Act: Unified Spatiotemporal Detection of Human Actions Across Multi-Frame Sequences

by Nada Alzahrani, Ouiem Bchir and Mohamed Maher Ben Ismail

Sensors 2025, 25(10), 3013; https://doi.org/10.3390/s25103013 - 10 May 2025

Viewed by 474

Abstract

Automated action recognition has become essential in the surveillance, healthcare, and multimedia retrieval industries owing to the rapid proliferation of video data. This paper introduces YOLO-Act, a novel spatiotemporal action detection model that enhances the object detection capabilities of YOLOv8 to efficiently manage [...] Read more.

Automated action recognition has become essential in the surveillance, healthcare, and multimedia retrieval industries owing to the rapid proliferation of video data. This paper introduces YOLO-Act, a novel spatiotemporal action detection model that enhances the object detection capabilities of YOLOv8 to efficiently manage complex action dynamics within video sequences. YOLO-Act achieves precise and efficient action recognition by integrating keyframe extraction, action tracking, and class fusion. The model depicts essential temporal dynamics without the computational overhead of continuous frame processing by leveraging the adaptive selection of three keyframes representing the beginning, middle, and end of the actions. Compared with state-of-the-art approaches such as the Lagrangian Action Recognition Transformer (LART), YOLO-Act exhibits superior performance with a mean average precision (mAP) of 73.28 in experiments conducted on the AVA dataset, resulting in a gain of +28.18 mAP. Furthermore, YOLO-Act achieves this higher accuracy with significantly lower FLOPs, demonstrating its efficiency in computational resource utilization. The results highlight the advantages of incorporating precise tracking, effective spatial detection, and temporal consistency to address the challenges associated with video-based action detection. Full article

(This article belongs to the Section Sensing and Imaging)

► Show Figures

Figure 1

22 pages, 12121 KiB

Open AccessArticle

A Computational–Cognitive Model of Audio-Visual Attention in Dynamic Environments

by Hamideh Yazdani, Alireza Bosaghzadeh, Reza Ebrahimpour and Fadi Dornaika

Big Data Cogn. Comput. 2025, 9(5), 120; https://doi.org/10.3390/bdcc9050120 - 6 May 2025

Viewed by 210

Abstract

Human visual attention is influenced by multiple factors, including visual, auditory, and facial cues. While integrating auditory and visual information enhances prediction accuracy, many existing models rely solely on visual-temporal data. Inspired by cognitive studies, we propose a computational model that combines spatial, [...] Read more.

Human visual attention is influenced by multiple factors, including visual, auditory, and facial cues. While integrating auditory and visual information enhances prediction accuracy, many existing models rely solely on visual-temporal data. Inspired by cognitive studies, we propose a computational model that combines spatial, temporal, face (low-level and high-level visual cues), and auditory saliency to predict visual attention more effectively. Our approach processes video frames to generate spatial, temporal, and face saliency maps, while an audio branch localizes sound-producing objects. These maps are then integrated to form the final audio-visual saliency map. Experimental results on the audio-visual dataset demonstrate that our model outperforms state-of-the-art image and video saliency models and the basic model and aligns more closely with behavioral and eye-tracking data. Additionally, ablation studies highlight the contribution of each information source to the final prediction. Full article

► Show Figures

Figure 1

Search Results (885)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (885)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI