MDPI - Publisher of Open Access Journals

37 pages, 10380 KB

Open AccessArticle

FEWheat-YOLO: A Lightweight Improved Algorithm for Wheat Spike Detection

by Hongxin Wu, Weimo Wu, Yufen Huang, Shaohua Liu, Yanlong Liu, Nannan Zhang, Xiao Zhang and Jie Chen

Plants 2025, 14(19), 3058; https://doi.org/10.3390/plants14193058 - 3 Oct 2025

Abstract

Accurate detection and counting of wheat spikes are crucial for yield estimation and variety selection in precision agriculture. However, challenges such as complex field environments, morphological variations, and small target sizes hinder the performance of existing models in real-world applications. This study proposes [...] Read more.

Accurate detection and counting of wheat spikes are crucial for yield estimation and variety selection in precision agriculture. However, challenges such as complex field environments, morphological variations, and small target sizes hinder the performance of existing models in real-world applications. This study proposes FEWheat-YOLO, a lightweight and efficient detection framework optimized for deployment on agricultural edge devices. The architecture integrates four key modules: (1) FEMANet, a mixed aggregation feature enhancement network with Efficient Multi-scale Attention (EMA) for improved small-target representation; (2) BiAFA-FPN, a bidirectional asymmetric feature pyramid network for efficient multi-scale feature fusion; (3) ADown, an adaptive downsampling module that preserves structural details during resolution reduction; and (4) GSCDHead, a grouped shared convolution detection head for reduced parameters and computational cost. Evaluated on a hybrid dataset combining GWHD2021 and a self-collected field dataset, FEWheat-YOLO achieved a COCO-style AP of 51.11%, AP@50 of 89.8%, and AP scores of 18.1%, 50.5%, and 61.2% for small, medium, and large targets, respectively, with an average recall (AR) of 58.1%. In wheat spike counting tasks, the model achieved an R² of 0.941, MAE of 3.46, and RMSE of 6.25, demonstrating high counting accuracy and robustness. The proposed model requires only 0.67 M parameters, 5.3 GFLOPs, and 1.6 MB of storage, while achieving an inference speed of 54 FPS. Compared to YOLOv11n, FEWheat-YOLO improved AP@50, AP_s, AP_m, AP_l, and AR by 0.53%, 0.7%, 0.7%, 0.4%, and 0.3%, respectively, while reducing parameters by 74%, computation by 15.9%, and model size by 69.2%. These results indicate that FEWheat-YOLO provides an effective balance between detection accuracy, counting performance, and model efficiency, offering strong potential for real-time agricultural applications on resource-limited platforms. Full article

(This article belongs to the Special Issue Advances in Artificial Intelligence for Plant Research)

17 pages, 2399 KB

Open AccessArticle

SADAMB: Advancing Spatially-Aware Vision-Language Modeling Through Datasets, Metrics, and Benchmarks

by Giorgos Papadopoulos, Petros Drakoulis, Athanasios Ntovas, Alexandros Doumanoglou and Dimitris Zarpalas

Computers 2025, 14(10), 413; https://doi.org/10.3390/computers14100413 - 29 Sep 2025

Abstract

Understanding spatial relationships between objects in images is crucial for robotic navigation, augmented reality systems, and autonomous driving applications, among others. However, existing vision-language benchmarks often overlook explicit spatial reasoning, limiting progress in this area. We attribute this limitation in part to existing [...] Read more.

Understanding spatial relationships between objects in images is crucial for robotic navigation, augmented reality systems, and autonomous driving applications, among others. However, existing vision-language benchmarks often overlook explicit spatial reasoning, limiting progress in this area. We attribute this limitation in part to existing open datasets and evaluation metrics, which tend to overlook spatial details. To address this gap, we make three contributions: First, we greatly extend the COCO dataset with annotations of spatial relations, providing a resource for spatially aware image captioning and visual question answering. Second, we propose a new evaluation framework encompassing metrics that assess image captions’ spatial accuracy at both the sentence and dataset levels. And third, we conduct a benchmark study of various vision encoder–text decoder transformer architectures for image captioning using the introduced dataset and metrics. Results reveal that current models capture spatial information only partially, underscoring the challenges of spatially grounded caption generation. Full article

(This article belongs to the Topic Visual Computing and Understanding: New Developments and Trends)

► Show Figures

Figure 1

18 pages, 2628 KB

Open AccessArticle

Importance-Weighted Locally Adaptive Prototype Extraction Network for Few-Shot Detection

by Haibin Wang, Yong Tao, Zhou Zhou, Yue Wang, Xu Fan and Xiangjun Wang

Sensors 2025, 25(19), 5945; https://doi.org/10.3390/s25195945 - 23 Sep 2025

Viewed by 194

Abstract

Few-Shot Object Detection (FSOD) aims to identify new object categories with a limited amount of labeled data, which holds broad application prospects in real-life scenarios. Previous approaches usually ignore attention to critical information, which leads to the generation of low-quality prototypes and suboptimal [...] Read more.

Few-Shot Object Detection (FSOD) aims to identify new object categories with a limited amount of labeled data, which holds broad application prospects in real-life scenarios. Previous approaches usually ignore attention to critical information, which leads to the generation of low-quality prototypes and suboptimal performance in few-shot scenarios. To overcome the defect, an improved FSOD network is proposed in this paper, which mimics the human visual attention mechanism by emphasizing areas that are semantically important and rich in spatial information. Specifically, an Importance-Weighted Local Adaptive Prototype module is first introduced, which highlights key local features of support samples, and more expressive class prototypes are generated by assigning greater weights to salient regions so that generalization ability is effectively enhanced under few-shot settings. Secondly, an Imbalanced Diversity Sampling module is utilized to select diverse and challenging negative sample prototypes, which enhances inter-class separability and reduces confusion among visually similar categories. Moreover, a Weighted Non-Linear Fusion module is designed to integrate various forms of feature interaction. The contributions of the feature interactions are modulated by learnable importance weights, which improve the effect of feature fusion. Extensive experiments on PASCAL VOC and MS COCO benchmarks validate the effectiveness of our method. The experimental results reflect the fact that the mean average precision from our method is improved by 2.84% on the PASCAL VOC dataset compared with Fine-Grained Prototypes Distillation (FPD), and the AP from our method surpasses the recent FPD baseline by 0.8% and 1.8% on the MS COCO dataset, respectively. Full article

(This article belongs to the Section Intelligent Sensors)

► Show Figures

Figure 1

29 pages, 34222 KB

Open AccessArticle

BFRDNet: A UAV Image Object Detection Method Based on a Backbone Feature Reuse Detection Network

by Liming Zhou, Jiakang Yang, Yuanfei Xie, Guochong Zhang, Cheng Liu and Yang Liu

ISPRS Int. J. Geo-Inf. 2025, 14(9), 365; https://doi.org/10.3390/ijgi14090365 - 21 Sep 2025

Viewed by 398

Abstract

Unmanned aerial vehicle (UAV) image object detection has become an increasingly important research area in computer vision. However, the variable target shapes and complex environments make it difficult for the model to fully exploit its features. In order to solve this problem, we [...] Read more.

Unmanned aerial vehicle (UAV) image object detection has become an increasingly important research area in computer vision. However, the variable target shapes and complex environments make it difficult for the model to fully exploit its features. In order to solve this problem, we propose a UAV image object detection method based on a backbone feature reuse detection network, named BFRDNet. First, we design a backbone feature reuse pyramid network (BFRPN), which takes the model characteristics as the starting point and more fully utilizes the multi-scale features of backbone network to improve the model’s performance in complex environments. Second, we propose a feature extraction module based on multiple kernels convolution (MKConv), to deeply mine features under different receptive fields, helping the model accurately recognize targets of different sizes and shapes. Finally, we design a detection head preprocessing module (PDetect) to enhance the feature representation fed to the detection head and effectively suppress the interference of background information. In this study, we validate the performance of BFRDNet primarily on the VisDrone dataset. The experimental results demonstrate that BFRDNet achieves a significant improvement in detection performance, with the mAP increasing by 7.5%. To additionally evaluate the model’s generalization capacity, we extend the experiments to the UAVDT and COCO datasets. Full article

(This article belongs to the Topic State-of-the-Art Object Detection, Tracking, and Recognition Techniques)

► Show Figures

Figure 1

40 pages, 9065 KB

Open AccessArticle

Empirical Evaluation of Invariances in Deep Vision Models

by Konstantinos Keremis, Eleni Vrochidou and George A. Papakostas

J. Imaging 2025, 11(9), 322; https://doi.org/10.3390/jimaging11090322 - 19 Sep 2025

Viewed by 415

Abstract

The ability of deep learning models to maintain consistent performance under image transformations-termed invariances, is critical for reliable deployment across diverse computer vision applications. This study presents a comprehensive empirical evaluation of modern convolutional neural networks (CNNs) and vision transformers (ViTs) concerning four [...] Read more.

The ability of deep learning models to maintain consistent performance under image transformations-termed invariances, is critical for reliable deployment across diverse computer vision applications. This study presents a comprehensive empirical evaluation of modern convolutional neural networks (CNNs) and vision transformers (ViTs) concerning four fundamental types of image invariances: blur, noise, rotation, and scale. We analyze a curated selection of thirty models across three common vision tasks, object localization, recognition, and semantic segmentation, using benchmark datasets including COCO, ImageNet, and a custom segmentation dataset. Our experimental protocol introduces controlled perturbations to test model robustness and employs task-specific metrics such as mean Intersection over Union (mIoU), and classification accuracy (Acc) to quantify models’ performance degradation. Results indicate that while ViTs generally outperform CNNs under blur and noise corruption in recognition tasks, both model families exhibit significant vulnerabilities to rotation and extreme scale transformations. Notably, segmentation models demonstrate higher resilience to geometric variations, with SegFormer and Mask2Former emerging as the most robust architectures. These findings challenge prevailing assumptions regarding model robustness and provide actionable insights for designing vision systems capable of withstanding real-world input variability. Full article

(This article belongs to the Special Issue Advances in Machine Learning for Computer Vision Applications)

► Show Figures

Figure 1

17 pages, 24022 KB

Open AccessArticle

Robust Object Detection Under Adversarial Patch Attacks in Vision-Based Navigation

by Haotian Gu, Hyung Jin Yoon and Hamidreza Jafarnejadsani

Automation 2025, 6(3), 44; https://doi.org/10.3390/automation6030044 - 9 Sep 2025

Viewed by 639

Abstract

In vision-guided autonomous robots, object detectors play a crucial role in perceiving the environment for path planning and decision-making. However, adaptive adversarial patch attacks undermine the resilience of detector-based systems. Strengthening object detectors against such adaptive attacks enhances the robustness of navigation systems. [...] Read more.

In vision-guided autonomous robots, object detectors play a crucial role in perceiving the environment for path planning and decision-making. However, adaptive adversarial patch attacks undermine the resilience of detector-based systems. Strengthening object detectors against such adaptive attacks enhances the robustness of navigation systems. Existing defenses against patch attacks are primarily designed for stationary scenes and struggle against adaptive patch attacks that vary in scale, position, and orientation in dynamic environments. In this paper, we introduce Ad_YOLO+, an efficient and effective plugin that extends Ad_YOLO to defend against white-box patch-based image attacks. Built on YOLOv5x with an additional patch detection layer, Ad_YOLO+ is trained on a specially crafted adversarial dataset (COCO-Visdrone-2019). Unlike conventional methods that rely on redundant image preprocessing, our approach directly detects adversarial patches and the overlaid objects. Experiments on the adversarial training dataset demonstrate that Ad_YOLO+ improves both provable robustness and clean accuracy. Ad_YOLO+ achieves

85.4 %

top-1 clean accuracy on the COCO dataset and

74.63 %

top-1 robust provable accuracy against pixel square patches anywhere on the image for the COCO-VisDrone-2019 dataset. Moreover, under adaptive attacks in AirSim simulations, Ad_YOLO+ reduces the attack success rate, ensuring tracking resilience in both dynamic and static settings. Additionally, it generalizes well to other patch detection weight configurations. Full article

(This article belongs to the Section Robotics and Autonomous Systems)

► Show Figures

Figure 1

17 pages, 1078 KB

Open AccessArticle

Prototype-Based Two-Stage Few-Shot Instance Segmentation with Flexible Novel Class Adaptation

by Qinying Zhu, Yilin Zhang, Peng Xiao, Mengxi Ying, Lei Zhu and Chengyuan Zhang

Mathematics 2025, 13(17), 2889; https://doi.org/10.3390/math13172889 - 7 Sep 2025

Viewed by 736

Abstract

Few-shot instance segmentation (FSIS) is devised to address the intricate challenge of instance segmentation when labeled data for novel classes is scant. Nevertheless, existing methodologies encounter notable constraints in the agile expansion of novel classes and the management of memory overhead. The integration [...] Read more.

Few-shot instance segmentation (FSIS) is devised to address the intricate challenge of instance segmentation when labeled data for novel classes is scant. Nevertheless, existing methodologies encounter notable constraints in the agile expansion of novel classes and the management of memory overhead. The integration workflow for novel classes is inflexible, and given the necessity of retaining class exemplars during both training and inference stages, considerable memory consumption ensues. To surmount these challenges, this study introduces an innovative framework encompassing a two-stage “base training-novel class fine-tuning” paradigm. It acquires discriminative instance-level embedding representations. Concretely, instance embeddings are aggregated into class prototypes, and the storage of embedding vectors as opposed to images inherently mitigates the issue of memory overload. Via a Region of Interest (RoI)-level cosine similarity matching mechanism, the flexible augmentation of novel classes is realized, devoid of the requirement for supplementary training and independent of historical data. Experimental validations attest that this approach significantly outperforms state-of-the-art techniques in mainstream benchmark evaluations. More crucially, its memory-optimized attributes facilitate, for the first time, the conjoint assessment of FSIS performance across all classes within the COCO dataset. Visualized instances (incorporating colored masks and class annotations of objects across diverse scenarios) further substantiate the efficacy of the method in real-world complex contexts. Full article

(This article belongs to the Special Issue Structural Networks for Image Application)

► Show Figures

Figure 1

30 pages, 25011 KB

Open AccessArticle

Multi-Level Contextual and Semantic Information Aggregation Network for Small Object Detection in UAV Aerial Images

by Zhe Liu, Guiqing He and Yang Hu

Drones 2025, 9(9), 610; https://doi.org/10.3390/drones9090610 - 29 Aug 2025

Viewed by 520

Abstract

In recent years, detection methods for generic object detection have achieved significant progress. However, due to the large number of small objects in aerial images, mainstream detectors struggle to achieve a satisfactory detection performance. The challenges of small object detection in aerial images [...] Read more.

In recent years, detection methods for generic object detection have achieved significant progress. However, due to the large number of small objects in aerial images, mainstream detectors struggle to achieve a satisfactory detection performance. The challenges of small object detection in aerial images are primarily twofold: (1) Insufficient feature representation: The limited visual information for small objects makes it difficult for models to learn discriminative feature representations. (2) Background confusion: Abundant background information introduces more noise and interference, causing the features of small objects to easily be confused with the background. To address these issues, we propose a Multi-Level Contextual and Semantic Information Aggregation Network (MCSA-Net). MCSA-Net includes three key components: a Spatial-Aware Feature Selection Module (SAFM), a Multi-Level Joint Feature Pyramid Network (MJFPN), and an Attention-Enhanced Head (AEHead). The SAFM employs a sequence of dilated convolutions to extract multi-scale local context features and combines a spatial selection mechanism to adaptively merge these features, thereby obtaining the critical local context required for the objects, which enriches the feature representation of small objects. The MJFPN introduces multi-level connections and weighted fusion to fully leverage the spatial detail features of small objects in feature fusion and enhances the fused features further through a feature aggregation network. Finally, the AEHead is constructed by incorporating a sparse attention mechanism into the detection head. The sparse attention mechanism efficiently models long-range dependencies by computing the attention between the most relevant regions in the image while suppressing background interference, thereby enhancing the model’s ability to perceive targets and effectively improving the detection performance. Extensive experiments on four datasets, VisDrone, UAVDT, MS COCO, and DOTA, demonstrate that the proposed MCSA-Net achieves an excellent detection performance, particularly in small object detection, surpassing several state-of-the-art methods. Full article

(This article belongs to the Special Issue Intelligent Image Processing and Sensing for Drones, 2nd Edition)

► Show Figures

Figure 1

21 pages, 3363 KB

Open AccessArticle

A Hybrid CNN-GCN Architecture with Sparsity and Dataflow Optimization for Mobile AR

by Jiazhong Chen and Ziwei Chen

Appl. Sci. 2025, 15(17), 9356; https://doi.org/10.3390/app15179356 - 26 Aug 2025

Viewed by 493

Abstract

Mobile augmented reality (AR) applications require high-performance, energy-efficient deep learning solutions to deliver immersive experiences on resource-constrained devices. We propose SAHA-WS, a Sparsity-Aware Hybrid Architecture with Weight-Stationary Dataflow, combining Convolutional Neural Networks (CNNs) and Graph Convolutional Networks (GCNs) to efficiently process grid-like (e.g., [...] Read more.

Mobile augmented reality (AR) applications require high-performance, energy-efficient deep learning solutions to deliver immersive experiences on resource-constrained devices. We propose SAHA-WS, a Sparsity-Aware Hybrid Architecture with Weight-Stationary Dataflow, combining Convolutional Neural Networks (CNNs) and Graph Convolutional Networks (GCNs) to efficiently process grid-like (e.g., images) and graph-structured (e.g., human skeletons) data. SAHA-WS leverages channel-wise sparsity in CNNs and adjacency matrix sparsity in GCNs, paired with weight-stationary dataflow, to minimize computations and memory access. Evaluations on ImageNet, COCO, and NTU RGB+D datasets demonstrate SAHA-WS achieves 87.5% top-1 accuracy, 75.8% mAP, and 92.5% action recognition accuracy at 0% sparsity, with 40 ms latency and 42 mJ energy consumption at 60% sparsity, outperforming a baseline by 1020% in efficiency. Ablation studies confirm the contributions of sparsity and dataflow optimizations. SAHA-WS enables complex AR applications to run smoothly on mobile devices, enhancing immersive and engaging experiences. Full article

► Show Figures

Figure 1

26 pages, 30652 KB

Open AccessArticle

Hybrid ViT-RetinaNet with Explainable Ensemble Learning for Fine-Grained Vehicle Damage Classification

by Ananya Saha, Mahir Afser Pavel, Md Fahim Shahoriar Titu, Afifa Zain Apurba and Riasat Khan

Vehicles 2025, 7(3), 89; https://doi.org/10.3390/vehicles7030089 - 25 Aug 2025

Viewed by 617

Abstract

Efficient and explainable vehicle damage inspection is essential due to the increasing complexity and volume of vehicular incidents. Traditional manual inspection approaches are not time-effective, prone to human error, and lead to inefficiencies in insurance claims and repair workflows. Existing deep learning methods, [...] Read more.

Efficient and explainable vehicle damage inspection is essential due to the increasing complexity and volume of vehicular incidents. Traditional manual inspection approaches are not time-effective, prone to human error, and lead to inefficiencies in insurance claims and repair workflows. Existing deep learning methods, such as CNNs, often struggle with generalization, require large annotated datasets, and lack interpretability. This study presents a robust and interpretable deep learning framework for vehicle damage classification, integrating Vision Transformers (ViTs) and ensemble detection strategies. The proposed architecture employs a RetinaNet backbone with a ViT-enhanced detection head, implemented in PyTorch using the Detectron2 object detection technique. It is pretrained on COCO weights and fine-tuned through focal loss and aggressive augmentation techniques to improve generalization under real-world damage variability. The proposed system applies the Weighted Box Fusion (WBF) ensemble strategy to refine detection outputs from multiple models, offering improved spatial precision. To ensure interpretability and transparency, we adopt numerous explainability techniques—Grad-CAM, Grad-CAM++, and SHAP—offering semantic and visual insights into model decisions. A custom vehicle damage dataset with 4500 images has been built, consisting of approximately 60% curated images collected through targeted web scraping and crawling covering various damage types (such as bumper dents, panel scratches, and frontal impacts), along with 40% COCO dataset images to support model generalization. Comparative evaluations show that Hybrid ViT-RetinaNet achieves superior performance with an F1-score of 84.6%, mAP of 87.2%, and 22 FPS inference speed. In an ablation analysis, WBF, augmentation, transfer learning, and focal loss significantly improve performance, with focal loss increasing F1 by 6.3% for underrepresented classes and COCO pretraining boosting mAP by 8.7%. Additional architectural comparisons demonstrate that our full hybrid configuration not only maintains competitive accuracy but also achieves up to 150 FPS, making it well suited for real-time use cases. Robustness tests under challenging conditions, including real-world visual disturbances (smoke, fire, motion blur, varying lighting, and occlusions) and artificial noise (Gaussian; salt-and-pepper), confirm the model’s generalization ability. This work contributes a scalable, explainable, and high-performance solution for real-world vehicle damage diagnostics. Full article

► Show Figures

Figure 1

16 pages, 707 KB

Open AccessArticle

High-Resolution Human Keypoint Detection: A Unified Framework for Single and Multi-Person Settings

by Yuhuai Lin, Kelei Li and Haihua Wang

Algorithms 2025, 18(8), 533; https://doi.org/10.3390/a18080533 - 21 Aug 2025

Viewed by 707

Abstract

Human keypoint detection has become a fundamental task in computer vision, underpinning a wide range of downstream applications such as action recognition, intelligent surveillance, and human–computer interaction. Accurate localization of keypoints is crucial for understanding human posture, behavior, and interactions in various environments. [...] Read more.

Human keypoint detection has become a fundamental task in computer vision, underpinning a wide range of downstream applications such as action recognition, intelligent surveillance, and human–computer interaction. Accurate localization of keypoints is crucial for understanding human posture, behavior, and interactions in various environments. In this paper, we propose a deep-learning-based human skeletal keypoint detection framework that leverages a High-Resolution Network (HRNet) to achieve robust and precise keypoint localization. Our method maintains high-resolution representations throughout the entire network, enabling effective multi-scale feature fusion, without sacrificing spatial details. This approach preserves the fine-grained spatial information that is often lost in conventional downsampling-based methods. To evaluate its performance, we conducted extensive experiments on the COCO dataset, where our approach achieved competitive performance in terms of Average Precision (AP) and Average Recall (AR), outperforming several state-of-the-art methods. Furthermore, we extended our pipeline to support multi-person keypoint detection in real-time scenarios, ensuring scalability for complex environments. Experimental results demonstrated the effectiveness of our method in both single-person and multi-person settings, providing a comprehensive and flexible solution for various pose estimation tasks in dynamic real-world applications. Full article

(This article belongs to the Section Evolutionary Algorithms and Machine Learning)

► Show Figures

Figure 1

12 pages, 7715 KB

Open AccessArticle

Hardware Accelerator Design by Using RT-Level Power Optimization Techniques on FPGA for Future AI Mobile Applications

by Achyuth Gundrapally, Yatrik Ashish Shah, Sai Manohar Vemuri and Kyuwon (Ken) Choi

Electronics 2025, 14(16), 3317; https://doi.org/10.3390/electronics14163317 - 20 Aug 2025

Viewed by 630

Abstract

In resource-constrained edge environments—such as mobile devices, IoT systems, and electric vehicles—energy-efficient Convolution Neural Network (CNN) accelerators on mobile Field Programmable Gate Arrays (FPGAs) are gaining significant attention for real-time object detection tasks. This paper presents a low-power implementation of the Tiny YOLOv4 [...] Read more.

In resource-constrained edge environments—such as mobile devices, IoT systems, and electric vehicles—energy-efficient Convolution Neural Network (CNN) accelerators on mobile Field Programmable Gate Arrays (FPGAs) are gaining significant attention for real-time object detection tasks. This paper presents a low-power implementation of the Tiny YOLOv4 object detection model on the Xilinx ZCU104 FPGA platform by using Register Transfer Level (RTL) optimization techniques. We proposed three RTL techniques in the paper: (i) Local Explicit Clock Enable (LECE), (ii) operand isolation, and (iii) Enhanced Clock Gating (ECG). A novel low-power design of Multiply-Accumulate (MAC) operations, which is one of the main components in the AI algorithm, was proposed to eliminate redundant signal switching activities. The Tiny YOLOv4 model, trained on the COCO dataset, was quantized and compiled using the Tensil tool-chain for fixed-point inference deployment. Post-implementation evaluation using Vivado 2022.2 demonstrates around 29.4% reduction in total on-chip power. Our design supports real-time detection throughput while maintaining high accuracy, making it ideal for deployment in battery-constrained environments such as drones, surveillance systems, and autonomous vehicles. These results highlight the effectiveness of RTL-level power optimization for scalable and sustainable edge AI deployment. Full article

(This article belongs to the Special Issue Hardware Acceleration for Machine Learning)

► Show Figures

Figure 1

32 pages, 3272 KB

Open AccessArticle

Bridging Modalities: An Analysis of Cross-Modal Wasserstein Adversarial Translation Networks and Their Theoretical Foundations

by Joseph Tafataona Mtetwa, Kingsley A. Ogudo and Sameerchand Pudaruth

Mathematics 2025, 13(16), 2545; https://doi.org/10.3390/math13162545 - 8 Aug 2025

Viewed by 818

Abstract

What if machines could seamlessly translate between the visual richness of images and the semantic depth of language with mathematical precision? This paper presents a theoretical and empirical analysis of five novel cross-modal Wasserstein adversarial translation networks that challenge conventional approaches to cross-modal [...] Read more.

What if machines could seamlessly translate between the visual richness of images and the semantic depth of language with mathematical precision? This paper presents a theoretical and empirical analysis of five novel cross-modal Wasserstein adversarial translation networks that challenge conventional approaches to cross-modal understanding. Unlike traditional generative models that rely on stochastic noise, our frameworks learn deterministic translation mappings that preserve semantic fidelity across modalities through rigorous mathematical foundations. We systematically examine: (1) cross-modality consistent dual-critical networks; (2) Wasserstein cycle consistency; (3) multi-scale Wasserstein distance; (4) regularization through modality invariance; and (5) Wasserstein information bottleneck. Each approach employs adversarial training with Wasserstein distances to establish theoretically grounded translation functions between heterogeneous data representations. Through mathematical analysis—including information-theoretic frameworks, differential geometry, and convergence guarantees—we establish the theoretical foundations underlying cross-modal translation. Our empirical evaluation across MS-COCO, Flickr30K, and Conceptual Captions datasets, including comparisons with transformer-based baselines, reveals that our proposed multi-scale Wasserstein cycle consistent (MS-WCC) framework achieves remarkable performance gains—12.1% average improvement in FID scores and 8.0% enhancement in cross-modal translation accuracy—compared to state-of-the-art methods, while maintaining superior computational efficiency. These results demonstrate that principled mathematical approaches to cross-modal translation can significantly advance machine understanding of multimodal data, opening new possibilities for applications requiring seamless communication between visual and textual domains. Full article

► Show Figures

Figure 1

12 pages, 492 KB

Open AccessArticle

AFJ-PoseNet: Enhancing Simple Baselines with Attention-Guided Fusion and Joint-Aware Positional Encoding

by Wenhui Zhang, Yu Shi and Jiayi Lin

Electronics 2025, 14(15), 3150; https://doi.org/10.3390/electronics14153150 - 7 Aug 2025

Viewed by 324

Abstract

Simple Baseline has become a dominant benchmark in human pose estimation (HPE) due to its excellent performance and simple design. However, its “strong encoder + simple decoder” architectural paradigm suffers from two core limitations: (1) its non-branching, linear deconvolutional path prevents it from [...] Read more.

Simple Baseline has become a dominant benchmark in human pose estimation (HPE) due to its excellent performance and simple design. However, its “strong encoder + simple decoder” architectural paradigm suffers from two core limitations: (1) its non-branching, linear deconvolutional path prevents it from leveraging the rich, fine-grained features generated by the encoder at multiple scales and (2) the model lacks explicit prior knowledge of both the absolute positions and structural layout of human keypoints. To address these issues, this paper introduces AFJ-PoseNet, a new architecture that deeply enhances the Simple Baseline framework. First, we restructure Simple Baseline’s original linear decoder into a U-Net-like multi-scale fusion path, introducing intermediate features from the encoder via skip connections. For efficient fusion, we design a novel Attention Fusion Module (AFM), which dynamically gates the flow of incoming detailed features through a context-aware spatial attention mechanism. Second, we propose the Joint-Aware Positional Encoding (JAPE) module, which innovatively combines a fixed global coordinate system with learnable, joint-specific spatial priors. This design injects both absolute position awareness and statistical priors of the human body structure. Our ablation studies on the MPII dataset validate the effectiveness of each proposed enhancement, with our full model achieving a mean PCKh of 88.915, a 0.341 percentage point improvement over our re-implemented baseline. On the more challenging COCO val2017 dataset, our ResNet-50-based AFJ-PoseNet achieves an Average Precision (AP) of 72.6%. While this involves a slight trade-off in Average Recall for higher precision, this result represents a significant 2.2 percentage point improvement over our re-implemented baseline (70.4%) and also outperforms other strong, publicly available models like DARK (72.4%) and SimCC (72.1%) under comparable settings, demonstrating the superiority and competitiveness of our proposed enhancements. Full article

(This article belongs to the Section Computer Science & Engineering)

► Show Figures

Figure 1

24 pages, 4199 KB

Open AccessArticle

Hazelnut Kernel Percentage Calculation System with DCIoU and Neighborhood Relationship Algorithm

by Sultan Murat Yılmaz, Serap Çakar Kaman and Erkan Güler

Processes 2025, 13(8), 2414; https://doi.org/10.3390/pr13082414 - 30 Jul 2025

Viewed by 695

Abstract

Hazelnut (Corylus avellana L.) is a significant global agricultural product due to its high economic and nutritional worth. The traditional methods used to measure the hazelnut kernel percentage for quality assessment are often time-consuming, expensive, and prone to human errors. Inaccurate measurements [...] Read more.

Hazelnut (Corylus avellana L.) is a significant global agricultural product due to its high economic and nutritional worth. The traditional methods used to measure the hazelnut kernel percentage for quality assessment are often time-consuming, expensive, and prone to human errors. Inaccurate measurements can adversely impact the market value, shelf life, and industrial applications of hazelnuts. This research introduces a novel system for calculating hazelnut kernel percentage utilizing a non-destructive X-ray imaging technique along with deep learning methods to assess hazelnut quality more efficiently and reliably. An image dataset of hazelnut kernels has been developed using X-ray technology, and defective areas are identified employing YOLOv7 architecture. Additionally, a novel bounding box regression technique called DCIoU and an algorithm for Neighborhood Relationship have been introduced to enhance object detection capabilities and to improve the selection of the target box with greater precision, respectively. The performance of these proposed methods has been evaluated using both the created hazelnut dataset and the COCO-128 dataset. The results indicate that the system can serve as a valuable tool for measuring hazelnut kernel percentages by accurately identifying defects in hazelnuts. Full article

(This article belongs to the Section Food Process Engineering)

► Show Figures

Figure 1

Search Results (354)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (354)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI