Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (199)

Search Parameters:
Keywords = mobile vision transformer

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
28 pages, 3527 KB  
Article
Autonomous Tomato Harvesting System Integrating AI-Controlled Robotics in Greenhouses
by Mihai Gabriel Matache, Florin Bogdan Marin, Catalin Ioan Persu, Robert Dorin Cristea, Florin Nenciu and Atanas Z. Atanasov
Agriculture 2026, 16(8), 847; https://doi.org/10.3390/agriculture16080847 - 11 Apr 2026
Viewed by 124
Abstract
Labor shortages and the need for increased productivity have accelerated the development of robotic harvesting systems for greenhouse crops; however, reliable operation under fruit occlusion and clustered arrangements remains a major challenge, particularly due to the limited integration between perception and motion planning [...] Read more.
Labor shortages and the need for increased productivity have accelerated the development of robotic harvesting systems for greenhouse crops; however, reliable operation under fruit occlusion and clustered arrangements remains a major challenge, particularly due to the limited integration between perception and motion planning modules. The paper presents the design and experimental validation of an autonomous robotic system for greenhouse tomato harvesting. The proposed platform integrates a rail-guided mobile base, a six-degrees-of-freedom robotic manipulator, and an adaptive end effector with a hybrid vision framework that combines convolutional neural networks and watershed-based segmentation to enable robust fruit detection and localization under occluded conditions. The proposed approach enables improved separation of overlapping fruits and provides accurate spatial localization through stereo vision combined with IMU-assisted camera-to-robot coordinate transformation. An occlusion-aware trajectory planning strategy was developed to generate collision-free manipulation paths in the presence of leaves and stems, enhancing harvesting safety and reliability. The system was trained and evaluated using a dataset of real greenhouse images supplemented with synthetic data augmentation. Experimental trials conducted under practical greenhouse conditions demonstrated a fruit detection precision of 96.9%, recall of 93.5%, and mean Intersection-over-Union of 79.2%. The robotic platform achieved an overall harvesting success rate of 78.5%, reaching 85% for unobstructed fruits, with an average cycle time of 15 s per fruit in direct harvesting scenarios. The rail-guided mobility significantly improved positioning stability and repeatability during manipulation compared with fully mobile platforms. The results confirm that integrating hybrid perception with occlusion-aware motion planning can substantially improve the functionality of robotic harvesting systems in protected cultivation environments. The proposed solution contributes to the advancement of automation technologies for greenhouse vegetable production and supports the transition toward more sustainable and labor-efficient agricultural practices. Full article
Show Figures

Figure 1

21 pages, 845 KB  
Article
GNTF: A Lightweight CNN Robustness Enhancement Method for IoT Devices
by Xuan Liu, Benkui Zhang, Jinxiao Wang, Huanyu Bian and Yunping Ge
Sensors 2026, 26(7), 2207; https://doi.org/10.3390/s26072207 - 2 Apr 2026
Viewed by 227
Abstract
Deploying lightweight convolutional neural networks (CNNs) to provide vision services on resource-constrained Internet of Things (IoT) devices has become the mainstream approach to addressing computing and energy consumption constraints. However, these IoT devices often operate in complex outdoor environments (e.g., fog, rain, and [...] Read more.
Deploying lightweight convolutional neural networks (CNNs) to provide vision services on resource-constrained Internet of Things (IoT) devices has become the mainstream approach to addressing computing and energy consumption constraints. However, these IoT devices often operate in complex outdoor environments (e.g., fog, rain, and snow), and the quality of the data they collect is easily degraded, causing standard lightweight CNNs to experience a significant performance drop under such corrupted data. To this end, this paper proposes a Generative Nonlinear Transformation Filter (GNTF) method to improve the generalization performance of lightweight CNNs on corrupted data. The core of the GNTF is that only a portion of the filters are used as learnable parameters (named seed filters), while the remaining filters are generated by applying the nonlinear transformation to the seed filters, which is randomly initialized and fixed during training. This design makes the model parameters less dependent on the training data distribution, thereby regularizing the model, mitigating overfitting, and enhancing its robustness to data degradation. The GNTF further analyzes the structural characteristics of lightweight CNNs, showing that significant performance improvements can be achieved simply by replacing the depthwise convolutional modules. Furthermore, this paper examines the properties of various nonlinear transformation functions and finds that model robustness can be improved by applying simple translations. To verify the effectiveness of the GNTF, we conducted extensive experiments on the CIFAR-10/-100, CIFAR-10-C/-100-C, and ICONS-50 datasets, using the MobileNetV2, ShuffleNetV2, EfficientNet, and GhostNet models. The results show that the proposed GNTF can improve the model’s accuracy on corrupted data while reducing the number of trainable parameters in most cases. For example, on the CIFAR-10-C dataset, ShuffleNetV2 with the GNTF improves accuracy by about 3.3% over the original model while slightly reducing the number of trainable parameters. Full article
(This article belongs to the Section Internet of Things)
Show Figures

Figure 1

31 pages, 7864 KB  
Article
Development of a General-Purpose AI-Powered Robotic Platform for Strawberry Harvesting
by Muhammad Tufail, Jamshed Iqbal and Rafiq Ahmad
Agriculture 2026, 16(7), 769; https://doi.org/10.3390/agriculture16070769 - 31 Mar 2026
Viewed by 397
Abstract
The integration of emerging technologies such as robotics and artificial intelligence (AI) has the potential to transform agricultural harvesting by improving efficiency, reducing waste, lowering labor dependency, and enhancing produce quality. This paper presents the development of an intelligent robotic berry harvesting system [...] Read more.
The integration of emerging technologies such as robotics and artificial intelligence (AI) has the potential to transform agricultural harvesting by improving efficiency, reducing waste, lowering labor dependency, and enhancing produce quality. This paper presents the development of an intelligent robotic berry harvesting system that combines deep learning–based perception with autonomous robotic manipulation for real-time strawberry harvesting. A computer vision pipeline based on the YOLOv11 segmentation model was developed and integrated into a Smart Mobile Manipulator (SMM) equipped with autonomous navigation, a 6-degree-of-freedom (6-DoF) xArm 6 robotic arm, and ROS middleware to enable real-time operation. Using a publicly available strawberry dataset comprising 2,800 images collected under ridge-planted cultivation conditions, the proposed YOLOv11-small segmentation model achieved 84.41% mAP@0.5, outperforming YOLOv11 object detection, Faster R-CNN, and RT-DETR in segmentation quality while maintaining real-time performance at 10 FPS on an NVIDIA Jetson Orin Nano edge GPU. A PCA-based fruit orientation and geometric analysis method achieved 86.5% localization accuracy on 200 test images. Controlled indoor harvesting experiments using synthetic strawberries demonstrated an overall harvesting success rate of 72% across 50 trials. The proposed system provides a general-purpose platform for berry harvesting in controlled environments, offering a scalable and efficient solution for autonomous harvesting. Full article
(This article belongs to the Special Issue Advances in Robotic Systems for Precision Orchard Operations)
Show Figures

Figure 1

20 pages, 2112 KB  
Article
CE-Fusion Botanic: A Lightweight Leaf Disease Detection Model via Adaptive Local–Global Information Fusion
by Yamei Bao, Xiaolong Qi, Huiling Wang, Tao Liu and Yuqi Bai
Appl. Sci. 2026, 16(7), 3177; https://doi.org/10.3390/app16073177 - 25 Mar 2026
Viewed by 341
Abstract
To solve the problem of limited generalization ability that is widely existing in lightweight models used for leaf disease detection, this paper puts forward a lightweight detection model named CE-Fusion Botanic, which is based on the adaptive control of local–global information fusion. Therefore, [...] Read more.
To solve the problem of limited generalization ability that is widely existing in lightweight models used for leaf disease detection, this paper puts forward a lightweight detection model named CE-Fusion Botanic, which is based on the adaptive control of local–global information fusion. Therefore, this model includes a globally guided dynamic gating fusion mechanism that dynamically adjusts fusion weights between local features, such as spot lesions, and global semantic features, such as symptoms of systemic infection, thus realizing adaptive perception of the dual characteristics of plant diseases. Hence, the local information extraction branch combines an improved MobileNetV3-Small structure and a CBAM attention mechanism, while the global information extraction branch uses a lightweight Vision Transformer (ViT) design called EffiViT. Comprehensive contrast experiments were carried out by using seven mainstream lightweight models on the PlantVillage tomato disease subset, the full-category PlantVillage leaf disease dataset, and the Grapevine leaf disease dataset. Models were divided into large-scale, medium-scale, and small-scale groups according to the number of parameters. The results show that CE-Fusion Botanic is significantly better than comparative methods in both detection accuracy and generalization performance, and at the same time, it keeps a lightweight profile, which demonstrates superior cross-dataset adaptation capabilities. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

23 pages, 7102 KB  
Article
Detection of Uniform Corrosion in Steel Pipes Using a Mobile Artificial Vision System
by Rafael Antonio Rodríguez Ospino, Cristhian Manuel Durán Acevedo and Jeniffer Katerine Carrillo Gómez
Corros. Mater. Degrad. 2026, 7(1), 21; https://doi.org/10.3390/cmd7010021 - 20 Mar 2026
Viewed by 352
Abstract
Corrosion in steel pipelines can cause critical failures in industrial systems, while conventional inspection methods such as radiography and ultrasonic testing are costly and require specialized personnel. This study presents a mobile computer vision system for automated corrosion detection inside steel pipes using [...] Read more.
Corrosion in steel pipelines can cause critical failures in industrial systems, while conventional inspection methods such as radiography and ultrasonic testing are costly and require specialized personnel. This study presents a mobile computer vision system for automated corrosion detection inside steel pipes using deep learning-based visual analysis. The proposed system consists of a Raspberry Pi 4-based mobile robot equipped with a high-resolution camera for internal inspection. Acquired images were processed using color-space transformations (RGB–HSV), filtering, and segmentation. Convolutional neural networks and semantic segmentation models, including YOLOv8-seg (Instance segmentation) and DeepLabV3 (Semantic segmentation), were trained on a custom corrosion image dataset to identify corroded regions. Real-time visualization was implemented via Flask-based video streaming. Experimental results demonstrated high detection accuracy for uniform corrosion, achieving a mean Intersection over Union (mIoU) above 0.98 and a precision of 0.99 with the YOLOv8-seg model. These results indicate that the proposed system enables reliable and automated corrosion inspection, with the potential to reduce inspection costs and improve operational efficiency. Future work will focus on enhancing real-time performance through hardware optimization. Full article
Show Figures

Figure 1

35 pages, 5649 KB  
Article
Cross-Dataset Benchmarking of Deep Learning Models for Surface Defect Classification in Metal Parts
by Fábio Mendes da Silva, João Manuel R. S. Tavares, António Mendes Lopes and Antonio Ramos Silva
Appl. Sci. 2026, 16(6), 3022; https://doi.org/10.3390/app16063022 - 20 Mar 2026
Viewed by 347
Abstract
Accurate surface defect classification is critical for industrial quality control. Although Deep Learning achieves strong results on individual datasets, most prior studies benchmark only a narrow set of models under inconsistent pipelines, limiting comparability and industrial relevance. This work introduces the first systematic [...] Read more.
Accurate surface defect classification is critical for industrial quality control. Although Deep Learning achieves strong results on individual datasets, most prior studies benchmark only a narrow set of models under inconsistent pipelines, limiting comparability and industrial relevance. This work introduces the first systematic benchmark of ten architectures—CNNs (CNN, ResNet18/50), lightweight models (MobileNetV2, SuperSimpleNet, GhostNet, EfficientNetV2), Vision Transformers (Swin Transformer), a hybrid CNN–Transformer (CoAtNet), and a one-stage detector (YOLOv12)—across five public defect datasets (NEU-DET, X-SDD, KolektorSDD2, DAGM, MTDD) under a unified pipeline. Results show that Swin Transformer and CoAtNet achieve the best performance (mean F1-scores 90.8% and 85.5%), while EfficientNetV2 underperformed (41.9%), underscoring the need for domain-specific benchmarks. Lightweight models such as MobileNetV2, GhostNet, and SuperSimpleNet deliver competitive accuracy at much lower cost, offering practical solutions for edge deployment. By bridging the gap between academic benchmarks and manufacturing requirements, this study provides actionable guidance for selecting defect detection models in automated inspection. Full article
Show Figures

Figure 1

22 pages, 2426 KB  
Article
MidFusionEfficientV2: Improving Ophthalmic Diagnosis with Mid-Level RGB–LBP Fusion and SE Attention
by Julide Kurt Keles, Soner Kiziloluk, Eser Sert, Furkan Talo and Muhammed Yildirim
J. Clin. Med. 2026, 15(6), 2352; https://doi.org/10.3390/jcm15062352 - 19 Mar 2026
Viewed by 398
Abstract
Background/Objectives: Early diagnosis of eye diseases is critically important for enhancing individuals’ quality of life and reducing the risk of vision loss. In this study, a deep learning-based hybrid model called MidFusionEfficientV2 has been proposed to classify eye diseases, including uveitis, conjunctivitis, [...] Read more.
Background/Objectives: Early diagnosis of eye diseases is critically important for enhancing individuals’ quality of life and reducing the risk of vision loss. In this study, a deep learning-based hybrid model called MidFusionEfficientV2 has been proposed to classify eye diseases, including uveitis, conjunctivitis, cataract, eyelid drooping, and normal conditions. Methods: The model presents a dual-branch architecture that combines an RGB image branch with an EfficientNetV2-S architecture and a specialized texture branch based on Local Binary Pattern (LBP) transformation at an intermediate level. Thanks to the Squeeze-and-Excitation (SE) blocks integrated into the LBP branch, channel-based attention mechanisms have been activated, enhancing the prominence of textural features. The features obtained from the RGB and LBP branches were combined at an intermediate level and transferred to the classification stage. Results: Experimental studies on the five-class eye disease dataset from the Mendeley Data platform have shown that the proposed model outperformed six strong models (ResNetV2, ConvNeXt, DenseNet-121, EfficientNet-B1, MobileNetV3 Large, and EfficientNetV2-S) with an accuracy of 98%. Especially in the difficult-to-diagnose uveitis class, recall and F1 scores of 97% and 94%, respectively, were achieved. Conclusions: The results show that a moderate combination of color and texture features significantly improves classification performance, and that MidFusionEfficientV2 offers a reliable and effective solution for the automatic diagnosis of eye diseases. Full article
Show Figures

Figure 1

19 pages, 7310 KB  
Article
Mathematical Benchmarking of Convolutional Neural Networks for Thai Dialect Recognition: A Spectrogram Texture Classification Approach
by Porawat Visutsak, Duongduen Ongrungruaeng, Surapong Wiriya and Keun Ho Ryu
Electronics 2026, 15(6), 1271; https://doi.org/10.3390/electronics15061271 - 18 Mar 2026
Viewed by 313
Abstract
This study rigorously evaluates 13 Convolutional Neural Network (CNN) architectures for Thai dialect recognition. By treating Automatic Speech Recognition (ASR) as a computer vision texture classification task, we processed an extensive 840-h dataset from the Spoken Language Systems, Chulalongkorn University (SLSCU) corpus. Raw [...] Read more.
This study rigorously evaluates 13 Convolutional Neural Network (CNN) architectures for Thai dialect recognition. By treating Automatic Speech Recognition (ASR) as a computer vision texture classification task, we processed an extensive 840-h dataset from the Spoken Language Systems, Chulalongkorn University (SLSCU) corpus. Raw audio from four major dialects—Central, Northern (Khummuang), Northeastern (Korat), and Southern (Pat-tani)—was transformed into 2D Mel-spectrograms using the Short-Time Fourier Transform (STFT). We analyzed a diverse range of architectures, including the VGG, Inception, ResNet, DenseNet, and MobileNet families, to establish the optimal trade-off between mathematical complexity and spectral feature extraction. Our experimental results identify NASNet-Mobile as the most effective model, achieving a macro-average F1-score of 0.9425. The analysis suggests that NASNet’s search-optimized cell structure is uniquely capable of capturing the multiscale texture of phonetic formants. In contrast, we observed a catastrophic mode collapse in VGG16 (32.97% accuracy), likely due to excessive parameter bloat, while Xception and MobileNetV2 maintained robust generalization. Confusion matrix analysis reveals high acoustic distinctiveness for Southern Thai (96.7% recall), whereas Northern Thai exhibits significant spectral overlap with Central Thai. These results support the hypothesis that CNNs interpret spectrograms as textures rather than discrete objects, positioning NASNet-Mobile as a high-performance, low-latency baseline for edge-device deployment in resource-constrained environments. Full article
(This article belongs to the Special Issue Advances in Machine Learning for Image Classification)
Show Figures

Figure 1

40 pages, 9518 KB  
Article
Transit-Oriented Development in the Gulf: Comparative Analysis of Al Mansoura (Doha) and Olaya (Riyadh)
by Silvia Mazzetto, Raffaello Furlan, Jalal Hoblos and Rashid Al-Matwi
Sustainability 2026, 18(6), 2952; https://doi.org/10.3390/su18062952 - 17 Mar 2026
Viewed by 307
Abstract
Since the 1970s, accelerated urban development in Doha has contributed to a disjointed and inefficient city structure. While the Doha Metro has begun to address spatial and mobility-related challenges, planners continue to call for a more integrated, strategic approach to ensure safe, accessible, [...] Read more.
Since the 1970s, accelerated urban development in Doha has contributed to a disjointed and inefficient city structure. While the Doha Metro has begun to address spatial and mobility-related challenges, planners continue to call for a more integrated, strategic approach to ensure safe, accessible, and efficient transit connectivity. In response, the Qatar National Development Framework provides a long-term vision for sustainable urban transformation, with a central aim of embedding the Metro system within the existing urban context and aligning expansion with Transit-Oriented Development (TOD), which promotes dense, multifunctional, pedestrian-oriented neighborhoods along transit corridors. Within this context, this study investigates how TOD strategies can enhance quality of life in mixed-use environments, focusing on the area surrounding Al Mansoura metro station and the adjacent Najma and Al Mansoura districts. Using the Integrated Modification Methodology (IMM), the analysis assesses spatial structure across density, spatial diversity, and connectivity, and derives evidence-based recommendations to improve livability and support sustainable revitalization. To broaden regional applicability, the study also compares Al Mansoura with Olaya in Riyadh—two mid-to-late 20th-century, high-density mixed-use districts undergoing TOD-driven transition—highlighting how spatial form, infrastructure legacy, and urban governance shape TOD outcomes and inform adaptable TOD frameworks for Gulf cities. Full article
(This article belongs to the Section Sustainable Transportation)
Show Figures

Figure 1

20 pages, 6854 KB  
Article
TARTS: Training-Free Adaptive Reference-Guided Traversability Segmentation with Automated Footprint Supervision and Experimental Verification
by Shuhong Shi and Lingchuan Zeng
Electronics 2026, 15(6), 1194; https://doi.org/10.3390/electronics15061194 - 13 Mar 2026
Viewed by 243
Abstract
Autonomous mobile robots require robust traversability perception to navigate safely in diverse outdoor environments. However, traditional deep learning approaches are data-hungry, requiring large-scale manual annotations, and struggle to adapt quickly to unseen environments. This paper introduces TARTS (Training-free Adaptive Reference-guided Traversability Segmentation), a [...] Read more.
Autonomous mobile robots require robust traversability perception to navigate safely in diverse outdoor environments. However, traditional deep learning approaches are data-hungry, requiring large-scale manual annotations, and struggle to adapt quickly to unseen environments. This paper introduces TARTS (Training-free Adaptive Reference-guided Traversability Segmentation), a novel framework combining one-shot prototype initialization with trajectory-guided online adaptation for terrain segmentation. Using a single reference image of desired traversable terrain, TARTS establishes an initial prototype from pre-trained DINO Vision Transformer (ViT) features. The system performs segmentation through superpixel-based feature aggregation and valley-emphasis Otsu thresholding while continuously refining the prototype via Exponential Moving Average (EMA) updates driven by automated footprint supervision from the robot’s traversed trajectory. Extensive experiments on our introduced Reference-guided Traversability Segmentation Dataset (RTSD) and the challenging Off-Road Freespace Detection (ORFD) benchmark demonstrate strong performance, achieving 94.5% IoU on RTSD and 94.1% IoU on ORFD, outperforming state-of-the-art supervised methods that require multi-modal inputs and dedicated training. The framework maintains efficient performance (17–24 FPS) on embedded platforms, enabling practical deployment with only a reference image as initialization. Full article
Show Figures

Figure 1

17 pages, 3196 KB  
Article
Aphid-ResNetSwin: An Image Recognition Method with Improved Attention Mechanism for Graded Identification of Myzus persicae
by Jinzhou Luo, Jiazhao Sun, Xiaoli Hao, Heng Liu, Fajin Lv and Wei Ding
Insects 2026, 17(3), 305; https://doi.org/10.3390/insects17030305 - 11 Mar 2026
Viewed by 378
Abstract
Myzus persicae is the most devastating piercing-sucking pest threatening tobacco production. Precise quantification of infestation severity is a prerequisite for precision pest management, making the integration of visual image analysis highly essential for efficient management. Current computer vision models in modern agriculture are [...] Read more.
Myzus persicae is the most devastating piercing-sucking pest threatening tobacco production. Precise quantification of infestation severity is a prerequisite for precision pest management, making the integration of visual image analysis highly essential for efficient management. Current computer vision models in modern agriculture are primarily designed for classifying various pest species, and there is a lack of image-driven analytical tools for assessing the severity of damage inflicted by a single target pest. To supplement existing analytical tools and enable the graded recognition of tobacco aphid (M. persicae) infestation levels, we propose the Aphid-ResNetSwin model. This model is constructed by integrating a Global Channel-Spatial Attention module (a specialized attention mechanism) into the well-established InceptionResNetV2 architecture. Performance evaluation results demonstrated that the Aphid-ResNetSwin model achieved a graded recognition accuracy of 89.11%. Compared with widely adopted mainstream baseline models in computer vision, such as MobileNetV3, SwinTransformer, and InceptionResNetV2, our proposed model exhibited superior performance in recognition accuracy. Furthermore, the classification accuracy of our model for M. persicae infestation across all severity levels was significantly higher than that of manual identification, with the exception of healthy leaves. Collectively, our findings indicate that the Aphid-ResNetSwin model provides a robust tool for the graded recognition of M. persicae infestation, offering valuable model-based support for the precision control of aphids in tobacco fields. Full article
Show Figures

Figure 1

31 pages, 8223 KB  
Article
X-ViTCNN: A Novel Network-Level Fusion of Transfer Learning and Customized Vision Transformer for Multi-Stage Alzheimer’s Disease Prediction Using MRI Scans
by Armughan Ali, Hooria Shahbaz, Shahid Mohammad Ganie and Manahil Mohammed Alfuraydan
Diagnostics 2026, 16(6), 835; https://doi.org/10.3390/diagnostics16060835 - 11 Mar 2026
Viewed by 496
Abstract
Background/Objectives: Alzheimer’s disease (AD), the most prevalent form of dementia, is characterized by an overall decline in cognitive functioning and represents a major public health crisis. It remains critical to be able to accurately and quickly diagnose patients with AD; however, recent deep [...] Read more.
Background/Objectives: Alzheimer’s disease (AD), the most prevalent form of dementia, is characterized by an overall decline in cognitive functioning and represents a major public health crisis. It remains critical to be able to accurately and quickly diagnose patients with AD; however, recent deep learning approaches using MRI data do not provide sample generalization, have high computational requirements, and offer little interpretability. Methods: In this study, we present a new framework called eXplorative ViT-CNN (X-ViTCNN) that combines a customized Vision Transformer model with two previously trained CNNs (DenseNet201 and MobileNetV2). With our proposed preprocessing approach using contrast-enhanced preprocessing to highlight neuroanatomical features as well as Bayesian Optimization to tune hyperparameters, we fuse local structural features originating from the CNNs with global representations from the transformer and feed the final result to fully connected dense layers for multi-stage classification. We also use Grad-CAM visualizations to provide insight into how our model arrived at its classification. Results: Experiments conducted on ADNI and OASIS datasets demonstrate the superiority of X-ViTCNN, achieving accuracies of 97.98% and 94.52%, respectively. The model outperformed individual baselines and other pre-trained architectures, showing balanced sensitivity and specificity across all AD stages. Conclusions: The proposed X-ViTCNN framework is a powerful, interpretable method for predicting the development of multi-stage Alzheimer’s disease using MRI scans. The combination of complementary feature learning, automatic hyperparameter optimization and interpretability through visualization make it an excellent potential tool for clinicians to support their decision making in the early diagnosis and ongoing monitoring of persons with Alzheimer’s disease. Full article
Show Figures

Figure 1

17 pages, 1775 KB  
Article
Evaluation of Maxillary Sinus Membrane Morphology Using a Novel Hybrid CNN-ViT-Based Deep Learning Model: An Automated Classification Study
by Nurullah Duger, Furkan Talo, Gulucag Giray Tekin, Burak Dagtekin, Mucahit Karaduman, Muhammed Yildirim and Tuba Talo Yildirim
Diagnostics 2026, 16(5), 777; https://doi.org/10.3390/diagnostics16050777 - 5 Mar 2026
Viewed by 390
Abstract
Objectives: This study aimed to develop and validate a hybrid deep learning model combining Convolutional Neural Networks (CNN) and Vision Transformers (ViT) to automatically classify maxillary sinus membrane morphologies on Cone-Beam Computed Tomography (CBCT) images, distinguishing between Normal, Flat, Polypoid, and Obstruction [...] Read more.
Objectives: This study aimed to develop and validate a hybrid deep learning model combining Convolutional Neural Networks (CNN) and Vision Transformers (ViT) to automatically classify maxillary sinus membrane morphologies on Cone-Beam Computed Tomography (CBCT) images, distinguishing between Normal, Flat, Polypoid, and Obstruction types. Methods: A dataset of 959 CBCT images was collected and categorized into four morphological classes: Normal, Flat, Polypoid and Obstruction. A custom hybrid model was developed, integrating a lightweight residual CNN for local feature extraction, learnable weighted feature fusion with a bidirectional feature pyramid network and a Transformer encoder for global context modeling. The performance of proposed model was compared against six different architectures, including ResNet50, MobileNetV3L and standard ViT models, using accuracy, precision, recall and F1-score metrics. Results: The proposed hybrid model achieved the highest overall accuracy of 98.44%, outperforming six strong CNN and ViT models including ResNet50 (97.92%) and ViT-B16 (86.46%) models. In class-wise analysis, the model demonstrated superior diagnostic capability, particularly for the “Obstruction” class, achieving 100% accuracy. High discrimination was also observed for “Flat” (98.21%) and “Polypoid” (98.04%) morphologies, confirming the model’s sensitivity to shape-based features. Conclusions: The proposed hybrid CNN-ViT model successfully classifies maxillary sinus membrane morphologies with high accuracy, effectively overcoming the limitations of standard ViT models on limited datasets. Detection of membrane morphology is vital for predicting surgical risks like membrane perforation and post-operative sinusitis. This model serves as a reliable clinical decision support tool, enabling clinicians to objectively assess specific risk factors before implant surgery and sinus floor elevation. Full article
(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)
Show Figures

Figure 1

20 pages, 1379 KB  
Article
Hybrid Vision Transformer–CNN Framework for Alzheimer’s Disease Cell Type Classification: A Comparative Study with Vision–Language Models
by Md Easin Hasan, Md Tahmid Hasan Fuad, Omar Sharif and Amy Wagler
J. Imaging 2026, 12(3), 98; https://doi.org/10.3390/jimaging12030098 - 25 Feb 2026
Viewed by 641
Abstract
Accurate identification of Alzheimer’s disease (AD)-related cellular characteristics from microscopy images is essential for understanding neurodegenerative mechanisms at the cellular level. While most computational approaches focus on macroscopic neuroimaging modalities, cell type classification from microscopy remains relatively underexplored. In this study, we propose [...] Read more.
Accurate identification of Alzheimer’s disease (AD)-related cellular characteristics from microscopy images is essential for understanding neurodegenerative mechanisms at the cellular level. While most computational approaches focus on macroscopic neuroimaging modalities, cell type classification from microscopy remains relatively underexplored. In this study, we propose a hybrid vision transformer–convolutional neural network (ViT–CNN) framework that integrates DeiT-Small and EfficientNet-B7 to classify three AD-related cell types—astrocytes, cortical neurons, and SH-SY5Y neuroblastoma cells—from phase-contrast microscopy images. We perform a comparative evaluation against conventional CNN architectures (DenseNet, ResNet, InceptionNet, and MobileNet) and prompt-based multimodal vision–language models (GPT-5, GPT-4o, and Gemini 2.5-Flash) using zero-shot, few-shot, and chain-of-thought prompting. Experiments conducted with stratified fivefold cross-validation show that the proposed hybrid model achieves a test accuracy of 61.03% and a macro F1 score of 61.85, outperforming standalone CNN baselines and prompt-only LLM approaches under data-limited conditions. These results suggest that combining convolutional inductive biases with transformer-based global context modeling can improve generalization for cellular microscopy classification. While constrained by dataset size and scope, this work serves as a proof of concept and highlights promising directions for future research in domain-specific pretraining, multimodal data integration, and explainable AI for AD-related cellular analysis. Full article
Show Figures

Figure 1

15 pages, 1376 KB  
Article
GANimate: Ultra-Efficient Lip-Landmark-Driven Talking Face Animation Using a Learned Kalman Filter on GAN Feature Latent Space for Human–Computer Interaction on Mobile Devices
by Ethan Fenakel, Ben Ohayon and Dan Raviv
Sensors 2026, 26(4), 1377; https://doi.org/10.3390/s26041377 - 22 Feb 2026
Viewed by 581
Abstract
We present GANimate, a lightweight method for animating talking faces that leverages recent advances in latent-space manipulation of Generative Adversarial Networks (GANs). Unlike existing approaches based on computationally intensive diffusion models, transformers, or complex 3DMM representations, which are impractical for mobile and other [...] Read more.
We present GANimate, a lightweight method for animating talking faces that leverages recent advances in latent-space manipulation of Generative Adversarial Networks (GANs). Unlike existing approaches based on computationally intensive diffusion models, transformers, or complex 3DMM representations, which are impractical for mobile and other low-resource edge devices due to high memory and compute demands, GANimate is designed for efficient operation on low-memory, low-compute edge devices. The model operates on 2D lip landmarks extracted from standard mobile vision-sensor inputs and requires no pre-training, making it easily integrable with any lip-landmark generator. Through an optimization process in the GAN feature latent space, these landmarks act as geometric constraints to animate a static portrait, producing realistic and expressive lip movements. To maintain stability and visual coherence across frames, we employ a Kalman filter to detect and track lip landmarks during video synthesis, enabling adaptive refinement and improved temporal consistency. The result is a compact and modular framework that bridges the gap between performance and accessibility in talking face synthesis, delivering high-quality and stable animations with minimal computational overhead. GANimate represents an important step toward lifelike, real-time avatars suitable for sensor-enabled and mobile human–computer interaction. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

Back to TopTop