Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

Search Results (305)

Search Parameters:
Keywords = cross-modal alignment

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
32 pages, 13552 KB  
Article
Closing Sim2Real Gaps: A Versatile Development and Validation Platform for Autonomous Driving Stacks
by J. Felipe Arango, Rodrigo Gutiérrez-Moreno, Pedro A. Revenga, Ángel Llamazares, Elena López-Guillén and Luis M. Bergasa
Sensors 2026, 26(4), 1338; https://doi.org/10.3390/s26041338 - 19 Feb 2026
Abstract
The successful transfer of autonomous driving stacks (ADS) from simulation to the real world faces two main challenges: the Reality Gap (RG)—mismatches between simulated and real behaviors—and the Performance Gap (PG)—differences between expected and achieved performance across domains. We propose a [...] Read more.
The successful transfer of autonomous driving stacks (ADS) from simulation to the real world faces two main challenges: the Reality Gap (RG)—mismatches between simulated and real behaviors—and the Performance Gap (PG)—differences between expected and achieved performance across domains. We propose a Methodology for Closing Reality and Performance Gaps (MCRPG), a structured and iterative approach that jointly reduces RG and PG through parameter tuning, cross-domain metrics, and staged validation. MCRPG comprises three stages—Digital Twin, Parallel Execution, and Real-World—to progressively align ADS behavior and performance. To ground and validate the method, we present an open-source, cost-effective Development and Validation Platform (DVP) that integrates an ROS-based modular ADS with the CARLA simulator and a custom autonomous electric vehicle. We also introduce a two-level metric suite: (i) Reality Alignment via Maximum Normalized Cross-Correlation (MNCC) over multi-modal signals (e.g., ego kinematics, detections), and (ii) Ego-Vehicle Performance covering safety, comfort, and driving efficiency. Experiments in an urban scenario show convergence between simulated and real behavior and increasingly consistent performance across stages. Overall, MCRPG and DVP provide a replicable framework for robust, scalable, and accessible Sim2Real research in autonomous navigation techniques. Full article
24 pages, 8810 KB  
Article
FreqPose: Frequency-Aware Diffusion with Fractional Gabor Filters and Global Pose–Semantic Alignment
by Meng Wang, Bing Wang, Huiling Chen, Jing Ren and Xueping Tang
Sensors 2026, 26(4), 1334; https://doi.org/10.3390/s26041334 - 19 Feb 2026
Abstract
The task of pose-guided person image generation has long been confronted with two major challenges: high-frequency texture details tend to blur and be lost during appearance transfer, while the semantic identity of the person is difficult to maintain consistently during pose changes. To [...] Read more.
The task of pose-guided person image generation has long been confronted with two major challenges: high-frequency texture details tend to blur and be lost during appearance transfer, while the semantic identity of the person is difficult to maintain consistently during pose changes. To address these issues, this paper proposes a diffusion-based generative framework that integrates frequency awareness and global semantic alignment. The framework consists of two core modules: a multi-level fractional-order Gabor frequency-aware network, which accurately extracts and reconstructs high-frequency texture features such as hair strands and fabric wrinkles, enhances image detail fidelity through fractional-order filtering and complex domain modeling; and a global semantic-pose alignment module that utilizes a cross-modal attention mechanism to establish a global mapping between pose features and appearance semantics, ensuring pose-driven semantic alignment and appearance consistency. The collaborative function of these two modules ensures that the generated results maintain structural integrity and natural textures even under complex pose variations and large-angle rotations. The experimental results on the DeepFashion and Market1501 datasets demonstrate that the proposed method outperforms existing state-of-the-art approaches in terms of SSIM, FID, and perceptual quality, validating the effectiveness of the model in enhancing texture fidelity and semantic consistency. Full article
(This article belongs to the Section Intelligent Sensors)
Show Figures

Figure 1

26 pages, 26398 KB  
Article
WEMFusion: Wavelet-Driven Hybrid-Modality Enhancement and Discrepancy-Aware Mamba for Optical–SAR Image Fusion
by Jinwei Wang, Yongjin Zhao, Liang Ma, Bo Zhao, Fujun Song and Zhuoran Cai
Remote Sens. 2026, 18(4), 612; https://doi.org/10.3390/rs18040612 - 15 Feb 2026
Viewed by 153
Abstract
Optical and synthetic aperture radar (SAR) imagery are highly complementary in terms of texture details and structural scattering characterization. However, their imaging mechanisms and statistical distributions differ substantially. In particular, pseudo-high-frequency components introduced by SAR coherent speckle can be easily entangled with genuine [...] Read more.
Optical and synthetic aperture radar (SAR) imagery are highly complementary in terms of texture details and structural scattering characterization. However, their imaging mechanisms and statistical distributions differ substantially. In particular, pseudo-high-frequency components introduced by SAR coherent speckle can be easily entangled with genuine optical edges, leading to texture mismatch, structural drift, and noise diffusion. To address these issues, we propose WEMFusion, a wavelet-prior-driven framework for frequency-domain decoupling and discrepancy-aware state-space fusion. Specifically, a multi-scale discrete wavelet transform (DWT) explicitly decomposes the inputs into low-frequency structural components and directional high-frequency sub-bands, providing an interpretable frequency-domain constraint for cross-modality alignment. We design a hybrid-modality enhancement (HME) module: in the high-frequency branch, it effectively injects optical edges and directional textures while suppressing the propagation of pseudo-high-frequency artifacts, and in the low-frequency branch, it reinforces global structural consistency and prevents speckle perturbations from leaking into the structural component, thereby mitigating structural drift. Furthermore, we introduce a discrepancy-aware gated Mamba fusion (DAG-MF) block, which generates dynamic gates from modality differences and complementary responses to modulate the parameters of a directionally scanned two-dimensional state-space model, so that long-range dependency modeling focuses on discrepant regions while preserving directional coherence. Extensive quantitative evaluations and qualitative comparisons demonstrate that WEMFusion consistently improves structural fidelity and edge detail preservation across multiple optical–SAR datasets, achieving superior fusion quality with lower computational overhead. Full article
Show Figures

Figure 1

19 pages, 1562 KB  
Article
Vox2Face: Speech-Driven Face Generation via Identity-Space Alignment and Diffusion Self-Consistency
by Qiming Ma, Yizhen Wang, Xiang Sun, Jiadi Liu, Gang Cheng, Jia Feng, Rong Wang and Fanliang Bu
Information 2026, 17(2), 200; https://doi.org/10.3390/info17020200 - 14 Feb 2026
Viewed by 203
Abstract
Speech-driven face generation aims to synthesize a face image that matches a speaker’s identity from speech alone. However, existing methods typically trade identity fidelity for visual quality and rely on large end-to-end generators that are difficult to train and tune. We propose Vox2Face, [...] Read more.
Speech-driven face generation aims to synthesize a face image that matches a speaker’s identity from speech alone. However, existing methods typically trade identity fidelity for visual quality and rely on large end-to-end generators that are difficult to train and tune. We propose Vox2Face, a speech-driven face generation framework centered on an explicit identity space rather than direct speech-to-image mapping. A pretrained speaker encoder first extracts speech embeddings, which are distilled and metric-aligned to the ArcFace hyperspherical identity space, transforming cross-modal regression into a geometrically interpretable speech-to-identity alignment problem. On this unified identity representation, we reused an identity-conditioned diffusion model as the generative backbone and synthesized diverse, high-resolution faces in the Stable Diffusion latent space. To better exploit this prior, we introduce a discriminator-free diffusion self-consistency loss that treats denoising residuals as an implicit critique of speech-predicted identity embeddings and updates only the speech-to-identity mapping and lightweight LoRA adapters, encouraging speech-derived identities to lie on the high-probability identity manifold of the diffusion model. Experiments on the HQ-VoxCeleb dataset show that Vox2Face improves the ArcFace cosine similarity from 0.295 to 0.322, boosts R@10 retrieval accuracy from 29.8% to 32.1%, and raises the VGGFace Score from 18.82 to 23.21 over a strong diffusion baseline. These results indicate that aligning speech to a unified identity space and reusing a strong identity-conditioned diffusion prior is an effective method to jointly improve identity fidelity and visual quality. Full article
(This article belongs to the Section Artificial Intelligence)
Show Figures

Figure 1

24 pages, 2988 KB  
Article
Multimodal Named-Entity Recognition Based on Symmetric Fusion with Contrastive Learning
by Yubo Wu and Junqiang Liu
Symmetry 2026, 18(2), 353; https://doi.org/10.3390/sym18020353 - 14 Feb 2026
Viewed by 98
Abstract
Multimodal named-entity recognition (MNER) aims to identify entity information by leveraging multimodal features. With recent research shifting to multi-image scenarios, existing methods overlook modality noise and lack effective cross-modal interaction, leading to prominent semantic gaps. This study innovatively integrates symmetric multimodal fusion with [...] Read more.
Multimodal named-entity recognition (MNER) aims to identify entity information by leveraging multimodal features. With recent research shifting to multi-image scenarios, existing methods overlook modality noise and lack effective cross-modal interaction, leading to prominent semantic gaps. This study innovatively integrates symmetric multimodal fusion with contrastive learning, proposing a novel model with a symmetric-encoder collaborative architecture. To mitigate the noise, a modality refinement encoder maps each modality to an exclusive space, while an aligned encoder bridges gaps via contrastive learning in a shared space, surpassing the superficial cross-modal mapping of existing models. Building on these encoders, the symmetric fusion module achieves deep bidirectional fusion, breaking traditional one-way or concatenation-based limitations. Experiments on two datasets show the model outperforms state-of-the-art methods, with ablation experiments validating the symmetric encoder’s uniqueness for consistent multimodal learning. Full article
(This article belongs to the Section Computer)
Show Figures

Figure 1

20 pages, 2326 KB  
Article
A LLaMA-Based Efficient Fine-Tuning Method for Image Captioning Using Multi-Feature Dynamic Prompts
by Yongyang Yin, Hengyu Cao, Chunsheng Zhang, Faxun Jin, Xin Liu and Jun Lin
Appl. Sci. 2026, 16(4), 1857; https://doi.org/10.3390/app16041857 - 12 Feb 2026
Viewed by 145
Abstract
To address the trade-off between parameter scale and generation quality in Vision-Language Models (VLMs), this study proposes a Multi-Feature Dynamic Instruction Tuning (MFDIT) image captioning model based on LLaMA. By integrating CLIP-based global features with SAM-derived local features, the model constructs a multi-level [...] Read more.
To address the trade-off between parameter scale and generation quality in Vision-Language Models (VLMs), this study proposes a Multi-Feature Dynamic Instruction Tuning (MFDIT) image captioning model based on LLaMA. By integrating CLIP-based global features with SAM-derived local features, the model constructs a multi-level visual representation. Additionally, a Dynamic Prompt Adapter is designed to enable cross-modal semantic alignment with adaptive flexibility. Combined with a Low-Rank Adaptation (LoRA) fine-tuning strategy, the proposed method enhances the model’s capability in describing diverse images while training only 20 million parameters, accounting for merely 0.05% of the total parameter volume. Experimental results demonstrate that the model achieves a CIDEr score of 126.7 on the MSCOCO dataset, surpassing traditional adapter-based approaches by 3.0 points. Moreover, in the MME Benchmark evaluation, the proposed model outperforms the mainstream LLaMA-Adapter V2 by 7.3% and 3.8% in OCR and object counting tasks, respectively. Ablation studies further validate the synergistic effects of multi-feature fusion and dynamic instruction optimization. This research provides an efficient solution for parameter-efficient multimodal model training and potential deployment in resource-constrained environments. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

18 pages, 4326 KB  
Article
DCS: A Zero-Shot Anomaly Detection Framework with DINO-CLIP-SAM Integration
by Yan Wan, Yingqi Lang and Li Yao
Appl. Sci. 2026, 16(4), 1836; https://doi.org/10.3390/app16041836 - 12 Feb 2026
Viewed by 130
Abstract
Recently, the progress of foundation models such as CLIP and SAM has shown the great potential of zero-shot anomaly detection tasks. However, existing methods usually rely on general descriptions such as “abnormal”, and the semantic coverage is insufficient, making it difficult to express [...] Read more.
Recently, the progress of foundation models such as CLIP and SAM has shown the great potential of zero-shot anomaly detection tasks. However, existing methods usually rely on general descriptions such as “abnormal”, and the semantic coverage is insufficient, making it difficult to express fine-grained anomaly semantics. In addition, CLIP primarily performs global-level alignment, and it is difficult to accurately locate minor defects, while the segmentation quality of SAM is highly dependent on prompt constraints. In order to solve these problems, we proposed DCS, a unified framework that integrates Grounding DINO, CLIP and SAM through three key innovations. First of all, we introduced FinePrompt for adaptive learning, which significantly enhanced the modeling ability of exception semantics by building a fine-grained exception description library and adopting learnable text embeddings. Secondly, we have designed an Adaptive Dual-path Cross-modal Interaction (ADCI) module to achieve more effective cross-modal information exchange through dual-path fusion. Finally, we proposed a Box-Point Prompt Combiner (BPPC), which combines box prior information provided by DINO with the point prompt generated by CLIP, so as to guide SAM to generate finer and more complete segmentation results. A large number of experiments have proved the effectiveness of our method. On the MVTec-AD and VisA datasets, DCS has achieved the most state-of-the-art zero-shot anomaly detection results. Full article
Show Figures

Figure 1

23 pages, 15029 KB  
Article
LPDiag: LLM-Enhanced Multimodal Prototype Learning Framework for Intelligent Tomato Leaf Disease Diagnosis
by Heng Dong, Xuemei Qiu, Dawei Fan, Mingyue Han, Jiaming Yu, Changcai Yang, Jinghu Li, Ruijun Liu, Riqing Chen and Qiufeng Chen
Agriculture 2026, 16(4), 419; https://doi.org/10.3390/agriculture16040419 - 12 Feb 2026
Viewed by 188
Abstract
Tomato leaf diseases exhibit subtle inter-class differences and substantial intra-class variability, making accurate identification challenging for conventional deep learning models, especially under real-world conditions with diverse lighting, occlusion, and growth stages. Moreover, most existing approaches rely solely on visual features and lack the [...] Read more.
Tomato leaf diseases exhibit subtle inter-class differences and substantial intra-class variability, making accurate identification challenging for conventional deep learning models, especially under real-world conditions with diverse lighting, occlusion, and growth stages. Moreover, most existing approaches rely solely on visual features and lack the ability to incorporate semantic descriptions or expert knowledge, limiting their robustness and interpretability. To address these issues, we propose LPDiag, a multimodal prototype-attention diagnostic framework that integrates large language models (LLMs) for fine-grained recognition of tomato diseases. The framework first employs an LLM-driven semantic understanding module to encode symptom-aware textual embeddings from disease descriptions. These embeddings are then aligned with multi-scale visual features extracted by an enhanced Res2Net backbone, enabling cross-modal representation learning. A set of learnable prototype vectors, combined with a knowledge-enhanced attention mechanism, further strengthens the interaction between visual patterns and LLM prior knowledge, resulting in more discriminative and interpretable representations. Additionally, we develop an interactive diagnostic system that supports natural-language querying and image-based identification, facilitating practical deployment in heterogeneous agricultural environments. Extensive experiments on three widely used datasets demonstrate that LPDiag achieves a mean accuracy of 98.83%, outperforming state-of-the-art models while offering improved explanatory capability. The proposed framework offers a promising direction for integrating LLM-based semantic reasoning with visual perception to enhance intelligent and trustworthy plant disease diagnostics. Full article
(This article belongs to the Section Artificial Intelligence and Digital Agriculture)
Show Figures

Figure 1

16 pages, 429 KB  
Article
HCA-IDS: A Semantics-Aware Heterogeneous Cross-Attention Network for Robust Intrusion Detection in CAVs
by Qiyi He, Yifan Zhang, Jieying Liu, Wen Zhou, Tingting Zhang, Minlong Hu, Ao Xu and Qiao Lin
Electronics 2026, 15(4), 784; https://doi.org/10.3390/electronics15040784 - 12 Feb 2026
Viewed by 178
Abstract
Connected and Autonomous Vehicles (CAVs) are exposed to increasingly sophisticated cyber threats hidden within high-dimensional, heterogeneous network traffic. A critical bottleneck in existing Intrusion Detection Systems (IDS) is the feature heterogeneity gap: discrete protocol signatures (e.g., flags, services) and continuous traffic statistics (e.g., [...] Read more.
Connected and Autonomous Vehicles (CAVs) are exposed to increasingly sophisticated cyber threats hidden within high-dimensional, heterogeneous network traffic. A critical bottleneck in existing Intrusion Detection Systems (IDS) is the feature heterogeneity gap: discrete protocol signatures (e.g., flags, services) and continuous traffic statistics (e.g., flow duration, packet rates) reside in disjoint latent spaces. Traditional deep learning approaches typically rely on naive feature concatenation, which fails to capture the intricate, non-linear semantic dependencies between these modalities, leading to suboptimal performance on long-tail, minority attack classes. This paper proposes HCA-IDS, a novel framework centered on Semantics-Aware Cross-Modal Alignment. Unlike heavy-weight models, HCA-IDS adopts a streamlined Multi-Layer Perceptron (MLP) backbone optimized for edge deployment. We introduce a dedicated Multi-Head Cross-Attention mechanism that explicitly utilizes static “Pattern” features to dynamically query and re-weight relevant dynamic “State” behaviors. This architecture forces the model to learn a unified semantic manifold where protocol anomalies are automatically aligned with their corresponding statistical footprints. Empirical assessments on the NSL-KDD and CICIDS2018 datasets, validated through rigorous 5-Fold Cross-Validation, substantiate the robustness of this approach. The model achieves a Macro-F1 score of over 94% on 7 consolidated attack categories, exhibiting exceptional sensitivity to minority attacks (e.g., Web Attacks and Infiltration). Crucially, HCA-IDS is ultra-lightweight, with a model size of approximately 1.00 MB and an inference latency of 0.0037 ms per sample. These results confirm that explicit semantic alignment combined with a lightweight architecture is key to robust, real-time intrusion detection in resource-constrained CAVs. Full article
Show Figures

Figure 1

20 pages, 1914 KB  
Article
Influence of Multimodal AR-HUD Navigation Prompt Design on Driving Behavior at F-Type-5 M Intersections
by Ziqi Liu, Zhengxing Yang and Yifan Du
J. Eye Mov. Res. 2026, 19(1), 22; https://doi.org/10.3390/jemr19010022 - 11 Feb 2026
Viewed by 142
Abstract
In complex urban traffic environments, the design of multimodal prompts in augmented reality head-up displays (AR-HUDs) plays a critical role in driving safety and operational efficiency. Despite growing interest in audiovisual navigation assistance, empirical evidence remains limited regarding when prompts should be delivered [...] Read more.
In complex urban traffic environments, the design of multimodal prompts in augmented reality head-up displays (AR-HUDs) plays a critical role in driving safety and operational efficiency. Despite growing interest in audiovisual navigation assistance, empirical evidence remains limited regarding when prompts should be delivered and whether visual and auditory information should remain temporally aligned. To address this gap, this study aims to examine how audiovisual prompt timing and prompt mode influence driving behavior in AR-HUD navigation systems at complex F-type-5 m intersections through a within-subject experimental design. A 2 (prompt mode: synchronized vs. asynchronous) × 3 (prompt timing: −1000 m, −600 m, −400 m) design was employed to assess driver response time, situational awareness, and eye-movement measures, including average fixation duration and fixation count. The results showed clear main effects of both prompt mode and prompt timing. Compared with asynchronous prompts, synchronized prompts consistently resulted in shorter response times, reduced visual demand, and higher situational awareness. Driving performance also improved as prompt timing shifted closer to the intersection, from −1000 m to −400 m. But no significant interaction effects were found, suggesting that prompt mode and prompt timing can be treated as relatively independent design factors. In addition, among the six experimental conditions, the −400 m synchronized condition yielded the most favorable overall performance, whereas the −1000 m asynchronous condition performed worst. These findings indicate that in time-critical and low-tolerance scenarios, such as F-type-5 m intersections, near-distance synchronized multimodal prompts should be prioritized. This study provides empirical support for optimizing prompt timing and cross-modal temporal alignment in AR-HUD systems and offers actionable implications for interface and timing design. Full article
Show Figures

Figure 1

27 pages, 749 KB  
Article
A Data-Driven Multimodal Method for Early Detection of Coordinated Abnormal Behaviors in Live-Streaming Platforms
by Jingwen Luo, Pinrui Zhu, Yiyan Wang, Zilin Xiao, Jingqi Li, Xuebei Kong and Yan Zhan
Electronics 2026, 15(4), 769; https://doi.org/10.3390/electronics15040769 - 11 Feb 2026
Viewed by 106
Abstract
With the rapid growth of live-streaming e-commerce and digital marketing, abnormal marketing behaviors have become increasingly concealed, coordinated, and intertwined across heterogeneous data modalities, posing substantial challenges to data-driven platform governance and early risk identification. Existing approaches often fail to jointly model cross-modal [...] Read more.
With the rapid growth of live-streaming e-commerce and digital marketing, abnormal marketing behaviors have become increasingly concealed, coordinated, and intertwined across heterogeneous data modalities, posing substantial challenges to data-driven platform governance and early risk identification. Existing approaches often fail to jointly model cross-modal temporal semantics, the gradual evolution of weak abnormal signals, and organized group-level manipulation. To address these challenges, a data-driven multimodal abnormal behavior detection framework, termed MM-FGDNet, is proposed for large-scale live-streaming environments. The framework models abnormal behaviors from two complementary perspectives, namely temporal evolution and cooperative group structure. A cross-modal temporal alignment module first maps video, text, audio, and user behavioral signals into a unified temporal semantic space, alleviating temporal misalignment and semantic inconsistency across modalities. Building upon this representation, a temporal fraud pattern modeling module captures the progressive transition of abnormal behaviors from early incipient stages to abrupt outbreaks, while a cooperative manipulation detection module explicitly identifies coordinated interactions formed by organized user groups and automated accounts. Extensive experiments on real-world multi-platform live-streaming e-commerce datasets demonstrate that MM-FGDNet consistently outperforms representative baseline methods, achieving an AUC of 0.927 and an F1 score of 0.847, with precision and recall reaching 0.861 and 0.834, respectively, while substantially reducing false alarm rates. Moreover, the proposed framework attains an Early Detection Score of 0.689. This metric serves as a critical benchmark for operational viability, quantifying the system’s capacity to shift platform governance from passive remediation to proactive prevention. It confirms the reliable identification of the “weak-signal” stage—rigorously defined as the incipient phase where subtle, synchronized deviations in interaction rhythms manifest prior to traffic inflation outbreaks—thereby providing the necessary time window for preemptive intervention against coordinated manipulation. Ablation studies further validate the independent contributions of each core module, and cross-domain generalization experiments confirm stable performance across new streamers, new product categories, and new platforms. Overall, MM-FGDNet provides an effective and scalable data-driven artificial intelligence solution for early detection of coordinated abnormal behaviors in live-streaming systems. Full article
Show Figures

Figure 1

19 pages, 1393 KB  
Article
Multimodal Emotion Recognition Model Based on Dynamic Heterogeneous Graph Temporal Network
by Bulaga Da and Feilong Bao
Appl. Sci. 2026, 16(4), 1731; https://doi.org/10.3390/app16041731 - 10 Feb 2026
Viewed by 148
Abstract
To address the semantic gap and complex feature entanglement inherent in multimodal emotion recognition, we propose the Dynamic Heterogeneous Graph Temporal Network (DHGTN), an end-to-end framework designed to model dynamic cross-modal interactions effectively. Utilizing a robust backbone of Wav2vec 2.0, VideoMAE, and BERT, [...] Read more.
To address the semantic gap and complex feature entanglement inherent in multimodal emotion recognition, we propose the Dynamic Heterogeneous Graph Temporal Network (DHGTN), an end-to-end framework designed to model dynamic cross-modal interactions effectively. Utilizing a robust backbone of Wav2vec 2.0, VideoMAE, and BERT, we introduce a “Shared Private” subspace projection mechanism that explicitly disentangles emotion common features from modality-specific noise through contrastive learning to ensure strict semantic alignment. Furthermore, our collaborative Dynamic Heterogeneous Graph and Transformer module overcomes static fusion limitations by constructing time-varying graphs for instantaneous associations and employing global attention to capture long-range temporal dependencies. Extensive experiments on the IEMOCAP and MELD benchmarks demonstrate that DHGTN significantly outperforms state-of-the-art baselines, achieving weighted F1-scores of 73.86% and 66.87%, respectively, which confirms the method’s effectiveness and robustness. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

18 pages, 2041 KB  
Article
Wavelet-CNet: Wavelet Cross Fusion and Detail Enhancement Network for RGB-Thermal Semantic Segmentation
by Wentao Zhang, Qi Zhang and Yue Yan
Sensors 2026, 26(3), 1067; https://doi.org/10.3390/s26031067 - 6 Feb 2026
Viewed by 146
Abstract
Leveraging thermal infrared imagery to complement RGB spatial information is a key technology in industrial sensing. This technology enables mobile devices to perform scene understanding through RGB-T semantic segmentation. However, existing networks conduct only limited information interaction between modalities and lack specific designs [...] Read more.
Leveraging thermal infrared imagery to complement RGB spatial information is a key technology in industrial sensing. This technology enables mobile devices to perform scene understanding through RGB-T semantic segmentation. However, existing networks conduct only limited information interaction between modalities and lack specific designs to exploit the thermal aggregation entropy of the thermal modality, resulting in inefficient feature complementarity within bilateral structures. To address these challenges, we propose Wavelet-CNet for RGB-T semantic segmentation. Specifically, we design a Wavelet Cross Fusion Module (WCFM) that applies wavelet transforms to separately extract four types of low- and high-frequency information from RGB and thermal features, which are then fed back into attention mechanisms for dual-modal feature reconstruction. Furthermore, a Cross-Scale Detail Enhancement Module (CSDEM) introduces cross-scale contextual information from the TIR branch into each fusion stage, aligning global localization through contour information from thermal features. Wavelet-CNet achieves competitive mIoU scores of 58.3% and 85.77% on MFNet and PST900, respectively, while ablation studies on MFNet further validate the effectiveness of the proposed WCFM and CSDEM modules. Full article
(This article belongs to the Special Issue AI-Based Computer Vision Sensors & Systems—2nd Edition)
Show Figures

Figure 1

22 pages, 7754 KB  
Article
CSSA: A Cross-Modal Spatial–Semantic Alignment Framework for Remote Sensing Image Captioning
by Xiao Han, Zhaoji Wu, Yunpeng Li, Xiangrong Zhang, Guanchun Wang and Biao Hou
Remote Sens. 2026, 18(3), 522; https://doi.org/10.3390/rs18030522 - 5 Feb 2026
Viewed by 276
Abstract
Remote sensing image captioning (RSIC) aims to generate natural language descriptions for the given remote sensing image, which requires a comprehensive and in-depth understanding of image content and summarizes it with sentences. Most RSIC methods have successful vision feature extraction, but the representation [...] Read more.
Remote sensing image captioning (RSIC) aims to generate natural language descriptions for the given remote sensing image, which requires a comprehensive and in-depth understanding of image content and summarizes it with sentences. Most RSIC methods have successful vision feature extraction, but the representation of spatial features or fusion features fails to fully consider cross-modal differences between remote sensing images and texts, resulting in unsatisfactory performance. Thus, we propose a novel cross-modal spatial–semantic alignment (CSSA) framework for an RSIC task, which consists of a multi-branch cross-modal contrastive learning (MCCL) mechanism and a dynamic geometry Transformer (DG-former) module. Specifically, compared to discrete text, remote sensing images present a noisy property, interfering with the extraction of valid vision features. Therefore, we present an MCCL mechanism to learn consistent representation between image and text, achieving cross-modal semantic alignment. In addition, most objects are scattered in remote sensing images and exhibit a sparsity property due to the overhead view. However, the Transformer structure mines the objects’ relationships without considering the geometry information of the objects, leading to suboptimal capture of the spatial structure. To address this, a DG-former is designed to realize spatial alignment by introducing geometry information. We conduct experiments on three publicly available datasets (Sydney-Captions, UCM-Captions and RSICD), and the superior results demonstrate its effectiveness. Full article
Show Figures

Figure 1

25 pages, 7527 KB  
Article
Heterogeneous Multi-Domain Dataset Synthesis to Facilitate Privacy and Risk Assessments in Smart City IoT
by Matthew Boeding, Michael Hempel, Hamid Sharif and Juan Lopez
Electronics 2026, 15(3), 692; https://doi.org/10.3390/electronics15030692 - 5 Feb 2026
Viewed by 276
Abstract
The emergence of the Smart Cities paradigm and the rapid expansion and integration of Internet of Things (IoT) technologies within this context have created unprecedented opportunities for high-resolution behavioral analytics, urban optimization, and context-aware services. However, this same proliferation intensifies privacy risks, particularly [...] Read more.
The emergence of the Smart Cities paradigm and the rapid expansion and integration of Internet of Things (IoT) technologies within this context have created unprecedented opportunities for high-resolution behavioral analytics, urban optimization, and context-aware services. However, this same proliferation intensifies privacy risks, particularly those arising from cross-modal data linkage across heterogeneous sensing platforms. To address these challenges, this paper introduces a comprehensive, statistically grounded framework for generating synthetic, multimodal IoT datasets tailored to Smart City research. The framework produces behaviorally plausible synthetic data suitable for preliminary privacy risk assessment and as a benchmark for future re-identification studies, as well as for evaluating algorithms in mobility modeling, urban informatics, and privacy-enhancing technologies. As part of our approach, we formalize probabilistic methods for synthesizing three heterogeneous and operationally relevant data streams—cellular mobility traces, payment terminal transaction logs, and Smart Retail nutrition records—capturing the behaviors of a large number of synthetically generated urban residents over a 12-week period. The framework integrates spatially explicit merchant selection using K-Dimensional (KD)-tree nearest-neighbor algorithms, temporally correlated anchor-based mobility simulation reflective of daily urban rhythms, and dietary-constraint filtering to preserve ecological validity in consumption patterns. In total, the system generates approximately 116 million mobility pings, 5.4 million transactions, and 1.9 million itemized purchases, yielding a reproducible benchmark for evaluating multimodal analytics, privacy-preserving computation, and secure IoT data-sharing protocols. To show the validity of this dataset, the underlying distributions of these residents were successfully validated against reported distributions in published research. We present preliminary uniqueness and cross-modal linkage indicators; comprehensive re-identification benchmarking against specific attack algorithms is planned as future work. This framework can be easily adapted to various scenarios of interest in Smart Cities and other IoT applications. By aligning methodological rigor with the operational needs of Smart City ecosystems, this work fills critical gaps in synthetic data generation for privacy-sensitive domains, including intelligent transportation systems, urban health informatics, and next-generation digital commerce infrastructures. Full article
Show Figures

Figure 1

Back to TopTop