Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (5)

Search Parameters:
Keywords = cross-modal linguistic associations

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
30 pages, 37977 KB  
Article
Text-Guided Visual Representation Optimization for Sensor-Acquired Video Temporal Grounding
by Yun Tian, Xiaobo Guo, Jinsong Wang and Xinyue Liang
Sensors 2025, 25(15), 4704; https://doi.org/10.3390/s25154704 - 30 Jul 2025
Viewed by 470
Abstract
Video temporal grounding (VTG) aims to localize a semantically relevant temporal segment within an untrimmed video based on a natural language query. The task continues to face challenges arising from cross-modal semantic misalignment, which is largely attributed to redundant visual content in sensor-acquired [...] Read more.
Video temporal grounding (VTG) aims to localize a semantically relevant temporal segment within an untrimmed video based on a natural language query. The task continues to face challenges arising from cross-modal semantic misalignment, which is largely attributed to redundant visual content in sensor-acquired video streams, linguistic ambiguity, and discrepancies in modality-specific representations. Most existing approaches rely on intra-modal feature modeling, processing video and text independently throughout the representation learning stage. However, this isolation undermines semantic alignment by neglecting the potential of cross-modal interactions. In practice, a natural language query typically corresponds to spatiotemporal content in video signals collected through camera-based sensing systems, encompassing a particular sequence of frames and its associated salient subregions. We propose a text-guided visual representation optimization framework tailored to enhance semantic interpretation over video signals captured by visual sensors. This framework leverages textual information to focus on spatiotemporal video content, thereby narrowing the cross-modal gap. Built upon the unified cross-modal embedding space provided by CLIP, our model leverages video data from sensing devices to structure representations and introduces two dedicated modules to semantically refine visual representations across spatial and temporal dimensions. First, we design a Spatial Visual Representation Optimization (SVRO) module to learn spatial information within intra-frames. It selects salient patches related to the text, capturing more fine-grained visual details. Second, we introduce a Temporal Visual Representation Optimization (TVRO) module to learn temporal relations from inter-frames. Temporal triplet loss is employed in TVRO to enhance attention on text-relevant frames and capture clip semantics. Additionally, a self-supervised contrastive loss is introduced at the clip–text level to improve inter-clip discrimination by maximizing semantic variance during training. Experiments on Charades-STA, ActivityNet Captions, and TACoS, widely used benchmark datasets, demonstrate that our method outperforms state-of-the-art methods across multiple metrics. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

26 pages, 13429 KB  
Article
Multimodal Prompt-Guided Bidirectional Fusion for Referring Remote Sensing Image Segmentation
by Yingjie Li, Weiqi Jin, Su Qiu and Qiyang Sun
Remote Sens. 2025, 17(10), 1683; https://doi.org/10.3390/rs17101683 - 10 May 2025
Cited by 1 | Viewed by 958
Abstract
Multimodal feature alignment is a key challenge in referring remote sensing image segmentation (RRSIS). The complex spatial relationships and multi-scale targets in remote sensing images call for efficient cross-modal mapping and fine-grained feature alignment. Existing approaches typically rely on cross-attention for multimodal fusion, [...] Read more.
Multimodal feature alignment is a key challenge in referring remote sensing image segmentation (RRSIS). The complex spatial relationships and multi-scale targets in remote sensing images call for efficient cross-modal mapping and fine-grained feature alignment. Existing approaches typically rely on cross-attention for multimodal fusion, which increases model complexity. To address this, we introduce the concept of prompt learning in RRSIS and propose a parameter-efficient multimodal prompt-guided bidirectional fusion (MPBF) architecture. MPBF combines both early and late fusion strategies. In the early fusion stage, it conducts the deep fusion of linguistic and visual features through cross-modal prompt coupling. In the late fusion stage, to handle the multi-scale nature of remote sensing targets, a scale refinement module is proposed to capture diverse scale representations, and a vision–language alignment module is employed for pixel-level multimodal semantic associations. Comparative experiments and ablation studies on a public dataset demonstrate that MPBF significantly outperformed existing state-of-the-art methods with relatively small computational overhead, highlighting its effectiveness and efficiency for RRSIS. Further application experiments on a custom dataset confirm the method’s practicality and robustness in real-world scenarios. Full article
Show Figures

Graphical abstract

22 pages, 1959 KB  
Article
DMFormer: Dense Memory Linformer for Image Captioning
by Yuting He and Zetao Jiang
Electronics 2025, 14(9), 1716; https://doi.org/10.3390/electronics14091716 - 23 Apr 2025
Cited by 1 | Viewed by 535
Abstract
Image captioning is a cross-task of computer vision and natural language processing, aiming to describe image content in natural language. Existing methods still have deficiencies in modeling the spatial location and semantic correlation between image regions. Furthermore, these methods often exhibit insufficient interaction [...] Read more.
Image captioning is a cross-task of computer vision and natural language processing, aiming to describe image content in natural language. Existing methods still have deficiencies in modeling the spatial location and semantic correlation between image regions. Furthermore, these methods often exhibit insufficient interaction between image features and text features. To address these issues, we propose a Linformer-based image captioning method, the Dense Memory Linformer for Image Captioning (DMFormer), which has lower time and space complexity than the traditional Transformer architecture. The DMFormer contains two core modules: the Relation Memory Augmented Encoder (RMAE) and the Dense Memory Augmented Decoder (DMAD). In the RMAE, we propose Relation Memory Augmented Attention (RMAA), which combines explicit spatial perception and implicit spatial perception. It explicitly uses geometric information to model the geometric correlation between image regions and implicitly constructs memory unit matrices to learn the contextual information of image region features. In the DMAD, we introduce Dense Memory Augmented Cross Attention (DMACA). This module fully utilizes the low-level and high-level features generated by the RMAE through dense connections, and constructs memory units to store prior knowledge of image and text. It learns the cross-modal associations between visual and linguistic features through an adaptive gating mechanism. Experimental results on the MS-COCO dataset show that the descriptions generated by the DMFormer are richer and more accurate, with significant improvements in various evaluation metrics compared to mainstream methods. Full article
Show Figures

Figure 1

38 pages, 10305 KB  
Article
Listening Beyond the Source: Exploring the Descriptive Language of Musical Sounds
by Isabel Pires
Behav. Sci. 2025, 15(3), 396; https://doi.org/10.3390/bs15030396 - 20 Mar 2025
Viewed by 1445
Abstract
The spontaneous use of verbal expressions to articulate and describe abstract auditory phenomena in everyday interactions is an inherent aspect of human nature. This occurs without the structured conditions typically required in controlled laboratory environments, relying instead on intuitive and spontaneous modes of [...] Read more.
The spontaneous use of verbal expressions to articulate and describe abstract auditory phenomena in everyday interactions is an inherent aspect of human nature. This occurs without the structured conditions typically required in controlled laboratory environments, relying instead on intuitive and spontaneous modes of expression. This study explores the relationship between auditory perception and descriptive language for abstract sounds. These sounds, synthesized without identifiable sources or musical structures, allow listeners to engage with sound perception free from external references. The investigation of correlations between subjective descriptors (e.g., “rough”, “bright”) and physical sound attributes (e.g., spectral and dynamic properties) reveals significant cross-modal linguistic associations in auditory perception. An international survey with a diverse group of participants revealed that listeners often draw on other sensory domains to describe sounds, suggesting a robust cross-modal basis for auditory descriptors. Moreover, the findings indicate a correlation between subjective descriptors and objective sound wave properties, demonstrating the effectiveness of abstract sounds in guiding listeners’ attention to intrinsic qualities. These results could support the development of new paradigms in sound analysis and manipulation, with applications in artistic, educational, and analytical contexts. This multidisciplinary approach may provide the foundation for a perceptual framework for sound analysis, to be tested and refined through theoretical modelling and experimental validation. Full article
(This article belongs to the Special Issue Music Listening as Exploratory Behavior)
Show Figures

Figure 1

16 pages, 4298 KB  
Article
Acquisition of the Epistemic Discourse Marker Wo Juede by Native Taiwan Mandarin Speakers
by Chun-Yin Doris Chen, Chung-Yu Wu and Hongyin Tao
Languages 2022, 7(4), 292; https://doi.org/10.3390/languages7040292 - 15 Nov 2022
Viewed by 2153
Abstract
This study examines the use of a fixed expression, wo juede (WJ) ‘I feel, I think’, in Taiwan Mandarin in the context of two types of oral production tasks: argumentative and negotiative discourses. The participants consisted of two groups used for [...] Read more.
This study examines the use of a fixed expression, wo juede (WJ) ‘I feel, I think’, in Taiwan Mandarin in the context of two types of oral production tasks: argumentative and negotiative discourses. The participants consisted of two groups used for comparison: one group of children from Grades 2, 4, and 6, and one group of adults (college students). The results show that both groups were more inclined to utilize WJ in argumentative genres than in negotiative genres. Of the seven pragmatic functions associated with WJ, the participants all had a strong preference to use WJ for the commenting/reasoning function. Developmental patterns gleaned from the data indicate that children’s language expands as their age increases. The implications of the findings for cross-linguistic comparison in the realm of epistemic modality are explored in this paper. This study contributes to the study of Chinese morphology by drawing more attention to the acquisition and development patterns of fixed expressions in larger chunks. Full article
(This article belongs to the Special Issue Current Research on Chinese Morphology)
Show Figures

Figure 1

Back to TopTop