Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (390)

Search Parameters:
Keywords = voice recognition

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
18 pages, 2065 KB  
Article
Phoneme-Aware Augmentation for Robust Cantonese ASR Under Low-Resource Conditions
by Lusheng Zhang, Shie Wu and Zhongxun Wang
Symmetry 2025, 17(9), 1478; https://doi.org/10.3390/sym17091478 - 8 Sep 2025
Viewed by 190
Abstract
Cantonese automatic speech recognition (ASR) faces persistent challenges due to its nine lexical tones, extensive phonological variation, and the scarcity of professionally transcribed corpora. To address these issues, we propose a lightweight and data-efficient framework that leverages weak phonetic supervision (WPS) in conjunction [...] Read more.
Cantonese automatic speech recognition (ASR) faces persistent challenges due to its nine lexical tones, extensive phonological variation, and the scarcity of professionally transcribed corpora. To address these issues, we propose a lightweight and data-efficient framework that leverages weak phonetic supervision (WPS) in conjunction with two pho-neme-aware augmentation strategies. (1) Dynamic Boundary-Aligned Phoneme Dropout progressively removes entire IPA segments according to a curriculum schedule, simulating real-world phenomena such as elision, lenition, and tonal drift while ensuring training stability. (2) Phoneme-Aware SpecAugment confines all time- and frequency-masking operations within phoneme boundaries and prioritizes high-attention regions, thereby preserving intra-phonemic contours and formant integrity. Built on the Whistle encoder—which integrates a Conformer backbone, Connectionist Temporal Classification–Conditional Random Field (CTC-CRF) alignment, and a multi-lingual phonetic space—the approach requires only a grapheme-to-phoneme lexicon and Montreal Forced Aligner outputs, without any additional manual labeling. Experiments on the Cantonese subset of Common Voice demonstrate consistent gains: Dynamic Dropout alone reduces phoneme error rate (PER) from 17.8% to 16.7% with 50 h of speech and 16.4% to 15.1% with 100 h, while the combination of the two augmentations further lowers PER to 15.9%/14.4%. These results confirm that structure-aware phoneme-level perturbations provide an effective and low-cost solution for building robust Cantonese ASR systems under low-resource conditions. Full article
(This article belongs to the Section Computer)
Show Figures

Figure 1

20 pages, 2732 KB  
Article
Redesigning Multimodal Interaction: Adaptive Signal Processing and Cross-Modal Interaction for Hands-Free Computer Interaction
by Bui Hong Quan, Nguyen Dinh Tuan Anh, Hoang Van Phi and Bui Trung Thanh
Sensors 2025, 25(17), 5411; https://doi.org/10.3390/s25175411 - 2 Sep 2025
Viewed by 403
Abstract
Hands-free computer interaction is a key topic in assistive technology, with camera-based and voice-based systems being the most common methods. Recent camera-based solutions leverage facial expressions or head movements to simulate mouse clicks or key presses, while voice-based systems enable control via speech [...] Read more.
Hands-free computer interaction is a key topic in assistive technology, with camera-based and voice-based systems being the most common methods. Recent camera-based solutions leverage facial expressions or head movements to simulate mouse clicks or key presses, while voice-based systems enable control via speech commands, wake-word detection, and vocal gestures. However, existing systems often suffer from limitations in responsiveness and accuracy, especially under real-world conditions. In this paper, we present 3-Modal Human-Computer Interaction (3M-HCI), a novel interaction system that dynamically integrates facial, vocal, and eye-based inputs through a new signal processing pipeline and a cross-modal coordination mechanism. This approach not only enhances recognition accuracy but also reduces interaction latency. Experimental results demonstrate that 3M-HCI outperforms several recent hands-free interaction solutions in both speed and precision, highlighting its potential as a robust assistive interface. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

12 pages, 304 KB  
Article
LoRA-INT8 Whisper: A Low-Cost Cantonese Speech Recognition Framework for Edge Devices
by Lusheng Zhang, Shie Wu and Zhongxun Wang
Sensors 2025, 25(17), 5404; https://doi.org/10.3390/s25175404 - 1 Sep 2025
Viewed by 545
Abstract
To address the triple bottlenecks of data scarcity, oversized models, and slow inference that hinder Cantonese automatic speech recognition (ASR) in low-resource and edge-deployment settings, this study proposes a cost-effective Cantonese ASR system based on LoRA fine-tuning and INT8 quantization. First, Whisper-tiny is [...] Read more.
To address the triple bottlenecks of data scarcity, oversized models, and slow inference that hinder Cantonese automatic speech recognition (ASR) in low-resource and edge-deployment settings, this study proposes a cost-effective Cantonese ASR system based on LoRA fine-tuning and INT8 quantization. First, Whisper-tiny is parameter-efficiently fine-tuned on the Common Voice zh-HK training set using LoRA with rank = 8. Only 1.6% of the original weights are updated, reducing the character error rate (CER) from 49.5% to 11.1%, a performance close to full fine-tuning (10.3%), while cutting the training memory footprint and computational cost by approximately one order of magnitude. Next, the fine-tuned model is compressed into a 60 MB INT8 checkpoint via dynamic quantization in ONNX Runtime. On a MacBook Pro M1 Max CPU, the quantized model achieves an RTF = 0.20 (offline inference 5 × real-time) and 43% lower latency than the FP16 baseline; on an NVIDIA A10 GPU, it reaches RTF = 0.06, meeting the requirements of high-concurrency cloud services. Ablation studies confirm that the LoRA-INT8 configuration offers the best trade-off among accuracy, speed, and model size. Limitations include the absence of spontaneous-speech noise data, extreme-hardware validation, and adaptive LoRA structure optimization. Future work will incorporate large-scale self-supervised pre-training, tone-aware loss functions, AdaLoRA architecture search, and INT4/NPU quantization, and will establish an mJ/char energy–accuracy curve. The ultimate goal is to achieve CER ≤ 8%, RTF < 0.1, and mJ/char < 1 for low-power real-time Cantonese ASR in practical IoT scenarios. Full article
(This article belongs to the Section Electronic Sensors)
Show Figures

Figure 1

23 pages, 1233 KB  
Article
Decoding the Digits: How Number Notation Influences Cognitive Effort and Performance in Chinese-to-English Sight Translation
by Xueyan Zong, Lei Song and Shanshan Yang
Behav. Sci. 2025, 15(9), 1195; https://doi.org/10.3390/bs15091195 - 1 Sep 2025
Viewed by 346
Abstract
Numbers present persistent challenges in interpreting, yet cognitive mechanisms underlying notation-specific processing remain underexplored. While eye-tracking studies in visually-assisted simultaneous interpreting have advanced number research, they predominantly examine Arabic numerals in non-Chinese contexts—neglecting notation diversity increasingly prevalent in computer-assisted interpreting systems where Automatic [...] Read more.
Numbers present persistent challenges in interpreting, yet cognitive mechanisms underlying notation-specific processing remain underexplored. While eye-tracking studies in visually-assisted simultaneous interpreting have advanced number research, they predominantly examine Arabic numerals in non-Chinese contexts—neglecting notation diversity increasingly prevalent in computer-assisted interpreting systems where Automatic Speech Recognition outputs vary across languages. Addressing these gaps, this study investigated how number notation (Arabic digits vs. Chinese character numbers) affects trainee interpreters’ cognitive effort and performance in Chinese-to-English sight translation. Employing a mixed-methods design, we measured global (task-level) and local (number-specific) eye movements alongside expert assessments, output analysis, and subjective assessments. Results show that Chinese character numbers demand significantly greater cognitive effort than Arabic digits, evidenced by more and longer fixations, more extensive saccadic movements, and a larger eye-voice span. Concurrently, sight translation quality decreased markedly with Chinese character numbers, with more processing attempts yet lower accuracy and fluency. Subjective workload ratings confirmed higher mental, physical, and temporal demands in Task 2. These findings reveal an effort-quality paradox where greater cognitive investment in processing complex notations leads to poorer outcomes, and highlight the urgent need for notation-specific training strategies and adaptive technologies in multilingual communication. Full article
(This article belongs to the Section Cognition)
Show Figures

Figure 1

28 pages, 2673 KB  
Article
AI Anomaly-Based Deepfake Detection Using Customized Mahalanobis Distance and Head Pose with Facial Landmarks
by Cosmina-Mihaela Rosca and Adrian Stancu
Appl. Sci. 2025, 15(17), 9574; https://doi.org/10.3390/app15179574 - 30 Aug 2025
Viewed by 663
Abstract
The development of artificial intelligence has inevitably led to the growth of deepfake images, videos, human voices, etc. Deepfake detection is mandatory, especially when used for unethical and illegal purposes. This study presents a novel approach to image deepfake detection by introducing the [...] Read more.
The development of artificial intelligence has inevitably led to the growth of deepfake images, videos, human voices, etc. Deepfake detection is mandatory, especially when used for unethical and illegal purposes. This study presents a novel approach to image deepfake detection by introducing the Custom-Made Facial Recognition Algorithm (CMFRA), which employs four distinct features to differentiate between authentic and deepfake images. The proposed method combines facial landmark detection with advanced statistical analysis, integrating mean Mahalanobis distance and three head pose coordinates (yaw, pitch, and roll). The landmarks are extracted using the Google Vision API. This multi-feature approach assesses facial structure and orientation, capturing subtle inconsistencies indicative of deepfake manipulations. A key innovation of this work is introducing the mean Mahalanobis distance as a core feature for quantifying spatial relationships between facial landmarks. The research also emphasizes anomaly analysis by focusing solely on authentic facial data to establish a baseline for natural facial characteristics. The anomaly detection model recognizes when a face is modified without extensive training on deepfake samples. The process is implemented by analyzing deviations from this established pattern. The CMFRA demonstrated a detection accuracy of 90%. The proposed algorithm distinguishes between authentic and deepfake images under varied conditions. Full article
Show Figures

Figure 1

26 pages, 555 KB  
Concept Paper
Do We Need a Voice Methodology? Proposing a Voice-Centered Methodology: A Conceptual Framework in the Age of Surveillance Capitalism
by Laura Caroleo
Societies 2025, 15(9), 241; https://doi.org/10.3390/soc15090241 - 30 Aug 2025
Viewed by 377
Abstract
This paper explores the rise in voice-based social media as a pivotal transformation in digital communication, situated within the broader era of chatbots and voice AI. Platforms such as Clubhouse, X Spaces, Discord and similar ones foreground vocal interaction, reshaping norms of participation, [...] Read more.
This paper explores the rise in voice-based social media as a pivotal transformation in digital communication, situated within the broader era of chatbots and voice AI. Platforms such as Clubhouse, X Spaces, Discord and similar ones foreground vocal interaction, reshaping norms of participation, identity construction, and platform governance. This shift from text-centered communication to hybrid digital orality presents new sociological and methodological challenges, calling for the development of voice-centered analytical approaches. In response, the paper introduces a multidimensional methodological framework for analyzing voice-based social media platforms in the context of surveillance capitalism and AI-driven conversational technologies. We propose a high-level reference architecture machine learning for social science pipeline that integrates digital methods techniques, automatic speech recognition (ASR) models, and natural language processing (NLP) models within a reflexive and ethically grounded framework. To illustrate its potential, we outline possible stages of a PoC (proof of concept) audio analysis machine learning pipeline, demonstrated through a conceptual use case involving the collection, ingestion, and analysis of X Spaces. While not a comprehensive empirical study, this pipeline proposal highlights technical and ethical challenges in voice analysis. By situating the voice as a central axis of online sociality and examining it in relation to AI-driven conversational technologies, within an era of post-orality, the study contributes to ongoing debates on surveillance capitalism, platform affordances, and the evolving dynamics of digital interaction. In this rapidly evolving landscape, we urgently need a robust vocal methodology to ensure that voice is not just processed but understood. Full article
Show Figures

Figure 1

7 pages, 312 KB  
Proceeding Paper
AI as Modern Technology for Home Security Systems: A Systematic Literature Review
by Rizki Muhammad, Muhammad Syailendra Aditya Sagara, Yaunarius Molang Teluma and Fikri Arif Wicaksana
Eng. Proc. 2025, 107(1), 35; https://doi.org/10.3390/engproc2025107035 - 28 Aug 2025
Viewed by 1124
Abstract
The growing demand for innovative home security solutions has accelerated the integration of advanced technologies to enhance safety, convenience, and operational efficiency. Artificial intelligence (AI) has become a pivotal element in revolutionizing home security systems by enabling real-time threat detection, automated surveillance, and [...] Read more.
The growing demand for innovative home security solutions has accelerated the integration of advanced technologies to enhance safety, convenience, and operational efficiency. Artificial intelligence (AI) has become a pivotal element in revolutionizing home security systems by enabling real-time threat detection, automated surveillance, and intelligent decision-making. This study employs a systematic literature review (SLR) to explore recent advancements in AI-driven technologies, such as machine learning, computer vision, natural language processing, and the Internet of Things (IoT). These innovations enhance security by providing features like facial recognition, anomaly detection, voice-activated controls, and predictive analysis, delivering more accurate and responsive security solutions. Furthermore, this study addresses challenges related to data privacy, cybersecurity threats, and cost considerations while emphasizing AI’s potential to deliver scalable, efficient, and user-friendly systems. The findings demonstrate AI’s vital role in the evolution of home security technologies, paving the way for smarter and safer living environments. Full article
Show Figures

Figure 1

19 pages, 319 KB  
Article
The Unbearable Lightness of Being an Early Childhood Educator in Day-Care Settings
by Bárbara Tadeu and Amélia Lopes
Educ. Sci. 2025, 15(9), 1107; https://doi.org/10.3390/educsci15091107 - 26 Aug 2025
Viewed by 322
Abstract
This article explores how working conditions and professional well-being intersect in day-care settings, shaping early childhood educators’ professional identities, especially at the start of their careers. Based on a qualitative and interpretative study involving a focus group with seven educators and thirty interviews [...] Read more.
This article explores how working conditions and professional well-being intersect in day-care settings, shaping early childhood educators’ professional identities, especially at the start of their careers. Based on a qualitative and interpretative study involving a focus group with seven educators and thirty interviews across Portugal, the findings reveal a profession marked by overload, time pressure, institutional silence, and the invisibility of emotional labour. Yet, educators also demonstrate resistance, mutual support networks, and pedagogical reinvention. Wellbeing is conceptualised as an ecological and political issue, influenced by institutional structures, the absence of public policies, and cultural narratives that continue to devalue the profession. Special focus is given to novice educators, whose entry into the field is characterised by vulnerability, lack of guidance, and identity tensions, pointing to the urgent need for better initial training and institutional support. This article presents a critical analysis of professionalism in early childhood education and care, with implications for teacher education, including mentoring, supervision, and public policy development. It frames the work of early childhood educators in day-care as both an ethical commitment and a form of resistance. Ultimately, it amplifies educators’ voices as knowledge producers and agents of change, contributing to the pedagogy of dignity and the recognition of a profession often rendered invisible. Full article
(This article belongs to the Special Issue Education for Early Career Teachers)
20 pages, 2568 KB  
Article
Towards Spatial Awareness: Real-Time Sensory Augmentation with Smart Glasses for Visually Impaired Individuals
by Nadia Aloui
Electronics 2025, 14(17), 3365; https://doi.org/10.3390/electronics14173365 - 25 Aug 2025
Viewed by 546
Abstract
This research presents an innovative Internet of Things (IoT) and artificial intelligence (AI) platform designed to provide holistic assistance and foster autonomy for visually impaired individuals within the university environment. Its main novelty is real-time sensory augmentation and spatial awareness, integrating ultrasonic, LiDAR, [...] Read more.
This research presents an innovative Internet of Things (IoT) and artificial intelligence (AI) platform designed to provide holistic assistance and foster autonomy for visually impaired individuals within the university environment. Its main novelty is real-time sensory augmentation and spatial awareness, integrating ultrasonic, LiDAR, and RFID sensors for robust 360° obstacle detection, environmental perception, and precise indoor localization. A novel, optimized Dijkstra algorithm calculates optimal routes; speech and intent recognition enable intuitive voice control. The wearable smart glasses are complemented by a platform providing essential educational functionalities, including lesson reminders, timetables, and emergency assistance. Based on gamified principles of exploration and challenge, the platform includes immersive technology settings, intelligent image recognition, auditory conversion, haptic feedback, and rapid contextual awareness, delivering a sophisticated, effective navigational experience. Exhaustive technical evaluation reveals that a more autonomous and fulfilling university experience is made possible by notable improvements in navigation performance, object detection accuracy, and technical capabilities for social interaction features, according to a thorough technical audit. Full article
(This article belongs to the Section Computer Science & Engineering)
Show Figures

Figure 1

2 pages, 681 KB  
Correction
Correction: Luzzi et al. Defective Awareness of Person-Recognition Disorders Through Face, Voice and Name in Right and Left Variants of Semantic Dementia: A Pilot Study. Brain Sci. 2025, 15, 504
by Simona Luzzi, Oscar Prata and Guido Gainotti
Brain Sci. 2025, 15(9), 912; https://doi.org/10.3390/brainsci15090912 - 25 Aug 2025
Viewed by 292
Abstract
In the original publication [...] Full article
(This article belongs to the Special Issue Anosognosia and the Determinants of Self-Awareness)
25 pages, 10870 KB  
Article
XTTS-Based Data Augmentation for Profanity Keyword Recognition in Low-Resource Speech Scenarios
by Shin-Chi Lai, Yi-Chang Zhu, Szu-Ting Wang, Yen-Ching Chang, Ying-Hsiu Hung, Jhen-Kai Tang and Wen-Kai Tsai
Appl. Syst. Innov. 2025, 8(4), 108; https://doi.org/10.3390/asi8040108 - 31 Jul 2025
Viewed by 561
Abstract
As voice cloning technology rapidly advances, the risk of personal voices being misused by malicious actors for fraud or other illegal activities has significantly increased, making the collection of speech data increasingly challenging. To address this issue, this study proposes a data augmentation [...] Read more.
As voice cloning technology rapidly advances, the risk of personal voices being misused by malicious actors for fraud or other illegal activities has significantly increased, making the collection of speech data increasingly challenging. To address this issue, this study proposes a data augmentation method based on XText-to-Speech (XTTS) synthesis to tackle the challenges of small-sample, multi-class speech recognition, using profanity as a case study to achieve high-accuracy keyword recognition. Two models were therefore evaluated: a CNN model (Proposed-I) and a CNN-Transformer hybrid model (Proposed-II). Proposed-I leverages local feature extraction, improving accuracy on a real human speech (RHS) test set from 55.35% without augmentation to 80.36% with XTTS-enhanced data. Proposed-II integrates CNN’s local feature extraction with Transformer’s long-range dependency modeling, further boosting test set accuracy to 88.90% while reducing the parameter count by approximately 41%, significantly enhancing computational efficiency. Compared to a previously proposed incremental architecture, the Proposed-II model achieves an 8.49% higher accuracy while reducing parameters by about 98.81% and MACs by about 98.97%, demonstrating exceptional resource efficiency. By utilizing XTTS and public corpora to generate a novel keyword speech dataset, this study enhances sample diversity and reduces reliance on large-scale original speech data. Experimental analysis reveals that an optimal synthetic-to-real speech ratio of 1:5 significantly improves the overall system accuracy, effectively addressing data scarcity. Additionally, the Proposed-I and Proposed-II models achieve accuracies of 97.54% and 98.66%, respectively, in distinguishing real from synthetic speech, demonstrating their strong potential for speech security and anti-spoofing applications. Full article
(This article belongs to the Special Issue Advancements in Deep Learning and Its Applications)
Show Figures

Figure 1

16 pages, 2283 KB  
Article
Recognition of Japanese Finger-Spelled Characters Based on Finger Angle Features and Their Continuous Motion Analysis
by Tamon Kondo, Ryota Murai, Zixun He, Duk Shin and Yousun Kang
Electronics 2025, 14(15), 3052; https://doi.org/10.3390/electronics14153052 - 30 Jul 2025
Viewed by 336
Abstract
To improve the accuracy of Japanese finger-spelled character recognition using an RGB camera, we focused on feature design and refinement of the recognition method. By leveraging angular features extracted via MediaPipe, we proposed a method that effectively captures subtle motion differences while minimizing [...] Read more.
To improve the accuracy of Japanese finger-spelled character recognition using an RGB camera, we focused on feature design and refinement of the recognition method. By leveraging angular features extracted via MediaPipe, we proposed a method that effectively captures subtle motion differences while minimizing the influence of background and surrounding individuals. We constructed a large-scale dataset that includes not only the basic 50 Japanese syllables but also those with diacritical marks, such as voiced sounds (e.g., “ga”, “za”, “da”) and semi-voiced sounds (e.g., “pa”, “pi”, “pu”), to enhance the model’s ability to recognize a wide variety of characters. In addition, the application of a change-point detection algorithm enabled accurate segmentation of sign language motion boundaries, improving word-level recognition performance. These efforts laid the foundation for a highly practical recognition system. However, several challenges remain, including the limited size and diversity of the dataset and the need for further improvements in segmentation accuracy. Future work will focus on enhancing the model’s generalizability by collecting more diverse data from a broader range of participants and incorporating segmentation methods that consider contextual information. Ultimately, the outcomes of this research should contribute to the development of educational support tools and sign language interpretation systems aimed at real-world applications. Full article
Show Figures

Figure 1

23 pages, 3741 KB  
Article
Multi-Corpus Benchmarking of CNN and LSTM Models for Speaker Gender and Age Profiling
by Jorge Jorrin-Coz, Mariko Nakano, Hector Perez-Meana and Leobardo Hernandez-Gonzalez
Computation 2025, 13(8), 177; https://doi.org/10.3390/computation13080177 - 23 Jul 2025
Viewed by 526
Abstract
Speaker profiling systems are often evaluated on a single corpus, which complicates reliable comparison. We present a fully reproducible evaluation pipeline that trains Convolutional Neural Networks (CNNs) and Long-Short Term Memory (LSTM) models independently on three speech corpora representing distinct recording conditions—studio-quality TIMIT, [...] Read more.
Speaker profiling systems are often evaluated on a single corpus, which complicates reliable comparison. We present a fully reproducible evaluation pipeline that trains Convolutional Neural Networks (CNNs) and Long-Short Term Memory (LSTM) models independently on three speech corpora representing distinct recording conditions—studio-quality TIMIT, crowdsourced Mozilla Common Voice, and in-the-wild VoxCeleb1. All models share the same architecture, optimizer, and data preprocessing; no corpus-specific hyperparameter tuning is applied. We perform a detailed preprocessing and feature extraction procedure, evaluating multiple configurations and validating their applicability and effectiveness in improving the obtained results. A feature analysis shows that Mel spectrograms benefit CNNs, whereas Mel Frequency Cepstral Coefficients (MFCCs) suit LSTMs, and that the optimal Mel-bin count grows with corpus Signal Noise Rate (SNR). With this fixed recipe, EfficientNet achieves 99.82% gender accuracy on Common Voice (+1.25 pp over the previous best) and 98.86% on VoxCeleb1 (+0.57 pp). MobileNet attains 99.86% age-group accuracy on Common Voice (+2.86 pp) and a 5.35-year MAE for age estimation on TIMIT using a lightweight configuration. The consistent, near-state-of-the-art results across three acoustically diverse datasets substantiate the robustness and versatility of the proposed pipeline. Code and pre-trained weights are released to facilitate downstream research. Full article
(This article belongs to the Section Computational Engineering)
Show Figures

Graphical abstract

26 pages, 2261 KB  
Article
Real-Time Fall Monitoring for Seniors via YOLO and Voice Interaction
by Eugenia Tîrziu, Ana-Mihaela Vasilevschi, Adriana Alexandru and Eleonora Tudora
Future Internet 2025, 17(8), 324; https://doi.org/10.3390/fi17080324 - 23 Jul 2025
Viewed by 747
Abstract
In the context of global demographic aging, falls among the elderly remain a major public health concern, often leading to injury, hospitalization, and loss of autonomy. This study proposes a real-time fall detection system that combines a modern computer vision model, YOLOv11 with [...] Read more.
In the context of global demographic aging, falls among the elderly remain a major public health concern, often leading to injury, hospitalization, and loss of autonomy. This study proposes a real-time fall detection system that combines a modern computer vision model, YOLOv11 with integrated pose estimation, and an Artificial Intelligence (AI)-based voice assistant designed to reduce false alarms and improve intervention efficiency and reliability. The system continuously monitors human posture via video input, detects fall events based on body dynamics and keypoint analysis, and initiates a voice-based interaction to assess the user’s condition. Depending on the user’s verbal response or the absence thereof, the system determines whether to trigger an emergency alert to caregivers or family members. All processing, including speech recognition and response generation, is performed locally to preserve user privacy and ensure low-latency performance. The approach is designed to support independent living for older adults. Evaluation of 200 simulated video sequences acquired by the development team demonstrated high precision and recall, along with a decrease in false positives when incorporating voice-based confirmation. In addition, the system was also evaluated on an external dataset to assess its robustness. Our results highlight the system’s reliability and scalability for real-world in-home elderly monitoring applications. Full article
Show Figures

Figure 1

25 pages, 5055 KB  
Article
FlickPose: A Hand Tracking-Based Text Input System for Mobile Users Wearing Smart Glasses
by Ryo Yuasa and Katashi Nagao
Appl. Sci. 2025, 15(15), 8122; https://doi.org/10.3390/app15158122 - 22 Jul 2025
Viewed by 596
Abstract
With the growing use of head-mounted displays (HMDs) such as smart glasses, text input remains a challenge, especially in mobile environments. Conventional methods like physical keyboards, voice recognition, and virtual keyboards each have limitations—physical keyboards lack portability, voice input has privacy concerns, and [...] Read more.
With the growing use of head-mounted displays (HMDs) such as smart glasses, text input remains a challenge, especially in mobile environments. Conventional methods like physical keyboards, voice recognition, and virtual keyboards each have limitations—physical keyboards lack portability, voice input has privacy concerns, and virtual keyboards struggle with accuracy due to a lack of tactile feedback. FlickPose is a novel text input system designed for smart glasses and mobile HMD users, integrating flick-based input and hand pose recognition. It features two key selection methods: the touch-panel method, where users tap a floating UI panel to select characters, and the raycast method, where users point a virtual ray from their wrist and confirm input via a pinch motion. FlickPose uses five left-hand poses to select characters. A machine learning model trained for hand pose recognition outperforms Random Forest and LightGBM models in accuracy and consistency. FlickPose was tested against the standard virtual keyboard of Meta Quest 3 in three tasks (hiragana, alphanumeric, and kanji input). Results showed that raycast had the lowest error rate, reducing unintended key presses; touch-panel had more deletions, likely due to misjudgments in key selection; and frequent HMD users preferred raycast, as it maintained input accuracy while allowing users to monitor their text. A key feature of FlickPose is adaptive tracking, which ensures the keyboard follows user movement. While further refinements in hand pose recognition are needed, the system provides an efficient, mobile-friendly alternative for HMD text input. Future research will explore real-world application compatibility and improve usability in dynamic environments. Full article
(This article belongs to the Special Issue Extended Reality (XR) and User Experience (UX) Technologies)
Show Figures

Figure 1

Back to TopTop