Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (102)

Search Parameters:
Keywords = speaker identification

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
25 pages, 2538 KB  
Article
Fic2Bot: A Scalable Framework for Persona-Driven Chatbot Generation from Fiction
by Sua Kang, Chaelim Lee, Subin Jung and Minsu Lee
Electronics 2025, 14(19), 3859; https://doi.org/10.3390/electronics14193859 - 29 Sep 2025
Abstract
This paper presents Fic2Bot, an end-to-end framework that automatically transforms raw novel text into in-character chatbots by combining scene-level retrieval with persona profiling. Unlike conventional RAG-based systems that emphasize factual accuracy but neglect stylistic coherence, Fic2Bot ensures both factual grounding and consistent persona [...] Read more.
This paper presents Fic2Bot, an end-to-end framework that automatically transforms raw novel text into in-character chatbots by combining scene-level retrieval with persona profiling. Unlike conventional RAG-based systems that emphasize factual accuracy but neglect stylistic coherence, Fic2Bot ensures both factual grounding and consistent persona expression without any manual intervention. The framework integrates (1) Major Entity Identification (MEI) for robust coreference resolution, (2) scene-structured retrieval for precise contextual grounding, and (3) stylistic and sentiment profiling to capture linguistic and emotional traits of each character. Experiments conducted on novels from diverse genres show that Fic2Bot achieves robust entity resolution, more relevant retrieval, highly accurate speaker attribution, and stronger persona consistency in multi-turn dialogues. These results highlight Fic2Bot as a scalable and domain-agnostic framework for persona-driven chatbot generation, with potential applications in interactive roleplaying, language and literary studies, and entertainment. Full article
(This article belongs to the Special Issue Feature Papers in Artificial Intelligence)
Show Figures

Figure 1

20 pages, 776 KB  
Article
Who Speaks to Whom? An LLM-Based Social Network Analysis of Tragic Plays
by Aura Cristina Udrea, Stefan Ruseti, Laurentiu-Marian Neagu, Ovio Olaru, Andrei Terian and Mihai Dascalu
Electronics 2025, 14(19), 3847; https://doi.org/10.3390/electronics14193847 - 28 Sep 2025
Abstract
The study of dramatic plays has long relied on qualitative methods to analyze character interactions, making little assumption about the structural patterns of communication involved. Our approach bridges NLP and literary studies, enabling scalable, data-driven analysis of interaction patterns and power structures in [...] Read more.
The study of dramatic plays has long relied on qualitative methods to analyze character interactions, making little assumption about the structural patterns of communication involved. Our approach bridges NLP and literary studies, enabling scalable, data-driven analysis of interaction patterns and power structures in drama. We propose a novel method to supplement addressee identification in tragedies using Large Language Models (LLMs). Unlike conventional Social Network Analysis (SNA) approaches, which often diminish dialogue dynamics by relying on co-occurrence or adjacency heuristics, our LLM-based method accurately records directed speech acts, joint addresses, and listener interactions. In a preliminary evaluation of an annotated multilingual dataset of 14 scenes from nine plays in four languages, our top-performing LLM (i.e., Llama3.3-70B) achieved an F1-score of 88.75% (P = 94.81%, R = 84.72%), an exact match of 77.31%, and an 86.97% partial match with human annotations, where partial match indicates any overlap between predicted and annotated receiver lists. Through automatic extraction of speaker–addressee relations, our method provides preliminary evidence for the potential scalability of SNA for literary analyses, as well as insights into power relations, influence, and isolation of characters in tragedies, which we further visualize by rendering social network graphs. Full article
Show Figures

Figure 1

22 pages, 2431 KB  
Article
Perceptual Plasticity in Bilinguals: Language Dominance Reshapes Acoustic Cue Weightings
by Annie Tremblay and Hyoju Kim
Brain Sci. 2025, 15(10), 1053; https://doi.org/10.3390/brainsci15101053 - 27 Sep 2025
Abstract
Background/Objectives: Speech perception is shaped by language experience, with listeners learning to selectively attend to acoustic cues that are informative in their language. This study investigates how language dominance, a proxy for long-term language experience, modulates cue weighting in highly proficient Spanish–English bilinguals’ [...] Read more.
Background/Objectives: Speech perception is shaped by language experience, with listeners learning to selectively attend to acoustic cues that are informative in their language. This study investigates how language dominance, a proxy for long-term language experience, modulates cue weighting in highly proficient Spanish–English bilinguals’ perception of English lexical stress. Methods: We tested 39 bilinguals with varying dominance profiles and 40 monolingual English speakers in a stress identification task using auditory stimuli that independently manipulated vowel quality, pitch, and duration. Results: Bayesian logistic regression models revealed that, compared to monolinguals, bilinguals relied less on vowel quality and more on pitch and duration, mirroring cue distributions in Spanish versus English. Critically, cue weighting within the bilingual group varied systematically with language dominance: English-dominant bilinguals patterned more like monolingual English listeners, showing increased reliance on vowel quality and decreased reliance on pitch and duration, whereas Spanish-dominant bilinguals retained a cue weighting that was more Spanish-like. Conclusions: These results support experience-based models of speech perception and provide behavioral evidence that bilinguals’ perceptual attention to acoustic cues remains flexible and dynamically responsive to long-term input. These results are in line with a neurobiological account of speech perception in which attentional and representational mechanisms adapt to changes in the input. Full article
(This article belongs to the Special Issue Language Perception and Processing)
Show Figures

Figure 1

19 pages, 1603 KB  
Article
Cross-Linguistic Influences on L2 Prosody Perception: Evidence from English Interrogative Focus Perception by Mandarin Listeners
by Xing Liu, Xiaoxiang Chen, Chen Kuang and Fei Chen
Brain Sci. 2025, 15(9), 1000; https://doi.org/10.3390/brainsci15091000 - 16 Sep 2025
Viewed by 342
Abstract
Background/Objectives: This study sets out to explore how L1 Mandarin speakers with varying lengths of L2 experience perceived English focus interrogative tune, L*H-H%, within the framework of the autosegmental–metrical model. Methods: Eighteen Mandarin speakers with varying lengths of residence in the United States [...] Read more.
Background/Objectives: This study sets out to explore how L1 Mandarin speakers with varying lengths of L2 experience perceived English focus interrogative tune, L*H-H%, within the framework of the autosegmental–metrical model. Methods: Eighteen Mandarin speakers with varying lengths of residence in the United States and eighteen English native speakers were invited to perceive prosodic prominence and judge the naturalness of focus prosody tunes. Results: For the perception of on-focus pitch accent L*, Mandarin speakers performed well in the prominence detection task but not in the focus identification task. For post-focus edge tones, we found that phrase accents were more susceptible to L1 influences than boundary tones due to the varying degrees of cross-linguistic similarity between these intonational categories. The results also show that even listeners with extended L2 experience were not proficient in their perception of L2 interrogative focus tunes. Conclusions: This study reveals the advantage of considering the degree of L1-L2 similarity and the necessity to examine cross-linguistic influences on L2 perception of prosody separately in phonological and phonetic dimensions. Full article
(This article belongs to the Special Issue Language Perception and Processing)
Show Figures

Figure 1

12 pages, 381 KB  
Article
Russian–Belarusian Border Dialects and Their “Language Roof”: Dedialectization and Trajectories of Changes
by Anastasiia Ryko
Languages 2025, 10(9), 225; https://doi.org/10.3390/languages10090225 - 5 Sep 2025
Viewed by 475
Abstract
The dialects discussed in this article were considered Belarusian in the early 20th century, and later, as a result of the transfer of the administrative (state) border, they became part of the Russian territory and were considered Russian. The changes occurring in these [...] Read more.
The dialects discussed in this article were considered Belarusian in the early 20th century, and later, as a result of the transfer of the administrative (state) border, they became part of the Russian territory and were considered Russian. The changes occurring in these dialects as a result of the influence of the standard Russian language are interesting from various perspectives. Firstly, the linguistic self-identification of dialect speakers changes and the perception of their dialect as less prestigious compared to the standard language is formed. Secondly, linguistic features that dialectologists previously defined as characteristic of the Belarusian language are being replaced by standard Russian ones. By analyzing the linguistic data obtained from the dialect speakers of different generations, we can trace the emergence of variation and then its loss. Observing which linguistic features are subject to change first, and which remain more stable, allows us to examine linguistic changes through the lens of the “hierarchy of borrowings” theory. Additionally, given the linguistic inequality between the dialect and the standard language, we can observe the gradual transformation of the dialect under the influence of the prestigious standard idiom. Therefore, the loss of Belarusian–Russian variation can be viewed as a process of dedialectization, bringing the dialect closer to the standard language. Full article
(This article belongs to the Special Issue Language Attitudes and Language Ideologies in Eastern Europe)
18 pages, 3632 KB  
Article
Multilingual Mobility: Audio-Based Language ID for Automotive Systems
by Joowon Oh and Jeaho Lee
Appl. Sci. 2025, 15(16), 9209; https://doi.org/10.3390/app15169209 - 21 Aug 2025
Viewed by 537
Abstract
With the growing demand for natural and intelligent human–machine interaction in multilingual environments, automatic language identification (LID) has emerged as a crucial component in voice-enabled systems, particularly in the automotive domain. This study proposes an audio-based LID model that identifies the spoken language [...] Read more.
With the growing demand for natural and intelligent human–machine interaction in multilingual environments, automatic language identification (LID) has emerged as a crucial component in voice-enabled systems, particularly in the automotive domain. This study proposes an audio-based LID model that identifies the spoken language directly from voice input without requiring manual language selection. The model architecture leverages two types of feature extraction pipelines: a Variational Autoencoder (VAE) and a pre-trained Wav2Vec model, both used to obtain latent speech representations. These embeddings are then fed into a multi-layer perceptron (MLP)-based classifier to determine the speaker’s language among five target languages: Korean, Japanese, Chinese, Spanish, and French. The model is trained and evaluated using a dataset preprocessed into Mel-Frequency Cepstral Coefficients (MFCCs) and raw waveform inputs. Experimental results demonstrate the effectiveness of the proposed approach in achieving accurate and real-time language detection, with potential applications in in-vehicle systems, speech translation platforms, and multilingual voice assistants. By eliminating the need for predefined language settings, this work contributes to more seamless and user-friendly multilingual voice interaction systems. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

19 pages, 1612 KB  
Article
Listening for Region: Phonetic Cue Sensitivity and Sociolinguistic Development in L2 Spanish
by Lauren B. Schmidt
Languages 2025, 10(8), 198; https://doi.org/10.3390/languages10080198 - 20 Aug 2025
Viewed by 688
Abstract
This study investigates how second language (L2) learners of Spanish identify the regional origin of native Spanish speakers and whether specific phonetic cues predict dialect identification accuracy across proficiency levels. Situated within a growing body of work on sociolinguistic competence, this research addresses [...] Read more.
This study investigates how second language (L2) learners of Spanish identify the regional origin of native Spanish speakers and whether specific phonetic cues predict dialect identification accuracy across proficiency levels. Situated within a growing body of work on sociolinguistic competence, this research addresses the development of learners’ ability to use linguistic forms not only for communication but also for social interpretation. A dialect identification task was administered to 111 American English-speaking learners of Spanish and 19 native Spanish speakers. Participants heard sentence-length stimuli targeting regional phonetic features and selected the speaker’s country of origin. While L2 learners were able to identify regional dialects above chance, accuracy was low and significantly below that of native speakers. Higher-proficiency learners demonstrated improved identification, especially for speakers from Spain and Argentina, and relied more on salient phonetic cues (e.g., [θ], [ʃ]). No significant development was found for identification of Mexican or Puerto Rican varieties. Unlike native speakers, L2 learners did not show sensitivity to broader macrodialect groupings; instead, they frequently defaulted to high-exposure varieties (e.g., Spain, Mexico) regardless of the phonetic cues present. Findings suggest that sociophonetic perception in L2 Spanish develops gradually and unevenly, shaped by cue salience and exposure. Full article
(This article belongs to the Special Issue Second Language Acquisition and Sociolinguistic Studies)
Show Figures

Figure 1

24 pages, 5649 KB  
Article
Bangla Speech Emotion Recognition Using Deep Learning-Based Ensemble Learning and Feature Fusion
by Md. Shahid Ahammed Shakil, Fahmid Al Farid, Nitun Kumar Podder, S. M. Hasan Sazzad Iqbal, Abu Saleh Musa Miah, Md Abdur Rahim and Hezerul Abdul Karim
J. Imaging 2025, 11(8), 273; https://doi.org/10.3390/jimaging11080273 - 14 Aug 2025
Viewed by 686
Abstract
Emotion recognition in speech is essential for enhancing human–computer interaction (HCI) systems. Despite progress in Bangla speech emotion recognition, challenges remain, including low accuracy, speaker dependency, and poor generalization across emotional expressions. Previous approaches often rely on traditional machine learning or basic deep [...] Read more.
Emotion recognition in speech is essential for enhancing human–computer interaction (HCI) systems. Despite progress in Bangla speech emotion recognition, challenges remain, including low accuracy, speaker dependency, and poor generalization across emotional expressions. Previous approaches often rely on traditional machine learning or basic deep learning models, struggling with robustness and accuracy in noisy or varied data. In this study, we propose a novel multi-stream deep learning feature fusion approach for Bangla speech emotion recognition, addressing the limitations of existing methods. Our approach begins with various data augmentation techniques applied to the training dataset, enhancing the model’s robustness and generalization. We then extract a comprehensive set of handcrafted features, including Zero-Crossing Rate (ZCR), chromagram, spectral centroid, spectral roll-off, spectral contrast, spectral flatness, Mel-Frequency Cepstral Coefficients (MFCCs), Root Mean Square (RMS) energy, and Mel-spectrogram. Although these features are used as 1D numerical vectors, some of them are computed from time–frequency representations (e.g., chromagram, Mel-spectrogram) that can themselves be depicted as images, which is conceptually close to imaging-based analysis. These features capture key characteristics of the speech signal, providing valuable insights into the emotional content. Sequentially, we utilize a multi-stream deep learning architecture to automatically learn complex, hierarchical representations of the speech signal. This architecture consists of three distinct streams: the first stream uses 1D convolutional neural networks (1D CNNs), the second integrates 1D CNN with Long Short-Term Memory (LSTM), and the third combines 1D CNNs with bidirectional LSTM (Bi-LSTM). These models capture intricate emotional nuances that handcrafted features alone may not fully represent. For each of these models, we generate predicted scores and then employ ensemble learning with a soft voting technique to produce the final prediction. This fusion of handcrafted features, deep learning-derived features, and ensemble voting enhances the accuracy and robustness of emotion identification across multiple datasets. Our method demonstrates the effectiveness of combining various learning models to improve emotion recognition in Bangla speech, providing a more comprehensive solution compared with existing methods. We utilize three primary datasets—SUBESCO, BanglaSER, and a merged version of both—as well as two external datasets, RAVDESS and EMODB, to assess the performance of our models. Our method achieves impressive results with accuracies of 92.90%, 85.20%, 90.63%, 67.71%, and 69.25% for the SUBESCO, BanglaSER, merged SUBESCO and BanglaSER, RAVDESS, and EMODB datasets, respectively. These results demonstrate the effectiveness of combining handcrafted features with deep learning-based features through ensemble learning for robust emotion recognition in Bangla speech. Full article
(This article belongs to the Section Computer Vision and Pattern Recognition)
Show Figures

Figure 1

20 pages, 594 KB  
Article
Identification of Mandarin Tones in Loud Speech for Native Speakers and Second Language Learners
by Hui Zhang, Xinwei Chang, Weitong Liu, Yilun Zhang and Na Wang
Behav. Sci. 2025, 15(8), 1062; https://doi.org/10.3390/bs15081062 - 5 Aug 2025
Viewed by 846
Abstract
Teachers often raise their vocal volume to improve intelligibility or capture students’ attention. While this practice is common in second language (L2) teaching, its effects on tone perception remain understudied. To fill this gap, this study explores the effects of loud speech on [...] Read more.
Teachers often raise their vocal volume to improve intelligibility or capture students’ attention. While this practice is common in second language (L2) teaching, its effects on tone perception remain understudied. To fill this gap, this study explores the effects of loud speech on Mandarin tone perception for L2 learners. Twenty-two native Mandarin speakers and twenty-two Thai L2 learners were tested on their perceptual accuracy and reaction time in identifying Mandarin tones in loud and normal modes. Results revealed a significant between-group difference: native speakers consistently demonstrated a ceiling effect across all tones, while L2 learners exhibited lower accuracy, particularly for Tone 3, the falling-rising tone. The loud speech had different impacts on the two groups. For native speakers, tone perception accuracy remained stable across different speech modes. In contrast, for L2 learners, loud speech significantly reduced the accuracy of Tone 3 identification and increased confusion between Tones 2 and 3. Reaction times in milliseconds were prolonged for all tones in loud speech for both groups. When subtracting the length of the tones, the delay of RT was evident only for Tones 3 and 4. Therefore, raising the speaking volume negatively affects the Mandarin tone perception of L2 learners, especially in distinguishing Tone 2 and Tone 3. Our findings have implications for both theories of L2 tone perception and pedagogical practices. Full article
(This article belongs to the Section Cognition)
Show Figures

Figure 1

22 pages, 305 KB  
Review
Review of Automatic Estimation of Emotions in Speech
by Douglas O’Shaughnessy
Appl. Sci. 2025, 15(10), 5731; https://doi.org/10.3390/app15105731 - 20 May 2025
Cited by 1 | Viewed by 898
Abstract
Identification of emotions exhibited in utterances is useful for many applications, e.g., assisting with handling telephone calls or psychological diagnoses. This paper reviews methods to identify emotions from speech signals. We examine the information in speech that helps to estimate emotion, from points [...] Read more.
Identification of emotions exhibited in utterances is useful for many applications, e.g., assisting with handling telephone calls or psychological diagnoses. This paper reviews methods to identify emotions from speech signals. We examine the information in speech that helps to estimate emotion, from points of view involving both production and perception. As machine approaches to recognize emotion in speech often have much in common with other speech tasks, such as automatic speaker verification and speech recognition, we compare such processes. Many methods of emotion recognition have been found in research on pattern recognition in other areas, e.g., image and text recognition, especially in recent methods for machine learning. We show that speech is very different compared to most other signals that can be recognized, and that emotion identification is different from other speech applications. This review is primarily aimed at non-experts (more algorithmic detail is present in the cited literature), but this presentation has much discussion for experts as well. Full article
(This article belongs to the Special Issue Speech Recognition and Natural Language Processing)
17 pages, 4114 KB  
Article
Biomimetic Computing for Efficient Spoken Language Identification
by Gaurav Kumar and Saurabh Bhardwaj
Biomimetics 2025, 10(5), 316; https://doi.org/10.3390/biomimetics10050316 - 14 May 2025
Viewed by 758
Abstract
Spoken Language Identification (SLID)-based applications have become increasingly important in everyday life, driven by advancements in artificial intelligence and machine learning. Multilingual countries utilize the SLID method to facilitate speech detection. This is accomplished by determining the language of the spoken parts using [...] Read more.
Spoken Language Identification (SLID)-based applications have become increasingly important in everyday life, driven by advancements in artificial intelligence and machine learning. Multilingual countries utilize the SLID method to facilitate speech detection. This is accomplished by determining the language of the spoken parts using language recognizers. On the other hand, when working with multilingual datasets, the presence of multiple languages that have a shared origin presents a significant challenge for accurately classifying languages using automatic techniques. Further, one more challenge is the significant variance in speech signals caused by factors such as different speakers, content, acoustic settings, language differences, changes in voice modulation based on age and gender, and variations in speech patterns. In this study, we introduce the DBODL-MSLIS approach, which integrates biomimetic optimization techniques inspired by natural intelligence to enhance language classification. The proposed method employs Dung Beetle Optimization (DBO) with Deep Learning, simulating the beetle’s foraging behavior to optimize feature selection and classification performance. The proposed technique integrates speech preprocessing, which encompasses pre-emphasis, windowing, and frame blocking, followed by feature extraction utilizing pitch, energy, Discrete Wavelet Transform (DWT), and Zero crossing rate (ZCR). Further, the selection of features is performed by DBO algorithm, which removes redundant features and helps to improve efficiency and accuracy. Spoken languages are classified using Bayesian optimization (BO) in conjunction with a long short-term memory (LSTM) network. The DBODL-MSLIS technique has been experimentally validated using the IIIT Spoken Language dataset. The results indicate an average accuracy of 95.54% and an F-score of 84.31%. This technique surpasses various other state-of-the-art models, such as SVM, MLP, LDA, DLA-ASLISS, HMHFS-IISLFAS, GA base fusion, and VGG-16. We have evaluated the accuracy of our proposed technique against state-of-the-art biomimetic computing models such as GA, PSO, GWO, DE, and ACO. While ACO achieved up to 89.45% accuracy, our Bayesian Optimization with LSTM outperformed all others, reaching a peak accuracy of 95.55%, demonstrating its effectiveness in enhancing spoken language identification. The suggested technique demonstrates promising potential for practical applications in the field of multi-lingual voice processing. Full article
Show Figures

Figure 1

11 pages, 4877 KB  
Proceeding Paper
Leveraging RFID for Road Safety Sign Detection to Enhance Efficiency and Notify Drivers
by Dhanasekar Ravikumar, Vijayaraja Loganathan, Pranav Ponnovian, Vignesh Loganathan and Bharanidharan Sivalingam
Eng. Proc. 2025, 87(1), 53; https://doi.org/10.3390/engproc2025087053 - 15 Apr 2025
Viewed by 412
Abstract
Road safety signboards are now difficult to see due to pollution and harsh weather elements such as snow and fog, which has resulted in more accidents. The problem is especially common in Western countries where snow can block these critical signs. An approach [...] Read more.
Road safety signboards are now difficult to see due to pollution and harsh weather elements such as snow and fog, which has resulted in more accidents. The problem is especially common in Western countries where snow can block these critical signs. An approach addressing this issue involves a system that uses Radio Frequency Identification (RFID) and Internet of Things (IoT). The real-time alerts that this system sends to drivers improve driver safety in complex environments. For this purpose, an RFID reader is placed in the vehicle, and passive RFID tags are attached to road safety signboards. The reader picks up the signal as a vehicle comes within range, and the warning for the vehicle is sent to the driver. It helps to reduce the number of accidents resulting from poor visibility. In addition, because its multi-lingual audio alerts the drive through speakers and visual warnings displayed on a display screen, the system is accessible to drivers from various regions. To make the system more sustainable, we added some solar panels to the system to cut costs as far as energy efficiency is concerned. The system combines GPS and GSM modules to provide the vehicle position in real time in the cloud. It gives better warnings and helps avoid accidents. In addition to improving road safety, the system offers support for the environment, by limiting emissions and waste of resources caused by accidents. Traffic patterns can thus be studied with the data, creating more efficient and ecofriendly transportation systems. This solution enables a smarter vehicle network that is safer and more sustainable with quick, accurate alerts. Full article
(This article belongs to the Proceedings of The 5th International Electronic Conference on Applied Sciences)
Show Figures

Figure 1

17 pages, 3872 KB  
Article
Technology to Enable People with Intellectual Disabilities and Blindness to Collect Boxes with Objects and Transport Them to Different Rooms of Their Daily Context: A Single-Case Research Series
by Giulio E. Lancioni, Gloria Alberti, Francesco Pezzuoli, Fabiana Abbinante, Nirbhay N. Singh, Mark F. O’Reilly and Jeff Sigafoos
Technologies 2025, 13(4), 131; https://doi.org/10.3390/technologies13040131 - 31 Mar 2025
Viewed by 524
Abstract
(1) Background: People with intellectual disabilities and blindness tend to be withdrawn and sedentary. This study was carried out to assess a new technology system to enable seven of these people to collect boxes containing different sets of objects from a storage room [...] Read more.
(1) Background: People with intellectual disabilities and blindness tend to be withdrawn and sedentary. This study was carried out to assess a new technology system to enable seven of these people to collect boxes containing different sets of objects from a storage room and transport them to the appropriate destination rooms. (2) Methods: The technology system used for the study involved tags with radio frequency identification codes, a tag reader, a smartphone, and mini speakers. At the start of a session, the participants were called by the system to take a box from the storage room. Once they collected a box, the system identified the tags attached to the box, called the participants to the room where the box was to be transported and delivered, and provided them with preferred music stimulation. The same process was followed for each of the other boxes available in the session. (3) Results: During baseline sessions without the system, the mean frequency of boxes handled correctly (collected, transported, and put away without research assistants’ guidance) was zero or virtually zero. During the intervention sessions with the system, the participants’ mean frequency of boxes handled correctly increased to between about 10 and 15 per session. (4) Conclusions: These findings suggest that the new technology system might be helpful for people like the participants of this study. Full article
Show Figures

Figure 1

16 pages, 799 KB  
Article
Advanced Identification of Prosodic Boundaries, Speakers, and Accents Through Multi-Task Audio Pre-Processing and Speech Language Models
by Francisco Javier Lima Florido and Gloria Corpas Pastor
Computers 2025, 14(3), 102; https://doi.org/10.3390/computers14030102 - 14 Mar 2025
Cited by 1 | Viewed by 1694
Abstract
In recent years, the advances in deep neural networks (DNNs) and large language models (LLMs) have led to major breakthroughs and new levels of performance in Natural Language Processing (NLP), including tasks related to speech processing. Based on these new trends, new models [...] Read more.
In recent years, the advances in deep neural networks (DNNs) and large language models (LLMs) have led to major breakthroughs and new levels of performance in Natural Language Processing (NLP), including tasks related to speech processing. Based on these new trends, new models such as Whisper and Wav2Vec 2.0 achieve robust performance in speech processing tasks, even in speech-to-text translation and end-to-end speech translation, far exceeding all previous results. Although these models have shown excellent results in real-time speech processing, they still have some accuracy issues for some tasks and high latency problems when working with large amounts of audio data. In addition, many of them need audio to be segmented and labelled for speech synthesis and annotation tasks. Speaker diarisation, background noise detection, prosodic boundary detection and accent classification are some of the pre-processing tasks required in these cases. In this study, we will fine-tune a small Wav2Vec 2.0 base model for multi-task classification and audio segmentation. A corpus of spoken American English will be used for the experiments. We intend to explore this new approach and, more specifically, the performance of the model with regard to prosodic boundaries detection for audio segmentation, and advanced accent identification. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Large Language Modelling)
Show Figures

Figure 1

21 pages, 4067 KB  
Article
The Speaker Identification Model for Air-Ground Communication Based on a Parallel Branch Architecture
by Weijun Pan, Shenhao Chen, Yidi Wang, Sheng Chen and Xuan Wang
Appl. Sci. 2025, 15(6), 2994; https://doi.org/10.3390/app15062994 - 10 Mar 2025
Viewed by 1417
Abstract
This study addresses the challenges of complex noise and short speech in civil aviation air-ground communication scenarios and proposes a novel speaker identification model, Chrono-ECAPA-TDNN (CET). The aim of the study is to enhance the accuracy and robustness of speaker identification in these [...] Read more.
This study addresses the challenges of complex noise and short speech in civil aviation air-ground communication scenarios and proposes a novel speaker identification model, Chrono-ECAPA-TDNN (CET). The aim of the study is to enhance the accuracy and robustness of speaker identification in these environments. The CET model incorporates three key components: the Chrono Block module, the speaker embedding extraction module, and the optimized loss function module. The Chrono Block module utilizes parallel branching architecture, Bi-LSTM, and multi-head attention mechanisms to effectively extract both global and local features, addressing the challenge of short speech. The speaker embedding extraction module aggregates features from the Chrono Block and employs self-attention statistical pooling to generate robust speaker embeddings. The loss function module introduces the Sub-center AAM-Softmax loss, which improves feature compactness and class separation. To further improve robustness, data augmentation techniques such as speed perturbation, spectral masking, and random noise suppression are applied. Pretraining on the VoxCeleb2 dataset and testing on the air-ground communication dataset, the CET model achieves 9.81% EER and 88.62% accuracy, outperforming the baseline ECAPA-TDNN model by 1.53% in EER and 2.19% in accuracy. The model also demonstrates strong performance on four cross-domain datasets, highlighting its broad potential for real-time applications. Full article
Show Figures

Figure 1

Back to TopTop