Whispered Speech Recognition Based on Audio Data Augmentation and Inverse Filtering

Galić, Jovan; Marković, Branko; Grozdić, Đorđe; Popović, Branislav; Šajić, Slavko

doi:10.3390/app14188223

Open AccessArticle

Whispered Speech Recognition Based on Audio Data Augmentation and Inverse Filtering

by

Jovan Galić

^1,*

,

Branko Marković

²

,

Đorđe Grozdić

^3,4,

Branislav Popović

⁵

and

Slavko Šajić

¹

Department of Telecommunications, Faculty of Electrical Engineering, University of Banja Luka, 78000 Banja Luka, Bosnia and Herzegovina

²

Department of Computer and Software Engineering, Faculty of Technical Sciences, University of Kragujevac, 32000 Čačak, Serbia

³

Grid Dynamics, 11000 Belgrade, Serbia

⁴

School of Electrical Engineering, University of Belgrade, 11000 Belgrade, Serbia

⁵

Faculty of Technical Sciences, University of Novi Sad, 21000 Novi Sad, Serbia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(18), 8223; https://doi.org/10.3390/app14188223

Submission received: 28 July 2024 / Revised: 6 September 2024 / Accepted: 8 September 2024 / Published: 12 September 2024

(This article belongs to the Special Issue Speech Recognition and Natural Language Processing)

Download

Browse Figures

Versions Notes

Abstract

Modern Automatic Speech Recognition (ASR) systems are primarily designed to recognize normal speech. Due to a considerable acoustic mismatch between normal speech and whisper, ASR systems suffer from a significant loss of performance in whisper recognition. Creating large databases of whispered speech is expensive and time-consuming, so research studies explore the synthetic generation using pre-existing normal or whispered speech databases. The impact of standard audio data augmentation techniques on the accuracy of isolated-word recognizers based on Hidden Markov Models (HMM) and Convolutional Neural Networks (CNN) is examined in this research study. Furthermore, the study explores the potential of inverse filtering as an augmentation strategy for producing pseudo-whisper speech. The Whi-Spe speech database, containing recordings in normal and whisper phonation, is utilized for data augmentation, while the internally recorded speech database, developed specifically for this study, is employed for testing purposes. Experimental results demonstrate statistically significant improvement in performance when employing data augmentation strategies and inverse filtering.

Keywords:

artificial neural networks; audio databases; automatic speech recognition; convolutional neural network; hidden Markov models; inverse filtering; whispered speech

1. Introduction

Human communication mostly relies on speech, playing a significant role in our everyday lives. Unlike traditional tactile interfaces, such as the mouse and keyboard, which require the user’s full attention, speech-based human-machine interaction offers a more natural and intuitive mode of communication [1].

Automatic Speech Recognition (ASR) employs various complex computational algorithms to transcribe spoken words into text. ASR systems are distinguished by several key characteristics [2]:

Initial user training is unnecessary, as speech is a natural skill that develops from a young age.
Speaking enables communication up to ten times faster than typing or writing.
The users can be partially engaged in other activities, as their hands and eyes are not occupied.
Microphones are affordable and have smaller dimensions compared to keyboards.

Modern ASR systems demonstrate notable speed and accuracy, surpassing human accuracy in specific controlled environments for Large Vocabulary Continuous Speech Recognition (LVCSR) [3]. Nevertheless, these systems exhibit a high degree of sensitivity when faced with speech that differs from the one used in training, typically consisting of normal speech recorded under controlled conditions in a laboratory setting. Recognizing such unconventional speech, including changes in vocal effort, variations in speaker emotional states and accents, and adverse conditions such as background noise, reverberation, and loudness, presents a challenging task for researchers in the field.

Speech can be classified into five modes based on the level of effort: whispered, soft, normally phonated (normal), loud, and shouted speech [4]. Whisper is the most unique mode due to the absence of glottal vibrations and noise excitation of the vocal tract. People tend to whisper or lower their voices for various reasons. Firstly, whispering is used in situations where normal speech is not allowed or deemed inappropriate, such as in a theater or reading room. Secondly, it is employed to convey confidential information that uninvolved individuals should not hear. Lastly, whispering is utilized in criminal activities to conceal one’s identity. Apart from intentional whispering, it can also occur as a result of health issues (laryngitis or rhinitis) [5].

Whispered speech poses the greatest challenge to the ASR system’s performance compared to other speech modes. The notable acoustic mismatch between normal and whispered speech plays a key role in such performance degradation. To improve the robustness, an extensive speech database containing different speech modalities and variability should be developed, especially for modern deep-learning-based ASR systems [6]. However, this procedure is expensive and time-consuming. To address these constraints and enhance the generalization capability of ASR systems, data augmentation (DA) methods have been utilized. A review of the augmentation techniques used in the literature for speech recognition tasks is provided in Table 1.

The application of data augmentation in audio classification tasks is discussed in [16]. Study [17] demonstrates the synthesis of speech waveforms based on Continuous Wavelet Transform (CWT), which produces natural-sounding synthetic speech. Recent advancements in whispered speech recognition involve pseudo-whispered speech conversion through the following three steps [18]:

(1): Removing the glottal contribution via spectral subtraction;
(2): Shifting the first formant using frequency warping;
(3): Increasing the formant bandwidth with a moving average filter.

Audio data augmentation utilizes various algorithms to generate new (artificial) audio samples from existing ones (natural) to enlarge the dataset size and mitigate over-fitting issues in ASR systems.

In the context of low-resource languages such as Serbian, where only a limited number of publicly available speech databases are accessible, utilization of audio data augmentation can play an important role. The main motivation of this research study is to demonstrate that by applying audio data augmentation techniques to some existing datasets, we can elevate the quality and diversity of training data, ultimately leading to improved performance of ASR.

The primary contributions outlined in this paper are:

The whispered speech database, which has been annotated and made publicly accessible, was developed for testing purposes.
The efficacy of using inverse filtering as an algorithm for data augmentation in generating pseudo-whisper has been validated in speaker-independent whisper recognition.
The efficacy of typical augmentation techniques (such as pitch shifting, time stretching, and volume adjustment) in artificially enlarging whispered speech databases has been illustrated for an ASR system utilizing Hidden Markov Models (HMM) and Convolutional Neural Networks (CNN).
The beneficial effect of parallel augmentation, where a single natural utterance produces multiple artificial ones, has been verified to enhance the performance of CNN-based recognizers.

This study analyzes the performance of the recognition systems that utilize Mel Frequency Cepstrum Coefficients (MFCC) as feature vectors. Additionally, Linear Predictive Coding (LPC), based on the source-filter model of speech signals, is commonly employed in the formant extraction procedure. In this paper, LPC coefficients are utilized for inverse filtering to generate pseudo-whisper utterances from normal speech. Stochastic Gradient Descent (SGD) is an effective algorithm for training neural networks, and the CNN recognizer developed in this study uses SGD with an optimization algorithm based on Adaptive Moment Estimation (Adam).

The paper is structured into seven sections as follows. Section 2 provides an overview of whispered speech and compares it with normal speech. Section 3 explores the methodology and theory underlying audio data augmentation techniques. Section 4 explains the rationale behind the usage of the inverse filtering algorithm for artificially expanding the speech database used for training. Section 5 details the experimental preparation, including the speech databases (natural and augmented) used for training and testing, the feature extraction procedure, and the characteristics of ASR systems. Section 6 presents the experimental results and discussion, focusing on the contribution of augmentation strategies and inverse filtering concerning WER reduction. Finally, Section 7 concludes with final remarks and outlines directions for future work.

2. Whispered Speech

Whisper represents a unique mode of verbal expression that, based on its distinct characteristics, nature, and way of production, deviates considerably from normal speech. As previously stated, the key features of whispering include the lack of a fundamental frequency and the noisy excitation of the vocal tract. Formant frequencies of whispered vowels are significantly greater than those of the neutral voice [19]. Compared to normal speech, whispered speech exhibits reduced frame energy, extended speech and silence durations, a flatter long-term spectrum, and a lower Sound Pressure Level (SPL).

Even though a greater level of concentration is required for understanding whispered speech, its intelligibility remains remarkably high. An average identification accuracy of 82% in identifying vowels within [hVd] syllables when spoken in a whisper is demonstrated in [20]. Conversely, non-verbal cues such as age, sex, emotional state, and identity are often difficult to discern in whispered speech.

Figure 1 illustrates the waveform and spectrogram of the phrase “Govor šapata” (“Whispered speech” in English) in Serbian. A phonetic transcription is also provided for normal speech (in capital letters) and whisper (in small letters). The amplitude levels in Figure 1a display a noticeable difference between the two modes of speech due to a loss of sonority. Additionally, the spectrogram in Figure 1b reveals the following characteristics:

Fricatives such as /š/ (i.e., /ʃ/ in IPA notation) and plosives /p/ and /t/ are well preserved in whisper.
The vibrant /r/ exhibits a similar shape in both normal and whispered speech.
The absence of vertical lines representing glottal pulses shows that a harmonic structure of vowels is not present in a whisper.

3. Audio Data Augmentation

Data augmentations encompass a variety of effective algorithms developed to generate synthetic data and modify existing datasets to improve recognition systems’ capability for generalization. Even though the technique was initially created for image processing, audio data may as well benefit from its usage. In the area of audio-related machine-learning tasks, a number of audio augmentation techniques have been created. Depending on the format of the input audio, data augmentation techniques can be categorized into two groups:

Audio data augmentation and
Spectrogram audio data augmentation.

Audio Data Augmentation (Audio DA) includes a set of modifications (transformations) performed directly on the raw audio signal (waveform). Commonly used augmentation techniques are pitch shifting, time stretching, volume control, adding noise, and time shifting.

Pitch shift

The voice fundamental frequency (F₀) is one of the speaker’s most significant individual acoustic characteristics. This method modifies the F₀, usually expressed in semitones (one-twelfth of an octave). Pitch shifting is widely recognized as one of the most common augmentation techniques.

Time stretch

This method adjusts the speaking rate or speed of the speech signal while maintaining the pitch. The speedup factor, also known as the stretch ratio, typically ranges from 0.8 (slower) to 1.2 (faster).

Volume control (volume gain)

This method applies a gain to the speech signal, typically measured in decibels.

Add noise

This technique adds white noise to the speech signal based on the specified Signal-to-Noise Ratio (SNR) as defined by Equation (1), where

P_{s}

represents the power of the signal and

P_{n}

represents the power of the noise.

S N R [d B] = 10 {l o g}_{10} \frac{P_{S}}{P_{N}}

(1)

Despite improving its robustness in noisy conditions, it may not be effective when the test set is noiseless [21].

Time shift

This method shifts the audio signal left or right by a specified time in milliseconds. While it is more useful in sound classification tasks, its applicability in ASR tasks is limited [21].

The other audio DA techniques include:

Wow Resampling
Clipping
Add Impulse Response
Filtering
MP3 Compression
Inversion

The image and spectrogram DA techniques are described in more detail in [22]. Traditional methods used for data augmentation include flip, zoom range, shift, rotation angle, and brightness range. Spectrogram DA is applied to spectral images and is often used in audio classification tasks.

Audio augmentation algorithms can be implemented either in a sequential manner (in series) or independently (in parallel). The architecture of sequential augmentation is illustrated in Figure 2a (single augmented output) and Figure 2b (N augmented output signals) for parallel random sequential augmentations. The probability of applying an augmentation and the value of any parameters that are probabilistically determined are independent.

4. Inverse Filtering

Inverse filtering is a powerful technique in speech processing used to isolate the excitation component from the vocal tract shape. This chapter focuses on using inverse filtering to transform normal speech into a pseudo-whisper by removing the formant structure of voiced sounds and flattening the spectrum [23,24,25,26].

Linear Predictive Coding (LPC) [27] is a method that approximates the spectral envelope of a speech signal using a finite number of coefficients. The speech signal (s[n]) can be modeled as a linear combination of its past (p) samples plus an error term (e[n]):

s (n) = \sum_{i = 1}^{p} a_{i} s [n - i] + e [n],

(2)

where:

s[n] is the current value of the signal;
$a_{i}$ are the coefficients of the model;
s[n − i] are the previous values of the signal up to order p;
e[n] is the error term or white noise at time n.

LPC coefficients can be calculated using different methods, including the autocorrelation method and Burg’s method [28]. Burg’s method estimates LPC coefficients by minimizing the forward and backward prediction errors while ensuring the model’s stability.

The vocal tract is modeled as an all-pole filter, represented as a transfer function H(z) in the z-domain, where

a_{i}

are the coefficients and p is the order of the system:

H (z) = \frac{1}{1 - \sum_{i = 1}^{p} a_{i} z^{- i}}

(3)

This filter captures the resonance characteristics or formants of the vocal tract [29,30]. The inverse filter is designed to remove the effects of the vocal tract filter:

I F (z) = 1 - \sum_{i = 1}^{p} a_{i} z^{- i}

(4)

By applying this filter to the original signal, the formant structure is removed, resulting in a flattened spectrum that emphasizes the excitation signal. From a signal processing viewpoint, the steps include frame blocking and windowing. First, the speech signal is divided into overlapping frames (25 ms with 10 ms overlap). Then, a window function (Hamming) is applied to each frame. For each frame, the LPC coefficients are computed using Burg’s method. The inverse filter IF(z) is then applied to each frame:

e [n] = s [n] - \sum_{i = 1}^{p} a_{i} s [n - i]

(5)

Filtering can be conducted in both the time domain and the spectral domain. In this study, the signals were convolved directly with the inverse filter in the time domain. Alternatively, in the spectral domain, the signals should be transformed to the frequency domain using FFT, multiplied with the inverse filter, and then transformed back.

The order (p) of the LPC filter affects the accuracy of the vocal tract model and the resulting spectrum after inverse filtering. Higher-order filters capture more details of the formant structure, while lower-order filters are sometimes more appropriate because they maintain some signal characteristics without fully flattening them. The LPC model of the vocal tract effectively represents the resonant frequencies. By inverse filtering, these resonances are removed, leading to a pseudo-whisper [26]. Consider a sample speech signal. The spectrums before and after inverse filtering are depicted in Figure 3.

Inverse filtering using LPC effectively removes the vocal tract resonances, creating a flattened spectrum and transforming speech into a pseudo-whisper. This technique offers significant utility in various speech processing applications, including audio (speech) data augmentation.

The research presented in [24] analyzes the Long-Term Average Spectrum (LTAS) of both normal speech and whisper. It demonstrates, through the calculation of cepstral distance, that the spectra of speech and whisper exhibit greater similarity after inverse filtering. Consequently, the mismatch between normal speech and whisper is reduced.

5. Materials and Methods

5.1. Natural Speech Databases

In this study, two speech databases containing whisper utterances were employed. The first one is the Whi-Spe database exploited in full capacity [31]. Whisper utterances from this database were utilized for training and augmentation, and neutral utterances are used for inverse filtering (creating pseudo-whisper samples). Recordings from the Whi-Spe dataset consist of fifty isolated words spoken by ten speakers (5 female and 5 male). Each word was repeated 10 times, both in normal and whispered phonation. As a result, the overall size of the database encompasses 10,000 recordings, which is roughly equivalent to two hours of speech. The recordings were carried out under controlled laboratory conditions. The speech samples were digitized with a sampling rate of 22,050 Hz, 16 bits per sample, using the mono technique and the linear PCM wav format. More details regarding the vocabulary, segmentation procedure, and quality control of this database can be found in [31].

The second database, known as DBtest, was created in a real-world setting to evaluate the robustness of the ASR systems. Ten volunteer students (5 males and 5 females, as in Whi-Spe) uttered words (two times each) from the Whi-Spe lexicon in a whispered voice. The recordings were made using a laptop and an integrated microphone in various rooms, with ambient noise Sound Pressure Level (SPL) of approximately 35 dBA. The database is publicly available in the repository [32]. To ensure consistency, all recordings from the two databases were resampled to 16 kHz and 16 bits per sample. Furthermore, the audio files in both databases follow the same naming convention.

5.2. Augmented Speech Databases

Based on the Whi-Spe database comprising whisper and normal phonation, we created 22 artificial (augmented) datasets for experimental purposes. Augmented datasets were generated using MATLAB^® R2021b [33]. To examine the influence of three different augmentation techniques and their combinations, 14 augmented datasets were created. The flow diagram for the generation of augmented speech databases is depicted in Figure 4.

An additional 8 augmented datasets are generated for examination of the influence of the number of augmentations on the performance of the recognizer. The flow diagram for the generation of augmented speech databases using parallel augmentations is depicted in Figure 5.

5.3. Automatic Speech Recognition Systems

In this research study, the recognition performance using two types of classifiers was analyzed:

Hidden Markov Models (HMM), and
Convolutional Neural Networks (CNN).

The subsequent subsections will provide a more detailed explanation of both classifiers.

5.3.1. HMM-Based ASR System

The conventional ASR methodology relies on Hidden Markov Models (HMM) with Gaussian Mixture Models (GMM). Three modeling units for the recognition of isolated words are usually considered: phonemes independent from context (monophones), phonemes with contextual dependencies (biphones or triphones), and whole words. The highest robustness in whispered speech recognition and the Whi-Spe database is demonstrated using monophone models [34].

The most important configuration parameters in the feature extraction procedure are summarized in Table 2. The feature vectors were extracted using the HTK software (version 3.4.1) [35].

The developed ASR system utilized HMM models with continuous density GMMs and diagonal covariance matrices. Models of monophones consist of 5 states (3 emitting states) with a left-to-right topology and without skips, as depicted in Figure 6.

The models are initialized with the global mean and variance, with flat-start initialization. Baum-Welch re-estimation was performed for five cycles, and the number of mixtures was gradually increased up to 16. Phone-level transcription involved 32 monophones, encompassing 30 Serbian letters, the schwa phoneme, and silence. Although the phoneme schwa does not have a letter in Serbian, it is marked separately if it is next to the letter/r/. During the testing phase, the Viterbi decoder was employed to estimate the most probable state sequence. The ASR system was developed using the HTK toolkit, while MATLAB software (version R2016a) was utilized for generating scripts and configuration files, phonetic transcription, and logging the performance of the ASR system, which was evaluated within HTK.

5.3.2. CNN-Based ASR System

Even though Convolutional Neural Networks were originally designed for image recognition tasks, recent studies demonstrated promising performance in speech recognition, particularly for isolated and control words [36].

A typical CNN model architecture consists of the following components:

Input layer,
Convolution layer with an activation function,
Pooling layer, and
Fully Connected Layer

The most important parameters for the CNN model are summarized in Table 3.

In order to normalize input data (set between 0 and 1), the following transformation was applied:

n o r m (x) = \frac{x - x_{m i n}}{x_{m a x} - x_{m i n}}

(6)

This is necessary because deep learning algorithms work more properly when the input data are normalized. Various algorithms are available for training neural networks, and the SGD algorithm and its variants are commonly used for training CNN models [37].

The speaker-independent mode was analyzed using 90% training and 10% validation split. The remaining speaker was reserved for testing employing a leave-one-speaker-out cross-validation approach. To ensure reliable evaluation of the performance, the experiments were repeated 10 times, and the average accuracy was calculated for each speaker.

After training both models, the training histories were obtained, in which training and validation losses were plotted against the epochs. Figure 7 shows the training history of the model. As shown, the training process was completed correctly, as the validation loss decreased along with the training loss among the epochs.

The speech signal is analyzed on a short-time basis, where it is divided into overlapping frames, and a feature vector is computed for each frame. The window size is typically between 20 and 30 ms, with consecutive frames shifted between 10 and 15 ms. Consequently, utterances of varying durations have different numbers of feature vectors. To mitigate this limitation, two approaches are considered:

The window size is proportional to the frame period, $w_{s} = K f_{p}$ , where K is a constant for all utterances (Figure 8).
The window length is fixed, but K is dynamically changed.

Research studies, such as the one presented in [38], have shown better results in the recognition of isolated words using the first approach. The best results in recognizing words from the Whi-Spe database (in the speaker-dependent case) were achieved using 18 windows per utterance [39]. Therefore, in our experiments, segmentation into 18 overlapping windows and K = 3 was employed. Speech parameterization involved using 39 MFCC dynamic feature vectors per frame (13 static + Δ + ΔΔ), with cepstral mean subtraction (CMS) applied. Ultimately, each utterance was represented by a vector of 702 coefficients (18 × 39). The ASR recognizer was developed using Python software (version 3.9.12), utilizing the Scikit-learn package.

5.4. The Experiment Setup

The experimental setup flow chart is depicted in Figure 9 for the baseline experiments (a) and the experiments examining the impact of data augmentation techniques and the number of augmentations (b).

The accuracy is used to evaluate the performance of the recognizer. For instance, if N is the total number of utterances analyzed for a specific speaker and E is the number of incorrectly recognized utterances, the accuracy can be determined using the following equation:

A c c = \frac{N - E}{N} \cdot 100 [%]

(7)

Word Error Rate (WER) is a common metric of the performance and can be calculated using Equation (8):

W E R = \frac{E}{N} \cdot 100 [%]

(8)

To ensure a more reliable evaluation of performance, a 10-fold cross-validation is conducted for every speaker. Finally, the average accuracy is calculated and employed as the final metric of the recognizer’s performance. The experiments were conducted on two laptops with similar specifications. The HMM training and testing were performed on a laptop equipped with an Intel Core i5-1235U processor, 1.30 GHz, and 8 GB of RAM. On the other hand, the laptop employed for CNN training and testing was equipped with an AMD Ryzen 5 4500U, 2.38 GHz, and 8 GB RAM.

6. Results

This section is structured in the following manner. Section 6.1 presents the results and discussion of the baseline experiments. The purpose of these experiments is to evaluate the recognition performance for two natural speech databases: Whi-Spe and DBtest, where the training is performed on the Whi-Spe database.

In Section 6.2, we investigate the impact of augmentation strategies on the ASR-recognizer (HMM and CNN) accuracy for the DBtest dataset.

In Section 6.3, we investigate the impact of the number of augmentations on the ASR-recognizer accuracy (HMM and CNN) for the DBtest dataset.

Section 6.4 delves into the examination of the cumulative impact of augmentation strategies and the number of augmentations on the enhancement of relative Word Error Rate (WER).

6.1. Baseline Experiments

The main objective of the baseline experiments was to analyze the speaker-independent recognition of a whisper in two distinct scenarios:

Closed set—both training and testing were carried out using the Whi-Spe database.
Open set—training was completed using the Whi-Spe database while testing was carried out with the DBtest database.

The training/test split for speaker-independent recognition is based on individual speakers, ensuring that utterances from the test speaker are not included in the training set. The accuracy results are presented in Figure 10 (HMM) and Figure 11 (CNN), showing the impact of varying percentages of the database used for the training, ranging from 10% to 100% with a 10% increment. The results of experiments for each speaker are given in the Supplementary Document accessible in the repository [32].

Figure 10 illustrates that the closed-set recognition for HMM achieved a final accuracy of 95.02% after full-capacity training. In contrast, the accuracy dropped significantly to 91.61% for the DBtest database. Bearing in mind that the DBtest is not recorded in a laboratory, a resulting drop in performance is expected. The most notable performance increase was observed between 10% and 20% of the training database, followed by a saturation effect and minor fluctuations after 30%.

The CNN recognition showed lower performance than HMM, as shown in Figure 11. The accuracy for full-capacity training was 92.46% (Whi-Spe) and 84.14% (DBtest). In contrast to HMM, the CNN scenario did not achieve performance saturation for the DBtest database. The insufficient size of the training database may have contributed to the poor performance of CNN.

The same experiment setup was performed for normal speech and Whi-Spe database. Compared to whispered speech, the closed set recognition of normal speech in baseline experiments was with higher success and reached an accuracy of 98.07% (HMM) and 92.80% (CNN) [40].

6.2. The Impact of Data Augmentation Strategies

Of the several augmentation techniques outlined in Section 3, three methods (single) and their combinations (2-way and 3-way) were examined in this study, resulting in a total of seven different augmentation techniques.

Pitch Shift (PS),
Time Stretch (TST),
Volume Control (VC),
PS + TST,
TST + VC,
PS + VC, and
PS + TST + VC.

The probability of applying the augmentation method was set to 0.5 (see Figure 2a). The sequential augmentations had the following parameter ranges:

PS—Semitone Shift Range: [−2, 2]
TST—Speedup Factor Range: [0.8, 1.2]
VC—Volume Gain Range: [−3 dB, 3 dB]

The values (real numbers) for the augmentation parameters are randomly assigned within these specified ranges. It is worth noting that adding noise and time shifting were excluded from the analysis due to their lack of positive influence on the performance of the ASR system.

The average word recognition accuracy for the speech database DBtest can be found in Table 4 (HMM) and Table 5 (CNN). The codes for experiments are accessible in [32]. In these experiments, the speech database used for augmentation is of the same size as the original one (sequential augmentation, as illustrated in Figure 2).

Moreover, the analysis included training solely with augmented speech samples (W—whisper augmented and P—pseudo-whisper augmented), as well as the fusion of original and augmented datasets (OW, OP, and OWP raws).

The results obtained from training using solely augmented utterances (W and P raw in Table 4 and Table 5) indicate the following:

When augmentation was applied to whispered utterances (W raw), the performance of the HMM recognizer slightly weakened, while it remained comparable for the CNN recognizer.
When augmentation was applied to pseudo-whisper utterances only (P raw), the performances significantly worsened for both the HMM and CNN recognizer.

When training was applied to both the original and augmented datasets (OW, OP, and OWP raw in Table 4 and Table 5), the performances were notably improved for the CNN recognizer for all techniques. At the same time, only a particular scenario and augmentation method led to performance improvement for the HMM recognizer. The significance of the inverse filtering algorithm is evident when examining the OWP and OW rows in Table 4 and Table 5. In every augmentation scenario, adding pseudo-whisper utterances in training has resulted in a marked improvement in accuracy for both recognizers.

In order to evaluate the statistical significance of obtained results in cases when augmented datasets are added to the original, a one-tailed Wilcoxon test for hypothesis H_A was performed.

Hypothesis H_0A: “Both recognizers (initial recognizer when training was applied using only original samples and the other with augmentation—OW/OP/OWP) produce the same accuracy”.
Hypothesis H_1A: “Recognizer with augmentation—OW/OP/OWP gives higher accuracy”.

The p-values for H_0A/H_1A hypothesis are provided in Table 6 for values of alpha 0.05.

As can be seen from Table 6, statistically significant improvement (

p < 0.05

) in the accuracy for the HMM recognizer is obtained only in the OWP case for TST, VC, and TST + VC augmentation strategies. Statistically significant improvement is obtained for all techniques in the OWP and OP cases for the CNN recognizer.

Processing time is a crucial metric for evaluating the efficiency during the training phase and real-time processing capabilities in the decoding phase of ASR systems. The Real-Time Factor (RTF) measures the speed of ASR during decoding, defined as the ratio between the processing time and the input duration. Figure 12 shows the training time and RTF for the best augmentation strategies (VC for HMM and PS + VC for CNN).

The results obtained suggest the following conclusions:

Expanding the training dataset by incorporating augmented utterances significantly increases the training time but improves accuracy, particularly for the CNN recognizer.
The RTF for the HMM recognizer is considerably higher and remains relatively stable, even with the expansion of the training dataset.

6.3. The Impact of the Number of Augmentations

As Figure 2b in Section 3 shows, parallel augmentation enables the enlargement of augmented dataset size. By defining the number of augmentations (N), we obtain an N-times larger augmented dataset. The objective of this subsection is to examine the impact of N on the performances of the recognizers. Because of a larger training corpus, the number of mixtures for the HMM recognizer is increased from 16 to 32.

Figure 13 depicts the average accuracy of HMM and CNN recognizers for the number of augmentations from one to five. The best augmentation strategy from the previous subsection (VC for HMM and PS + VC for CNN in the OWP scenario) was exploited. The results of particular experiments for each speaker are given in the Supplementary Document accessible in the repository [32].

As evident from Figure 13, unlike the CNN, the HMM recognizer shows a small drop in accuracy after N = 2. In that way, enlargement of the training part does not lead to performance improvement for the HMM recognizer.

On the other hand, the accuracy for the CNN recognizer notably increases as N goes higher. When N goes from 1 to 3, the accuracy roughly increases by 1 percentage point (88.06 to 89.03). Afterward, the improvement slowly decreases with a tendency for saturation of performance. The final examined case (N = 5) gives an accuracy of 89.45%. The next subsection deals with the relative WER improvement.

6.4. WER Improvement

The relative WER improvement (i.e., WER reduction) with respect to the cumulative effect of augmentation techniques (VC for HMM and PS + VC for CNN), inverse filtering, and parallel augmentations is depicted in Figure 14. As can be seen, the contribution of augmentation techniques and inverse filtering to the WER improvement is 24.7% for CNN and 5.7% for HMM. Further increment in the number of augmentations leads to the WER improvement for the CNN recognizer and reaches 33.5% for N = 5.

For the HMM recognizer, the WER is improved only in the case when N is changed from 1 to 2. Further increment leads to the WER deterioration.

It should be emphasized that adding pseudo-whisper samples obtained through inverse filtering from normal speech contributes significantly to the WER improvement, especially for the CNN recognizer.

It is crucial to bear in mind that the accuracy of ASR systems is highly dependent on factors such as the size of the speech database employed in training (including the number of speakers and utterances), perplexity, and the type of speech (spontaneous, conversational, read, isolated-word, etc.). Therefore, conducting direct and consistent comparisons between different studies may not always be feasible.

To the best of our knowledge, there are only a few research studies that report WER improvement through the utilization of data augmentation and pseudo-whisper generation methods to enhance speaker-independent whisper recognition.

The results of this research demonstrate a greater relative improvement compared to the state-of-the-art studies carried out on the wTIMIT speech corpus [41], outperforming the improvements reported in [15] (23%) and [18] (18%). However, to enable a direct comparison, it is important to assess our methodology on the wTIMIT corpus, which can be explored in future studies.

7. Conclusions

The motivation behind the research study presented in this paper is the increasing necessity to enhance human-machine speech communication, including variations in vocal effort. Given the substantial acoustic mismatch between normal and whispered speech, achieving reliable and accurate speaker-independent recognition of whispered speech in real-world scenarios poses a significant challenge for researchers.

Recent research studies have demonstrated the viability of utilizing data augmentation strategies to enhance the robustness of ASR systems, particularly those utilizing deep learning algorithms. The efficacy of traditional augmentation techniques such as pitch shifting, time stretching, volume control, and their combinations is assessed. Furthermore, the potential of generating pseudo-whispers through inverse filtering algorithms is explored.

The findings of this study suggest that data augmentation can be successfully utilized to enhance the performance of isolated-word whispered speech recognition systems. The conducted experiments have shown that the success of the recognizer, besides traditional augmentation methods, can be additionally improved by an inverse filtering strategy for both HMM and CNN recognizers. Conducted statistical tests confirmed the significance of improvement in performance.

The direction of future works includes enlarging the dataset of whispered speech used for training along with testing evaluated algorithms on different whisper datasets that are currently available.

Supplementary Materials

The supplementary documents can be downloaded at: https://github.com/jovan81etf/whisper (accessed on 26 July 2024).

Author Contributions

Data curation, J.G., B.M. and S.Š.; Funding acquisition, B.P.; Methodology, J.G. and Đ.G.; Resources, J.G. and S.Š.; Software, J.G.; Visualization, J.G. and Đ.G.; Writing—original draft, J.G., Đ.G. and B.P.; Writing—review and editing, J.G., B.M., Đ.G., B.P. and S.Š. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Science Fund of the Republic of Serbia, #7449, Multimodal Multilingual Human-Machine Speech Communication—AI-SPEAK.

Data Availability Statement

The dataset Whi-Spe is available on request from the authors. The speech dataset used for testing can be downloaded at: https://github.com/jovan81etf/whisper (accessed on 26 July 2024).

Acknowledgments

The authors would like to thank the students for agreeing to participate in the creation and recording of the whispered speech database used for testing purposes (DBtest).

Conflicts of Interest

Author Đorđe Grozdić was employed by the company Grid Dynamics, Belgrade, Serbia. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Lu, R.; Wei, R.; Zhang, J. Human-computer interaction based on speech recognition. Appl. Comput. Eng. 2024, 36, 102–110. [Google Scholar] [CrossRef]
Vajpai, J.; Bora, A. Industrial Applications of Automatic Speech Recognition Systems. Int. J. Eng. Res. Appl. 2016, 6, 88–95. Available online: https://api.semanticscholar.org/CorpusID:42601329 (accessed on 25 April 2024).
Xiong, W.; Droppo, J.; Huang, X.; Seide, F.; Seltzer, M.; Stolcke, A.; Yu, D.; Zweig, G. Achieving Human Parity in Conversational Speech Recognition. arXiv 2016, arXiv:1610.05256. [Google Scholar] [CrossRef]
Zhang, C.; Hansen, J.H.L. Analysis and Classification of Speech Mode: Whispered through Shouted. In Proceedings of the 8th Annual Conference of the International Speech Communication Association (Interspeech 2007); International Speech Communication Association (ISCA), Antwerp, Belgium, 27–31 August 2007; pp. 2289–2292. [Google Scholar] [CrossRef]
Jovičić, S.T.; Šarić, Z. Acoustic Analysis of Consonants in Whispered Speech. J. Voice 2008, 22, 263–274. [Google Scholar] [CrossRef] [PubMed]
Bang, J.-U.; Choi, M.-Y.; Kim, S.-H.; Kwon, O.W. Automatic Construction of a Large-Scale Speech Recognition Database Using Multi-Genre Broadcast Data with Inaccurate Subtitle Timestamps. IEICE Trans. Inf. Syst. 2020, 103-D, 406–415. [Google Scholar] [CrossRef]
Singh, D.K.; Amin, P.P.; Sailor, H.B.; Patil, H.A. Data Augmentation Using CycleGAN for End-to-End Children ASR. In Proceedings of the 29th European Signal Processing Conference (EUSIPCO), Dublin, Ireland, 23–27 August 2021; pp. 511–515. [Google Scholar] [CrossRef]
Atmaja, B.T.; Sasou, A. Effects of Data Augmentations on Speech Emotion Recognition. Sensors 2022, 22, 5941. [Google Scholar] [CrossRef] [PubMed]
Chatziagapi, A.; Paraskevopoulos, G.; Sgouropoulos, D.; Pantazopoulos, G.; Nikandrou, M.; Giannakopoulos, T.; Katsamanis, A.; Potamianos, A.; Narayanan, S. Data Augmentation Using GANs for Speech Emotion Recognition. In Proceedings of the 20th Annual Conference of the International Speech Communication Association INTERSPEECH, Graz, Austria, 15–19 September 2019; pp. 171–175. [Google Scholar] [CrossRef]
Ko, T.; Peddinti, V.; Povey, D.; Khudanpur, S. Audio Augmentation for Speech Recognition. In Proceedings of the 16th Annual Conference of the International Speech Communication Association INTERSPEECH, Dresden, Germany, 6–10 September 2015; pp. 3586–3589. [Google Scholar] [CrossRef]
Ko, T.; Peddinti, V.; Povey, D.; Seltzer, M.L.; Khudanpur, S. A Study on Data Augmentation of Reverberant Speech for Robust Speech Recognition. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 5220–5224. [Google Scholar] [CrossRef]
Fernández-Gallego, M.P.; Toledano, D.T. A Study of Data Augmentation for ASR Robustness in Low Bit Rate Contact Center Recordings Including Packet Losses. Appl. Sci. 2022, 12, 1580. [Google Scholar] [CrossRef]
Ramirez, J.M.; Montalvo, A.; Calvo, J.R. A Survey of the Effects of Data Augmentation for Automatic Speech Recognition Systems. In Proceedings of the 24th Iberoamerican Congress on Pattern Recognition, CIARP 2019, Havana, Cuba, 28–31 October 2019; Nyström, I., Hernández Heredia, Y., Milián Núñez, V., Eds.; Springer: Cham, Switzerland, 2019; pp. 669–678. [Google Scholar] [CrossRef]
Damania, R. Data Augmentation for Automatic Speech Recognition for Low Resource Languages; Rochester Institute of Technology: New York, NY, USA, 2021; Available online: https://repository.rit.edu/theses/10968 (accessed on 18 May 2024).
Gudepu, P.R.R.; Vadisetti, G.P.; Niranjan, A.; Saranu, K.; Sarma, R.; Shaik, M.A.B.; Paramasivam, P. Whisper Augmented End-to-End/Hybrid Speech Recognition System-CycleGAN Approach. In Proceedings of the 21st Annual Conference of the International Speech Communication Association INTERSPEECH, Shanghai, China, 25–29 October 2020; pp. 2302–2306. [Google Scholar] [CrossRef]
Sugiura, T.; Kobayashi, A.; Utsuro, T.; Nishizaki, H. Audio Synthesis-Based Data Augmentation Considering Audio Event Class. In Proceedings of the 10th Global Conference on Consumer Electronics (GCCE), Kyoto, Japan, 12–15 October 2021; pp. 60–64. [Google Scholar] [CrossRef]
Salah Al-Radhi, M.; Gábor Csapó, T.; Zainkó, C.; Németh, G. Continuous Wavelet Vocoder-Based Decomposition of Parametric Speech Waveform Synthesis. In Proceedings of the Interspeech 2021, Brno, Czechia, 30 August–3 September 2021; pp. 2212–2216. [Google Scholar] [CrossRef]
Lin, Z.; Patel, T.B.; Scharenborg, O. Improving Whispered Speech Recognition Performance Using Pseudo-Whispered Based Data Augmentation. In Proceedings of the 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Taipei, Taiwan, 16–20 December 2023; IEEE: Piscataway, NJ, USA, 2023. [Google Scholar] [CrossRef]
Swerdlin, Y.; Smith, J.; Wolfe, J. The Effect of Whisper and Creak Vocal Mechanisms on Vocal Tract Resonances. J. Acoust. Soc. Am. 2010, 127, 2590–2598. [Google Scholar] [CrossRef] [PubMed]
Tartter, V.C. Identifiability of Vowels and Speakers from Whispered Syllables. Percept. Psychophys. 1991, 49, 365–372. [Google Scholar] [CrossRef] [PubMed]
Maguolo, G.; Paci, M.; Nanni, L.; Bonan, L. Audiogmenter: A MATLAB Toolbox for Audio Data Augmentation. arXiv 2022, arXiv:1912.05472. [Google Scholar] [CrossRef]
Ferreira-Paiva, L.; Alfaro-Espinoza, E.; Almeida, V.M.; Felix, L.B.; Neves, R.V. A Survey of Data Augmentation for Audio Classification. In Proceedings of the 24th Brazilian Congress of Automatics (CBA), Fortaleza, Brazil, 16–19 October 2022; Volume 3. [Google Scholar] [CrossRef]
Grozdić, Đ.T.; Jovičić, S.T.; Subotić, M. Whispered Speech Recognition Using Deep Denoising Autoencoder. Eng. Appl. Artif. Intell. 2017, 59, 15–22. [Google Scholar] [CrossRef]
Grozdić, Đ.T.; Jovičić, S.T. Whispered Speech Recognition Using Deep Denoising Autoencoder and Inverse Filtering. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 2313–2322. [Google Scholar] [CrossRef]
Grozdić, Đ.; Jovičić, S.T.; Šumarac Pavlović, D.; Galić, J.; Marković, B. Comparison of Cepstral Normalization Techniques in Whispered Speech Recognition. Adv. Electr. Comput. Eng. 2017, 17, 21–27. [Google Scholar] [CrossRef]
Grozdić, Đ.; Jovičić, S.T.; Galić, J.; Marković, B. Application of Inverse Filtering in Enhancement of Whisper Recognition. In Proceedings of the 12th Neural Network Applications in Electrical Engineering (NEUREL), Belgrade, Serbia, 25–27 November 2014. [Google Scholar] [CrossRef]
Makhoul, J. Linear Prediction: A Tutorial Review. Proc. IEEE 1975, 63, 561–580. [Google Scholar] [CrossRef]
Burg, J. Maximum Entropy Spectral Analysis, Paper Presented at the 37th Meeting; Society of Exploration Geophysics: Oklahoma City, OK, USA, 1967. [Google Scholar]
Rabiner, L.R.; Schafer, R.W. Digital Processing of Speech Signals; Prentice-Hall, Inc.: Englewood Cliffs, NJ, USA, 1978; ISBN 978-0132136037. [Google Scholar]
Proakis, J.G.; Manolakis, D.G. Digital Signal Processing: Principles, Algorithms, and Applications; Pearson: Upper Saddle River, NY, USA, 2021; ISBN 978-0137348244. [Google Scholar]
Marković, B.; Jovičić, S.T.; Galić, J.; Grozdić, Đ. Whispered Speech Database: Design, Processing and Application. In Text, Speech, and Dialogue; Habernal, I., Matoušek, V., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; pp. 591–598. [Google Scholar] [CrossRef]
Galić, J. Github Repository. Available online: https://github.com/jovan81etf/whisper (accessed on 26 July 2024).
The MathWorks, I. MATLAB (R2021b), Natick, Massachusetts, USA. Available online: www.mathworks.com (accessed on 7 July 2024).
Galić, J.; Jovičić, S.T.; Grozdić, Đ.; Marković, B. HTK-Based Recognition of Whispered Speech. In Speech and Computer; Ronzhin, A., Potapova, R., Delic, V., Eds.; Springer International Publishing: Cham, Switzerland, 2014; pp. 251–258. [Google Scholar] [CrossRef]
Young, S.J.; Kershaw, D.; Odell, J.; Ollason, D.; Valtchev, V.; Woodland, P. The HTK Book Version 3.4; Cambridge University Press: Cambridge, UK, 2006; Available online: http://speech.ee.ntu.edu.tw/homework/DSP_HW2-1/htkbook.pdf (accessed on 8 June 2024).
Alsobhani, A.; ALabboodi, H.M.A.; Mahdi, H. Speech Recognition Using Convolution Deep Neural Networks. J. Phys. Conf. Ser. 2021, 1973, 012166. [Google Scholar] [CrossRef]
Habib, G.; Qureshi, S. Optimization and Acceleration of Convolutional Neural Networks: A Survey. J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 4244–4268. [Google Scholar] [CrossRef]
García-Cabellos, J.M.; Peláez-Moreno, C.; Gallardo-Antolín, A.; Pérez-Cruz, F.; Díaz-de-María, F. SVM Classifiers for ASR: A Discussion about Parameterization. In Proceedings of the 12th European Signal Processing Conference, Vienna, Austria, 6–10 September 2004; pp. 2067–2070. [Google Scholar] [CrossRef]
Galić, J.; Popović, B.; Šumarac Pavlović, D. Whispered Speech Recognition Using Hidden Markov Models and Support Vector Machines. Acta Politech. Hung. 2018, 15, 11–29. [Google Scholar] [CrossRef]
Galić, J.; Grozdić, Đ. Exploring the Impact of Data Augmentation Techniques on Automatic Speech Recognition System Development: A Comparative Study. Adv. Electr. Comput. Eng. 2023, 23, 3–12. [Google Scholar] [CrossRef]
Lim, B.P. Computational Differences between Whispered and Non-Whispered Speech. Ph.D. Thesis, University of Illinois at Urbana-Champaign, Champaign–Urbana Metropolitan Area, Champaign, IL, USA, 2011. Available online: https://hdl.handle.net/2142/24283 (accessed on 15 June 2024).

Figure 1. The waveform (a) and spectrogram (b) of the phrase “Govor šapata”. spoken in Serbian (normal—capital letters; whisper—small letters). The horizontal axis represents time in seconds.

Figure 2. The architecture of sequential audio augmentation for one (a) and N (b) augmentations of the input audio signal.

Figure 3. Example of inverse filtering on a word in normal speech: (a) FFT spectrum of the word, (b) LPC spectral envelope, (c) FFT spectrum after inverse filtering, and (d) frequency response of the inverse filter IF (z).

Figure 4. The flow diagram for the generation of augmented datasets using various augmentation techniques and their combinations.

Figure 5. The flow diagram for the generation of augmented datasets by varying the number of augmentations using a single augmentation technique.

Figure 6. The topology of HMM models.

Figure 7. Training history of model: blue—training loss; orange—validation loss.

Figure 8. The segmentation of utterances using the fixed overlap factor.

Figure 9. The flow of the recognition process in (a) baseline experiments and (b) experiments testing the impact of data augmentation techniques and the number of augmentations.

Figure 10. The average recognition accuracy (in %) for Whi-Spe (closed set) and DBtest database (open set) in the HMM framework. The horizontal axis denotes the percentage of the Whi-Spe subset employed in the training.

Figure 11. The average recognition accuracy (in %) for Whi-Spe (closed set) and DBtest database (open set) in the CNN framework. The horizontal axis denotes the percentage of the Whi-Spe subset employed in the training.

Figure 12. Average training time (a) and Real Time Factor (b) for HMM and CNN recognizers. The corresponding accuracies are given in parentheses.

Figure 13. Average recognition accuracy vs. the number of augmentations for HMM and CNN recognizers.

Figure 14. Relative WER improvement compared to the baseline.

Table 1. The review of augmentation techniques.

Ref.	Augmentation	Metric	Speech
[7]	CycleGAN, SpecAugment, Speed perturbation	Word Error Rate (WER)	Children’s
[8]	Glottal source extraction, Silence removal, Impulse response, Noise addition	Unweighted Average Recall (UAR)	Emotional
[9]	GANs, Time Stretch, Pitch Shift, Add Noise	UAR, F-score	Emotional
[10]	Speed perturbation	WER	Continuous
[11]	Simulated Room Impulse Response, Add noise	WER	Continuous
[12]	MP3 and Full rate GSM	WER and Mean Opinion Score (MOS)	Conversational
[13]	Add noise	WER	Continuous
[14]	Multiplying the region, Replacing the region, Input concatenation	WER	Read
[15]	CycleGAN	WER	Whispered

Table 2. Configuration parameters in the feature extraction procedure.

Parameter	Name/Value
Feature vector	MFCC
Window	Hamming
Pre-emphasis	0.97
Window size/frame shift	24/8 ms
Cepstrum	Magnitude
Number of channels	26
Number of coefficients (static + ∆ + ∆∆)	39 (13 + 13 + 13)
Normalization	CMS

Table 3. The parameters of the CNN model.

Parameter	Name/Value
Number of convolution layers	2
Number of filters	32 and 64
Activation function	ReLU
Pooling layer	Max
Number of epochs	10
Optimizer	ADAM
Loss function	Categorical cross-entropy
Metric	Accuracy

Table 4. Average recognition accuracy (DBtest database) for seven data augmentation techniques and HMM framework. The accuracies with statistically significant improvements are bolded.

Augmentation
Training Scenario	PS	TST	VC	PS + TST	PS + VC	TST + VC	PS + TST + VC
W	90.66	90.66	91.61	89.06	90.75	90.98	89.88
P	72.16	71.32	69.51	72.11	72.06	72.26	70.84
OW	91.29	91.21	91.61	91.09	91.29	91.49	91.14
OP	91.17	90.91	91.34	91.06	91.20	91.09	90.88
OWP	91.53	91.91	92.09	91.37	91.48	92.06	91.52
ORIGINAL (NO AUG)	91.61

Table 5. Average recognition accuracy (DBtest database) for seven data augmentation techniques and CNN framework. The accuracies with statistically significant improvements are bolded.

Augmentation
Training Scenario	PS	TST	VC	PS + TST	PS + VC	TST + VC	PS + TST + VC
W	84.29	83.72	83.88	84.22	84.37	83.70	84.93
P	69.52	69.76	69.23	69.94	69.00	69.72	68.85
OW	85.82	84.98	85.18	85.62	85.84	85.02	85.80
OP	85.73	85.80	86.17	86.10	85.74	85.16	85.80
OWP	87.54	87.37	87.28	87.97	88.06	86.86	87.31
ORIGINAL (NO AUG)	84.14

Table 6. The p-value calculated for hypothesis H_A in relation to seven augmentation strategies (OW—original + whisper; OP—original + pseudo-whisper; OWP—original + whisper + pseudo-whisper). The statistically significant p-values are bolded.

Augmentation
Recognizer	Training Scenario	PS	TST	VC	PS + TST	PS + VC	TST + VC	PS + TST + VC
	OW	1.000	0.999	1.000	1.000	1.000	0.903	1.000
HMM	OP	0.998	1.000	0.985	0.999	0.998	1.000	1.000
	OWP	0.811	0.007	0.002	0.968	0.957	0.001	0.804
	OW	0.010	0.122	0.027	0.006	0.021	0.097	0.004
CNN	OP	0.018	0.017	0.002	0.016	0.007	0.049	0.018
CNN	OWP	0.001	0.001	0.001	0.001	0.001	0.002	0.001

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Galić, J.; Marković, B.; Grozdić, Đ.; Popović, B.; Šajić, S. Whispered Speech Recognition Based on Audio Data Augmentation and Inverse Filtering. Appl. Sci. 2024, 14, 8223. https://doi.org/10.3390/app14188223

AMA Style

Galić J, Marković B, Grozdić Đ, Popović B, Šajić S. Whispered Speech Recognition Based on Audio Data Augmentation and Inverse Filtering. Applied Sciences. 2024; 14(18):8223. https://doi.org/10.3390/app14188223

Chicago/Turabian Style

Galić, Jovan, Branko Marković, Đorđe Grozdić, Branislav Popović, and Slavko Šajić. 2024. "Whispered Speech Recognition Based on Audio Data Augmentation and Inverse Filtering" Applied Sciences 14, no. 18: 8223. https://doi.org/10.3390/app14188223

APA Style

Galić, J., Marković, B., Grozdić, Đ., Popović, B., & Šajić, S. (2024). Whispered Speech Recognition Based on Audio Data Augmentation and Inverse Filtering. Applied Sciences, 14(18), 8223. https://doi.org/10.3390/app14188223

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Whispered Speech Recognition Based on Audio Data Augmentation and Inverse Filtering

Abstract

1. Introduction

2. Whispered Speech

3. Audio Data Augmentation

4. Inverse Filtering

5. Materials and Methods

5.1. Natural Speech Databases

5.2. Augmented Speech Databases

5.3. Automatic Speech Recognition Systems

5.3.1. HMM-Based ASR System

5.3.2. CNN-Based ASR System

5.4. The Experiment Setup

6. Results

6.1. Baseline Experiments

6.2. The Impact of Data Augmentation Strategies

6.3. The Impact of the Number of Augmentations

6.4. WER Improvement

7. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI