Non-Intrusive Air Traffic Control Speech Quality Assessment with ResNet-BiLSTM

Wu, Yuezhou; Li, Guimin; Fu, Qiang

doi:10.3390/app131910834

Open AccessArticle

Non-Intrusive Air Traffic Control Speech Quality Assessment with ResNet-BiLSTM

by

Yuezhou Wu

^*,

Guimin Li

^* and

Qiang Fu

School of Computer Science, Civil Aviation Flight University of China, Guanghan 618307, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2023, 13(19), 10834; https://doi.org/10.3390/app131910834

Submission received: 20 August 2023 / Revised: 20 September 2023 / Accepted: 27 September 2023 / Published: 29 September 2023

Download

Browse Figures

Versions Notes

Abstract

:

In the current field of air traffic control speech, there is a lack of effective objective speech quality evaluation methods. This paper proposes a new network framework based on ResNet–BiLSTM to address this issue. Firstly, the mel-spectrogram of the speech signal is segmented using the sliding window technique. Next, a preceding feature extractor composed of convolutional and pooling layers is employed to extract shallow features from the mel-spectrogram segment. Then, ResNet is utilized to extract spatial features from the shallow features, while BiLSTM is used to extract temporal features, and these features are horizontally concatenated. Finally, based on the concatenated spatiotemporal features, the final speech quality score is computed using fully connected layers. We conduct experiments on the air traffic control speech database and compare the objective scoring results with the subjective scoring results. The experimental results demonstrate that the proposed method has a high correlation with the mean opinion score (MOS) of air traffic control speech.

Keywords:

speech quality; ResNet; BiLSTM; MOS

1. Introduction

Currently, air traffic management in the civil aviation sector in China primarily relies on manual control, with radio communication being used for the instructions exchanged between air traffic controllers and pilots. During the process of voice interaction, factors such as noise, accents, speaking rate, intonation, and timbre can affect the transmission of speech information and auditory perception. However, clear and accurate voice communication between controllers and pilots is a crucial factor in ensuring flight safety. Therefore, the evaluation of the quality of air traffic control speech is of utmost importance.

Speech quality assessment can be broadly categorized into two categories: subjective methods and objective methods. Subjective evaluation, which takes into account the perceptual characteristics of the human auditory system, is widely adopted due to its high credibility and reliability. This method requires professionals to reference the pristine clean version of the speech signal to estimate the quality of the distorted speech signal, typically using a rating scale ranging from 1 to 5 [1], as shown in Table 1. The average of all ratings is referred to as the mean opinion score (MOS) [2]. During the evaluation process, professionals consider various aspects, including noise, distortion, speech clarity, the naturalness of sound, and alignment with specific application scenarios, which allows for a relatively objective assessment of speech quality.

However, subjective evaluation methods exhibit limitations, encompassing rater subjectivity and variability, as well as the time-consuming, costly, and non-repeatable nature of the evaluation process [3]. To address these challenges, researchers have begun proposing objective evaluation methods. Objective evaluation methods aim to quantify and measure speech quality by using computer algorithms or automation techniques, thus reducing the influence of subjective factors [4]. These methods can be based on acoustic features, waveform analysis, signal processing, and other techniques, using mathematical models and algorithms to assess speech quality. Compared to subjective evaluation methods, objective evaluation methods offer better repeatability and consistency, as well as higher efficiency and lower costs.

Objective evaluation methods can be classified into intrusive and non-intrusive types. Intrusive models also require the pristine clean version of the speech signal as a reference to estimate the quality of distorted speech signals, such as the PESQ model proposed in ITU-T Recommendation P.862 [5,6]. In contrast, non-intrusive models solely rely on the output signal of the transmission system to estimate the MOS, such as the P.563 model introduced by ITU-T [7,8]. Obtaining the pristine clean version of the speech signal can be challenging in practical communication scenarios. Therefore, non-intrusive objective evaluation has become a hot topic in domestic and international research to ensure real-time and convenience. Non-intrusive models can directly assess distorted speech signals, offering greater flexibility to adapt to different application scenarios and reducing the complexity of the evaluation process [9]. This method holds significant importance for a wide range of speech quality assessment tasks.

With the continuous development of technologies like deep learning, researchers are exploring and improving non-intrusive objective evaluation methods to enhance their accuracy and reliability. These methods leverage models such as neural networks to automatically learn speech features and patterns, enabling more precise speech quality assessment. Future research will continue to focus on the advancement of non-intrusive objective evaluation, further driving its application in practical scenarios.

Related Work

Various non-intrusive speech quality assessment methods are available in the literature. A non-intrusive speech quality assessment method under complex environments is proposed in [10,11]. The method [10] employs Bayesian non-negative matrix factorization (BNMF) to calculate the fundamental spectro-temporal matrixes of the target speech, integrating the resulting matrix, as a separate layer, into the deep neural network (DNN) model. Subsequently, a deep neural network is trained to learn the complex mapping between the target source and the mixture signal, reconstructing the magnitude spectrogram of the quasi-clean speech. Finally, the reconstructed speech is regarded as the reference of the modified PESQ to estimate the MOS of the tested speech sample. The method [11] learns an overcomplete dictionary of the clean speech power spectrum through K-singular value decomposition. Then, during the sparse representation stage, it adaptively obtains the stopping residue based on the estimated cross-correlation and noise spectrum, adjusted by an a posteriori SNR-weighted factor. It utilizes orthogonal matching pursuit to reconstruct clean speech spectra from noisy speech. By using the quasi-clean speech as a reference, it estimates the degraded speech’s MOS using the modified PESQ. Both of these methods entail constructing reference signals, leading to high computational costs.

Another approach to non-intrusive speech quality assessment involves extracting intrinsic features from the speech signal and then mapping these features to MOS. Compared to reconstructing clean speech, this method has lower computational costs. Fu et al. [12] proposed an end-to-end non-intrusive speech quality evaluation model based on Bidirectional Long Short-Term Memory (BiLSTM) for predicting speech PESQ scores. However, this model faces difficulties in evaluating enhanced speech with low PESQ scores compared to noisy speech. Chen et al. [13] studied several neural network architectures for evaluating speech conversion quality, including Convolutional Neural Network (CNN), BiLSTM, and CNN–BiLSTM. However, this model is only applied to speech conversion systems and has limited generalization ability. Donald et al. [14] used a pyramid BiLSTM with an attention mechanism to predict MOS, and the proposed model achieved scores close to human judgment. However, the encoder of this model, composed of a stacked pyramid BiLSTM, significantly increases the overall computational complexity, making it less suitable for low-resource speech evaluation tasks. Cauchi et al. [15] combined modulation energy features with a recurrent neural network using Long Short-Term Memory to propose a non-intrusive speech quality evaluation method suitable for evaluating speech enhancement algorithms under various acoustic conditions. Shen et al. [16] proposed a reference-free speech quality evaluation method based on ResNet and BiLSTM. They utilized an attention mechanism [17] to obtain weights and scored based on the BiLSTM output, resulting in scores closer to human ratings. However, this model mostly predicts MOS scores between 1.5 and 4.5, and it cannot effectively predict speech quality with low or high MOS scores.

Most of the aforementioned methods available in the literature show promising results, but, to the best of our knowledge, there is no method available for evaluating the quality of air traffic control speech. This work, inspired by the combined advantages of CNN and RNN and the powerful local feature extraction capabilities of ResNet and BiLSTM, proposes a new air traffic control speech quality assessment method based on ResNet and BiLSTM. This paper considers the importance of air traffic control speech details during the communication process, such as call signs, command types, and command content. Through extensive analysis of experimental data, it retains more air traffic control speech details from three aspects. Firstly, the mel-spectrogram is introduced [18] to capture dense features in the speech, and the mel-spectrogram is segmented for frame-level analysis of speech quality. Secondly, a preceding feature extractor composed of convolutional and pooling layers is employed to compute shallow features suitable for speech quality prediction from mel-spectrogram segments. Lastly, ResNet is utilized to extract spatial features from the shallow features, and BiLSTM is used to extract temporal features from the shallow features [19]. The extracted features from both ResNet and BiLSTM are concatenated and fused, effectively improving the accuracy of air traffic control speech quality evaluation.

In summary, this paper makes the following contributions:

We created an air traffic control speech database in a real environment, annotated with speech quality scores.
We proposed a new non-intrusive speech quality evaluation method based on ResNet and BiLSTM for air traffic control speech.
We investigated the effects of varying signal-to-noise ratios and speech rates in the aerial traffic control speech dataset on the performance of the proposed method.

In the following, Section 2 presents the proposed methodology. Subsequently, Section 3 introduces the database and performance metrics utilized in the experimental setup and the experiments conducted in this study. Finally, Section 4 presents the conclusion and future work.

2. Method

The speech quality evaluation model designed in this paper for air traffic control speech consists of four main components: preprocessing module, shallow feature extraction module, spatiotemporal feature parameter processing module, and score mapping module, as shown in Figure 1. In Module 1, the mel-spectrogram is computed based on the input signal of air traffic control speech, which is then divided into several sub-segments with certain overlapping regions. In Module 2, a feature extractor is designed that computes shallow features suitable for speech quality prediction from mel-spectrogram segments. In Module 3, a parallel network structure of ResNet and BiLSTM is designed to extract spatiotemporal features from shallow features and then splice the features’ output by ResNet and BiLSTM, respectively. In Module 4, a fully connected layer with an attention mechanism is used to estimate the quality of air traffic control speech.

2.1. Preprocessing Module

Air traffic control speech undergoes pre-emphasis, framing, windowing, fast Fourier transform, power spectrum calculation, and filtering through the mel-filterbank to compute the mel-spectrogram [20]. The mel-spectrogram, with MOS scores ranging from 1 to 5, is illustrated in Figure 2.

The model utilizes the sliding window technique [21] to divide the input mel-spectrogram features into several overlapping sub-segments, as shown in Figure 3. Since the mel-filterbank consists of 48 bands, the height of each sub-segment is 48. The width of the sub-segment is set to 15 (equivalent to 15 frames, totaling 40 ms), and the hop size between sub-segments is 4 (equivalent to 4 frames, totaling 40 ms) [22]. By employing this segmentation approach, the model processes air traffic control speech at the frame level, enabling more effective capturing of the temporal features of the speech.

2.2. Shallow Feature Extraction Module

This module employs a preceding feature extractor consisting of convolutional layers and pooling layers to extract features from each mel-spectrogram segment, resulting in a set of local feature representations that better capture short-term variations in the speech signal. The preceding feature extractor consists of 6 convolutional layers, 3 max-pooling layers, and 1 fully connected layer, with specific parameters outlined in Table 2. Except for the last convolutional layer, all other convolutional layers utilize width padding. As a result, the input size of

48 \times 15

passes through the sixth convolutional layer to yield an output of

64 \times 6 \times 1

, where 64 represents the number of channels. Finally, a fully connected layer is used to compress the features into a 384-dimensional vector [23].

2.3. Spatiotemporal Feature Parameter Processing Module

This module utilizes a parallel network of ResNet and BiLSTM for feature extraction and performs feature concatenation and fusion. Such a network structure allows for better capturing of the features and contextual information of the speech signal, providing more accurate and comprehensive high-quality features [24].

2.3.1. ResNet

ResNet is a deep-learning model proposed by He et al. [25] in 2016. It exhibits powerful capabilities in feature extraction and successfully addresses the issue of accuracy saturation or decline with increasing network depth. Figure 4 illustrates the ResNet network, which performs convolution operations in the time dimension to generate feature maps containing temporal sequence information. The ResNet model incorporates batch normalization (BN) layers, establishing correlations among batch samples [26]. As a result, during random batch processing in the network, the model’s output is not uniquely determined for a specific training sample. This characteristic helps, to some extent, in avoiding overfitting issues.

The formula for the ResNet network is shown below [27]:

y = f (x, W) + x H (x) = g (y)

(1)

where

x

represents the input from the previous module;

W

is the weight matrix of the convolutional layer;

f (x, W)

denotes the output after two convolutions of

x

;

y

represents the residual output; g refers to the ReLU activation function; and

H (x)

represents the output of

x

after passing through the ResNet model.

The ResNet model, applied for feature extraction in air traffic control speech, is capable of learning more robust and discriminative feature representations to capture spatial features in speech. Leveraging the advantages of deep learning, the ResNet model demonstrates good generalization performance on different databases compared to other traditional methods [28]. The parameters of each layer in ResNet in this paper are shown in Table 3.

2.3.2. BiLSTM

BiLSTM is based on LSTM, which is a type of recurrent neural network (RNN). The design goal of LSTM is to address the issues of vanishing and exploding gradients that commonly occur in RNNs when dealing with long sequence information [29]. LSTM introduces gate control mechanisms, including the “forget gate”, “input gate”, and “output gate” [30], which enable long-term memory when processing long sequence information. The structure of LSTM is illustrated in Figure 5.

The computation steps of LSTM are as follows. Firstly, the previous timestep’s state value,

h_{t - 1}

, and the current timestep’s input information,

x_{t}

, are used to calculate the values of three gates and the candidate memory cell,

{\tilde{C}}_{t}

.

f_{t} = σ (W_{f} \times [h_{t - 1}, x_{t}] + b_{f}) i_{t} = σ (W_{i} \times [h_{t - 1}, x_{t}] + b_{i}) o_{t} = σ (W_{o} \times [h_{t - 1}, x_{t}] + b_{o}) {\tilde{C}}_{t} = t a n h (W_{c} \times [h_{t - 1}, x_{t}] + b_{c})

(2)

where

x_{t}

represents the current input;

h_{t - 1}

represents the previous timestep’s state value;

{\tilde{C}}_{t}

represents the current timestep’s candidate memory cell;

f

represents the forget gate;

i

represents the input gate;

o

represents the output gate;

t a n h

refers to the hyperbolic tangent activation function;

σ

denotes the sigmoid activation function;

W

represents the weight matrix; and b represents the bias parameters.

Next, the values of the forget gate,

f_{t}

; input gate,

i_{t}

; previous timestep’s memory cell,

C_{t - 1}

; and candidate memory cell,

{\tilde{C}}_{t}

, are combined through element-wise multiplication and addition to update the memory cell,

C_{t}

.

C_{t} = f_{t} \times C_{t - 1} + i_{t} \times {\tilde{C}}_{t}

(3)

Finally, the memory cell,

C_{t}

, is processed through the

t a n h

activation function and multiplied by the output gate,

o_{t}

, to obtain the current timestep’s state value,

h_{t}

.

h_{t} = o_{t} \times t a n h (C_{t})

(4)

Due to the temporal correlation of noise, jitter, or other disturbances affecting air traffic control speech data [31], there is a continuous impact on the quality of speech data within a certain time range, rather than just instantaneously. In the spectrogram of the speech, this temporal correlation is manifested as persistent patterns or trends of these disturbances over time, rather than a completely random distribution. Merely relying on LSTM to extract forward information features from air traffic control speech is insufficient. Therefore, this paper utilizes a BiLSTM composed of both forward LSTM and backward LSTM to integrate the contextual information of air traffic control speech data. The structure of BiLSTM is illustrated in Figure 6.

The computation process of BiLSTM involves performing forward propagation in the forward layer while simultaneously performing backward computation on the input sequence [32]. The final result is the stacking of outputs from both the forward LSTM and the backward LSTM. The design of BiLSTM in this paper is presented in Table 4.

2.4. Score Mapping Module

This module weighs the output of the spatiotemporal feature parameter processing module to compute the final speech quality score. The score mapping module, as shown in Figure 7, takes as input a feature matrix of size dtf × L, where dtf represents the feature dimension after processing for each segment, and L represents the number of segments along the time axis. Through a fully connected layer, attention scores are obtained. These scores are masked and normalized using zero-padding and the softmax function. They are then multiplied with the input matrix, y, through matrix multiplication to obtain a weighted average feature vector, z. Finally, z is passed through another fully connected layer to estimate the overall speech quality [33].

3. Experimental Results

3.1. Experiment Settings

3.1.1. Database

This model aimed to evaluate the quality of air traffic control speech, so it was necessary to construct a real air traffic control speech database. We collected real air traffic control speech from the air traffic control agency’s voice communication system (VCS). The speech was encoded using Pulse Code Modulation (PCM). These speech samples covered a wide range of acoustic conditions, including different genders (male/female), various sources of noise (ground noise, aircraft noise, weather noise, etc.), different speaking rates (slow/medium/fast), and different scenarios (ground, tower, approach, etc.). To create a database for speech quality assessment in the air traffic control domain, which includes the air traffic control speech and their corresponding real MOS values, we employed various strategies. These included voice activity detection (VAD) to extract valid audio segments, automatic speaker recognition, and manual verification. Finally, we annotated the air traffic control speech with the real MOS values. The specific process is outlined as follows.

The method of collecting the air traffic control speech is as follows: based on the existing VCS, an additional bypass VCS is installed at each air traffic control position, and communication speech between pilots and controllers is simultaneously collected through both the bypass VCS and the existing VCS. In this process, the existing VCS is set to normal communication mode, receiving and recording the distorted speech data containing the original noise. The bypass VCS is set to monitoring mode, designed to minimize background noise and interference as much as possible while receiving and recording clean speech data that is theoretically free from disturbances such as noise. Thus, the aforementioned collection of original noisy and clear speech data pairs is achieved [34].

The collected clean–distorted speech data pairs undergo preprocessing, including segmentation and speaker discrimination. The process is as follows:

Continuous communication speech is segmented into command speech segments using VAD, with each segment containing a single-sentence command from one speaker, while non-active speech is discarded.
A classification model is employed to classify command speech segments into two categories: controller and pilot speech, with pilot speech data being discarded.

The preprocessing procedure is illustrated in Figure 8.

Following the mentioned steps, we collected 50,000 pairs of air traffic control speech data from real-world environments, each pair containing both the clean speech data and the distorted speech data from the same air traffic control speech. Finally, the distorted speech data were annotated with the MOS to be used for training the subsequent air traffic control speech quality assessment model. Due to the time-consuming and labor-intensive nature of obtaining the MOS through subjective evaluations, we employed the PESQ model proposed by ITU to derive the MOS of the distorted speech data. PESQ calculates the quality score of the distorted speech data by analyzing the feature differences between the clean speech data and the distorted speech data from the same air traffic control speech.

As a result, we obtained an air traffic control speech database comprising 50,000 instances, comprising air traffic control speech with various disturbances including noise, along with the corresponding MOS ratings. The sampling rate of air traffic control speech is 16-kHz, and it can be categorized based on the speaker’s gender (male/female), signal-to-noise ratio (SNR), and speech rate, as shown in Table 5.

To verify the quality assessment results and ensure a reliable comparison, 80% of this database is used for training the air traffic control speech quality assessment model, while 20% is used for testing the model.

3.1.2. Experimental Environment

The experiments in this paper were conducted on a 64-bit Windows 10 operating system. The CPU used was an 11th Gen Intel Core i5-11300H, and the deep learning framework employed was PyTorch. During the training process, Adam [35] was used as the network optimizer, with an initial learning rate of 0.001. The batch size was set to 40, and the number of epochs was set to 500. An exponential decay learning rate [36] was utilized, where the learning rate decayed by a factor of 0.2 every 15 epochs.

The objective function proposed in [12] was introduced as the loss function for predicting speech quality. It incorporates frame-level prediction errors to obtain speech-level predictions closer to human scores, as shown in Equation (5).

O = \frac{1}{S} \sum_{s = 1}^{S} [{(X_{s}^{^} - X_{s})}^{2} + \frac{1}{T_{s}} \sum_{t = 1}^{T_{s}} {(X_{s}^{^} - x_{s, t})}^{2}]

(5)

where

S

represents the total number of speech samples; and

X_{s}^{^}

,

X_{s}

,

T_{s}

, and

x_{s, t}

denote the true MOS, the predicted MOS, the total number of frames, and the frame-level prediction for the t-th frame of the s-th speech sample, respectively.

3.1.3. Evaluation Index

This paper adopts the root-mean-square error (RMSE) [37] and Karl Pearson’s Coefficient of Correlation (PCC) [38] as performance metrics to evaluate the model. RMSE is used to measure the magnitude of the error between the predicted and true values. PCC is used to assess the strength of the linear relationship between two variables.

Equation (6) calculates the root-mean-square error between the subjective MOS and predicted MOS. A smaller RMSE indicates the better predictive performance of the model [39].

R M S E = \sqrt{\sum_{i = 1}^{N} {(a_{i} - b_{i})}^{2} / N}

(6)

Equation (7) calculates the correlation coefficient between the subjective MOS and predicted MOS, and the closer the absolute value of PCC is to 1, the better the predictive performance of the model is [40].

P C C = \frac{\sum_{i = 1}^{N} (a_{i} - a) (b_{i} - b)}{\sqrt{\sum_{i = 1}^{N} {(a_{i} - a)}^{2} {(b_{i} - b)}^{2}}}

(7)

where

N

represents the total number of speech samples in the test set,

a_{i}

denotes the subjective MOS of the i-th speech sample,

b_{i}

represents the predicted MOS of the i-th speech sample, and

a

and

b

represent the arithmetic mean of the subjective MOS and predicted MOS, respectively.

3.2. Contrast Experiment

This study conducted comparative experiments with the NISQA model, the P.563 model, and the Quality-Net model. The P.563 model is the non-intrusive objective speech quality assessment standard established by the ITU, while the NISQA and Quality-Net models are non-intrusive speech quality prediction models proposed in recent years. They are all classic models in the field of non-intrusive speech quality assessment, and many speech quality assessment models have been compared with them. Therefore, this paper also compares with the NISQA model, the P.563 model, and the Quality-Net model.

Figure 9 shows the scatter plots of the subjective and objective evaluations for the four models of the air traffic control speech test set. The vertical axis represents the predicted MOS values, while the horizontal axis represents the subjective MOS values. By comparing the distribution of the scatter plots, we can evaluate the prediction consistency and accuracy of the proposed model compared to the NISQA model, the P.563 model, and the Quality-Net model.

According to Figure 9, it can be observed that the scatter distributions in (a), (b), and (d) are closer to the diagonal line compared to (c), indicating that the predicted results of the proposed model, the NISQA model, and the Quality-Net model are more consistent with human evaluations. The difference between (a), (b), and (d) is not significant. To visually compare the performance of the four methods, we plotted third-order polynomial fitting curves and calculated their PCC and RMSE. The specific results can be found in Figure 10 and Table 6. These figures and tables provide reliable quantitative indicators for comparing the performance of different models and further support the differences between the proposed model, the NISQA model, and the Quality-Net model.

According to the results in Figure 10, it can be observed that compared to the NISQA model, the P.563 model, and the Quality-Net model, the fitting curve of the proposed model is closer to y = x. This indicates that the proposed model has a more accurate prediction of the subjective MOS.

Additionally, based on the data in Table 6, it can be observed that the proposed model exhibits higher correlation coefficients and lower root-mean-square errors compared to the other models, thus validating the effectiveness of the proposed approach. These results further support the academic contribution of this research and indicate that the proposed model has an advantage in predicting the MOS.

3.3. Impact of SNR

Figure 11 shows the PCC and RMSE of air traffic control speech with varying SNRs. Figure 11 demonstrates that the proposed ResNet–BiLSTM model maintains good performance when evaluating air traffic control speech with different SNRs.

3.4. Impact of Speech Rate

Figure 12 shows the PCC and RMSE of air traffic control speech with various speech rates. Figure 12 demonstrates that the proposed model maintains good performance when the speech rate is medium.

3.5. Discussion

By observing the experimental results, it can be noted that the predicted MOS in this study is generally lower than the subjective MOS. This phenomenon is partly attributed to the influence of using PESQ–MOS as labels for the air traffic control speech database. Since the PESQ method has certain limitations in assessing audio quality, its predictions may deviate from subjective evaluations to some extent. Therefore, training the model with such MOS labels in this study might lead to an underestimation of the subjective MOS. Further exploration and improvement are needed regarding the feasibility of using PESQ–MOS as database labels to enhance the accuracy and reliability of MOS prediction. A better solution would involve annotating the air traffic control speech database with the MOS based on the subjective evaluation methods proposed by ITU.

From Figure 9, we can observe that compared to the NISQA model, the P.563 model, and the Quality-Net model, our proposed approach performs better when predicting real MOS values in the range of 2–3.5 for air traffic control speech. This improvement can be attributed to our optimization of the shallow feature extractor, considering the impact of speaking rate on the mel-frequency spectra of air traffic control speech. Thus, the predictions made by our proposed approach are closer to the real MOS values.

As shown in Figure 12, we found that the performance of the proposed method significantly declines when the speech rate is fast. The underlying reason can be attributed to the fact that, for the same air traffic control speech under identical conditions, as the speech rate increases, the spectrogram of the speech shows dimension reduction. This results in the feature extraction module of the proposed method capturing more redundant information from the spectrogram, leading to deviations in the evaluation of air traffic control speech quality.

4. Conclusions

In this study, a no-reference air traffic control speech quality assessment method was developed by optimizing the details of shallow feature extraction and improving the ResNet–BiLSTM network structure. The experimental results demonstrated that the proposed model accurately estimates the quality of the air traffic control speech and shows a high consistency with subjective evaluations. This provides a feasible solution for air traffic control speech quality assessment.

In the future, we will continue our commitment to enhancing and expanding the practicality and depth of this research. Specifically, we have planned the following initiatives: Firstly, our focus will be on expanding the real air traffic control speech database to encompass a wider range of scenarios and diverse categories of speech samples. This expansion aims to improve the robustness of our model, ensuring its exceptional performance across various contexts. Secondly, we intend to develop a rating standard tailored to air traffic control speech, following the subjective evaluation method proposed by ITU, and apply this standard to rate our speech database. Through subjective evaluations, we will achieve a more accurate quantification of speech quality and validate our model’s performance in real-world scenarios. Additionally, we will continuously refine and optimize the network structure to enhance its capability to accurately predict air traffic control speech quality scores in complex speech environments. This will involve improvements in feature extraction and deep learning architectures to better capture the critical features of speech quality. Lastly, we will conduct an in-depth analysis of the specific factors that influence air traffic control speech quality and convert them into actionable metrics. This will facilitate a deeper understanding of speech quality and further improvements in this domain.

Author Contributions

Conceptualization, Y.W. and G.L.; methodology, G.L.; software, Q.F.; validation, Y.W., G.L., and Q.F.; formal analysis, Y.W.; investigation, G.L.; writing—original draft preparation, Y.W. and G.L.; writing—review and editing, Y.W., G.L., and Q.F.; project administration, Y.W.; funding acquisition, Y.W. and Q.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the National Key R&D Program of China (program no. 2021YFF0603904) and in part by the Fundamental Research Funds for the Central Universities (program no. ZJ2022-004).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ResNet	Residual Network
LSTM	Long Short-Term Memory
BiLSTM	Bidirectional Long Short-Term Memory
MOS	Mean Opinion Score
ITU	International Telecommunication Union
DNN	Deep Neural Network
CNN	Convolutional Neural Network
RNN	Recurrent Neural Network
SNR	Signal-to-Noise Ratio
BN	Batch Normalization
VCS	Voice Communication System
VAD	Voice Activity Detection
RMSE	Mean Square Error
PCC	Karl Pearson’s Coefficient of Correlation

References

ITU-T Recommendations. P.800: Methods for Subjective Determination of Transmission Quality; International Telecommunication Union: Geneva, Switzerland, 1996. [Google Scholar]
Union Investment. ITU-T Recommendation P.800.1: Mean Opinion Score (MOS) Terminology; Tech. Rep.; International Telecommunication Union: Geneva, Switzerland, 2006. [Google Scholar]
Nybacka, M.; He, X.; Gómez, G.; Bakker, E.; Drugge, L. Links between subjective assessments and objective metrics for steering. Int. J. Automot. Technol. 2014, 15, 893–907. [Google Scholar] [CrossRef]
Yang, M.; Wang, S.; Calheiros, R.N.; Yang, F. Survey on QoE assessment approach for network service. IEEE Access 2018, 6, 48374–48390. [Google Scholar] [CrossRef]
ITU-T Recommendations. P.862: PESQ—An Objective Method for End-to-End Speech Quality Assessment of Narrow-Band Telephone Networks and Speech Codecs; International Telecommunication Union: Geneva, Switzerland, 2001. [Google Scholar]
ITU-T Recommendations. Wideband Extension to Recommendation P.862 for the Assessment of Wideband Telephone Networks and Speech Codecs; International Telecommunication Union: Geneva, Switzerland, 2007. [Google Scholar]
ITU-T Recommendations. P.563: Single-Ended Method for Objective Speech Quality Assessment in Narrow-Band Telephony Applications; International Telecommunication Union: Geneva, Switzerland, 2004. [Google Scholar]
ITU-T Recommendations. P.Imp563: Implementers’ Guide for Recommendation ITU-T P.563; International Telecommunication Union: Geneva, Switzerland, 2006. [Google Scholar]
Kumalija, E.J.; Nakamoto, Y. MiniatureVQNet: A Light-Weight Deep Neural Network for Non-Intrusive Evaluation of VoIP Speech Quality. Appl. Sci. 2023, 13, 2455. [Google Scholar] [CrossRef]
Zhou, W.; Zhu, Z. A novel BNMF-DNN based speech reconstruction method for speech quality evaluation under complex environments. Int. J. Mach. Learn. Cybern. 2021, 12, 959–972. [Google Scholar] [CrossRef]
Zhou, W.; He, Q.; Wang, Y.; Li, Y. Sparse representation-based quasi-clean speech construction for speech quality assessment under complex environments. IET Signal Process. 2017, 11, 486–493. [Google Scholar] [CrossRef]
Fu, S.W.; Tsao, Y.; Hwang, H.T.; Wang, H.M. Quality-Net: An end-to-end non-intrusive speech quality assessment model based on BLSTM. arXiv 2018, arXiv:1808.05344. [Google Scholar]
Lo, C.C.; Fu, S.W.; Huang, W.C.; Wang, X.; Yamagishi, J.; Tsao, Y.; Wang, H.M. Mosnet: Deep learning based objective assessment for voice conversion. arXiv 2019, arXiv:1904.08352. [Google Scholar]
Dong, X.; Williamson, D.S. A pyramid recurrent network for predicting crowdsourced speech-quality ratings of real-world signals. arXiv 2020, arXiv:2007.15797. [Google Scholar]
Cauchi, B.; Siedenburg, K.; Santos, J.F.; Falk, T.H.; Doclo, S.; Goetze, S. Non-intrusive speech quality prediction using modulation energies and lstm-network. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 1151–1163. [Google Scholar] [CrossRef]
Shen, K.; Yan, D.; Ye, Z.; Xu, X.; Gao, J.; Dong, L.; Peng, C.; Yang, K. Non-intrusive speech quality assessment with attention-based ResNet-BiLSTM. Signal Image Video Process. 2023, 17, 3377–3385. [Google Scholar] [CrossRef]
Vaswani, A.B.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N. Attention Is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Hamza, A.; Javed, A.R.R.; Iqbal, F.; Kryvinska, N.; Almadhor, A.S.; Jalil, Z.; Borghol, R. Deepfake audio detection via MFCC features using machine learning. IEEE Access 2022, 10, 134018–134028. [Google Scholar] [CrossRef]
Liu, Y.; Xiang, H.; Jiang, Z.; Xiang, J. A Domain Adaptation ResNet Model to Detect Faults in Roller Bearings Using Vibro-Acoustic Data. Sensors 2023, 23, 3068. [Google Scholar] [CrossRef] [PubMed]
Li, F.; Lu, Z.; Tang, J.; Zhang, W.; Tian, Y.; Cui, Z.; Jiang, F.; Li, H.; Jiang, S. Rotating Machinery State Recognition Based on Mel-Spectrum and Transfer Learning. Aerospace 2023, 10, 480. [Google Scholar] [CrossRef]
Suresh, V.; Janik, P.; Rezmer, J.; Leonowicz, Z. Forecasting Solar PV Output Using Convolutional Neural Networks with a Sliding Window Algorithm. Energies 2020, 13, 723. [Google Scholar] [CrossRef]
Mittag, G.; Möller, S. Quality Degradation Diagnosis for Voice Networks-Estimating the Perceived Noisiness, Coloration, and Discontinuity of Transmitted Speech. In Proceedings of the INTERSPEECH, Graz, Austria, 15–19 September 2019; pp. 3426–3430. [Google Scholar]
Mittag, G.; Möller, S. Non-intrusive speech quality assessment for super-wideband speech communication networks. In Proceedings of the ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 7125–7129. [Google Scholar]
Purwins, H.; Li, B.; Virtanen, T.; Schlüter, J.; Chang, S.Y.; Sainath, T. Deep learning for audio signal processing. IEEE J. Sel. Top. Signal Process. 2019, 13, 206–219. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Hou, R.; Ma, B.; Chang, H.; Gu, X.; Shan, S.; Chen, X. IAUnet: Global context-aware feature learning for person reidentification. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 4460–4474. [Google Scholar] [CrossRef]
Sun, X.; Fu, J.; Wei, B.; Li, Z.; Li, Y.; Wang, N. A Self-Attentional ResNet-LightGBM Model for IoT-Enabled Voice Liveness Detection. IEEE Internet Things J. 2022, 10, 8257–8270. [Google Scholar] [CrossRef]
Zhang, S.; Kong, J.; Chen, C.; Li, Y.; Liang, H. Speech GAU: A single head attention for mandarin speech recognition for air traffic control. Aerospace 2022, 9, 395. [Google Scholar] [CrossRef]
Al-Selwi, S.M.; Hassan, M.F.; Abdulkadir, S.J.; Muneer, A. LSTM Inefficiency in Long-Term Dependencies Regression Problems. J. Adv. Res. Appl. Sci. Eng. Technol. 2023, 30, 16–31. [Google Scholar]
Girirajan, S.; Pandian, A. Acoustic Model with Hybrid Deep Bidirectional Single Gated Unit (DBSGU) for Low Resource Speech Recognition. Multimed. Tools Appl. 2022, 81, 17169–17184. [Google Scholar] [CrossRef]
Liu, T.; Wang, C.; Li, Z.; Huang, M.C.; Xu, W.; Lin, F. Wavoice: A mmWave-Assisted Noise-Resistant Speech Recognition System. ACM Trans. Sens. Netw. 2023, accepted. [Google Scholar] [CrossRef]
Han, T.; Zhang, Z.; Ren, M.; Dong, C.; Jiang, X.; Zhuang, Q. Speech Emotion Recognition Based on Deep Residual Shrinkage Network. Electronics 2023, 12, 2512. [Google Scholar] [CrossRef]
Mittag, G.; Naderi, B.; Chehadi, A.; Möller, S. Nisqa: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets. arXiv 2021, arXiv:2104.09494. [Google Scholar]
Burczyk, R.; Cwalina, K.; Gajewska, M.; Magiera, J.; Rajchowski, P.; Sadowski, J.; Stefanski, J. Voice Multilateration System. Sensors 2021, 21, 3890. [Google Scholar] [CrossRef] [PubMed]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Li, Z.; Arora, S. An Exponential Learning Rate Schedule for Deep Learning. arXiv 2019, arXiv:1910.07454. [Google Scholar]
Martinez, A.M.C.; Spille, C.; Roßbach, J.; Kollmeier, B.; Meyer, B.T. Prediction of Speech Intelligibility with DNN-Based Performance Measures. Comput. Speech Lang. 2022, 74, 101329. [Google Scholar] [CrossRef]
Chen, X.; Wang, X.; Zhong, B.; Yang, J.; Shang, Y. Bispectral Feature Speech Intelligibility Assessment Metric Based on Auditory Model. Comput. Speech Lang. 2023, 80, 101492. [Google Scholar]
Ye, Z.; Chen, J.; Yan, D. Residual-Guided Non-Intrusive Speech Quality Assessment. arXiv 2022, arXiv:2203.11499. [Google Scholar]
Manocha, P.; Kumar, A. Speech Quality Assessment through MOS Using Non-Matching References. arXiv 2022, arXiv:2206.12285. [Google Scholar]

Figure 1. The proposed model architecture for evaluating the quality of air traffic control speech. Our network architecture consists of four modules: the mel-spectrogram segmentation module, shallow feature extraction module, spatiotemporal feature parameter processing module, and score mapping module.

Figure 2. Spectrograms of different MOSs. (a) MOS = 1, (b) MOS = 2, (c) MOS = 3, (d) MOS = 4, and (e) MOS = 5.

Figure 3. Mel-spectrogram segmentation.

Figure 4. Structure of the ResNet network.

Figure 5. Structure of LSTM network structure.

Figure 6. Structure of BiLSTM network.

Figure 7. Structure of score mapping.

Figure 8. Preprocessing procedure.

Figure 9. Performance of each model. (a) Ours, (b) NISQA, (c) P.563, and (d) Quality-Net.

Figure 10. The third-order fitting curve of four models.

Figure 11. Prediction results of air traffic control speech database with different SNRs.

Figure 12. Prediction results of air traffic control speech database with different speech rates.

Table 1. Speech quality rating scale.

Rating	Speech Quality	Level of Distortion
5	Excellent	Imperceptible
4	Good	Just perceptible, but not annoying
3	Fair	Perceptible and slightly annoying
2	Poor	Annoying, but not objectionable
1	Bad	Very annoying and objectionable

Table 2. Details of the shallow feature extraction network. “-” indicates nonexistent.

Network Layer	Kernel	Stride	Channel	Activation Function
Conv	3	1	16	ReLU
Max-Pooling	2	2	-	-
Conv	3	1	32	ReLU
Max-Pooling	2	2	-	-
Conv	3	1	64	ReLU
Conv	3	1	64	ReLU
Max-Pooling	2	2	-	-
Conv	3	1	64	ReLU
Conv	3	1	64	ReLU
Fully connected	-	-	-	Softmax

Table 3. Details of the ResNet network.

Network Layer	Kernel	Stride	Channel	Activation Function
Conv	3	1	32	ReLU
BN	-	-	-	-
Conv	3	1	32	ReLU
Conv	3	1	64	ReLU

Table 4. Details of the BiLSTM network.

Network Layer	Unit
BiLSTM	16
Dropout 50%	-
BiLSTM	32

Table 5. Classifications and proportions of air traffic control speech database.

Category	Classification	Proportion
Gender	Male	50%
Gender	Female	50%
SNR	$>$ 15 dB	10%
	10–15 dB	40%
	5–10 dB	40%
	$<$ 5 dB	10%
Speech rate	Slow	25%
	Medium	50%
	Fast	25%

Table 6. Performance comparison of four methods.

Method	PCC	RMSE
ResNet–BiLSTM (Ours)	0.87	0.68
NISQA	0.85	0.95
P.563	0.81	1.33
Quality-Net	0.82	0.89

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, Y.; Li, G.; Fu, Q. Non-Intrusive Air Traffic Control Speech Quality Assessment with ResNet-BiLSTM. Appl. Sci. 2023, 13, 10834. https://doi.org/10.3390/app131910834

AMA Style

Wu Y, Li G, Fu Q. Non-Intrusive Air Traffic Control Speech Quality Assessment with ResNet-BiLSTM. Applied Sciences. 2023; 13(19):10834. https://doi.org/10.3390/app131910834

Chicago/Turabian Style

Wu, Yuezhou, Guimin Li, and Qiang Fu. 2023. "Non-Intrusive Air Traffic Control Speech Quality Assessment with ResNet-BiLSTM" Applied Sciences 13, no. 19: 10834. https://doi.org/10.3390/app131910834

APA Style

Wu, Y., Li, G., & Fu, Q. (2023). Non-Intrusive Air Traffic Control Speech Quality Assessment with ResNet-BiLSTM. Applied Sciences, 13(19), 10834. https://doi.org/10.3390/app131910834

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Non-Intrusive Air Traffic Control Speech Quality Assessment with ResNet-BiLSTM

Abstract

1. Introduction

Related Work

2. Method

2.1. Preprocessing Module

2.2. Shallow Feature Extraction Module

2.3. Spatiotemporal Feature Parameter Processing Module

2.3.1. ResNet

2.3.2. BiLSTM

2.4. Score Mapping Module

3. Experimental Results

3.1. Experiment Settings

3.1.1. Database

3.1.2. Experimental Environment

3.1.3. Evaluation Index

3.2. Contrast Experiment

3.3. Impact of SNR

3.4. Impact of Speech Rate

3.5. Discussion

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI