1. Introduction
The airspace operation environment is becoming more and more complex, air traffic flow continues to increase, and the problems arising from various aspects, such as air transportation efficiency and flight safety, are becoming more and more serious [
1], demanding higher standards for air traffic control. Air traffic control primarily relies on radio communication between air traffic control officers and pilots, where controllers transmit instructions, pilots read them back, and the controllers confirm them before the pilots execute the instructions. As the initial step in executing air traffic control directives, clear and accurate voice communication between controllers and pilots is a crucial factor in ensuring flight safety [
2]. According to statistical data released by the European Organization for the Safety of Air Navigation and the European Telecommunications Standards Institute for Aviation, as many as 30% of aviation incidents are related to misheard voice instructions [
3].
During airplane flights and air traffic control procedures, communication between controllers and pilots is often subject to various disturbances due to restricted communication conditions. These disturbances originate partly from external environmental sources such as cabin instrument noise, aircraft engine roar, and air traffic control center noise, as well as noise from the communication process itself, including radio noise and high-frequency communication noise. Due to these disturbances, conversations between the parties frequently suffer from suboptimal quality, resulting in unsafe occurrences such as unclear semantic expression, misinterpretations, and missed information. Therefore, researching methods for speech enhancement in air traffic control communication holds significant practical importance. This research ensures that controllers can convey instructions, aids pilots in better understanding these instructions, enhances flight safety, reduces communication issues and risks, and promotes more effective aviation traffic management.
Traditional speech enhancement methods can be broadly categorized into those based on time–frequency (T-F) domain filtering and those based on statistical models. T-F domain-filtering methods include spectral subtraction [
4], subspace methods [
5], Wiener filtering [
6], and minimum mean square error estimation [
7]. These techniques operate under the assumption that “noise is stationary,” aiming to reduce noise by reducing its spectrum. Methods based on statistical models encompass Hidden Markov models [
8] and Gaussian mixture models [
9]. These statistical models rely on specific acoustic assumptions and model structures. However, noise often exhibits variability in natural environments, rendering the acoustic assumptions about noise invalid. Consequently, the denoising effectiveness of traditional speech enhancement methods is limited, and these methods tend to generate artifacts resembling musical interference.
With the rise of deep learning, various neural network models have been introduced and applied to the field of speech enhancement. Xu et al. [
10,
11] proposed a supervised method based on a DNN, aiming to directly map enhanced speech by seeking the mapping function between the noise and clean speech signals. Wang et al. [
12,
13] employed a DNN to estimate two different targets, namely, the ideal binary mask (IBM) and the IRM, aiming to reduce noise interference in the target speech signal. Zhou et al. [
14] integrated the estimation of the IRM with the target binary mask (TBM) to obtain an improved mask for speech enhancement. The performance of the fused mask was found to be superior to the masks estimated individually. Liu et al. [
15] proposed an amplitude fusion method based on deep learning, which effectively improves the effect of speech dereverberation by combining the mapping method with the ideal amplitude masking (IAM) method. The IBM is simple to implement but may cause speech distortion; the IAM offers smooth transitions and improved sound quality but has higher computational complexity and may retain some noise; and the IRM provides the best results in smoothness and noise suppression of the three masks but is computationally complex and requires accurate estimation of speech and noise energy.
This research aims to solve the quality problem of air traffic control speech in complex noisy environments. Specifically, the following research questions and hypotheses are proposed. First, we explore how to effectively enhance the clarity and intelligibility of air traffic control speech in intense noise environments. Second, it is assumed that an improved DNN-IRM method can reduce the distortion of target speech in a noisy environment and outperform the traditional method. In addition, it is assumed that introducing the LeakyReLU activation function can alleviate the gradient vanishing problem and improve the accuracy of the IRM estimation, and thus improve the score of the relevant speech quality. We hope to verify these hypotheses through experiments and prove the effectiveness of the proposed method in actual air traffic control speech enhancement.
This paper’s main contributions are as follows:
We propose an air traffic control speech enhancement method based on an improved DNN-IRM. It uses the LeakyReLU activation function to mitigate the gradient vanishing issue, refines the DNN architecture to boost the IRM estimation accuracy, and modifies the IRM weights to minimize noise interference with the target speech.
We select some pure noise segments from the air traffic control speech database as noise, add them to other clean speech, and create speech pairs of clean speech and noisy speech. Additionally, we conduct three sets of comparative experiments using the PESQ [
16], short-time objective intelligibility (STOI) [
17], the structural similarity index measure (SSIM) [
18] and the scale-invariant signal-to-distortion ratio (SI-SDR) [
19] as evaluation metrics. The SSIM is utilized to measure the similarity between the original speech spectrogram and the enhanced speech spectrogram.
We use air traffic control speech in a natural environment to test the performance of the method proposed in this article and also use the speech quality non-reference evaluation method to test the overall score of the enhanced speech, respectively, deep noise suppression mean opinion score (DNSMOS) [
20] and non-intrusive speech quality assessment (NISQA) [
21].
We investigate the impact of different parameter values on the experimental results when adjusting the IRM weights.
The rest of the article is organized as follows. In
Section 2, we begin by explaining the relevant concepts covered in the paper, followed by a detailed exposition of the adopted method’s process and network structure.
Section 3 compares the proposed method with other speech enhancement methods and explores the impact of adjusting the mask weight on the experimental results.
Section 4 provides a comprehensive summary of the entire document and outlines prospects for future research directions.
3. Experiments and Results Analysis
This section first introduces the experimental setup, then gives a detailed explanation of the air traffic control speech dataset, and finally shows the experimental results of the comparative experiments and each group of experiments.
3.1. Experimental Setup
The experiment is conducted using the PyTorch 1.8.0 framework.
Table 2 provides the specific details of the experimental setup.
3.2. Dataset
Both the training and test speech data come from the air traffic control speech database provided by the Civil Aviation Administration of China, totaling 50,000 pieces of speech data. The characteristics of air traffic control speech are a fast speaking speed and bilingual broadcasting in Chinese and English. In addition, due to the limitation of voice calls, it is impossible to collect completely clean speech data. Therefore, we assessed the NISQA score for all the speech data, and regarded the speech signals with a score greater than 3.0 as “clean speech data”, and obtained 20,000 speech data that could be used for the experiments. Each piece of speech data is about 3 s long and the total length is about 17 h. Overall, 70% were randomly selected as training data, and the remaining 30% were used as test data; some pure noise segments with a score less than 1.5 were selected from the speech data and they were extended to the same length as the clean speech signal. The noise includes more than 130 types of noise, such as aircraft cockpit noise, high-frequency communication noise, and air traffic control (ATC) center noise. After the dataset was collected, all the acquired data were uniformly processed into mono, single-channel waveform files with a sampling rate of 16 kHz. Subsequently, noise was added to the clean speech at SNRs of −5 dB, 0 dB, 5 dB, and 10 dB to generate noisy training and test speech.
3.3. Constrast Experiment
To verify the performance of the method proposed in this paper, four sets of experiments were set up to provide results for discussion, as follows:
- (a)
R-UnAdj: This method adopts the traditional speech enhancement method based on DNN-IRM, uses ReLU as the activation function, and does not adjust the IRM weights;
- (b)
R-Adj: This method is based on R-UnAdj, uses ReLU as the activation function, but adjusts the IRM weights;
- (c)
LR-UnAdj: This method is based on R-UnAdj, uses LeakyReLU as the activation function, and does not adjust the IRM weights;
- (d)
LR-Adj: This method is based on R-UnAdj, uses LeakyReLU as the activation function, and adjusts the IRM weights. LR-Adj is considered as the proposed method in this paper.
In addition, we trained the networks using ReLU and LeakyReLU for 100 epochs each, and recorded the loss for each epoch of both networks, as shown in
Figure 6. We observed that the network trained with LeakyReLU was more stable and had lower losses.
3.4. Results Analysis
To evaluate the actual performance of the air traffic control speech enhancement method based on the improved DNN-IRM proposed in this article, the experimental data results were compared and analyzed from two aspects: reference evaluation and non-reference evaluation. We used the SSIM, PESQ, STOI, and SI-SDR as the parametric evaluation methods. The SSIM was used to judge the similarity between the original speech spectrogram and the enhanced speech spectrogram; for the non-parametric evaluation method, we chose DNSMOS and NISQA, and chose non-parametric evaluation. The purpose of the method was to test the performance of the proposed method in a natural air traffic control speech environment. Finally, we also discuss the impact on the experimental results when the parameter () and the parameter () take different values when adjusting the IRM weights. In the process of conducting the comparative experiments, when analyzing one parameter, we fixed the other parameter.
3.4.1. Spectrogram Comparison and SSIM
The spectrogram plays a crucial role in speech analysis, providing a visual understanding of speech data.
Figure 7 shows the spectrogram of the clean speech used in the experiment, while
Figure 8,
Figure 9 and
Figure 10 display the spectrograms of the same speech data under different SNR conditions, affected by aircraft cockpit noise, ATC center noise, and high-frequency communication (HFcommunication) noise, respectively. Additionally, the enhanced spectrograms after applying the four aforementioned methods are presented. Each column in the figures corresponds to a set of instances. Taking
Figure 8 as an example,
Figure 8a to
Figure 8e, respectively, show the spectrogram with −5 dB aircraft cockpit noise, the spectrogram enhanced by R-UnAdj, the spectrogram enhanced by R-Adj, the spectrogram enhanced by LR-UnAdj, and the spectrogram enhanced by LR-Adj. From
Figure 8, it can be observed that aircraft cockpit noise is mainly concentrated in specific frequency bands. At −5 dB noise levels, LR-Adj effectively eliminates high-frequency noise, but some noise remains in the low-frequency range, especially in pure noisy speech segments. ATC center noise gradually weakens from low to high frequencies, and LR-Adj and LR-UnAdj show the best enhancement at 10 dB, while preserving more speech details. Aircraft cockpit noise is concentrated in specific frequency bands, and all methods exhibit the worst enhancement at −5 dB noise levels. LR-Adj significantly reduces noise in pure noisy speech segments at 0 dB, 5 dB, and 10 dB. In summary, through the observation of the three figures, it can be seen that the LR-Adj method generally outperforms the other three methods, especially in low SNR conditions, where it effectively eliminates low-frequency noise.
The SSIM measures the similarity between the clean speech spectrograms and enhanced speech spectrograms. It considers three aspects of information: brightness, contrast, and structure.
Figure 11 shows the SSIM score of the speech spectrogram after four enhancement methods at different SNR levels.
The SSIM score is obtained by comparing the enhanced speech spectrum with the clean speech spectrum. The ordinate of each chromatogram in the figure represents the SSIM score of the four methods, and the abscissa represents the added noise level. It can be observed that the speech spectrum enhanced by the LR-Adj achieved the highest SSIM scores across all noise levels. From
Figure 11b, it can be observed that the LR-Adj has a significant effect on processing low SNR ATC center noise. From
Figure 11a,c, it can be seen that although the LR-Adj does not have a significant effect on processing low SNR aircraft cockpit noise and HFcommunication noise, the ability of the proposed method to process these two noises is improved at high SNR.
3.4.2. PESQ and STOI
The main purpose of the PESQ is to simulate listeners’ perception of speech quality and provide a numerical score to represent the quality of the speech. The PESQ generates a quality score between −0.5 and 4.5, where a higher score indicates better speech quality and a lower score indicates poorer quality. We selected airplane takeoff and landing noise, aircraft cabin noise, and electric current noise to test the performance of each method. The PESQ scoring results of each method at different noise types and SNR levels are shown in
Table 3.
According to the data in
Table 3, the enhancement effects of each method vary under different types of noise and SNR levels. We used the score for the enhanced speech minus the score of the noisy speech, and then divided it by the score of the noisy speech to obtain the improvement offered by the method. For each noise type, the improvement effect under all the SNR values was averaged to get the average improvement effect of the method under the noise type. Under aircraft cabin noise, the average enhancement effects for R-UnAdj, R-Adj, LR-UnAdj, and LR-Adj were 43.08%, 53.26%, 63.11%, and 78.35%, respectively. Under electric current noise, the average enhancement effects for each method were 40.04%, 53.50%, 70.08%, and 79.48%, respectively. Under airplane takeoff and landing noise, the average enhancement effects for each method were 42.92%, 53.38%, 65.17%, and 78.60%, respectively. Overall, the LR-Adj method generally exhibited better performance, followed by LR-UnAdj, then R-Adj, and finally the R-UnAdj method. These results demonstrate that the improved DNN-IRM method significantly improves the PESQ scores across different noise types compared to the original methods, highlighting its superiority in enhancing speech quality. Furthermore, we observed that using LeakyReLU rather than ReLU could significantly improve the performance of the model with adjusted IRM weights because the loss in the network trained by LeakyReLU is smaller than that in the network trained by ReLU. Moreover, under high SNR, the mask value estimation is relatively accurate, and we can improve the speech quality by reducing the weight of the noise mask.
STOI is an objective assessment method that measures speech clarity and intelligibility. Scores range from 0 to 1, where 1 represents complete intelligibility and 0 represents complete unintelligibility. Higher STOI scores indicate clearer and more understandable speech, while lower scores indicate poorer intelligibility. The evaluation results for STOI for each method at different noise types and various SNR levels are shown in
Table 4.
According to the data in
Table 4, we can see that LR-Adj and LR-UnAdj have similar performance in terms of STOI, among which LR-Adj shows the seven best indicators. The advantages of LR-Adj are mainly reflected in high SNR levels. LR-Adj is generally higher than the three other methods for STOI scores at 5 dB and 10 dB. In contrast, LR-UnAdj performs well on six indicators and shows better enhancement effects at −5 dB and 0 dB. The reasons for the excellent effect of the LR-Adj method are the same as before. On the one hand, the network loss trained by LeakyReLU is low, and, on the other hand, reducing the weight of the noise mask can improve speech quality.
3.4.3. Comparisons with Other Methods and Testing Real Speech
To further verify the performance of this method, we also tested the effects of other masks on this method.
The
at each time and frequency point is marked as belonging to target speech or noisy speech. If a time and frequency point contains mainly speech information, then the point is marked as 1 in the ideal binary mask; otherwise, it is marked as 0, as follows:
where
represents the target speech and
represents the noisy speech.
The ideal amplitude mask (
) is similar to the
, but it considers the impact of the target speech and noisy speech on the results, as follows:
where
represents the target speech and
represents the mixed speech.
We used the same data to train the IBM and IAM, and then randomly mixed the noise of the test data between 0 and 5 dB. The generated noise data was denoised through the three methods to obtain the enhanced speech. Finally, we used three evaluation indicators (PESQ, STOI, SI-SDR) to measure the performance of each method, and the results are shown in
Table 5. Under the three different noise types, our method excelled in all three evaluation metrics (PESQ, STOI, SI-SDR). The improvement value is calculated as the difference between the noisy speech score and the enhanced speech score. For the PESQ, our method achieved an average improvement of 1.04, significantly better than the DNN-IBM’s 0.31 and the DNN-IAM’s 0.94. In STOI, our method showed an average improvement of 0.143, also surpassing the DNN-IBM and DNN-IAM, both at 0.13. For the SI-SDR, our method achieved an average improvement of 11.09, slightly higher than the DNN-IAM’s 11.06 and the DNN-IBM’s 10.90. This demonstrates that our method consistently provided better speech quality and intelligibility while effectively suppressing noise across various noise environments, showcasing its broad superiority.
In addition, we randomly selected 50 pieces of abandoned original control speech data. These data had not been processed by noise type, and the noise in the speech came from real ATC center noise. These data were used to test the performance of the proposed method. The DNSMOS and NISQA were used as evaluation indicators. The results are shown in
Table 6. It can be seen that compared with the other two methods, the method proposed in this article shows a good performance using both the DNSMOS and NISQA.
3.4.4. Impact of Parameter Values
This section investigates the impact of the parameter values, and , on the experimental outcomes during the adjustment of the IRM weights. The experiment was still conducted on data from the air traffic control speech database, and the average PESQ obtained at three noise conditions was used as the evaluation criterion. By employing a method with fixed parameters, the study delved deeper into analyzing the individual parameter’s influence on the experimental outcomes.
Table 7 shows the comparison of the PESQ values within the LR-Adj method for different
values when
. Among these, the
values of 0.4 and 0.5 achieved the optimal average PESQ score, both reaching 2.15.
exhibited the best PESQ performance at 0 dB. The optimal scores were achieved at −5 dB when the values were set to 0.3 and 0.4. Additionally,
was also effective at 10 dB. The poorest average PESQ score was seen at
, scoring only 1.47. In addition, when
, it produced almost the same results as LR-UnAdj.
Table 8 presents the comparison of PESQ values within the LR-Adj for different
values when
. The optimal average PESQ scores were observed at
values of 0.8 and 0.9, both reaching 2.17. Conversely, the least favorable average PESQ performance occurred at
, scoring only 1.95. Moreover, for
values of 0.5, 0.6, and 0.8, they were suitable for handling speech at a −5 dB noise environment, with scores reaching 1.54.
Figure 12 illustrates the average PESQ scores from
Table 5 and
Table 6. The red line represents the average PESQ scores for different
values when
, while the green line represents the average PESQ scores for different
values when
.
This section has mainly explored the impact of different values of the parameters on the experimental results when adjusting the IRM weight. From
Figure 12, it can be observed that at
and
, the speech enhancement performs optimally under various noise conditions, achieving the highest average scores. When the parameter value is close to 0 or close to 1, the PESQ score is very low, especially when
and
. At this time, the adjusted mask is similar to the IBM, and the noise signal is assigned to 0, but unlike the IBM, the target signal is not assigned to 1.
4. Discussion and Conclusions
Aiming at the quality and clarity issues of air traffic control speech in complex noisy environments, an air traffic control speech enhancement method based on an improved DNN-IRM is proposed by using noise and speech from the air traffic control speech database. The interference of noise in the target speech is reduced by refining the network architecture, using LeakyReLU as the activation function, and adjusting the network output IRM weight.
The experimental results show that the proposed method has made some progress in the PESQ, STOI, SI-SNR, and SSIM indicators, proving that adjusting the weight of the network output IRM is feasible for improving speech quality, because this step reduces the weight of the noise mask. It also proves that LeakyReLU is more suitable for speech enhancement tasks than ReLU, because the loss of the network trained with LeakyReLU is smaller than that of the network trained with ReLU. In addition, when enhancing actual air traffic control speech, the performance of the proposed method is also better than that of the IBM-based method and the IAM-based method.
The amplitude spectrum and phase spectrum can be obtained after Fourier transform of the speech signal, but the mask used in this paper only considers the influence of the amplitude spectrum, and does not consider the influence of the phase spectrum on speech enhancement. Future experiments will consider more advanced masks, such as the complex ratio mask [
25], to ensure that the original information in the speech is not lost.