1. Introduction
CPSs (cyber–physical systems) are new complex network systems that integrate physical perception, computing, communication, and control to realize real-time perception and distributed control of physical objects [
1,
2]. The rise of CPSs has helped the development of industrial CPS by breaking the bottleneck in the traditional production of industrial automation and control systems. Embedded systems built by sensors, actuators, processors, and heterogeneous networks provide extensive and flexible support for complex and large-scale industrial production lines, promote high automation and intelligent integration of future industries, and achieve closer collaboration between the work of robots and humans [
3]. Speech is the most natural and convenient form of human communication. In many industrial CPS environments, where the operation of large equipment often requires multiple people and machines to work together, speech communication is one of the most effective ways to communicate. However, industrial facilities are frequently noisy environments where speech is invariably interfered with, especially when the amplitude of sound generated by powering and operating large equipment is much greater than the intensity of speech generated by workers. As a result, mutual communication becomes significantly less effective. In severe cases, there will be ineffective communication, resulting in the failure of collaborative work. Speech enhancement is used to restore clean speech signals from noise-interfered speech signals, improving speech quality and listener comfort, and is therefore widely used for speech in noisy environments [
4].
After decades of development, numerous speech enhancement algorithms have been proposed one after another, including classical speech enhancement methods such as spectral subtraction, Wiener filtering, statistical model-based methods, and subspace-based methods [
5,
6,
7,
8]. These methods tend to assume smooth or slowly varying noise and enhance speech suppressed by noise better under high signal-to-noise ratio (SNR), low complexity, and smooth noise conditions. However, in factories, most of the noise is low SNR and non-smooth, and traditional methods cannot track its features effectively [
9], so they are not well-suited to real industrial environments. Due to the excellent ability of deep neural networks to model complex non-linear functions, training with datasets from different noisy environments can achieve stable noise reduction performance even in highly unsteady noise environments [
10,
11], supporting the implementation of the CPS.
Neural network-based supervised single-channel speech enhancement is mainly divided into time–frequency masking-based methods and feature mapping-based methods. The time–frequency-based masking method uses a neural network to obtain the time–frequency relationship between clean and noisy speech, multiplies the mask estimate of the clean speech with the original noisy speech, and then synthesizes the time-domain waveform of the enhanced speech through an inverse conversion technique. Early masking methods only exploited the relationship between clean and noisy speech amplitudes, ignoring the phase information between them, and include examples such as ideal binary masking (IBM) [
12], ideal ratio masking (IRM) [
13], and spectral amplitude masking (SAM) [
14]. However, research has found that phase information is important for speech perception quality improvement at low SNR [
15]. A phase-sensitive mask (PSM) [
16] showed the feasibility of phase estimation. Williamson et al. [
17] proposed an ideal ratio mask that can jointly estimate the real and imaginary parts of clean speech. Tan et al. [
18] proposed a convolutional recurrent network (CRN) for complex spectral mapping (CSM) that can theoretically estimate the real and imaginary spectra of clean speech from the spectrum of noisy speech. The authors of [
19] combined a CRN with a DPRNN module to improve both the local modeling capability and the long-term modeling capability of the network. The signal is converted to the complex frequency domain through STFT, so the speech enhancement algorithm needs to process the amplitude and phase of the speech signal concurrently [
20]. Due to the difficulty of phase estimation, this imposes an upper limit on the performance of speech enhancement. In addition, effective speech enhancement in the STFT domain requires high-frequency resolution, which leads to a relatively large time window length and also leads to large system delays because the minimum delay of the system is limited by the length of the STFT time window, making it difficult to carry out in practical applications. The feature mapping-based approach uses neural networks to learn the complex mapping relationship between noisy and clean speech, and the network directly outputs the waveform of the enhanced speech. Conv-TasNet [
21] is a representative and high-performance model for neural network noise reduction and speech improvement. Conv-TasNet is a fully convolutional time-domain audio separation network that models speech signals directly in the time domain. Pascual et al. [
22] used GAN [
23] in the field of speech enhancement and achieved some enhancement. Considering the powerful modeling capability of WaveNet [
24] on speech waveforms, Ref. [
25] proposed to introduce speech prior distributions into the Bayesian WaveNet framework for speech enhancement. The authors of [
26] built on WaveNet by non-causal expansion convolution to predict the target speech. Further, Refs. [
27,
28] proposed an end-to-end speech enhancement framework using a fully convolutional neural network that focuses on time-dependent information between long speech segments.
At the same time, the highly integrated features and large-scale production lines of industrial CPS make information security issues increasingly prominent. Data communication is an important part of a CPS, so it is important to keep CPS information secure. In practice, a CPS operates with data from different industrial production systems that may come from a variety of privacy scenarios. Accessing source data from a variety of different noisy environments faces significant barriers due to issues such as user data privacy and security and commercial competition. For data privacy preservation, federated learning has been proposed [
29,
30]. This paradigm enables models to collaboratively model different data structures and different institutions without uploading private data, effectively protecting user privacy and data security.
Mainstream federal learning systems are often based on the assumption that local data on the client side are labeled. However, in realistic scenarios, due to the high cost of capturing clean and noisy speech pairs, the data on the client side is mostly unlabeled. As an extension of federal learning, knowledge distillation extracts features of the teacher network in an unsupervised manner and improves the performance of the student network based on those features. Therefore, more and more researchers are trying to apply knowledge distillation within unsupervised domain-adaptive methods.
Two heads are better than one. Since a combination of multiple teacher models outperforms a single teacher, Ref. [
31] transferred the predictive distribution of multiple teachers as knowledge to the student model. In [
32], the authors assigned weights to teacher knowledge by analyzing the diversity of teacher models in the gradient space. The authors of [
33,
34] extended knowledge distillation to domain adaptation by training multiple teacher models in the source domain and pooling these models in the target domain to train a student model. In recent years, the idea of knowledge distillation has been introduced to improve the performance of speech enhancement models. The teacher network in [
35] uses enhanced speech, while the student model uses the original noise-laden speech for ASR training, thus encouraging the student model to attempt speech enhancement within the network. In [
36], researchers implemented single-channel speech enhancement with low latency online using teacher–student learning to prevent performance degradation due to reduced input segment length. The authors of [
37] used the noise reduction results of the teacher model to optimize the student model for unsupervised learning. Recently, Refs. [
38,
39] attempted to transfer knowledge from multiple teachers to student models in speech enhancement. In [
38], the spectrogram was divided into multiple sub-bands, the teacher model was trained on each sub-band, and then the knowledge was migrated from the teacher model to the general sub-band student enhancement model through a framework of knowledge distillation. The performance of the student model exceeded that of the corresponding full-band model. The authors of [
39] trained teacher models on multiple datasets with different SNRs and then used these teacher models to supervise the training of student models.
To address the above problems of speech enhancement applied in industrial noise environments, we combine speech enhancement with knowledge distillation to obtain a comprehensive speech enhancement model that suppresses multiple industrial noise types. To begin, the network structure of the built student model is comparable to that of the instructor model. This similarity between the two models’ network structures can be thought of as symmetry between the student model and the teacher model in the physical space. After that, by using knowledge distillation, the prior knowledge of various teacher models is transferred to a student model. As a result, the student model represents the unity of the prior knowledge possessed by numerous teachers in the information space. Consequently, the student model is a symmetrical and unified model of the instructor model in both physical space and information space and has the ability to suppress various forms of noise. The results also demonstrate that the performance of the teacher model is enhanced by combining multiple teacher models, as each teacher model focuses on a single noise-reduction problem.
The remainder of the paper is structured as follows:
Section 2 focuses on the methodology proposed in this paper;
Section 3 presents the experimental configurations;
Section 4 describes the experimental results and analysis; and
Section 5 concludes our work.
5. Conclusions
To solve the problem of industrial CPS noise data being hard to combine because of privacy and commercial copyright laws and the lack of public dataset labels, we propose unsupervised distillation-based speech enhancement for unsourced data. In this method, a speech enhancement model for multiple noisy environments is trained to achieve unification of the CPS physical space through federal learning and knowledge distillation without accessing the source data. Through a large number of comparative experiments, we verify that the prior noise suppression performance of teacher models trained under different noises is always due to other teacher models. We compare our average distillation and random distillation methods. It can be seen from the experimental results that the average distillation guides the student model to effectively utilize the feature information of different teacher models during the training process and alleviates the problem of insufficient or excessive enhancement of a single teacher model during the enhancement process.
Our method can eliminate or reduce the background noise in noisy speech and improve the quality and intelligibility of the target speech signal. As shown in
Figure 4, speech enhancement may be successful in suppressing the noise but often distorts the speech that is of interest. Artifacts created by speech enhancement can harm the quality of speech recognition or other further automatic processing. However, a High-Fidelity Generative Adversarial Network [
49] consisting of one generator and two discriminators is adversarially trained to output high-quality speech without artificial generation artifacts. In our next work, we will use the High-Fidelity Generative Adversarial Network for speech enhancement to eliminate generation artifacts.
In future research, we will continue to explore the application of multi-teacher distillation in speech enhancement and train a standard model with samples representing different noisy environments as a comparison. Although our experimental data come from a public dataset, the coupling of noisy and clean speech is artificially synthesized from clean speech and noise. The simulated speech may be different from recordings in a real noise environment. In future work, we will use CHiME and other real recording datasets to conduct knowledge distillation research in multi-noise scenarios. Further, in order to extend the application of our method to speech signals, we aim to extend our approach to other downstream tasks, including simultaneous speech recognition and separation.