1. Introduction
Audio signal is ubiquitous in our daily life. With the development of science and technology, audio has also developed from monophonic audio to dual-channel audio and multi-channel audio, such as 2.1-channel, 5.1-channel, 7.1-channel, and so on [
1]. The pursuit of multi-channel audio technology is to restore the various sound effects that humans hear in nature [
2]. Therefore, multi-channel audio is closer to the real sound heard by the human ear and provides a better immersive experience for the audience. At present, multi-channel audio is widely used in the fields of movies, TV programs, music production, and game development. However, the process of collecting and transmitting multi-channel audio signal is accompanied by abnormal phenomena such as missing data and audio damage. These phenomena will deteriorate the quality of the final audio, affect the auditory sensation, and even reduce the intelligibility of the audio content. If the damaged audio is applied to other tasks, such as audio recognition and classification tasks, it will affect the final accuracy. Therefore, the restoration of damaged multi-channel audio has become one of the current research hotspots [
3].
The core problem related to recovering audio signal is how to establish a link between lost data and known data [
4]. At present, some traditional audio signal restoration algorithms have some problems, including the fact that the algorithms are complex and the effect of audio signal recovery is not satisfactory. For instance, the audio restoration algorithm based on sparse decomposition [
5] needs more iterations, in order to approach the optimal result. The audio restoration algorithm based on the regression model [
6] needs to adjust the model order and other parameters; the problem with this algorithm is hearing distortion. The traditional matrix completion algorithm [
7] may incur the problem of information loss and serious performance degradation. At present, these algorithms are not especially used for the restoration of multi-channel audio signals. Therefore, these algorithms do not take into account the spatial position of multi-channel audio and the strong correlation between channels in relation to the impact of the audio completion effect.
As we know, the multi-channel audio signal can be thought of as a multi-dimension model that contains channel, time, and spectrum [
8]. However, the traditional matrix model cannot directly process high-order data such as the multi-channel audio signal. It needs to transform the high-order data into a matrix using dimensionality reduction operations. This step will result in the loss of some structural information and alter the effect of signal recovery. At the same time, the representation of multi-dimensional data by the matrix is inefficient [
9]. As an extension of the matrix in the high-dimensional space, the tensor has been utilized for multi-dimensional array processing [
10]. Therefore, the tensor model can be used to represent high-order data and to directly analyze and process high-order data [
11]. This feature can reflect the inherent relationship of multi-factor signals well, so the tensor model has been widely used in image processing [
12], computer vision [
13], and other fields. Therefore, in order to make full use of the correlation between the various factors of the audio signal [
14], researchers have developed tensor completion algorithms.
The tensor completion methods can be roughly grouped into tensor factorization-based methods and tensor completion-based methods [
15]. Tensor decomposition can succinctly represent the underlying structure of a tensor; therefore, various tensor factorizations are applied to tensor completion. CANDECOMP/PARAFAC (CP) is one of the well-known tensor factorizations. Currently, one of the most commonly used methods is the CP weighted optimization (CP-WOPT) algorithm. This method is often used for audio completion. CP is a special case of Tucker decomposition and so Tucker is also used for completion. However, compared with Tucker, the tensor train (TT) decomposition has a better ability to represent tensors and can avoid the curse of dimensionality, so it is better for completion. Sedighin et al. [
9] used a complete algorithm based on TT decomposition to reconstruct the signal in the multiway delay space. In addition, the tensor nuclear norm is the sum of the singular values of the frontal slices of the tensor after Fourier transformation and it is the tightest convex relaxation of the L
1 norm of the tensor. Studies have shown that methods based on nuclear norm are superior to methods based on tensor factorization. Therefore, the tensor nuclear norm is used by the researchers for signal completion. For example, Ran et al. [
15] adopted high-accuracy Low Rank Tensor Completion (HaLRTC) to complete the traffic data.
In this paper, tensor completion based on the tensor nuclear norm is used to recover the data of the multi-channel audio signal. First of all, the multi-channel audio signal is preprocessed to complete the operations of the framework and adding window. Then, the multi-channel audio signal is transformed from the time domain to the frequency domain. The next step is to construct a third-order multi-channel audio tensor. Finally, the multi-channel audio signal with missing data is recovered by using the completion algorithm based on the tensor nuclear norm. By finding the connection between the missing data and the known data, the lost data can be recovered, as much as possible, from some observation data, so that the quality and auditory effect of the multi-channel audio signal are improved.
The rest of this paper is organized as follows:
Section 2 introduces the notations and methods. The settings of the experiment are described in
Section 3. The experimental results are discussed in
Section 4. In
Section 5, conclusions are discussed.
3. Experimental Setup
3.1. Audio Signal Modeling
Although vectors and matrices are easier to process, in order to retain more structural correlations, this paper constructs the multi-channel audio signal into a third-order tensor. The multi-channel audio can have more than two axes of variation, such as channel, frame, and feature [
30]. First of all, the multi-channel audio signal is divided into frames and then added the window. The frame length is generally set to 10–30 ms. There will be a partial overlap between the frames, to avoid the discontinuity of the audio signal in the time domain. The product of the frame length and sampling frequency is the number of samples in each frame. After that, the MDCT is performed on the frame samples of the processed audio signal and the audio signal is transformed from the time domain to the frequency domain, to obtain the frequency domain coefficients of each frame sample. The MDCT transformation has anti-symmetric characteristics and, as a result of this, the number of frequency domain coefficients is equal to half the number of samples in the time domain. The frequency domain coefficients are selected as the characteristic parameters and as one of the orders of the audio tensor. Next, the multi-channel audio signal can be constructed into a third-order tensor, which is represented by
, where
represents the coefficients after frequency domain transformation,
represents the frame samples, and
represents the number of channels of the audio signal. Its structure is shown in
Figure 1. Then, the audio tensor is completed through the structural relationship within the audio signal in the frequency domain. The process of audio signal recovery is shown in
Figure 2.
3.2. Experiment Settings
The multi-channel audio signals used in the completion experiment are the 5.1-channel audio signals that are common in our daily life and that are downloaded from the Internet. The specific content of the audio is popular music. The format of the multi-channel audio used in the audio completion experiment is WAV. In total, 50 segments of audio are used in the experiment, the sampling frequency is 48 kHz, and the sampling bit depth is 16 bit. The duration of each audio segment is 10 s. Taking a piece of audio in the dataset as an example, the dynamic range of the left channel of this piece of audio is 46.82 dB, the dynamic range of the right channel is 50.77 dB, the dynamic range of the center channel is 47.62 dB, the dynamic range of the low frequency enhanced channel is 47.04 dB, the dynamic range of the left surround channel is 50.92 dB, and the dynamic range of the right surround channel is 45.37 dB. The dynamic range of music is generally 40–60 dB and the audio used in the experiment is within this range. After framing, the length of each frame is set to 20 ms; thus, the number of samples contained in each frame is 960. The overlap between two frames is set to 50%; thus, the frame shift is 10 ms. The time–frequency conversion is completed through the MDCT transformation and the number of frequency domain coefficients obtained is 480. Then, the three-order audio tensor can be constructed.
The missing data of the multi-channel audio signal is due to the method of random loss, where data loss occurs at random locations. The total missing data rate is set to 15%, 30%, 45%, 60%, and 75%, respectively. According to the above experimental settings, a third-order audio tensor with data missing can be constructed. Then, four kinds of audio completion algorithms, the TC-TNN algorithm, the TC-SNN algorithm, the CP-WOPT algorithm, and the robust matrix completion (RMC) algorithm, are used, respectively, to carry out audio recovery experiments on audio signals with missing data. The allowable error of the experiment is 10−8 and the maximum number of iterations is 500.
The recovery effect of the audio signal is evaluated using both objective and subjective evaluation indicators. The objective evaluation indicator is the relative standard error (RSE), while the subjective evaluation indicator is Multiple Stimuli with Hidden Reference and Anchor (MUSHRA). The completion experiment is conducted on a DELL 7050 computer with a 3.6 GHz CPU of Intel Core i7 and 16 GB RAM and the simulation software is Matlab (R2019a).
4. Results and Discussion
4.1. Objective Evaluation
In the audio recovery experiment, for each missing data rate, the experiment was repeated 10 times for each audio, to avoid coincidence. Then, the results of the recovery experiment are objectively evaluated. The objective evaluation indicator is RSE, which is a measure of the difference between the original signal and the recovered signal. The lower the value of RSE, the better the effect of audio signal recovery. RSE is defined as follows:
where
represents the tensor after completion and
represents the tensor without data loss.
Table 2 records the RSE of the multi-channel audio signals restored using the four audio completion algorithms, as well as the RSE are the average values of the results of the experiment of 50 pieces of multi-channel audio.
In addition, this experiment also counts the time taken, using different audio completion algorithms, to restore multi-channel audio signal, which is represented by CPU running time (CPU time). The results of this experiment are shown in
Table 3.
It can be seen from
Table 2 that, under all conditions of missing data rate, the value of RSE obtained using the TC-TNN algorithm is the lowest. This phenomenon shows that the audio recovery capability of the TC-TNN algorithm is the best, in all cases. It shows that the operation of constructing the multi-channel audio signal into a tensor can make full use of the inherent relationship of high-order structure, as well as recovering the missing data of the multi-channel audio signal better, so that the recovery ability of this tensor completion algorithm is stronger and the audio recovery quality is higher.
The recovery ability of the TC-SNN algorithm and the CP-WOPT algorithm are medium, in terms of the four completion algorithms, and these algorithms also carry out tensor modeling of multi-channel audio signals. Compared with the results of the RMC algorithm, the RSE of the TC-SNN algorithm and the CP-WOPT algorithm are lower. The TC-SNN algorithm is based on the matrix nuclear norm, while the TC-TNN algorithm is based on the tensor nuclear norm. The tensor nuclear norm can be considered as a higher extension of the matrix nuclear norm. As a result, the tensor nuclear norm has a high-order structure and its intrinsic correlation is stronger, compared to the matrix nuclear norm. For this reason, in each case of missing data rate, the RSE of the TC-SNN algorithm is slightly higher than the TC-TNN algorithm, as well as the recovery ability of the TC-SNN algorithm being slightly weaker than that of the TC-TNN algorithm. In the aspect of data completion, the nuclear norm is superior to the tensor factorization. Therefore, the recovery ability of the CP-WOPT algorithm is slightly weaker than that of the TC-TNN algorithm and the TC-SNN algorithm.
It can be seen from the two tables that the RMC algorithm takes relatively less time to recover the multi-channel audio signal compared to the other three algorithms, but the RSE of this algorithm is the highest among the four methods, owing to the fact that the RMC algorithm does not consider the spatial structure and other correlations of high-order data, and even loses the high-order structural information in the process of restoring the audio signal. As a result, its recovery ability is not as good as the tensor completion algorithm.
The tensor completion algorithm requires a large number of iterative operations to obtain the optimal solution, which leads to a long time for multi-channel audio signal recovery. However, compared to the traditional matrix completion algorithm, the audio recovery quality of the tensor completion algorithm is better, indicating that this type of algorithm increases the complexity of the algorithm, in exchange for a better audio recovery effect.
In addition, for the reason that the TC-TNN algorithm has the best recovery effect in the completion experiment, the spectrograms of the original audio and the audio that is recovered using the TC-TNN and RMC algorithms are shown in
Figure 3,
Figure 4 and
Figure 5. From top to bottom, they are the left channel, the right channel, the center channel, the low frequency enhanced channel, the left surround channel, and the right surround channel. In the spectrogram, the depth of the color indicates the energy of the frequency, the horizontal stripes represent the formant information, and the vertical stripes represent the pitch information. The denser the stripes, the higher the pitch. It can be seen that the energy of the frequency points of the corresponding channels of the audios is slightly different in
Figure 3 and
Figure 4. However, the position and number of horizontal and vertical stripes are roughly same. The difference between
Figure 5 and
Figure 3 is greater and some of the horizontal stripes are blurred. Hence, the spectrogram can also show that the audio recovery effect of the TC-TNN algorithm is better.
4.2. Subjective Evaluation
The purpose of recovering the multi-channel audio signal with data loss is to improve the quality of the multi-channel audio, improve the intelligibility of the audio content, and to obtain a better auditory sensation. As a consequence, it is necessary to subjectively evaluate the results of the experiment and test the restoration quality of multi-channel audio in subjective hearing. The subjective evaluation indicator is the MUSHRA method. This test method is recommended by the International Telecommunication Union and was first used for the subjective evaluation of streaming media and the relevant coding of communication. The main feature of the MUSHRA method is to mix the lossless audio into the test corpus as a reference, with the total loss audio as an anchor. Through the double-blind listening test, the measured audio, the hidden reference audio, and the anchor audio are subjectively scored. This test method requires experienced listeners that need to be trained to be familiar with the test process and scoring rules before the formal test. The original multi-channel audio without data loss is generally used as a reference signal. During the formal test, the listeners scored the audio signal by comparing the multi-channel audio signal without data loss to the multi-channel audio signal after completion. The scores are integers ranging from 0 to 100 and the corresponding evaluations range from poor to very good.
In this experiment, ten experienced listeners, including five men and five women, were selected to conduct subjective audiometry and score the multi-channel audio. The subjective audiometry is performed in a quiet audio lab and the room reverberation time is 0.5 s. The equipment is a kind of 5.1-channel stereo device, with a dynamic range of 86 dB, and the equipment is placed according to the 5.1-channel schematic diagram, as shown in
Figure 6. L is the left channel, R is the right channel, C is the center channel, LFE is the low frequency enhanced channel, SL is the left surround channel, and SR is the right surround channel. The length of each piece of measured audio is 10 s. The interval test is performed between the reference audio and the measured audio and repeated three times to prevent misjudgment. The total time of a complete test is controlled within 15 min to 20 min, which is also aimed at preventing misjudgment due to auditory fatigue. The average score of the MUSHRA test of 50 pieces of multi-channel audio is used as the subjective evaluation of the audio restoration quality. The results are shown in
Table 4.
It can be seen from
Table 4 that, in the case of various missing data rates, the MUSHRA test score of the TC-TNN algorithm is the highest among several completion algorithms, and the MUSHRA test scores of the TC-SNN algorithm and the CP-WOPT algorithm are also in the middle, which is consistent with the objective evaluation. For all these completion algorithms, the MUSHRA test score decreases as the missing data rate increases, which means that there is an inverse relationship between the audio recovery quality and the missing data rate. In particular, when the missing data rate is more than half, the audio recovery quality drops sharply. For the reason that when the data are lost too much, the structural correlation is weakened and it becomes difficult to mine the connection between the lost data and the known data, the multi-channel audio signal cannot be well recovered and the audio recovery quality will eventually decline.
5. Conclusions
In the field of audio signal processing, audio restoration tasks have attracted wide attention. In this paper, the tensor completion algorithm is used to restore the multi-channel audio signal with data loss. First of all, the multi-channel audio signal with data loss is constructed as a third-order tensor, after signal preprocessing and time–frequency transformation. Afterwards, the audio recovery is carried out using the completion algorithm based on the tensor nuclear norm, and the optimal solution of the tensor completion convex optimization model is obtained by using convex relaxation technology. Then, the completed audio tensor is obtained. At last, it will be converted into multi-channel audio. The results of the experiment are compared to the experimental results of the traditional matrix completion algorithm, based on the objective and subjective indicators. It can be seen from the experimental results that the tensor completion algorithm is better able to recover audio signal with data loss and has a higher recovery ability compared to the traditional method. The tensor completion algorithm models the problem of audio data recovery using a mathematical model and optimizes the model to solve the global value, so as to achieve the purpose of data recovery. The tensor completion method provides a new way to recover the lost data of multi-channel audio and effectively improves the quality of the recovered audio. Therefore, the tensor completion algorithm has a good application prospect in the field of audio signal processing.