3.1. CNN Noise Cancellation
CNNs are commonly used in audio signal processing tasks, such as speech recognition, music analysis, and sound event detection, due to their ability to extract relevant features from raw audio data [
16,
17]. CNNs are especially effective in analysing audio signals because they can detect local patterns and features in the input data by applying a set of learnable filters, or kernels, to small segments of the audio signal at a time. These kernels can capture patterns, such as specific frequency ranges or spectrotemporal modulations that are relevant to the task at hand. Furthermore, CNNs can learn to represent increasingly complex features hierarchically by stacking multiple layers of convolutional and pooling operations. This allows the network to capture higher-level representations of the audio signal, such as phoneme or chord sequences in speech or music [
18]. CNNs effectively process audio signals because they can extract relevant features from the raw data and learn hierarchical representations of these features, which are essential for many audio signal processing tasks.
The CNN model can compute most accurate filter parameters for ANC. In addition, CNN is suitable for the feedforward control because its performance greatly increases as mathematical properties of noise in the CNN are explained well. The noise components are inferred through the learning of noisy audio, which is the combination of the clean audio and the noisy audio to be extracted.
Figure 3 shows the training process and inference process of the CNN noise canceller. In the learning stage, recorded noisy common voices are used as a dataset and noise audio as a target parameter. The CNN is trained to predict the noise signal in the noisy audio so that it generates the anti-noise to be removed. It accepts the magnitude of the signals transformed to the frequency domain by FFT to calculate the FIR filter impulse response rather than generating the clean audio directly. To cancel out the noise adaptively using feedforward control, noise components should be determined using a fixed filter. Thus, frequency domain data are used in our work. In the inference stage, the noisy audio recorded from the microphone is inserted into the CNN model in the form of the magnitude. Inserted noisy audio works as a predictor. As the input audio changes, a filter coefficient library that extracts specified noise can be configured by user choice based on the learned model. The calculated magnitude and angle components are combined into the complex number. Through this process, we constructed a CNN that extracts the filter coefficient of noise components for the human voice.
3.2. Even–Odd Buffer
Buffering refers to the practice of temporarily storing audio data in a buffer, or temporary memory space, before processing it. Buffering is commonly used in audio applications to prevent audio dropouts or interruptions caused by network latency, slow processing speed, and other performance issues [
19]. However, buffering can also create its own set of issues if the buffer size is too small or large. If the buffer size is too small, there may not be enough audio data available to prevent audio dropouts, resulting in stuttering or skipping in the audio. On the other hand, if the buffer size is too large, it can cause a delay or latency between the audio source and the playback device, resulting in audio lag or synchronization issues. Buffering issues can be especially noticeable in real-time audio applications, such as live streaming and online gaming, in which any delay or lag in the audio can disrupt the user experience [
20]. To address buffering issues, it is important to optimize the buffer based on the specific audio application and ensure that the buffer is constantly being filled with enough audio data to prevent interruptions.
Buffer size has increased due to high-resolution (over 96 kHz and 24 bits) audio sampling and signal processing circuits that demand a high computational load. Buffer management in audio DSP design is increasingly important, and the buffer structure should have an optimized data processing scheme for faster processing speed. One of the most popular buffer optimization schemes is double buffering. It efficiently handles the input/output dataflow using multiple buffers. Double buffering reduces elapsed time if the processing time of processing units is smaller than the read time during the merge process [
21]. Additionally, as long as scheduling is associated with the input side buffer, the maximum increase in speed by the double buffer is ideally twice that of the single buffer. However, it has latency issues because data processing overhead affects the output buffer to be delayed, and the input/output buffer should be synchronized by the input data acquisition and should manage all control signals [
22]. Sync operation can be done by a simple task-level pipelined ping-pong buffer, in which the producer and consumer are simultaneously scheduled. In our study, based on the ping-pong buffer, the even–odd buffer scheme is introduced to overcome the buffer under-fill issue and decrease processing time in real-time applications with robustness.
Figure 4 illustrates the structure of the proposed even–odd buffer. It operates as a ping-pong buffer that continuously feeds the data to the consumer. In the ANC, the producer is the ADC and the consumer is the FFT module. Stream scheduling is managed by the characteristic of parallelism [
13]. Even–odd buffer data input/output port bit depth is 16 bits because common audio is sampled in 16 bits. The microphone picks up the sound and sends an ambient noise signal to the ADC at a specified sample rate, buffer size, and bit depth. The ADC writes audio samples into the buffer. Simultaneous access of the processing unit could result in a hazard called underrun, which refers to output signal silence when data are written in the buffer from the ADC. To solve this problem, we propose an even–odd buffer that efficiently processes the data sent to the FFT module with little latency. It is intended to write and read the data continuously. Control logic receives a temporary “done” signal when both of the FIFOs’ write operations are done within one operation cycle. The sampled frame’s ends are clearly defined because their operation is symmetrical. The key to achieving a low noise ratio in ANC is to optimize computation dataflow. The faster coefficient change can increase the quality of the output signals while the FIR filters process many parameters per window. However, there is a bottleneck in the data transfer process because the buffer cannot hold long windows. Our previous study [
15] confirmed that an even–odd buffer sampling method is more efficient when the sampled frame is large. Experiments comparing other models’ performance are possible because they are synthesizable to the register transfer level (RTL) schematics and feasible on the FPGA. The adopted even–odd buffer pipeline architecture determines which buffer is a writer or reader. When the synchronization operation ends with one frame’s buffering, the control logic enables the data processing units.
Equation (
1) above refers to direct convolution:
is an output sequence,
is an input sequence,
is an impulse response sequence, and
L is convolution length, which is the sum of the lengths of
x and
h. Direct convolution has a large computational cost (
). This time-domain multiply–add method is infeasible when impulse responses are large. Impulse responses are computed following the overlap-add or overlap-save convolution method to reduce computational cost.
We use OLS computation (
2) in our experiment. This method requires less computational cost (
).
is an output sequence.
is an input sequence.
is an overlapped register.
is an impulse response sequence. All of the parameters are magnitudes in the frequency domain.
M is a filter length.
L is an integer such that
becomes a power of 2. In other words,
should be equal to the length of the FFT. Our proposed even–odd buffer has a special register to store overlapped samples. The proposed even–odd buffer has the advantage of computing the OLS method well with partitioned convolution. The single-buffer method has an overwriting risk that could disturb previous sample processing. Additionally, partitioned convolution uses previous frames and overlaps
samples. For
M sample cycles, the last
M samples are transferred to another buffer. The parallel structure and its sampling controllability of the buffer make implementing the OLS convolution easy. However, the CNN noise cancellation system requires a sliding window, which is converted to a frequency domain by FFT and is used for OLS convolution.
Figure 5 shows the FFT-based OLS convolution with cross-fading. Acoustic fields are generally time-varying. At a point,
T, in the time domain, a noise-type change makes the pre-trained filter type controlled. Each writer/reader buffer sends status signals to the control logic. Assigning a count register in the buffer prevents overflow. The buffered bandwidth minimization technique includes partitioned convolution, which is a real-time application that efficiently performs time-domain convolution with low inherent latency [
23].
Algorithm 1 is the proposed even–odd buffer operation algorithm, which is an algorithm used for performing the OLS convolution method. The OLS method is a convolution technique that avoids the costly multiplication operation in the time domain by using the frequency domain instead. The algorithm buffers the input signal into two separate buffers, called buffer 0 and buffer 1, of size
N. The algorithm processes the data in a frame-by-frame manner, in which each frame consists of
N samples of the input signal. The algorithm also uses a filter of size
M as well as an overlap-save register
R of size
, which is used to save the overlapping samples from one frame to the next. The control signals,
and
, are configured at the beginning of each frame:
indicates which buffer to use to buffer the input data, and
is a signal that is set to 1 when a frame is completed. If both buffers are full (i.e.,
is
N and
is
N),
is set to 1 and
is toggled. The algorithm then buffers the input data into the selected buffer based on the value of
. If
is 1, the input data are buffered into buffer 0. If
is 0, the input data are buffered into buffer 1. For each buffer, the algorithm first overlaps the samples from the previous frame with the current frame by copying the samples from the overlap-save register,
R, into the buffer. The algorithm then saves the new data samples from the current frame into the buffer. If the end of the input data is reached (i.e.,
or
is greater than
L, the sample size), the algorithm saves the overlapping samples from the current frame into the overlap-save register,
R, for use in the next frame. The even–odd buffering algorithm is efficient in terms of memory usage because it only requires two buffers of size
N and avoids unnecessary computations by using the overlap-save technique.
Algorithm 1 Even–odd buffering algorithm for the OLS convolution |
|