Audio Pre-Processing and Beamforming Implementation on Embedded Systems

Wang, Jian-Hong; Le, Phuong Thi; Kuo, Shih-Jung; Tai, Tzu-Chiang; Li, Kuo-Chen; Chen, Shih-Lun; Wang, Ze-Yu; Pham, Tuan; Li, Yung-Hui; Wang, Jia-Ching

doi:10.3390/electronics13142784

Open AccessArticle

Audio Pre-Processing and Beamforming Implementation on Embedded Systems

by

Jian-Hong Wang

^1,†

,

Phuong Thi Le

^2,†

,

Shih-Jung Kuo

³,

Tzu-Chiang Tai

^4,*,

Kuo-Chen Li

^5,*

,

Shih-Lun Chen

⁶,

Ze-Yu Wang

¹,

Tuan Pham

⁷

,

Yung-Hui Li

⁸

and

Jia-Ching Wang

³

¹

School of Computer Science and Technology, Shandong University of Technology, Zibo 255000, China

²

Department of Computer Science and Information Engineering, Fu Jen Catholic University, New Taipei City 24205, Taiwan

³

Department of Computer Science and Information Engineering, National Central University, Taoyuan City 320317, Taiwan

⁴

Department of Computer Science and Information Engineering, Providence University, Taichung 43301, Taiwan

⁵

Department of Information Management, Chung Yuan Christian University, Taoyuan City 320317, Taiwan

⁶

Department of Electronic Engineering, Chung Yuan Christian University, Taoyuan City 320317, Taiwan

⁷

Faculty of Digital Technology, University of Technology and Education, Da Nang 550000, Vietnam

⁸

AI Research Center, Hon Hai Research Institute, New Taipei City 236, Taiwan

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2024, 13(14), 2784; https://doi.org/10.3390/electronics13142784

Submission received: 15 May 2024 / Revised: 12 June 2024 / Accepted: 14 June 2024 / Published: 15 July 2024

(This article belongs to the Special Issue Recent Advances in Audio, Speech and Music Processing and Analysis)

Download

Browse Figures

Versions Notes

Abstract

Since the invention of the microphone by Barina in 1876, there have been numerous applications of audio processing, such as phonographs, broadcasting stations, and public address systems, which merely capture and amplify sound and play it back. Nowadays, audio processing involves analysis and noise-filtering techniques. There are various methods for noise filtering, each employing unique algorithms, but they all require two or more microphones for signal processing and analysis. For instance, on mobile phones, two microphones located in different positions are utilized for active noise cancellation (one for primary audio capture and the other for capturing ambient noise). However, a drawback is that when the sound source is distant, it may lead to poor audio capture. To capture sound from distant sources, alternative methods, like blind signal separation and beamforming, are necessary. This paper proposes employing a beamforming algorithm with two microphones to enhance speech and implementing this algorithm on an embedded system. However, prior to beamforming, it is imperative to accurately detect the direction of the sound source to process and analyze the audio from that direction.

Keywords:

beamforming; embedded system; enhance speech; audio pre-processing

1. Introduction

Methods for reducing noise and enhancing signal recognition have been studied in various fields since the early stages. Starting from the mid-1980s, techniques such as Blind Source Separation (BSS) [1,2,3,4,5] and beamforming [6,7,8,9] have been developed for this purpose. They serve as crucial components in numerous audio-processing applications, such as sound recognition and speech recognition [10,11,12,13,14].

Subsequently, numerous algorithms have been derived from beamforming, such as the Fourier beamformer [15] and the MVDR (minimum variance distortionless response beamformer) [16]. The MVDR beamformer, known as the minimum variance distortionless response beamformer, is a highly utilized and optimal design for beamforming applications. Its versatility allows for application across various fields such as antenna arrays, radar systems, ultrasound technology, and microphone arrays used in speech enhancement. This beamformer is adaptive to data, aiming to minimize the output variance. In situations where noise and the desired signal lack correlation, the combined variance of captured signals equals the sum of the variances of the desired signal and the noise. Consequently, the MVDR solution aims to minimize this cumulative variance, effectively reducing the impact of noise [17]. MVDR has been employed, offering the same SNR while also providing interference suppression capabilities (by detecting the angle of the sound source and attenuating sounds from other angles). However, incorrect identification of the sound source angle can result in receiving information with excessive noise.

To accurately identify the angle of the sound source, Voice Activity Detection (VAD) [18] can be utilized to estimate noise levels. Prior to conducting direction of arrival (DOA) estimation, eliminating noise components from the audio signal can lead to more accurate DOA outcomes with reduced errors.

In typical environments, there are often numerous unwanted sounds, such as fans, speakers, refrigerator compressors, washing machines, etc. These unwanted sounds are collectively referred to as “noise”. These noises can interfere with the transmission of essential information and affect discernibility. Therefore, there is a need for methods to achieve “noise reduction and speech enhancement”, which will be greatly beneficial for future applications.

The purpose of multichannel noise suppression mainly involves two parts: direction of arrival (DOA) detection and beamforming.

DOA (direction of arrival) involves analyzing the time differences of arrival of signals to determine the relative angle of a sound source, utilizing the relationship between signal space and time.
Beamforming: Forms a corresponding beam pattern for a specific angle of arrival, conducts spatial filtering, and suppresses sounds from other angles.

In theory, after confirming the angle, using beamforming can suppress noise and enhance the speech of the sound source. However, DOA and beamforming require significant computational resources, and most embedded systems need to add additional memory to cope with this. Nevertheless, their processing speed cannot achieve “real-time processing of signals”.

If achieving real-time processing is the goal for beamforming, FPGA offers a good performance, but the production cost is prohibitively high, and many detailed parameters are not easily modifiable. Therefore, this paper employs a high-performance MCU to accomplish the task. The required computational expressions are all stored in a cache, and unnecessary parameters and variables are trimmed.

Hardware Platform.

The experimental setup is built upon the architecture of an audio codec IC and MCU. Utilizing hardware circuits for low-/high-pass filters helps in the early elimination of power noise. Combined with the cache memory (512 KB) within the MCU, the aim is to achieve real-time processing to the greatest extent possible.

Algorithm Implementation and Evaluation.

The software research concentrates on implementing DOA (direction of arrival) and beamforming algorithms, validating their functionality and evaluating the execution efficiency of the MCU.

Sound is a category of signals generated by the vibration of objects, propagated through a medium (such as air, solids, or liquids) and perceived by the auditory organs of animals. When objects vibrate, they cause regular variations in the density of the medium (such as air), resulting in longitudinal waves.

Sound is the result of the superposition of “sine waves” with different frequencies and intensities. In signal analysis, Fourier transformation is commonly used to decompose sound into signals of different frequencies, facilitating signal analysis and filtering. The frequency range of sound waves audible to the human ear is 20 Hz to 20 kHz. Therefore, in audio analysis, only signals below 20 kHz are typically utilized.

Microphones are capable of capturing sound through internal sensors resembling eardrums, which detect the pressure of sound waves and generate vibrations. Depending on the amplitude of these vibrations, different voltage signals (analog signals) are produced. However, to store sound signals as digital information in media files, the analog signals (voltage values sensed by the microphone) need to be converted through an Analog-to-Digital Converter (ADC). This process enables the storage of sound as digital data, allowing for analysis and processing of these digital signals, applicable across various fields.

Composed of multiple sensors arranged in a specific manner (with equal spacing and consistent relative angle differences), this setup is utilized to receive signals in space, which are then processed for various applications, including source separation, direction of arrival (DOA) estimation, and beamforming.

Theoretically, placing multiple microphones aims to enhance recognition rates and to establish the relationship between “signal space” and “time” for the purpose of “enhanced speech”. However, microphone placement follows specific patterns depending on different applications, with variations in arrangement and quantity. The simplest model of a microphone array is shown in Figure 1 [19].

2. Implementation of Direction of Arrival (DOA) on an Embedded System

2.1. Algorithm Flow

The operational procedure for the DOA in this paper is as follows:

Utilize VAD [18] to detect speech segments;
Apply spectral subtraction [20,21] for audio noise reduction;
Input the denoised audio into the DOA recognizer for azimuth detection [22]. An architecture diagram of the DOA-embedded system is shown in Figure 2.

2.1.1. Voice Activity Detection (VAD)

Use VAD to treat sound as a form of energy for identifying sound events, as per the following formula:

x_{t} [n]

: Received audio;

t

: Frame number of the audio;

N

: Total number of samples in the audio frame;

n

: Sample number of the audio (ranging from 1 to

N

);

m

: The average value of the audio

x_{t} [n]

.

If

E_{t}

is greater than the threshold value

T

, then it is determined that there is a sound event (got Event). However, different microphones or different audio modules have different threshold values, which need to be determined through testing to find their respective threshold values [11].

E_{t} = \sum_{n = 1}^{N} {(x_{t} [n] - m_{t})}^{2} if E_{t} > T, g o t E v e n t

(1)

2.1.2. Spectrum Subtraction Method

This method is primarily used for “removing environmental noise”. It involves subtracting the “average noise spectrum” from the “signal spectrum containing noise” to remove environmental noise. The “average noise spectrum” refers to the signal received when there are non-sound events.

The

y (k)

is the audio signal we receive. The

s (k)

is the original signal. The

n (k)

is plus noise. We obtain the following [21,22]:

y (k) = s (k) + n (k)

(2)

After Fourier transformation.

Y (e^{j w}) = S (e^{j w}) + N (e^{j w})

(3)

Expression of the spectrum subtraction method.

{|S (e^{j w})|}^{2} = \{\begin{matrix} {|Y (e^{j w})|}^{2} - α {|μ (e^{j w})|}^{2}, if {|Y (e^{j w})|}^{2} > α {|μ (e^{j w})|}^{2} \\ β {|Y (e^{j w})|}^{2}, otherwise \end{matrix}

{|μ (e^{j w})|}^{2} = E {{|N (e^{j w})|}^{2}}

(4)

|μ (e^{j w})|

: Average noise spectrum;

α

: A decimal between 0 and 1, used as a weighting factor for “noise” in the algorithm;

β

: A very small value close to 0.

The spectrum of the signal after denoising is

\hat{S} (e^{j w})

, where

θ (e^{j w})

, represents the phase of

Y (e^{j w})

.

\hat{S} (e^{j w}) = |S (e^{j w})| e^{j θ (e^{j w})}

(5)

The ratio of the spectrum of

{|S (e^{j w})|}^{2}

, which is equivalent to the energy-added and subtracted spectrum, to the spectrum of the signal with noise

Y (e^{j w})

, is denoted as

H (e^{j w})

.

\hat{S} (e^{j w}) = H (e^{j w}) Y (e^{j w})

(6)

H (e^{j w}) = \frac{{|S (e^{j w})|}^{2}}{{|Y (e^{j w})|}^{2}}

(7)

2.1.3. Direction Detection

The Choice of Microphone Spacing

The method of calculating the angle of a sound signal falls under blind detection. Upon receiving the data directly, mathematical algorithms are applied to compute and determine the angle of the sound source. A direction detection schematic diagram is shown in Figure 3 [23].

According to this paper, to achieve the minimum resolvable angle, one needs to determine the time difference corresponding to the minimum discernible distance that the system can recognize.

T h e m i n i m u m d e t e c t i o n t i m e = \frac{t h e m i n i m u m a n g l e f o r m e d d i s t a n c e d i f f e r e n c e}{t h e s p e e d o f s o u n d}

τ = \frac{l}{v} = \frac{d \sin (θ)}{v}

(8)

v = 331.5 + 0.61 T

(9)

v

: Speed of sound;

T

: Celsius temperature;

τ

: Time difference.

For example, if the current temperature is 25 °C, the distance to the microphone is 25 cm, and a discernible angle of 5° is desired; then, substituting these conditions into the formula yields a minimum time difference of 62.8376 μS, requiring a sample rate of at least 15.9 KHz.

The Choice of Microphone Spacing

The algorithm for angle detection referenced the paper on XPSD estimation and peak detection of cross-correlation peaks based on GCC-PHAT (Generalized Cross-Correlation, The Phase Transform) using Time Delay Estimation (TDE) [22].This method can only discern the direction of one sound source at a time and cannot identify multiple sound source locations simultaneously. However, its advantage lies in its simplicity, requiring only two microphones, simpler hardware architecture, and lower computational requirements.

When performing directional detection, two scenarios need to be considered: far-field planar waves and near-field radiating waves. In the near field, differences caused by the distance between microphones must be considered, whereas in the far-field model, differences in sound reception between microphones can be neglected. For simplicity of computation, this paper only considers the far-field model.

In the far-field model, sound waves can be treated as planar waves. The time difference τ between the sound received by two microphones is expressed by Equation (8). Therefore, knowing the value of the time difference τ allows us to determine the sound source’s direction.

This paper uses the following method to obtain the direction of arrival [22]. For a distant sound source, the incident angle with respect to each microphone is almost identical. The expression for the signals received by the two microphones is as follows:

x_{1} (t) = s (t) + n_{1} (t)

(10)

x_{2} (t) = α s (t + D) + n_{2} (t)

(11)

x_{1} (t), x_{2} (t)

: The signals received by each microphone;

n_{1} (t)

,

n_{2} (t)

: The noise received by each microphone;

D

: Time delay of receiving the same signal;

τ

: The argument that maximizes provides an estimate of delay;

α

: Change in signal.

We can express the cross-correlation between the microphones as follows:

\begin{matrix} R_{x_{1}, x_{2}} (τ) = E [x_{1} (t) x_{2} (t - τ)] \\ = E \{[s (t) + n_{1} (t)] [α s (t - τ) + n_{2} (t - τ)]\} \\ = α E [s (t) s (t - τ)] + E [s (t) n_{2} (t - τ)] + α E [s (t - τ) n_{1} (t)] + E [n_{1} (t) n_{2} (t - τ)] \end{matrix}

(12)

Assuming that

s (t)

,

n_{1} (t)

and

n_{2} (t)

are mutually independent, then

α

= 1,

n_{1}

, and

n_{2}

, and both equal 0; the formula becomes the following:

R_{x_{1}, x_{2}} (τ) = E [s (t) s (t - τ)]

(13)

where

E

represents the expected value [22], aiming to maximize the value of Equation (13); however, due to limited actual observation time, the estimation of τ is also limited. Hence, the following expression can be used:

{\overset{\land}{R}}_{x_{1}, x_{2}} (τ) = \frac{1}{T - τ} \int_{t}^{T} x_{1} (t) x_{2} (t - τ) d t

(14)

where

T

represents the observation interval. The correlation between cross-correlation and cross-power spectrum can be expressed in the following form using Fourier transform [22]:

R_{x_{1}, x_{2}} (τ) = \int_{- \infty}^{\infty} G_{x_{1}, x_{2}} (f) e^{j 2 π f τ} d f

(15)

G_{y_{1}, y_{2}} (f) = H_{1} (f) H_{2}^{*} (f) G_{x_{1}, x_{2}} (f)

(16)

The symbol * represents the complex conjugate according to reference [22]. As a result, the generalized correlation between

x_{1}

(

t

) and

x_{2} (t)

can be determined.

R_{y_{1} y_{1}}^{(g)} (τ) = \int_{- \infty}^{\infty} φ_{g} {(f) G}_{x_{1}, x_{2}} (f) e^{j 2 π f τ} d f

(17)

φ_{g} (f) = H_{1} (f) H_{2}^{*} (f)

(18)

where

φ_{g} (f)

denotes the general frequency weighting [22].

The cross-power spectral density

{\overset{\land}{G}}_{x_{1}, x_{2}} (f)

of

G_{x_{1}, x_{2}} (f)

can be derived from finite observations of

x_{1} (t) {a n d x}_{2} (t)

.

{\overset{\land}{R}}_{y_{1} y_{1}}^{(g)} (τ) = \int_{- \infty}^{\infty} φ_{g} {(f) \overset{\land}{G}}_{x_{1}, x_{2}} (f) e^{j 2 π f τ} d f

(19)

The

{\overset{\land}{R}}_{y_{1} y_{1}}^{(g)} (τ)

is computed and utilized for delay estimation. Depending on the specific form of

φ_{g} (f)

and the available prior knowledge, it may also be necessary to estimate

φ_{g} (f)

. For instance, if the prefilter’s purpose is to enhance the signal passed to the correlator at frequencies with the highest signal-to-noise (S/N) ratio, then

φ_{g} (f)

is expected to be a function of signal and noise spectra, which must either be known beforehand or estimated [22].

The waveform schematic diagram for direction detection is shown in Figure 4.

2.2. Embedded System Hardware Devices

The embedded system for azimuth detection in this paper utilizes the PIC32MZ starter kit and the AK4642EN audio codec module as the development platform. Related hardware for development platforms is shown in Figure 5, and a simple circuit diagram for embedded systems is shown in Figure 6. We will now provide a detailed specification introduction for this hardware architecture, divided into three parts: a microphone, an audio codec module, and the PIC32MZ starter kit development board.

2.2.1. The Audio Module (Comprising Microphone Module and Audio Codec Module)

Using a dynamic microphone with the signal amplification module (E001) featuring a filtering function can reduce noise in the received audio at the expense of attenuating higher frequency signals.

The audio codec module contains the AK4642EN audio codec IC for audio encoding/decoding, and the MIC signal receiver includes a simple hardware filter that can eliminate signals below 50 Hz. It can be configured via the I2C communication interface to adjust basic functions of the audio codec, such as the sample rate, communication format, volume adjustment, filtering frequency, etc. The detailed specifications are shown in Table 1. The AK4642EN hardware block is shown in Figure 7 [24].

2.2.2. The Processing Core of Embedded Systems

The core of the embedded system utilizes an MCU based on the PIC32MZ system, which incorporates hardware decoding for floating point operations. It also features DMA functionality, allowing for concurrent data transfer and the execution of other algorithms. The specifications and characteristics of the microprocessor are shown in Table 2. The microprocessor computational capability block is shown in Figure 8 [25,26].

3. Beamforming

In this study, due to resource constraints, beamforming, which typically utilizes an array of multiple sensors for signal enhancement, is implemented using only dual microphones.

Two assumptions need to be considered in array signal processing: near-field radiation and far-field plane waves. Both scenarios need to be accounted for in beamforming.

In the near field, differences caused by the distance between microphones must be calculated. In the far-field model, differences in microphone audio reception can be negligible. For simplicity in computation, this paper only considers the far-field model to facilitate the explanation of beamforming principles.

Assuming different angles of arrival for audio source signals are known, the task of beamforming is to process these audio signals, retaining only the sound from the selected direction. If executed correctly, other noises and echoes can be filtered out directly. The method chosen in this paper is minimum variance distortionless response (MVDR) beamforming. MVDR enhances the desired signal and then minimizes interference or noise from other directions, reducing total output power.

The used MVDR beamforming is described as follows [28]. Considering a uniform linear array (ULA) comprising M sensors with interelement spacing d, the output of a narrow band beamformer is given by

y (k) = w^{H} x (k)

(20)

where

k

is the time index,

x (k) = {[x_{1} (k), \dots, x_{M} (k)]}^{T}

is the

M \times 1

complex vector of array observation, and

w = [w_{1}, w_{2}, \dots, w_{M}]^{T}

is the

M \times 1

complex vector of beamformer weights.

{(\cdot)}^{T}

and

{(\cdot)}^{H}

stand for the transpose and Hermitian transpose, respectively.

The output SINR of the beamformer is defined as follows:

S I N R = \frac{w^{H} R_{d} w}{w^{H} R_{i + n} w} = \frac{σ_{0}^{2} |w^{H} a (θ_{0})|}{w^{H} R_{i + n} w}

(21)

The optimum beamformer that maximizes the output SINR can be interpreted as the solution to the following constrained minimization problem. The beamforming-embedded system architecture diagram is shown in Figure 9 [28].

w_{0} = \arg \underset{w}{m i n} w^{H} R w s . t . w^{H} a (θ_{0}) = 1

(22)

4. Results

4.1. Experimental Results of Direction Detection

4.1.1. Experimental Environment

The test space used measures 8 m × 5 m × 3 m and features two unidirectional microphones spaced 25 cm apart. The distance between the microphones and the source of sound is 1.5 m. A schematic diagram of experimental space for direction detection is shown in Figure 10. The experimental setup is shown in Figure 11. Microphone placement is shown in Figure 12. Microphone placement and angle markings are shown in Figure 13.

A total of nine angles are to be tested, ranging from 30° to 150° in increments of 15°. Each angle should be tested at least twice to verify any differences between consecutive tests.

4.1.2. Experimental Results

Currently, using a microchip development board to implement direction of arrival (DOA) functionality, the obtained angular error does not meet the expected results. However, it can still detect angles within ±15 degrees, with most differences being within 5 degrees for multiple tests at the same point. The current recognition speed is approximately 1.5 s, falling short of real-time recognition, primarily due to the processing speed of the MCU itself. A demonstration is depicted in Figure 14. The values for the test are displayed in Table 3.

4.1.3. Discussion of Issues

Based on actual test results, the angular differences measured at the same point are all within ±5 degrees. However, significant deviations in measured angles or angles measured at 90 ± 5 degrees (when they should not be) still occur. The former is inferred to be due to excessive reverberation in the testing environment, leading to significant discrepancies in angle determination. The latter is inferred to be caused by temporary noise interference or excessive environmental noise, rendering the spectrum subtraction method ineffective in noise removal.

4.2. Results from the Beamforming Experiment

4.2.1. Experimental Environment

The space used for testing has dimensions of 8 m × 5 m × 3 m, with two unidirectional microphones placed at a spacing of 25 cm. The microphones are positioned 1.5 m away from the center of the sound source. There are two sound sources positioned at angles of 45° and 135°, respectively. The microphone setup follows the configuration described in Section 4.1. We are preparing to record and test the transformed signals using the MVDR method to evaluate the differences. Beamforming, a schematic diagram of the experimental space for dual sound source testing, is shown in Figure 15.

4.2.2. Experimental Results

Due to limited data space using only the PIC32’s internal cache, it is not possible to store signals from too many channels. Thus, only dual microphones are used to test the waveform differences after beamforming MVDR.

There are two testing methods currently being employed.

Using a single sound source (fixed angle of 45°) and executing beamforming MVDR at different angles to observe waveform differences. Under single-source beamforming at 170°, the result shows a waveform similar to the right channel shown in Figure 16. Under single-source beamforming at 45°, the result shows a waveform similar to the left channel shown in Figure 17. Under single-source beamforming at 90°, the result shows a waveform similar to the mean of the left and right channels shown in Figure 18.
Using dual sound sources (at positions of 45° and 135°) and executing beamforming MVDR at different angles to observe waveform differences. Under dual-source beamforming at 180°, the result shows a waveform similar to the right channel, as shown in Figure 19. Under dual-source beamforming at 45°, the result shows a waveform similar to the left channel, as shown in Figure 20. Under dual-source beamforming at 135°, the result shows a waveform similar to the mean of the left and right channels shown in Figure 21.

The following shows an experimental illustration of the results for a single sound source at a 45-degree angle.

4.2.3. Discussion of Issues

According to the theory of beamforming, it is supposed to enhance the desired direction of audio while attenuating sound from other directions. From a waveform perspective, this theory seems to hold, but in terms of human perception, there is not a significant difference. When two sources of sound are present simultaneously, one can still clearly hear sounds coming from other directions, which may affect efficiency. The reasons for this could include the following:

Differences in microphones and filtering circuits: The type of microphone used can affect the quality of reception, and poorly designed signal filtering circuits can result in unwanted noise. If this noise needs to be completely eliminated through software, it could lead to excessive computational time. However, if left unaddressed, it could interfere with the beamforming algorithm.
Varied types or placements of microphones: Differences in microphone types, positions, and angles can all lead to variations in the corresponding algorithms. Fine-tuning of the MVDR algorithm might be necessary.
Inadequate number of microphones: Typically, beamforming requires arrays of four or more microphones, arranged not in a straight line but in a curved or circular configuration. Currently, only two microphones are being used, so while beamforming may have some effect, it is not pronounced.

5. Conclusions

The method employed in this paper utilizes a beamforming algorithm with two microphones to enhance speech, implemented on an embedded system. However, prior to beamforming, accurate detection of the sound source’s direction is imperative for effectively processing and analyzing audio signals emanating from that specific direction. According to beamforming theory, it should strengthen the desired direction’s audio while attenuating sound from other directions. This is generally reflected in the waveform; however, perceptually, there may not be a significant difference, especially when multiple sound sources are present simultaneously, leading to performance degradation. This could be attributed to several reasons.

1. Differences originating from microphones and filtering circuits: The type of microphone used can affect the quality of reception, and improper design of signal filtering circuits may result in unwanted noise. Completely eliminating this noise through software may lead to excessive computational time, but leaving it unaddressed could affect the beamforming algorithm. 2. Differences in microphone type or placement: Variations in microphone type, placement, and angle can lead to differences in the corresponding algorithms. Further refinement of the minimum variance distortionless response (MVDR) algorithm may be necessary. 3. Insufficient number of microphones: Beamforming typically requires arrays of four or more microphones, arranged not linearly but in an arc or ring configuration. Currently, only two microphones are being used, resulting in some effectiveness of beamforming, albeit not prominently evident.

Author Contributions

Conceptualization, J.-C.W.; methodology, S.-J.K.; resources, Y.-H.L.; writing—original draft preparation, S.-J.K.; writing—review and editing, J.-H.W., P.T.L., T.-C.T., K.-C.L., S.-L.C., Z.-Y.W. and T.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflict of interest.

References

H’erault, J.; Jutten, C.; Ans, B. Détection de grandeurs primitives dansun message composite parune architecture decalcul neuromimétique en apprentissage non supervisé. In Proceedings of the GRETSI, Nice, France, 20–24 May 1985; pp. 1017–1020. [Google Scholar]
Bai, M.-H.; Liu, Y.-T.; Kuei, C.-Y.; Hsu, W.-C. A Method for Noise Reduction and Speech Enhancement; Intellectual Property Office: Beijing, China, 2011.
Wang, J.-C.; Wang, C.-Y.; Tai, T.-C.; Shih, M.; Huang, S.-C.; Chen, Y.-C.; Lin, Y.-Y.; Lian, L.-X. VLSI design for convolutive blind source separation. IEEE Trans. Circuits Syst. II 2016, 63, 196–200. [Google Scholar] [CrossRef]
Ueda, T.; Nakatani, T.; Ikeshita, R.; Kinoshita, K.; Araki, S.; Makino, S. Low Latency Online Blind Source Separation Based on Joint Optimization with Blind Dereverberation. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021. [Google Scholar]
Ueda, T.; Nakatani, T.; Ikeshita, R.; Ki-noshita, K.; Araki, S.; Makino, S. Low Latency Online Source Separation and Noise Reduction Based on Joint Optimization with Dereverberation. In Proceedings of the 2021 29th European Signal Processing Conference (EUSIPCO), Dublin, Ireland, 23–27 August 2021; pp. 1000–1004. [Google Scholar]
Johnson, D.; Dudgeon, D. Array Signal Processing: Concepts and Techniques; Prentice Hall: Englewood Cliffs, NJ, USA, 1993. [Google Scholar]
Van Veen, B.D.; Buckley, K.M. Beamforming: A versatile approach to spatial filtering. IEEE ASSP Mag. 1988, 5, 4–24. [Google Scholar] [CrossRef] [PubMed]
Priyanka, S.S. A review on adaptive beamforming techniques for speech enhancement. In Proceedings of the 2017 Innovations in Power and Advanced Computing Technologies (i-PACT), Vellore, India, 21–22 April 2017; pp. 1–6. [Google Scholar]
Zhu, J.; Bao, C.; Cheng, R. Speech Enhancement Integrating the MVDR Beamforming and T-F Masking. In Proceedings of the 2019 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC), Dalian, China, 20–22 September 2019; pp. 1–5. [Google Scholar]
Wang, J.-C.; Lee, Y.-S.; Lin, C.-H.; Siahaan, E.; Yang, C.-H. Robust environmental sound recognition with fast noise suppression for home automation. IEEE Trans. Autom. Sci. Eng. 2015, 12, 1235–1242. [Google Scholar] [CrossRef]
Nga, C.H.; Vu, D.-Q.; Luong, H.H.; Huang, C.-L.; Wang, J.-C. Cyclic transfer learning for Mandarin-English code-switching speech recognition. IEEE Signal Process. Lett. 2023, 30, 1387–1391. [Google Scholar] [CrossRef]
Prabhavalkar, R.; Hori, T.; Sainath, T.N.; Schlüter, R.; Watanabe, S. End-to-End Speech Recognition: A Survey. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 325–351. [Google Scholar] [CrossRef]
Gong, X.; Wu, Y.; Li, J.; Liu, S.; Zhao, R.; Chen, X.; Qian, Y. Advanced Long-Content Speech Recognition with Factorized Neural Transducer. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 1803–1815. [Google Scholar] [CrossRef]
Rouhe, A.; Grósz, T.; Kurimo, M. Principled Compari-sons for End-to-End Speech Recognition: Attention vs. Hybrid at the 1000-Hour Scale. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 623–638. [Google Scholar] [CrossRef]
Zoltowski, M. High resolution sensor array signal processing in the beamspace domain: Novel techniques based on the poor resolution of Fourier beamforming. In Proceedings of the Fourth Annual ASSP Workshop on Spectrum Estimation and Modeling, Minneapolis, MN, USA, 3–5 August 1988; pp. 350–355. [Google Scholar]
Marciano, J.S., Jr.; Vu, T.B. Reduced complexity MVDR broadband beamspace beamforming under frequency invariant constraint. In Proceedings of the IEEE Antennas and Propagation Society International Symposium, Salt Lake City, UT, USA, 16–21 July 2000; pp. 902–905. [Google Scholar]
VOCALTechnologies. Available online: https://vocal.com/beamforming-2/minimum-variance-distortionless-response-mvdr-beamformer (accessed on 15 May 2024).
Kinnunen, T.; Rajan, P. A practical, self-adaptive voice activity detector for speaker verification with noisy telephone and microphone data. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 7229–7233. [Google Scholar]
Jin, J.; Benesty, J.; Huang, G.; Chen, J. On Differential Beamforming with Nonuniform Linear Microphone Arrays. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 1840–1852. [Google Scholar] [CrossRef]
Boll, S.F. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 1979, 27, 113–120. [Google Scholar] [CrossRef]
Kim, W.; Kang, S.; Ko, H. Spectral subtraction based on phonetic dependency and masking effects. IEE Proc.-Vis. Image Signal 2000, 147, 423–427. [Google Scholar] [CrossRef]
Knapp, C.H.; Carter, G.C. The generalized correlation method for estimation of time delay. IEEE Trans. Acoust. Speech Signal Process. 1976, 24, 320–327. [Google Scholar] [CrossRef]
Ibrahim, S.Z.; Razalli, M.S.; Hoon, W.F.; Karim, M.N.A. Six-Port interferometer for direction-of-arrival detection system. In Proceedings of the 2016 IEEE International Symposium on Systems Engineering (ISSE), Edinburgh, UK, 3–5 October 2016; pp. 1–4. [Google Scholar]
AK4642EN Datasheet. Available online: https://www.alldatasheet.net/datasheet-pdf/pdf/136842/AKM/AK4642EN.html (accessed on 1 May 2024).
PIC32MZ Datasheet. Available online: https://www.alldatasheet.com/view.jsp?Searchword=Pic32mz&gad_source=1&gclid=C-jwKCAjwrcKxBhBMEiwAIVF8rLSgCqjZ81IB8nHSzJV9bQYjmVWdD48KMWiGLTCPAruosHVGesXvihoCs08QAvD_BwE (accessed on 1 May 2024).
Li, D.; Yin, Q.; Mu, P.; Guo, W. Robust MVDR beamforming using the DOA matrix decomposition. In Proceedings of the 2011 1st International Symposium on Access Spaces (ISAS), Yokohama, Japan, 17–19 June 2011; pp. 105–110. [Google Scholar]
Guo, C.; Tian, L.; Jiang, Z.H.; Hong, W. A Self-Calibration Method for 5G Full-Digital TDD Beamforming Systems Using an Embedded Transmission Line. IEEE Trans. Antennas Propag. 2021, 69, 2648–2659. [Google Scholar] [CrossRef]
Liang, Y.; Bao, C.; Zhou, J. An Implementaion of the CNN-Based MVDR Beamforming For Speech Enhancement. In Proceedings of the 2021 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC), Xi’an, China, 17–19 August 2021; pp. 1–5. [Google Scholar] [CrossRef]

Figure 1. Microphone array—linear model [19].

Figure 2. Architecture diagram of the DOA-embedded system [22].

Figure 3. Direction detection schematic diagram [23].

Figure 4. Waveform schematic diagram for direction detection. (Note: The “立体声” is “stereo“. The “32位浮点” is “32-bit floating point”. The “静音” is “mute”. The “独奏” is “solo”.).

Figure 5. Related hardware for development platforms.

Figure 6. Simple circuit diagram for embedded systems.

Figure 7. AK4642EN hardware block diagram [24].

Figure 8. Microprocessor computational capability block diagram [27].

Figure 9. Beamforming-embedded system architecture diagram.

Figure 10. Schematic diagram of experimental space for direction detection.

Figure 11. Experimental setup. (Note: The definition of “30度” is 30-degree angle. The definition of“60度” is 60-degree angle. The definition of “90度” is 90-degree angle. The definition of “120度” is 120-degree angle. The definition of “150度” is 150-degree angle.).

Figure 12. Microphone placement.

Figure 13. Microphone placement + angle markings.

Figure 14. Demonstration illustration. (Note: The definition of “通訊埠測試-[接收音訊資料]” is “Communication port testing-[Receiving audio data]”. The definition of “設定” is “Settings”. The definition of “執行” is “Execute”. The definition of “接收音訊資料中” is “In the process of receiving audio data”. The definition of “無法接收音訊檔”is “Unable to receive audio file”.).

Figure 15. Beamforming, a schematic diagram of the experimental space for dual sound source testing.

Figure 16. Under single-source beamforming at 170°, the result shows a waveform similar to the right channel. (Note: The definition of “文件-編輯-視圖-播放-軌道-生成-效果-分析-幫助” is “File-Edit-View-Play-Track-Generate-Effects-Analyze-Help”. The “立体声” is “stereo”. The “32位浮点” is “32-bit floating point”. The “静音” is “mute”. The “独奏” is “solo”. The definition of“麥克風”is “Microphone”.).

Figure 17. Under single-source beamforming at 45°, the result shows a waveform similar to the left channel.

Figure 18. Under single-source beamforming at 90°, the result shows a waveform similar to the average of the left channel and the right channel.

Figure 19. Under Dual audio sources beamforming at 180°, the result shows a waveform similar to the right channel.

Figure 20. Under Dual audio sources beamforming at 45°, the result shows a waveform similar to the left channel.

Figure 21. Under Dual audio sources beamforming at 135°, the result shows a waveform similar to the mean of the left and right channels.

Table 1. AK4642EN simple specification sheet.

Specifications	Description
IC serial number	AK4246EN
ADC/DAC precision	16 bits
ADC Sample rate	8 KHz~48 KHz
Audio interface format	ADC: 16 bits MSB justified, I²S DAC: 16 bits MSB justified, 16 bits LSB justified, 16–24 bits I²S
Control interface	I²C (400 KHz at high-speed mode)
Filter	Digital high-/low-pass filter
Stereo mic input Digital volume +12 dB~−115 dB
Digital ALC (Automatic Level Control) +36 dB~−54 dB

Table 2. Microprocessor [25].

Specifications	Description
Core	200 MHz
Cache	512 KB
Flash ROM size	2 MB
Communication interface	PMP, UART, SPI, I2C, I2S, USB, Ethernet, CAN Bus
Pipeline	5-stage pipeline
Timer	9 timer units
Multiply/divide unit	Maximum issue rate of one 32 × 32 multiply per clock Early-in iterative divide Every calculation needs 12~38 clocks
Floating point unit	IEEE-754-compliant floating point unit Supports single- and double-precision data types Runs at a 1:1 core/FPU clock ratio

Table 3. Test values.

Sound Source Angle	Test Max	Test Min	Algorithm Execution Time	Repeated Angle Error Testing	Angle Error Relative to the True Angle
30	32.29	32.29	1.531	0	2.29
45	48.89	45.43	1.531	3.45	3.89
60	60.54	60.54	1.531	0	0.54
75	80.13	76.16	1.531	3.97	5.13
90	92.21	91.58	1.531	0.63	2.21
105	111.04	99.21	1.531	11.83	6.04
120	123.91	123.80	1.531	0.11	3.91
135	135.45	120.91	1.531	14.54	0.45
150	154.23	152.40	1.531	1.83	4.23

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, J.-H.; Le, P.T.; Kuo, S.-J.; Tai, T.-C.; Li, K.-C.; Chen, S.-L.; Wang, Z.-Y.; Pham, T.; Li, Y.-H.; Wang, J.-C. Audio Pre-Processing and Beamforming Implementation on Embedded Systems. Electronics 2024, 13, 2784. https://doi.org/10.3390/electronics13142784

AMA Style

Wang J-H, Le PT, Kuo S-J, Tai T-C, Li K-C, Chen S-L, Wang Z-Y, Pham T, Li Y-H, Wang J-C. Audio Pre-Processing and Beamforming Implementation on Embedded Systems. Electronics. 2024; 13(14):2784. https://doi.org/10.3390/electronics13142784

Chicago/Turabian Style

Wang, Jian-Hong, Phuong Thi Le, Shih-Jung Kuo, Tzu-Chiang Tai, Kuo-Chen Li, Shih-Lun Chen, Ze-Yu Wang, Tuan Pham, Yung-Hui Li, and Jia-Ching Wang. 2024. "Audio Pre-Processing and Beamforming Implementation on Embedded Systems" Electronics 13, no. 14: 2784. https://doi.org/10.3390/electronics13142784

APA Style

Wang, J.-H., Le, P. T., Kuo, S.-J., Tai, T.-C., Li, K.-C., Chen, S.-L., Wang, Z.-Y., Pham, T., Li, Y.-H., & Wang, J.-C. (2024). Audio Pre-Processing and Beamforming Implementation on Embedded Systems. Electronics, 13(14), 2784. https://doi.org/10.3390/electronics13142784

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Audio Pre-Processing and Beamforming Implementation on Embedded Systems

Abstract

1. Introduction

2. Implementation of Direction of Arrival (DOA) on an Embedded System

2.1. Algorithm Flow

2.1.1. Voice Activity Detection (VAD)

2.1.2. Spectrum Subtraction Method

2.1.3. Direction Detection

The Choice of Microphone Spacing

The Choice of Microphone Spacing

2.2. Embedded System Hardware Devices

2.2.1. The Audio Module (Comprising Microphone Module and Audio Codec Module)

2.2.2. The Processing Core of Embedded Systems

3. Beamforming

4. Results

4.1. Experimental Results of Direction Detection

4.1.1. Experimental Environment

4.1.2. Experimental Results

4.1.3. Discussion of Issues

4.2. Results from the Beamforming Experiment

4.2.1. Experimental Environment

4.2.2. Experimental Results

4.2.3. Discussion of Issues

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI