1. Introduction
Sound reproduction systems play an important role in our everyday life. They allow us to listen to recordings from a different place and a past time. Many different methods for the recording and playback of sound exist, utilizing different combinations of microphone and loudspeaker setups. The most common one is a simple stereo reproduction, but there are more complex reproduction techniques, such as wave field synthesis [
1] or ambisonics [
2]. Even though the state-of-the-art methods achieve a very good accuracy in reproducing sound fields, they do not consider the interaction between the acoustics of the recording and playback environment. In particular, extra reverberation is created by the playback environment, and in addition, there is no control over the spatial distribution of the reverberant sound field, which may influence the apparent source width and perceived listener envelopment. For this reason, ongoing investigations aim to improve the performance of these methods.
In particular, Grosse and van de Par proposed a new way of recording and playing back sound fields [
3]. The main idea behind their research was to record the direct and reverberant sound field separately in order to be able to render it in a playback room while optimizing certain perceptually-motivated criteria for the authentic audio reproduction. These criteria aim for recreating the reverberant sound field in the playback environment as faithfully as possible by optimizing the amount and spectral shape of the reverberation, as well as the interaural cross-correlation created by the reproduced reverberant sound field, such as it is created in the reproduction room, including its added reverberant effect. In their paper, Grosse and van de Par assumed that optimizing these perceptual criteria is sufficient for an authentic reproduction of the sound field present in the recording room, which is created by a single source. This claim was supported by subjective evaluations. The playback and recording configuration can be seen in
Figure 1. In addition to the two basic stereo loudspeakers, the proposed approach used two dipole loudspeakers to excite and equalize the reverberant sound field. For the optimized rendering, the system relies on the presence of a relatively dry direct signal to be rendered on the frontal loudspeakers and a reverberant signal to be optimized and rendered on the dipole loudspeakers. To record the direct sound, a microphone
was positioned close to the sound source. This also avoided early reflections, which could cause a change in coloration [
4,
5]. For recording the reverberant sound field, two microphones
were placed at two distant positions in the diffuse field.
Since the method of Grosse and van de Par [
3] until now is limited to a single source and only records the direct sound field with one microphone, an extension is needed to also represent the spatial distribution of sources within the direct sound field signals as perceived at the listener position. Although this could in principle be achieved by using multiple close microphones and an appropriate mixing scheme, in this contribution, we want to provide a method with only a single `true-stereo’ microphone setup that is placed at the intended listener position within the recording room. Particular attention has to be paid to reduce the reverberant sound field in the direct sound field signals to be able to separately optimize the rendering of the direct and reverberant sound fields according to perceptual criteria within the playback room [
3].
Although the specific design criteria for the proposed microphone array are envisioned to be used in the audio reproduction system of Grosse and van de Par [
3], it can also be considered to use the proposed microphone array to record a relatively dry spatial image of the sound sources on stage to be combined with a reverberant track that can be mixed at a level that the recording engineer deems suitable. In this case, however, it will not necessarily fulfill the optimization criteria as formulated in Grosse and van de Par [
3] that create a faithful audio reproduction.
The state-of-the-art true stereo systems combine two microphones with a characteristic directivity pattern, placed at different distances and under different angles relative to one another. Depending on these parameters, a deviating spatial rendering of the distributed sources can be observed [
6]. Despite this, for use in the method proposed by Grosse and van de Par [
3], these systems have some disadvantages that make them unsuitable to be implemented in this specific sound reproduction system because there is a high percentage of recorded reverberant sound, which should be avoided in the system of [
3].
We overcome these disadvantages with the development of a new method of a true stereo microphone array, using a superdirective beamforming algorithm that is applied on two logarithmically-spaced microphone arrays. Correct, frequency-dependent interchannel level differences are captured by optimizing the shape of the two main lobes of the arrays. Together, they create the proper interchannel level difference required for an accurate spatial reproduction of the sound field while ensuring that no interchannel phase differences occur that can result in unintended changes in the perceived location of sound sources. Additionally, an optimal side lobe suppression is applied to reduce the influence of the reverberant sound field on the recording of the direct sound. This proposed stereo microphone array is compared to the state-of-the-art stereo microphone configurations mentioned earlier that shows a clearly reduced level of the reverberant sound field.
2. Methods
The following section is divided into five parts. The first
Section 2.1 gives a brief introduction to the most relevant theory on beamforming needed for our proposed method.
Section 2.2 focuses on the issue of the robustness of beamforming algorithms. The desired directivity pattern is specified in
Section 2.3, which is based on a stereo intensity-panning rule related to the auditory processing of the interaural level differences.
Section 2.4 introduces an optimal array design to suppress side lobes and, in this way, reduce the influence of the reverberant sound field on the recording of the direct sound. Further, a specific filter design is proposed in
Section 2.5, which will be used and evaluated throughout this study. The design is based on a superdirective beamforming algorithm and describes how the directivity pattern that is specified in
Section 2.3 can be used for the optimization.
2.1. Beamforming
Beamforming describes the process of forming the directivity pattern of several microphones, which are arranged into an array, with signal processing techniques to obtain a specific, frequency-dependent directivity pattern. The directivity pattern
of a linear discrete microphone array, consisting of
N microphones, is calculated as follows [
7]:
where
denotes the angle ranging from
to
,
f the frequency,
the frequency-dependent complex weighting filtering applied to microphone
n and
the steering vector denoting the direction and frequency-dependent transfer function from the sound source to microphone
n. Such a microphone array is illustrated in
Figure 2.
Assuming a far field condition with the microphones that have an omnidirectional directivity pattern, the transfer function states:
where
c is the speed of sound and
represents the distance of the
n-th microphone to the center of the array [
7].
The influence on the directivity patterns of the microphones in the array can be taken into account by changing the transfer function . The filter optimization used to match the directivity pattern of the array with a desired one is called beamforming. The look direction of the microphone array is defined as the angle of the main lobe of the desired directivity pattern, which is also called the steering angle.
There are several beamforming algorithms based on an analytic solution for the optimal filter
and some others on a numerical approximation. Analytic solutions allow us to set
N constraints on the directivity pattern for a finite number of frequencies, as for example described in [
8]. Since we have a higher number of constraints in our problem, we will use numerical methods that allow accommodating a higher number of constraints to control the directivity pattern.
Equation (
1) will be solved numerically, and for this purpose, the frequency range is discretized into
P frequencies
and the angular range into
M angles
:
Equation (
3) is reformulated in matrix notation as:
where the directivity pattern is an
vector
, the transfer function an
matrix
and the filter a
vector
[
7]. All bold variables are either vectors or matrices in the remainder of this manuscript.
2.2. Robustness and White Noise Gain
One of the problems that beamforming algorithms often have is their lack of robustness. This property is related to a resistance to the presence of spatially white noise and can be impaired by deviations from the specified microphone characteristics and microphone position errors. These imperfections affect the beamformer in a manner similar to a recorded spatially white noise that is amplified. Hence, the White Noise Gain (WNG) is a measure commonly used for quantifying the robustness of a beamformer design. The WNG shows the ability of a beamformer to suppress spatial white noise, because it expresses the gain of the beamformer in the desired look direction relative to the amplification of spatially white noise.
The WNG
is defined as follows:
where
denotes the value of the directivity pattern in steer direction [
7]. A high value of the WNG
corresponds to a robust beamforming design, whereas a small value
effectively corresponds to an amplification of spatial white noise [
7]. The maximum possible value of the WNG is equal to the number of microphones used:
which corresponds to a uniform filter [
7]:
2.3. Desired Directivity Pattern
The playback of the recorded signals should be in a stereophonic configuration, as mentioned in
Section 1 and illustrated in
Figure 3a.
The playback approach proposed by Grosse and van de Par [
3] uses two loudspeakers for the direct sound reproduction with a typical base angle of
relative to the listener’s position [
9]. There are several approaches to shift a phantom source from one loudspeaker to the other, utilizing phase differences
and/or level differences (amplitude panning)
applied on the two loudspeaker signals.
Based on this playback configuration, the recording configuration presented in this paper consists of two crossed end-fire microphone arrays with a
opening angle, sharing one center microphone and using omnidirectional microphones, illustrated in
Figure 3b. The microphone positions in this figure can only be considered as a sketch, the absolute positions can be found in
Section 3. The phantom-source shifting approaches of the playback configuration can be used to formulate either the correct phase and/or level differences between the two arrays. In this way, the perceived location of the sound source in the playback situation is identical to the one of the recording provided that the distribution of recorded sound sources does not span more than
of angle. Although not evaluated here, in principle, a different opening angle could be used for the microphone arrays, thus effectively compressing or expanding the reproduced sound stage. We restrict our proposed method to have only level differences, and for this reason, the desired directivity pattern
is purely real valued. With this desired directivity pattern, the phase of the directivity pattern is mainly controlled by the array design, which will be explained in
Section 2.4.
In this paper, the phantom source shifting approach of amplitude panning is used for formulating the desired directivity pattern of Array 1
and Array 2
[
9]:
The angle area
between both arrays is defined by:
with the constant
. The derivation of the desired directivity patterns according to [
9] gives two possible recording room assumptions: an anechoic chamber or a real room. The latter one is chosen for Equation (
8) since the microphone array configuration will be used in real rooms, such as concert halls.
The desired directivity pattern of the one array is the mirror-flipped version of the other array. This symmetry of the recording configuration makes it possible to formulate one desired directivity pattern, which is the same for both arrays. The following parts of the desired directivity pattern, the first
valid for the beam area and the second
valid for the steering angle, consider a microphone array aligned on the
axis corresponding to the steering angle
:
In the following subsections, an optimal array design in terms of optimal microphone positions and an optimal filter design is proposed to achieve the desired directivity pattern.
2.4. Array Design
The positions of the microphones have an influence both on the filter
and the transfer function
, and thus, on the directivity pattern itself. The optimal microphone positions selected for this paper maximize the spatial aliasing frequency and, at the same time, minimize the frequency from which beamforming is effectively possible. The spatial aliasing frequency describes the lowest frequency
for which aliasing effects occur, which is caused by a spatial undersampling of the array for sound waves at high frequencies. The aliasing leads to side lobes with the same amplitude as the main lobe. The spatial aliasing frequency of an array with linear microphone spacing is usually given in the literature as:
with
as the space between the microphones [
10].
A small microphone spacing sets an upper limit to the spatial aliasing frequency. In contrast, a large microphone spacing sets a lower limit to the frequency from which beamforming is effectively possible. In order to have good directional properties of the microphone array across a wide frequency range, an irregularly-spaced microphone array is used in which both kinds of spacing can occur. A linear-shaped, logarithmically-spaced, to the reference microphone (
), symmetrical array is used in this paper. Consequential, the number of the used microphones
N has to be uneven
. The symmetry around one central microphone ensures a purely real directivity. The microphone positions are calculated as follows [
11].
with:
where
is the total length of the array. The array parameter
is a free variable describing the ratio between the spacing of the microphones at the extremities of the array and the spacing of the microphones at the center of the array. Linear microphone spacings are archived with
. If
, the spacing of the microphones at the extremities of the array is smaller than the one at the center of the array. In the case of
, it is the opposite.
2.5. Filter Design
In this section, an optimal filter design is proposed to fit the directivity pattern of the array, whose design was specified in
Section 2.4, to the desired directivity pattern specified in
Section 2.3. The following filter design is based on numerical convex optimization and has the advantage that only one global minimum exists. In general, this end-fire design can also be used with different desired directivity patterns and array designs. In
Section 3, we indicate the ideal values of the constants for the desired directivity pattern and array design proposed in this study.
The aim of this algorithm is to minimize the quadratic error
between the directivity pattern obtained by a microphone array
and a desired frequency independent directivity pattern
[
7]:
This minimization task will be subjected to additional constraints, and therefore, the beamformer will be termed the Constrained Least-Squares Beamformer (CLSB).
In the following subsections, the main minimization task and the used constraints will be explained paying particularly attention to the WNG and different spatial areas. These areas are shown in
Figure 4.
Additionally, this optimization process is placed within an optimization loop in order to optimize several important constants. This optimization procedure will be explained in the last subsection of this section.
2.5.1. White Noise Gain
Such a convex optimization procedure allows including a frequency-dependent lower bound
for the WNG when optimizing the filters
[
7]:
This constraint has a direct influence on the robustness and on how well the desired directivity pattern can be achieved. A high value for the lower bound reduces the accuracy of forming the directivity pattern because the filter is too restricted by this constraint, whereas a low value leads to a not robust filter. In
Section 3, an optimal value for this lower bound will be discussed.
2.5.2. Steering Angle
In the direction of the steering angle
, representing the direction of the main lobe of the microphone array, the directivity pattern obtained by the array is constrained to the value of the desired directivity pattern [
7]:
In this way, the directivity pattern is normalized to . The steering angle is limited to the array-axis, since the goal is an end-fire array.
2.5.3. Beam Area
The area around the steering angle is the beam area, which defines the main lobe of the directivity pattern:
and
indicate one discrete angle before and after the steering angle, respectively. The constant
can be chosen freely and defines the width of the beam area. Fitting the directivity pattern to the desired one, an angle-dependent upper bound
is set to the error (cf. Equation (
14)) in this area:
where
denotes the absolute value of every entry of the vector argument. In this case,
is a column vector with as many entries as the directivity pattern in the beam area.
2.5.4. Unconstrained Area
An angle area without any constraints is defined to avoid an effective discontinuity in the intermediate zone between the beam and the stop area, which would have a negative impact on the optimized solution that would be obtained:
The constant can be chosen freely and defines the width of the unconstrained area.
2.5.5. Stop Area
The remaining area is called the stop area:
The main optimization task is applied to this area. In the context of this work, the sound from this direction can be assumed to be mainly reverberant sound that does not belong to the direct sound and is therefore undesired. For this reason, the desired directivity pattern in this area is set to zero to suppress sound coming from this area as much as possible [
7]:
In addition to this optimization, an upper bound
is set to the uniform norm of the directivity pattern:
This upper bound is not angle-dependent, but restricted to the stop area because of the uniform norm and will play an important role in the following loop design.
2.5.6. Loop Design
Choosing the correct upper bound for the beam area is difficult: on the one hand, a low upper bound for the beam area leads to a good fit in this area (low values), but to undesired side lobes in the stop area (high values). Consequential, the direct sound will be recorded correctly, but is mixed with the undesired reverberant sound field, which should be ideally suppressed. On the other hand, a high upper bound for the beam area leads to the opposite, a bad fit in the beam area (high values), but low undesired side lobes (low values). The following loop design finds a frequency-dependent optimal upper bound for the beam area, which is a compromise between a good fit in the beam area and only small side-lobes in the stop area.
As a first step in the loop design, the upper bound of the beam area is initialized in matrix notation:
The rows cover the beam area, whereas the columns cover the different iterations of the following loops with k as the counter, where indicates the last iteration. The upper bound starts in the first iteration with and continues linearly spaced with step size . The step size is designed in such a way that the maximum value of the upper bound of the beam area is reached in overall K steps. Either or can be chosen to calculate , since they are equal according to the symmetry of the desired directivity pattern. The upper bound then ends with the difference between and at the row specific angle. If this difference is reached before the last iteration , this value will stay till this iteration is reached. This will be the case for every row, except the first and the last one. This procedure ensures that stays the maximum value of the directivity pattern.
In contrast to the upper bound of the beam area, the bound of the stop area is initialized as a vector, since there is no angle dependency:
The entries with the counter l, where indicates the last iteration, correspond to the iterations of the following loops and are linearly spaced. The constant controls the maximum allowed value of the directivity pattern in the stop area for the first iteration.
The loop design itself can be seen in
Figure 5 and is repeated for every frequency
, where the constants
and
can be chosen freely so that
and
, respectively. These two constants regulate the part of the upper bound of the beam area, which is used in the looped optimization process.
The first loop repeats the optimization with the first part of the upper bound of the beam area
to
till Equation (
22) with
is true. A result of the optimization, fulfilling Equation (
22), is denoted as valid. If this is not the case, Loop 2 repeats Loop 1 with different upper bounds of the stop area
. If still no valid result is found, Loop 3 increases
with the step width of
. The upper bounds, for which the loop design finds a valid solution, are denoted as optimal
and
. The filter
, which corresponds to these upper bounds, is also denoted as optimal
. For the case that
increases
over
K , the last
calculated result of the optimization is taken as a valid solution.
3. Setup
The following setup is used for the numerical simulations, whose results are described in
Section 4 and
Section 5. The angular range is discretized into
linearly-spaced angles
. The frequency range covers the range of
to
generated at a sampling rate of
using a filter length of 512 samples. This results in
linear spaced frequency bins. This frequency range covers the spectral content of music [
12] that is to be recorded by these microphone arrays. To obtain impulse responses of the filters, the complex spectrum was mirrored, conjugated and transformed towards the time domain via an ifft.
The microphone array consists of omnidirectional microphones and has a total length of . The array design is done with , so that the smallest microphone spacing in the center of the array is . Following that, the spatial aliasing frequency can be maximized to a frequency of . For practical reasons, the limitation is set to to ensure enough space for the microphones. The absolute microphone positions are set as follows (displayed in millimeter precision): , , , , , , , , .
After having specified the microphone positions, the convex functions of the CLSB, shown in
Section 2.5, are solved utilizing CVX, a package for specifying and solving convex programs [
13,
14]. Parts of these convex functions are the WNG constraint and the loop design.
For the WNG constraint, the lower bound
for the WNG
is set up as follows:
The lower bound starts with and ends with . In the intermediate zone, a Cubic Spline Interpolation (CSI) connects both points. The CSI in the intermediate zone avoids rapid changes of the directivity pattern across frequency below . In the high frequency range , a lower bound of ensures a robust beamforming design.
For the loop design, the constants are set up as follows:
The constants
and
, as well as the parts of the desired directivity pattern
and
are set up according to
Section 2.3.
The values of the constants
K,
and
are chosen in such a way that Loop 1 scans the beam area from
in steps of
till
. If necessary, Loop 3 increases the value of the upper bound of the beam area according to the value of the constant
(cf.
Section 2.5).
An increase of the value of the constant
K leads to an improvement in the beam area (lower
values), because the step size
is smaller. The validity (cf.
Section 2.5) of more possible directivity patterns with small
values is checked by the loop design. In fact, to find a valid solution, Loop 2 has to increase
further than before, which leads to a worsening in the stop area (higher
values). A decrease of the value of the constant
K leads consequently to the opposite effect.
An increase of the values of the constants and leads to a worsening in the beam area (higher values), because the first end point of Loop 1 , as well as all of the other ones is now higher. More possible directivity patterns with high values are checked by the loop design: Loop 2 does not have to increase so much than before, because these directivity patterns are in general more likely to be valid. This leads then to an improvement in the stop area (lower values). A decrease of the values of the constants and leads consequently to the opposite effect.
The values of the constants L and are chosen in such a way that Loop 2 scans the stop area from in steps of till .
An increase of the value of the constant and at the same time a decrease of the value of the constant L, preserving the step width of as mentioned earlier, lead to a worsening in the stop area. The start point of Loop 2 is now higher, allowing higher values from the beginning. It is now easier for Loop 1 to find a valid solution, which leads to an improvement in the beam area. A decrease of the value of the constant and a coherent increase of the value of the constant L lead to the opposite effect.
Overall, it can be said that a variation of the values of the constants K, , , L and leads to a changed balance, fulfilling the constraints between the beam and the stop area. For every desired directivity pattern and intended purpose of the microphone array has to be found separately optimal values.
A variation of the value of the constant does not significantly change the results in terms of the error in the beam and the stop area. Nevertheless, the value should not be chosen too big to avoid undesired results (very big differences between the obtained and the desired directivity pattern), since there is no control over the directivity pattern in the unconstrained area. The maximum value of till there are no undesired results depends in a complex manner on the number of used microphones and the desired directivity pattern.
With the setup shown in Equation (
26), we achieved best results in fitting the directivity pattern to the desired one. Different initializations of the constants are also possible, as mentioned before (a detailed analysis of the effect on the results regarding the variation of the constants’ values given in Equation (
26) is beyond the scope of this article). Our results are, however, discussed in the following
Section 4 and
Section 5.
6. Discussion and Conclusions
In this study, a new approach for intensity stereophonic recording has been investigated. Guided by the playback situation and its auditory requirements, we decided to postulate a setup consisting of two crossed end-fire microphone arrays and a fitting desired directivity pattern. The difference between the directivity pattern obtained and the one desired was minimized by a superdirective beamforming algorithm. It was based on convex numeric optimization and also contains a frequency-dependent WNG constraint to ensure a robust beamforming design.
In addition to designing the filters of the microphones via beamforming algorithms, we found an ideal array design. This design maximizes the spatial aliasing frequency and also takes practical issues into account, which will appear in an actualization of the arrays. The extent of the microphones demands a particular spacing, also to avoid interferences between them.
A comparison between the new stereo system and the state-of-the-art ones, which use two microphones, has shown that the former has the advantage of less recorded reverberant sound, as it is more directive in the look direction than the latter are. This matches the requirements posed by the recording method proposed in Grosse and van de Par [
3], which requires separate dry and reverberated representations of the audio signal. The reverberated sound field can be taken from single microphone signals.
Future research could develop a method to optimize the directivity pattern of both arrays as one system rather than handling them separately. Furthermore, two additional beams pointing into the diffuse field could be introduced for optimization to replace the two microphones placed in that field and to use only the array system.
A final assessment of the proposed recording and playback system needs to run listening tests and investigate the perception of the recording and playback room.