1. Introduction
Spatial sound field reproduction seeks to establish an immersive acoustic environment within a predefined region, enabling the listener to experience a virtual yet realistic replication of the original sound field. Techniques that physically recreate an acoustic field are referred to as sound field synthesis [
1]. Examples include Wave Field Synthesis (WFS) [
2,
3,
4,
5,
6], Higher Order Ambisonics (HOA) [
7,
8,
9], and the pressure matching method [
10,
11]. In contrast, amplitude panning methods attempt to create a plausible spatial perception by delivering the relevant psychoacoustic cues to the listener’s ears. These perceptually motivated reproduction techniques distribute the source signal across multiple loudspeakers, assigning a gain to each to create a sound image or virtual sound source from the desired direction. Such methods are advantageous for practical applications, offering low computational complexity, absence of destructive interference in the ‘sweet spot’, high timbral quality, and gradual degradation of sound quality outside the ‘sweet spot’ [
12]. Vector-based amplitude panning (VBAP) [
13,
14,
15,
16] is the most widely used perceptually motivated method for two- and three-dimensional multi-loudspeaker reproduction, forming the basis of spatial and object-based audio standards, such as MPEG-H [
17]. The multiple direction amplitude panning (MDAP) method [
18] extends VBAP by introducing additional virtual sources around the intended source position, allowing for the synthesis of both uniform and variable source spreads. The distance-based amplitude panning (DBAP) method [
19] is used for irregular setups, taking the actual positions of the loudspeakers in space as the point of departure. All-Round Ambisonic Panning (AllRAP) [
20] creates phantom sources with stable loudness and adjustable source width. Additionally, a convex optimization-based method has been proposed to allow the precise control of source spread [
12,
21,
22]. The Compensated Amplitude Panning (CAP) method [
23,
24,
25] is proposed by considering the orientation of the listener’s head. Frequency-dependent gain panning methods [
26,
27] have been proposed to maintain consistent loudness and localization.
Most amplitude panning methods assume that the listener is positioned at the ‘sweet spot’, with loudspeakers arranged equidistantly and symmetrically around them. However, achieving such an idealized listening environment can be challenging, particularly in domestic and cabin settings. In consumer systems, where loudspeaker positions may deviate from their canonical locations, the perceived direction of the virtual sound source may shift away from its intended location. To address this problem, techniques such as time-aligning and loudness-matching loudspeakers based on distance [
28] have been proposed. A modified panning approach for non-equidistant loudspeakers, considering the composite contributions of direct sound from multiple loudspeakers, has also been suggested [
29]. However, these compensation methods for amplitude panning assume each loudspeaker behaves as an omnidirectional point source in a free field, thereby neglecting frequency-dependent factors, including room modes, loudspeaker directivity, and inter-loudspeaker variations. Moreover, these methods frequently presume that, after compensation, the reproduction system approximates an ideal listening environment, enabling the application of panning methods predicated on idealized conditions. This assumption disregards the persistent effects of irregular setups and various listening environments, ultimately leading to inaccuracies in the perceived direction of virtual sound sources.
To enhance the listening experience in acoustic environments with irregular loudspeaker configurations, this paper introduces an adaptive binaural cue-based amplitude panning algorithm designed to accurately reproduce the azimuth angle of virtual sound sources. This algorithm leverages measured room impulse responses (RIRs) and binaural room impulse responses (BRIRs) to account for the characteristics of the listening environment when deriving loudspeaker gains. It consists of two stages: the compensation stage and the gain and delay optimization stage. In the compensation stage, the inverse filtering algorithm is employed to eliminate discrepancies in the magnitude frequency responses of RIRs, ensuring consistent sound output from each loudspeaker at the center of the listening area. During the gain and delay search optimization stage, the interaural time difference (ITD) and interaural cross-correlation (IACC) for different gains and time delays are predicted using BRIRs measured with a dummy head. By adjusting the gains and time delay between loudspeakers, the predicted ITD of the virtual source is matched to that of the real source in the free field, creating an accurate virtual sound source at the desired azimuth angle. To resolve multiple solutions during the search process, the optimal gain pair and time delay are determined by using IACC. The effectiveness of the proposed algorithm is validated through both subjective and objective evaluations on a stereo system.
The remainder of this paper is organized as follows:
Section 2 introduces the proposed amplitude panning method, including the use of inverse filters to eliminate magnitude differences between loudspeakers and the determination of gains and time delay based on ITD and IACC.
Section 3 presents the results of the simulation.
Section 4 presents the results of objective and subjective experiments conducted in a listening room. Finally,
Section 5 concludes the paper.
3. Simulations
The simulations were conducted using measured RIRs and BRIRs obtained in the listening room of the Acoustics Institute at Nanjing University, with a reverberation time of approximately 0.3 s. The spatial arrangement of the loudspeakers and listening area is shown in
Figure 2a. The distance between the left and right loudspeakers is 1.5 m, and the distance from the center of the listening position to the loudspeaker axis is 1.3 m. The center of the listening area is laterally offset by 0.4 m from the loudspeaker centerline, with its height aligned with the loudspeakers. To acquire the BRIRs, the MegaSig AH 262 desktop dummy head is placed at the listening location, oriented directly forward as shown in
Figure 2b. This dummy head is interfaced with a Fireface UC sound card connected to a laptop, which serves to control the loudspeakers and record signals from the left and right ear microphones of the dummy head. The sampling rate is set to 44.1 kHz.
The RIRs of both loudspeakers, measured with the microphone placed at the listening position, are shown in
Figure 3. Irregular loudspeaker placement introduces variations in the distances between each loudspeaker and the measurement position, causing differences in the direct sound components of the RIRs, both in time and amplitude, compared to standard configurations. Additionally, room modes, loudspeaker directivity, and their differences lead to magnitude variations across frequencies. Therefore, the inverse filtering algorithm is applied to mitigate the magnitude frequency response differences between the left and right loudspeakers, with the target response set to a constant. The frequency range of the inverse filter is limited to 150 Hz to 9000 Hz by adjusting the frequency-dependent regularization factor based on the loudspeakers’ working frequency bands. This prevents excessive compensation outside the loudspeakers’ working range. The regularization factor
β is shown in
Figure 3c. For the frequency range from 150 Hz to 9000 Hz,
β is set to 1 × 10
−4, and for frequencies below 75 Hz and above 18,000 Hz, it is set to 1 × 10
−3. The factor for the remaining frequencies is determined through logarithmic interpolation. To improve the robustness of inverse filtering, one-third octave smoothing is applied to the RIRs before calculating the inverse filters. The resulting filtered magnitude frequency responses are shown in
Figure 3d. Compared to the raw RIRs shown in
Figure 3b, the magnitude differences between the loudspeakers are substantially reduced, with their magnitude frequency curves exhibiting a near-flat characteristic within the designated frequency range.
During the gain and delay optimization phase, the gains and inter-loudspeaker delay are determined to accurately reproduce the virtual sound source at various azimuth angles. In this stage, a range of gains is systematically evaluated, with the corresponding ITD of the virtual sound source calculated. The gain matching the target ITD is selected to reproduce the virtual sound source at the desired azimuth angle. In the simulation, the azimuth angle search range extends from −15° to 40°, with 5° intervals in the front horizontal plane. The target ITD values are derived from measurements of a real source acquired in the anechoic chamber at Nanjing University, corresponding to the relevant azimuth angles. During the optimization stage, the left loudspeaker’s gain is incrementally increased from 0 to 1 with discrete steps of 0.001, and the right loudspeaker’s gain is calculated using Equation (10). The ITD for each gain pair is calculated using Equations (5)–(7), with the input signal set as the unit impulse signal. Prior to the gain optimization process, a preliminary time delay is applied to the left loudspeaker’s signal to ensure the proper combination of direct sounds from both loudspeakers, preventing excessive time differences that could lead to auditory dominance by the leading loudspeaker.
Figure 4a shows the variation of ITD with gain for the left loudspeaker time delays of 0.88 ms, 0.93 ms, and 0.98 ms. A positive correlation exists between the left loudspeaker gain and the ITD. Specifically, when the left loudspeaker gain is set to 0, only the right loudspeaker emits sound, resulting in the smallest ITD. Conversely, when the left loudspeaker gain reaches its maximum value of 1, the ITD reaches its maximum, corresponding to the left loudspeaker’s ITD. Therefore, as the left loudspeaker gain increases from 0 to 1, the ITD of the virtual sound source gradually shifts from the right loudspeaker’s ITD to the left loudspeaker’s ITD. Furthermore, the precise time alignment significantly impacts the algorithm’s performance. When employing time delays of 0.93 ms or 0.98 ms, the ITD changes uniformly with varying gain, with the difference between adjacent ITD values corresponding to the duration of a single time sample. As the gain increases from 0 to 1, every ITD value between the left and right loudspeakers’ ITDs is covered, enabling the acquisition of the gain for each desired azimuth angle. When multiple gains correspond to the same ITD value, the arithmetic mean gain is selected to prevent minor gain changes from affecting the ITD, thereby enhancing robustness. However, with a time delay of 0.88 ms, a notable discontinuity arises as the ITD abruptly shifts from −0.36 ms to 0 ms as the left loudspeaker gain transitions from 0.472 to 0.473. This discontinuity causes the gain search method to fail in accurately reproducing the target ITDs within this range. The gain results for virtual sound source angles ranging from −15° to 40° are shown in
Figure 4b. The ITD of the virtual sound source is influenced by both time delay and gains. For a given azimuth angle, varying time delays correspond to different gains. To ensure a continuous ITD variation during the gain search process, the time delay must be judiciously selected. Otherwise, an abrupt discontinuity in the ITD during the gain and delay search stage will prevent the virtual sound source’s ITD from matching the target ITD through gain adjustment alone.
To determine the appropriate time delay, we investigate the influence of varying time delays on the virtual sound source ITD obtained by the gain optimization algorithm. Given that the inherent time difference for the direct sound position between the two loudspeakers is 1.00 ms, the search range for time delays is set from 0.77 ms to 1.22 ms, centered around this value, covering 20 adjacent time sampling points. The results of the loudspeaker gain search for each discrete time delay are shown in
Figure 5. When the time delay is between 0.93 ms and 1.16 ms, the loudspeaker gain pair can be adjusted for each desired azimuth angle to align the virtual sound source’s ITD with that of the real sound source in the free field. However, outside this range, the ITD fails to accurately match the real sound source’s ITD at certain angles. As the deviation from the ideal time delay increases, an increasing number of angles exhibit ITD errors.
Since multiple time delays can result in zero absolute ITD error, we calculate the mean IACC and mean absolute ITD error for each desired sound source azimuth angle at each delay to determine the optimal time delay and corresponding gains, as shown in
Figure 6a. Within the time delay range where the mean absolute ITD error approaches zero, the mean IACC initially increases and subsequently decreases, reaching a maximum value at 1.04 ms. Outside this range, the highest mean IACC occurs at 0.77 ms. At this particular time delay, the ITDs for each azimuth angle are restricted to either 0.16 ms or −0.43 ms, corresponding to the ITDs when the left or right loudspeakers work individually. This indicates that during the search process, the sound image abruptly shifts from the left loudspeaker to the right, failing to synthesize the virtual sound source between loudspeakers. Consequently, compared to the virtual sound source reproduced by two loudspeakers, the IACC value for a single loudspeaker is higher. Therefore, a time delay of 1.04 ms is selected as the optimal setting, and the corresponding gains obtained during the search are applied to reproduce the virtual sound source at the desired azimuth angles, as shown in
Figure 6b. As the desired azimuth angle increases, the gain of the left loudspeaker decreases, and that of the right loudspeaker increases. The proposed algorithm uses a uniform time delay for all azimuth angles, thereby simplifying the creation of continuously moving virtual sound sources compared to methods with variable time delays. For instance, when the virtual sound source is intended to move continuously from −15° to −10°, interpolation techniques can be used to achieve the desired spatial trajectory. When the delays for −15° and −10° are consistent, the loudspeaker gains can be easily interpolated. However, if the time delays for −15° and −10° differ, both the time delay and gain would need simultaneous adjustments when moving the virtual sound source. Since time delay adjustments occur in discrete steps (at least a single time sample), maintaining continuous changes becomes challenging and complicates the interpolation process.
The performance of the proposed algorithm was evaluated in comparison with the VBAP [
13] and DBAP [
19] algorithms. As the VBAP algorithm is inherently designed for loudspeaker arrangements with equidistant placement, amplitude and time delay compensation are required when applying it to setups with unequal loudspeaker distances [
28]. In the simulation, measured RIRs were used to implement delay compensation, aligning the direct sound peaks of the two loudspeakers. Additionally, the amplitude was adjusted to equalize the peak values of the direct sound. Regarding the DBAP algorithm, which simultaneously controls both the azimuth angle and distance of the virtual sound source, the virtual source was restricted to the loudspeaker axis.
Figure 7a presents the ITD simulation results for the three evaluated methods. The VBAP algorithm accurately reproduces virtual sound sources with ITD errors within 0.02 ms (equivalent to one sample delay) and azimuth angle deviations within 5° for desired angles between −15° and 10°, as well as at 40°. However, in the range of 15° to 35°, the reproduced ITDs are consistently lower than the target values. In particular, at 20°, the ITD error reaches 0.14 ms (equivalent to six sample delays), with corresponding azimuth errors approaching 15°. For the DBAP algorithm, the ITDs remain nearly constant at approximately 0.14 ms for azimuth angles between −15° and 15°. However, at 20° and 25°, the ITDs exhibit a sudden increase to 0.91 ms, while for angles above 30°, they drop sharply to −0.41 ms. In contrast, the proposed method yields ITD values that closely match the target ITDs across the entire azimuth range, indicating superior and consistent performance. These results demonstrate that the proposed algorithm achieves reliable ITD reproduction, while the VBAP algorithm maintains accuracy only at azimuths near the loudspeakers and exhibits significant bias at mid-range angles. The DBAP algorithm fails to synthesize coherent virtual sources, primarily due to the excessive time delay between sound arrival from the two loudspeakers. At 20° and 25°, the calculated ITDs reflect only the geometric path length difference, rather than corresponding to a perceivable auditory image.
Figure 7b presents the IACC simulation results of virtual sound sources reproduced by the three methods across various azimuth angles. In a free field, the IACC of a real sound source approaches unity. However, in a listening room, acoustic reflections reduce the IACC values for all virtual sources. The overall differences in IACC among the three methods are relatively small. Compared to the other two methods, the proposed algorithm yields slightly lower IACC values at azimuth angles near the loudspeakers, but higher values at mid-range angles.