Next Article in Journal
Mitigating Quantization Errors Due to Activation Spikes in Gated Linear Unit-Based Large Language Models
Previous Article in Journal
Securing Decentralized Ecosystems: A Comprehensive Systematic Review of Blockchain Vulnerabilities, Attacks, and Countermeasures and Mitigation Strategies
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Big-Delay Estimation for Speech Separation in Assisted Living Environments

by
Swarnadeep Bagchi
and
Ruairí de Fréin
*,†
School of Electrical and Electronic Engineering, Technological University Dublin, D07 EWV4 Dublin, Ireland
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Future Internet 2025, 17(4), 184; https://doi.org/10.3390/fi17040184
Submission received: 19 February 2025 / Revised: 1 April 2025 / Accepted: 8 April 2025 / Published: 21 April 2025

Abstract

:
Phase wraparound due to large inter-sensor spacings in multi-channel demixing renders the DUET and AdRess source separation algorithms—known for their low computational complexity and effective speech demixing performance—unsuitable for hearing-assisted living applications, where such configurations are needed. DUET is limited to relative delays of up to 7 samples, given a sampling rate of F s = 16 kHz in anechoic scenarios, while the AdRess algorithm is constrained to instantaneous mixing problems. The task of this paper is to improve the performance of DUET-type time–frequency (TF) masks when microphones are placed far apart. A significant challenge in assistive hearing scenarios is phase wraparound caused by large relative delays. We evaluate the performance of a large relative delay estimation method, called the Elevatogram, in the presence of significant phase wraparound. We present extensions of DUET and AdRess, termed Elevato-DUET and Elevato-AdRess, which are effective in scenarios with relative delays of up to 200 samples. The findings demonstrate that Elevato-AdRess not only outperforms Elevato-DUET in terms of objective separation quality metrics—BSS_Eval and PEASS—but also achieves higher intelligibility scores, as measured by the Perceptual Evaluation of Speech Quality (PESQ) Mean Opinion Score (MOS) scores. These findings suggest that the phase wraparound limitations of DUET and AdRess algorithms in assistive hearing scenarios involving large inter-microphone spacing can be addressed by introducing the Elevatogram-based Elevato-DUET and Elevato-AdRess algorithms. These algorithms improve separation quality and intelligibility, with Elevato-AdRess demonstrating the best overall performance.

1. Introduction

Selective Hearing (SH) is described in [1] as the ability of listeners to focus on a specific sound source within a competing speaker scenario in their auditory environment. The machine emulation of this process via SH technology involves source localization, separation, and enhancement techniques, as well as unifying frameworks that integrate these methods. Machine emulation of SH seeks to enhance the intelligibility and quality of a target speaker’s voice at the ear of a listening-impaired person. Speech enhancement focuses on improving the perceptual quality of speech that has been degraded by additive noise [2]. These algorithms work to reduce or suppress background noise and are often referred to as noise suppression algorithms [3]. In conjunction with interference suppression (IS), these types of problems are closely related to the problem of source separation (SS) [4]. For better interference reduction performance, researchers often introduce the assumption that each audio signal in a multi-audio scenario is in proximity to at least one microphone channel, that predominantly captures the target signal [5,6,7,8,9]. These methods have extensive applications in assisted living (AL) scenarios where hearing aids (HA) are used [9,10,11], in voice-controlled smart homes [12,13,14] for elderly people and in other speech AL technologies [15,16,17]. Audio SS involves extracting individual sources from a combined audio mixture. This technique has applications in speech enhancement [18], speech recognition [19], and HA devices [20,21].
This paper is motivated by applications in multi-microphone-based auditory localization systems. In the scenario illustrated in Figure 1a, each mixture captures a different version of the speaker, characterized by variations in spatial location, relative attenuation, and delay parameters. These mixtures are processed by powerful speech demixing algorithms capable of isolating a specific voice of interest [22]. The demixed speech is then transmitted in real-time to a wireless HA using Bluetooth, infrared, or other wireless transmission systems [23]. In these systems, the strategic placement of microphones has proven highly effective in addressing challenges such as noise, interference, and reverberation, particularly in the framework of microphone array processing.
Advances in Voice-over-Wireless Sensor Network (VoWSN) systems [24,25,26,27] and Acoustic Sensor Networks [28,29,30,31] are making it increasingly feasible to enable audio communication through microphones and other AL devices [22,32,33], which are wirelessly connected and strategically placed throughout large rooms. The performance of audio processing tasks—including speech enhancement, separation, diarization, and recognition—benefits greatly from the spatial distribution of microphones across a large area in a room [34].
Assistive hearing devices encompass a wide range of listening technologies designed to enhance access to speech signals or to amplify speech of interest for individuals with hearing impairments. These devices address challenges where traditional ear-mounted HAs [35] may be insufficient. Examples include environments with significant background noise, large speaker-to-listener distances, interference, or reverberation. By wirelessly linking to a user’s HA, these devices ensure the delivery of clear, high-quality demixed target speech signals, ensuring a clear and high-quality broadcast of the demixed target speech signal [32]. The prevalence of wireless devices with microphones, including smartphones, laptops, and hands-free kits, is a common feature of modern life. These devices can also be integrated into the network to improve noise reduction in HAs [36]. State-of-the-art solutions often rely on telecoil-based amplification systems. These systems employ wireless remote microphones (RM) paired with HAs [37], using telecoil technology to amplify and transmit the speaker of interest’s voice via Bluetooth, infrared, light waves, or other proprietary wireless technologies [23]. RMs [38] allow clearer reception of the distant speaker of interest at the receiving end which is otherwise achievable when relying on the sound-field near the listener’s ear [39], in other words when the talker speaks close to the listener’s ear. Recent developments aim to help HA users better understand speakers located at a distance [10,11,37,40,41]. Listening can be challenging when there is excessive background noise, interference, reverberation, or a significant distance between the sound source and the person with hearing loss, as these factors can reduce the perceived signal quality [37]. The use of RM has been shown to improve intelligibility in adverse conditions reliably [37,41]. The wireless RM provides a relatively clean reference for speech recognition systems [42]. Additional spatial cues are often available in the form of the relative delay/interaural phase difference (IPD) and the received signal strength/interaural intensity difference (IID). These cues correspond to the panning coefficient in the pan-pot mixing model and are captured on the far-end microphone relative to the RM [38], as shown in Figure 1. Ideally, an RM in conjunction with other distant microphones in a wireless acoustic sensor network (WASN) setting mimics the sound-field energy present at the speaker’s mouth and can transmit it to the listener’s ear in a manner resembling free-field anechoic transmission [39]. Speech enhancement in these settings refers to a class of algorithms that improve the quality and intelligibility of speech transmitted from a far-end microphone—where speech is captured as part of a mixture—to the listeners’ HA, which ultimately broadcasts the demixed target speech of interest.
Spatial cues help us to distinguish sounds arriving at our ears from different talkers and directions separated in space [10]. Understanding how humans interpret these spatial cues can inform the design of microphone array setups that effectively separate sound sources. For instance, a close-talk or near-field microphone array setup, designed to capture sound sources in close proximity, helps the detection of target speech with a relatively high signal-to-noise ratio (SNR). It decreases in effectiveness with increased speaker-to-microphone distance [43,44]. A significant gap, larger than half the smallest wavelength content, λ min , of the propagating wave, between the RM and the far-end microphone will render the estimation of relative delay cues/IPDs ambiguous [45,46]. This is known as the phase wraparound effect. Phase wraparound occurs because the higher-frequency components of the signal cannot diffract around the microphones positioned along their propagation path [47]. These high frequencies are corrupted and pose a challenge when using the time–frequency (TF) representation [48]. Typical indoor temperatures are often maintained between 20 °C and 25 °C (68 °F and 77 °C) in AL environments. At standard atmospheric pressure, and at a typical room temperature of 20 °C, the speed of sound is 343 m/s. To accurately capture and represent these sound waves digitally, a sufficient sampling rate is required. Given a sampling rate of F s = 16 kHz, the maximum frequency content of that signal is f max = 8 kHz. The bound on the inter-microphone gap so that phase wraparound does not occur, is commonly known as the spatial Nyquist criterion
d s < λ min 2 = c 2 f max .
Substituting these values into Equation (1), the maximum permissible separation between sensors is 2.14 cm. Under similar conditions, setting the inter-microphone gap to be greater than 1 m, d s > 1 m, allows the maximum frequency that will be uncorrupted to equal up to f aliasing = 171.5 Hz. Frequencies above this maximum frequency, f aliasing , may experience phase wraparound [49]. This corresponds to 97 % of the total frequencies being corrupted, resulting in ambiguous delay estimates. Underdetermined speech demixing occurs when the number of sources exceeds the number of microphones. With an increased inter-microphone gap between the RM and a far-end microphone, depicted in Figure 1b, phase wraparound becomes a challenge in near-field settings, leading to two issues: scaling and permutation ambiguities, which complicate underdetermined speech demixing problems. As a result, in the TF domain, a source may exhibit erroneous phases and gains across different TF-bins [50], especially when the frequencies are above f aliasing . This complicates relative delay estimation and consequently source demixing, as clustering-based techniques struggle to assign the TF bins to their correct sources. The authors of [46,49,51] demonstrated that if the IPDs of the low-frequency components that are less than f aliasing , are estimated accurately, they can be used to align the higher frequency TF bins, where phase wraparound occurs to the correct sources. We address the problem of achieving speech separation in competing-speaker assisted living environments, which are characterized by large delays—and consequently phase wraparound—due to the significant physical separation of microphones.
In this paper, we propose two methods of adapting the Degenerate Unmixing Estimation Technique (DUET)-based mask demixing function to operate effectively in anechoic scenarios with large relative delays, of up to 200 samples. These proposed methods are called Elevato-DUET and Elevato-AdRess. Numerical evaluations on real-speech mixtures demonstrate that Elevato-AdRess outperforms Elevato-DUET in most speech separation quality metrics. Furthermore, the techniques introduced in this paper are not affected by the permutation or scaling problems commonly encountered in scenarios involving large relative delays.
This paper is organized as follows. In Section 2, we introduce the mixing model. Section 3 describes the Elevatogram, originally introduced in [5]. Section 4 outlines the contributions of this paper—namely, the Elevato-AdRess and Elevato-DUET algorithms. Finally, Section 5 evaluates the performance of these algorithms in the task of separating the target source, s j [ n ] , from a mixture of sources J = 4 .

2. Mixing Model

A reverberant stereo noise-free pair of mixtures is expressed as
x 1 [ n ] = j = 1 J h 1 j [ n ] s j [ n ] ,
x 2 [ n ] = j = 1 J h 2 j [ n ] s j [ n ] ,
where ∗ is the linear convolution operator and the filters, h i j [ n ] , are the channel Received Impulse Responses (RIR) for source s j [ n ] at microphone, x i [ n ] . The source-to-microphone distance is d i j and K is the number of DFT points. For an anechoic or low reverberant scenario, the mixing filter for the i t h channel, h i j [ n ] , simplifies to a combination of delays, δ i j , and attenuations, α i j , which are expressed as,
α i j = 1 4 π d i j ,
where d i j is the distance between the j t h source and the i t h microphone [49]. The Acoustic Transfer Function (ATF) of the j t h source, which is the frequency domain transform of the RIR, which is called its steering vector in a scenario with I microphones is,
d j [ k ] = 1 4 π d 1 j e j 2 π K k d 1 j c F s , 1 4 π d 2 j e j 2 π K k d 2 j c F s , , 1 4 π d I j e j 2 π K k d I j c F s .
The gain normalized ATF of the j t h source, called the Relative Transfer Function (RTF) is,
d j Rel [ k ] = 1 , d 1 j d 2 j e j 2 π K k ( d 2 j d 1 j ) c F s , , d 1 j d I j e j 2 π K k ( d I j d 1 j ) c F s ,
and can be uniquely determined by the relative gain and delay estimates [52], in a low reverberant room [53]. Equation (6) helps us fix the scaling problem [54]. Substituting Equation (4) in Equation (6) gives the RTF,
d j Rel [ k ] = 1 , α 2 j e j 2 π K k δ 2 j , , α I j e j 2 2 π K k δ I j
where δ i j = d i j d 1 j c F s , is the relative delay measured in samples and the attenuation of the j t h source impinging on the i t h microphone is α i j = d 1 j d I j , which is bounded by 0 α i j 1 , and is used to define an anechoic mixing model. The anechoic mixing model is a special case of the reverberant model. In a room which has low reverberance, or which has no echoes, the RTF in Equation (7) can be uniquely determined by the relative gain and delay estimates [52,53]. The terms relative gain, attenuation and IID are used interchangeably in the literature.
Conventional relative delay estimation techniques, for example, the Generalized Cross-Correlation-PHAse Transform (GCC)-PHAT [55], Steered Power Response (SRP)-PHAT [56], and MUltiple SIgnal Classification (MUSIC) [57] are designed to estimate the loudest source in an indoor acoustic environment. The low power of the target speech source often impedes the accuracy of the delay estimate [58] in a multi-talker scenario because the phase information is lost due to other speech interference [59]. Joint multi-source delay estimation is challenging when two sources are spatially close to one another [46,60,61]. In addition, speech enhancement techniques are not useful when the interferer is in proximity to the target speaker of interest [62]. Recent work in [4] demonstrated that the accuracy of widely used delay estimation techniques decreased when tasked with estimating the relative delays of all constituent sources simultaneously. The advantage of a close-talk setup is the ability to estimate the most prominent peak associated with a relative delay [63], which corresponds to the reference microphone positioned near the target speaker’s mouth. This provides a reliable estimate of the IPD for the speaker of interest, which can then be used to parameterize a TF mask to isolate the speech effectively.
Assuming that an arbitrary source, s j [ n ] , is in physical proximity to any one RM, as shown in Figure 1b, then the corresponding discrete-time, anechoic stereo-mixing scenario, which consists of J sources is,
x 1 [ n ] = s j [ n ] ,
x 2 [ n ] = j = 1 J α 2 j s j [ n δ 2 j ] ,
where the discrete time index is n, where 1 n N and n Z + . Given a binaural mixture, the relative attenuation and delay coefficients are redefined, α 2 j as α j and δ 2 j as δ j , for simplicity. In the mixture x 2 [ n ] , the source, s j [ n ] , is attenuated and delayed by α j and δ j , respectively, relative to x 1 [ n ] . The relative attenuation lies in the range 0 α j 1 , and the relative delay δ j is measured in samples. The TF transform of x 1 [ n ] provides the mapping as follows, X 1 : x 1 [ n ] R X 1 [ k , τ ] C , where the discrete frequency is k, and where 1 k K . The discrete-time frame is τ , where 1 τ T . Similarly, x 2 [ n ] is mapped to X 2 [ k , τ ] , using the TF transform stated in [48].
It is assumed that the target speaker is always the most intense among all speakers present. As proposed in [4], speech enhancement with interference reduction using an RM can enable speech demixing. The demixing of a mixture of voice signals at the far-end microphone is illustrated in Figure 1b. In a simple anechoic model, an acoustic speech wave at one microphone differs from the reference RM signal in terms of IPD and IID cues. We hypothesize that DUET-based masking techniques, which include the classical DUET [64] and AdRess [65] algorithms—where the latter was originally designed for separating music mixed using a pan-pot model—can be adapted to a large-delay anechoic model, with delays of up to 200 samples. The AdRess technique is a brute force version of the DUET mask, where the IID cues are localized on a frequency–azimuth plane relative to a gain scale.
In the time domain, the silent interval between two interacting speakers is typically small. This causes the majority of the signal to have coefficients which are non-zero [66]. In the TF domain, a small percentage of the TF coefficients contain a majority of the energy of the acoustic signal [67]. Speech is typically sparse and disjoint in the TF domain [68]. The TF representation is commonly employed for speech processing because sources tend to exhibit less overlap in the TF domain compared to the temporal waveform [69,70]. While speech in the time domain is not sparse during active speech, it is typically sparse in the TF domain. This implies that a small percentage of the TF Gabor coefficients capture a large part of the energy. Specifically, the magnitudes of most of these Gabor coefficients are generally small. According to the authors of [69,71] it is unlikely that multiple speech signals will overlap in a particular TF bin of the mixture of speech utterances. This characteristic makes the signals approximately Windowed-Disjoint Orthogonal (WDO) [69,71]. The authors in [67] observed that the TF bins that contain 90 % of the energy of the j t h source contain less than 1 % of the energy of all the interference combined together.
Strict, pairwise WDO between two speech utterances, S 1 [ k , τ ] and S 2 [ k , τ ] , separated in space, in a channel mixture is expressed as,
S 1 [ k , τ ] S 2 [ k , τ ] = 0 , [ k , τ ] , and S 1 S 2 .
In real-mixtures, these utterances are sparse. They do not satisfy the condition in Equation (10), but they are approximately-sparse. A simple explanation is obtained by calculating the Log-Max approximation in the TF domain as,
X i [ k , τ ] = S 1 [ k , τ ] + S 2 [ k , τ ]
= S 1 [ k , τ ] 1 + S 2 [ k , τ ] S 1 [ k , τ ] .
Taking the natural-logarithm on both sides produces
ln X i [ k , τ ] = ln S 1 [ k , τ ] + ln 1 + S 2 [ k , τ ] S 1 [ k , τ ] .
If S 1 [ k , τ ] > > S 2 [ k , τ ] , then S 2 [ k , τ ] S 1 [ k , τ ] < < 1 and Equation (13) reduces to
ln X i [ k , τ ] ln S 1 [ k , τ ] , ln 1 + S 2 [ k , τ ] S 1 [ k , τ ] 0 .
Similarly, if  S 2 [ k , τ ] > > S 1 [ k , τ ] , then taking S 2 [ k , τ ] as a factor in Equation (11), it can be shown that
ln X i [ n ] ln S 2 [ n ] , ln 1 + S 1 [ k , τ ] S 2 [ k , τ ] 0 .
Consequently, the expression
ln X i [ k , τ ] { ln S 2 [ n ] , if S 2 [ k , τ ] > > S 1 [ k , τ ] ln S 1 [ n ] , if S 1 [ k , τ ] > > S 2 [ k , τ ]
assumes that the source with the maximum energy among the constituent sources is essentially the sole occupant of a TF bin, leading to the definition of approximate WDO,
ln ( S 1 [ k , τ ] + S 2 [ k , τ ] ) = max ln ( S 1 [ k , τ ] ) , ln ( S 2 [ k , τ ] ) .
The approximate WDO description in Equation (16), accompanied by the idea of the dominance of a particular speech utterance in a TF bin, is a good approximation of the strict WDO criterion given in Equation (10), when the sources are i.i.d. Gaussian. Similarly, when the number of sources present in Equation (11) is N, it can be stated that the sources are N-way W-disjoint orthogonal and predominant in Equation (16), k , τ , j .

3. The Elevatogram for Large Delay Estimation

The Elevatogram was introduced in [5]. In previous contributions, it was demonstrated that in an anechoic condition without any environmental noise, the Elevatogram can estimate relative delays up to more than 500 samples, given a sampling frequency of, F s = 16 kHz [5,8]. The TF representation of the mixtures, x 1 [ n ] and x 2 [ n ] , in Equations (8) and (9), are given as X 1 [ k , τ ] and X 2 [ k , τ ] , respectively. The Cross-Power Spectrum (CPS) is calculated as
X [ k , τ ] = X 1 [ k , τ ] X ¯ 2 [ k , τ ] ,
where the complex conjugate of X 2 [ k , τ ] is denoted as X ¯ 2 [ k , τ ] . The normalized CPS contains the phase information for a given delay, δ j , and is expressed as
Δ ϕ [ k , τ ] = X [ k , τ ] = X 1 [ k , τ ] X 2 [ k , τ ] + Ω δ j + 2 π r ,
where r Z comes from the modulo- 2 π phase ambiguity [72], and  Δ ϕ [ k , τ ] R and Ω = 2 π k / K . The parameter, Δ ϕ [ k , τ ] , can be modeled as a wrapped scaled-Gaussian Mixture Model (scaled-GMM) or Von-Mises noise distribution [73,74], because the phase is a circular variable that is bounded in the interval ( π , π ] . The random variable Δ ϕ [ k , τ ] is zero-mean. It has a periodicity of 2 π and is unimodal in one cycle [75] and i.i.d. A histogram of the k t h discrete frequency row of Δ ϕ [ k , : ] was generated in the Elevatogram in [5]. The bins of the histogram are given by the vector [ π , , π ] T , which has L elements, each corresponding to a quantization level, which are uniformly separated, and each bin is spaced by Δ d . If the bins are indexed by ϕ ˜ , then the set of TF bins contributing to any ϕ ^ , is obtained using
I [ ϕ ^ ] : = { { k , τ } : | Δ ϕ [ k , τ ] ϕ ˜ | < Δ d } .
The higher the cardinality of | I [ ϕ ˜ ] | , the more intensely that bin ϕ ^ has been activated in the histogram, h k T . The histogram is generated as follows,
h k = hist Δ ϕ [ k , : ] , L R 1 × L .
The Elevatogram [5] computes a histogram for each of the K frequencies. It generates a matrix, P R L × K , which was also called the Elevatogram by the authors of [5], and is obtained by concatenation resulting in a phase–frequency matrix,
P = [ h 1 T , h 2 T , h 3 T , h K T ] R L × K .
The slanted lines in the example shown in Figure 2a, which depict the distribution of phase with respect to frequency, represent a single line that undergoes phase wraparound at 1 , 3 , 5 and 7 kHz. The larger the delay, the more often this phase wraparound occurs. The location of the set of most significant collinear points forming a straight line in the phase–frequency domain was determined in [5], as shown in Figure 2b. This concept is analogous to the distance–angle accumulator in the Hough transform [76], which receives the highest activation. This value is marked in green and denoted as ϕ max . The relative delay produced by the Elevatogram [5] is
δ ^ j = K L tan ϕ max .
In general, the phase wraparound issue prevents the IPD cues from being mapped to a unique delay, δ j . Using the closed-form solution in Equation (22), the relative delay can be estimated unambiguously. This explains the success of the Elevatogram approach [5]. For better estimation, the base histogram matrix, P , was tiled by the author of [5] by concatenating tiles in a T × T grid. Each tile was a replica of P and concatenating them adjacent to and below the base P gave the consolidated tiled histogram [5]. This procedure can be interpreted as a Maximum Likelihood Estimator (MLE). Given a set of independent and identically distributed (i.i.d.) N pixel input features: X = { x 1 , x 2 , , x N }, signifying N significant collinear points in the phase-frequency feature space which are mapped to an accumulator in the parametric [ ρ , ϕ ]-space shown in Figure 2a, the joint distribution is defined in Appendix A and defined here
L ( [ ρ , ϕ ] ) = i = 1 N p x i | [ ρ , ϕ ] , x i X .
The log-likelihood of which is expressed as,
L [ ρ , ϕ ] = log L [ ρ , ϕ ] = i = 1 N log p x i | [ ρ , ϕ ] .
As more features (pixels) are taken into account, the PDF in the Hough space, given N features, will change from a uniform distribution, where all accumulators are equally likely, to a maximum at a particular accumulator, [ ρ ^ , ϕ ^ ] in Figure 2b as the MLE, which is computed as
[ ρ ^ , ϕ ^ ] = argmax ρ , ϕ L [ ρ , ϕ ] .

4. TF Masking

Widely recognized components of source separation (SS) approaches include Estimation of Signal Parameters via Rotational Invariant Techniques (ESPRIT) [77], DUET [67], TIme Frequency Ratio Of Mixture (TIFROM) [78] and the Direction Estimation of Mixing matrIX (DEMIX) [79] algorithms. DUET-ESPRIT (DESPRIT) [80] is an extension of ESPRIT. A family of power-weighted estimators was introduced in [71], exploiting the sparsity of sources in a specific transform domain [81], in this case the TF domain. The core principle of Independent Component Analysis (ICA)-based methods is to leverage the sparsity of sources in a specific transform domain [82]. The inherent sparsity of the sources implies that they are already separated, enabling SS to be performed using a binary or hard mask. The essence of ICA-based methods lies in the exploitation of the sparsity of the sources in the transform domain [82]. If the sources are sparse, this implies that they are already separated [68]. Speech demixing can then be achieved using a binary or hard mask, as shown in Equation (26). The mask is a weighting matrix that corresponds to a filter for the target source from the mixture [20]. Hard-masking techniques assume that a TF-bin belongs to one source. According to the authors in [67], if all the TF bins that correspond to a particular source are determined, then the source is already separated.
M [ k , τ ] = 1 , i f   t h e   T F - b i n S j 0 , o t h e r w i s e .
The mask, M [ k , τ ] , has TF bins set to 1 for all components belonging to the speaker of interest, S j . All other TF bins in M [ k , τ ] belong to the background interference and are set to 0. The hard-mask AdRess algorithm was extended to handle anechoic scenarios in [83]. A soft-masking variant of AdRess, is known as Reformulated-AdRess (Redress) [84] because it tackles the hard-masking short-coming of AdRess. It was motivated by prior work on single channel feature learning in the presence of background noise [85]. It was adapted for anechoic conditions in [86]. These techniques are computationally efficient soft-masking techniques for SS. Unlike hard-masking methods, which use binary decisions, soft-masking approaches generate a ratio mask M [ k , τ ] with values ranging between 0 M [ k , τ ] 1 . This approach distributes the energy contributions in M [ k , τ ] across sources, based on the likelihood of their presence in the mixture [87]. TF-masking has seen application in Automatic Speech Recognition (ASR) systems because these systems involve a step to isolate the target from its acoustic background.

4.1. Elevato-AdRess: Separation via Frequency–Azimuth Plane

The classical AdRess algorithm [65] is often used for music demixing, particularly when the sources are mixed using a pan-pot approach. It isolates sources by localizing the IID cues on a frequency–azimuth plane. This method separates music from the TF mixture by employing a form of spectral energy cancellation technique, which uses attenuations as a spatial cue. They are obtained from the magnitude spectra, | X i [ k , τ ] | , where X i is the i t h channel mixture. The AdRess algorithm [65] was licensed for integration into Sony’s SingStar game on the PlayStation 3. Subsequently, it was licensed to Riffstation, which was later acquired by Fender and utilized by millions of users between 2012 and 2018. Each source is given a position in the stereo field using an azimuth location. In the sound engineering and music production industries these panning coefficients are used to create multichannel mixtures [65,88]. These coefficients scale the contribution of each source in each of the available channels.
We extended this approach to an anechoic speech mixing model, introducing the D-AdRess technique in [83], which considers the mixing model in Equations (8) and (9). D-AdRess uses the magnitude spectra, | X i [ k , τ ] | and the phase spectra, X i [ k , τ ] . D-AdRess was restricted to scenarios with a relative delay of up to 5 samples. The primary contribution of the present paper is an extension of the AdRess technique. It extends AdRess to large delay scenarios where the delays are up to 200 samples. The resulting algorithm is called Elevato-AdRess, acknowledging its reliance on the Elevatogram for accurate estimation of large delays [5]. Elevato-AdRess is summarized in Algorithm 1. The AdRess technique has garnered widespread adoption in recent research [89,90,91], and has been improved in subsequent contributions [9,84]. In addition, it has also been cited in a patent for an acoustic noise monitoring framework [92]. The proposed method, Elevato-AdRess, operates as a beamforming technique that maximizes the power of the source arriving from a specific direction, corresponding to δ ^ j , using a pair of microphones. The Signal-to-Noise Ratio (SNR) is the ratio of the target source power to the combined power of all interfering speech sources. This approach is equivalent to maximizing the SNR, as described in [61]. The use of relative delay estimates for beamforming was reported in [93]. Prior information regarding the direction of the target and the interference helps implement robust IS techniques for speech enhancement [44].
To introduce Elevato-AdRess, we consider the two-synthetic-signal example that served as motivation in [84]. The first signal, s 1 [ n ] , is composed of sinusoids with frequencies, k 1 = 100 Hz and k 3 = 300 Hz, and the second signal, s 2 [ n ] , has components of frequency, k 2 = 200 Hz and k 3 = 400 Hz. The resulting sources
s 1 [ n ] = sin 2 π k 1 n F s + sin 2 π k 3 n F s ,
s 2 [ n ] = sin 2 π k 2 n F s + sin 2 π k 4 n F s ,
are mixed via Equations (8) and (9) with the attenuation coefficients α 1 = 0.8 and α 2 = 0.2 . The authors of [84] localized the IID cues on a frequency–gain plane, A = [ A 1 , A 2 ] R K × M , as illustrated in Figure 3a, by scaling | X 1 | relative to | X 2 | and vice versa, with respect to an attenuation range g, where 0 g 1 . This range is divided into M equally spaced levels. The cancellation approach described in [5] results in the difference between the magnitudes of the Short-Time Fourier Transform (STFT) representations of the constituent sources to be zero [48]. A plane is constructed by finding the best g ^ for a particular source, where
g ^ f i n d A 1 = | X 1 [ k , τ ] g X 2 [ k , τ ] | 0 , a n d   A 2 = | X 2 [ k , τ ] g X 1 [ k , τ ] | 0 ,
where the spectral energies of the sources are canceled, according to [84], producing
A 1 = α 1 = 1 g if k = k 1 α 2 = 1 g if k = k 3 0 , otherwise ,
A 2 = α 1 = g , if k = k 2 α 2 = g , if k = k 4 0 , otherwise ,
provided g < 0 . In Figure 3a, it is observed that a null forms at g ^ = α 1 = 0.2 for the frequencies of s 1 [ n ] , k = 100 Hz and k = 300 Hz. Similarly, the source s 2 [ n ] contains the nulls at g ^ = α 2 = 0.8 for the frequencies, k = 200 Hz and k = 400 Hz, for non-overlapping frequencies. This frequency–gain plane is also called the frequency–azimuth plane because it provides information on the source’s azimuth in the stereo mixture. The foundational step is to isolate the target source. A two-microphone scenario gives the azimuth location of the source and not the elevation. Peaks constructed at these nulls determine the energy content of the sources. If they overlap, the corresponding null forms at g ^ α 1 + α 2 2 or 2 α 1 + α 2 , subject to | g ^ | 1 . The initial, detailed discussion justifying this assertion is given in [94]. Figure 3c shows that the sources overlap at 300 Hz and have peaks at g ^ 0.5 . AdRess is restricted to pan-pot mixing scenarios. Detection of peaks at g ^ becomes challenging in anechoic mixing scenarios. We formulate a technique for delay cancellation in the second mixture, X 2 , using the delay estimated via the Elevatogram [5] in Equation (22) in [83], and scale X 2 , based on the approach in [94] according to
X 2 = e + j ω δ ^ j X 2 .
This operation localizes the spectral energies at accurate gain locations on the frequency–azimuth plane. Extracting the subportion of A [ : , : ] where these peaks occur corresponding to a particular source and repeating this for all the available time frames, T, results in an augmented TF representation. Inverting this TF representation gives the demixed source in the temporal domain. The spectral energy of sources at overlapping frequency locations cannot be accurately extracted using this approach. For real speech signals, the spectral energies at these overlapping frequencies are dispersed across A in Figure 3, making their precise estimation ambiguous.
Spatial and spectral information must be highly correlated for AdRess to function effectively in an anechoic scenario. Any inaccuracy in estimating the relative delay can result in discrepancies when localizing the spectral energy of the corresponding source on the frequency–azimuth plane, ultimately hindering the technique’s ability to isolate the target speaker.
Algorithm 1 Elevato-AdRess algorithm
1:
Find the delay estimate, δ ^ j , using Equation (22) as per the Elevatogram [5].
2:
Subtract the delayed sources in X 2 using Equation (32).
3:
Extract the peaks corresponding to the target source using Equation (29) and generate an augmented TF representation.
4:
Compute the inverse TF (ITF) transform to recover the target utterance in time domain.

4.2. Elevato-DUET

Assuming that speech sources are sparse in the TF mixture suggests that at any bin, one source is active. The RTF in Equation (6) for a stereo-anechoic setting, where the number of sensors is I = 2 , in Equations (8) and (9), gives us the mixing model,
X 1 [ k , τ ] X 2 [ k , τ ] = 1 α j e j 2 π K k δ j S j [ k , τ ] ,
where S j [ k , τ ] is the speaker of interest that must be isolated from the mixture and directed to the hearing-aid. In many papers where the DUET-based sparsity assumption is made, for example [95,96,97,98,99], the TF-bins where one source is active are referred to as Single Source Points (SSPs). This assumption is restricted to noiseless anechoic scenarios. Given a sampling rate of F s = 16 kHz, DUET can accurately estimate delays of up to 7 samples. A recent contribution in [91] extended DUET to echoic scenarios and used a Head Related Transfer Function (HRTF) data lookup table, where the spatial cues of different speakers in the room relative to the listener are predefined. This approach is an extension of the TF dictionary approach in [95] which constructs spatial signatures for different room locations. The Google-held patent [100] underlines the efficacy, quality and robustness of DUET-based masking techniques. The classical DUET technique described in [64] constructs a 2-D weighted histogram from the ratio of the TF mixtures,
X 2 [ k , τ ] X 1 [ k , τ ] α j e j 2 π K k δ j , [ k , τ ] belongs to S j .
The two axes represent a range of relative attenuations and delays, which are denoted by the discrete indices, α ˜ and δ ˜ . Each axis is evenly divided with resolution widths of Δ α and Δ δ . The set of points contributing to a specific location, [ α , δ ] , in the histogram are defined as
I [ α , δ ] : = | α [ k , τ ] α ˜ | < Δ α , | δ [ k , τ ] δ ˜ | < Δ δ .
This 2-D histogram,
H [ α , δ ] : = I [ α , δ ] d α d δ ,
aggregates membership of the set of ( α , δ ) that belong to one source, S j [ n ] . The most prominent peaks on H [ α , δ ] are the estimates of the mixing parameters, ( α ^ j , δ ^ j ) , associated with the J sources, { ( α ^ j , δ ^ j ) , j = 1 , 2 , J } . The underlying assumption is that the peaks located are sufficiently far apart and identifiable. More details about this technique can be found in [64].
In large delay scenarios, where d s > λ min / 2 , the DUET mixing parameter estimation step using the 2-D histogram plane is often not the best approach. The Elevatogram described in [5] provides good estimates of δ ^ j via Equation (22). This estimate can be used to subtract the delayed source in the second mixture using Equation (32) via the AdRess step in Equation (29), by estimating the attenuation, α ^ j . This relative attenuation–delay pair can be used to generate a mask that partitions the TF-mixture
J [ k , τ ] : = argmin j α ^ j e j δ ^ j Ω x 1 [ k , τ ] x 2 [ k , τ ] 2 1 + α ^ j 2 .
The variable Ω = 2 π K k , is the angular frequency. It is in the range π Ω + π . This procedure is iterated over all the available parameter pairs, ( α j , δ j ^ ) , to compute the TF mask,
M [ k , τ ] = 1 , i f   J [ k , τ ] = S j 0 , o t h e r w i s e ,
where M [ k , τ ] { 1 , 0 } R + . Sources are demixed by combining the mask and the MLE estimate using
S ^ j [ k , τ ] = M [ k , τ ] X 1 [ k , τ ] + α ^ j e j δ ^ j ω X 2 [ k , τ ] 1 + α ^ j 2 ,
where S ^ j [ k , τ ] is the recovered target speech utterance. This illustrates that Elevato-DUET can demix the target speaker experiencing a large relative delay.
The quality of any TF-mask, evaluated using the metric Preserved-Signal Ratio (PSR), is computed using
PSR : = | | M [ k , τ ] S j [ k , τ ] | | 2 | | S j [ k , τ ] | | 2 ,
and is defined as a measure of how well it preserves the energy of, s j [ n ] . In addition, the Source-to-Interference Ratio (SIR) is given as
SIR = M [ k , τ ] S j [ k , τ ] 2 p = 1 J S p [ k , τ ] 2 , if p j
where p = 1 J S p [ k , τ ] is the sum of the interference. The WDO is calculated as
W D O = PSR PSR SIR .
WDO has a maximum value of 1 and is in the range 0 < WDO < 1 . A WDO value that is less than 0 implies that the mask rejects the target source energy and that the voice is barely intelligible. The PSR and SIR are measured in dB. In Equation (33), observe that the WDO 1 when the SIR signifying good separation performance. The mask, M [ k , τ ] , preserves the energy of the speech of interest while suppressing the interference. A WDO of approximately 0 means that either the PSR is approximately zero or that the SIR is approximately 1. The former implies the mask rejects the spectral energy of the target speaker, and the latter means S j [ k , τ ] 2 = p = 1 J S p [ k , τ ] 2 , that is, the mask results in the source having equal energy as the interference. In practical demixing scenarios, the ideal WDO cannot be achieved in a strict sense. WDO is considered in an approximate sense. The greatest obtainable WDO drops to less than 1 as the number of sources in the mixture increases. Calculating the PSR of Elevato-AdRess is not as straightforward as Elevato-DUET because the mask M [ k , τ ] is not derived in the form of a binary matrix. Instead, it looks to extract the spectral energies that are localized on the frequency–azimuth plane in Figure 3, and then to create an augmented TF representation. Applying the Inverse TF (ITF) transform gives us the estimated utterance.

5. Results

Anechoic stereo mixtures consisting of up to four sources were generated using randomly selected speech utterances from a total of 25,200 files in the TIMIT corpus [101]. The evaluation was carried out using a TF window with a K = 2048 -sample Hamming window and a 50 % overlap. Algorithms 1 and 2 were implemented in MATLAB 2021a.
Algorithm 2 Elevato-DUET algorithm
1:
Find the delay estimate, δ ^ j , using Equation (22) as per the Elevatogram [5].
2:
Subtract the delayed sources in X 2 using Equation (32).
3:
Find the attenuation estimate, α ^ j , using Equation (29).
4:
Using the estimated ( α ^ j , δ ^ j ) pairs corresponding to the sources, compute the TF mask using Equations (26) and (37).
5:
Compute the ITF transform.
The goal is to separate s j [ n ] from a mixture of J sources. Up to J = 4 source utterances were considered. To construct the Elevatogram matrix, P , as described in [5], the total number of quantization levels used was L = 100 and the number of tiles was T × T = 2 × 2 to keep the Matlab computation time low. An analysis of the improvement obtained by increasing the tiling parameter is given in [5]. The performance of the proposed techniques was evaluated using relative delays in the range 1 δ j 200 samples. The number of equally spaced azimuth positions was M = 1000 in the range 0 g 1 for Elevato-AdRess. It was assumed that the difference in relative delay between any two pairs of sources could not be less than Δ g < 150 units on the gain axis and that it should be separated by at least 40 samples of relative delay so that the peaks were sufficiently apart, and thus, were distinguishable (Figure 4).
The quality of the separated speech utterances was evaluated using BSS_Eval [102] and PEASS [103] and their intelligibility was measured using Perceptual Evaluation of Speech Quality (PESQ) [104] scores. The Ideal Binary Mask (IBM) is considered the ultimate goal of Computational Auditory Scene Analysis (CASA) [105]. The target source is judged to be present at TF points where it dominates its interference. Consequently, it is preserved. In the regions where the source has a weaker presence than the interference, it is discarded. To evaluate the quality of recovered utterances, the two proposed methods were benchmarked against the IBM (https://github.com/IoSR-Surrey/MatlabToolbox, accessed on 15 April 2025) [106], which is assumed to give the best achievable separation quality. Given a sampling rate of F s = 16 kHz, in a noiseless scenario, the two benchmark techniques achieved the following performance. DUET estimates delays of up to 7 samples. AdRess was designed for instantaneous scenarios. This is captured by the poor performance of DUET and AdRess for large relative delays in Figure 5.
Speech enhancement and demixing algorithms employ various analysis and synthesis techniques, in the form of spectral energy and phase manipulation techniques, that introduce distortions, noise, and artifacts, and are typically called filtering errors [107]. In summary, measurements tend to decrease as the number of sources in the mixture increases, primarily due to the higher probability of sources overlapping in the TF bins. The estimate of the target source, s ^ j [ n ] , can be decomposed into the sum of four separate components,
s ^ j [ n ] = s j [ n ] + e j interference [ n ] + e j noise [ n ] + e j artifact [ n ] ,
where s j [ n ] is the clean speech utterance [102]. The Source-to-Distortion (SDR) separation quality is defined as
SDR = | | s j [ n ] | | 2 | | e j target [ n ] + e j interference [ n ] + e j noise [ n ] | | 2 .
It gives a measure of how well the target utterance has been recovered. The SIR is defined as
SIR = | | s j [ n ] | | 2 | | e j interference [ n ] | | 2 .
It computes a measure of how much the energy leaks from the other interfering sources, or how loud they can be heard in the recovered speech utterance. The Source-to-Artifact Ratio (SAR) is given as,
SAR = | | s j [ n ] + e j interference [ n ] + e j noise [ n ] | | 2 | | e j artifact [ n ] | | 2 .
This measure attempts to quantify burbling sounds, which are also referred to as musical-noise, in the reconstructed separated utterances. Figure 5 illustrates that Elevato-DUET and Elevato-AdRess can function effectively with relative delays of up to 200 samples. This capability arises from combining the Elevatogram technique, which was contributed in [5], with the classical DUET [64] and AdRess [65] methods, enhancing their performance in scenarios where the acoustic sources experience significant relative delays. The figure further highlights that Elevato-AdRess outperforms Elevato-DUET in terms of the SDR, SIR, and SAR separation quality metrics.
As shown in Figure 5a, Elevato-AdRess gives better results than Elevato-DUET by a mean reconstruction SDR gain of 2.2 dB for relative delays of up to 120 samples. Beyond this point, a performance decrease is observed due to an increase in the Mean Absolute Error (MAE) between the true delay, δ j , and its estimated value, δ ^ j , which is calculated by the Elevatogram [5], e.g., | δ j δ ^ j | . As can be seen in Figure 5b, the mean reconstruction SIR of Elevato-AdRess exceeds that of Elevato-DUET by 4.3 dB for a relative delay of δ j = 100 samples. Figure 5c shows that the difference in mean reconstruction SAR between the two algorithms is small. It lies within 1.5 dB for relative delays of up to 120 samples. In general, Elevato-AdRess improves the separation quality, in terms of the SDR, SIR, and SAR, compared with the utterances recovered by Elevato-DUET.
Figure 6 illustrates the boxplots of the measurements obtained for each mixture category. The categories correspond to 2-source, 3-source and 4-source mixtures. The PEASS toolbox [103] estimates the distortion, which is considered to be the error between the estimated source, s ^ j [ n ] , and the clean speech s j [ n ] and given as
s ^ j [ n ] s j [ n ] = e j target [ n ] + e j interference [ n ] + e j artifact [ n ] .
The target distortion, e j target [ n ] , indicates how many distortions (both attenuation and amplification distortions [108]) are added to the true target source during the filtering process. The measure of interference-related source energies present in the recovered utterance is denoted by e j interference [ n ] and e j artifact [ n ] is the measure of musical noise that degrades the demixed speech. Each recovered utterance, s ^ j [ n ] , is evaluated relative to the reference. The target signal, s ^ j [ n ] , is on a scale which ranges from 0 to 100 units, with higher scores indicating better quality of separation. As can be seen in Figure 6, Elevato-AdRess achieves a better Overall-Perceptual Score (OPS), Target-Related Perceptual Score (TPS), and Interference-Related Perceptual Score (IPS) performance than Elevato-DUET.
Figure 6a shows that Elevato-AdRess outperforms Elevato-DUET in terms of the average OPS score, with a mean reconstruction improvement of 2 units for mixtures with two sources. This difference becomes more pronounced as the number of speech utterances in the mixtures increases, reaching a difference of 7 units for mixtures with four sources. Figure 6b highlights reduced distortions are experienced in the estimated utterances recovered by Elevato-AdRess. This is reflected by the average TPS scores, where Elevato-AdRess exceeds Elevato-DUET by 8 units in the case of two sources. For mixtures with four sources, the TPS score difference increases to 24 units.
In Figure 6c, reduced energy leakage from interfering sources is observed in the estimates demixed by Elevato-AdRess compared to Elevato-DUET. For a two-source case, Elevato-AdRess outperforms Elevato-DUET by a score of 18 units and the latter lags behind the former by a score of 12 units when four sources are present in the mixture.
The highest PEASS metric achieved by Elevato-AdRess resulted from reduced interference in the recovered utterance. The average IPS score exceeded 85 units and remained relatively stable, even as the number of constituent sources in the mixture increased, in contrast to Elevato-DUET, which showed a notable decline under similar conditions. In terms of artifact-related noise, Figure 6d reveals that Elevato-DUET achieves a higher Artifact-Related Perceptual Score (APS), indicating fewer artifacts compared to Elevato-AdRess. Specifically, Elevato-AdRess is outperformed by Elevato-DUET by an average score of 3 units in the two-source scenario. This difference increases slightly to 5 units in the case of four sources. This result aligns with the higher SNR observed for Elevato-DUET, as shown in Figure 7. A reasonable assumption is that discarding the overlapping frequency bin energies, the approach taken by Elevato-DUET, introduces significant distortion and artifact noise in the separated utterances. SNRs are typically high if they are greater than 5 dB, low if they are less than < dB, and completely unintelligible if they are less than 0 dB according to [3]. In general, the SNRs of the separated speech utterances decrease with an increase in the number of constituent sources in the mixture, where the SNR between the clean utterance, s j [ n ] , and the recovered utterance s ^ j [ n ] , is defined as
SNR = 10 log 10 | | s j [ n ] | | 2 2 | | s j [ n ] s ^ j [ n ] | | 2 2 .
In Figure 7, it can be observed that Elevato-DUET improved SNR performance compared to Elevato-AdRess by a measure of 4.9 , 4.5 , and 4.13 dB for the two, three, and four mixtures cases, respectively.
Higher quality speech-utterance separation does not necessarily mean that high intelligibility is achieved [108,109]. A good speech demixing and enhancement algorithm must preserve intelligibility while improving the quality of separation. To evaluate intelligibility, the PESQ Mean Opinion Score (MOS) [104] intelligibility assessment is used. These scores range from 0.5 to 4.5 units with scores below 1 unit indicating unintelligible speech and higher scores reflecting better intelligibility. Comparing an utterance with itself gives a MOS score of ≥4 units. Figure 8 quantifies the intelligibility of the demixed speech utterances. For a mixture of two sources, J = 2 , Elevato-AdRess achieves a NarrowBand (NB) MOS score of 1.4 units higher than Elevato-DUET. This difference increases to 1.73 units in a four-source scenario, J = 4 , as shown in Figure 8 (left). In this scenario, Elevato-AdRess achieves better performance than the IBM by 0.3 units. Figure 8 (middle) illustrates that in terms of the Listening Quality Objective (LQO) score, Elevato-AdRess gives better performance than Elevato-DUET and IBM by 0.57 units for the two-source case. This improvement rises to 1.1 units when there are J = 4 sources in the mixtures. Finally, Figure 8 (right) highlights the WideBand (WB) MOS scores, which extend the NB evaluation to account for broader frequency ranges, further demonstrating the superior performance of Elevato-AdRess.
Note that the standard separation quality achieved by Elevato-AdRess is largely preserved, as the number of constituent sources in the speech mixtures increases from two to four. This demonstrates the robustness of Elevato-AdRess, as it remains relatively unaffected by the increase in number of sources in the mixture.

6. Discussion and Future Work

This paper introduced a reformulation of the AdRess demixing technique, an extension of the classical DUET algorithm. AdRess was developed for the pan-pot mixing model to address target speaker recovery in anechoic, competing speaker scenarios characterized by relative attenuations and delays. A stereo mixture provides spatial cues that help identify a signal propagating from a specific direction. Using this information, we proposed the use of a remote microphone to isolate the speaker of interest and directed the separated audio to a smart HA, enhancing speech accessibility for individuals with hearing impairments.
The main contribution of this paper lies in extending two source demixing algorithms—DUET and AdRess—by incorporating a large relative delay estimation technique called the Elevatogram, which was introduced in [5], with the derived TF masks. The masks developed used both spectral and spatial information and can function in anechoic scenarios. The two proposed techniques, Elevato-DUET and Elevato-AdRess, outperformed the benchmark techniques in scenarios where the speech source suffers large relative delays. In the experiments, the speech utterances were mixed in an anechoic model with different attenuation and relative delays. Care was taken to ensure distinct spatial signatures—relative attenuations and delays—for each source, minimizing proximity effects that could hinder separation performance. In both proposed TF-masks, the spectral cue depends on the spatial cue. The step concerning accurate delay subtraction helped us to localize the spectral energy of the target source on the frequency–azimuth plane. It is essential to estimate the relative delay accurately for the proposed techniques to function. Inaccurate spatial cues will localize the spectral energies at erroneous positions on the gain axis in Figure 3a. In Elevato-DUET, it is important to maintain the quality of the TF mask. Unlike Elevato-AdRess, which focuses solely on localizing the IID cues of the target speaker, Elevato-DUET requires estimating the attenuation–delay pairs for all constituent sources. The TF mask, in Equation (37), is then computed by iterating over these pairs to derive the best mask to isolate the target source. As demonstrated in [9], the failure to find a good mask results in lower separation quality. In short, these results underline the importance of accurate relative delay estimation, when the separation between sensors is large in AL environments, and provide additional evidence that the Elevatorgram [5] provides accurate large relative delay estimates.
In summary, in terms of speech separation quality, as measured by BSS_Eval metrics (SDR, SAR, and SIR) and PEASS scores (OPS, TPS, IPS, and APS), as well as intelligibility quality assessed by PESQ, Elevato-AdRess demonstrated superior performance compared to Elevato-DUET. The Elevato-AdRess technique focuses on isolating the components of the frequency–azimuth matrix, where the IID spectral cues of the target speaker are localized under the constraint that discrete frequencies must be non-overlapping. TF-bins where the spectral energies overlap on the frequency–azimuth plane, in Figure 3c, are excluded during the extraction process, leading to a loss of information that significantly introduces distortion that affects the SDR and SNR scores of recovered speech utterances. Although this approach increases the computation time in MATLAB due to the localization of spectral energies across M = 1000 , discrete azimuth levels, it offers the advantage of reducing interference in the recovered utterance compared to Elevato-DUET, but this comes at the cost of a significant increase in MATLAB computation time. Elevato-DUET is computationally efficient compared to Elevato-AdRess. One of the key motivations behind the interest in Elevato-AdRess is that its computational time can be reduced, since it can be computed in parallel, making it suitable for real-time implementation. Computing the WDO measure in Equation (33) and the PSR measure in Equation (40), for Elevato-AdRess kind of masking is a challenge. This is because AdRess-based masking uses a brute-force method of explicitly localizing the non-overlapping spectral energies of a speech utterance on the frequency–azimuth plane. It is not the matrix-mask, M [ k , τ ] { 1 , 0 } , in the traditional sense as that of DUET. Elevato-AdRess and Elevato-DUET perform well in noiseless anechoic scenarios. Future work will focus on investigating how the quality of separation in Elevato-DUET deteriorates with an increasing presence of overlapping TF bins in the mixture, by increasing the number of speakers in the AL environment. The Elevato-AdRess and Elevato-DUET algorithms demonstrate improvements in intelligibility by effectively addressing the phase wraparound issue. However, aggressive source separation may introduce audible artifacts and spectral distortions. We will explore adaptive post-processing techniques to minimize these distortions, thereby enhancing perceived speech quality. For example, in the context of music source separation, once the AdRess algorithm has been applied to determine the stems in the original mixture—individual component tracks of a mixed audio recording—some of the mixture is added back into the demixed stems to improve the perceptual quality of the resulting sources.

7. Conclusions

A source demixing technique originally designed to separate music sources mixed using the pan-pot model was reformulated to operate in an anechoic scenario characterized by relative attenuations and large delays. The primary objective was to isolate a target speaker of interest using multiple microphones from a mixture of concurrently active voices in a competing acoustic environment, and to transmit the isolated speech to a hearing-aid worn by a patient with hearing impairment in an AL environment. A significant challenge in real-world scenarios is the phase wraparound issue, which arises when the distance between a pair of microphones exceeds the smallest allowable wavelength of the acoustic wave. This causes severe corruption of higher frequencies, rendering traditional separation algorithms ineffective. Demixing algorithms function as filtering techniques involving both analysis and synthesis processes, which may introduce distortions and artifacts into the synthesized utterance.
Building on the work in [4,5,9], this paper proposed a speech IS scheme utilizing a remote microphone that constituted a near-field microphone setup. The RM-aided IS scheme was designed to enhance the intelligibility of a distant target speaker. Specifically, this paper investigated the feasibility of extending two classical source demixing algorithms, DUET and AdRess, by combining a large delay estimation technique known as the Elevatogram [5]. Generally, the challenge of phase wraparound hinders accurate localization of a source in space. The Elevatogram technique in [5] overcomes this challenge. Both Elevato-DUET and Elevato-AdRess are spatio-spectral masking techniques in which the IID cue depends on the accurate estimation of the IPD cue, making the two cues interrelated. These enhanced versions, named Elevato-AdRess and Elevato-DUET, are designed to operate effectively in scenarios with substantial relative delays of up to 200 samples. This advancement holds significant potential for applications in hearing-assisted living scenarios. Through the derived masks of these algorithms, we demonstrated their capability to perform robustly in large relative delay conditions. In conclusion, the Elevato-AdRess demonstrated superior performance compared to Elevato-DUET in terms of speech separation quality, exhibiting fewer distortions, reduced interference, and enhanced intelligibility for human listeners. It was observed that the Elevato-AdRess technique, by discarding overlapping frequencies, introduced significant artifact noise into the recovered utterance, resulting in a reduced SNR for the extracted source.
Finally, the correlation between the BSS_Eval and PEASS separation quality metrics was demonstrated. Specifically, the SDR exhibited a strong correlation with the OPS, which evaluates the quality of the separated utterance. Similarly, the SIR showed a significant correlation with the interference-related IPS metric. The poor quality of separation demonstrated lower scores in both types of objective measures. The SAR of BSS_Eval demonstrated little correlation with the APS of PEASS. The Elevato-AdRess demonstrated better interference reduction in the recovered signal than Elevato-DUET. It remained stable with increases in the number of constituent sources in the mixture. An accurate estimation of the relative delay was crucial for the operation of the Elevato-AdRess because erroneous estimates rendered the application of the mask unsuccessful. The MAE, | δ j δ ^ j | , was also crucial. An estimation error of more than 1 sample rendered the Elevato-AdRess ineffective. From the perspective of human auditory perception, Elevato-AdRess demonstrated better performance in terms of the PESQ MOS intelligibility score compared to Elevato-DUET.

Author Contributions

S.B.: Conceptualization; Methodology; Software; Validation; Investigation; Resources; Data curation; Writing—original draft preparation; Visualization. R.d.F.: Conceptualization; Methodology; Software; Validation; Investigation; Resources; Data curation; Writing—original draft preparation; Visualization; Project administration; Writing—review and editing; Funding acquisition; Project administration. All authors have read and agreed to the published version of the manuscript.

Funding

This publication has emanated from research conducted with the financial support of Research Ireland under Grant numbers 18/CRT/6222, 15/SIRG/3459 and 13/RC/2077_P2. For the purpose of Open Access, the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission.

Data Availability Statement

Data available in a publicly accessible repository. The original data presented in the study are openly available here https://catalog.ldc.upenn.edu/LDC93S1, (accessed on 15 April 2025) [101].

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Hough Transform as a Log-Likelihood

Consider a data set of N collinear points in the image feature space in Figure 2a,
x = ( x 1 , y 1 ) , ( x 2 , y 2 ) , , ( x N , y N ) T .
Given this set of points, we consider the problem of maximizing the likelihood function with respect to the pair of parameters, [ ρ , ϕ ] . Assuming a Gaussian i.i.d. noise model with variance σ , the likelihood function is given by:
p x | [ ρ , ϕ ] = L x | ( ρ , ϕ ) = i = 1 N p ( x i , y i ) | ( ρ , ϕ ) = i = 1 N 1 2 π σ 2 exp ( ρ x i cos ( ϕ ) y i sin ( ϕ ) ) 2 2 σ 2 .
The log-likelihood, L x | ( ρ , ϕ ) , is a more convenient form to work with. The voting procedure that maximizes the log-likelihood function, the MLE is,
[ ρ ^ , ϕ ^ ] = argmax [ ρ , ϕ ] L x | ( ρ , ϕ ) ,
and corresponds to the accumulator cell that receives the highest vote.

References

  1. Cano, E.; Lukashevich, H. Selective Hearing: A Machine Listening Perspective. In Proceedings of the 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP), Kuala Lumpur, Malaysia, 27–29 September 2019; pp. 1–6. [Google Scholar]
  2. Chhetri, S.; Joshi, M.S.; Mahamuni, C.V.; Sangeetha, R.N.; Roy, T. Speech Enhancement: A Survey of Approaches and Applications. In Proceedings of the 2023 2nd International Conference on Edge Computing and Applications (ICECAA), Namakkal, India, 19–21 July 2023; pp. 848–856. [Google Scholar]
  3. Loizou, P.C. Speech Enhancement: Theory and Practice; CRC Press: Boca Raton, FL, USA, 2007. [Google Scholar]
  4. Bagchi, S.; de Fréin, R. Elevato-CDR: Speech Enhancement in Large Delay and Reverberant Assisted Living Scenarios. In Proceedings of the 2024 9th International Conference on Frontiers of Signal Processing (ICFSP), Paris, France, 12–14 September 2024; pp. 153–157. [Google Scholar] [CrossRef]
  5. de Fréin, R. Tiled time delay estimation in mobile cloud computing environments. In Proceedings of the IEEE ISSPIT, Bilbao, Spain, 18–20 December 2017; pp. 282–287. [Google Scholar]
  6. Prätzlich, T.; Bittner, R.M.; Liutkus, A.; Müller, M. Kernel additive modeling for interference reduction in multi-channel music recordings. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 19–24 April 2015; pp. 584–588. [Google Scholar]
  7. Cano, E.; Nowak, J.; Grollmisch, S. Exploring sound source separation for acoustic condition monitoring in industrial scenarios. In Proceedings of the 2017 25th European Signal Processing Conference (EUSIPCO), Kos, Greece, 28 August–2 September 2017; pp. 2264–2268. [Google Scholar]
  8. Bagchi, S.; de Fréin, R. Evaluating large delay estimation techniques for assisted living environments. Electron. Lett. 2022, 58, 846–849. [Google Scholar] [CrossRef]
  9. Bagchi, S.; de Fréin, R. Anechoic Demixing under Phase Wraparound Conditions in Assisted Living Environments. In Proceedings of the 2024 35th Irish Signals and Systems Conference (ISSC), Belfast, UK, 13–14 June 2024; pp. 01–06. [Google Scholar]
  10. Corey, R.M.; Singer, A.C. Adaptive binaural filtering for a multiple-talker listening system using remote and on-ear microphones. In Proceedings of the 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 17–20 October 2021; pp. 1–5. [Google Scholar]
  11. Sathyapriyan, V.; Pedersen, M.S.; Brookes, M.; Østergaard, J.; Naylor, P.A.; Jensen, J. Speech Enhancement in Hearing Aids Using Target Speech Presence Estimation Based on a Delayed Remote Microphone Signal. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 1006–1010. [Google Scholar]
  12. Zhu, J.; Wang, D.; Zhao, Y. Design of smart home environment based on wireless sensor system and artificial speech recognition. Meas. Sens. 2024, 33, 101090. [Google Scholar] [CrossRef]
  13. Chumuang, N.; Ganokratanaa, T.; Pramkeaw, P.; Ketcham, M.; Chomchaiya, S.; Yimyam, W. Voice-activated assistance for the elderly: Integrating speech recognition and iot. In Proceedings of the 2024 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 6–8 January 2024; pp. 1–4. [Google Scholar]
  14. Supriya, N.; Surya, S.; Kiran, K. Voice Controlled Smart Home for Disabled. In Proceedings of the 2024 International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics (IITCEE), Bangalore, India, 24–25 January 2024; pp. 1–4. [Google Scholar]
  15. Latif, S.; Qadir, J.; Qayyum, A.; Usama, M.; Younis, S. Speech technology for healthcare: Opportunities, challenges, and state of the art. IEEE Rev. Biomed. Eng. 2020, 14, 342–356. [Google Scholar] [CrossRef]
  16. Deepa, P.; Khilar, R. Speech technology in healthcare. Meas. Sens. 2022, 24, 100565. [Google Scholar] [CrossRef]
  17. Gullapalli, A.S.; Mittal, V.K. Early detection of Parkinson’s disease through speech features and machine learning: A review. In ICT with Intelligent Applications: Proceedings of ICTIS 2021; Springer: Singapore, 2021; Volume 1, pp. 203–212. [Google Scholar]
  18. Das, N.; Chakraborty, S.; Chaki, J.; Padhy, N.; Dey, N. Fundamentals, present and future perspectives of speech enhancement. Int. J. Speech Technol. 2021, 24, 883–901. [Google Scholar] [CrossRef]
  19. Yu, D.; Deng, L. Automatic Speech Recognition; Springer: Berlin/Heidelberg, Germany, 2016; Volume 1. [Google Scholar]
  20. Wang, D. Time-frequency masking for speech separation and its potential for hearing aid design. Trends Amplif. 2008, 12, 332–353. [Google Scholar] [CrossRef]
  21. Esra, J.S.; Sukhi, Y. Speech Separation Methodology for Hearing Aid. Comput. Syst. Sci. Eng. 2023, 44, 1659–1678. [Google Scholar] [CrossRef]
  22. Wang, D. Deep learning reinvents the hearing aid. IEEE Spectr. 2017, 54, 32–37. [Google Scholar] [CrossRef]
  23. Mecklenburger, J.; Groth, T. Wireless technologies and hearing aid connectivity. In Hearing Aids; Springer: Cham, Switzerland, 2016; pp. 131–149. [Google Scholar]
  24. Mangharam, R.; Rowe, A.; Rajkumar, R.; Suzuki, R. Voice over sensor networks. In Proceedings of the 2006 27th IEEE International Real-Time Systems Symposium (RTSS’06), Rio de Janeiro, Brazil, 5–8 December 2006; pp. 291–302. [Google Scholar]
  25. Mouhassine, N.; Moughit, M.; Laassiri, F. Improving the quality of service of voice over IP in wireless sensor networks by centralizing handover management and authentication using the SDN controller. In Proceedings of the 2019 Third International Conference on Intelligent Computing in Data Sciences (ICDS), Marrakech, Morocco, 28–30 October 2019; pp. 1–6. [Google Scholar]
  26. Yang, C. Design of smart home control system based on wireless voice sensor. J. Sens. 2021, 2021, 8254478. [Google Scholar] [CrossRef]
  27. Mathur, R.; Dubey, T.K. Security-Focused Mathematical Model for Voice Over Wireless Sensor Network. In Intelligent Computing Techniques for Smart Energy Systems: Proceedings of ICTSES 2021; Springer: Berlin/Heidelberg, Germany, 2022; pp. 681–688. [Google Scholar]
  28. Brendel, A.; Kellermann, W. Distributed source localization in acoustic sensor networks using the coherent-to-diffuse power ratio. IEEE J. Sel. Top. Signal Process. 2019, 13, 61–75. [Google Scholar] [CrossRef]
  29. Ferrer, M.; de Diego, M.; Piñero, G.; Gonzalez, A. Affine projection algorithm over acoustic sensor networks for active noise control. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 29, 448–461. [Google Scholar] [CrossRef]
  30. Richard, G.; Smaragdis, P.; Gannot, S.; Naylor, P.A.; Makino, S.; Kellermann, W.; Sugiyama, A. Audio signal processing in the 21st century: The important outcomes of the past 25 years. IEEE Signal Process. Mag. 2023, 40, 12–26. [Google Scholar] [CrossRef]
  31. Hu, D.; Si, Q.; Liu, R.; Bao, F. Distributed sensor selection for speech enhancement with acoustic sensor networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 985–999. [Google Scholar] [CrossRef]
  32. Kim, J.S.; Kim, C.H. A review of assistive listening device and digital wireless technology for hearing instruments. Korean J. Audiol. 2014, 18, 105. [Google Scholar] [CrossRef]
  33. Zallio, M.; Ohashi, T. The evolution of assistive technology: A literature review of technology developments and applications. Hum. Factors Access. Assist. Technol. 2022, 37, 85. [Google Scholar]
  34. Kellermann, W.; Martin, R.; Ono, N. Signal processing and machine learning for speech and audio in acoustic sensor networks. EURASIP J. Audio Speech Music Process. 2023, 2023, 54. [Google Scholar] [CrossRef]
  35. Plazak, J.; Kersten-Oertel, M. A Survey on the Affordances of “Hearables”. Inventions 2018, 3, 48. [Google Scholar] [CrossRef]
  36. Bertrand, A. Applications and trends in wireless acoustic sensor networks: A signal processing perspective. In Proceedings of the 2011 18th IEEE Symposium on Communications and Vehicular Technology in the Benelux (SCVT), Ghent, Belgium, 22–23 November 2011; pp. 1–6. [Google Scholar]
  37. Wagener, K.C.; Vormann, M.; Latzel, M.; Mülder, H.E. Effect of hearing aid directionality and remote microphone on speech intelligibility in complex listening situations. Trends Hear. 2018, 22, 2331216518804945. [Google Scholar] [CrossRef]
  38. Courtois, G.A. Spatial Hearing Rendering in Wireless Microphone Systems for Binaural Hearing Aids; Technical Report; EPFL: Lausanne, Switzerland, 2016. [Google Scholar]
  39. Stone, M.A.; Lough, M.; Wilbraham, K.; Whiston, H.; Dillon, H. Toward a Real-World Technical Test Battery for Remote Microphone Systems Used with Hearing Prostheses. Trends Hear. 2023, 27, 23312165231182518. [Google Scholar] [CrossRef]
  40. Szurley, J.; Bertrand, A.; Van Dijk, B.; Moonen, M. Binaural noise cue preservation in a binaural noise reduction system with a remote microphone signal. IEEE/ACM Trans. Audio Speech Lang. Process. 2016, 24, 952–966. [Google Scholar] [CrossRef]
  41. Corey, R.M.; Singer, A.C. Immersive Enhancement and Removal of Loudspeaker Sound Using Wireless Assistive Listening Systems and Binaural Hearing Devices. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–2. [Google Scholar]
  42. Kumatani, K.; McDonough, J.; Raj, B. Microphone array processing for distant speech recognition: From close-talking microphones to far-field sensors. IEEE Signal Process. Mag. 2012, 29, 127–140. [Google Scholar] [CrossRef]
  43. Cobos, M.; Antonacci, F.; Alexandridis, A.; Mouchtaris, A.; Lee, B. A survey of sound source localization methods in wireless acoustic sensor networks. Wirel. Commun. Mob. Comput. 2017, 2017, 3956282. [Google Scholar] [CrossRef]
  44. Pertilä, P.; Fagerlund, E.; Huttunen, A.; Myllylä, V. Online own voice detection for a multi-channel multi-sensor in-ear device. IEEE Sens. J. 2021, 21, 27686–27697. [Google Scholar] [CrossRef]
  45. Pertilä, P. Online blind speech separation using multiple acoustic speaker tracking and time–frequency masking. Comput. Speech Lang. 2013, 27, 683–702. [Google Scholar] [CrossRef]
  46. Xu, A.; Choudhury, R.R. Learning to separate voices by spatial regions. arXiv 2022, arXiv:2207.04203. [Google Scholar]
  47. Moore, B.C. An Introduction to the Psychology of Hearing; Brill: Leiden, The Netherlands, 2012. [Google Scholar]
  48. de Fréin, R.; Rickard, S.T. The Synchronized Short-Time-Fourier-Transform: Properties and Definitions for Multichannel Source Separation. IEEE Trans. Signal Process. 2011, 59, 91–103. [Google Scholar] [CrossRef]
  49. Duong, N.Q.; Vincent, E.; Gribonval, R. Under-determined reverberant audio source separation using a full-rank spatial covariance model. IEEE Trans. Audio Speech Lang. Process. 2010, 18, 1830–1840. [Google Scholar] [CrossRef]
  50. Araki, S.; Mukai, R.; Makino, S.; Nishikawa, T.; Saruwatari, H. The fundamental limitation of frequency domain blind source separation for convolutive mixtures of speech. IEEE Trans. Speech Audio Process. 2003, 11, 109–116. [Google Scholar] [CrossRef]
  51. Sawada, H.; Araki, S.; Mukai, R.; Makino, S. Grouping separated frequency components by estimating propagation model parameters in frequency-domain blind source separation. IEEE Trans. Audio Speech Lang. Process. 2007, 15, 1592–1604. [Google Scholar] [CrossRef]
  52. Gannot, S.; Cohen, I. Adaptive beamforming and postfiltering. In Springer Handbook of Speech Processing; Springer: Berlin/Heidelberg, Germany, 2008; pp. 945–978. [Google Scholar]
  53. Gannot, S.; Vincent, E.; Markovich-Golan, S.; Ozerov, A. A consolidated perspective on multimicrophone speech enhancement and source separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 692–730. [Google Scholar] [CrossRef]
  54. Yang, J.; Guo, Y.; Yang, Z.; Xie, S. Under-determined convolutive blind source separation combining density-based clustering and sparse reconstruction in time-frequency domain. IEEE Trans. Circuits Syst. I Regul. Pap. 2019, 66, 3015–3027. [Google Scholar] [CrossRef]
  55. Knapp, C.; Carter, G. The generalized correlation method for estimation of time delay. IEEE Trans. Acoust. Speech Signal Process. 1976, 24, 320–327. [Google Scholar] [CrossRef]
  56. Grinstein, E.; Tengan, E.; Çakmak, B.; Dietzen, T.; Nunes, L.; van Waterschoot, T.; Brookes, M.; Naylor, P.A. Steered Response Power for Sound Source Localization: A tutorial review. EURASIP J. Audio Speech Music Process. 2024, 2024, 59. [Google Scholar] [CrossRef]
  57. Belouchrani, A.; Amin, M.G. Time-frequency MUSIC. IEEE Signal Process. Lett. 1999, 6, 109–110. [Google Scholar] [CrossRef]
  58. Bu, S.; Zhao, T.; Zhao, Y. TDOA Estimation of Speech Source in Noisy Reverberant Environments. In Proceedings of the 2022 IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar, 9–12 January 2023; pp. 1059–1066. [Google Scholar]
  59. Cobos, M.; Antonacci, F.; Comanducci, L.; Sarti, A. Frequency-sliding generalized cross-correlation: A sub-band time delay estimation approach. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 1270–1281. [Google Scholar] [CrossRef]
  60. Nesta, F.; Svaizer, P.; Omologo, M. Cumulative state coherence transform for a robust two-channel multiple source localization. In Proceedings of the Independent Component Analysis and Signal Separation: 8th International Conference, ICA 2009, Paraty, Brazil, 15–18 March 2009; pp. 290–297. [Google Scholar]
  61. Blandin, C.; Ozerov, A.; Vincent, E. Multi-source TDOA estimation in reverberant audio using angular spectra and clustering. Signal Process. 2012, 92, 1950–1960. [Google Scholar] [CrossRef]
  62. Chaudhari, A.; Dhonde, S. A review on speech enhancement techniques. In Proceedings of the 2015 International Conference on Pervasive Computing (ICPC), Pune, India, 8–10 January 2015; pp. 1–3. [Google Scholar]
  63. Kumatani, K.; Raj, B.; Singh, R.; McDonough, J. Microphone array post-filter based on spatially-correlated noise measurements for distant speech recognition. In Proceedings of the Interspeech 2012, Portland, OR, USA, 9–13 September 2012; pp. 298–301. [Google Scholar] [CrossRef]
  64. Rickard, S. The DUET blind source separation algorithm. In Blind Speech Separation; Springer: Berlin/Heidelberg, Germany, 2007; pp. 217–241. [Google Scholar]
  65. Barry, D.; Lawlor, B.; Coyle, E. Sound Source Separation: Azimuth Discrimination and Resynthesis. In Proceedings of the 7th International Conference on Digital Audio Effects, DAFX 04, Montréal, QC, Canada, 5–8 October 2004. [Google Scholar]
  66. Boldt, J. Binary Masking & Speech Intelligibility. Ph.D. Thesis, Aalborg University, Aalborg, Denmark, 2010. Available online: https://vbn.aau.dk/en/publications/binary-masking-amp-speech-intelligibility (accessed on 18 February 2025).
  67. Yilmaz, O.; Rickard, S. Blind separation of speech mixtures via time-frequency masking. IEEE Trans. Signal Process. 2004, 52, 1830–1847. [Google Scholar] [CrossRef]
  68. Rickard, S. Sparse sources are separated sources. In Proceedings of the 2006 14th European Signal Processing Conference, Florence, Italy, 4–8 September 2006; pp. 1–5. [Google Scholar]
  69. Rickard, S.; Yilmaz, O. On the approximate W-disjoint orthogonality of speech. In Proceedings of the 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, Orlando, FL, USA, 13–17 May 2002; Volume 1, p. I-529. [Google Scholar]
  70. Rafii, Z.; Liutkus, A.; Stöter, F.R.; Mimilakis, S.I.; FitzGerald, D.; Pardo, B. An overview of lead and accompaniment separation in music. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 1307–1335. [Google Scholar] [CrossRef]
  71. de Fréin, R.; Rickard, S.T. Power-Weighted Divergences for Relative Attenuation and Delay Estimation. IEEE Sig. Proc. Let. 2016, 23, 1612–1616. [Google Scholar] [CrossRef]
  72. Oppenheim, A.V. Discrete-Time Signal Processing; Pearson Education India: Delhi, India, 1999. [Google Scholar]
  73. Mandel, M.I. Binaural Model-Based Source Separation and Localization; Columbia University: New York, NY, USA, 2010. [Google Scholar]
  74. Ban, Y.; Alameda-Pineda, X.; Evers, C.; Horaud, R. Tracking multiple audio sources with the von mises distribution and variational em. IEEE Signal Process. Lett. 2019, 26, 798–802. [Google Scholar] [CrossRef]
  75. Mandel, M.I.; Weiss, R.J.; Ellis, D.P. Model-based expectation-maximization source separation and localization. IEEE Trans. Audio Speech Lang. Process. 2009, 18, 382–394. [Google Scholar] [CrossRef]
  76. Duda, R.O.; Hart, P.E. Use of the Hough transformation to detect lines and curves in pictures. Comm. ACM 1972, 15, 11–15. [Google Scholar] [CrossRef]
  77. Roy, R.H., III; Kailath, T. ESPRIT-Estimation of Signal Parameters via Rotational Invariance Techniques. Opt. Eng. 1990, 29, 296–313. [Google Scholar]
  78. Abrard, F.; Deville, Y. A time–frequency blind signal separation method applicable to underdetermined mixtures of dependent sources. Signal Process. 2005, 85, 1389–1403. [Google Scholar] [CrossRef]
  79. Arberet, S.; Gribonval, R.; Bimbot, F. A robust method to count and locate audio sources in a stereophonic linear instantaneous mixture. In Proceedings of the International Conference on Independent Component Analysis and Signal Separation, Charleston, SC, USA, 5–8 March 2006; pp. 536–543. [Google Scholar]
  80. Melia, T.; Rickard, S.; Fearon, C. Histogram-based Blind Source Separation of more sources than sensors using a DUET-ESPRIT technique. In Proceedings of the 13th EUSIPCO, New Paltz, NY, USA, 16–19 October 2005; pp. 1–4. [Google Scholar]
  81. Hyvarinen, A. Fast and robust fixed-point algorithms for independent component analysis. IEEE Tran. Neural Netw. 1999, 10, 626–634. [Google Scholar] [CrossRef]
  82. Zibulevsky, M.; Pearlmutter, B.A. Blind Source Separation by Sparse Decomposition in a Signal Dictionary. Neural Comput. 2001, 13, 863–882. [Google Scholar] [CrossRef] [PubMed]
  83. Bagchi, S.; de Fréin, R. Extending Instantaneous De-mixing Algorithms to Anechoic Mixtures. In Proceedings of the 2021 32nd Irish Signals and Systems Conference (ISSC), Athlone, Ireland, 10–11 June 2021; pp. 1–6. [Google Scholar]
  84. de Fréin, R. Reformulating the binary masking approach of adress as soft masking. Electronics 2020, 9, 1373. [Google Scholar] [CrossRef]
  85. de Fréin, R.; Rickard, S.T. Learning speech features in the presence of noise: Sparse convolutive robust non-negative matrix factorization. In Proceedings of the 2009 16th International Conference on Digital Signal Processing, Santorini, Greece, 5–7 July 2009; pp. 1–6. [Google Scholar] [CrossRef]
  86. Bagchi, S.; de Fréin, R. Soft-Mask De-Mixing for Anechoic Mixtures. In Proceedings of the 2022 33rd Irish Signals and Systems Conference (ISSC), Cork, Ireland, 9–10 June 2022; pp. 1–6. [Google Scholar]
  87. Izumi, Y.; Ono, N.; Sagayama, S. Sparseness-based 2ch BSS using the EM algorithm in reverberant environment. In Proceedings of the 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY, USA, 21–24 October 2007; pp. 147–150. [Google Scholar]
  88. Cano, E.; FitzGerald, D.; Liutkus, A.; Plumbley, M.D.; Stöter, F.R. Musical source separation: An introduction. IEEE Signal Process. Mag. 2018, 36, 31–40. [Google Scholar] [CrossRef]
  89. Barry, D. Real-Time Sound Source Separation for Music Applications. Ph.D. Thesis, Technological University Dublin, Dublin, Ireland, 2019. [Google Scholar]
  90. Chun, C.; Jeon, K.M.; Choi, W. Configuration-invariant sound localization technique using azimuth-frequency representation and convolutional neural networks. Sensors 2020, 20, 3768. [Google Scholar] [CrossRef]
  91. Giorgio, P. the bivariate mixture space: A compact spectral representation of bivariate signals. J. Audio Eng. Soc. 2023, 71, 481–491. [Google Scholar]
  92. Abesser, J.; Lukashevich, H.; Holly, S.; Körber, Y.; Ruch, R. Device, Method and Computer Program for Acoustic Monitoring of a Monitoring Area. U.S. Patent 11,557,279, 17 January 2023. [Google Scholar]
  93. Kumatani, K.; Arakawa, T.; Yamamoto, K.; McDonough, J.; Raj, B.; Singh, R.; Tashev, I. Microphone array processing for distant speech recognition: Towards real-world deployment. In Proceedings of the 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, Hollywood, CA, USA, 3–6 December 2012; pp. 1–10. [Google Scholar]
  94. de Fréin, R. Remedying Sound Source Separation via Azimuth Discrimination and Re-synthesis. In Proceedings of the 31st Irish Signals and Systems Conference (ISSC), Letterkenny, Ireland, 11–12 June 2020; pp. 1–6. [Google Scholar] [CrossRef]
  95. de Fréin, R.; Rickard, S.T.; Pearlmutter, B.A. Constructing Time-Frequency Dictionaries for Source Separation via Time-Frequency Masking and Source Localisation. In Proceedings of the Independent Component Analysis and Signal Separation; Adali, T., Jutten, C., Romano, J.M.T., Barros, A.K., Eds.; Springer: Berlin/Heidelberg, Germany, 2009; pp. 573–580. [Google Scholar]
  96. Reju, V.G.; Koh, S.N.; Soon, Y. Underdetermined convolutive blind source separation via time–frequency masking. IEEE Trans. Audio Speech Lang. Process. 2009, 18, 101–116. [Google Scholar] [CrossRef]
  97. Zhang, H.; Hua, G.; Yu, L.; Cai, Y.; Bi, G. Underdetermined blind separation of overlapped speech mixtures in time-frequency domain with estimated number of sources. Speech Commun. 2017, 89, 1–16. [Google Scholar] [CrossRef]
  98. He, Y.; Wang, H.; Chen, Q.; So, R.H. Harvesting partially-disjoint time-frequency information for improving degenerate unmixing estimation technique. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 506–510. [Google Scholar]
  99. Lu, J.; Qian, W.; Yin, Q.; Xu, K.; Li, S. An Improved Underdetermined Blind Source Separation Method for Insufficiently Sparse Sources. Circuits Syst. Signal Process. 2023, 42, 7615–7639. [Google Scholar] [CrossRef]
  100. Xiangru, B.; Zhang, G.; Xie, Y.; Zhang, Q. Method and System for Voice Separation Based on Degenerate Unmixing Estimation Technique. U.S. Patent 11,783,848, 10 October 2023. [Google Scholar]
  101. Garofolo, J.S.; Lamel, L.F.; Fisher, W.M.; Fiscus, J.G.; Pallett, D.S.; Dahlgren, N.L.; Zue, V. Timit Acoustic Phonetic Continuous Speech Corpus; Linguistic Data Consortium: Philadelphia, PA, USA, 1993. [Google Scholar]
  102. Vincent, E.; Gribonval, R.; Févotte, C. Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 2006, 14, 1462–1469. [Google Scholar] [CrossRef]
  103. Emiya, V.; Vincent, E.; Harlander, N.; Hohmann, V. Subjective and objective quality assessment of audio source separation. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 2046–2057. [Google Scholar] [CrossRef]
  104. Rix, A.W.; Beerends, J.G.; Hollier, M.P.; Hekstra, A.P. Perceptual evaluation of speech quality (PESQ)—A new method for speech quality assessment of telephone networks and codecs. In Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, Salt Lake City, UT, USA, 7–11 May 2001; Volume 2, pp. 749–752. [Google Scholar]
  105. Wang, D. On ideal binary mask as the computational goal of auditory scene analysis. In Speech Separation by Humans and Machines; Springer: New York, NY, USA, 2005; pp. 181–197. [Google Scholar]
  106. Hummersone, C.; Mason, R.; Brookes, T. Ideal Binary Mask Ratio: A Novel Metric for Assessing Binary-Mask-Based Sound Source Separation Algorithms. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 2039–2045. [Google Scholar] [CrossRef]
  107. Ward, D.; Wierstorf, H.; Mason, R.D.; Grais, E.M.; Plumbley, M.D. BSS Eval or PEASS? Predicting the perception of singing-voice separation. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 596–600. [Google Scholar]
  108. Kim, G.; Loizou, P.C. Why do speech-enhancement algorithms not improve speech intelligibility? In Proceedings of the 2010 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Dallas, TX, USA, 14–19 March 2010; pp. 4738–4741. [Google Scholar]
  109. Loizou, P.C.; Kim, G. Reasons why current speech-enhancement algorithms do not improve speech intelligibility and suggested solutions. IEEE Trans. Audio Speech Lang. Process. 2010, 19, 47–56. [Google Scholar] [CrossRef]
Figure 1. (a) Motivational Scenario: Multi-microphone arrays and a smart hearing aid (HA) are interconnected via a wireless network. The microphones capture speech mixtures, which are processed by powerful demixing algorithms, isolating the speaker of interest and directing their speech to the HA. (b) Overhead schematic view of the experimental environment: the distance, d s = B M , is the effective inter-sensor gap in d s meters. Relative to the RM, the speech signal, s 1 [ n ] , travels an extra distance, B M , giving it a larger relative delay. As the distance, d s , increases, the target signal s 1 [ n ] becomes more prone to phase wraparound.
Figure 1. (a) Motivational Scenario: Multi-microphone arrays and a smart hearing aid (HA) are interconnected via a wireless network. The microphones capture speech mixtures, which are processed by powerful demixing algorithms, isolating the speaker of interest and directing their speech to the HA. (b) Overhead schematic view of the experimental environment: the distance, d s = B M , is the effective inter-sensor gap in d s meters. Relative to the RM, the speech signal, s 1 [ n ] , travels an extra distance, B M , giving it a larger relative delay. As the distance, d s , increases, the target signal s 1 [ n ] becomes more prone to phase wraparound.
Futureinternet 17 00184 g001
Figure 2. (ad) IPD distribution across frequencies: A voting algorithm is used in [5] to obtain the global maxima, ϕ = 2.5 rads. The black, green, and orange dots represent ϕ max , located in the region where tan ( ϕ ) rises/falls exponentially.
Figure 2. (ad) IPD distribution across frequencies: A voting algorithm is used in [5] to obtain the global maxima, ϕ = 2.5 rads. The black, green, and orange dots represent ϕ max , located in the region where tan ( ϕ ) rises/falls exponentially.
Futureinternet 17 00184 g002
Figure 3. (a) Both halves of the frequency-attenuation matrix are illustrated [84]. (b) Considering the second-half of the frequency-attenuation matrix, when the frequency components of the sources are non-overlapping, nulls appear at appropriate attenuation coefficients. (c) Shows that when the sources overlap at 300 Hz a new peak is formed at g ^ 0.5 . The interested reader is directed to [84] for a more in-depth discussion of this process.
Figure 3. (a) Both halves of the frequency-attenuation matrix are illustrated [84]. (b) Considering the second-half of the frequency-attenuation matrix, when the frequency components of the sources are non-overlapping, nulls appear at appropriate attenuation coefficients. (c) Shows that when the sources overlap at 300 Hz a new peak is formed at g ^ 0.5 . The interested reader is directed to [84] for a more in-depth discussion of this process.
Futureinternet 17 00184 g003
Figure 4. Speech utterances recovered from a mixture of 4 sources using Elevato-AdRess and Elevato-DUET are illustrated. The first column gives the clean speech sources. The second column shows the source estimates recovered by Elevato-AdRess. The third column presents the sources estimated by Elevato-DUET. In each figure, the unit of the normalized frequency is given on the y-axis. The primary result is that Elevato-DUET in the third column contains more interference.
Figure 4. Speech utterances recovered from a mixture of 4 sources using Elevato-AdRess and Elevato-DUET are illustrated. The first column gives the clean speech sources. The second column shows the source estimates recovered by Elevato-AdRess. The third column presents the sources estimated by Elevato-DUET. In each figure, the unit of the normalized frequency is given on the y-axis. The primary result is that Elevato-DUET in the third column contains more interference.
Futureinternet 17 00184 g004
Figure 5. (ac) AdRess, DUET, Elevato-DUET and Elevator-AdRess are evaluated in terms of their Source-to-Distortion Ratio (SDR), on the LHS, Source-to-Interference Ratio (SIR), in the center, and their Source-to-Artifact Ratio (SAR), on the RHS, as a function of relative delay. The Ideal Binary Mask is also illustrated as a benchmark method. The algorithms are evaluated using a four-source mixture, J = 4 , with a sampling rate of F s = 16 kHz. All the separation quality metrics are measured in dB. The Elevato-AdRess algorithm typically gives better performance than Elevato-DUET for all delays. The performance range of AdRess and DUET is limited to small delays in the case of DUET and zero delay for AdRess. Of particular interest in this setting, is that in scenarios with large delays, Elevato-DUET generally outperforms Elevato-AdRess. This difference can be attributed to the Mean Absolute Error (MAE) associated with δ ^ j that is used to estimate α ^ j , and subsequently the TF mask, especially when the delay is large.
Figure 5. (ac) AdRess, DUET, Elevato-DUET and Elevator-AdRess are evaluated in terms of their Source-to-Distortion Ratio (SDR), on the LHS, Source-to-Interference Ratio (SIR), in the center, and their Source-to-Artifact Ratio (SAR), on the RHS, as a function of relative delay. The Ideal Binary Mask is also illustrated as a benchmark method. The algorithms are evaluated using a four-source mixture, J = 4 , with a sampling rate of F s = 16 kHz. All the separation quality metrics are measured in dB. The Elevato-AdRess algorithm typically gives better performance than Elevato-DUET for all delays. The performance range of AdRess and DUET is limited to small delays in the case of DUET and zero delay for AdRess. Of particular interest in this setting, is that in scenarios with large delays, Elevato-DUET generally outperforms Elevato-AdRess. This difference can be attributed to the Mean Absolute Error (MAE) associated with δ ^ j that is used to estimate α ^ j , and subsequently the TF mask, especially when the delay is large.
Futureinternet 17 00184 g005
Figure 6. (ad) PEASS scores: the OPS, TPS, IPS, and APS scores are used to compare the performance of Elevato-AdRess and Elevato-DUET relative to the Ideal-Binary Mask (IBM). The PEASS scores are compared as a function of the number of sources in each mixture, which increases from two to four. Boxplots are used to summarize the performance. Elevato-AdRess obtains a better Overall-Perceptual Score (OPS), Target-Related Perceptual Score (TPS), and Interference-Related Perceptual Score (IPS) compared with Elevato-DUET.
Figure 6. (ad) PEASS scores: the OPS, TPS, IPS, and APS scores are used to compare the performance of Elevato-AdRess and Elevato-DUET relative to the Ideal-Binary Mask (IBM). The PEASS scores are compared as a function of the number of sources in each mixture, which increases from two to four. Boxplots are used to summarize the performance. Elevato-AdRess obtains a better Overall-Perceptual Score (OPS), Target-Related Perceptual Score (TPS), and Interference-Related Perceptual Score (IPS) compared with Elevato-DUET.
Futureinternet 17 00184 g006
Figure 7. Boxplots of the SNR of the sources recovered by Elevato-AdRess and Elevato DUET from two, three and four source mixtures. The Ideal-Binary Mask (IBM) is given as the benchmark demixing algorithm. The SNR of the separated speech utterances decreases as the number of sources in the mixture increases. In summary, Elevato-DUET demonstrates better SNRs for the recovered sources than Elevato-AdRess.
Figure 7. Boxplots of the SNR of the sources recovered by Elevato-AdRess and Elevato DUET from two, three and four source mixtures. The Ideal-Binary Mask (IBM) is given as the benchmark demixing algorithm. The SNR of the separated speech utterances decreases as the number of sources in the mixture increases. In summary, Elevato-DUET demonstrates better SNRs for the recovered sources than Elevato-AdRess.
Futureinternet 17 00184 g007
Figure 8. The intelligibilty of the demixed speech recovered by DUET, Elevato-DUET and Elevato-AdRess is quantified. The IBM is used as the benchmark technique. Sources experience a relative delay of 120 samples and the sampling rate of F s = 16 kHz. The results for the NarrowBand (NB) MOS are given on the LHS. The NB MOS-Listening Quality Objective (LQO) results are given in the center. The WideBand (WB) MOS-LQO scores of the separated utterances are given on the RHS. Elevato-AdRess improves speech intelligibility in comparison with Elevato-DUET and the IBM. It achieves higher MOS and LQO scores in both the NB and WB evaluations.
Figure 8. The intelligibilty of the demixed speech recovered by DUET, Elevato-DUET and Elevato-AdRess is quantified. The IBM is used as the benchmark technique. Sources experience a relative delay of 120 samples and the sampling rate of F s = 16 kHz. The results for the NarrowBand (NB) MOS are given on the LHS. The NB MOS-Listening Quality Objective (LQO) results are given in the center. The WideBand (WB) MOS-LQO scores of the separated utterances are given on the RHS. Elevato-AdRess improves speech intelligibility in comparison with Elevato-DUET and the IBM. It achieves higher MOS and LQO scores in both the NB and WB evaluations.
Futureinternet 17 00184 g008
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bagchi, S.; de Fréin, R. Big-Delay Estimation for Speech Separation in Assisted Living Environments. Future Internet 2025, 17, 184. https://doi.org/10.3390/fi17040184

AMA Style

Bagchi S, de Fréin R. Big-Delay Estimation for Speech Separation in Assisted Living Environments. Future Internet. 2025; 17(4):184. https://doi.org/10.3390/fi17040184

Chicago/Turabian Style

Bagchi, Swarnadeep, and Ruairí de Fréin. 2025. "Big-Delay Estimation for Speech Separation in Assisted Living Environments" Future Internet 17, no. 4: 184. https://doi.org/10.3390/fi17040184

APA Style

Bagchi, S., & de Fréin, R. (2025). Big-Delay Estimation for Speech Separation in Assisted Living Environments. Future Internet, 17(4), 184. https://doi.org/10.3390/fi17040184

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop