**2. Background**

Noisy mixed signals observed via a recording device can be stated as: *y*(*t*) = *s*1(*t*) + *s*2(*t*) + *n*(*t*) where *s*1(*t*) and *s*2(*t*) denote the original sounds, and *n*(*t*) is noise. This research is focused on two sound events in a single recorded signal. The proposed method consists of two steps: noisy sound separation and sound event classification, which is illustrated in Figure 1, where *y*(*t*) and **Y**(ω, *t*) denote a sound-event mixture in the time domain and time-frequency domain, respectively. The terms **W***k*(ω), **H***k*(*t*), φ*k*(ω, *t*) are spectral basis, temporal code or weight matrix, and phase information, respectively. The term λ*k*(*t*) represents sparsity and sˆj(t) is an estimated sound event source. The abbreviations MFCC, STE, and STZCR stand for Mel frequency cepstral coefficients, short-time energy, and short-time zero-crossing rate, respectively. The proposed method is consecutively elaborated in the following parts.

Sound event class

**Figure 1.** Signal flow of the proposed method.

### *2.1. Single-Channel Sound Event Separation*

The problem formulation in time-frequency (TF) representation is given by an observed complex spectrum, **<sup>Y</sup>***f*,*<sup>t</sup>* <sup>∈</sup> <sup>C</sup>, to estimate the optimal parameters <sup>θ</sup> <sup>=</sup> {**W**, **<sup>H</sup>**, <sup>φ</sup>} of the model. A new factorization algorithm named as the adaptive *L*1-sparse complex non-negative matrix factorization (adaptive *L*1-SCMF) is derived in the following section. The generative model is given by

$$\mathbf{Y}(\omega, t) = \sum\_{k=1}^{K} \mathbf{W}^{k}(\omega) \mathbf{H}^{k}(t) \mathbf{Z}^{k}(\omega, t) = \mathbf{X}(\omega, t) + \boldsymbol{\varepsilon}(\omega, t) \tag{1}$$

where**Z***k*(ω,*t*) <sup>=</sup> *<sup>e</sup>j*φ*k*(ω,*t*) and the reconstruction error (ω,*t*) ∼ N<sup>C</sup> 0, σ<sup>2</sup> is assumed to be independently and identically distributed (i.i.d.) with white noise having zero mean and variance σ2. The term (ω, *t*) is used to denote a modeling error for each source. The likelihood of θ = {**W**, **H**, φ} is thus written as

$$P(\mathbf{Y}|\boldsymbol{\theta}) = \prod\_{f,t} \frac{1}{\pi \sigma^2} \exp(-\frac{\left|\mathbf{Y}(\omega, t) - \mathbf{X}(\omega, t)\right|^2}{\sigma^2})\tag{2}$$

It is assumed that the prior distributions for **W**, **H**, and φ are independent, which yields

$$P(\theta|\lambda) = P(\mathcal{W})P(\mathcal{H}|\lambda)P(\phi) \tag{3}$$

The prior *P*(**H**|λ) corresponds to the sparsity cost, for which a natural choice is a generalized Gaussian prior. When *p* = 1, *P*(**H**|λ) promotes the *L*1-norm sparsity. *L*1-norm sparsity has been shown to be probabilistically equivalent to the pseudo-norm, *L*0, which is the theoretically optimum sparsity [33,34]. However, *L*0-norm is non-deterministic polynomial-time (NP) hard and is not useful in large datasets such as audio. Given Equation (3), the posterior density [35,36] is defined as the maximum a posteriori probability (MAP) estimation problem, which leads to minimizing the following optimization problem with respect to θ. Equations of Gaussian prior and maximum a posteriori probability (MAP) estimation are expressed in Appendix A.

The CMF parameters have been upgraded by using an efficient auxiliary function for an iterative process. The auxiliary function for *f*(θ) can be expressed as the following: for any auxiliary variables with *k* **Y** *k* (ω, *t*) = **Y**(ω, *t*), for any β*k*(ω, *t*) > 0, *k* <sup>β</sup>*k*(ω, *<sup>t</sup>*) <sup>=</sup> 1, for any **<sup>H</sup>***k*(*t*) ∈ R, **<sup>H</sup>***<sup>k</sup>* (*t*) ∈ R, and *p* = 1. The term *<sup>f</sup>*(θ) <sup>≤</sup> *<sup>f</sup>* <sup>+</sup> θ, θ with an auxiliary function was defined as:

$$f^{+}\left(\theta,\overline{\theta}\right) = \sum\_{f,k,t} \frac{\left|\overline{\mathbf{Y}}^{k}\left(\omega,t\right) - \mathbf{W}^{k}\left(\omega\right)\mathbf{H}^{k}\left(t\right)\cdot\mathbf{e}^{i\left|\Phi^{k}\left(\omega,t\right)\right|}\right|^{2}}{\beta^{k}\left(\omega,t\right)} + \sum\_{k,t} \left[\left(\lambda^{k}\left(t\right)\right)^{p}\left|p\left|\overline{\mathbf{H}}^{k}\left(t\right)\right|^{p-2}\mathbf{H}^{k}\left(t\right)^{2} + \left(2-p\right)\left|\overline{\mathbf{H}}^{k}\left(t\right)\right|^{p}\right] - \log\lambda^{k}\left(t\right)\right] \tag{4}$$

where θ = **Y** *k* (ω, *<sup>t</sup>*), **<sup>H</sup>***<sup>k</sup>* (*t*) <sup>1</sup> <sup>≤</sup> *<sup>f</sup>* <sup>≤</sup> *<sup>F</sup>*, 1 <sup>≤</sup> *<sup>t</sup>* <sup>≤</sup> *<sup>T</sup>*, 1 <sup>≤</sup> *<sup>k</sup>* <sup>≤</sup> *<sup>K</sup>* . The function *<sup>f</sup>* <sup>+</sup> θ, θ is minimized w.r.t. θ when

$$\overline{\mathbf{Y}}^{k}(\omega,t) = \mathbf{W}^{k}(\omega)\overline{\mathbf{H}}^{k}(t)\cdot\mathbf{e}^{i\phi^{k}(\omega,t)} + \beta^{k}(\omega,t)(\mathbf{Y}(\omega,t) - \mathbf{X}(\omega,t))\tag{5}$$

$$\overline{\mathbf{H}}^k(t) = \mathbf{H}^k(t) \tag{6}$$

### *2.2. Formulation of Proposed CMF Based Adaptive Variable Regularization Sparsity*

2.2.1. Estimation of the Spectral Basis and Temporal Code

In Equation (4), the update rule for <sup>θ</sup> is derived by differentiating *<sup>f</sup>* <sup>+</sup> θ, θ partially w.r.t. **W***k*(ω) and **H***k*(*t*), and setting them to zero, which yields the following:

$$\mathbf{W}^{k}(\omega) \;= \frac{\sum\_{t} \frac{\mathbf{H}^{k}(t)}{\beta^{k}(\omega, t)} \text{Re} \left[ \overleftarrow{\mathbf{Y}}^{k}(\omega, t)^{\*} \cdot e^{j\phi^{k}(\omega, t)} \right]}{\sum\_{t} \frac{\mathbf{H}^{k}(t)^{2}}{\beta^{k}(\omega, t)}} \tag{7}$$

$$\mathbf{H}^{k}(t) = \frac{\sum\_{f} \frac{\mathbf{W}^{k}(\boldsymbol{\omega})}{\boldsymbol{\beta}^{k}(\boldsymbol{\omega}, t)} \text{Re} \left[ \overline{\mathbf{Y}}^{k}(\boldsymbol{\omega}, t)^{\*} \boldsymbol{\omega}^{j} \boldsymbol{\Phi}^{k}(\boldsymbol{\omega}, t) \right]}{\sum\_{f} \frac{\mathbf{W}^{k}(\boldsymbol{\omega})^{2}}{\boldsymbol{\beta}^{k}(\boldsymbol{\omega}, t)} + (\boldsymbol{\lambda}^{k}(t))^{p} \left| \overline{\mathbf{H}}^{k}(t) \right|^{p-2}} \tag{8}$$

The update rule for the phase, φ*k*(ω, *t*), can be derived by reformulating Equation (4) as follows:

$$\begin{split} f^{+}\left(\partial\_{t}\overline{\partial}\right) &= \sum\_{k,f,t} \frac{\left|\overline{\mathbf{f}}^{k}(\omega\boldsymbol{t})\right|^{2} - 2\mathsf{W}^{k}(\omega)\mathbf{H}^{k}(t)\mathbf{h}\left[\overline{\mathbf{f}}^{k}(\omega\boldsymbol{t})\circ\boldsymbol{\tau}^{\partial}\overline{\boldsymbol{\theta}}^{k}(\omega\boldsymbol{t})\right] + \mathsf{W}^{k}(\omega)^{2}\mathbf{H}^{k}(t)^{2}}{\mathsf{f}^{k}(\omega\boldsymbol{t})} + \sum\_{k,f} \lambda^{k}(t) \left|\overline{\mathbf{H}}^{k}(t)\right|^{-1}\mathbf{H}^{k}(t)^{2} - \overline{\mathbf{H}}^{k}(t)\right| - \sum\_{k,l} \log\lambda^{k}(t) \\ &= A - 2\sum\_{k,f,t} \left|\mathbf{B}^{k}(\omega,t)\right| \cos\left(\boldsymbol{\theta}^{k}(\omega,t) - \boldsymbol{\Omega}^{k}(\omega,t)\right) \end{split} \tag{9}$$

where *<sup>A</sup>* denotes the terms that are irrelevant with <sup>φ</sup>*k*(ω, *<sup>t</sup>*), **<sup>B</sup>***k*(ω, *<sup>t</sup>*) <sup>=</sup> **<sup>W</sup>***k*(ω)**H***k*(*t*)**<sup>Y</sup>** *k* (ω,*t*) <sup>β</sup>*k*(ω,*t*) , cos <sup>Ω</sup>*k*(ω, *<sup>t</sup>*) <sup>=</sup> *Re* **Y** *k* (ω,*t*) **Y** *k* (ω,*t*) , and sin **<sup>Ω</sup>***k*(ω, *<sup>t</sup>*) <sup>=</sup> Im **Y** *k* (ω,*t*) **Y** *k* (ω,*t*) . Derivation of (9) is elucidated in Appendix B. The auxiliary function, *<sup>f</sup>* <sup>+</sup> θ, θ in Equation (4) is minimized when cos φ*k*(ω, *t*) − Ω*k*(ω, *t*) = cos φ*k*(ω, *t*) cos Ω*k*(ω, *t*) + sin φ*k*(ω, *t*) sin Ω*k*(ω, *t*) = 1, namely, cos φ*k*(ω, *t*) = cos Ω*k*(ω, *t*) and sin φ*k*(ω, *t*) = sin Ω*k*(ω, *t*). The update formula for *ej*φ*k*(ω,*t*) eventually leads to

$$\begin{aligned} e^{j\Phi^k(\omega, t)} &= \cos\phi^k(\omega, t) + j\sin\phi^k(\omega, t) \\ &= \frac{\overline{\mathbf{Y}}^k(\omega, t)}{|\overline{\mathbf{Y}}^k(\omega, t)|} \end{aligned} \tag{10}$$

The update formula for β*k*(ω, *t*) and **H***k*(*t*) for projection onto the constraint space is set to

$$\beta^k(\omega, t) = \frac{\mathbf{W}^k(\omega)\mathbf{H}^k(t)}{\sum\_k \mathbf{W}^k(\omega)\mathbf{H}^k(t)}\tag{11}$$

$$\mathbf{H}^{k}(t) \leftarrow \frac{\mathbf{H}^{k}(t)}{\sum\_{k} \mathbf{H}^{k}(t)} \tag{12}$$

2.2.2. Estimation of L1-Optimal Sparsity Parameter λk(t)

This section aims to facilitate spectral dictionaries with adaptive sparse coding. First, the CMF model is defined as the following terms:

**W** = **<sup>I</sup>** <sup>⊗</sup> **<sup>W</sup>**1(ω) . . .**<sup>I</sup>** <sup>⊗</sup> **<sup>W</sup>**2(ω) . . . ··· . . .**<sup>I</sup>** <sup>⊗</sup> **<sup>W</sup>***K*(ω) , *e <sup>j</sup>*φ(*t*) = *e <sup>j</sup>*φ1(*t*). . . ··· . . .*e j*φ*K*(*t*) **y** = vec(**Y**) = ⎡ ⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ **Y**1(:) ... **Y**2(:) ... . . . ... **Y***K*(:) ⎤ ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ , **h** = ⎡ ⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ **H**1(*t*) ... **H**2(*t*) ... . . . ... **H***K*(*t*) ⎤ ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ , λ = ⎡ ⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ λ1(*t*) ... λ2(*t*) ... . . . ... λ*K*(*t*) ⎤ ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ , φ = ⎡ ⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ φ1(:, *t*) ... φ2(:, *t*) ... . . . ... φ*K*(:, *t*) ⎤ ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ **A** = ⎡ ⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ **W**◦*e <sup>j</sup>*φ(*t*) **0** ... **0 0 W**◦*e <sup>j</sup>*φ(*t*) **<sup>0</sup>** . . . . . . **0 W**◦*e <sup>j</sup>*φ(*t*) **0 0** ... **0 W**◦*e j*φ(*t*)*<sup>t</sup>* ⎤ ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ (13)

where "⊗" and "◦" are the Kronecker product and the Hadamard product, respectively. The term vec(·) denotes the column vectorization and the term **I** is the identity matrix. The goal is then set to compute the regularization parameter λ*k*(*t*) related to each **H***k*(*t*). To achieve the goal, the parameter *p* in Equation (4) is set to 1 to acquire a linear expression (in λ*k*(*t*)). In consideration of the noise variance σ2, Equation (4) can concisely be rewritten as:

$$F(\underline{\mathbf{h}}, \underline{\mathbf{A}}) = \frac{1}{2\sigma^2} \parallel \underline{\mathbf{y}} - \overline{\mathbf{A}}\underline{\mathbf{h}} \parallel\_F^2 + \underline{\mathbf{A}}^T \underline{\mathbf{h}} - (\log \underline{\mathbf{A}})^T \underline{\mathbf{1}} \tag{14}$$

where the **h** and λ terms indicate vectors of dimension *R* × 1 (i.e., *R* = *F* × *T* × *K*), and the superscript '**T**' is used to denote complex Hermitian transpose (i.e., vector (or matrix) transpose followed by complex conjugate). The Expectation–Maximization (EM) algorithm will be used to determine λ and **h** is the hidden variable where the log-likelihood function can be optimized with respect to λ. The log-likelihood function satisfies the following [12]:

$$\ln p(\underline{\mathbf{y}} \, | \, \underline{\mathbf{A}} \, \overline{\mathbf{A}} , \sigma^2) \ge \int Q(\underline{\mathbf{h}}) \ln \left( \frac{p(\underline{\mathbf{y}} \, | \, \underline{\mathbf{h}} \, | \, \underline{\mathbf{A}} \, \overline{\mathbf{A}} , \sigma^2)}{Q(\underline{\mathbf{h}})} \right) d\underline{\mathbf{h}} \tag{15}$$

by applying the Jensen's inequality for any distribution *Q*(**h**). The distribution can simply verify the posterior distribution of **h**, which maximizes the right-hand side of Equation (15), is given by *Q*(**h**) = *p* **h y**, <sup>λ</sup>, **<sup>A</sup>**, <sup>σ</sup><sup>2</sup> . The posterior distribution in the form of the Gibbs distribution is proposed as follows:

$$Q(\underline{\mathbf{h}}) = \frac{1}{Z\_{\mathbf{h}}} \exp[-F(\underline{\mathbf{h}})] \text{ where } Z\_{\mathbf{h}} = \int \exp[-F(\underline{\mathbf{h}})] d\underline{\mathbf{h}} \tag{16}$$

The term *F*(**h**) in Equation (16) as the function of the Gibbs distribution is essential for simplifying the adaptive optimization of λ. The maximum-likelihood (ML) estimation of λ can be decomposed as follows:

$$\underline{\Lambda}^{ML} = \underset{\underline{\Lambda}}{\text{arg}\, \underline{\text{max}}} \int Q(\underline{\mathbf{h}}) \ln p(\underline{\mathbf{h}} | \underline{\Lambda}) d\underline{\mathbf{h}} \tag{17}$$

In the same way,

$$\sigma^2\_{ML} = \underset{\sigma^2}{\text{arg}\, \text{max}} \int Q(\underline{\mathbf{h}}) \ln p(\underline{\mathbf{y}} \, | \, \underline{\mathbf{h}}, \, \overline{\mathbf{A}}, \sigma^2) d\underline{\mathbf{h}} \tag{18}$$

Individual element of **H** is required to be exponentially distributed with independent decay parameters that delivers *p*(**h** λ) <sup>=</sup> & *g* <sup>λ</sup>*<sup>g</sup>* exp −λ*ghg* , thus Equation (17) obtains

$$\underline{\Lambda}^{ML} = \underset{\underline{\Lambda}}{\text{arg}\, \text{max}} \int \mathcal{Q}(\underline{\mathbf{h}}) \Big(\ln \lambda\_{\mathcal{S}} - \lambda\_{\mathcal{S}} h\_{\mathcal{S}}\Big) d\underline{\mathbf{h}} \tag{19}$$

The term **h** denotes the dependent variable of the distribution *Q*(**h**), whereas other parameters are assumed to be constant. As such, the λ optimization in Equation (19) is derived by differentiating the parameters within the integral with respect to **h**. As a result, the functional optimization [37] of λ then obtains

$$
\lambda\_{\mathcal{S}} = \frac{1}{\int h\_{\mathcal{S}} Q(\underline{\mathbf{h}}) d\underline{\mathbf{h}}} \tag{20}
$$

where *g* = 1, 2, ... ,*R*, λ*<sup>g</sup>* denotes the *gth* element of λ. Notice that the solution **h** naturally splits its elements into distinct subsets **h***<sup>M</sup>* and **h***P*, consisting of components ∀*<sup>m</sup>* ∈ *M* so that *hm* > 0 and components ∀*<sup>p</sup>* ∈ *P* so that *hP* = 0. The sparsity parameter is then obtained as presented in Equation (21):

$$\lambda\_{\mathcal{S}} = \begin{cases} \frac{1}{\int \frac{h\_{\mathcal{S}} Q \mathbf{u} \left(\mathbf{h}\_{\mathbf{h}}\right) d\mathbf{h}\_{\mathbf{M}}}{\int h\_{\mathcal{S}} Q \mathbf{u} \left(\mathbf{h}\_{\mathcal{P}}\right) d\mathbf{h}\_{\mathcal{P}}}} = \frac{1}{h\_{\mathcal{S}\_{\mathcal{S}}}^{\text{MAP}}} \text{ if } \mathcal{g} \in \mathcal{M} \\\frac{1}{\int h\_{\mathcal{S}} Q \mathbf{u} \left(\mathbf{h}\_{\mathcal{P}}\right) d\mathbf{h}\_{\mathcal{P}}} = \frac{1}{u\_{\mathcal{S}}} \text{ if } \mathcal{g} \in P \end{cases} \tag{21}$$

and its covariance X is given by

$$X\_{ab} = \begin{cases} \left(\overline{\mathbf{C}\_p}^1\right)\_{ab} \text{ if } a, b \in M\\ \
u\_p^2 \delta\_{ab\prime} \text{ Otherwise.} \end{cases} \tag{22}$$

where *Q*ˆ *<sup>P</sup>* **h***<sup>P</sup>* ≥ 0 = & *p*∈*P* 1 *up* exp −*hp up* , **C***<sup>P</sup>* = <sup>1</sup> <sup>σ</sup><sup>2</sup> **AT** *<sup>P</sup>***A***<sup>P</sup>* and *up* ← *up* −ˆ *bp*+ ( ˆ *b*2 *<sup>p</sup>*+4 (**Cu**<sup>ˆ</sup> )*<sup>p</sup>* )*up* <sup>2</sup>(**Cu**<sup>ˆ</sup> )*<sup>p</sup>* . The function

*QM* **h***<sup>M</sup>* will be expressed as the unconstrained Gaussion with mean **h**MAP *<sup>M</sup>* and covariance **<sup>C</sup>**−<sup>1</sup> *<sup>M</sup>* based on a multivariate Gaussian distribution. Similarly, the inference for σ<sup>2</sup> can be computed as

$$
\sigma^2 = \frac{1}{N\_0} \int Q(\underline{\mathbf{h}}) (\|\underline{\mathbf{y}} - \overline{\mathbf{A}}\underline{\mathbf{h}}\|^2) d\underline{\mathbf{h}} \tag{23}
$$

where

$$
\hat{h}\_{\mathcal{S}} = \begin{cases}
\quad \, ^{h\_{\mathcal{S}}^{\text{MAP}}} \text{if } \mathcal{g} \in M \\
\quad ^{\mu\_{\mathcal{S}}} \text{if } \mathcal{g} \in P
\end{cases}
$$

The core procedure of the proposed CMF method is based on *L*1-optimal sparsity parameters. The estimated sources are discovered by multiplying the respective rows of the *Wk*(ω) components with the corresponding columns of the *Hk*(*t*) weights and time-varying phrase spectrum *ej*φ*k*(ω,*t*). The separated source *s*ˆ*j*(*t*) is obtained by converting the time-frequency represented sources into the time domain. Derivation of *L*1-optimal sparsity parameter, is elucidated in the Appendix C.

### *2.3. Sound Event Classification*

Once the separated sound signal is obtained, the next step is to identify the sound event. A multiclass support vector machine (MSVM) is employed to achieve the goal. The MSVM is comprised of two phases: the learning phase and the evaluation phase. The MSVM is based on one versus one strategy (OvsO) that splits observed *<sup>c</sup>* classes into *<sup>c</sup>*(*c*−1) <sup>2</sup> binary classification sub-problems. To train the *th* MSVM model, the MSVM will construct hyperplanes for discriminating each observed data into its corresponding class by executing the series of binary classification. Starting from the learning phase, sound signatures are extracted from the training dataset in the time-frequency domain. The sound signatures that were studied in this research were the Mel frequency cepstral coefficients (MFCC: MF), short-time energy (STE: *E*(*t*)), and short term zero-crossing rate (STZCR: *STZ*(*t*)), which can be orderly expressed as: *MF* = <sup>2525</sup> <sup>×</sup> log[<sup>1</sup> + (*f*/7)], *<sup>E</sup>*(*t*) = ∞ τ=−∞ [*y*(*t*)· *fw*(*<sup>t</sup>* <sup>−</sup> <sup>τ</sup>)]<sup>2</sup> , *Z*(*t*) =

 ∞ τ=−∞ *sgn*[*s*(τ)] <sup>−</sup> *sgn*[*s*(<sup>τ</sup> <sup>−</sup> <sup>1</sup>)] · *fw*(*<sup>t</sup>* <sup>−</sup> <sup>τ</sup>) where *fw*(*t*) denotes the windowing function. The training signals are segmented into small blocks, then the individual block is extracted to the three signature parameters. The mean supervector is then computed as an average of individual feature of all blocks for each sound event input. Thus, the mean feature supervector (*O*) with a corresponding sound-event-label vector ((*w*)) is paired together (i.e., (ψ(*O*, *w*))) and supplied to the MSVM model. The discriminant formula can be expressed as:

$$\begin{aligned} \{\triangle\_{\boldsymbol{\beta}}\} &= \arg\max\_{\Box} \left\{ \boldsymbol{a}\_{\Box}^{T} \psi(O, w; \boldsymbol{\beta}) \right\} \\ &= \arg\max\_{\Box} \left\{ \max\_{\beta} \sum\_{i=1}^{|\mathbf{w}|} \boldsymbol{a}\_{\Box}^{T} \psi(O\_{\vec{\mathbf{d}}|\beta}, w\_{i}) \right\} \end{aligned} \tag{24}$$

where *Oi*|β, *wi* , *i* = 1, ... , *M* represents the *i th* separated sound signals; the weight vector <sup>α</sup> is employed for individual class to compute a discriminant score for the *O*; the *i* term is the index of the block order (β); and the function α*<sup>T</sup>* <sup>ψ</sup>(*O*, *<sup>w</sup>*; <sup>β</sup>) measures a linear discriminant distance of the hyperplane with the extracted feature vector from the observed data. The MSVM based OvsO strategy for class *th* and other, the hyperplane, can be maximized as α*<sup>T</sup>* <sup>ψ</sup>(*O*, *<sup>w</sup>*; <sup>β</sup>) <sup>+</sup> *<sup>b</sup>* and can then be learned via the following equation as

$$\min\_{\alpha\_{\square}, \xi\_{\square}} \frac{1}{2} \left\| \alpha\_{\square} \right\|^2 + C \sum\_{i=1}^{M} \xi\_i^{\square} \tag{25}$$

where ξ *<sup>i</sup>* <sup>≥</sup> 0, *<sup>b</sup>* is a constant. The term *M i*=1 ξ *<sup>i</sup>* denotes a penalty function for tradeoff between a large margin and a small error penalty. The optimal hyperplane can be determined by minimizing <sup>1</sup> 2 α 2 to maximize the condition (i.e., α*<sup>T</sup>* <sup>ψ</sup>(*O*, *<sup>w</sup>*; <sup>β</sup>) <sup>+</sup> *<sup>b</sup>* <sup>≥</sup> <sup>1</sup> <sup>−</sup> <sup>ξ</sup> *<sup>i</sup>* ). If the conditional term is greater than 1 − ξ *<sup>i</sup>* , then the estimated sound event belongs to the *th* class. Otherwise, the estimated sound event classifies into other classes.

The overview of the proposed algorithm is presented in the following table as Algorithm 1.

**Algorithm 1** Overview of the proposed algorithm.


$$\begin{aligned} \left| \hat{\mathbf{S}}\_{l} \right|^{2} &= \sum\_{k=1}^{K\_{l}} \mathbf{W}\_{l\_{f}}^{k} \mathbf{H}\_{l\_{l}}^{k} \mathbf{e}^{j\phi\_{l\_{f}l}^{(k)}} \text{ and construct the binary TF mask for the } i^{l\text{h}} \text{ source,} \\ M\_{l}(f, t\_{s}) &:= \left\{ \begin{array}{ll} 1, & \text{if } \left| \stackrel{\scriptstyle \Lambda}{\mathbf{S}\_{l}}(f, t\_{s}) \right|^{2} > \left| \stackrel{\scriptstyle \Lambda}{\mathbf{S}\_{l}}(f, t\_{s}) \right|^{2}, \ i \neq j \\ & 0, & \text{otherwise} \end{array} \right. \end{aligned}$$

