3.2.1. Determination Optimal Window Length for Feature Encoding

For the MSVM method, sound-event signals are segmented into small blocks for encoding feature parameters by using a fixed-length of the sliding window. The sets of feature vectors are computed using the mean supervector and then loaded to the MSVM model for learning and constructing the hyperplane. The size of blocks can affect the information of the feature vectors, which leads to the classifier performance. The block's size will affect the α , hence modifying the block size will mark the learning efficiency of the MSVM model. Therefore, in order to obtain the optimal value of α , the optimal block size was exploited by training the MSVM model given various lengths of window sizes (i.e., 0.5, 1, 1.5, and 2.0 s) to learn the 400 noisy sound-event signals of four event classes with cross-validation.

The experimental results are plotted in Figure 5, where the block size varies from 0.5 to 2.0 with 0.5 increments. The MSVM model of the 1.5 s block size yielded the best sound-event classification at 100% accuracy. The sliding window function benefits from SVM to learn an unknown sound event by generating the set of blocks from the observed event, regarded as a number of observed events. As a result, a set of sound event characteristics were computed for each block (i.e., *Oi*|β, *wi* in Equation (24)).

**Figure 5.** Classification performance of the original and combination source MSVM with various block sizes.

The optimal length of the window size can capture the signature of the sound event. If the window length is too short, the encoded features will then deviate from the character of the sound event. In addition, the mean supervector is computed from the set of features of all blocks, which can be regarded as the mean of the probability distribution of the features. This mean supervector advantages the MSVM to reduce misclassifications when compared to the conventional SVM. Hence, the STFT of all experiments set the window function at 1.5 s.

### 3.2.2. Determination of Sound-Event Features

Each sound-event signal was encoded with three features: Mel frequency cepstral coefficients (MFCCs), short-time energy (STE), and short-time zero-crossing rate (STZCR). MFCCs are represented as a frequency domain feature that is evaluated in a similar assembly to the human ear (i.e., logarithmic frequency perception). STE is the total spectrum power of an observed event. The STZCR denotes the number of times that the signal amplitude interval satisfies the condition (i.e., *STZCR* = (1/*T* − 1) *T*−1 *<sup>t</sup>*=<sup>1</sup> [[{*stst*−<sup>1</sup> < 0} where [[{*stst*−<sup>1</sup> < 0} is 1 if the condition is true and 0 otherwise). The STZCR features of four sound-event classes are illustrated in Figure 6.

**Figure 6.** STZCR patterns of four sound-events (**a**–**d**).

The STZCR feature represents unique patterns of four sound-event classes. The four sound-event patterns are different in shape and data range. Similarly, the MCFFs and STE features extract distinctive patterns of all event classes, except for the patterns between door knocking and footstep, as illustrated in Figure 7.

(**a**) MFCCs patterns

**Figure 7.** *Cont*.

**Figure 7.** MFCCs (**a**) and STE (**b**) patterns of door knocking and footstep.

Figure 7 aims to compare the characteristics of similar sound events such as door knocking and footsteps. Thus, MFCCs and STE features were used to illustrate the patterns of sound events. Figure 7a represents the five orders of MFCC features to compare patterns between door knocking and walking while the STE features are shown in Figure 7b.

The proposed method separated the six categories of mixtures, then classified each estimated sound event signal into its corresponding class. Classified results of the six categories are presented as confusion matrixes below:

The classification of the proposed method was measured by Precision = TP/(TP + FP), Recall = TP/(TP + FN), and F1-score = 2 × (Precision × Recall)/(Precision + Recall). The TP and TN terms refer to the true positive and true negative, while the FP and FN terms mean false positive and false negative. The scores of Precision, Recall, and F1-score were 0.7667, 0.7731, and 0.7699, respectively.

Each feature represents unique characteristics of an individual sound event. Thus, features were matched into seven cases for exploiting their influence on the MSVM classifiers (i.e., {(MFCC), (STE), (STZCR), (MFCC, STE), (MFCC, STZCR), (STE, STZCR), (MFCC, STE, STZCR)}).

As shown in Figure 8, the MSVM model given by MFCCs and STZCR yielded the best classified accuracy at 100%, with less deviation among the other cases. Therefore, the separated signals were then classified by the proposed MSVM method given by the MFCC and STZCR vectors and the 1.5 s window function. The computational complexity of the proposed method was analyzed by two steps. First, the adaptive L1-SCMF method was NP-hard. Big-O of the adaptive *L*1-SCMF method consists of spectral basis (*m*), temporal code (*n*), and phase information that rely on components (*k*). Thus, Big-O of the separation step is (*mn*) *O*(*k*2) . For MSVM steps, it is a polynomial algorithm where Big-O is *O n*3 . Therefore, the computational complexity of the proposed method based on Big-O is (*mn*) *O*(*k*2) . All experiments were performed by a PC with Intel® Core™ i7-4510U CPU2.00 GHz and 8 GB RAM. MATLAB was used as the programming platform.

**Figure 8.** Classification performances of multi-class MSVM of various sets of features and length of event signal.

### 3.2.3. Performance of MSVM Classifier

The MSVM-classifier performance is presented in terms of percentage of the corrected sound-event classification. The 240 separated signals of four classes from the proposed separation method were individually identified by the MSVM classifier.

Figure 9 compares the classification performance on the four classes of individual sound events. The best classification accuracy was door open, followed by footstep, door knocking, and speech. On the other hand, the classification results based on the mixed sound events are illustrated in Figure 10. The MSVM model delivered the highest performance of the door-open event with 84% accuracy.

From the above experiments, the proposed method yields an average classification accuracy of 76.67%. The MSVM method can well discriminate and classify the mixed event signals with high classification accuracy (i.e., the mixture of door open with door knocking and door knocking with speech were correctly classified with above 80% accuracy). Due to the MFCC and STZCR features in the individual event, these signals had obvious distinguishable patterns, as shown in the example of STZCR plots in Figure 6. Despite the SDR scores of the separated signals between door open and door knocking being relatively low (as given in Figure 3), the MSVM yielded the highest classification accuracy for the door open with door knocking mixture (DO + DK). This is attributed to the fact that interference remaining in the separated event signals causes the extracted MFCC and STZCR vectors to deviate from their original sound event vectors.

**Figure 9.** Average percentage of classification accuracy from the perspective of event group of the proposed NSSEC method.

**Figure 10.** Classification performance of NSSEC model with 1.5 s block size.
