*2.1. System Overview*

Figure 1 shows the overall architecture of the proposed acoustic scene classification system. The sound signal is captured from environment through a microphone. It is digitized using an ADC in the preprocessing step and fed into the feature extraction stage. The MFCC and 1D-LTP features are extracted from the digital sound signal, they are fused together in a joint feature vector and finally classified using an SVM classifier. The main processing steps of the proposed system are discussed as follows.

**Figure 1.** System Architecture for Acoustic Scene Classification.

## *2.2. Feature Extraction*

#### 2.2.1. 1-D Local Ternary Patterns

The local binary patterns (LBPs) have been investigated as among the most prominent feature descriptors in the field of computer vision and image analysis [25]. The basic idea behind LBP is to compare each pixel of an image with its neighborhood. Each compar ison of an image pixel with its neighbors results in binary values '0' or '1'. This helps to summarize a local structure in an image and obtains powerful feature descriptors for a number of promising applications such as face recognition [26] and texture analysis [27]. LBPs are invariant to monotonic grey scale changes and have low computational cost [28]. Applying the LBP method for 1-D signals such as sound, helps to obtain useful information about local temporal dynamics of sound. The LBPs achieve discriminative features of several sounds, as exhibited by the works on music genre recognition [29] as well as environmental sound classification [1]. However, LBPs are highly affected by noise and fluctuations in acoustic samples [1]. In order to further improve the discriminative power of LBP, LTPs were proposed for face recognition in 2010 [30], and later on applied in a number of works [31–33]. In contrast to the LBPs which encode the relationships of 'greater than' or 'less than' between the pixel and its neighbor, the LTPs reflect the 'greater than', 'equal to' or 'less

than' relationships. Under the same sampling conditions, LTPs help to achieve more discriminative and sophisticated sound features as compared to 1D-LBPs.

Analog audio signal is first digitized with sampling frequency *Fs* to form a discrete signal *X*[*i*] having *N* number of samples. The 1D-LTPs of sampled signal *X*[*i*] are computed using a sliding window approach. Consider a signal sample *x*[*i*] with amplitude *α* is placed at the center of window with size *P* + 1. Defining the upper and lower values of amplitude threshold as (*α* + *t*) and (*α* − *t*) respectively, where *t* is arbitrary constant. From the amplitudes of signal samples that lie in the window, a ternary code vector *F* of size *P* is obtained whose individual values are computed as;

$$F[j] = Q(x[i + \frac{P}{2} - r]), \forall j \in \{0, \dots, P - 1\}, \tag{1}$$

$$r = \left\{ \begin{array}{ccc} j & j < \frac{P}{2} \\ j+1 & j \ge \frac{P}{2} \end{array} \right\},\tag{2}$$

where *Q*(*x*[*i*]) is defined as;

$$Q(\mathbf{x}[i]) = \left\{ \begin{array}{ll} 1, & \mathbf{x}[i] > (a+t) \\ 0, & (a-t) \le \mathbf{x}[i] \le (a+t) \\ -1, & \mathbf{x}[i] < (a-t) \end{array} \right\}.\tag{3}$$

From the ternary code vector, the upper and lower local ternary patterns are computed as;

$$LTP\_{upper}[i] = \sum\_{k=0, k \neq i}^{P-1} s\_{\mathcal{U}}(F[k]) \cdot \mathcal{Z}^k,\tag{4}$$

$$LTP\_{lower}[i] = \sum\_{k=0, k \neq i}^{P-1} s\_l(F[k]) \cdot 2^k \,\, ^t \,\, \tag{5}$$

where,

$$s\_u(F[k]) \rangle = \left\{ \begin{array}{ll} 1 & F[k] = 1 \\ 0 & \text{otherwise} \end{array} \right\} ,\tag{6}$$

$$\left(s\_l(F[k])\right) = \left\{ \begin{array}{cc} 1 & F[k] = -1 \\ 0 & \text{otherwise} \end{array} \right\},\tag{7}$$

Figure 2 illustrates the extraction of 1D-LTP features for one sample of a discrete audio signal.

**Figure 2.** Extraction of 1D-LTP features.

#### 2.2.2. Mel-Frequency Cepstral Coefficients (MFCC)

MFCCs are a baseline method that has been widely used in the analysis of audio signals. Although primarily designed for speech recognition [34,35], they have been a popular feature of choice in the automatic scene classification [36,37]. The MFCCs are the coefficients that collectively make up the Mel Frequency Cepstrum (MFC), a representation of short term power spectrum of sound based on linear cosine transform of a log power spectrum on a non linear Mel scale of frequency. The MFCCs are linearly spaced on the Mel frequency scale which closely approximates the human auditory system's response. Such a representation of sound signal extracts discriminant features which help to achieve environmental sound classification with good accuracy.

Figure 3 shows a standard pipeline for the extraction of MFCC features. In the first step, the digitized sound signal is segmented in to short frames each having *N* samples. Next, the periodogram-based power spectrum is estimated for each frame. Let *si*(*n*) denote the time domain signal (of *N* samples) that belongs to frame *i*, its Discrete Fourier Transform (DFT) is calculated as;

$$S\_i(k) = \sum\_{n=1}^{N} s\_i(n)h(n)e^{-j2\pi kn/N}, \ 1 \le k \le K \tag{8}$$

where *K* denotes the length of DFT and *h*(*n*) denotes the *N* sample long analysis window. In this work, Hamming window is used to realize a high-pass FIR filter to emphasize the high frequency part of the signal and remove DC content. In the next step, the output of complex Fourier transform is magnitude squared and power spectral estimate of frame *i* is computed as;

**Figure 3.** MFCC Feature Extraction Pipeline.

Then, a set of Mel-scaled filter banks is computed and applied to power spectrum of each frame. The Mel-scale is linear for frequencies lower than 1000 Hz and a logarithm above it. To compute the filter bank energy spectrum, each filter is multiplied by the power spectrum computed above and coefficients are added up. The Mel-filtered spectrum of frame *i* is computed as;

$$E\_l(l) = \sum\_{k=0}^{N-1} P\_l(k) H\_l(k), \ \forall l = 1, \cdots, L \tag{10}$$

where *L* denotes the total number of filters and *Hl* denotes the transfer function of *l*th filter. Next, the *logarithm* of Mel-filtered energy spectrum is computed and Discrete Cosine Transform (DCT) is applied to it. Mathematically,

$$E\_l^{'}(l) = \log(E\_l(l)), \ \forall l = 1, \cdots, L \tag{11}$$

$$x\_i(n) = \sum\_{l=1}^{L} E\_l'(l) \cos(n(l - 0.5)/\pi/L) \tag{12}$$

where *n* = 1, ··· , *L* is the cepstral coefficient number. In the proposed frame work, initial 13 MFCCs are used for scene classification.

## *2.3. Feature Fusion*

The 1D-LTP and MFCC features extracted above are fused together to form a joint feature vector for classification. The fusion of 1D-LTP and MFCC features helps to obtain a more sophisticated feature representation which has better discriminative properties as well as an accurate representation in frequency domain. The fusion process is a simple serial concatenation of 1D-LTP and MFCC feature vectors.

$$\mathbf{F}^{(c,s)} = \mathfrak{c}\_{\mathbf{k}} || \mathfrak{s}\_{\mathbf{k}} \tag{13}$$
