*4.3. Features*

In ML, features are individual measurable properties of an observed phenomenon [42]. Selecting informative, independent, and discriminating features is a crucial process in classification or regression. The 45 features implied in this study are shown in Table 1. The feature sets include low-level signal properties (f1–f9) and Mel-frequency spectral coefficients (MFCCs) (f10–f45) [27].

Table 1 defines the features of low-level signal properties (f1–f9). *N* is the sample number of one segment; *k* refers to the *k*th sample point; *x* is the time-series signal; and *X* denotes the spectrum of Fourier transform (FT); *sign*( ) is the sign function; *TH* is the threshold, which takes the value of 0.85 in the definition of f6; *<sup>P</sup>*(*k*), which is shown in the definition of f8, is the probability distribution of the power spectrum *S*(*k*) = *X*(*k*)2. Moreover, MFCCs are features commonly used in speech and speaker recognition [38]. In this study, the first 12 MFCCs coefficients (f10–f21) were used to obtain more information from the audio segments. Because the audio signals vary intermittently, it is necessary to add features related to the change of cepstral characteristics over time [43]. Therefore, the first- and second-order derivatives of the first 12 MFCCs (f22–f33 and f34–f45) were also calculated.


**Table 1.** Features used in this study.

#### *4.4. Feature Selection Based on IG*

During data analysis, hundreds of features may be generated, many of which are redundant and not relevant to the data mining task. Removing these irrelevant features may waste vast amounts of computation time and influence the prediction results. Although experts in relevant files can select the useful features, this is a challenging and time-consuming task, especially when the characteristics of the dataset are not well known. The goal of feature selection is to find a minimum set of features so that the prediction results are as close as possible to (or better than) the original feature set.

In this study, we employed the IG as an index for feature selection. IG is a feature evaluation method based on entropy and is widely employed in the field of ML [44]. In feature selection, IG is defined as the complete information provided by the features for the classification task. IG measures the importance of features as:

$$IG(S, a) = E(S) - E(S|a),\tag{3}$$

where *IG*(*<sup>S</sup>*, *a*) is the IG of the original feature set *S* for feature *a*; *E*(*S*) is the entropy for the feature set without any change; and *<sup>E</sup>*(*Sa*) is the conditional entropy for the feature set, given feature *a*. The conditional entropy *<sup>E</sup>*(*Sa*) can be written as:

$$E(S|a) = \sum\_{v \in a} \frac{Sa(v)}{S} \* E(Sa(v))\,, \tag{4}$$

where *Sa*(*v*) *S* is the categorical probability distribution of feature *a* at *v* ∈ *a*, and *<sup>E</sup>*(*Sa*(*v*)) is the entropy of a sample group where *a* has the value *v*. The greater the value of *IG*(*<sup>S</sup>*, *<sup>a</sup>*), the more critical is *a* for the classification model.

#### *4.5. Multi-Classification Model for Vehicle Interior Noise Based on XGBoost*

XGBoost was designed based on gradient boosted decision trees [45]. We chose XGBoost due to its computation speed and model performance, which have been verified by a previous study [22]. As an ensemble model of decision trees, the definition of the XGBoost model can be written as:

$$\mathfrak{H}\_{i} = \sum\_{k=1}^{K} f\_{k}(x\_{i}),$$

where *K* is the total number of decision trees, *fk* is the *k*th decision tree, and *y*ˆ*i* is the prediction result of sample *xi*. The cost function with a regularization term is given by [45]:

$$L(f) = \sum\_{i=1}^{n} l(\mathfrak{H}\_{i\prime} y\_i) + \sum\_{k=1}^{K} \Omega(f\_k),\tag{6}$$

with:

Ω(*f*) = γ*T* + 12 <sup>λ</sup>||*w*||<sup>2</sup> , (7)

where *T* is the number of leaves of the classification tree *f*, and *w* is the score of each leaf. The Lasso regulation of coefficient γ and ridge regularization of coefficient λ can work together to control the complexity of the model. By expressing the objective function as a second-order Taylor expansion, the objective function at step *t* can be written as [46]:

$$L(f) \approx \sum\_{i=1}^{n} \left[ l(\hat{y}\_i, y\_i) + \mathbf{g}\_i f\_l(\mathbf{x}\_i) + \frac{1}{2} h\_i f\_l^2(\mathbf{x}\_i) \right] + \Omega(f\_l),\tag{8}$$

where *gi* = <sup>∂</sup>*y*<sup>ˆ</sup>*l*(*y*<sup>ˆ</sup>*i*, *yi*), and *gi* = <sup>∂</sup>*y*<sup>ˆ</sup>2*l*(*y*<sup>ˆ</sup>*i*, *yi*). By removing the constant term, the approximation of the objective at step *t* is available:

$$\hat{L}(f) = \sum\_{i=1}^{n} \left[ \mathbf{g}\_i f\_t(\mathbf{x}\_i) + \frac{1}{2} h\_i f\_t^2(\mathbf{x}\_i) \right] + \Omega(f\_t). \tag{9}$$

By expanding the regularization term Ω and defining *Ij* as the instance set at leaf *j*, Equation (9) can be rewritten as [47]:

$$\mathcal{L}(f) = \sum\_{j=1}^{T} \left[ \left( \sum\_{i \in I\_j} g\_i \right) w\_j + \frac{1}{2} \left( \sum\_{i \in I\_j} h\_i + \lambda \right) w\_j^{-2} \right] + \gamma T. \tag{10}$$

By rewriting the objective function as a unary quadratic function of leaf score *w*, the optimal *w* and the value of the objective function are easily obtained. In XGBoost, the gain is used for splitting decision trees:

$$G\_{\bar{j}} = \sum\_{i \in I\_{\bar{j}}} g\_i \tag{11}$$

$$H\_{\bar{j}} = \sum\_{i \in I\_{\bar{j}}} h\_{\bar{i}\prime} \tag{12}$$

$$\text{gain} = \frac{1}{2} \left[ \frac{G\_L^2}{H\_L + \lambda} + \frac{G\_R^2}{H\_R + \lambda} - \frac{\left(G\_L + G\_R\right)^2}{H\_L + H\_R + \lambda} \right] - \gamma\_\prime \tag{13}$$

where the first and second terms are the score of the left and right child tree, respectively; the third term is the score if there is no splitting; and γ is the complexity cost when a new split is added. Despite the serial relationship between the adjacent trees, the node in a certain level can be parallel during the splitting, which enables XGBoost to have a faster train speed.

#### **5. Results and Discussions**

In general, the parameters of an ML model can significantly impact its performance, and XGBoost is no exception. Through extensive testing and observation, we set the critical parameters of this model as follows: maximum depth of the tree (max\_depth) = 6; learning rate (eta) = 0.01; minimum sum of instance weight needed in a child (min\_child\_weight) = 1; subsample ratio of the training instance (subsample) = 1; fraction of features (columns) to use (colsample\_bytree) = 1. The ratio between the training dataset and the test dataset was set to 0.8/0.2 in this study.

#### *5.1. Optimal Time Window Size and Data Balance*

We divided the audio signals collected from the test line into segmen<sup>t</sup> sequences with di fferent time windows. Figure 6 presents the calculated Shannon entropies under di fferent time window sizes. The Shannon entropy maintains a relatively stable state when the time window size increases from 0.1 (10−1) to 1.58 (100.2) s, after which it decreases dramatically. When the time window size is 1.58 s, the Shannon entropy reached its maximum value. According to the maximum Shannon entropy hypothesis, the optimal time window size is 1.58 s. However, we maintained a relatively small window in our study to avoid a situation where one window contains di fferent vehicle interior noise events. Therefore, we set the time window size to 1 s.

**Figure 6.** Entropy at di fferent time window sizes.

We increased the proportion of four minority classes to the same as 'Broadcast' with SMOTE. The performance of the multi-classification model using balanced or unbalanced training data was compared. Table 2 reports the comparison results from the perspective of precision, recall, and F1 score. 'Support' in this table means the total number of occurrences in each category. Data balance increased the precision of 'Broadcast' and decreased its recall. In contrast, it decreased the precision and increased the recall of minority classes, namely 'Beep', 'Rumble', 'Squeal', and 'Other noises'. Meanwhile, F1 scores presented a slight drop after the data balance, except for the classes of 'Beep' and 'Squeal'.


**Table 2.** Classification reports of test results.

We also employed confusion matrices to describe the performance before and after the training data were balanced, as shown in Figure 7. These matrices provide insights into the errors by the classification model and distinguish the types of errors. For instance, the matrices imply that 'Squeal' is commonly mislabeled as 'Broadcast', and 'Rumble' is mislabeled as 'Other noises'. One can also notice that the data balance improves the identification of the performance of minority classes such as 'Beep', 'Rumble', and 'Squeal'. 'Squeal' and 'Rumble' have a strong relationship with vehicle-track conditions, which is a major concern in our research. It is therefore desirable to detect all 'Squeal' and 'Rumble' events. Therefore, we balanced the training dataset via SMOTE to improve the recall of 'Squeal' and 'Rumble', despite the slight decrease in precision.

**Figure 7.** Confusion matrices of test results: (**a**) Model trained with unbalanced data; (**b**) Model trained with balanced data.

#### *5.2. Feature Selection Based on the Importance Score*

The importance was calculated explicitly for each feature by using the inbuilt feature importance property of XGBoost algorithm. The scores for features indicate how useful they were in the construction of the model and allows features to be ranked and compared with each other. Besides, a mutual information-based feature selection method is also used to verify the results of the importance-based method. In contrast to the importance score, the calculation of mutual information does not depend on the classifiers, but only considers the statistical characteristics of the input features and target variables.

In our classification model, 45 initial features were considered. Figure 8a shows the feature importance scores calculated by gain [45]. The importance scores of di fferent features vary greatly, ranging from 0 to 378. The spectral centroid, denoted as f4, ranks first. In contrast, the importance score of f2, root mean square (RMS) of segments, equals zero, which means that it was not used during the training process. Figure 8a also shows that the low-order features and first 12 MFCCs are essential in the classification task. The results of the feature importance analysis indicate that the

contribution of different features to the model varies greatly. Thus, feature selection is necessary to improve the performance of the model and speed of calculations. Figure 8c shows the results for 45 features calculated by the mutual information-based method. The mutual information of these features has a similar trend with that of importance score. However, the importance scores of some features are very different from their mutual information value. For example, the importance score of feature f2 is 0, but its mutual information ranks fifth among all of the 45 features. The reason is that the mutual information only considering the features and target variables cannot reflect whether the features were engaged in the establishment of the classification model.

**Figure 8.** Illustration of feature selection based on different methods: (**a**) importance score of all the features; (**b**) importance score of the top 20 features; (**c**) mutual information of all the features; (**d**) mutual information of the top 20 features; (**e**) comparison of results of the two feature selection methods.

First, all 45 features were sorted in descending order of importance and mutual information, respectively. Figure 8b,d show the histograms of the top 20 features in descending order of the importance score and mutual information independently. We then constructed 20 feature sets incrementally with top 1, top 2, ... , and top 20 features. Furthermore, the classification results with different features sets were compared, as shown in Figure 8e. There, the weighted macro average F1

score, *F*1*wm*, was used to evaluate the performance of the multi-classification model, and it can be defined as follow:

$$F1\\_\text{wur} = \frac{\sum\_{i=1}^{N} F1\_i \times w\_i}{N}\,. \tag{14}$$

where *N* is the total number of classes, in this study *N* = 5; *F*1*i* is the F1 score of the *i*th class; and *wi* is the weight of the *i*th class and there is *N i*=1 *wi* = *N*. Because this study mainly focuses on 'Squeal' and 'Rumble' we set both their weights to 1.3, and the weights of 'Other noises', 'Beep', and 'Broadcast', to 0.8. The value of *F*1 *wm* varies from 0 to 1. The closer the weight is to 1, the better the model performs. The red line in Figure 8e corresponds to the classification results of 20 feature sets constructed by the mutual information-based feature selection method, and the blue line corresponds to that by the feature importance-based method. The results in Figure 8e show that *F*1*wm* by both feature selection methods increased rapidly when the feature set expanded from the top 1 to the top 8 features. Afterward, *F*1*wm* remained stable. The comparison of the results of the two methods indicates that the mutual information-based method performed better than the importance-based one when the number of selected features was less than 4. However, when the feature set expanded from the top 4 to the top 11, the importance-based method performed better. Then, the continuous increase in the number of the features selected causes no obvious di fference between the performances of the two methods. According to the analysis, the set with the top 10 features selected by the importance-based method was employed in this study, the *F*1*wm* of which reached 0.91.

#### *5.3. Comparisons with Other Methods*

To validate the performance and execution speed of the XGBoost-based classifier used in our study, we conducted a comparison with other commonly used classifiers, including the K-nearest neighbors, decision trees, random forest, gradient boost, extra trees, AdaBoost, and artificial neural network (ANN) classifiers. This study ran all classifiers on the same computer and with the same training and testing data set. Table 3 shows the comparison results of *F*1*wm* and running time. The *F*1 *wm* value of the gradient boost ranked first at 0.925. However, training and testing the gradient boost classifier also consumed the longest running time, 340.31 s, which was approximately 22 times longer than the time needed by the XGBoost classifier. In contrast, the K-nearest Neighbors presented the fastest computing speed and one of the lowest *F*1 *wm*. Besides, the accuracy and precision of di fferent models are provided in Table 3. The accuracy and precision share a similar trend with *F*1 *wm*. The comparison with other classifiers depicts that the XGBoost model shows a good performance in accuracy and execution speed.

**Table 3.** Comparisons between XGBoost and other classifiers.


#### *5.4. Case Studies to Extend the Model Application Scenarios*

In this paper, we provided two case studies to extend the application scenarios. First, we conducted a statistical analysis to investigate the relationship between the vehicle interior noises and the dynamic responses of the car body with multi-source data collected by smartphones. After that, we used the proposed multi-classification model to detect abnormal interior noise events and evaluate the e ffect of rail grinding for guiding the implementation of maintenance work. Figure 9 illustrates the schematics of both case studies in this work.

**Figure 9.** Schematics for case studies: (**a**) statistical analysis of vehicle interior noise and dynamic responses; (**b**) abnormal events detection and rail grinding effect evaluation using the XGBoost multi-classification model.

In the first case study, about 10 h of onboard monitoring data collected by smartphones were used. As shown in Figure 9a, the audio signals of the vehicle interior noise were fed into the multi-classification model established in this work. According to the classification results, the raw data were labeled into three categorizations: 'Squeal', 'Rumble', and 'Normal'. 'Normal' contained all other events except for 'Squeal' and 'Rumble' events. Then, statistical analyses for the dynamic responses corresponding to different vehicle interior noise were performed. This case study aimed to investigate the causes of the abnormal noise events and find out the solutions through the statistical analysis results.

For 'Squeal', 'Rumble', and 'Normal', the probability distribution curves of running speed (*v*) and vertical acceleration (*av*) of the car body are presented in Figure 9a,b, respectively. The vehicle speed *v* used here was not measured directly but obtained by the first-order integration of the longitudinal acceleration *al* [47], which can be written as follows:

$$v = \int\_0^t a\_l dt + v\_{0\prime} \tag{15}$$

where *t* denotes the time; *v*0 is the initial velocity. Since the integration begins when the subway train starts, *v*0 equals to 0. The probability distribution curves in Figure 10a shows that 'Squeal' usually occurs at higher running speed compared with 'Normal' and 'Rumble'. This also suggests that we can reduce the occurrence of 'Squeal' by adjusting the operating speed of the train. In contrast, 'Rumble' occurs at a slower speed and higher vertical vibration level compared to 'Squeal', as shown in Figure 10b. This phenomenon implies that the occurrence of 'Rumble' is related to the resonance of the car body, which may be avoided by optimizing the structure of the car body.

The schematic of the second case study is presented in Figure 9b. The test interval selected in this study was between two adjacent stations with a length of 1631 m. The track alignment of the test interval is presented in the upper plot of Figure 11a. There are three curves in the test interval, the radii of which are 1200 m, 800 m, and 800 m. This case study aimed to test the capacity of this model for identifying abnormal noise events, evaluating the effect of rail grinding, and providing information relevant to designing a future maintenance plan.

**Figure 10.** Statistical analysis of vehicle interior noise and dynamic responses: (**a**) The probability distribution curves of running speed (*v*); (**b**) The probability distribution curves of vertical acceleration (*av*).

**Figure 11.** Abnormal events detection and rail grinding effect evaluation using the XGBoost multi-classification model: (**a**) track alignments of the test section and the identification results before and after rail grinding; (**b**) the surface roughness of the rail before and after rail grinding.

The authors first collected multi-source data with the onboard smartphone on 2 August 2019. The results of the multi-classification model are depicted in the lower plot of Figure 11a with a blue line. The results indicate that 'Squeal' occurred in the positions from 580 to 890 m, 910 to 1040 m, and 1320 to 1370 m. It can be seen that the figure the sections where 'Squeal' occurs have a high overlap ratio with the curve sections, especially the curve section with a radius of 800 m. According to the classification results and design information, we can make a preliminary conclusion that the sharp curves are the main causes of 'Squeal'. The results also indicate the need for rail grinding or other corresponding maintenance measures.

Then, a scheduled rail grinding of the test interval was done on 21 August 2019. The surface roughness of the rail before and after rail grinding presented in Figure 11b indicates that rail grinding reduced the roughness of the rail surface effectively. Since reducing the rail roughness, that is, the unevenness on the tread of the rail benefits improving the rail-wheel contact relationship, rail grinding is a common measure for eliminating the abnormal noise and vibration of subway trains.

Another onboard test was conducted on 1 October 2019, to verify the effects of the maintenance work. The corresponding classification results after the rail grinding are displayed in red in the lower plot of Figure 11a. It can be seen that after rail grinding, the 'Squeal' was eliminated at 580–890 m and 1320–1370 m. However, the 'Squeal' at 910–1040 m remained. The results illustrate that rail grinding eliminated 'Squeal' at circular curves effectively. Nevertheless, it showed no apparent effect on the occurrences at transition curves and straight-line sections, which shows that there exist some other factors that lead to 'Squeal' in these sections. Thus, future maintenance work should focus on the section from 910 to 1040 m. This case study demonstrates the potential of applying the proposed multi-classification model in evaluating the e ffect of rail grinding and providing more information about the track conditions to making a further rail maintenance plan.
