**2. Research Methodology**

The research methodology of this study is shown in Figure 1. First, we developed an Android app that leverages built-in sensors of onboard smartphones to collect vehicle interior noise and the corresponding dynamic responses of the car body. Second, time windows were used to segmen<sup>t</sup> the multi-source signals and establish the corresponding relationship between the audio and other signals. This method was significantly e ffective in overcoming the di fficulty brought by the di fferent sampling frequencies of a variety of sensors. Third, features were generated and selected from the time- and frequency-domains. Fourth, an automatic classification model for train interior noise was developed using XGBoost, a tree-based method. Finally, the proposed model was validated based on field experiments on the subway line.

**Figure 1.** Research methodology of this study.

#### **3. Data Collection and Description**

Figure 2 shows the field test setup for data collection using Android smartphones (Huawei Honor FRD-AL00). During the test, the smartphone was placed on the cabin floor, right above the bogie to sense the response from the wheel-rail contact interface. In a parallel study, we verified that the differences between smartphone sensors and high-precision industry accelerators are acceptable, especially in the vertical direction [36]. Thus, the dynamic response signals can be considered a good record of the movement state of the car body. An app was developed to save and transmit the data to our cloud server. In the field test, three sensors were used, namely the microphone, accelerometer, and gyroscope. Moreover, considering the performance of these sensors and the characteristics of the signals, the sampling frequency of the accelerometer and gyroscope were set to 100 Hz, and that of the microphone to 22,050 Hz.

**Figure 2.** Data collection with the smartphone.

In this study, all tests were performed on Line 7 of the Chengdu Metro, China, which is a loop subway line. Its layout is shown in Figure 3a. This line covers 38.61 km and 31 stations, and it started operations in December 2017. The trains run along the outer and inner loop, with a maximum speed of 80 km/h. Because this is a loop line, it contains a large number of curve sections (166 curves). The radius distribution of these curves is presented in Figure 3b. It is challenging to maintain the track structures in good conditions due to the high number of curves, and the squeal that typically occurs along the curves is one of the most significant problems.

**Figure 3.** Line 7 of the Chengdu Metro, China: (**a**) Overview; (**b**) Radius of curves.

The data used in this study were collected on 2 August 2019, and 1 October 2019, before and after rail grinding. There were more abnormal events in the dataset before rail grinding. The data from August was used to train and test the multi-classification model, and to justify the need for rail grinding. The data measured on both days were compared. When training the model, we manually labeled the audio sequence into five groups, including 'Other noises', 'Broadcast', 'Squeal', 'Rumble', and 'Beep'. Here, 'Broadcast' refers to the official broadcast by the subway system or passengers' voices. 'Squeal' is an intense noise generated by the relative movement between wheel and rail. 'Rumble' refers to a low heavy sound when the train passes a specific area. 'Beep' is the alarm sound when a door is opened or closed. 'Other noises' refers to a sound which cannot be categorized into the above four classes. The time-frequency characteristics of these five classes of noise are presented in Figure 4.

**Figure 4.** Data collection with the smartphone.

## **4. Model Approach**

#### *4.1. Data Segmentation and Time Window*

Di fferences in sensor sampling frequencies make it di fficult to identify the corresponding relationship among the multi-source signals. In this context, data segmentation is a typical method to preprocess continuous data and capture embedded features. This approach has been frequently implemented in activity recognition, such as in speech [38] and human activity [39] recognition. Therefore, we adopted the moving time-window method to segmen<sup>t</sup> the signals in our study. During data segmentation, there were two crucial parameters to be determined the size of the time window and the overlap between two adjacent windows. To avoid the duplication of data interference with statistical analysis, the overlap parameter was set to 0. That is, there was no overlap between two adjacent windows. Although the window method is normally used in data segmentation, there is no clear consensus on which window size should be employed [39]. The characteristics of vehicle interior noise are di fferent from other audio signals. Therefore, we cannot use the window sizes used in speech recognition as a reference. Generally, small windows allow for on-point activity detection with a few resources and low energy costs. In contrast, large windows are usually considered to identify complex activities. To obtain the optimal window size for vehicle interior noise multi-classification, we leveraged the Shannon entropy and the actual requirements when labeling the training data manually.

We assumed that under the optimal window size, the system carries more information than under other situations [40]. The Shannon entropy is a method commonly used to describe the average information of a system, and it can be written as:

$$H = -\sum\_{i=1}^{m} p(\mathbf{x}\_i) \log\_2 p(\mathbf{x}\_i),\tag{1}$$

where *xi* denotes the *i*th event; *m* represents the total number of events; and *p*(*xi*) is the probability when *x* = *xi* and *m i*=1 *p*(*xi*) = 1. To obtain the optimal window size, the vehicle interior noise signal was first divided into a series of segmen<sup>t</sup> sequences according to di fferent window sizes. The standard deviation of each segmen<sup>t</sup> was calculated to describe the state of the segment. Consequently, standard deviation sequences corresponding to di fferent window sizes were available. It was then assumed that all values of standard deviation fall within the range of (0, *<sup>A</sup>*], where *A* is the maximum standard deviation under di fferent window sizes. After that, this interval was equally divided into *m* sub-intervals, where the *i*th sub-interval can be written as (*ai*, *ai*+<sup>1</sup>], *a*1 = 0, and *am*+1 = *A* . Thus, the optimization model for time window size can be described as:

$$\max H(n) = -\sum\_{i=1}^{m} p\_i(n) \log\_2 p\_i(n),\tag{2}$$

where *n* is the time window size, and *pi*(*n*) is the probability of standard deviation values to fall into the range of (*ai*, *ai*+<sup>1</sup>] when the time window size is *n* . In this study, the optimal time window size was obtained from an extensive number of samples. The size of the windows ranged from 0.1 to 64 s, and the total number of samples was 200. For a higher classification accuracy, more attention should be paid to small windows. To obtain those samples, logarithm interpolation was used. For all samples, the next sample is always 10(log10 <sup>64</sup>−log10 0.1)/200 times the previous one. By calculating the Shannon entropy considering all 200 sizes, we obtained the maximum entropy and its corresponding window size.

#### *4.2. Data Balance Using the Synthetic Minority Oversampling Technique (SMOTE)*

The pie chart in Figure 5a shows the proportion of the five categories of vehicle interior noise studied in this work. The most frequent event is 'Broadcast', which accounts for 67.56% of all vehicle

interior noise events. 'Other noises' is the next most frequent event, at approximately 22%. 'Beep', 'Squeal', and 'Rumble' represent smaller percentages of the vehicle interior noise events, at 4.99%, 2.79%, and 2.66%, respectively. These results indicate that there is a severe class imbalance, which could significantly undermine most standard classification learning algorithms [41].

**Figure 5.** Data (**a**) before and (**b**) after synthetic minority oversampling technique (SMOTE) balance.

In this study, we adopted the synthetic minority oversampling technique (SMOTE) to overcome data imbalance. Generally, the class imbalance can be addressed by: (1) synthesizing new minority class instances; (2) oversampling minority class; (3) under-sampling majority class; and (4) tweaking the cost function to enhance the importance of misclassification of minority instances. The SMOTE used in this study utilizes the first solution because increasing the number of minority classes is better than merely duplicating minority classes, which has stronger robustness and generalization ability. This technique returns the original samples and an additional number of synthetic minority class samples. The SMOTE takes samples from the feature space of each minority class and its *k* nearest neighbors and generates new instances that combine the features of the target classes with the features of their *k* neighbors. Therefore, it increases the features available for each category and makes the samples more general. In this study, we increased the percentage of 'Other noises', Squeal', 'Rumbel', and 'Beep' to be the same as 'Broadcast' via SMOTE when training the multi-classification model, as shown in Figure 5b.
