*3.3. Machine Learning*

As mentioned before, three different machine learning algorithms were used, and the results were compared: a CNN using different representations, Support Vector Machine (SVM), and K Nearest Neighbor (KNN). Each machine learning algorithm was based on a set of hyperparameters and their optimal values, which are summarized in Table 3. The following paragraphs describe each specific algorithm and how the hyperparameter values were identified. All the machine learning algorithms were implemented in MATLAB. The three machine learning algorithms were chosen on the basis of the following considerations.



The main objective of this study was to investigate the performance of Deep Learning (DL) for this particular problem of the identification of road anomalies and obstacles. Among other DL algorithms, CNN has been widely applied to image processing since the excellent results from the ImageNet Large Scale Visual Recognition Competition (ILSVRC) in 2012 [18]. The approach proposed in this paper is based on the transformation to images of the accelerometer/gyroscope readings in the time domain collected on the vehicles while driving (see Figures 7 and 8). The intimate relationship between the layers and spatial information in CNNs renders them well suited for image processing and for extracting the discriminating characteristics of the road anomalies and obstacles [19].

**Figure 8.** Application of the spectrogram to the Gyroscope Y (GyroY) output from the IMU.

The relatively small size of the dataset (9600 segments of road) for deep learning may generate the risk of overfitting. This risk was mitigated by tuning the hyperparameters (e.g., *L2*) to improve generalization, by using a dropout layer (see Figure 9) and by using cross-validation using 12 folds, as described in the subsequent paragraphs. We also highlight that the number of segments was comparable to the ones used in the literature where deep learning was also applied [9,15].

**Figure 9.** CNN architectureused for the classification.

Then, two shallow machine learning algorithms were used to compare their classification performance to the deep learning CNN algorithm described above. In particular, SVM and KNN were used. The SVM is a computational learning method based on statistical learning theory, whose algorithm constructs and then searches the separating hyperplanes with the maximum margin by transforming the problem description into the dual space by means of the Lagrangian. SVM is reported to be successfully applied in many classification problems (object detection and recognition, information and image retrieval) with a good generalization ability [20]. SVM can deliver a unique solution, since the optimality problem is convex, while other algorithms (like neural networks) may have multiple solutions associated with local minima. In addition, SVM is well known for its effectiveness in high-dimensional spaces, as in the dataset used in this study.

The SVM was also used since it is widely adopted in the literature for road anomaly detection when a machine learning approach is used. The survey presented in [4] for the machine learning algorithms showed that SVM is the most used machine learning algorithm and the second in the overall ranking (the first approach in the ranking is the threshold based approach, which is not a machine learning algorithm and, therefore, is not adopted in this paper).

The KNN was chosen as a baseline for comparison with the other two algorithms as it is relatively simple and naturally lends itself to multi-class problems, as in this case.

The architecture of the CNN is shown in Figure 9, and a brief description is provided here. The input layer of the CNN depended on the type of adopted transformation (e.g., in the 1D time domain, the representation is 1 × 200). The convolutional layers' parameters (e.g., stride) were optimized according to the specific input, as this input changes for the different representations (e.g., time or spectrogram). Padding was used. The number of filters was set to 30 (an analysis range between 20 and 40 was considered), and the max pooling was set to 4. The solver RMSProp was used as it provided a superior performance to other solvers (stochastic gradient descent and stochastic gradient descent with momentum) for this specific dataset with a learning rate of 0.001 (optimization range between 0.0001 and 0.01). The batch size was set to 128. The number of epochs was set to 160. To mitigate overfitting, the *L2* parameter was set to 0.0005 (optimization range between 0.0001 and 0.01), and a dropout layer was used. Cross-entropy was used as the loss function. The Number of Convolutional layers (*NC*) was also optimized in the range *NC* = 2 to *NC* = 5. The optimal value for the identification accuracy of *NC* was determined to be *NC* = 3,

which defines the architecture described in Figure 9. Because the number of hyperparameters to optimize was significant (number of filters, solver, *L2*, batch size, and number of convolutional layers), each parameter was optimized in a sequential way by keeping fixed the value of some parameters and performing the optimization on other parameters. The first parameter, to be optimized, was the number of convolutional layers *NC* (i.e., the CNN architecture) with *Nf* = 20, solver = SGD, *L2* = 0.001, and batch size = 64. Then, the solver parameter was optimized with the identified number of convolutional layers (i.e., *NC* = 3). Finally, a three-dimensional grid approach was used to select the optimal values of the parameters *Nf*, *L2*, and batch size. The resulting values from the optimization process are presented in Table 3.

SVM is a supervised learning model that classifies data by creating a hyperplane or set of hyperplanes in a high- or infinite-dimensional space, to distinguish the samples belonging to different classes. As this paper addresses a multi-class machine learning problem, a multi-class SVM was used, which was based on an Error-Correcting Output Coding (ECOC) classifier for multiclass learning, where the classifier consists of multiple binary learners. In particular, we used the OneVsOne approach where for each binary learner, one class is positive, another is negative, and the algorithm ignores the rest. This design exhausts all combinations of class pair assignments. Various kernels (e.g., linear, polynomial) were tried, and the one providing the best performance was the Radial Basis Function (RBF) kernel, where the values of the scaling factor *γ* must be optimized together with the parameter *C* [21]. In this paper, the SVM was applied with the RBF with a scaling factor *γ* = 2<sup>5</sup> and a *Cf actor* = 27. The optimization was performed on a range of values for *γ* and *C* between 2<sup>2</sup> and 2<sup>10</sup> using a grid search approach. The optimal values of the parameters are presented in Table 3.

K Nearest Neighbor (KNN) is an approach to data classification that estimates how likely a data point is to be a member of one class or another depending in which group the data points nearest to it are. KNN is an example of a lazy learner algorithm, meaning that it does not build a model using the training set until a query of the dataset is performed. The main hyperparameter in KNN is the *K* factor, which must be optimized for the specific classification problem. The type of distance metric used to calculate the "nearest" must also be chosen carefully [22]. The optimization was performed on a range of values of *K* from 1 and 10 for each of the different distance: Euclidean, Manhattan, Chebyschev, and Mahalanobis. The optimal values of the hyperparameters, which provided the highest accuracy, are listed in Table 3.

All the optimized values presented in this section were calculated using a linear or grid approach for each sampling rate and then averaged for the analysis of the time domain data.

For all three machine learning algorithms, a 12-fold cross-validationwas used with the partition of the entire set into 12 exclusive folds. For each fold, eleven of twelve parts of the initial dataset were used for the training plus validation, while the remaining 1/12 part was used for testing. The validation part was 1/4 of the training plus validation portion (thus, the validation part was 11/48 of the entire dataset in each fold). Then, the results were averaged.

The evaluation metrics were the accuracy, precision, and recall, and they are described in the following equations:

$$Accuracy = \frac{(TP + TN)}{(TP + TN + FP + FN)} \tag{1}$$

$$Precision = \frac{(TP)}{(TP + FP)}\tag{2}$$

$$Recall = \frac{(TP)}{(TP + FN)}\tag{3}$$

where *TP* is the number of True Positives, *FP* is the number of False Positives, *FN* is the number of False Negatives, and *TN* is the number of True Negatives.

#### *3.4. Time Frequency Transforms*

This section is focused on the description of the spectrogram and the related hyperparameters. In addition, because Section 4 provides a comparison with another time-frequency representation based on Continuous Wavelet Transform (CWT), the CWT definition and hyperparameters are also described here.

Consider a signal *s*(*τ*) of length *N* (the original signal in the time domain, called 1D-T in the rest of this paper) and a window *w*(*τ*) of length *Wsize* (where *N* >> *Wsize*), whose Fourier transforms are respectively *S(f)* and *W(f)*, as shown in the following equations:

$$S(f) = \int\_{-\infty}^{\infty} s(\tau) e^{-j2\pi f\tau} d\tau \tag{4}$$

$$\mathcal{W}(f) = \int\_{-\infty}^{\infty} w(\tau) e^{-j2\pi f \tau} d\tau \tag{5}$$

The localized spectrum representation of *s*(*τ*) at time *τ* = *t* is obtained by multiplying the signal by the window *w*(*τ*) centered in time *τ* = *t*, as shown in the following equation:

$$s\_w(t, \tau) = s(\tau)w(\tau - t),\tag{6}$$

through the application of the FT, the Short Time Fourier Transform (STFT) *F<sup>w</sup> <sup>s</sup>* (*t*, *f*) is obtained:

$$\mathcal{F}\_s^{\psi}(t,f) = \mathcal{F}\_{\tau \to f} \left\{ s(\tau)w(\tau - t) \right\} \tag{7}$$

The squared magnitude of the STFT, denoted by *S<sup>w</sup> <sup>s</sup>* (*t*, *f*), is called the spectrogram and is expressed by the following equation:

$$S\_s^w(t,f) = \left|F\_s^w(t,f)\right|^2\tag{8}$$

As described in Section 4, the spectrogram was adopted for the analysis presented in this paper because it provided a superior identification performance in comparison to the phase component of the STFT, as shown in Table 4.

In the spectrogram definition, we evaluated the impact of three main hyperparameters: (1) the type of window *w*(*τ*), (2) the length of the window or window size *Wsize*, and (3) how much overlapping was between different windows on the overall length *N* of the signal *x*(*τ*). The overlapping is defined as *Olap*, and it is defined as a percentage of the window size *Wsize* while the window is sliding across *x*(*τ*). Because the parameter *Wsize* is dependent on the sample ratio at which the data are collected by the IMU and it would not be the same across different sampling rates, the parameter *WR* is instead used in the rest of this paper. *WR* is defined as the ratio of *N*/*Wsize*. For example, a value of *WR* = 10 at a sampling rate of 50 Hz results in *Wsize* = 20.

Four different window types were used: Hamming, Bartlett, Chebyschev, and Kaiser [23], and they were chosen because of their different filtering behaviors. Three different window sizes with different values of *WR* were identified: *WR* = 4, 5, and 10. Finally, three different overlapping factors *Olap* were evaluated: 33%, 50%, and 66%. The results of the evaluation are presented in the next Section 4.

The other time-frequency representation was based on the wavelets. The CWT provides an overcomplete representation of a signal by letting the translation and scale parameter of the wavelets vary continuously [24].

The CWT is expressed by:

$$\mathbb{C}(a,b) = \frac{1}{|a|^{1/2}} \int\_{-\infty}^{\infty} s(\tau) \psi\left(\frac{(\tau-b)}{a}\right) d\tau \tag{9}$$

where *ψ* is the mother wavelet, a is the scale, and b is the translational value. In this analysis, we chose the Morse mother wavelet [25] and the CWT implementation in MATLAB from MathWorks (i.e., the cwt function).

#### **4. Results**

In an initial step, the impact of the hyperparameters of the spectrogram definitions on the identification accuracy was evaluated.

#### *4.1. Optimization of the Spectrogram Hyperparameters*

This section is focused on the optimization of the hyperparameters in the spectrogram.

Figure 10 presents the results for the calculated accuracy using the CNN-SP approach using AccZ. The results are presented as a bar graph with different bars related to the hyperparameters of the spectrogram definition including the size of the window (based on the *WR* factor) and the different values of the overlapping factor *Olap*. Four different graphs are presented for the different sampling rates: 50 Hz (Figure 10a), 100 Hz (Figure 10b), 200 Hz (Figure 10c), and 250 Hz (Figure 10d). The sampling rate of 150 Hz is not presented because of the lack of space and because it is quite similar to other graphs. The graphs in Figure 10 shows that for *WR* = 4 and *WR* = 5, the highest accuracy was obtained with *Olap* = 66%, while for *WR* = 10, the highest accuracy was obtained with *WR* = 10. On the other side, it is noted that for most of the shown sampling rates (e.g., 50, 100 and 200 Hz), the accuracy obtained with *WR* = 10 was the lowest of all the values of *Olap*, while this was different for the sampling rate of 250 Hz. The use of higher sampling rates also increased the classification time; then, if there is no loss of classification accuracy (as shown in Figure 10), it is preferable to use the optimal parameters for the lower sampling rate (i.e., 50 Hz). In this case, such an optimal value was obtained with *WR* = 5 and *Olap* = 66%. These results were obtained with the Hamming window. The choice of this window is supported by the results presented in Figure 11 for different windows and overlap values *Olap* with *WR* = 5. In fact, the results depicted in the figures below show that the optimal identification accuracy is obtained with the Hamming window across all the sampling rates and the *Olap* values, and this is the type of window chosen in the subsequent results.

The results obtained for GyroY are shown in Figure 12. In general, the results for GyroY confirm the previous results obtained with AccZ, but with greater variability. In fact, *Olap* = 66% provides better results for the sampling rate of 100 Hz (Figure 12b) and the sampling rate of 200 Hz (Figure 12c) for *WR* = 4 and *WR* = 5, but in other cases, *Olap* = 50% provides better results. We highlight that the highest classification accuracy (i.e., reaching the threshold of 0.97) is quite similar to the one obtained with AccZ. These results indicate that the use of AccZ or GyroY provides an equivalent classification accuracy. All these results are obtained with the Hamming window. As shown in Figure 13, the Hamming window provides better results than the other windows across all the different sampling rates. From this point of view, these results obtained for the gyroscope are consistent with the results obtained with the accelerometer.

0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1

Accuracy

(**b**) Accuracy for sampling rate equal to 100 Hz

(**c**) Accuracy for sampling rate equal to 200 Hz

(**d**) Accuracy for sampling rate equal to 250 Hz

**Figure 10.** Accuracy for AccZ in relation to the size of window (based on the *WR* factor) used in the spectrogram for different sampling rates and different values of the overlapping factor *Olap*. The Hamming window is used to obtain these results.

(**a**) Accuracy for sampling rate equal to 50 Hz

(**b**) Accuracy for sampling rate equal to 100 Hz

(**c**) Accuracy for sampling rate equal to 200 Hz

(**d**) Accuracy for sampling rate equal to 250 Hz

**Figure 11.** Accuracy for AccZ in relation to the type of window used in the spectrogram for different sampling rates and different values of the overlapping factor *Olap* and *WR* = 5.

(**c**) Accuracy for sampling rate equal to 200 Hz

(**d**) Accuracy for sampling rate equal to 250 Hz

**Figure 12.** Accuracy for GyroY in relation to the size of window (based on the *WR* factor) used in the spectrogram for different sampling rates and different values of the overlapping factor *Olap*. The Hamming window is used to obtain these results.

(**a**) Accuracy for sampling rate equal to 50 Hz

(**c**) Accuracy for sampling rate equal to 200 Hz

**Figure 13.** Accuracy for GyroY in relation to the type of window used in the spectrogram for different sample rates and different values of the overlapping factor *Olap* and *WR* = 5.

#### *4.2. Analysis and Comparison of the Approaches*

This section presents the results of the comparison of the proposed approach with the shallow machine learning algorithms and CNN-1D. Figure 14 provides the comparison of all the different approaches evaluated in the analysis for the accelerometer in the *Z* direction (AccZ), while Figure 14 provides the comparison for the gyroscope in the *Y* direction (GyroY). As mentioned before, CNN-SP is

the application of CNN to the spectrogram of the original signal in the time domain; CNN-1D is the application of the CNN to 1D-T (the original signal in the time domain). SVM and KNN represent the direct application of the respective shallow machine learning algorithm to 1D-T. The presented results are based on the mean (average) of the results calculated for each fold across the 12 folds. The results clearly show that all the CNN based approaches significantly outperform the shallow machine learning approach in a consistent way across all the different sampling rates, which confirms the results obtained by the authors in [15] for a different dataset. In addition, the approach proposed in this paper based on the combination of the CNN with the spectrogram (CNN-SP) outperforms in a consistent way the direct application of CNN on the time domain (CNN-1D), which is the novel finding of this paper. The other interesting results provided by Figures 14 and 15 are that the CNN-SP approach provides a relatively uniform value of the accuracy across the different sampling rates, while CNN-1D drops significantly with higher sampling rates. A potential reason for this behavior is that the CNN algorithm is able to extract similar discriminating features across the different sampling rates when applied in combination with the spectrogram, while lower sampling rates in 1D-T introduce a smoothing effect in CNN-1D, which benefits the classification accuracy. A similar consideration can be applied to the application of the SVM and KNN algorithms, whose classification accuracy also degrades with increasing sampling rates.

**Figure 14.** Comparison of the accuracy among the spectrogram CNN approach (CNN-SP) proposed in this paper (for different window sizes), the CNN directly applied to the time domain representation (CNN-1D), and the shallow machine learning techniques (SVM, KNN) applied to the data of the Accelerometer in the *Z* direction (AccZ).

For completeness, we also report the accuracy obtained for another time-frequency representation as mentioned previously (i.e., the CWT) and for the application of the STFT in phase (the complex amplitude of the STFT is the spectrogram already considered before). The results are shown in Table 4. Note that the results presented for the STFT phase and the CWT are the result of an optimization process similar to what was done for the spectrogram in Section 4.1. For the CWT, this was done on the type of the wavelet and the number of octaves for frequency. For the STFT (phase), this was based on the same parameters identified in Section 4.1. The results clearly confirm that CNN-SP provides a superior classification performance in comparison to the other approaches. In particular, the use of the magnitude-only components of CNN-SP provides a better identification accuracy than the phase-only component (and the combination of amplitude plus phase, which is provided here). The other spectral approach based on the use of CWT also provides an inferior performance, which indicates that the wavelets may not be appropriate for this context (note that a similar result for vehicle authentication was also obtained in [16]).

**Figure 15.** Comparison of the accuracy among the spectrogram CNN approach (CNN-SP) proposed in this paper (for different window sizes), the CNN directly applied to the time domain representation (CNN-1D), and the shallow machine learning techniques (SVM, KNN) applied to the data of the Gyroscope in the *Y* direction (GyroY).

**Table 4.** Accuracy obtained for the different representations at 50 Hz. Bold represents the highest value.


The accuracy metric only provides a limited view of the classification performance. For this reason, we also provide in the following paragraphs and figures the estimate of the recall and precision obtained for the specific set of parameters and for different machine learning algorithms. The following figures show the recall and precision for each specific road anomaly when compared to the other anomalies, and they give a better understanding of the areas where the machine learning failed to identify the correct samples (false positives and false negatives).

Figure 16a,b provides respectively the precision and recall for each road anomaly as a percentage using the CNN-SP approach with a sampling rate of 50 Hz, *Olap* = 66%, and *WR* = 5. Similar results were obtained for the other values of the CNN-SP hyperparameters, but they are provided here for space reasons. These figures confirm the previous findings on the accuracy, which shows that it is possible to obtain a great identification accuracy using the CNN-SP approach. On the other hand, the figures show that there can be great variability for the identification of each road anomaly/obstacle. Of course, this depends on the type of road anomaly/obstacle and how much different from the other road anomalies/obstacles it is. One relevant aspect is the relatively low value of recall and precision for the class NORM. The reason is that the machine learning algorithm confuses in some cases the road anomaly/obstacle with the normal segment (identified as NORM in the figures), because the digital output from the accelerometers/gyroscopes is similar. In fact, even a NORM segment includes some small irregularities (e.g., cracks, uneven condition of the road), which may be difficult to distinguish. This result is to be expected, and it is consistent with the findings in the literature [15].

**Figure 16.** Precision and Recall using the *CNN* − *SP* approach. (**a**) Precision for each road anomaly for CNN-SP (sampling rate of 50 Hz, *Olap* = 66%, *WR* = 5) and (**b**) recall for each road anomaly for CNN-SP (sampling rate of 50 Hz, *Olap* = 66%, *WR* = 5).

On the other side, such low values of precision and recall may be a significant problem when used in real-world applications since most real-world road segments are normal. A potential reason for these results is that no thresholds were applied for the detection of road anomalies/obstacles, but the machine learning algorithms were directly applied to the data recordings. Different approaches can be used to mitigate this issue. Some of these approaches are based on a similar analysis from previous studies in the literature, and they can be classified into different phases of the overall methodology: data collection, data pre-processing, or data processing/classification (e.g., based on the ML/DL algorithm). One potential approach in the data collection phase would be to widen the sensor inputs using the shock responses of the four absorbers or the wheel speed from the four wheels, as was done in [9] to enhance the discriminating power of the classification algorithm. Another approach in the data pre-processing is to introduce a smoothing filter in the data pre-processing phase. We note that the evaluation of the identification performance with different sampling rates described in this study is also a crude form of smoothing for the lower sampling rates. A review of the application of filters in the literature for road anomalies'/obstacles' detection was provided in Section 2.2 of [10]. Another possibility would be to set a threshold to analyze only the most significant road anomalies/obstacles and remove the smaller ones, which are present in the normal segment (see Section 2.3 of [10] for a review on threshold techniques). Note that the thresholds' definitions do not need to be static, but they can also be dynamic to adapt themselves to the road conditions, as proposed in [26]. The application of both techniques would require significant effort to identify the most effective smoothing filter or the appropriate values of the threshold. For this reason, this analysis is postponed to future developments (see Section 5). In the data processing/classification phase, the existing dataset could be widened using data augmentation techniques (e.g., adding noise) to improve generalization and robustness. Another approach would be to use other time-frequency transforms (e.g., Stockwell transform) in combination with CNN or with other deep learning algorithms.

Figure 17a,b provides respectively the precision and recall for each road anomaly as a percentage using the CNN-1D approach at a sampling rate of 50 Hz. The results are similar to what was achieved for the CNN-SP approach, but the recall and precision are slightly lower, as already foreseen from the results presented in Figure 14. In particular, the same phenomenon of a lower value of precision and recall is also noted here.

Figure 18a,b provides respectively the precision and recall for each road anomaly as a percentage using the SVM algorithm at a sampling rate of 50 Hz. The figures show the drastic drop in the classification performance in comparison to the CNN based approach. In addition, we note a greater variability in the identification of the specific road anomalies among them. In general, the precision is negatively impacted as the number of False Positives (FP) is quite significant. In particular, the algorithm confuses specific road features (e.g., RF14) or obstacles (e.g., SB01 and SB02) as normal road segments (i.e., NORM). In comparison, recall is usually higher for the same road anomalies, which shows that the False Negatives (FN) are more limited in number than the False Positives (FP).

**Figure 17.** Precision and recall using the CNN-1D approach. (**a**) Precision for each road anomaly for CNN-1D (sampling rate of 50 Hz) and (**b**) recall for each road anomaly for CNN-1D (sampling rate of 50 Hz).

**Figure 18.** Precision and recall using the SVM algorithm. (**a**) Precision for each road anomaly for SVM (sampling rate of 50 Hz) and (**b**) recall for each road anomaly using SVM (sampling rate of 50 Hz).

Figure 19a,b provides respectively the precision and recall for each road anomaly as a percentage using the KNN algorithm at a sampling rate of 50 Hz. These results confirm what was already obtained for the accuracy. The variability is even increased in comparison to the SVM approach. In addition, the recall for the NORM segment reaches quite low values, which highlights the difficulty of the algorithm to distinguish the NORM segments from the other segments. It is also noted that the lower values of precision and recall are usually for specific road anomalies (e.g., RF14 or SB01), in a similar way to what was obtained for the previous figures and algorithms.

**Figure 19.** Precision and recall using the KNN algorithm. (**a**) Precision for each road anomaly using KNN (sampling rate of 50 Hz) and (**b**) recall for each road anomaly using KNN (sampling rate of 50 Hz).

#### *4.3. Discussion*

The results shown in the previous subsections prove the superiority of the time-frequency CNN (TF-CNN) for road anomaly identification in comparison to shallow machine learning algorithms (SVM, KNN) and to the application of CNN to the initial time domain representation (accelerometer and gyroscope). The difference in classification performance appears for all the considered sampling rates (i.e., 50, 100, 150, 200 and 250 Hz) and both for accelerometer and gyroscope data. With accelerometer data, the improvement in the identification accuracy of TF-CNN is even more visible than with gyroscope data. The final identification accuracy taking into consideration all the road anomalies and the vehicles used in the data collection is slightly more than 97%, which is consistent with the findings in the literature. We note that this value of identification accuracy was obtained by using a set of 12 different vehicles rather than the one or two vehicles commonly used in literature. Then, the high accuracy obtained with TF-CNN is even more remarkable because it manages to mitigate the differences of the mechanical configuration of the vehicles, and it can be used to support crowd-sourcing approaches (with many different vehicles on the road) for the detection and mapping of the road anomalies in the road infrastructure.

Regarding the choice of the time-frequency transform, the results in Table 4 show that the spectrogram in combination with CNN (CNN-SP) provides a slightly better identification accuracy than the application of the complex magnitude of the CWT (CNN-CWT). Additional time-frequency transforms could also be applied, and this will be the scope of future studies (see Section 5).

Another important result from Figures 14 and 15 is that CNN-SP is able to provide a high identification accuracy (higher than 97%) even at lower sampling rates (e.g., 50 or 100 Hz), which supports the deployment of the proposed approach even with cost effective mass-market smartphones, which are now equipped with accelerometers and gyroscopes with a sampling rate of 100 Hz.

We note that the study presented in this paper conducted an extensive analysis of the hyperparameters present in the definition of the time-frequency transforms (e.g., window size, type of window, overlapping ratio) to evaluate their impact on the identification accuracy. The results show that although some hyperparameters do not have a significant impact, other hyperparameters like the overlapping ratio must be carefully tuned in a training phase to achieve the optimal identification accuracy.

One aspect that must still be improved is the relatively low precision and recall of the NORM segments. Potential approaches to mitigate this issue would be to introduce a smoothing filter in the data pre-processing phase, but this would entail an extensive analysis of the choice of the most appropriate filter, so for this reason, this analysis is postponed to future developments (see Section 5).

#### **5. Conclusions and Future Developments**

This paper proposes a novel approach for the detection and identification of road anomalies using data collected from accelerometers and gyroscopes installed on the vehicles. The approach is based on the transformation of the collected data into the spectral domain (via a spectrogram), which is then given as an input to CNN. This approach is compared against the direct application of CNN on the samples collected in the time domain and on the application of shallow machine learning algorithms like SVM and KNN, as is usually proposed in the existing research literature. This approach is evaluated on a dataset, created by the authors, by collecting IMU (e.g., accelerometers, gyroscopes) recordings on many km of road infrastructure using 12 different automotive vehicles. A comprehensive analysis of the influence of the hyperparameters on the classification performance is presented for the data collected from the accelerometers and gyroscopes. The results show that this approach is able to obtain a very high identification accuracy for each road anomaly, and it is able to distinguish accurately between obstacles intentionally created by road traffic authorities against road anomalies (potholes), which are the consequences of road degradation and roadworks. Beyond the analysis of the road roughness, such high accuracy can also be used to correctly identify specific road anomalies and obstacles on the road infrastructure, which can be used as landmarks in maps to improve the positions of vehicles (including autonomous vehicles) in the future.

Future developments will extend the scope of the study presented in this paper in various directions. One direction in the data collection phase is to use additional sensors' inputs (e.g., different placements of the accelerometers/gyroscopes in the vehicle, shock absorber responses of the four absorbers, or the wheels' speed). Another direction in the data collection phase would be to collect the data with different passengers and drivers in the vehicles. One direction in the processing phase is to investigate the application of additional time-frequency transformations in combination with other deep learning architectures and algorithms. Future developments will also investigate the combination of thresholds or smoothing filters with CNN-SP to enhance the precision and recall results for the classification of the NORM segments. In particular, we will investigate adaptive filters whose definition and parameters are modified depending on the conditions of the road or the vehicle model.

**Author Contributions:** Conceptualization, G.B. and R.G.; methodology, G.B., R.G., and F.G.; software, G.B., R.G., and F.G.; validation, G.B., R.G., and F.G.; investigation, G.B. and F.G.; writing, original draft preparation, G.B.; writing, review and editing, G.B., R.G., and F.G.; supervision, G.B. All authors have read and agreed to the published version of the manuscript.

**Funding:** No specific funding was used to conduct the study presented in this paper.

**Acknowledgments:** We acknowledge the directorates I and C of the European Commission Joint Research Centre for providing most of the vehicles used in the study.

**Conflicts of Interest:** The authors declare no conflict of interest.
