1. Introduction
Intelligent Transportation Systems (ITSs) play a crucial role in developing smart cities through advanced technologies to enhance the efficiency and sustainability of transportation networks [
1,
2]. These systems integrate data from sensors, cameras, and GPS devices that provide real-time information on traffic flow, weather conditions, and other relevant factors. ITSs can also use this to adjust the traffic signals dynamically, manage toll road usage, and give drivers personalized route recommendations to reduce congestion and lower travel time. Additionally, ITSs can support autonomous vehicles and shared mobility services, further improving overall urban transportation system performance.
In smart cities, ITSs can also contribute to air quality and greenhouse gas emission reduction by making public transportation, cycling, and walking real options instead of private car travel [
3]. This can be achieved through incentives for using sustainable modes of transportation by implementing smart parking systems and congestion pricing schemes, thus decreasing the overall demand for fossil fuel-powered vehicles. This would further allow easy intermodal connections, making urban mobility and the general urban environment even more accessible, sustainable, and inclusive.
The role of vision in systems has become an integral part of modern infrastructure, especially in ITSs, for furthering both safety and efficiency. However, these systems have a significant limitation: they cannot “hear” necessary auditory signals such as emergency sirens, mechanical faults, and environmental hazards [
4]. There are various limitations of vision-based systems, such as the inability to detect auditory signals [
5], environmental noise interference [
6], mechanical fault detection [
7]. Other than emergency signals and mechanical faults, environmental hazards such as construction noise, falling debris, or wildlife may create serious dangers. In this regard, the vision-based systems cannot detect such hazards until they become visually apparent, which may be too late to act effectively. Thus, sound-based diagnostics can provide early warnings against such dangers and enhance safety. Sound-based diagnostics can improve the reliability and efficiency of existing systems manifoldly. There are several reasons that sound-based diagnosis is highly respected, including enhanced situational awareness [
8,
9], real-time monitoring and alerts [
10], cost-effectiveness [
11], accessibility for hearing-impaired people [
12], and improved emergency response [
12].
ITSs are designed to improve the functionality and safety of transport systems using advanced technologies [
13]. Whereas these systems are designed to streamline the process, people with disabilities face unique barriers to their movement and access to essential services in such contexts. These include, but are not limited to, information inaccessibility, lack of warnings in emergencies, navigation barriers, inadequate access to vehicles, and communication barriers. Sound-based systems are essential in offering complementarities and ensuring increased access within ITS environments based on alternative means of communication and information sharing. Some of the possible solutions include the following: audio cues for navigation, visual alerts for the hearing-impaired, sound detection for vehicle fault detection, emergency sirens and alerts, and improved communication systems [
14]. ITS application covers a wide range of fields in implementing sound detection. For example, this will help in emergency response for quick and timely mitigation of hazards. In the field of public transportation, increased safety and efficiency for passengers may also be achieved. Smart cars use it to enhance driver awareness and the vehicle’s behavior.
Recent advances in artificial intelligence (AI) have paved the way for sophisticated data analysis techniques that drive innovative applications across various fields, including sound-based diagnostics. Within this AI framework, machine learning (ML)—defined as using algorithms and statistical models that enable computer systems to learn from data and improve their performance on specific tasks without explicit programming—has emerged as a key enabler. By leveraging ML, our study can analyze complex auditory signals from vehicle faults and environmental sounds, transforming raw data into actionable insights for Intelligent Transportation Systems (ITSs).
We begin our approach with foundational ML models that provide robust baseline performance. Techniques such as Logistic Regression (LR) and k-nearest neighbors (kNN) are utilized to establish initial classification capabilities, forming the groundwork for further enhancements. These basic models are critical for understanding the underlying patterns in the audio data and serve as benchmarks against which more sophisticated methods can be compared.
The study introduces three novel datasets that capture various sounds relevant to Intelligent Transportation Systems. The first dataset (DB1) consists of 27 distinct vehicle fault classes featuring critical sounds such as Engine Misfire, Fuel Pump Cartridge Fault, Radiator Fan Failure, and Strut Mount Failure, all directly generated by the vehicle. The second dataset (DB2) comprises 22 environmental sound classes, including emergency signals like sirens and various transportation-related and ambient environmental noises. These datasets provide a rich collection of auditory signals that form the basis for robust sound-based diagnostic systems. The third dataset (DB3) also merges DB1 and DB2 to create a comprehensive collection of 49 classes. This unified dataset enables the framework to classify any sound from vehicle faults or external environmental events into the correct category. The study lays the groundwork for advancing machine learning research in sound-based diagnostics by addressing the scarcity of specialized, publicly available auditory datasets. It contributes to the development of more inclusive and responsive ITS applications.
Building on these fundamentals, our framework incorporates advanced variants of ML to address the challenges of differentiating acoustic similar classes. Ensemble methods such as AdaBoost, Random Forest (RF), and Gradient Boosting (GB) enhance accuracy by combining the strengths of multiple weak learners. Additionally, Support Vector Machines (SVM) and Stochastic Gradient Descent (SGD) optimize decision boundaries in complex feature spaces, while Decision Trees (DTs) provide interpretable classification logic. Complementing these are the CN2 algorithm and Naive Bayes (NB), which handle complex rule-based classification and probabilistic inference. Together, these diverse ML techniques form a comprehensive diagnostic system capable of robust performance in real-world ITS applications.
Integrating auditory intelligence into ITSs addresses several key research challenges, notably the difficulty of distinguishing acoustically similar classes—such as Universal Joint Failure versus Bad CV Joint and Knocking versus Pre-ignition Problem. These challenges demand advanced feature extraction techniques and robust machine learning models that capture subtle differences in sound signatures. Additionally, the scarcity of specialized, publicly available auditory datasets has historically hindered progress in this area; this research overcomes that barrier by introducing a comprehensive, curated dataset that serves as a benchmark for future work.
This study addresses a critical gap in Intelligent Transportation Systems (ITSs) by explicitly defining its aim to detect faults directly from sounds generated by vehicles, such as engine or brake noises, and to classify external alert sounds, including emergency sirens. The intended applications of these predictive outputs are articulated, emphasizing their role in real-time diagnostics for smart vehicle systems and providing auditory-to-visual alert conversions to assist sound-impaired drivers. Additionally, the study highlights the potential of auditory capabilities to enhance vehicle fault detection and accessibility for individuals with disabilities while addressing the scarcity of specialized datasets in this domain. This is achieved through the following key contributions:
Introducing a novel dataset comprising vehicle fault sounds, emergency sirens, and environmental noises filling a critical gap in publicly available resources.
Developing a comprehensive methodology for audio preprocessing, including normalization, resampling, and segmentation.
Proposing robust feature extraction techniques, such as Mel Spectrograms, MFCCs, and Chromatograms, enabling compact and expanded feature representations.
Evaluating multiple ML models in the first scenario, including neural networks, Logistic Regression, and Random Forests.
Proposing a Bayesian-Optimized Weighted Soft Voting with Feature Selection (BOWSVFS) approach in the second scenario, achieving a classification accuracy of 91.04% on the car fault dataset (DB1) and outperforming the first scenario’s results.
Demonstrating the relevance of sound-based ITSs in promoting accessibility by offering real-time alerts and auditory-to-visual conversion solutions for individuals with disabilities.
Aligning sound-based diagnostics with broader smart city goals, contributing to the development of safer and more inclusive transportation systems.
This research establishes a strong foundation for integrating auditory intelligence into ITSs, with significant implications for safety, accessibility, and inclusivity in smart cities. The framework demonstrates strong performance overall, with several fault classes being recognized with near-perfect accuracy. For example, classes such as Engine Misfire, Fuel Pump Cartridge Fault, Radiator Fan Failure, Strut Mount Failure, and Suspension Arm Fault consistently achieve 100% accuracy in many cases. This indicates that the framework effectively captures the distinct acoustic signatures associated with these faults. However, challenges remain for acoustically similar classes. Universal Joint Failure, for instance, is occasionally misclassified—often confused with Bad CV Joint—while Bad Wheel Bearing also shows minor misclassifications. More notably, the Knocking and Pre-ignition Problem classes face significant difficulties, with Pre-ignition Problem instances frequently being predicted as Engine Misfire. These misclassifications highlight the areas where further refinement in feature extraction or model tuning may be necessary to better differentiate between closely related acoustic patterns.
The structure of this paper is as follows: A summary of the current literature is given in
Section 2, along with potential directions for further research.
Section 3 introduces the datasets.
Section 4 focuses on materials and explains the proposed methodology.
Section 5 provides an overview of the experiments, including the experimental setup, methods, and findings collected, focusing on the performance metrics attained. The overall discussion in
Section 6 wraps up with conclusions and future work, summarizing the paper’s key contributions and suggesting directions for subsequent research in this domain.
2. Literature Review
This section discusses earlier attempts at sound recognition and sound-based defect detection in machinery, vehicles, trains, and aircraft systems. The majority of the methods assessed were developed using ML methods. Some, meanwhile, are more recent and rely on deep learning or vision transformers.
Nasim et al. introduced a sound-based early fault detection system for vehicles utilizing ML technology [
15]. This system is specifically designed to target the faults in vehicles at their initial stages by analyzing the sound emitted by the vehicle. The system starts working by binary classification, which can decide whether the vehicle is faulty or healthy. They utilized time domain, frequency domain, and time–frequency domain features to detect normal and abnormal vehicle conditions effectively. Additionally, they employed abnormal vehicle data to classify them into fifteen other typical vehicle issues. Through experimentation, the random forest algorithm yielded the best accuracy of 97% for fault detection and 92% for problem classification when utilizing time–frequency features. Hamad et al. proposed a rule-based ML technique that automatically detects engine problems [
16]. The generalizability of the system is considered by time domain, frequency domain, and time–frequency domain features. The robustness of the developed system is evaluated using noisy sound data collected under various normal and abnormal conditions. The experimental results demonstrated that the approach outperformed other techniques by 2.6−6.0% and yielded the highest performance accuracy of 98.6%. Yildirim et al. proposed a testing and evaluation procedure on the sound quality of two types of cars [
17]. The sound quality is analyzed through the car’s road running test on the provided ground with varying running speeds. They proposed a neural network predictor to model the system for possible experimental applications. In their experiments, only objective factors of loudness, sharpness, speech intelligibility, and sound pressure level are considered essential for sound quality. The computer simulations and experiments show evidence that the neural predictor algorithm provides reasonable accommodation in different cases and allows superior prediction in two-car sound analysis.
Mel-Frequency Cepstral Coefficients (MFCCs), DWT-based features, and the Extreme Learning Machine (ELM) classifier were employed in the vehicle problem diagnostic system that Akbalik et al. presented [
18]. The proposed framework uses a big, diversified dataset that includes many vehicle models and real-world operating situations. The experiment results show that the MFCC-based features combined with the ELM classifier outperform the others in terms of accuracy, precision, recall, F1-score, macro F1-score, and weighted F1-score, which are 92.17%, 92.24%, 92.22%, 92.10%, and 92.06%, respectively. Murovec et al. created an acquisition system using the Zero-Crossing Signature (ZCS) technique [
19]. To accomplish precise engine type classification, the study used a unique level-crossing (ZCS) feature that demonstrated excellent performance in differentiating engine sounds from surrounding noise. A dataset of 417 vehicle recordings was examined, and the classification performance of the ZCS was compared to the traditional Zero-Crossing (ZC) technique utilizing a Self-Organizing Map (SOM) with a 1D grid of nine neurons. Wang et al. proposed a method for diagnosing engine acoustic signal faults using multi-level supervised learning and time-frequency transformation [
20]. First, it decomposes the fault diagnostic problem into feature augmentation, fault detection, and identification. Second, based on several time–frequency studies, it proposes an adaptive fault feature band extraction approach aimed at distinct features from different vehicle data. Finally, a frequency band attention module was designed to focus on the most meaningful frequency range to the characteristics of engine failure.
Boztas et al. proposed a learning model for improving machine fault classification using handcrafted attributes [
21]. The approach utilized texture and statistical features in classifying faults with high performance. They developed a hybrid and multilevel feature extraction technique that maintains high efficiency while lowering the complexity associated with deep learning frameworks. Using a Chi2 feature selector to eliminate redundant features, the model focused on the most informative features throughout the classification step. In the MIMII (noisy) dataset, the proposed model effectively classified more than 90% of the five cases. A Variational Autoencoder/Convolutional Neural Network (VAE-CNN) was created by Wang et al. to diagnose rolling bearing faults [
22]. The model was developed to extract complex vibration signal features to detect and categorize faults. While the CNN component increases the expressiveness of signal data and successfully handles issues like gradient vanishing and explosion, the VAE component improves noise robustness. The diagnostic accuracy of the VAE-CNN model for various fault types at varying rotational speeds typically exceeded 90%, yielding generally satisfactory diagnostic results. Xinwen Guo developed a defect diagnostic approach based on feature extraction and a word bag model using acoustics and vibration engineering science theories [
23]. This approach mainly expands the three-layer structure of the word bag model and constructs codebooks for each layer’s feature vectors based on this model. Thereafter, it develops the failure detection system of a rolling bearing based on the adaptive extended word bag model. The findings revealed that the defect detection technique has excellent diagnostic accuracy and stability, offering dependable technical assistance for regular operation and safe mechanical equipment maintenance.
Li et al. developed a defect diagnosis system for railway turnout switch machines based on sound signals [
24]. The method used Eigenmode Decomposition to improve the sound signal, reduce noise, and extract important statistical information from the time and frequency domains. The ReliefF algorithm is used for feature selection, dimension reduction, and fault classification with weighted parameters to address redundant information in high-dimensional features. The selected feature parameters are then utilized to train the Support Vector Machine. The results showed a defect diagnostic accuracy of 98% in the positioning work mode and 95.67% in the reversing work mode. Kreuzera et al. proved that diagnosing bearing defects in railway vehicles using aerial sound data is possible, even in complex real-world settings [
25]. To that purpose, many characteristics are investigated, including Mel Frequency Cepstral Coefficients (MFCCs), which are best suited for diagnosing bearing problems by analyzing airborne sound. The MFCCs were utilized to train an MLP classifier. The suggested technique is assessed using real-world data from a cutting-edge commuter train car in a dedicated measurement effort. The classification results showed that the chosen MFCC features allowed for the reliable detection of bearing defects, including those not included in the training. Eunsun Yun and Minjoong Jeong proposed a feature extraction technique for fault sound identification in EPS motors [
26]. This technique reduced the feature dimensionality while preserving the original raw waveform, which is crucial to maintaining the essential features in the waveform for anomaly detection. They combined DFMT with MFCC to optimize feature extraction. They applied LSTM-AE to classify data by segregating standard data from abnormal ones using reconstruction error metrics. The experimental results of the proposed method were proved efficient with an accuracy of 99.2%, recall of 94.0%, precision of 95.6%, and F1-score of 94.7%.
A sound-based engine classification was proposed by Shajie et al. to detect flaws in engine ball bearings [
27]. They used sound-based component extraction techniques to find reoccurring patterns across time. They proposed modifications to the ResNet and hybrid CNN models based on the NASA-bearing dataset. To identify TIM-bearing faults, they employed time and frequency features that may be inferred from the signals and their spectra. The experiments considered realistic scenarios found in real-world industrial settings. They gained insights into the method’s performance with reasonable accuracy rates. To improve industrial productivity and minimize machinery downtime, Khan et al. developed a technique for diagnosing robotic manipulator faults using motor sound analysis [
28]. They investigate the efficiency of deep learning and conventional ML in detecting motor abnormalities using a dataset created with a specifically designed robotic manipulator. It obtained an F1-score of over 92%, outperforming the traditional methods significantly, hence proving the potential of sound analysis for automatic defect identification in robotic systems using the proposed custom CNN and 1D-CNN models. Kim et al. proposed a deep denoising autoencoder method to filter out various industrial noise levels from audio data [
29]. They applied unsupervised learning models for rapid and accurate anomaly detection. They preprocessed audio data to adapt the denoising technique to the noise levels of different industrial contexts. Several experiments using different industrial equipment types demonstrated the proposed technique’s effectiveness, efficiency, and processing speed. Senanayaka et al. diagnosed machinery defects by isolating audio sources from complex mixtures of sound waves [
30]. First, they activated fault sound isolation and separated distinct fault noises from a complicated blend of sound signals. Then, the isolated fault noises were passed through a 1D-CNN classifier to ensure correct classification. A machine fault simulator by Spectra Quest equipped with a condenser mic was employed to evaluate the proposed model. To improve early vehicle defect recognition, Hameed et al. investigated the application of ML for real-time engine knocking detection [
31]. They analyzed several machine-learning techniques and retrieved frequency modulation amplitude demodulation features from engine sound data. With a classification accuracy of 66.01%, the coarse decision tree approach proved the most successful. The accuracy was then increased by employing deep learning models; a deep learning recurrent neural network (RNN) model in LSTM attained 90% accuracy.
Naryanto et al. developed a deep learning model to detect and classify damage or defects in diesel engines using artificial neural networks and convolutional neural networks [
32]. They utilized the DEFault dataset, which has 3500 rows of data organized under four distinct labels. Results showed that ANN outperformed CNN for noisier datasets, but it outperformed for less noisy datasets. Yuan et al. proposed a defect detection approach for new energy vehicle engines using wavelet transforms and Support Vector Machines [
33]. First, an abnormal noise signal identification model for vehicle engine faults is developed, and the time–frequency parameters of the basis function are adaptively changed. The engine surface radiation noise is then split into the inner mechanical and battery excitation components. The new energy vehicle engine failure signal was decomposed using feature decomposition and multiscale separation. Furthermore, fuzzy clustering and time–frequency analysis of fault signals in the fractional Fourier domain were used to detect faults in new energy vehicle engines. Chu et al. proposed an intelligent identification model for diesel engine faults based on mixed attention [
34]. They proposed a multi-cylinder whole-machine fault diagnosis model that integrates 1D-CNN with self- and mutual attention mechanisms. Single-cylinder sensor data were integrated using self-attention in the model, and signal features of each cylinder were fused using the mutual attention mechanism. Simultaneously considering the mechanism knowledge of cylinder structural consistency and signal time delay similarity, this approach utilized single-cylinder fault data to develop a comprehensive fault recognition model for all cylinders. The average diagnosis accuracy reached 100% in known fault data and about 96.65% in unknown fault data.
Lee et al. proposed a bearing failure detection using an LSTM autoencoder with self-attention based on graph convolution networks [
35]. Accordingly, they trained their model using data from the Fault Simulator Testbed and the Case Western Reserve University dataset. Results demonstrated that the proposed model attained an accuracy of 97.3% and 99.9%,, respectively, in the CWRU dataset and Fault Simulator Dataset. Using a single microphone and a data-driven approach, Spadini et al. developed a model for intelligent fault diagnosis in rotating equipment that successfully identified 42 classes of defect kinds and severities [
36]. They considered reliable data from the unbalanced MaFaulDa dataset to balance high performance and minimal resource consumption. The model achieved remarkable performance in terms of the analysis by time, frequency, mel-frequency, and statistical parameters with an accuracy of 99.54% and F-Beta of 99.52%. Using sound samples, Gantert et al. proposed a multiclass method for identifying anomalous samples in industrial machinery [
37]. Integrating binary models commonly found in the literature aims to improve the model’s generality while decreasing the number of classifiers. Using MIMII and ToyADMOS, two industrial sound datasets, they compared the proposed multiclass models with the binary alternative. Experiments revealed that 98% of the Toy-ADMOS dataset and 93% of the MIMII dataset were correctly classified.
Table 1 summarizes the papers mentioned in this study highlighting the main characteristics and problems of each one.
Research gap: While Intelligent Transportation Systems have advanced significantly through vision-based technologies, a critical gap exists in integrating sound-based fault detection mechanisms. This gap is particularly evident in three areas: (1) the limited development of audio-based diagnostic systems that utilize real-time analysis of vehicle-generated sounds (e.g., engine or brake noises) and external emergency alert sounds (e.g., sirens), (2) the scarcity of comprehensive public datasets designed explicitly for vehicle sound analysis, and (3) insufficient attention to accessibility needs for individuals with disabilities within ITS frameworks. These limitations hinder the development of more inclusive and comprehensive transportation monitoring systems. To address these limitations, this study aims to develop a comprehensive dataset that serves as a “conscious ear” for intelligent systems in modern cities, vehicles, and transportation networks. This effort seeks to enhance the auditory capabilities of smart systems, enabling them to respond effectively to complex auditory scenarios, thereby enhancing safety and functionality in various applications.
3. Dataset Creation
The main problem with transportation sound-based fault diagnosis is the availability of datasets. Therefore, data from various sources are collected to create a tailored dataset for car sound analysis. Reliable audio samples are built by downloading videos from YouTube related to car faults, animal sounds, car crashes, siren sounds, etc. After this step, the model splits videos into segments and extracts those sections that may contain the target audio. Then, it converts the files into audio format. To expand the dataset, additional audio is supplemented from Kaggle datasets: FSC22 [
38], Google AudioSet [
39], Audio Classifier Dataset [
40], Sound Classification of Animal Voice [
41], and Vehicle Sounds Dataset [
42]. Lastly, the model ensures that every sample is labeled and verified.
In this vein, the dataset was created and reviewed using a combination of publicly available datasets and real-world recordings, covering a wide range of vehicle faults, crashes, emergency sirens (police, ambulance, fire truck), wild animal sounds, car and truck horns, and other environmental road sounds. This approach ensures a diverse and realistic dataset that enhances model performance in detecting road-related events. The dataset creation procedure involves different stages as shown in
Figure 1.
3.1. Data Collection and Annotation
We carefully selected publicly available datasets that include real traffic scenarios and vehicle fault cases. Additionally, we extracted relevant frames and sequences from YouTube videos, ensuring a diverse representation of traffic conditions and vehicle behaviors. Each data sample was manually labeled based on predefined criteria, focusing on vehicle states, traffic interactions, and specific fault conditions.
3.2. Expert Review and Validation
To enhance reliability, domain experts with extensive experience in automotive engineering and machine learning reviewed the dataset. The experts cross-checked and validated the labels to ensure accuracy and consistency with real-world vehicle behaviors.
3.3. Publicly Available Datasets Referenced
We utilized multiple datasets containing wild animal sounds, vehicle faults, and environmental noises to build a comprehensive dataset. The key datasets referenced include:
FSC22 Dataset: A collection focused on various sound categories, including vehicle sounds and environmental noises, useful for sound classification models [
38].
Google AudioSet: A large-scale collection of audio data across thousands of categories, aimed at improving sound classification models [
39].
Audio Data: A dataset containing diverse audio clips across various categories, useful for developing classification models [
40].
Sound Classification of Animal Voice: A dataset containing sounds from different animals, useful for animal sound classification tasks [
41].
DCASE 2024 Challenge: A dataset designed for the DCASE 2024 challenge, covering environmental sound classification tasks [
43].
UrbanSound8K Dataset: Contains 8732 labeled sound excerpts from urban environments, categorized into 10 classes such as car horns and sirens [
44].
AudioSet by Google Research: A vast dataset with over 2 million human-labeled 10 s sound clips spanning thousands of audio categories [
45].
Vehicle Sounds Dataset: Contains various vehicle sounds useful for training models focused on transportation-related sound classification [
42].
This dataset has been meticulously designed and validated to provide a diverse and realistic representation of road-related sounds, ensuring high-quality training data for machine learning models in the domain of automotive fault detection, traffic event classification, and environmental sound analysis. To standardize the process, files should have the same duration; however, this is not the case. Preprocessing is performed as a solution to ensure samples of the same duration. Algorithm 1 contains the pseudo-code for this stage. The primary steps in the algorithm are:
Repeat the audio until it achieves the required duration.
Normalize the audio to standardize levels.
Resample the audio to a consistent sampling rate (e.g., 16 kHz or 44.1 kHz).
Segment lengthy recordings into shorter clips (e.g., 2–5 s each).
Algorithm 1: Pseudo-code for Audio Preprocessing Script |
![Machines 13 00258 i001]() |
4. Methodology
This study aims to develop an advanced sound-based early diagnosis system to support Intelligent Transportation Systems (ITS) by enabling real-time detection of vehicle faults and identification of emergency sounds. The main steps of this system are illustrated in
Figure 2. To achieve this objective, addressing the primary challenges encountered in this field is essential, starting with the absence of comprehensive public datasets specifically designed for vehicle sound analysis. The initial phase of the proposed model involves the creation of a dataset that contains recordings of car fault sounds, emergency sirens, and ambient noises. This process includes audio data collection and preprocessing. Subsequently, the most significant features are extracted in two versions: a compact version with 52 features and an expanded one with 126 features.
In the final step, both sets of extracted features are classified using 11 distinct ML models. Another phase of optimization is provided by the system to enhance the accuracy of classification. It utilizes the best ML models with the highest-ranked features to build an ensemble optimization model. The following subsections provide further details on each stage of this model.
The key steps in our audio preprocessing pipeline to provide further insight into how the system operates are as follows:
Fixed Time Windows for Feature Extraction:
We preprocess the raw audio files by extending them to a minimum duration of 10 s (MIN_DURATION_MS = 10,000 ms), normalizing their levels, and resampling them to a target sample rate of 16 kHz (TARGET_SAMPLE_RATE = 16,000 Hz).
The preprocessed audio is then segmented into fixed-length clips of 2.5 s (SEGMENT_DURATION_MS = 2500 ms).
Only segments meeting the required length are retained for further processing.
Time Window Length Determination:
The nature of the vehicle fault guided the choice of segment duration sounds, which are typically periodic and repetitive over short durations.
To ensure consistency across all samples, we set a minimum duration of 10 s for all audio files. If an audio file is shorter than this threshold, it is repeated and trimmed to match the minimum duration.
The fixed 2.5 s time window used for feature extraction ensures that features such as MFCCs, Mel Spectrogram, and Chroma Features capture sufficient temporal and spectral characteristics of the sounds.
Sliding Factor Consideration:
A fixed windowing approach is used over overlapping sliding windows during segmentation. This ensures non-redundant segments while maintaining dataset balance.
However, future work could explore the impact of using overlapping windows to capture more temporal variations while controlling data redundancy.
This structured approach ensures that the extracted sound features represent the fault categories well while maintaining computational efficiency.
By the end of this phase, audio files are sampled, labeled, and normalized to build the dataset. Three datasets are created: the first one contains car faults (DB1) with 133 audio files and 27 distinct classes, the second dataset contains other sounds (DB2) with 1031 audio files and 22 distinct classes, and the third dataset is a merged version between the latter two (DB3) with 1164 audio files and 49 different classes.
Table 2 and
Table 3 show the labels and the corresponding file counts for DB1 and DB2, respectively.
4.1. Feature Extraction
After preparing the datasets, the next stage in the proposed system is feature extraction. Audio feature extraction is a significant task in processing an audio signal for the purpose of sound classification. From an audio signal, meaningful features can be extracted to analyze and understand the content of the audio.
Figure 3 shows some key features commonly extracted from audio signals.
The essential features in our study are extracted in two versions: a compact version with 52 features and an expanded one with 126 features. In the compact version, Mel Spectrogram [
46], MFCCs [
47], and Chroma Features [
48] were used.
Figure 4 shows an example of a Mel Spectrogram.
For generating the expanded version, Spectral Features [
49], Zero-Crossing Rate [
19], Root Mean Square Energy (RMSE) [
50], Chroma Features, MFCCs, and Extended MFCCs [
51] were used.
Table 4 defines these features, including their counts and the version(s), Compact (C) or Expanded (E), they appeared in.
Figure 5 shows two-dimensional data projection DB1 compact features.
The pseudo-code for extracting compact and expanded feature lists from audio files are listed in Algorithms 2 and 3, respectively.
Algorithm 2: Pseudo-code for extracting Compact feature list |
![Machines 13 00258 i002]() |
Algorithm 3: Pseudo-code for extracting Expanded feature list |
![Machines 13 00258 i003]() |
4.2. Classification
In this proposed system, the input audio is classified using ML techniques. The two versions of feature lists are used to test eleven different models on the three datasets created.
Neural Network (NN): A computational model consisting of interconnected neurons [
52]. It is used for both regression and classification tasks. A neural network can be formulated by:
where
f is an activation function,
W are weights,
x is input, and
b is bias.
Naive Bayes (NB): A probabilistic classifier based on Bayes’ theorem, assuming independence among predictors [
53]. The NB equation is given by:
where
C is the class and
X is the feature vector.
Logistic Regression (LR): A statistical method for predicting binary classes [
54]. The outcome is modeled using a logistic function, which outputs probabilities. Logistic Regression is formulated as follows:
Stochastic Gradient Descent (SGD): An iterative method for optimizing an objective function, commonly used in training ML models, particularly neural networks [
55].
where
is the learning rate and
is the gradient of the loss function.
k-Nearest Neighbors (kNN): A non-parametric method used for classification and regression by finding the k nearest data points in the feature space [
56].
Decision Tree (DT): A flowchart-like structure where each internal node represents a feature test, each branch represents an outcome, and each leaf node represents a class label [
57].
Random Forest (RF): An ensemble learning method that constructs multiple decision trees during training and outputs the mode of their predictions for classification tasks [
58].
Support Vector Machine (SVM): A supervised learning model that finds the optimal hyperplane that best separates different classes in the feature space [
59].
CN2 Rule Induction: An algorithm for inducing classification rules from examples [
60]. It generates rules based on the attributes of the training data.
where
are conditions based on attributes and
C is the class label.
Adaptive Boosting (AdaBoost): An ensemble method that combines multiple weak classifiers to create a strong classifier by focusing on errors made by previous classifiers [
61].
where
is the weight of the classifier
.
Gradient Boosting (GB): An ensemble technique that builds models sequentially, with each new model correcting errors made by the previous ones [
62].
where
is the learning rate and
is the new model.
4.3. Feature Ranking
Feature ranking is considered a very crucial step in machine learning and data analysis and can select relevant features that contribute to the predictive models [
63]. Four important feature ranking methods have been applied to estimate the importance of the different features used in previous classification experiments, including information gain, analysis of variance, ReliefF, and fast correlation-based filters.
Information Gain (IG) measures the reduction in entropy or uncertainty after splitting a dataset based on a feature [
64]. IG calculates the difference between the entropy of the target variable and the conditional entropy given the feature. Features with higher IG values are more informative.
where
is the entropy of the target variable, and
is the conditional entropy.
Analysis of Variance (ANOVA) measures the ratio of between-class variance to within-class variance for a feature [
65]. Features with higher ANOVA values are more discriminative.
ReliefF is an extension of the Relief algorithm, estimating feature relevance by measuring the difference between the feature’s values for nearest neighbours from different classes [
66].
where
is the current weight of feature
F,
k is the number of nearest neighbors considered,
are the nearest neighbors from the same class (hits), and
are the nearest neighbors from different classes (misses).
Fast Correlation-Based Filter (FCBF) evaluates feature relevance using correlation and redundancy [
67]. It selects features with a high correlation to the target variable and low redundancy.
4.4. Bayesian-Optimized Weighted Soft Voting with Feature Selection (BOWSVFS)
Ensemble learning is currently one of the most powerful methods in machine learning, combining many models to produce predicted performance better to that of standalone models [
68]. Optimal combination in model weight determination remains an optimization challenge. To address this issue, the proposed approach employs Bayesian Optimization combined with Weighted Soft Voting.
In this model, WSV is the central process, with a few classifiers voting on a final prediction and assigning weights to each. Soft voting, unlike hard voting, does not involve direct class prediction; instead, it uses cross-class probability distributions [
69]. Each prediction is weighted based on the perceived importance of each classifier in the entire ensemble, and the weighted probabilities are combined to obtain the final prediction.
The weights assigned to each classifier play a crucial role in the ensemble’s performance. Traditional approaches often use equal weights or weights determined through grid search. However, these methods can be computationally expensive and may not find the optimal weight configuration, especially when dealing with multiple classifiers and features. Algorithm 4 depicts the procedures for calculating WSV.
Figure 6 shows the steps of Bayesian-Optimized Weighted Soft Voting procedure.
Algorithm 4: Pseudo-code for calculating WSV |
![Machines 13 00258 i004]() |
Bayesian optimization provides a more systematic and efficient technique to determining optimal weights [
70]. Bayesian optimization uses a probabilistic model to predict the link between hyperparameters, weight, and feature counts, as well as model performance. This model is typically a Gaussian Process variation. This strategy is particularly useful since it swiftly explores the hyperparameter space by creating a proxy model of the objective function. It strikes a compromise between exploring unknown regions and exploiting known favorable locations, requiring fewer iterations than grid or random search algorithms. The steps for implementing the proposed Bayesian Optimization for Weighted Soft Voting are given in Algorithm 5, and these steps are:
Algorithm 5: Pseudo-code for Bayesian Optimization Weighted Soft Voting |
![Machines 13 00258 i005]() |
Implementing the proposed approach in practice can be performed by detecting the onset of an alert or emergency sound through a preprocessing step such as a sound event or voice activity detection (VAD). By distinguishing between background noise and relevant sound events, these techniques can help the system identify when a sound starts, even in noisy environments. Additionally, microphones or sensors can be strategically placed in or around the vehicle to capture sound more accurately, such as in isolated engine compartments with noise-canceling technology to improve sound capture quality. Also, the system can be integrated with existing vehicle monitoring systems to automatically trigger sound detection when certain conditions are met, such as abnormal engine behavior, sudden changes in vehicle speed, or other sensor data that might indicate a fault or emergency event.
5. Experiments and Discussion
In this study, we evaluated the performance of eleven distinct machine learning models on three datasets, utilizing two versions of feature lists: a compact version comprising 52 features and an expanded version consisting of 126 features. The models were assessed based on several performance metrics, including Area Under the Curve (AUC), Classification Accuracy (CA), F1-score (F1), Precision (Prec), Recall, Matthews Correlation Coefficient (MCC), Specificity (Spec), and Logarithmic Loss (LogLoss).
Data acquisition involved online tools for downloading YouTube videos, while segmenting and audio extraction utilized FFmpeg (v6.0) and Veed.io. Audio conversion was performed using Online Audio Converter. Audio processing, feature extraction, and analysis were conducted using Python (v3.12) (Jupyter Notebook and Spyder IDE) on a computer equipped with an Intel Core i7 processor and 16 GB RAM. Key libraries employed include Librosa (v0.10.1) and Pydub (v0.25.1). All processes were completed using standard software tools. To ensure transparency and reproducibility, all datasets and codes are publicly available in our GitHub (v3.15) repository, along with comprehensive documentation and 72 references for dataset collection, including publicly available sources and YouTube audio samples.
5.1. Performance Metrics
This study used several performance indicators to analyze the efficiency of the models under evaluation. One significant metric utilized is the Area Under the Curve (
AUC), which measures a model’s ability to distinguish between positive and negative classes by computing the Area under the Receiver Operating Characteristic (ROC) curve. The AUC is determined using the following formula:
where ROC is the true positive rate plotted against the false positive rate at various threshold settings.
Another important metric is
Classification Accuracy (CA), calculated as the ratio of correctly predicted instances to the total number of instances in a dataset. The formula runs as follows:
where: TP = True Positives, TN = True Negatives, FP = False Positives, FN = False Negatives.
Precision is the ratio of true positive predictions the model provides to total positive predictions, indicating how precise the model is when producing positive predictions. Precision can be calculated as:
Recall, also known as sensitivity, is the ratio of true positive predictions compared to all actual positive instances. It is one of the key metrics for evaluating the performance of a predictive model by its ability to identify positive instances correctly. It is calculated using the following formula:
Specificity, on the other hand, measures the proportion of true negative predictions among all actual negative instances and is calculated as:
The
F1-score comprehensively evaluates a model’s performance by calculating the harmonic mean of Precision and recall. It’s calculated with the following formula:
The Matthews Correlation Coefficient (
MCC) is a well-balanced measure that considers all four categories of the confusion matrix, providing a more comprehensive metric for binary classification. The MCC is calculated as follows:
Logarithmic Loss (
LogLoss) is a metric used to evaluate the performance of a classification model. It measures the accuracy of the probabilities assigned to each class. The LogLoss is calculated as:
where
N = total number of instances,
= actual label (0 or 1),
= predicted probability of the positive class.
5.2. Hyperparameters
Hyperparameter tuning is one of the essential steps in any machine learning classification process [
71,
72]. It basically involves the selection of the best hyperparameters that control the learning process of a model. These hyperparameters are fixed before training since they are not learned from the data during training. Well-set hyperparameters can significantly improve the accuracy of a model and its generalization on unseen data. They form the basis for finding a good trade-off between bias and variance. Appropriate hyperparameters can speed up the training process, which means faster model development and deployment. Efficient hyperparameter settings can optimize resource utilization, thus reducing training time and costs.
Table 5 lists the configurations used for each model. The following are some common hyperparameters and their effect:
Learning Rate: This hyperparameter controls the step size during gradient descent when moving toward the minimum. A very high learning rate results in instability, while a very low one slows down training.
Number of Trees/Estimators: The number of trees in ensembling techniques like Random Forest and Gradient Boosting. More trees provide higher accuracy, but training a model takes longer.
Tree Depth: The hyperparameter for tree-based models defines each tree’s maximum depth. Deep trees can easily capture complex patterns but tend to overfit much more.
Regularization: Methods such as L1 and L2 regularization prevent overfitting by penalizing large weights. The strength of regularization is a hyperparameter that needs tuning.
Number of Hidden Layers and Neurons: This governs the model’s architecture in neural networks.
Cross-validation is one of the best methods for hyperparameter tuning, and it was employed in this study. It evaluates model performance using techniques such as k-fold cross-validation to obtain a more accurate assessment of its performance. The sampling type used was a 10-fold cross-validation.
5.3. Car Faults DB1 Evaluation
Various ML models were tested on the car faults dataset (DB1), analyzing two versions of feature lists: compact and expanded.
Table 6 and
Table 7 display the measured performance metrics in both cases.
It was shown in the results of testing DB1 by the compact version of the extracted feature list that the Logistic Regression has the highest classification accuracy with the lowest Log Loss value among all evaluated models. Logistic Regression had the best classification accuracy but the second lowest Log Loss value when testing DB1 with the expanded version of the list of extracted features.
5.4. Other Sounds DB2 Evaluation
ML models were tested on other sound datasets (DB2), analyzing two versions of feature lists: compact and expanded.
Table 8 and
Table 9 display the measured performance metrics in both cases.
Based on the DB2 testing results of the compact version of the list of features extracted, the Neural Networks model has the highest accuracy classification with the lowest Log Loss value among all models evaluated. Using the expanded version of the list of extracted features, Neural Networks obtained the second-highest accuracy after AdaBoost.
5.5. DB3 Evaluation
ML models were tested on the merged dataset (DB3), analyzing two versions of feature lists: compact and expanded.
Table 10 and
Table 11 display the measured performance metrics in both cases.
The DB3 test results for the compact form of the list of features extracted showed that the Neural Networks model presents the minimum Log Loss value and the maximum accuracy of classification compared with the other models. Using the expanded version of the list of extracted features, Neural Networks reached the second-highest accuracy after AdaBoost.
5.6. Feature Ranking
Feature ranking was performed using the compact feature list on the DB1 dataset.
Table 12 shows the rankings of the 52 features of the compact list. The table’s rankings demonstrate that the top features across approaches were MFCC features (mean_10, mean_3, mean_2, mean_4) and Mel Spectrogram features (mean). The dominance of MFCC methods is evident. MFCC mean features are statistically significant across all measures, consistently outperforming standard deviation features. Chromagram characteristics, notably standard deviations, have a lower overall relevance. However, a few exceptions, such as chromagram_mean_7, have moderate rankings.
5.7. Evaluation of BOWSV
To incorporate Bayesian Optimization and Weighted Soft Voting into the proposed model, the previously selected features are ranked using ANOVA F-scores to select the most relevant features. Standardization is performed to scale the variables to the same scale. Multiple classifiers with diverse bases are included. Optimization starts with defining the bounds for classifier weights and feature counts. Then, cross-validation is applied to the results to obtain robust performance estimates, and acquisition function guides the search for optimal parameters. The optimization objective function evaluates the ensemble’s performance using cross-validation to ensure that the estimates of the generalization performance are reliable. It converges to the best answer by iteratively proposing different weight combinations and assessing how well they work.
Table 13 shows the metrics of the three datasets, DB1, DB2, and DB3, after applying ensemble optimization. For each iteration, the weights (w1, w2, w3), the number of features used, and the achieved accuracy are depicted.
Due to the large number of classes (27 for car faults, 22 for environmental sounds, and 49 for the merged dataset), a full confusion matrix would be impractical. Instead, key examples of classification performance have been summarized. Several classes—including Engine Misfire, Fuel Pump Cartridge Fault, Radiator Fan Failure, Strut Mount Failure, Suspension Arm Fault, and others—are classified with 100% accuracy, and the Bad CV Joint class achieves around 75% accuracy. Furthermore, the Bayesian-Optimized Weighted Soft Voting with Feature Selection (BOWSVFS) approach demonstrates the robustness of the model by achieving an overall accuracy of 91.04% on the car fault dataset (DB1).
However, some classes present challenges. For instance, the Universal Joint Failure or Steering class has an 80% correct classification rate, with misclassifications primarily as engine rattling noise. The Knocking class, in particular, exhibits significant difficulty, with only 40% of instances correctly classified and misclassifications distributed across categories such as bad wheel bearing, squeaky, and squeaky brake (or grinding brake). These examples highlight the strengths and areas for improvement within the proposed framework.
5.8. Outlook and Future Perspectives
The practical implications of this research are far-reaching. The framework enhances overall transportation safety through timely interventions and improved emergency response by enabling early fault detection and real-time classification of vehicle and environmental sounds. Furthermore, the ability to accurately interpret auditory cues supports the development of more accessible and inclusive ITSs. For instance, auditory alerts can be transformed into visual or haptic signals, thereby assisting individuals with disabilities and ensuring that critical safety information is disseminated effectively. These advances pave the way for smarter, more responsive urban transportation systems that improve efficiency and significantly elevate the safety and quality of life in smart cities.
Although the proposed classification methods can achieve a high degree of accuracy in sound-based early fault detection for vehicles in ITSs, there remains potential for further enhancement through the incorporation of explicit user feedback, such as ratings of the classification results. The efficacy of machine learning systems can be notably augmented by fostering a collaborative relationship with users, improving the system’s accuracy and enhancing user understanding and trust in the system [
73,
74,
75]. Users can contribute to the classification model by providing explicit collective feedback regarding its classification accuracy and the early detection of faults for vehicles in ITSs. This feedback can subsequently be utilized to refine the overall accuracy of the classification model. For instance, users might assign scores or ratings to the accuracy of detected faults. Nonetheless, sustaining user motivation for continuous feedback poses a challenge, as many users exhibit limited interest in participating in such evaluations [
76].
The gamification concept is employed as a behavioral change strategy to enhance user motivation toward engaging in desired behaviours, such as providing feedback on the classification accuracy of detected faults for vehicles in ITS [
77,
78]. A prevalent application of gamification involves incorporating elements of video games, such as points and levels, into non-gaming contexts, such as educational settings [
79]. Gamification has demonstrated successful implementation across various domains, including the promotion of healthy lifestyle choices [
80], the enhancement of student engagement in academic courses [
81], and the improvement of quality and productivity within business environments [
82]. There are four primary elements of gamification commonly utilized in non-gaming contexts [
83]:
Points: Many gamification strategies rely on point systems, which may include features such as levels and leaderboards. The classification accuracy of detected faults can be quantified through user ratings regarding the quality of fault detection for vehicles in ITS. Points accumulated or lost will subsequently inform the classification model’s training to enhance its ability to detect sound-based faults for vehicles in ITSs early. Nevertheless, points should be integrated with other gamification elements to effectively motivate users [
83].
Digital Badges: Users may receive digital badges as recognition for acquiring specific skills, knowledge, or achievements, thereby showcasing their accomplishments [
84]. These badges are typically awarded based on predefined criteria [
85,
86,
87]. For example, users might earn digital badges by reaching a specified number of points corresponding to their ratings on the classification accuracy of early detected sound-based faults for vehicles in ITSs.
Levels: Users must accumulate points to advance to higher levels. Upon reaching a predetermined point threshold, they can level up, thereby unlocking additional features within the system [
88].
Leaderboards: Users can establish leaderboards to reflect their achievements or points earned or to track progress toward specific goals [
86].
A recent study [
89] identifies several factors that affect users’ perceptions and responses to gamification elements utilized for feedback collection, revealing diverse preferences in this context. This underscores the necessity of systematically gathering users’ explicit and collective feedback, which can be instrumental in optimizing our proposed classification model to align with user preferences. Neglecting this aspect could result in overseeing critical factors that enhance classification accuracy. To address this, one can utilize the application-independent conceptual framework proposed by [
89], which can be adapted to gamify the feedback collection process regarding the accuracy of our sound-based early detection of faults for vehicles in the ITS system. This framework articulates the variations in user perceptions and needs concerning gamification elements, aiming to motivate users to provide high-quality feedback on the classification accuracy of our proposed system. It serves as a guiding resource for software engineers in encouraging users to offer their explicit and collective feedback, thereby facilitating further training of our classification model and potentially improving its early fault detection accuracy for vehicles in the ITS. Additionally, a category representing normal operational conditions or safe car sounds can be included to better differentiate faults from irrelevant auditory data.