Next Article in Journal
Video-Based CSwin Transformer Using Selective Filtering Technique for Interstitial Syndrome Detection
Previous Article in Journal
Simulator Sickness in Maritime Training: A Comparative Study of Conventional Full-Mission Ship Bridge Simulator and Virtual Reality
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Acoustic Analysis of Semi-Rigid Base Asphalt Pavements Based on Transformer Model and Parallel Cross-Gate Convolutional Neural Network

1
National Engineering Research Center of Highway Maintenance Equipment, Chang’an University, Xi’an 710018, China
2
Henan Gaoyuan Highway Maintenance Technology Co., Ltd., Xinxiang 453000, China
3
School of Information Engineering, Chang’an University, Xi’an 710018, China
4
School of Engineering Machinery, Xi’an University of Science and Technology, Xi’an 710018, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(16), 9125; https://doi.org/10.3390/app15169125
Submission received: 28 July 2025 / Revised: 14 August 2025 / Accepted: 15 August 2025 / Published: 19 August 2025
(This article belongs to the Section Civil Engineering)

Abstract

Semi-rigid base asphalt pavements, a common highway structure in China, often suffer from debonding defects which reduce road stability and shorten service life. In this study, a new method of road debonding detection based on the acoustic vibration method is proposed to address the needs of hidden debonding defects which are difficult to detect. The approach combines the Transformer model and the Transformer-based Parallel Cross-Gated Convolutional Neural Network (T-PCG-CNN) to classify and recognize semi-rigid base asphalt pavement acoustic data. Firstly, over a span of several years, an excitation device was designed and employed to collect acoustic data from different road types, creating a dedicated multi-sample dataset specifically for semi-rigid base asphalt pavements. Secondly, the improved Mel frequency cepstral coefficient (MFCC) feature and its first-order differential features (ΔMFCC) and second-order differential features (Δ2MFCC) are adopted as the input data of the network for different sample acoustic signal characteristics. Then, the proposed T-PCG-CNN model fuses the multi-frequency feature extraction advantage of a parallel cross-gate convolutional network and the long-time dependency capture ability of the Transformer model to improve the classification performance of different road acoustic features. Comprehensive experiments were conducted to analyze parameter sensitivity, feature combination strategies, and comparisons with existing classification algorithms. The results demonstrate that the proposed model achieves high accuracy and weighted F1 score. The confusion matrix indicates high per-class recall (including debonding), and the one-vs-rest ROC curves (AUC ≥ 0.95 for all classes) confirm strong class separability with low false-alarm trade-offs across operating thresholds. Moreover, the use of blockwise self-attention with global tokens and shared weight matrices significantly reduces model complexity and size. In the multi-type road data classification test, the classification accuracy reaches 0.9208 and the weighted F1 value reaches 0.9315, which is significantly better than the existing methods, demonstrating its generalizability in the identification of multiple road defect types.

1. Introduction

The construction of semi-rigid base asphalt pavements provides durability and driving comfort on roads subject to heavy traffic and variable climatic conditions. However, moisture, temperature fluctuations, and uneven subsidence can cause debonding between the asphalt and water-stabilized layers [1,2,3]. Debonding often begins covertly, hampering early detection without specialized methods. Over time, these localized defects can spread, compromising structural integrity, and shortening pavement service life [4,5].
For semi-rigid base asphalt pavement debonding detection, researchers worldwide have conducted extensive studies and optimized a variety of techniques [6,7,8,9]. Commonly used debonding detection techniques for semi-rigid subgrades include ground-penetrating radar, bending subsidence and the acoustic vibration method. Ground-penetrating radar [10,11,12], due to its wavelength limitations, is mainly suitable for detecting large-scale debonding. Rasmussen [13], Pedersen [7], Graczyk [14], Muller [15], and Zofka [16] used Traffic Speed Deflectometers (TSD) to study the construction of a road foundation and bending sedimentation model through the study of road debonding, but the structural deformation distribution law and testing error of semi-rigid base asphalt pavements still need to be studied.
Acoustic vibration methods, traditionally used for rigid pavement analysis [17,18,19], also show promise for semi-rigid pavements by analyzing impact-generated acoustic waves. However, low bonding strength between the asphalt and stabilizing layers, irregular porosity, and complicated attenuation paths make acoustic signal interpretation far more challenging in semi-rigid asphalt structures [18]. A reliable acoustic-based approach must therefore handle multi-scale noise, complex frequency overlap, and long-range temporal variations in the signals. Recent material-side advances include the studies of Kuz’min [20], showing that using burnt-rock aggregates can alter concrete mechanical durability performance, and Gunka [21], demonstrating that phenol–cresol–formaldehyde resins improve bitumen–aggregate adhesion. Together, these studies highlight material variability in pavements and motivate robust, data-driven in-situ monitoring.
Recent research demonstrates that data-driven approaches can significantly enhance the precision and efficiency of road condition monitoring [22,23]. Convolutional Neural Networks (CNNs) are adept at extracting local time–frequency features from acoustic data [24]. Nevertheless, CNNs have inherent constraints in modeling long sequential dependencies, which are essential for identifying subtle, gradually evolving debonding signals [25,26]. Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) architectures can capture extended temporal information but often suffer from vanishing gradients and high computational overhead, impeding large-scale or real-time applications [27].
Moreover, although approaches like Mel-Frequency Cepstral Coefficients (MFCC) effectively capture spectral characteristics in short time frames, they may not fully accommodate the dynamic variations required to detect transient debonding anomalies. CNNs often leverage MFCC-based inputs to enhance classification performance in various acoustic tasks, including speech recognition [28] and environmental acoustic classification [29]. By focusing on local time–frequency representations, CNN + MFCC pipelines can robustly handle steady-state or moderately varying signals [30]. However, when confronted with rapidly changing or highly noisy acoustic environments, they may struggle to capture longer temporal dependencies, resulting in reduced accuracy for anomalies such as road debonding [31]. Acoustic reflections and attenuation paths in pavement structures can significantly distort local spectral patterns, underscoring the need for more advanced architectures capable of modeling global context as well as localized features.
To handle long-range dependencies more efficiently, Transformer architectures have gained significant traction in signal processing [32,33]. Their multi-head self-attention mechanism is particularly well-suited for capturing global contextual cues in noisy acoustic environments [34,35]. However, vanilla Transformers often face two critical challenges: high computational complexity [36,37,38] and reduced robustness in high-noise or limited-data scenarios [39,40,41,42]. Consequently, there is a pressing need for approaches that integrate local feature extraction with global attention while maintaining computational efficiency and noise robustness.
Convolutional networks with gating (CG-CNNs) [43] employ lightweight channel- and spatial-gating to recalibrate feature maps, enhancing salient time–frequency patterns while suppressing nuisance noise. This design offers strong local discrimination at modest parameter cost. However, because gating acts within localized receptive fields, CG-CNNs are limited in capturing long-range temporal interactions and cross-event context; reweighting alone cannot model global dependencies or event ordering. Motivated by these limitations, we next consider Parallel Cross-Gate CNNs (PCG-CNNs). PCG-CNNs introduce gated modules and multi-branch convolutions to efficiently merge complementary frequency components, showing enhanced adaptability in complex acoustic classification tasks [44,45]. Although PCG-CNNs exhibit excellent performance in local noise suppression, their reliance on localized receptive fields can limit their effectiveness in capturing long time-range interactions [46,47]. Therefore, an integrated Transformer + PCG-CNN design can mitigate both local and global feature extraction challenges: the multi-branch convolution and gating mechanisms in PCG-CNNs effectively handle noisy local structures, while the Transformer focuses on long-range dependencies.
Nevertheless, many existing acoustic vibration-based detection studies predominantly focus on specific road sections or operating conditions, thus adopting a case-driven application approach [14,48]. Due to the lack of standardized testing protocols and direct comparison frameworks, the reported results often lack comparability and fail to highlight the theoretical contributions and practical reliability of these methods across varied conditions. Against this background, the present study not only integrates the complementary strengths of Transformers and PCG-CNNs at the network design level but also constructs a diverse road-condition dataset with multiple evaluation metrics to enable more reproducible and comparative experiments in acoustic-based detection. This study first constructs a diverse dataset of acoustic signals from three representative semi-rigid base asphalt pavements. Then, it extracts static and dynamic MFCC features to capture time–frequency characteristics. A T-PCG-CNN model is proposed by integrating parallel gated CNNs and a Transformer encoder. Finally, comprehensive experiments are conducted to evaluate accuracy, robustness, and generalization ability.
The remainder of this paper is organized as follows. Section 2 introduces the construction of the road-acoustics dataset and the multi-level feature-extraction workflow, detailing the acquisition hardware, sample statistics, and the generation of MFCC, ΔMFCC and Δ2MFCC features. Section 3 describes the proposed T-PCG-CNN architecture, which integrates a parallel cross-gated convolution module with a lightweight Transformer to balance local noise suppression and long-range dependency modelling. Section 4 presents comprehensive experiments and result analyses—including parameter sensitivity studies, ablation experiments and comparisons with existing algorithms—to demonstrate the method’s advantages in accuracy and noise robustness. Section 5 summarizes the work and outlines future research directions.

2. Road Acoustic Dataset and Multi-Level Feature Extraction

2.1. Road Type Selection and Acoustic Characteristics

This study investigates several highways and provincial semi-rigid base asphalt pavements in China, referencing existing literature on different road types and structures [3,5,49,50]. Specifically, three representative roads were selected:
  • Changji Expressway: connecting Jincheng, Shanxi Province, to Jiaozuo, Henan Province, designated G5512.
  • National Highway 107: Xinxiang, Henan Province.
  • South Outer Ring Municipal Road: Town Road in Xinxiang City, Henan Province.
These roads were chosen because they are typical semi-rigid base asphalt pavements yet differ in design bearing capacities, traffic volumes, and structural configurations, as illustrated in Figure 1.
These three structural designs capture the variations in different road types and traffic demands [50]. The G5512 Expressway employs a multi-layer SBS modified asphalt concrete structure featuring high flexibility and strength, efficiently withstanding stress concentrations under large traffic loads and high-speed conditions. National Highway 107 combines the flexibility of asphalt concrete with the rigidity of a cement-stabilized layer, thus providing enhanced fatigue and rut resistance suitable for moderate traffic volumes. In contrast, the South Outer Ring Municipal Road emphasizes surface-layer flexibility coupled with a high-strength base, aiming to improve surface durability for lower-speed, varied-load scenarios.
Building on these structural and functional distinctions, each pavement type exhibits distinctive reflection, absorption, and scattering properties that substantially influence the time–frequency distribution of the detected signals. Asphalt surfaces typically present moderate absorption due to their inherent porosity, which tends to dampen higher-frequency components and alter vibration energy pathways. Cement-based layers, by contrast, are generally more reflective, producing stronger echoes and broader high-frequency energy distributions. Composite pavements blend features of both asphalt and concrete, resulting in a more intricate acoustic signature that simultaneously reflects and absorbs energy across different frequency bands. Consequently, the signals recorded from each of these pavement structures exhibit unique time-frequency characteristics, thereby affecting subsequent feature extraction and classification in debonding detection.
Collectively, these three roads—encompassing expressways, national highways, and municipal roads—offer a broad range of application value and sample diversity for analyzing acoustic characteristics in semi-rigid base asphalt pavements.

2.2. Data Acquisition: Collection Setup and Signal Processing Principles

In the data collection process of this study, road structure continuity CAE (Cavity-Acoustic-Effect) testing equipment developed by the China National Engineering Research Centre for Highway Maintenance Equipment was used for data collection, and the testing vehicle continuously collected road excitation acoustic signals (data collection was performed under authorization from the relevant road authority). The road acoustic collection equipment and the principle of collection are shown in Figure 2.
The detection principle is based on the continuous tapping of the road to be tested by means of a specially designed excitation wheel in order to generate mechanical vibrations. The excitation acoustic signals are captured by means of a special acoustic acquisition device with a sampling frequency of 44.1 kHz and a sampling bit count of 16 bits, and are subsequently converted into electronic signals by means of an acoustic–electrical converter. These electronic signals are transferred to a computer data processing system together with the position information. The software system analyses the discontinuity characteristic parameters of the road structure, generates the acoustic characteristic curve and baseline of the detected road section, and, in combination with the data from the distance acquisition module, further determines the location of the discontinuity area and its extent. The operation and data interpretation of this equipment requires the participation of professional engineers and requires a high level of experience. This study only uses this equipment to collect road excitation acoustic signals.

2.3. Dataset Structuring and Statistical Distribution

After continuously collecting road excitation acoustics from the three pavement sections mentioned above, field surveys, coring, and distance correction calibration techniques were employed to ensure a strict correspondence among the acoustic data, distance measurements, and road events. This resulted in a semi-rigid base asphalt road acoustic dataset. To increase sample size and enhance the model’s generalization capability, data augmentation was performed, ultimately yielding a total of 1560 samples of damage, 1680 samples of debonding, 440 samples of manhole cover (primarily metal or concrete in municipal settings), 860 samples of vehicle noise (environmental noise), and 5300 samples labeled as others. As shown in Table 1, the dataset is further subdivided into G-datasets (from G5512), NR-datasets (from National Road 107), and TR-datasets (from the Town Road), each containing a similar distribution of categories.
Despite substantial efforts to achieve balanced categories, a clear imbalance remains due to the notably larger number of samples in the “others” category (n = 5300) compared to the other classes. This discrepancy can lead to class imbalance issues, biasing certain machine learning models. Moreover, manhole-cover and vehicle-noise samples account for only about 3% and 6% of the total data, respectively, which means that relying solely on accuracy can result in models that prioritize majority classes and overlook minority classes. Consequently, a model might exhibit high accuracy overall but perform poorly on underrepresented categories. In such scenarios, the F1 score provides a more balanced measure of performance by capturing the trade-off between precision and recall, particularly for minority classes [32]. Details on the F1 score evaluation approach are presented in Section 4.
To avoid introducing synthetic artifacts in the acoustic domain, we deliberately refrained from aggressive oversampling or synthesis-based balancing (e.g., SMOTE on spectrograms) in this study. Instead, we report per-class results and adopt the weighted F1 score as the primary metric to mitigate evaluation bias toward majority classes. In future work, we will investigate calibrated oversampling and loss re-weighting (e.g., focal or class-weighted losses) to further enhance minority-class recognition without compromising acoustic realism.

2.4. Acoustic Feature Analysis: Time–Frequency Domain Comparisons

Analyzing the frequency characteristics of the data is crucial for understanding the underlying acoustic properties [51,52,53,54]. In Figure 3a, the original time-domain signals (Acoustic Time Domain Signal, ATDS) for each sample type—damage, debonding, manhole cover, other, and vehicle noise—reveal overall amplitude levels and waveform structures. However, these raw signals may mask short-lived or subtle events due to overlapping vibration energy and environmental noise.
To better capture dynamic acoustic variations, first-order and second-order derivatives were introduced [55,56]. The first-order derivative (Figure 3b) highlights the rate of change in amplitude, helping to diminish the influence of inconsistent knocking intensities or transient disturbances. For instance, the damage signals exhibit sharp, high-amplitude spikes in the derivative waveform—signifying sudden, high-energy events—whereas debonding and manhole cover signals show more moderate changes. The second-order derivative (Figure 3c) further accentuates acceleration or curvature in the signal, effectively revealing structural nuances in road layers. Notably, damage signals again stand out, while vehicle noise and other background signals maintain relatively smoother profiles, confirming their less abrupt nature. Through these derivative analyses, specific acoustic behaviors—ranging from abrupt impact events to mild background vibrations—become more discernible.
By analyzing the original signal, first-order, and second-order differential signals, the dynamic variation characteristics of different acoustic signals can be more clearly understood. Signals of the “damage” type exhibit significant and abrupt changes in the differential analysis, indicating sudden or high-intensity events. The “debonding” and “manhole cover” types show moderate changes, which may be related to regular mechanical activities. Signals of the “other” and “vehicle noise” types exhibit smaller variations in the differential signals, reflecting their characteristics as background or continuous acoustics.
Building on the time-domain data analysis presented in Figure 4, the Fast Fourier Transform (FFT) method was introduced to identify and analyze distinct features of road acoustic signals [53,54] more effectively. This approach addresses the limitations of time-domain analysis, which may not accurately differentiate between variations in road materials and structures. The FFT method transforms the acoustic data from the time domain to the frequency domain, allowing for clearer identification of the frequency components and their intensity distribution within the acoustic signal, as shown in Figure 4. This frequency analysis not only reveals information hidden in the time-domain data but also enables more effective recognition and differentiation of various road surface characteristics and structures, providing a more comprehensive reflection of the dynamic acoustic features of the road.
Based on the frequency-domain data analysis in Figure 4, the frequency distribution and power density characteristics of different acoustic samples reflect their respective acoustic properties and acoustic source characteristics. Vehicle noise exhibits multiple frequency peaks, indicating persistence and complexity, while the spectra of damage and debonding signals are simpler, primarily concentrated in the low-frequency range. The manhole cover and other signals show some energy distribution in the mid-to-low frequency range, reflecting the diversity of their data samples.

2.5. Multi-Level Feature Extraction

In acoustic recognition, directly using raw one-dimensional audio data often makes it difficult to effectively extract useful information, especially in high-noise environments. To address this, various feature extraction methods have been proposed to enhance the robustness of the signal and improve feature representation.
In principle, spectrograms (STFT) and wavelet transforms can describe time–frequency structure, but they are less suited to our road–impact acoustics and deployment constraints. The STFT [51,53] method uses a fixed analysis window, which forces a trade-off between time and frequency resolution: short windows are needed to resolve the brief impact/echo transients, but they blur frequency content; longer windows improve frequency resolution but smear the onset/decay and increase leakage—effects that are aggravated by vehicle speed variation and nonstationary traffic/wind noise. Achieving robustness typically requires dense, high-resolution spectrograms, which increase memory and latency and are undesirable for edge inference. Wavelet transforms [52,56] offer adaptive multi-resolution, yet practical use requires selecting a mother wavelet and scale set; for semi-rigid pavements with changing materials, speeds and climates, a single configuration is brittle, while CWT representations are redundant and computationally demanding. We also observed that small shifts in impact timing or sensor mounting can lead to unstable wavelet coefficients, complicating model training under limited labels. These factors make spectrogram/wavelet front ends less attractive for our noisy, drive-by acquisition scenario.
In light of these limitations, we turn to established cepstral representations Linear Predictive Coding (LPC) [57], Linear Predictive Cepstral Coefficients (LPCC) [58], and Mel-Frequency Cepstral Coefficients (MFCC) [59,60,61], among others. LPC and LPCC are based on a linear predictive model, which describes speech characteristics by the relationship between the current signal and previous frames; however, they are susceptible to noise interference in low signal-to-noise ratio (SNR) environments. In contrast, MFCC applies a filter using the Mel frequency scale to simulate the human ear and uses Discrete Cosine Transform (DCT) in the logarithmic energy domain. MFCC offers several advantages for application:
  • Dimensionality Reduction: MFCC reduces thousands of spectral bins into a compact feature set, significantly lowering the computational overhead required in real-time road monitoring systems [62].
  • Noise Robustness: Numerous studies in road acoustics have shown that MFCC-like features are robust to moderate noise, a critical factor in environments with high background noise, such as highways [63].
  • Computational Efficiency: The relatively low complexity of generating MFCCs (compared to full-resolution Wavelet analyses) makes them ideal for embedded or edge computing systems [64].
Building on the advantages outlined above, this study converts road excitation acoustic data into MFCC features, which are then used as input for an acoustic recognition network for training and identification [59,61,62]. By utilizing MFCC features, the deep learning model’s powerful feature learning capabilities in speech recognition tasks can be leveraged, ultimately achieving higher accuracy in acoustic recognition.
An improved MFCC-based acoustic feature extraction algorithm is proposed here, and its specific steps are as follows:
  • Framing: the input audio signal x(t) is divided into multiple short-time frames, each with a frame length of 86 ms and a frame shift of 30 ms.
  • Windowing: to enhance the continuity at both ends of each frame and to avoid frequency leakage caused by the rectangular window’s truncation of the signal, a Hamming window is commonly applied. The formula for the Hamming window is as follows:
W n = 1 α α cos 2 π n N 1 ,   0 n N 1
where α is the Hamming window parameter, set to 0.46 in this study to optimize the spectral characteristics in signal processing [58], and N represents the number of samples in each frame.
3.
DFT: the FFT is applied to each frame of the signal to obtain the frequency spectrum of each frame. The power spectrum of the speech signal is then obtained by taking the squared magnitude of the frequency spectrum. The transformation formula is as follows:
X a ( k ) = n = 0 N 1 x ( n ) e j 2 π k / N ,   0 k N
where x(n) is the acoustic signal after denoising, windowing, and framing, and N is the number of samples per frame.
4.
Power Spectrum Calculation: the power spectrum is obtained by taking the squared magnitude of the frequency spectrum of the acoustic signal, resulting in the spectral line energy P(k):
P ( k ) = 1 N X ( k ) 2
Next, the power spectrum P(k) is passed through a set of Mel-scale triangular filters to obtain the Mel spectrum. At each frequency, the product of P(k) and the filter Hm(n) is calculated. The frequency response of the triangular filter Hm(n) is computed as follows:
H m n = 0 , n < f m n f m 1 f m f m 1 , f m 1 n < f m 1 , n = f m f m + 1 n f m + 1 f m , f m 1 < n f m 0 , n > f m
where 0 ≤ m ≤ M (in this study, M = 50), and f(m) is the center frequency.
The road excitation signal, processed by the power spectrum from Equation (3), is multiplied and accumulated with M triangular filters. The Mel log-spectrum S(m) for the corresponding frequency band of the frame data can then be obtained, as shown in Equation (5):
S m = ln k = 0 N 1 S n 2 H m k
Finally, the Mel-frequency cepstral coefficients (MFCCs) C(n) of the road excitation acoustic signal are obtained using DCT, as shown in Equation (6):
C n = m = 0 N 1 S m cos π n m 0.5 / M
5.
Dynamic MFCC Feature Extraction: Since MFCC features represent the static characteristics of the acoustic signal, they do not fully capture the dynamic features of road excitation acoustics. To better reflect the dynamic MFCC [62] characteristics of the road, this study computes the first-order derivative ΔMFCC and second-order derivative Δ2MFCC of the excitation signal. The first-order derivative is used to analyze the intensity variation of the acoustic data, helping to mitigate the effects of slight variations in the acoustic signal caused by road surface cracks during road inspection. The second-order derivative analyses the intensity changes in the acoustic signal and aids in distinguishing the different material and structural characteristics of the road.
The first-order derivative coefficients of MFCC (ΔMFCC) can be calculated using the following formula:
D n = i = I I i C ( n + i ) i = I I i 2
where C(n + i) is the MFCC coefficient, D(n) is the first-order MFCC signal parameter, and I = 2.
The second-order derivative coefficients of MFCC (Δ2MFCC) can be obtained from the result of Equation (8):
D 2 n = i = I I i D ( n + i ) 2 i = I I i 2
where D2(n) is the second-order MFCC signal parameter.
Using Equations (6)–(8), MFCC, ΔMFCC, and Δ2MFCC features were extracted, respectively, constructing the feature datasets shown in Figure 5. These extracted features provide a fundamental data basis for the subsequent study of T-PCG-CNN and lay a solid foundation for the next steps in feature learning.

3. Neural Network Design and Optimization

3.1. Overall Network Design

Road acoustic signals exhibit complex time–frequency characteristics and significant dynamic variations due to influences from road structures, materials, and environmental noise. To better capture these subtle features, this study first extracts MFCC and enhances the ability to capture dynamic variations in acoustic signals through ΔMFCC and Δ2MFCC. MFCC effectively extracts the fundamental spectral information of the acoustic signal, while ΔMFCC and Δ2MFCC further capture intensity fluctuations, making the features more suitable for scenarios with significant dynamic changes.
To fully exploit the time-frequency characteristics of the data, this study proposes a network architecture that integrates the Transformer model with a Cross-Gated Parallel Convolutional Neural Network (PCG-CNN) [41,62], collectively forming the T-PCG-CNN structure. This architecture is specifically designed for SPAR acoustic recognition; as illustrated in Figure 6, the proposed model integrates MFCC, ΔMFCC, and Δ2MFCC inputs each passing into the PCG-CNN + Transformer pipeline, ultimately merging for final classification. This design ensures robust local feature filtering and global context modeling.
This network employs the PCG-CNN module to process different input features in parallel, leveraging the strength of convolutional neural networks in local feature extraction to progressively refine useful information from the acoustic signals. Meanwhile, to model the long-term dependencies within the input features, the Transformer model is incorporated. Its powerful self-attention mechanism effectively captures correlations between distant features in sequential data, thereby enhancing the recognition of hierarchical structural characteristics of the road. The design of the T-PCG-CNN module and its specific role in feature extraction are detailed in the following sections.

3.2. PCG-CNN Module

As analyzed in Section 3.1, road acoustic signals are often subject to heavy noise interference, multi-scale time-frequency variations, and limited labeled data, making it difficult for either purely convolutional networks or large-scale Transformer models alone to adequately capture both local features and long-range dependencies. To address these issues, a PCG-CNN is proposed. By combining multi-branch convolutions with a cross-gate mechanism, the module adaptively suppresses noise while enhancing subtle acoustic signals [46]. This design keeps the overall model size manageable yet significantly boosts the clarity and discriminative power of the features passed to the Transformer encoder.
The design and internal architecture of the PCG-CNN module are illustrated in Figure 7. It consists of three interconnected CG-CNN modules, each containing a CNN layer and a Cross Gate Logic Generation Module. Let M1 M×T, ΔM1 M×T, Δ2M1 M×T, denote the MFCC, ΔMFCC, and Δ2MFCC features of the sample acoustic data, respectively. Here, M is the feature dimension and T is the number of data frames. These features serve as the input to the first layer of the PCG-CNN.
Within the leftmost CG-CNN module, a CNN layer and a Cross Gate Logic module jointly perform local feature extraction and adaptive gating:
  • CNN Layer
The CNN layer extracts fundamental information in M1, expressed by the following:
A 1 a = C o n v ( M 1 , θ 1 a )
where A 1 a M1×T1 is the output data of the CNN layer, and θ 1 a are the learnable parameters. The dimensions (M1, T1) reflect the result of convolutional operations (potentially including stride, padding, or dilation).
2.
Cross Gate Logic
The Cross Gate Logic module takes the sample acoustic features M1, ΔM1, Δ2M1 as inputs. Its operation acts as a complementary gating network, producing the following:
G 1 a = ( δ ( C o n v ( M 1 , θ 1 a a ) ) + δ ( C o n v ( Δ M 1 , θ 1 a b ) ) + δ ( C o n v ( Δ 2 M 1 , θ 1 a c ) ) ) / 3
where G 1 a ( M 1 + m ) × ( T 1 + n ) can be considered the gated output; δ ( · ) denotes the activation function, and ( θ 1 a a , θ 1 a b , θ 1 a c ) are learnable parameters for the respective convolution branches. Based on Equations (9) and (10), the CG-CNN module’s output for M 1 * is then derived via element-wise multiplication:
M 1 * = A 1 a G 1 a
the final representation of the first CG-CNN module is obtained M 1 * , where indicates element-wise multiplication.
Analogously, the second and third CG-CNN modules refine Δ M 1 * and Δ 2 M 1 * , taking the following form:
Δ M 1 * = C o n v ( Δ M 1 , θ 1 b ) ( δ ( C o n v ( M 1 , θ 1 b a ) ) + δ ( C o n v ( Δ M 1 , θ 1 b b ) ) + δ ( C o n v ( Δ 2 M 1 , θ 1 b c ) ) ) / 3
Δ 2 M 1 * = C o n v ( Δ 2 M 1 , θ 1 c ) ( δ ( C o n v ( M 1 , θ 1 c a ) ) + δ ( C o n v ( Δ M 1 , θ 1 c b ) ) + δ ( C o n v ( Δ 2 M 1 , θ 1 c c ) ) ) / 3
where ( θ 1 b , θ 1 c , θ 1 b a , θ 1 b b , θ 1 b c , θ 1 c a , θ 1 c b , θ 1 c d ) are likewise CNN layer parameters for each branch. These parallel convolutions and cross-gate mechanisms adaptively dampen noise and amplify key acoustic patterns within the local time–frequency features.
After passing through three consecutive CG-CNN modules, M 1 * , Δ M 1 * and Δ 2 M 1 * become progressively cleaned of spurious noise. This multi-branch synergy addresses various ranges of acoustic phenomena, e.g., high-frequency impulses, low-frequency resonances, and mid-frequency partial echoes. The final outputs are fed into the Transformer model (Section 3.3) to capture longer-range dependencies.

3.3. Transformer Model

3.3.1. Motivation

Classical Transformers rely on full self-attention, which imposes a prohibitive O(N2) cost in time and memory for sequences of length N. Figure 8a illustrates this dense attention matrix, wherein each token interacts with every other token. While this architecture excels in capturing rich, global context, it becomes impractical for extended road acoustic signals—often spanning thousands of frames per sample—especially on resource-constrained or embedded hardware. Moreover, many real-time road inspection pipelines demand near real-time inference, aggravating the memory and speed bottleneck of standard Transformers [65].
Recent research tackles the overhead through various approximations that preserve performance on many tasks. These approaches include blockwise (local) attention, low-rank projection, token pruning, and kernel-based attention approximations [66]. Given the focus on extended acoustic signals, a blockwise self-attention approach was selected, augmented by global tokens or bridging strategies, to handle subtle yet lengthy patterns such as structural echo tails from pavement cracks. This yields a “faster and lighter” Transformer suitable for large-scale road health monitoring, seamlessly integrated with the PCG-CNN front end.

3.3.2. Blockwise Self-Attention

The primary tool to reduce O(N2) complexity is a blockwise (local) self-attention mechanism. Instead of the all-to-all pattern (Figure 8a), each token’s attention is restricted to a local region of the input. Figure 8b–d contrast different variants of local or partially global patterns:
  • Splitting the Sequence into Blocks
Let X ∈ ℝN×d, denote the entire frame-level feature sequence from Section 3.2. The input X is partitioned into B = [N/L] contiguous blocks, each of size L. Each block Xb ∈ ℝL×d thus handles a smaller subset of frames. Typically, LN. This block partition ensures overall attention scales linearly with N.
2.
Local Windowed Self-Attention
Within each block, tokens attend to a window of width w. As shown in Figure 8b, each token interacts only with positions within ±w. This “sliding window” approach effectively captures short-range impulses or abrupt changes (e.g., from minor pavement defects), requiring only O (B × L × w) complexity, i.e., linear in N.
3.
Dilated Windows
For certain acoustic tasks involving widely-spaced impulses or resonances, dilated local windows are adopted (Figure 8c): each token attends to every d-th neighbor. In practice, a fraction of attention heads is configured to incorporate dilation, preserving a balance between local detail and extended coverage. This concept is reminiscent of dilated CNN filters, but now in a self-attention setting.
4.
Bridging Across Blocks
A purely local attention pattern remains confined within individual blocks unless cross-block bridging is introduced.
Layer Stacking: As the Transformer layers deepen, each block can indirectly incorporate information from adjacent blocks in a stepwise manner.
Global Tokens: Special tokens can be designated to attend across the entire sequence. Figure 8d illustrates a “global + sliding window” variant, as elaborated in Section 3.3.3. For advanced tasks such as multi-hop reasoning across widely separated cracks, these global tokens help unify context across multiple blocks.
By focusing on local windows of size w (within each block of size L), memory usage is reduced to O(N·w) or O(N·L), rather than O(N2). Following Longformer, “banded” attention with specialized GPU kernels that skip computing or storing out-of-window positions. This is critical for large N, enabling us to process thousands of frames per sample in a single pass.

3.3.3. Parameter Sharing Across Heads

In conventional multi-head attention, each head i employs unique projection matrices W i Q , W i Q , W i Q . Instead, we employ a single set {WQ, WK, WV} across the h heads, inspired by Linformer and Reformer. Mathematically, for head i,
head i   = AttnBlock ( XW Q , XW K , XW V )
This approach reduces the parameter count from the typical 3·h·d2 to 3·d2. A shared WO ( h d k ) × d then recombines the heads after local attention. This cuts down on memory usage and model size, making it more practical in memory-constrained roadside or onboard computing environments.

3.3.4. Integration with PCG-CNN

The T-PCG-CNN pipeline (Section 3.2) first applies cross-gated parallel convolutions to address short-range noise gating and multi-frequency feature extraction (MFCC, ΔMFCC, and Δ2MFCC). The resulting embeddings X are then fed into this blockwise attention Transformer:
  • Local Denoising and Multi-Scale Extraction
PCG-CNN refines the raw acoustic frames, suppressing extraneous noise and highlighting subtle resonance or structural echoes.
2.
Blockwise Sparse Attention
The refined embeddings proceed block by block. Each token only sees its local window or dilated window. Parameter sharing across heads keeps memory usage in check.
Overall, blockwise attention + parameter sharing ensures that the Transformer remains tractable for thousands of acoustic frames, while T-PCG-CNN supplies robust local features. As illustrated in Section 4, this results in the pipeline’s ability to detect subtle, long-duration echoes or damage signals. This design addresses both the short-range gating and extended acoustic dependencies essential in road inspection tasks, without surpassing typical embedded hardware constraints.

3.4. Classifier

In the classifier [33,44,47], the fused signal MConcat enters the CNN layer and, through the activation function, extracts high-level and broad acoustic features from the sample. The output signal is computed as follows:
M = δ ( C o n v ( M Concat , θ ) )
where M is the acoustic feature output from the network module, θ represents the CNN layer parameters, and δ ( · ) is the activation function.
To reduce the dimensionality of the features while retaining important information, an average pooling layer is introduced in the classifier. The average pooling layer performs averaging along the time dimension to obtain a fixed-length feature representation. The computation is given by the following equation:
M ~ i = 1 T t = 0 T 1 M i , t , i = 0 , 1 , , D 1
where M i , t is the element at the i-th row and t-th column of the feature matrix M, and T is the number of data frames. After processing by the average pooling layer, a fixed-length feature vector M = M 0 , M 1 , M D 1 T is obtained.
The pooled features are input into the fully connected layer for dimensionality reduction, and are then mapped non-linearly through the activation function. The output of the fully connected layer is given by the following equation:
f = δ ( W M ~ + b )
where W is the weight matrix of the fully connected layer, and b is the bias term.
Finally, the output features f are input into the Softmax layer for classification, in order to identify different acoustic categories. The Softmax function is used to map the outputs into a probability distribution:
P ( y = j f ) = exp ( f j ) k = 1 K exp ( f k ) , j = 1 , , K
where fj is the feature representation of the acoustic sample for class j, K is the total number of classes, and P ( y = j f ) is the probability of classifying the sample as class j.

4. Performance Evaluation of T-PCG-CNN

In this section, a series of experiments are conducted to comprehensively evaluate the proposed T-PCG-CNN network for road acoustic classification. Three experimental scenarios are considered: (1) parameter sensitivity analysis of the T-PCG-CNN model; (2) performance comparison of different network architectures and feature fusion strategies; (3) comparison with existing acoustic feature extraction and machine learning algorithms.
The dataset used is described in Chapter 2, with sample types and distribution shown in Table 1. The data are split into 75% for training, 15% for validation, and 10% for testing. Because certain classes (e.g., “manhole cover” and “vehicle noise” sounds) constitute only about 3% and 6% of the samples, relying solely on overall accuracy can be misleading—the model might achieve high accuracy by focusing on majority classes while neglecting minority classes. Therefore, in addition to Accuracy, a suite of evaluation metrics including Precision, Recall, and F1 score are used to better capture performance on minority classes [32], and the evaluation parameters are as follows:
Accuracy = T P + T N T P + T N + F P + F N
Specificity = T N T N + F P
Sensitivity = T P T P + F N
Precision = T P T P + F P
Recall = T P T P + F N
F 1 = 2 Precision Recall Precision + Recall
where TP (True Positive) represents the number of true positive results, TN (True Negative) represents the number of true negative results, FP (False Positive) represents the number of false positive results, and FN (False Negative) represents the number of false negative results.
Equation (24) represents the F1 score calculation for each class. However, as the task involves a multi-class recognition problem with five categories, an alternative calculation method is required. The two main methods are the average F1 score (Macro F1 Score) and the weighted average F1 score (Weighted F1 Score). The average F1 [67] score is more suitable when focusing on the performance of each individual class, while the weighted average F1 score is more appropriate for evaluating the overall performance of the model. In this study, the weighted average F1 score is adopted, as shown in the following equation:
W - F 1 = i = 1 n N C i F 1 C i i = 1 n N C i
where W-F1 represents the weighted average F1 score, NCi is the number of samples in the i-th class, and F1Ci is the score for the i-th class.
This provides an overall performance measure that accounts for class imbalance. In addition, the Area Under the ROC Curve (AUC) is reported for each class by treating the multi-class problem in a one-vs-rest manner. Receiver Operating Characteristic (ROC) curves are also plotted to illustrate the trade-off between true positive rate and false positive rate. Finally, the model’s efficiency is assessed through inference time and parameter count to evaluate its engineering practicality. The following subsections present the experimental results based on these considerations.

4.1. Network Parameter Performance Analysis

The sensitivity of T-PCG-CNN to various hyperparameters, including dropout rate, number of Transformer attention heads, learning rate, and training epochs, is considered. We trained all models using the Adam optimizer (β1 = 0.9, β2 = 0.999, ε = 1 × 10−8) with an initial learning rate of 0.01 and an exponential decay schedule with factor 0.9 per epoch. The mini-batch size was 64, the loss function was cross-entropy for 5-class classification, and unless otherwise stated we applied no weight decay and no gradient clipping. Mini-batches were shuffled each epoch. Input features were MFCC, ΔMFCC, and Δ2MFCC; prior to training, each coefficient was z-score normalized per feature using statistics computed on the training split only. For sequence batching, variable-length inputs were zero-padded to the longest sequence in the batch and an attention mask was used so the Transformer ignored padded positions. Experiments were run for 50 epochs and repeated 10 times with different random seeds; we report the mean Accuracy and weighted F1 across runs.
Dropout Rate: Dropout rates were in the range of [0.2, 0.8] (and a no-dropout baseline). Table 2 presents the results on the three road datasets (G-dataset, NR-dataset, TR-dataset). The performance initially improves with increasing dropout as it helps prevent overfitting. The best results for all three datasets were achieved at a dropout rate of 0.4, yielding, for example, 92.24% accuracy (W-F1 0.9210) on the G-dataset. Beyond 0.4, performance starts to decline as excessive dropout removes too much information. Therefore, a dropout rate of 0.4 is selected for the final model.
Training Epochs: The model’s learning curves on the training and validation sets were monitored. As illustrated by the accuracy, loss, precision, and recall, model performance stabilizes after approximately 15 epochs. Beyond this point, the validation loss and accuracy cease to improve and begin to degrade, indicating the onset of overfitting. Consequently, the number of training epochs is fixed at 15, at which the model has effectively converged. Extending training beyond this point would likely result in overfitting to noise or specific patterns, thereby reducing generalization performance.
Attention Heads: With the dropout rate and number of training epochs fixed, the impact of the number of heads in the Multi-Head Self-Attention mechanism of the Transformer is evaluated. The number of attention heads ∈ {4, 8, 16}. The results show that setting eight heads yields the best performance on all three datasets, as shown in Table 3. For instance, increasing from four to eight heads improved W-F1 from 0.9214 to 0.9224 on the G-dataset, and from 0.9427 to 0.9494 on the TR-dataset. However, 16 heads led to a slight drop, likely due to over-parameterization without additional benefit (accuracy fell to 90.17% on the G-dataset). Therefore, eight attention heads are selected for the final model configuration.
With these optimal hyperparameters (dropout 0.4, 15 epochs, eight heads, learning rate 0.01), T-PCG-CNN achieves strong baseline performance. The final trained model has approximately 7.8 million parameters (about 31.4 MB in size), which is relatively lightweight for deployment.

4.2. Performance Comparison of Different Network Architectures

To evaluate the impact of the multi-level feature fusion design, several network architecture variants are compared using different combinations of input features. Experiments are conducted using each feature type individually (MFCC only, ΔMFCC only, or Δ2MFCC only), as well as pairwise combinations (MFCC + ΔMFCC, MFCC + Δ2MFCC, ΔMFCC + Δ2MFCC), and the full trio (MFCC + ΔMFCC + Δ2MFCC). For combined inputs, two fusion strategies are considered: (i) complete cross-fusion of features within the network, and (ii) partial or no fusion as baselines. Based on these strategies, three network variants are defined in addition to the full T-PCG-CNN:
  • T-PCG-CNN*: a variant where two of the feature types are fused in parallel and the third is processed in a separate branch. For example, “MFCC & ΔMFCC, Δ2MFCC” means MFCC and ΔMFCC features are cross-fused in parallel convolution streams while Δ2MFCC is handled by a separate stream. Three configurations of which features are fused versus separate were tested (as shown in Table 4).
  • T-PG-CNN: a variant with no cross-gating between feature streams. Here MFCC, ΔMFCC, and Δ2MFCC are each processed in independent CNN branches (no mutual gating), and their outputs are simply concatenated before the classifier. This tests the importance of the cross-gate mechanism.
  • T-G-CNN: a single-feature network for comparison, which uses only one type of input. This is essentially a “Transformer + Gated CNN” applied to a single feature, serving as a baseline with no multi-feature fusion.
Table 4 presents the recognition Accuracy and W-F1 for these architectures on the three datasets. Several clear trends are observed:
First, the proposed T-PCG-CNN (with full MFCC, ΔMFCC, Δ2MFCC fused) achieves the highest performance on all datasets—for example, 94.64% accuracy (W-F1 0.9422) on the TR-dataset. In contrast, using any single feature (T-G-CNN) yields substantially lower performance (e.g., MFCC-only achieved ~83% accuracy on the G-dataset). This highlights that combining complementary features (cepstral and its deltas) is crucial for capturing more discriminative information. It aligns with findings in other domains that incorporating first- and second-order MFCC features improves classification of audio signals.
Second, the cross-gating mechanism clearly boosts performance: the T-PCG-CNN outperforms the no-gating parallel network (T-PG-CNN) by a significant margin. For instance, on the G-dataset T-PCG-CNN improves accuracy by ~7.0% (W-F1 by 6.6%) compared to T-PG-CNN, and on the NR-dataset by ~10.4% (W-F1 8.4%). Even on the easier TR-dataset, cross-gating gives a 4–5% boost in accuracy. This demonstrates that the cross-gate module effectively captures complementary information between features of different scales/resolutions, leading to better feature integration than simple concatenation.
Third, the partially fused variant T-PCG-CNN* performs in between the extremes: when two features are fused and one separate, the accuracy is higher than no fusion but still about 2–4% lower (absolute) than fusing all three. For example, on the G-dataset, T-PCG-CNN* (“MFCC&ΔMFCC, Δ2MFCC”) achieved 89.44% accuracy vs. 92.24% for full T-PCG-CNN. This indicates that incorporating all three feature types into a unified cross-gated fusion yields the best outcome. In summary, the proposed architecture (T-PCG-CNN) leveraging multi-feature cross-gated convolution plus Transformer achieves superior performance, validating design choices. The improvements are most pronounced on noisy, complex datasets (NR-dataset), where integrating multi-scale frequency features with attention helps the model focus on subtle but important patterns. By contrast, simpler architectures (single feature or no gating) struggle to capture the necessary diversity of features, resulting in lower recall on the minority classes and overall lower W-F1.
The confusion matrix for the final T-PCG-CNN model is computed to further analyze class-wise performance. Table 4 summarizes the classification results across the five classes in the test set. A visual depiction of the confusion matrix is shown in Figure 9.
The model achieves high true positive rates for all pavement condition classes (diagonal values are all above 85%). Notably, the “Damage” and “Debonding” classes, which correspond to pavement distress conditions, are identified with 95% and 93% recall, respectively, indicating the model is very effective at recognizing debonding-related defects. The minor classes “Manhole Cover” and “Noise” have slightly lower recall (85% and 90%, respectively), with some confusion observed: e.g., a small fraction of “Manhole Cover” cases are misclassified as “Damage”, and about 10% of ambient “Noise” instances are mistaken as “Other” (normal road sounds).
These confusions are understandable because Manhole Cover and Damage are both impact-driven classes that start with a strong low-frequency impulse; in practice, the metallic “ring” of a cover often appears as a narrow, quasi-stationary mid-band ridge (0.6–0.7 kHz) with short decay, whereas Damage exhibits a broader low-frequency smear with a longer, irregular decay envelope. Under low SNR, short analysis windows, or speed-induced time warping, these cues can be masked or truncated, making a lid strike resemble a damage burst; likewise, distant vehicle noise forms diffuse low-energy bands that overlap the “Other” background. To reduce these errors, we will (i) encode resonance sharpness/persistence and decay-slope features alongside MFCCs, (ii) apply class-balanced and contrastive training targeted at the lid–damage pair, (iii) use attention-guided temporal weighting to emphasize post-impact frames that carry ringing/decay cues, and (iv) incorporate simple location priors (known manhole positions) during inference to suppress spurious damage detection without altering the core model.
Overall, the confusion matrix confirms that using W-F1 as the evaluation metric was appropriate—while overall accuracy is high, the per-class breakdown shows the challenge on minority classes. By achieving strong performance on even the infrequent classes (e.g., correctly recognizing 85% of “Manhole Cover” events), T-PCG-CNN demonstrates robust learning of all categories.
To illustrate the classifier’s discriminative ability, the one-vs-rest ROC curves for each class are plotted, as shown in Figure 10. Each curve shows how the True Positive Rate (TPR or sensitivity) varies with False Positive Rate (1–specificity) as the decision threshold is varied for one class versus all others. All classes exhibit steep ROC curves that rise toward the top left corner, reflecting excellent performance. The AUC values are above 0.95 for every class, which means the model has a very high probability of ranking a random positive example higher than a random negative example for any given class. The “Vehicle Noise” class (normal condition) achieves the highest AUC (0.98), indicating the model is almost perfect at distinguishing normal road sounds from anomaly sounds. The “Debonding” and “Other” classes have the lowest AUC (0.95) among the five, but this is still extremely high and confirms that the model can reliably detect the subtle acoustic signature of a vehicle driving over a manhole cover. Overall, Figure 10 highlights the model’s strong class-separation capability—even under varying classification thresholds, the true positive rate remains high while false positives are low. This further reinforces the effectiveness of the fused multi-scale features and attention mechanism in T-PCG-CNN, as it achieves both high precision and high recall across all road condition categories.
Figure 10 shows ROC curves for each class on the test set using the T-PCG-CNN model. Each colored curve corresponds to one target class (e.g., “Damage” vs. all other classes). The points toward the upper left represent operating thresholds with high recall and low false alarm rate. All classes have ROC curves close to the top left, indicating outstanding classification performance. The Area Under the Curve (AUC) is 0.97 for “Damage”, 0.95 for “Debonding”, 0.96 for “Manhole Cover”, 0.95 for “Other”, and 0.98 for “Noise”. Such high AUC values suggest the model would perform well even under different operating thresholds—for instance, one could tighten the decision criterion to reduce false positives and still maintain a high true positive rate. The consistently high AUC across classes demonstrates that the model’s predictive power is not limited to just the dominant classes but extends to the rarer event types as well. In practical terms, this means T-PCG-CNN can be tuned to be very sensitive to road debonding events while keeping false alarms (misidentifying normal conditions as faults) to a minimum, which is crucial for a reliable monitoring system.

4.3. Ablation Study

To quantify the contribution of each component in the T-PCG-CNN model, a series of ablation experiments are conducted. In each experiment, one component is systematically removed or replaced, and the resulting impact on performance (Accuracy, W-F1) and model complexity is show in Table 5. The following ablation variants are compared:
  • Without Transformer (PCG-CNN only): The Transformer encoder is removed, and only the Parallel Cross-Gated CNN is used for classification. This setup is used to evaluate the importance of capturing long-term temporal dependencies. Results indicate a notable drop in performance—for example, on the G-dataset, the W-F1 drops from 0.9210 to about 0.8992 (2.18% decrease) which is in line with the performance of the CG-PCNN model (Cross-Gated CNN without Transformer). Similar declines are observed on the other datasets. This confirms that the Transformer’s sequence modeling significantly boosts accuracy by integrating information over time, as opposed to using only instantaneous or local acoustic features. In other words, the PCG-CNN alone, while powerful in extracting multi-scale features, misses the temporal context that the Transformer provides, resulting in lower recall on events that have more subtle or longer acoustic signatures.
  • Without Cross-Gating (Parallel CNN + Transformer): In this variant, the cross-gating mechanism between convolutional branches is removed. The network instead uses parallel convolution streams for each feature (MFCC, ΔMFCC, Δ2MFCC) without the adaptive gating, and their outputs are concatenated before feeding into the Transformer. Essentially, this is a no-gate version of the model. The absence of cross-gating leads to a significant performance degradation: accuracy drops by about 4–9% depending on the dataset. Specifically, compared to the full T-PCG-CNN, the no-gating model’s W-F1 is lower by 5.48% on the G-dataset, 7.23% on the NR-dataset, and 4.03% on the TR-dataset. This highlights that the cross-gate module is crucial for effectively fusing features—it allows the network to adaptively weight and exchange information between the MFCC, ΔMFCC and Δ2MFCC channels, capturing the complementary nature of these features. Without gating, the model likely cannot reconcile differences between feature types as well, leading to misclassifications. This ablation underscores the value of the cross-gated parallel design, which is also supported by other studies that use gated fusion to combine modalities or feature streams.
  • Without Multi-Scale Convolutions (Single-Branch CNN + Transformer): In this variant, the parallel multi-branch convolutional architecture is removed and replaced with a single CNN stream for feature extraction. All input features are either combined into one branch or, alternatively, only one feature is used as input before passing through the Transformer. This ablation is designed to evaluate the importance of employing multiple convolutional kernel sizes and maintaining separate feature-specific processing streams. The results indicate a notable decline in performance, with the single-branch network achieving only 82–85% accuracy—comparable to the T-G-CNN single-feature baseline, which achieved 85% accuracy on the TR-dataset. The absence of multi-scale feature learning likely impairs the model’s ability to capture important frequency-specific patterns; a single convolution kernel may fail to detect features that a multi-branch structure—with both wide and narrow receptive fields—can effectively capture. Additionally, a decline in recall was observed for minority classes; for example, detection of the “Manhole Cover” class was negatively affected without the high-frequency branch provided by the PCG-CNN design. These findings suggest that using multiple parallel convolutional filters to extract diverse spectral characteristics significantly enhances performance. This result aligns with the rationale behind CG-PCNN architectures used in speech feature modeling. In summary, removing multi-scale parallelism substantially weakens feature extraction capacity, reinforcing the effectiveness of the proposed PCG-CNN design.
  • Standard Transformer instead of Lightweight Transformer (T*-PCG-CNN) [32]: Finally, the optimized Transformer design is compared with a baseline model that incorporates a conventional Transformer module integrated with the PCG-CNN, denoted T*-PCG-CNN. This baseline does not include the optimizations introduced in the proposed model, such as sparse attention or weight sharing across Transformer layers. In contrast, the proposed model employs a parameter-sharing scheme to reduce redundancy and improve efficiency. The T*-PCG-CNN baseline achieves accuracy levels nearly identical to those of the proposed model, with differences within 0.1–0.5% across all datasets. This indicates that the optimized design does not compromise classification accuracy. However, the model size and efficiency tell a different story: T*-PCG-CNN’s model file is 47.6 MB, whereas T-PCG-CNN is only 31.4 MB. This is a 34% reduction in model size achieved by the design. The number of parameters in T*-PCG-CNN is correspondingly higher (roughly 11.9 million vs. 7.8 million). In practice, this means the model is significantly more memory-efficient and faster at inference. Performance measurements show that T-PCG-CNN runs approximately 1.3× per inference compared to the T*-PCG-CNN baseline on the same hardware. This ablation confirms that the modifications to the Transformer module effectively reduce model complexity while maintaining performance. The multi-head self-attention mechanism still provides the needed sequence learning, but with fewer parameters—demonstrating an efficient design. This is important for engineering deployment, as a smaller model is easier to deploy on edge devices (such as roadside units or mobile data acquisition vehicles) without compromising detection capability.
  • Efficiency metrics (GFLOPs and memory): Beyond accuracy and model size, we report GFLOPs per forward pass and peak training memory (batch = 64) for a 3 s input (100 frames). As summarized in Table 5, the proposed T-PCG-CNN requires ≈1.4 GFLOPs with ≈35.0 GB peak memory, whereas the standard-Transformer variant T*-PCG-CNN requires ≈1.8 GFLOPs and ≈36.2 GB. Thus, T-PCG-CNN reduces computational demand by 22% and peak memory by 1.2 GB while maintaining accuracy, supporting its efficiency advantage over the heavier T* baseline.
Overall, the ablation study confirms that each component of the T-PCG-CNN contributes to its high performance. The Transformer encoder, the cross-gated multi-branch CNN, and the multi-scale feature extraction all play vital roles. Removing any of these leads to a noticeable drop in F1 score and accuracy, validating the architectural choices. Additionally, comparison with the T*-PCG-CNN variant demonstrates that the efficiency improvements achieved through parameter sharing and design optimizations—drawing on recent advancements in Transformer architectures—enhance the model’s practicality for real-world applications without compromising accuracy. These ablation results, supported by quantitative metrics and confusion analyses, give a clear picture of how each innovation adds value to the final system.

4.4. Comparison with Existing Acoustic Signal Feature Extraction and Machine Learning Algorithms (Noise Robustness)

In this section, the proposed approach is compared with several representative baseline methods from the literature, encompassing both traditional techniques and recent deep learning models for acoustic classification. As no existing models have been trained specifically on semi-rigid base pavement acoustic datasets, analogous methods from related audio domains are implemented or referenced to assess comparative performance. The compared baseline methods include the following: (a) i-vector based deep model [25]; (b) x-vector based approach [26]; (c) wav2vec 2.0 (self-supervised raw-audio representation model) [31]; (d) HuBERT (self-supervised masked-unit prediction model) [32]; (e) spectrogram + CNN classifier [27]; (f) MFCC + CNN classifier [26]; (g) MFCC + CENS + CNN (feature concatenation) [26]; (h) CG-PCNN (Cross-Gate Parallel CNN) model [44]. The experimental data are shown in Table 6.
These methods span the major types of approaches in acoustic signal classification: traditional i-vector/x-vector frameworks often used in speech processing, spectrogram-based end-to-end CNN recognition, hand-crafted feature (MFCC/CENS) with CNN classifiers, and feature-fusion models with gating (CG-PCNN).
To simulate sensor and ambient interference in a controlled way, we generate additive white Gaussian noise (AWGN) at target SNRs and mix it with the time-domain road-acoustic waveform. Given a clean signal xxx with frame-wise power P x = 1 N n x n 2 , we draw n ~ N ( 0 , 1 ) and scale it to P n = P x / 10 SNR dB / 10 ; the corrupted signal is y = x + α n with α = P n . All waveforms are peak-normalized to 1 , 1 before mixing, and SNR is computed on non-silent segments. The reported TR-5 dB/TR-10 dB results are produced by this AWGN procedure.
Qualitative Comparison of Approaches: Approaches (a) and (b) rely on learned embeddings (i-vectors/x-vectors) that compress acoustic information into global vectors, potentially losing local features crucial for distinguishing subtle debonding signatures from noise. We also consider two recent self-supervised baselines, (c) wav2vec 2.0 and (d) HuBERT, which provide generic acoustic representations learned from raw audio; these improve over i/x-vector but do not include task-specific multi-branch gating. Methods (e), (f), and (g) use classic feature representations—spectrogram, MFCC, or chroma (CENS)—together with CNNs; while they benefit from deep learning’s capacity to capture time–frequency cues, they lack explicit gating or self-attention. CG-PCNN (h) goes a step further by applying cross-gated multi-frequency branches in parallel, effectively capturing complementary spectral information; however, it does not account for long-distance temporal correlations. The T-PCG-CNN extends CG-PCNN via a Transformer component, adding multi-head self-attention to model long-range dependencies. As a result, T-PCG-CNN achieves a W-F1 of 94.94%, compared to CG-PCNN’s 92.08%, demonstrating the importance of coupling multi-scale gating with global attention.
Quantitative Results: In clean conditions, (e) spectrogram + CNN achieves around 80% accuracy (W-F1 ≈ 0.78), whereas (f) MFCC + CNN improves to roughly 85% (W-F1 ≈ 0.83). Incorporating CENS (g) further raises performance to about 88% accuracy (W-F1 ≈ 0.85). The self-supervised baselines are stronger: on TR, (c) wav2vec 2.0 and (d) HuBERT reach W-F1 = 0.9035 and 0.9044, respectively, outperforming (a)/(b) (0.8758/0.8814) yet remaining below (h) CG-PCNN (0.9208) and T-PCG-CNN (0.9494). Similar patterns hold on G and NR (e.g., W-F1 on G: 0.8721 for (c), 0.8905 for (d); on NR: 0.8944 for (c), 0.8900 for (d)). Overall, while generic SSL representations help, the proposed adaptive feature fusion and explicit time-context modeling enable T-PCG-CNN to exceed 92% accuracy (W-F1 ≈ 0.94).
Noise-Level Performance: To evaluate robustness under challenging acoustic conditions, all methods were tested on the TR dataset augmented with 5 dB and 10 dB additive white noise (TR-5 dB and TR-10 dB). Under these low-SNR scenarios, the self-supervised baselines (c) and (d) remain notably stronger than (a)/(b) (e.g., W-F1 at TR-5 dB: 0.7995/0.8267; at TR-10 dB: 0.8632/0.8785), with (d) HuBERT showing slightly better stability. Even so, (h) CG-PCNN and especially T-PCG-CNN are more resilient (e.g., W-F1 at TR-5 dB/10 dB: 0.8712/0.8995 for (h) vs. 0.9187/0.9415 for T-PCG-CNN). Cross-gated convolutions mitigate local noise artifacts, and Transformer attention preserves critical temporal cues, enabling T-PCG-CNN to maintain the highest accuracy and W-F1 in harsh acoustic settings.
T-PCG-CNN unifies multi-frequency feature fusion, cross-gate noise suppression, and Transformer-based long-range sequence modeling, surpassing both CG-PCNN and smaller CNN baselines in accuracy and noise robustness. These results are especially valuable for on-site monitoring of semi-rigid pavements, where ambient noise can be pervasive. By effectively handling acoustic variability—ranging from faint debonding signals to high-amplitude interference—T-PCG-CNN offers a state-of-the-art solution for practical, scalable debonding detection in real-world road infrastructure applications.

4.5. Application Prospects and Future Work

This chapter demonstrates the strong performance of T-PCG-CNN across three distinct road acoustic datasets—highway, national road, and town road—highlighting its effectiveness in classifying pavement conditions based on acoustic signals. An important next step involves evaluating the model’s generalization capability across varying road types to advance toward real-world deployment. The following strategies are outlined and explored as directions for future work:
  • Single-Dataset Training with Cross-Dataset Testing: The T-PCG-CNN model is trained on one dataset (e.g., highway data) and evaluated on the others to assess generalization across road types. Specifically, 75% of the selected dataset is used for training, and the remaining 25% is used for validation. To form the test set, a small portion (15%) is sampled from each of the other two datasets. This setup enables evaluation of the model’s ability to classify acoustic signals from road types that were not included in the training process. This scenario simulates a model trained in one environment being used in different environments. Two-Dataset Combination: Two datasets are randomly sampled in equal proportions to form a new dataset. The number of samples in each category is kept like that of the original single dataset. The T-PCG-CNN network is trained (75%) and validated (25%) on this new combined dataset, and the remaining third dataset is randomly sampled with 15% for testing.
  • Two-Dataset Combination Training: Two of the road datasets are merged to create a combined training set, with data sampled in equal proportions to preserve the original category balance. The model is then trained on 75% and validated on 25% of this combined dataset. For testing, 15% of the remaining third dataset—unseen during training—is used. This experimental setup evaluates the model’s ability to generalize when exposed to a broader range of road conditions during training while still encountering a new, unseen road type at test time.
  • Three-Dataset Comprehensive Training: All three datasets are combined into one large training pool. Data from all road types are mixed in equal proportion, and the number of samples per class is kept consistent with the single-dataset case (to avoid bias). The model is trained on 75% and validated on 25% of this unified dataset. For testing, unknown conditions are simulated by selecting the remaining unused data to create three separate test sets, each equal in size to the combined validation set and drawn from a distinct road type. This approach enables evaluation of the model’s performance on each road type individually after being trained on a diverse set of conditions, thereby assessing its generalization capability in varied real-world scenarios.
The experimental data are shown in Table 7.
The experimental results for these generalization tests are summarized in Table 7. Overall, the T-PCG-CNN maintained high recognition rates even when evaluated on road types not seen in training, though some performance drop was observed compared to in-domain testing. In scenario (a), the model trained on a single dataset could still recognize conditions in other datasets with only a minor loss in accuracy (a drop of a few percentage points in W-F1). This indicates that the acoustic characteristics learned for one road (e.g., patterns of debonding on a highway) are largely transferable to others, thanks to the robust feature representation. Scenario (b) showed improved generalization: when trained on two types of roads, the model performed better on the third unseen type than the single-dataset model did, confirming that exposure to diverse training data helps the model learn more invariant features. Finally, in scenario (c), where the model was trained on all three datasets combined, it achieved excellent results across the board—the accuracy and W-F1 on each individual road type’s test set were nearly as high as when testing within the same domain. This demonstrates the model’s powerful capacity to learn a generalized representation of road acoustic signatures. Such a model could be deployed in a variety of environments without retraining, making it highly practical.
Moving forward, the work can be extended in several directions. First, incorporating additional modalities (e.g., accelerometer data or vehicle suspension feedback) in a multi-modal fusion framework could further enhance robustness—akin to how adaptive gated fusion has benefited cross-modal tasks. Second, exploring unsupervised or semi-supervised learning on unlabeled acoustic data could help the model leverage larger datasets and handle new types of noises. Third, real-time implementation and field tests will be important to assess the system’s performance under operational conditions; here, the model’s lightweight design is advantageous for deployment on embedded devices. In our setup, training is conducted on an RTX 3060 Ti, and for edge use we plan a ruggedized in-vehicle/roadside unit (industrial PC with a GPU of the same class) running FP16/TensorRT inference with batch = 1 and a 3 s sliding window (≈100 frames), including on-device MFCC, ΔMFCC and Δ2 MFCC extraction. Laboratory tests on the 3060 Ti show tens-of-milliseconds end-to-end latency per window—comfortably real-time—while the 31.4 MB model and modest runtime footprint fit well within available GPU memory. The production deployment will use a sealed, vibration-resistant enclosure with 12–24 V power, GPS/odometry time-stamping, and a watchdog/fallback logger for stable long-term operation; this is a planned edge deployment pathway, and we will report field latencies and throughput once pilot units are evaluated. Lastly, as more annotated pavement acoustic datasets become available, comparative benchmarks can be established to further validate T-PCG-CNN against future state-of-the-art methods.

5. Conclusions

In the semi-rigid base asphalt pavement application process, due to the unique nature of roads, there is a lack of effective methods and tools for detecting debonding defects. To develop an acoustic-based debonding detection method for semi-rigid base asphalt pavements, this study investigates relevant types of excitation acoustics, feature data, and recognition methods. A semi-rigid base asphalt pavement acoustic recognition method based on a Transformer model and Parallel Cross-Gate Convolutional Neural Network (T-PCG-CNN) is proposed. Over several years, excitation devices were designed and utilized to collect acoustic data from various road types, resulting in a dedicated multi-sample dataset encompassing five categories: damage, debonding, manhole cover, other, and vehicle noise. By analyzing the time-domain and frequency-domain characteristics of each acoustic sample, Mel-frequency cepstral coefficients (MFCC), along with first-order (ΔMFCC) and second-order (Δ2MFCC) derivatives, were extracted to characterize the dynamic features of continuous road acoustic signals.
A T-PCG-CNN architecture was subsequently proposed to classify the acoustic data. This model leverages parallel cross-gated CNN branches to extract complementary frequency components from the multi-feature input and incorporates a Transformer module to effectively capture long-term dependencies in the acoustic signals. To keep the model efficient, techniques such as sparse matrices and shared weight matrices were applied, reducing the Transformer’s size without compromising performance.
Compared to existing classification methods, the approach shows significant improvements in both accuracy and F1 score. In multi-class classification tests across diverse road conditions, the T-PCG-CNN achieved a classification accuracy of 0.9208 and a weighted F1 of 0.9315, substantially outperforming traditional methods. This high performance holds true even in the presence of high ambient noise, demonstrating the model’s strong noise immunity and validating the effectiveness of the data augmentation strategy. The integration of different feature types and the novel network structure markedly enhance the model’s generalization and application potential. In practical terms, the model can reliably distinguish debonding-induced acoustic signatures from other sources of road noise (such as traffic or benign impacts), highlighting its value for on-site pavement health monitoring. From a theoretical perspective, the T-PCG-CNN introduces a unique hybrid architecture that unites Transformer-based global attention with CNN-based multi-scale feature gating. This design contributes to the field of acoustic signal processing by enabling simultaneous extraction of long-range temporal features and fine-grained spectral features, a combination that prior models have not fully achieved. The success of this architecture in a noisy, real-world dataset illustrates a notable advancement in model design for infrastructure sensing.
Future research and development: This study is limited to data from three representative roads—an expressway, a national highway, and a municipal road—so broader validation across diverse climates and operating conditions is still needed. We will undertake multi-region data collection and cross-dataset evaluations to quantify large-scale generalization and assess robustness under wider environmental variability. Methodologically, we will deepen the analysis of discriminative acoustic attributes, expand the database to include additional types of semi-rigid base pavements, and extract more representative, noise-resilient features to further improve performance. Finally, because the present work considers only five acoustic categories, advancing toward finer-grained multi-class recognition—especially accurately detecting and distinguishing debonding from other defects—remains a key objective for subsequent studies.

Author Contributions

Conceptualization, C.H.; Methodology, C.H. and M.Y.; Data curation, B.L.; Software, B.L. and J.Z.; Writing—original draft, B.L.; Writing—review and editing, C.H. and M.Y.; Supervision, M.Y.; Validation, B.L. and J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the National Key Research and Development Project grant funded by the China Science and Technology Exchange Center (No. 2025YFE0199500), and the Key Research and Development Special Project grant funded by the China Henan Provincial Department of Science and Technology (Nos. 231111520200 and 241111241600).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset is available from the corresponding author upon reasonable request.

Acknowledgments

This study was carried out through a collaboration between Chang’an University (National Engineering Research Center of Highway Maintenance Equipment; School of Information Engineering) and Henan Gaoyuan Highway Maintenance Technology Co., Ltd., together with three collaborating institutions. We gratefully acknowledge all partners for research coordination, dataset collection/management, and data sharing. We also thank the road authority for authorizing on-road measurements and the field teams for their support during data acquisition. C.H is a doctoral candidate at Chang’an University and employee of Henan Gaoyuan Highway Maintenance Technology Co., Ltd.; their dual roles facilitated the integration of theoretical research and engineering practice. In addition to the institutional support acknowledged above, we thank Zhen Sun for extensive editorial assistance that substantially improved the manuscript’s clarity and structure; Qiao Wang for a critical review and constructive suggestions that strengthened accuracy and completeness; and Zhongyu Li for facilitating access to key research resources and coordinating financial support for the study.

Conflicts of Interest

Author Changfeng Hao was employed by the company Henan Gaoyuan Highway Maintenance Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Wu, J.T.; Wu, Y.T. Performance evaluation of asphalt pavement with semi-rigid base and fine-sand subgrade by indoor large-scale accelerated pavement testing. Lect. Notes Civ. Eng. 2020, 96, 80–89. [Google Scholar]
  2. Jing, C.; Zhang, J.X.; Song, B. An innovative evaluation method for performance of in-service asphalt pavement with semi-rigid base. Constr. Build. Mater. 2020, 235, 117573. [Google Scholar] [CrossRef]
  3. Dezhong, D.Y.; Qianqian, Z.Q.; Luyu, Z.L. Mechanical behavior analysis of asphalt pavement based on measured axial load data. Int. J. Pavement Res. Technol. 2024, 17, 460–469. [Google Scholar] [CrossRef]
  4. Gao, Y.Y. Theoretical analysis of reflective cracking in asphalt pavement with semi-rigid base. Iran. J. Sci. Technol.-Trans. Civ. Eng. 2018, 43 (Suppl. 1), 149–157. [Google Scholar] [CrossRef]
  5. Chen, J.; Li, H.; Zhao, Z.; Hou, X.; Luo, J.; Xie, C.; Liu, H.; Ren, T. Investigation of transverse crack spacing in an asphalt pavement with a semi-rigid base. Sci. Rep. 2022, 12, 18079. [Google Scholar] [CrossRef]
  6. Yang, X.; Huang, R.; Meng, Y.; Liang, J.; Rong, H.; Liu, Y.; Tan, S.; He, X.; Feng, Y. Overview of the application of ground-penetrating radar, laser, infrared thermal imaging, and ultrasonic in nondestructive testing of road surface. Measurement 2024, 224, 113927. [Google Scholar] [CrossRef]
  7. Pedersen, L. Viscoelastic Modelling of Road Deflections for Use with the Traffic Speed Deflectometer. Master’s Thesis, Department of Civil Engineering, Technical University of Denmark, Copenhagen, Denmark, 2013. [Google Scholar]
  8. Flintsch, G.; Katicha, S.; Bryce, J.; Ferne, B.; Nell, S.; Diefenderfer, B. Assessment of Continuous Pavement Deflection Measuring Technologies; The National Academies Press: Washington, DC, USA, 2013. [Google Scholar]
  9. Dong, Z.J.; Tan, Y.Q.; Ou, J.P. Dynamic response analysis of asphalt pavement under three directional nonuniform moving load. China Civ. Eng. J. 2013, 46, 122–130. [Google Scholar]
  10. Liu, H.; Shi, Z.; Li, J.; Liu, C.; Meng, X.; Du, Y.; Chen, J. Detection of road cavities in urban cities by 3D ground penetrating radar. Geophysics 2021, 86, WA25–WA33. [Google Scholar] [CrossRef]
  11. Khamzin, A.K.; Varnavina, A.V.; Torgashov, E.V.; Anderson, N.L.; Sneed, L.H. Utilization of air-launched ground penetrating radar (GPR) for pavement condition assessment. Constr. Build. Mater. 2017, 141, 130–139. [Google Scholar] [CrossRef]
  12. Ling, J.Y.; Qian, R.Y.; Shang, K.; Guo, L.; Zhao, Y.; Liu, D. Research on the dynamic monitoring technology of road subgrades with time-lapse full-coverage 3D ground penetrating radar (GPR). Remote Sens. 2022, 14, 1593. [Google Scholar] [CrossRef]
  13. Soren, R.; Lisbeth, A.; Susanne, B.; Jorgen, K. A comparison of two years of network level measurements with the traffic speed deflectometer. In Proceedings of the TRA2008: Transport Research Arena Conference, Ljubljana, Slovenia, 21–24 April 2008. [Google Scholar]
  14. Graczyk, M.; Zofka, A.; Sudyka, J. Analytical solution of pavement deflections and its application to the TSD measurements. In Proceedings of the 26th ARRB Conference, Sydney, Australia, 19–22 October 2014. [Google Scholar]
  15. Mullerw, B.; Roberts, J. Revised approach to assessing traffic speed deflectometer data and field validation of deflection bowl predictions. Int. J. Pavement Eng. 2013, 14, 388–402. [Google Scholar] [CrossRef]
  16. Zofka, A.; Sudyka, J.; Maliszewski, M.; Harasim, P.; Sybilski, D. Alternative approach for interpreting traffic speed deflectometer results. Transp. Res. Rec. 2014, 2457, 12–18. [Google Scholar] [CrossRef]
  17. Peng, Y.H.; Ma, R. Determination of cement concrete pavement foundation emptying by acoustic vibration method. J. Nat. Sci. Heilongjiang Univ. 2009, 26, 276–280. [Google Scholar]
  18. Wang, Q.; Han, X.; Yi, Z.J. Identification of concrete pavement slab cavitation based on transient impact response. J. Southwest Jiaotong Univ. 2010, 45, 718–724. [Google Scholar]
  19. Liu, W.D.; Wang, D.P.; Peng, P. Experimental study on determination of cement concrete pavement emptying by acoustic vibration method. J. Heilongjiang Inst. Technol. (Nat. Sci.) 2011, 25, 29–33. [Google Scholar]
  20. Kuz’min, M.P.; Larionov, L.M.; Kondratiev, V.V.; Kuz’mina, M.Y.; Grigoriev, V.G.; Kuz’mina, A.S. Use of the burnt rock of coal deposits slag heaps in the concrete products manufacturing. Constr. Build. Mater. 2018, 179, 117–124. [Google Scholar] [CrossRef]
  21. Gunka, V.; Demchuk, Y.; Sidun, I.; Miroshnichenko, D.; Nyakuma, B.B.; Pyshyev, S. Application of phenol-cresol-formaldehyde resin as an adhesion promoter for bitumen and asphalt concrete. Road Mater. Pavement Des. 2021, 22, 2906–2918. [Google Scholar] [CrossRef]
  22. Cho, Y.S.; Hong, S.U. The ANN simulation of stress wave based NDT on concrete structures. In Proceedings of the International Conference on System Science and Simulation Engineering, Venice, Italy, 21–23 November 2008; pp. 140–146. [Google Scholar]
  23. Yousefi, M.; Hansen, J.H.L. Block-based high performance CNN architectures for frame-level overlapping speech detection. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 28–40. [Google Scholar] [CrossRef]
  24. Hershey, S.; Chaudhuri, S.; Ellis, D.P.; Gemmeke, J.F.; Jansen, A.; Moore, R.C.; Plakal, M.; Platt, D.; Saurous, R.A.; Seybold, B.; et al. CNN architectures for large-scale audio classification. arXiv 2016, arXiv:1609.09430. [Google Scholar]
  25. Liu, M.; Wang, J.; Li, S.; Xiang, F.; Yao, Y.; Yang, L. MOS predictor for synthetic speech with i-vector inputs. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Singapore, 23–27 May 2022; pp. 906–910. [Google Scholar]
  26. Jeancolas, L.; Petrovska-Delacrétaz, D.; Mangone, G.; Benkelfat, B.E.; Corvol, J.C.; Vidailhet, M.; Lehéricy, S.; Benali, H. X-vectors: New quantitative biomarkers for early Parkinson’s disease detection from speech. Front. Neuroinform. 2021, 15, 578369. [Google Scholar] [CrossRef]
  27. Wazir, A.S.B.; Karim, H.A.; Abdullah, M.H.L.; Mansor, S.; AlDahoul, N.; Fauzi, M.F.A.; See, J. Spectrogram-based classification of spoken foul language using deep CNN. In Proceedings of the 2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP), Tampere, Finland, 21–24 September 2020. [Google Scholar]
  28. Sainath, T.N.; Mohamed, A.R.; Kingsbury, B.; Ramabhadran, B. Deep convolutional neural networks for LVCSR. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, 19–24 April 2015; pp. 8614–8618. [Google Scholar]
  29. Piczak, K.J. Environmental acoustic classification with convolutional neural networks. In Proceedings of the 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), Boston, MA, USA, 17–20 September 2015; pp. 1–6. [Google Scholar]
  30. Hershey, S.; Chaudhuri, S.; Ellis, D.P.W.; Gemmeke, J.F.; Jansen, A.; Moore, R.C.; Plakal, M.; Platt, D.; Saurous, R.A.; Seybold, B.; et al. CNN architectures for large-scale audio classification. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 131–135. [Google Scholar]
  31. Lee, H.; Yoo, I.; Park, S. Learning robust feature representations for audio event detection. IEEE Trans. Audio Speech Lang. Process. 2019, 27, 726–735. [Google Scholar]
  32. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
  33. Zhang, H.; Li, J.; Cai, G.; Chen, Z.; Zhang, H. A CNN-based method for enhancing boring vibration with time-domain convolution-augmented transformer. Insects 2023, 14, 631. [Google Scholar] [CrossRef]
  34. Cai, Y.; Hou, A. Analysis on transformer vibration signal recognition based on convolutional neural network. J. Vibroeng. 2021, 23, 484–495. [Google Scholar] [CrossRef]
  35. Yushao, M.; Wang, X.; Zhou, W.; Xiang, L. Research on transformer condition recognition based on acoustic signal and one-dimensional convolutional neural networks. J. Phys. Conf. Ser. 2021, 2005, 012078. [Google Scholar]
  36. Dai, Z.; Yang, Z.; Yang, Y.; Carbonell, J.; Le, Q.V.; Salakhutdinov, R. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 2978–2988. [Google Scholar]
  37. Geng, Q.S.; Wang, F.H.; Zhou, D.X. Mechanical fault diagnosis of power transformer by GFCC time-frequency map of acoustic signal and convolutional neural network. In Proceedings of the 2019 IEEE Sustainable Power and Energy Conference (iSPEC), Beijing, China, 21–23 November 2019. [Google Scholar]
  38. Wu, Y.; Zhang, Z.; Xiao, R.; Jiang, P.; Dong, Z.; Deng, J. Operation state identification method for converter transformers based on vibration detection technology and deep belief network optimization algorithm. Actuators 2021, 10, 56. [Google Scholar] [CrossRef]
  39. Chen, H.; Yu, Y.; Li, P. Transformer-based denoising of mechanical vibration signals. arXiv 2023, arXiv:2308.02166. [Google Scholar] [CrossRef]
  40. Secic, A.; Krpan, M.; Kuzle, I. Vibro-acoustic methods in the condition assessment of power transformers: A survey. IEEE Access 2019, 7, 83915–83931. [Google Scholar] [CrossRef]
  41. Liu, W.; Liu, X.; Wang, D.; Lu, W.; Yuan, B.; Qin, C. MITDCNN: A multi-modal input Transformer-based deep convolutional neural network for misfire signal detection in high-noise diesel engines. Expert Syst. Appl. 2024, 238, 121797. [Google Scholar] [CrossRef]
  42. Ahmed, H.O.A.; Nandi, A.K. Convolutional-Transformer Model with Long-Range Temporal Dependencies for Bearing Fault Diagnosis Using Vibration Signals. Machines 2023, 11, 746. [Google Scholar] [CrossRef]
  43. Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 7132–7141. [Google Scholar]
  44. Zhang, J.C.; Yan, W.Y.; Zhang, Y. A new speech feature fusion method with cross gate parallel CNN for speaker recognition. arXiv 2022, arXiv:2211.13377. [Google Scholar] [CrossRef]
  45. Burra, M.; Vanambathina, S.D.; Lakshmi A, V.A.; Ch, L.; Kotiah, N.S. Cross channel interaction based ECA-Net using gated recurrent convolutional network for speech enhancement. Multimed. Tools Appl. 2024, 84, 16455–16479. [Google Scholar] [CrossRef]
  46. Yu, H.; Zhao, Q. Brain-inspired multisensory integration neural network for cross-modal recognition through spatiotemporal dynamics and deep learning. Cogn. Neurodyn. 2023, 18, 3615–3628. [Google Scholar] [CrossRef] [PubMed]
  47. Yang, M.; Yeh, C.H.; Zhou, Y.; Cerqueira, J.P. A 1μW voice activity detector using analog feature extraction and digital deep neural network. In Proceedings of the 2018 IEEE International Solid-State Circuits Conference-(ISSCC), San Francisco, CA, USA, 11–15 February 2018. [Google Scholar]
  48. Jan, M.; Khattak, K.S.; Khan, Z.H.; Gulliver, T.A.; Altamimi, A.B. Crowdsensing for Road Pavement Condition Monitoring: Trends, Limitations, and Opportunities. IEEE Access 2023, 11, 133143–133159. [Google Scholar] [CrossRef]
  49. Zang, G.; Sun, L.; Chen, Z.; Li, L. A nondestructive evaluation method for semi-rigid base cracking condition of asphalt pavement. Constr. Build. Mater. 2018, 162, 892–897. [Google Scholar] [CrossRef]
  50. Liu, J.; Liu, G.; Yang, T.; Zhou, J. Research on relationships among different distress types of asphalt pavements with semi-rigid bases in China using association rule mining: A statistical point of view. J. Transp. Eng. Part B Pavements 2019, 5, 57–68. [Google Scholar] [CrossRef]
  51. Chu, S.; Narayanan, S.; Kuo, C.C.J. Environmental acoustic recognition with time-frequency audio features. IEEE Trans. Audio Speech Lang. Process. 2009, 17, 200–205. [Google Scholar] [CrossRef]
  52. Constantinescu, C.; Brad, R. An overview on acoustic features in time and frequency domain. Int. J. Adv. Sci. Technol. Electr. Eng. 2023, 24, 45–58. [Google Scholar]
  53. Ghosh, S.K.; Ponnalagu, R.N.; Tripathy, R.K. Automated heart acoustic activity detection from PCG signal using time-frequency-domain deep neural network. IEEE Access 2022, 10, 30024–30031. [Google Scholar]
  54. Tang, J.; Sun, X.; Yan, L.; Qu, Y.; Wang, T.; Yue, Y. Acoustic source localization method-based time-domain signal feature using deep learning. Appl. Acoust. 2023, 213, 109626. [Google Scholar] [CrossRef]
  55. Ye, Z.; Xiong, H.; Wang, L. Collecting comprehensive traffic information using pavement vibration monitoring data. Comput.-Aided Civ. Infrastruct. Eng. 2020, 35, 134–149. [Google Scholar] [CrossRef]
  56. Zhou, K.; Lei, D.; He, J.; Zhang, P.; Bai, P.; Zhu, F. Real-time localization of micro-damage in concrete beams using DIC technology and wavelet packet analysis. Cem. Concr. Compos. 2021, 123, 104113. [Google Scholar] [CrossRef]
  57. Walid, M.; Darmawan, A.K. Pengenalan ucapan menggunakan metode linear predictive coding (LPC) dan K-nearest neighbor (K-NN). Energy 2017, 7, 13–22. [Google Scholar]
  58. Mini, P.P.; Thomas, T.A.; Kumari, R.G. EEG-based direct speech BCI system using a fusion of SMRT and MFCC/LPCC features with ANN classifier. Biomed. Signal Process. Control 2021, 68, 102625. [Google Scholar] [CrossRef]
  59. Sharma, A.; Kaut, S. Two-stage supervised learning-based method to detect screams and cries in urban environments. IEEE/ACM Trans. Audio Speech Lang. Process. 2016, 24, 290–299. [Google Scholar] [CrossRef]
  60. Zhang, X.L.; Wang, D.L. Boosting contextual information for deep neural network based voice activity detection. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 24, 252–264. [Google Scholar] [CrossRef]
  61. Wang, Q.; Zeng, Q.; Xie, X.; Zheng, Z. Research on speech recognition method in low SNR environment. Acoust. Technol. 2017, 36, 50–56. [Google Scholar]
  62. Mitra, V.; Wang, W.; Franco, H.; Lei, Y. Evaluating robust features on deep neural networks for speech recognition in noisy and channel mismatched conditions. In Proceedings of the International Conference on Speech and Language Processing, Grenoble, France, 14–16 October 2014. [Google Scholar]
  63. Farahani, G. Autocorrelation-based noise subtraction method with smoothing, overestimation, energy, and cepstral mean and variance normalization for noisy speech recognition. EURASIP J. Audio Speech Music. Process. 2017, 2017, 1–16. [Google Scholar] [CrossRef]
  64. Zhang, S.; Yang, Y.; Chen, C.; Zhang, X.; Leng, Q.; Zhao, X. Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: A systematic review of recent advancements and prospects. Expert Syst. Appl. 2024, 237, 121692. [Google Scholar] [CrossRef]
  65. Yang, J.; Yang, F.; Zhou, Y.; Wang, D.; Li, R.; Wang, G. A data-driven structural damage detection framework based on parallel convolutional neural network and bidirectional gated recurrent unit. Inf. Sci. 2021, 566, 103–117. [Google Scholar] [CrossRef]
  66. Choromanski, K.M.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.Q.; Mohiuddin, A.; Kaiser, L.; et al. Rethinking Attention with Performers. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual Event, Vienna, Austria, 3–7 May 2021. [Google Scholar]
  67. Leng, D.; Zheng, L.; Wen, Y.; Zhang, Y.; Wu, L.; Wang, J.; Wang, M.; Zhang, Z.; He, S.; Bo, X. A benchmark study of deep learning-based multi-omics data fusion methods for cancer. Genome Biol. 2022, 23, 171. [Google Scholar] [CrossRef]
Figure 1. Road structure diagram.
Figure 1. Road structure diagram.
Applsci 15 09125 g001
Figure 2. Road acoustic collection equipment and collection principle. Note: the Chinese text in the figure reads “Highway Inspection”.
Figure 2. Road acoustic collection equipment and collection principle. Note: the Chinese text in the figure reads “Highway Inspection”.
Applsci 15 09125 g002
Figure 3. Original, first-order differential and second-order differential acoustic time-domain data for different samples.
Figure 3. Original, first-order differential and second-order differential acoustic time-domain data for different samples.
Applsci 15 09125 g003
Figure 4. Original, first-order differential and second-order differential acoustic frequency-domain data for different samples.
Figure 4. Original, first-order differential and second-order differential acoustic frequency-domain data for different samples.
Applsci 15 09125 g004
Figure 5. MFCC feature maps of different acoustic samples.
Figure 5. MFCC feature maps of different acoustic samples.
Applsci 15 09125 g005
Figure 6. T-PCG-CNN network architecture diagram.
Figure 6. T-PCG-CNN network architecture diagram.
Applsci 15 09125 g006
Figure 7. Internal structure of the PCG-CNN module.
Figure 7. Internal structure of the PCG-CNN module.
Applsci 15 09125 g007
Figure 8. Performance comparison of blockwise self-attention with attention mode configuration. (a) Full n2 attention, (b) sliding window attention, (c) dilated sliding window, (d) domain-specific global tokens.
Figure 8. Performance comparison of blockwise self-attention with attention mode configuration. (a) Full n2 attention, (b) sliding window attention, (c) dilated sliding window, (d) domain-specific global tokens.
Applsci 15 09125 g008
Figure 9. T-PCG-CNN predicts the confusion matrix graph on the test set.
Figure 9. T-PCG-CNN predicts the confusion matrix graph on the test set.
Applsci 15 09125 g009
Figure 10. ROC curves for each sample under different threshold conditions.
Figure 10. ROC curves for each sample under different threshold conditions.
Applsci 15 09125 g010
Table 1. Distribution of three road dataset types and volumes for G5512, 107 National Road and Town Road.
Table 1. Distribution of three road dataset types and volumes for G5512, 107 National Road and Town Road.
DamageDebondingManhole CoverVehicle NoiseOther
G-datasets5605601502951800
NR-datasets5005601452851750
TR-datasets5005601452801750
Table 2. Network performance under different dropout rates.
Table 2. Network performance under different dropout rates.
Dropout RateAccuracy (G)W-F1 (G)Accuracy (NR)W-F1 (NR)Accuracy (TR)W-F1 (TR)
No Dropout 0.90560.90740.89260.91110.92920.9317
0.20.91690.91170.91640.92650.93770.9486
0.30.91180.92140.92800.93060.93940.9533
0.40.92240.92100.93420.93110.94640.9422
0.50.91860.91830.92970.91260.94370.9336
0.60.90630.90620.91180.92070.93930.9319
0.70.91440.91110.91050.91320.93650.9214
0.80.89030.90790.91110.91170.92610.9221
Note: Bold indicates the highest value in each column; the same convention is used in the tables that follow.
Table 3. Impact of number of heads on test results across different datasets.
Table 3. Impact of number of heads on test results across different datasets.
Num HeadsAccuracy (G)W-F1 (G)Accuracy (NR)W-F1 (NR)Accuracy (TR)W-F1 (TR)
40.91450.92140.92800.93160.93770.9427
80.91890.92240.93330.93620.94480.9494
160.90170.90390.92170.92660.93510.9221
Table 4. The recognition results for different data features in the G-dataset, NR-dataset, and TR-dataset are shown below.
Table 4. The recognition results for different data features in the G-dataset, NR-dataset, and TR-dataset are shown below.
Network TypeData Feature TypeAccuracy (G)W-F1 (G)Accuracy (NR)W-F1 (NR)Accuracy (TR)W-F1 (TR)
T-PCG-CNNMFCC&ΔMFCC&Δ2MFCC0.92240.92100.93420.93110.94640.9422
T-PCG-CNN*MFCC&ΔMFCC, Δ2MFCC0.89440.90110.88650.89660.91080.9155
MFCC&Δ2MFCC, ΔMFCC0.88280.88640.90510.89040.90910.9084
ΔMFCC&Δ2MFCC, MFCC0.86870.87580.87480.88460.90270.9136
T-PG-CNNMFCC, ΔMFCC, Δ2MFCC0.85310.86220.84570.85880.90750.9019
T-G-CNNMFCC0.83110.83.070.82420.81140.85420.8450
ΔMFCC0.82330.82170.81260.80680.85660.8514
Δ2MFCC0.82890.83130.82090.81610.85520.8521
Table 5. Ablation study of various models.
Table 5. Ablation study of various models.
Model VariantAccuracy
(G)
W-F1
(G)
Accuracy
(NR)
W-F1
(NR)
Accuracy
(TR)
W-F1
(TR)
Model Size (MB)Params (M)Inference SpeedGFLOPs Peak Memory (GB)
T-PCG-CNN0.92240.92100.93420.93110.94640.942231.47.81.3 × (baseline)1.435.0
Without Transformer (PCG-CNN only)0.90410.89920.90030.91500.92140.9237
Without Cross-Gating (Parallel CNN+ Transformer, no gates)0.85310.86220.84570.85880.90750.9019
Without Multi-Scale (Single-Branch CNN + Transformer)0.82740.82280.83110.83200.84410.8502
Standard Transformer (T*-PCG-CNN)0.91900.92180.92860.93420.94200.945847.611.91.0 × (baseline)1.836.2
Table 6. Comparison of T-PCG-CNN with other networks’ performance.
Table 6. Comparison of T-PCG-CNN with other networks’ performance.
NetAccuracy
(G)
W-F1
(G)
Accuracy
(NR)
W-F1
(NR)
Accuracy
(TR)
W-F1
(TR)
Accuracy
(TR-5 dB)
W-F1
(TR-5 dB)
Accuracy
(TR-10 dB)
W-F1
(TR-10 dB)
i-vector0.85680.86420.87650.86000.88310.87580.70320.75430.73360.7669
x-vector0.86210.85320.86490.86700.87400.88140.72440.72480.79330.7804
wav2vec 2.00.88950.87210.89350.89440.90010.90350.80590.79950.85440.8632
HuBERT0.87700.89050.88450.89000.89560.90440.83350.82670.86380.8785
Spectrogram + CNN0.78330.71550.80140.78220.82670.77220.68520.69720.72590.7119
MFCC + CNN0.84110.82.070.83420.82140.86420.85500.79000.78420.81530.8247
MFCC + CENS + CNN0.87000.86140.87550.85420.89320.90410.82570.84240.84330.8451
CG-PCNN0.91140.90350.92270.91160.94350.92080.88240.87120.90140.8995
T-PCG-CNN0.91890.92240.93330.93620.94480.94940.91140.91870.94000.9415
Table 7. T-PCG-CNN recognition performance on other datasets under different test conditions.
Table 7. T-PCG-CNN recognition performance on other datasets under different test conditions.
Training and Validation DatasetTest Dataset
G-DatasetsNR-DatasetsTR-Datasets
AccuracyW-F1AccuracyW-F1AccuracyW-F1
G-datasets0.77250.75430.78420.7731
NR-datasets0.68580.70240.76610.7492
TR-datasets0.65240.67110.63200.6442
G&NR- datasets0.82430.8124
G&TR-datasets0.84360.8311
NR&TR-datasets0.81250.8046
G&NR&TR- datasets0.88890.89350.91170.90290.92080.9315
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hao, C.; Ye, M.; Li, B.; Zhang, J. Acoustic Analysis of Semi-Rigid Base Asphalt Pavements Based on Transformer Model and Parallel Cross-Gate Convolutional Neural Network. Appl. Sci. 2025, 15, 9125. https://doi.org/10.3390/app15169125

AMA Style

Hao C, Ye M, Li B, Zhang J. Acoustic Analysis of Semi-Rigid Base Asphalt Pavements Based on Transformer Model and Parallel Cross-Gate Convolutional Neural Network. Applied Sciences. 2025; 15(16):9125. https://doi.org/10.3390/app15169125

Chicago/Turabian Style

Hao, Changfeng, Min Ye, Boyan Li, and Jiale Zhang. 2025. "Acoustic Analysis of Semi-Rigid Base Asphalt Pavements Based on Transformer Model and Parallel Cross-Gate Convolutional Neural Network" Applied Sciences 15, no. 16: 9125. https://doi.org/10.3390/app15169125

APA Style

Hao, C., Ye, M., Li, B., & Zhang, J. (2025). Acoustic Analysis of Semi-Rigid Base Asphalt Pavements Based on Transformer Model and Parallel Cross-Gate Convolutional Neural Network. Applied Sciences, 15(16), 9125. https://doi.org/10.3390/app15169125

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop