Acoustic Analysis of Semi-Rigid Base Asphalt Pavements Based on Transformer Model and Parallel Cross-Gate Convolutional Neural Network
Abstract
1. Introduction
2. Road Acoustic Dataset and Multi-Level Feature Extraction
2.1. Road Type Selection and Acoustic Characteristics
- Changji Expressway: connecting Jincheng, Shanxi Province, to Jiaozuo, Henan Province, designated G5512.
- National Highway 107: Xinxiang, Henan Province.
- South Outer Ring Municipal Road: Town Road in Xinxiang City, Henan Province.
2.2. Data Acquisition: Collection Setup and Signal Processing Principles
2.3. Dataset Structuring and Statistical Distribution
2.4. Acoustic Feature Analysis: Time–Frequency Domain Comparisons
2.5. Multi-Level Feature Extraction
- Dimensionality Reduction: MFCC reduces thousands of spectral bins into a compact feature set, significantly lowering the computational overhead required in real-time road monitoring systems [62].
- Noise Robustness: Numerous studies in road acoustics have shown that MFCC-like features are robust to moderate noise, a critical factor in environments with high background noise, such as highways [63].
- Computational Efficiency: The relatively low complexity of generating MFCCs (compared to full-resolution Wavelet analyses) makes them ideal for embedded or edge computing systems [64].
- Framing: the input audio signal x(t) is divided into multiple short-time frames, each with a frame length of 86 ms and a frame shift of 30 ms.
- Windowing: to enhance the continuity at both ends of each frame and to avoid frequency leakage caused by the rectangular window’s truncation of the signal, a Hamming window is commonly applied. The formula for the Hamming window is as follows:
- 3.
- DFT: the FFT is applied to each frame of the signal to obtain the frequency spectrum of each frame. The power spectrum of the speech signal is then obtained by taking the squared magnitude of the frequency spectrum. The transformation formula is as follows:
- 4.
- Power Spectrum Calculation: the power spectrum is obtained by taking the squared magnitude of the frequency spectrum of the acoustic signal, resulting in the spectral line energy P(k):
- 5.
- Dynamic MFCC Feature Extraction: Since MFCC features represent the static characteristics of the acoustic signal, they do not fully capture the dynamic features of road excitation acoustics. To better reflect the dynamic MFCC [62] characteristics of the road, this study computes the first-order derivative ΔMFCC and second-order derivative Δ2MFCC of the excitation signal. The first-order derivative is used to analyze the intensity variation of the acoustic data, helping to mitigate the effects of slight variations in the acoustic signal caused by road surface cracks during road inspection. The second-order derivative analyses the intensity changes in the acoustic signal and aids in distinguishing the different material and structural characteristics of the road.
3. Neural Network Design and Optimization
3.1. Overall Network Design
3.2. PCG-CNN Module
- CNN Layer
- 2.
- Cross Gate Logic
3.3. Transformer Model
3.3.1. Motivation
3.3.2. Blockwise Self-Attention
- Splitting the Sequence into Blocks
- 2.
- Local Windowed Self-Attention
- 3.
- Dilated Windows
- 4.
- Bridging Across Blocks
3.3.3. Parameter Sharing Across Heads
3.3.4. Integration with PCG-CNN
- Local Denoising and Multi-Scale Extraction
- 2.
- Blockwise Sparse Attention
3.4. Classifier
4. Performance Evaluation of T-PCG-CNN
4.1. Network Parameter Performance Analysis
4.2. Performance Comparison of Different Network Architectures
- T-PCG-CNN*: a variant where two of the feature types are fused in parallel and the third is processed in a separate branch. For example, “MFCC & ΔMFCC, Δ2MFCC” means MFCC and ΔMFCC features are cross-fused in parallel convolution streams while Δ2MFCC is handled by a separate stream. Three configurations of which features are fused versus separate were tested (as shown in Table 4).
- T-PG-CNN: a variant with no cross-gating between feature streams. Here MFCC, ΔMFCC, and Δ2MFCC are each processed in independent CNN branches (no mutual gating), and their outputs are simply concatenated before the classifier. This tests the importance of the cross-gate mechanism.
- T-G-CNN: a single-feature network for comparison, which uses only one type of input. This is essentially a “Transformer + Gated CNN” applied to a single feature, serving as a baseline with no multi-feature fusion.
4.3. Ablation Study
- Without Transformer (PCG-CNN only): The Transformer encoder is removed, and only the Parallel Cross-Gated CNN is used for classification. This setup is used to evaluate the importance of capturing long-term temporal dependencies. Results indicate a notable drop in performance—for example, on the G-dataset, the W-F1 drops from 0.9210 to about 0.8992 (2.18% decrease) which is in line with the performance of the CG-PCNN model (Cross-Gated CNN without Transformer). Similar declines are observed on the other datasets. This confirms that the Transformer’s sequence modeling significantly boosts accuracy by integrating information over time, as opposed to using only instantaneous or local acoustic features. In other words, the PCG-CNN alone, while powerful in extracting multi-scale features, misses the temporal context that the Transformer provides, resulting in lower recall on events that have more subtle or longer acoustic signatures.
- Without Cross-Gating (Parallel CNN + Transformer): In this variant, the cross-gating mechanism between convolutional branches is removed. The network instead uses parallel convolution streams for each feature (MFCC, ΔMFCC, Δ2MFCC) without the adaptive gating, and their outputs are concatenated before feeding into the Transformer. Essentially, this is a no-gate version of the model. The absence of cross-gating leads to a significant performance degradation: accuracy drops by about 4–9% depending on the dataset. Specifically, compared to the full T-PCG-CNN, the no-gating model’s W-F1 is lower by 5.48% on the G-dataset, 7.23% on the NR-dataset, and 4.03% on the TR-dataset. This highlights that the cross-gate module is crucial for effectively fusing features—it allows the network to adaptively weight and exchange information between the MFCC, ΔMFCC and Δ2MFCC channels, capturing the complementary nature of these features. Without gating, the model likely cannot reconcile differences between feature types as well, leading to misclassifications. This ablation underscores the value of the cross-gated parallel design, which is also supported by other studies that use gated fusion to combine modalities or feature streams.
- Without Multi-Scale Convolutions (Single-Branch CNN + Transformer): In this variant, the parallel multi-branch convolutional architecture is removed and replaced with a single CNN stream for feature extraction. All input features are either combined into one branch or, alternatively, only one feature is used as input before passing through the Transformer. This ablation is designed to evaluate the importance of employing multiple convolutional kernel sizes and maintaining separate feature-specific processing streams. The results indicate a notable decline in performance, with the single-branch network achieving only 82–85% accuracy—comparable to the T-G-CNN single-feature baseline, which achieved 85% accuracy on the TR-dataset. The absence of multi-scale feature learning likely impairs the model’s ability to capture important frequency-specific patterns; a single convolution kernel may fail to detect features that a multi-branch structure—with both wide and narrow receptive fields—can effectively capture. Additionally, a decline in recall was observed for minority classes; for example, detection of the “Manhole Cover” class was negatively affected without the high-frequency branch provided by the PCG-CNN design. These findings suggest that using multiple parallel convolutional filters to extract diverse spectral characteristics significantly enhances performance. This result aligns with the rationale behind CG-PCNN architectures used in speech feature modeling. In summary, removing multi-scale parallelism substantially weakens feature extraction capacity, reinforcing the effectiveness of the proposed PCG-CNN design.
- Standard Transformer instead of Lightweight Transformer (T*-PCG-CNN) [32]: Finally, the optimized Transformer design is compared with a baseline model that incorporates a conventional Transformer module integrated with the PCG-CNN, denoted T*-PCG-CNN. This baseline does not include the optimizations introduced in the proposed model, such as sparse attention or weight sharing across Transformer layers. In contrast, the proposed model employs a parameter-sharing scheme to reduce redundancy and improve efficiency. The T*-PCG-CNN baseline achieves accuracy levels nearly identical to those of the proposed model, with differences within 0.1–0.5% across all datasets. This indicates that the optimized design does not compromise classification accuracy. However, the model size and efficiency tell a different story: T*-PCG-CNN’s model file is 47.6 MB, whereas T-PCG-CNN is only 31.4 MB. This is a 34% reduction in model size achieved by the design. The number of parameters in T*-PCG-CNN is correspondingly higher (roughly 11.9 million vs. 7.8 million). In practice, this means the model is significantly more memory-efficient and faster at inference. Performance measurements show that T-PCG-CNN runs approximately 1.3× per inference compared to the T*-PCG-CNN baseline on the same hardware. This ablation confirms that the modifications to the Transformer module effectively reduce model complexity while maintaining performance. The multi-head self-attention mechanism still provides the needed sequence learning, but with fewer parameters—demonstrating an efficient design. This is important for engineering deployment, as a smaller model is easier to deploy on edge devices (such as roadside units or mobile data acquisition vehicles) without compromising detection capability.
- Efficiency metrics (GFLOPs and memory): Beyond accuracy and model size, we report GFLOPs per forward pass and peak training memory (batch = 64) for a 3 s input (100 frames). As summarized in Table 5, the proposed T-PCG-CNN requires ≈1.4 GFLOPs with ≈35.0 GB peak memory, whereas the standard-Transformer variant T*-PCG-CNN requires ≈1.8 GFLOPs and ≈36.2 GB. Thus, T-PCG-CNN reduces computational demand by 22% and peak memory by 1.2 GB while maintaining accuracy, supporting its efficiency advantage over the heavier T* baseline.
4.4. Comparison with Existing Acoustic Signal Feature Extraction and Machine Learning Algorithms (Noise Robustness)
4.5. Application Prospects and Future Work
- Single-Dataset Training with Cross-Dataset Testing: The T-PCG-CNN model is trained on one dataset (e.g., highway data) and evaluated on the others to assess generalization across road types. Specifically, 75% of the selected dataset is used for training, and the remaining 25% is used for validation. To form the test set, a small portion (15%) is sampled from each of the other two datasets. This setup enables evaluation of the model’s ability to classify acoustic signals from road types that were not included in the training process. This scenario simulates a model trained in one environment being used in different environments. Two-Dataset Combination: Two datasets are randomly sampled in equal proportions to form a new dataset. The number of samples in each category is kept like that of the original single dataset. The T-PCG-CNN network is trained (75%) and validated (25%) on this new combined dataset, and the remaining third dataset is randomly sampled with 15% for testing.
- Two-Dataset Combination Training: Two of the road datasets are merged to create a combined training set, with data sampled in equal proportions to preserve the original category balance. The model is then trained on 75% and validated on 25% of this combined dataset. For testing, 15% of the remaining third dataset—unseen during training—is used. This experimental setup evaluates the model’s ability to generalize when exposed to a broader range of road conditions during training while still encountering a new, unseen road type at test time.
- Three-Dataset Comprehensive Training: All three datasets are combined into one large training pool. Data from all road types are mixed in equal proportion, and the number of samples per class is kept consistent with the single-dataset case (to avoid bias). The model is trained on 75% and validated on 25% of this unified dataset. For testing, unknown conditions are simulated by selecting the remaining unused data to create three separate test sets, each equal in size to the combined validation set and drawn from a distinct road type. This approach enables evaluation of the model’s performance on each road type individually after being trained on a diverse set of conditions, thereby assessing its generalization capability in varied real-world scenarios.
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Wu, J.T.; Wu, Y.T. Performance evaluation of asphalt pavement with semi-rigid base and fine-sand subgrade by indoor large-scale accelerated pavement testing. Lect. Notes Civ. Eng. 2020, 96, 80–89. [Google Scholar]
- Jing, C.; Zhang, J.X.; Song, B. An innovative evaluation method for performance of in-service asphalt pavement with semi-rigid base. Constr. Build. Mater. 2020, 235, 117573. [Google Scholar] [CrossRef]
- Dezhong, D.Y.; Qianqian, Z.Q.; Luyu, Z.L. Mechanical behavior analysis of asphalt pavement based on measured axial load data. Int. J. Pavement Res. Technol. 2024, 17, 460–469. [Google Scholar] [CrossRef]
- Gao, Y.Y. Theoretical analysis of reflective cracking in asphalt pavement with semi-rigid base. Iran. J. Sci. Technol.-Trans. Civ. Eng. 2018, 43 (Suppl. 1), 149–157. [Google Scholar] [CrossRef]
- Chen, J.; Li, H.; Zhao, Z.; Hou, X.; Luo, J.; Xie, C.; Liu, H.; Ren, T. Investigation of transverse crack spacing in an asphalt pavement with a semi-rigid base. Sci. Rep. 2022, 12, 18079. [Google Scholar] [CrossRef]
- Yang, X.; Huang, R.; Meng, Y.; Liang, J.; Rong, H.; Liu, Y.; Tan, S.; He, X.; Feng, Y. Overview of the application of ground-penetrating radar, laser, infrared thermal imaging, and ultrasonic in nondestructive testing of road surface. Measurement 2024, 224, 113927. [Google Scholar] [CrossRef]
- Pedersen, L. Viscoelastic Modelling of Road Deflections for Use with the Traffic Speed Deflectometer. Master’s Thesis, Department of Civil Engineering, Technical University of Denmark, Copenhagen, Denmark, 2013. [Google Scholar]
- Flintsch, G.; Katicha, S.; Bryce, J.; Ferne, B.; Nell, S.; Diefenderfer, B. Assessment of Continuous Pavement Deflection Measuring Technologies; The National Academies Press: Washington, DC, USA, 2013. [Google Scholar]
- Dong, Z.J.; Tan, Y.Q.; Ou, J.P. Dynamic response analysis of asphalt pavement under three directional nonuniform moving load. China Civ. Eng. J. 2013, 46, 122–130. [Google Scholar]
- Liu, H.; Shi, Z.; Li, J.; Liu, C.; Meng, X.; Du, Y.; Chen, J. Detection of road cavities in urban cities by 3D ground penetrating radar. Geophysics 2021, 86, WA25–WA33. [Google Scholar] [CrossRef]
- Khamzin, A.K.; Varnavina, A.V.; Torgashov, E.V.; Anderson, N.L.; Sneed, L.H. Utilization of air-launched ground penetrating radar (GPR) for pavement condition assessment. Constr. Build. Mater. 2017, 141, 130–139. [Google Scholar] [CrossRef]
- Ling, J.Y.; Qian, R.Y.; Shang, K.; Guo, L.; Zhao, Y.; Liu, D. Research on the dynamic monitoring technology of road subgrades with time-lapse full-coverage 3D ground penetrating radar (GPR). Remote Sens. 2022, 14, 1593. [Google Scholar] [CrossRef]
- Soren, R.; Lisbeth, A.; Susanne, B.; Jorgen, K. A comparison of two years of network level measurements with the traffic speed deflectometer. In Proceedings of the TRA2008: Transport Research Arena Conference, Ljubljana, Slovenia, 21–24 April 2008. [Google Scholar]
- Graczyk, M.; Zofka, A.; Sudyka, J. Analytical solution of pavement deflections and its application to the TSD measurements. In Proceedings of the 26th ARRB Conference, Sydney, Australia, 19–22 October 2014. [Google Scholar]
- Mullerw, B.; Roberts, J. Revised approach to assessing traffic speed deflectometer data and field validation of deflection bowl predictions. Int. J. Pavement Eng. 2013, 14, 388–402. [Google Scholar] [CrossRef]
- Zofka, A.; Sudyka, J.; Maliszewski, M.; Harasim, P.; Sybilski, D. Alternative approach for interpreting traffic speed deflectometer results. Transp. Res. Rec. 2014, 2457, 12–18. [Google Scholar] [CrossRef]
- Peng, Y.H.; Ma, R. Determination of cement concrete pavement foundation emptying by acoustic vibration method. J. Nat. Sci. Heilongjiang Univ. 2009, 26, 276–280. [Google Scholar]
- Wang, Q.; Han, X.; Yi, Z.J. Identification of concrete pavement slab cavitation based on transient impact response. J. Southwest Jiaotong Univ. 2010, 45, 718–724. [Google Scholar]
- Liu, W.D.; Wang, D.P.; Peng, P. Experimental study on determination of cement concrete pavement emptying by acoustic vibration method. J. Heilongjiang Inst. Technol. (Nat. Sci.) 2011, 25, 29–33. [Google Scholar]
- Kuz’min, M.P.; Larionov, L.M.; Kondratiev, V.V.; Kuz’mina, M.Y.; Grigoriev, V.G.; Kuz’mina, A.S. Use of the burnt rock of coal deposits slag heaps in the concrete products manufacturing. Constr. Build. Mater. 2018, 179, 117–124. [Google Scholar] [CrossRef]
- Gunka, V.; Demchuk, Y.; Sidun, I.; Miroshnichenko, D.; Nyakuma, B.B.; Pyshyev, S. Application of phenol-cresol-formaldehyde resin as an adhesion promoter for bitumen and asphalt concrete. Road Mater. Pavement Des. 2021, 22, 2906–2918. [Google Scholar] [CrossRef]
- Cho, Y.S.; Hong, S.U. The ANN simulation of stress wave based NDT on concrete structures. In Proceedings of the International Conference on System Science and Simulation Engineering, Venice, Italy, 21–23 November 2008; pp. 140–146. [Google Scholar]
- Yousefi, M.; Hansen, J.H.L. Block-based high performance CNN architectures for frame-level overlapping speech detection. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 28–40. [Google Scholar] [CrossRef]
- Hershey, S.; Chaudhuri, S.; Ellis, D.P.; Gemmeke, J.F.; Jansen, A.; Moore, R.C.; Plakal, M.; Platt, D.; Saurous, R.A.; Seybold, B.; et al. CNN architectures for large-scale audio classification. arXiv 2016, arXiv:1609.09430. [Google Scholar]
- Liu, M.; Wang, J.; Li, S.; Xiang, F.; Yao, Y.; Yang, L. MOS predictor for synthetic speech with i-vector inputs. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Singapore, 23–27 May 2022; pp. 906–910. [Google Scholar]
- Jeancolas, L.; Petrovska-Delacrétaz, D.; Mangone, G.; Benkelfat, B.E.; Corvol, J.C.; Vidailhet, M.; Lehéricy, S.; Benali, H. X-vectors: New quantitative biomarkers for early Parkinson’s disease detection from speech. Front. Neuroinform. 2021, 15, 578369. [Google Scholar] [CrossRef]
- Wazir, A.S.B.; Karim, H.A.; Abdullah, M.H.L.; Mansor, S.; AlDahoul, N.; Fauzi, M.F.A.; See, J. Spectrogram-based classification of spoken foul language using deep CNN. In Proceedings of the 2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP), Tampere, Finland, 21–24 September 2020. [Google Scholar]
- Sainath, T.N.; Mohamed, A.R.; Kingsbury, B.; Ramabhadran, B. Deep convolutional neural networks for LVCSR. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, 19–24 April 2015; pp. 8614–8618. [Google Scholar]
- Piczak, K.J. Environmental acoustic classification with convolutional neural networks. In Proceedings of the 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), Boston, MA, USA, 17–20 September 2015; pp. 1–6. [Google Scholar]
- Hershey, S.; Chaudhuri, S.; Ellis, D.P.W.; Gemmeke, J.F.; Jansen, A.; Moore, R.C.; Plakal, M.; Platt, D.; Saurous, R.A.; Seybold, B.; et al. CNN architectures for large-scale audio classification. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 131–135. [Google Scholar]
- Lee, H.; Yoo, I.; Park, S. Learning robust feature representations for audio event detection. IEEE Trans. Audio Speech Lang. Process. 2019, 27, 726–735. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Zhang, H.; Li, J.; Cai, G.; Chen, Z.; Zhang, H. A CNN-based method for enhancing boring vibration with time-domain convolution-augmented transformer. Insects 2023, 14, 631. [Google Scholar] [CrossRef]
- Cai, Y.; Hou, A. Analysis on transformer vibration signal recognition based on convolutional neural network. J. Vibroeng. 2021, 23, 484–495. [Google Scholar] [CrossRef]
- Yushao, M.; Wang, X.; Zhou, W.; Xiang, L. Research on transformer condition recognition based on acoustic signal and one-dimensional convolutional neural networks. J. Phys. Conf. Ser. 2021, 2005, 012078. [Google Scholar]
- Dai, Z.; Yang, Z.; Yang, Y.; Carbonell, J.; Le, Q.V.; Salakhutdinov, R. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 2978–2988. [Google Scholar]
- Geng, Q.S.; Wang, F.H.; Zhou, D.X. Mechanical fault diagnosis of power transformer by GFCC time-frequency map of acoustic signal and convolutional neural network. In Proceedings of the 2019 IEEE Sustainable Power and Energy Conference (iSPEC), Beijing, China, 21–23 November 2019. [Google Scholar]
- Wu, Y.; Zhang, Z.; Xiao, R.; Jiang, P.; Dong, Z.; Deng, J. Operation state identification method for converter transformers based on vibration detection technology and deep belief network optimization algorithm. Actuators 2021, 10, 56. [Google Scholar] [CrossRef]
- Chen, H.; Yu, Y.; Li, P. Transformer-based denoising of mechanical vibration signals. arXiv 2023, arXiv:2308.02166. [Google Scholar] [CrossRef]
- Secic, A.; Krpan, M.; Kuzle, I. Vibro-acoustic methods in the condition assessment of power transformers: A survey. IEEE Access 2019, 7, 83915–83931. [Google Scholar] [CrossRef]
- Liu, W.; Liu, X.; Wang, D.; Lu, W.; Yuan, B.; Qin, C. MITDCNN: A multi-modal input Transformer-based deep convolutional neural network for misfire signal detection in high-noise diesel engines. Expert Syst. Appl. 2024, 238, 121797. [Google Scholar] [CrossRef]
- Ahmed, H.O.A.; Nandi, A.K. Convolutional-Transformer Model with Long-Range Temporal Dependencies for Bearing Fault Diagnosis Using Vibration Signals. Machines 2023, 11, 746. [Google Scholar] [CrossRef]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 7132–7141. [Google Scholar]
- Zhang, J.C.; Yan, W.Y.; Zhang, Y. A new speech feature fusion method with cross gate parallel CNN for speaker recognition. arXiv 2022, arXiv:2211.13377. [Google Scholar] [CrossRef]
- Burra, M.; Vanambathina, S.D.; Lakshmi A, V.A.; Ch, L.; Kotiah, N.S. Cross channel interaction based ECA-Net using gated recurrent convolutional network for speech enhancement. Multimed. Tools Appl. 2024, 84, 16455–16479. [Google Scholar] [CrossRef]
- Yu, H.; Zhao, Q. Brain-inspired multisensory integration neural network for cross-modal recognition through spatiotemporal dynamics and deep learning. Cogn. Neurodyn. 2023, 18, 3615–3628. [Google Scholar] [CrossRef] [PubMed]
- Yang, M.; Yeh, C.H.; Zhou, Y.; Cerqueira, J.P. A 1μW voice activity detector using analog feature extraction and digital deep neural network. In Proceedings of the 2018 IEEE International Solid-State Circuits Conference-(ISSCC), San Francisco, CA, USA, 11–15 February 2018. [Google Scholar]
- Jan, M.; Khattak, K.S.; Khan, Z.H.; Gulliver, T.A.; Altamimi, A.B. Crowdsensing for Road Pavement Condition Monitoring: Trends, Limitations, and Opportunities. IEEE Access 2023, 11, 133143–133159. [Google Scholar] [CrossRef]
- Zang, G.; Sun, L.; Chen, Z.; Li, L. A nondestructive evaluation method for semi-rigid base cracking condition of asphalt pavement. Constr. Build. Mater. 2018, 162, 892–897. [Google Scholar] [CrossRef]
- Liu, J.; Liu, G.; Yang, T.; Zhou, J. Research on relationships among different distress types of asphalt pavements with semi-rigid bases in China using association rule mining: A statistical point of view. J. Transp. Eng. Part B Pavements 2019, 5, 57–68. [Google Scholar] [CrossRef]
- Chu, S.; Narayanan, S.; Kuo, C.C.J. Environmental acoustic recognition with time-frequency audio features. IEEE Trans. Audio Speech Lang. Process. 2009, 17, 200–205. [Google Scholar] [CrossRef]
- Constantinescu, C.; Brad, R. An overview on acoustic features in time and frequency domain. Int. J. Adv. Sci. Technol. Electr. Eng. 2023, 24, 45–58. [Google Scholar]
- Ghosh, S.K.; Ponnalagu, R.N.; Tripathy, R.K. Automated heart acoustic activity detection from PCG signal using time-frequency-domain deep neural network. IEEE Access 2022, 10, 30024–30031. [Google Scholar]
- Tang, J.; Sun, X.; Yan, L.; Qu, Y.; Wang, T.; Yue, Y. Acoustic source localization method-based time-domain signal feature using deep learning. Appl. Acoust. 2023, 213, 109626. [Google Scholar] [CrossRef]
- Ye, Z.; Xiong, H.; Wang, L. Collecting comprehensive traffic information using pavement vibration monitoring data. Comput.-Aided Civ. Infrastruct. Eng. 2020, 35, 134–149. [Google Scholar] [CrossRef]
- Zhou, K.; Lei, D.; He, J.; Zhang, P.; Bai, P.; Zhu, F. Real-time localization of micro-damage in concrete beams using DIC technology and wavelet packet analysis. Cem. Concr. Compos. 2021, 123, 104113. [Google Scholar] [CrossRef]
- Walid, M.; Darmawan, A.K. Pengenalan ucapan menggunakan metode linear predictive coding (LPC) dan K-nearest neighbor (K-NN). Energy 2017, 7, 13–22. [Google Scholar]
- Mini, P.P.; Thomas, T.A.; Kumari, R.G. EEG-based direct speech BCI system using a fusion of SMRT and MFCC/LPCC features with ANN classifier. Biomed. Signal Process. Control 2021, 68, 102625. [Google Scholar] [CrossRef]
- Sharma, A.; Kaut, S. Two-stage supervised learning-based method to detect screams and cries in urban environments. IEEE/ACM Trans. Audio Speech Lang. Process. 2016, 24, 290–299. [Google Scholar] [CrossRef]
- Zhang, X.L.; Wang, D.L. Boosting contextual information for deep neural network based voice activity detection. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 24, 252–264. [Google Scholar] [CrossRef]
- Wang, Q.; Zeng, Q.; Xie, X.; Zheng, Z. Research on speech recognition method in low SNR environment. Acoust. Technol. 2017, 36, 50–56. [Google Scholar]
- Mitra, V.; Wang, W.; Franco, H.; Lei, Y. Evaluating robust features on deep neural networks for speech recognition in noisy and channel mismatched conditions. In Proceedings of the International Conference on Speech and Language Processing, Grenoble, France, 14–16 October 2014. [Google Scholar]
- Farahani, G. Autocorrelation-based noise subtraction method with smoothing, overestimation, energy, and cepstral mean and variance normalization for noisy speech recognition. EURASIP J. Audio Speech Music. Process. 2017, 2017, 1–16. [Google Scholar] [CrossRef]
- Zhang, S.; Yang, Y.; Chen, C.; Zhang, X.; Leng, Q.; Zhao, X. Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: A systematic review of recent advancements and prospects. Expert Syst. Appl. 2024, 237, 121692. [Google Scholar] [CrossRef]
- Yang, J.; Yang, F.; Zhou, Y.; Wang, D.; Li, R.; Wang, G. A data-driven structural damage detection framework based on parallel convolutional neural network and bidirectional gated recurrent unit. Inf. Sci. 2021, 566, 103–117. [Google Scholar] [CrossRef]
- Choromanski, K.M.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.Q.; Mohiuddin, A.; Kaiser, L.; et al. Rethinking Attention with Performers. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual Event, Vienna, Austria, 3–7 May 2021. [Google Scholar]
- Leng, D.; Zheng, L.; Wen, Y.; Zhang, Y.; Wu, L.; Wang, J.; Wang, M.; Zhang, Z.; He, S.; Bo, X. A benchmark study of deep learning-based multi-omics data fusion methods for cancer. Genome Biol. 2022, 23, 171. [Google Scholar] [CrossRef]
Damage | Debonding | Manhole Cover | Vehicle Noise | Other | |
---|---|---|---|---|---|
G-datasets | 560 | 560 | 150 | 295 | 1800 |
NR-datasets | 500 | 560 | 145 | 285 | 1750 |
TR-datasets | 500 | 560 | 145 | 280 | 1750 |
Dropout Rate | Accuracy (G) | W-F1 (G) | Accuracy (NR) | W-F1 (NR) | Accuracy (TR) | W-F1 (TR) |
---|---|---|---|---|---|---|
No Dropout | 0.9056 | 0.9074 | 0.8926 | 0.9111 | 0.9292 | 0.9317 |
0.2 | 0.9169 | 0.9117 | 0.9164 | 0.9265 | 0.9377 | 0.9486 |
0.3 | 0.9118 | 0.9214 | 0.9280 | 0.9306 | 0.9394 | 0.9533 |
0.4 | 0.9224 | 0.9210 | 0.9342 | 0.9311 | 0.9464 | 0.9422 |
0.5 | 0.9186 | 0.9183 | 0.9297 | 0.9126 | 0.9437 | 0.9336 |
0.6 | 0.9063 | 0.9062 | 0.9118 | 0.9207 | 0.9393 | 0.9319 |
0.7 | 0.9144 | 0.9111 | 0.9105 | 0.9132 | 0.9365 | 0.9214 |
0.8 | 0.8903 | 0.9079 | 0.9111 | 0.9117 | 0.9261 | 0.9221 |
Num Heads | Accuracy (G) | W-F1 (G) | Accuracy (NR) | W-F1 (NR) | Accuracy (TR) | W-F1 (TR) |
---|---|---|---|---|---|---|
4 | 0.9145 | 0.9214 | 0.9280 | 0.9316 | 0.9377 | 0.9427 |
8 | 0.9189 | 0.9224 | 0.9333 | 0.9362 | 0.9448 | 0.9494 |
16 | 0.9017 | 0.9039 | 0.9217 | 0.9266 | 0.9351 | 0.9221 |
Network Type | Data Feature Type | Accuracy (G) | W-F1 (G) | Accuracy (NR) | W-F1 (NR) | Accuracy (TR) | W-F1 (TR) |
---|---|---|---|---|---|---|---|
T-PCG-CNN | MFCC&ΔMFCC&Δ2MFCC | 0.9224 | 0.9210 | 0.9342 | 0.9311 | 0.9464 | 0.9422 |
T-PCG-CNN* | MFCC&ΔMFCC, Δ2MFCC | 0.8944 | 0.9011 | 0.8865 | 0.8966 | 0.9108 | 0.9155 |
MFCC&Δ2MFCC, ΔMFCC | 0.8828 | 0.8864 | 0.9051 | 0.8904 | 0.9091 | 0.9084 | |
ΔMFCC&Δ2MFCC, MFCC | 0.8687 | 0.8758 | 0.8748 | 0.8846 | 0.9027 | 0.9136 | |
T-PG-CNN | MFCC, ΔMFCC, Δ2MFCC | 0.8531 | 0.8622 | 0.8457 | 0.8588 | 0.9075 | 0.9019 |
T-G-CNN | MFCC | 0.8311 | 0.83.07 | 0.8242 | 0.8114 | 0.8542 | 0.8450 |
ΔMFCC | 0.8233 | 0.8217 | 0.8126 | 0.8068 | 0.8566 | 0.8514 | |
Δ2MFCC | 0.8289 | 0.8313 | 0.8209 | 0.8161 | 0.8552 | 0.8521 |
Model Variant | Accuracy (G) | W-F1 (G) | Accuracy (NR) | W-F1 (NR) | Accuracy (TR) | W-F1 (TR) | Model Size (MB) | Params (M) | Inference Speed | GFLOPs | Peak Memory (GB) |
---|---|---|---|---|---|---|---|---|---|---|---|
T-PCG-CNN | 0.9224 | 0.9210 | 0.9342 | 0.9311 | 0.9464 | 0.9422 | 31.4 | 7.8 | 1.3 × (baseline) | 1.4 | 35.0 |
Without Transformer (PCG-CNN only) | 0.9041 | 0.8992 | 0.9003 | 0.9150 | 0.9214 | 0.9237 | |||||
Without Cross-Gating (Parallel CNN+ Transformer, no gates) | 0.8531 | 0.8622 | 0.8457 | 0.8588 | 0.9075 | 0.9019 | |||||
Without Multi-Scale (Single-Branch CNN + Transformer) | 0.8274 | 0.8228 | 0.8311 | 0.8320 | 0.8441 | 0.8502 | |||||
Standard Transformer (T*-PCG-CNN) | 0.9190 | 0.9218 | 0.9286 | 0.9342 | 0.9420 | 0.9458 | 47.6 | 11.9 | 1.0 × (baseline) | 1.8 | 36.2 |
Net | Accuracy (G) | W-F1 (G) | Accuracy (NR) | W-F1 (NR) | Accuracy (TR) | W-F1 (TR) | Accuracy (TR-5 dB) | W-F1 (TR-5 dB) | Accuracy (TR-10 dB) | W-F1 (TR-10 dB) |
---|---|---|---|---|---|---|---|---|---|---|
i-vector | 0.8568 | 0.8642 | 0.8765 | 0.8600 | 0.8831 | 0.8758 | 0.7032 | 0.7543 | 0.7336 | 0.7669 |
x-vector | 0.8621 | 0.8532 | 0.8649 | 0.8670 | 0.8740 | 0.8814 | 0.7244 | 0.7248 | 0.7933 | 0.7804 |
wav2vec 2.0 | 0.8895 | 0.8721 | 0.8935 | 0.8944 | 0.9001 | 0.9035 | 0.8059 | 0.7995 | 0.8544 | 0.8632 |
HuBERT | 0.8770 | 0.8905 | 0.8845 | 0.8900 | 0.8956 | 0.9044 | 0.8335 | 0.8267 | 0.8638 | 0.8785 |
Spectrogram + CNN | 0.7833 | 0.7155 | 0.8014 | 0.7822 | 0.8267 | 0.7722 | 0.6852 | 0.6972 | 0.7259 | 0.7119 |
MFCC + CNN | 0.8411 | 0.82.07 | 0.8342 | 0.8214 | 0.8642 | 0.8550 | 0.7900 | 0.7842 | 0.8153 | 0.8247 |
MFCC + CENS + CNN | 0.8700 | 0.8614 | 0.8755 | 0.8542 | 0.8932 | 0.9041 | 0.8257 | 0.8424 | 0.8433 | 0.8451 |
CG-PCNN | 0.9114 | 0.9035 | 0.9227 | 0.9116 | 0.9435 | 0.9208 | 0.8824 | 0.8712 | 0.9014 | 0.8995 |
T-PCG-CNN | 0.9189 | 0.9224 | 0.9333 | 0.9362 | 0.9448 | 0.9494 | 0.9114 | 0.9187 | 0.9400 | 0.9415 |
Training and Validation Dataset | Test Dataset | |||||
---|---|---|---|---|---|---|
G-Datasets | NR-Datasets | TR-Datasets | ||||
Accuracy | W-F1 | Accuracy | W-F1 | Accuracy | W-F1 | |
G-datasets | — | — | 0.7725 | 0.7543 | 0.7842 | 0.7731 |
NR-datasets | 0.6858 | 0.7024 | — | — | 0.7661 | 0.7492 |
TR-datasets | 0.6524 | 0.6711 | 0.6320 | 0.6442 | — | — |
G&NR- datasets | — | — | — | — | 0.8243 | 0.8124 |
G&TR-datasets | — | — | 0.8436 | 0.8311 | — | — |
NR&TR-datasets | 0.8125 | 0.8046 | — | — | — | — |
G&NR&TR- datasets | 0.8889 | 0.8935 | 0.9117 | 0.9029 | 0.9208 | 0.9315 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hao, C.; Ye, M.; Li, B.; Zhang, J. Acoustic Analysis of Semi-Rigid Base Asphalt Pavements Based on Transformer Model and Parallel Cross-Gate Convolutional Neural Network. Appl. Sci. 2025, 15, 9125. https://doi.org/10.3390/app15169125
Hao C, Ye M, Li B, Zhang J. Acoustic Analysis of Semi-Rigid Base Asphalt Pavements Based on Transformer Model and Parallel Cross-Gate Convolutional Neural Network. Applied Sciences. 2025; 15(16):9125. https://doi.org/10.3390/app15169125
Chicago/Turabian StyleHao, Changfeng, Min Ye, Boyan Li, and Jiale Zhang. 2025. "Acoustic Analysis of Semi-Rigid Base Asphalt Pavements Based on Transformer Model and Parallel Cross-Gate Convolutional Neural Network" Applied Sciences 15, no. 16: 9125. https://doi.org/10.3390/app15169125
APA StyleHao, C., Ye, M., Li, B., & Zhang, J. (2025). Acoustic Analysis of Semi-Rigid Base Asphalt Pavements Based on Transformer Model and Parallel Cross-Gate Convolutional Neural Network. Applied Sciences, 15(16), 9125. https://doi.org/10.3390/app15169125