Enhancement of Boring Vibrations Based on Cascaded Dual-Domain Features Extraction for Insect Pest Agrilus planipennis Monitoring

Shi, Haopeng; Chen, Zhibo; Zhang, Haiyan; Li, Juhu; Liu, Xuanxin; Ren, Lili; Luo, Youqing

doi:10.3390/f14050902

Open AccessArticle

Enhancement of Boring Vibrations Based on Cascaded Dual-Domain Features Extraction for Insect Pest Agrilus planipennis Monitoring

¹

School of Information Science and Technology, Beijing Forestry University, Beijing 100083, China

²

Engineering Research Center for Forestry-Oriented Intelligent Information Processing, National Forestry and Grassland Administration, Beijing 100083, China

³

Beijing Key Laboratory for Forest Pest Control, Beijing Forestry University, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Forests 2023, 14(5), 902; https://doi.org/10.3390/f14050902

Submission received: 12 March 2023 / Revised: 24 March 2023 / Accepted: 25 April 2023 / Published: 27 April 2023

(This article belongs to the Section Forest Health)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Wood-boring beetles are among the most destructive forest pests. The larvae of some species live in the trunks and are covered by bark, rendering them difficult to detect. Early detection of these larvae is critical to their effective management. A promising surveillance method is inspecting the vibrations induced by larval activity in the trunk to identify whether it is infected. As convenient as it seems, it has a significant drawback. The identification process is easily disrupted by environmental noise and results in low accuracy. Previous studies have proven the feasibility and necessity of adding an enhancement procedure before identification. To this end, we proposed a small yet powerful boring vibration enhancement network based on deep learning. Our approach combines frequency-domain and time-domain enhancement in a stacked network. The dataset employed in our study comprises the boring vibrations of Agrilus planipennis larvae and various environmental noises. After enhancement, the SNR (signal-to-noise ratio) increment of a boring vibration segment reaches 18.73 dB, and our model takes only 0.46 s to enhance a 5 s segment on a laptop CPU. The accuracy of several well-known classification models showed a substantial increase using clips enhanced by our model. All experimental results proved our contribution to the early detection of larvae.

Keywords:

boring vibration; Agrilus planipennis; pest management; deep learning; neural network; machine learning

1. Introduction

Forests play a crucial part in sustaining life on our planet [1]. It also guarantees income and the well-being of humanity. Forestry aims to provide optimal economic gain while ensuring the continuous provision of ecological, social, and cultural services [2]. However, biotic, abiotic, and anthropogenic factors threaten this sustainability [3]. Among all biotic factors, tree pests and pathogens are a major and increasing threat to the integrity of forest ecosystems [4]. The constant increase in the volume of goods shipped internationally has caused an impressive number of non-native species introductions worldwide [5]. Wood-boring beetles, such as bark and ambrosia beetles (Scolytinae), longhorn beetles (Cerambycidae), and jewel beetles (Buprestidae), are one of the most successful guilds of alien species invasive to forest habitats [6]. Hidden within wood-packaging materials, round wood logs, and live plants, these beetle larvae are often moved both throughout their native biogeographic region and among continents. Thus, practical strategies for detecting potentially invasive beetles as early as possible are needed [7].

Insect pests may be monitored using a wide range of techniques, including visual inspection, suction traps, and passive methods. The traps may be colored or baited with species-specific pheromones or pheromone blends and host volatiles to attract multiple species [8]. Observation of traps requires skilled operators to enter every single location on a regular basis. In the methods mentioned above, the operator has to visit the observation points directly. These man-powered-assisted monitoring methods are consistently associated with high labor costs and, in some cases, may result in low efficiency, limited promptness of reaction, and inadequate sample size [9]. Since wood-boring insects spend most of their lives protected beneath the bark, visual inspection is not applicable. Pheromone traps are intended for insects in the adult stage. However, most of the damage is carried out during the larval stage. An effective wood-boring pest larvae monitoring method is needed for early detection, eradication, or containment.

Acoustic technology has been applied for many years in studies of insect communication and the monitoring of calling-insect population levels, geographic distributions, and species diversity, as well as in the detection of cryptic insects in soil, wood, container crops, and stored products [10]. Those applications indicate further usage of acoustic technology in improving pest management methods. In the era of digitalization, the massive advancement in computer technology and electronic instrumentation has widened the possibilities for insect pest monitoring, thus enabling further study of exploiting substrate-borne vibrations in pest management, which belongs to the field of applied biotremology. Applied biotremology is a relatively novel field of study that is gaining increased interest in the scientific community and has become the center of interest for multinational companies in the field of pest control [11]. The possibility of automatically detecting vibrations induced by insect larvae chewing on wood or stored grains has been foreseen for a long time. With the constant and rapid development of signal analysis, the gap between theory and application is quickly narrowing. It is time for the technology and techniques of signal analysis to be applied to biotremology and developed into a powerful pest control method for foresters.

One of the most active areas of insect acoustic detection research is the development of improved methods to identify the vibrations of targeted insect species. The detection is mainly aimed at stored-product insects and wood-boring insects. Mankin et al. [12] used the TreeVibes [13] sensor to describe Sitophilus oryzae (Coleoptera: Curculionidae) adult and larval movement and feeding in stored grain. Banlawe et al. [14] tested a MEMS (Micro Electrical-Mechanical System) microphone, an electret microphone, and a piezoelectric transducer for recording the acoustic emission of Mango Pulp Weevil (MPW) in a soundproof box. Bittner et al. [15] monitored the vibration signals of an individual Callosobruchus maculatus (F) (Coleoptera: Bruchidae) while feeding on cowpea seeds. The sensor they used is a PCB 352C15 high-frequency ceramic shear accelerometer. To prove that the generation of acoustic emission corresponds to the chewing movements of Dinoderus minutus, Watanabe et al. [16] placed a larva and an adult of D. minutus in a madake specimen, and they attached a sensor (R15α, Physical Acoustics Co., Princeton Junction, NJ, USA) to the madake. Flynn et al. [17] developed a process chain to autonomously identify the presence of Trogoderma inclusum and Tenebrio molitor in rice grains by their acoustic signals. The detection was performed in a soundproof box to reduce external noise. In order to acoustically monitor the red palm weevil (Rhynchophorus ferrugineus) in palm trees, Hetzroni et al. [18] used a piezoelectric sensor to capture the larvae’s distinct sounds that propagate through the fibrous palm tissue. Then the recorded vibrations were diagnosed by a human listener and software. Mankin et al. [19] applied an AED-2010 amplifier system to record the vibrations of Mallodon dasystomus larvae in avocado trees. The recognition was then carried out by analyzing impulses in the recordings. Sutin et al. [20] detect and classify infestations based on vibration signal features extracted from the vibrations of Asian Longhorn Beetle and Emerald Ash Borer larvae. Jalinas et al. [21] developed acoustic methods for detecting Rhynchophorus ferrugineus and Rhynchophorus cruentatus in their early instars. Sun et al. [22] proposed a lightweight convolutional neural network (CNN) to automatically identify the boring vibrations of Semanotus bifasciatus and Eucryptorrhynchus brandti larvae. Karar et al. [23] proposed an IoT-based framework for the early detection of red palm weevils using a fine-tuned classifier, InceptionResNet-V2 [24], which was trained through vibration data recorded by a Tree Vibes [13] recording device. Zhang et al. [25] designed a neural network named TrunkNet for the purpose of identifying the existence of Agrilus planipennis larvae in trunks by their vibration signals.

In fact, a particular challenge for the detection model is the great interference of complex environmental noise recorded simultaneously with the vibration signal in practical applications. Noise has been verified to have a negative impact on recognition accuracy [26]. It is defined as an unwanted sound or signal. There are biotic (conspecific or heterospecific cues or signals) and abiotic (wind or rain) noise sources, as well as anthropogenic noise caused by traffic and heavy machinery [27]. In the vibrational channel, the frequency range of boring vibrations and the frequency range of noise from the environment overlap, causing severe interference. Studies of the natural vibrational environment show that regardless of the environment studied, geophysical vibrations induced by light wind are nearly always a component of the natural vibroscape that is present. Stronger wind gusts generate high-amplitude vibrations in the frequency range up to 5 kHz, characterized by rapid, unpredictable short-term variations in amplitude [28]. Through the analysis of the vibrations of Rhynchophorus larvae and background noise in both the time and frequency domains, Mankin et al. [29] concluded that part of the background noise has the same frequency as the larval vibrations and could interfere with the discrimination of infestation. Liu et al. [30] designed a recognition model to recognize the boring vibrations of Semanotus bifasciatus. Their results clearly showed that noise had a significant impact on classification accuracy. When the SNR was −7 dB, the recognition accuracy decreased by 10.8% and 15.6% for their and the baseline model, respectively. Zhou et al. [31] introduced improved anti-noise power normalized cepstral coefficients (PNCCs) based on the wavelet package for trunk borer vibrations. For a −5 dB SNR noise added to their data, the accuracy of their model decreased from 100% to 83%, with a further decline to 70% for −10 dB SNR.

Although plenty of recent studies aim to detect wood-boring insect larvae by substrate-borne signals, few involve the analysis and processing of environmental noise. Most recording procedures were conducted in a soundproof box [14,15,17,22] or sound insulation chamber [12,32,33] to avoid noise interference. Some researchers [18,19,21] monitored the signals with headphones to obtain a subjective assessment of insect presence and avoid interference from background noise. Vinatier et al. [34] presented a non-invasive technique based on a bioacoustic sensor with a band-pass filter to detect larval activity inside banana corms. Others [18,21,35] screened the oscillograms and spectrograms of recorded vibrations by Raven Pro Software [36] to identify relatively noise-free intervals. These methods do not apply to large datasets because of their heavy workload and low efficiency. They cannot handle the complex and intense noise of the actual situation either. As a solution to the problem outlined above, the prevailing deep learning method provides a viable alternative. Deep learning has revolutionized the domains of computer vision and speech, language, and audio processing. Its primary strength comes from its ability to leverage massive amounts of data to find relationships and patterns and learn varying representations of the data. Thus, the focused area of deep learning may be the most powerful one of all existing methods, provided there is enough training data [37,38]. Liu et al. [39] proposed a deep learning-based time domain model and a frequency domain model to enhance the mixture of Semanotus bifasciatus larval feeding vibrations and various environmental noises. The average SNR increment reached 17.53 dB and 11.10 dB after enhancement by their two models, respectively. Shi et al. [40] adopted deep learning-based speech enhancement and further improved it to develop a time-domain boring vibration enhancement model, VibDenoiser. The model achieves an improvement of 18.57 dB in SNR, and it runs in real-time on a laptop CPU. The accuracy of the four classification models increased by a large margin using vibration clips enhanced by the model, proving the necessity and efficacy of the enhancement model.

The two latest studies took advantage of deep learning-based speech enhancement. Speech enhancement aims to improve the quality and intelligibility of degraded speech. Speech enhancement algorithms reduce or suppress background noise to some degree and are sometimes referred to as noise suppression algorithms [41]. Since the probe is capable of detecting boring vibrations within a spherical region of a trunk [13], the data will be in a single channel. This problem is best addressed by monoaural speech enhancement. Classical speech enhancement methods include spectral-subtractive algorithms, Wiener filtering, statistical-model-based methods, subspace methods, and noise-estimation algorithms [41]. Those classical speech enhancement algorithms can reduce background noise to some degree. However, they are cumbersome, complicated [42], and have suboptimal performance on non-stationary noises for real-world scenarios. In recent years, deep learning-based approaches have shown considerable success [43].

Deep learning-based speech enhancement methods can be further categorized into frequency-domain and time-domain methods. The frequency domain methods aim to extract the acoustic features of clean speech from the features of noisy speech. In these approaches, speech signals are analyzed and reconstructed using the short-time Fourier transform (STFT) and inverse STFT, respectively. Common training targets include the ideal ratio mask [44], the ideal binary mask [45], the spectral magnitude mask [46], etc. The time domain methods directly estimate the clean speech waveforms through end-to-end training, circumventing the trouble of estimating phase information in the frequency domain. There are two popular architecture backbones for time-domain methods: WaveNet [47] and U-Net [48].

The emerald ash borer (EAB), Agrilus planipennis Fairmaire, a phloem-feeding beetle (Coleoptera: Buprestidae) native to Asia, was determined to be the cause of widespread decline and mortality of ash (Fraxinus spp.) [49]. This study recorded the boring vibration of EAB and environmental noise to construct our dataset. The importance and necessity of an enhancement procedure before the recognition of boring vibrations encouraged us to propose a deep learning-based vibration enhancement model called the Dual-Vibration Enhancement Network (DVEN). DVEN enhances the vibration signal in both the frequency-domain and the time-domain. It employs an efficient structure, which ensures its excellent enhancement performance and fast inference speed. The recognition accuracy on boring vibration signals of four prominent classification models increased, benefiting from the enhancement of our model. We hope that our work will greatly contribute to exploiting substrate-borne vibrations in pest management.

2. Materials and Methods

2.1. Dataset Construction

2.1.1. Data Collection and Screening

The dataset employed in this research is the same as in [40]. All boring vibrations and environmental noise used to construct the dataset were collected during the research. We cut down several ash trees from an EAB larvae-infected forest farm in Tongzhou District, Beijing. The trunks were retained and cut into sections of equivalent length. In order to record boring vibrations in the trunks, the probe of a piezoelectric vibration sensor was drilled into trunk sections. Beijing Forestry University and Beihang University jointly developed the sensor. We chose a sampling rate of 44,100 Hz, and the bit depth was 16 bits for the recording. A computer received the signals from our sensor and saved them in wave format using Audacity. The recording was conducted in an idle lab without other activities to prevent interference from unwanted noise. We manually monitored the larval activity with headphones before recording. The recording duration was one and a half hours daily for each trunk if the larvae in it were still active. That is, multiple audio clips were recorded from one trunk section at different times. We only recorded one clip for trunks from dying and dead trees, for almost no larval activity can be found but noise floor. The recording lasted for five days, starting on 23 July 2021. As the recording finished, we peeled the bark of all the trunk sections to reveal and count the EAB larvae inside under the supervision of forestry professionals.

We trimmed each clip’s beginning and end to eliminate the noise of recording operations. After that, we inspected the spectrogram of each clip carefully for any abnormal noise floor and unexpected noises such as footsteps, raindrops, and sudden bursts in the instrument and removed them from the clip. We abandoned the whole clip if it did not contain enough larval activity.

The training of the enhancement model requires environmental noise to generate noisy, boring vibrations. Therefore, we chose five locations where noises are the same as those in the growing environment of ash trees. Four are located at Beijing Forestry University, and the other is in Olympic Forest Park. For the sake of consistency, we inserted the same probe into an ash trunk free from EAB infestation and used the same computer to save the signals. The noise we acquired consists mainly of the sound of the wind, the rustling of leaves, birds’ twittering, babble, and tire noise. Figure 1 shows the frequency spectra of some noise samples. We discarded all the segments that did not contain noise to ensure high training efficiency.

2.1.2. Dataset Construction

To avoid the impact of the noise floor, we removed the noise floor from all recordings using our implementation of spectral subtraction [50] in Python. In the interest of easier processing by the model, we split all recordings into segments of 5 s duration. In our dataset, 94 percent of the boring vibration and noise segments were divided into the training set, and the other 6 percent were designated as the test set. We randomly mixed the boring vibration and noise segments at an SNR of −10 dB for both the training and test sets. Limited by the computing capability of our hardware, we selected half of the audio clips that contained the most energy [51] as our dataset. Consequently, the training set contained 9940 clips with a duration of about 13.8 h, and the test set contained 632 clips with a duration of about 52.67 min. Figure 2 demonstrates the clean boring vibration segment, the noise segment, and the noisy segment after mixing boring vibration and noise.

2.2. Model Architecture

Traditional speech enhancement methods work fine with stationary noise but do not generalize well to non-stationary or structured noise types [52] common in real-world scenarios. With the rapid development of deep learning and increased computational power, deep learning-based approaches have been widely investigated and have shown considerable success. Two-stage approaches in deep learning-based speech processing have previously been explored for denoising plus dereverberation, separation plus dereverberation, and denoising plus separation [53]. The modularization allows the networks to focus on specific tasks. Inspired by DTLN [54], our DVEN is designed to be a stacked dual signal transformation LSTM [55] network, which jointly applies the masking-based method and the mapping-based method. It cascades into two enhancement cores, with the output of the first enhancement core being fed into the next enhancement core. The first one features an STFT signal transformation, which performs the enhancement in the frequency domain. The second one performs the enhancement in the time domain with a convolutional recurrent network (CRN) integrating a convolutional encoder-decoder (CED) structure and a recurrent structure. The specific order was intended for a robust magnitude estimation with the first core and a further enhancement with phase information by the second core. This structure is proven to have a beneficial effect due to the complementarity of frequency and time domain features while maintaining a relatively small computational footprint. DVEN is considerably smaller than our previous model, with even faster inference speed and better enhancement ability.

2.2.1. Structure of DVEN

The first enhancement core uses an STFT analysis and synthesis base. The time domain signal (raw waveform) of boring vibration fed into the model is first calculated by STFT to get the magnitude and phase. The first enhancement core takes the magnitude of the boring vibration segment as input. It consists of two LSTM layers, a fully connected layer, and a sigmoid activation. All LSTM layers in our model have 128 units. The enhancement core predicts a mask output. Then the magnitude is multiplied by the mask to get the enhanced magnitude spectrogram. The enhanced signal is transformed back to the time domain using the phase of the input. The enhancement core is constructed to estimate the magnitude information of the vibration signal, but it neglects phase information, which is also essential for the quality of the signal.

To compensate for the degradation of signal quality caused by possible inaccurate phase estimation in the spectral domain, the second enhancement core works directly on time-domain signals with further enhancement. It takes the enhanced signal from the first core as input, adopts the CRN structure, and outputs the enhanced time-domain boring vibration signal. CRN was first introduced to speech enhancement by [56,57]. It nested a recurrent neural network (RNN) module inside the CNN-based encoder-decoder structure [58]. The RNN module is capable of handling long-term contexts in a sequence-based manner [59], but often requires high-level features. The CNN module is able to extract high-level features but mainly focuses on local temporal-spectral patterns [60]. Combining their advantages, the CRN structure has been shown to be very effective for speech enhancement [58,61,62,63]. Motivated by [58,61], we determined three convolution layers for the encoder, three transposed convolution layers for the decoder, and two LSTM layers between them. A skip connection connects the output of each encoder layer to the input of the corresponding decoder layer. All convolution and transposed convolution layers have a kernel size of 8 and a stride of 4, followed by a ReLU activation [64]. The output channels of the encoder layers are 48, 96, and 192, respectively. The output channels of the first two layers of the decoder are 96 and 48. Note that the last layer of the decoder has a single channel output with no ReLU. To maintain the shape of the output the same as the input, the first and second decoder layers used an output padding of 2 and 3, respectively.

As the training objective, the negative signal-to-noise ratio loss [65] is adopted. It is defined as:

f (y, \hat{y}) = - 10 \log_{10} (\frac{\sum_{t} y_{t}^{2}}{{\sum_{t} (y_{t} - {\hat{y}}_{t})}^{2}})

(1)

The symbol

t

represents the t-th sampling point,

y

and

\hat{y}

represent the clean signal and the enhanced signal, respectively. It has the advantage that the scale of the enhanced signal is preserved and consistent with the noisy input, which is desirable in real-time processing systems. Furthermore, it is one of the evaluation metrics. Given that it operates in the time domain, the phase information can be implicitly taken into consideration. On the contrary, mean squared error and magnitude STFT as training objectives are incapable of providing any phase information in the optimization process.

2.2.2. Structure of DVEN Variants

Additionally, we designed two variants of DVEN, namely DVEN-r and DVEN-c, to pursue better performance.

DVEN-r is generally the same as DVEN, but without the skip connections and the kernel size reduced to 4, stride reduced to 2. Furthermore, the first decoder layer no longer uses an output padding, and the second decoder layer has an output padding of 1.

We further add a convolution layer and activation to each of the layers in the encoder and decoder to create DVEN-c. The added convolution is a

1 \times 1

convolution, meaning the kernel size and stride are both 1. The new convolution layer is added behind each of the original convolution layers in the encoder, following which a GLU activation [66] is applied. The added convolution layers double the output channels from the previous layer, and the output channels are halved after the GLU activation. The structure of the decoder is symmetric with that of the encoder. Thus, the new convolution layer and GLU activation are added ahead of the original convolution layers in each decoder layer. Those three

1 \times 1

convolution layers produce 384, 192, and 96 output channels. The output paddings are applied just as in DVEN. We visualize the model architecture in Figure 3.

3. Results

3.1. Implemenmtation Details

All our models were implemented in TensorFlow [67], an open-source machine learning and deep learning software library. The STFT uses a frame length of 512 and a frame shift of 128 and 512 fft length. The Adam optimizer [68] is used with a learning rate of

3 \times 10^{- 3}

and a gradient norm clipping of 3. The learning rate is halved if the loss does not improve over three consecutive epochs. If the loss does not decrease after ten epochs, early stopping is applied. DVEN and DVEN-r are trained on a batch size of 32. It is lowered to 20 for DVEN-c due to hardware limitations. We waited until all training for our models came to an end. Eventually, the training epochs for DVEN, DVEN-r, and DVEN-c will be 70, 73, and 54, respectively. We tried several loss functions on our models, including mean squared error (MSE), mean absolute error (MAE), log-cosh loss, and Huber loss. We also tried different optimizers, such as AdamW, RAdam, SGD, and LookAhead. Table A1 in Appendix A provides the outcome of these experiments. None of the above combinations outperform Adam with negative SNR losses. We tuned the batch size for DVEN exclusively, ranging from 26 to 36. However, it turned out that DVEN gets the best performance at the batch size of 32; increasing or decreasing it will lead to a decline in enhancement performance. Table A2 in Appendix A reveals the results for different batch sizes. The hardware platform of our experiment included a workstation with an Intel Xeon Gold 5120 Processor (Intel, Santa Clara, CA, USA) and an NVIDIA T4 Tensor Core GPU (NVIDIA, Santa Clara, CA, USA), as well as another workstation with an Intel Core i7-10870H Processor (Intel, Santa Clara, CA, USA) and an NVIDIA GeForce RTX 3070 laptop GPU (NVIDIA, Santa Clara, CA, USA).

3.2. Evluation Metrics

We evaluated our model using three commonly used objective metrics: the signal-to-noise ratio (SNR), the segmental signal-to-noise ratio (SNRseg), and the log-likelihood ratio (LLR) [69]. Since boring vibration belongs to substrate-borne vibration instead of speech, we did not use the standard Perceptual Evaluation of Speech Quality (PESQ) and Short-Time Objective Intelligibility (STOI), as they include perceptual quality assessment of acoustic signals for humans. The following formulas illustrate the computational methods:

S N R {= 10 \log}_{10} \frac{\sum_{n = 0}^{N - 1} s^{2} (n)}{\sum_{n = 0}^{N - 1} (s^{2} (n) - {\hat{s}}^{2} (n))}

(2)

S e g S N R = \frac{10}{M} \sum_{m = 0}^{M - 1} \log_{10} \frac{\sum_{n = L m}^{L m + L - 1} s^{2} (n)}{\sum_{n = L m}^{L m + L - 1} (s^{2} (n) - {\hat{s}}^{2} (n))}

(3)

where

s

represents the clean signal and

\hat{s}

represents the enhanced signal.

M

is the number of frames in a segment of a signal,

N

is the number of samples, and

L

is the frame length.

The LLR represents the ratio of the energies of the prediction residuals of the enhanced and clean signals. It is defined as:

L L R (a_{x}, {\bar{a}}_{\hat{x}}) = \log \frac{{\bar{a}}_{\hat{x}}^{T} R_{x} {\bar{a}}_{\hat{x}}}{a_{x}^{T} R_{x} a_{x}}

(4)

where

a_{x}^{T}

are the LPC coefficients of the clean signal,

{\bar{a}}_{\hat{x}}^{T}

are the coefficients of the enhanced signal, and

R_{x}

is the autocorrelation matrix of the clean signal.

3.3. Enhancement Results

We trained the three models depicted above and the DTLN on the dataset presented in Section 2.1. The enhancement performance of them and our previous model (VibDenoiser) is shown in Table 1. The replacement of the second enhancement core of DTLN using a CRN architecture improves the modeling ability. The enhancement in the frequency domain by the first enhancement core is of great help to the following enhancement in the time domain. The results proved that cascading two enhancement cores from different domains is complementary and beneficial. The simplest model, DVEN-r, exhibits a performance equal to VibDenoiser. The DVEN with skip connections and the DVEN-c with extra convolution layers obviously outperform VibDenoiser.

Apart from enhancement performance, inference speed and model footprints are also important metrics considering the future application of the model. Table 2 reveals this information about our model and previous models. The inference speed is the average time cost of a model enhancing the 5 s boring vibration segments. At each test, we observed different inference times. We run each model on the test set ten times to make the result more credible, and we calculate the average inference time as the final inference time. DTLN is the smallest and fastest, with the worst enhancement performance. Our new DVEN surpasses VibDenoiser on enhancement performance with a faster inference speed and about ten times fewer parameters and model sizes. Due to the small kernel size and stride in the convolution layers, DVEN-r has a relatively high computational complexity and an unacceptably slow inference speed. Compared to DVEN, DVEN-c gains a marginal improvement in enhancement ability but with a much slower inference speed. With overall consideration, DVEN is the best choice for a boring vibration enhancement model.

With the purpose of examining the enhancement effect of our model in practical application, several enhancement results are exhibited in Figure 4. The noise contained in the segments of each line is babble, birds’ twitter, wind noise, and the leaves rustling from top to bottom.

We also checked the detection accuracy of four well-known classification models on noisy boring vibrations and boring vibrations enhanced by DVEN. The classification models are VGG16 [70], ResNet18 [71], SqueezeNet [72], and MobileNetV2 [73]. We constructed another dataset using the noisy and clean segments introduced in Section 2.1 for the classification models. The dataset contained two categories, namely infected and uninfected. The infected class comprised 891 noisy, boring vibration segments and 891 clean, boring vibration segments. As for the uninfected class, 840 segments of noise and 940 segments from the dead trunk were used. We chose a specific number of segments to even out the number of segments in both classes. The classification models were trained on the dataset described above and tested on 632 noisy, boring vibration segments. Affected by intense noise of −10 dB, all four models can barely perform correct recognition. The classification accuracy increased substantially using the same noisy, boring vibration segments after enhancement by DVEN. VibDenoiser may also give a boost to classification accuracy, but by a smaller margin than DVEN. Table 3 provides the outcome of the classification model.

4. Discussion

The detection of trunk-boring beetle larvae has always been problematic in pest control and forest management. A pragmatic solution would be to embed a piezoelectric vibration sensor in tree trunks and use a trained model to inspect for boring vibrations. However, environmental noise may interfere with the feeble vibrations of larvae, posing significant challenges for the detection procedure. The urgent need for an enhancement method before detection is obvious. This study of substrate-borne vibrations of larvae belongs to the emerging field of applied biotremology, which requires new technology and techniques to promote its development. The recent surge in popularity of deep learning has resulted in a multitude of new data-driven techniques to tackle challenges in the domains of speech and audio. It has been proven in previous studies that deep learning-based technology is applicable to substrate-borne boring vibrations [39,40]. Up until now, there has been very little research in the field of boring vibration enhancement.

To conduct our research, we drilled a piezoelectric vibration sensor into trunks and recorded dozens of hours of vibrations caused by EAB larvae as well as environmental noise. Then, we screened the recordings and divided them into sections before mixing them to create our dataset. Intending to build a better boring vibration enhancement model, we leveraged the deep learning-based monaural speech enhancement method and further improved it to construct the model DVEN. Unlike usual models, DVEN utilizes both frequency and time domain features through its stacked dual signal enhancement core. The first core extracts magnitude features via STFT and two LSTM layers. The second core employs a CRN structure, specifically three convolution layers as the encoder and decoder and two LSTM layers as the bottleneck between them. The encoder extracts high-level features, and the recurrent structure models the long-term temporal dependencies. With such a configuration, the first core was able to estimate magnitude robustly, while the second core enhanced the signal further with phase information. We trained and tested DVEN on our dataset, and results showed that DVEN performed better than our previous model, with faster inference speed and a significantly smaller model footprint. It achieves 18.73 dB SNR, consumes only 0.46 s to enhance a 5-s segment, and is only 5.05 MB in size. Four well-known classification models were applied to noisy, boring vibration segments, DVEN-enhanced segments, and VibDeniser-enhanced segments for detection. The accuracy was significantly improved using enhanced segments. Furthermore, the accuracy on DVEN-enhanced segments is higher than that on VibDenoiser-enhanced segments, proving the superior enhancement capability of DVEN. The comparison of DVEN and VibDenoiser demonstrates that methods that exploit both frequency domain and time domain features are more efficient than methods that only exploit time domain features. Through the introduction of our new model, we are one step closer to practical larvae detection.

Our model is particularly valuable for insect pest monitoring, as it can be integrated into larval surveillance programs with mobile deployment. It can also be used to develop prototypes of early warning systems for trunk-boring beetles in forests and cities. Early detection enables early reactions and treatments, such as applying systemic insecticides, removing infested ash trees in isolated populations, employing girdled trap trees, natural enemies, and biological control. The methods of applying insecticides include soil injection, systemic bark spray, and foliage cover spray. As for specific insecticides, imidacloprid, dinotefuran, and emamectin benzoate are proven effective. Parasitoids such as hymenopterous parasitoid wasps, Atanycolus cappaerti Marsh, Phasgonophora sulcata Westwood (Chalcidae), and Balcha indica Mani and Kaul (Hymenoptera: Eupelmidae) may affect local EAB population growth to some extent [71,72]. In future research studies, the boring vibrations of various larvae can be included to train models with improved universal applicability. For the purpose of creating a model closer to practical application, further research may also consider including noisy, boring vibrations directly recorded in the habitats of the larval host plant.

5. Conclusions

A pragmatic approach to detecting infestations of trunk-boring beetle larvae would be to embed a piezoelectric vibration sensor in tree trunks and use a trained model to inspect for boring vibrations. Nevertheless, detection is constantly hampered by environmental noise, which is recorded simultaneously with the weak vibrations induced by larval activity. In this study, a new boring vibration enhancement model with dual-domain features extraction was developed utilizing deep learning-based speech enhancement. Our model cascades two enhancement cores. The STFT signal transformation is applied in the first one, with the use of two LSTM layers to enhance the frequency domain. The second one employs a convolutional recurrent network and performs the enhancement in the time domain. As a result, the model achieves 18.73 dB SNR enhancement and consumes only 0.46 s to enhance a 5-s segment. Four renowned classification models were applied to noisy, boring vibration segments and DVEN-enhanced segments. The accuracy was significantly improved using enhanced segments. Our DVEN is capable of effectively suppressing noise in boring vibrations at an affordable expense, demonstrating its value and the contribution of our work to exploiting substrate-borne vibrations in pest management.

Author Contributions

Conceptualization, H.S., Z.C. and X.L.; methodology, H.S.; software, H.S.; validation, H.S.; formal analysis, H.S.; investigation, H.S.; resources, L.R., Z.C. and Y.L.; data curation, H.S.; writing—original draft preparation, H.S.; writing—review and editing, H.S., J.L., X.L. and L.R.; visualization, H.S.; supervision, Z.C.; project administration, H.Z. and J.L.; funding acquisition, Z.C. and H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 32071775. This research was also supported by the Forestry Industry Standard Formulation and the Revision Program of National Forestry and the Grassland Administration under Grants 2019130004-129. The APC was funded by the National Natural Science Foundation of China.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Enhancement performance of DVEN with different optimizers and different loss functions. The first three lines are results with different optimizers; the rest are results with different loss functions.

Optimizer or Loss	SNR (dB)	SNRseg (dB)	LLR
AdamW	17.52	14.65	0.67
Radam	18.52	17.32	0.38
SGD	16.03	9.95	1.33
Lookahead	3.26	1.83	0.75
MSE	18.66	17.49	0.36
MAE	18.61	17.73	0.36
log-cosh	18.69	17.54	0.37
Huber	18.57	17.25	0.36

All models in the table have the same setting with DVEN except the optimizer or loss function.

Table A2. Enhancement of the performance of DVEN with different batch sizes.

Batch Size	SNR (dB)	SNRseg (dB)	LLR
26	18.65	17.52	0.41
28	18.67	17.55	0.39
30	18.69	17.62	0.38
31	18.67	17.56	0.37
33	18.71	17.70	0.38
34	18.71	17.64	0.37
36	18.67	17.57	0.41

All models in the table have the same setting with DVEN except batch size.

References

Why Forests Are So Important. Available online: https://wwf.panda.org/discover/our_focus/forests_practice/importance_forests/ (accessed on 24 March 2023).
Bruenig, E.F. Conservation and Management of Tropical Rainforests: An Integrated Approach to Sustainability; CABI: Wallingford, UK, 2016. [Google Scholar]
Torun, P.; Altunel, A.O. Effects of environmental factors and forest management on landscape-scale forest storm damage in Turkey. Ann. For. Sci. 2020, 77, 39. [Google Scholar] [CrossRef]
Woodcock, P.; Cottrell, J.E.; Buggs, R.J.A.; Quine, C.P. Mitigating pest and pathogen impacts using resistant trees: A framework and overview to inform development and deployment in Europe and North America. For. Int. J. For. Res. 2017, 91, 1–16. [Google Scholar] [CrossRef]
Hulme, P.E. Trade, transport and trouble: Managing invasive species pathways in an era of globalization. J. Appl. Ecol. 2009, 46, 10–18. [Google Scholar] [CrossRef]
Marchioro, M.; Faccoli, M. Dispersal and colonization risk of the Walnut Twig Beetle, Pityophthorus juglandis, in southern Europe. J. Pest Sci. 2022, 95, 303–313. [Google Scholar] [CrossRef]
Rassati, D.; Marini, L.; Marchioro, M.; Rapuzzi, P.; Magnani, G.; Poloni, R.; Di Giovanni, F.; Mayo, P.; Sweeney, J. Developing trapping protocols for wood-boring beetles associated with broadleaf trees. J. Pest Sci. 2019, 92, 267–279. [Google Scholar] [CrossRef]
Nahrung, H.F.; Liebhold, A.M.; Brockerhoff, E.G.; Rassati, D. Forest Insect Biosecurity: Processes, Patterns, Predictions, Pitfalls. Annu. Rev. Entomol. 2023, 68, 211–229. [Google Scholar] [CrossRef] [PubMed]
Preti, M.; Verheggen, F.; Angeli, S. Insect pest monitoring with camera-equipped traps: Strengths and limitations. J. Pest Sci. 2021, 94, 203–217. [Google Scholar] [CrossRef]
Mankin, R. Applications of Acoustics in Insect Pest Management; CABI International: Wallingford, UK, 2012; Volume 2012, pp. 1–7. [Google Scholar]
Hill, P.S.M.; Mazzoni, V.; Narins, P.; Virant-Doberlet, M.; Wessel, A. Quo Vadis, Biotremology? In Biotremology: Studying Vibrational Behavior; Hill, P.S.M., Lakes-Harlan, R., Mazzoni, V., Narins, P.M., Virant-Doberlet, M., Wessel, A., Eds.; Springer International Publishing: Cham, Switheland, 2019; pp. 3–14. [Google Scholar]
Mankin, R.; Hagstrum, D.; Guo, M.; Eliopoulos, P.; Njoroge, A. Automated Applications of Acoustics for Stored Product Insect Detection, Monitoring, and Management. Insects 2021, 12, 259. [Google Scholar] [CrossRef]
Rigakis, I.; Potamitis, I.; Tatlas, N.-A.; Potirakis, S.M.; Ntalampiras, S. TreeVibes: Modern Tools for Global Monitoring of Trees for Borers. Smart Cities 2021, 4, 271–285. [Google Scholar] [CrossRef]
Banlawe, I.A.P.; Cruz, J.C.D. Acoustic Sensors for Mango Pulp Weevil (Stretochenus frigidus sp.) Detection. In Proceedings of the 2020 IEEE 10th International Conference on System Engineering and Technology (ICSET), Shah Alam, Malaysia, 9 November 2020; pp. 191–195. [Google Scholar]
Bittner, J.A.; Balfe, S.; Pittendrigh, B.R.; Popovics, J.S. Monitoring of the Cowpea Bruchid, Callosobruchus maculatus (Coleoptera: Bruchidae), Feeding Activity in Cowpea Seeds: Advances in Sensing Technologies Reveals New Insights. J. Econ. Entomol. 2018, 111, 1469–1475. [Google Scholar] [CrossRef]
Watanabe, H.; Yanase, Y.; Fujii, Y. Relationship between the movements of the mouthparts of the bamboo powder-post beetle Dinoderus minutus and the generation of acoustic emission. J. Wood Sci. 2016, 62, 85–92. [Google Scholar] [CrossRef]
Flynn, T.; Salloum, H.; Hull-Sanders, H.; Sedunov, A.; Sedunov, N.; Sinelnikov, Y.; Sutin, A.; Masters, D. Acoustic methods of invasive species detection in agriculture shipments. In Proceedings of the 2016 IEEE Symposium on Technologies for Homeland Security (HST), Waltham, MA, USA, 10–11 May 2016; pp. 1–5. [Google Scholar]
Hetzroni, A.; Soroker, V.; Cohen, Y. Toward practical acoustic red palm weevil detection. Comput. Electron. Agric. 2016, 124, 100–106. [Google Scholar] [CrossRef]
Mankin, R.W.; Burman, H.; Menocal, O.; Carrillo, D. Acoustic Detection of Mallodon dasystomus (Coleoptera: Cerambycidae) in Persea americana (Laurales: Lauraceae) Branch Stumps. Fla. Entomol. 2018, 101, 321–323. [Google Scholar] [CrossRef]
Sutin, A.; Yakubovskiy, A.; Salloum, H.; Flynn, T.; Sedunov, N.; Nadel, H.; Krishnankutty, S. Sound of wood-boring larvae and its automated detection. J. Acoust. Soc. Am. 2018, 143, 1795. [Google Scholar] [CrossRef]
Jalinas, J.; Güerri-Agulló, B.; Dosunmu, O.G.; Haseeb, M.; Lopez-Llorca, L.V.; Mankin, R.W. Acoustic Signal Applications in Detection and Management of Rhynchophorus spp. in Fruit-Crops and Ornamental Palms. Fla. Entomol. 2019, 102, 475–479. [Google Scholar] [CrossRef]
Sun, Y.; Tuo, X.; Jiang, Q.; Zhang, H.; Chen, Z.; Zong, S.; Luo, Y. Drilling Vibration Identification Technique of Two Pest Based on Lightweight Neural Networks. Sci. Silvae Sin. 2020, 56, 100–108. [Google Scholar]
Karar, M.E.; Reyad, O.; Abdel-Aty, A.-H.; Owyed, S.; Hassan, M.F. Intelligent IoT-Aided Early Sound Detection of Red Palm Weevils. Cmc-Comput. Mater. Contin. 2021, 69, 4095–4111. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Zhang, X.; Zhang, H.; Chen, Z.; Li, J. Trunk Borer Identification Based on Convolutional Neural Networks. Appl. Sci. 2023, 13, 863. [Google Scholar] [CrossRef]
Korinšek, G.; Tuma, T.; Virant-Doberlet, M. Automated Vibrational Signal Recognition and Playback. In Biotremology: Studying Vibrational Behavior; Hill, P.S.M., Lakes-Harlan, R., Mazzoni, V., Narins, P.M., Virant-Doberlet, M., Wessel, A., Eds.; Springer International Publishing: Cham, Switherland, 2019; pp. 149–173. [Google Scholar]
Oberst, S.; Lai, J.C.S.; Evans, T.A. Physical Basis of Vibrational Behaviour: Channel Properties, Noise and Excitation Signal Extraction. In Biotremology: Studying Vibrational Behavior; Hill, P.S.M., Lakes-Harlan, R., Mazzoni, V., Narins, P.M., Virant-Doberlet, M., Wessel, A., Eds.; Springer International Publishing: Cham, Switherland, 2019; pp. 53–78. [Google Scholar]
Strauß, J.; Stritih-Peljhan, N.; Nieri, R.; Virant-Doberlet, M.; Mazzoni, V. Communication by substrate-borne mechanical waves in insects: From basic to applied biotremology. Adv. Insect Physiol. 2021, 61, 189–307. [Google Scholar] [CrossRef]
Mankin, R.W.; Mizrach, A.; Hetzroni, A.; Levsky, S.; Nakache, Y.; Soroker, V. Temporal and Spectral Features of Sounds of Wood-Boring Beetle Larvae: Identifiable Patterns of Activity Enable Improved Discrimination from Background Noise. Fla. Entomol. 2008, 91, 241–248. [Google Scholar] [CrossRef]
Liu, X.; Sun, Y.; Cui, J.; Jiang, Q.; Chen, Z.; Luo, Y. Early Recognition of Feeding Sound of Trunk Borers Based on Artificial Intelligence. Sci. Silvae Sin. 2021, 57, 93–101. [Google Scholar]
Zhou, H.; He, Z.; Sun, L.; Zhang, D.; Zhou, H.; Li, X. Improved Power Normalized Cepstrum Coefficient Based on Wavelet Packet Decomposition for Trunk Borer Detection in Harsh Acoustic Environment. Appl. Sci. 2021, 11, 2236. [Google Scholar] [CrossRef]
Geng, S.L.; Li, F.J. Design of the Sound Insulation Chamber for Stored Grain Insect Sound Detection. Appl. Mech. Mater. 2012, 220–223, 1598–1601. [Google Scholar] [CrossRef]
Mankin, R.W.; Shuman, D.; Coffelt, J.A. Noise Shielding of Acoustic Devices for Insect Detection. J. Econ. Entomol. 1996, 89, 1301–1308. [Google Scholar] [CrossRef]
Vinatier, F.; Vinatier, C. Acoustic recording as a non-invasive method to detect larval infestation of Cosmopolites sordidus. Entomol. Exp. Et Appl. 2013, 149, 22–26. [Google Scholar] [CrossRef]
Mankin, R.W.; Al-Ayedh, H.Y.; Aldryhim, Y.; Rohde, B. Acoustic Detection of Rhynchophorus ferrugineus (Coleoptera: Dryophthoridae) and Oryctes elegans (Coleoptera: Scarabaeidae) in Phoenix dactylifera (Arecales: Arecacae) Trees and Offshoots in Saudi Arabian Orchards. J. Econ. Entomol. 2016, 109, 622–628. [Google Scholar] [CrossRef]
Charif, R.; Waack, A.; Strickman, L. Raven Pro 1.4 User’s Manual; Cornell Lab of Ornithology: Ithaca, NY, USA, 2010. [Google Scholar]
Høye, T.T.; Ärje, J.; Bjerge, K.; Hansen, O.L.P.; Iosifidis, A.; Leese, F.; Mann, H.M.R.; Meissner, K.; Melvad, C.; Raitoharju, J. Deep learning and computer vision will transform entomology. Proc. Natl. Acad. Sci. USA 2021, 118, e2002545117. [Google Scholar] [CrossRef]
Kiskin, I.; Zilli, D.; Li, Y.; Sinka, M.; Willis, K.; Roberts, S. Bioacoustic detection with wavelet-conditioned convolutional neural networks. Neural Comput. Appl. 2020, 32, 915–927. [Google Scholar] [CrossRef]
Liu, X.; Zhang, H.; Jiang, Q.; Ren, L.; Chen, Z.; Luo, Y.; Li, J. Acoustic Denoising Using Artificial Intelligence for Wood-Boring Pests Semanotus bifasciatus Larvae Early Monitoring. Sensors 2022, 22, 3861. [Google Scholar] [CrossRef]
Shi, H.; Chen, Z.; Zhang, H.; Li, J.; Liu, X.; Ren, L.; Luo, Y. A Waveform Mapping-Based Approach for Enhancement of Trunk Borers’ Vibration Signals Using Deep Learning Model. Insects 2022, 13, 596. [Google Scholar]
Loizou, P.C. Speech Enhancement: Theory and Practice, Second Edition, 2nd ed.; CRC Press: Boca Raton, FL, USA, 2013. [Google Scholar]
Cui, X.; Chen, Z.; Yin, F. Speech enhancement based on simple recurrent unit network. Appl. Acoust. 2020, 157, 107019. [Google Scholar] [CrossRef]
Zhang, X.; Du, J.; Chai, L.; Lee, C.-H. A Maximum Likelihood Approach to SNR-Progressive Learning Using Generalized Gaussian Distribution for LSTM-Based Speech Enhancement. In Proceedings of the Interspeech 2021, Brno, Czechia, 1 September 2021; pp. 2701–2705. [Google Scholar]
Hummersone, C.; Stokes, T.; Brookes, T. On the Ideal Ratio Mask as the Goal of Computational Auditory Scene Analysis. In Blind Source Separation: Advances in Theory, Algorithms and Applications; Naik, G.R., Wang, W., Eds.; Springer: Berlin/Heidelberg, Germany, 2014; pp. 349–368. [Google Scholar]
Wang, D. On Ideal Binary Mask as the Computational Goal of Auditory Scene Analysis. In Speech Separation by Humans and Machines; Divenyi, P., Ed.; Springer: Boston, MA, USA, 2005; pp. 181–197. [Google Scholar]
Wang, Y.; Narayanan, A.; Wang, D. On Training Targets for Supervised Speech Separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 22, 1849–1858. [Google Scholar] [CrossRef]
Oord, A.v.d.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. Wavenet: A generative model for raw audio. In Proceedings of the The 9th ISCA Speech Synthesis Workshop, Sunnyvale, CA, USA, 13–15 September 2016. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation; Springer International Publishing: Cham, Switherland, 2015; pp. 234–241. [Google Scholar]
Poland, T.M.; McCullough, D.G. Emerald Ash Borer: Invasion of the Urban Forest and the Threat to North America’s Ash Resource. J. For. 2006, 104, 118–124. [Google Scholar] [CrossRef]
Berouti, M.; Schwartz, R.; Makhoul, J. Enhancement of speech corrupted by acoustic noise. In Proceedings of the ICASSP ‘79. IEEE International Conference on Acoustics, Speech, and Signal Processing, Washington, DC, USA, 2–4 April 1979; pp. 208–211. [Google Scholar]
Jalil, M.; Butt, F.A.; Malik, A. Short-time energy, magnitude, zero crossing rate and autocorrelation measurement for discriminating voiced and unvoiced segments of speech signals. In Proceedings of the 2013 The International Conference on Technological Advances in Electrical, Electronics and Computer Engineering (TAEECE), Konya, Turkey, 9–11 May 2013; pp. 208–212. [Google Scholar]
Kong, Z.; Ping, W.; Dantrey, A.; Catanzaro, B. Speech Denoising in the Waveform Domain With Self-Attention. In Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 7867–7871. [Google Scholar]
Maciejewski, M.; Wichern, G.; McQuinn, E.; Roux, J.L. WHAMR!: Noisy and Reverberant Single-Channel Speech Separation. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 696–700. [Google Scholar]
Westhausen, N.L.; Meyer, B.T. Dual-Signal Transformation LSTM Network for Real-Time Noise Suppression. In Proceedings of the Interspeech 2020, Shanghai, China, 25–19 October 2020; pp. 2477–2481. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Tan, K.; Wang, D. A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 3229–3233. [Google Scholar]
Zhao, H.; Zarar, S.; Tashev, I.; Lee, C.H. Convolutional-Recurrent Neural Networks for Speech Enhancement. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 2401–2405. [Google Scholar]
Zhang, X.; Ren, X.; Zheng, X.; Chen, L.; Zhang, C.; Guo, L.; Yu, B. Low-Delay Speech Enhancement Using Perceptually Motivated Target and Loss. In Proceedings of the Interspeech 2021, Brno, Czechia, 1 April 2021; pp. 2826–2830. [Google Scholar]
Gao, T.; Du, J.; Dai, L.R.; Lee, C.H. Densely Connected Progressive Learning for LSTM-Based Speech Enhancement. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5054–5058. [Google Scholar]
Park, S.R.; Lee, J. A Fully Convolutional Neural Network for Speech Enhancement. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017; pp. 1993–1997. [Google Scholar]
Défossez, A.; Synnaeve, G.; Adi, Y. Real Time Speech Enhancement in the Waveform Domain. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 3291–3295. [Google Scholar]
Zhao, S.; Ma, B.; Watcharasupat, K.N.; Gan, W.S. FRCRN: Boosting Feature Representation Using Frequency Recurrence for Monaural Speech Enhancement. In Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 9281–9285. [Google Scholar]
Choi, H.S.; Park, S.; Lee, J.H.; Heo, H.; Jeon, D.; Lee, K. Real-Time Denoising and Dereverberation wtih Tiny Recurrent U-Net. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 5789–5793. [Google Scholar]
Glorot, X.; Bordes, A.; Bengio, Y. Deep Sparse Rectifier Neural Networks. In Proceedings of the PMLR Fourteenth International Conference on Artificial Intelligence and Statistics, Lauderdale, FL, USA, 11–13 April 2011; pp. 315–323. [Google Scholar]
Kavalerov, I.; Wisdom, S.; Erdogan, H.; Patton, B.; Wilson, K.; Roux, J.L.; Hershey, J.R. Universal Sound Separation. In Proceedings of the 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 20–23 October 2019; pp. 175–179. [Google Scholar]
Dauphin, Y.N.; Fan, A.; Auli, M.; Grangier, D. Language Modeling with Gated Convolutional Networks. In Proceedings of the PMLR 34th International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; pp. 933–941. [Google Scholar]
Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M. Tensorflow: A system for large-scale machine learning. In Proceedings of the Osdi, Savannah, GA, USA, 2–4 November 2016; pp. 265–283. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the ICLR, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Loizou, P.C. Speech Quality Assessment. In Multimedia Analysis, Processing and Communications; Lin, W., Tao, D., Kacprzyk, J., Li, Z., Izquierdo, E., Wang, H., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 623–654. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Iandola, F.N.; Moskewicz, M.W.; Ashraf, K.; Han, S.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1 MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]

Figure 1. Frequency spectrums of three noise segments: (a) sound of the wind, (b) birds’ twittering, and (c) babble.

Figure 2. Frequency spectrums of (a) clean boring vibrations of EAB, (b) birds’ twittering, and (c) the mixture of previous two segments.

Figure 3. The architecture of (a) DVEN, (b) DVEN-r, and (c) DVEN-c. Each gray box represents an enhancement core.

Figure 4. Several Enhancement results of DVEN. Column (a) is the frequency spectra of noisy, boring vibration segments. Column (b) is the frequency spectra of the same segments after enhancement by DVEN.

Table 1. Comparison of the enhancement performance of our model and previous models.

Model *	SNR (dB)	SNRseg (dB)	LLR
DTLN	18.48	17.29	0.47
VibDenoiser	18.57	16.60	0.33
DVEN-r	18.56	17.35	0.35
DVEN	18.73	17.65	0.36
DVEN-c	18.77	17.76	0.35

* Those results were acquired by our models when having the best convergence. For DVEN-r, DVEN, and DVEN-c, it appeared at epoch 63, 60, and 44, respectively.

Table 2. Inference speed and model footprints of different models.

Model	Inference Time (s) *	Parameters (M)	FLOPs (G)	Model Size (MB)
DTLN	0.33	0.99	0.57	3.76
VibDenoiser	0.76	10.75	52.48	41.02
DVEN-r	2.41	1.14	6.23	4.34
DVEN	0.46	1.32	2.14	5.05
DVEN-c	0.52	1.52	3.69	5.79

* The inference time was tested on an Intel Core i7-10870H Processor. The models were restricted to running on a single core.

Table 3. Accuracy of four well-known classification models on noisy segments and segments enhanced by VibDenoiser and DVEN, respectively.

Classification Model	Accuracy on Noisy Segments	Accuracy on VibDenoiser Enhanced Segments	Accuracy on DVEN Enhanced Segments
VGG16	51.27%	97.78%	97.78%
ResNet18	55.38%	88.92%	97.15%
SqueezeNet	59.97%	96.68%	97.15%
MobileNetV2	53.63%	90.19%	98.89%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shi, H.; Chen, Z.; Zhang, H.; Li, J.; Liu, X.; Ren, L.; Luo, Y. Enhancement of Boring Vibrations Based on Cascaded Dual-Domain Features Extraction for Insect Pest Agrilus planipennis Monitoring. Forests 2023, 14, 902. https://doi.org/10.3390/f14050902

AMA Style

Shi H, Chen Z, Zhang H, Li J, Liu X, Ren L, Luo Y. Enhancement of Boring Vibrations Based on Cascaded Dual-Domain Features Extraction for Insect Pest Agrilus planipennis Monitoring. Forests. 2023; 14(5):902. https://doi.org/10.3390/f14050902

Chicago/Turabian Style

Shi, Haopeng, Zhibo Chen, Haiyan Zhang, Juhu Li, Xuanxin Liu, Lili Ren, and Youqing Luo. 2023. "Enhancement of Boring Vibrations Based on Cascaded Dual-Domain Features Extraction for Insect Pest Agrilus planipennis Monitoring" Forests 14, no. 5: 902. https://doi.org/10.3390/f14050902

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancement of Boring Vibrations Based on Cascaded Dual-Domain Features Extraction for Insect Pest Agrilus planipennis Monitoring

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Construction

2.1.1. Data Collection and Screening

2.1.2. Dataset Construction

2.2. Model Architecture

2.2.1. Structure of DVEN

2.2.2. Structure of DVEN Variants

3. Results

3.1. Implemenmtation Details

3.2. Evluation Metrics

3.3. Enhancement Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI