1. Introduction
Trees face significant threats from climate change, wildfires, and pest infestations [
1]. Various forest disasters, including wildfires, climate change, human encroachment, and tree pathogens and pests, have varying degrees of impact on forest ecosystems, influencing agriculture, forestry, and human livelihood [
2]. Among forest pests, wood-boring beetles pose a particularly significant threat, including species such as Agrilus planipennis (Coleoptera: Buprestidae), Semanotus bifasciatus (Coleoptera: Cerambycidae), and Eucryptorrhynchus brandti (Coleoptera: Curculionidae). The larvae of these beetles primarily inhabit the cambium layer of trees, feeding on the wood and disrupting nutrient transportation, leading to the weakening and eventual death of the tree. Unfortunately, infestations by wood-boring beetles are often not easily detected until visible signs of withering or damage on the tree branches appear [
3]. A novel approach currently under exploration involves analyzing acoustic vibration signals to detect the presence of insect larvae within tree trunks. This method utilizes a piezoelectric sensor, such as an ICP (integrated circuit piezoelectric) sensor, to convert mechanical vibrations into electrical signals. The collected vibration signals are then inputted into a pre-trained model to detect larva presence and determine whether the tree is infected or not [
4]. Despite some studies on wood-boring insect larva detection [
5,
6], most of these studies rely on a single vibration sensor to receive the signals, making them susceptible to environmental noise that can impact larva detection in outdoor environments. To improve the recognition rate of larvae, we propose using multiple vibration sensors to capture and feed the vibration signals into the model. The primary focus of this work is to preprocess the monotonous vibration signals emitted by the larvae, separating them from signals contaminated by various environmental noise sources that could affect recognition performance. The aim is to isolate and preserve the clean wood-boring vibration signals associated with the insect larvae.
Some studies have applied the technology mentioned above to identify wood-boring vibrations. For example, Alexander Sutin [
4] utilized a piezoelectric sensor and a dual-mode charge sensor to record the vibration signals of infected trees. They developed an automatic detection algorithm that determines the presence of larvae based on a threshold of the average pulse rate. In this study, representative feeding sounds of larvae were selected by human experts. Features were extracted from these sounds, followed by binary or multi-class classification analysis in the time, frequency, or scale domains. Ultimately, an automated insect detection system was achieved by optimizing the parameters of radial basis function (RBF) kernels and polynomial kernels. Le Conte et al. proposed using acoustic emission monitoring technology to detect wood-boring insects in wooden cultural heritage instruments preserved in various European museums [
7]. The study employed robust data processing based on orthogonal linear transforms, followed by applying the processed signals to distinguish insect signals from environmental noise. Using acoustic detection technology, the study successfully detected larvae measuring approximately 1–2 mm in length within the musical instruments. In addition, in another study, ref. [
8] proposed a unified framework for the automated bioacoustic recognition of specific pests.
In natural environments, natural and non-natural noises are typically concurrent. For this study, we categorize all signals unrelated to boring vibration signals as noise signals. As previously demonstrated in the literature [
9], the impact of environmental noise on the recognition of boring vibration signals has been confirmed. Traditional multi-channel denoising algorithms include adaptive filtering [
10], post-processing Wiener filtering [
11], and spatial noise suppression techniques [
12]. However, these methods often require prior knowledge or substantial computational resources. The separation and denoising results of these methods may only be partially satisfactory. With the continuous development of acoustic technology, applying emerging techniques to boring vibration signals has become increasingly feasible.
Furthermore, most previous studies have primarily focused on single-channel analysis [
13]. Indeed, in the context of detecting wood-boring insects using acoustic techniques [
14,
15], using multiple sensors often offers significant advantages compared to a single sensor.
Multi-channel boring-vibration signal-separation techniques benefit from using data received from multiple vibration sensors, enabling them to acquire vast information. The more input information we have, the more information we can extract. Furthermore, from a traditional perspective, multi-channel analysis enables beamforming, which exhibits strong generalization and robustness [
16]. Research on multi-channel end-to-end speech separation primarily focuses on two directions: neural network beamforming and extending single-channel models to multi-channel settings. The output-based neural network beamforming methods mainly include DeepBeam and Beam-TasNet approaches. DeepBeam utilizes time-domain multi-channel Wiener filtering. It starts by selecting a reference microphone and employs a single-channel enhancement network trained to enhance the signal from that particular microphone. This enhanced signal is then utilized as the target for Wiener filtering, aiming to obtain the optimal filter parameters for the remaining vibration sensor. The objective is to obtain a cleaner speech signal through this process [
17]. In the Beam-TasNet method [
18], a combination of time-domain and frequency-domain approaches is employed. A multi-channel TasNet performs an initial separation of the mixed speech signals, resulting in preliminary separated speech. Subsequently, the MVDR (minimum variance distortionless response) weights are estimated using the separated speech as references in the frequency domain. These weights are then applied to the mixed speech to obtain the final separated speech. Beam-TasNet incorporates time-domain techniques for phase estimation and utilizes spatial features to achieve the desired speech separation. In addition to DeepBeam and Beam-TasNet, another significant work in neural network-based beamforming is the Filter-and-sum Network (FaSNet) [
19]. Indeed, FaSNet emphasizes information sharing among multiple channels to optimize time-domain filters jointly. There have been significant advancements in the development of insect sound detection technology [
20], with many mature methods available. These methods involve the application of various vibration sensors in different substrates and employ signal processing techniques. Therefore, applying multi-channel speech enhancement and separation techniques to boring vibration signals using similar processing techniques is feasible.
With the advancements in deep learning and neural network technologies [
21], acoustic technology has also experienced further development in this regard. Combining deep learning with speech enhancement and separation techniques has further propelled the growth of wood-boring larva detection. Rigakis et al. [
15] proposed an automated system called TreeVibes, which collects vibration sounds and converts them into analyzable data. The data are then fed into a deep learning model, such as the Xception model, for analysis to determine whether the trees are affected by insect infestation. Compared to traditional manual detection methods, this approach significantly increases the chances of early detection of insect infestations. Mankin et al. [
22] proposed using acoustic technology for detecting and monitoring insect pests. They collected vibration data from within tree trunks and analyzed them using deep learning models to determine the presence of pests.
In addition, a novel approach called the attention mechanism has emerged quietly. The attention mechanism enables the model to focus on different parts of the input sequence while generating the output sequence. It enhances the weights assigned to the relevant features and disregards irrelevant portions of the input sequence. By employing multiple iterations of learning linear projections, the model can attend to different representation subspaces at different positions [
23]. The attention mechanism has been widely adopted in various domains, such as image recognition [
24], audio processing, sentiment analysis [
25], and more.
This study takes the boring vibrations of emerald ash borer (EAB) larvae as the research subject. The larvae of the emerald ash borer (EAB) indeed have a significant impact on ecological environments [
26], particularly in terms of the damage they inflict on white ash trees. Traditional methods for identifying wood-boring insect infestations include manual observations [
27] and using pheromones [
28]. Manual identification involves visually inspecting the presence of “D-shaped” exit holes on the tree bark and checking for live larvae inside the tree. Indeed, this method is labor-intensive and less effective in achieving efficient pest control. Its limitations make obtaining satisfactory results in terms of prevention and control complex. Using vibration signals emitted by larvae during their activity and feeding as a clue to examine the presence of EAB larvae inside tree trunks is an effective method that saves significant human resources. Therefore, this paper proposes an end-to-end multi-channel boring-vibration signal-separation model based on attention mechanisms. Our team collected and synthesized all the data used in this study. This includes clean boring vibration signals and synthetic signals generated through simulations. Our results demonstrate that our proposed model effectively suppresses noise compared to single-channel and multi-channel models. In the case of utilizing different numbers of vibration sensors and networks, our model has demonstrated improvements in SNR and SDR ranging from 5% to 15%.
4. MultiSAMS Model
4.1. Drill String Vibration Signal Enhancement Model
Recurrent neural networks (RNNs) have achieved remarkable success in various speech signal processing tasks in recent years, such as speech recognition, speech synthesis, speech enhancement, and speech separation [
29]. To effectively utilize contextual information, we have chosen to use recurrent neural networks (RNNs), which are well-suited for capturing sequential dependencies [
30]. Because we need to use multiple vibration sensors in data collection, we have opted to employ time-domain beamforming techniques. While time-domain methods may have some performance differences in robustness compared to frequency-domain methods [
31], they offer faster response times and relatively smaller model sizes, requiring fewer computational resources. Based on the abovementioned techniques, we propose a novel approach called MultiSAMS, a time-domain-based multi-channel separation network. And on this basis, we incorporate the multi-head self-attention mechanism module (MSAM), also known as the attention mechanism module. Unlike other methods, MultiSAMS replaces the traditional filtering module with a bidirectional RNN neural network. RNN neural networks are well suited for audio tasks as they effectively capture long-term dependencies.
In MultiSAMS, we also use a one-dimensional convolutional layer to extract data features from each channel, where the size of the one-dimensional kernel is variable and determined by the sum of the context length and window size. Moreover, we calculate the cosine similarity between channels to extract the NCC (normalized cross-correlation) feature. These two features are then concatenated and fed into the MSAM module.
The attention mechanism in the MSAM module provides several advantages. It enables the model to focus more on different channels’ information during the learning process, thus learning the correlations between other channels. By using the attention mechanism, the model can automatically adjust the weights between channels, giving more attention to crucial information and enhancing the model’s performance.
Finally, the concatenated features are passed through the filtering module to obtain the final output. The proposed MultiSAMS method effectively combines the power of bidirectional RNNs, one-dimensional convolutional layers, and attention mechanisms, making it well suited for multi-channel audio signal separation tasks.
Specifically, our beamforming technique is the estimation of time-domain beamforming filters for microphone arrays with
vibration sensor. We select one vibration sensor as the reference sensor and sum the filtered signals from all vibration sensor channels to better estimate the vibration chosen sensor. We need to divide the signal
from the vibration sensor into
M sample frames, with each sample frame having a hop size of
.
In this equation,
t represents the index of the frame, and
i represents the index of the vibration sensor. The operation
selects the values of the vector
x from index
a to index
b.
In this context, refers to the beamforming signal at frame t, while represents the context window around the vibration sensor i. The variable represents the beamforming filter that the microphone i is learning, and ⊗ represents the convolution operation. Zero-padding is applied to the context window to ensure that the model has a span of samples across microphone delays.
For the reference microphone, assuming the first microphone is labeled as 1, the input signal would be
, which includes the current frame and
L past and future samples. For the other microphones,
i, the signal corresponding to the frame is extracted as follows:
. To be specific, let
be the context window of the signal in the reference microphone and
. To mitigate the impact of other frequency-domain beamforming tasks, we utilize normalized cross-correlation (NCC) as the inter-channel feature [
32]. We compute the cosine similarity between
and
.
where
represents the cosine similarity between the reference microphone and microphone
i, where
is a vector of length
. By averaging the
NCC vectors
from all other vibration sensors, we obtain the average feature:
For a specific channel, the input is the center frame of the signal in the reference microphone, denoted as
, which is an
dimensional vector. A linear layer is applied to
to create a K-dimensional embedding, represented as
. This is accomplished using a weight matrix
D of size
.
where
is the weight matrix. Subsequently, it is passed to an RNN neural network with a gated output layer, generating C beamforming filters, where C is the number of sources of interest.
The mapping function
is used to generate the output
, which is then convolved with
to generate the beamforming output
for the reference microphone.
For the remaining channels, the beamforming filters
are estimated. The estimated clean vibration signal from the reference vibration sensor is used as a cue for all remaining vibration sensors for the sources of interest. Firstly, the aforementioned process is applied to compute all NCC features.
The filters are convolved, and the outputs of all filters are weighted and summed to obtain the final beamforming output.
Furthermore, the overall architecture of the MultiSAMS model is illustrated in
Figure 4, and the filter-and-sum network (BF-Module) is shown in
Figure 5.
4.2. Filter-and-Sum Network
Inspired by the work of Luo et al. [
33], we utilize neural network modules to enable end-to-end training, replacing traditional beamforming filters. This approach addresses the limitation of fixed operations in traditional filters, allowing for real-time adjustments. We can effectively handle longer vibration signals by employing bidirectional recurrent neural networks (RNNs). Bidirectional RNN networks combine local and global modeling, significantly reducing the computational burden of RNNs and improving efficiency. Moreover, the structure of bidirectional RNNs is relatively simple, making them easy to implement. For an input sequence
, where
M represents the feature dimension and
L represents the number of time steps. The input sequence is divided into blocks of length
K with a stride size of
P, resulting in
S equal-sized blocks. These blocks are then concatenated together to form a three-dimensional tensor. The segmentation output
Z is then passed to the stack of
B RNN blocks. Each module transforms the input three-dimensional tensor into another tensor of the same shape, containing two sub-modules for intra-block and inter-block processing. The intra-block RNN is bidirectional and applied to the second dimension of the input tensor for each of the
S blocks. The input tensor for each module is denoted as
, where
b represents the block number ranging from 1 to
B. The shape of the input tensor
is
, where
M is the feature dimension,
K is the block length, and
S is the number of blocks.
where
is the output of the RNN,
is the mapping function, and
is the sequence defined by chunk
i.
Regarding the use of the multi-head attention mechanism, we apply it after the RNN layer by transforming the output
and feeding it into the attention mechanism to obtain the new output
. The shape of the output
is then transformed back to its original dimensions to obtain the new
. The specific implementation of the attention mechanism will be described in the next paragraph of the paper.
The fully connected layer aims to restore the feature dimension N.
represents the restored features after this layer.
and
are the weight and bias of the FC layer. In addition,
represents chunk
i in
. The input
application layer is normalized to obtain the output
. We then perform the residual connection by connecting
to the original input
, resulting in a new output
.
The output obtained from the previous step is used as the input to the next RNN block, where the inter-block RNN captures global information. The output of the inter-block RNN is obtained by applying the mapping function . The sequence is defined by the i time step in all S blocks. Similar to the previous step’s output Tb, the Kb is also subjected to an attention mechanism layer, a fully connected layer, and a normalization layer. Finally, a residual connection is added to form the output.
4.3. Self-Attention Mechanism
The attention mechanism can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, and an attention function calculates the weights. In our model, each output at each time step depends on the previous time step. When using the attention layer for feature extraction, all keys, values, and queries originate from the same source. In this scenario, each position can attend to all positions in the preceding layer. This property helps address the issue of long-term dependencies.
In practice, the attention function is computed for queries packed into a matrix
Q. Similarly, the keys and values are loaded into matrices
K and
V. Respectively, the output matrix is then computed as follows:
Linear projections are applied to the queries, keys, and values
h times, with each projection having its own learned parameters and resulting in different dimensions (
,
, and
, respectively). The attention function is then applied independently to each projection, producing output values of dimension
. These output values are concatenated together. Finally, the concatenated values are projected again to obtain the final output. This design enables the model to attend to information from different representation subspaces at different positions, addressing the issue of suppression that arises when using a single attention head that averages across different sources of information.
where the projections are parameter matrices
,
,
and
.
In the conducted experiments, it was observed that setting the value of
achieved a desirable trade-off between the inference speed of the model and its performance. The specific structure can be referred to as shown in
Figure 6, which provides a detailed description of the implementation of the attention mechanism model.
7. Discussion
One relatively simple method for controlling and preventing wood-boring pests is to use acoustic technology to detect problems within tree trunks [
4,
41]. This involves using embedded piezoelectric accelerometers to measure the vibrations of the trunk and capture these vibrations using vibration sensors. The captured vibration signals are then processed using a multi-channel wood-boring-vibration signal model to detect the presence of pests. In outdoor environments with wood-boring pests, there are typically various noise sources. The presence of environmental noise can indeed disrupt the accurate detection and interpretation of wood-boring-insect signals, leading to false positives or missed detections [
42]. Therefore, developing robust multi-channel signal processing techniques and algorithms becomes crucial in mitigating the impact of noise and enhancing the detection of vibrations generated by wood-boring insects.
Our model seamlessly integrates a multi-channel approach for detecting wood-boring vibrations. It incorporates four layers of attention mechanisms, designed to reduce computational complexity through a time-domain-based approach. Nonetheless, including multiple attention layers elevates the model’s complexity, presenting a computational challenge for microcontrollers. Our next phase will prioritize simplifying the model, minimizing computational demands, and enhancing separation performance. There are some limitations in the experimental results. For example, the version of the four-layer model is better than that of the eight-layer model. We suspect this may be due to overfitting and the increased complexity of the network, which could lead to the loss of features during the training process. We will collect more field data and further optimize our model by deploying our microcontrollers in the field. We have only tested the model with two and four vibration sensors and have yet to explore higher numbers. In the future, once we address the hardware limitations, we plan to experiment with a more significant number of vibration sensors to enhance the performance of our model further. In the network section, as depicted in
Figure 11, it is evident that GRU networks yield significantly better results than LSTM neural networks. GRU networks are relatively more straightforward in structure than LSTM networks, making them more suitable for low-computational-power microcontroller systems. Our future research will primarily focus on utilizing the GRU network within the MultiSAMS model. We agree that our model can be extended to detect other wood-boring pests, not just limited to the emerald ash borer (EAB). By adapting and fine-tuning the model for different species, we can also apply it to detecting other wood-boring insects. Currently, our research scope has expanded beyond the identification of the white wax narrow beetle. We have also researched the detection and identification of another type of harmful wood-boring insect, the wood-borer moth. We have collected a dataset of wood-borer-moth drilling-vibration signals with a duration exceeding 50 h and have conducted experiments. Through training on the wood-borer moth dataset, MultiSAMS has achieved a recognition rate of over 90% for wood-borer moths. At present, automatic species recognition technology has not been implemented. Whether it is wood-borer moths or white wax narrow beetles, species recognition requires training on the corresponding dataset to achieve recognition of the species. In the future, we will focus on enhancing the model’s ability to recognize different species, aiming to improve the accuracy of species classification recognition. Population recognition will also be a direction for our future research. This would broaden the application and impact of our model in pest detection and management. Expanding the dataset to include a variety of wood-boring pest species and different environmental conditions is a valuable approach to enhance the applicability of our model. By training the model with a diverse range of wood-boring-insect vibration signals, we can improve its ability to detect and classify different pest species. This expanded model can then be used for early warning systems, enabling the implementation of various pest management strategies such as biological control or chemical treatments to minimize the damage caused by wood-boring pests to trees.