1. Introduction
A biological community is a relatively stable group of organisms if humans do not interfere. It is very important in the ecological research field to be able to predict the characteristics of a biological community in an area by studying the effects of ecological factors on a biological community. Studies of the relationships between ecological factors and animal community parameters have led to the development of models for predicting changes in species [
1,
2]. Similar studies have been performed to predict plant communities and with good results [
3,
4,
5].
Multiple linear regression (MLR) is a traditional method for predicting changes in biological fields. MLR has been used in many environmental and ecological studies [
6,
7,
8]. Classic MLR can only establish recursive relationships between one parameter and multiple ecological factors, and it is usually difficult to specify the relationships between multiple assessment parameters and multiple ecological factors in a community. Establishing reasonable regression relationships between community parameters and ecological factors will require developing more accurate methods for predicting community characteristics. The classic MLR technique has been improved in recent years, and other methods have been developed, such as GLMM [
9], which has advantages over the classic MLR technique. It has been found in numerous studies that agricultural yields are predicted better by artificial neural networks (ANNs) than classic MLR [
10]. One of the most effective currently used methods for processing experimental data is the neural network method [
11]. It has been found in various biological and environmental studies that markedly better results can be achieved using a reasonable neural network model than using MLR [
12,
13,
14,
15].
The development of various machine learning methods led to much attention being focused on establishing recursive relationships using various neural networks in the microbiology research field. For example, images or quantitative morphological features have been used to identify species [
16,
17] and to predict crop yields [
11,
18,
19], plant diseases [
20,
21], and pest damage [
22]. ANNs have also been developed to predict the abundances of fish [
23] and insects [
24], including predicting population densities [
12,
15,
17]. Many of these studies have produced good results [
14,
25]. Convolutional neural networks (CNNs) have recently been developed to overcome some of the disadvantages of traditional machine learning models. CNNs have other advantages, including strong abilities to extract features, fuse multiple inputs, and quickly achieve convergence. Some important information about factors that affect population density can also be extracted more effectively using a CNN than an ANN. CNNs have therefore been used to study biological communities for the last decade [
17]. However, a CNN usually requires a large training dataset to optimize a large number of parameters in the network and establish the complex nonlinear mapping relationships that are required. It is difficult to acquire sufficient field data for use as training data, and this greatly limits the use of machine learning methods in predicting population diversity.
Recent rapid improvements in computing power have led to research focused on developing deep generative models that learn from the probability densities of observable samples and randomly generate new samples. The most influential recently developed generative models are the variational autoencoder (VAE) and the generative adversarial network (GAN), which have very successfully solved problems caused by insufficient data. Wan et al. [
26] addressed the unbalanced learning problem by developing a VAE-based method for generating synthetic data. The method generates a large number of similar samples from the original dataset and evaluates and compares the samples using traditional synthesis methods. Yong et al. [
27] used a GAN to generate a large number of fault samples, establish a balanced dataset, and diagnose faults in induction motors. Siahkoohi et al. [
28] used a GAN to reconstruct missing seismic data, learn the distribution characteristics of the original data through an adversarial process in a neural network, and reconstruct severely under-sampled seismic data.
Insects are a very large group of biota and are often used to assess the ecological environment. Changes in aquatic ecosystems, whether natural or anthropogenic, can lead to changes in the relative abundance of certain groups of aquatic insects [
29]. Most insect species live in different habitats during their egg, larval, pupal, and adult stages, so it is very common that an insect community will have a different composition in different seasons. Data on insect communities are also strongly affected by random factors that affect field sampling. This makes it difficult to reliably predict the composition of an insect community. The third author of this paper and his co-workers found that there is a correlation between the diversity of aquatic beetles and environmental factors within a certain range of areas based on the studies of Chinese fauna of aquatic beetle diversity for more than 20 years [
30,
31,
32,
33,
34,
35]. It is possible to know the diversity of aquatic insects by monitoring environmental factors. It is a very challenging and complex task. Few studies in which the compositions of insect communities have been predicted have been performed. Many aquatic insects live in a single water body for at least one stage of their life history and do not tend to migrate long distances. The aquatic habitats of aquatic insects mean that sampling is less affected by various factors for aquatic than for other insects. ANNs can therefore be used to study aquatic insect community richness [
1,
2,
36,
37,
38]. However, ANN models have not yet been used to predict the number of species or the number of individuals in insect communities. Ecological factors have different effects on the compositions of aquatic insect communities in aquatic systems in different geographical areas. There are regional limitations on research into insect community predictions, and different ecological factors should be used as variables when making predictions for different regions and environments.
In this study, a 1-D CNN was used to identify the relationships between the aquatic beetle community composition and important ecological factors in a hydrostatic environment in Guangdong Province (East China) using experimental data for aquatic beetles acquired at the study site for one year. The 1-D CNN simultaneously predicted the aquatic beetle community richness and diversity, which can achieve the multi-variable output. To improve the accuracy of prediction, this paper built a CNN model combined with a Bayesian optimization algorithm and optimized the input parameters to build the optimal CNN model. Meanwhile, considering the difficulty of field collection, only a small CNN training dataset was available, so a VAEGAN-based strategy was developed to produce a large amount of generated data with a similar probability distribution to the real data. This was achieved by establishing a new prediction model and using the same test dataset to verify the effectiveness of the VAEGAN data enhancement strategy to effectively evaluate the characteristics of aquatic beetles under different ecological conditions.
2. Materials and Sample Acquisition
Sampling location: An artificial pool, Nanshe Village, Dapeng New District, Shenzhen City, Guangdong Province, China, 22°28′45″ N 114°31′6″ E, 1 m above sea level (
Figure 1). This pool has not been interfered with by humans for over 30 years.
The materials used in the study are shown in
Table 1.
Sampling time: November 2020–October 2021. The sample was carried out within 1 or 2 days in the middle of or late each month.
Sampling method: A D-type net was swept 10 times in the same direction in each 1 m × 1 m sampling square. The aquatic beetles that were collected were placed in a bottle containing absorbent paper soaked with ethyl acetate. Each bottle containing aquatic beetles was labeled and transferred to the laboratory where the beetles were identified.
Selection of ecological factors: Pearson correlation coefficient analysis, multiple regression analysis, and marginal effect analysis were used to determine the following factors affecting Dapeng Peninsula aquatic beetle.
Determination of ecological factors:
Water temperature. A thermometer was used to determine the real-time water temperature at each collection time.
pH. A precise pH test paper was used to determine the pH of the water at each collection time.
Salinity. A pen salinometer was used to determine the salinity three times at each collection time, and the mean of the three consecutive measurements was used.
Water depth. Each sample was collected from an area of water <1.5 m deep. The water depth was determined five times using a tape measure, and the mean depth was used.
Proportional area of aquatic plants. The area of exposed plants as a percentage of the total area in a sampling square was estimated.
Proportional area of submerged plants. The area of submerged plants as a percentage of the total area of the sampling square was estimated.
Water area. The water area was calculated from measurements made using a tape measure.
Water level. The weighted water level method was used to divide the zones of the pool according to depth. The contribution of each zone to the total water area of the pool was calculated. The water depth in each zone was then measured five times. The maximum and minimum values were removed, then the mean of the remaining values was used to represent the water level in the zone. The water level proportion in each zone was multiplied by the area proportion in each zone, and then the sum was defined as the weighted water level of the water area, calculated using Equation (1)
where
is the weighted water level of the water area,
is the water level proportion for zone
, and
represents the proportion of the area in each zone.
Data processing: In order to eliminate the dimensional influence among the indicators, the eight ecological factors were normalized respectively, and the normalization formula is as follows:
where
is the measured value for each environmental factor,
is the minimum value for the environmental factor (in the whole dataset),
is the maximum value for the environmental factor (in the whole dataset), and
is the normalized value. Raw data (i.e., without any processing) for the number of species were used to build the prediction model. The aquatic beetles were divided into six grades based on the proportion each species contributed to the total number of species collected in the pool. Grade 1 was for a proportion contribution of 0% (0 individual was collected in this sampling); grade 2 was a contribution of 1–20% (occurring in this pool, but the number of individuals was small in the pool); grade 3 was a contribution of 21–40% (called as “common species” here); grade 4 was a contribution of 41–60%; grade 5, a contribution of 61–80% (called as “dominant specie” here); and grade 6 a contribution of 81–100% (called as “dominant species” here, does not usually occur based on our over 30 years of experience).
3. Deep Learning Model for Making Predictions
3.1. 1-D CNN
As shown in
Figure 2, a CNN model usually consists of an input layer, a convolution layer, a fully connected layer, and an output layer. The input layer is used to receive input information and convert it into something that the network can understand. The convolution layer is used to extract features from the data. In the convolution layer, the convolution kernel continuously scans the input data and convolves the data with the convolution kernel to generate a new feature matrix. As shown in
Figure 3, the convolution process consists of multiplying each element in the convolution kernel (e.g., green box) by the corresponding element in a sub-region (e.g., purple or red dotted box) of the convolution layer input data, and then summing the products to give an element in the feature map. The sub-region moves down one step each time, and the process is repeated until all of the elements in the input data have been processed. Finally, the convolution operation generates a new array (i.e., the feature map).
After the convolution layer, there is an activation function that introduces nonlinear factors [
39], allowing the CNN to arbitrarily approximate any nonlinear function and be applied to many nonlinear problems (tasks). Commonly used activation functions include the sigmoid, tanh, ReLU (rectified linear unit) [
40], and LeakyReLU functions, as shown in
Figure 4. The LeakyReLU activation function was used in this study, as it is much faster than the sigmoid and tanh functions [
41]. LeakyReLU activation functions multiply neurons with negative thresholds by a small number [
42] so that neurons with negative threshold values are not discarded and information about these neurons is preserved.
The fully connected layer can collect the extracted features together (integration features). We usually need to define a loss function before the output layer. The loss function (i.e., the RMSE (Equation (4)); see the next section) is used to calculate the difference between the prediction result and the actual value for the CNN at each iteration, to guide the next network training step to proceed in the correct direction (less error). The output layer is used to output the final predicted value.
In this study, a novel predictive method based on the 1-D CNN model was established to assess biodiversity using the Deep Learning Toolbox in MATLAB (Math-Works, Natick, MA, USA). Eight ecological factors were used as input parameters, and the output parameters were the grades of the number of individuals and the number of species of aquatic beetles. In total, 132 groups of experimental sample data were acquired, and these were divided into 112 groups in the training set, 20 groups in the testing set, and 20 groups in the validation set. The training set, testing set, and validation set were all experimental data, and the testing and validation sets were the same. This allowed the 1-D CNN aquatic beetle prediction model to be established.
3.2. Bayesian Optimization
The CNN mentioned earlier contained unknown hyperparameters, including the network structure, learning rate, number of training cycles, number of samples per batch, and size and number of convolutional kernels. Manually selecting and tuning these parameters would be very difficult and time-consuming, so a Bayesian optimization algorithm was used to automatically search for the best combination of hyperparameters. The selection process is shown in Equation (3).
where
is a combination of parameters,
is the objective function (i.e., mean relative error),
is the best combination of parameters, and
is the range of parameter values.
A search range was established for each hyperparameter. The search ranges for the number of CNN layers, learning rate, epoch, and mini-batch were 1–4, 10
−4–1, 0–900, and 0–100, respectively. The optimization workflow for the deep learning model is shown in
Figure 5. First, a combination of hyperparameters
was randomly selected from within the range of the hyperparameter values, then
was fed into the network for training to give the corresponding objective function value
. Next, the location of the next sampling point was determined using a Gaussian process, and the probability distribution
corresponding to
was calculated, and
corresponding to
in the probability distribution was predicted using all of the (
,
) inputs. The x values that minimized the objective function were used for training, and the objective function
was calculated. Until the maximum number of iterations was reached, (
,
) was used as an input for the Gaussian process. According to historical evaluation records (i.e., the mean relative error), this provided a reference for subsequent attempts to decrease the time taken to identify the parameters and to continually update the probabilistic model to give a new (
,
). Ultimately, the x values corresponding to the minimum objective function
value were identified as the optimal hyperparameter combination
.
3.3. Performance Indicator
In order to better evaluate the effectiveness of the trained CNN model, the root mean square error (RMSE) was used as accuracy metrics, see Equations (4).
where
is the predicted value, and
is the actual value.
The RMSE focuses on the accuracy of the predictions. The smaller the values of RMSE, the better the prediction accuracy of the CNN model.
4. VAEGAN Sample Augmentation Model
4.1. Overview
Generally, a CNN requires a large amount of data to train the network parameters and produce a desired prediction model, but obtaining sufficient training data is often difficult because it is difficult to collect data from the field. This greatly limits the use of deep learning methods to evaluate aquatic beetle diversity. To address this problem, we performed data augmentation using a VAEGAN to provide data with probability distributions extremely similar to the probability distributions of real samples. The 112 training sets for the real samples were fed into the VAEGAN, and 400 sets of samples were produced. The deep-learning framework in MATLAB 2022a (MathWorks) was also used. To verify that the generated samples improved the prediction accuracy of the 1-D CNN, the test set and validation set were kept the same and the training set containing the 512 real and synthesized groups was used to build the 1-D CNN-based VAEGAN model.
4.2. Training the VAEGAN
As shown in
Figure 6, the VAEGAN had two components, the VAE and the GAN, which had the same generator. The VAE captured the distributional features of the input data using an encoder and decoder and generated samples that matched the real data. At the same time, a discriminator in the GAN was applied to the VAE to maximize the ability of the method to determine whether the input was a real or generated sample and guide the generator to generate more realistic samples. This combination enhanced the expressiveness of the model and improved the quality and diversity of the generated samples. An intermediate loss function was used for cross-learning between the two networks, which incrementally improved the generative power of the model. The VAEGAN losses were
with
where
is the mini-batch size,
is the input data,
is the reconstructed data decoded from the prior distribution (z),
and
are the mean and standard deviation (prior distribution) of the latent space, respectively,
is the reconstructed data decoded from the normal distribution (
), and
,
, and
are the outputs of the discriminator when the input is the origin data
, the reconstructed data
and the reconstructed data decoded from the normal distribution
, respectively.
The aquatic beetle sample expansion process is shown in
Figure 6. First, the eight input parameters for the original samples (112 samples from the training set) and the two output parameters were stitched together to produce an overall matrix. In the VAE model, the encoder network transformed the input data into a potential vector containing the mean and variance, then sampled a random vector from this vector, and transformed it back into an output with the same dimension as the original input data using the decoder network with multiple transposed convolutional layers and other network layers. This is often seen as the generative model output. In the GAN model, the discriminator network used the real data and data generated by the generator as inputs and performed binary discrimination. A CNN is often used as a discriminator in the GAN to automatically learn about features in the input data and pass the learned features into the fully connected layers for the classification task until the discriminator predictions (i.e., the output) are produced. During the network optimization process, the generative sample generated by the generator attempted to fool the discriminator into classing a generated sample as a true sample, and the discriminator learned about features of the true data distribution by minimizing the classification error. This continually improved the ability of the discriminator to distinguish between a generated sample and a true sample. Alternately training the discriminator and generator against each other caused the difference between a generated sample and a real sample to continually decrease until true and false samples were indistinguishable, resulting in a powerful generative model.
5. Results
5.1. The List of Water Beetle Species in Nanshe Pool
A total of 39 aquatic beetle species were collected, including 1 Gyrinidae, 3 Noteridae, 16 Dytiscidae, and 19 Hydrophilidae (
Table 2).
Hydrovatus obtusus (Motschulsky) and
Crephelochares abnormalis (Sharp) were occasionally collected, and only one was recovered.
5.2. Bayesian Optimization of the CNN
The results of the Bayesian optimization of the 1-D CNN model based on 112 training sample sets are shown in
Figure 7. The Bayesian optimization results for the 1-D CNN-based VAEGAN model based on 512 training sample sets are shown in
Figure 8. In both cases, the optimal prediction model structure was also established using the parameter combination with the lowest error. We further analyzed the influence of network parameters on the results. The results show the following. (1) There were large errors when the values of the number of iterations (MaxEpochs) were set too low. Increasing the number of training epochs could avoid adverse events. (2) The minimum batch size was not appropriate to take too large values. (3) It is recommended to set a low learning rate. When the value was low, the achieved error was small and close to 0. (4) A deeper network layer could obtain a smaller prediction error. With the parameterization analysis, the best design scheme (network structure and hyperparameters) of the CNN can be determined through Bayesian optimization. By synthesizing the optimization results of each hyperparameter of 1-D CNN when the number of convolutional layers was 1, the minimum batch size was 58, the number of iterations (MaxEpochs) was 742, and the learning rate was 0.0156, the error of the model was minimal. By synthesizing the optimization results of each hyperparameter of 1-D CNN-based VAE-GAN when the number of convolutional layers was 1, the minimum batch size was 97, the number of iterations (MaxEpochs) was 501, and the learning rate was 0.0155, the error of the model was minimal. The detailed network structure of 1-D CNN and 1-D CNN-based VAE-GAN is shown in
Table 3.
5.3. Performance Indicators of the CNN
When the 1-D CNN model and the 1-D CNN-based VAE GAN model each reach the maximum number of iterations (MaxEpochs), the network stops training.
Figure 9 shows the RMSE of the accuracy of the 1-D CNN and 1-D CNN-based VAE-GAN model, with the best performance identified from the Bayesian optimization. It can be seen that the curve stably decreased along the network training. The 1-D CNN model achieved a convergence status after 50 iterations. The 1-D CNN-based VAE-GAN model achieved a convergence status after 100 iterations. with a small loss value compared with 1-D CNN. This implies that the 1-D CNN-based VAE-GAN model is very robust and well trained with the input data.
5.4. Predictions Made by the 1-D CNN Model
The results of the 1-D CNN (
Figure 10) indicated that the relative errors for predicting the grade of the number of individuals ranged from 0.0 to 70.0%, and the average relative error was 26.0%. When the actual value was grade 1, the relative error ranged from 50.0 to 70.0%, and the average relative error of prediction was 62.5%. When the actual value was grade 2 or above, the relative error ranged from 0.0 to 50.0% and the average of relative error of prediction was 16.9%. The prediction accuracy was less than 80.0%.
The number of species predicted by the 1-D CNN model is shown in
Figure 11. The absolute differences between the predicted and actual values were calculated, and
Figure 12 was drawn to show the proportion of the difference. It can be seen that when using the 1-D CNN to predict the number of species, 5.0% of the predicted values were the same as the actual values, 30.0% of the predicted values were one species different from the actual values, and 35.0% of the predicted values were two species different from the actual values. Field sampling inevitably leads to some errors, so we concluded that a difference of less than two species was a relatively reliable prediction. The predictions made by the 1-D CNN were therefore 70.0% reliable, but the accuracy of the prediction model was low.
5.5. Predictions Made by the 1-D CNN-Based VAEGAN Model
As shown in
Figure 10,
Figure 11 and
Figure 12, there was a large deviation between the values predicted by the 1-D CNN model and the actual values. This was caused by the relatively small amount of data collected in the field. The prediction accuracy was improved by establishing a 1-D CNN-based VAEGAN to predict the grades of the number of individuals and the number of species, and the results are shown in
Figure 13,
Figure 14 and
Figure 15.
The results of the 1-D CNN-based VAEGAN (
Figure 13) showed that the relative error in predicting the grade of the number of individuals ranged from 0.0 to 60.0%, and the average of relative error was 14.0%. When the actual value was grade 1, the relative error ranged from 0.0 to 60.0%, and the average of relative error of prediction was 17.5%. When the actual value was grade 2 or above, the relative error ranged from 0.0 to 30.0%, and the average of relative error of prediction was 13.1%. On the whole, the average of relative errors of the 1-D CNN-based VAEGAN was much lower than that of the 1-D CNN before sample expansion.
The number of species predicted by the 1-D CNN-based VAEGAN model is shown in
Figure 14. The absolute differences between the predicted and actual values were calculated, and
Figure 15 was drawn to show the proportion of the difference. It can be seen that when using the 1-D CNN-based VAEGAN to predict the number of species, 25.0% of the predicted values were the same as the actual values, 30.0% of the predicted values were one species different from the actual values, and 30.0% of the predicted values were two species different from the actual values. Field sampling inevitably leads to some errors, so we concluded that a difference of less than two species was a relatively reliable prediction. The predictions made by the 1-D CNN-based VAEGAN were therefore 85.0%. Compared with 1-D CNN before sample expansion, the accuracy of the number of species prediction of 1-D CNN-based VAEGAN significantly improved.
A few very rare species (VRS) were sampled only once and only one individual was collected during one year of sampling. In order to reduce the impact of VRS on prediction in the community, a new model was established to exclude such data based on the 1-D CNN-based VAEGAN.
The results of the 1-D CNN-based VAEGAN after excluding VRS (
Figure 16) showed that the relative error of predicting the grade of the number of individuals ranged from 0.0 to 40.0%, and the average of relative error was 12.0%. When the actual value was grade 1, the relative error ranged from 0.0% to 50.0%, and the average relative error was 10.0%. When the actual value was grade 2 or above, the relative error ranged from 0.0 to 30.0%, and the average of relative error of prediction was 12.5%.
The number of species predicted by the 1-D CNN-based VAEGAN model after excluding VRS is shown in
Figure 17. The absolute differences between the predicted and actual values were calculated, and
Figure 18 was drawn to show the proportion of the difference. It can be seen that when using the 1-D CNN-based VAEGAN after excluding VRS to predict the number of species, 25.0% of the predicted values were the same as the actual values, 35.0% of the predicted values were one species different from the actual values, and 25.0% of the predicted values were two species different from the actual values. Field sampling inevitably leads to some errors, so we concluded that a difference of less than two species was a relatively reliable prediction. The predictions made by the 1-D CNN-based VAEGAN after excluding VRS were therefore 85.0%.
6. Conclusions and Discussion
Previous studies using ANNs for benthic invertebrate communities have mainly been aimed at predicting populations and population densities in rivers. Various ecological factors (e.g., altitude, dissolved oxygen concentration, vegetation parameters, and water temperature) are usually used as input variables to allow for predicting the presence and abundance of benthic invertebrates [
43]. In fact, many ecological factors can affect the aquatic beetle community structure, and the same ecological factors play different roles in different geographical regions. A predictive model based on ecological factors for one habitat may not be used to make predictions for a community in another habitat. Ecological factors for assessing or making predictions about insect communities should be selected after collecting samples and analyzing the data using reliable methods. The ecological factors used in this study were selected after assessing previously published research, so relatively ideal results were obtained.
Machine learning has often been used to predict the populations of some insect species. Ecological factors affect each individual in a population in a stable and regular way, so these studies have achieved good results [
13,
44]. A community is composed of many species, and each species has its own particular biological characteristics, so ecological factors affect each species differently. For example, changes in one factor can have different effects on the life histories, development stages, and development of different species. Moreover, the probability of being collected for each species is unequal even if the sampling method is the same. Therefore, it is acceptable that the predicted accuracy was lower than the accuracy of the population.
We developed a nonlinear 1-D CNN prediction model based on field sampling data to study aquatic beetle communities in stagnant pools. The results indicated that it is feasible to use a 1-D CNN approach to assess or predict the grade of the number of individuals and the number of species in an aquatic beetle community. The reliability of the 1-D CNN prediction model was affected by the number of samples. It is feasible to predict or assess a population when the accuracy is >80% [
45,
46]. Our results were not ideal when 112 datasets were used to establish the 1-D CNN model. However, the accuracy for the number of species and the grade of the number of individuals were 85.0% and 86.0%, respectively, when sufficient data (>500 datasets) were used after expanding the data using the VAEGAN. We suggest that at least 500 datasets should be used to study insect communities using the 1-D CNN method.
The actual value was the experimental data obtained in the field. If the population density of an aquatic beetle was very low and examples of the species were rarely collected, the number of aquatic beetles was grade 1. However, individuals of such a species can occasionally be collected during a long-term sampling campaign. The effect of such a species can be ignored when using the 1-D CNN model to predict community richness, although such a species will have an important impact on species diversity in the aquatic environment.
When assessing an insect community using a neural network, the total number of all individuals is usually used as an output variable and only quantitative information about the community is assessed or predicted [
1,
2]. The number of genera has also been used as an output variable when predicting community richness [
36]. We used the 1-D CNN model to make aquatic beetle community predictions using both individual quantity information and species composition information as output variables. Compared with the findings of Park [
2] and Edia [
36], we got a more ideal result of the diversity of the community.
A few beetle species will be dominant in a lentic habitat, and more individuals of this species than of other species will be collected at each sampling time. The population density of some species will be very low, and few or no individuals will be collected. Errors will be unavoidable during sampling, so individuals of each species should be graded to decrease the possibility that rare species are excluded or the roles of rare species are reflected because of the large numbers of individuals of dominant species that are collected. We graded individuals before making predictions. If we did not exclude VRS, the prediction accuracy for the grades of the number of individuals was 86.0%. If we excluded VRS, the prediction accuracy for the grades of the number of individuals was 88.0%. In both cases, prediction accuracy was >80.0%. The prediction accuracy for the number of species with and without excluding VRS was similar, 80.0%. This indicates that the method effectively graded the individual aquatic beetles when studying community diversity.
On the basis of the obtained experimental data, the VAEGAN method is used to obtain more extended data, which can make the prediction more reliable. We conclude that the accuracy of the 1-D CNN for predicting the aquatic beetle species number and abundance from relevant environmental factors can be improved using VAEGAN to expand the experimental data.
However, further research is needed to study the predictive modeling of biodiversity:
- (1)
Prediction models need to be built based on other machine learning methods (such as tree models, etc.) and complex network frameworks. The SHapley Additive exPlanations (SHAP) method can be applied to a machine learning model to analyze how much each sample or feature contributes to the corresponding predicted value and, thus, to explore the sensitive factors of the aquatic beetle community.
- (2)
The existing model will be extended to other natural-standing waters in the same area for verification. The ultimate goal is to be able to predict insect diversity in a standing-water environment by monitoring ecological factors.
- (3)
The relatively complex terrestrial environment and marine biodiversity without human disturbance can be expanded further for study purposes.
Author Contributions
Methodology, M.H., S.J., F.J., X.Y. and Z.L.; Software, M.H.; Validation, M.H. and X.Y.; Formal analysis, M.H., S.J., F.J. and X.Y.; Investigation, S.J.; Resources, F.J. and Z.L.; Data curation, S.J. and Z.L.; Writing-original draft, M.H. and S.J.; Writing-review & editing, F.J., X.Y. and Z.L.; Visualization, M.H. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the GDAS Special Project of Science and Technology Development (grant nos. 2020GDASYL-20200301003 and 2020GDASYL-20200102021).
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Some or all data, models, or codes generated or used during the study are available from the corresponding author by request.
Acknowledgments
We are indebted to Robert Angus, aquatic beetle specialist, at the Natural History Museum, London, UK, for reviewing the manuscript. We warmly thank Zuqi Mai and Zhuoyin Jiang at the School of Life Sciences, Sun Yat-sen University, Guangzhou, China, for assisting with fieldwork. We are particularly grateful to Shuai Teng and Mansheng Lin, at the Faculty of Civil and Transportation Engineering, Guangdong University of Technology, Guangzhou, China, for providing great help relating to artificial neural networks. This research was supported by the GDAS Special Project of Science and Technology Development (grant nos. 2020GDASYL-20200301003 and 2020GDASYL-20200102021). The authors declare no conflict of interest. We thank Gareth Thomas, PhD, from Liwen Bianji (Edanz) (
https://www.liwenbianji.cn), for editing the language of a draft of this manuscript (accessed on 3 July 2023).
Conflicts of Interest
The authors declare no conflict of interest.
References
- Obach, M.; Wagner, R.; Werner, H.; Schmidt, H.-H. Modelling population dynamics of water insects with artificial neural networks. Ecol. Model. 2001, 146, 207–217. [Google Scholar] [CrossRef]
- Park, Y.; Céréghino, R.; Compin, A.; Lek, S. Applications of artificial neural networks for patterning and predicting water insect species richness in running waters. Ecol. Model. 2003, 160, 265–280. [Google Scholar] [CrossRef]
- Aitkenhead, M.; Mustard, M.J.; McDonald, A. Using neural networks to predict spatial structure in ecological systems. Ecol. Model. 2004, 179, 393–403. [Google Scholar] [CrossRef]
- Elmendorf, S.C.; Moore, K.A. Use of Community Composition Data to Predict the Fecundity and Abundance of Species. Conserv. Biol. 2008, 22, 1523–1532. [Google Scholar] [CrossRef]
- Rocha, J.C.; Peres, C.K.; Buzzo, J.L.L.; de Souza, V.; Krause, E.A.; Bispo, P.C.; Frei, F.; Costa, L.S.M.; Branco, C.C.Z. Modeling the species richness and abundance of lotic macroalgae based on habitat characteristics by artificial neural networks: A potentially useful tool for stream biomonitoring programs. J. Appl. Phycol. 2017, 29, 2145–2153. [Google Scholar] [CrossRef] [Green Version]
- Cho, J.H.; Lee, J.H. Multiple Linear Regression Models for Predicting Nonpoint-Source Pollutant Discharge from a Highland Agricultural Region. Water 2018, 10, 1156. [Google Scholar] [CrossRef] [Green Version]
- Croteau, K.; Ryan, A.C.; Santore, R.; DeForest, D.; Schlekat, C.; Middleton, E. Comparison of Multiple Linear Regression and Biotic Ligand Models to Predict the Toxicity of Nickel to Aquatic Freshwater Organisms. Environ. Toxicol. 2021, 40, 2189–2205. [Google Scholar] [CrossRef]
- Piekutowska, M.; Niedbała, G.; Piskier, T.; Lenartowicz, T.; Pilarski, K.; Wojciechowski, T.; Pilarska, A.A.; Czechowska-Kosacka, A. The Application of Multiple Linear Regression and Artificial Neural Network Models for Yield Prediction of Very Early Potato Cultivars before Harvest. Agronomy 2021, 11, 885. [Google Scholar] [CrossRef]
- Pinna, M.S.; Mattana, E.; Cañadas, E.M.; Bacchetta, G. Effects of pre-treatments and temperature on seed viability and germination of Juniperus macrocarpa Sm. Comptes Rendus Biol. 2014, 337, 338–344. [Google Scholar] [CrossRef]
- Matsumura, K.; Gaitan, C.; Sugimoto, K.; Cannon, A.; Hsieh, W. Maize yield forecasting by linear regression and artificial neural networks in Jilin, China. J. Agric. Sci. 2015, 153, 399–410. [Google Scholar] [CrossRef]
- Nikiforov, A.; Kuchumov, A.; Terentev, S.; Karamulina, I.; Romanova, I.; Glushakov, S. Neural network method as means of processing experimental data on grain crop yields. E3S Web Conf. 2020, 161, 01031. [Google Scholar] [CrossRef] [Green Version]
- Cocu, N.; Harrington, R.; Rounsevell, M.D.A.; Worner, S.P.; Hullé, M.; Participants, T.E.P. Geographical location, climate and land use influences on the phenology and numbers of the aphid, Myzus persicae, in Europe. J. Biogeogr. 2005, 32, 615–632. [Google Scholar] [CrossRef]
- Skawsang, S.; Nagai, M.; Tripathi, N.K.; Soni, P. Predicting Rice Pest Population Occurrence with Satellite-Derived Crop Phenology, Ground Meteorological Observation, and Machine Learning: A Case Study for the Central Plain of Thailand. Appl. Sci. 2019, 9, 4846. [Google Scholar] [CrossRef] [Green Version]
- Wagner, R.; Dapper, T.; Schmidt, H.H. The influence of environmental variables on the abundance of water insects: A comparison of ordination and artificial neural networks. Hydrobiologia 2000, 422, 143–152. [Google Scholar] [CrossRef]
- Watts, M.J.; Worner, S.P. Comparing ensemble and cascaded neural networks that combine biotic and abiotic variables to predict insect species distribution. Ecol. Inform. 2008, 3, 354–366. [Google Scholar] [CrossRef]
- Dyrmann, M.; Karstoft, H.; Midtiby, H.S. Plant species classification using deep convolutional neural network. Biosyst. Eng. 2016, 151, 72–80. [Google Scholar] [CrossRef]
- Hansen, O.L.P.; Svenning, J.; Olsen, K.; Dupont, S.; Garner, B.H.; Iosifidis, A.; Price, B.W.; Høye, T.T. Species-level image classification with convolutional neural network enables insect identification from habitus images. Ecol. Evol. 2020, 10, 737–747. [Google Scholar] [CrossRef]
- Fang, H.M.; Zhang, S.T.; Ding, W.K. Study on forecasting the yield in Maize regional test based on BP Neural Network. J. Anhui Agric. Sci. 2007, 34, 10969–10970. [Google Scholar]
- Karuna, G.; Pravallika, K.; Anuradha, K.; Srilakshmi, V. Convolutional and Spiking Neural Network Models for Crop Yield Forecasting. E3S Web Conf. 2021, 309, 01162. [Google Scholar] [CrossRef]
- Arinichev, I.V.; Polyanskikh, S.V.; Arinicheva, I.V.; Sergeeva, I.O. Applications of convolutional neural networks for the detection and classification of fungal rice diseases. IOP Conf. Ser. Earth Environ. Sci. 2021, 699, 012020. [Google Scholar] [CrossRef]
- Dhaka, V.S.; Meena, S.V.; Rani, G.; Sinwar, D.; Kavita, K.; Ijaz, M.F.; Woźniak, M. A Survey of Deep Convolutional Neural Networks Applied for Prediction of Plant Leaf Diseases. Sensors 2021, 21, 4749. [Google Scholar] [CrossRef] [PubMed]
- Grünig, M.; Razavi, E.; Calanca, P.; Mazzi, D.; Wegner, J.D.; Pellissier, L. Applying deep neural networks to predict incidence and phenology of plant pests and diseases. Ecosphere 2021, 12, e03791. [Google Scholar] [CrossRef]
- Zhang, H.; Zimba, P.V. Analyzing the effects of estuarine freshwater fluxes on fish abundance using artificial neural network ensembles. Ecol. Model. 2017, 359, 103–116. [Google Scholar] [CrossRef]
- Wagner, R.; Obach, M.; Werner, H.; Schmidt, H. Artificial neural nets and abundance prediction of aquatic insects in small streams. Ecol. Inform. 2006, 1, 423–430. [Google Scholar] [CrossRef]
- Case, E.; Shragai, T.; Harrington, L.; Ren, Y.; Morreale, S.; Erickson, D. Evaluation of Unmanned Aerial Vehicles and Neural Networks for Integrated Mosquito Management of Aedes albopictus (Diptera: Culicidae). J. Med. Entomol. 2020, 57, 1588–1595. [Google Scholar] [CrossRef]
- Wan, Z.; Zhang, Y.; He, H. Variational autoencoder based synthetic data generation for imbalanced learning. In Proceedings of the 2017 IEEE Symposium Series on Computational Intelligence (SSCI), Honolulu, HI, USA, 27 November–1 December 2017; pp. 1–7. [Google Scholar]
- Liu, H.; Zhou, J.; Xu, Y.; Zheng, Y.; Peng, X.; Jiang, W. Unsupervised fault diagnosis of rolling bearings using a deep neural network based on generative adversarial networks. Neurocomputing 2018, 315, 412–424. [Google Scholar] [CrossRef]
- Siahkoohi, A.; Kumar, R.; Herrmann, F. Seismic data reconstruction with generative adversarial networks. In Proceedings of the 80th EAGE Conference and Exhibition 2018, Copenhagen, Denmark, 11–14 June 2018; pp. 1–5. [Google Scholar]
- Cummins, K.W.; Merritt, R.W.; Berg, M.B. Ecology and distribution of aquatic insects. In An Introduction of Aquatic Insects of North America, 5th ed.; Merritt, R.W., Cummins, K.W., Berg, M.B., Eds.; Kendall Hunt Publishing Company: Dubuque, IA, USA, 2019; pp. 117–140. [Google Scholar]
- Jiang, Z.Y.; Zhao, S.; Jia, F.L.; Šťastný, J. Two new species of Platynectes Régimbart, 1879 from China with notes on other Chinese members of the genus, including a key to species (Coleoptera: Dytiscidae: Agabinae). Zootaxa 2023, 5227, 401–425. [Google Scholar] [CrossRef]
- Mai, Z.; Hu, J.; Jia, F. Additional fauna of Coelostoma Brullé, 1835 from China, with re-establishment of Coelostomasulcatum Pu, 1963 as a valid species (Coleoptera, Hydrophilidae, Sphaeridiinae). ZooKeys 2022, 1091, 15. [Google Scholar] [CrossRef]
- Liang, Z.; Angus, R.B.; Jia, F. Three new species of Patrus Aubé with additional records of Gyrinidae from China (Coleoptera, Gyrinidae). Eur. J. Taxon. 2021, 767, 1–39. [Google Scholar] [CrossRef]
- Jia, F.; Wang, S.; Aston, P. Revision of Chaetarthria Stephens (Coleoptera: Hydrophilidae) in China, with a key to the species in the Oriental Region. J. Nat. Hist. 2018, 52, 2369–2384. [Google Scholar] [CrossRef]
- Jia, F. A revisional study of the Chinese species of Amphiops Erichson (Coleoptera, Hydrophilidae, Chaetarthriini). J. Nat. Hist. 2014, 48, 1085–1101. [Google Scholar]
- Jia, F.; Xu, R. Applying four numerical methods to analyze aquatic insects diversity in rice fields. Zhongshan Da Xue Xue Bao Zi Ran Ke Xue Ban = Acta Sci. Nat. Univ. Sunyatseni 2002, 41, 73–76. [Google Scholar]
- Céréghino, R.; Park, Y.S.; Compin, A.; Lek, S. Predicting the species richness of water insects in streams using a limited number of environmental variables. J. North Am. Benthol. Soc. 2003, 22, 442–456. [Google Scholar] [CrossRef]
- Edia, E.O.; Gevrey, M.; Ouattara, A.; Brosse, S.; Gourène, G.; Lek, S. Patterning and predicting water insect richness in four West-African coastal rivers using artificial neural networks. Knowl. Manag. Water Ecosyst. 2010, 398, 06. [Google Scholar] [CrossRef] [Green Version]
- Gutiérrez, J.C.; Bilton, D.T. A heuristic approach to predicting water beetle diversity in temporary and fluctuating waters. Ecol. Model. 2010, 221, 1451–1462. [Google Scholar] [CrossRef] [Green Version]
- Min, W.; Liu, B.; Foroosh, H. Look-up table unit activation function for deep convolutional neural networks. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018. [Google Scholar]
- Nair, V.; Hinton, G.E. Rectified linear units improve restricted Boltzmann machines. In Proceedings of the Proceedings of the 27th International Conference on International Conference on Machine Learning, Haifa, Israel, 21–24 June 2010. [Google Scholar]
- Krizhevsky, A.; Sutskever, I. Imagenet classification with deep convolutional neural networks. In Proceedings of the International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012. [Google Scholar]
- Zhou, L.L. Improved recommendation system based on social trust relation. Comput. Appl. Softw. 2014, 31, 31–35. [Google Scholar]
- Goethals, P.; Dedecker, A.P.; Gabriels, W.; Lek, S.; De Pauw, N. Applications of artificial neural networks predicting macro invertebrates in freshwater. Water Ecol. 2007, 41, 491–508. [Google Scholar]
- Yu, D.X.; Lin, L.F.; Luo, L.; Zhou, W.; Gao, L.-L.; Chen, Q.; Yu, S.-Y. Establishment of an artificial neural network model for analysis of the influence of climate factors on the density of Aedes albopictus. J. South. Med. Univ. 2010, 30, 1604–1605. [Google Scholar]
- Yang, S.X.; Zhao, H.Y.; Bao, X.H. A study on the forecast model of Dendrolimus superans butler occurrence based on artificial neural network. Chin. Agric. Sci. Bull. 2014, 30, 72–75. [Google Scholar]
- Kim, K.; Hyun, J.; Kim, H.; Lim, H.; Myung, H. A Deep Learning-Based Automatic Mosquito Sensing and Control System for Urban Mosquito Habitats. Sensors 2019, 19, 2785. [Google Scholar] [CrossRef] [Green Version]
Figure 1.
Map of sampling location.
Figure 1.
Map of sampling location.
Figure 2.
Convolutional neural network structure.
Figure 2.
Convolutional neural network structure.
Figure 3.
Convolution process.
Figure 3.
Convolution process.
Figure 4.
Activation function: (a) sigmoid; (b) tanh; (c) ReLU; (d) LeakyReLU.
Figure 4.
Activation function: (a) sigmoid; (b) tanh; (c) ReLU; (d) LeakyReLU.
Figure 5.
Bayesian optimization workflow.
Figure 5.
Bayesian optimization workflow.
Figure 6.
Overview of a VAEGAN.
Figure 6.
Overview of a VAEGAN.
Figure 7.
Hyperparameter optimization results for the 1-D CNN.
Figure 7.
Hyperparameter optimization results for the 1-D CNN.
Figure 8.
Hyperparameter optimization results for the 1-D CNN-based VAEGAN.
Figure 8.
Hyperparameter optimization results for the 1-D CNN-based VAEGAN.
Figure 9.
The RMSE achieved by the model with the best performance.
Figure 9.
The RMSE achieved by the model with the best performance.
Figure 10.
Grades of the numbers of individuals predicted by the 1-D CNN and measured in the field.
Figure 10.
Grades of the numbers of individuals predicted by the 1-D CNN and measured in the field.
Figure 11.
Numbers of species predicted by the 1-D CNN and the actual numbers of species.
Figure 11.
Numbers of species predicted by the 1-D CNN and the actual numbers of species.
Figure 12.
Proportions of the actual number of species and the number of species predicted by the 1-D CNN that were one to six species different.
Figure 12.
Proportions of the actual number of species and the number of species predicted by the 1-D CNN that were one to six species different.
Figure 13.
Grades of the number of individuals predicted by the 1-D CNN-based VAEGAN and the actual grades of the number of individuals.
Figure 13.
Grades of the number of individuals predicted by the 1-D CNN-based VAEGAN and the actual grades of the number of individuals.
Figure 14.
Number of species predicted by the 1-D CNN-based VAEGAN and the actual number of species.
Figure 14.
Number of species predicted by the 1-D CNN-based VAEGAN and the actual number of species.
Figure 15.
Proportions of the actual number of species and the number of species predicted by the 1-D CNN-based VAEGAN that were one to six species different.
Figure 15.
Proportions of the actual number of species and the number of species predicted by the 1-D CNN-based VAEGAN that were one to six species different.
Figure 16.
Grades of the number of individuals predicted by the 1-D CNN-based VAEGAN after excluding VRS and the actual grades of the number of individuals.
Figure 16.
Grades of the number of individuals predicted by the 1-D CNN-based VAEGAN after excluding VRS and the actual grades of the number of individuals.
Figure 17.
Number of species predicted by the 1-D CNN-based VAEGAN after excluding VRS and the actual number of species.
Figure 17.
Number of species predicted by the 1-D CNN-based VAEGAN after excluding VRS and the actual number of species.
Figure 18.
Proportions of the actual number of species and the number of species predicted by the 1-D CNN-based VAEGAN after excluding VRS, with a difference of one to six species.
Figure 18.
Proportions of the actual number of species and the number of species predicted by the 1-D CNN-based VAEGAN after excluding VRS, with a difference of one to six species.
Table 1.
Experimental materials and equipment.
Table 1.
Experimental materials and equipment.
Materials and Equipment | Brand Models |
---|
pH test paper | Shanghai Sanaisi Reagent Company Limited (Shanghai, China); Precision paper pH 5.5–9.0, B-extensive paper pH 1–14 |
Pen salinometer | Hong Kong Sigam Instrument Group Company Limited (Hong Kong, China); AR8012 |
Thermometer | MITIR Electric Company Limited (Bangalore, India); Electronic folding thermometer TP502 |
Binocular anatomy mirror | Nikon (Tokyo, Japan) SMZ1270 |
D-type water net | 26 cm × 28 cm, net depth of about 40 cm, 120 mesh |
Table 2.
Aquatic beetles of the pool, Nanshe village, Shenzhen (November 2020–October 2021).
Table 2.
Aquatic beetles of the pool, Nanshe village, Shenzhen (November 2020–October 2021).
Family | Species | 2020 | 2021 |
---|
Nov. | Dec. | Jan. | Feb. | Mar. | Apr. | May | Jun. | Jul. | Aug. | Sep. | Oct. |
---|
Dytiscidae | Copelatus oblitus (Sharp) | + | | | | | | | | | + | | + |
Cybister rugosus (Macleay) | | | | + | | + | + | | | + | | |
Cybister sugillatus (Erichson) | | + | | | + | + | + | + | | | | + |
Cybister tripunctatus lateralis (Fabricius) | | + | | | + | + | | + | | | | + |
Hydaticus luczonicus (Aubé) | + | + | | | | + | | | + | | + | + |
Hydaticus rhantoides (Sharp) | + | + | | | | + | | | | + | | + |
Hydaticus vittatus (Fabricius) | | | | | | | | + | | + | | |
Hydrovatus confertus (Sharp) | + | + | + | + | + | | | + | | + | + | + |
Hydrovatus obtusus (Motschulsky) | + | | | + | | | | | | + | | |
Hydrovatus pudicus (Clark) | + | + | + | + | | | | + | | | + | |
Hydrovatus sp. | | + | | | | | | | | + | | |
Hydroglyphus sp. | + | + | | | | | | + | + | + | + | + |
Laccophilus ellipticus (Régimbart) | + | + | + | + | + | + | + | + | | | | + |
Laccophilus transversalis lituratus (Sharp) | | + | + | | | | | | | | | + |
Laccophilus sp. | + | | | | | | | | | | | |
Leiodytes sp. | | + | + | | | | | | + | | + | + |
Noteridae | Canthydrus sp. | | + | | | | | | | | | | |
Neohydrocoptus bivittis (Motschulsky) | | + | + | + | + | + | | | | | | + |
Neohydrocoptus rubescens (Clark) | + | + | | + | | | | | | + | | |
Gyrinidae | Patrus productus (Régimbart) | + | + | + | + | + | + | + | | + | + | + | |
Hydrophilidae | Agraphydrus activus (Komarek and Hebauer) | + | + | | | + | | | | | | | |
Agraphydrus masatakai (Minoshima, Komarek and Ôhara) | + | | + | | | | | | | | | + |
Agraphydrus sp. | | | | | | | | | | | | + |
Amphiops mater (Sharp) | + | + | + | + | + | + | + | + | + | + | | + |
Crephelochares abnormalis (Sharp) | + | + | | | | | | | | + | | |
Coelostoma fallaciosum (d′Orchymont) | + | + | + | + | + | | | + | + | | + | + |
Enochrus esuriens (Walker) | + | + | + | + | + | + | + | + | + | + | + | + |
Enochrus flavicans (Regimbart) | + | | + | | | + | | + | | | | + |
Enochrus sp. | | + | | | | | + | | | | | |
Helochares atropiceus (Régimbart) | | + | + | | + | | | + | | + | | |
Helochares densus (Sharp) | + | + | | | | | + | | + | + | + | + |
Helochares guoi (Yang and Jia) | | + | | | | | | | | | | |
Helochares minusculus (d′Orchymont) | + | | | | + | | | | | + | + | |
Helochares negledus (Hope) | | | | | | | + | | | | + | |
Helochares pallens (MacLeay) | + | + | | + | | | + | + | + | + | + | + |
Hydrobiomorpha spinicollis (Eschscholtz) | + | + | | | | + | | | | + | + | + |
Paracymus orientalis (d′Orchymont) | + | + | + | + | | | + | | | + | | + |
Paracymus sp | | + | | | | | | | | | | |
Sternolophus rufipes (Fabricius) | + | + | + | | | + | + | + | | | + | + |
Table 3.
Structural parameters of the CNN.
Table 3.
Structural parameters of the CNN.
| Type | Input | Convolution | FC | Output |
---|
Kernel number | 1-D CNN | - | 16 | - | - |
1-D CNN-based VAEGAN | | 24 | | |
Kernel size | 1-D CNN | - | [2,1] | - | - |
1-D CNN-based VAEGAN | | [4,1] | | |
Stride | 1-D CNN | - | 1 | - | - |
1-D CNN-based VAEGAN | | 1 | | |
Padding | 1-D CNN | - | 0 | - | - |
1-D CNN-based VAEGAN | | 0 | | |
Activation | 1-D CNN | - | LeakyReLU | - | - |
1-D CNN-based VAEGAN | | LeakyReLU | | |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).