1. Introduction
Phytoplankton community composition and abundance is often used in assessments of recreational, aquaculture, and drinking water quality. Long-term monitoring studies conducted in marine and estuarine waters used for aquaculture activities [
1,
2] and in freshwater lakes and reservoirs used to provide drinking water and recreational areas [
3,
4,
5] have demonstrated distinctive relationships between certain phytoplankton community constituents and water temperature, salinity, and nutrient concentrations. However, long-term phytoplankton community composition studies in small-bodied agricultural irrigation waters to examine similar relationships are lacking.
The examination of water for phytoplankton community composition and abundance is a time-consuming activity that relies on the expertise of well-trained phytoplankton taxonomists or automated technologies, such as flow cytometry, that may be cost-prohibitive to many water quality management programs [
6,
7,
8]. Satellite imagery has proven useful for monitoring phytoplankton community structure in large lakes (>24,000 acres, [
9]) but does not yet have the spatial scale needed to remotely observe smaller bodies of water that are increasingly being used in agricultural irrigation applications [
10]. Hence, alternative techniques are being explored to examine the relationships between more easily measured water quality parameters (i.e., temperature, chlorophyll-
a, and specific conductance) and phytoplankton community composition and abundance. The presence of such relationships makes the use of regression analysis feasible for predicting phytoplankton community structure and concentrations using measured water quality parameters. Regression analyses have been used to predict the occurrence of bloom-forming cyanobacteria in shallow lakes [
11,
12], green algae in reservoirs [
13], and diatoms in estuaries, rivers, and lakes [
14,
15], as well as to evaluate overall irrigation water quality [
16,
17], but these two parameters have not been examined as distinct input variables within the same mathematical model.
Regression analyses were used to successfully predict the composition of phytoplankton communities in a drinking water reservoir near Beijing, China, that had an area of greater than 44,000 acres [
18]. However, as noted by Cheruvelil et al., [
19], scale and regionalization are important factors to consider when conducting water quality assessments and applying water quality standards. Models such as those reported by Zeng et al. [
18] are only beginning to be constructed for the freshwater reaches of the Chesapeake Bay watershed [
20,
21], yet even these efforts do not reflect water specifically designated for agricultural uses. Recently, machine learning provided several versatile techniques to establish models suitable to create ‘phytoplankton–water quality’ relationships. The machine learning algorithm of random forests was specifically chosen for its ability to elucidate nonlinear relationships between input variables and because of its built-in mechanism to limit potential overfitting of the model. The objective of this work was to evaluate the performance of the random forest algorithm in estimating the phytoplankton community structure from in situ water quality measurements of different complexities obtained during three years of spatially intensive observations at two 1-acre agricultural irrigation ponds.
Phytoplankton community structure has long been used to assess trophic changes in aquatic systems [
22] with shifts from green algae-dominated communities to cyanobacteria-dominated communities indicating eutrophic conditions [
23,
24]. Equally as important is the influence various phytoplankton groups can have on water chemistry [
25,
26], especially carbon cycling [
27]. For this study three phytoplankton groups were considered critical to assess in relationship to water quality parameters due to their abundance within local freshwater phytoplankton populations. Previous studies by Parson and Parker [
28] and Marshall [
29,
30] demonstrated that between 70–80% of regional freshwater lake phytoplankton community structure was composed of green algae (Chlorophytes), diatoms (Bacillariophytes), and cyanobacteria (Cyanophytes). Due to the harmful and potentially toxic effects of cyanobacteria blooms on human and environmental health, the detection, prediction, and modeling of these blooms has become a focus for resource managers [
31,
32,
33]. Additionally, there is growing concern about the risk that cyanotoxins may pose to the agriculture industry through the transfer of cyanotoxins from irrigation waters to crops and livestock, particularly as climate change increases the occurrence and toxicity of cyanobacteria blooms [
34,
35,
36]. Other concerns include both toxic and non-toxic cyanobacteria blooms altering carbon cycling, alkalizing waters, and increasing turbidity, thus further degrading water quality [
37,
38]. Thus, additional analyses with the random forest algorithm were conducted to determine if there were correlations between water quality parameters and the cyanobacteria orders Chroococcales and Nostocales, as these orders contain many pelagic, toxigenic species that are of particular concern in surface waters [
39].
4. Discussion
Earlier work by Smith et al. [
40,
46] demonstrated the correlation between several basic water quality parameters and cyanobacteria populations, as well as the temporal stability of phytoplankton populations within these ponds. Here, the relationship between more complex water quality parameters and phytoplankton groups were examined with machine learning. Phytoplankton group concentrations in the two agricultural irrigation ponds in this study did not vary greatly, nor were the community compositions significantly different, both representing communities of eutrophic, shallow, small-bodied waters. Average diatom and green algae concentrations were similar between years and the two ponds. Despite the routine application of the algicide copper sulfate during the study, phytoplankton concentrations in Pond 1 were comparable to those reported in regional [
29,
30] and global lakes [
49,
50]. Pond 2 had recurrent cyanobacteria blooms during the study, making the phytoplankton concentrations more comparable to those reported in small lakes by Lee et al. [
51] and in local waters by Tango and Butler [
52]. Pond 2 phytoplankton concentrations were slightly higher than Pond 1 concentrations and can potentially be explained by routine algicide use in Pond 1. All three phytoplankton populations in Pond 1 were greater in 2017 than 2018, whereas the opposite was true for Pond 2, except for diatom concentrations, which were slightly higher in 2017 than 2018.
Root-mean-square errors (RMSEs), a metric used to evaluate model performance, for the 2017, 2018, and 2017 + 2018 models (sets A and AB) varied depending on phytoplankton group. Green algae models tended to have the best performance, followed by diatoms, and then cyanobacteria. In a review by Shimoda and Arhonditsis [
53], green algae were found to have the least error of the three phytoplankton groups similar to the results in this study. Cyanobacteria models had higher RMSEs than green algae models in both our findings and those reviewed by Shimoda and Arhonditsis [
53]. This could be explained by the natural spatial and temporal variability of cyanobacteria blooms making accurate population predictions more challenging [
46,
54]. While various types of models were used in the review by Shimoda and Arhonditsis [
53], the RMSEs from this work indicate that the random forest model is a superior model for predicting green algae when compared to the diatom and cyanobacteria models. In the work of Di Maggio et al. [
55] where the same three functional groups were studied, cyanobacteria were found to have the least accurate model performance during peak biomass periods. However, Thomas et al. [
56] noted that cyanobacteria were more predictable than diatoms and green algae across many time scales in an alpine lake. Both ecosystem type and available input variables appear to affect the comparative performance of the random forest algorithm in predictions of phytoplankton functional groups. The robustness of the model during the growing season is characterized by the RMSE values presented in this paper since these RMSEs are averages over the datasets used for training and testing by the random forest algorithm. Since this study only focused on assessing the accuracy of the prediction model in agricultural irrigation ponds during the growing season (May–October) (when waters were used for irrigation purposes and when cyanobacteria biomass, and subsequently risks from cyanotoxins, was expected to be greatest in this region [
52,
57]), to better assess this model’s performance in comparison to similar models, additional training and validation needs to be done using data collected outside of the growing season and in varying waterbody types. In this study, sampling was conducted during periods of time between rainfall events, when irrigation is more likely to take place due to crop production demands, elevated temperatures, and reduced soil moisture [
58]. To better equip this model for prediction during all weather conditions and all seasons, additional sampling and training of the model would be necessary.
Model performance did not differ drastically between years. The exception to this is for cyanobacteria predictions wherein RMSE values decreased substantially from 2017 to 2018, indicating better performance of the 2018 models. In Pond 1, models predicting diatoms and cyanobacteria performed better in 2018 compared to 2017. Similarly, in Pond 2, better model performance in 2018 were seen for cyanobacteria predictions and, to a lesser extent, for diatom predictions. The combined 2017 + 2018 datasets had higher RMSE values than when using just the 2018 dataset, but lower than when only the 2017 dataset was used. For all three groups and both sets of parameters (A and AB), 2018 had the best model performance as indicated by the lowest RMSEs. Thomas et al. [
56] found that multiyear datasets were able to produce reasonable performance and attributed it to the model having more data points to train the machine learning algorithm with. Our individual years had fewer data points than the combined year models. While 2018 had the lowest RMSE values of the three data sets, the use of 2017 + 2018 caused a decrease in RMSE values for 2017. Furthermore, it was determined that the prediction of the 2019 data was not as accurate as the prediction of the 2017 and 2018 years. Additional monitoring would help to determine if the model performance of future years is comparable to the accuracy represented in the 2017 and 2018 evaluations.
The addition of organic constituent-related input parameters did not improve model performance overall. While some aspects of the model saw a small increase in performance, others saw a small decrease, and no general pattern could be defined. This follows many other studies that showed the use of inputs, similar to this study’s set A parameters (DO, pH, NTU, and TEMP), tended to be most important and produced the best prediction results [
59,
60,
61]. According to Rigosi et al. [
62], a model based on water quality physical parameters often has superior performance, and this was attributed to the high level of complexity found in biological processes. Likewise, while the nutrient and macro element parameters in input set C were highly influential when evaluating the 2018 data, the difference in model performance across phytoplankton groups may be due to the complex and interrelated way each phytoplankton group utilizes different nutrients and macro elements [
63,
64], and subsequently interacts with other organisms [
65], which was not captured with just one year of data. The presence of short blooms of both nitrogen-fixing and non-nitrogen-fixing cyanobacteria in the study area [
46], which can utilize different forms of nitrogen and impact the overall nitrogen budget [
66,
67], also may not have been equitably represented in this dataset. When just the potentially toxigenic cyanobacteria, represented in the dataset as Chroococcales and Nostocales, were examined alone, inclusion of nutrient parameters in the training and testing dataset did improve model performance and warrants further consideration. However, the ability to use the random forest algorithm to predict phytoplankton groups using only set A inputs is beneficial for a wide range of resource monitoring applications, including the differentiation of discolored water caused by cyanobacteria, including subsurface bloom species like
Raphidiopsis raciborskii [
68], from discolored water caused by chlorophytes and euglenophytes, both of which are known to occur in the study area [
29,
30,
52,
69]. Set A input parameters are often the least expensive and easiest parameters to collect, thus predictions can be quickly and easily performed and provide a guideline to expanding resource monitoring efforts should cyanobacteria blooms be predicted.
Overall, spatial distributions of RMSE values differed based on phytoplankton group. Green algae had the lowest spatial average RMSEs (P1 = 0.278, P2 = 0.356); cyanobacteria had the highest spatial average RMSEs (P1 = 0.567, P2 = 0.679); and average RMSEs for diatoms were in between (P1 = 0.446, P2 = 0.578) for both ponds and models. This indicates that the set A and AB models were the most accurate in predicting the spatial green algae concentrations for the 2017 + 2018 dataset. In general, interior waters tended to exhibit the lowest RMSE values in both ponds and for models, with both input sets A and AB showing that the random forest algorithm predicted interior concentrations of green algae best, followed by diatoms and cyanobacteria. In a prior study on the temporal and spatial variability of phytoplankton functional groups within these two agricultural irrigation ponds, it was established that interior waters tended to be less variable than nearshore waters [
46]. This stability allows the model to better predict the phytoplankton community structure in those locations. Variations in phytoplankton concentrations tended to be greater in nearshore samples when compared to interior waters using an assessment of CV. In over 50% of the sampling dates, CVs were higher for nearshore samples except for green algae in Pond 1. Similarly, water quality CVs in both ponds were almost always higher for nearshore locations, with most nearshore variability being higher in 75% or more of the sampling dates. This pattern was also observed in the study by Awada et al. [
70] for marine waters; the model developed by these authors performed best in open water locations of the Gulf of Sirte and had poorer agreement between measured and simulated concentrations of chlorophyll-
a along the shoreline. In Lake Taihu, locations closer to the shoreline tended to have higher simulation errors than central lake locations [
60]. However, in a study on Lake Okeechobee, the random forest algorithm had better model results at nearshore locations as opposed to pelagic locations, and Zhang et al. [
71] attributed this to poor phytoplankton growth in the pelagic zones caused by wind-driven sediment resuspension.
For all three phytoplankton groups, there was almost no change in RMSE values from models run using set A parameters to models run using set AB parameters, indicating that the additional parameters did not impact the predictive abilities of the random forests. The ability of the random forest model to predict phytoplankton community structure or chlorophyll-
a concentrations accurately on set A parameters (TEMP, pH, NTU, and SPC) alone has been noted in several other studies [
18,
61,
72]. Whereas other studies [
73,
74] found that biological parameters (biological oxygen demand; chlorophyll-
a concentrations) were more important for phytoplankton prediction models, biological oxygen demand was not measured in this study. It should, however, be considered for future modeling efforts as it is known to be spatially and temporally variable in lake waters [
75,
76] and can be positively correlated with potentially toxigenic cyanobacteria species [
77] and overall phytoplankton biomass [
76], both of which are of concern to agricultural resource managers looking to meet water quality standards.
Overall, this study found that the most important variables tended to be set A parameters (TEMP, pH, NTU, and SPC) for both ponds. TEMP was determined to be the most recurrent parameter in the top three most influential parameters for all groups and both ponds. This is comparable to numerous other random forest models used for phytoplankton prediction [
33,
61,
72,
74,
78]. Other set A parameters which were also reported in the top three most influential parameters, but to a lesser degree than TEMP, in this study were SPC, NTU, pH, and DO. SPC appeared to be the most influential predictor in input set A. A possible reason for this could be the correlation between SPC and nutrient ion concentrations in agricultural waters [
79] and the intercoupled relationship between specific nutrient forms and concentrations and phytoplankton groups [
80]. Correlations between water quality parameters reflected both commonalities of the water quality-forming processes in the studied ponds and the specifics of ponds. The strength of correlations between inputs depended on the ponds and years. This indicates the importance of processes not well-characterized by the available input variables. However, one cannot exclude the presence of confounding factors, i.e., factors affecting both input and target variables. Alleviating the effect of confounding variables is efficient if the relationships between the target variable and independent variables are found from designed experiments when the confounding variable is known and can be measured. This is not the case in the present work. More needs to be learned about the functioning of phytoplankton communities in small agricultural irrigation ponds now and how future community structure may be shaped by climate change and increased anthropogenic forces to adequately discover and monitor compounding variables.
The presence of correlations between independent variables in this work illustrates the common multicollinearity problem. Whereas the accuracy of the random forest models typically is not affected by multicollinearity [
81], the causality conclusions, including the ranks of correlated variables in the lists of important inputs, can be affected [
82]. We realize that the possible effect of multicollinearity was not fully addressed in this work. Multiple methods of variable elimination are suggested to reduce the input variable list and to characterize the effects of the input reduction on variable importance determinations [
83,
84]. Applying these methods to low-dimensional data can improve the model’s reliability as more data are available per model coefficient [
85]. However, this may uncontrollably change conclusions on the relative importance of input variables [
86]. Comparing the efficiency of the input reduction methods presents an interesting research avenue. In the present study, we made the first step in this type of investigation by analyzing correlations between inputs. For example, the expected strong correlations in both ponds in 2017 and 2018 were found between DO and pH (
Supplemental Table S3). The probable cause for this is high phytoplankton biomass undergoing photosynthesis, which causes pH to increase due to CO
2 consumption and O
2 release. Thus, we cannot exclude the effect of this correlation on the occurrence and position of DO and pH in lists of important inputs (
Table 3 and
Table 4).
The only instance when set A parameters were not the most influential parameters was in 2018 when nutrients (input set C) were measured and used as inputs. Nutrients being the most influential or important parameters is in line with numerous assessments of phytoplankton community structure using random forest algorithms. Total nitrogen (TN), total phosphate (TP), nitrate, and nitrite were identified as the most important predictors in the phytoplankton models used in Lake Okeechobee [
71] and Lake Taihu [
76]. However, these studies took place in lakes considerably larger than the ponds studied here. Small waterbodies (<12 acres), which are increasingly being used in agricultural practices in the Mid-Atlantic region, often have a greater biodiversity than larger bodies of water and can experience more anthropogenic and climatic stress [
87,
88,
89], highlighting the need to refine models to local conditions. Since nutrients were only measured in 2018, these parameters were not included in the 2017 or combined-year models. The modeling robustness of nutrient parameters, when compared to set A parameters, has yet to be determined for ponded agricultural waters. When we examined the use of nutrient parameters to predict potentially toxic cyanobacteria species in the orders Chroococcales and Nostocales, inclusion of nutrient parameters did improve model performance. Due to the small number of water samples tested for cyanotoxins during this study (data not shown), we cannot correlate this model’s performance with cyanotoxin production. However, this preliminary assessment showed improved performance when nutrient parameters were incorporated into this model tuned for small, ponded water systems with recurrent cyanobacteria blooms, and can be used as a baseline for our future monitoring and modeling efforts as there currently are no prediction models available that can differentiate between toxigenic and non-toxigenic cyanobacteria blooms [
33,
90], even though it is known through field and laboratory studies that TN and TP enrichment can stimulate the production of cyanotoxins in numerous species [
37,
91].
In a review of predictive and forecasting models for cyanobacteria by Rousso et al. [
33], it was found that parameters similar to this study’s input set A (TEMP, DO, and pH) were reported as the most influential predictors in 38.5% of publications surveyed. Nutrients were reported as the most influential parameters in 30.5% of the total publications surveyed. One of the least influential predictors reported (6% of publications) was similar to the parameters included input set B (FDOM, CHL, and Phyco), which is comparable to the findings in this study. As noted in a cyanobacteria research forecast by Burford et al. [
91], future modeling efforts should incorporate CO
2 dynamics that will reflect future climate scenarios, temporally relevant weather patterns, and the intricate relationship cyanobacteria have with the food web, all factors which ultimately will influence agricultural irrigation water quality. This study only focused on between-rain events when irrigation water from these agriculture ponds was used most frequently. However, rain events, specifically those which cause surface run-off, will ultimately influence nutrient concentrations within surface waters used for irrigation. Defining the relationship between water quality parameters and cyanobacteria blooms under numerous weather conditions with an easy-to-use model would aid local resource managers charged with safeguarding irrigation water quality and mitigating the risks posed to the agriculture industry from cyanotoxins.
Physicochemical parameters being the most important predictors for the three major phytoplankton groups is beneficial for water quality management. Enumeration of phytoplankton is time intensive, requires highly-trained staff, and/or expensive infrastructure [
8,
18,
32], whereas parameters such as temperature, dissolved oxygen, pH, conductivity, and turbidity can be easily and affordably measured in real time with an in situ sensor. The quick acquisition and input of these parameters into a modeling application allows for the prediction of major phytoplankton groups by machine learning algorithms to be performed by a broader group of individuals that could lead to more timely alerts of potentially harmful phytoplankton species.