Impacts of Missing Buoy Data on LSTM-Based Coastal Chlorophyll-a Forecasting

Zhang, Caiyun; Ding, Wenxiang; Zhang, Liyu

doi:10.3390/w16213046

Open AccessArticle

Impacts of Missing Buoy Data on LSTM-Based Coastal Chlorophyll-a Forecasting

by

Caiyun Zhang

^1,2,*

,

Wenxiang Ding

³

and

Liyu Zhang

⁴

¹

State Key Laboratory of Marine Environmental Science, College of Ocean and Earth Sciences, Xiamen University, Xiamen 361102, China

²

Key Laboratory of Underwater Acoustic Communication and Marine Information Technology, Ministry of Education, Xiamen University, Xiamen 361102, China

³

Marine Science and Technology College, Zhejiang Ocean University, Zhoushan 316022, China

⁴

Xiamen Marine & Fisheries Institute, Xiamen 361008, China

^*

Author to whom correspondence should be addressed.

Water 2024, 16(21), 3046; https://doi.org/10.3390/w16213046

Submission received: 29 September 2024 / Revised: 21 October 2024 / Accepted: 23 October 2024 / Published: 24 October 2024

Download

Browse Figures

Versions Notes

Abstract

Harmful algal blooms (HABs) pose significant threats to coastal ecosystems and public health. Accurately predicting the chlorophyll-a (Chl) concentration, a key indicator of algal biomass, is crucial for mitigating the impact of algal blooms. Long short-term memory (LSTM) networks, as deep learning tools, have demonstrated significant potential in time series forecasting. However, missing data, a common occurrence in environmental monitoring systems, can significantly degrade model performance. This study examines the impact of missing input parameters, particularly the absence of Chl data, on the predictive performance of LSTM models. To evaluate the model’s performance and the effectiveness of different imputation techniques under various missing data scenarios, we used data collected from 2008 to 2018 for training and data from 2020 and 2021 for testing. The results indicated that missing Chl data can significantly reduce predictive accuracy compared to other parameters such as temperature or dissolved oxygen. Edge-missing data had a more pronounced negative effect on the model than non-edge missing data, and the model’s performance declined more steeply with longer periods of missing data. The prediction of high Chl concentrations was relatively more negatively affected by missing data than by low Chl concentrations. Although LSTM imputation methods help mitigate the impact of missing data, ensuring data completeness remains critical. This study underscores the importance of reliable data collection and improved imputation strategies for accurate forecasting of algal blooms.

Keywords:

missing data; LSTM; imputation; chlorophyll-a; algal bloom; forecasting

1. Introduction

Algal blooms are an increasingly important environmental issue in aquatic ecosystems, as they negatively affect water quality, marine life, and public health [1,2]. Driven by a combination of environmental factors such as nutrient availability, temperature, and light, the rapid proliferation of algae can lead to harmful algal blooms (HABs) that disrupt aquatic ecosystems and threaten biodiversity [3,4]. Therefore, the early detection and accurate prediction of algal blooms are critical for effective water quality management and preventing damage to critical ecosystem functions.

Algal bloom prediction has long been a focal point for researchers, and numerous approaches and methods have been employed [5,6,7]. Among these, deep learning has emerged as a powerful tool capable of capturing nonlinear relationships in environmental data [8,9,10]. In recent years, deep learning models, particularly long short-term memory (LSTM) networks, have shown significant potential for predicting complex marine parameters [11,12]. LSTM is a variant of recurrent neural networks (RNNs). These models excel at capturing long-term dependencies in sequential data, making them particularly suitable for time series forecasting tasks such as predicting chlorophyll (Chl) concentrations [13,14].

The effectiveness of LSTM models is highly dependent on the quality, completeness, and continuity of the input data. Ecological data, particularly those from buoys and sensors, often contain gaps and discontinuities due to equipment malfunction, inclement weather, or data collection issues. If these missing data points are not properly addressed, they can severely degrade the predictive performance of LSTM models [15,16]. The impact of missing input parameters and data discontinuity on the accuracy of algal bloom forecasts remains a critical challenge.

Algal blooms are highly complex processes that are significantly correlated with multiple environmental factors, including dissolved oxygen (DO), Chl, pH, water temperature (Temp), meteorological conditions, and hydrological factors [2,17,18]. These factors are commonly used in models for predicting algal blooms and Chl concentrations. However, due to regional variations, different parameters influence the model in distinct ways, and the absence of certain parameters may affect Chl forecasts inconsistently. For example, Ding et al. [10] showed that when constructing a Chl forecasting model for the Zhoushan fishery in China, the Chl parameter had a significant impact on predictions in areas with high Chl concentrations, whereas wind and currents were the dominant factors in areas with medium and low Chl concentrations.

Chl, which directly reflects algal biomass, is currently one of the most important indicators for predicting algal blooms. Other environmental factors such as water temperature, DO, and pH also play significant roles in predicting algal blooms, but their relevance is generally not as high as that of Chl. Therefore, the completeness and reliability of Chl data are critical for accurate prediction. Missing or incomplete Chl data can significantly impair the ability of the LSTM model to capture the growth dynamics and provide timely warnings of algal blooms.

Real-time data are essential when using LSTM models to forecast Chl concentrations. However, in practical applications, fluctuations in data quality and missing parameters can lead to discontinuities in time series, adversely affecting the model’s predictive performance. Data imputation is a common method for resolving these discontinuities in time series data [19,20]. Traditional imputation methods, such as using the mean, median, or mode, often fail to capture complex relationships between variables and local data trends. Advanced methods such as polynomial interpolation [21], Hermite interpolation [22], and cubic spline interpolation [19] can effectively address some of these issues. Nevertheless, they still face challenges when dealing with long-term data gaps and multivariate, nonlinear dynamics. Machine learning techniques, particularly LSTM networks, have demonstrated significant value in solving the problem of multivariate time series data imputation [23,24,25].

This study aimed to explore the impact of missing input parameters on the performance of LSTM models in forecasting algal blooms, with a particular focus on the effect of missing Chl data. By simulating various data-missing scenarios, we investigated how LSTM imputation techniques and data-handling strategies influence prediction accuracy. Specifically, we compared the model’s performance when key parameters such as Chl, Temp, DO, and pH were missing, and assessed the effects of two types of missing data patterns: edge-missing data, which refers to missing data points at the boundaries of the time series, and non-edge missing data, which refers to missing observations within the time series. The results of this study emphasize the importance of ensuring data completeness and highlight the need for advanced imputation techniques when there are missing data in environmental monitoring systems.

2. Materials and Methods

2.1. Data Collection and Preprocessing

The dataset used in this study was compiled from three main sources: buoy monitoring data, meteorological station data, and tide forecasting data. The dataset was collected over the period from January 2008 to June 2021, excluding missing data from January 2019 to August 2020 due to equipment downtime. The buoy monitoring data were collected from an ecological buoy deployed in Xiamen Bay (for the buoy location, refer to Figure 1 in Ding et al. [26]). The key parameters monitored included temperature (Temp), dissolved oxygen (DO), Chl, and pH, with a sampling frequency of every 30 min. This dataset was provided by the Xiamen Marine and Fisheries Research Institute. We implemented strict quality control protocols following the method of Zhang et al. [27]. After quality control, the data were aggregated into daily averages, reducing the 48 half-hourly measurements per day to single daily values.

The meteorological station data included daily observations of precipitation, mean air pressure, mean air temperature, relative humidity, sunshine duration, minimum and maximum air pressure, minimum and maximum temperature, two-minute average wind speed, and maximum wind speed. These data were sourced from the National Meteorological Information Center (https://data.cma.cn). Additionally, tide forecasting data were derived from daily average tide heights calculated from hourly tide measurements.

For model training, we used data from 2008 to 2018, while data from September 2020 to June 2021 were reserved for model testing. This partitioning was applied consistently across all the collected data. To ensure dimensional consistency and facilitate model convergence, all data were normalized using the following equation:

{x^{'}}_{i} = \frac{x_{i} - \bar{x}}{S}

(1)

where

{x_{i}}^{'}

is the normalized value,

x_{i}

represents the original data point,

\bar{x}

is the mean,

\bar{x} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}

, and S is the standard deviation,

S = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(x_{i} - \bar{x})}^{2}}

.

Several studies suggest that evaluating the rate of biomass change, rather than absolute concentrations, is key to understanding bloom dynamics [3]. Based on this, Ding et al. [26] optimized a machine learning Chl forecasting model. We adopted their approach by using the relative Chl change rate (∆RChl) as the output in our LSTM-based Chl forecasting model.

2.2. Long Short-Term Memory Model

The LSTM model is a type of RNN designed to capture long-term dependencies in sequential time series data [28,29]. LSTM networks are particularly useful in forecasting applications, and they have previously been used in the prediction of Chl concentration [13,14,30]. LSTM models utilize three gates—input, output, and forget gates—to control the flow of information (Figure 1a). These gates enable the network to retain or discard information: the forget gate determines which information from the memory should be removed based on the current input

x_{t}

and the previous output

h_{t - 1}

, while the input gate determines which information is added to the state

c_{t - 1}

to generate a new state

c_{t}

. The output gate calculates the current output

h_{t}

using the new state

c_{t}

, the previous output

h_{t - 1}

, and the current input

x_{t}

.

The LSTM model used in this study is illustrated in Figure 1b. The model consists of a sequence input layer, two LSTM layers, two fully connected layers, a dropout layer, and a regression output layer. Each of the LSTM layers contain 200 nodes, and the fully connected layers contain 10 nodes and one node. Dropout is a regularization technique designed to prevent overfitting in deep learning models [31,32].

In this study, the LSTM model was employed for both imputing missing Chl time-series data and forecasting future Chl concentrations.

2.3. Patterns of Missing Data

LSTM models rely on continuous sequences of input data for time series forecasting. However, real-world data, particularly from ecological buoys, often exhibit discontinuities due to equipment malfunction, environmental factors, biofouling, or insufficient rates of sampling. Such discontinuities can result in missing or incomplete parameters and thereby affect the model’s performance. This study evaluated the impact of these missing data patterns on the LSTM model’s prediction accuracy.

We categorized missing data into two types based on their position in the time series: edge-missing and non-edge-missing.

■: Edge-missing data: This refers to missing data points at the boundaries of the time series, such as when the most recent buoy data are unavailable for forecasting the Chl concentration on the next day.
■: Non-edge-missing data: This refers to missing observations within the time series. For example, if the model uses the last 5 days of data to predict the Chl concentration of the next day, and the data on day 3 are missing, this is considered a case of non-edge-missing data.

To investigate the effects of these two types of missing data, we designed and tested different missing data patterns within the LSTM model, as shown in Figure 2, where “0” represents the most recent data point. Edge-missing refers to the absence of the most recent data point (“0”), while non-edge-missing refers to gaps elsewhere in the sequence.

2.4. Evaluation Criteria

The model’s forecasting performance under various missing data patterns was evaluated using three key metrics: the root mean square error (RMSE), the correlation coefficient (r), and the absolute error [26]. These metrics were calculated using Equations (2)–(4). Lower RMSE and absolute error values, combined with higher r values, indicate better model performance. To minimize random errors, each model run was repeated 10 times, and the average forecast results were reported:

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(Y_{i} - y_{i})}^{2}}

(2)

r = \frac{\sum_{i = 1}^{n} (Y_{i} - {\bar{Y}}_{i}) (y_{i} - {\bar{y}}_{i})}{\sqrt{\sum_{i = 1}^{n} {(Y_{i} - {\bar{Y}}_{i})}^{2} \sum_{i = 1}^{n} {(y_{i} - {\bar{y}}_{i})}^{2}}}

(3)

A b s o l u t e e r r o r_{i} = | Y_{i} - y_{i} |

(4)

where

Y_{i}

and

{\bar{Y}}_{i}

represent the observed Chl and its mean value;

y_{i}

and

{\bar{y}}_{i}

refer to the forecast for Chl and its mean value, and n refers to the number of data.

3. Results and Discussion

3.1. Influence of Missing Parameters

The LSTM model primarily relies on data from meteorological stations, ecological buoys, and tidal forecasts to predict algal blooms. While meteorological and tidal data are generally stable, data from ecological buoys such as Chl, Temp, DO, and pH often contain gaps or discontinuities due to various environmental and technical factors. This section discusses the model’s performance when any of these four buoy parameters are missing.

Figure 3 illustrates the impact of excluding DO, Temp, and pH from the test data. The Pearson correlations between the predicted and observed results for these cases were 0.91, 0.89, and 0.90. These values were only marginally lower than the correlation of 0.93 obtained when no parameters were missing (Figure 4, Table 1). The root mean square errors (RMSEs) for these cases increased by 11.9%, 21.1%, and 17.4% compared to the scenario with no missing parameters (Figure 4, Table 1). However, when Chl data were missing, the correlation decreased significantly to 0.70, and the RMSE increased to 2.08 mg/m³, a 90.8% increase over the case with no missing data. This demonstrated that the absence of Chl has a far greater impact on the model’s predictive performance than the absence of other parameters.

Further analysis demonstrated how the absence of various parameters affected the LSTM model’s accuracy in forecasting algal blooms. With complete data, 67% of forecasts had an absolute error below 0.5 μg/L, and only 1% had an error exceeding 5 μg/L (Figure 5a, Table 1). When Chl data were missing, the accuracy decreased significantly: the probability of an absolute error below 0.5 μg/L decreased to 36%, while the probability of an error exceeding 5 μg/L increased to 4% (Figure 5b, Table 1). Conversely, missing DO, Temp, and pH parameters did not significantly alter the distribution of absolute errors (Figure 5c–e), suggesting that Chl is the most critical factor for maintaining high prediction accuracy in the LSTM model.

Algal blooms are phenomena characterized by the rapid proliferation of algae in water bodies [33]. The variation in nearshore Chl is highly complex and is widely regarded as a crucial indicator of algal blooms [34,35]. Numerous studies have demonstrated that the occurrence of algal blooms is significantly correlated with multiple factors [2,17,18]. Parameters such as Temp, Chl, DO, and pH often exhibit significant changes during algal bloom events, and these changes can be observed through buoy monitoring systems. While temperature and DO are factors that contribute to algae growth, measurements of these parameters alone cannot predict algal bloom occurrence with sufficient accuracy.

Temperature plays a key role in algal growth, with optimal temperatures accelerating algal metabolism and reproduction [36,37]. However, temperature alone cannot predict algal blooms; typically, the interactions with other factors are important [10,38]. DO levels reflect the oxygen content in the water, but fluctuations in DO are often a consequence of algal blooms rather than a cause [39]. Although pH can influence algal growth, with most algae preferring neutral to slightly alkaline conditions [40,41], changes in pH are often gradual and less effective at capturing short-term variation related to bloom events.

In contrast, the Chl concentration is directly related to the algal biomass, as it is the primary pigment for photosynthesis in algae [42]. Changes in Chl concentration are closely tied to the rates of algae proliferation, making it a more accurate predictor of algal blooms [7]. An increase in Chl concentration signals accelerated algal growth, thereby serving as an early warning of potential blooms [33]. The absence of Chl data drastically diminishes the model’s ability to capture these growth trends, reducing its forecasting accuracy. Chl is a more sensitive indicator than other parameters, enabling the model to detect minor fluctuations in algae populations, a critical factor for early warning systems.

When Chl data were missing, the LSTM model exhibited the poorest performance in predicting algal blooms (Figure 3 and Figure 4; Table 1). This underscores the importance of Chl data as a key input for short-term predictions of algal growth. The model’s effectiveness remains high as long as Chl and meteorological data are available, even if DO, Temp, or pH data are missing. Meteorological factors, including precipitation, sunlight, and wind, play significant role in the short-term development of blooms [43,44,45]. This emphasizes the importance of retaining these inputs along with the Chl data.

In summary, Chl plays a primary role in algal bloom forecasting, offering direct insights into algal biomass and serving as a key early warning indicator. Ensuring the completeness and accuracy of Chl data is, thus, essential for improving LSTM models for algal bloom prediction.

3.2. Influences of Discontinuities in Time Series Data

The instability of data from ecological buoys not only leads to missing parameters, but also causes discontinuities in time series data, significantly hindering accurate forecasting by the LSTM model. The absence of Chl data has the most pronounced impact on the model’s accuracy, underscoring Chl as a critical parameter for operational forecasting (Figure 3, Figure 4 and Figure 5; Table 1). Here, we used Chl as an example to simulate various missing data scenarios (Figure 2) and applied the LSTM model to impute the missing data. The imputation results were evaluated using r and RMSE (Figure 6 and Figure 7).

The imputation of non-edge missing data consistently outperformed that for edge-missing data across all scenarios in terms of accuracy. When data were missing for one day, the r value for edge-missing imputation was 0.96, 3% lower than the value for non-edge missing data, while the RMSE for edge-missing imputation was 0.85, that is, 88% higher than that for non-edge missing data. Furthermore, the imputation performance improved as the number of consecutive missing days decreased. For non-edge missing data, the r value for one day of missing data was 0.99, that is, 9% higher than that for seven consecutive days, while the RMSE was 0.45, that is, 167% lower than that for seven consecutive missing days. For edge-missing data, the r value for one day of missing data was 0.96, that is, 10% higher than that for seven consecutive missing days, and the RMSE was 0.85, that is, 69% lower than that for seven consecutive missing days.

After imputation, the time series data were used as test data for Chl prediction, and the LSTM model was employed to forecast the concentration of Chl for the subsequent day. The forecast evaluation metrics are shown in Figure 8 and Figure 9. Across all missing data scenarios, the forecasting results indicated that edge-and non-edge missing data 1 day removed from the edge had the most significant impacts on prediction accuracy (Figure 8 and Figure 9). Compared to no missing data or other non-edge missing data scenarios, the r values were significantly decreased, while the RMSE was significantly increased in these two cases. For non-edge missing scenarios, except for the scenario where the missing data were 1 day away from the edge, the r and RMSE values at other positions were relatively close to those in the no-missing data scenario (Figure 8 and Figure 9). The forecasting results for data missing for 3, 5, and 7 consecutive days exhibited similar patterns, with performance declining as the number of consecutive missing days increased (Figure 8 and Figure 9).

In edge-missing scenarios, the evaluation metrics for imputed data with missing periods of 1, 3, 5, and 7 consecutive days showed a notable decline in forecasting accuracy compared to the no-missing data scenario (Figure 8 and Figure 9). Specifically, the RMSE increased by 11.9%, 20.0%, 23.1%, and 25.8%, while the r values decreased by 1.4%, 3.5%, 3.8%, and 3.8%, respectively (Table 2). As the number of consecutive missing days increased, the forecast performance for edge-missing imputed data became significantly worse.

Several studies have demonstrated that LSTM models perform well in terms of time series data imputation [23,24,25]. This aligns with the findings of the present study. The imputation performance declined as the number of consecutive missing days increased (Figure 7 and Figure 8), likely due to the loss of relevant information as the period of missing data lengthened. In non-edge missing data scenarios, the missing Chl values can be better interpolated by referencing the Chl values before and after the missing period. In contrast, for edge-missing data, only the Chl value before the missing period is available for evaluation, which may explain the considerable difference in performance between edge-missing and non-edge missing data. Poor imputation results for edge-missing data directly lead to poorer forecasting outcomes, as shown by the significantly lower performance compared to non-edge missing data (Figure 8 and Figure 9).

In this study, we also forecasted Chl concentrations by predicting the ∆RChl parameter. In non-edge missing scenarios 1 day from the edge, Chl_n is the observed value when predicting ∆RChl_n+1, while Chl_n−1 is the imputed value. Therefore, ∆RChl_n calculated from Chl_n and Chl_n−1 is also an imputed value rather than an observed datum. Consequently, among the nearest Chl-related parameters involved in the forecast, only Chl_n is observed, while ∆RChl_n is imputed. This may explain the poor forecasting performance for imputed data in non-edge missing scenarios 1 day away from the edge.

Chl concentrations greater than or equal to 5 μg/L were defined as high values, while others were classified as low values. The forecasting accuracy of imputed test data for high and low values was evaluated under edge-missing scenarios with different consecutive missing days (Table 2). The impact of edge-missing imputation on low-value forecasts was relatively small, with RMSE increasing by <5.5% and r decreasing by <1%. However, the impact on high-value forecasting was much more pronounced. When data were missing for 1 day, the RMSE increased by 13.6%, while r decreased by 3.3%. The forecasting errors further increased with the length of consecutive missing data. For 7 consecutive missing days, the RMSE increased by 31.0%, and r decreased by 16.2%. This analysis demonstrated that the reduction in forecasting accuracy of the LSTM Chl prediction model due to missing data imputation is primarily reflected in the decline of high-value prediction accuracy.

When Chl data were entirely missing from the test dataset, one option was to exclude the Chl parameter before prediction, or, alternatively, to impute the missing Chl data and include these in the model. Table 3 presents the forecasting results for both methods. The imputation method yielded a lower RMSE and a higher r value. For low-value data, the correlation for the imputation method was 0.85, significantly better than that of the method excluding Chl. For high-value data, although the improvement was less pronounced, the RMSE and r values of the imputation method were still notably better than those of the method excluding Chl. Therefore, when Chl data are missing but other parameters are complete, it is preferable to impute the missing Chl data before prediction to reduce uncertainty and enhance the model’s predictive performance.

In algal bloom forecasting, greater attention is given to predicting high Chl values. Discontinuities in the input data clearly have a more pronounced negative impact on the prediction of high Chl values compared to low Chl values. This result highlights the difference in the effectiveness of imputation methods at different Chl concentration levels. For low Chl concentrations, the data generally reflect a more stable state of the water body, where algae growth is slower and fluctuations are smaller. Therefore, even with missing data, imputation methods can accurately recover the data. Conversely, high Chl concentrations usually occur during periods of rapid algae growth or algal blooms, and the data show higher volatility and greater influence of environmental factors. In such cases, imputation methods struggle to capture these dramatic changes, with the result of larger relative errors.

These findings suggest that different strategies should be applied for data at varying Chl concentrations. For low Chl concentrations, the traditional imputation methods can provide reasonable data recovery. However, for high Chl concentrations, more advanced imputation techniques such as machine-learning-based dynamic imputation or multi-source data fusion should be implemented to improve accuracy.

In summary, dividing Chl data based on a threshold of 5 mg/m³ revealed the differential effectiveness of imputation methods across concentration levels. While imputation performed well for low Chl concentrations, more sophisticated techniques are required to enhance the prediction accuracy for high Chl concentrations. These insights provide valuable guidance for optimizing and improving algal bloom prediction.

3.3. Importance of Input Parameter Completeness on LSTM Prediction Models

The continuity and completeness of time series data are critical for the effective performance of LSTM models, particularly when applied to predicting complex phenomena such as algal blooms [46]. Ensuring continuous data collection, especially for essential parameters such as Chl, is crucial for enhancing the accuracy of ecological forecasting. Data gaps, whether due to sensor malfunctions or environmental disruptions, introduce significant challenges, hampering the model’s ability to capture temporal trends and subtle patterns [47]. For example, missing Chl data can dramatically reduce model performance, as demonstrated by decreased correlations and increased root mean square errors (Figure 3, Figure 4 and Figure 5, Table 1). These metrics reflect a substantial decline in predictive accuracy.

One of the key findings of this study is the difference in imputation performance between edge-missing and non-edge-missing data. Edge-missing data—referring to missing values near the boundary of the time series—typically lead to poorer imputation results, thus lowering prediction accuracy. This highlights the LSTM model’s reliance on sequential dependencies across time steps. When data sequences are disrupted at the edges, the model struggles to impute missing values and predict future trends. This is particularly important in operational forecasting, as the loss of edge data can introduce significant errors in environmental management and ecological decision-making. Therefore, maintaining buoy equipment and minimizing the risk of edge data loss, particularly during critical monitoring periods, is of paramount importance.

The present study also demonstrates that as the duration of missing data increases, the negative effects of data discontinuity on prediction accuracy become more pronounced. With longer periods of missing data, the model’s ability to recover underlying trends and patterns diminishes significantly, resulting in higher values of RMSE and lower correlations. In real-world applications, prolonged data loss due to equipment failure or adverse environmental conditions can severely compromise the accuracy of forecasting. To mitigate this risk, increasing the frequency of buoy maintenance and regularly calibrating sensors is essential for preventing extended data gaps and ensuring the continuous and reliable flow of data into the LSTM model.

The present study highlights significant differences in how data discontinuity affects predictions of high and low Chl concentrations. Predictions for low Chl values exhibited better resilience to data gaps, whereas predictions for high Chl concentrations were much more sensitive to missing data. This distinction is crucial for environmental monitoring, where accurate predictions of high Chl concentrations are vital for the early detection of harmful algal blooms. The greater vulnerability of high-value predictions to data discontinuity underscores the need for robust imputation techniques and alternative modeling approaches. In addition to improving equipment maintenance and increasing the frequency of data collection, it may be beneficial to deploy multiple buoys in different locations. This redundancy would ensure that backup data are available in case of equipment failure, thereby reducing the risk of data loss during critical events. Cross-validating data from multiple buoys could also enhance dataset consistency and accuracy, further reducing prediction errors caused by missing data.

It should be noted that the model constructed in this study does not consider other critical variables, such as nutrient levels, phytoplankton, and zooplankton abundance, which also have a significant influence on Chl variability [3,4]. The lack of these parameters is primarily due to challenges in obtaining long-term, continuous data, as current monitoring technologies often focus on easily measurable factors like Chl, DO, pH, and temperature. Despite their importance in driving algal bloom dynamics, these variables are often not included in forecasting models due to the difficulty in consistently collecting such data over extended periods [10,13,14,30]. Addressing this gap in future studies could improve model accuracy and provide a more comprehensive understanding of bloom dynamics.

In summary, discontinuities in data have profound and multifaceted impacts on LSTM prediction models. To improve the reliability of ecological forecasting, the stable operation of data collection equipment such as buoys should be prioritized. Regular maintenance and the frequent calibration of buoy sensors are essential to ensure proper functionality and minimize data loss. Moreover, developing more advanced imputation techniques capable of handling edge-missing data and addressing long-term data gaps is necessary to overcome some of the challenges identified in this study. The adoption of such methods may lead to more robust and accurate forecasts, thereby supporting more informed environmental management and decision making.

4. Conclusions

This study highlights the critical role of data completeness and continuity in the accuracy and reliability of LSTM models for predicting algal blooms. Our analysis demonstrates that missing chlorophyll concentration (Chl) has a greater negative impact on the predictive performance of LSTM models compared to missing temperature, dissolved oxygen and pH data. Chl, being a primary indicator of algal biomass, directly influences the accuracy of LSTM forecasts, and its absence resulted in substantial increases in root mean square error and lower correlations. This underscores the fact that missing Chl data cannot be compensated for by other environmental parameters.

The study also underscores the challenges posed by data discontinuities, particularly edge-missing data and extended periods of missing data, which exacerbate the reduction in Chl prediction accuracy. As the length of missing data periods increases, the ability of the LSTM model to accurately predict future trends declines. This finding emphasizes the importance of maintaining continuous and high-quality data streams, especially through regular sensor maintenance and the calibration of buoy equipment.

Furthermore, the impact of missing data is more pronounced for predictions of high Chl concentrations than for low Chl concentrations. High chlorophyll concentration predictions are more sensitive to data gaps, signaling a need for more robust imputation techniques or hybrid modeling approaches to address these challenges.

In light of the above, ensuring the completeness and quality of input parameters, particularly Chl, is essential for improving the accuracy of LSTM-based ecological forecasts. Future research should focus on enhancing the reliability of data collection, developing advanced imputation methods, and integrating multiple data sources to address the challenges of data discontinuity and further strengthen the predictive ability of LSTM models in environmental management.

Author Contributions

Conceptualization, C.Z. and W.D.; methodology, C.Z. and W.D.; software, C.Z. and W.D.; validation, C.Z., W.D., and L.Z.; formal analysis, C.Z.; investigation, W.D.; resources, C.Z. and L.Z.; data curation, W.D.; writing—original draft preparation, C.Z. and W.D.; writing—review and editing, C.Z. and W.D.; visualization, W.D.; supervision, C.Z.; project administration, C.Z.; funding acquisition, C.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Science and Technology Program of Fujian Province, China (No. 2023Y4001), the Science and Technology Program of Xiamen, China (No. 3502Z20226021), and the National Key Research and Development Program of China (No. 2022YFC3105300).

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to privacy or confidentiality concerns.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Anderson, C.R.; Sapiano, M.R.; Prasad, M.B.; Long, W.; Tango, P.J.; Brown, C.W.; Murtugudde, R. Predicting potentially toxigenic Pseudo—Nitzschia blooms in the Chesapeake Bay. J. Mar. Syst. 2010, 83, 127–140. [Google Scholar] [CrossRef]
Anderson, D.M.; Alpermann, T.J.; Cembella, A.D.; Collos, Y.; Masseret, E.; Montresor, M. The globally distributed genus Alexandrium: Multifaceted roles in marine ecosystems and impacts on human health. Harmful Algae 2012, 14, 10–35. [Google Scholar] [CrossRef] [PubMed]
Behrenfeld, M.J.; Boss, E.S. Resurrecting the Ecological Underpinnings of Ocean Plankton Blooms. Annu. Rev. Mar. Sci. 2014, 6, 167–194. [Google Scholar] [CrossRef]
Anderson, C.R.; Moore, S.K.; Tomlinson, M.C. Living with Harmful Algal Blooms in a Changing World: Strategies for Modeling and Mitigating Their Effects in Coastal Marine Ecosystems. In Coastal and Marine Hazards, Risks, and Disasters; Elsevier: Amsterdam, The Netherlands, 2015; pp. 495–561. [Google Scholar] [CrossRef]
McGillicuddy, D.J.; Anderson, D.M.; Lynch, D.R.; Townsend, D.W. Mechanisms regulating large-scale seasonal fluctuations in Alexandrium fundyense populations in the Gulf of Maine: Results from a physical-biological model. Deep Sea Res. Part II Top. Stud. Oceanogr. 2005, 52, 2698–2714. [Google Scholar] [CrossRef]
Anderson, C.R.; Siegel, D.A.; Kudela, R.M.; Brzezinski, M.A. Empirical models of toxigenic Pseudo-nitzschia blooms: Potential use as a remote detection tool in the Santa Barbara Channel. Harmful Algae 2009, 8, 478–492. [Google Scholar] [CrossRef]
Deng, T.N.; Chau, K.W.; Duan, H.F. Machine learning based marine water quality prediction for coastal hydro-environment management. J. Environ. Manag. 2021, 284, 112051. [Google Scholar] [CrossRef]
Manucharyan, G.E.; Siegelman, L.; Klein, P. A Deep Learning Approach to Spatiotemporal Sea Surface Height Interpolation and Estimation of Deep Currents in Geostrophic Ocean Turbulence. J. Adv. Model. Earth Syst. 2021, 13, e2019MS001965. [Google Scholar] [CrossRef]
Dong, S.; Wang, P.; Abbas, K. A survey on deep learning and its applications. Comput. Sci. Rev. 2021, 40, 100379. [Google Scholar] [CrossRef]
Ding, W.X.; Li, C.L. Algal blooms forecasting with hybrid deep learning models from satellite data in the Zhoushan fishery. Ecol. Inform. 2024, 82, 102664. [Google Scholar] [CrossRef]
Ham, Y.G.; Kim, J.H.; Luo, J.J. Deep learning for multi-year ENSO forecasts. Nature 2019, 573, 568–572. [Google Scholar] [CrossRef]
Gambin, A.F.; Angelats, E.; Gonzalez, J.S.; Miozzo, M.; Dini, P. Sustainable Marine Ecosystems: Deep Learning for Water Quality Assessment and Forecasting. IEEE Access 2021, 9, 121344–121365. [Google Scholar] [CrossRef]
Tian, W.C.; Liao, Z.L.; Wang, X. Transfer learning for neural network model in chlorophyll-a dynamics prediction. Environ. Sci. Pollut. Res. 2019, 26, 29857–29871. [Google Scholar] [CrossRef]
Yussof, F.N.; Maan, N.; Reba, M.N.M. LSTM Networks to Improve the Prediction of Harmful Algal Blooms in the West Coast of Sabah. Int. J. Environ. Res. Public Health 2021, 18, 7650. [Google Scholar] [CrossRef]
Chen, Z.; Xu, H.; Jiang, P.; Yu, S.; Lin, G.; Bychkov, I.; Hmelnov, A.; Ruzhnikov, G.; Zhu, N.; Liu, Z. A transfer Learning-Based LSTM strategy for imputing Large-Scale consecutive missing data and its application in a water quality prediction system. J. Hydrol. 2021, 602, 126573. [Google Scholar] [CrossRef]
Zhou, Y.N.; Wang, S.Y.; Wu, T.J.; Feng, L.; Wu, W.; Luo, J.C.; Zhang, X.; Yan, N.N. For-backward LSTM-based missing data reconstruction for time-series Landsat images. Giscience Remote Sens. 2022, 59, 410–430. [Google Scholar] [CrossRef]
Cosgrove, S.; Ní Rathaille, A.; Raine, R. The influence of bloom intensity on the encystment rate and persistence of Alexandrium minutum in Cork Harbor, Ireland. Harmful Algae 2014, 31, 114–124. [Google Scholar] [CrossRef]
Sourisseau, M.; Le Guennec, V.; Le Gland, G.; Plus, M.; Chapelle, A. Resource Competition Affects Plankton Community Structure: Evidence from Trait-Based Modeling. Front. Mar. Sci. 2017, 4, 52. [Google Scholar] [CrossRef]
Duan, Q.; Djidjeli, K.; Price, W.G.; Twizell, E.H. Weighted rational cubic spline interpolation and its application. J. Comput. Appl. Math. 2000, 117, 121–135. [Google Scholar] [CrossRef]
Guo, Z.; Wan, Y.; Ye, H. A data imputation method for multivariate time series based on generative adversarial network. Neurocomputing 2019, 360, 185–197. [Google Scholar] [CrossRef]
Barthelmann, V.; Novak, E.; Ritter, K. High dimensional polynomial interpolation on sparse grids. Adv. Comput. Math. 2000, 12, 273–288. [Google Scholar] [CrossRef]
Zhang, M.; Liang, X.Z. On a Hermite interpolation on the sphere. Appl. Numer. Math. 2011, 61, 666–674. [Google Scholar] [CrossRef]
Sun, X.L.; Guo, Y.; Li, N.; Song, X.X. Multivariate missing data imputing algorithm based on modified RNN. Inf. Technol. Netw. Secur. 2019, 38, 47–53. [Google Scholar] [CrossRef]
Fouladgar, N.; Framling, K. A Novel LSTM for Multivariate Time Series with Massive Missingness. Sensors 2020, 20, 2832. [Google Scholar] [CrossRef] [PubMed]
Song, W.; Gao, C.; Zhao, Y.; Zhao, Y.D. A Time Series Data Filling Method Based on LSTM-Taking the Stem Moisture as an Example. Sensors 2020, 20, 5045. [Google Scholar] [CrossRef] [PubMed]
Ding, W.X.; Zhang, C.Y.; Shang, S.P.; Li, X.D. Optimization of deep learning model for coastal chlorophyll a dynamic forecast. Ecol. Model. 2022, 467, 109913. [Google Scholar] [CrossRef]
Zhang, C.Y.; Zhang, X.M.; Shang, S.L. Study on quality control of automatic monitoring buoy data in Xiamen west area. In Proceedings of the 2009 Annual Academic Conference of the Chinese Society for Environmental Sciences; Beijing University of Aeronautics and Astronautics Press: Beijing, China, 2009; Volume I, pp. 582–586. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural. Comput. 1997, 19, 1735–1780. [Google Scholar] [CrossRef]
Roy, D.K.; Sarkar, T.K.; Kamar, S.S.A.; Goswami, T.; Muktadir, M.A.; Al-Ghobari, H.M.; Alataway, A.; Dewidar, A.Z.; El-Shafei, A.A.; Mattar, M.A. Daily Prediction and Multi-Step Forward Forecasting of Reference Evapotranspiration Using LSTM and Bi-LSTM Models. Agronomy 2022, 12, 594. [Google Scholar] [CrossRef]
Lee, S.; Lee, D. Improved Prediction of Harmful Algal Blooms in Four Major South Korea’s Rivers Using Deep Learning Models. Int. J. Environ. Res. Public Health 2018, 15, 1322. [Google Scholar] [CrossRef]
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
Lecun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Siegel, D.A.; Doney, S.C.; Yoder, J.A. The North Atlantic spring phytoplankton bloom and Sverdrup’s critical depth hypothesis. Science 2002, 296, 730–733. [Google Scholar] [CrossRef] [PubMed]
Strutton, P.G.; Martz, T.R.; DeGrandpre, M.D.; McGillis, W.R.; Drennan, W.M.; Boss, E. Bio-optical observations of the 2004 Labrador Sea phytoplankton bloom. J. Geophys. Res.-Ocean. 2011, 116, C11037. [Google Scholar] [CrossRef]
Sarangi, R.K. Observation of Algal Bloom in the Northwest Arabian Sea Using Multisensor Remote Sensing Satellite Data. Mar. Geod. 2012, 35, 158–174. [Google Scholar] [CrossRef]
Lim, P.T.; Leaw, C.P.; Usup, G.; Kobiyama, A.; Koike, K.; Ogata, T. Effects of light and temperature on growth, nitrate uptake, and toxin production of two tropical dinoflagellates: Alexandrium tamiyavanichii and Alexandrium minutum (Dinophyceae). J. Phycol. 2006, 42, 786–799. [Google Scholar] [CrossRef]
Guallar, C.; Bacher, C.; Chapelle, A. Global and local factors driving the phenology of Alexandrium minutum (Halim) blooms and its toxicity. Harmful Algae 2017, 67, 44–60. [Google Scholar] [CrossRef]
Kim, Y.; Shin, H.S.; Plummer, J.D. A wavelet-based autoregressive fuzzy model for forecasting algal blooms. Environ. Model. Softw. 2014, 62, 1–10. [Google Scholar] [CrossRef]
Iriarte, A.; Aravena, G.; Villate, F.; Uriarte, I.; Ibanez, B.; Llope, M.; Stenseth, N.C. Dissolved oxygen in contrasting estuaries of the Bay of Biscay: Effects of temperature, river discharge and chlorophyll a. Mar. Ecol. Prog. Ser. 2010, 418, 57–71. [Google Scholar] [CrossRef]
Pedersen, M.F.; Hansen, P.J. Effects of high pH on a natural marine planktonic community. Marne Ecol. Prog. Ser. 2003, 260, 19–31. [Google Scholar] [CrossRef]
Ajin, A.M.; Silvester, R.; Alexander, D.; Nashad, M.; Abdulla, M.H. Characterization of blooming algae and bloom-associated changes in the water quality parameters of traditional pokkali cum prawn fields along the South West coast of India. Environ. Monit. Assess. 2016, 188, 145. [Google Scholar] [CrossRef]
Pitawala, S.; Trifunovic, Z.; Steele, J.R.; Lee, H.C.; Crosbie, N.D.; Scales, P.J.; Martin, G.J.O. Variation of the photosynthesis and respiration response of filamentous algae (Oedogonium) acclimated to averaged seasonal temperatures and light exposure levels. Algal Res.-Biomass Biofuels Bioprod. 2023, 64, 103213. [Google Scholar] [CrossRef]
Guan, D.; Gao, D.W.; Ren, N.Q.; Li, Y.F. Viewpoints of Dominant Environmental Factors Influencing Algal Blooms. Int. Conf. Mech. Mater. Manuf. Eng. 2011, 66–68, 155–159. [Google Scholar] [CrossRef]
Zhang, Y.L.; Shi, K.; Liu, J.J.; Deng, J.M.; Qin, B.Q.; Zhu, G.W.; Zhou, Y.Q. Meteorological and hydrological conditions driving the formation and disappearance of black blooms, an ecological disaster phenomena of eutrophication and algal blooms. Sci. Total Environ. 2016, 569, 1517–1529. [Google Scholar] [CrossRef] [PubMed]
Zhou, Y.T.; Yan, W.J.; Wei, W.Y. Effect of sea surface temperature and precipitation on annual frequency of harmful algal blooms in the East China Sea over the past decades. Environ. Pollut. 2021, 270, 116224. [Google Scholar] [CrossRef] [PubMed]
Tzoumpas, K.; Estrada, A.; Miraglio, P.; Zambelli, P. A Data Filling Methodology for Time Series Based on CNN and (Bi)LSTM Neural Networks. IEEE Access 2024, 12, 31443–31460. [Google Scholar] [CrossRef]
Garcia, D.A.; Amori, M.; Giovanardi, F.; Piras, G.; Groppi, D.; Cumo, F.; de Santoli, L. An identification and a prioritisation of geographic and temporal data gaps of Mediterranean marine databases. Sci. Total Environ. 2019, 668, 531–546. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of the LSTM unit (a) and LSTM model (b).

Figure 2. Schematic diagram illustrating patterns of missing data. Solid circles represent available data, while crosses indicate missing data.

Figure 3. Comparison between the forecast and observed values of Chl when the buoy data for Chl (a), dissolved oxygen (b), water temperature (c), and pH (d) are missing.

Figure 4. Comparison of correlations (r) and root mean square error (RMSE) of the forecasting results when buoy-observed Chl, DO, Temp, and pH data are missing.

Figure 5. Histogram of absolute errors of forecasting results when buoy observation parameters have no missing (a), missing Chl (b), missing DO (c), missing Temp (d), and missing pH (e) data.

Figure 6. Correlations (r) between the imputed and observed data under scenarios of missing data for one day (a), three consecutive days (b), five consecutive days (c), and seven consecutive days (d). The values on the x-axis represent the position of the missing data, where 0 indicates edge-missing data. The schematic representation of each position is shown in Figure 2.

Figure 7. Root mean square error (RMSE) between the imputed and observed data for scenarios of missing data over 1 day (a), 3 consecutive days (b), 5 consecutive days (c), and 7 consecutive days (d).

Figure 8. Correlations (r) between forecasting results of time series data after imputation and the observed data under scenarios of missing data for 1 day (a), 3 consecutive days (b), 5 consecutive days (c), and 7 consecutive days (d). The values on the x-axis represent the positions of the missing data, where 0 indicates edge-missing data.

Figure 9. Root mean square errors (RMSEs) between forecasting results of time series data after imputation and the observed data under scenarios of missing data for 1 day (a), 3 consecutive days (b), 5 consecutive days (c), and 7 consecutive days (d).

Table 1. Statistical analysis of forecasting results when buoy-observed Chl, DO, Temp, and pH data are missing, respectively.

Scenario	No Missing	Missing Chl	Missing DO	Missing Temp	Missing pH
r	0.93	0.70	0.91	0.89	0.90
RMSE (μg/L)	1.09	2.08	1.22	1.32	1.28
Percentage of absolute error <0.5 μg/L	67%	36%	66%	68%	66%
Percentage of absolute error >5 μg/L	1%	4%	1%	1%	1%

Table 2. Evaluation coefficients for the forecasting results of the test data after imputation in the case of edge-missing data for different numbers of consecutive days *.

Consecutive Missing Days		RMSE	r	High Value (≥5 μg/L)		Low Value (<5 μg/L)
Consecutive Missing Days		RMSE	r	RMSE	r	RMSE	r
No missing		1.09	0.93	2.65	0.76	0.57	0.85
1 day	Evaluation coefficient	1.22	0.91	3.00	0.73	0.60	0.84
1 day	Improvement rate	−11.9%	−1.4%	−13.6%	−3.3%	−5.5%	−0.9%
3 days	Evaluation coefficient	1.31	0.89	3.30	0.64	0.58	0.85
3 days	Improvement rate	−20.0%	−3.5%	−24.8%	−15.4%	−3.0%	−0.2%
5 days	Evaluation coefficient	1.35	0.89	3.43	0.63	0.56	0.86
5 days	Improvement rate	−23.1%	−3.8%	−29.7%	−16.9%	1.6%	0.3%
7 days	Evaluation coefficient	1.38	0.89	3.47	0.64	0.59	0.85
7 days	Improvement rate	−25.8%	−3.8%	−31.0%	−16.2%	−5.1%	−0.5%

Notes: * The improvement rate refers to the degree of improvement of the evaluation coefficient relative to the case of no missing data, with positive values representing improvement and negative values representing decline.

Table 3. Evaluation metrics for the forecasting result for Chl parameter elimination and Chl parameter imputation methods in the case of missing data for the Chl parameter.

Evaluation Coefficient	RMSE	r	High Value (≥5 μg/L)		Low Value (<5 μg/L)
Evaluation Coefficient	RMSE	r	RMSE	r	RMSE	r
Chl parameter elimination	2.079	0.70	4.92	0.58	1.16	0.17
Chl parameter imputation	1.462	0.88	3.73	0.59	0.60	0.85

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, C.; Ding, W.; Zhang, L. Impacts of Missing Buoy Data on LSTM-Based Coastal Chlorophyll-a Forecasting. Water 2024, 16, 3046. https://doi.org/10.3390/w16213046

AMA Style

Zhang C, Ding W, Zhang L. Impacts of Missing Buoy Data on LSTM-Based Coastal Chlorophyll-a Forecasting. Water. 2024; 16(21):3046. https://doi.org/10.3390/w16213046

Chicago/Turabian Style

Zhang, Caiyun, Wenxiang Ding, and Liyu Zhang. 2024. "Impacts of Missing Buoy Data on LSTM-Based Coastal Chlorophyll-a Forecasting" Water 16, no. 21: 3046. https://doi.org/10.3390/w16213046

APA Style

Zhang, C., Ding, W., & Zhang, L. (2024). Impacts of Missing Buoy Data on LSTM-Based Coastal Chlorophyll-a Forecasting. Water, 16(21), 3046. https://doi.org/10.3390/w16213046

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Impacts of Missing Buoy Data on LSTM-Based Coastal Chlorophyll-a Forecasting

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Collection and Preprocessing

2.2. Long Short-Term Memory Model

2.3. Patterns of Missing Data

2.4. Evaluation Criteria

3. Results and Discussion

3.1. Influence of Missing Parameters

3.2. Influences of Discontinuities in Time Series Data

3.3. Importance of Input Parameter Completeness on LSTM Prediction Models

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI