3.1. Evaluation and Comparison of the Seven Regression Models
The experiment was conducted using the inversion data of oceanic information that were collected from the Ocean Color website (
https://oceancolor.gsfc.nasa.gov/, (accessed on 20 May 2023)). Seven machine learning regression models (LCE, CATboost, XGR, LightGBM, RR, SVR, and DT) were employed, and leave-one-out cross-validation was performed [
68,
69] (with each tuple in the test set having a count of 1).
Figure 8 illustrates the scores of each model based on the evaluation metrics (
/
). From the figure, it can be observed that among the seven machine learning regression models, the XGR model achieved the lowest scores for
and
(0.0714), thus indicating the best fitting performance for cell concentration in the dataset collected in this study. Additionally, the CATboost (0.0922), LightGBM (0.1026), and LCE (0.1044) models demonstrated better fitting performance compared to the other models, with minimal differences among the three. The scores of the other three regression models were higher than those of the first four models.
In order to further explore the relationship between meteorological data and remote sensing data in the prediction of harmful algal blooms (HABs), as well as to investigate the model performance, all samples were used to train the seven regression models in the experiment. The coefficient of determination (R-squared) and
(explained variance score) were introduced as the evaluation metrics for the models, as shown in
Figure 9. The results indicate that, based on the R-squared evaluation, the XGR (0.9493), CATboost (0.8809), LightGBM (0.6347), and LCE (0.8835) models achieved higher scores compared to the other models. Under the evaluation metric of the explained variance score, the performance scores for the different models were as follows: XGR (0.9593), CATboost (0.8939), LightGBM (0.6797), and LCE (0.8958). Furthermore, based on the
/
evaluation metrics, the XGR (0.0236/0.0151), CATboost (0.0381/0.0221), LightGBM (0.0662/0.0283), and LCE (0.0382/0.0183) models exhibited better fitting performance compared to the other models. In
Figure 9, the red labels in
Figure 9a represent the XGR model, which exhibited the best performance in both the explained variance (
) and the R-squared evaluation metrics. In
Figure 9b, the red labels again denote the XGR model, which displays the optimal performance in both the root mean square error (
) and the mean absolute error (
) evaluation metrics. Based on this analysis, the XGR model demonstrated excellent performance across all of the four evaluation metrics.
Based on the aforementioned discussion, we observed that XGR, CATboost, LightGBM, and LCE performed well in harmful algal bloom detection. In order to further validate the significance of the heterogeneous meteorological data in harmful algal bloom monitoring tasks, as well as to avoid the limitations of relying solely on remote sensing data that have been obtained by previous researchers, we conducted additional experiments by excluding the heterogeneous meteorological data and solely using the remote sensing data with the four regression models instead. The comparison of the four evaluation metrics is depicted in
Figure 10. As shown in
Figure 10a, the integration of the heterogeneous meteorological information in the evaluation metrics improved the performance of all four models. In
Figure 10b, the incorporation of heterogeneous meteorological data enhanced the models’ performance in the R-squared evaluation metric. Similarly, in
Figure 10c,d, the method of fusing meteorological heterogeneous information with remote sensing data yielded lower score tolerances than did using remote sensing data alone for both the
and
evaluation metrics. Furthermore, considering the combined results from
Figure 10a–d, we observed that the XGR model also exhibited a favorable performance in harmful algal bloom detection tasks when using only remote sensing data (
= 0.9226;
= 0.9216;
= 0.0187; and
= 0.0326).
3.2. Feature Sorting and Analysis
In the evaluation analysis of the seven regression models mentioned above, XGR, CATboost, and LCE demonstrated superior performances when compared to the other regression models. Therefore, in this section, we will utilize these three models to conduct a feature ranking based on their weights (the feature indices for this are listed in
Appendix A Table A1). The feature rankings for the three models are illustrated in
Figure 11. In the XGR model, the red section displays the feature ranking, in which the order is chlorophyll-a (0.1524), maximum sustained wind speed (0.1252), dew-point temperature (0.0820), bb_443 (0.0797), day of the year (0.0737), longitude (0.0733), sea surface temperature (0.0727), latitude (0.0626), bb_469 (0.0555), average temperature (0.0360), par (0.0339), rrsdiff (0.0304), a_547 (0.0292), a_443 (0.0238), Angstrom exponent (0.0167), sea-level pressure (0.0155), a_678 (0.0111), a_488 (0.0109), adg_443 (0.0103), a_645 (0.0097), visibility (0.0091), maximum temperature (0.0087), and a_667 (0.0038). In a tan shade in
Figure 11, the CATboost model’s feature ranking was as follows: chlorophyll-a (0.1479), day of the year (0.1232), longitude (0.1140), latitude (0.0998), maximum sustained wind speed (0.0946), adg_443 (0.0510), maximum temperature (0.0416), average temperature (0.0385), bb_469 (0.0285), par (0.0280), dew-point temperature (0.0267), sea-level pressure (0.0243), a_678 (0.0216), a_488 (0.0206), sea surface temperature (0.0182), a_645 (0.0146), Angstrom exponent (0.0142), visibility (0.0136), bb_443 (0.0098), a_547 (0.0085), rrsdiff (0.0075), a_443 (0.0069), and a_667 (0.0024). In a light blue shade in
Figure 11, the LCE model’s feature ranking was as follows: day of the year (0.1460), longitude (0.1410), latitude (0.1011), maximum sustained wind speed (0.0792), Angstrom exponent (0.0250), average temperature (0.0240), dew-point temperature (0.0192), chlorophyll-a (0.0160), maximum temperature (0.0138),sea-level pressure (0.0133), bb_469 (0.0126), visibility (0.0112), par (0.0097), bb_443 (0.0097), rrsdiff (0.0077), a_488 (0.0076), a_645 (0.0058), a_443 (0.0053), a_678 (0.0031), adg_443 (0.0030), a_547 (0.0027), sea surface temperature (0.0022), and a_667 (0.0006).
Through the analysis and explanations provided above, it became evident that, in the XGR model’s weight distribution, the five most important variables (chlorophyll-a, the maximum sustained wind speed, the dew-point temperature, bb_443, and the day of the year) included two instances of remote sensing data and two meteorological data variables. However, in the CATboost and LCE models, longitude and latitude had significant weight proportions. Furthermore, through the weight comparisons, chlorophyll-a emerged as a pivotal factor in harmful algal bloom monitoring, and it held greater weight when compared to other factors. In terms of meteorological factors, the maximum sustained wind speed also had a relatively high importance.
Through the feature analysis, we discovered that the models were sensitive to geographical characteristics, meteorological factors, and some of the remote sensing data, thus indicating variations in the harmful algal bloom characteristics across different regions. The feature weights among the three regression models were relatively evenly distributed. The difference between the highest weight, chlorophyll-a (0.1524), and the lowest weight, a_667 (0.0038) for XGR, was 0.1486. For CATboost, the difference between the highest weight, chlorophyll-a (0.1479), and the lowest weight, a_667 (0.0024), was 0.1455. In the case of LCE, the difference between the highest weight, the day of the year (0.1460), and the lowest weight, a_667 (0.0006), was 0.1454. These results indicated that the differences among the three models were not substantial.
To further compare the gap between using only remote sensing data and integrating heterogeneous meteorological data, we analyzed the feature weights of the above three models when using only remote sensing data. In
Figure 12, red represents the XGR model and blue represents the CATboost model. The feature weights of the XGR model when using only remote sensing data were as follows: bb_469 (0.1481), sea surface temperature (SST) (0.1357), latitude (0.1074), longitude (0.1025), bb_443 (0.0936), day of the year (0.0839), a_443 (0.0522), adg_443 (0.0521), chlorophyll-a (0.0437), par (0.0364), a_488 (0.0332), Angstrom exponent (0.0242), a_667 (0.0228), a_645 (0.0215), a_678 (0.0162), rrsdiff (0.0133), and a_547 (0.0131). For the CATboost model when using only remote sensing data, the feature weights were as follows: day of the year (0.2311), longitude (0.1860), latitude (0.1302), chlorophyll-a (0.0728), adg_443 (0.0612), Angstrom exponent (0.0521), a_547 (0.0328), SST (0.0309), a_443 (0.0266), bb_443 (0.0257), a_667 (0.0245), a_488 (0.0232), a_678 (0.0224), par (0.0209), bb_469 (0.0205), a_645 (0.0203), and rrsdiff (0.0188).
When using only remote sensing data for harmful algal bloom detections, it was observed that the importance of the critical indicators like chlorophyll-a decreased after removing meteorological data. Conversely, the importance levels associated with the proportion of spatial and temporal information (longitude, latitude, and the day of the year) increased, which is not advantageous for harmful algal bloom detection tasks. However, in the XGR model, the sea surface temperature (SST) continued to play a significant role in harmful algal bloom detection, thus indirectly emphasizing the importance of heterogeneous meteorological data in these tasks.
It is worth noting that, in the analysis of LCE when using only remote sensing data, the importance of spatial information (longitude) was significantly higher than other variables, with a feature weight proportion of 0.5821 for longitude.
Figure 13 displays the feature weight comparison chart of the LCE model, in which the highest feature point (longitude: 0.5820) is marked by a red circle and the lowest feature point (a_678: 0.0005) is marked by a blue circle. The feature weight span of the LCE model reached 0.5815. However, such a large span in feature weights can lead to increased numerical instability, thereby affecting the regression performance of the model. From the distribution of the feature weights, it can be observed that in the context of harmful algal bloom detection when using only remote sensing data, the model’s perception of spatial information significantly increases. Due to the extensive nature of the data collection, this heightened spatial awareness could greatly influence the accuracy of harmful algal bloom detection tasks worldwide.
Through feature analysis, it was observed that the models were more sensitive to geographical characteristics, meteorological factors, and certain remote sensing data, thus indicating the variability of harmful algal bloom features across different regions. Among the three regression models, the feature weights were relatively evenly distributed. The difference between the highest weight, chlorophyll-a (0.1524), and the lowest weight, a_667 (0.0038), for XGR was 0.1486. For CATboost, the difference between the highest weight, chlorophyll-a (0.1479), and the lowest weight, a_667 (0.0024), was 0.1455. In the case of LCE, the difference between the highest weight, the day of the year (0.1460), and the lowest weight, a_667 (0.0006), was 0.1454. In terms of feature diversity, the standard deviations of the feature weights are compared in
Figure 14 with XGR (0.0391), CATboost (0.0417), and LCE (0.0425). The results indicate that XGR and CATboost had smaller standard deviations of feature weights compared to the other models, thus making them more stable.
Therefore, considering both the model evaluation metrics and feature analysis, XGR demonstrated a better regression performance. In this study, the XGR model was employed to achieve a regression prediction for harmful algal blooms with MODIS multifactor and heterogeneous meteorological data.