Next Article in Journal
Genetic Parameter Estimates for Growth of Hāpuku (Groper, Polyprion oxygeneios) in Land-Based Aquaculture Using Random Regression Models
Previous Article in Journal
Impact of Arthrospira maxima Feed Supplementation on Gut Microbiota and Growth Performance of Tilapia Fry (Oreochromis niloticus)
Previous Article in Special Issue
A Study on the Impact of Environmental Factors on Chub Mackerel Scomber japonicus Fishing Grounds Based on a Linear Mixed Model
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Construction and Comparison of Machine-Learning Forecast Models of Albacore Thunnus alalunga Fishing Grounds in the South Pacific Ocean

1
Marine and Fishery Institute, Zhejiang Ocean University, Zhoushan 316021, China
2
Zhejiang Marine Fisheries Research Institute, Zhoushan 316021, China
3
Key Laboratory of Sustainable Utilization of Technology Research for Fishery Resource of Zhejiang Province, Zhoushan 316021, China
4
College of Marine Biological Resources and Management, Shanghai Ocean University, Shanghai 201306, China
5
East China Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Shanghai 200090, China
*
Author to whom correspondence should be addressed.
Fishes 2024, 9(10), 375; https://doi.org/10.3390/fishes9100375
Submission received: 10 August 2024 / Revised: 13 September 2024 / Accepted: 17 September 2024 / Published: 25 September 2024
(This article belongs to the Special Issue Biodiversity and Spatial Distribution of Fishes)

Abstract

:
The traditional methods for predicting the distribution of albacore (Thunnus alalunga) fishing grounds have low performance and accuracy. Uneven sampling can result in unreasonable evaluation indicators. To address these issues, three methods, equi-frequency, K-means clustering algorithm, and 1-R split, were applied to discretize the catch per unit effort (CPUE) of albacore in the South Pacific from 2016 to 2021 and partition the fishing grounds into abundance levels. Eight machine learning models were used to predict the fishing grounds. In addition to the traditional evaluation index based on confusion matrix, top-k index was also used to evaluate the accuracy of fishery abundance predictions. The results showed that (1) When sampling is unbalanced, the reported accuracy does not fully represent the actual performance of the model in predicting the abundance of albacore in the fishing ground. F1 value can be used as the index of the model effect and stability. (2) In binary classification, the quartile stacking algorithm has the best stacking performance, with F1 0.89. (3) The top-1 prediction accuracy of three-category fishery forecasting is the highest at 0.74, and the top-1 prediction accuracy of five-category fishery forecasting is the highest at 0.54. (4) The top-k accuracy of classification of fisheries with multiple abundance using K-means is significantly better than that of equal frequency discretization (p < 0.001). The top-k evaluation index was used to predict the fishing grounds of albacore across multiple abundance levels for the first time in this study, which is significant for pioneering a new method for this application and which provides a demonstration of the development of artificial intelligence techniques for fisheries in the future.
Key Contribution: This paper compared and analyzed the influence of different discrete methods on the prediction of fishing grounds by several machine learning methods. Additionally, a top-k index was first used to describe the fault tolerance rate of different discrete methods for albacore fishing ground forecasts. These methods’ applications provide a more abundant prediction method for tuna fishery forecasts based on artificial intelligence in the future.

1. Introduction

The albacore (Thunnus alalunga) is a pelagic fish that inhabits open ocean waters in tropical and temperate regions worldwide, including the Atlantic, Pacific, and Indian Oceans [1]. As a top predator in the marine food chain, albacore play an important role in marine ecology and are also an important economic target in fisheries [2]. As a highly migratory species, the movements and habits of albacore also provide valuable insights into the complex dynamics of open marine ecosystems [3]. However, the impacts of climate change, overfishing, and bycatch on albacore populations highlight the need for increased measures to protect this valuable species and its widespread habitat [4]. Therefore, understanding how to establish a stable and accurate model for predicting the spatiotemporal distribution of albacore through remote sensing technology has become an urgent problem.
The distribution of albacore is influenced by various environmental factors, and previous studies typically included various marine environmental variables as indicators of the spatiotemporal distribution of tuna [5]. Vertical seawater temperature (ST) directly affects the spatial and vertical distribution of albacore [6]. Chlorophyll-a concentration (CHL) has an indirect effect on the distribution of albacore [7] and sea surface height anomaly (SSHA) may also affect their distribution [8]. The salinity of seawater is considered to play an important role in modeling the distribution of albacore tuna [9]. In addition, sea surface wind speed (WS) may cause changes in mixed layer depth (MLD), which in turn could result in changes in the vertical distribution of albacore. Eddy kinetic energy (EKE) causes changes in other marine environments, indirectly affecting the population of albacore [10]. Large scale climate events such as El Niño and La Niña can have an impact on the migration and distribution of albacore [4,11], as seen in the observed correlation between the Southern Oscillation Index and the fishing of albacore tuna. In addition, research on the life cycle of albacore tuna has shown that the Pacific intergenerational oscillation index affects the sensitivity of mid- to upper-sea-level fish [12].
In the era of big data, artificial intelligence algorithms are rapidly developing and maturing. In recent years, researchers have begun using machine learning algorithms, and especially supervised learning algorithms, to construct prediction models for fishing grounds [13,14]. Early artificial neural networks were used to predict albacore fishing grounds in the South Pacific [15], but were limited to algorithms to explore the relationship between environmental factors and catch per unit effort (CPUE). Moreover, due to the easy overfitting of neural networks, the low accuracy of actual deployment models requires further improvement in their robustness and accuracy. Ensemble learning algorithms, such as random forest [16] in the Indian Ocean and boosted regression trees [17] in the Tonga EZZ have been applied to the prediction of tuna fisheries. STK (stacking) models are unique in their prediction accuracy around fisheries, and their performance for fishing ground forecasting was better than other single models and integrated learning [18].
In previous studies on predicting tuna fishing grounds, the traditional classification method of high and low fishing grounds was the third percentile method [16]. This method takes the first point of the third percentile of CPUE as the dividing line and divides the fishing grounds into low-yield fishing grounds and high-yield fishing grounds through CPUE [19]. However, this method cannot avoid the problem of imbalanced sampling [20]. An imbalance in the proportion of majority and minority classes can cause a model to bias predictions in favor of the majority class, resulting in high accuracy but often limiting recall and accuracy [21]. Although imbalanced sampling can sometimes be addressed with data augmentation methods such as oversampling, undersampling, and reinforcement learning based on generative adversarial networks, the lack of clear boundaries in the multi-dimensional feature space of the abundance level of fishing grounds prevents the accuracy of the model from being improved solely through simple feature engineering [22].
This study hypothesizes that a new method of dividing the abundance levels of fishing grounds will improve the predictive performance of the model (Figure 1). Three different methods of dividing fishing grounds, equal frequency division, cluster division, and 1-R division, were used to classify fishing ground abundance prediction as a binary or multi-class classification problem. Eight machine learning methods, including advanced ensemble learning, were applied to the dataset, and the top-k evaluation index was used as a novel metric for evaluating the predictions of fishing ground abundance levels. The relationship between the accuracy of the model’s prediction of albacore fishing ground abundance and the division method was explored, and the concept of model fault tolerance was added to determine how unsupervised learning can guide the classification of fishing ground levels to improve the power of the model’s prediction capabilities.

2. Materials and Methods

2.1. Data Sources

This study used tuna production data from a total of 34 longline vessels provided by Pingtairong Marine Fisheries Group Co., Ltd. (Zhoushan, China) and Jiangsu Yuanyou Marine Fisheries Co., Ltd. (Qidong, China) (Figure 2). The operating ocean area was the South Pacific (0° S–40° S, 160° E–90° W). A total of 27,126 actual fishing records were matched to operational fishing records to a grid of 1° × 1° by month, yielding a total of 4621 samples.
The following 17 marine environmental factors were collected to build the fishing ground prediction models: sea surface temperature (SST); vertical sea surface temperature at a depth of 50–200 m (abbreviated as ST50, ST100, ST200); sea surface salinity (SSS); vertical salinity at a depth of 50–200 m (abbreviated as SS50, SS100, SS200); chlorophyll-a concentration (CHL); sea surface dissolved oxygen concentration (DO); sea surface height anomalies (SSHA); mixed layer depth (MLD); eddy kinetic energy (EKE); Southern Oscillation Index (SOI); photosynthetically active radiation (PAR); wind speed (WS); and Pacific Intergenerational Oscillation Index (PDO). SST, SSS, SOI, and PAR were downloaded from database of National Oceanic and Atmospheric Administration (https://coastwatch.pfeg.noaa.gov (accessed on 20 August 2023)). ST50, ST100, ST200, SS50, SS100, SS200, CHL, DO, SSHA, MLD, EKE, PDO, and WS were downloaded from the Copernicus marine environmental monitoring website (https://data.marine.copernicus.eu (accessed on 20 August 2023)). The temporal resolution of all environmental factors was monthly. The spatial resolution of SST and PAR is 0.04° × 0.04°, and the spatial resolution of all other environmental data was 0.25° × 0.25°. All environmental spatial data were downscaled in Python 3.8 to match the 1° × 1° resolution of the longline fishing data.

2.2. Data Processing

The catch per unit effort (CPUE) was calculated for each 1° × 1° grid cell monthly using the formula [23]:
C P U E y , m , i ,   j = C a t c h y , m , i , j H o o k y , m , i , j × 1000
where, C a t c h y , m , i , j refers to the count of individuals (ind.) of albacore caught in the unit grid of longitude i and latitude j in year y and month m. H o o k y , m , i , j is the number of hooks in the unit grid of longitude i and latitude j in year y and month m.
Three different partitioning methods were used to discretize CPUE: equal frequency partitioning, K-means-clustering partitioning [24], and 1-R discretization [25]. Equal-frequency partitioning can achieve segmentation boundaries of the third quartile, fourth quartile, and median, and can divide fishing grounds into binary low- and high-abundance areas. The K-means-clustering discretization method divides a continuous variable into several discrete intervals based on the inherent distance and similarity of the data. The 1-R discretization method is a decision tree-based discretization method that divides a continuous variable into several discrete values according to the construction method of the decision tree. Low- and high-abundance fishing grounds were thus represented as “0”and “1” respectively.

2.3. Feature Engineering

This study used Pearson correlation tests to analyze the relationships between environmental factors, screen out highly correlated explanatory variables, and identify correlations between the spatiotemporal and environmental factors and CPUE.
The calculation method was as follows:
ρ X , Y = E X μ X Y μ Y σ X σ Y
X and Y refer to two sets of variables, respectively, μ and σ refer to expected and standard deviation, respectively.
In order to offset the impact between different scales, data were normalized using the following formula:
X = X X m i n X m a x X m i n
where, X , X , X m i n , and X m a x refer to the normalized values, actual values, minimum values, and maximum values of the variables, respectively.
This study used a large number of environmental factors as variables to predict albacore fishing ground abundances. Principal Component Analyses (PCAs) were used to analyze the included environmental variables to eliminate data noise and display differences in samples from fishing areas across multiple abundances.

2.4. Modeling Methods

Eight machine learning methods (Table 1) were tested to build prediction models of albacore fishing grounds. The 5-fold cross-validation was used for the selection of model hyperparameters, while main parameters were set in Table 1 for the algorithms.
The STK model combines multiple base learners to form a learner with a stronger generalization ability. The base learner is suitable for strong and stable models such as KNN, RF and XGB. The meta-learner is suitable for models with simple structure and strong generalization ability. ANN has a simple structure and can effectively process the nonlinear part of the data with strong generalization ability. KNN, RF, and XGB were applied as base learners, ANN as meta learners, and the system uses 5-fold cross validation for training. The algorithm flowchart is shown in Figure 3

2.5. Evaluating Indicators

The confusion matrix (Table 2) summarizes how a classification model performs in the comparison of the predicted and actual values in a dataset [21].
Accuracy (A), Precision (P), recall (R), and harmonic coefficient (F1) scores are performance indicators used to evaluate the effectiveness of classification models. The calculation method was as follows:
A = T P + T N T P + T N + F P + F N
P = T P T P + F P
R = T P T P + T N
F 1 = 2 × P × R P + R
The RoC curve (receiver operation characteristic curve) is a graphical representation of the performance of binary classification models showing when threshold changes are detected [34]. The area under the RoC curve (AUC) is a single scalar that summarizes the overall performance of the model, without considering the selected classification threshold. AUC = 1.0 indicates a perfect classifier, while AUC = 0.5 indicates that the performance of the classifier is not better than random chance [35].
Top-k accuracy is one of the key indicators for evaluating the classification performance of a model in multi-classification problems [36]. It refers to the probability of the model correctly classifying in the first K-value predictions, where K is a pre-specified positive integer. Top-k accuracy refers to the proportion of the first K results with the highest probability in the prediction results that contain correct labels. Therefore, as the K-value approaches the total number of labels, the model’s fault tolerance increases. When K is 1, the accuracy is equal to the model’s prediction accuracy for the test set. Usually, a higher top-k accuracy implies better classification performance of the model. Compared to traditional accuracy, top-k accuracy is more suitable for classification problems, especially in evaluating model performance when there are a large number of categories [37].

2.6. Statistical Analysis

Python 3.9 was primarily used for statistical analysis in this study. In Python, the modules NumPy and pandas were used for statistical analysis and calculation of large datasets and SciPy was used for Pearson correlation analysis. Feature engineering and machine learning methods came from the scikit-learn, CatBoost, and XGBoost modules. Matplotlib, seaborn, and basemap modules were used for data visualization and geographic information processing.

3. Results

3.1. Distribution of Albacore Fishing Ground in the South Pacific

The albacore fishing grounds were mainly distributed in 10° S–35° S and 160° E–90° W in the South Pacific, with high CPUE in the area of 25° S–35° S and 100° W–130° W (Figure 4). The relationship between monthly average catch and CPUE of South Pacific albacore and time of year (Figure 5) showed that the CPUE reached 41.7 in August, 2021, 43.5 in September, 2020, higher than other months.

3.2. Correlation between Environmental Variables and CPUE of Albacore

Every environmental factor except year, PDO, and WS had significant correlations with CPUE (p < 0.001) (Table 3). SST and DO were the two environmental variables with the highest correlation coefficients. The relationship between SST and CPUE had the strongest negative correlation (−0.48), and the relationship between DO and CPUE presented the strongest positive correlation (0.49).

3.3. Comparative Analysis of Binary Classification Models

The abundance of fishing grounds was divided into low-abundance and high-abundance areas through three discretization methods (Table 4). RoC curves (Figure 6) and PR curves (Figure 7) showed that when the sampling ratio was appropriate, the prediction performance of the ensemble learning model was better than that of the single model in scheme C. The AUC of STK, CAT, and RF reached 0.817, 0.817, and 0.814, respectively. The AP values of STK, CAT, and XGB reached 0.906, 0.914, and 0.91, respectively. Prediction performance by STK (Table 5), when the sampling ratio was relatively balanced, showed higher P and R rates of high-abundance fishing grounds. Using the first quartile resulted in the highest F1 value, reaching 0.89 with an accuracy rate of 0.823. Thus, although the accuracy was higher (reached 0.976) than other segmentation methods, the A and R were lower (reached 0.417 and 0.192). These results suggested that the quartile method was the most appropriate discretization method for dividing high- and low-abundance fishing grounds.

3.4. Comparative Analysis of Multiple Classification Models

The PCA results (Figure 8) indicated that there was no significant difference between various fishing grounds (Table 6) in this study. Principal component PC1 explained 46.07% of the variation, while PC2 explained 13.47%. The clustering discretization division of fishing grounds had the highest accuracy in prediction (top-1), reaching 0.741 (Figure 9) for XGB when the abundance level of the fishing ground was divided into 3 levels. When the abundance level of the fishing ground was divided into 5 levels by clustering division, the precision prediction accuracy of STK was the highest, reaching 0.54. The comparison of the top-k accuracy predicted by each learner (Table 7) showed that the prediction performance of the K-means clustering algorithm for dividing the fishing ground level was significantly better than that of the equifrequency division (p < 0.001), and the same result was also observed in low fault tolerance prediction (p < 0.001).

3.5. Performance of Fishing Ground Forecast Models

The actual test performance of using the quartile method to classify high- and low-abundance fishing grounds and predicting fishing grounds through STK was shown in Figure 10. Figure 10A shows that the STK model performed well in predicting high-abundance fishing grounds, and the prediction model showed that high-abundance fishing grounds appeared in all survey areas, especially between 15° S–35° S and 90° W–130° W, where a higher density of high-abundance fishing grounds are found. Figure 10B shows that the STK model had certain problems in predicting negative categories, i.e., low-abundance fishing grounds. The model exhibited low recall characteristics for low-abundance fishing grounds, and most of the survey areas of 5° S–20° S, 120° W–180° W were classified as low-abundance fishing grounds, corresponding with the previous result.
When CPUE was classified in 3 levels by K-means, the distribution of actual and predicted fishing grounds (Figure 11) showed that high-abundance fishing grounds have a higher probability of occurrence at 90° W–120° W and 25° S–35° S. When CPUE was classified into 5 levels (Figure 12), the distribution of low-abundance and high-abundance fishing grounds predicted by the model was similar to that of low-abundance fishing grounds classified into 3 levels. However, the discrimination performance for high-abundance and low-abundance fishing grounds was poor, because when the sampling ratio difference was too large, the learner was conservative in discriminating minority classes and tended to predict majority classes to achieve the minimum learning loss function.

4. Discussion

4.1. Model Comparison and Analysis

Fishing ground forecasting is a technique that predicts the distribution of specific fish species within a specified time and location based on ocean environmental and historical data, and can be expanded to predict expected catch, fishing season, quantity, and quality [13]. Currently, most fishery forecasts have established the assumption that the life history of marine fish is closely related to the physical and chemical conditions of the ocean [14,18]. Traditional fishing ground predictions are built with habitat index models, generalized linear models, and generalized additive models, which are greatly affected by human factors, have weak generalization ability, difficult feature transformation, and insufficient fitting ability [16,19]. Machine learning is a growing method of data mining, which can explore the potential relationship between environmental variables and fishery abundance, establish the relationship between environmental factors and fishery resource abundance through certain rules, and play an important role in fishery forecasting research. Compared to traditional habitat index models and linear models, machine learning algorithms are capable of mining large-scale, complex, variable, and high-dimensional ocean data information, with a stronger ability to fit data and achieve high-precision and high-generalization ability in fishing ground forecasting [18].
A single machine learning model such as KNN, LR, SVM, or ANN can achieve good fishing ground prediction results. Compared with a single model, the ensemble learning algorithm can provide more accurate prediction power and better prediction generalization by building multiple learners, sampling and forecasting random forests through bootstrapping, and selecting prediction results through boosting in XGB and CAT. RF had better performance compared to linear models and ANN, predicting the habitat distribution characteristics of the spring short octopus in Haizhou Bay and its relationships with environmental factors [38]. Additionally, several ensemble algorithms were applied to estimate the biomass of major economically important crabs in Zhoushan fishing grounds, and XGB had the best estimation effect [39]. CAT was uniquely applied to forecast the albacore fishery in the South Pacific Ocean and has shown greater advantages than other single models. STK had a superior performance in predicting potential fishing areas for target fish species in the fishing grounds, thanks to advantages in combating data noise, non-Gaussian distribution, and non-linear relationships compared to other models. It was found that in predicting albacore fishing grounds in the South Pacific with STK, models had higher accuracy compared to other machine learning models. Mugo and Saitohused 2020 applied machine learning models to model the overall habitat of bonito in the western Pacific Ocean, and the results showed that ensemble learning algorithms have higher generalization ability compared to other models [40].
We propose three main reasons why the STK model had superior performance in predicting fishing grounds. (1) For fishing ground label classification, KNN, RF, and XGB had superior performance and were more stable compared to other models, making overfitting less likely to occur. However, due to insignificant differences between fishing grounds, the performance upper limit of secondary learners is low, and the accuracy of predictions is greatly affected by the distribution of data. The STK model integrates relatively stable and accurate KNN, RF, and XGB, and uses a structurally simple ANN as a meta learner to further prevent model overfitting, resulting in better performance in fishery label classification tasks. (2) The use of the 5-fold cross validation method fully utilizes the entire dataset, reduces the risk of overfitting, and improves the model’s generalization ability. (3) Considering that CPUE does not always follow a normal distribution, multiple methods of separating data were tested to avoid various problems caused by imbalanced data during model prediction.

4.2. Impact of Environmental Factors on Model Prediction

The life history of albacore is closely related to ocean temperature, salinity, dissolved oxygen concentration, and sea level height [41]. As large-scale migratory marine bony fish, albacore are usually distributed in tropical and temperate waters, and their abundance is highly correlated with temperature. This may be because warm water is suitable for albacore to lay eggs, and during the spawning period, the development and survival rate of eggs and juveniles greatly depends on the appropriate temperature [42]. Appropriate water temperature also helps to maintain the stability of water quality, providing suitable environmental conditions to support the growth and development of juvenile fish. In addition, warm waters nurture more bait organisms, with short life-cycle cephalopods and small fish serving as prey for albacore. Prey habitats are widely distributed in temperate and tropical waters [43], providing more feeding opportunities for albacore. Albacore prefer high-salinity waters, as high-salinity waters often imply relatively stable water masses and higher biological suitability [44]. As a top predator, albacore require continuous swimming to breathe [4], and albacore caught by longline fishing are rarely found in low dissolved oxygen concentrations [1]. Albacore have a higher CPUE in the range of dissolved oxygen concentrations around 240 mol/L [45]. Albacore prefer waters with higher dissolved oxygen levels than other tuna species [46]. Abnormalities in sea level height are often closely related to changes in ocean circulation and marine environment, and the variation in sea surface height is mainly related to factors such as ocean currents, water masses, and tides as a characterization of ocean hydrodynamics [47]. Abnormal changes in sea level height may lead to corresponding changes in the migration path and distribution range of albacore, which are closely related to the occurrence of mesoscale eddies [48]. Albacore are sensitive to the flow and changes in the marine environment, and find suitable temperature, salinity, and dissolved oxygen conditions through changes in the distribution of ocean circulation.
Chlorophyll-a concentration, photosynthetic effective radiation, and eddy energy have indirect effects on the distribution of albacore. The concentration of chlorophyll-a is directly related to the level of primary productivity [49], but albacore have a high trophic level and do not directly consume primary productivity. However, the distribution of albacore fishing grounds is still indirectly affected by chlorophyll concentration and photosynthetically active radiation. Photosynthetically active radiation (PAR) is the main driving force for the growth and reproduction of phytoplankton in the ocean [8,41], and phytoplankton are an important food source for albacore prey organisms. Appropriate photosynthetic and effective radiation conditions can promote the growth of phytoplankton, thereby providing abundant food resources and attracting the aggregation of albacore. The correlation between eddy energy and CPUE is relatively small. EKE is calculated through ocean current velocity and affects the three-dimensional distribution of circulation, water temperature, and chlorophyll, indirectly affecting the distribution of albacore fishing grounds [10].
Albacore have vertical distribution patterns correlated with vertical temperature, salinity, and mixed layer depth. Changes in temperature and salinity of vertical water layers in the ocean have a direct impact on the ecological processes of growth, development, reproduction, metabolism, and migration distribution of marine organisms [50,51]. Albacore are active in water layers below 100 m, with an optimal water temperature of 20–26 °C and a salinity range of 35.6–35.9. It was reported that environmental factors in different water layers have varying effects on the CPUE of albacore [52]. Depth of the mixed layer (MLD) also has an effect on the vertical distribution of albacore. The hydrological properties of the mixed layer in the same sea area follow the same trends, and environmental factors such as temperature and salinity are similar. In tropical latitudes, the vertical distribution of albacore follows a day-night pattern, in which albacore occupy shallow and warmer waters above MLD at night and deeper and cooler waters below MLD during the day. However, in temperate latitudes, albacore are almost exclusively found in shallow waters above MLD [5].
In recent years, global climate change has significantly impacted marine ecosystems, and typical climate events have altered the habitat distribution of marine organisms to some extent [53]. Climate change has a significant impact on the life history of albacore. Lehodey showed subtropical albacore had low replenishment levels during El Niño and high replenishment levels during La Niña, which is opposite to the patterns of bonito (Katsuwonus pelmis) and yellow-finned tuna (Thunnus albacares) [54]. The low correlation between CPUE and PDO may have been due to the low number of years collecting longline fishing logs, which was consistent with previous studies and had a high correlation with SOI. Besides, the short time series in this study contains less abnormal climate information, and it is easy to ignore the influence factors of climate change on tuna distribution when training data.

4.3. Impact of Different Partitioning Methods on the Model

The area of research on albacore tuna in the South Pacific is mostly distributed in the southwestern Pacific Ocean west of the Cook Islands [15,18], and there is relatively little research on the distribution of albacore in the southeastern Pacific Ocean. The main research area for this study was the Cook Islands, providing a more comprehensive reference for the prediction and application of albacore tuna fishing grounds in the South Pacific. Overall, there was significant agreement between the predicted and actual locations of most fishing grounds. However, the main reasons for the relatively few deviations in predictions of high abundance fishing grounds may have been (1) long line fishing operations have a large spatial span [19]; (2) additionally, the actual distribution of fishing grounds is relatively discrete and outlier fishing grounds may have not been considered as positive classes by the model; (3) finally, due to the large study area, the difference in environmental factors was not significant, which may lead to a decrease in the classification performance of the model (Figure 9).
The proportion of samples used to generate data labels is one of the main factors influencing the effectiveness of classifying fishing area abundance levels [20,21]. When studying large ocean areas, the effects of marine environment, fishing factors, and human factors are difficult to estimate. Therefore, the definition and discrimination of fishing area levels can offset the influence of these unpredictable factors [22]. When fishing grounds are defined in two levels, the sampling ratio may be imbalanced, and the model may bias the prediction results towards the majority class to improve accuracy. Although STK performed better than other models, its overall generalization ability decreased, and there may be a concurrent decrease in the recall rate of low abundance fishing grounds in actual fishing ground prediction (Figure 11), especially in predicting low-abundance fishing grounds at higher latitudes. Multi-level classification methods can be used to avoid the problem of imbalanced sample partitioning. The forecast results support the conclusion that the prediction effect of multi-class abundance is better than that of binary classification prediction models (Figure 11 and Figure 12). Moreover, this study suggests the top-k indicator is most appropriate for multi-level classification of fishing ground forecasts.

4.4. Outlook

Using methods other than equal frequency partitioning to divide CPUE can cause imbalanced sampling, which can lead to a model leaning towards the majority class prediction and resulting in low recall or accuracy in the prediction results. In addition, the division of spatiotemporal distribution was based on 1° × 1° degree and monthly units, and there was a significant lag in time measuring fishing ground predictions in practice. There was no continuous large-scale data support in the data collection process and the actual operation dates and locations of fishing grounds could not be standardized. Multiple spatial and space resolutions can be used to compare the forecast resolution in future studies. Further, it is not economically feasible to meet the requirements of an ideal survey with standardized sampling times and methods, resulting in unavoidable error. Optimization algorithms such as particle swarm optimization, simulated annealing algorithm, and genetic algorithm can be introduced in subsequent work to achieve rapid model convergence and improve modeling efficiency and robustness. In addition, in the current relevant studies, the tuna forecasting model mostly took a single species as the response variable, but seldom considered the interactions between species. Therefore, joint species distribution models can also be used to explore the relationship between species in subsequent studies.

5. Conclusions

This research used different discretization methods to divide the CPUE of the South Pacific albacore in different ways. These division methods were used to evaluate indicators for imbalanced sampling and the problem of predicting fishery abundance was transformed into a classification problem. Fishery abundance was divided into 2, 3, and 5 levels. Multiple machine learning algorithms were tested to predict fishery abundance levels through remote sensing environmental factors. The accuracy and stability of STK were found to be higher than those of other models. However, due to the inherent circumstances of fishery production data, there are shortcomings in predicting the temporal aspects of fishing grounds. Therefore, in the future, it is necessary to improve the collection and organization of fishery data and further study the environmental attribute values of fishing grounds to reflect the marine environment more accurately. The implementation of these improvements will provide more reliable and real-time fishery and environmental information for fishing ground prediction models, thereby more accurately predicting and guiding fishery production. In addition, climate change challenges for tuna fisheries require more diverse and accurate fishery forecasting techniques. A variety of forecasting methods and technologies can provide suitable fishing programs that will contribute to the sustainable management of tuna fisheries.

Author Contributions

Conceptualization, F.C.; Data curation, Q.D.; Funding acquisition, W.Y. and W.Z. (Weifeng Zhou); Visualization, D.L.; Writing—original draft, J.L.; Writing—review and editing, W.Z. (Wenbin Zhu). All authors have read and agreed to the published version of the manuscript.

Funding

National Key RD Program of China (2023YFD2401303), Pelagic Fishery Resources Monitoring Project of Zhejiang Province (2024HZ003).

Institutional Review Board Statement

This study applied machine learning for tuna fishing ground forecast by the tuna long line fishing log data from long-line vessels in the South Pacific Ocean, there was not any animal experiment in the study.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets and codes that support the findings of this study are available from the corresponding author upon reasonable request.

Acknowledgments

The authors wish to thank the commercial fishing company for providing fishing logs of longline vessels in the South Pacific Ocean. This work was funded by the National Key RD Program of China (2023YFD2401303) and routine monitoring program for distant-water fishery resources of Zhejiang province(2023HZ024).

Conflicts of Interest

The authors have no conflicts of interest to declare.

References

  1. Nikolic, N.; Morandeau, G.; Hoarau, L.; West, W.; Arrizabalaga, H.; Hoyle, S.; Fonteneau, A. Review of albacore tuna, Thunnus alalunga, biology, fisheries and management. Rev. Fish. Bio. Fish. 2017, 27, 775–810. [Google Scholar] [CrossRef]
  2. Watanabe, H.; Kubodera, T.; Masuda, S.; Kawahara, S. Feeding habits of albacore Thunnus alalunga in the transition region of the central North Pacific. Fish. Sci. 2004, 70, 573–579. [Google Scholar] [CrossRef]
  3. Dragon, A.C.; Senina, I.; Titaud, O.; Calmettes, B.; Conchon, A.; Arrizabalaga, H.; Lehodey, P. An ecosystem-driven model for spatial dynamics and stock assessment of North Atlantic albacore. Can. J. Fish. Aquat. Sci. 2015, 72, 864–878. [Google Scholar] [CrossRef]
  4. Lehodey, P.; Senina, I.; Nicol, S.; Hampton, J. Modelling the impact of climate change on South Pacific albacore tuna. Deep-Sea. Res. Pt. II. 2015, 113, 246–259. [Google Scholar] [CrossRef]
  5. Vaihola, S.; Yemane, D.; Kininmonth, S. Spatiotemporal Patterns in the Distribution of Albacore, Bigeye, Skipjack, and Yellowfin Tuna Species within the Exclusive Economic Zones of Tonga for the Years 2002 to 2018. Diversity. 2023, 15, 1091. [Google Scholar] [CrossRef]
  6. Williams, A.J.; Allain, V.; Nicol, S.J.; Evans, K.J.; Hoyle, S.D.; Dupoux, C.; Vourey, E.; Dubosc, J. Vertical behavior and diet of albacore tuna (Thunnus alalunga) vary with latitude in the South Pacific Ocean. Deep-Sea. Res. Pt. II. 2015, 113, 154–169. [Google Scholar] [CrossRef]
  7. Reglero, P.; Santos, M.; Balbín, R.; Laíz-Carrión, R.; Alvarez-Berastegui, D.; Ciannelli, L.; Jiménez, E.; Alemany, F. Environmental and biological characteristics of Atlantic bluefin tuna and albacore spawning habitats based on their egg distributions. Deep-Sea. Res. Pt. II. 2017, 140, 105–116. [Google Scholar] [CrossRef]
  8. Zainuddin, M.; Saitoh, K.; SAITOH, S.I. Albacore (Thunnus alalunga) fishing ground in relation to oceanographic conditions in the western North Pacific Ocean using remotely sensed satellite data. Fish. Oceanogr. 2008, 17, 61–73. [Google Scholar] [CrossRef]
  9. Mondal, S.; Lee, M.A. Habitat modeling of mature albacore (Thunnus alalunga) tuna in the Indian Ocean. Front. Mar. Sci. 2023, 10, 1258535. [Google Scholar] [CrossRef]
  10. Tussadiah, A.; Pranowo, W.S.; Syamsuddin, M.L.; Riyantini, I.; Nugraha, B.; Novianto, D. Characteristic of eddies kinetic energy associated with yellowfin tuna in southern Java Indian Ocean. IOP Conf. Ser. Earth Environ. Sci. 2018, 176, 012004. [Google Scholar] [CrossRef]
  11. Singh, A.A.; Sakuramoto, K.; Suzuki, N. Impact of climatic factors on albacore tuna Thunnus alalunga in the South Pacific Ocean. Amer. J. Clim. Chan. 2015, 4, 295. [Google Scholar] [CrossRef]
  12. Lindegren, M.; Checkley, D.M., Jr.; Koslow, J.A.; Goericke, R.; Ohman, M.D. Climate-mediated changes in marine ecosystem regulation during El Niño. Glob. Change Biol. 2018, 24, 796–809. [Google Scholar] [CrossRef] [PubMed]
  13. Gao, F.; Chen, X.; Guan, W.; Li, G. A new model to forecast fishing ground of Scomber japonicus in the Yellow Sea and East China Sea. Acta Ocean. Sin. 2016, 35, 74–81. [Google Scholar] [CrossRef]
  14. Cui, X.; Tang, F.; Zhou, W.; Wu, Z.; Yang, S.; Hua, C. Fishing ground forecasting model of Ommastrephes bartramii based on support vector machine (SVM) in the Northwest Pacific Ocean. South China Fish. Sci. 2016, 12, 1–7, (In Chinese with English abstract). [Google Scholar]
  15. Mao, J.; Chen, X.; Yu, J. Forecasting fishing ground of Thunnus alalunga based on BP neural network in South Pacific Ocean. Acta. Ocean. Sin. 2016, 10, 34–43, (In Chinese with English abstract). [Google Scholar]
  16. Chen, X.; Fan, W.; Cui, X.; Zhou, W.; Tang, F. Fishing ground forecasting of Thunnus alalunga in Indian Ocean based on random forest. Acta. Ocean. Sin. 2013, 35, 158–164, (In Chinese with English abstract). [Google Scholar]
  17. Vaihola, S.; Kininmonth, S. Environmental Factors Determine Tuna Fishing Vessels’ Behavior in Tonga. Fishes. 2023, 8, 602. [Google Scholar] [CrossRef]
  18. Hou, J.; Zhou, W.; Fan, W.; Zhang, H. Research on fishing grounds forecasting models of albacore tuna based on ensemble learning in South Pacific. South China Fish. Sci. 2020, 5, 42–50, (In Chinese with English abstract). [Google Scholar]
  19. Song, L.; Li, T.; Zhang, T.; Sui, H.; Li, B.; Zhang, M. Comparison of machine learning models within different spatial resolutions for predicting the bigeye tuna fishing grounds in tropical waters of the Atlantic Ocean. Fish. Oceanogr. 2023, 32, 509–526. [Google Scholar] [CrossRef]
  20. Xu, L.; Chi, D. Machine Learning Classification Strategy for Imbalanced Data Sets. Comput. Eng. Appl. 2020, 56, 12–27, (In Chinese with English abstract). [Google Scholar]
  21. Wardhani, N.W.S.; Rochayani, M.Y.; Iriany, A.; Sulistyono, A.D.; Lestantyo, P. Cross-validation metrics for evaluating classification performance on imbalanced data. In Proceedings of the 2019 International Conference on Computer, Control, Informatics and Its Applications, Tangerang, Indonesia, 23–24 October 2019; pp. 14–18. [Google Scholar]
  22. Shabani, F.; Kumar, L.; Ahmadi, M. A Comparison of Absolute Performance of Different Correlative and Mechanistic Species Distribution Models in an Independent Area. Ecol. Evol. 2016, 6, 5973–5986. [Google Scholar] [CrossRef] [PubMed]
  23. Feng, Y.; Chen, X.; Gao, F.; Liu, Y. Impacts of changing scale on Getis-Ord Gi hotspots of CPUE: A case study of the neon flying squid (Ommastrephes bartramii) in the northwest Pacific Ocean. Acta. Ocean. Sin. 2018, 37, 67–76. [Google Scholar] [CrossRef]
  24. Krishna, K.; Murty, M.N. Genetic K-means algorithm. IEEE Trans. Sys. Man. Cyber. Pt. B 1999, 29, 433–439. [Google Scholar] [CrossRef] [PubMed]
  25. Jacod, J.; Protter, P. Discretization of Processes; Springer: Berlin, Heidelberg, 2011. [Google Scholar]
  26. Ren, L.; Ma, Y.; Shi, H.; Chen, X. Overview of Machine Learning Algorithms. In Lecture Notes in Electrical Engineering; Springer: Singapore, 2020; pp. 672–678. [Google Scholar]
  27. LaValley, M.P. Logistic regression. Circulation 2008, 117, 2395–2399. [Google Scholar] [CrossRef] [PubMed]
  28. Hearst, M.A.; Dumais, S.T.; Osuna, E.; Platt, J.; Scholkopf, B. Support vector machines. Springer Science+Business Media, LLC: New York, NY, USA, 1998; Volume 13, pp. 18–28. [Google Scholar]
  29. Suryanarayana, I.; Braibanti, A.; Sambasiva Rao, R.; Ramam, V.A.; Sudarsan, D.; Nageswara Rao, G. Neural networks in fisheries research. Fish. Res. 2008, 92, 115–139. [Google Scholar] [CrossRef]
  30. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  31. Chen, T.; He, T.; Benesty, M.; Khotilovich, V.; Tang, Y.; Cho, H.; Chen, K.; Mitchell, R.; Cano, I.; Zhou, T.; et al. R Package, version 0.4-2; Xgboost: Extreme Gradient Boosting; The R Foundation: Vienna, Austria, 2015; Volume 1, pp. 1–4. [Google Scholar]
  32. Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. Ad. Neural. Infor. Process. Syst. 2018, 31, 6637–6647. [Google Scholar]
  33. Džeroski, S.; Ženko, B. Is combining classifiers with stacking better than selecting the best one? Mach. Learn. 2004, 54, 255–273. [Google Scholar] [CrossRef]
  34. Fawcett, T. An introduction to ROC analysis. Pat. Recog. Let. 2006, 27, 861–874. [Google Scholar] [CrossRef]
  35. Majnik, M.; Bosnić, Z. ROC analysis of classifiers in machine learning: A survey. Intel. Dt. Analy. 2013, 17, 531–558. [Google Scholar] [CrossRef]
  36. Harding, J.; Shahbaz, M.; Srinivas, S.; Kusiak, A. Data mining in manufacturing: A review. ASME Trans. J. Manuf. Sci. Eng. 2006, 128, 969–976. [Google Scholar] [CrossRef]
  37. Tan, H.; Wu, Y.; Shen, B.; Jin, P.; Ran, B. Short-term traffic prediction based on dynamic tensor completion. IEEE Trans. Intell. Transp. Syst. 2016, 17, 2123–2133. [Google Scholar] [CrossRef]
  38. Cui, Y.; Liu, S.; Zhang, Y.; Xu, B.; Ji, Y.; Zhang, C.; Xue, Y. Habitat characteristics of Octopus ocellatus and their relationship with environmental factors during spring in Haizhou Bay, China. Chin. J. Appl. Ecol. 2022, 33, 1686–1692, (In Chinese with English abstract). [Google Scholar]
  39. Yang, C.; Li, X.; Liu, Q.; Wang, Y. Application of machine learning methods for estimating the biomass of economically crabs in the Zhoushan fishery. Mar. Sci. 2023, 9, 61–70, (In Chinese with English abstract). [Google Scholar]
  40. Mugo, R.; Saitoh, S.I. Ensemble modelling of Skipjack tuna (Katsuwonus pelamis) habitats in the western North Pacific using satellite remotely sensed data; a comparative analysis using machine-learning models. Remote. Sens. 2020, 12, 2591. [Google Scholar] [CrossRef]
  41. Zainuddin, M.; Kiyofuji, H.; Saitoh, K.; Saitoh, S.I. Using multi-sensor satellite remote sensing and catch data to detect ocean hot spots for albacore (Thunnus alalunga) in the northwestern North Pacific. Deep-Sea. Res. Pt. II. 2006, 53, 419–431. [Google Scholar] [CrossRef]
  42. Ashida, H.; Gosho, T.; Watanabe, K.; Okazaki, M.; Tanabe, T.; Uosaki, K. Reproductive traits and seasonal variations in the spawning activity of female albacore, Thunnus alalunga, in the subtropical western North Pacific Ocean. J. Sea. Res. 2020, 160, 101902. [Google Scholar] [CrossRef]
  43. Jackson, G.D.; Meekan, M.G.; Wotherspoon, S.; Jackson, C.H. Distributions of young cephalopods in the tropical waters of Western Australia over two consecutive summers. ICES. J. Mar. Sci. 2008, 65, 140–147. [Google Scholar] [CrossRef]
  44. Shcherbina, A.Y.; D’Asaro, E.A.; Riser, S.C.; Kessler, W.S. Variability and interleaving of upper-ocean water masses surrounding the North Atlantic salinity maximum. Oceanography 2015, 28, 106–113. [Google Scholar] [CrossRef]
  45. Mondal, S.; Wang, Y.C.; Lee, M.A.; Weng, J.S.; Mondal, B.K. Ensemble three-dimensional habitat modeling of Indian Ocean immature albacore tuna (Thunnus alalunga) using remote sensing data. Remote. Sens. 2022, 14, 5278. [Google Scholar] [CrossRef]
  46. Arrizabalaga, H.; Dufour, F.; Kell, L.; Merino, G.; Ibaibarriaga, L.; Chust, G.; Irigoien, X.; Santiago, J.; Murua, H.; Fraile, I.; et al. Global habitat preferences of commercially valuable tuna. Deep-Sea Res. Pt. II. 2015, 113, 102–112. [Google Scholar] [CrossRef]
  47. Kai, E.T.; Marsac, F. Influence of mesoscale eddies on spatial structuring of top predators’ communities in the Mozambique Channel. Prog. Oceanogr. 2010, 86, 214–223. [Google Scholar]
  48. Zhou, C.; He, P.; Xu, L.; Bach, P.; Wang, X.; Wan, R.; Zhang, Y. The effects of mesoscale oceanographic structures and ambient conditions on the catch of albacore tuna in the South Pacific longline fishery. Fish. Oceanogr. 2020, 29, 238–251. [Google Scholar] [CrossRef]
  49. Iriarte, J.L.; González, H.E.; Liu, K.K.; Rivas, C.; Valenzuela, C. Spatial and temporal variability of chlorophyll and primary productivity in surface waters of southern Chile (41.5–43 S). Estuarine. Coast. Shelf. Sci. 2007, 74, 471–480. [Google Scholar] [CrossRef]
  50. Lougee, L.A.; Bollens, S.M.; Avent, S.R. The effects of haloclines on the vertical distribution and migration of zooplankton. J. Exp. Mar. Bio. Eco. 2002, 278, 111–134. [Google Scholar] [CrossRef]
  51. Wu, J.; Jin, L. Exploration of the classification and main characteristics of marine ecosystems. Inter. J. Mar. Sci. 2023, 13, 1–7. [Google Scholar]
  52. Xu, H.; Song, L.; Shen, J.; Li, Y.; Zhang, M. The relationship between the spatial-temporal distribution of albacore tuna CPUE and the marine environment variables in waters near the Cook Islands based on GAM. Mar. Sci. Bull. 2023, 4, 444–455, (In Chinese with English abstract). [Google Scholar]
  53. Du Pontavice, H.; Gascuel, D.; Reygondeau, G.; Maureaud, A.; Cheung, W.W. Climate change undermines the global functioning of marine food webs. Glob. Change Biol. 2020, 26, 1306–1318. [Google Scholar] [CrossRef]
  54. Lehodey, P.; Chai, F.; Hampton, J. Modelling climate-related variability of tuna populations from a coupled ocean–biogeochemical-populations dynamics model. Fish. Oceanogr. 2003, 12, 483–494. [Google Scholar] [CrossRef]
Figure 1. Flowcharts of research on comparing the models of albacore fishing grounds in the South Pacific Ocean.
Figure 1. Flowcharts of research on comparing the models of albacore fishing grounds in the South Pacific Ocean.
Fishes 09 00375 g001
Figure 2. Monthly average catch of Pingtairong Marine Fisheries Group Co., Ltd. and Jiangsu Yuanyou Marine Fisheries Co., Ltd.
Figure 2. Monthly average catch of Pingtairong Marine Fisheries Group Co., Ltd. and Jiangsu Yuanyou Marine Fisheries Co., Ltd.
Fishes 09 00375 g002
Figure 3. Flowchart of Stacking algorithm.
Figure 3. Flowchart of Stacking algorithm.
Fishes 09 00375 g003
Figure 4. Average albacore CPUE distribution monthly in the South Pacific.
Figure 4. Average albacore CPUE distribution monthly in the South Pacific.
Fishes 09 00375 g004
Figure 5. Monthly average catch and CPUE Changes of albacore tuna in the South Pacific.
Figure 5. Monthly average catch and CPUE Changes of albacore tuna in the South Pacific.
Fishes 09 00375 g005
Figure 6. Six discretization methods were used to segment CPUE, and eight learners were used to predict the RoC curve obtained from 25% of the training set. (A) Median, (B) Third percentile first quantile, (C) Quartile first quantile, (D) Quintile second quantile, (E) K-means clustering, (F) 1-R partition nodes.
Figure 6. Six discretization methods were used to segment CPUE, and eight learners were used to predict the RoC curve obtained from 25% of the training set. (A) Median, (B) Third percentile first quantile, (C) Quartile first quantile, (D) Quintile second quantile, (E) K-means clustering, (F) 1-R partition nodes.
Fishes 09 00375 g006
Figure 7. 6 segmentation CPUE methods and 8 binary classifiers were used to predict the relationship between probability, positive class recall, and accuracy. (A) Median, (B) Third percentile first quantile, (C) Quartile first quantile, (D) Quintile second quantile, (E) K-means clustering, (F) 1-R partition nodes.
Figure 7. 6 segmentation CPUE methods and 8 binary classifiers were used to predict the relationship between probability, positive class recall, and accuracy. (A) Median, (B) Third percentile first quantile, (C) Quartile first quantile, (D) Quintile second quantile, (E) K-means clustering, (F) 1-R partition nodes.
Fishes 09 00375 g007
Figure 8. PCA analysis was applied for 18 environmental factors and mapped all abundance levels of fishing grounds to PC1 and PC2. (A) Third percentile method. (B) K-means clustering for K = 3. (C) Fifth percentile method. (D) K-means clustering for K = 5.
Figure 8. PCA analysis was applied for 18 environmental factors and mapped all abundance levels of fishing grounds to PC1 and PC2. (A) Third percentile method. (B) K-means clustering for K = 3. (C) Fifth percentile method. (D) K-means clustering for K = 5.
Fishes 09 00375 g008
Figure 9. (A) Top-3 accuracy of each learner’s prediction of fishery abundance after discretizing and segmenting fishery abundance levels using the third percentile method. (B) Top-3 accuracy of using the K-means clustering algorithm to divide fishery abundance levels by K = 3; (C) Top-5 accuracy predicted by using quintile discretization to segment the abundance level of fishing grounds; (D) Top-5 accuracy of using K-means clustering algorithm to classify the abundance level of fishing grounds with K = 5.
Figure 9. (A) Top-3 accuracy of each learner’s prediction of fishery abundance after discretizing and segmenting fishery abundance levels using the third percentile method. (B) Top-3 accuracy of using the K-means clustering algorithm to divide fishery abundance levels by K = 3; (C) Top-5 accuracy predicted by using quintile discretization to segment the abundance level of fishing grounds; (D) Top-5 accuracy of using K-means clustering algorithm to classify the abundance level of fishing grounds with K = 5.
Fishes 09 00375 g009
Figure 10. (A) The distribution of real and predicted fishing grounds for high abundance fishing grounds in the test set, and (B) the distribution of real and predicted fishing grounds for low abundance fishing grounds in the test set.
Figure 10. (A) The distribution of real and predicted fishing grounds for high abundance fishing grounds in the test set, and (B) the distribution of real and predicted fishing grounds for low abundance fishing grounds in the test set.
Fishes 09 00375 g010
Figure 11. The abundance of fishing grounds was divided into 3 levels by K-means clustering and modeled using the STK methods. (A) The actual low abundance fishing grounds and prediction points, (B) the actual medium abundance fishing grounds and prediction points, and (C) the actual high abundance fishing grounds and prediction points.
Figure 11. The abundance of fishing grounds was divided into 3 levels by K-means clustering and modeled using the STK methods. (A) The actual low abundance fishing grounds and prediction points, (B) the actual medium abundance fishing grounds and prediction points, and (C) the actual high abundance fishing grounds and prediction points.
Fishes 09 00375 g011
Figure 12. The abundance of fishing grounds was divided into 5 levels by K-means clustering and modeled using the STK method. (A) actual lowest abundance fishing grounds and prediction points, (B) actual low abundance fishing grounds and prediction points, (C) actual medium abundance fishing grounds and prediction points, (D) actual high abundance fishing grounds and prediction points, and (E) highest abundance fishing grounds and prediction points.
Figure 12. The abundance of fishing grounds was divided into 5 levels by K-means clustering and modeled using the STK method. (A) actual lowest abundance fishing grounds and prediction points, (B) actual low abundance fishing grounds and prediction points, (C) actual medium abundance fishing grounds and prediction points, (D) actual high abundance fishing grounds and prediction points, and (E) highest abundance fishing grounds and prediction points.
Fishes 09 00375 g012aFishes 09 00375 g012b
Table 1. Introduction and parameters of machine learning models in the research.
Table 1. Introduction and parameters of machine learning models in the research.
ModelIntroductionParameters
K-Nearest Neighbors (KNN) (reproduced from Springer, 2020) [26]The KNN classification algorithm is an algorithm that only considers the K training set data closest to the new data.K = 9, weight: distance
K is the amount of nearest neighbors.
Logistic Regression (LR) (reproduced from American Heart Association, 2008) [27]LR is a classic classification algorithm widely used in fields such as life sciences, social sciences, and finance.penalty coefficient = 1.2,
max_iterate = 100.
Support Vector Machine (SVM) (reproduced from Springer, 1998) [28]SVM is a powerful classification algorithm which finds an optimal line or surface (also known as a hyperplane), divides the data into two categories, and maximizes the spacing between the two categories.penalty coefficient = 1
Artificial Neural Network (ANN) (reproduced from Fish. Res, 2008) [29]ANN is a model widely used in fields such as image, audio, and natural language processing.HiddenLayerSizes = 20, learning rate: adaptive
Random Forest (RF) (reproduced from Mach. Learn, 2001) [30]Random Forest is an ensemble learning algorithm based on decision trees.n_estimators = 300
Extreme Gradient Boosting (XGB) (reproduced from CRAN, 2015) [31]XGBoost is an efficient implementation of gradient lifting trees that combines various techniques such as tree pruning and feature subsampling.n_estimators = 200, learning rate = 0.05
CatBoost (CAT) (reproduced from Ad. Neural. Infor. Process. Syst., 2018) [32]CatBoost is a machine learning algorithm based on gradient lifting methods that supports tasks such as classification, regression, and sorting.n_estimators = 200, learning rate = 0.05
Stacking (STK) (reproduced from Mach.Learn, 2004) [33] Stacking is an integrated learning technique that combines different models to achieve higher accuracy in the final prediction.Estimators = {KNN, RF, XGB}, meta_learner = ANN, cv = 5
Table 2. Confusion matrix.
Table 2. Confusion matrix.
Actual Positive ClassActual Negative Class
Predict positive classTP (True positives)FP (False positives)
Predict negative classFN (False negatives)TN (True negatives)
Table 3. Pearson correlation analysis between environmental factors and CPUE.
Table 3. Pearson correlation analysis between environmental factors and CPUE.
VariablesCorrelationp
Year−0.000.81
Month0.27<0.05
Lat−0.37<0.05
Lon0.12<0.05
SOI0.13<0.05
PDO0.010.48
SST−0.48<0.05
CHL−0.05<0.05
SSS−0.17<0.05
DO0.49<0.05
SSHA0.06<0.05
WS−0.020.28
MLD0.37<0.05
PAR−0.23<0.05
EKE−0.06<0.05
ST50−0.46<0.05
SS50−0.32<0.05
ST100−0.39<0.05
SS100−0.37<0.05
ST200−0.36<0.05
SS200−0.42<0.05
Table 4. Number and imbalance of different methods for classifying fishing ground abundance levels.
Table 4. Number and imbalance of different methods for classifying fishing ground abundance levels.
SchemeDivision BasisNumber of High Abundance Fishing GroundsNumber of Low Abundance Fishing GroundsImbalance Rate
AMedian231223111
BThird percentile first quantile309715262
CQuartile first quantile346711563
DTwo fifths percentile277418491.5
EK-means clustering65039736
F1-R partition nodes102452144
Table 5. STK was used to predict the A-F group, selecting a classification threshold of 0.5.
Table 5. STK was used to predict the A-F group, selecting a classification threshold of 0.5.
SchemeAPRF1
A0.7440.7430.7460.744
B0.7460.7730.8790.822
C0.8230.8330.9550.89
D0.7530.7710.8370.802
E0.9250.8280.5890.688
F0.9760.4170.1920.263
Table 6. Proportion of fishery abundance samples divided by equal frequency and K-means clustering.
Table 6. Proportion of fishery abundance samples divided by equal frequency and K-means clustering.
SchemeSegmentation MethodParameterSample Proportion
AThird percentile method-1: 1: 1
BK-means clusteringK = 32872: 1415: 336
CFifth percentile method-1: 1: 1: 1: 1
DK-means clusteringK = 51972: 1538: 683: 317: 106
Table 7. Differential analysis of Top-k prediction performance using paired sample t-test.
Table 7. Differential analysis of Top-k prediction performance using paired sample t-test.
SchemeSchemeTop-ktp
AB1−43.392<0.001
AB2−33.639<0.001
CD1−20.906<0.001
CD2−19.472<0.001
CD3−13.458<0.001
CD4−16.306<0.001
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, J.; Chen, F.; Dai, Q.; Zhu, W.; Li, D.; Yu, W.; Zhou, W. Construction and Comparison of Machine-Learning Forecast Models of Albacore Thunnus alalunga Fishing Grounds in the South Pacific Ocean. Fishes 2024, 9, 375. https://doi.org/10.3390/fishes9100375

AMA Style

Li J, Chen F, Dai Q, Zhu W, Li D, Yu W, Zhou W. Construction and Comparison of Machine-Learning Forecast Models of Albacore Thunnus alalunga Fishing Grounds in the South Pacific Ocean. Fishes. 2024; 9(10):375. https://doi.org/10.3390/fishes9100375

Chicago/Turabian Style

Li, Jianxiong, Feng Chen, Qian Dai, Wenbin Zhu, Dewei Li, Wei Yu, and Weifeng Zhou. 2024. "Construction and Comparison of Machine-Learning Forecast Models of Albacore Thunnus alalunga Fishing Grounds in the South Pacific Ocean" Fishes 9, no. 10: 375. https://doi.org/10.3390/fishes9100375

Article Metrics

Back to TopTop