3.2. Regional Geomagnetic Anomaly Field Model Construction
In this study, the Northwest Pacific Ocean—specifically, the 3° × 3° area from 122° E to 125° E in longitude and 29.5° N to 32.5° N in latitude (the area marked by the red box in
Figure 1a)—was selected as the study area. This area, which covers approximately 160,000 square kilometers and contains 8100 data points, was assessed based on the sea-level dataset. As illustrated in
Figure 1b, the spatial distribution of the geomagnetic anomalies in the study area is clearly characterized, and the statistics provided in
Table 1 highlight the significant variations regarding the geomagnetic anomalies in the area. Furthermore, there are several areas with high geomagnetic anomaly values in the region, which provides good conditions for verifying and comparing the accuracy and applicability of various modeling approaches. To construct and validate the model,
of the data (i.e., 6480 data points) were randomly selected from the 8100 points in the selected region as the training dataset for the model; meanwhile, the remaining
(i.e., 1620 data points) were used as the test dataset for the model. This data partitioning approach ensures an unbiased evaluation of the model’s performance on independent data. To objectively assess the model’s predictive accuracy, the root mean square error (RMSE) was selected as the key index, which is calculated as follows:
where
n represents the number of samples,
denotes the predicted value of the model for the
ith data point, and
represents the true value for the ith data point. The smaller the RMSE value, the smaller the difference between the predicted result and the true value of the model, which indicates that the predictive performance of the model is superior. Through calculating the RMSE, we can quantitatively assess the prediction accuracy of the model and provide a scientific basis for the optimization and selection of the model.
Previous studies [
14] have highlighted that the truncation level and the resolution of boundary effects are critical factors in constructing regional geomagnetic anomaly models. In regional geomagnetic field modeling, the truncation level refers to the maximum degree of the modeling model expansion used, which determines the resolution and complexity of the model. The selection of the optimal truncation level is crucial, as it balances the trade-off between model accuracy and computational efficiency. A higher truncation level captures finer details of the field but increases the computational complexity, while a lower truncation level simplifies calculations but may omit significant features. Therefore, the optimal truncation level is determined through comprehensive consideration of the error analysis results and computational efficiency. Regarding boundary effects, the prediction of data within the modeling region is essentially an interpolation problem, where points exhibit continuous relationships with surrounding points. However, at the boundaries, the boundary points have vacant data outside the region, which can result in significant deviations in the model predictions at the boundary points.
Figure 2a illustrates the relationship between the truncation level and the number of outliers. The RMSE was calculated using the training data and the number of outliers, which was determined using the test data through the box plot method (as will be detailed later in this section). As shown in
Figure 2b, when the truncation level is less than 10, the model is in an underfitting state, leading to a certain number of outliers. As the truncation level increased gradually from 10, the accuracy of the model training improved, with the RMSE reaching its minimum value of 0.41 nT at the 85th degree. This accuracy was maintained up to approximately the 104th degree, after which the RMSE started to increase gradually and slightly. The relationship between the number of outliers and the truncation level follows an approximately exponential trend, expressed as
. As the truncation level increases from 0 to 87, the number of outliers increases from 0 to 100. With a further increase in the truncation level from 88 to 97, the number of outliers increases from 100 to 200. Subsequently, as the truncation level increases from 98 to 300, the number of outliers continues to increase, before decreasing slightly to 100 from 200 when the truncation level spans from 88 to 97. As the truncation level increased from 98 to 105, the detected outliers also increased. At a truncation level of 85, 93 outliers were identified, which comprised
of the total number of data points. When the truncation level increased to 104, the number of detected outliers increased to 270, approximately three times the original number.
Figure 3 illustrates the distribution of outliers at truncation levels of 85 and 104. The outliers were primarily concentrated at the four corners of the region’s boundary. At the 104th truncation level, the number of outliers increased significantly and extended further into the interior of the region.
To more accurately and rationally determine the optimal truncation level, this study proposes a comprehensive evaluation index formula that combines the RMSE and the number of outliers
A, and fully considers their impact on the overall assessment. The formula is as follows:
In Equation (
9),
and
represent the weight parameters and satisfy
; meanwhile,
denotes a nonlinear function of the number of outliers
A, which is used to adjust the impact of outliers on the comprehensive evaluation index. Specifically, as the number of outliers increases, the negative impact on the comprehensive evaluation index also increases. The explicit form of
is given by:
In Equation (
10),
represents a threshold value and
k represents a constant greater than 1, which is used to amplify the impact of outliers that exceed the threshold.
During the model training phase, the RMSE was selected as the primary evaluation index. The value of the weight
was set to 0.7 and the value of
was set to 0.3.
is taken as the threshold value of
of the number of checkpoints (i.e., it is 113). This threshold value was based on a statistical analysis of the dataset and the requirements for identifying outliers. Furthermore, k was set to 1.2 to enhance the sensitivity of the model to outliers. As shown in
Figure 2b, analyzing the trend of the comprehensive evaluation index in relation to the truncation level revealed that the index remained highly stable between the 81st and 89th truncation levels, consistently falling within the range of 0.56 to 0.59. The integrated evaluation index value reached a minimum of 0.5622 at the 83rd degree. This result indicates that the 83rd degree is the optimal truncation level for the model, achieving a good balance between model complexity and prediction performance. Thus, the 83rd degree was selected as the optimal truncation level of the R−ALPOLM in this study, and post-processing of the predicted outliers was performed on this basis to enhance the performance of the model.
Traditional approaches to handling boundary effects typically involve adding boundary data points to control errors, thereby enhancing the fitting degree of the model and overcoming boundary issues. However, when constructing a model, measured data are considered the most reliable. Artificial interpolation—which involves generating synthetic data based on measured values for use as training data—can introduce unnecessary errors. These errors vary depending on the distribution density of the measured data, leading to a certain level of error during the initial stages of model construction.
The data presented in
Table 2 indicate the overall RMSE after removing the prediction outliers. The results indicate that the overall prediction accuracy of the model remained high after excluding the predicted outliers, and the RMSE decreased with increasing truncation level. Furthermore, through the selection of an appropriate truncation level, the number of prediction outliers can be limited to a smaller proportion of the total number of checkpoints. Consequently, including significant predicted outliers in the evaluation of model accuracy to determine the truncation level is not ideal, as it does not accurately reflect the prediction performance of higher-degree models.
In light of this, this study does not focus on the original observations when addressing boundary effects but, instead, employs a strategy of post-processing the abnormal predicted values at the boundaries based on the model to enhance the overall prediction performance. This approach aims to reduce the negative impacts of boundary effects on model performance through optimizing the prediction mechanism within the model to achieve more accurate prediction results.
Outlier detection was performed using the box plot method, which visualizes the distribution of raw geomagnetic data by dividing the dataset into four equal parts based on three quartiles: the lower quartile (Q1), the median (Q2), and the upper quartile (Q3). Specifically, Q1 separates the lowest of the data, Q2 (the median) divides the data into two equal halves, and Q3 marks the boundary below which of the data falls. This division effectively partitions the dataset into four distinct segments, each representing of the data, ordered from the smallest to largest values. The difference between the upper and lower quartiles is known as the Interquartile Range (IQR). Values greater than the upper quartile by 1.5 times the IQR or smaller than the lower quartile by 1.5 times the IQR are typically classified as outliers when detecting predicted outliers using the box-and-line plot method. This method is highly robust as the IQR is less affected by outliers.
The moving average is a widely used statistical technique in data analysis and time series forecasting. Its primary purpose is to minimize randomness by smoothing data to better extract trend features. For regional geomagnetic data, which typically exhibit continuity, outlier prediction can be performed by applying the moving average method to smooth the predicted data values of surrounding points (N points of radius d). This approach enables the re-prediction of outliers. To address predicted outliers in a dataset , the following steps can be taken:
Step 1: Compute the upper quartile and the lower quartile .
Step 2: Determine the interquartile range .
Step 3: Detect predicted outliers: if , then is not an outlier; otherwise, is an outlier.
Step 4: Use the model coefficient matrix A to predict the geomagnetic data for the surrounding point coordinates, in order to obtain the surrounding point data values .
Step 5: Calculate the surrounding point average for each outlier: .
The obtained result is the result of the re-prediction of the predicted outliers.
3.3. Model Accuracy Evaluation Analysis
To evaluate the superiority of the R−ALPOLM in regional geomagnetic field modeling, comparative experiments were conducted against the Taylor polynomial model (TPM) and the Legendre polynomial model (LPM). These models are based on Taylor polynomials and traditional Legendre polynomials, respectively. The Taylor polynomial is a widely used mathematical tool that provides local approximations of functions around a specific point. In the context of geomagnetic field modeling, it can also be employed to approximate variations in geomagnetic elements. For a given geomagnetic element
B, the center point of the modeled region is selected as the expansion point of the Taylor polynomial. The normalized latitude and longitude coordinates of the center point are denoted by
and
,
A denotes the matrix of model coefficients, and the modeling formula is expressed as in Equation (
11) [
25,
26]:
The direct solution formula for the traditional Legendre polynomials is shown in Equation (
12):
In Equation (
12),
represents the integer part of
, where
n is the degree of the Legendre polynomial, and
m denotes the order of the summation terms. The fundamental model equations used for both Legendre polynomials and associated Legendre polynomials are identical, with the key difference lying in the polynomial
. By substituting Equation (
12) into Equation (
3), the subsequent methods and steps for solving the coefficient matrix remain the same.
The experimental data originated from the training dataset described in
Section 3.2, and the results illustrated in
Figure 4 were obtained by comparing the training accuracy and runtime of the three models at different truncation levels. As illustrated in
Figure 4a, the RMSE of the LPM and the R−ALPOLM proposed in this study decreased rapidly as the truncation level increased, where the RMSE value of the R−ALPOLM was slightly lower than that of the LPM. However, the LPM encounters a singular value problem at the 85th degree; thus, it is incapable of further execution. Conversely, the proposed model demonstrated stable performance, operating successfully up to the 116th degree before encountering the singular value issue. The RMSE value of the TPM was substantially higher than those for the other two models, showing a sharp increase after descending to the 42th degree and exhibiting greater fluctuations after the 42th degree.
Figure 4b illustrates the runtime performance of the three models. The runtime of all models increased as the truncation level increased. The overall runtimes of the TPM and R−ALPOLM were comparable, whereas the LPM exhibited a significantly higher runtime than the other two models. Notably, the singular value problem occurred at the 43rd degree for the TPM, which explains the sharp increase in the RMSE value after the 42nd degree. The appearance of the singular value solution in the TPM runtime indicates that it does not have significant reference value. Considering its higher-degree fitting ability and runtime efficiency, the proposed R−ALPOLM exhibited a clear advantage.
Figure 5 shows the variations in RMSE and the number of outliers for TPM and LPM at different degrees. As illustrated in
Figure 5a, the LPM and R−ALPOLM exhibited similar results when the truncation level increased, with their RMSE values decreasing while the number of outliers gradually increased. At a truncation level of 85, the RMSE reached a minimum of 0.59 nT, whereas the number of outliers increased to a maximum of 97. To evaluate the combined impact of the RMSE and number of outliers on the overall model performance, the comprehensive evaluation index (defined in Equation (
9)) was applied. The obtained results are illustrated in
Figure 6a. As shown in
Figure 6a, the comprehensive evaluation index steadily decreased as the truncation level increased, reaching a minimum value of 0.8749 at the 76th degree. Beyond this point, the index began to increase gradually, Based on this analysis, the optimal truncation level for the LPM was identified as the 76th degree, corresponding to the outlier number of 43 points and representing approximately
of the total number of points. As shown in
Figure 5b, the trend in RMSE of the TPM across truncation levels can be categorized into three phases: a slow decrease in the initial phase, followed by local stabilization and, finally, a sharp increase. For truncation levels lower than 42, the RMSE gradually decreased, reaching a minimum value of 16.9 nT at degree 42. Between degrees 42 and 51, the RMSE remained relatively stable, fluctuating within a range of approximately 1 nT. However, from degree 52 onward, the RMSE increased dramatically, exceeding 400 nT at degree 103. Simultaneously, the change in the number of outliers exhibited a specific pattern. The number of outliers exhibited continuous fluctuations within the first 7 degrees. The number of outliers then increased rapidly from the 8th to the 90th degree. After the 90th degree, although the number of outliers continued to increase in general, significant fluctuations occurred. Considering the influences of the RMSE and the number of outliers on the overall model, the results obtained when applying the comprehensive evaluation index formula (Equation (
9)) are shown in
Figure 6b. As illustrated in
Figure 6b, the comprehensive evaluation index decreased steadily as the truncation level increased, reaching a minimum value of 11.9086 at the 42nd degree. After this point, the index increased rapidly, reaching a value of 4507.089 at the 112th degree. Thus, the optimal truncation level of the TPM was determined to be the 42nd degree.
Next, we compared the performance of the R−ALPOLM, LPM, and TPM at their respective optimal truncation levels.
Table 3 presents the statistics of the predicted data for the three models. The results indicate that the maximum error value of the predicted data for the R−ALPOLM of 83rd degree was 32.60 nT, the minimum error was
nT, the average error was 1.16 nT, and the RMSE was 3.21 nT, which were the lowest among the three models for all parameters. The R−ALPOLM exhibited the lowest RMSE among the models, thereby highlighting its superior prediction accuracy. The predicted data for the 76th degree LPM showed similar values for the maximum, minimum, and the average error values, compared with the R−ALPOLM; however, the RMSE value for the LPM reached 5.00 nT, indicating comparable stability, although it was slightly inferior to that of the R−ALPOLM. In contrast, the predicted values for the 42nd degree TPM differed notably from those of the first two models, particularly with the RMSE reaching 12.76 nT. This was significantly higher than those for the first two models, highlighting a disadvantage in terms of its prediction accuracy.
Figure 7 illustrates the two-dimensional (2D) images of local geomagnetic anomalies constructed by three models at their respective optimal truncation levels. The observations and analyses indicate that the 2D images generated by each model exhibit a high degree of consistency with the original geomagnetic anomaly data in terms of overall color distribution, effectively capturing the general characteristics of the geomagnetic anomaly field. Notably, the R−ALPOLM and LPM demonstrate significant advantages in the depiction of detailed features, as their 2D images not only align closely with the overall color distribution trends of the original data but also exhibit higher similarity in local detail characteristics. In contrast, although the TPM model maintains basic consistency with the original data in terms of overall color distribution trends, it shows obvious deficiencies in the expression of detailed features, with multiple regions displaying inconsistent color distributions compared to the original data, indicating its weaker capability in capturing local geomagnetic anomaly features. To further visually illustrate the differences in fitting accuracy among the three models,
Figure 8 presents the planar residual distribution and residual histograms of the predicted results from each model. As shown in
Figure 8, the R−ALPOLM model exhibits the best modeling performance, with residuals primarily concentrated within the range of ±2 nT. Moreover, the residual values in the central modeling region are generally small, indicating high fitting accuracy. Larger residuals are mainly distributed in the four corner regions of the modeling area, and their distribution pattern aligns with the anomaly point distribution characteristics described in
Section 3.2. The LPM model also performs well overall, with residuals similarly concentrated within ±2 nT. However, compared to the R−ALPOLM model, its high residuals are more dispersed, appearing not only in the corner regions but also in the central modeling area, suggesting slightly inferior fitting capability for local geomagnetic anomalies compared to the R−ALPOLM model. In contrast, the TPM model exhibits significantly larger residuals, with high residuals widely distributed throughout the entire modeling region, indicating substantially lower modeling accuracy compared to the R−ALPOLM and LPM models.