1. Introduction
Solar flares are sudden, intense bursts of electromagnetic radiation and high-energy particles in the Sun’s atmosphere, primarily caused by magnetic reconnection. These events can severely disrupt the Earth’s space environment and technological systems, affecting radio communications, GPS accuracy, and satellite operations. Accurate solar flare prediction is therefore essential for mitigating these impacts and safeguarding critical infrastructure.
Solar flare forecasting has long been a focus of research using traditional physics-based methods. Since the 1930s, statistical relationships between solar flares and sunspot activity have been extensively studied, forming the foundation for early forecasting models. Techniques such as Poisson distributions have been developed to estimate flare probabilities [
1], while discriminant analysis has been employed to identify the critical magnetic parameters influencing solar activity [
2]. Active regions (ARs), characterized by strong magnetic fields, have been widely analyzed through multi-wavelength observations, with researchers extracting morphological, magnetic, and coronal features—including magnetic gradients, neutral line lengths, magnetic energy dissipation, and effective magnetic fields—to systematically parameterize flare-productive regions [
3,
4,
5]. Sunspot classification schemes have further refined the relationship between sunspot morphology and flare occurrence. Advanced techniques, such as Zernike moments derived from magnetograms [
6] and fractal analysis of active regions [
7], have provided deeper insights into flare conditions. Helioseismology studies have additionally revealed the connections between subsurface flow patterns and flare productivity [
8], complemented by recent power spectral analysis demonstrating the predictive capability of magnetic power-law indices in young, active regions [
9]. However, despite these advancements, the comparable performance of traditional approaches indicates a persistent bottleneck in predictive accuracy [
10]. Collectively, these studies elucidate the complex interplay between magnetic field properties, energy storage mechanisms, and flare-triggering processes, highlighting the continued value of physics-based approaches in solar flare forecasting.
Deep learning has significantly advanced solar flare prediction by integrating diverse data-driven approaches. Self-organized criticality and low-dimensional chaotic dynamics in solar activity have been explored [
11], while models leveraging McIntosh classifications, magnetic gradients [
12,
13], and blackout parameters [
14] have improved predictions. Techniques like CNN-GRU [
15], density clustering with SMOTE [
16], selective upsampling [
17], and loss function weighting [
18] address class imbalance, with ensemble learning and hybrid methods also proving effective [
19,
20,
21]. LSTM-based models, including BiLSTM-Attention, excel in multiclass flare forecasting, while the cGAN method enhances magnetic polarity accuracy using
images [
22].Machine learning applications in solar eruption prediction have been reviewed, summarizing progress and future directions [
23]. Studies on magnetogram resolution [
24] and EUV imaging [
25] have highlighted the insensitivity of deep learning models to resolution before specific thresholds. Research integrating magnetic field height has identified optimal ranges (1000–1800 km) for improving flare onset time predictions [
26,
27], with synthetic
-sunspot data and the
method offering insights into 3D magnetic configurations [
28,
29]. These advances underline the promise of deep learning in refining solar flare forecasting and modeling.
Current research on the influence of magnetic field height variations in deep-learning-based solar flare prediction remains limited. This paper presents the first comprehensive investigation into how different magnetic field altitudes affect forecasting performance. Leveraging SDO/HMI observational data, we systematically incorporated magnetic field measurements at multiple atmospheric heights into deep neural networks, rigorously evaluating the predictive value of this three-dimensional magnetic information for solar flare occurrence.
Building upon this foundation, we implemented three widely adopted CNN architectures—AlexNet [
30], ResNet-18 [
31], and SqueezeNet [
32]—to systematically evaluate how magnetic field height variations influence model performance. Using the AUC metric for quantitative assessment, we conducted comprehensive testing across multiple atmospheric heights. This paper not only reveals previously unexplored relationships between magnetic field altitude and flare prediction accuracy but also provides actionable insights for optimizing deep learning applications in solar activity forecasting.
The remainder of this paper is organized as follows:
Section 2 details the data sources and preprocessing methodologies.
Section 3 presents the architecture and implementation of the three CNN models employed in this study.
Section 4 provides a comprehensive analysis and discussion of the experimental results. Finally,
Section 5 concludes this paper with key findings and outlines promising directions for future research.
2. Data
Solar Dynamics Observatory/Helioseismic and Magnetic Imager (SDO/HMI) provides continuous and high-quality photosphere magnetic field observation data. We obtained SHARPs on the photospheric magnetic fields of the solar active region every 96 min from 2010 to 2019 from the Joint Science Operations Center (JSOC) (
http://jsoc.stanford.edu/ajax/lookdata.html, accessed on 1 July 2022) and selected Stonyhurst longitude maximum and minimum values with an arithmetic average < = 30° as the original data.
The occurrence of solar flares is caused by the sudden release of magnetic energy stored in the solar active region in the corona, so the magnetic information on the solar active region is very important for the prediction of solar flares. However, due to the lack of conventional coronal magnetic field observation data, we annotated the downloaded the SHARP data products and input the active photosphere magnetic field as the basic boundary condition into the Nonlinear Force-Free Coronal Magnetic Field (NLFFF) Extrapolations to construct a 3D coronal vector magnetic field dataset [
33].
Because the span of magnetic field intensity is very large, we needed to normalize the three-dimensional magnetic field data obtained from extrapolation. Most of the magnetic field intensity in the solar active region is concentrated in a certain range, but some data are far beyond (or far below) this range, the normalization according to the overall data reduce the values in the dataset range to 0, and there is only a small amount of data near 1. Therefore, before normalization, we set the left and right thresholds for the magnetic field data concentration region of the photosphere as ±500 Gauss, and the data with an absolute value greater than 500 Gauss were uniformly classified as the threshold. We delimited the threshold for each layer of data obtained by extrapolating and kept the proportion of data within the threshold to the total data the same as that in the photosphere layer. In the photosphere, the data within ±500 Gauss accounted for about 98% of the total data. This is shown in
Figure 1.
At the end of data processing, we labeled the data samples. We classified the observed data and their extrapolated results into flare and nonflare samples. If at least one M-class flare occurred in the active region within 24 h from the beginning of the observation, the magnetic field data sample of the active region was considered to be a flare sample; conversely, it was considered to be a nonflare sample.
We obtained a dataset of 52,352 samples, to each of which we added an additional third dimension to the traditional two-dimensional magnetic map: magnetic field height. The dataset was composed of 1140 AR, and we used the samples from January to October of each year from 2010 to 2019 as the training set and the remaining samples as the test set. The ratio of the training set to the test set was approximately 4.68:1. The training set consisted of 43,140 samples, of which 882 were flares, and 42,258 were nonflares. The test set consisted of 9212 samples, of which 206 were flares, and 9006 were nonflares. Obviously, we found a positive and negative sample class imbalance, so we needed to resample the sample.
3. Methods
Deep learning methods demonstrate strong interpretability in solar flare prediction applications. Local interpretation techniques, including gradient-based methods (e.g., gradient input) and feature importance approaches (e.g., LIME), enable the analysis of individual prediction decisions by examining input feature contributions [
34]. Global interpretation methods characterize overall model behavior, with techniques like Feature Importance Ranking identifying key discriminative features across the entire input space. Visualization of these interpretability results reveals the specific magnetic field characteristics and patterns that models prioritize during flare identification and prediction. These interpretability analyses not only validate model reliability but also provide physical insights into flare-triggering mechanisms and evolutionary processes, ultimately enhancing forecasting accuracy.
Deep learning can automatically extract useful features from raw observational data. In this study, three widely used CNN models were selected, AlexNet, ResNet-18, and SqueezeNet, to examine the flare prediction performance at different altitudes.
Figure 2 illustrates the architecture of our flare prediction network. To evaluate how magnetic field height affects prediction performance, we conducted separate training sessions using datasets from different atmospheric levels. The models generated binary predictions indicating whether each active region will produce flares within 24 h. We then computed standard evaluation metrics to enable the systematic comparison of prediction accuracy across both different network architectures and varying magnetic field heights.
3.1. AlexNet
Figure 3 illustrates the AlexNet architecture, which comprises five convolutional (conv) layers, three fully connected (FC) layers, three max-pooling (MaxPool) layers, and one adaptive average pooling (AdaptiveAvePool) layer. Each convolutional layer is immediately followed by a Rectified Linear Unit (ReLU) activation layer, with additional ReLU layers inserted between the FC layers (
Appendix A). As the first CNN architecture to implement ReLU activation systematically, AlexNet significantly improved training efficiency by introducing nonlinear transformations. These nonlinearities enable the network to learn complex representations while mitigating gradient vanishing problems during backpropagation. The specific configuration of each network layer is detailed below.
3.2. ResNet-18
Model accuracy typically improves with increasing network depth up to a critical threshold. Beyond this point, however, unexpected accuracy degradation occurs due to the vanishing/exploding gradient problem, where backpropagated gradients either diminish or grow exponentially with network depth. The residual module [
31] addresses this by introducing skip connections that bypass multiple layers, maintaining gradient flow even through non-transformative layers. This architectural innovation enables successful training of ultra-deep networks (e.g., ResNet variants with 150+ layers) while significantly boosting classification performance. For our study, we implemented ResNet-18, an 18-layer variant offering optimal balance between depth and computational efficiency for our flare prediction task.
When cross-layer composition is carried out, the problem of dimension mismatch is likely to occur. In order to ensure that the input and output can be added correctly, the residuals need to be raised or reduced dimensionally through a convolution layer with a convolution kernel of 1 × 1 when combining across channels. After the weights are trained, ResNet-18 uses a 512 × 1 × 1 averaging pooling layer to map the residual fast output to a feature vector with a 512 length and finally a linear layer to obtain the output. ResNet-18 has achieved good classification results in flare prediction tasks. Its architecture is shown in
Figure 4:
3.3. SqueezeNet
The SqueezeNet model architecture is shown in
Figure 5. In order to achieve efficiency and accuracy, SqueezeNet adopts the following three strategies (
Appendix B).
3.4. Application of the Model
The data preprocessing pipeline begins by loading all training and test samples into the dataset loader. Each magnetic field map is resized to a standardized 3 × 512 × 512 tensor format, where the three channels (nx, ny, nz) correspond to the vector components of the magnetic field intensity. This spatial normalization ensures dimensional consistency across all inputs while preserving the essential vector field information. The 512 × 512 resolution was selected to maintain sufficient spatial detail for accurate flare prediction while remaining computationally tractable for the subsequent CNN processing.
Due to the class imbalance problem in the dataset, there were few positive samples and many negative samples. In order to effectively avoid this problem, we defined a self-defined sampler. Assuming that the batch size is x, we randomly sampled 2 negative samples in a cycle; for positive samples, we sampled x/2 positive samples sequentially and downsamplde most negative samples. Oversampling a small number of positive samples can ensure that positive and negative samples are balanced in each batch, and random sampling also improves the generalization ability of the model.
Regarding the selection of hyperparameters, we mainly considered batch size, learning rate, number of epochs and optimizer. In order to explore the influence of different magnetic field heights on the prediction performance, the same and appropriate hyperparameters were fixed for three different models in this study. We chose a batch size of 64, a learning rate of 1e-3, epochs of 50, and the Adam optimizer [
35].
During model training, the input magnetograms were processed to generate flare predictions, which were compared against ground-truth labels. For this binary classification task, we employed the cross-entropy loss function in Equation (
1) to quantify the discrepancy between the model’s predicted probabilities and the actual flare occurrences. The loss function evaluates how closely the network’s output distribution aligns with the true label distribution, providing the gradient signal for backpropagation-based optimization.
where g is the real label, which is usually a one-hot encoded vector, i.e., the value is 1 on the index position of the real category and 0 on the other position. p is the probability distribution predicted by the model, which is usually a probability vector output by the softmax function. log is the natural logarithm.
The gradients of the loss function with respect to the network weights are backpropagated through all layers, enabling weight updates in both convolutional filters and fully connected connections via stochastic gradient descent (SGD). This forward–backward propagation cycle iterates continuously until meeting termination criteria. To mitigate overfitting, we employed early stopping—halting training when test set performance plateaued—which simultaneously minimized the loss function and preserved generalization capability. As formalized in Equation (
2), SGD updates the network parameters
using gradients computed on mini-batches of training data:
where
is the current parameter value, and
is the learning rate, which controls the step size of the parameter update.
is the loss function with respect to the gradient of the parameter on sample (x, y).
is the updated parameter value. In this study, due to the very large dataset size, we used the Adam optimizer that could automatically adjust the learning rate and converge faster.
During the final evaluation phase, we quantitatively assessed the model’s predictive performance using held-out validation data. The trained convolutional neural network processed input magnetograms from test active regions through forward propagation to generate flare predictions. Model accuracy was systematically evaluated by comparing these predictions with ground-truth flare occurrences, employing standard classification metrics to measure the agreement between predicted and observed events.
4. Experiments and Results
In order to facilitate the subsequent deep learning work, the data observed from January to October each year from 2010 to 2019, along with their extrapolation results, were pre-allocated to the training set, and the remaining samples were used as the test set. At this point, the establishment of the sample dataset was complete. Each sample in the resulting dataset included the SHARP number of the active region, a four-dimensional array of magnetic field data (where the first dimension represented the three components of the magnetic field data, and the following three dimensions represented the space size of the magnetic field), the label of the sample, and the dataset to which the sample belonged (training or test set).
4.1. Evaluation Indices
The output of the model is a two-dimensional vector representing the model’s prediction of whether a flare will occur within a 24 h time window.
Table 1 shows the indicators under four prediction conditions: true positive (TPs), false positive (FPs), false negative (FNs), and true negatives (TNs), where TPs represent sthe number of positive instances classified as positive, FPs represents the number of negative instances classified as positive, FNs represents the number of positive instances classified as negative, and TNs represents the number of negative instances classified as negative. P = TP + FN is the total number of positive samples; N = FP + TN is the total number of negative samples.
4.1.1. TPR, FNR, TNR, FPR
True positive rate (TPR), also known as sensitivity or recall, refers to the proportion of all actual positive cases that are correctly identified as positive cases. The false negative rate (FNR) refers to the proportion of all actual positive cases that are incorrectly identified as negative cases. They can be calculated using Equations (3) and (4).
TNR (True Negative Rate): true negative rate, also known as Specificity. It refers to the proportion of all actual negative cases that are correctly identified as negative cases. FPR(False Positive Rate): A false positive rate is the proportion of all actual negative cases that are incorrectly identified as positive. Their calculation methods are shown in Equations (5) and (6), respectively.
4.1.2. Accuracy
Accuracy is the number of correctly classified samples divided by the total number of samples, again ranging from 0 to 1. Generally speaking, the higher the classification accuracy rate, the closer the result is to one, and the better the classification effect, but the accuracy rate is not applicable to unbalanced classification. If a classifier predicts that all instances of a minority class belong to the majority class, and the majority class also has a high classification accuracy, accuracy remains high. ACC can be obtained through Equation (
7).
4.1.3. True Skill Score
The true skill score (TSS) ranges from [−1,1], and the closer the result to one, the better the classification. A result of –1 means that all forecasts are wrong, and a result of 1 means that all forecasts are correct, that is, the results of all positive and negative classes of forecasts are the same as the actual situation. The number of positive and negative samples in our solar flare dataset is unbalanced, and the true skill statistics are sensitive to the ratio of class imbalance. Therefore, we used the true skill statistics as the main indicator. The others were used as secondary indicators, which could well reflect the performance of the entire model. The TSS can be obtained through Equation (
8).
In practice, there are far fewer flaring samples than nonflaring samples. Consider the imbalance between positive and negative samples in the database. If a model performs flaring judgment for all input active regions, it can also obtain better TPR performance, but the FPR index will be large, and the TSS will be worse.
4.1.4. Receiver Operating Characteristic and Area Under the Curve
The Receiver Operating Characteristic (ROC) curve is a graph that shows how a predictive model behaves at all prediction thresholds, showing the relationship between the model’s true rate (TPR) and false positive rate (FPR). The curve has two parameters: TPR and FPR. We can obtain the ROC curve by increasing the threshold from 0 to 1. If the threshold is set to 0, all samples are predicted to be positive. At this point, the TPR and FPR are one. If the threshold is equal to one, all samples are predicted to be negative; At this point, both TPR and FPR are zero. The Area Under the Curve (AUC) score is the area under the ROC curve, which is a general indicator for evaluating the two types of classification models. The higher the AUC value, the better the performance of the model. This score is between 0 and 1.
4.2. Experimental Procedure
During preprocessing, we standardized all input data by resizing the magnetograms to 512 × 512 resolution using bilinear interpolation. For model training, we exclusively utilized the vertical magnetic field component (nz direction) as the input. To address class imbalance, we implemented balanced batch sampling with equal numbers of positive and negative samples, complemented by three data augmentation techniques: (1) random horizontal flipping, (2) random vertical flipping, and (3) arbitrary rotation (0–360°). These spatial transformations served dual purposes: they artificially expanded our training dataset while improving model robustness to observational variations in solar imagery. Importantly, all augmentation was applied exclusively during training, preserving the integrity of our validation and test sets. The augmentation strategy effectively mitigated the limitations of small datasets in solar physics applications by simulating diverse viewing conditions and increasing the effective training sample size.
We used the network model provided by the PyTorch package 1.13.1 and modified it. The number of output channels of the model was selected as 2. Secondly, we chose the cross-entropy loss function, which is a widely used optimization method in training networks. The stochastic gradient descent algorithm was used to train the model. We trained batch_size = 64, lr = 1 × 10−3, num_epochs = 50 in each network, where batch_size represents batch size, lr represents learning rate, and num_epochs represents number of epochs.
4.3. Experimental Results
After training different models with different magnetic field heights, the corresponding confusion matrix was obtained, and the evaluation indices such as upper loss, AUC, and TSS were calculated. The AUC of the different models was summed and averaged to obtain the comprehensive AUC index. Next, training loss and comprehensive AUC plots were drawn to analyze the influence of the magnetic field height.
4.3.1. Training Loss
Taking ResNet-18 as an example, we obtained the changing trend in the training loss with zlevel under the premise of using the ResNet-18 model. The
Figure 6 shows the training loss curve. We found that, on the whole, the training loss showed a decreasing trend with the increase in epochs, which proved that our training was effective, and the training loss showed an increasing trend with the increase in zlevel. In relatively high layers, it was difficult to continue to reduce the training loss even after 50 rounds of training epochs, which indicated that with the increase in the number of layers, the training loss would continue to decrease. The data quality also gradually deterioratesd and the model could no longer learn new features well. This conclusion is obviously very reasonable, because the geomagnetic field in the photosphere can be directly observed, while the magnetic field above the photosphere is obtained through extrapolation. As the magnetic field height increases, the data quality gradually deteriorates, and the corresponding indicators also gradually deteriorates.
4.3.2. Results
Table 2 shows the contingency table and AUC for different heights in different models.
Figure 7 shows the ROC curves of the different models under different magnetic field heights. We conducted pairwise hypothesis tests for different groups of curves with the three models. The results are shown in
Table 3. There were significant differences among different groups. It can be seen that with the gradual increase in magnetic field heights, the area under the ROC curve presents a general trend of first increasing and then decreasing and reaches the maximum sum of the areas of the three models when the zlevel is two. The combined AUC curve, AUC bar chart, and standard deviation are shown in
Figure 8. As can be seen from the two figures, with the increase in the magnetic field height, the AUC also shows a trend of first increasing and then decreasing; in the first five layers, the two indicators remain at a high level; and then the indicators at the higher levels have an obvious downward trend. Among them, the AUC bar chart on the right is divided into four groups; the first, second, third and fourth groups represent zlevel0–zlevel4, zlevel5–zlevel9, zlevel10–zlevel14, and zlevel15–zlevel19 respectively. It is not difficult to find that the AUC value of the first group, zlevel0-zlevel4, is significantly higher than that of the other groups. This also confirms Korsos’s conclusion that the
method has better flare prediction in regions with higher magnetic field heights [
27].
From the above, we conclude that the optimal model can be obtained when the zlevel is two; that is, zlevel2 is determined to be the best height. When dividing different layers in the z direction, we divided one layer at an interval of 10 pixels. Therefore, zlevel2 is about 20 pixes inl height on the magnetic field of the photosphere. We know that a pixel is about 0.36 Mm, that is, 360 km, so we can find that at a magnetic field height of about 7200 km, there is an optimal magnetic field height, and the flare prediction performance is better than other layers; that is to say, the real eruption location of the flare is likely to be near this optimal height.
From this, we find that our results are not completely consistent with Korsos’s conclusion. They believe that the accuracy of solar flare prediction can be improved in a magnetic field height range of 1000–1800 km, while we think that the magnetic field height is about 7200 km, but there is one thing in common that it is above the photosphere. The performance of solar flare prediction in the higher magnetic field is better than that in the photosphere. Magnetic reconnection is an important prerequisite for the eruption of solar flares, and only in the chromosphere and corona are the intensity and complexity of the magnetic field enough to support the occurrence of magnetic reconnection process. This process releases huge energy, and it is possible for solar flares to erupt, which also shows that our conclusion is reasonable.
5. Conclusions and Future Work
To investigate how magnetic field height affects solar flare prediction, we performed magnetic field extrapolations from photospheric magnetograms to obtain data at various atmospheric heights. Using temporally partitioned data (January–October for training, November–December for testing), we evaluated three CNN architectures (AlexNet, ResNet-18, SqueezeNet) across different height levels. Our experiments revealed a consistent inverted-"V” pattern in prediction performance versus height, peaking near typical flare formation heights (consistent with physical expectations) and declining sharply beyond the fourth extrapolation layer. This degradation likely stems from two factors: (1) errors increase in the potential field extrapolation method at greater heights, and (2) data quality filtering during preprocessing disproportionately affects the higher layers due to increased noise and artifacts in the extrapolated data. The resulting sample size reduction in the upper layers may contribute to the observed performance decline.
In this paper, we only discuss the three-dimensional magnetic field under the weak long condition. Future research work can focus on other extrapolation techniques that are closer to real data, using detection equipment for direct observation of the solar chromosphere and corona, etc., to improve the prediction performance by improving the reliability of the data.
Furthermore, the improvements made to the Solar Magnetism and Activity Telescope (SMAT) data processing techniques demonstrate its capability for reliable application to Huairou Solar Observing Station (HSOS) magnetogram datasets [
36]. By addressing instrumental challenges such as the zero-level problem and optimizing observational strategies like using both wings of the spectral line, SMAT-derived magnetograms align well with established datasets like those from HMI. These advancements highlight the potential of SMAT data to contribute significantly to studies of global solar magnetic fields and space weather, particularly within the context of the unique observational framework provided by HSOS. Huairou also has high-quality magnetic field data, and our work can be used on those data in the future.