1. Introduction
Solar energy, as one of the most promising renewable energy sources [
1], is abundant, green, and clean. Solar power generation is bound to experience significant development [
2]. The Qinghai–Tibet Plateau region is exceptionally rich in solar energy, with annual sunshine duration ranging from approximately 1500 to 3400 h. The region holds tremendous potential for solar power generation. However, there are significant fluctuations in power generation, and sudden changes in power output can have adverse effects on the stability of the grid [
3]. Accurate prediction of solar power generation provides a suitable means for the safe and efficient operation of the grid [
4], which is crucial for reducing the impact of integrating solar power systems into the power grid [
5]. Therefore, precise forecasting of power generation is of utmost importance. Solar irradiance is a primary determining factor affecting power output [
6], making irradiance prediction one of the most challenging focal points currently. Therefore, in the context of China’s dual carbon strategy goals, establishing a suitable solar irradiance prediction model for the Tibetan Plateau holds great significance.
Solar radiation series, as a type of time series, can be predicted using time series analysis methods [
7,
8], with autoregressive integrated moving average (ARIMA) being widely applied. In the study by Zhang [
9], a hybrid model combining ARIMA and artificial neural networks (ANN) was constructed, and the results showed that the model effectively improved prediction accuracy. In the study by Reikard [
10], multiple radiation datasets with resolutions of 5, 15, 30, and 60 min were used to build ARIMA models for predicting solar irradiance from the next 5 min to several hours ahead. The ARIMA model with time-varying coefficients (logs) obtained the best results. Ferrari [
11] conducted a study on solar irradiance time series and predicted it using AR, ARMA, and ARIMA models. They compared these models with persistence models, k-nearest neighbors models, and support vector machine models. The results indicated that the ARIMA model provided the best fit. In the study by Yang [
12], ARIMA models were constructed using different types of meteorological data as input variables to predict solar radiation for the next 1 h. It was found that utilizing cloud information for prediction could improve accuracy. The ARIMA model constructed by Das [
13] provided reliable predictions for solar radiation and solar photovoltaic power output. It is flexible enough to incorporate more information and its performance improves with an increasing number of data points.
In recent years, machine learning became one of the main methods for irradiance prediction, with random forest (RF) being widely employed by researchers due to its high performance, low overfitting risk, and fast training speed [
14]. In the study by Sun [
15], multiple meteorological, solar radiation, and air pollution index data from various stations were utilized to construct an RF model for irradiance prediction. The results demonstrated that the RF model outperformed empirical methods in terms of fitting accuracy. Fouilloy [
16] analyzed 11 statistical and machine learning methods used for solar irradiance prediction and compared their performance across three different meteorological stations. For sites with high variability, the reliability of predictions was lower, but RF demonstrated the best predictive performance. In the study by Benali [
17], an RF-based radiation prediction model was found to outperform intelligent persistence and artificial neural network models. In the study by Zeng [
18], the simulated results of a high-density daily solar radiation network constructed based on the RF model showed good agreement with measured values in China. Hou [
19] utilized Himawari-8 AHI data and constructed a prediction model based on random forest (RF) to estimate the downward shortwave radiation at the surface in China. They achieved promising results with this approach. Villegas-Mier [
20] proposed a RF-based solar radiation prediction model. The results showed an accuracy improvement of 95.98% compared to traditional linear regression methods, and it exhibited strong robustness.
With the rapid development of deep learning, researchers extended its application to the field of solar radiation prediction, particularly the widespread use of the long short-term memory (LSTM) model due to its strong suitability for time series forecasting. In the study by Srivastava and Lessmann [
21], an LSTM-based irradiance prediction model was constructed, validating its robustness and demonstrating that the optimally configured LSTM model outperformed other methods. Qing and Niu [
22] utilized two years of radiation data collected in Cape Verde to train and predict using an LSTM model. Their results showed a 18.34% lower RMSE compared to multilayered feedforward neural networks. Wen [
23] developed a deep recursive neural network with long short-term memory (DRNN-LSTM) for solar power generation and load forecasting. Their performance surpassed that of multilayer perceptron (MLP) and support vector machine (SVM). Lan Huynh [
24] developed an LSTM-based model for radiation prediction in Vietnam, forecasting radiation for 1, 5, 10, 15, and 30 min into the future. Their results indicated superiority over other models. In the study by Huang [
25], an LSTM-based irradiance prediction model was developed, analyzing the influence of different lag time parameters, primary inputs, and auxiliary inputs on the model’s predictive performance. The results showed that the accuracy was superior to that of the BPNN model. Sorkun [
26] proposed an LSTM-based solar radiation prediction model and investigated the impact of various meteorological variables. The research results demonstrated that the multivariate model outperformed the previous univariate models. Liu [
27] conducted solar radiation prediction and evaluation using seven years of radiation data from the U.S. Department of Energy’s Atmospheric Radiation Measurement (ARM) center. Their results demonstrated that LSTM had the best overall performance, outperforming XGBoost and ARIMA models. Gao [
28] developed a deep generative model based on LSTM for multi-step solar irradiance prediction. The results showed that the model effectively avoids the issue of error accumulation. Compared to the traditional regression LSTM model, it achieved an accuracy improvement of 7.7%. Bou-Rabee [
29] proposed a solar radiation prediction model based on attention mechanism and bidirectional long short-term memory (BiLSTM). The model was designed separately for sunny and cloudy weather conditions. The results showed that its performance was superior to other deep learning networks. Alizamir [
30] constructed multiple solar radiation prediction models, and the results indicated that the combination of LSTM model and wavelet transform technique can enhance the accuracy of radiation prediction based on climatic parameters.
The Qinghai–Tibet Plateau region has abundant solar energy resources. In the context of China’s dual-carbon strategy goals, it is of great significance to establish a solar shortwave radiation prediction model suitable for this region. Previous studies showed that using statistical models, machine learning, and deep learning to establish solar radiation prediction models is an advanced and effective research approach. However, based on the radiation characteristics of different regions, it is necessary to perform local testing and optimization of model parameters. Therefore, in this study, utilizing ground solar shortwave radiation flux observation data, representative methods including ARIMA, RF, and LSTM algorithms were employed to construct models for predicting the average solar shortwave radiation for the next 10 min. Sensitivity testing and optimization of key parameters were conducted, and a comparative analysis was carried out to reveal the advantages and limitations of these methods in irradiance prediction, aiming to establish a radiation prediction model suitable for the Qinghai–Tibet Plateau region. These data-driven prediction methods heavily rely on the training dataset [
15], and the sample size of the training set is a determining factor for the model’s generalization ability [
31]. The sample size affects the learning and training effectiveness of the model, and the accuracy of the model can also be influenced by the numerical distribution of the training set [
32], which is influenced by seasonal variations in irradiance. Therefore, it is necessary to conduct research by classifying seasons when predicting irradiance. However, there is limited research on the impact of factors such as sample size and numerical distribution of the training set on the prediction accuracy of the model, highlighting the need for relevant studies. Additionally, the prediction forecast horizon has a significant impact on the model’s accuracy, and a quantitative study on the accurate prediction forecast horizon of each model in different seasons can provide reference for the construction of prediction models in this region.
The structure of this paper is as follows: In
Section 2, the research area, data preprocessing, dataset configuration, and data features are introduced. In
Section 3, the research methods are presented, including the principles and construction of the ARIMA, RF, and LSTM models. In
Section 4 and
Section 5, the experimental results are showcased and discussed.
2. Data
Constructing a solar radiation prediction model requires data-driven approaches and validation. Analyzing the differences in the training set and the impact of the prediction time range on model accuracy necessitates an examination of the data’s characteristics.
2.1. Overview of the Study Area
Yangbajing (90°33′ E,30°05′ N) is located 90 km northwest of Lhasa, Tibet. It has an average elevation of 4300 m and features a flat terrain surrounded by mountains. The area experiences short spring and autumn seasons, with warm and humid summers and long, cold winters. It enjoys abundant sunshine throughout the year, with an annual sunshine duration of over 2800 h. A solar photovoltaic power station was built in this area. The Yangbajing Atmospheric Observatory, operated by the Institute of Atmospheric Physics, Chinese Academy of Sciences, conducted comprehensive atmospheric observations since 2018. The observatory covers a wide range of detection wavelengths, from ultraviolet to infrared, terahertz, and millimeter waves. It enables high vertical resolution (10–100 m), high temporal resolution (1 min to 1 h), and continuous simultaneous quantitative measurements of multiple atmospheric variables throughout the entire atmospheric column.
2.2. Data Sources
This study focused on the analysis of shortwave solar radiation data obtained from the four-component radiometer MR-60 at the Yangbajing Atmospheric Observatory. The spectral range of the data was 285–3000 nm, and the unit was W/m
2. The data were sampled at a frequency of 1 min. A total of 366 days of data, from 1 June 2019 to 31 May 2020, were selected for analysis. Samples with zero radiation during the nighttime were excluded [
33], and only data collected between 8:00 and 19:00 during the day were retained. The data were then resampled to calculate the average radiation values over 10 min intervals. Thus, there were 66 samples per day.
Since the accuracy of the models can be influenced by the distribution of the dataset, the distribution of radiation data is related to seasonal variations. Therefore, in this experiment, the data were divided into four datasets based on seasons: spring (March–May), summer (June–August), autumn (September–November), and winter (December–February). Each season had a similar number of samples. The training and testing datasets were split in a 6:1 ratio, and the models were trained to predict the 10 min average radiation for different seasons. This study used historical time series data of solar radiation as input variables for the models. By conducting sensitivity experiments to determine the optimal parameters of each model, the study performed training, prediction, and evaluation to develop short-term radiation prediction models suitable for different seasons in the Qinghai–Tibet Plateau region.
2.3. Data Characteristics
Different datasets require testing and optimization of model parameters based on their respective data characteristics. Previous research results showed significant seasonal variations in solar irradiance, and so, it is important to understand the seasonal characteristics of the dataset. Since only daytime data were retained, the training dataset consisted of multiple samples from different quantities of daytime periods, necessitating analysis of this time period. Additionally, understanding the diurnal variations in the data helps determine the input features of the model.
2.3.1. Seasonal Characteristics
According to
Table 1, the solar radiation in the Yangbajing region exhibited significant seasonal variations. The peak occurred in summer, reaching 1713 W/m
2, which was much higher than the solar constant. This may be attributed to the influence of clouds [
34] and terrain [
35]. The standard deviation of radiation was higher in spring and summer, and lower in winter, indicating greater fluctuations in solar radiation during spring and summer, and relatively stable conditions during winter. This can be attributed to the higher rainfall and frequent weather changes in spring and summer, while winter experienced more stable weather conditions.
2.3.2. Diurnal Variation Characteristics
According to
Figure 1 and
Table 2, the diurnal variations of solar radiation in the Yangbajing region exhibited similar patterns in different seasons, showing a single-peak inverted “U” shape. Due to the rotation of the Earth and the variation of the solar zenith angle, the radiation showed a clear periodic variation with a peak around 11–15 o’clock. The standard deviation of radiation in all four seasons is highest around 14–15 o’clock and lowest at 8 o’clock, indicating greater fluctuation at noon and relatively stable conditions in the morning. Additionally, in spring, autumn, and winter, there were often instances in the morning (8–10 o’clock) and evening (17–18 o’clock) where the instantaneous radiation was much higher than the average value in the low radiation zone.
Based on the comprehensive analysis, it can be concluded that solar radiation exhibits significant variations across different seasons. Therefore, the numerical distribution of the training datasets used by the models will differ greatly among the seasons. Each dataset representing a specific season corresponds to a distinct numerical distribution of the training set. Solar radiation historical data were utilized as input for statistical methods and machine learning models. Ignoring the prominent characteristics of solar radiation would result in suboptimal predictions [
36]. Hence, it is necessary to classify the datasets according to the seasonal variations of solar radiation and develop separate prediction models for each season to improve accuracy. Additionally, exploring the impact of the numerical distribution differences in the training sets caused by seasonal factors on model accuracy can also be investigated.