1. Introduction
Amid the backdrop of severe energy shortages and global warming, renewable energy, characterized by its clean, low-carbon, and sustainable nature, is playing an increasingly crucial role in the formulation of national energy strategies [
1]. Photovoltaic (PV) power generation has gained widespread attention globally due to its non-polluting, renewable, and low-cost characteristics, as well as its technological maturity [
2]. However, the inherent non-storage and intermittent nature of solar energy presents significant challenges to the power grid when scaled up for widespread use [
3]. Therefore, accurately predicting photovoltaic power generation is imperative for optimal grid dispatch, enhanced management, and improved energy consumption efficiency. It is also a critical factor in achieving complementary power relationships within the grid.
With continuous advancements in artificial intelligence technology, deep learning has garnered widespread attention due to its excellent performance in image processing and speech recognition [
4]. Consequently, some researchers have introduced deep learning into the field of PV power prediction. Compared to traditional machine learning models, deep learning models offer more accurate prediction results owing to their superior feature extraction and data mining capabilities [
5]. Among these, models such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) show promising results in PV prediction. For instance, Ref. [
6] proposed a hybrid approach based on deep CNN for short-term PV power forecasting using solar radiation, temperature, and historical electricity data. Ref. [
7] developed a meteorological-information-based long short-term memory (LSTM) model to predict the daily power generation of a large-scale PV power plant by classifying weather conditions. Ref. [
8] used global solar radiation, temperature, velocity, relative humidity, and power output as inputs, employing an LSTM model optimized by the Particle Swarm Algorithm (PSO) to predict PV power across an entire region, demonstrating the method’s accuracy experimentally. However, the performance of a single model often falls short when meeting dealing with large data samples, leading many researchers to propose hybrid models to improve prediction accuracy. Ref. [
9] proposed a hybrid Wavelet-PSO-SVM prediction model supported by supervisory control and data acquisition (SCADA) systems, and meteorological information has been shown to effectively enhance prediction accuracy. Ref. [
10] improved the prediction accuracy of distributed PV power by fusing multiple models based on a stacking integration strategy. Considering the prediction effectiveness of previous studies, this study develops a CNN-LSTM model to fuse the feature extraction function of CNNs and the timing analysis capability of LSTM.
While the studies mentioned above proposed effective methods for predicting PV power, they often overlook the potential information gain from similar day samples concerning input features. To address this gap, weather conditions are classified into three categories using fuzzy C-means clustering and selected samples with higher similarity to the target day for training, thereby improving prediction accuracy [
11], while Ref. [
12] employed the K-medoids clustering algorithm to categorize weather into three groups and validated the effectiveness of the prediction model under various weather conditions. By pre-dividing the weather types in the dataset and selecting samples that closely resemble the target day for training, prediction accuracy can indeed be enhanced. However, traditional clustering algorithms typically do not account for the shape information of time-series data, which complicates the effective resolution of clustering challenges in this context. Additionally, conventional similar day analysis methods, such as Pearson and Euclidean distance, struggle to provide accurate similarity measures when dealing with varying time resolutions.
To address the shortcomings of the aforementioned studies and build upon their foundations, this study proposes a CRSSA-CNN-LSTM prediction method that incorporates clustering to classify weather types. This method effectively solves the problems of poor adaptability of traditional clustering methods to time series and difficulty in analyzing similar days caused by different time resolutions, improving prediction accuracy. This paper is organized as follows: First, two data preprocessing techniques are introduced to enhance the quality of the input data. Next, to construct similar day sample sets for the target prediction days, the K-shape clustering algorithm is utilized, along with the implementation of Dynamic Time Warping (DTW) to measure similarity effectively. To achieve more accurate predictions, this study proposes a hybrid ICNN-LSTM model, which combines the strengths of convolutional neural networks and long short-term memory networks. The hyperparameters of this model are optimized using the improved sparrow search algorithm (ISSA). Finally, the hybrid ICNN-LSTM model is applied to actual PV power plant data from a region in Nanjing, China, to conduct simulations that verify the effectiveness of the proposed method.
2. Similar Day Selection Based on K-Shape and DTW
Selecting samples that closely resemble the forecast target meteorological conditions can enhance the relevance of the training data and improve the effectiveness of the training model, thereby increasing prediction accuracy. To achieve this, this study employs K-shape clustering to classify PV power into different modes. Additionally, the similarity of time-series data at varying time resolutions is assessed using DTW.
2.1. Dataset Description
In this study, we evaluate the proposed aggregated prediction method using data from a PV plant with an installed capacity of 20 kW located in a region of China. The dataset spans from January to November 2017 and includes features, such as output power, solar radiation intensity, ambient temperature, relative humidity, barometric pressure, and wind speed, recorded at 15 min intervals. The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.
2.2. Data Preprocessing
The feature variables influencing the prediction of photovoltaic power generation often exhibit different scales and orders of magnitude, necessitating feature scaling. The multidimensional features that have been scaled are unified to a dimensionless state and have similar scales, which significantly promotes the convergence efficiency of the gradient descent algorithm. Normalization is a common choice among various processing methods. However, due to its strong dependence on the maximum and minimum values within the dataset, normalization needs to be redefined every time a new extremum is encountered. Therefore, this study uses standardization techniques as the core means of data preprocessing.
The method of normalizing the time-series data is shown in (1), which transforms them into a mean value of 0 and a standard deviation of 1:
where
is the standardized data, x is a sample of the original data for a particular characteristic,
is the sample mean, and
is the sample standard deviation.
During the process of information acquisition, communication failures or human errors can result in a certain amount of missing values. The handling of missing data can generally be categorized into two approaches: deletion and imputation. Directly deleting samples with missing values can lead to the loss of important information, significantly impacting the feature extraction process and the overall quality of the time-series data. To address this issue, this study employs the Random Forest Linear Filling method for managing missing data [
13] to enhance prediction accuracy.
2.3. K-Shape Clustering Algorithm
Traditional clustering algorithms often struggle to effectively measure the similarity between time series. The K-shape clustering algorithm, however, has demonstrated superior performance in clustering time-series data, making it particularly suitable for applications that involve such challenges [
14]. In this study, we utilize the K-shape clustering process to classify the daily variation curves of distributed PV power generation into K distinct patterns.
The K-shape algorithm utilized the standard method of mutual correlation to calculate the shape distance (SBD). Mutual correlation is a statistical method that can be used to measure the similarity of two time series,
and
. It uses the mutual correlation between two sequences to optimally shift the time window of
Y, aligning it globally with
X. This alignment enables a comprehensive comparison of their global shape features, as illustrated in (2):
where
Y(
s) is the shifted oscillatory data;
s is the shift amount and
. If
s > 0, the time window of
Y is shifted to the right by
units; if
s < 0, the time window of
Y is shifted to the left by
units.
Considering all the shifts of
, a sequence of interrelations of length 2m − 1 can be obtained by
, and
is defined as follows:
where
. When
,
; when k < 0,
.
Then,
reaches its maximum value and obtains a value of
. The optimal translation of
Y with respect to
X can then be derived by
. To eliminate the effect of sequence distortion,
should be normalized by (4) as follows:
where
is the normalization factor.
Therefore, SBD between
X and
Y can be calculated by (5):
where
ranges from 0 to 2.
indicates that
X and
Y are perfectly similar.
2.4. Dynamic Time Warping (DTW)
In the prediction scenario, the time accuracy of the day to be analyzed often does not match the sampling accuracy of the photovoltaic output power. In addition, the significant randomness and volatility of photovoltaic output make it difficult to establish a direct correspondence between the two in the time series. To address this challenge, we introduce DTW (Dynamic Time Warping) technology to evaluate the similarity between the two. The advantage of DTW lies in its ability to nonlinearly align two curves, identifying the optimal correspondence between them. This method effectively matches points where the curves exhibit similar shapes, resulting in a more accurate morphological measurement [
14].
As illustrated in
Figure 1, DTW can identify and map similar points between two curves. These mappings allow each point in one curve to correspond to the most similar point in the other curve. The total distance between all corresponding points is then calculated and serves as a criterion for assessing the similarity between the two curves. The lengths of curves
X and
Y are defined as
m and
n. We then construct the distance matrix
D as
The curved path between two curves is defined as
. The subscript
s denotes the coordinates of
s-th point on the curved path, corresponding to
. The subscript
l denotes the number of elements in the path. A curved path represents the mapping relationship between two curves, visually illustrated by the lines connecting corresponding points on the curves in
Figure 1.
The objective of DTW is to identify an optimal curved path that minimizes the cumulative distance between two time series, while adhering to the constraints of boundary conditions, monotonicity, and continuity:
2.5. Similarity Day Selection Process
Initially, the output from distributed PV systems is categorized into
K distinct patterns using the K-shape clustering algorithm. The feature vectors for the normalized data from the days to be collected and the historical data are organized as follows:
where
is the eigenvector of the day to be collected;
is the eigenvector of the center of mass of the
k-th PV output pattern;
m is the number of forecast points of the day to be predicted;
n is the number of sampling points of the historical days; in the absence of a priori knowledge,
K will be selected by the profile coefficients to derive the number of clusters with the highest intra-cluster similarity and the lowest inter-cluster similarity. The number of clusters with “highest similarity within clusters and lowest similarity between clusters” is obtained.
In forecasting, DTW is utilized to assess the similarity between historical sample clusters and the target day that is to be predicted:
The clusters of samples with the highest similarity are selected to train the prediction model, thereby improving the prediction accuracy.
5. Experiment Analysis
5.1. Data
In this study, we evaluate the proposed aggregated prediction method using data from a PV plant with an installed capacity of 20 kW located in a region of China. The dataset spans from January to November 2017 and includes features, such as output power, solar radiation intensity, ambient temperature, relative humidity, barometric pressure, and wind speed, recorded at 15 min intervals. To validate the effectiveness of the proposed method, we set up three sets of experiments to compare prediction performance: (1) we compare the prediction model proposed in this study with various baseline models to assess its predictive accuracy and performance; (2) we conduct experiments to evaluate the prediction accuracy before and after incorporating the division of weather types, measuring the impact of this consideration on model performance; (3) we compare the optimization algorithm proposed in this study with other existing optimization algorithms to evaluate its efficiency and effectiveness in enhancing the model’s performance.
5.2. Indicators
To assess the performance of the proposed model, the predictive performance of the model is assessed by mean absolute error (MAE), mean absolute percentage error (MAPE), and coefficient of determination index (R-square, R
2). Specifically, smaller values of MAE and MAPE indicate that the proposed model performs better. Furthermore, smaller MAE and MAPE values signify improved accuracy in the prediction model, while a coefficient of determination closer to 1 reflects a better fitting result between the predicted and actual values. The MAE and MAPE can be calculated using the following formulae:
5.3. Feature Selection
The dependence of photovoltaic output on weather factors can enhance model prediction accuracy and mitigate the risk of overfitting by eliminating irrelevant or redundant features. Therefore, this section employs Pearson coefficients, Spearman coefficients, and MIC to analyze both linear and nonlinear correlations between the influencing factors and PV output power. The results of the correlation coefficient calculations are presented in
Table 1. The analysis reveals a significant correlation between solar intensity and photovoltaic power, with Pearson, Spearman, and MIC values of 0.98, 0.92, and 0.9, respectively. Additionally, the Pearson coefficients for relative humidity and ambient temperature both exceed 0.35, indicating that these factors are also highly correlated with PV power output. Based on these findings, irradiance, relative humidity, and ambient temperature are selected as the input features from the original dataset, as they are the most influential in predicting PV power output.
5.4. Cluster Number Analysis
In this section, PV power data and selected meteorological factors are utilized as the clustering indices. The contour coefficient is employed as the evaluation metric for clustering effectiveness, and the results are illustrated in
Figure 5. The analysis indicates that starting from three clustering categories, the contour coefficient gradually decreases as the number of clusters increases. Notably, the clustering effect for meteorological factors is relatively weak compared to the clustering of PV output power. The contour coefficient for using the power envelope parameter as a clustering index is significantly higher than that of the meteorological factors. Specifically, when the number of clusters is set to three, the contour coefficient reaches 0.7466, which is considerably better than the results obtained with other clustering indicators. Based on these findings, a clustering configuration of three clusters is selected as the optimal result for this study, demonstrating the most effective representation of the data characteristics.
5.5. Analysis of Forecast Results
This section outlines a methodology based on the joint prediction model of CRSSA-CNN-LSTM. To validate the effectiveness of the proposed method under different temporal resolutions, the samples from each cluster are divided into two subsets: a training set comprising 80% of the data and a training set comprising the remaining 20%. The training set samples are further categorized into three distinct patterns, as illustrated in
Figure 6. The analysis reveals that Weather Pattern I exhibits a gentle distribution, resembling typical sunny day characteristics. In contrast, Weather Patterns II and III reflect a more fluctuating distribution, akin to the conditions observed on cloudy and rainy days. Notably, because K-shape clustering organizes the PV output curves based on their shape information, some curves with lower outputs but a flatter trend—compared to those in Patterns II and III—are also classified into Weather Pattern I. The K-shape clustering algorithm proves to be a straightforward and efficient method for organizing the PV output curves, facilitating a better understanding of how different weather conditions affect photovoltaic performance.
To verify the superiority of the K-shape clustering algorithm used in this article, the results obtained from K-shape clustering were compared with the average contour coefficients of other clustering algorithms, including K-means and K-medoids. The K-shape clustering algorithm used in this article has the highest contour coefficient and, therefore, has the best clustering performance.
To verify the superiority of the DTW algorithm used in this article, the selection of DTW similarity days was compared with other distance measurement methods, including Euclidean distance, Manhattan distance, and cosine distance. The experimental results show that the DTW algorithm used in this article has the lowest prediction error and, therefore, has the best ability to analyze similar days.
To demonstrate the effectiveness of the prediction method following the clustering of weather patterns proposed in this study, we compare the predicted results from various models after categorizing the weather days, as shown in
Table 2 and
Table 3. The comparison involves several methodologies, including long short-term memory (LSTM), Backpropagation Neural Network (BPNN), and Wavelet Neural Network (WNN). The hyperparameters of each model are optimized by the CRSSA algorithm proposed in this study. The process begins with calculating the similarity measure between the day to be predicted and the defined clusters. This is followed by training on sample clusters that exhibit higher similarity, effectively screening the data from the perspective of the inputs. This approach contributes to enhancing the prediction accuracy for each model. The results indicate that the proposed method can, at most, achieve a reduction of 1.2% in the MAPE and a decrease of 1.47 in the RMSE when compared to the LSTM model. This suggests that the proposed model significantly improves the accuracy of PV power predictions. Furthermore, the prediction accuracy of the LSTM model surpasses that of the traditional BPNN and WNN models, highlighting the efficacy of LSTM’s temporal memory function for addressing prediction challenges. Additionally, the prediction performance of the model after applying CNN feature extraction is superior to that of the original LSTM model, confirming that the CNN effectively extracts relevant features, contributing to improved predictive accuracy.
To illustrate the superiority of the prediction model proposed in this study, we present the results of typical daily predictions across various weather patterns from the test set, as depicted in
Figure 6,
Figure 7 and
Figure 8. The analysis reveals that the CNN-LSTM model developed in this study achieves the highest prediction accuracy among the models evaluated. Notably, the fluctuations in PV power under Weather Mode 1 are minimal, and the prediction results from all models closely align with the actual values. This observation indicates that the model effectively learns the weather fluctuation characteristics associated with sunny conditions. Furthermore, the prediction accuracy in the context of Weather Mode 1 is significantly greater than that for Weather Modes 2 and 3, highlighting the impact that varying weather conditions have on prediction accuracy. If historical data corresponding to Weather Modes 2 or 3 are used as the training set for Model 1, this could adversely affect its prediction accuracy. This finding corresponds with the results detailed in
Table 1 and
Table 2, further validating the reasonableness and effectiveness of the proposed method.
To demonstrate the effectiveness of the hyperparameter optimization scheme proposed in this study, the Nanjing area in China is used as a case study. In this analysis, the manually adjusted hyperparameters serve as the baseline (denoted as “Base”). Various optimization algorithms, including ISSA, SSA, PSO, and Ant Colony Optimization (ACO) algorithm, are employed to optimize the hyperparameters for the CNN-LSTM prediction model. The resulting error indicators from these optimizations are presented in
Table 4 and
Table 5, and the convergence curves are depicted in
Figure 9,
Figure 10 and
Figure 11. It is important to note that the objective function used in the iterative process is the average value derived from K-fold cross-validation of the training set. The results indicate that employing intelligent optimization algorithms for hyperparameter tuning significantly enhances the prediction performance of the model. Among the different optimization methods evaluated, the proposed CRSSA exhibits the best convergence performance. It achieves the lowest values for the error metrics MAE and MAPE in predicting PV power.