TSMixer- and Transfer Learning-Based Highly Reliable Prediction with Short-Term Time Series Data in Small-Scale Solar Power Generation Systems

Lee, Younjeong; Jeong, Jongpil

doi:10.3390/en18040765

Open AccessArticle

TSMixer- and Transfer Learning-Based Highly Reliable Prediction with Short-Term Time Series Data in Small-Scale Solar Power Generation Systems

by

Younjeong Lee

^1,2

and

Jongpil Jeong

^1,*

¹

Department of Smart Factory Convergence, Sungkyunkwan University, 2066 Seobu-ro, Jangan-gu, Suwon-si 16419, Republic of Korea

²

AI Research Center, 20 Pangyo-ro, Bundang-gu, Gfyhealth, Seongnam-si 13488, Republic of Korea

^*

Author to whom correspondence should be addressed.

Energies 2025, 18(4), 765; https://doi.org/10.3390/en18040765

Submission received: 2 January 2025 / Revised: 24 January 2025 / Accepted: 3 February 2025 / Published: 7 February 2025

(This article belongs to the Special Issue Artificial Intelligence and Machine Learning Applications in Smart Energy Systems)

Download

Browse Figures

Versions Notes

Abstract

:

With the surge in energy demand worldwide, renewable energy is becoming increasingly important. Solar power, in particular, is positioning itself as a sustainable and environmentally friendly alternative, and is increasingly playing a role not only in large-scale power plants but also in small-scale home power generation systems. However, small-scale power generation systems face challenges in the development of efficient prediction models because of the lack of data and variability in power generation owing to weather conditions. In this study, we propose a novel forecasting framework that combines transfer learning and dynamic time warping (DTW) to address these issues. We present a transfer learning-based prediction system design that can maintain high prediction performance even in data-poor environments. In the process of developing a prediction model suitable for the target domain by utilizing multi-source data, we propose a data similarity evaluation method using DTW, which demonstrates excellent performance with low error rates in the MSE and MAE metrics compared with conventional long short-term memory (LSTM) and Transformer models. This research not only contributes to maximizing the energy efficiency of small-scale PV power generation systems and improving energy independence but also provides a methodology that can maintain high reliability in data-poor environments.

Keywords:

time-series forecasting; transfer learning; dynamic time warping; prediction performance optimization; TSMixer

1. Introduction

With soaring energy demand worldwide, renewable energy is becoming increasingly important. In particular, solar power has established itself as a sustainable and environmentally friendly alternative, playing a growing role not only in large-scale power plants but also in small-scale residential power generation systems [1]. Small-scale power generation systems have been recognized as an important means of realizing energy independence by enabling households and small businesses to generate and consume power independently [2]. Small-scale power generation systems offer a means to become energy self-sufficient through a decentralized energy system, removed from the traditional centralized power supply system. However, small-scale power generation systems are limited in that their power output fluctuates significantly depending on weather conditions (e.g., sunlight, temperature, and humidity), and accurate forecasting that considers this variability is essential [3]. Existing power production forecasting models have primarily been developed for large-scale power plants, which utilize vast amounts of data to maintain high forecast accuracy. However, it is difficult to apply such large-scale prediction models to small-scale power generation systems owing to limited data collection [4]. Therefore, new approaches are required to solve the data scarcity problem, and transfer learning has recently been recognized as an effective alternative [5]. Transfer learning is a technology that can maintain high prediction performance with less data by applying models trained with large amounts of data in environments with less data. It helps to utilize the data accumulated from large-scale power generation systems for small-scale power generation systems [6].

This study aims to solve the problem of data scarcity in small-scale power generation systems and enable efficient power production forecasting with less data. To this end, we use transfer learning and various artificial intelligence (AI) techniques to develop sophisticated prediction models in a small-data environment. Furthermore, these predictive models contribute to maximizing the energy efficiency of small-scale power generation systems and improving their energy independence.

Key Contributions of the study:

We designed a framework that utilizes transfer learning and dynamic time warping (DTW) to compensate for temporal nonlinearities in time-series data and maintain high prediction performance even in data-sparse environments.
After training a generalized model based on multi-source domain data, a target-domain-specific model was constructed by applying linear probing techniques.
We quantitatively evaluated the prediction accuracy of the proposed model by comparing its performance with representative time-series models such as long short-term memory (LSTM) and Transformer, and found that the proposed model achieved the lowest error rate in the mean squared error (MSE) and mean absolute error (MAE) metrics.

The remainder of this paper is organized as follows: Section 2 reviews DTW, which measures the similarity of time-series data, and existing power prediction and transfer learning techniques to explain the distinction and necessity of this work. Section 3 presents the design of an the MLP-based transfer learning system, including a dictionary learning process that takes advantage of the similarity between source and target data and a linear probing process that is specific to the target domain. Section 4 uses UK Power Networks data to compare the predictive performance of the proposed model with existing models, such as LSTM and Transformer, and analyzes the effectiveness of the multi-source learning. Section 5 establishes the significance of this study and describes its limitations. Finally, Section 6 demonstrates the performance superiority of the proposed model, and explores its potential for future extensions to various domains and the integration of real-time data processing and state-of-the-art technologies. This study proposes an efficient forecasting system for data-poor environments to address the power forecasting problem of small-scale PV systems. Existing power forecasting models have been optimized for systems based on large datasets. However, because small-scale power generation systems have limited data collection, we address this problem by introducing a transfer learning technique that can maintain prediction performance in a data-poor environment.

2. Related Research

2.1. Solar Power Forecasting Research

Solar power forecasting is an important research area in renewable energy management and electricity supply and demand planning [7]. Forecasting models are typically used to estimate the hourly power generation accurately and consider variables such as weather data and solar panel characteristics [8]. Solar power plays a key role in renewable energy generation globally and is an important contributor to reducing carbon emissions and building sustainable energy systems. As solar power technology improves, the price of solar power modules decreases [9]. However, solar power generation is affected by various factors, such as weather conditions, resulting in volatile power generation. This variability presents challenges for grid operations and energy supply and demand planning, and the importance of solar power generation forecasting technology is increasing [10].

Accurate solar power generation forecasting can reduce surplus power losses, ensure reliable power supply, and support the efficient operation of battery storage systems [11]. Solar power generation can fluctuate significantly with the weather, which affects the load management and supply reliability of the grid. Accurate forecasting is essential for balancing supply and demand and ensuring grid stability. When the share of renewables is greater, the uncertainty of energy supply and demand increases, and solar power generation forecasting plays a key role in energy transition planning [12]. Increased forecast accuracy provides economic benefits by reducing surplus power, power storage costs, and fossil fuel use. It also contributes to a reduction in carbon emissions through efficient energy utilization and accelerates the transition to a sustainable energy system [13].

Solar power forecasting has evolved from early statistical models to more recent machine learning- and deep learning-based techniques. Among the forecasting techniques for solar insolation, time-series analysis forecasts solar power generation based on historical data over time. The auto-regressive integrated moving average (ARIMA) is a time-series forecasting method that can be used to predict solar power generation. ARIMA estimates current and future values based on historical data and has the advantage of being applicable to data that are not particularly seasonal or trendy [14]. Using the ARIMA model for solar power forecasting can capture cyclical trends in the data effectively; however, it is difficult to determine the factors that drive the peak output of solar power. The ARIMAX model can be used to predict output power by considering external factors such as temperature, humidity, insolation, and wind speed. It is possible to improve the accuracy by considering external factors, but it is important to note that unnecessary external data can also reduce the accuracy [15].

AI-based solar power forecasting has become a tool with significantly improved data-driven analysis and forecasting accuracy. Although early statistical techniques relied on simple mathematical models, AI can learn nonlinear and complex interactions to predict the variability of solar power generation more accurately. LSTM was created to solve the long-term dependency problem of recurrent neural networks (RNNs) and can remember information over long periods. A study tested the prediction error rate of deep learning algorithms for RNN, LSTM, and gated recurrent unit models for solar power prediction and found that the LSTM model showed superior characteristics in three types of weather [16]. In addition, a study using weather satellite data to improve the prediction performance showed that a simple deep learning structure significantly improves the prediction performance [17]. In this study, we use MLP-based transfer learning to predict solar power among AI models.

2.2. Time-Series Forecasting Research

Time-series forecasting is a statistical method for analyzing observed data over time and predicting the future [18]. It has been utilized as an essential tool in various industries, such as finance, economics, energy, healthcare, and the environment. Time-series data consist of continuous observations over time and typically include trends, seasonality, cyclicality, and residuals [19]. Time-series analyses have traditionally focused on understanding or predicting future patterns based on historical data. In recent years, with advances in big data and AI, time-series analysis has evolved from traditional statistical approaches to more sophisticated methods that utilize machine and deep learning [20]. The importance of time-series analysis is growing rapidly, especially in areas such as climate prediction, energy supply and demand planning, and financial market forecasting. Time-series analysis is an important tool for predicting future behavior or patterns when the data are temporally dependent. Time-series analysis plays a key role in predicting the future using learning patterns from past data [21]. It is an important tool for stock price prediction in financial markets, energy demand forecasting, and weather and climate prediction. It also provides actionable insights to decision-makers by identifying trends and seasonal variations in data [22].

Time-series forecasting has been used in various industries. For example, in the energy sector, it has been used to analyze seasonal patterns in electricity demand for supply planning [23]. In the healthcare industry, it is used to predict changes in a patient’s health status [24], and in manufacturing, it is applied to predict when equipment will fail [25]. Research has also proposed the unsupervised domain adaptation of classification models of multivariate time-series data for vital sign diagnosis classification [26]. This maximizes the operational efficiency and enables cost savings. Time-series analyses are broadly categorized into traditional statistical and AI-based methods [27]. In this study, the ARIMA model or Seasonal ARIMA model is mainly used for solar power generation forecasting, and the LSTM and Transformer models are used as AI-based models [28]. Elman employed RNNs for forecasting and analyzing the time series of electric energy consumption [29]. Recently, automated machine learning (AutoML), which automatically searches for optimal models and hyperparameters, has been used to simplify the model development process and improve the performance. Another study proposed a wind power output prediction algorithm using time-series decomposition and AutoML, and predictions were performed using Jeju wind farm and weather data [30].

TSMixer for time-series forecasting, which is used in this study, is utilized in various time-series forecasting fields. It has shown high performance in energy demand forecasting, financial market analysis, and weather data forecasting and has achieved excellent prediction accuracy with a simple structure compared with existing complex models.

2.3. DTW

DTW is an algorithm that measures similarity by allowing for nonlinear time variations between two time-series data points [31]. The Euclidean distance, which is commonly used in traditional similarity measures between time-series data, computes the distance by comparing values that occur simultaneously between two data points [32]. However, real-world industrial data, especially time-series data such as solar power generation data, contain various pattern variations on the time axis, and this simultaneous comparison method has limitations for accurate similarity assessment. To solve this problem, DTW allows temporal distortions to determine the optimal path between two time-series data points, which is then used to measure the similarity between the two data points [33].

DTW calculates the distance between each point in the two time-series data points in the form of a matrix and then determines the path that most closely matches the two time series [34]. The optimal path does not compare all time intervals equally, but instead, allows for temporal distortions and responds nonlinearly to maximize the pattern similarity [35]. As DTW allows for temporal nonlinearity, it can accurately measure similarity even in data with the same pattern that change along the time axis. This is particularly useful for weather-sensitive data such as solar power generation. Solar power generation shows similar patterns at different times owing to weather changes or sunlight differences. For example, cloud cover may occur at different times of the day or sunlight may fluctuate from hour to hour; however, the overall pattern of power generation may be similar. DTW adjusts for this variability temporally to enable pattern comparisons [36]. DTW also offers the ability to compare data at different time scales. This provides us with the flexibility to analyze long-term patterns, as opposed to models that focus only on specific time periods.

Solar power forecasting uses nonlinear time-series data that are affected by various external factors such as sunlight, cloud cover, and temperature. These data are subject to high temporal variability, which can lead to difficulties in forecasting [37]. By effectively handling this temporal variability, DTW can be a powerful tool for comparing and analyzing power generation patterns that occur under similar environmental conditions. For example, even if cloud cover or sunrise times vary from region to region, DTW can compare the overall generation patterns, providing the possibility of applying generation forecasting models from one region to another [38]. DTW can also detect similar time-series patterns under weather conditions in which power generation is not constant, enabling reliable power forecasting in highly variable environments. As DTW compares every point between two time series, the computation time increases rapidly as the amount of data increases. Optimization algorithms such as Fast DTW have been proposed to address this problem, but they still require high computational costs when applied to large datasets [39]. DTW is primarily used to measure the similarity between two time series, which limits its practical use as a predictive model. The problem of data scarcity in small-scale PV systems can be solved by combining transfer learning and DTW [40]. When applying models trained through transfer learning based on data collected from large-scale power plants to small-scale power generation systems, DTW can play an important role in compensating for temporal variability [41]. For example, in situations in which power generation may vary by time of day, DTW can help transfer learning models to maintain high forecast accuracy in different environments by handling temporal distortions [42].

2.4. Transfer Learning

Transfer learning is a technique for improving the performance of a model by applying a model trained on a large dataset to another problem with relatively sparse data [43]. Instead of training a model from scratch for a new problem, it utilizes previously learned knowledge to improve the learning efficiency and achieve high performance with small datasets. In particular, transfer learning is used extensively in fields such as computer vision and natural language processing, and has recently been successfully applied in time-series data prediction [44].

Transfer learning can be broadly divided into two methods: feature extraction and fine-tuning. Feature extraction is performed by fixing the internal weights and parameters of a previously trained model and using it as a feature extractor for a new problem. This allows the general patterns learned by the existing model to be applied to the new problem. Learning is achieved by adapting some layers of the existing model to the new data. Fine-tuning can achieve good performance with less data, particularly if the new problem is similar to the old problem. Knowledge reuse is a key concept in transfer learning [45]. For example, in solar power generation forecasting, a model trained on data collected from a large-scale power plant can be applied to a small-scale home power generation system. Thus, even if the small system does not have sufficient data, transfer learning can apply the general patterns of large-scale power plants to the small system to improve the power generation prediction accuracy [46].

Transfer learning can utilize previously learned knowledge to perform effectively with less data, even when the data are scarce. This is particularly useful for solving data-sparsity problems in small-scale power generation systems. By reusing existing models without having to learn from scratch, the learning time for new problems can be reduced significantly [47]. Transfer learning can be applied not only to time-series data but also to image processing, natural language processing, speech recognition, and many other fields. Transfer learning is most effective when the similarity between the old and new problems is high. If the correlation between the two problems is low, knowledge from the existing model may not fit the new problem and the performance may suffer. This is known as negative transfer, and sufficient analysis of the new problem and evaluation of the suitability of the existing model are required to avoid this problem [48].

It is difficult to obtain sufficient data in solar power forecasting because it has to deal with different weather conditions and environmental variability. In this situation, transfer learning can be used to improve the forecast accuracy in small-scale power generation systems by utilizing weather and power generation data from large-scale power plants. In particular, transfer learning can be combined with LSTM or Transformer, which are models that handle long-term dependence, to predict the variability of solar power generation more accurately [49,50].

3. TSMixer- and Transfer Learning-Based Highly Reliable Prediction

3.1. System Framework

In this study, we propose a multi-source transfer learning system that utilizes source domain data collected from different regions and devices to achieve high prediction performance in data-poor target domains. The proposed system, shown in Figure 1, is designed with a two-stage architecture and includes complementary procedures: data similarity evaluation and model training and prediction. This structure is designed to address the problem of data sparsity and produce predictions that reflect the characteristics of the target domain.

The first step of the proposed framework involves selecting suitable source data for transfer learning by evaluating the similarity between the source and target domains. This ensures that the pre-training is based on data with high relevance to the target domain. In particular, this study utilizes principal component analysis (PCA) and DTW to evaluate the similarity quantitatively and select the most appropriate source data to improve the data selection accuracy in the initial stage. This design provides a foundation for efficient identification and utilization of suitable sources, even in environments with diverse source data. The second step involves utilizing the selected source data to pre-train a multi-source model and fine-tune it for the target domain. The multi-source model is designed to learn generalized characteristics by integrating individual source data. Subsequently, by performing linear probing on the target domain data, a parameter fixation technique is utilized to preserve the generalized knowledge learned from the source domain while gaining target domain-specific adaptability to achieve optimal prediction performance on the target data.

The proposed framework does not simply address the problem of data sparsity but focuses on maximizing the performance of pre-trained models based on their relevance to the target domain. This is particularly significant for system design in small-scale industrial environments, in which data collection is difficult, or in the early adoption phase. In addition, we show that a transfer learning approach utilizing multi-source data can overcome the limitations of traditional single-source transfer learning models and provide a richer information base.

The system can effectively solve the problem of data sparsity and contribute significantly to improving the prediction performance in target domains. It holds promise in various applications, such as renewable energy management, quality control in the manufacturing industry, and the operational efficiency of small-scale power generation systems, offering a new direction for data utilization.

3.2. TSMixer

The transfer learning system proposed in this study uses the TSMixer. TSMixer is applicable to time-series data in various domains and is suitable for predicting multivariate time-series data [51]. TSMixer utilizes MLP layers, which were first introduced in the field of vision and extended to time-series data, and instead of the self-attention mechanism used by Transformer, it processes data centered on MLP layers, which reduces the computational cost and simplifies the model structure. The MLP layer is designed to learn complex patterns in the input data effectively and utilizes shared MLP blocks that can learn the global relationships. The workflow of the model is shown in Figure 2, and the notations used in each step are listed in Table 1.

The first step, instance normalization, leverages the Reversible Instance Normalization (RevIN) model to normalize the distribution of the input data to increase the learning stability by normalizing each instance, which helps the model to learn without being significantly affected by deviations in the data [52]. First, the input data X are provided in the form [

b \times s l \times c

], where b is the batch size,

s l

is the time series length, and c is the number of channels. The input data are then normalized using instance normalization. This process removes the mean and standard deviation of the data to account for the variations in their distribution. Subsequently, the input data are divided into patches, which are converted into the form [

b \times n \times p l \times c

], where n is the number of patches and

p l

is the length of the patches. The data are then converted into [

b \times c \times n \times p l \times c

] by permutation, which restructures the dimensions of the patches and channels and prepares them for processing in the backbone.

The backbone is the core structure for processing the time-series data and consists of three modules: an inter-patch mixer, an intra-patch mixer, and an inter-channel mixer. Each module is designed to learn interactions across different dimensions of the time-series data.

The inter-patch mixer is responsible for learning the relationships between different patches. It restructures the patch dimensions by transposing them and then uses gated attention to weight important patches. It then uses shared MLP blocks to learn the correlations between patches, and ensures stable learning through residual connections and layer normalization. The intra-patch mixer learns the interactions between the features within individual patches. This module has a structure similar to that of the inter-patch mixer but focuses on interactions within patches.

The output of the backbone is transformed into a final prediction using the prediction head. The output of the backbone, [

b \times c \times n \times h f

], is converted into [

b \times c \times (n \cdot h f)

] by flattening. It is then passed through a dropout layer to prevent overfitting and a linear layer to map to the desired prediction dimension

f l

. Finally, the output is transposed in the form [

b \times f l \times c

] to generate the final prediction. In conclusion, TSMixer achieves both high efficiency and performance through its lightweight and modularized structure. It combines patch-based processing, independent and interdependent feature learning, and efficient MLP transformations to achieve excellent time-series forecasting performance without requiring complex Transformer models. These characteristics render TSMixer an ideal model for environments with limited computational resources.

3.2.1. MLP Block

The MLP block shown in Figure 3 is an important component of the TS-Backbone, which is responsible for converting time-series data and extracting features.

First, the input data of size

[b \times c \times n \times h_{f}]

pass through the first linear layer, where the feature dimension

h_{f}

is converted into

e_{f}

. This transformation maps the data to a new feature space to enable more effective feature learning. The output size of the transformed data is

[b \times c \times n \times e_{f}]

. Next, a Gaussian error linear unit (GeLU) activation function is applied to add nonlinearity to the data [53]. This allows us to learn more complex patterns beyond simple linear relationships. Simultaneously, a dropout is applied to prevent overfitting by randomly disabling some features during training. At this stage, the data size is maintained at

[b \times c \times n \times e_{f}]

. After passing through GeLU and dropout, the data are passed to the second linear layer, where the dimensionality is restored from

e_{f}

to

h_{f}

. This ensures that the dimensions of the input and output data are consistent and compatible with those of the other layers in the model. The size of the output data at this stage is restored to

[b \times c \times n \times h_{f}]

. Finally, an additional dropout layer is applied. This step serves as an additional normalization process to improve the generalization performance of the model further and prevent overfitting. The size of the final output data is maintained at

[b \times c \times n \times h_{f}]

. The MLP block consists of two linear layers, a GeLU activation function, and dropout. This block performs feature transformation and generates a learnable representation while maintaining dimensionality consistency between the input and output.

3.2.2. Gated Attention Block

The gated attention (GA) block, shown in Figure 4, is designed to address issues that arise when processing time-series data. Time-series data often contain unnecessary features that are irrelevant to important information and can degrade the performance of the model. The GA block is used to suppress these unnecessary features and emphasize the important ones to improve the model efficiency. The GA block uses a gate mechanism to assign attention weights

W^{A}

to the features. This weight is calculated as shown in Equation (1):

W^{A} = Softmax (A (X^{M}))

(1)

where A is a function that computes the attention score for input

X^{M}

, and

X^{M}

is the input tensor from the previous mixer component. The Softmax function is computed to represent the importance of certain features probabilistically, increasing the weight of more important features and decreasing the weight of less important features.

The output of the GA block is the attention weight

W^{A}

and dot product of the input tensor

X^{M}

to produce

X^{G}

, which is the result of the gate being applied. This process emphasizes the important features and weakens unnecessary features. The calculation for this approach is shown in Equation (2):

X^{G} = W^{A} \times X^{M}

(2)

3.3. Single-Source Transfer Learning

In Stage 1, a similarity evaluation process is performed to select the most appropriate source domains (

D_{S_{1}}, D_{S_{2}}, \dots, D_{S_{i}}

) for the target domain (

D_{T}

). First, to ensure data consistency, we replace the missing values with the average values to minimize errors in the model training owing to incomplete data. By maintaining a consistent dataset, we prevent the introduction of unnecessary noise during transfer learning. This is particularly important during transfer learning to ensure that the pre-trained weights are appropriately applied to the target data.

Next, we apply PCA to reduce the multi-dimensional data to one dimension. PCA is a method for reducing the complexity of data by removing unnecessary dimensions from high-dimensional data while preserving important characteristics, allowing the model to focus on and learn from the most important information. This dimensionality reduction process reduces the computational cost while retaining key information, making the learning process more efficient. Subsequently, DTW is used to assess the similarity to the target data quantitatively. DTW measures the nonlinear similarity between time-series data and is used to select the top two datasets with the most similar characteristics to the target data.

This yields a DTW distance value for each source–target domain pair, and the two source domains (

D_{S_{1}}

and

D_{S_{2}}

) that are most similar to the target domain (

D_{T}

) are selected. The purpose of this process is to avoid unnecessary data usage during pre-training and improve the learning efficiency by selecting the correct source data for the target domain.

3.4. Multi-Source Transfer Learning

In Stage 2, the models are pre-trained using the two source domains (

D_{S_{1}}

and

D_{S_{2}}

) selected in Stage 1, and linear probing is performed to fit the target domain (

D_{T}

). In the pre-training phase, individual models are trained based on data from each source domain to learn the unique patterns and features of that domain effectively. This pre-training contributes to the learning of generalized knowledge from the source domains to provide the initial parameters required for the target domain.

Linear probing is applied to the pre-trained model using the target domain data. Linear probing maintains most model parameters fixed and updates only the final few layers, effectively adapting the model to the target data while avoiding overfitting. This allows the model to learn additional target domain-specific information while maintaining the generalized knowledge learned from prior training. Finally, the trained model is evaluated using test data from the target domain, and the evaluation results confirm the efficiency and suitability of the proposed methodology. This step focuses on achieving model performance that is optimized for the target domain while efficiently leveraging information from the source domain.

4. Experiment and Results

4.1. Experimental Environment

The experimental conditions used in this study are listed in Table 2. The CPU was an Intel core i7 13700 K, the RAM was 64 GB, and the graphics card was a Geforce 4060 Ti. PyTorch was used to implement the model and system framework for the experiments.

4.2. Datasets

The data used in this study were collected from a UK Power Networks project and include a range of environmental and electrical data from solar panels installed in residential neighborhoods. The data were collected primarily to validate a solar PV connection assessment tool and measured for 480 days. The measurements were obtained at 10 min intervals, and during the summer of 2014, they were also obtained at 1 min intervals. The dataset was collected from 20 substations and 10 residences. To merge the solar PV and weather data, the data collection interval was set to 30 min and the dataset ID was set to identify the location where the data were collected and the last two digits of the UID of the equipment. The data used for training are presented in Table 3.

The dataset includes a Time column to indicate the time of measurement, which is recorded in date and time format. The external environmental variables include factors that affect solar power generation. The TempOut column represents the outside temperature in the area where the solar panels are installed and is recorded in degrees Celsius (°C). The OutHum column also provides outside humidity as a percentage (%). The WindSpeed column represents the wind speed (m/s), which is utilized in the analysis as a variable that can affect the cooling effect and energy efficiency of the solar panels. The SolarRad column provides solar radiation (W/m²), which is the main variable directly affecting solar power generation and is used as a key input to the power generation model. Power data provide information for evaluating the power quality and performance of the solar power system. The VA column represents the filtered value of the generator output voltage (V). Similarly, the IA column contains the current (A), which serves as an important variable in the power flow and consumption analysis. The f column represents the power frequency (Hz), which is used to evaluate the network stability of the generation system. Finally, the PA column provides the actual output power (kW), which is used as the target variable for the model.

The dataset was divided into training (70%), validation (20%), and test (10%) sets to evaluate the model’s generalization. To solve the time paradox problem that may occur owing to time misalignment when using data collected from different devices for pre-training, we partitioned the data based on time. The data before 1 September 2014 were used as pre-training data and the data after that date were used only in the linear probing phase.

4.3. Evaluation Metrics

MAE and MSE were selected as the evaluation metrics to compare the performance of the solar power prediction models. MAE is the average of the absolute value of the difference between the predicted and actual values, which can intuitively determine how close the model’s predicted value is to the actual value. MAE is defined as follows:

MAE = \frac{1}{n} \sum_{t = 1}^{n} | y_{t} - {\hat{y}}_{t} |

(3)

where n is the total number of data points,

y_{i}

is the actual value, and

{\hat{y}}_{i}

is the predicted value. MAE is a popular metric for evaluating the predictive accuracy of a model, and it has the advantage of taking the absolute value of the error to avoid canceling out negative and positive errors. MAE also represents the average difference between the actual and predicted values, making it intuitive to interpret and less affected by outliers than other performance metrics. This made it possible to compare the performances of the prediction models for solar data collected under different weather conditions fairly.

MSE is the mean square of the errors and provides a more rigorous assessment of the difference between the predicted and actual values of the model. MSE is defined as follows:

MSE = \frac{1}{n} \sum_{t = 1}^{n} {(y_{t} - {\hat{y}}_{t})}^{2}

(4)

MSE is characterized by being sensitive to outliers because it takes the square of the error and penalizes larger errors more heavily. Because of this characteristic, MSE is a useful metric for tuning a model to avoid large errors in its forecasts. Therefore, using MAE and MSE together can provide a multi-faceted assessment of the performance of a solar power forecasting model by considering the overall model performance as well as its sensitivity to large errors.

4.4. Results

In this section, we quantitatively demonstrate that the proposed model learning approach can produce significant results in solar power generation forecasting. Three models were trained for model evaluation. To verify the performance of the TSMixer model in analyzing time-series data, we conducted a comparison experiment with the LSTM and Transformer models. LSTM is a type of RNN that performs well in extracting continuous patterns from time-series data and is characterized by its ability to model long-term dependencies effectively. By comparing the performances of TSMixer and LSTM, we can evaluate how TSMixer handles long-term patterns in time-series data. Transformer also performs well on long-term sequence data through its self-attention mechanism and has the advantage of a high learning efficiency owing to parallel processing. The comparison with Transformer demonstrates how effectively TSMixer can extract time-series patterns with its unique data processing method.

The common hyperparameter values for the training proposed in this study were as follows: To train the model, we set the number of epochs to 100, batch size to 256, and learning rate to 0.001. We also used a dropout rate of 0.2 to avoid overfitting. The dataset was divided into 70% for training, 20% for validation, and 10% for testing.

4.4.1. Stage 1

In the source domain similarity evaluation step of Stage 1, DTW was used to calculate the similarity between the source and target domains. For the target data, Elm_38 was randomly selected from the total data, and its similarity to the source domain was quantitatively evaluated based on these data. DTW has a high computational load on high-dimensional data. Therefore, we applied PCA to mitigate this and retain key information.

The process of applying PCA was performed in two steps. In the first step, all data were merged to select the optimal number of principal components that reflected the main flow and variance of the data. Figure 5 shows the cumulative explained variance as a function of the number of principal components, which led to the conclusion that six principal components were required to explain more than 95% of the variance in the data. This result was used as a guide to determine the number of principal components to be used when applying PCA to each source and target domain data point separately.

In the second step, based on the selected number of principal components, PCA was applied to each source domain (

D_{S_{1}}, D_{S_{2}}, \dots, D_{S_{i}}

) and target domain (

D_{T}

) data point separately. In this process, high-dimensional data were reduced to five low-dimensional principal components for prediction efficiency, which contributed to reducing the computational cost of the DTW calculation.

The dimensionality-reduced data from the PCA were then used as the input to calculate the similarity between the source and target domains using DTW. This approach focused on preserving the intrinsic patterns in the data while maximizing the computational efficiency. Consequently, DTW could successfully select the two source domains that were most similar to the target domain.

Figure 6 presents a heatmap visualization of the DTW distance between the source and target domains. Each value represents the similarity between a particular source domain and the target domain; the lower the value, the more similar the pattern between the two data points. This allowed us to determine how close each source data point is to the target data point intuitively. After analyzing the heatmap results, Elm_57 and Elm_60 were selected as the two source domains that were most similar to the target domain (

D_{T}

). The two data points with the smallest DTW distances from the target data were

D_{S_{1}}

and

D_{S_{2}}

, which were used for transfer learning.

4.4.2. Stage 2

Stage 2 utilized the source domain data selected in the previous step as pre-training data for the weighted-source transfer learning model. The purpose of this step was to learn the generalized characteristics of the model based on the selected source data and improve the prediction performance specific to the target domain data. Figure 7 shows a visualization of the loss values during the model training and validation processes in Stage 2. The left graph shows the decreasing trend in the loss value for the training dataset and the right graph shows the loss value for the validation dataset.

The x-axis represents the number of steps required for training. By using the step unit, we can observe the detailed changes in the loss value during training and consistently represent the training progress regardless of the batch size. In this study, the loss values were recorded for each batch processing to visualize the rapid loss reduction at the beginning of training and the stable convergence process at the end of training. The y-axis represents the loss measured at this step. The loss value is a measure of the difference between the predicted result of the model and the actual value and is calculated using the MSE; the lower the loss value, the closer the model prediction to the actual value.

In terms of learning loss, the loss value in the initial stage started at 0.9803 and converged to 0.6096 after about 2000 steps, which is a total loss reduction of 0.3707, indicating that the model converged stably as it gradually learned the data patterns. The loss value at the beginning of the validation started at 1.4657 and converged to 0.9199 after about 2000 steps. The total loss reduction was 0.5458, showing that the model successfully learned from the validation data while maintaining its generalization performance. In addition, the learning time tended to decrease gradually with each training step, indicating that the model learning process became increasingly efficient. The detailed loss and time values for each step are shown in Table 4.

The reason the validation loss stabilized after a certain point is that the model successfully learned the main patterns in the source domain data during pre-training and then maintained its generalization performance in further training. This stabilization indicates that the model effectively learned the patterns in the multi-source data used in the pre-training phase and succeeded in minimizing the validation loss without overfitting.

In the final step of Stage 2, linear probing was performed on the target domain data based on the pre-trained multi-source model. By maintaining most of the weights in the model fixed and updating only certain layers, linear probing provides target-domain-specific adaptation while retaining the generalized knowledge learned in the source domain. During the linear probing phase, we trained for 100 epochs. This allowed the model, which already had generalized weights from prior training, to make target-domain-data-specific adjustments efficiently. Each epoch iteratively trained the entire target dataset and we observed that the loss values converged stably.

Figure 8 shows a visualization of the training and validation loss values for the linear probing phase. The calculation method was the same as that in Stage 1 using MSE. The learning loss values were initially relatively high but decreased as the steps progressed. After approximately 200 steps, we reached stable convergence, where the loss value fluctuated slightly. This indicates that, during the linear probing phase, the model effectively learned the target domain data and converged to a stable state. The validation loss value decreased rapidly during the initial training phase and exhibited relatively small fluctuations after approximately 150 steps. This indicates that the model could adapt to the target domain data while maintaining the generalized knowledge learned from the source domain. The stability of the validation loss values suggests that learning occurred without overfitting during the linear probing phase.

4.4.3. Model Comparison

Table 5 shows the performance comparison of the proposed model with the LSTM and Transformer models based on the MSE and MAE metrics. These metrics measure the difference between the model’s predicted value and the actual value, with lower values indicating better prediction performance.

The proposed framework demonstrated excellent performance with an MSE of 0.4517 and MAE of 0.4349, the lowest values among all the compared models, indicating that the proposed model effectively reduced forecast errors and provided the most accurate predictions by learning the complex patterns in the time series data with precision. LSTM also performed relatively well with an MSE of 0.6888 and MAE of 0.5585, but its limitation was that it did not fully learn the volatility in the data. The Transformer model had the highest error value with an MSE of 0.8517 and MAE of 0.6692, showing relatively poor performance in predicting time-series data. The proposed framework has significantly lower values than the other models in both MSE and MAE metrics, demonstrating its superior performance in learning complex patterns in time-series data and providing accurate predictions. LSTM outperformed Transformer, but did not reach the precision of the proposed model. Transformer recorded the highest error values and showed limitations in accurately predicting the time-series patterns in the dataset. Overall, these results demonstrate that the proposed model provides robust and reliable performance in handling time-series data with complex patterns and volatility. The good performance in terms of MSE and MAE metrics suggests that the proposed model has the potential to be a powerful tool in various time-series forecasting tasks.

5. Discussion

Figure 9 compares the predicted values of the model with the actual values along the time axis. The x-axis of the graph visualizes values over the time range from 1 November 2014 to 17 November 2014. The y-axis represents the values of PA, with the blue line representing the actual values and the green line representing the model predicted values. This visualization was used to compare the predictive performance of the model along the time axis. Overall, the model predictions showed a high degree of similarity to the actual values and accurately reflected the trends in the data over time.

In Figure 9, we evaluate the performance of various prediction models in this study by comparing them with real data. Among them, the system framework proposed in this study performed well in overall pattern learning and showed a high agreement rate with the actual value in the stable interval. In particular, the system framework proposed in this study learned consistent patterns in the non-volatile regions of the data, and maintained stable performance without over-prediction or exceptional response. This is unlike the overfitting and unstable response to sudden changes that can occur with traditional LSTM or Transformer models.

In the case of the LSTM model, it showed strength in learning temporal dependence, but the error from the true value was relatively large in the interval where the value changed rapidly. This suggests that LSTMs are well suited to learning temporal trends over the long term, but may have limitations in capturing spikes in volatility. The Transformer model is strong at learning complex data patterns and is relatively sensitive to rapidly changing bins, but it overreacted in some bins, leading to errors from the true value. This shows that the Transformer model may have overlearned the high variability of the data and overfitted. In contrast, the system framework proposed in this study predicted relatively consistent patterns in both the stable and volatile regions of the data, and the predicted values were neither too narrow nor too wide. This demonstrates that the system framework proposed in this study learns the variation in actual values in a more balanced way and is effective in solving problems such as overfitting or unstable responses in existing models. These results suggest that the data preprocessing and model structure design used in this study had a positive impact on performance. The models were designed to reliably handle the variability in the data while effectively learning key patterns, which is a key factor that highlights the distinction and superiority of this study.

Nevertheless, it is noteworthy that all models showed a consistent degradation in performance in bins with rapidly changing values. This is related to the possibility that the training data did not contain enough bins with sharp changes, or that important information was lost during preprocessing. In particular, the PCA- and DTW-based preprocessing used in this study was useful for reducing the dimensionality of the data or simplifying patterns, but may have lost some of the higher-dimensional information associated with rapid variability.

6. Conclusions

6.1. Research Results

A key advantage of the proposed model is its lightweight structure, which enables efficient operation in environments with limited computational resources. Furthermore, the transfer learning approach proved effective in leveraging pre-trained knowledge from similar datasets, mitigating the issues caused by data scarcity and improving generalization performance. In conclusion, this study provides a practical and efficient solution for power forecasting in small-scale PV systems. By integrating transfer learning with a TSMixer-based architecture, the proposed framework demonstrates its potential to maintain high reliability and accuracy in data-constrained environments. Future research directions include real-time data integration, application to diverse domains, and further enhancement with state-of-the-art deep learning techniques. This study highlights the feasibility of applying transfer learning to address challenges in data-scarce industries, paving the way for its broader adoption.

6.2. Future Work

This research focused on proposing a transfer learning-based model for power generation prediction of small-scale photovoltaic power generation systems, and verifying its applicability to various industries. Future research will demonstrate the overall performance of the model through benchmark comparisons using large-scale long-term data, and evaluate the model’s performance by applying it to various time-series data domains such as manufacturing processes, energy forecasting, and finance. In particular, we plan to verify its ability to predict various patterns, including sudden fluctuations, and to compare and analyze its performance under various weather conditions by adding multiple weather data or obtaining data reflecting different environmental conditions.

In addition, this study focused on validating the design and performance of the TSMixer model, and demonstrated the superiority of TSMixer through comparisons with LSTM and transformer models. In future work, we plan to extend and deepen our research by performing performance comparisons with various MLP variants, including models such as gMLP. This will allow us to see how the TSMixer model compares to other state-of-the-art models and provide further research directions to improve the design and performance of the model.

Author Contributions

Conceptualization, Y.L.; methodology, Y.L.; software, Y.L.; validation, Y.L.; formal analysis, Y.L.; investigation, Y.L.; resources, Y.L.; data curation, Y.L.; writing—original draft preparation, Y.L.; writing—review and editing, Y.L.; visualization, Y.L.; supervision, J.J.; project administration, J.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. The article publication charge (APC) was waived as part of the publisher’s special discount program.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data supporting the findings of this study were obtained from the Greater London Authority and are publicly available at https://data.london.gov.uk/dataset/photovoltaic--pv--solar-panel-energy-generation-data (accessed on 19 December 2024).

Acknowledgments

This work was supported by ICT Creative Consilience Program through the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (RS-2020-II201821).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, B.; Liu, Y.; Wang, D.; Song, C.; Fu, Z.; Zhang, C. A review of the photothermal-photovoltaic energy supply system for building in solar energy enrichment zones. Renew. Sustain. Energy Rev. 2024, 191, 114100. [Google Scholar] [CrossRef]
Shahid, A.; Plaum, F.; Korõtko, T.; Rosin, A. AI Technologies and Their Applications in Small-Scale Electric Power Systems. IEEE Access 2024, 12, 109984–110001. [Google Scholar] [CrossRef]
Wang, J.; Yang, C.; Jiang, X.; Wu, J. WHEN: A Wavelet-DTW hybrid attention network for heterogeneous time series analysis. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA, 6–10 August 2023. [Google Scholar]
Zheng, J.; Du, J.; Wang, B.; Klemeš, J.J.; Liao, Q.; Liang, Y. A hybrid framework for forecasting power generation of multiple renewable energy sources. Renew. Sustain. Energy Rev. 2023, 172, 113046. [Google Scholar] [CrossRef]
Li, W.; Huang, R.; Li, J.; Liao, Y.; Chen, Z.; He, G.; Yan, R.; Gryllias, K. A perspective survey on deep transfer learning for fault diagnosis in industrial scenarios: Theories, applications and challenges. Mech. Syst. Signal Process. 2022, 167, 108487. [Google Scholar] [CrossRef]
Zhuang, F.; Qi, Z.; Duan, K.; Xi, D.; Zhu, Y.; Zhu, H.; Xiong, H.; He, Q. A comprehensive survey on transfer learning. Proc. IEEE 2020, 109, 43–76. [Google Scholar] [CrossRef]
Bouquet, P.; Jackson, I.; Nick, M.; Kaboli, A. AI-based forecasting for optimised solar energy management and smart grid efficiency. Int. J. Prod. Res. 2024, 62, 4623–4644. [Google Scholar] [CrossRef]
Al-Dahidi, S.; Madhiarasan, M.; Al-Ghussain, L.; Abubaker, A.M.; Ahmad, A.D.; Alrbai, M.; Aghaei, M.; Alahmer, H.; Alahmer, A.; Baraldi, P.; et al. Forecasting solar photovoltaic power production: A comprehensive review and innovative data-driven modeling framework. Energies 2024, 17, 4145. [Google Scholar] [CrossRef]
Zhao, Z.y.; Zhang, S.Y.; Hubbard, B.; Yao, X. The emergence of the solar photovoltaic power industry in China. Renew. Sustain. Energy Rev. 2013, 21, 229–236. [Google Scholar] [CrossRef]
Yang, H.T.; Huang, C.M.; Huang, Y.C.; Pai, Y.S. A weather-based hybrid method for 1-day ahead hourly forecasting of PV power output. IEEE Trans. Sustain. Energy 2014, 5, 917–926. [Google Scholar] [CrossRef]
Gandhi, O.; Zhang, W.; Kumar, D.S.; Rodríguez-Gallegos, C.D.; Yagli, G.M.; Yang, D.; Reindl, T.; Srinivasan, D. The value of solar forecasts and the cost of their errors: A review. Renew. Sustain. Energy Rev. 2024, 189, 113915. [Google Scholar] [CrossRef]
Mohammad, A.; Mahjabeen, F. Revolutionizing solar energy: The impact of artificial intelligence on photovoltaic systems. Int. J. Multidiscip. Sci. Arts 2023, 2. [Google Scholar] [CrossRef]
Shen, Q.; Wen, X.; Xia, S.; Zhou, S.; Zhang, H. AI-Based Analysis and Prediction of Synergistic Development Trends in US Photovoltaic and Energy Storage Systems. Int. J. Innov. Res. Comput. Sci. Technol. 2024, 12, 36–46. [Google Scholar] [CrossRef]
Bae, S.U. Research Trends in Solar PV Output Power Prediction Techniques. Electr. World 2018, 67, 16–25. [Google Scholar]
Akal, M. Forecasting Turkey’s tourism revenues by ARMAX model. Tour. Manag. 2004, 25, 565–580. [Google Scholar] [CrossRef]
Kang, B.B.; Yoon, J.H. A study on solar power generation prediction model using deep learning algorithm. Electron. Eng. J. 2023, 60, 119–125. [Google Scholar]
AlKandari, M.; Ahmad, I. Solar power generation forecasting using ensemble approach based on deep learning and statistical methods. Appl. Comput. Inform. 2024, 20, 231–250. [Google Scholar] [CrossRef]
Chiang, S.; Zito, J.; Rao, V.R.; Vannucci, M. Time-series analysis. In Statistical Methods in Epilepsy; Chapman and Hall/CRC: Boca Raton, FL, USA, 2024; pp. 166–200. [Google Scholar]
Li, W.; Yu, W.; Du, H.; Du, S.; You, J.; Tang, Y. Learning Seasonal-Trend Representations and Conditional Heteroskedasticity for Time Series Analysis. In Proceedings of the International Conference on Artificial Neural Networks, Lugano, Switzerland, 17–20 September 2024; Springer: Cham, Switzerland, 2024; pp. 267–281. [Google Scholar]
Kashpruk, N.; Piskor-Ignatowicz, C.; Baranowski, J. Time Series Prediction in Industry 4.0: A Comprehensive Review and Prospects for Future Advancements. Appl. Sci. 2023, 13, 12374. [Google Scholar] [CrossRef]
Yi, K.; Zhang, Q.; Fan, W.; Wang, S.; Wang, P.; He, H.; An, N.; Lian, D.; Cao, L.; Niu, Z. Frequency-domain MLPs are more effective learners in time series forecasting. arXiv 2024, arXiv:2311.06184. [Google Scholar]
Liu, J. Navigating the Financial Landscape: The Power and Limitations of the ARIMA Model. Highlights Sci. Eng. Technol. 2024, 88, 747–752. [Google Scholar] [CrossRef]
Pełka, P. Analysis and forecasting of monthly electricity demand time series using pattern-based statistical methods. Energies 2023, 16, 827. [Google Scholar] [CrossRef]
Morid, M.A.; Sheng, O.R.L.; Dunbar, J. Time series prediction using deep learning methods in healthcare. ACM Trans. Manag. Inf. Syst. 2023, 14, 1–29. [Google Scholar] [CrossRef]
De Simone, V.; Di Pasquale, V.; Miranda, S. An overview on the use of AI/ML in manufacturing MSMEs: Solved issues, limits, and challenges. Procedia Comput. Sci. 2023, 217, 1820–1829. [Google Scholar] [CrossRef]
Kim, S.; Kim, D. Multivariate Time Series Unsupervised Domain Adaptation based on Uncertainty-Aware Pseudo-Label Selection and Manifold Mixup using Medical-Bio Signal. In Proceedings of the Korean Institute of Information Scientists and Engineers Assoc., Jeju, Republic of Korea, 26–28 June 2024; pp. 1112–1114. [Google Scholar]
Lim, B.; Zohren, S. Time-series forecasting with deep learning: A survey. Philos. Trans. R. Soc. A 2021, 379, 20200209. [Google Scholar] [CrossRef]
Jailani, N.L.M.; Dhanasegaran, J.K.; Alkawsi, G.; Alkahtani, A.A.; Phing, C.C.; Baashar, Y.; Capretz, L.F.; Al-Shetwi, A.Q.; Tiong, S.K. Investigating the power of LSTM-based models in solar energy forecasting. Processes 2023, 11, 1382. [Google Scholar] [CrossRef]
Lee, C.Y.; Kim, J.H. Prediction and analysis of electric energy time series using Elman recurrent neural networks. Ind. Manag. Syst. J. 2018, 41, 84–93. [Google Scholar]
Han, Y.; Heo, J. A Study on Short-Term Wind Power Forecasting Based on Time Series Decomposition and Automated Machine Learning. In Proceedings of the Korean Electr. Soc., Jeju, Republic of Korea, 10–13 July 2024; pp. 541–542. [Google Scholar]
Shifaz, A.; Pelletier, C.; Petitjean, F.; Webb, G.I. Elastic similarity and distance measures for multivariate time series. Knowl. Inf. Syst. 2023, 65, 2665–2698. [Google Scholar] [CrossRef]
Baek, J.; Alhindi, T.J.; Jeong, Y.S.; Jeong, M.K.; Seo, S.; Kang, J.; Shim, W.; Heo, Y. Real-time fire detection system based on dynamic time warping of multichannel sensor networks. Fire Saf. J. 2021, 123, 103364. [Google Scholar] [CrossRef]
Wang, L.; Koniusz, P. Uncertainty-dtw for time series and sequences. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022; pp. 1–17. [Google Scholar]
Vaughan, N.; Gabrys, B. Comparing and combining time series trajectories using dynamic time warping. Procedia Comput. Sci. 2016, 96, 465–474. [Google Scholar] [CrossRef]
Martinez, I. Diffeomorphic Transformations for Time Series Analysis: An Efficient Approach to Nonlinear Warping. arXiv 2023, arXiv:2309.14029. [Google Scholar]
Matheri, A.N.; Nabadda, E.; Mohamed, B. Sustainable and circularity in the decentralized hybrid solar-bioenergy system. Environ. Dev. Sustain. 2024, 26, 16987–17011. [Google Scholar] [CrossRef]
Krishnan, N.; Kumar, K.R.; Inda, C.S. How solar radiation forecasting impacts the utilization of solar energy: A critical review. J. Clean. Prod. 2023, 388, 135860. [Google Scholar] [CrossRef]
Li, Q.; Zhang, X.; Ma, T.; Liu, D.; Wang, H.; Hu, W. A Multi-step ahead photovoltaic power forecasting model based on TimeGAN, Soft DTW-based K-medoids clustering, and a CNN-GRU hybrid neural network. Energy Rep. 2022, 8, 10346–10362. [Google Scholar] [CrossRef]
Deriso, D.; Boyd, S. A general optimization framework for dynamic time warping. Optim. Eng. 2023, 24, 1411–1432. [Google Scholar] [CrossRef]
Mao, W.; Zhang, W.; Feng, K.; Beer, M.; Yang, C. Tensor representation-based transferability analytics and selective transfer learning of prognostic knowledge for remaining useful life prediction across machines. Reliab. Eng. Syst. Saf. 2024, 242, 109695. [Google Scholar] [CrossRef]
Ge, L.; Du, T.; Li, C.; Li, Y.; Yan, J.; Rafiq, M. Virtual Collection for Distributed Photovoltaic Data: Challenges, Methodologies, and Applications. Energies 2022, 15, 8783. [Google Scholar] [CrossRef]
Jain, V.; Fokow, V.; Wicht, J.; Wetzker, U. A dynamic time warping based method to synchronize spectral and protocol domains for troubleshooting wireless communication. IEEE Access 2023, 11, 64668–64678. [Google Scholar] [CrossRef]
Niu, S.; Liu, Y.; Wang, J.; Song, H. A decade survey of transfer learning (2010–2020). IEEE Trans. Artif. Intell. 2020, 1, 151–166. [Google Scholar] [CrossRef]
Vrbančič, G.; Podgorelec, V. Transfer learning with adaptive fine-tuning. IEEE Access 2020, 8, 196197–196211. [Google Scholar] [CrossRef]
Neyshabur, B.; Sedghi, H.; Zhang, C. What is being transferred in transfer learning? Adv. Neural Inf. Process. Syst. 2020, 33, 512–523. [Google Scholar]
Xu, Y.; Lin, K.; Hu, C.; Wang, S.; Wu, Q.; Zhang, L.; Ran, G. Deep transfer learning based on transformer for flood forecasting in data-sparse basins. J. Hydrol. 2023, 625, 129956. [Google Scholar] [CrossRef]
Zhu, Z.; Lin, K.; Jain, A.K.; Zhou, J. Transfer learning in deep reinforcement learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13344–13362. [Google Scholar] [CrossRef] [PubMed]
Wei, N.; Yin, L.; Yin, C.; Liu, J.; Wang, S.; Qiao, W.; Zeng, F. Pseudo-correlation problem and its solution for the transfer forecasting of short-term natural gas loads. Gas Sci. Eng. 2023, 119, 205133. [Google Scholar] [CrossRef]
Hu, Z.; Gao, Y.; Ji, S.; Mae, M.; Imaizumi, T. Improved multistep ahead photovoltaic power prediction model based on LSTM and self-attention with weather forecast data. Appl. Energy 2024, 359, 122709. [Google Scholar] [CrossRef]
Spencer, R.; Ranathunga, S.; Boulic, M.; van Heerden, A.; Susnjak, T. Transfer Learning on Transformers for Building Energy Consumption Forecasting–A Comparative Study. arXiv 2024, arXiv:2410.14107. [Google Scholar]
Ekambaram, V.; Jati, A.; Nguyen, N.; Sinthong, P.; Kalagnanam, J. Tsmixer: Lightweight mlp-mixer model for multivariate time series forecasting. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA, 6–10 August 2023; pp. 459–469. [Google Scholar]
Kim, T.; Kim, J.; Tae, Y.; Park, C.; Choi, J.H.; Choo, J. Reversible instance normalization for accurate time-series forecasting against distribution shift. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]

Figure 1. Proposed system framework.

Figure 2. TSMixer Workflow.

Figure 3. MLP block.

Figure 4. Gated attention block.

Figure 5. Cumulative explained variance in PCA.

Figure 6. DTW distance heatmap.

Figure 7. Pre-training 1oss.

Figure 8. Linear probing loss.

Figure 9. Wavelet comparison of the predicted and actual values of the proposed and comparison models.

Table 1. Symbol Descriptions for Time-Series Forecasting.

Symbol	Description
L	Length of the time series
c	Number of time series or channels
$s l$	Input sequence length
$f l$	Forecast sequence length
b	Batch size
n	Number of patches
$p l$	Patch length
$h f$	Hidden feature dimension
$e f$	Expansion feature dimension
$n l$	Number of MLP-Mixer layers
$c l$	Context length
$A (\cdot)$	Linear layer for compactness

Table 2. System configuration.

Parameter	Description
CPU	Intel Core i7 13700 K
RAM	64.0 GB
Graphic Card	Geforce 4060 Ti
CUDA	12.6
PyTorch	2.4.1

Table 3. Dataset.

Dataset ID	Collection Start Date	Collection End Date	Total Data Points
(Target) Elm_38	2014-06-10 02:30:00	2014-11-17 08:30:00	7504
Elm_57	2014-06-10 02:30:00	2014-11-17 09:30:00	7493
Elm_58	2014-06-10 02:30:00	2014-11-17 08:30:00	7495
Elm_60	2014-06-10 02:30:00	2014-11-17 09:30:00	7515
Forest_07	2014-06-10 02:30:00	2014-11-17 10:00:00	7671
Forest_20	2014-06-10 02:30:00	2014-11-17 11:00:00	7655
Forest_28	2014-06-10 02:30:00	2014-11-17 11:00:00	7666
Maple_23	2014-06-10 02:30:00	2014-11-19 14:00:00	5358
Maple_25	2014-06-10 02:30:00	2014-11-19 14:00:00	5357
YMCA_73	2014-06-10 02:30:00	2014-11-19 13:00:00	7791
YMCA_81	2014-06-10 02:30:00	2014-11-19 19:00:00	7785

Table 4. Training and evaluation loss at different steps.

Step	Training Loss	Training Time (m)	Evaluation Loss	Evaluation Time (m)
22	0.9803	11.3697	1.4657	11.3674
528	0.6383	76.5214	0.9340	76.6172
1012	0.6277	74.3255	0.9227	74.3044
1408	0.6362	58.8556	0.9160	58.7563
1804	0.6085	56.9159	0.9196	56.9235
2200	0.6096	56.8487	0.9199	56.8370

Table 5. Comparisons with other models.

Model	MSE	MAE
Our model	0.4517	0.4349
LSTM	0.6888	0.5585
Transformer	0.8517	0.6692

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lee, Y.; Jeong, J. TSMixer- and Transfer Learning-Based Highly Reliable Prediction with Short-Term Time Series Data in Small-Scale Solar Power Generation Systems. Energies 2025, 18, 765. https://doi.org/10.3390/en18040765

AMA Style

Lee Y, Jeong J. TSMixer- and Transfer Learning-Based Highly Reliable Prediction with Short-Term Time Series Data in Small-Scale Solar Power Generation Systems. Energies. 2025; 18(4):765. https://doi.org/10.3390/en18040765

Chicago/Turabian Style

Lee, Younjeong, and Jongpil Jeong. 2025. "TSMixer- and Transfer Learning-Based Highly Reliable Prediction with Short-Term Time Series Data in Small-Scale Solar Power Generation Systems" Energies 18, no. 4: 765. https://doi.org/10.3390/en18040765

APA Style

Lee, Y., & Jeong, J. (2025). TSMixer- and Transfer Learning-Based Highly Reliable Prediction with Short-Term Time Series Data in Small-Scale Solar Power Generation Systems. Energies, 18(4), 765. https://doi.org/10.3390/en18040765

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TSMixer- and Transfer Learning-Based Highly Reliable Prediction with Short-Term Time Series Data in Small-Scale Solar Power Generation Systems

Abstract

1. Introduction

2. Related Research

2.1. Solar Power Forecasting Research

2.2. Time-Series Forecasting Research

2.3. DTW

2.4. Transfer Learning

3. TSMixer- and Transfer Learning-Based Highly Reliable Prediction

3.1. System Framework

3.2. TSMixer

3.2.1. MLP Block

3.2.2. Gated Attention Block

3.3. Single-Source Transfer Learning

3.4. Multi-Source Transfer Learning

4. Experiment and Results

4.1. Experimental Environment

4.2. Datasets

4.3. Evaluation Metrics

4.4. Results

4.4.1. Stage 1

4.4.2. Stage 2

4.4.3. Model Comparison

5. Discussion

6. Conclusions

6.1. Research Results

6.2. Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI