1. Introduction
Climate change is one of the major challenges of the 20th century, characterized by rising temperatures, variations in precipitation, and the intensification of extreme weather phenomena, which alter global and regional precipitation patterns. In some regions, this may result in more frequent and prolonged dry spells [
1]. Droughts can lead to long-term environmental degradation, including soil erosion, loss of biodiversity, and depletion of water resources. Drought estimating and forecasting involve a combination of monitoring, data analysis, and robust predictive modeling techniques. Additionally, they involve a thorough analysis of detecting different patterns, periodicity, and the interaction of multiple environmental factors based on the historical observations [
2,
3].
Two main methodologies stand out to forecast climate parameters, i.e., physical, and data-driven models [
4,
5,
6]. These represent distinct approaches for understanding and predicting the complexities of the Earth’s climate system. Physical models are often employed for medium- and long-term forecasts, as they incorporate complex parameters such as oceanic circulation and large-scale interactions between the atmosphere and the oceans. Although these models are demanding in terms of computational resources, they offer an in-depth understanding of meteorological and climatic processes [
7]. Conversely, data-driven models are generally more suitable for short-term forecasts due to their ability to quickly capture trends and patterns from real-time data. Their algorithmic flexibility also enables them to adapt to rapid changes in weather patterns. However, the long-term reliability of these models may be affected by sudden and unforeseen variations in input data, posing challenges for their use in longer-term climate-forecasting contexts [
8,
9]. The first models developed to address this issue include statistical extrapolation techniques, which are extensively employed in various fields, such as meteorology, urban science, and energy [
10,
11]. Among these models, there are Autoregressive Integrated Moving Average (ARIMA), Seasonal Autoregressive Integrated Moving Average (SARIMA), and Fourier Forecast [
12,
13,
14]. The effectiveness of ARIMA and its extension, SARIMA, lies in the fact that it adeptly captures seasonal fluctuations. In contrast, the Fourier model breaks down data into frequencies, uncovering periodic patterns that provide precise predictions and insights into evolving temporal forecasts. However, their limitation in modeling nonlinear data with non-stationary distributions has driven the advancement of Artificial Neural Networks (ANNs), which now stand as the primary research focus in developing various intelligent models [
15].
Time series analysis is a crucial research domain aimed at tackling forecasting challenges across diverse applications. This extends to the examination of non-stationary data, which frequently showcases non-uniform trends, seasonal cycles, and other temporal structures undergoing irregular changes. This complexity poses challenges in capturing and modeling temporal data patterns, as traditional techniques suffer from having to effectively generalize models for nonlinear data distributions [
16]. Therefore, the development of robust models for non-stationary data forecasting often requires the use of advanced machine- and deep-learning models, such as Recurrent Neural Networks (RNNs), Long Short-Term Memory models (LSTMs), or Long-Term Gaussian Processes (GP-LSTMs). These approaches enable us to capture the complex temporal dependencies and dynamically adjust the model parameters based on data evolution and preprocessing analysis, which improves models’ accuracy in forecasting multiscale data series [
17,
18,
19]. However, enhancing the accuracy of forecasting models also involves a foundational step known as data preprocessing, which demands considerable time and attention within any data analytics workflow. Data preprocessing ensures the quality of the data utilized as input for machine-learning techniques, thereby minimizing the risk of generating inaccurate or flawed results. This typically includes data cleaning to rectify inconsistencies, errors, and features, sample selection to pinpoint relevant data subsets, outlier removal to enhance data reliability, normalization to standardize data distributions, and data transformation to discern tendencies, seasonality, and periodicity. Data preprocessing makes it easier to capture the period of the similar motifs in time series and enhance the model fitting [
20]. According to the literature, various transformation methods have been developed and applied to improve the non-stationary pattern of data time series and make them more homogeneous. These methods are mainly classified into two categories, which are data mapping and decomposition. A set of parametric and nonparametric models has been proposed for mapping data based on trend adjustment via detrending methods (linear and polynomial detrending, seasonal adjustment…etc.), or by ignoring tendency using mathematical transformations (linearization, nonlinear smoothing, and fractional differences). On the other side, the second category of transformed models focus on decomposing data into trends and seasonality using the time or time–frequency domains, such as seasonal and trend decomposition using locally estimated scatterplot smoothing (LOESS), empiric mode decomposition, and wavelet transformation [
20,
21].
Our study presents a new approach to address the challenge of forecasting non-stationary time series data on a small scale. To exemplify our method, the daily standardized precipitation index analyzed for the Adige basin (in the Alps, Italy) is chosen as a case study, where the dataset exhibits a non-stationary distribution due to the non-uniform variability of daily precipitation. This complexity makes it challenging to adapt machine-learning parameters for forecasting future events. Our approach aims to adjust the non-stationary distribution and simplify the detection of time delays using a basic linear autocorrelation method. We employ a set of statistical techniques in the preprocessing phase, including sampling-based seasonality, mathematical transformation, and normalization, to standardize the values and facilitate comparative performance analysis among the different models. In this context, we apply two transformation methods to evaluate their impact on the length of the time lag window and the accuracy of the machine-learning forecasting models. This paper is structured into three sections, starting with this introduction, which provides an overview of the state-of-the-art techniques and models used for time series forecasting. In the subsequent section, we introduce the methods, datasets, and the proposed approach. Finally, the results, conclusions and future works are presented in the last section.
3. Proposed Method
The complexity of non-stationary data in time series analysis becomes more apparent at finer scales, such as daily and infra-daily. This non-uniform variability poses additional challenges for modeling and forecasting, as the variability is due to the irregular fluctuation of the means and variances across various time intervals. This temporal heterogeneity complicates the adaptation of the ML model’s parameters, especially for applications requiring continuous time forecast. In this section, we present a method to reduce the non-uniform trend in the data by employing sampling techniques, data transformation, and auto-correlation to identify time lags where the data series displays periodicity and similarity. From a statistical perspective, this is where the data show a strong correlation. The flowchart illustrated in
Figure 3 outlines the various stages of the analysis, starting from the preprocessing step and moving through to the training and validation of the ML model. In this study, we focus on forecasting drought using SPI daily data as a case study; nevertheless, the model is adaptable to any time series data exhibiting a non-stationary distribution.
The model initiates by organizing and ranging the data according to seasonal sampling, followed by detrending using linearization (log) and sinusoidal transformation (Sin) methods. Subsequently, data normalization is applied to facilitate statistical analysis, such as modeling, performance analysis, and comparison between different results.
The dedicated model is primarily based on the use of time lags to predict the future daily events within time windows. These time frames are detected through autocorrelation analysis by dividing the data series into input and regressor segments using a daily shift (T). We chose simpler machine-learning methods over deep learning due to the stationary distribution achieved through preprocessing. Furthermore, this analysis is univariate and utilizes features as a sample vector.
4. Results and Discussion
In this section, we outline the outcomes of the data preprocessing, which focus on the data transformation and the determination of the time lag using autocorrelation for both the training and testing datasets across various transformations. Following that, we present the results obtained from using K-Nearest Neighbors (KNN) and Random Forest (RF) for predicting future SPI events based on the maximum time lag.
Table 1 provides a descriptive statistical analysis of the training and testing datasets used during the processing step. Various tests are conducted on the data series after transformation and normalization. It offers a comparative examination of each data sample before and after adjustment. The outcomes reveal that the mean and median values closely approximate the mid-range of the normalized data (between 0 and 1), particularly for linearized and sinusoidal-transformed data. Following the preprocessing phase, the SPI values achieved stationarity, where the variability of the linearized dataset decreases and shows an enhancement of the data distribution (uniform) compared to the other dataset. In
Table 1, we have highlighted the best transformation results for each seasonal sample using the green color. The results selection is based on taking the minimum CV value and the degree of similarity between the mean and median values. From the results, the best transformation applied to the autumn data for both training and testing was observed with the sinusoidal transformation. The same level of performance was observed because the variability of the original data was consistent for both the training and testing datasets. On the other hand, the statistics applied to the winter and spring data showed good model performance with the sinusoidal transformation for the data used in training the ML models. However, for the testing data, the results indicated a good performance of the linear transformation. Similar results were also observed with the summer datasets due to the uniform distribution. In general, the best stationarity of the time series is achieved when using the sinusoidal and linear trans-formation methods. However, the choice of the best method depends on the nonlinear distribution of the original data.
Autocorrelation tests, as a fundamental concept in time series analysis, play a crucial role in identifying temporal dependencies and patterns within sequential data by measuring the correlation between a time series and its lagged values at various time intervals.
Figure 4,sets the stage for discussing the importance of autocorrelation tests in analyzing data time series and outlines their implications for understanding temporal dynamics within the transformed SPI datasets. It also underscores the broader importance of identifying the best transformation method to make data more stationary across different seasons. In this analysis, we shifted the data based on the time windows by generating time lags that range from one day to half the size of the observation period. It is important to note that a small sample size can impact the quality of the results provided in the modeling step.
Figure 4, illustrates scenarios where the time lag exhibits a robust correlation between historical and future observations.
This analysis was conducted across all the seasonal samples of daily SPI data following linearization and sinusoidal transformation. The plots show the common time lag, providing insights into the performance of the transformation methods. Furthermore, the outcomes identify the maximum lag appropriate for forecasting data using historical observations. In this analysis, we used data normalization to facilitate comparison across different plots. The results from the training data show that data transformation contributes to stabilizing the distribution across the year. It is observed that the maximum time lag occurs by shifting the daily data using these samples compared to the original one. During the autumn season, linearized data provide the maximum time lag (285 days) compared to all the other data transformation techniques. Conversely, during the winter and spring periods, sinusoidal transformation demonstrates the most effective data shift compared to both the linearized and original scales, with observed lags of 330 and 341 days, respectively.
However, during the summer season, which is characterized by a weak rainfall regime, the original data show a maximum time lag just after application of a simple normalization. The stationary SPI datasets derived through mathematical transformations are used in this section for training and testing the accuracy and the bias of the KNN and RF models. Moreover, 70% of the data were selected for building and training both models used to forecast the SPI based on the historical and future regressor data vectors provided via autocorrelation analysis. The remaining 30% of the data were reserved for assessing the performance of the model’s parameters. In this part, we experiment with all four seasonal subsets obtained from the sampling, which helps provide the annual variability of the SPI. Regarding the used time lag, we opt for the maximum value of time lag to visualize the residual trends over all the range of future events. In addition, the forecasting data help to understand the limits of the model application for each machine-learning method. Multiple statistical tests, such as R2, R2 Adjusted, MSE, MAE, and t-tests, were employed to control and compare the accuracy, bias, and variability between the forecasting and ground truth data.
Figure 5 and
Figure 6, show the graphical results of training and testing both KNN (K = 3) and RF, as applied on transformed data during the four seasonal periods. The choice of a K equal to 3 is based on the optimization analysis illustrated in
Figure S1, where we trained and tested the model for K values ranging from 1 to 50, incorporating all the time delays provided by the cross-correlation. The optimal K is determined by its ability to maintain a strong correlation across all cases. In the testing step, we present the results by using the original scale to assess the impact of data transformation on the accuracy of the forecasting model during the training phase. A very good performance is observed with the RF method compared to KNN, as demonstrated by an R
2 consistently exceeding 0.86. This performance is also evident in the test phase when using the same model parameters, which were obtained during training. Conversely, KNN exhibits some bias when utilizing normalized data, particularly during the spring and summer seasons (R
2 equals 0.52 and 0.54, respectively). When predicting future events through linear transformation, both models provide a satisfactory fit with the ground truth data. However, KNN demonstrates the highest accuracy, with R
2 values ranging between 0.87 and 0.92.
Table 2 shows the statistical analysis regarding testing both the KNN and RF models. It offers different insights to determine if the observed bias significantly produces a change in forecasting data distribution by applying the
t-test hypothesis to compare these results with ground truth data. We highlight the best results using the green color, which are determined based on the maximum time forecast and the highest accuracy of the ML models across different metrics.
In addition, the red color is used to highlight the significant biases provided by both ML models based on the
t-test scores (see
Table 2). Generally, the best choice of forecasting method also has a strong correlation with the stationarity statistical analysis (refer to
Table 1 for the case of the testing data). Since the CV provides a good result, the autocorrelation helps in detecting the maximum time lag. From the results, the linear transformation method enhances the performance of the KNN method.
Moreover, the p-values reveal notable shifts in the mean values of the outcome data obtained using this method. Contrariwise, in the instances of normalized data and data derived from sinusoidal transformation, significant bias is evident when employing the KNN model to forecast the daily SPI for spring and summer, as indicated by p-values below 0.05. RF, as an unsupervised method, is adapted to learn the nonlinear behaviors of time series. It can only show good results by using simple normalization. However, the best selection is performed by using sinusoidal transformation to train the RF model for both the autumn and summer periods based on the maximum time forecasting, where an increase in the time window is shown from 154 to 226 and from 221 to 229, respectively.
Furthermore,
Figure 7 illustrates the accuracy assessment of each proposed model using cross-validation for a 10-day time step, where the R
2 and MAE results obtained for each season using data transformation are presented. The scores highlight the robustness of RF throughout all the periods, particularly when addressing nonlinear data. The combined application of sinusoidal transformation with both ML forecasting models demonstrates consistent and accurate outcomes, with R
2 values surpassing 0.5 in each subcase. Conversely, in scenarios involving normalization, RF displays notable performance compared to KNN, as evidenced by the MAE curves (refer to
Figure 7).
5. Conclusions
This study underscores the critical need for innovative methodologies for forecasting non-stationary time series. By focusing on the daily standardized precipitation index (SPI) as a prime example, the research addresses the complexities arising from heterogeneous precipitation variability and its impact on the SPI temporal distribution, which poses significant challenges to several machine-learning models in fixing the model parameters to forecast data during different periods. Through the introduction of a novel approach centered on adjusting non-stationary distributions and simplifying time lag detection via autocorrelation, this study achieves substantial improvements in the forecasting accuracy. Moreover, this study highlights the efficacy of mathematical transformations such as linearization, sinusoidal transformation, and normalization in stabilizing the data distributions over the whole observation, facilitating the capture of the maximum time lags, extending up to one year. The key findings highlight the pivotal role of data transformation in minimizing tendencies and standardizing both periodicity and seasonality, thereby enhancing the performance of machine-learning algorithms such as K-Nearest Neighbors (KNN) and Random Forest (RF). The results indicate RF’s remarkable accuracy in forecasting the SPI, particularly when the data have a nonlinear distribution. Contrariwise, KNN faces a limitation, especially during the dryness period, where the mean values of the forecast data have a significance difference compared with the ground truth. However, it demonstrates the high performance of the KNN model when the data have a linear distribution. These findings emphasize the importance of adaptive methodologies in analyzing and forecasting climate data, offering valuable insights for climate change mitigation and risk management efforts. From another perspective, the model faces a limitation in generating a continuous data series when using different time lag values obtained from each seasonal sampling. In our forthcoming research, we aspire to explore additional transformation techniques, including data decomposition and smoothing employing parametric models. Our objective is to evaluate the efficacy of these methods in combination with deep-learning models for spatiotemporal forecasting.