**1. Introduction**

Demand forecasting, a prerequisite for inventory decision-making, plays a vital role in supply chain management. How to improve prediction accuracy has always been the focus of academic circles and enterprises. With the increasingly fierce competition in business, product variety has become an important feature of modern enterprises, which can contribute to meet diverse needs of customers and occupy more market segments [1]. However, many products, where 'many' means hundreds or thousands, bring about a new challenge to demand forecasting. Traditional time series algorithms cannot well adapt to the complex high- or even ultra-high dimensionality, resulting in inferior predictive effectiveness in multi-product scenarios.

It is worth noting that the demand of multiple products is not completely isolated, but rather complex relationships exist between them. According to the relevant literature, there are two common association relationships between different products: correlation and Granger causality. For example, the demand for complementary products is highly correlated, having contemporaneous influence with each other [2]. Materials used in engineering projects have a clear sequence, so Granger causality exists in their demand [3]. Obviously, capturing and making full use of such potential information can be helpful to obtain more accurate prediction results. What's more, when the time series are short, historic trend cannot provide enough information for future demand. Associated relationships

can make up for the defects and reduce the bias of prediction. However, as far as we know, there is currently little research taking into account association relationships between products in demand forecasting. In this paper, we incorporate associated relationships among products into the forecasting framework to construct a more accurate prediction approach.

In previous literature, there are mainly three branches of forecasting models for multi-dimensional time series. The first one is a series of statistical methods, represented by the AIRMA model and its extended versions, including VARMA, VARs, BVAR, etc. [4–13]. They treat multi-dimensional time series as an endogenous system. Target variables are regressed by lag items of all series, considering their relations generally. With the development of econometrics, VARs with di fferent settings are widely applied. For example, [8] proposed five types of VAR and utilized industrial production data from OECD countries to test their forecasting e ffect. The key defect of VAR is that the number of estimated parameters increases exponentially along with the increase in dimensions. For high-dimensional time series, it is easy to cause overfitting, weakening the prediction ability outside the original sample. Some scholars have assumed that estimated parameters obey a specific prior distribution to reduce their number, i.e., BVAR, applied in macro-economic forecasting [14–17], market share forecasting [12] and business forecasting [10]. Some other scholars incorporated some unmodeled predictors from exogenous variables to improve original regression models. For example, [18] integrated intra- and inter-category promotional information to construct multistage LASSO regression to forecast the demand of 3222 products. The results are significantly better than the model only using endogenous variables. Unfortunately, these methods can alleviate but not completely solve the problem of overfitting. Accurate results can be obtained only when the time series is long enough.

The second strand of research concentrates on processing high-dimensional time series through the method of dimension reduction, represented by dynamic factor model (DFM). [19] holds the belief that a small number of latent factors are able to interpret fluctuations of observed macroeconomic indices. As long as these potential factors can be portrayed accurately, the task of forecasting is simplified substantially and precise results are achievable. There are many algorithms for the estimation of dynamic factors, including maximum likelihood [19–22], principle component analysis (PCA) [23–32], and data shrinking methods [33,34]. As for prediction accuracy, [35] collected relevant literature and confirmed that DFM performs better than single time series prediction models through a meta-analysis method. Reference [36] pointed out that a simple AR model may be better than a DFM model when there is a large structural change in the data. Compared with other high-dimensional time series models like shrinkage forecasts, FDM is also superior [37]. What's more, [38] introduced U.S. macroeconomic data to compare two forms of DFM estimation methods [39,40]. The results demonstrated that their forecasting precision is not significantly di fferent. However, DFM has the obstacle to tackle sophisticated high-dimensional time series, due to the existence of some isolated series, and the same is true for ultra-high dimensional time series. More specifically, some unique information may be skipped in the process of selecting a limited number of factors, leading to ine fficient estimates. If add more dynamic factors, it will fall into the over parameterized problem again.

As the development of artificial intelligence, various machine learning models have been widely used in the area of forecasting, including neural network [41–43], support vector machine [44–46], nearest neighbor regression [47,48], and so on. They are serious contenders to classical statistical models and form a vital research branch. Di fferent from statistical models, these models construct the dependency between historic data and future values through a black-box and nonlinear process. Reference [49] compared eight types of machine learning models, finding that their rank is unambiguous and does not change much with di fferent features of time series. Reference [50] tested the accuracy of some popular machine learning models. The results demonstrated that their performances were inferior to eight traditional statistical ones. In addition, [50] points out that machine learning models need to become more accurate, reduce their computation load, as well as be less of a black box. Therefore, in this paper, we will continue to optimize the statistics models by associated relationships to ge<sup>t</sup> higher accuracy, instead of machine learning models.

By summarizing previous literature related to multi-dimensional time series analysis, we find that these methods all fail to deal with the situation where product dimension is large but time dimension is small. Based on this situation, we innovatively construct an improved forecasting model for the target variable based on its precedent values and endogenous predictors selected from associated relationships. In some scenarios, if there are many time series associated with the object, we adopt two feasible schemes to simplify the variable space. Then we conduct an empirical study by using an actual dataset of a Chinese grid company. The results of forecasting errors and inventory simulation show that new approaches are superior to these conventional time series forecasting models, including SES, AR, VAR and FDM. Generally speaking, the proposed methods have three major advantages. Firstly, the number of estimated parameters is simplified significantly, not depending exponentially on the dimensions. Secondly, each variable has a customized forecasting regression, which can describe isolated time series well. Thirdly, it does not necessary to collect extra data to act as exogenous variables. Therefore, one contribution of our work is that the new approaches innovatively incorporate associated relationships into demand forecasting, getting rid of the transitional dependence on historical data, so it can be applied to forecast short time series with large dimensionality, making up for the void of previous algorithms. In addition, we contribute to solving over-fitting problems, providing a new direction for the subsequent research. Besides the above theoretical implications, the study also has important practical significance. Note that life circles of products especially high-tech products are getting shorter and many new products are born due to the acceleration of technological innovation. Demand forecasting in terms of limited time points is very common and necessary in actual business activities. Therefore, our new approaches have a wide range of application scenarios and can provide more accurate decision-making basis for practitioners.

The remainder of this paper is organized as follows: In Section 2, we give a brief description of two conventional forecasting models for multi-dimensional time series, i.e., VAR and DFM, and then present our new forecasting approaches based on correlation and Granger causality, respectively. Section 3 describes the procurement dataset and analyzes the relationships between the demands for purchased products. An empirical study and its results are discussed in Section 4. Finally, we summarize our conclusions in Section 5.

## **2. Forecasting Model and Evaluation**

In this section, first we review two common models used in multivariable forecasting. Then we detail our new approaches, utilizing correlations and Granger causality among products to improve prediction accuracy respectively. Finally, some indices are introduced to evaluate the forecasting performance.

## *2.1. VAR and DFM*

VAR and DFM are two conventional models used to forecast demand under multi-product scenarios. Both them have specific limitations, struggling with high (or ultra-high) dimensionality and failing to describe evolutions of short time series. Firstly, we introduce the VAR model. Assume that demand of *N* products at time *t* is *xt* = (*<sup>x</sup>*1,*t*, *<sup>x</sup>*2,*t*, ... , *xN*,*<sup>t</sup>*) -, t = 1, 2, ... , T, where *xj*,*<sup>t</sup>* represents the demand of *j*th product at time *t*. The VAR model is as follows:

$$B(L)\mathbf{x}\_l = \mathfrak{a} + \mathfrak{e}\_l \tag{1}$$

where *B*(*L*) = *IN* − *B*1*L* − *<sup>B</sup>*2*L*<sup>2</sup> −···− *BpLp* is a matrix polynomial with *p* lags in total. *Bj* is a *N* × *N* parameter matrix of *j*th lag and *L* is the lag operator calculated by *Lj xt* = *<sup>x</sup>t*−*j*. α is a N ×1 constant vector, and ε*t* is a *N* × 1 vector of white noisy process, without contemporaneous correlation. According to (1), it is obvious that there are total *P* × *N* × *N* free parameters need to be estimated in VAR model. With the increase of the number of products (i.e., N), the parameters increase quadratically. Therefore, only time

series have moderate dimensionality, i.e., the length of data point is long enough relative to the number of products, can VAR obtain efficient estimates.

As for DFM, it extracts some dynamic factors that can explain the most variation of target variables as predictors, turning the curse of dimensionality into a blessing. However, when the number of products is large, there are some isolated products unavoidably. Common factors cannot explain their demand accurately. Keeping the previous assumptions about *xt*, the general DFM model is as follows:

$$\mathbf{x}\_{t} = \Gamma(L)f\_{t} + \varepsilon\_{t} \tag{2}$$

$$\mathbf{Y}(L)f\_t = \eta\_{t\prime} \tag{3}$$

where *ft* = (*f*1,*t*, *f*2,*t*, ... , *fm*,*<sup>t</sup>*)- is a m-dimensional column vector, representing values for m (m < N) unobserved factors at time t. It can supplant the originally large data. **Γ**(*L*) = **Γ**0 + **Γ**1*L* + **<sup>Γ</sup>**2*L*<sup>2</sup> + ··· + **<sup>Γ</sup>***pLp*, **Ψ**(*L*) = **I***m* + **Ψ**1*L* + **<sup>Ψ</sup>**2*L*<sup>2</sup> + ··· + **<sup>Ψ</sup>***qLq*, and the meanings of these parameters are similar to *B*(*L*)'s in the previous part. ε*t* and η*t* are residuals, satisfying some idiosyncratic assumptions. Equation (3) aims to ge<sup>t</sup> predictive values of dynamic factors, then applied in Equation (2).

We can see that the quality of factors is the key to determine the accuracy of DFM. As mentioned above, there are many methods to extract factors. Among them, PCA is commonly used in forecasting literature [28]. In PCA estimation, assume that **Γ**(*L*) = **Γ**0, i.e., original time series are only influenced by contemporaneous factors. Because *ft* and ε*t* are uncorrelated at all lags, we can decompose the covariance matrix of *xt* into two parts:

$$
\Sigma\_{\text{xx}} = \Gamma\_0 \Sigma\_{\text{ff}} \Gamma\_0' + \Sigma\_{\text{tt}} \tag{4}
$$

where <sup>Σ</sup>*ff* and Σεε are covariance matrices of *ft* and ε*t* respectively. Under the assumptions, the eigenvalues of Σεε is *O*(1) and **Γ**0**Γ**-0 is *<sup>O</sup>*(*N*), the first *r* eigenvalues of Σ*xx* are *O*(*N*) and the remaining eigenvalues are *<sup>O</sup>*(1). Therefore, the first *m* principal components of *xt* can act as dynamic factors. If **Γ**0 is known, the estimator of *ft* can be calculated by OLS directly, i.e., ˆ *ft* = (**<sup>Γ</sup>**0**Γ**0-)−1**Γ**0*xt*. However, **Γ**0 is usually unknown for most cases. Similar to regression, the following optimization equation can estimate **Γ**0 and *ft*:

$$\min\_{f\_1, f\_2, \dots, f\_{T\_\iota} \Gamma\_0} \frac{1}{T} \sum\_{t=1}^T \left(\mathbf{x}\_t - \Gamma\_0 f\_t\right)'(\mathbf{x}\_t - \Gamma\_0 f\_t), \text{ s.t. } \Gamma\_0 \Gamma\_0' = I\_r. \tag{5}$$

The first order condition for minimizing (5) with respect to *ft* shows that ˆ *ft* = ( **^ Γ**0 **^ Γ**0 - ) <sup>−</sup>1**^ Γ**0 - *xt*. By substituting this into the objective function, the results demonstrate that **^ Γ**0 equals to the first *m* eigenvectors of Σ ˆ *xx*, where Σ ˆ *xx* = *<sup>T</sup>*−<sup>1</sup>(*Tt*=<sup>1</sup> *<sup>x</sup>txt*-). More detailed derivation process can refer to [28]. Correspondingly, ˆ *ft* = **^ Γ**0 - *xt* is the first *m* principal components of *xt*. It is the final PCA estimator of dynamic factors in DFM. Finally, let *<sup>x</sup>t*, *f orecast* = **^ Γ**0 ˆ *ft*, *f orecast* to ge<sup>t</sup> predictive values of the original time series.

## *2.2. The Forecasting Approach Based on Correlation*

Associated relationships between multiple products can provide rich information for demand forecasting. Mining effective predictors from associated time series, instead of all series, will be helpful to eliminate some irrelevant information and reduce the number of parameters significantly. Based on this believe, we propose new approaches based on two typical association relationships, namely correlation and Granger causality, respectively. It is proved that they have higher accuracy and can work well even if a wide range of products only have limited data points in the time dimension.

We start with the forecasting approach based on correlation between products. If two products are highly correlated, their demand has specific interactions in the contemporaneous period. For example, if the demand for a product increases, its complementary products will also see a rise in demand at the same time, while its substitutes will experience a decline. Therefore, we utilize such hidden information to modify forecasting algorithms and ge<sup>t</sup> more accurate results. There are mainly three steps in the forecasting approach based on correlation. Firstly, find a proper variable subset for each product in terms of correlated relationships. To be specific, calculate the correlation coe fficients between the target one and all other products. Those highly correlated to the targeted one, i.e., whose correlation coe fficient is more than a certain threshold, constitute the proper variable subset. If a product does not have highly correlated ones, its proper variable subset is empty. Secondly, run autoregressive model (AR) for each product to ge<sup>t</sup> originally predictive values of its demand. AR only depends on past values of a time series to forecast its future evolution, ignoring useful information hidden in other correlated time series. Therefore, the third step is that reconstruct forecasting model for products whose proper variable subset is not empty. We can add these proper variables into AR to ge<sup>t</sup> final results. It is worth noting that in some cases one product may have many highly correlated products, i.e., many proper variables. If the time point is not enough, adding too many proper variables results in the over-fitted problem, similar to VAR. For example, in this paper, the training sample only contains 36 time points in total. In terms of the principle that the number of estimated parameters should be less than 1/10 of that of observations, there are no more than 3 parameters in the forecasting model. Since the autoregressive process of original time series occupies at least two parameters, only one predictor based on correlation can be selected. Therefore, we propose two feasible schemes to control the scale of the forecasting model as follows:

Scheme (I): Only select the product having the highest correlation with the object from the proper variable subset as a predictor added in final model.

Scheme (II): Extract the first principle component of all elements in the proper variable subset as a predictor added in the final model.

More formally, Let *X* = [*<sup>x</sup>*1, *x*2, ... , *xT*] represent time series of demand for all products during the T periods, and then the correlation matrix ρ*XX* of *X* is as follows:

$$\rho\_{\text{XX}}(i,j) = \frac{cov(\mathbf{X}\_{i.}, \mathbf{X}\_{j.})}{\sqrt{var(\mathbf{X}\_{i.}) \times var(\mathbf{X}\_{j.})}} \tag{6}$$

where *Xi***.** is the *i*th row of *X*, i.e., the sophistic demand series of *i*th product. According to ρ*XX*, we can pick up products highly correlated to *Xi***.**, making up for the proper variable subset for *i*th product. The autoregression *xi*,*<sup>t</sup>* = α*i* + β*<sup>i</sup>*,1*xi*,*t*−<sup>1</sup> + β*<sup>i</sup>*,2*xi*,*t*−<sup>2</sup> + ··· + β*<sup>i</sup>*,*pxi*,*t*−*<sup>p</sup>* + <sup>ε</sup>*i*,*<sup>t</sup>* can ge<sup>t</sup> originally predictive demand *<sup>x</sup>*<sup>ˆ</sup>*i*,*<sup>t</sup>* for *i*th product in *t*th period. *p* is a lag parameter, determined by the Akaike information criterion (AIC). Based on this, *X*ˆ [*i*] is a matrix, containing original prediction values of all proper variables for *i*th product. Its rows represent time dimension and columns correspond to products, ranked from left to right in terms of their correlation coe fficients with *Xi***.** in descending order. Assume that *fi* = [ *fi*,2, *fi*,3, ... , *fi*,*<sup>T</sup>*] is the e ffective predictor selected from *X*ˆ [*i*] to improve forecasting. Because it corresponds to prediction values, the first time point is missed. The final model is as follows:

$$\mathbf{x}\_{i,t} = a\_i + \beta\_{i,1}\mathbf{x}\_{i,t-1} + \beta\_{i,2}\mathbf{x}\_{i,t-2} + \dots + \beta\_{i,p}\mathbf{x}\_{i,t-p} + \beta\_{i,p+1}f\_{i,t} + \varepsilon\_{i,t} \tag{7}$$

Scheme (I) suggests that *fi* = *X*ˆ [*i*] **.**1 , where *X*ˆ [*i*] **.**1 is the first column of *X*ˆ [*i*], the product having highest correlation with the *i*th one. According to Scheme (II), *fi* is the first principle component of *X*ˆ [*i*]. The procedure of calculating principle components is as follows: (i) computing the covariance matrix Σ *X*ˆ of *X*ˆ [*i*], (ii) determining eigenvalues and eigenvectors (<sup>λ</sup>1, *<sup>e</sup>*1), (<sup>λ</sup>2, *<sup>e</sup>*2), ... , (<sup>λ</sup>*<sup>n</sup>*, *<sup>e</sup>n*) of Σ *X*ˆ , where λ1 > λ2 > ··· > λ*<sup>n</sup>*, (iii) getting the first principle component *fi* = *e*- 1*X*ˆ [*i*] .

## *2.3. The Forecasting Approach Based on Granger Causality*

If Granger causality exists between two products, it means that historic observations of one product can explain the future demand of another product (there is a time lag between them). This situation often occurs when the procurement of products has a stable sequence, such as material procurement in engineering projects. The idea of the forecasting approach based on Granger causality is similar to the former one based on correlation, also consisting of three steps. Firstly, find the proper variable subset for every product by doing Granger causal relation test. When the *p*-value of Granger test satisfies a critical condition, the corresponding product can join the proper variable subset of the target one. Secondly, run AR for every product to ge<sup>t</sup> its originally predictive demand. Finally, select e ffective predictors from the proper variable subset to reconstruct the forecasting model, if a product's proper variable subset is not empty. Similarly, there are also two schemes to prevent excessive parameters:

Scheme (I): Only select the product having lowest *p*-value of Granger test with the object from the proper variable subset as a predictor added in final model.

Scheme (II): Extract the first principle component of all elements in the proper variable subset as a predictor added in the final model.

Assume *<sup>P</sup>kXX* is the *p*-value matrix of Granger test for *X* considering *k* lags, where *k* is determined by AIC. The rows of *<sup>P</sup>kXX* describe Granger results while columns are Granger causes. According to *<sup>P</sup>kXX*, we can construct the proper variable subset for the *i*th product, expressed by a matrix *<sup>X</sup>*[*<sup>i</sup>*,*k*] . The granger cause with the lowest *p*-value of the *i*th product arranges in the first column of *<sup>X</sup>*[*<sup>i</sup>*,*k*], and so on. Let *f k i* = [ *f k i*,1, *f k i*,2, ... , *f k i*,*T*] represents the e ffective predictor extracted from *<sup>X</sup>*[*<sup>i</sup>*,*k*]. According to Scheme (I) and Scheme (II), *f k i* = *<sup>X</sup>*[*<sup>i</sup>*,*k*] 1. and *f k i* is the first principle component of *<sup>X</sup>*[*<sup>i</sup>*,*k*] respectively. The final forecasting model based on Granger causality is as follows:

$$\mathbf{x}\_{i,t} = \alpha\_i + \beta\_{i,1}\mathbf{x}\_{i,t-1} + \beta\_{i,2}\mathbf{x}\_{i,t-2} + \dots + \beta\_{i,p}\mathbf{x}\_{i,t-p} + q\mathbf{y}\_{i,1}f\_{i,t-1}^k + \varepsilon\_{i,t} \tag{8}$$

## *2.4. The Forecast Accuracy Measures*

According to the previous literature, there are two major methods to evaluate the performance for demand forecasting approaches: forecasting errors and inventory performance, from the perspective of forecasting accuracy and actual inventory management, respectively. It is worth noting that the dataset used in this paper has intermittent demand series: the demand of some products is zero in some periods. Therefore, we adopt absolute scaled error (ASE) to measure forecasting errors. It can overcome the drawback of infinities caused by zero division [51]. The formula is as follows:

$$ASE\_t = \frac{\left| y\_t - \hat{y}\_t \right|}{\frac{1}{n-1} \sum\_{i=1}^{n-1} \left| y\_{i+1} - y\_i \right|} \tag{9}$$

Then mean absolute scaled error is *MASE* = *mean*(*ASEt*). A forecasting approach with lower MASE means that it is more accurate during the whole forecasting period in general. Therefore, we can compare di fferent approaches according to their values of MASE. In addition, we also apply relative error (RE) to measure the accuracy of forecasters, i.e., calculating ratios of their ASE to that of a baseline model. In this paper, we set simple exponential smoothing (SES) model as the benchmark, which can refer to [52]. For multi-period demand forecasting, the overall judgement of RE is usually based on the form of geometric mean instead of arithmetic mean [51,53]. Geometric mean relative absolute scaled error is expressed as *GMRASE* = *gmean*(*et*/*e*<sup>∗</sup> *t*), where *e*∗ *t* is errors of the baseline model.

In fact, optimizing forecasting accuracy aims to provide better guidance for order strategy and inventory management, finally reducing inventory costs and improving managerial e fficiency. How forecasting results influence inventory performance is also a concern for scholars and enterprises. A lot of studies assess forecasts by means of inventory simulations [54–58]. Therefore, we also

introduce inventory performance to evaluate forecasting approaches. The order-up-to-level policy, commonly used in practice, is adopted to control inventory simulation. We set the inventory review period as one month, consistent with the prediction period. The order-up-to level S is *S* = *D*ˆ + *SS*, where *D*ˆ is the predictive demand during the lead time (one month), SS is the safety stock related to the desired service level. At the beginning of each period, check the holding stock H. If H is below S, place an order with the ordering quantity H-S. Otherwise, nothing needs to be done. When face out-of-stocks, the demand will be serviced in the next period. To initialize the simulation system, assume that have full stock at the beginning, i.e., H=S. One index of inventory performance is total inventory costs, consisting of two parts: shortage costs and holding costs, i.e., total inventory costs = unit total cost × (mean inventory per month × *a* + mean stock-out per month × *b*) [52]. The cost parameters *a* and *b* reflect the trade-o ff between stock-holding and out-of-stock. *b* > *a* means that costs of out stocks are more expensive. When *b* < *a*, by contrast, unit stock-holding costs more dollars. In addition, another index is the inventory ratio. A smaller inventory ratio means higher inventory efficiency. It is calculated by the following expression:

> *mean holding stock mean demand* .
