3.2. Feature Engineering
After pre-processing the Rossmann and Walmart datasets, the feature selection is initially accomplished by employing MRMR and RFE techniques. Generally, the feature selection process decreases the feature sets by choosing the most active features or eliminating the drossy features. MRMR is chosen, as it identifies the relevant features and minimizes the redundancy, thus enhancing the classification accuracy. Therefore, the MRMR overcomes the other techniques based on the accuracy and number of supportive features. The MRMR is a feature measurement criterion; it computes the redundancy and correlation between the features based on mutual information. Here, the MRMR technique performs feature selection by following two conditions, such as maximal relevance (
) and minimal redundancy (
), which are mathematically specified in Equations (3) and (4). Moreover,
denotes the mutual information, and
denotes a set of features. The mathematical expression of mutual information
is represented in Equation (5).
On the other hand, the RFE is one of the effective feature selection techniques which fits the model and eliminates irrelevant features, until the discriminative active features are selected [
33]. RFE is selected here because it efficiently minimizes the dimensionality of high-dimensional datasets by removing redundant features to help enhance the computational and storage necessities. The RFE technique includes three major benefits: (i) complete elimination of irrelevant information in the data, (ii) ease in performing data visualization, and (iii) required limited computational power. By considering Equations (3) and (4), Equation (6) is generated. The RFE technique efficiently selects the optimal feature subsets using Equation (6).
In addition to this, feature optimization is carried out utilizing the APSO algorithm, where it selects active features from the pre-processed Rossmann and Walmart datasets. The APSO algorithm is applied here, as it selects the algorithmic constraints at run time to enhance the exploitation efficiency, and it performs a global search over the entire search space with a higher convergence speed. As discussed in the previous sections, this process decreases the complexity of the Bi-GRU model and its processing time. The conventional PSO algorithm is one of the effective metaheuristic-based optimization algorithms [
34] that generally mimics the behavior of birds and fish schooling [
35]. The velocity and the position of the particles are updated in the PSO algorithm by Equations (7) and (8).
where the global best positions of the particles are represented as
; the present best positions of the particles are denoted as
; the random numbers are indicated as
; the acceleration coefficients are denoted as
; the inertia weight is represented as
, used for balancing the local and the global searches; and the iteration number is indicated as
.
The APSO algorithm optimizes the features based on Adaptive Uniform Mutation (AUM) function from the Human Group Optimization (HGO), where the particle’s positions (features) are denoted as
. The AUM function extends the ability of feature optimization in the exploration phase. Additionally, a nonlinear function,
, is employed for controlling the mutation range and decision in each particle. The nonlinear function,
, is updated in each iteration by performing Equation (9).
If the iteration increases, the nonlinear function, , tends to decrease, while the maximum number of iterations is represented as . The mutation randomly selects the active features from the datasets when the nonlinear function, , is higher than the random number, which usually ranges between zero and one. The selected active features from the Rossmann and Walmart datasets are finally passed to the Bi-GRU model for retail sales forecasting. The APSO algorithm terminates when it reaches the maximum number of iterations (100).
The parameters considered in the APSO algorithm are as follows: the cognitive constant,
, is two; the social constant,
, is three; the size of the population is 100; and the number of iterations is 100. The features selected in the Walmart dataset are the date, Consumer Price Index (CPI), fuel prices, store, weekly_sales, and holiday_flag. Correspondingly, the features selected in the Rossmann dataset are the day of the week, open promo, customers, sales, and store number. The architecture of the APSO algorithm is mentioned in
Figure 2.
The step-by-step procedure of APSO algorithm is specified as follows.
Step 1: The swarm particles, size, location, objective, number of iterations, and save non-dominated solution are set into the archive.
Step 2: To update the , pareto domination connection is applied.
Step 3: Due to the multiplicity of solution, is chosen from the archives. In the beginning, crowding distance is estimated, and, formerly, binary tournament is utilized for choosing .
Step 4: At that time, the decision value is reset, which depends on . Each value of feature vector is considered as a binary value.
Step 5: Depending on step 5, the particle’s position and velocity is updated.
Step 6: Uniform mutation is accomplished.
Step 7: Then, the external archive is updated by means of crowding distance.
Step 8: Termination process: If the proposed method achieves the maximum iteration, the process is stopped; otherwise, step 2 is repeated. Therefore, the worst particles are removed by HGO. After choosing the optimal features from the Rossmann dataset (day of week, open promo, customers, sales, and store number) and Walmart dataset (date, CPI, fuel prices, store, weekly_sales, and holiday_flag) using MRMR, RFE, and APSO, forecasting is processed using Bi-GRU, which is described in the following section.
3.3. Retail Sales Forecasting
The optimal features selected from MRMR, RFE, and APSO on the Rossmann dataset and Walmart dataset are given as input to the Bi-GRU model for effective forecasting of retail sales. The Bi-GRU model has update and reset gates for performing sales forecasting, which also decreases the gradient dispersion and computational loss, while enabling the capability for shorter- and longer-term memory [
36,
37]. Also, Bi-GRU comprises a smaller number of constraints because it does not have a forget gate, which makes it computationally efficient, less prone to overfitting, a suitable option for a smaller-type dataset.
In the Bi-GRU model, the input and the forget gates of the LSTM network are replaced by the update gate,
. The update gate helps the model in determining the past information, which needs to be passed along with the future information. This process reduces the vanishing-gradient problem in the Bi-GRU model. The update gate,
, is mathematically specified in Equation (10).
where the weight matrix is represented as
; the bias matrix is denoted as
; the input matrix (selected features) at the time step,
, is indicated as
; the sigmoid activation function is denoted as
; and the hidden state at the previous time step,
, is indicated as
. In the Bi-GRU model, the reset gate,
, is utilized for controlling the historical time-series data and is responsible for the network’s shorter-term memory in the hidden state. The reset gate,
, is numerically expressed in Equation (11).
where the bias matrix and the weight matrix of the reset gate (
) are denoted as
and
. Then, the candidate of the hidden state,
, is specified in Equation (12).
where the tangent activation function is represented as
, the dot multiplication operation is denoted as
, and the bias matrix and weight matrix of the memory cell state are correspondingly denoted as
and
. The output,
, is obtained by linearly interpolating
and
, and this process is indicated in Equation (13). The Bi-GRU model’s architecture is mentioned in
Figure 3.
Appropriate feature engineering for the Bi-GRU model is needed for retail sales forecasting to extract the implicit vectors and complex variances in the historical sequence data. The traditional Bi-GRU model extracts only feature information in the forward direction, and it automatically rejects the backward historical time-series data. So, an adaptive Bi-GRU model was implemented in this study for precise retail sales forecasting. The proposed BiGRU has the capability to process several inputs proficiently over the conventional models because of its facility to study the input from both the directions concurrently.
The proposed regression model extracts the knowledge between the variables from the forward and backward directions, as mentioned in
Figure 3. In the Bi-GRU model, the forward GRU extracts prior information in the historical time-series data, and the backward GRU extracts future information in the historical time-series data. The numerical expression of the Bi-GRU
model is specified in Equation (14).
where the output of the backward and forward directions is represented as
, and it performs operations like multiplication function, average function, summation function, etc. In addition, the hidden states of backward and forward GRUs are denoted as
and
.
The parameters considered in the BI-GRU model are as follows: the look-back is eight, the number of neurons is 80, the dropout rate is 0.5, the batch size is 50, the loss function is MSE loss, the optimizer is Adam, and the learning rate is 0.0001. The numerical results of the proposed regression model are specified in
Section 4.