*Article* **Stock Index Prediction Based on Time Series Decomposition and Hybrid Model**

**Pin Lv, Qinjuan Wu, Jia Xu \* and Yating Shu**

School of Computer, Electronics and Information, Guangxi University, Nanning 530004, China; lvpin@gxu.edu.cn (P.L.); 1913392058@st.gxu.edu.cn (Q.W.); 2013391057@st.gxu.edu.cn (Y.S.) **\*** Correspondence: xujia@gxu.edu.cn

**Abstract:** The stock index is an important indicator to measure stock market fluctuation, with a guiding role for investors' decision-making, thus being the object of much research. However, the stock market is affected by uncertainty and volatility, making accurate prediction a challenging task. We propose a new stock index forecasting model based on time series decomposition and a hybrid model. Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN) decomposes the stock index into a series of Intrinsic Mode Functions (IMFs) with different feature scales and trend term. The Augmented Dickey Fuller (ADF) method judges the stability of each IMFs and trend term. The Autoregressive Moving Average (ARMA) model is used on stationary time series, and a Long Short-Term Memory (LSTM) model extracts abstract features of unstable time series. The predicted results of each time sequence are reconstructed to obtain the final predicted value. Experiments are conducted on four stock index time series, and the results show that the prediction of the proposed model is closer to the real value than that of seven reference models, and has a good quantitative investment reference value.

**Keywords:** stock index forecasting; CEEMDAN; ADF; ARMA; LSTM; hybrid model

#### **1. Introduction**

The stock index is calculated based on some representative listed stocks. To some extent, it can reflect price changes of the whole financial market, hence its use as an important indicator of the country's future macroeconomic performance. Forecasting the stock index accurately is of paramount importance for reducing risks in decision-making, by providing some important reference information [1]. However, owing to the complexity of the internal structure and the variability of external factors, changes of the stock market are dynamic and uncertain, and forecasting the stock index has always been a challenge. Many stock forecasting models are mostly classified as either statistical or machine learning models [2]. Statistical models were first used to predict the stock market in finance, and have made some achievements. However, they assume a linear and stationary time series, which is inconsistent with the dynamic, non-linear characteristics of the real stock market, so they have great limitations. A deep learning model can overcome the defects of traditional statistical models in time series prediction but is easily affected by noise in some complex and dynamic financial systems, making it difficult to mine the hidden features of time series, resulting in poor learning ability and limited prediction accuracy.

Therefore, a single statistical or machine learning model cannot well predict the stock index. To overcome these limitations, we propose a hybrid stock index forecasting model based on Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN) [3]. In this model, CEEMDAN is first used to decompose the original financial time series into a series of Intrinsic Mode Functions (IMFs) and a residual term. Then, the stability of the IMFs and the residual term is characterized using the Augmented Dickey Fuller (ADF) method, the low-volatility time series are classified as linear components,

**Citation:** Lv, P.; Wu, Q.; Xu, J.; Shu Y. Stock Index Prediction Based on Time Series Decomposition and Hybrid Model. *Entropy* **2022**, *24*, 146. https://doi.org/10.3390/e24020146

Academic Editors: Jan Kozak and Przemysław Juszczuk

Received: 18 November 2021 Accepted: 12 January 2022 Published: 19 January 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

and high-volatility time series are classified as non-linear components. In the final step, the Autoregressive Moving Average (ARMA) model is applied to the linear component, and Long Short-Term Memory (LSTM) is applied to the non-linear component. The final prediction result is obtained by reconstructing each prediction series. This method makes full use of ARMA in linear problems and uses LSTM to identify and abstract non-linear features, mining the movement rules of hidden components in time series and improving prediction accuracy. Hence, our proposed method is referred to as CAL (CEEMDAN-ARMA-LSTM). In the CAL model, CEEMDAN sequence decomposition can reduce the complexity of time series, and the sequences that pass the ADF stationarity test have significant linear trends. Therefore, we employ ARMA to predict the data of the linear part, avoiding the waste of effective information caused by differential operation.

The hybrid model combining linear and non-linear methods has great advantages in time series prediction [4]. Ref. [5] proposed a hybrid time-series prediction model taking the residual generated by Autoregressive Integrated Moving Average (ARIMA), combining the differences in a non-stationary time series with ARMA, as the input of LSTM for fitting. The ARIMA-LSTM model has achieved more accurate forecasting results than the individual LSTM and ARIMA models. A moving average filter was used to decompose a time series into linear and non-linear components [6]. ARIMA and Artificial Neural Network (ANN) were used to model low- and high-volatility data, respectively. This hybrid ARIMA-ANN model can achieve good prediction results. Each hybrid model in the literature combined linear and non-linear models in different ways, providing different perspectives for time series data prediction. However, these methods have the limitations that the error sequence generated by a linear model is assumed to be non-linear [5], and the original sequence is decomposed into single linear and non-linear components, which cannot mine the internal features of an overly complicated time series [6].

Our proposed model can properly decompose the original time series, and the ARMA and LSTM models are applied, which overcomes the defects of strong assumptions [5] and insufficient decomposition [6]. We validate our model's effectiveness on four stock market indices. The experimental results show that the proposed model has higher prediction accuracy than seven reference models on these indices. The main contributions of this study are summarized as follows:


The remainder of this article is organized as follows. Section 2 summarizes related work. Section 3 introduces the proposed CAL model. Section 4 experimentally evaluates the proposed method on real stock index datasets. Section 5 summarizes the paper and points out future research directions.

#### **2. Related Work**

Time series analysis is an important tool in many stock market prediction methods, and it makes predictions by analyzing observed points in the series. As one of the most widely used linear time series forecasting methods, the ARIMA model [8] integrates the Autoregressive (AR) and Moving Average (MA) models. It assumes that future predictions have a linear dependence on the current and past data values. Therefore, ARIMA can

only fit linear stationary time series data; the non-stationary time series might not be modeled effectively.

Deep learning can overcome the limitations of traditional linear models, such as weak fitting ability and weak feature extraction ability with non-linear data, and has gradually become a key research method in stock prediction. Some deep learning models, such as Convolutional Neural Networks (CNNs), can identify non-linear relationships and extract hidden information from data. LSTM can retain long historical information and achieve high prediction accuracy in sequential pattern learning problems. It does not require selecting features manually [9] and the performance to be superior to that of Feedforward Neural Network (FNN) [10], a Deep Neural Networks (DNN) [11], and Support Vector Machines (SVM) [12]. Although deep learning well models some complex problems, the traditional linear model still has some advantages. For example, the regression method sometimes has better prediction performance than deep learning in power system prediction [13,14].

Based on the above analysis, no individual model can be applied well in all circumstances. In a practical problem, the appropriate model depends on the characteristics of the dataset. However, in time series prediction, it is sometimes difficult to define whether the data are linear or non-linear, especially when there are multiple linear or non-linear components, making it difficult to choose an appropriate prediction model.

Various hybrid techniques exploit the unique strengths of both types of model to effectively improve prediction performance [4–6]. Ref. [15] combined ARIMA and SVM, which showed that the combined model was better than either of its components at stock price prediction. LSTM and an Autoregressive Conditional Heteroscedasticity (GARCH) model were combined to predict stock price volatility, with relatively accurate results [16]. Ref. [17] proposed an ARIMA-ANN hybrid model to improve time series predictions when a time series has both linear and non-linear components. Ref. [18] developed three different hybrid models combining linear ARIMA and non-linear models, such as SVM, ANN, and random forest (RF) models, to predict stock index returns. Experimental results showed that the hybrid model ARIMA-SVM achieved the highest accuracy and the best return.

#### **3. Stock Index Forecasting Model**

#### *3.1. Related Models*

#### 3.1.1. CEEMDAN

Empirical mode decomposition (EMD) [19] can decompose time series data into subseries according to their own time scales without setting a basis function, for effective treatment of non-linear and unstable data. However, mode aliasing can occur during EMD data decomposition. Ensemble Empirical Mode Decomposition (EEMD) addresses this problem but cannot completely eliminate reconstruction error after the introduction of Gaussian white noise [7]. In the process of decomposition, CEEMDAN adaptively adds white noise to avoid mode mixing of EMD, and addresses reconstruction error due to noise. The prediction of stock prices is affected by multiple factors and is a non-linear complex model. The components of CEEMDAN are relatively simple; hence, more accurate predictions can be obtained.

#### 3.1.2. LSTM

As a special recurrent neural network, LSTM solves the problem of gradient disappearance and explosion in the training process of long sequences, and it has a more complex network structure. LSTM introduces a cellular state and combines forgetting, input, and output gates to discard, maintain, and update information. The output of the model is calculated by multiple functions involving some summation operations, so it is not easy to produce the problems of gradient disappearance and explosion in the process of backpropagation. LSTM has advantages in some problems related to time series, such as industrial time series prediction [20] and text translation [21]. We take this model as the non-linear part of time series prediction.

#### 3.1.3. ARMA

ARMA is a linear sequential method that predicts a future according to historical and current data. ARMA data prediction must meet the requirements of stationarity. In practice, trends and periodicity often exist in many datasets, so there is a need to remove these effects before applying such models. Removal is typically carried out by including an initial differencing stage in the model, and the model is transformed into an ARIMA model. Therefore, ARIAM can be seen as an enhanced version of ARMA. It has a wider range of applications but a certain amount of information loss.

#### *3.2. Proposed Model*

It is widely accepted that the financial market is complex and dynamic, which calls for a noise elimination or time series decomposition. For this purpose, a multi-scale decomposition method called CEEMDAN is used in our model. The decomposed components have different scales; ARMA and LSTM are used as linear and non-linear prediction modules to exploit their respective advantages. Thus, a hybrid ARMA-LSTM model for time series forecasting based on CEEMDAN is proposed, which is called CAL (CEEMDAN-ARMA-LSTM). CEEMDAN can adaptively decompose a time series, yielding a series of IMFs and residue with different characteristic scales. The decomposition principle is given by

$$s(t) = \sum\_{i=1}^{n} imf\_i(t) + res(t),\tag{1}$$

where *s*(*t*) represents given time series data; *imfi*(*t*) (*i* = 1, 2,. . . ,*n*) represents the different IMFs; and *res*(*t*) is the residue. Each IMF and residue has its own local characteristic time scale. A low-volatility sequence contains more linear features, and ARMA is more suitable for processing. A high-volatility sequence can be considered non-linear, which better suits LSTM. We require a method to separate the linear and non-linear components and feed them into ARMA and LSTM.

Each hybrid model brings its own perspectives to time series decomposition. We use a statistical ADF method to separate linear and non-linear components. The ADF test can identify whether a time series is stationary. The existence of a unit root in a sequence indicates that a series is unstable. A more negative ADF test result indicates more stable data, and 0.05 is an accepted threshold to judge the stability of a dataset, which can used to separate linear and non-linear sequences [4].

$$s(t) = \sum\_{i=1}^{m} l\_i + \sum\_{i=m+1}^{n+1} n\_i. \tag{2}$$

An ADF stationary test separates time series decomposed by CEEMDAN in Equation (2), where *li* and *ni*, respectively, denote linear and non-linear components.

$$L\_t = \mathcal{g}(l\_{t-1}, l\_{t-2}, \dots, l\_{t-p}, \varepsilon\_{t-1}, \varepsilon\_{t-2}, \dots, \varepsilon\_{t-q}).\tag{3}$$

After the linear and non-linear components, respectively. The modeling process of ARMA is described by Equation (3), where *lt*−<sup>1</sup> to *lt*−*<sup>p</sup>* are time sequence values of the past p days, *εt*−<sup>1</sup> to *εt*−*<sup>q</sup>* denote corresponding random error, and *g* is the linear function of ARMA. It can be seen from Equation (3) that the results are related to the sequential values and random errors in a past period of time, so it can be concluded that its prediction process can reflect the continuity of the original sequence in time.

LSTM can mine the characteristics of non-linear time series, which we use to fit nonstationary sequences.The LSTM modeling process is described by Equation (4), where *f* is the non-linear function of LSTM, and *a* is the number of days observed by the model, i.e., how far we will go back in time. The prediction results of the linear and non-linear parts

are obtained by the corresponding models, and the final prediction is the integration of the linear and non-linear parts in Equation (5), where *y*(*t*) denotes the final predictions.

$$N\_t = f(n\_{t-1}, n\_{t-2}, \dots, n\_{t-a}),\tag{4}$$

$$y\_t = \sum\_{i=1}^{m} L\_i + \sum\_{i=m+1}^{n+1} N\_i. \tag{5}$$

To sum up, the CAL model prediction consists of time series decomposition, an ADF stationary test, model fitting, and integration of results. Figure 1 shows the prediction model, where IMF1-IMF*<sup>n</sup>* are IMF components after time series decomposition, and *res* is the residue. ARMA1-ARMA*<sup>m</sup>* denote that the *m* sequences pass the ADF test and are fitted using ARMA, and LSTM(*m*+1)-LSTM(*n*+1) denote the *n* − *m* + 1 sequences that fail the ADF test and are modeled by LSTM. The steps of the proposed hybrid model are as follows.


**Figure 1.** Stock market index forecasting model.

#### **4. Experimental Results and Discussions**

In this section, we experimentally present the predictive ability of the CAL model. In Section 4.1, datasets used in experiments are introduced. In Sections 4.2 and 4.3, the evaluation metrics and parameter settings in the experiment are discussed, respectively. The decomposition results of EMD and CEEMDAN are compared in Section 4.4. The models for comparison are listed in Section 4.5. The predicted effects of the CAL model and other comparative methods are evaluated in Section 4.6.

#### *4.1. Datasets*

We use one-step-ahead prediction to verify the prediction accuracy of the proposed CAL model on four major global stock indices: Deutscher Aktien (DAX), Hang Seng (HSI), Standard and Poor's 500 (S&P500), and Shanghai Stock Exchange Composite (SSE). These have strong representation in the global financial market and can reflect stock market changes, which has much research value. Stock market indices are affected by national policies, market environments, and other factors presenting different characteristics. Research on stock market indices in different financial markets can examine the prediction accuracy of the model.

The dataset comes from Yahoo! Finance. The range of each stock index is from 13 December 2007, to 12 December 2020, and the daily closing price is selected as the research object. The first 90% of the dataset in the time order of each stock index is used as the training set, and the last 10% is used as the test set. Only the data of trading days are used for research.

The statistical analysis of each stock index is shown in Table 1, where we determine the amount of data contained in each stock market index, as well as the average, maximum, minimum, standard deviation, and ADF test results of the closing index. As can be seen from Table 1, there is a large gap between the maximum and minimum values, and a large standard deviation, indicating that these closing indices have great volatility within the research range. Moreover, the ADF test results of the DAX and S&P500 are greater than the threshold 0.05, indicating that the dataset is highly volatile and non-stationary. SSE is somewhat more stable than the other three datasets. Figure 2 shows the sequential change of the closing index within the study range, from which it can be seen that the four indices all have great volatility and instability in the short term.


**Table 1.** Descriptive statistics of closing indices.

**Figure 2.** Daily closing index series of four financial markets. (**a**) DAX. (**b**) HSI. (**c**) S&P500. (**d**) SSE.

#### *4.2. Evaluation Metrics*

We evaluate the proposed CAL model by the Mean Absolute Error (*MAE*), Root Mean Square Error (*RMSE*), Mean Absolute Percentage Error (*MAPE*), and R-squared (*R*2), defined as Equation (6) to Equation (9).

$$MAE = \frac{1}{n} \sum\_{i=1}^{n} |p\_t - y\_t| \tag{6}$$

$$RMSE = \sqrt{\frac{1}{n} \sum\_{t=1}^{n} (p\_t - y\_t)^2} \,\tag{7}$$

$$MAPE = \frac{1}{n} \sum\_{t=1}^{n} |\frac{p\_t - y\_t}{y\_t}| \times 100\,\text{s}\tag{8}$$

$$R^2 = 1 - \frac{\sum\_{i=1}^{n} (p\_t - y\_t)^2}{\sum\_{i=1}^{n} (p\_t - \overline{y}\_t)^2}.\tag{9}$$

Here, *pt*, *yt*, and *y*¯*<sup>t</sup>* are the predicted, actual, and average of actual values, respectively, and *n* is the prediction horizon. *MAE* measures the average magnitude of the errors in a set of predictions, without considering their direction. *RMSE* is a quadratic scoring rule that also measures the average magnitude of the error. It is the square root of the average

of squared differences between prediction and actual observation. MAPE measures the percentage error of the forecast in relation to the actual values. *R*<sup>2</sup> is a statistical measure in a regression model that determines the proportion of variance in the dependent variable that can be explained by the independent variable. It corresponds to the squared correlation between the observed values and the predicted values by the model. A higher value of *R*<sup>2</sup> means a better prediction accuracy.

#### *4.3. Parameter Settings*

The sequential model structure in Keras is used to build the LSTM network. The batch size of the model is 128. Two layers of LSTM are employed to build the sequential model, and the output of the second layer of the last LSTM unit is connected to a fully connected layer. Then, the fully connected layer is connected to another fully connected layer for the final output. Figure 3 shows the LSTM network structure, where *xi* (*i* = 1, 2,. . . , *n*) is the input to the model. The numbers of units in each LSTM in the first and second layers are 128, 64, respectively. The third fully connected layer has 16 neurons, and the last layer has only one unit, which will provide a predicted value. Fully connected units and LSTM units use the ReLU and tanh activation function, respectively. We use MSE as a loss function, and use Adam as an optimization algorithm. Adam is an adaptive learning rate optimization algorithm that utilizes both momentum and scaling, and it has two decay parameters that control the decay rates and adjust the learning rate adaptively [22]. We explore the influence of different training epochs on the experimental results, and the results suggest that more training epochs result in a more skillful model, but it may lead to the problem of overfitting. Therefore, it is suitable to set the epoch to 200. The time steps works best at 10. The detailed parameter settings are shown in Table 2.

**Figure 3.** LSTM network architecture.


**Table 2.** Details of the parameters of the CAL model.

The best fitted ANN of ARIMA-ANN model in comparison has a layered architecture of 17 × 17 × 1 [4]. The parameters of CEEMDAN-LSTM refer to Ref. [7]. The parameters of LSTM, GRU, and Bi-LSTM, in comparison, are similar to that of LSTM in the CAL model.

Grid search is used to determine the optimal parameters *p* and *q* of the ARMA model. The range of the grid search is [0, 5], and the group with the smallest Akaike Information Criterion (AIC) value is selected.

#### *4.4. Decomposition Results of EMD and CEEMDAN*

Stock indices, which contain many influencing factors, can be decomposed used EMD or CEEMDAN. We take the SSE stock index as an example to decompose the original time series, so as to compare the two decomposition methods. To intuitively compare the results, we limit CEEMDAN and EMD to generate the same number of IMFs.

In Figure 4, the decomposing results of the original SSE index series are demonstrated. The results of sequence decomposition range from high to low frequency. The first few IMFs, with more noise, represent the high-frequency components in the original data; the middle IMFs, with reduced frequency, represent middle-frequency components; and the last few IMFs, with less volatility, which is similar to the long-term movement trend of a stock, represent the low-frequency components. The left and right sides of Figure 4 show the results of CEEMDAN and EMD data decomposition, respectively. It can be found that IMF5 and IMF6 on the right of Figure 4 have similar scales and are not easily distinguished. This is because the mode aliasing of EMD leads to the distribution of some similar time scales in different intrinsic mode functions, resulting in waveform aliasing and mutual influence. As a result, the features of a single sequence are not obvious, and feature extraction of later prediction models is more difficult. CEEMDAN data decomposition effectively solves this problem. As can be seen from the decomposition results on the left side of Figure 4, CEEMDAN decomposed the stock index into several components, from high- to lowfrequency, whose characteristics are obvious, and there is no waveform aliasing.

**Figure 4.** SSE decomposition results.

#### *4.5. Comparative Models*

To verify the effectiveness of the proposed CAL model for stock market prediction, we experimentally compare seven models. Table 3 lists the models and reference purposes of these seven controlled experiments, which verify the proposed model from different perspectives.


**Table 3.** Contrastive experiments.


#### *4.6. Experiments and Discussions*

We verify the effectiveness and superiority of the proposed model from three aspects:


#### 4.6.1. Observation of the Statistical Data

It can be observed from Table 4 that the CAL model has obvious advantages in stock index DAX series prediction, which decreases by 56.71% when compared to LSTM, and by 46.83% when compared to ARIMA in MAE. This indicates that a single model cannot effectively capture data patterns and make excellent predictions. Although GRU and Bi-LSTM improve the prediction accuracy of LSTM, their prediction accuracies are still lower than CAL.


**Table 4.** Prediction results of different models in DAX.

Methods with EMD achieve remarkably less error in their forecasts than CEEMDAN-LSTM and CAL, which shows that experimental results vary with data decomposition, and CEEMDAN-based methods can achieve better predictive performance. The ARIMA-ANN model is inferior to EMD- and CEEMDAN-based methods, perhaps because it has limited decomposition ability to extract hidden features. CEEMDAN properly decomposes time series, reduces their complexity, and improves LSTM information extraction, so the hybrid CEEMDAN-LSTM model can achieve a better prediction effect than just LSTM. However, CEEMDAN-LSTM is not as good as CAL because it does not consider linear factors that may exist in the original sequence in time series prediction.

Table 5 lists the prediction performance of different models on the HSI stock index, where we find a large error between the real and predicted values. This is mainly because the data of the HSI stock index are more volatile and difficult to predict. The CAL model achieves the best prediction accuracy, followed by CEEMDAN-LSTM, EMD-ARMA-LSTM, and ARIMA-ANN. ARIMA-ANN achieve higher prediction accuracy than the individual ARIMA and LSTM models, and ARIMA obtains better results than LSTM. As deep learning is easily affected by noise, it is difficult to learn effective data patterns in complex dynamic time series. Deep learning methods, such as LSTM, GRU, and Bi-LSTM, have the largest prediction error on the HSI stock index. Although ARIAM has a higher prediction accuracy than them, the gap between predicted and actual values of ARIAM is still large. This indicates the predictive performance of a single model is very limited. The hybrid model performs better than the single ARIMA and LSTM models. The experimental results show that ARIAM-ANN gives poorer results than CEEMDAN-LSTM, EMD-ARMA-LSTM, and CAL, perhaps due to an insufficient scale of decomposition. CEEMDAN-LSTM and EMD-ARMA-LSTM effectively improve prediction accuracy, but the effect is still inferior to the proposed CAL model, which has advantages and good potential in high-volatility time series data.


**Table 5.** Prediction results of different models in HSI.

Figure 2c shows that the movement trend of S&P500 is relatively stable, with little fluctuation in the research interval, and an overall upward trend. Hence, the predicted results are closer to the observed values of stock indices. Table 6 shows the experimental results of S&P500. The data show that the CAL model yields the smallest prediction error, with MAE 48.84% less than LSTM and 49.75% less than ARIMA. This shows that the single model has better prediction performance in some stable time series sets, but there is still room for improvement. However, GRU and Bi-LSTM cannot effectively improve the prediction accuracy. The prediction effect of EMD-ARMA-LSTM is still inferior to that of CAL, which further demonstrates the superiority of CEEMDAN over EMD data decomposition. CEEMDAN-LSTM achieves better prediction performance than the single LSTM model, and ARIMA-ANN yields higher prediction accuracy than ARIAM, showing that sequence decomposition and model combination can improve the prediction accuracy of financial series.

**Table 6.** Prediction results of different models in S&P500.


Table 7 shows the prediction performance results for SSE datasets. From Table 7, we can see that CAL has better predictive accuracy than the other seven models, with MAE up to 14.0294, followed by CEEMDAN-LSTM and EMD-ARMA-LSTM. ARIMA can achieve higher prediction accuracy than ARIMA-ANN and EMD-ARMA-LSTM. GRU and Bi-LSTM achieve higher prediction accuracy than LSTM.



Several important results are obtained on the SSE dataset. GRU and Bi-LSTM outperforms LSTM, but their prediction results are lower than ARIMA, which shows that a linear model can sometimes achieve a better prediction effect than a deep learning model. The prediction accuracy of EMD-ARMA-LSTM is relatively low, perhaps because the mode mixing of EMD leads to the inclusion of other scales of data in an IMF, and these abnormal data interfere with information extraction.

#### 4.6.2. Prediction Results and Errors

As demonstrated in Figure 5, we zoom in a part of the prediction interval to observe the consistency between the real and predicted values of different models. It can be seen that the CAL model yields the closest prediction results, and CEEMDAN-LSTM is closer to the observed values in comparison with EMD-ARMA-LSTM and ARIAM. LSTM and GRU have larger volatility and prediction error than the other models. The stem diagram oscillates up and down around the zero axis in Figure 6 and is locally symmetrical concerning the zero

axis, indicating that the prediction results of the CAL model are relatively stable within the prediction interval.

**Figure 5.** SSE comparison of sequence prediction results.

**Figure 6.** SSE error changes between real and predicted values.

#### 4.6.3. Regression Analysis

We conduct a linear regression to assess the correlation between the real data and the predicted values. The predicted value is denoted as *x*, and the real value is *y*, respectively. The regression equation is *y* = *ax* + *b*. The metrics, including standard error (SE), *p*-value (*p*) and *t*-value (*t*), are used to test the results of regression analysis. The definitions of SE and *t* are as follows, and *p* is derived from the *t* distribution.

$$SE = \frac{\sigma}{\sqrt{n}'} \tag{10}$$

$$t = \frac{\mathfrak{x} - \mu}{\frac{\sigma}{\sqrt{n}}}.\tag{11}$$

Here, *σ* is the standard deviation of the predicted values, *n* is the number of the predicted (or real) values, *x*¯ is the mean of the predicted values, and *μ* is the mean of real values. Table 8 lists the regression parameters and diagnostics results. It is observed that the slope *a* of each stock index is close to 1, the SE for *a* is relatively small, which means that the predicted values are very close to the real values. Furthermore, for each linear regression model, *p* for *a* is below the standard cutoff of 0.05, and *t* for *a* is high, suggesting it is a good model. In addition, Figure 7 shows the linear regression results of each stock index. The scattered points are evenly distributed near the fitting line, which indicates that the predicted and real values are highly correlated.


**Table 8.** The regression parameters and diagnostics results.

**Figure 7.** Linear regression analysis. (**a**) DAX. (**b**) HSI. (**c**) S&P500. (**d**) SSE.

#### 4.6.4. Summary

Based on the above experiment results, the observations are summarized as follows.


#### **5. Conclusions and Discussion**

Stock market index prediction plays an important role in reflecting overall stock market trends and has strong practical investment value. We proposed a hybrid stock index prediction model based on CEEMDAN and ARMA-LSTM. It takes the strengths of CEEMDAN in data decomposition, combines linear and non-linear models, and can well model complex time series. To verify the effectiveness of the prediction model, CAL was used to forecast the closing index of four stock markets, and seven control experiments were conducted for comparison. The results show that CAL can achieve the highest prediction accuracy. To optimize the model, future research can be conducted from the following aspects.


**Author Contributions:** Conceptualization, P.L.; Data curation, Q.W.; Formal analysis, Q.W.; Funding acquisition, J.X.; Investigation, Q.W. and Y.S.; Methodology, Q.W. and J.X.; Project administration, J.X.; Resources, Y.S.; Software, Q.W.; Supervision, P.L. and J.X.; Validation, P.L.; Visualization, Q.W. and Y.S.; Writing—original draft, P.L., Q.W. and J.X. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by National Natural Science Foundation of China, grant number 62067001, Guangxi Natural Science Foundation, grant number 2019JJA170045, Special Funds for Guangxi BaGui Scholars.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Publicly available datasets were analyzed in this study. DAX data can be found here: https://cn.investing.com/indices/germany-30-historical-data. HSI data can be found here: https://cn.investing.com/indices/hang-sen-40-historical-data. S&P500 data can be found here: https://cn.investing.com/indices/us-spx-500-historical-data. SSE data can be found here: https://cn.investing.com/indices/shanghai-composite-historical-data.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **Preference-Driven Classification Measure**

**Jan Kozak 1,\*, Barbara Probierz 1, Krzysztof Kania <sup>2</sup> and Przemysław Juszczuk <sup>1</sup>**


**Abstract:** Classification is one of the main problems of machine learning, and assessing the quality of classification is one of the most topical tasks, all the more difficult as it depends on many factors. Many different measures have been proposed to assess the quality of the classification, often depending on the application of a specific classifier. However, in most cases, these measures are focused on binary classification, and for the problem of many decision classes, they are significantly simplified. Due to the increasing scope of classification applications, there is a growing need to select a classifier appropriate to the situation, including more complex data sets with multiple decision classes. This paper aims to propose a new measure of classifier quality assessment (called the preference-driven measure, abbreviated p-d), regardless of the number of classes, with the possibility of establishing the relative importance of each class. Furthermore, we propose a solution in which the classifier's assessment can be adapted to the analyzed problem using a vector of preferences. To visualize the operation of the proposed measure, we present it first on an example involving two decision classes and then test its operation on real, multi-class data sets. Additionally, in this case, we demonstrate how to adjust the assessment to the user's preferences. The results obtained allow us to confirm that the use of a preference-driven measure indicates that other classifiers are better to use according to preferences, particularly as opposed to the classical measures of classification quality assessment.

**Keywords:** classification measure; quality of classification; quality measure; preference-driven classification; machine learning

#### **1. Introduction**

Classification continues to be one of the most important subjects in machine learning. Despite this, we still lack a general measure of quality independent of the specific characteristics of the data set. Moreover, in the situations where there is a necessity to involve the human decision maker in the classification process, we are forced to switch between different measures. Among the general ones, there are accuracy, precision or recall, and others that are data-dependent. Choosing the right (optimal) one is especially important because choosing a particular classification method depends heavily on the calculated quality measures.

Moreover, there is no single best classification measure that effectively identifies the method suitable for every task. Classification algorithms/methods have many characteristics. Consequently, there are many measures of classification because there is no single measure covering all the characteristics simultaneously [1]. Thus, finding an appropriate classification measure for a specific task is difficult and requires answering the question of what conditions, in specific circumstances, must be met by the measure.

One of the ways to bypass the problem of unambiguous assessment of the classifier's quality and selecting one best-suited classifier is ensemble and hybrid methods, which simultaneously use many different algorithms to perform a specific task. Within this approach, we can point out the homogeneous and heterogeneous solutions [2]. In the first

**Citation:** Kozak, J.; Probierz, B.; Kania, K.; Juszczuk, P. Preference-Driven Classification Measure. *Entropy* **2022**, *24*, 531. https://doi.org/10.3390/e24040531

Academic Editor: Sotiris Kotsiantis

Received: 7 February 2022 Accepted: 8 April 2022 Published: 10 April 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

group, we can find methods allowing us to create large groups of classifiers belonging to the same category (or even classifiers generated with the same method, but with different starting parameters), which allow the classification process simultaneously. The heterogeneous approach involves using a large variety of classifiers, for which the main advantage is the diversity of obtained results. Thus, the basic idea of these methods is the idea of collective decision making [3].

One should note that the above approach based on the ensemble methods allows for more robust classifier selection. However, it still leaves the decision maker with the problem of estimating the classification quality. In medicine, military or finance, the well-known accuracy measure seems to derive unsatisfactory results and present limited usability [4,5]. On the other hand, measures such as recall or precision are directed towards the binary classification problem. Most of the proposed measures are directly connected to the confusion matrix and related absolutely to the numerical outcome of the classification [6]. In most cases, it is strongly needed, but it makes the whole process independent of the user's preferences. In the decision-making context, taking them into account may be vital to make the process effective and at the same time maintain the sovereignty of the decision maker.

Incorporating users in the process of preparing a machine learning solution is an essential element of the entire procedure, the subject of many studies, and can take various forms [7,8]. One of the goals of the actions taken is to help the user to choose a suitable classifier. Most often, this task comes down to comparing simple measures of classification quality, which is usually carried out by trial and error, and yet can be unreliable [9]. This task becomes even more complicated if individual users' preferences are to be taken into account. This applies especially to issues of a managerial nature, but more generally wherever a human being to some extent participates in the decision-making process. In practice, this applies to all issues except physical or technical ones, where only objective laws are in force [10]. The main intention of introducing the new measure is to make this stage of a research procedure more methodical. To the best of our knowledge, there are no clear guidelines for taking into account the parameters in the learning process and the selection of a classifier in conjunction with individual preferences. The next issue is the systematic classification into the following application areas, thus expanding the group using machine learning methods. Finally, some users need a tool to control the process of selecting a classifier for their own needs, which are more complex and related to many classes. For such users, a measure that allows them to simply, directly, and methodically include their preferences in selecting and training the classifier would be very beneficial.

To cope with the above drawbacks, and at the same time maintain the role of the decision maker in the process, we propose the idea of a new measure in which their preferences are vital to the importance of individual classes. We aim to propose a measure that balances a thorough analysis of the classifier's performance and the selection of the classifier that performs best under the conditions (preferences) specified by the user.

The main intention of introducing the new measure is to make this stage of the research procedure more methodical. Users need a tool with which they will be able to select a classifier for their own needs, which are more complex and related to many classes. A measure that allows them to directly and methodically include their preferences in selecting and training the classifier would be very beneficial for such users.

We undertook to propose such a measure because, to the best of our knowledge, there is no reasonable alternative where, for any number of decision classes, it is possible to aggregate the quality of classification depending on the weight for a particular decision class. First, our solution was discussed in detail on prepared examples for two decision classes. It was then tested on real-world data sets and re-examined for a more significant number of decision classes that occurred in these data sets.

This article is organized as follows. Section 1 introduces the subject of this article. Section 2 provides an overview of the work related to the classification, particularly the measures for assessing the quality of the classification. In Section 3, we describe the

classification problem and the classification quality assessment measures based on the error matrix for binary and multi-class classification. In Section 4, we present a new measure for classification, in which it will be possible to control preferences. In Section 5, we present the analysis of our research on real data sets. Finally, in Sections 6 and 7, we discuss the results of the experiments and end with general remarks on this work and available directions for future research.

#### **2. Related Works**

Evaluating the classification performance is a difficult task, and the discussion on this topic arose from the beginning of work on automatic classification. The initial set of five measures (sensitivity, specificity, efficiency, positive and negative predictive value) was rapidly expanded [11]. The most often used measure of classification performance is accuracy. However, it is not the only measure of the quality of predictive models. Despite optimizing the classification error rate, high-accuracy models may fail to capture crucial information transfer in the classification task [12,13]. Despite the simplicity and intuitive interpretation, there are many reasons and situations in which accuracy should not be used [14]. Instead, the authors advocate for using Cohen's kappa as a better meter for measuring classifiers' own merits than accuracy. Moreover, [15] indicates that the most frequently used measures, which focus on correctly classified cases (precision, recall, or F-score), do not meet the needs of various decision-making situations, especially when more than one class is essential. The authors advocate for using three other measures—Youden's index, likelihood, and discriminant power—because they combine sensitivity and specificity and their complements.

While most studies concern binary classification, in [16], the authors focused on multi-class classification problems. The authors showed that the extension of measures to the classification of many classes is associated with averaging the results achieved for individual classes in most cases. Finally, in [17], the authors point out some shortcomings of the accuracy measure and list five conditions that the newly constructed discriminator metric should meet.

To solve the dilemma related to the choice of a measure for a given problem, a list of desired features of an ideal measure and analysis of the most known measures was proposed in [1]. More importantly, they proved that it is impossible to satisfy them simultaneously. They also proposed a new family of measures (Generalized Means) that meet all desirable properties except one, and a new measure called Symmetric Balanced Accuracy.

A comparative analysis and taxonomy of the quality of classification measures have been the subject of many studies. For instance [18], in their experimental comparison conducted on 30 data sets, proposed dividing the performance measures into three categories:


In a survey [19], the authors grouped the measures depending on the type of outcome on which a given measure is focused (correct or incorrect outcome). In turn, [20] presents an in-depth analysis of over twenty performance measures used in different classification tasks in the context of changes made to a confusion matrix and their relations with particular measures (measure invariance). In contrast, a comparative study of two or more classifiers based on statistical tests was presented in [21]. A comprehensive analysis of the methods and measures of classification assessment is also included in [22], where the relationships between all measures calculated based on the confusion matrix are shown. Finally, a different approach to the analysis of performance measures was presented in [23]. First, the authors grouped classification measures according to classification difficulty, which they defined in relation to a distance between the boundary lines and each correctly classified case. The authors later developed their idea and proposed an instance-based measure for calculating the performance of classification from the perspective of instances, called degree of credibility [24].

The set of measures is constantly growing. For instance, in [25] was proposed a measure that compares classifiers, which combines three measures from different groups: Matthews correlation coefficient as a measure, which is calculated from both true and false positives and negatives, and AUC (area under the curve), derived from ROC and accuracy. To overcome the shortcomings of the accuracy measure in evaluating multi-class classifiers and to improve the quality of classifiers, in [26], the authors proposed a metric based on the combined accuracy and dispersion values. They also showed experimentally that this two-dimensional metric is particularly suitable in complex, unbalanced data sets and with many classes.

The new measures are also proposed to supplement the already used measures that work better for specific tasks. For example, [27], as an alternative to measures used in medical diagnostics, which use only part of the values from the error matrix, define the measure *AQM*, which takes into account all values from the confusion matrix. Similarly, in [28], for image analysis, a new measure of classification performance called robust-and-balanced accuracy was introduced. It aims to connect balanced accuracy with measures of variations. In another proposition, to improve face recognition processes, a new classification measure, called the volume measure, based on the volume of the matrix, was proposed [29]. In turn, a measure dedicated to the analysis of imbalanced data sets based on the harmonic mean of recall and selectivity was proposed in [30].

Existing measures are also modified. For example, in [31], the *F*∗ measure was proposed as a modification of the F-score, towards the more straightforward interpretation of this measure. The above short review shows that the issue of classification measures is constantly under the attention of researchers. Moreover, for specific applications, new measures are created that are better suited to the requirements of the domain or user preferences.

#### **3. Classification and Quality Evaluation**

Formally, the classification problem *Q* can be solved using empirical experience *W*, while the quality of the solution is estimated by the quality measure *Y*. The value of *Y* should be increasing, while the experience *W* rises as well [32].

In machine learning, classification refers to the prediction problem of determining the class to which samples from a data set will be assigned. A classifier algorithm (often shortened to "a classifier") must be provided with training data with labeled classes. Then, the classifier can predict classes for new test data based on the training data. This approach is called supervised machine learning, and classification is one example of such a method. The training set is selected as a subset of the whole data set. A typical approach is to divide the known examples into a train and test set, following some general principles about the ratio of the two. Eventually, the test set includes a far smaller number of samples than the training set (preferably, the test set and training set should be disjoint). Test set is used to evaluate classification quality. Every object (also called a sample or observation) from the training data is assigned some predefined label (decision class). The idea is to build such a classifier, which assigns the proper labels for the objects. In contrast, the evaluation is

performed on the training set, where the difference between the assigned and the actual label is estimated.

Classification algorithms for a prediction problem are evaluated based on performance. Multiple measures for assessing classification quality can be used depending on the situation. One of the most commonly used measures is accuracy, which determines how many samples from the entire data set were correctly classified into the appropriate classes. Often, other measures are used in addition to accuracy that more accurately assess the classifier's performance. Such measures primarily include precision and recall. Unfortunately, there are many problems in which classical measures are insufficient, and it is necessary to look for new solutions.

Let us consider a set of all available samples (called the universe of objects) *X*, which will include *n* number of objects:

$$X = \{ \mathbf{x}\_1, \mathbf{x}\_2, \dots, \mathbf{x}\_{i'}, \dots, \mathbf{x}\_n \},\tag{1}$$

A single observation *xi* will be described by a finite set of attributes and the decision attribute:

$$a\_1, a\_2, \dots, a\_{m\prime} \tag{2}$$

where *aj* ∈ *Aj*, *j* = 1, ... , *m*. In this context, *Aj* denotes the domain of the *j*-th attribute, while features *a*1, *a*2,..., *am* create a feature space *A*<sup>1</sup> × *A*2× ... ×*Am*.

There are no general restrictions related to the values of attributes, which can be quantitative or categorical. However, one should note that preprocessing allows for discretizing selected quantitative attributes. The same procedure can be applied to the decision classes, where many labels can be merged into fewer during the discretization process (trade-off between the quality of classification and classification speed). Thus, the single sample can be described as follows:

$$\mathbf{x}\_{i} = (\vec{\mathcal{V}}\_{i} \mathbf{c}\_{k}), \mathbf{v}\_{i}^{l} \in A\_{j}, \mathbf{c}\_{k} \in \{1, \ldots, \mathbf{C}\}, \tag{3}$$

where *V<sup>i</sup>* = [*v*<sup>1</sup> *<sup>i</sup>* , ... , *<sup>v</sup><sup>m</sup> <sup>i</sup>* ] is a vector in an *m*-dimensional feature space, *v j <sup>i</sup>* is the value of attribute *aj* for observation (sample) *xi*, and *ci* is the class label (also called the decision class) of this object. Thus, the universe *X* can be formally described as:

$$X: \{ (\mathcal{V}\_i, c\_i) \}\_{i=1}^n. \tag{4}$$

Hence, the classification problem can generally be understood as the assignment problem, where every element *xi* from the universe *X* should have the decision class *ci* assigned. Eventually, we end with the classifier capable of assigning the newly arrived objects from the test set into proper decision classes. However, even in the case of the binary classification problem, the above is not trivial. It can be challenging to achieve when, for example, we observe unbalanced data (i.e., the situation in which most of the objects from universe *X* are assigned to a single class). Therefore, many different estimation methods were proposed (both to binary and multiple-class classification problems) to cope with this. Next, this section discusses various classification measures capable of dealing with both mentioned classification problems.

#### *3.1. Quality Assessment and Binary Confusion Matrix*

A confusion matrix is among the most popular tools used in the process of the validation of the quality of performed classification. The confusion matrix for a binary decision class is defined as a table with different combinations of predicted and actual values related to the decision classes [33,34]. The rows contain existing classes, while the columns contain predicted classes. The diagonal of the confusion matrix includes the correctly classified samples for both classes, while the off-diagonal cells represent the errors (misclassified samples for both decision classes). The confusion matrix represents the errors that the classifier makes and shows the type of these errors. It represents a detailed breakdown of

the answers considering the number of correct and incorrect classifications for both classes. It is imperative when the cost of misclassification is different for these classes and when the size of the classes is different [35].

For the binary classification (e.g., classification of emails and spam or sick and healthy patients), see Table 1, where the prediction of a positive class (labeled 1) and a negative class (labeled 2) can be uniquely determined. Such a situation indicates whether the classifier is more likely to incur an error by assigning a positive class as a negative class or vice versa. In Table 1, the confusion matrix for the binary classifier is shown, where:


**Table 1.** Confusion matrix for binary classification.


The quality of classifiers is measured based on a confusion matrix (as shown in Table 1). Measures based on the confusion matrix include, but are not limited to, accuracy, precision, recall, and F-score (also called F1, which is the measure *F<sup>β</sup>* for which *β* = 1). In addition, the Matthews correlation coefficient (MCC) measure and BalancedAccuracy are also often used for binary classification.

The accuracy measure indicates how often a classifier makes a correct prediction. It is the ratio of the number of accurate predictions to the total number of predictions and is calculated based on Formula (5).

$$accuracy\\_binary = \frac{TP + TN}{TP + FP + FN + TN} \tag{5}$$

Precision determines how many samples, out of all those classified as positive, are samples of the positive class. Precision is expressed by Formula (6).

$$precision\\_binary = \frac{TP}{TP + FP} \tag{6}$$

Recall is used to determine how many samples belonging to a positive class were classified as positive by the classifier. Recall is determined by Formula (7).

$$recall\\_binary = \frac{TP}{TP + FN} \tag{7}$$

For binary classification, the measure *F<sup>β</sup>* is defined as the harmonic mean of precision and recall, where additional precision or recall weights are used to obtain more accurate results. By setting the value of *β*, it is possible to control the effect of recall weight with respect to precision. *F<sup>β</sup>* is specified by Formula (8), where *β* is the number of times the recall is as important as the precision [36]. The *F<sup>β</sup>* value ranges from 0 to 1 (with 0 being the worst value and 1 being the optimal value). The most common value for *β* is 1, which simply means measure F1 (Equation (9)). Other frequently used values are 2 and 0.5. In the case of 2, recall weight is greater than precision; however, in the case of 0.5, recall weight is less than precision. The *F<sup>β</sup>* measure is based on the Van Rijsbergen measure of effectiveness [37].

$$F\_{\beta} = (1 + \beta^2) \times \frac{precision\\_binary \times recall\\_binary}{\beta^2 \times precision\\_binary + recall\\_binary} \tag{8}$$

The *F*1 measure is the instance of the measure *F<sup>β</sup>* (Equation (8)), for which *β* = 1. The *F*1 measure is defined as the harmonic mean of precision and recall [38]. Therefore, *F*1 will only take on a high value if both of its components reach a high value. The *F*1 measure often replaces precision when class counts are unbalanced [39]. For example, if 97% of the data belong to class 1 and only 3% belong to class 2, then classifying all observations as class 1 would yield a misleading accuracy of 97%. The *F*1 measure is based on precision and recall, and is thus robust to such distortions. The measure is calculated from Formula (9).

$$F1\\_binary = 2 \times \frac{precision\\_binary \times recall\\_binary}{precision\\_binary + recall\\_binary} = \frac{TP}{TP + \frac{1}{2}(FN + FP)}\tag{9}$$

The Matthews correlation coefficient (*MCC*) measure is often used for unbalanced data [40]. For the precision, recall, and *F*1 measures, the *TN* value is not used, which is very important if we are interested in both classes. Therefore, the Matthews correlation coefficient, calculated based on all terms from the confusion matrix, can be used. The *MCC* measure is calculated from Formula (10).

$$\text{MCC\\_binary} = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}} \tag{10}$$

The *MCC* value ranges within [−1 ... 1] (with −1 equal to the misclassification of all samples, while a value of 1 means that all samples are correctly classified). Therefore, the higher the correlation between actual and predicted values, the better the prediction. Furthermore, for the *MCC* measure, the two classes have the same importance weight, so when positive classes are swapped with negative classes, the *MCC* value will be the same, which means that the *MCC* measure is symmetric [40].

Another measure in binary classification is the BalancedAccuracy measure, which calculates balanced accuracy for anomaly or disease detection, where significant differences in the class size are observed [41]. Overestimated accuracy results can be avoided using the BalancedAccuracy measure for unbalanced classes. For binary classification, the balanced accuracy equals the arithmetic mean of the recall and specificity. The BalancedAccuracy measure is calculated by Formulas (11) or (12).

$$BalanccdAccuracy\\_binary = \frac{recall\\_binary + specificity\\_binary}{2} \tag{11}$$

$$\text{Balance} \, \text{Accuracy\\_binary} = \frac{1}{2} \left( \frac{TP}{TP + FN} + \frac{TN}{FP + TN} \right) \tag{12}$$

Specificity is used to determine the number of samples belonging to a negative class that have been classified as negative by the classifier [42]. Specificity is measured by Formula (13).

$$specific\\_binary = \frac{TN}{FP + TN} \tag{13}$$

#### *3.2. Quality Assessment and Confusion Matrix for Multi-Class Classification*

In addition to binary classification, it is common to classify samples into more than two classes. We are dealing with multi-class classification in such a situation—not to be confused with multi-label classification. The difference between multi-class and multi-label classification is that a sample can only be assigned to one class selected from multiple classes for multi-class classification. In contrast, a sample can be assigned to multiple classes for multi-label classification [43].

There is a need to construct an extended confusion matrix for a number of decision classes greater than 2. Thus, the number of rows and the number of columns in such an approach will equal the number of classes. For this purpose, we present the idea of a confusion matrix for *C* classes, which is shown in Table 2. Similarly, for the binary classification problem (see Table 1), we have listed four terms:


To evaluate the quality of multi-class classification, as in binary classification, measures based on the confusion matrix shown in Table 2 were used. Therefore, referring to the measures described in Section 3.1, we wish to present them for the multi-class classification problem.

Accuracy is one of the most commonly used measures in a multi-class classification problem [16] and is calculated according to Formula (14) or (15); when we define the sum of all samples as *s*, this equation boils down to the form (16). To calculate accuracy, sum all correctly classified samples and then divide by the number of all classified samples. Correctly classified samples are shown in the confusion matrix (see Table 2) on the main diagonal (from upper left corner to lower right corner).

$$accuracy = \frac{\sum\_{i=1}^{c} TP\_i}{\sum\_{i=1}^{c} TP\_i + \sum\_{i=1}^{c} FP\_i} \tag{14}$$

$$accuracy = \frac{\sum\_{i=1}^{c} TP\_i}{\sum\_{i=1}^{c} TP\_i + \sum\_{i=1}^{c} FN\_i} \tag{15}$$

$$accuracy = \frac{\sum\_{i=1}^{c} TP\_i}{s} \tag{16}$$

Precision and recall are used for the multi-class approach as well. Two modifications can be distinguished in this case: *macro*− and *micro*− precision or recall [16]. For the *macro*− modification, to calculate the value of the measures for multiple classes, one must count precision and recall for each class separately and then calculate the arithmetic mean of these values. In this way, all classes during multi-class classification have the same validity, regardless of the class count. In multi-class classification, *macro*\_*precision* is calculated by Formula (17) and *macro*\_*recall* is calculated by Formula (18).

$$
gamma\\_precision = \frac{1}{c} \sum\_{i=1}^{c} \frac{TP\_i}{TP\_i + FP\_i} \tag{17}
$$

$$
gamma\\_recall = \frac{1}{c} \sum\_{i=1}^{c} \frac{TP\_i}{TP\_i + FN\_i} \tag{18}
$$


**Table 2.** Confusion matrix for multiple classes.

In contrast, for *micro*\_ modification, to count the precision and recall values, one must look at all classes together. In this way, each correctly classified sample into a class is a component of all correctly classified samples. In other words, we calculate *TP* as the sum of all *TP* values for individual classes (the sum of the values from the main diagonal). The *FP* value will be the sum of all values off the main diagonal, equal to the *FN* value. Therefore, *micro*\_*precision* and *micro*\_*recall* are the same because they are the sum of *TP* values to all values in the confusion matrix [16]. In multi-class classification, *micro*\_*precision* is calculated by Formula (19) and *micro*\_*recall* is calculated by Formula (20).

$$micro\\_precision = \frac{\sum\_{i=1}^{c} TP\_i}{\sum\_{i=1}^{c} TP\_i + \sum\_{i=1}^{c} FP\_i} \tag{19}$$

$$micro\\_recall = \frac{\sum\_{i=1}^{c} TP\_i}{\sum\_{i=1}^{c} TP\_i + \sum\_{i=1}^{c} FN\_i} \tag{20}$$

The *F<sup>β</sup>* measure is also used for multi-class classification problems; however, the results of this measure are obtained by macro-averaging or micro-averaging [20]. When we assume that all classes are of the same weight, we use macro-averaging, where two additional methods can be adopted. The first is the calculation of the *F<sup>β</sup>* measure from the *macro*\_*precision* (see Equation (17)) and *macro*\_*recall* (see Equation (18)) measures, and the second is the arithmetic mean of the F-beta scores for each class separately, based on the *F<sup>β</sup>* measure for binary classification (see Equation (8)). The *macro*\_*F<sup>β</sup>* measure is specified by Formula (21).

$$\text{macro\\_F}\_{\beta} = (1 + \beta^2) \times \frac{\text{macro\\_precision} \times \text{macro\\_recall}}{\beta^2 \times \text{macro\\_precision} + \text{macro\\_recall}} \tag{21}$$

In contrast, in the classification depending on the frequency of classes, the *F<sup>β</sup>* results are calculated by micro-averaging based on *micro*\_*precision* (see Equation (19)) and *micro*\_*recall* (see Equation (20)). The *micro*\_*F<sup>β</sup>* measure is specified by Formula (22). In all of the discussed variants of the *F<sup>β</sup>* measure, the beta parameter determines the weight of the reference in relation to the precision. When *β* < 1, precision has more weight, and when *β* > 1, recall has more weight [20].

$$\text{micro\\_F}\_{\beta} = (1 + \beta^2) \times \frac{\text{micro\\_precision} \times \text{micro\\_recall}}{\beta^2 \times \text{micro\\_precision} + \text{micro\\_recall}} \tag{22}$$

The F1 measure (the most common instance of *Fβ*, where *β* = 1) is calculated based on recall and precision (present in the multi-class classification) [44], and thus can be used in the multi-class classification as well. For a problem where class imbalance does not matter and all classes are equally valid, we use the *macro*\_ modification and apply *macro*\_*precision* (see Equation (17)) and *macro*\_*recall* (see Equation (18)). Thus, the measure *macro*\_*F*1 can be calculated according to Formula (23).

$$macro\\_F1 = 2 \times \frac{macro\\_precision \times macro\\_recall}{macro\\_precision + macro\\_recall} \tag{23}$$

In problems with unbalanced classes, where selected classes are more important than others, the *micro*\_ modification leads to the *micro*\_*precision* (see Equation (19)) and *micro*\_*recall* (see Equation (20)). Thus, the measure *micro*\_*F*1 can be calculated according to Formula (24).

$$\text{micro\\_F1} = 2 \times \frac{\text{micro\\_precision} \times \text{micro\\_recall}}{\text{micro\\_precision} + \text{micro\\_recall}} \tag{24}$$

From Formulas (19), (20) and (24), one can conclude that, for the multi-class classification problem, the values of the measures *micro*\_*precision*, *micro*\_*recall*, and *micro*\_*F*1 are equal to each other. At the same time, the values of these measures are equal to the *accuracy* of Formula (16).

The Matthews correlation coefficient (MCC) is only used in classifications of up to two classes (see Equation (10)). For classifications with more than two classes, it is often irrelevant to determine the division of multiple classes into two classes (positive and negative) [45]. Therefore, J. Gorodkin, in his work [46], proposed an extended correlation coefficient (called the *RK* statistic, for *K* different classes) that can be used in multi-class classification. Based on this, we defined *MCC* for multiple classes denoted as Formula (25), where *s* is the sum of all samples, *TPi* + *FPi* is the value of all samples in row *i*, and *TPi* + *FNi* is the value of all samples in column *i*.

$$\text{MCC\\_multiclass} = \frac{s \times \sum\_{i=1}^{c} TP\_i - \sum\_{i=1}^{c} \left( (TP\_i + FP\_i) \times (TP\_i + FN\_i) \right)}{\sqrt{s^2 - \sum\_{i=1}^{c} (TP\_i + FP\_i)^2} \times \sqrt{s^2 - \sum\_{i=1}^{c} (TP\_i + FN\_i)^2}} \tag{25}$$

The last discussed measure is BalancedAccuracy, which could be used to calculate accuracy for unbalanced classes [41]. According to Formula (11), *BalancedAccuracy*\_*binary* is the arithmetic mean of *recall*\_*binary* and *speci ficity*\_*binary*. For binary classification, the value of *speci ficity*\_*binary* for the first class equals *recall*\_*binary* for the second class. For this reason, in multi-class classification, to calculate *BalancedAccuracy*, one must count the recall for each class separately and then calculate the arithmetic mean of these values, according to Formula (26).

$$BalanceAdAccuracy = \frac{1}{c} \sum\_{i=1}^{c} \frac{TP\_i}{TP\_i + FN\_i} \tag{26}$$

#### **4. Preference-Driven Quality Assessment for Multi-Class Classification**

Considering the measures described above, the preferences for individual classes in the classifier quality assessment are insufficient because there is no place in their construction to indicate such preferences. Thus, we propose a new preference-driven classification quality evaluation measure to evaluate classification quality based on different weights for each decision class. The proposed measure works independently of the number of decision classes—its definition is the same in binary and multi-class classification. We also suggest default values for this measure that can be used, in the test case, without specifying exact preferences for each decision class.

#### *4.1. Proposed Preference-Driven Classification Measure*

According to the measures given in Section 3.2, the proposed measure was defined based on the confusion matrix given in Table 2. We aim to keep it as simple as possible while satisfying the assumption of adjustment to preferences (of each decision class). Therefore, the proposed measure is based on a confusion matrix and precision and recall measures with *κ* parameters determining their relative importance.

The preference-driven classification measure is denoted as preference-driven−→*<sup>κ</sup>* , where −→*<sup>κ</sup>* is the preference vector, whose length is equal to the number of decision classes (see Formula (27)). The *κ* weights for each of the subsequent measures are written on the subsequent positions of the vector. The higher the *κ* value for a given decision class, the greater the importance of precision (determined by Formula (28)) relative to recall (determined by Formula (29))—based on this class only. Therefore, changing the *κ* values of a given class makes it possible to control the relative importance of precision and recall. This is a multi-criteria process because the *κ* value can differ for each class.

Finally, the preference-driven−→*<sup>κ</sup>* measure can be expressed by Formula (30). One should note that −→*κ* is a parameter related to the measure by which the relative importance between precision and recall can be established for each decision class separately. For example, −→*κ* = [0.2, 0.6, 0.3] means that, for the first class, 20% precision and 80% recall are used; for the second class, 60% precision and 40% recall are used, and for the third class, 30% precision and 70% recall are used. To keep the final value of the preference-driven−→*<sup>κ</sup>* measure in the range [0.0, 1.0], the sum of all these values is divided by the number of classes.

$$\text{preference-driven} \underset{\overline{\kappa}}{=} \frac{1}{c} \sum\_{i=1}^{c} \kappa\_i \times \frac{TP\_i}{TP\_i + FP\_i} + (1 - \kappa\_i) \times \frac{TP\_i}{TP\_i + FN\_i} \tag{27}$$

$$precision\_i = \frac{TP\_i}{TP\_i + FP\_i} \tag{28}$$

$$recall\_i = \frac{TP\_i}{TP\_i + FN\_i} \tag{29}$$

$$\text{Perference-driven} \underset{\overline{\kappa}}{=} \frac{1}{c} \sum\_{i=1}^{c} \kappa\_i \times precision\_i + (1 - \kappa\_i) \times recall\_i \tag{30}$$

#### *4.2. Proposed Measure Analysis in the Test Case*

The sample confusion matrix was prepared for a classification problem with two decision classes to demonstrate how the measure works depending on the preference and classification outcome. The two classes were chosen to visualize the measure's values.

Figure 1 presents nine confusion matrices for which preference-driven−→*<sup>κ</sup>* values were determined for −→*κ* , being a combination of all values from 0.0 to 1.0 with a step of 0.1, i.e.,

[0.0, 0.0], [0.0, 0.1],... [0.5, 0.4], [0.5, 0.5], [0.5, 0.6],... [1.0, 0.9], [1.0, 1.0].

It gives a total of 121 different combinations vectors of preferences. The results are visualized in the following figures.

To better capture the distribution of values for the preference-driven measure, an analysis based on the confusion matrix *cm*4 (see Figure 1) was presented. It was selected because the quality assessment in terms of classical measures always means the same values, i.e., precision (see Equation (17)) is 0.7500, and recall (see Equation (18)) is 0.8333. In Figure 2, one can see that the value of the preference-driven measure ranges from 0.5833 for −→*κ* = [1.0, 0.0] to 1.0000 for −→*κ* = [0.0, 1.0]. It is possible to obtain exactly the same values for recall (−→*κ* = [0.0, 0.0]) and precision (−→*κ* = [1.0, 1.0]), but depending on the preference, the classifier will be evaluated differently. For example, preference-driven[0.1,0.3] , i.e., for *κ*<sup>1</sup> = 0.1 and *κ*<sup>2</sup> = 0.3 is 0.8083 (green dot in Figure 2); similarly, preference-driven[0.1,0.8] = 0.9333 (black dot in Figure 2), preference-driven[0.4,0.4] = 0.7833 (red dot in Figure 2), preference-driven[0.9,0.1] = 0.6250 (light blue dot in Figure 2), and preference-driven[0.9,0.8] = 0.8000 (bright orange dot in Figure 2).

**Figure 1.** Confusion matrix used for measure analysis.

**Figure 2.** All possible values of the preference-driven measure for the confusion matrix *cm*4.

Next, we present different solution spaces obtained for successive confusion matrices. Figure 3 presents such solution spaces for the proposed measure for *cm*4 and *cm*8 confusion matrices (please refer to Figure 1, which can be described as an opposition to each other, and thus they were selected for the analysis. The value of *κ*<sup>1</sup> increases, and the value of *κ*<sup>2</sup> decreases (the value of the proposed measure for *cm*4 decreases, while that for *cm*8 increases—and vice versa).

**Figure 3.** All possible values of the preference-driven measure for the confusion matrix *cm*4 and its opposite *cm*8.

Different solution spaces for the confusion matrices from Figure 1 are presented in Figures 4 and 5. In the first case (Figure 4), specific confusion matrices are selected:


Figure 5 presents the solution spaces for all (except *cm*9) confusion matrices from Figure 1. It allowed us to present the dynamics of the solution space based on the preference-driven−→*<sup>κ</sup>* , depending on the classification being evaluated and the value of the −→*κ* vector. Depending on the values of the −→*κ* vector, the proposed measure differently evaluates the classifier. It is also presented in Table 3, where the results for different values of the −→*κ* vector are presented. Examples [0.3, 0.6], [0.9, 0.4], and [0.5, 0.5] have been chosen, as well as precision (which is equivalent to −→*κ* = [0.0, 0.0]) and recall (equivalent to −→*κ* = [1.0, 1.0]). As can be seen, depending on the given preferences, the evaluation of the classifier even in this case changes noticeably.

**Figure 4.** Example distributions of values of the preference-driven measure: (**a**) confusion matrix *cm*1; (**b**) confusion matrix *cm*9; (**c**) confusion matrix *cm*4; (**d**) confusion matrix *cm*8.

**Figure 5.** Solution spaces of proposed measure values for selected confusion matrices.

Analogously to Table 3, Figure 6 presents the classification quality assessment values for each of the prepared confusion matrices. Such a visualization allows us to observe the influence of the proposed measure on the classification evaluation (compare with the earlier discussed example for *cm*4 and *cm*8 for which precision, recall, and F1 have identical values), where diagonal lines allow a more straightforward analysis of changes between sample confusion matrices; in this case, we can see that the mentioned measures have identical values for these confusion matrices, but the proposed preference-driven measure obtains different scores for the same preference vectors. Please note that confusion matrices should not be interpreted as consecutive occurrences. Classification evaluation values should be compared regardless of the order in which they are presented.

To further compare the proposed preference-driven measure with the F1 measure, which is also based on recall and precision values, we analyzed the values of the preference vector (−→*κ* ) that produce the same classification score as the F1.

We performed careful analyses for all the confusion matrices described in this section, except for *cm*1, for which the value of all measures is always 1.0000 (this confusion matrix represents the ideal classification). Our observations indicate that there are (unlike the recall and precision measures) no classical values for the preference vector to find the equivalent value of the F1 measure. Therefore, for subsequent confusion matrices, the F1 counterparts are, respectively, preference-driven measures with preference vectors: for *cm*2, it is −→*κ* = [0.015, 0.095], then for *cm*3, it is −→*κ* = [0.010, 0.145]; for *cm*4, it is −→*κ* = [0.189, 0.284]; for *cm*5, it is −→*κ* = [0.045, 0.095]; for *cm*6, it is −→*κ* = [0.06, 0.12]; for *cm*7, it is −→*κ* = [0.5, 0.5]; for *cm*8, it is −→*κ* = [0.284, 0.189], and for *cm*9, it is −→*κ* = [0.67, 0.00].

These observations indicate significant differences between the F1 and the proposed preference-driven measures. As we have already presented, F1 for a single confusion matrix (i.e., the classifier score) is always represented by a single value. In contrast, it is possible to control the quality score according to preferences in the preference-driven case.

**Table 3.** Example values of the proposed preference-driven measure in comparison with other measures for assessing the quality of classification (p-d is the abbreviation for the preference-driven measure). Results determined for all confusion matrices presented in Figure 1.


<sup>1</sup> *macro*\_*precision* (Equation (17)) is equal to preference-driven[1.0,1.0] . <sup>2</sup> *macro*\_*recall* (Equation (18)) is equal to preference-driven[0.0,0.0] . <sup>3</sup> *macro*\_*F*1 (Equation (23)). <sup>4</sup> See Figure 1.

**Figure 6.** Example values of the proposed measure in comparison with other measures for assessing the quality of classification (the diagonal lines allow easier analysis of the changes between the sample confusion matrices).

#### *4.3. Default Values of the Preference Vector*

The proposed measure is used to evaluate the classifier's quality as closely as possible to the stated preferences (as long as the decision maker clearly describes these preferences). To make the measure more comparable with other measures, we propose, as an alternative, the default values of the preference vector for the preference-driven−→*<sup>κ</sup>* measure.

The ratio of the number of objects in each class to the number of all samples is taken as the default value of the preference vector. This means that, for classes with many samples, higher weight is related to precision for this class, while, for classes with a relatively small number of samples, higher weight is related to recall of this class. Such a solution allows compensation for the situation in which the samples are more often classified into classes with many samples. When validating a classifier, there is always a learning set (whether for train-and-test or cross-validation), which each time allows the mentioned default values of the preference vector to be determined—the preference vector can correspond to the distribution of cases in the learning data.

In analogy with previously adopted designations, a notation for the default value of the −→*κ* vector in Formula (31) is proposed. The designations are the same as those contained in Table 2; additionally, *s* is the sum of all samples.

$$\overrightarrow{\mathbf{x}}^{\rightarrow} = \left[ \frac{TP\_1 + FN\_1}{s}, \frac{TP\_2 + FN\_2}{s}, \dots, \frac{TP\_i + FN\_i}{s}, \dots, \frac{TP\_{C-1} + FN\_{C-1}}{s}, \frac{TP\_C + FN\_C}{s} \right] \tag{31}$$

For example, in the situation described in Section 4.2, where the classes in the confusion matrix are at equilibrium, the implicit vector would be of the form −→*κ* = [0.5, 0.5]. Its value is incidentally given in Table 3, in column preference-driven[0.5,0.5] .

Extending the example, a confusion matrix is proposed, shown in Table 4. In this case, there are three classes, containing 50, 20, and 30 samples each, respectively. Therefore, the default values of the preference vector would be as follows:

$$
\overrightarrow{\mathbb{K}}^{\flat} = \left[\frac{50}{100}, \frac{20}{100}, \frac{30}{100}\right],
$$

$$
\overrightarrow{\mathbb{K}}^{\flat} = [0.5, 0.2, 0.3].
$$

so

Here, 50 out of 100 instances are in class 1, 20 out of 100 instances are in class 2, and 30 out of 100 instances are in class 3. The preference-driven measure value is 0.656, because

$$\begin{aligned} \text{preference-driven}\_{[0.5, 0.2, 0.3]} &= \frac{1}{3} \times \left( 0.5 \times \frac{TP\_1}{TP\_1 + FP\_1} + (1 - 0.5) \times \frac{TP\_1}{TP\_1 + FN\_1} \right) + \\ &\quad \frac{1}{3} \times \left( 0.2 \times \frac{TP\_2}{TP\_2 + FP\_2} + (1 - 0.2) \times \frac{TP\_2}{TP\_2 + FN\_2} \right) + \\ &\quad \frac{1}{3} \times \left( 0.3 \times \frac{TP\_3}{TP\_3 + FP\_3} + (1 - 0.3) \times \frac{TP\_3}{TP\_3 + FN\_3} \right) \end{aligned}$$

so

$$\begin{aligned} \text{preference-driven}\_{[0.5, 0.2, 0.3]} &= \frac{1}{3} \times \left( 0.5 \times \frac{40}{40 + 8 + 9} + 0.5 \times \frac{40}{40 + 7 + 3} \right) + 1 \\ &\quad \frac{1}{3} \times \left( 0.2 \times \frac{10}{10 + 7 + 1} + 0.8 \times \frac{10}{10 + 8 + 2} \right) + 1 \\ &\quad \frac{1}{3} \times \left( 0.3 \times \frac{20}{20 + 3 + 2} + 0.7 \times \frac{20}{20 + 9 + 1} \right) \end{aligned}$$

preference-driven[0.5,0.2,0.3] <sup>=</sup> <sup>1</sup> <sup>3</sup> <sup>×</sup> (0.5 <sup>×</sup> 0.702 <sup>+</sup> 0.5 <sup>×</sup> 0.8 <sup>+</sup> 0.2 <sup>×</sup> 0.556 <sup>+</sup> 0.8 <sup>×</sup> 0.5 <sup>+</sup> 0.3 <sup>×</sup> 0.8 <sup>+</sup> 0.7 <sup>×</sup> 0.667)

$$\text{preference-driven}\_{[0.5, 0.2, 0.3]} = \frac{1}{3} \times (0.351 + 0.4 + 0.111 + 0.4 + 0.24 + 0.467)$$

preference-driven[0.5,0.2,0.3] = 0.656

The results for the default values of the preference vector for the preference-driven−→*<sup>κ</sup>* measure are also analyzed in Section 5, when testing the proposed measure with different real data sets.

**Table 4.** Example confusion matrix used to demonstrate how to determine the proposed preferencedriven measure.


#### **5. Analysis on Real-World Data Sets**

As the paper proposes a new classification quality assessment measure whose value depends on the stated preferences (called preference vector), we conducted experiments on real-world data sets. First, we checked the importance of the proposed measure depending on the given preference vector—while comparing the performance and ranks of the classifiers with classical measures of classification quality assessment.

#### *5.1. Experiment Conditions*

The proposed measure preference-driven−→*<sup>κ</sup>* is compared with the classical measures described in this paper (see Section 3.2). As some of the measures, in the case of multi-class classification, reduce to the same measure, the following names are used in this section: accuracy (see Equation (16)), precision (see Equation (17)), recall (see Equation (18)), F1 (see Equation (23)), and MCC (see Equation (25)).

Four well-known data sets were selected for classification, whose structures are described in Table 5. As one can see, multi-class data sets were selected, in which, additionally, the distribution of samples in classes was uneven (see "percent of samples per class" in Table 5, where the ratio of samples of each class to the whole data set is given). On the other hand, as test classifiers, we selected classifiers available in the system Weka-3-6-11 [47]. More specifically, these were the following classifiers: Bagging [48], BayesNet [49], DecisionTable [50], C4.5 (J48) [51], and RandomForest [52]. In each case, 10-fold cross-validation was used so that each sample was subject to prediction.

Classification results from the Weka system (including confusion matrices) and the values of the proposed classification quality evaluation measure for fixed preference vectors are available on the website of the Department of Machine Learning of the University of Economics in Katowice (https://www.ue.katowice.pl/jednostki/katedry/wiik/katedrauczenia-maszynowego/zrodla-danych/preference-driven.html (accessed on 6 February 2022)).

**Table 5.** Characteristics of the real data sets used to test the proposed preference-driven measure and comparison with classical measures.


The number of decision classes in each data set is different, so the checked combinations of values in the preference vector determined different, possible values of subsequent elements of the vector. Thus, the number of combinations is close to 15,000 (except for the krkopt data set, which is too large and had to exceed this value). In this way, we obtained:


#### *5.2. Experimental Results*

Tables 6–9 present the results for all data sets and all classifiers. Ratings have been made for all measures, with preference-driven−→*<sup>κ</sup>* determined each time for the default values and five different, selected preference vectors. Evaluation values were presented along with the ranking order of the classifier for each measure (in brackets). This approach allows us not only to evaluate the differences between the evaluation values but also to indicate whether, using the proposed measure, it could happen that another classifier would be better than those indicated by the classical classification quality evaluation measures. On the other hand, Figure 7 shows histograms of the preference-driven−→*<sup>κ</sup>* measure values for all tested preference vectors.

Figure 7 presents 20 different histograms used to show the distribution of values (without any unnecessarily detailed information). In part (a), successive histograms are shown for the car data set, then (b) contains the nursery histograms, (c) is related to dermatology, and (d) is the krkopt results. This figure concerns the value of the preferencedriven measure determined for different preference vectors (−→*κ* ). Therefore, the X-axis represents the value of the preference-driven measure, while the Y-axis represents the number of occurrences of this value.

This presentation in Figure 7 allows us to notice that in each case, using the proposed measure, it is possible to obtain a different score with the same confusion matrix. In addition, in most cases, the values of the measure are close to a normal distribution. A slightly more interesting case is (b), with the Bagging, BayesNet, and RandomForest classifiers. In this case, however, it should be noted that the classification each time gives a specific confusion matrix (detailed results are available on the UE Katowice website (https://www.ue.katowice.pl/jednostki/katedry/wiik/katedra-uczeniamaszynowego/zrodla-danych/preference-driven.html) (accessed on 6 February 2022)). Note that there are only two samples in the first class. The class with the most significant number of samples is always correctly identified. Additionally, in case (c), the particular histogram is obtained for the RandomForest classifier, where the classification was excellent, close to error-free. In contrast, the values were rounded to the second decimal for the histogram, hence such a high concentration of preference-driven measure values.

**Figure 7.** Histogram of preference-driven measure values in the (**a**) car data set; (**b**) nursery data set; (**c**) dermatology data set; (**d**) krkopt data set—the X-axis represents the value of the preference-driven measure, while the Y-axis represents the number of occurrences of this value.

Analyzing the experimental results, it is worth noting that by using different vectors, in the case of a preference-driven measure, it is possible to indicate a different classifier as more adapted to the problem. For example, in the case of the car data set (see Table 6), the classifier DecisionTable turns out to be the best for some preference vectors (e.g., [0, 1, 1, 0.9]); similarly, the classifier BayesNet turns out to be better than Bagging if the preference vector is, among others, [0.3, 1, 1, 1]. The situation is similar for the set of nursery data (see Table 7), where also DecisionTable turns out to be the best, while the preference vector is [0.498, 1, 0.166, 1, 0]. A similar difference, but on subsequent ranks (between 3 and 4), can be observed also in the case of the krkopt data set (see Table 9). Although, in the case of the dermatology data set (see Table 8), no change of ranks was observed, it could be observed that the disproportion between the assessment of specific classifiers changed a lot and, e.g., in the case of vector [0, 0, 0, 1, 1, 0.75], the difference between the best BayesNet and the second RandomForest was less than 0.0083, where, with classical measures of classification quality assessment, this difference was around 0.02.

The aim of the experiment was to check whether the proposed measure will show different classifiers for the same problem depending on the value in the preference vector. It turned out that, indeed, the measure indicates different classifiers, even with a limited number of vectors tested.

**Table 6.** Results for the car data set—the value of the classification quality assessment (in brackets, we give the ranking of the classifier, according to the given measure). The ranking determines the order of the classifier depending on the classification quality rating measure used.


<sup>1</sup> Default value of the preference vector (see Section 4.3).

**Table 7.** Results for the nursery data set—the value of the classification quality assessment (in brackets, we give the ranking of the classifier, according to the given measure). The ranking determines the order of the classifier depending on the classification quality rating measure used.


<sup>1</sup> Default value of the preference vector (see Section 4.3).

**Table 8.** Results for the dermatology data set—the value of the classification quality assessment (in brackets, we give the ranking of the classifier, according to the given measure). The ranking determines the order of the classifier depending on the classification quality rating measure used.


<sup>1</sup> Default value of the preference vector (see Section 4.3).

**Table 9.** Results for the krkopt data set—the value of the classification quality assessment (in brackets, we give the ranking of the classifier, according to the given measure). The ranking determines the order of the classifier depending on the classification quality rating measure used.


<sup>1</sup> Default value of the preference vector (see Section 4.3). <sup>2</sup> −→*κ* = [0, 1, 1, 1, 1, 1, 0.9, 0.9, 1, 0.9, 0.9, 0.9, 0, 0, 0, 0, 0.9, 1]. <sup>3</sup> −→*κ* = [0.33, 0.66, 0.33, 0.33, 0.66, 0.33, 0.66, 0.66, 0.33, 0.66, 0.66, 0.33, 0.66, 0.66, 0.66, 0.66, 0.33, 0.33]. <sup>4</sup> −→*κ* = [0.66, 0.33, 0.33, 0.33, 0.33, 0.66, 0.66, 0.33, 0.66, 0.66, 0.66, 0.33, 0.33, 0.66, 0.33, 0.33, 0.33, 0.33]. <sup>5</sup> −→*κ* = [1, 0.1, 1, 1, 1, 1, 0.1, 0.1, 0, 0.1, 0.1, 0.1, 0, 0, 0, 0, 0.9, 1]. <sup>6</sup> −→*κ* = [1, 1, 0.9, 0.9, 0.8, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.3, 0.2, 0.2, 0.1, 0.1, 0.0, 0.0].

#### **6. Discussion**

In supervised learning, one of the necessary steps in adequately designed research is teaching on training data different classifiers, sometimes with different parameters, and then evaluating their quality. In our work, we assume that the user has specific preferences. In addition, they relate to particular decision classes. Thus, we can immediately identify a suitable classifier.

Using a vector of preferences, it is possible to indicate whether precision or recall is more important. It is crucial for many problems to indicate that it is possible to select a classifier better adapted to the structure of the data and decision makers' expectations for each class separately. The best strategy is to examine the training of as many classifiers as possible (with different parameters). Such an approach allows the identification of the potentially best classifiers for a given problem.

The proposed preference-driven measure used in the experiments allowed for a better selection of the classifier most suitable for the task. With an approach allowing us to indicate whether precision or recall is more important—for each class separately—it is possible to select a classifier more adapted to the more important (from the evaluation point of view) decision class.

Despite the limited number of combinations, the conducted experiments indicated that the proposed measure, depending on the preferences conveyed in the form of a preference vector, points to different classifiers as the best choice for further prediction.

Since the set of values in the preference vector is infinite, the measurement values are also unlimited. Therefore, the calculation of a measure and comparison with other measures is possible only by calculating them at specific points. It should be noted that it is not necessary to test many combinations of the preference vector in a real application. Instead, these should be predetermined, indicating the relative importance of the recall–precision balance in each decision class.

This also distinguishes the preference-driven measure from the *F<sup>β</sup>* measure, which also raised attempts to weigh between precision and recall. However, the proposed preferencedriven measure is different. Note that, in the case of *Fβ*, there is a weighting between precision and recall overall for the classifier. In contrast, there are weights for each decision class with the preference-driven measure. It allows us to change the emphasis on precision and recall depending on each decision class (differently). In the case of *Fβ*, this possibility does not exist.

#### **7. Conclusions**

This article presents a new idea for a preference-driven classification measure. We tried to show that the measure works, i.e.,


In the "objective" approach (without preferences), the result is unambiguous and comparable, but the best classifier will not necessarily be adjusted to the subjective needs of the user. In the "subjective" approach (preference-driven), comparability is difficult, but in return, users can acquire a classifier better suited to their requirements.

The concept of this measure results from the shortcomings of measures related to the multi-class classification. Nowadays, we observe a large number of classification methods. There is no single versatile classification measure capable of catching up on concepts related to both: overall good classification quality and a particular focus on the selected decision classes. The whole idea of different classification measures is mostly extended for the binary decision classes, which often fail to achieve good results for real-world data. At the same time, multi-class classification measures are based on averaging the results, which can be fair for general cases but fails to include decision makers' preferences related to the particular classes.

Similarly, for unbalanced cases, where there is a need to focus on particular classes, the proposed preference-driven measure fits well for this gap. To be more precise, our proposed preference-driven measure can be aligned with the decision makers' preferences regarding the relative importance of precision and recall. We also present the idea of setting the default values for the vector of preferences based on the overall number of samples assigned to every decision class. The most important advantage of the proposed idea is the good fit between the well-known measures such as precision and recall. Moreover, the *κ* preference vector allows us to direct the focus to a particular decision class, or even to express the importance of selected decision classes in terms of the precision measure (and others in terms of recall).

At the same time, we show that even for potentially trivial cases, such preferencedriven measures could lead to entirely different results based on the *κ* selection. It opens a discussion for multi-class classification and leads to an interesting situation. The solution for the classification problem should not be considered a single scalar value.

In the future, a preference-driven measure can be used in line with the proposed approach. Alternatively, the factors of which the measure is composed could be scrutinized, and other measures could be used instead of the class's relative precision and recall values. **Author Contributions:** Conceptualization, J.K.; methodology, J.K., B.P., K.K. and P.J.; software, J.K.; validation, J.K. and B.P.; formal analysis, J.K., B.P., K.K. and P.J.; investigation, J.K. and B.P.; resources, J.K.; data curation, J.K.; writing—original draft preparation, J.K., B.P., K.K. and P.J.; writing—review and editing, J.K., B.P., K.K. and P.J.; visualization, J.K.; supervision, J.K.; project administration, J.K., B.P. and P.J. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Data Availability Statement:** The classification results (including confusion matrices) obtained in the Weka program, as well as the values of the preference-driven measure for the analyzed preference vectors, can be found on the website of the Department of Machine Learning of the University of Economics in Katowice: https://www.ue.katowice.pl/jednostki/katedry/wiik/katedra-uczeniamaszynowego/zrodla-danych/preference-driven.html (accessed on 6 February 2022). The implementation of the proposed preference-driven measure was prepared in Python language. The source code of this implementation is publicly available via GitHub: https://github.com/jankozak/preferencedriven (accessed on 6 February 2022).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **A Novel Method for Fault Diagnosis of Rotating Machinery**

**Meng Tang, Yaxuan Liao, Fan Luo \* and Xiangshun Li**

School of Automation, Wuhan University of Technology, Wuhan 430070, China; tangmeng@whut.edu.cn (M.T.); lyx19971020@whut.edu.cn (Y.L.); lixiangshun@whut.edu.cn (X.L.)

**\*** Correspondence: dr\_luofan@whut.edu.cn

**Abstract:** When rotating machinery fails, the consequent vibration signal contains rich fault feature information. However, the vibration signal bears the characteristics of nonlinearity and nonstationarity, and is easily disturbed by noise, thus it may be difficult to accurately extract hidden fault features. To extract effective fault features from the collected vibration signals and improve the diagnostic accuracy of weak faults, a novel method for fault diagnosis of rotating machinery is proposed. The new method is based on Fast Iterative Filtering (FIF) and Parameter Adaptive Refined Composite Multiscale Fluctuation-based Dispersion Entropy (PARCMFDE). Firstly, the collected original vibration signal is decomposed by FIF to obtain a series of intrinsic mode functions (IMFs), and the IMFs with a large correlation coefficient are selected for reconstruction. Then, a PARCMFDE is proposed for fault feature extraction, where its embedding dimension and class number are determined by Genetic Algorithm (GA). Finally, the extracted fault features are input into Fuzzy C-Means (FCM) to classify different states of rotating machinery. The experimental results show that the proposed method can accurately extract weak fault features and realize reliable fault diagnosis of rotating machinery.

**Keywords:** fast iterative filtering; parameter adaptive refined composite multiscale fluctuation-based dispersion entropy; rotating machinery; fault diagnosis

#### **1. Introduction**

Rotating machinery, such as electric motors, centrifugal pumps, and turbine engines, represent the most widely used mechanical equipment in industrial processes [1]. The mechanical equipment usually operate under unstable loads and harsh working conditions, thus various failures of their critical components, such as bearing damage and impeller damage, are inevitable. The operating states of rotating machinery directly affect the productivity and safety of the industrial sector. Therefore, accurate and reliable fault diagnosis of rotating machinery is of great practical significance [2].

The key to fault diagnosis of rotating machinery is to extract fault features from vibration signals. Vibration signals are nonlinear and nonstationary [3], and are easily interfered by noise, thus it is difficult to extract hidden features. Therefore, it is necessary to combine the appropriate time–frequency analysis method with the entropy measurement method to extract the hidden tiny fault features. The first step is to choose the appropriate signal processing method. Studies have shown that when the fault signal is disturbed by noise, traditional time–frequency analysis techniques, such as Fourier transform (FFT) and Wavelet Transform (WT) cannot accurately extract fault features [4,5]. The more commonly used method is the Empirical Mode Decomposition (EMD) method proposed by Huang et al. in 1998 [6]. The EMD can adaptively decompose the signal into the sum of finite intrinsic mode functions (*IMF*), each *IMF* component represents a set of characteristic scale signals, and the feature extraction of each component can better reveal the fault information intrinsic characteristics. However, EMD suffers from modal aliasing, end-point effects, and a lack of rigorous mathematical framework for using envelopes in an iterative manner [7]. Although the Ensemble Empirical Mode Decomposition (EEMD) [8] optimized on the basis of EMD can effectively improve the problem of mode aliasing, and the Fast

**Citation:** Tang, M.; Liao, Y.; Luo, F.; Li, X. A Novel Method for Fault Diagnosis of Rotating Machinery. *Entropy* **2022**, *24*, 681. https:// doi.org/10.3390/e24050681

Academic Editors: Jan Kozak and Przemysław Juszczuk

Received: 6 April 2022 Accepted: 8 May 2022 Published: 12 May 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Ensemble Empirical Mode Decomposition (FEEMD) [9] further improves the calculation speed, neither escape the drawbacks of using envelopes in an iterative fashion without a rigorous mathematical framework. Subsequently, Dragomiretskiy K et al. proposed a new adaptive Variational Mode Decomposition (VMD) method. The method is a non-recursive variational decomposition model, and the optimal solution of the variational model is iteratively searched by the alternating direction multiplier method, thereby determining the center frequency and bandwidth of each mode. It avoids mode mixing in EMD, and has better robustness to noise [10]. However, VMD suffers from relatively slow computational efficiency, and its performance depends heavily on its two input parameters, namely the penalty factor and the number of decomposition modes [11]. The Iterative Filtering (IF) method proposed by Lin et al. and its derivatives [12], such as the Adaptive Local Iterative Filtering (ALIF) method [13], the Fast Iterative Filtering (FIF) method [14] can produce results similar to EMD-based algorithms, with the important advantage that their convergence and stability are guaranteed. Moreover, the FIF method uses a fixed low-pass filter function to replace the envelope mean curve in the EMD method, which solves the problem of EMD lacking a strict mathematical framework. Meanwhile, the FIF method is unaffected by mode aliasing, and mode splitting can be easily avoided by adjusting the value of the stopping criterion parameter [4]. Furthermore, FIF greatly improves the calculation speed on the basis of ensuring decomposition accuracy, with small decomposition error, good noise robustness, and can achieve efficient and accurate signal decomposition [15]. Therefore, this paper adopts the FIF method to decompose the vibration signal of rotating machinery.

The components of the vibration signal following decomposition by FIF contain rich fault information. Moreover, the components of vibration signals in different states of rotating machinery show different complexity, so the entropy parameter can be used to extract the fault information [16]. Approximate Entropy (ApEn), Sample Entropy (SampEn), Fuzzy Entropy (FE), and Permutation Entropy (PE) are widely used in the field of rotating machinery fault diagnosis to measure the complexity of vibration signals [17–19]. However, ApEn and Multiscale Approximate Entropy include the comparison of their own data segments in the calculation process, and their calculation depends on the data length. If the data length is short, the obtained value is usually smaller than the actual value. The SampEn is an improvement on the approximate entropy. It does not include the comparison of its own data segments, and has higher calculation accuracy and better consistency. However, SampEn and its improvements also have clear shortcomings: Firstly, SampEn and its improvements use Heaviside functions to measure the complexity of time series, resulting in inaccurate estimates in practical applications [20]. Secondly, SampEn and its improvements are computationally inefficient, especially for long time series. FE and its improvements replace the Heaviside function with a fuzzy membership function that is insensitive to background noise and highly sensitive to dynamic changes, but it is computationally inefficient [16]. PE is a method to measure the complexity of chaotic time series. PE has high computational efficiency, can be used to calculate huge datasets, and exhibits good anti-noise performance. However, the main disadvantage of PE is that it is prone to generating undefined entropy values for short-term time series and cannot classify well-defined patterns for a specific design [21]. In order to overcome the above problems, Hamed Azami et al. proposed a nonlinear time complexity evaluation method of Dispersion Entropy (DE). DE can generate reliable entropy values, is insensitive to noise interference, can accurately capture signal characteristics, and calculate with high efficiency [22]. Subsequently, in order to improve the extraction ability of hidden fault features, Hamed Azam et al. continued to propose the Refined Composite Multiscale Fluctuation-based Dispersion Entropy (RCMFDE), which can more accurately analyze the complexity of nonlinear time series under various scale factors, with more stable entropy values [23].

However, in the RCMFDE method, there are two key parameters (i.e., embedding dimension and class number) that need to be manually selected in advance. Furthermore, the parameter setting of the RCMFDE algorithm will affect the final processing result. If the parameter settings are unreasonable, the hidden tiny fault features may not be accurately extracted, resulting in misclassification. Aiming at the determination of the embedding dimension m and the class number c in the RCMFDE algorithm, this paper proposes a Parameter Adaptive Refined Composite Multiscale Fluctuation-based Dispersion Entropy (PARCMFDE). The method takes skewness as the objective function, and uses a Genetic Algorithm (GA) to optimize parameters of RCMFDE. PARCMFDE can automatically and effectively determine the important parameters of RCMFDE, so as to describe the complexity and uncertainty of time series more accurately, and achieve the purpose of extracting the features of hidden faults. In view of the shortcomings of existing methods, relevant research is carried out, and the main contributions are as follows:


This paper is mainly divided into the following sections: Section 2 briefly introduces the basic principles and characteristics of the FIF algorithm. In Section 3, the principle of PARCMFDE is introduced and compared with RCMFDE and Multiscale Sample Entropy (MSE) and Multiscale Fluctuation-based Dispersion Entropy (MFDE). Section 4 briefly introduces the principle and evaluation index of FCM. Section 5 presents the method of fault diagnosis of rotating machinery. Section 6 verifies the effectiveness of the method and compares it with other vibration signal fault diagnosis methods through the bearing data of Case Western Reserve University and experimental data from centrifugal pumps obtained by building a water circulation experimental system. Section 7 provides the conclusion.

#### **2. Fast Iterative Filtering**

The key idea of Fast Iterative Filtering is to iteratively subtract the simple oscillatory components contained in the signal from the signal itself, the so-called IMFs, by approximating the moving average of the signal, thereby separating the simple oscillatory components in the signal [14]. The approximate moving average is computed by convolution with the window/filter function *w*. Consider a raw vibration signal *s*(*x*), define a window/filter function *<sup>w</sup>* is a non-negative even function in the range of *<sup>C</sup>*0([−*L*, *<sup>L</sup>*]), *<sup>L</sup>* <sup>&</sup>gt; 0. The Fokker–Plank filter is used here, and *<sup>R</sup> <sup>w</sup>*(*z*)*<sup>z</sup>* <sup>=</sup> *L* <sup>−</sup>*<sup>L</sup> <sup>w</sup>*(*z*)*<sup>z</sup>* <sup>=</sup> 1, *<sup>s</sup>*<sup>ˆ</sup> denotes the Fourier transform of *s*, *DFT* denotes the discrete Fourier transform, and *IDFT* denotes the inverse discrete Fourier transform. The specific implementation process of FIF is as follows:

(1) Calculate the length *L* of the corresponding filter *w* of the signal *s*(*x*):

$$L := 2\left\lfloor \xi \frac{N}{k} \right\rfloor \tag{1}$$

where *N* is the total number of sampling points of the signal *s*(*x*), *k* is the number of its extreme points, and *ξ* is a tuning parameter, which is usually fixed around 1.6 for the Fokker–Plank filter.

(2) Calculate the discrete Fourier transform of the signal *s*(*x*) and the corresponding filter *w*, denoted as *DFT*(*s*) and *DFT*(*w*), respectively.

(3) Calculate *s*ˆ*m*+1:

$$\mathfrak{s}\_{m+1} = (I - \operatorname{diag}(DFT(w))^m DTT(s)) \tag{2}$$

(4) Calculate *N*<sup>0</sup> ∈ *N* and *IMF*:

$$\frac{N\_0^{N\_0}}{(N\_0+1)^{N\_0+1}} < \frac{\delta}{||s\_m||\_2} \tag{3}$$

$$IMF = \sum\_{k=0}^{n-1} \mu\_k (1 - \lambda\_k)^{N\_0} \sigma\_k = IDFT((I - D)^{N\_0}DFT(s)) \tag{4}$$

where *δ* > 0, represents the required precision; *N*<sup>0</sup> represents the number of iterations required to achieve the required precision *δ* when calculating a specific *IMF*; *σ<sup>k</sup>* represents the *k*th element of the Fourier transform of the signal *s*; *λ<sup>k</sup>* represents the *k*th eigenvalue; *uk* is the *k*th eigenvector; *I* is the identity matrix; *D* is the diagonal matrix, whose diagonal is the eigenvalue.

(5) Judgment of inner loop stop condition: if the stop standard *SD* is met, then stop the inner loop, otherwise let *m* = *m* + 1 repeat steps (3)–(5), the stop standard *SD* is calculated by the following formula:

$$SD := \frac{||s\_{m+1} - s\_m||\_2}{||s\_m||\_2} < \delta\_\prime \,\forall m \ge N\_0. \tag{5}$$

(6) Calculate the *IMF* component and the new *s*:

$$IMF = IMF \cup \{IDFT(\mathfrak{s}\_m)\} \tag{6}$$

$$s = s - IDFT(\mathfrak{s}\_m). \tag{7}$$


$$IMF = IMF \cup \{s\}.\tag{8}$$

In short, the FIF method includes two processes: inner loop and outer loop. The purpose of the inner loop is to filter out the *IMF* components of each order. The purpose of the outer loop is to end the process of extracting the *IMF* component of the inner loop. When the residual obtained by removing all *IMF* components from the original signal *s*(*x*) contains only one or less extreme points, the outer loop stops.

#### **3. Parameter Adaptive Refined Composite Multiscale Fluctuation Based Dispersion Entropy**

*3.1. Refined Composite Multiscale Fluctuation-Based Dispersion Entropy*

Refined Composite Multiscale Fluctuation-based Dispersion Entropy (RCMFDE) accounts for the shortcomings of Multiscale Fluctuation-based Dispersion Entropy (MFDE) in the process of coarse-graining, which has low computational efficiency and a high probability of invalid entropy values. The entropy value is more stable, the operation speed is faster, and the probability of invalid entropy occurrence is greatly reduced. The specific process of RCMFDE is as follows:

(1) For a given univariate signal *L* : *v* = {*v*1, *v*2, ... , *vL*}. Dividing *v* into non-overlapping segments of length *τ* is called the scale factor. Construct a composite coarse-grained time series:

$$\mathbf{x}\_k^{(\tau)}(i) = \frac{1}{\tau} \sum\_{\varepsilon=(i-1)\tau+k}^{i\tau+k-1} v\_{\varepsilon}, 1 \le i \le \left\lfloor \frac{L}{\tau} \right\rfloor = n, k = 1, 2, \dots, \tau \tag{9}$$

where *k* represents the coarse-grained sliding number of the scale factor under *τ*.

(2) Map *X* = {*x*1, *x*2, ... , *xn*} to *Y* = {*y*1, *y*2, ... , *yn*} through the normal cumulative distribution function (NCDF) as follows:

$$y\_k(i) = \frac{1}{\sigma\sqrt{2\pi}} \int\_{-\infty}^{\mathbf{x}\_k(i)} e^{\frac{-(t-\mu)^2}{2\sigma^2}} dt \tag{10}$$

where *σ* is the standard deviation of the time series *X* and *μ* is the mean.

(3) Linearly assign *yk*(*i*) to an integer *zk*(*i*) from 1 to *c* as follows:

$$z\_k^c(i) = round(c \times y\_k(i) + 0.5) \tag{11}$$


$$\mathcal{W}(\pi\_{v\_0\ldots v\_{m-1}}) = \frac{\#\{j \mid j \le n - (m-1)d, Z\_k^{m,c}(j)\text{hsttype}\pi\_{v\_0\ldots v\_{m-1}}\}}{n - (m-1)d} \tag{12}$$

where # means cardinality.

(7) The Refined Composite Multiscale Fluctuation-based Dispersion Entropy (RCMFDE) is obtained by the following formula:

$$RCMFDE = -\sum\_{\pi=1}^{\left(2c-1\right)^{m-1}} \sum\_{k=1}^{\tau} \mathcal{W}(\pi\_{\pi\_{0}\dots\pi\_{m-1}}) \times \ln(\sum\_{k=1}^{\tau} \mathcal{W}(\pi\_{\pi\_{0}\dots\pi\_{m-1}})).\tag{13}$$

The RCMFDE algorithm has four parameters, which are the embedding dimension *m*, the class number *c*, the delay time *d*, and the maximum scale factor *τ*max. A study [23] pointed out that the results of the RCMFDE do not change significantly with the time delay *d*, and a different embedding dimension *m* and class number *c* have influence on RCMFDE. The higher the number of dispersion modes based on potential fluctuations (*ln*((2*c* − 1)*m* − 1)), the higher the RCMFDE value [23]. When *m* and *c* are too small, the ability of RCMFDE to detect signal mutations is lower, but the larger *m* and *c* are, the longer the algorithm runs. For samples of the same category, the feature vectors should be as similar as possible; for samples of different categories, the feature vectors should be significantly different. If the parameter selection is not suitable, it may cause instability of entropy value, incomplete extraction of hidden feature information or excessively long operation time, rendering it difficult to classify correctly. Therefore, it is necessary to select appropriate values of the class number *c* and the embedding dimension *m*.

#### *3.2. Genetic Algorithm*

Genetic Algorithm (GA) is a computational model that simulates the biological evolution process of natural selection and genetic mechanism of Darwin's theory of biological evolution. It is a method to search for optimal solutions by simulating the natural evolution process [24]. When solving more complex combinatorial optimization problems, this algorithm can usually obtain better optimization results faster than some conventional optimization algorithms. The specific process of GA is as follows:


(4) If *t* = *T*, or the change of the fitness function value reaches the given threshold, the optimal fitness individual is used as the optimal solution output. If *t* < *T*, and the change of the fitness function value is greater than the given threshold, define *t* = *t* + 1, and repeat steps (2)–(4).

#### *3.3. Parameter Adaptive Refined Composite Multiscale Fluctuation-Based Dispersion Entropy*

The settings of the embedding dimension *m* and the class number *c* of RCMFDE affect its final entropy value, entropy value stability and operation time. If the parameter settings are unreasonable, the best processing effect will not be achieved. Therefore, a suitable method is needed to adaptively select the embedding dimension *m* and the class number *c* in RCMFDE. For the above problems, this paper proposes a Parameter Adaptive Refined Composite Multiscale Fluctuation-based Dispersion Entropy (PARCMFDE). The method performs parameter optimization through Genetic Algorithm (GA) to determine the optimal parameter combination of the embedding dimension *m* and the class number *c* in RCMFDE. Figure 1 shows the flowchart of using GA to optimize the parameters of RCMFDE. The steps of parameter optimization in PARCMFDE are described as follows:


$$\text{ske} = E[H\_p(X) - H\_p^{\text{m}}(X)]^3 / [H\_p^{\text{d}}(X)]^3 \tag{14}$$

where *H<sup>m</sup> <sup>p</sup>* (*X*) is the mean of *Hp*(*X*), *H<sup>d</sup> <sup>p</sup>*(*X*) is the standard deviation of *Hp*(*X*), and *E*[.] represents the mathematic expectation. The fitness function is taken as *f* = *ske*2.


**Figure 1.** The flowchart of using GA to optimize the parameters of RCMFDE.

#### *3.4. Error Analysis and Comparison Results*

To demonstrate the effectiveness of PARCMFDE in assessing the complexity and irregularity of time series, PARCMEDE of white Gaussian and pink noise signals are calculated and compared with RCMFDE, Multiscale Fluctuation-based Dispersion Entropy(MFDE), and Multiscale Sample Entropy (MSE). Furthermore, to compare the accuracy of the complexity measures with different entropies, 10 groups of white Gaussian noise and 10 groups of pink noise were randomly generated. For unified comparison, the maximum scale factor *τ*max of the entropy is set to 10, and the time delay *d* is set to 1. Among them, the embedding dimension *m* and the class number *c* of PARCMFDE take 2 and 3, respectively, in white Gaussian noise, and take *m* = 2, *c* = 21 in pink noise. Based on experience, the embedding dimension *m* and the class number *c* of RCMFDE and MFDE take the default values of *m* = 3 and *c* = 6, respectively. The MSE takes the default value of *m* = 2, *r* = 0.15 × *σ*, and *σ* represents the standard deviation of the signal. Figure 2a,b show the time-domain waveforms of white Gaussian noise and pink noise. Figure 3a,b plot the error bars of different entropy algorithms for white Gaussian noise and pink noise, respectively. The entropy value of pink noise time series should remain almost constant, while the entropy value of white Gaussian noise data should decrease monotonically [26]. It can be seen from Figure 3a that with the increase of the scale factor *τ*, the average curves of the four entropies of white Gaussian noise (i.e., RCMFDE, PARCMFDE, MFDE and MSE) bear a downward trend which indicates that the four algorithms have good sensitivity in detection complexity. Furthermore, the standard deviation of PARCMFDE for white Gaussian noise at each scale is smaller than that of RCMFDE, MFDE and MSE, indicating that PARCMFDE has higher accuracy than the other three algorithms on the complexity measure of white Gaussian noise. It can be seen from Figure 3b that the MSE of pink noise exhibits a slight downward trend with large fluctuations, but the PARCMFDE remains almost unchanged, indicating that the RCMFDE is better than the MSE. Furthermore, the standard deviation of PARCMFDE for pink noise at each scale is smaller than that of RCMFDE, MFDE and MSE, indicating that PARCMFDE can provide a more accurate complexity estimate for pink noise [27]. That is, PARCMFDE is effective in complexity measurement and feature extraction of nonstationary signals. The standard deviation of MSE is much larger than the

other three methods, indicating that MSE is insufficiently accurate regarding the complexity measurement and feature extraction of nonstationary signals.

**Figure 2.** Time Domain Waveform of (**a**) White Gaussian Noise and (**b**) Pink Noise.

**Figure 3.** Entropy of different algorithms of (**a**) White Gaussian Noise and (**b**) Pink Noise.

#### **4. Fuzzy C-Means Clustering**

Fuzzy C-means (FCM) clustering algorithm is the most widely used fuzzy clustering algorithm based on objective function. It obtains the membership degree of each sample point to all class centers by optimizing the objective function, so as to determine the class of the sample point to achieve the purpose of automatically classifying the sample data [28]. Let *R* = {*r*1,*r*2, ... ,*rn*} be the set of data samples, and *n* is the number of samples. *C* = {*c*1, *c*2, ... , *ct*} is the cluster center vector, and *t* is the total number of clusters. The FCM clustering algorithm minimizes the objective function shown in Equation (15) through continuous iteration of the least squares method, and its constraints are shown in Equation (16).

$$J\_m = \sum\_{i=1}^t \sum\_{k=1}^n \left[ \mu\_{ik}(r\_k) \right]^m ||r\_k - c\_i|| \tag{15}$$

$$\sum\_{i=1}^{t} \mu\_{ik}(r\_k) = 1\tag{16}$$

where *rk* is the *k*th sample point to be clustered, *μik* is the degree of membership of *rk* to the *i*th cluster center *ci*, *m* is the weight index of the degree of membership, generally *m* = 2.

The cluster center *ci* and the membership matrix *μik* are randomly selected initially. Then iteratively calculate through Equations (17) and (18), and stop until the change of the objective function is less than the threshold.

$$\mathbf{C}\_{i} = \frac{\sum\_{k=1}^{n} r\_{k} \mu\_{ik}^{2}}{\sum\_{k=1}^{n} \mu\_{ik}^{2}}, i = 1, 2, \dots, t \tag{17}$$

$$\mu\_{ik} = \frac{1}{\sum\_{j=1}^{t} \frac{||r\_k - c\_i||\_2^{2/(m-1)}}{||r\_k - c\_j||\_2^{2/(m-1)}}} \tag{18}$$

The average fuzzy entropy *E*, classification coefficient *S* and classification accuracy *Acc* are used to analyze and evaluate the clustering effect of the fuzzy C-means, which are, respectively, defined as:

$$E = -\frac{1}{n} \sum\_{i=1}^{t} \sum\_{k=1}^{n} \mu\_{ik} \ln \mu\_{ik} \tag{19}$$

$$S = \frac{1}{n} \sum\_{i=1}^{t} \sum\_{k=1}^{n} \mu\_{ik}^{2} \tag{20}$$

$$Acc = \frac{1}{n} (\sum\_{i=1}^{n} 1\{R\_{\overline{v}} == \hat{R}\_{V}\}) \tag{21}$$

where *Rv* and *R*ˆ *<sup>V</sup>* denote the actual class and the class assigned by FCM on the test dataset, *n* is the number of samples in the test dataset.

The ambiguity of clustering is represented by the average fuzzy entropy *E*, which reflects the distribution characteristics of the clustering dataset, so it can be used as an index to judge the clustering effect and correctness. The smaller the ambiguity, the higher the order of the system. The classification coefficient *S* measures the overlap between clusters, and the closer it is to 1, the more effective the clustering result [29]. Therefore, the closer *E* is to 0, the closer *S* is to 1, and the closer *Acc* is to 100%, the better the sample clustering effect is.

#### **5. Proposed Fault Diagnosis Method**

In order to quickly and reliably extract the characteristic information of the vibration signal and realize the automatic classification of the working state of the rotating machinery, a new fault diagnosis method of the rotating machinery based on FIF-PARCMFDE and Fuzzy C-means (FCM) is proposed. The specific process is as follows:


The block diagram of the proposed fault diagnosis method is shown in Figure 4.

**Figure 4.** Block diagram of the proposed fault diagnosis method.

#### **6. Experimental Verification**

In this section, we apply the proposed fault diagnosis method to the bearing vibration signal of Case Western Reserve University and centrifugal pump vibration signals obtained by building a water circulation experimental system. It is compared with some similar commonly used methods to evaluate the effectiveness and superiority of our method.

#### *6.1. Experiment 1: Bearing Data From Cwru*

#### 6.1.1. Experimental Setup

The experimental data adopts the vibration signal from the Bearing Data Center of Case Western Reserve University [30]. Vibration data are collected using accelerometers, which are mounted on the drive end bearing housing. The outer race, rolling element and inner race of rolling bearings are machined using electric sparks to simulate single point crack failure of the outer race, rolling element and inner race. The selected test bearing model is 6205-2RS, the rotational speed is 1750 r/min, and the sampling frequency is the vibration data of 12 kHz. Analyze the vibration data of normal, inner ring failure, outer ring failure, and rolling element failure. Twenty samples for each of the four bearing conditions were obtained through a non-overlapping sliding window of length 5500 points, that is, each sample contained 5500 points. The first 10 samples for each of the four bearing conditions are selected as the training set, and the remaining 10 samples for each of the four bearing conditions are selected as the testing set.

#### 6.1.2. Comparison And Analysis

To verify the effectiveness of the FIF-PARCMFDE-FCM method for bearing fault diagnosis, under the same test conditions, the vibration data of four operating states of normal bearing, inner race fault, rolling element fault and outer race fault were classified and identified.

The first 1000 points of the original vibration signal in the four states of the bearing are selected, as shown in Figure 5. It can be seen from Figure 5 that the vibration signals in the four states of the bearing are quite different and bear distinct characteristics, but they are not enough to be directly classified according to the waveform.

FIF decomposes the vibration signals in the four states of the bearing, and selects the first five-order *IMF* components for comparison. The *IMF* components with larger correlation coefficients can well retain the fault characteristic information of the signals [31]. The correlation coefficient between the first five-order *IMF* components and the original signal is shown in Table 1, and the component with the correlation coefficient greater than 0.4 is selected to reconstruct the signal. Therefore, the outer race fault selects IMF1 as the reconstruction signal, the inner race fault and rolling element fault select IMF1 and IMF2 for reconstruction, and the normal signal selects IMF1, IMF2 and IMF4 for reconstruction.

**Figure 5.** Time domain waveform of vibration signal under (**a**) Normal, (**b**) Inner Race Fault, (**c**) Rolling Element Fault and (**d**) Outer Race Fault of bearing.


**Table 1.** Correlation coefficients of bearings in different states.

The reconstruction signals *s* of different states of the bearing are selected. The skewness is used as the objective function in GA, and set the maximum evolutionary generation *T* to 200, the threshold for the fitness function to change is 10−6. The parameters of RCMFDE are optimized by GA. Calculate the PARCMFDE, RCMFDE and MSE of the reconstructed signal *s*, where the scale factor is 10, and the embedding dimension *m* and class number *c* of PARCMFDE under different conditions are shown in Table 2. Based on experience, RCMFDE takes default values *m* = 3, *c* = 6, MSE takes default value *m* = 2 , *r* = 0.15 × *σ*, *σ* represents the standard deviation of the signals *s*. It can be seen from Figures 6–8 that PARCMFDE, RCMFDE and MSE can all distinguish the four states of the bearing, indicating that the three methods can effectively extract the hidden features of different states of the bearing. However, compared with RCMFDE and MSE, PARCMFDE can distinguish the four states of the bearing more clearly, and is more suitable for further classification of bearing faults as a feature vector.


**Table 2.** PARCMFDE parameters of bearings in different states.

**Figure 6.** PARCMFDE in different states of bearing.

**Figure 7.** RCMFDE in different states of bearing.

Take PARCMFDE, RCMFDE and MSE as the eigenvector matrix. Perform FCM cluster analysis on the eigenvector matrix of the training samples, and four cluster centers can be obtained. Then the obtained cluster centers and testing sample eigenvector matrix are input into FCM clustering. The clustering results are shown in Figures 9–11.

**Figure 9.** FIF-PARCMFDE-FCM clustering results of different bearing states.

**Figure 10.** FIF-RCMFDE-FCM clustering results of different bearing state data.

**Figure 11.** FIF-MSE-FCM clustering results of different bearing state data.

It can be seen from Figures 9–11 that the data points of the same state are concentrated around their respective cluster centers, and the data points of different states are separated from each other. In addition, the positions of the cluster centers obtained by different methods are different, and the degree of closeness of the data points distributed around the cluster centers is also different. In comparison, FIF-PARCMFDE-FCM have the best clustering effect, that is, the categories are most distinct, the clustering centers of various signals are far apart, and the data points of various types are compactly clustered around the clustering centers. Compared with FIF-RCMFDE-FCM and FIF-MSE-FCM, the class center distance of FIF-PARCMFDE-FCM method is larger, and the different signals are more clearly distinguished, indicating that the method has a better classification effect on various fault signals of rolling bearings.

The classification coefficient *S*, the average fuzzy entropy *E* and the classification accuracy *Acc* of each clustering result were calculated, respectively, and the clustering effect of the above three algorithms and the fault recognition rate were quantitatively compared. The clustering results of the above three algorithms are shown in Table 3. It can be seen from Table 3 that the classification accuracy *Acc* of the above four methods are all 100%. The classification coefficient *S* of the FIF-MSE-FCM method is 0.9913, the average fuzzy entropy *E* is 0.0306, and the clustering effect is poor. The classification coefficient based on FIF-PARCMFDE-FCM method is 0.9967, the average fuzzy entropy is 0.0123, and the clustering effect is the best, indicating that this method can achieve a more accurate and reliable fault diagnosis.

**Table 3.** FCM clustering effect of different bearing entropy algorithms.


#### *6.2. Experiment 2: Experimental Data of Centrifugal Pump*

#### 6.2.1. Experimental Setup

To verify the effectiveness of the method, a water circulation system was built in Wuhan University of Technology, and the vibration signals of centrifugal pumps in different states were collected. In this experiment, a model CDL1-11FSWPG light-duty vertical multistage centrifugal pump was selected, with a rated speed of 2900 r/min, a lift of 61 m, and a rated flow of 1 m3/h. According to GBT-29531-2013 pump vibration measurement and evaluation method, the vibration sensor measurement points of centrifugal pump are arranged, and vibration data in three directions of *x*, *y*, and *z* are collected at the same time. The measuring point in the *x*-axis direction is arranged on the pump casing, the measuring point in the *z*-axis direction is arranged on the base, and the measuring point in the *y*-axis direction is arranged at the outlet flange. The structure of the pump body and the arrangement of measuring points are shown in Figure 12.

**Figure 12.** Layout of measuring points of centrifugal pump.

According to the actual situation during operation of the centrifugal pump, there are rotor unbalance and air binding faults. The impeller is the main component of the rotor in the centrifugal pump. During actual operation, the impeller is in contact with the working medium, thus it is the rotor part most prone to failure. The laboratory conditions will simulate a centrifugal pump rotor unbalance fault with impeller damage, as shown in Figure 13. The centrifugal pump is not filled with the liquid to be conveyed before starting, or air will infiltrate the pump during operation, because the density of the gas is less than the density of the liquid, the centrifugal force generated is small, and the air cannot be expelled. The negative pressure generated by the fluid in the pump casing during centrifugal motion with the motor is not enough to suck the liquid into the pump casing, which is called the air binding phenomenon of the centrifugal pump. In this experiment, by tightening the exhaust screw of the centrifugal pump, and then removing the centrifugal pump, the air can enter the inner chamber of the centrifugal pump. After installing the centrifugal pump, the residual air cannot be discharged from the centrifugal pump through the exhaust screw, so as to set the air binding fault of the centrifugal pump, as shown in Figure 14.

**Figure 13.** Rotor unbalance fault setup: (**a**) Impeller in normal condition, (**b**) Impeller in damaged condition.

**Figure 14.** Air binding fault setting: (**a**) Exhaust screw loose, (**b**) Exhaust screw tightened.

After building the experimental platform, the sampling frequency was set to 1 kHz, the motor speed to 1015 r/min, and the acquisition system developed by Labview was employed to collect the vibration signals of the centrifugal pump in normal state. Then, the vibration signals of rotor unbalance and air bind state of the centrifugal pump were collected. A total of 25 samples for each of the three conditions of the centrifugal pump were obtained through a non-overlapping sliding window of length 4000 points. This means there are 4000 points per sample. The first 10 samples for each of the three conditions of the

centrifugal pump are selected as the training set, and the remaining 15 samples for each of the three conditions are selected as the testing set.

#### 6.2.2. Comparison and Analysis

To verify the superiority of FIF-PARCMFDE-FCM clustering for fault diagnosis of centrifugal pump, under the same test conditions, the vibration data of three operating states of centrifugal pump normal, rotor unbalance fault and air binding fault were classified and identified. Furthermore, these data were compared with the cluster analysis results of FIF-RCMFDE-FCM and FIF-MSE-FCM.

The first 1000 points of the original vibration signal in the three states of the centrifugal pump are selected, as shown in Figure 15. It can be seen from Figure 15 that the characteristics of the vibration signals in the three states of the centrifugal pump are not clear and cannot be directly classified according to the waveform. The vibration signals of the centrifugal pump in three states were decomposed by FIF, and the first five-order *IMF* components were selected for comparison. The *IMF* components with larger correlation coefficients can well preserve the fault characteristic information of the signal. The correlation coefficients between the first five-order *IMF* components and the original signal are shown in Table 4. The components with a correlation coefficient greater than 0.4 are selected to reconstruct the signal. Therefore, IMF1 and IMF2 are selected for reconstruction for normal, rotor unbalance fault and air bind fault.

**Figure 15.** Time domain waveform of vibration signal of centrifugal pump in (**a**) Normal, (**b**) Air Bind Fault and (**c**) Rotor Unbalance Fault.


**Table 4.** Correlation coefficients of centrifugal pumps in different states.

The reconstructed signals of different states of the centrifugal pump are selected. The skewness is used as the objective function in GA and the maximum evolutionary generation *T* is set to 200, the threshold for the fitness function to change is 10−6. The parameters of RCMFDE are optimized by GA. The PARCMFDE, RCMFDE and MSE of the reconstructed signal *s*, are calculated where the scale factor is 10. The embedding dimension *m* and the class number *c* of PARCMFDE in different situations are shown in Table 5. Based on experience, RCMFDE takes default values *m* = 3, *c* = 6, MSE takes default values *m* = 2, *r* = 0.15 × *σ*, *σ* represents the standard deviation of signal *s*. As shown in Figure 16, PARCMFDE can clearly separate the three states of the centrifugal pump, indicating that PARCMFDE can effectively extract the hidden features of the three states of the centrifugal pump, which is suitable as a feature vector to further classify the states of the centrifugal pump. As shown in Figures 17 and 18, RCMFDE and MSE are almost inseparable from the three states of the centrifugal pump, indicating that RCMFDE and MSE may not be able to effectively extract the small fault features hidden in rotating machinery, and are not suitable for further classification of centrifugal pump states as feature vectors.

**Table 5.** PARCMFDE parameters of centrifugal pump in different states.


**Figure 16.** PARCMFDE in different states of centrifugal pump.

**Figure 17.** RCMFDE in different states of centrifugal pump.

**Figure 18.** MSE in different states of centrifugal pump.

Taking PARCMFDE as the eigenvector matrix. Perform FCM cluster analysis on the eigenvector matrix of the training samples, and three cluster centers can then be obtained. Next, the obtained cluster centers and testing sample eigenvector matrix are input into FCM clustering. The clustering results are shown in Figure 19. Similarly, RCMFDE and MSE are separately input into FCM clustering as eigenvector matrices. The clustering results are shown in Figures 20 and 21.

**Figure 19.** FIF-PARCMFDE-FCM clustering results of centrifugal pump data in different states.

**Figure 20.** FIF-RCMFDE-FCM clustering results of centrifugal pump data in different states.

**Figure 21.** FIF-MSE-FCM clustering results of centrifugal pump data in different states.

The clustering effect and classification accuracy of the above three algorithms are quantitatively compared, and the classification coefficient *S*, average fuzzy entropy *E* and classification accuracy *Acc* of each method are calculated, respectively, as shown in Table 6. From Figure 19 and quantitative analysis, it can be seen that FIF-PARCMFDE-FCM clearly distinguished the fault categories, the cluster centers of various signals are far apart, and the data points of various types are compactly clustered around the cluster centers. Moreover, the classification accuracy *Acc* is 100%. This shows that the method has a good classification effect on the fault signals of the centrifugal pump in different states. However, the methods based on FIF-RCMFDE-FCM and based on FIF-MSE-FCM exhibit poor clustering effect, serious misclassification, and low classification accuracy *Acc*.

It can be seen from the above two experiments that the accuracy of the three methods in experiment 1 is 100%, but the accuracy of FIF-MSE-FCM and FIF-RCMFDE-FCM is greatly reduced in Experiment 2. There are two reasons: 1. The vibration signal fault feature of Experiment 1 are more evident and easy to distinguish. However, the fault features of the vibration signal in Experiment 2 are relatively weak and difficult to extract accurately. 2. By using GA to optimize the parameters of RCMFDE, the problem regarding the selection of *m* and *c* depends on experience is solved, and the performance of feature extraction is improved. This reflects that the FIF-PARCMFDE-FCM can adaptively select parameter combinations according to different application scenarios, which has better adaptability for signals that are more difficult to classify and with less obvious fault features.

**Table 6.** FCM clustering effects of different entropy algorithms for centrifugal pumps.


#### **7. Conclusions**

To overcome the shortcomings of traditional feature extraction methods that bear difficulty in extracting tiny fault features hidden in vibration signals, and the shortcomings of RCMFDE to select parameters based on experience, a PARCMFDE is proposed. PARCMFDE takes the skewness of RCMFDE as the objective function, and uses genetic algorithm to optimize parameters. PARCMFDE can more accurately extract tiny fault features hidden in vibration signals of rotating machinery. At the same time, a new fault diagnosis method for rotating machinery based on FIF-PARCMFDE-FCM is proposed, which can classify rotating machinery faults accurately and automatically without depending on the length of data samples. FIF quickly decomposes the original vibration signal, and selects components with large correlation coefficients for reconstruction. The reconstructed signal features are extracted by PARCMFDE, and the feature vector is formed into FCM for automatic label-free classification. The bearing experiments with clear fault characteristics prove that the classification performance of this method is superior to other methods. Experiments on centrifugal pumps with weak fault features demonstrate that this method can extract hidden weak fault features from vibration signals and perform accurate and reliable automatic classification. Therefore, the proposed diagnostic method can achieve reliable diagnosis performance for rotating machinery.

However, the proposed method only identifies single faults of rotating machinery, and does not consider the identification of compound faults. Furthermore, PARCMFDE is slower than RCMFDE. Therefore, the identification of compound faults in rotating machinery and the improvement of the computing speed of PARCMFDE will be regarded as the focus of our future work.

**Author Contributions:** Methodology, software, validation, writing—original draft preparation, M.T.; data curation, Y.L.; experimental system construction, F.L.; writing—review and editing, X.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data used in this study are all owned by the research group and will not be transmitted.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

