(a) Construction of the Projection Matrix **Ψ***<sup>i</sup>*

For each group of features-mapped nodes, one important thing in the framework of BLS is building the projection matrix **Ψ***i*. An important question then arises as follows: how to build the projection matrix? The approach presented here follows the procedures of [8,33]. In the BLS, a random matrix <sup>P</sup>*<sup>i</sup>* <sup>∈</sup> <sup>R</sup>(*D*+1)×*fi* is generated first for each group of features-mapped nodes. Afterwards, we can obtain a random-projection data matrix **Q***i*, given by

$$\mathbf{Q}\_{i} = \mathbf{X} \mathcal{P}\_{i} \tag{13}$$

The projection matrix **Ψ***<sup>i</sup>* is the result of the sparse approximation problem given by

$$\min\_{\mathbf{W}\_i} \left\{ \left\| \left\| \mathbf{Q}\_i \mathbf{\bar{w}}\_i - \mathbf{X} \right\|\right\|\_F^2 + \rho \left\| \left\| \mathbf{\bar{w}}\_i \right\|\right\|\_1 \right\} \tag{14}$$

where in (14), the term *ρ* **Ψ***<sup>i</sup>* <sup>1</sup> is to enforce the solution of (14) to be a sparse matrix. Additionally, *ρ* is a regularization parameter for sparse regularization. In addition, *F* is the popular Frobenius norm and . <sup>1</sup> is norm 1.

#### (b) Construction of the Weight Matrices of the Enhancement Nodes

To some degree, the construction step of **W***<sup>j</sup>* s is a bit straightforward. For instance, as detailed in [8,33,34], the traditional BLS algorithm and other variants randomly generate the weight matrices for each group of the enhancement nodes. Similarly, this paper follows the same procedure to generate **W***<sup>j</sup>* s.

#### (c) Construction of Output Weight Vector

This section gives the procedure to construct the output weight vector β. Given the projection matrix **Ψ***is* ∀ *i* = 1, ··· , *n* of the feature-mapped nodes, and the training data matrix **X**, the *i*-th training data feature matrix for all training samples is given by

$$\mathbf{Z}\_{i}^{\top} = \begin{pmatrix} \mathbf{z}\_{i,1}^{\top} \\ \vdots \\ \mathbf{z}\_{i,N}^{\top} \end{pmatrix} = \mathbf{X} \mathbf{Y}\_{i}^{\top} \tag{15}$$

where **z***i*,*<sup>k</sup>* = [*zi*,*k*,1, ··· , *zi*,*k*, *fi* ] T, and

$$z\_{i,k,\mu} = \sum\_{\iota=1}^{D} \psi\_{i,\iota,\iota} x\_{k,\iota} + \psi\_{i,\iota,D+1} \tag{16}$$

Let Z be the collection of all training data feature matrices. Hence, we have

$$\mathbf{Z} = [\mathbf{Z}\_1, \cdots, \mathbf{Z}\_n] \tag{17}$$

In this way, Z is an *N* × *f* matrix, denoted as

$$Z = \begin{pmatrix} Z\_{1,1} & \cdots & Z\_{1,f} \\ \vdots & \ddots & \vdots \\ Z\_{N,1} & \cdots & Z\_{N,f} \end{pmatrix} = \begin{pmatrix} Z\_1^\mathrm{T} \\ \vdots \\ Z\_N^\mathrm{T} \end{pmatrix} \tag{18}$$

where the *<sup>k</sup>*-th row vector <sup>Z</sup><sup>T</sup> *<sup>k</sup>* of Z is the inputs of the enhancement nodes (the outputs of the feature-mapped nodes) for the *k*-th training input vector **x***k*. To handle input biases, we augment one vector into Z, given by

$$\mathcal{Z}' = \begin{pmatrix} \mathcal{Z}\_1^\mathrm{T} & 1 \\ \vdots & \vdots \\ \mathcal{Z}\_N^\mathrm{T} & 1 \end{pmatrix} = \begin{pmatrix} \mathcal{Z}'^\mathrm{T} \\ \vdots \\ \mathcal{Z}'^\mathrm{T} \end{pmatrix} \tag{19}$$

Furthermore, given Z, the enhancement node outputs of the *j*-th enhancement group for all training data are given by

$$\mathbf{H}\_{\dot{j}} \quad = \; \xi \left( \mathcal{Z}' \, \mathbf{W}\_{\dot{j}}^{\mathrm{T}} \right) = \begin{pmatrix} \mathbf{h}\_{\dot{j},1}^{\mathrm{T}} \\ \vdots \\ \mathbf{h}\_{\dot{j},N}^{\mathrm{T}} \end{pmatrix} \tag{20}$$

for *j* = 1, ··· , *m*, where

$$\mathbf{h}\_{j,k} = \begin{bmatrix} h\_{j,k,1} \cdot \cdot \cdot \cdot \, \, \_\text{\space r}h\_{j,k,x\_j} \end{bmatrix}^\text{T} \tag{21}$$

and

$$h\_{j,k,v} = \zeta \left(\sum\_{\tau=1}^{f+1} w\_{j,v,\tau} \mathcal{Z} \,\prime\_{k,\tau} \right) \tag{22}$$

Packing all the enhancement node outputs together, we have

$$\mathcal{H}\_{\text{\tiny \phantom{0}}} = \begin{array}{c} \left[ \mathbf{H}\_{1'} \cdot \cdots \cdot \mathbf{H}\_{m} \right] \end{array} \tag{23}$$

where H is a *N* × ∑*<sup>m</sup> <sup>j</sup>*=<sup>1</sup> *ej* = *N* × *e* matrix.

Define A = [Z|H]. The output weight vector β can be calculated based on least square techniques:

$$\arg\min\_{\mathfrak{B}} ||\mathcal{A}\mathfrak{B} - \mathbf{y}||\_{\rho}^{\rho} + \varrho \parallel \mathfrak{B} \parallel\_{\lambda}^{\lambda} \tag{24}$$

where **y** = [*y*1, ··· , *yN*] <sup>T</sup> is the collection of all training outputs. Equation (24) means that we can have different cost functions by setting different values of *ρ*, ρ, *λ*. It should be noted the value of *ρ* and *λ* are not necessarily the same. In this paper, to explore BLS for air pressure failure prediction, we reformulate the objective function (24) like that of logistic regression. In the next subsection, we give background details of logistic regression (LogR).

#### 2.2.4. Logistic Regression

Logistic regression (LogR) is a widely used and popular probabilistic statistical classification technique. It is designed for binary classification problems. Logistic regression is detailed in [35]. The technique aims to maximize the likelihood function given by

$$J\_{L\text{logR}} = \prod\_{k=1}^{N} t\_k^{y\_k} \{1 - t\_k\}^{1 - y\_k}; t\_k = \sigma\left(\mathbf{w}^T \mathbf{x}\_k\right) \tag{25}$$

where *σ*(*z*) = <sup>1</sup> <sup>1</sup>+exp−*<sup>z</sup>* and **x***<sup>k</sup>* is the *k*-th input vector. We can further modify (25) as a minimization problem. With manipulations, we turn (25) to a well-known cross-entropy error function (26). By taking the negative logarithm of the likelihood function (25), we arrive at

$$f(w) = -\log\left(I\_{\log R}\right) = -\sum\_{k=1}^{N} \left\{ y\_k \log(t\_k) + (1 - y\_k)\log(1 - t\_k) \right\} \tag{26}$$

The gradient descent can be used to optimize the error function (25) to obtain an optimal output weight **w**.

#### **3. The Proposed Technique**

In the proposed approach, we capitalize on the strength of a broad learning system (BLS) and logistic regression (LogR). Figure 5 shows the structure of the fused BLS network with logistic regression classifier.

**Figure 5.** The Flowchart of the proposed network and procedure.

In Figure 5, input **X** is passed to the feature-mapped layer, where the feature **Z***<sup>n</sup>* is extracted and obtained. This feature is further enhanced to obtain an enhanced feature **<sup>H</sup>***m*. Both features are combined as <sup>A</sup> <sup>=</sup> [**Z***n*|**H***m*]. The concatenated features are then passed to the logistic regression classifier for making the decision. The fusion of logistic regression and broad learning system, with the effectiveness of the feature-extraction layer and enhancement layer improves the performance of the network. For instance, when the feature nodes extract features from the input X, the enhancement nodes further enhance the features such that the distance between the positive class and negative class is widened. Hence, it able to separate between class even when the two classes are in balance.

In our approach, we incorporate the objective function of BLS (24) into the objective function of LogR (25). In other words, for the proposed broad embedded logistic regression model, we assume a non-linear relationship between the input of the logistic regression classifier and the output of logistic regression classifier. For easy notation and explanation, we let

$$\mathcal{A} = \left[ \mathcal{A}\_{1\prime} \cdot \dots \cdot \mathcal{A}\_{f+c} \right] = \begin{bmatrix} a\_{1,1} & \dots & a\_{1,f+c} \\ \vdots & \ddots & \vdots \\ a\_{N,1} & \dots & a\_{N,f+c} \end{bmatrix}$$

Additionally, the probability that *yk* = 1 is given by *pk*. In other words, when the model predicts that *yk* = 1, the prediction probability is given by *pk*. Hence, we formulate the relationship between feature A obtained from the BLS network and the output weight β = [*β*1,..., *βf*+*e*] T. In addition, we add a bias term *<sup>β</sup>*<sup>0</sup> and <sup>A</sup><sup>0</sup> <sup>=</sup> [1, 1, 1 . . . 1] *<sup>T</sup>* into the relationship. Hence, for *k*-th input, we have

$$d\_k = a\_{k,0} \beta\_0 + \sum\_{r=1}^{f+\varepsilon} a\_{k,r} \beta\_r = \log\_b \frac{p\_k}{1 - p\_k} \tag{27}$$

where *lk* is the log-odds for the *k*-th. Furthermore, it should be noted that *b* is an additional generalization, it is the base of the model.

For a more compact notation and to take the bias term into consideration, we specify the feature variables A and β as (*f* + *e* + 1)—dimensional vectors. They are given by

$$\begin{array}{ll}\overline{\mathcal{A}} = & \begin{bmatrix} \mathcal{A}\_{0\prime}\mathcal{A}\_{1\prime}\cdots \ \mathcal{A}\_{f+c+1} \end{bmatrix} \\ \overline{\mathcal{B}} = & \begin{bmatrix} \beta\_{0}\beta\_{1\prime}\ldots\beta\_{f+c+1} \end{bmatrix} \end{array} \tag{28}$$

where A<sup>0</sup> = [*a*1,0,... *aN*,0] *<sup>T</sup>*, <sup>A</sup><sup>1</sup> <sup>=</sup> [*a*1,1,... *aN*,1] *<sup>T</sup>*,... , <sup>A</sup>*f*+*e*+<sup>1</sup> <sup>=</sup> *a*1, *<sup>f</sup>*+*e*<sup>+</sup>1,... *aN*, *<sup>f</sup>*+*e*+<sup>1</sup> *T* . Hence, we rewrite the logit, *lk* as

$$d\_k = \sum\_{r=0}^{f+\varepsilon} \mathbf{a}\_{k,r} \boldsymbol{\beta}\_r = \log\_b \frac{p\_k}{1 - p\_k} \tag{29}$$

Now, solving for the probability *pk* that the model predicts *yk* = 1. yields.

$$p\_k = \frac{\mathfrak{e}^{l\_k}}{1 + \mathfrak{e}^{l\_k}} = \sigma(l\_k) \tag{30}$$

where *b* is substituted by *e* and it is exponential function and where *σ*(.) is the sigmoid function. With (30), we can easily compute the probability that *yk* = 1 for a given observation. The optimum β can be obtained by minimizing the negative log-likelihood of (30). Hence, the log-likelihood may be written as follows:

$$\begin{array}{rcl} J &=& \sum\_{k=1}^{N} \left\{ -y\_k \log(p\_k) - (1 - y\_k) \log(1 - p\_k) \right\} + \rho \parallel \mathfrak{B} \parallel\_{\lambda}^{\lambda} \\ &=& \sum\_{j=1}^{N} \left\{ -y\_k \log(\sigma\_k) - (1 - y\_k) \log(1 - \sigma\_k) \right\} + \rho \parallel \mathfrak{B} \parallel\_{\lambda}^{\lambda} \end{array} \tag{31}$$

where

$$\sigma\_k = \frac{1}{1 + \exp^{(l\_k)}} = \frac{1}{1 + \exp^{(\sum\_{r=0}^{f+\epsilon} \mathfrak{a}\_{k,r} \mathfrak{F}\_r)}} \tag{32}$$

We employ gradient descent to optimize the proposed objective function (31). We name our proposed technique broad embedded logistic regression (BELR).

From (26) and (31), it should be noted that the traditional logistic regression can only manage the linear relationship between dependent variables and independent variables effectively. In other words, the classical logistic regression does not consider any possible nonlinear relationship between the dependent variable and independent variables. Unlike the classical logistic regression classifier, where the raw data are used as its input directly, in this paper, from (31), the output of the feature-mapped node and enhancement node of BLS is the input of the logistic regression classifier. In other words, enhanced features serve as the input of the logistic regression classifier. This improves the performance of the algorithm.

In addition, the objective function (31) of the proposed approach contains the regularizer <sup>ρ</sup> <sup>β</sup> *<sup>λ</sup> <sup>λ</sup>*, where *λ* can be chosen or set to different values to have different scenarios and to improve the performance of the network. For instance, for *λ* = 1, the output weight of the proposed method will have a sparse solution. This setting can allow the network to automatically select a relevant feature from A, which may enhance the network performance. Similarly, if *λ* is set to 2, the output weight will have dense values and the values will be small. This will prevent the network from overfitting. In this paper, our focus is not to have a sparse solution. Hence, in our experiment, we utilize *λ* = 2.

#### **4. Experiment and Settings**

In this section, we compare the proposed BELR with other linear and non-linear algorithms, namely the original logistic regression (LogR), Random Forest classifier (RF), Gaussian Naive Bayes (GNB), K-nearest neighbour (KNN), and Support Vector Machine

(SVM). We use four evaluation metrics in our comparison. Table 1 presents the evaluation metrics used to evaluate the performance of the comparison algorithms.


**Table 1.** The METRICS FOR THE MODELS comparison.

From the Table, False Positive (FP) is the number of examples which are predicted to be positive by the model but belong to the negative class. False Negative (FN) is the number of examples which are predicted to be negative by the model but belong to the positive class. True Positive (TP) is the number of examples which are predicted to be positive by the model and belong to the positive class.

Furthermore, for a fair comparison, in all the comparison algorithms, we use standard settings for all the parameters suggested in the scikit-learn machine learning package [36]. Additionally, we use APS data set [37,38]. This benchmark data set is commonly used to evaluate machine learning algorithms, specifically for APS failure prediction tasks. There are two problems with the data set. First, the data set contains a high number of missing values. Second, the data set has a high imbalance in class distribution.

In some papers, median imputation has been used to fill missing data. For instance, in [16], the median imputation technique was utilized to handle missing values. However, median imputation can cause destruction to the data. Hence, we employ a robust imputer, namely the KNN imputation method. Thus, we replace the missing values in each column using KNN. The data set used in this paper is quite challenging, as it has the issue of imbalanced class distribution. Our proposed BELR has a comparable good performance. This may be attributed to the ability of feature-mapped layer (nodes) to extract features from the input data and enhancement layer (nodes) for further enhancement of the feature such that the classes are separated from each other. Hence, this improves the performance of BELR under skew data set. This is validated when we compare the original logistic regression classifier and the proposed BELR.

After filling the missing data using KNN imputer, we use cross validation method to fit the comparison models. Furthermore, inside cross-validation, we extract features by using BLS on training set, then fit logistic regression on a feature from the training set, then used the test set to estimate quality metrics.

The total data points are split into 10-fold using stratified method of scikit-learn machine learning package and run each algorithm 10 times. For instance, in the first run we combine nine samples of the divided data as the training set and the remaining one sample for test set. We repeat this process 10 times using different set of data points as the training set and test set. Table 2 summarize the details of the data set used in the first run. In the experiment, we present the average performance of each compared algorithm.

**Table 2.** Details of the data set and further details of the data set.


From the table, the ratio of positive case to negative case in the training set is 0.001831, and for the test set, it is 0.018360. It should be noted that we have used stratified method of scikit-learn, a machine learning package, in our cross-validation methods. It takes into consideration the imbalance class of the data to split the data into 10-fold.

## *The Comparison of the Performance of the Compared Algorithms*

In the subsection, we compare the proposed BELR and the original logistic regression (LogR), Random Forest classifier (RF), Gaussian Naive Bayes (GNB), K-nearest neighbour (KNN), and Support Vector Machine (SVM). The average performance in terms of the metrics listed in Table 1 is presented for the comparison algorithms. First, to prevent the effect of the curse of dimensionality, we use principal component analysis (PCA) to reduce the dimension and select an important feature from the input data. A total of 81 principal components are created after applying the PCA technique with a covariance value of 0.95. The initial dimension of input data is 170; however, after applying PCA, the dimension is reduced to 81, which is almost 50% of the feature variables compared to the initial feature variables. After applying PCA, we then apply comparison algorithms on the feature from PCA. We use 10-fold cross-validation concept. In the experiment, the total number of data point is 76,000. For each fold, there are 7600 data points after applying stratified cross-validation, ensuring that each fold has the same proportion of observations with a given categorical value. In the first run, we take one group (7600 data points) as the test set and the remaining nine groups (9 × 7600 data points) for training of the model. In the second run, we pick another 76,000 data points (a new group) as the test set and the remaining nine groups (9 × 7600 data points) to train the models. The process continues until we reach the 10th run or trial. The training set contains 67,162 negative cases and 123 positive cases. Similarly, the test set contains 7462 negative cases and 137 positive cases. Table 2 shows the details. From the experiment, the results obtained are presented in Table 3.


**Table 3.** The performance of comparison algorithms under certain metrics.

From Table 3, we notice that GNB has a recall of 79.35, and the performance looks better than the rest of the algorithms. However, GNB has a very poor performance in precision. It has a precision score of 32.45. In addition, it has a very poor performance in F1-score, with a score of 46.06.

For other algorithms, it is notice that SVM has a very good score in precision but a very poor score in recall. This resulted in a poor value of F1-score. However, LogR, RF, KNN, and the proposed BELR have good precision and recall scores from the Table. Their performances in terms of precision are relatively equal. The proposed BELR has the best average score in terms of Sensitivity (Recall). In addition, when we compared the performance of the compared algorithms in terms of average F1-score, the proposed BELR has the best F1-score, as shown in Table 3. Other scores for other evaluation metrics are presented in the Table. We use boxplot to present the average F1-score of all the compared algorithms. Figure 6 presents the average F1 score of the performance of the compared algorithms. It is noticed that the proposed BELR has a better F1 score from the box plot.

**Figure 6.** The average F1-score of the compared algorithms.

Overall, we notice that the performance of the proposed BELR is better than the other comparison algorithms under an imbalanced data set.

#### **5. Conclusions**

This paper proposes broad embedded logistic regression (BELR) for classification problems, specifically for APS failure prediction. In addition, its performance is studied under an exceedingly difficult data situation and an imbalanced class distribution problem. The feature-mapped nodes and enhancement nodes of the BLS are employed to handle imbalance data set due to the ability of the two types of nodes to generate/extract features that can uniquely separate two classes from each other. Hence, it improves the classification capacity of logistic regression classifier.

Furthermore, the APS data set has a problem of missing data, and in this paper we explore KNN imputation method to solve the problem of missing data using KNN\_imputer from Sklearn. Sklearn is a machine learning package commonly used for processing data, building machine learning model. It should be noted that other missing data imputation methods such as generative adversarial network (GAN), etc., could be explored.

The performance of the proposed algorithm is better than other comparison algorithms, namely Gaussian Naive Bayes (GNB), Random Forest, K-nearest neighbor (KNN), Support Vector Machine (SVM), and Logistic Regression (LogR). The performance of the comparison algorithms is evaluated using popular and commonly used metrics in the literature, namely average F1-score, average Recall, average Precision, and average Accuracy. In terms of the F1-score, the performance of the proposed algorithm is the best among the comparison algorithms. The Table and the Figures presented in the experimental section validate that the proposed BELR performances are comparable with other algorithms.

**Author Contributions:** Validation, B.P.; Writing—original draft, A.A.M.; Writing—review & editing, H.A.; Supervision, C.K.M.L.; Project administration, J.C.; Funding acquisition, C.K.M.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** ITC-InnoHK Clusters-Innovation and Technology Commission.

**Institutional Review Board Statement:** Not Applicable.

**Informed Consent Statement:** Not Applicable.

**Data Availability Statement:** Available upon request.

**Acknowledgments:** The work was supported by the Centre for Advances in Reliability and Safety (CAiRS) admitted under AIR@InnoHK Research Cluster.

**Conflicts of Interest:** The authors declare no conflict of interest.
