An Application of Decision Tree-Based Twin Support Vector Machines to Classify Dephosphorization in BOF Steelmaking

Phull, Jovan; Egas, Juan; Barui, Sandip; Mukherjee, Sankha; Chattopadhyay, Kinnor

doi:10.3390/met10010025

Open AccessArticle

An Application of Decision Tree-Based Twin Support Vector Machines to Classify Dephosphorization in BOF Steelmaking

by

Jovan Phull

¹,

Juan Egas

¹,

Sandip Barui

²

,

Sankha Mukherjee

¹

and

Kinnor Chattopadhyay

^1,*

¹

Department of Materials Science and Engineering, University of Toronto, 184 College Street, Toronto, ON M5S, Canada

²

Department of Mathematics and Statistics, University of South Alabama, 411 University Blvd N MSPB 325, Mobile, AL 36688, USA

^*

Author to whom correspondence should be addressed.

Metals 2020, 10(1), 25; https://doi.org/10.3390/met10010025

Submission received: 1 November 2019 / Revised: 18 December 2019 / Accepted: 19 December 2019 / Published: 22 December 2019

Download

Browse Figures

Versions Notes

Abstract

:

Ensuring the high quality of end product steel by removing phosphorus content in Basic Oxygen Furnace (BOF) is essential and otherwise leads to cold shortness. This article aims at understanding the dephosphorization process through end-point P-content in BOF steelmaking based on data-mining techniques. Dephosphorization is often quantified through the partition ratio (

l_{p}

) which is the ratio of wt% P in slag to wt% P in steel. Instead of predicting the values of

l_{p}

, the present study focuses on the classification of final steel based on slag chemistry and tapping temperature. This classification signifies different degrees (‘High’, ‘Moderate’, ‘Low’, and ‘Very Low’) to which phosphorus is removed in the BOF. Data of slag chemistry and tapping temperature collected from approximately 16,000 heats from two steel plants (Plant I and II) were assigned to four categories based on unsupervised K-means clustering method. An efficient decision tree-based twin support vector machines (TWSVM) algorithm was implemented for category classification. Decision trees were constructed using the concepts: Gaussian mixture model (GMM), mean shift (MS) and affinity propagation (AP) algorithm. The accuracy of the predicted classification was assessed using the classification rate (CR). Model validation was carried out with a five-fold cross validation technique. The fitted model was compared in terms of CR with a decision tree-based support vector machines (SVM) algorithm applied to the same data. The highest accuracy (≥97%) was observed for the GMM-TWSVM model, implying that by manipulating the slag components appropriately using the structure of the model, a greater degree of P-partition can be achieved in BOF.

Keywords:

dephosphorization; machine learning; BOF steelmaking; twin support vector machines; decision tree; gaussian mixture modeling; K-means clustering; mean shift; affinity propagation

1. Introduction

With an almost 100% increase in the price of iron ore over the past 5 years, the removal of phosphorus from these ores has become essential in order to maintain the persistent quality of steel [1]. Increased levels of phosphorus in steel can lead to cold shortness causing brittleness and poor toughness [2,3]. The process of phosphorus removal from iron ores is known as dephosphorization. In comparison to dissolved oxygen in liquid steel, iron oxide content in slag has shown a greater influence on dephosphorization for a given slag basicity and carbon content of steel. Dephosphorization has often been defined as (%P)/[%P], i.e., the ratio of slag/steel phosphorus distribution that frequently lies around the calculated equilibrium values for the metal/slag reactions involving iron oxide in slag [3,4].

Over the last few decades, ensuring high quality and productivity have motivated a substantial amount of research on phosphorus removal in steel based on various empirical and thermodynamic models [5,6,7,8,9]. Equilibrium relationships to estimate the effect of various slag components on phosphorus concentration were initially studied by Balajiva and Vajragupta in the 1940s on a small electric arc furnace (EAF) [5]. They reported that an increase in the concentration of CaO and FeO resulted in a positive influence on dephosphorization. In 1953, Turkdogan and Pearson discussed that the reactant concentration was not consistent in the changing external conditions, and therefore, decided to focus on estimating the equilibrium of the following reaction:

2 [P] + 5 [O] = (P_{2} O_{5})

(1)

where [A] and (B) represent a species in the metal phase and slag phase respectively [6]. The equilibrium constant,

K_{p}

, for (1) is given by:

\log K_{p} = \frac{37160}{T} - 29.67

(2)

where T is the slag temperature. Further, Suito and Inuoi investigated the CaO–SiO₂–MgO–FeO slag system, where they concluded that the phosphorus distribution ratio increases with an increasing concentration of CaO in the slag [7]. The equation representing the phosphorus partition ratio from Suito and Inuoi is given by,

\begin{array}{l} \log \frac{(% P)}{[% P] ({% Fe}_{total}^{2.5})} & = 0.072 {(% CaO) + 0.3 (% MgO) + 0.6 ({% P}_{2} O_{5}) + 0.2 (% MnO) \\ + 1.2 ({% CaF}_{2}) + 0.5 (% {Al}_{2} O_{3})} + \frac{11570}{T} - 10.52 \end{array}

(3)

\begin{matrix} \log \frac{(% P)}{[% P]} = 0.072 {(% CaO) + 0.3 (% MgO) + 0.6 ({% P}_{2} O_{5}) + 0.2 (% MnO) + 1.2 ({% CaF}_{2}) \\ - 0.5 (% {Al}_{2} O_{3})} + 2.5 \log (% {Fe}_{total}) + \frac{11570}{T} - 10.52 \end{matrix}

(4)

where (%A) represents the percentage by weight of any component A. Moreover, Healy used thermodynamic data on phosphorous activity and phosphate-free energy in a CaO–P₂O₅ binary system to develop a relationship as shown in (5) [8].

\log \frac{(% P)}{[% P]} = \frac{22350}{T} + 0.08 (% CaO) + 2.5 \log (% {Fe}_{total}) - 16 \pm 0.4

(5)

The mathematical relationship in (5) estimates the phosphorus distribution between molten iron and complex slags of the CaO–FeO–SiO₂ system and was extended to the CaO–Fet–SiO₂ system. In 2000, Turkdogan assessed

γ_{P_{2} O_{5}}

for a wide range of CaO, FeO, and P₂O₅ concentrations given by (6) [9].

\log γ_{P_{2} O_{5}} = - 9.84 - 0.142 (% CaO + 0.3 % MgO)

(6)

More recently, Chattopadhyay and Kumar applied multiple linear regression (MLR) to analyze data from two plants: one with low slag basicity (low temperature) and the other with high slag basicity (high temperature) [10]. They suggested that a significant improvement of P distribution can be obtained by reducing the phosphorus reversal during blowing and after tapping, and reducing the tapping temperature. In 2017, Drain et al. reviewed 36 empirical equations on phosphorous partitions and presented their own new equation based on regression [11]. They identified the effects of minor slag constituents, including Al₂O₃, TiO₂ and V₂O₅. An increase in Al₂O₃ content was found to have a detrimental effect on (%P)/[%P] except for low oxygen potential conditions, whereas TiO₂ and V₂O₅ were found to positively affect (%P)/[%P]. In an effort to understand the reaction kinetics and identify optimum treatment conditions, Kitamura et al. proposed a new reaction model for hot metal phosphorus removal by saturating the slag with dicalcium silicate in the solid phase and then applying the dissolution rate of lime to simulate laboratory-scale experiments [12]. Kitamura et al., discussed a simulation-based multi-scale model for a hot metal dephosphorization process by multi-phase slag, which was an integration of macro- and meso-scale models [13]. A coupled reaction model was used to define the reactions between liquid slag and the metal in the macro-scale model, whereas phase diagram data was applied and P₂O₅ partition between solid and liquid slag was analyzed by thermodynamic data in the meso-scale model. While kinetic models are definitely better than thermodynamic models for predicting phosphorus partition ratio and end point phosphorus, owing to the super complex nature of the BOF process, data driven models have a better chance of prediction with much higher accuracy. Additionally, data-driven models are dynamic in nature, can accommodate new data sets, and are not limited by the test conditions or experimental conditions under which the model is developed.

In general, thermodynamic models are extremely beneficial to get an idea of the dephosphorization process in BOF steelmaking, and in many cases, provide accurate estimates of the dephosphorization index too. However, accuracy of such estimates greatly depends on homogeneity of the slag compositions across different batches. Such homogeneity are highly unlikely in a BOF shop due to high variability in the data that exists due to the dynamic nature of the multiphase process, e.g., variation in iron ore quality, composition of coke etc. Moreover, there exists strong dependence of these models with the original experimental data. Identifying the factors which can lead to a higher degree of dephosphorization, building new infrastructures, and updating the existing amenities based on these models to initiate and accelerate the phosphorus removal process can be both time consuming and financially burdensome.

On the other hand, the empirical models are useful to predict the dephosphorization measure corresponding to slag compositions and tapping temperatures only for a specific thermodynamic system. However, estimates may not be so precise for completely different systems. This implies that although a model accurately estimates phosphorous partitions from one dataset, it may not perform as well on a new dataset. Most of these models, although they apply regression, do not estimate the slope parameters using parametric least-square methods or non-parametric rank-based methods, thereby, ignoring the concept of error variability [14,15]. Furthermore, the application of a multiple linear regression model requires certain criteria related to error distribution (e.g., homoscedasticity, normality and independence) to be met, which are often not verified during model development. In this context, the application of data-driven approaches like machine learning methods can be transformative as these models have the potential to identify and utilize the inherent latent structures and patterns in the process based on available data, and thereby, evolve accordingly.

Machine learning is a fast growing area of research due to its nature of updating the implemented algorithm based on training data. Few ML techniques have been used to predict the end-point phosphorus content in the BOF steelmaking process. For example, a multi-level recursive regression model for complete end-point P content prediction was established by Wang et al. in 2014 based on a large amount of production data [16]. A predictive model based on principal component analysis (PCA) with back propagation (BP) neural network has been discussed by He and Zhang in 2018 where they predicted end-point phosphorus content in BOF based on BOF metallurgical process parameters and production data [17]. Multiple linear regression and generalized linear models were fitted to the BOF data from two plants for predicting

\log \frac{(% P)}{[% P]}

by Barui et al. in 2019, where they discussed various verification and adequacy measures that need to be incorporated before fitting a MLR model [18].

Though not many articles have been published discussing the role of ML-based methods in end-point phosphorus content prediction in BOF steelmaking, there are few significant studies which have demonstrated various ML-based approaches in end-point carbon content and temperature control in BOF. Wang et al. (2009) developed an input weighted SVM for endpoint prediction of carbon content and temperature, with reasonable accuracy [19]. Improving on the works of Wang et al., Liu et al. (2018) used a least squares SVM method with a hybrid kernel to solve the dynamic nature of problems in the steelmaking process [20]. More recently in 2018, Gao et al. used an improved twin support vector regression (TWSVR) algorithm for end-point prediction of BOF steelmaking, receiving results of 96% and 94% accuracy for carbon content and temperature, respectively [21]. Though the applications of these models have not been explored specifically in the dephosphorization process, these models do indicate that ML-based algorithm has the potential for dealing with non-linear patterns in data associated with BOF steelmaking processes.

In this paper, an attempt was made to classify final steel based on

l_{p} = \log \frac{(% P)}{[% P]}

values, and then predict classes to which the response may belong depending on slag chemistry and tapping temperature. The process parameters such as initial hot metal chemistry and slag adjustment could be used as inputs to the model. However, the final slag chemistry is highly correlated to the hot metal chemistry and fluxes added, through the BOF mass balance model. Therefore, a model built on final slag chemistry is very similar to a model built using initial hot metal chemistry, fluxes added and amount of oxygen blown. By creating categories based on

l_{p}

values, the degree of phosphorus partition in the BOF process can be predicted; slag compositions belonging to the highest ordinal category would predict the lowest concentration of end-point phosphorus. In contrast to regression-based problems, a classification of

l_{p}

can be useful when the quality of steel within specific thresholds can be assumed to be similar. The classes are created based on quartiles (percentiles) of

l_{p}

and using K-means clustering method.

Following the work of Dou and Zhang (2018), a decision tree twin support vector machine based on kernel clustering (DT²-SVM-KC) method is proposed in this paper for multiclass-classification [22]. This structure or approach of classification considers twin support vectors as opposed to a single support vector in traditional SVM. TWSVM represents two non-parallel planes by solving two smaller quadratic programming problems, and hence, the computation cost in the training phase for TWSVM is almost reduced to 25% when compared to standard SVM [23,24]. On the other hand, decision trees based on recursive binary splitting are used extensively for classification purposes [25]. As a combination of TWSVM and decision tree, the proposed approach is appropriate to deal with multiclass classification problems with better generalization performance with lesser computation time [22].

The paper is arranged in the following scheme. Section 2 highlights the theory behind the algorithm, as well as an in-depth explanation of the algorithm itself. In Section 3, the results are presented and interpreted. Furthermore, this section analyzes the main findings through the outcome of the results. In Section 4, future considerations and improvements are discussed.

2. Theory and Methodology

As mentioned earlier, phosphorous partition is measured by a natural logarithm of the ratio of % weight of P in slag to % weight of P in steel, and it is denoted by

l_{p}

. A larger

l_{p}

value indicates a greater degree of phosphorus partition, resulting in steel with a lower content of phosphorous. As a result, the quality of steel is correlated to the

l_{p}

value, and a model that can accurately predict this value would be sufficient to characterize the dephosphorization process. In this paper, a hybrid method combining a decision tree with TWSVM was considered for analysis which could classify unlabeled test data to various dephosphorization categories. For implementing our proposed algorithm, Python 3.7 was used.

2.1. Nature of the Data

The proposed algorithm was constructed and tested on datasets obtained from two plants: plant I and plant II. Data from plant I (tapping temperature 1620–1650 °C and 0.088% P) consist of observations on nine features of slag chemistry from 13,853 heats to characterize

l_{p}

. On the other hand, data collected from plant II (tapping temperature 1660–1700 °C and 0.200%P) has seven slag chemistry features from 3084 heats. A detailed summary on various features and

l_{p}

values for both plants are presented in Table 1. All values (except l_p) are given in weight %.

2.2. Theoretical Model

The classification model comprises three phases: (1) initial labelling of data, (2) splitting the labeled data using a decision tree, and (3) training and testing of the TWSVM. The process flow is highlighted in Figure 1.

In the first phase, two unsupervised learning approaches, namely K-means clustering and quantile-based clustering, were incorporated to categorize the entire data set into four clusters based on the

l_{p}

values [26]. The clusters were labelled as 0, 1, 2, and 3. The approaches are discussed below.

K-means clustering:

Initially, the cluster centroids

{c_{1}, c_{2}, c_{3}, c_{4}}

were randomly assigned using some randomization technique. For each

l_{p}

,

d_{x} = \arg \min_{j = 1, 2, 3, 4} dist (c_{j}, l_{p})

(7)

was computed where the

dist (., .)

represents the Euclidean distance between two points. Each

l_{p}

value in the dataset was assigned to a cluster based on the centroid of closest proximity. Let

S_{i}

be the set of points assigned to the

i th

cluster.

The updated centroids

{{c^{'}}_{1}, {c^{'}}_{2}, {c^{'}}_{3}, {c^{'}}_{4}}

were calculated based on the mean of the clusters given as

{c^{'}}_{i} = \frac{1}{| S_{i} |} \sum_{j \in S_{i}} l_{p, j}

(8)

The centroid update steps were carried out iteratively until a convergence condition was met. Typically, a convergence would mean that the relative difference between two consecutive steps would be less than some small pre-specified quantity

ϵ

. Phase 1 of initial labelling of data using K-means clustering is shown in Figure 2.

Quantile-based clustering:

The second method used for initial labelling of

l_{p}

values is based on quantiles (percentiles). Each

l_{p}

was assigned to one of the four groups, viz. Minimum—25th percentile, 25th–50th percentile, 50th–75th percentile, and 75th percentile—Maximum. Figure 3 shows an example of quartile-based clustering of

l_{p}

values.

In the second phase of the algorithm, the labelled data was passed through a decision tree, with two output nodes [25,26]. Three different criteria for splitting the decision tree were considered, namely, Gaussian mixture models (GMM), mean shift (MS), and affinity propagation (AP) [26,27,28,29,30]. K-means clustering is developed on a deterministic idea that each data point can belong to only a single cluster. However, GMM assumes that each data point has certain probabilities of belonging to every cluster and that the data in each cluster follows a Gaussian distribution. The algorithm utilizes expectation-maximization (EM) technique to estimate model parameters [27]. For a given data point, the algorithm estimates the probability of belonging to a specific cluster and the point is assigned to the cluster for which this probability is maximum [28]. The next splitting algorithm, MS, clusters all the data points based on an attraction basin with respect to convergence of a point to a cluster center [29]. This method is an unsupervised learning approach that iteratively shifts a data point towards the point of highest density or concentration in the neighborhood. The number of clusters is not required to be specified as opposed to K-means clustering. In Python, this algorithm is controlled by a parameter called the kernel bandwidth value which generates a reasonable number of clusters based on the data. The final splitting algorithm tested was affinity propagation [30]. Similar to MS, AP does not require to specify the number of clusters. This algorithm clusters points based on their relative attractiveness and availability. In the first step, the algorithm takes the negative square difference feature by feature among all data points to produce a

n \times n

similarity matrix

S

, where

n

is the number of data points. Similarity Function

S (i, k)

is defined as

S (i, k) = - ‖ x_{i} - x_{k} ‖^{2}

(9)

where

i

and

k

are the row and column indices, respectively. Based on this matrix, a

n \times n

responsibility matrix

R

is generated by subtracting the similarity value between two data points, and then subtracting the maximum of the remaining similarities. The responsibility matrix works as a helper matrix to construct the availability matrix

A

as

A (i, k) \leftarrow S (i, k) - \max_{k^{'} \neq k} {A (i, k^{'}) + S (i, k^{'})}

(10)

which indicates the relative availability of a data point to a particular cluster given the attractiveness received from the other clusters. In this matrix, the diagonal terms are updated using (12) and the rest of the terms using (13) as

A (k, k) \leftarrow \sum_{i^{'} s . t . i^{'} \neq k} \max {0, R (i^{'}, k)}

(11)

and

A (i, k) \leftarrow \min {0, R (k, k) + \sum_{i^{'} s . t . i^{'} \notin {i, k}} \max {0, R (i^{'}, k)} .

(12)

The final step to create the clusters is by constructing a criterion matrix

C

, which is the sum of the availability matrix and the responsibility matrix.

C (i, k) \leftarrow R (i, k) + A (i, k)

(13)

Subsequently, the highest value on each row is taken, and this value is known as the exemplar. Rows that share the same exemplar are in the same cluster where

exemplar (i, k) = \max {A (i^{'}, k) + R (i^{'}, k)}

(14)

To generate the decision tree, one of the clustering algorithms was applied on the entire dataset to produce two centroids acting as the basis for child nodes. The purpose of applying a cluster algorithm was to find the optimal split of one node with four labels into two nodes with two labels each where the splitting criterion is the entropy among data points. For ease of computation, the initial clusters and child nodes are represented by their respective cluster centroids. This computation evaluates the difference of each initial cluster centroid to each child node centroid, and the assignment is carried out based on whichever combination produces the least entropy.

Once two children nodes are created with corresponding labels, they are passed through the TWSVM [22,23]. For generating the TWSVM, the data is split into 80% training and 20% testing data. A brief discussion on the working mechanism of TWSVM follows. Let

x

denote the feature vector and

y

denote the response variable. Let

T

be the training set:

T = {(x_{1}, y_{1}), \dots, (x_{m}, y_{m})}

(15)

where each response

y_{i} \in {+ 1, - 1},

+1 and −1 represent binary classes to which each

y_{i}

will be assigned. The matrices

A \in ℝ^{m_{1} \times n}

and

B \in ℝ^{m_{2} \times n}

represent the data points assigned to class +1 and −1, respectively, where

m_{1}

and

m_{2}

represent the number of data points assigned to +1 and −1. Using the training data, the goal of the TWSVM is to obtain two non-parallel hyperplanes:

x^{T} w_{1} + b_{1} = 0

and

x^{T} w_{2} + b_{2} = 0

where

w_{1}

and

w_{2}

are appropriately chosen weight vectors, and

b_{1}

and

b_{2}

are associated biases. To obtain the hyperplane equations, the following quadratic programming problems (QPPs) were solved:

\min_{w_{1}, b_{1}, q} \frac{1}{2} {(A w_{1} + e_{1} b_{1})}^{T} (A w_{1} + e_{1} b_{1}) + C_{1} e_{2}^{T} q

(TWSVM1)

subject to

- (B w_{1} + e_{1} b_{1}) + q \geq e_{2}, q \geq 0

and

\min_{w_{2}, b_{2}, q} \frac{1}{2} {(B w_{2} + e_{2} b_{2})}^{T} (B w_{2} + e_{2} b_{2}) + C_{2} e_{1}^{T} q

(TWSVM2)

subject to

- (A w_{2} + e_{1} b_{2}) + q \geq e_{1}, q \geq 0

In these equations,

C_{1}

and

C_{2}

are parameters that quantify the trade-off between classification and margin error,

e_{1}

and

e_{2}

are the relative cost of misclassification, and

q

is the error vector associated with each sample. To simplify the solutions of the QPPs, Lagrangian multipliers were used. For example, to solve TWSVM1, the corresponding Lagrangian is given by:

\begin{array}{l} L (w^{(1)}, b^{(1)}, q, α, β) & = \frac{1}{2} {(A w^{(1)} + e_{1} b^{(1)})}^{T} (A w^{(1)} + e_{1} b^{(1)}) + c_{1} e_{2}^{T} q \\ - α^{T} (- (B w^{(1)} + e_{2} b^{(1)}) + q - e_{2}) - β^{T} q \end{array}

(16)

where

α = {(α_{1}, α_{2}, \dots, α_{m_{2}})}^{T}

and

β = {(β_{1}, β_{2}, \dots, β_{m_{2}})}^{T}

are the vectors of the Lagrange multipliers. In addition to this Lagrange multiplier, the corresponding Karush-Kuhn-Tucker (KKT) optimality conditions are given by:

A^{T} (A w^{(1)} + e_{1} b^{(1)}) + B^{T} α = 0,

e_{1}^{T} (A w^{(1)} + e_{1} b^{(1)}) + e_{2}^{T} α = 0,

c_{1} e_{2} - α - β = 0

and

- (B w^{(1)} + e_{2} b^{(1)}) + q \geq e_{2}, q \geq 0,

α^{T} (- (B w^{(1)} + e_{2} b^{(1)}) + q - e_{2}) = 0,

β^{T} q = 0,

α \geq 0, β \geq 0 .

As the objective functions of the TWSVM are convex, the KKT conditions are necessary to find a solution for this type of problem. After the algorithm is trained with labeled data, the unlabeled testing data are passed through the TWSVM and assign a data sample to a class, based on whichever hyperplane is closer with respect to perpendicular distance. The accuracy of the final model is measured by the classification rate (

C R

) which is the percentage of the test data correctly labeled. More specifically,

C R = n^{- 1} \sum_{i = 1}^{4} a_{i i}

where

a_{i j}

is the number of heats with actual group label as ‘

i

’ (

i = 1, 2, 3, 4

) and predicted group label as ‘

j

’ (

j = 1, 2, 3, 4

) in the test data and

n = \sum_{i = 1}^{4} \sum_{j = 1}^{4} a_{i j} .

For our data,

x

represents the vector of slag chemistry values corresponding to a particular heat and

y

takes −1 or +1 based on two classes of

l_{p}

values.

2.3. Model Adequacy

Five-fold cross-validation was applied to the model. This is a resampling procedure used to evaluate the machine learning model performance on the training data. Following this procedure, the results were less biased towards an arbitrary selection of the test set. In the five-fold cross-validation step, the dataset is split into five groups, so that the model is trained four times with one group being held up from the training set and used as a test set. In this manner, each group has the opportunity to be the test set once, and therefore, the bias of the model is reduced. The final accuracy presented in the results section is an average of the four accuracies values obtained in the five-fold cross-validation step. Finally, in both plants the data points were normalized to reduce the amount of computational power needed to train and keep the data entropy in the same order of magnitude among all features.

3. Results

3.1. Descriptive Statistics

Mean, standard deviation (SD), minimum and maximum for the set of features (i.e., slag chemistry components), and

l_{p}

values are presented in Table 1. Box plots for all the features and

l_{p}

value for plant I are presented in Figure 4. To maintain brevity, the box plots corresponding to the features for plant II are not provided. The box plots indicate that most of the features have symmetric distribution except Fe_total, MnO and Al₂O₃. These plots serve as a visual aid to identify the range of values containing the middle 50% of the data for each feature. For example, the longer whisker of the box plot corresponding to Al₂O₃ indicates that the values are skewed and potentially have many outliers, while the middle 50% lies between 1.5% and 2% by weight. Table 2 shows the distribution of the

l_{p}

values in each of the initially labeled clusters by K-means for both plant I and plant II. It is observed that each cluster has a distinct range of

l_{p}

values since there is no overlap among one standard deviation intervals from the mean. For quantile-based clustering, each cluster has approximately 3563 (25%) observations for plant I and 771 (25%) observations for plant II. This signifies that the initial cluster labels have the potential to classify

l_{p}

values into disjoint intervals, and therefore, could be a reliable measurement to categorize features based on the degree of dephosphorization.

3.2. Model Hyper-Parameter Selection

The model hyper-parameters are parameters that cannot be learnt by the model during the training phase. Hyper-parameters are supplied to the model by the user and tuned empirically, in most cases, by a trial and error approach. The performance of the model is heavily dependent on our choice of hyper-parameters since they influence how fast the model learns and converges to a solution. For Twin-SVM, those hyper-parameters are epsilon (

ϵ)

and the cost (

C)

function as defined in Section 2.

ϵ

regulates the distance from the boundary decision layer to the threshold at which a point belonging to a certain class and within the threshold should be penalized by the cost function. Consequently, the larger the cost function the higher the penalty for a point within the threshold. A large

ϵ

will hinder the model convergence whereas a smaller

ϵ

may cause an over-fitted decision boundary. Other hyper-parameters also determine the model’s ability to construct a linear or non-linear structure. TWSVM can accommodate linear, polynomial, and radial basis function (RBF) kernels. These kernels are transformations applied onto the dataset to produce a representation of these data in a different space where feature sets can be classified [25,26].

The hyper-parameters where selected using a trial and error approach, with the starting values, increments, and ending values presented in Table 3. From this table, the selected hyper parameter based on best performance in terms of accuracy was selected. The trial and error approach though is not random but considered following the best practices in the field of machine learning. A comparison between Linear and RBF kernel-based DT2-SVM-KC was carried out. RBF performed better than the linear kernel-based TWSVM in terms of accuracy (classification rate), which further indicates the strong presence of inherent non-linearity in the data obtained in BOF steelmaking. A third option considered was a polynomial kernel; however, it was not considered because it resulted in computational overflowing even for a quadratic polynomial.

3.3. Accuracy of Results

The results of the analysis of BOF data based on decision tree twin SVM cluster-based algorithm are presented in this section. As mentioned previously, two methods of labeling

l_{p}

values: quartiles-based and K-means-based, along with three clustering algorithms function as split criterion for the decision tree to give a total of six different DT²-SVM-KC algorithms. The accuracy (%) for each of these algorithms is presented in Table 4.

A five-fold cross-validation technique was applied to validate the model adequacy. Of the six cases mentioned in Table 5, K-means clustering as a label generator and GMM-cluster-based DT²-SVM-KC, provided the most accurate result with accuracy around 98.03%.

A comparison between GMM, MS and AP-based algorithm is presented Figure 5. Results show that GMM performed better than MS and AP in terms of accuracy for each dataset. Figure 6 shows the accuracy obtained using K-Means clustering vs. quartile clustering to generate initial labels of

l_{p}

. This figure shows that K-Means clearly yield a superior accuracy when examined against the quartile-based clustering method. The mean of the accuracy across the node-leaves for plant I is 78.77% and for plant II is 98.04%.

3.4. Justification for Twin SVM over Other SVM Models

To validate the use of a complex model such as Twin SVM, we compare its results with general SVM-based models. Results are shown in Figure 7. Further, Table 5 shows that twin SVM improves the accuracy of the binary classification problem by at least 15% for both plants I and II. The hyper-parameters for the SVM were tuned to reach maximum possible accuracy and the decision tree was consistent with the best method used with Twin SVM (i.e., K-means as labeling generator and GMM as splitting criterion for the decision tree).

4. Discussion and Interpretation of Results

In this section, the results of the analysis are discussed. This section comprises of two parts. First, a discussion about the performance of the algorithm, and second, an interpretation from an industrial perspective.

4.1. Algorithm Performance

Twin SVM outperforms SVM by at least 15% across all datasets. Mathematically, twin SVM solves two smaller quadratic programming problems rather than a larger one as SVM does. Therefore, twin SVM is tailored for binary classification, which also means that the decision tree boosts its accuracy. This is because the decision tree reduces a multi-class classification problem into small binary problems that can be solved by the twin SVM. In fact, the decision tree allows a greater number of classes as long as the node-leaves end up with two classes (i.e.,

2^{k}

where

k \geq 2

). The downside of increasing the number of classes is that each node-leaf will have less data points to train every time we increase the number of classes resulting in less significant and biased results. Given the number of data points, empirically, four classes were found to be optimal for this experiment.

With regards to the difference in accuracy across plants, it is worth noticing that although plant I has five times more data points than plant II, the accuracy in plant I is lower. This is counter intuitive; however, two possible attributive factors could be suggested. First, that plant I has more features (i.e., V₂O₅ and TiO₂) than plant II, and, second, the range in

l_{p}

values from plant II is shorter than the one from plant I as shown in Table 1. As a consequence, the classification for plant I is more difficult due to higher variability of the data. On the other hand, ore from plant I contains more phosphorus than in plant B, which further plays role in the reduction of accuracy for plant I data. What such results suggest is that the tuning of hyper-parameters should be performed distinctively on each plant data; nevertheless, the algorithm will remain the same.

Finally, as part of the data pre-processing stage, all data points were normalized. The purpose of normalizing the data was to reduce the computational power for training and testing. Unexpectedly, normalizing also had an effect on the accuracy of the algorithm since an increment of at least 3 percentage points was noted for both plants. This effect was also attributed to the fact that less variability increases the accuracy of the machine-learning model. By normalizing the data, the influence of features with high numerical values such as temperature are weighted the same as the influence of other features with less numerical values such as MnO or CaO.

4.2. Application of the Results for Industry

From an industrial view point, the method of initial labeling is crucial. It means, given certain slag chemistry, we can predict the percentage of phosphorus in the steel. K-means clustering labels the data based on the proximity of an unlabeled data point to the cluster center of labeled data points. This center will be dynamically updated until the algorithm converges but it always represents the mean of data points within the cluster at any given time. For a metallurgist, the results of the algorithm will show how close the current batch is to a certain l_p value (i.e., center of the cluster). The advantage of using K-means over quartiles labelling approach is clearly demonstrated in terms of accuracy as illustrated in Table 5. One explanation for such results is that K-means is grouped-based clustering while quartiles labeling considers only the relative position of a value with respect to the others.

The GMM cluster-based decision tree TWSVM algorithm designed in this research aims for attaining higher flexibility and adaptability to real world conditions. Given the values of slag chemistry for a particular heat in a BOF shop, our model will be able to predict to which of the four

l_{p}

classes (i.e., 0–3), the batch will fall. For example, if the current batch belongs to class-0, the content of phosphorus in the resulting steel will be high and the corresponding output will be undesirable. The objective is to produce steel corresponding to high

l_{p}

classes (class 2–3). The proposed algorithm has proven to work seamlessly with two different plants having different slag chemistries. The algorithm provides a general framework and requires training data from a specified plant in order to achieve the optimal accuracy results for classifying

l_{p}

values.

5. Conclusions

A decision tree twin support vector machine based on a kernel clustering (DT²-SVM-KC) machine-learning model is proposed in this paper. The model classifies a batch (obtained from a specific heat at a BOF shop) based on slag chemistries into one of the four classes of

l_{p}

values. Class 0 corresponds to highest phosphorus content in the resulting output while class 3 corresponds to lowest phosphorus content in the resulting steel. The model is efficient and shows high accuracy with a relatively low computational requirement and cost. Even though further testing on different datasets is required, the model has shown consistent performance across Plant I and II. However, it was noticed that with an increase in the number of features and variability of the response variable

l_{p}

, the accuracy of the algorithm decreases. So, it is recommended that a metallurgical model based on fundamental theory might be helpful in ruling out features that do not necessarily affect or have negligible influence on the amount of phosphorus in the steel. Additionally, it is recommended that for samples with high variability in the response variable

(l_{p}

), an increase in the number of labels during the cluster labelling could be advantageous given there are enough data points corresponding to each node leaf. This is to avoid an under fitted model with unreliable results.

Finally, one of the main considerations towards industry applications is the interpretability of these results. In this paper, the results suggest that given a data point from a heat, one can deduct a certain

l_{p}

range for that batch and also understand the influence of the features on the results such that the slag composition can be tweaked to reduce the amount of phosphorus in liquid steel for the next batch.

Author Contributions

K.C. conceptualized the assessment of machine-learning techniques on dephosphorization data. S.B. conceptualized the use of the decision tree twin support vector machine algorithm. K.C. provided the acquired data via two steel plants. J.P. and J.E. performed various statistical analysis on the data. K.C. provided funding of the project. J.E. and J.P. designed and developed the algorithms, as well as implemented in on Python. S.B. provided mentorship of the project. J.P. created the figures in the paper. J.P. and J.E. wrote the initial draft of the manuscript. S.B. and S.M. edited and critically reviewed the final manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

Iron Ore Monthly Price-US Dollars per Dry Metric Ton. Available online: https://www.indexmundi.com/commodities/?commodity=iron-ore (accessed on 10 July 2019).
Bloom, T. The Influence of Phosphorus on the Properties of Sheet Steel Products and Methods Used to Control Steel Phosphorus Level in Steel Product Manufacturing. Iron Steelmak. 1990, 17, 35–41. [Google Scholar]
Chukwulebe, B.O.; Klimushkin, A.N.; Kuznetsov, G.V. The utilization of high-phosphorous hot metal in BOF steelmaking. Iron Steel Technol. 2006, 3, 45–53. [Google Scholar]
Urban, D.I.W.; Weinberg, I.M.; Cappel, I.J. De-phosphorization strategies and Modeling in Oxygen Steelmaking. Iron Steel Technol. 2014, 134, 27–39. [Google Scholar]
Balajiva, K.; Quarrell, A.; Vajragupta, P. A laboratory investigation of the phosphorus reaction in the basic steelmaking process. J. Iron Steel Inst. 1946, 153, 115. [Google Scholar]
Turkdogan, E.; Pearson, J. Activities of constituents of iron and steelmaking slags. ISIJ 1953, 175, 398–401. [Google Scholar]
Suito, H.; Inoue, R. Thermodynamic assessment of hot metal and steel dephosphorization with MnO-containing BOF slags. ISIJ Int. 1995, 35, 258–265. [Google Scholar] [CrossRef] [Green Version]
Healy, G. New look at phosphorus distribution. J. Iron Steel Inst. 1970, 208, 664–668. [Google Scholar]
Turkdogan, E. Slag composition variations causing variations in steel dephosphorisation and desulphurisation in oxygen steelmaking. ISIJ Int. 2000, 40, 827–832. [Google Scholar] [CrossRef] [Green Version]
Chattopadhyay, K.; Kumar, S. Application of thermodynamic analysis for developing strategies to improve BOF steelmaking process capability. In Proceedings of the AISTech 2013 Iron and Steel Technology Conference, Pittsburgh, PA, USA, 6–9 May 2013; pp. 809–819. [Google Scholar]
Drain, P.B.; Monaghan, B.J.; Zhang, G.; Longbottom, R.J.; Chapman, M.W.; Chew, S.J. A review of phosphorus partition relations for use in basic oxygen steelmaking. Ironmak. Steelmak. 2017, 44, 721–731. [Google Scholar] [CrossRef] [Green Version]
Kitamura, S.; Kenichiro, M.; Hiroyuki, S.; Nobuhiro, M.; Michitaka, M. Analysis of dephosphorization reaction using a simulation model of hot metal dephosphorization by multiphase slag. ISIJ Int. 2009, 49, 1333–1339. [Google Scholar] [CrossRef] [Green Version]
Kitamura, S.; Kimihisa, I.; Farshid, P.; Masaki, M. Development of simulation model for hot metal dephosphorization process. Tetsu Hagane J. Iron Steel Inst. Jpn. 2014, 100, 491–499. [Google Scholar] [CrossRef] [Green Version]
Chatterjee, S.; Hadi, A.S. Regression Analysis by Example; John Wiley & Sons: New York, NY, USA, 2015. [Google Scholar]
Kloke, J.; McKean, J.W. Nonparametric Statistical Methods Using R; Chapman and Hall/CRC: Boca Raton, FL, USA, 2014. [Google Scholar]
Wang, Z.; Xie, F.; Wang, B.; Liu, Q.; Lu, X.; Hu, L.; Cai, F. The control and prediction of end-point phosphorus content during BOF steelmaking process. Steel Res. Int. 2014, 85, 599–606. [Google Scholar] [CrossRef]
He, F.; Zhang, L. Prediction model of end-point phosphorus content in BOF steelmaking process based on PCA and BP neural network. J. Process Control 2018, 66, 51–58. [Google Scholar] [CrossRef]
Barui, S.; Mukherjee, S.; Srivastava, A.; Chattopadhyay, K. Understanding dephosphorization in basic oxygen furnaces (BOFs) using data driven modeling techniques. Metals 2019, 9, 955. [Google Scholar] [CrossRef] [Green Version]
Wang, X.; Han, M.; Wang, J. Applying Input Variables Selection Technique on Input Weighted Support Vector Machine Modeling for BOF Endpoint Prediction. Eng. Appl. Artif. Intell. 2010, 23, 1012–1018. [Google Scholar] [CrossRef]
Liu, C.; Tang, L.; Liu, J.; Tang, Z. A dynamic analytics method based on multistage modeling for a BOF steelmaking process. IEEE Trans. Autom. Sci. Eng. 2018, 16, 1097–1109. [Google Scholar] [CrossRef] [Green Version]
Gao, C.; Shen, M.; Liu, X.; Wang, L.; Chen, M. End-point prediction of BOF steelmaking based on KNNWTSVR and LWOA. Trans. Indian Inst. Met. 2018, 72, 257–270. [Google Scholar] [CrossRef]
Dou, Q.; Zhang, L. Decision tree twin support vector machine based on kernel clustering for multi-class classification. In Proceedings of the International Conference on Neural Information Processing, Siem Reap, Cambodia, 13–16 December 2018; pp. 293–303. [Google Scholar]
Khemchandani, R.; Chandra, S. Twin support vector machines for pattern classification. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 905–910. [Google Scholar]
Ding, S.; Yu, J.; Qi, B.; Huang, H. An overview on twin support vector machines. Artif. Intell. Rev. 2014, 42, 245–252. [Google Scholar] [CrossRef]
James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning; Springer Science + Business Media: New York, NY, USA, 2013. [Google Scholar]
Friedman, J.; Hastie, T.; Tibshirani, R. The Elements of Statistical Learning; Springer Series in Statistics; Springer: New York, NY, USA, 2001. [Google Scholar]
McLachlan, G.; Krishnan, T. The EM Algorithm and Extensions; John Wiley & Sons: New York, NY, USA, 2007. [Google Scholar]
Carrasco, Oscar. Gaussian Mixture Models Explained. Available online: https://towardsdatascience.com/gaussian-mixture-models-explained-6986aaf5a95 (accessed on 1 August 2019).
Comaniciu, D.; Meer, P. Mean Shift: A Robust Approach toward Feature Space Analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 603–619. [Google Scholar] [CrossRef] [Green Version]
Frey, B.J.; Dueck, D. Clustering by passing messages between data points. Science 2007, 315, 972–976. [Google Scholar] [CrossRef] [PubMed] [Green Version]

Figure 1. Graphical representation of our proposed algorithm.

Figure 2. K-means clustering for initial labelling of

l_{p}

values.

Figure 2. K-means clustering for initial labelling of

l_{p}

values.

Figure 3. Quantile-based clustering for initial labelling of

l_{p}

values.

Figure 3. Quantile-based clustering for initial labelling of

l_{p}

values.

Figure 4. Box plots of all features and the response variable for plant I data.

Figure 5. A comparison among the accuracy results of GMM, Mean Shift and Affinity Propagation using K-Means as the initial labeling method.

Figure 6. A comparison between the accuracy results of K-Means and quartile-based clustering using GMM as the splitting criterion.

Figure 7. A comparison between accuracy results of SVM vs. TWSVM-based models.

Table 1. Descriptive statistics of all features for plant I and plant II.

Variable	Mean	Standard Deviation	Minimum	Maximum
Plant I
l_p	4.31	0.30	2.50	7.06
Temperature	1648.82	19.14	1500.00	1749.00
CaO	42.43	3.62	20.00	55.90
MgO	9.23	1.37	3.75	16.46
SiO₂	12.89	1.74	5.40	23.30
Fe_total	18.22	3.53	7.70	36.00
MnO	4.80	0.70	2.28	11.98
Al₂O₃	1.80	0.48	0.59	7.79
TiO₂	1.13	0.28	0.17	2.21
V₂O₅	2.13	0.49	0.25	3.95
Plant II
l_p	4.63	0.34	2.77	5.64
Temperature	1679.10	27.11	1579.00	1777.00
CaO	53.45	2.30	42.33	64.06
MgO	0.99	0.34	0.30	3.18
SiO₂	13.52	1.44	8.16	18.74
Fe_total	19.34	2.06	13.71	29.72
MnO	0.62	0.18	0.24	2.50
Al₂O₃	0.94	0.25	0.46	4.09

Table 2. Classification of

l_{p}

based on the labels from K-means clustering for plant I and plant II.

Table 2. Classification of

l_{p}

based on the labels from K-means clustering for plant I and plant II.

Cluster Label	Frequency (%)	Mean	Standard Deviation	Minimum	Maximum
Plant I
0	2711 (19.57)	3.76	0.16	2.50	3.94
1	5316 (38.37)	4.12	0.09	3.94	4.26
2	4338 (31.31)	4.41	0.08	4.26	4.56
3	1488 (10.75)	4.72	0.14	4.56	7.06
Plant II
0	1364 (44.23)	3.66	0.56	2.77	3.99
1	1029 (33.37)	4.30	0.13	3.99	4.49
2	584 (18.94)	4.68	0.10	4.49	4.85
3	107 (3.46)	4.99	0.11	4.85	5.64

Table 3. Selection of Hyper-parameters for SVM for plant I and plant II.

Hyperparameter	Starting Value	Increments	Ending Value	Selected Parameter (Plant I)	Selected Parameter (Plant II)
$ϵ_{1}$	0.1	0.1	2.0	0.4	0.5
$ϵ_{2}$	0.1	0.1	2.0	0.5	0.5
$C_{1}$	0.1	0.1	2.5	1	1
$C_{2}$	0.1	0.1	2.5	1	1
Kernel Type	Linear	N/A	Radial	Radial	Radial
Kernel Parameter	0.5	0.1	3.0	2.5	2

Table 4. Accuracy results of the DT²-SVM-KC algorithm for both plants.

Labelling Method	GMM	Mean Shift	Affinity Propagation
Plant I Accuracy
K-Means	78.76%	71.77%	72.00%
Quartile	64.38%	62.35%	62.79%
Plant II Accuracy
K-Means	98.03%	98.01%	96.58%
Quartile	95.78%	94.81%	94.00%

Table 5. Accuracy comparison between SVM vs. TWSVM with Decision Trees for K-means clustering as the initial label generator.

Labelling Method	GMM	Mean Shift	Affinity Propagation
Plant I Accuracy
TWSVM	78.76%	71.77%	72.00%
SVM	72.97%	74.79%	76.11%
Plant II Accuracy
TWSVM	98.03%	98.01%	96.58%
SVM	97.45%	98.00%	96.42%

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Phull, J.; Egas, J.; Barui, S.; Mukherjee, S.; Chattopadhyay, K. An Application of Decision Tree-Based Twin Support Vector Machines to Classify Dephosphorization in BOF Steelmaking. Metals 2020, 10, 25. https://doi.org/10.3390/met10010025

AMA Style

Phull J, Egas J, Barui S, Mukherjee S, Chattopadhyay K. An Application of Decision Tree-Based Twin Support Vector Machines to Classify Dephosphorization in BOF Steelmaking. Metals. 2020; 10(1):25. https://doi.org/10.3390/met10010025

Chicago/Turabian Style

Phull, Jovan, Juan Egas, Sandip Barui, Sankha Mukherjee, and Kinnor Chattopadhyay. 2020. "An Application of Decision Tree-Based Twin Support Vector Machines to Classify Dephosphorization in BOF Steelmaking" Metals 10, no. 1: 25. https://doi.org/10.3390/met10010025

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Application of Decision Tree-Based Twin Support Vector Machines to Classify Dephosphorization in BOF Steelmaking

Abstract

1. Introduction

2. Theory and Methodology

2.1. Nature of the Data

2.2. Theoretical Model

2.3. Model Adequacy

3. Results

3.1. Descriptive Statistics

3.2. Model Hyper-Parameter Selection

3.3. Accuracy of Results

3.4. Justification for Twin SVM over Other SVM Models

4. Discussion and Interpretation of Results

4.1. Algorithm Performance

4.2. Application of the Results for Industry

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI