A Classification Method for Incomplete Mixed Data Using Imputation and Feature Selection

Li, Gengsong; Zheng, Qibin; Liu, Yi; Li, Xiang; Qin, Wei; Diao, Xingchun

doi:10.3390/app14145993

Open AccessArticle

A Classification Method for Incomplete Mixed Data Using Imputation and Feature Selection

by

Gengsong Li

^1,†

,

Qibin Zheng

^2,†,

Yi Liu

^2,*,

Xiang Li

²,

Wei Qin

² and

Xingchun Diao

¹

National Innovation Institute of Defense Technology, Beijing 100071, China

²

Academy of Military Sciences, Beijing 100091, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2024, 14(14), 5993; https://doi.org/10.3390/app14145993

Submission received: 1 June 2024 / Revised: 2 July 2024 / Accepted: 8 July 2024 / Published: 9 July 2024

(This article belongs to the Special Issue Applications of Data Science and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Data missing is a ubiquitous problem in real-world systems that adversely affects the performance of machine learning algorithms. Although many useful imputation methods are available to address this issue, they often fail to consider the information provided by both features and labels. As a result, the performance of these methods might be constrained. Furthermore, feature selection as a data quality improvement technique has been widely used and has demonstrated its efficiency. To overcome the limitation of imputation methods, we propose a novel algorithm that combines data imputation and feature selection to tackle classification problems for mixed data. Based on the mean and standard deviation of quantitative features and the selecting probabilities of unique values of categorical features, our algorithm constructs different imputation models for quantitative and categorical features. Particle swarm optimization is used to optimize the parameters of the imputation models and select feature subsets simultaneously. Additionally, we introduce a legacy learning mechanism to enhance the optimization capability of our method. To evaluate the performance of the proposed method, seven algorithms and twelve datasets are used for comparison. The results show that our algorithm outperforms other algorithms in terms of accuracy and

F_{1}

score and has reasonable time overhead.

Keywords:

data imputation; feature selection; particle swarm optimization; legacy learning mechanism; mixed data; classification

1. Introduction

Missing data occurs when there is no value assigned to an observation [1], which is a common issue in various systems and applications, such as web, clinical medicine, and material engineering [2,3,4]. It can be caused by multiple factors, such as survey unresponsiveness, irregular events, sensor malfunction, connection error, and scheduled maintenance, among others [5,6,7]. This problem can lead to a decrease in data quality and the introduction of bias, which in turn hinders the process of extracting knowledge from data analysis. The simplest and fastest way to deal with missing data problems is to discard the sample that has missing values; yet, this may lead to a significant loss of potentially valuable information and result in biased outcomes, especially in small datasets [8]. Thus, missing data imputation has received significant attention both technically and theoretically, and numerous studies have been performed in the last few decades [9]. Conventional imputation methods can be classified into two categories: statistics-based and learning-based methods [10].

In earlier studies, statistics-based methods were more favorable among researchers. The methods make use of statistical information such as the distance, frequency, mean, median, and mode of the observed data to fill in the missing values. The Mean Imputation (MI) method [11] replaces all missing values with the mean value from their corresponding feature. Although this strategy is simple and fast to implement, it lacks consideration of data distribution and can impact the integrity of the data structure. The K-Nearest Neighbor (KNN) imputation [12] method is a distance-based method. In particular, the K-nearest neighbors of the missing value are located and their values are utilized to generate the imputation value. The Expectation Maximization (EM) method [13] finds maximum likelihood estimates and creates models of missing values by iteratively executing two main steps: the expectation step, which imputes missing values with their expectation of likelihood based on the current estimate of the mean and covariance parameter, and the maximization step, which recomputes the parameters to maximize the likelihood in the expectation step. Statistics-based methods are easy to comprehend and simple to implement. However, these methods only use the feature information of samples and are unable to effectively utilize the label information of samples, thus defecting the imputation performance.

On the other hand, learning-based methods train a learning model to predict and impute the missing data. In tree-based imputation methods, a decision tree model is trained for each feature that has missing values, and the models are utilized for imputing missing values [14]. Clustering imputation methods partition the data into several clusters and utilize the cluster information for dealing with the missing values [15]. Moreover, the Support Vector Machine (SVM) algorithm is employed for data imputation [16], where the SVM regression model is trained using the decision attribute to predict the missing values. In a similar fashion, the Multivariate Imputation by Chained Equations (MICE) method [17] trains the linear regression model multiple times, and the average value of model outputs is used for imputing. Learning-based methods train specific models that can learn the pattern of data distribution to predict the missing values, thereby obtaining better imputation performance. Nevertheless, these methods do not fully consider the randomness of missing data, and this could introduce bias.

Evolutionary algorithms are a class of stochastic optimization methods inspired by natural selection and survival of the fittest in biological evolution and have been widely used in plentiful domains [18]. In recent years, the use of evolutionary algorithms in missing data imputation has been extensively investigated and has produced impressive results [19]. For example, the Multi-Objective Genetic algorithm for data Imputation (MOGI) [20] uses Nondominated Sorting Genetic Algorithm-II (NSGA-II) to find the optimal imputation values. These methods utilize the strong search ability of evolutionary algorithms to search for imputation values, and the large population of search agents can guarantee effectiveness and randomness simultaneously. Despite this, these methods are inadequate in utilizing both feature and label information, thereby constraining their performance. Furthermore, real-world datasets often contain quantitative features and categorical features at the same time. The distribution of these two types of features is probably different, yet most data imputation methods do not take this into consideration and regard all features the same.

In this data-proliferating era of big data, the data produced and stored by different fields have significantly increased. As a consequence, the dimensionality of datasets increases, thus aggravating the “curse of dimensionality” problem in many real-world applications [21]. Feature selection is one key technique for solving this. It selects a small subset of relevant features from the original feature set to maximize the evaluation criteria. By removing irrelevant and redundant features, feature selection can reduce the time taken to obtain the original data and the storage space of the data, diminish the training time of classification models, and improve the interpretability and performance of the classification models [22,23]. Feature selection is a Nondeterministic Polynomial (NP) hard problem with a large search space, and there are three main types of feature selection approaches, namely, filter, wrapper, and embedded methods. Among these, the evolutionary-algorithm-based wrapper methods have shown a great capability of searching the optimal feature subset, therefore attracting much attention from many researchers [24,25].

In many application scenarios, high-dimensional datasets with missing data pose a common challenge. Not only can the missing data possibly lead to the dysfunction of learning algorithms, but also the high-dimensional characteristic can be a huge obstruct to obtaining a good data analysis result. Considering that data imputation and feature selection both serve to improve data quality, it is possible that combining data imputation and feature selection can improve the efficiency and accuracy of learning algorithms to a greater extent, especially on high-dimensional missing datasets. Some studies have shown that applying feature selection and data imputation simultaneously is conducive to improving the performance of imputation methods [26]. Moreover, the Differential Evolution algorithm involving KNN imputation, Clustering and Feature selection (DEKCF) [27] method combines three techniques to enhance the method’s ability, and demonstrated a further improvement in classification accuracy. However, the differential evolution algorithm is only used for feature selection in DEKCF, and the imputation capability of the differential evolution algorithm is neglected. As mentioned above, evolutionary algorithms are successfully employed in both data imputation and feature selection, but there is little research that uses evolutionary algorithms to simultaneously implement the two techniques. The potential of this integrating strategy still needs evaluation and analysis.

Particle Swarm Optimization (PSO) is a prominent evolutionary algorithm that mimics the foraging behavior observed in fish and bird flocks. It is well known for its superior search capabilities and good scalability, and has been widely applied in various optimization applications [28]. Based on the fact that PSO has been successfully applied and has obtained good results in both feature selection and missing data imputation fields [29,30], in this study, we propose Particle-Swarm-Optimization-based Feature selection and Imputation (PSOFI), a sophisticated algorithm for handling classification problems with missing mixed data. PSOFI effectively combines the benefits of feature selection, data imputation, and PSO. Specifically, PSOFI incorporates a mixed data imputation method that classifies features into quantitative and categorical types and uses different imputation strategies for each. Additionally, it implements feature selection to choose promising features that can enhance its performance. By employing PSO to simultaneously optimize the parameters of the imputation strategies and select features in a wrapper manner, both feature and label information can be utilized, thereby further improving its effectiveness and performance. Moreover, a legacy learning mechanism is introduced to store and utilize the historical optimal solutions during iterations, thus enhancing the performance of the method.

This paper makes several contributions to the field, which are specifically as follows:

We propose a novel data imputation method based on PSO to address the challenges of mixed data imputation. Our method exploits different types of missing features and labels to generate suitable imputation values. Furthermore, we integrate the feature selection technique to enhance its overall performance.
To further enhance the proposed method, we design a legacy learning mechanism that allows the method to utilize historical optimal solutions to guide the search process.
We conduct experiments using seven comparison algorithms and twelve datasets to evaluate the effectiveness and superiority of our proposed method. Empirical results demonstrate that our method outperforms the other algorithms.

The remainder of this paper is organized as follows. The proposed method is elaborated in Section 2. The results of the experimental evaluation are presented in Section 3, followed by the conclusions in Section 4.

2. Method

This section provides a detailed description of PSOFI. First, the particle encoding strategy of PSOFI is interpreted in Section 2.1. It explains how the particles in PSOFI implement the data imputation and feature selection. Second, the legacy learning mechanism is introduced in Section 2.2, and is utilized to enhance the optimization performance of PSOFI. Finally, Section 2.3 gives an overview of the entire procedure of PSOFI.

2.1. Particle Encoding Strategy

Many practical systems and applications generate data that approximately follow a Gaussian distribution [31]. Therefore, PSOFI models the distribution of quantitative features that contain missing values as a Gaussian distribution, with the mean and standard deviation as variables to form the distribution. For cases of categorical features with missing data, the selecting probabilities of every unique value in existing data are taken as parameters. As a result, PSOFI generates values that comply with their probability distribution instead of specific numbers for imputing missing values, which enhances its generalizability. Furthermore, it uses feature selection to choose the optimal feature subset that can enhance the learning performance of the classification algorithm. To illustrate our approach, we provide a diagram of the encoding pattern for particles in PSOFI in Figure 1.

As shown in Figure 1, each particle consists of three parts: the first two parts are for imputation and the third part is for feature selection. Specifically,

μ_{i}

and

δ_{i}

in the first part denote the mean and standard deviation parameters of Gaussian distribution in quantitative feature i with missing values, respectively; as for the second part,

v_{k}^{1}

to

v_{k}^{q}

represent the selecting probabilities of the incomplete categorical feature k based on its existing q unique values; in the third part, PSOFI employs

f_{1}

to

f_{d}

as the coding values of d features in the dataset to select feature subsets. Based on the encoding pattern, the pseudo-code for evaluating the particles is presented in Algorithm 1.

Algorithm 1 Pseudo-code of particle evaluation.

Input: particle i, training dataset

D_{t}

, testing dataset

D_{e}

;
Output: fitness value

F_{i}

;

1:: for feature e from 1 to d in $D_{t}$ do
2:: obtain the corresponding imputation model M by decoding the imputation part of particle i;
3:: generate imputation values by M to impute missing positions in feature e;
4:: end for
5:: get the complete training dataset: $D_{t}^{'} \leftarrow D_{t}$ ;
6:: select the feature subset $F S$ by decoding the feature selection part of particle i;
7:: construct new training dataset and testing dataset based on $F S$ : $D_{t}^{″} \leftarrow D_{t}^{'}$ , $D_{e}^{'} \leftarrow D_{e}$ ;
8:: apply $D_{t}^{″}$ to train classifier C;
9:: if $D_{e}^{'}$ is complete then
10:: $F_{i} = E v a l u a t i o n (C, D_{e}^{'})$ ;
11:: else
12:: use the imputation models in step two to impute the missing positions and get a complete testing dataset: $D_{e}^{″} \leftarrow D_{e}^{'}$ ;
13:: $F_{i} = E v a l u a t i o n (C, D_{e}^{″})$ ;
14:: end if

Lines 1 to 5 impute the training dataset

D_{t}

. In Line 2, the imputation model for the corresponding feature is generated. Then, in line 3, random imputation values are generated based on the imputation model. More specifically, if quantitative feature i has

L_{i}

missing values, the algorithm generates

L_{i}

random values following the Gaussian distribution with mean and standard deviation values as

μ_{i}

and

δ_{i}

and uses the values to fill the missing positions of feature i. Additionally, we use the maximum and minimum values of feature i as the upper and lower bounds of

μ_{i}

and

δ_{i}

during the optimization iterations. Similarly, if the categorical feature k contains

L_{k}

missing values, we use

v_{k}^{1}

to

v_{k}^{q}

to represent the selecting probabilities of its q unique values and adopt the roulette strategy to select

L_{k}

values, and then fill the selected values to the missing positions of feature k. After the imputation process, the imputation models of the incomplete features and the complete training dataset

D_{t}^{'}

are obtained. In Line 6, the particle selects the features according to their coding values and (1).

S T (f_{e}) = \{\begin{matrix} 1, i f f_{e} > 0.5 \\ 0, i f f_{e} \leq 0.5 \end{matrix}

(1)

where

f_{e}

indicates the coding value of feature e in the particle. This means that if the coding value of feature

f_{e}

is above

0.5

, then feature e is selected. Thereafter, the new training dataset

D_{t}^{″}

and testing dataset

D_{e}^{'}

are constructed based on the selected feature subset

F S

. In Line 8, the new complete training dataset is employed to train a classifier C. Line 9 checks if the new testing dataset

D_{e}^{'}

contains missing values. When the new testing dataset is complete, it is fed directly into the classifier for prediction, and the evaluation result is recorded as the particle’s fitness value. In case of incompleteness, the obtained imputation models of the training dataset are implemented to generate imputation values for the testing dataset

D_{e}^{'}

to obtain a complete testing dataset

D_{e}^{″}

, and, subsequently, the trained classifier is used to obtain the evaluation result on the complete testing dataset as the particle’s fitness value. The schematic diagram of the process is presented in Figure 2. It is assumed that the dataset includes two quantitative features and two categorical features, with one quantitative feature and one categorical feature containing missing values. These two features with missing values are then imputed and selected to form the feature subset. The top row of the matrices provides information about the type of each feature, with “Quan” indicating a quantitative feature and “Cate” indicating a categorical feature. Additionally, “NaN” (Not a Number) is used to signify missing values within the dataset.

2.2. Legacy Learning Mechanism

We propose a strategy called the legacy learning mechanism to enhance the searching ability of PSOFI. The mechanism utilizes a legacy archive to store the optimal historical solutions, and PSOFI uses the information to produce better offspring. When the legacy archive becomes full, it is updated as described in Algorithm 2.

Algorithm 2 Pseudo-code of legacy archive update.

Input: particle i, legacy archive

L A

;
Output: new legacy archive

{L A}^{'}

;

1:: obtain the worst particle w in $L A$ by fitness value;
2:: if particle w is fitter than particle i then
3:: ${L A}^{'} = L A$ ;
4:: else if particle i is fitter than particle w then
5:: ${L A}^{'} = L A - w + i$ ;
6:: else
7:: find the particle m that has the same fitness value as i and the minimal time weight in $L A$ ;
8:: ${L A}^{'} = L A - m + i$ ;
9:: end if

The methodology for updating the legacy archive is as follows: the fitness values of the individuals in the archive are compared with that of the new particle, and the new particle replaces the individual with the worst fitness value. If there are multiple solutions with the same fitness value, the solution that has been optimized with the minimum time weight will be replaced by the new particle. Here, time weight refers to the number of iterations taken by the particle to reach its historical optimal solution.

2.3. Procedure of PSOFI

The pseudo-code of the proposed PSOFI is shown in Algorithm 3.

Algorithm 3 Pseudo-code of PSOFI.

Input: population size N, max iterations T, parameters

α

and

β

, training dataset and testing dataset;
Output: global optimal particle

g^{*}

and its fitness value

F_{g^{*}}

;

1:: classify the types of all features, initialize population based on the encoding pattern;
2:: evaluate the fitness values of all particles by Algorithm 1 and update their individual best position;
3:: update $g^{*}$ and $F_{g^{*}}$ ;
4:: store all particles in legacy archive $L A$ ;
5:: while current iteration $t < T$ do
6:: if $g^{*}$ is not changed at last iteration then
7:: merge $L A$ into population;
8:: end if
9:: for each particle i in population do
10:: update its speed: $v_{i}^{t} \leftarrow v_{i}^{t - 1}$ , update its position: $x_{i}^{t} \leftarrow x_{i}^{t - 1}$ ;
11:: obtain its fitness $F_{i}$ by Algorithm 1;
12:: update its best position $x_{i}^{*}$ and time weight;
13:: carry out $L A$ update procedure by Algorithm 2;
14:: end for
15:: update $g^{*}$ and $F_{g^{*}}$ ;
16:: if the population size exceeds N then
17:: arrange the particles in descending order according to the fitness values, and take the first N particles to construct new population;
18:: end if
19:: end while

The initialization process in Line 1 comprises the subsequent steps: first, to classify the feature types in the dataset, we manually categorize them as quantitative or categorical features based on the distribution of existing data from the data source. If the data are non-repetitive and can be fitted using a curve, then the feature can be identified as quantitative. If the data are only repeated several times among a few values, then the feature can be identified as categorical. Thereafter, the maximum and minimum values of the quantitative features are extracted, while only distinct values of the categorical features are considered. Finally, particles are initialized to construct the particle population based on the earlier information. The length of each particle is determined by the following formula:

2 \times l + \sum_{k = 1}^{u} h_{k} + d

, where l and u denote the number of quantitative features and categorical features that contain missing values, respectively; d corresponds to the dataset’s dimension;

h_{k}

represents the number of distinct values that the categorical feature k comprises. Furthermore, we limit the mean and standard deviation values to between maximum and minimum values of corresponding quantitative features to achieve reasonable outcomes. Line 2 calculates the particle’s fitness value through Algorithm 1. Line 3 involves copying all particles to the legacy archive, thus initializing it. The main body of PSOFI encompasses Lines 4 to 18. If the global optimal solution remains unchanged in the last iteration, Lines 5 to 7 ensure that PSOFI merges legacy archives into population to invoke legacy learning mechanisms to add more search agents to this iteration. In Line 9, the speed of the current particle i is updated as in (2).

v_{i}^{t} = w v_{i}^{t - 1} + α r_{1} (g^{*} - x_{i}^{t - 1}) + β r_{2} (x_{i}^{*} - x_{i}^{t - 1})

(2)

where

v_{i}^{t - 1}

represents the speed of particle i in the previous iteration. w is inertia weight, which controls the search range of particles in space.

α

and

β

are acceleration constants, which manage the influence of group and particle information on the current particle.

g^{*}

stands for the global optimal solution,

x_{i}^{*}

is the best solution of particle i in history, and

x_{i}^{t - 1}

indicates its present position.

r_{1}

and

r_{2}

are two random vectors, and each entry takes values between 0 and 1. Then, current particle i moves to its new position through (3).

x_{i}^{t} = x_{i}^{t - 1} + v_{i}^{t}

(3)

Line 10 assesses its fitness value utilizing Algorithm 1. Line 11 updates the current particle’s optimal position and its associated time weight. Subsequently, Line 12 implements the legacy archive update procedure in Algorithm 2. Line 14 then updates the global optimal particle. Lines 15 to 17 are dedicated to managing the population when its size is greater than N. PSOFI arranges the particles in descending order based on their fitness values and subsequently selects the top N particles for the reconstruction of the population.

The flow chart of PSOFI is presented in Figure 3.

3. Results

We conducted exhaustive experiments to evaluate the performance of PSOFI, as described in this section. To ensure fair results, three conventional statistics-based imputation methods and two evolutionary algorithm-based imputation methods are used for comparison. Specifically, MI [11], KNN [12], and Regularized Expectation Maximization (REM) [32] are chosen as statistical-based methods, while MOGI [20] and DEKCF [27] are selected as evolutionary-algorithm-based imputation methods. Additionally, to demonstrate the effectiveness of combining the proposed imputation method and feature selection, we included two variants of PSOFI: PSO-based IMputation (PSOIM) without feature selection and PSO-based Feature Selection (PSOFS) without imputation.

Twelve different kinds of datasets from the UCI (the datasets are available at https://archive.ics.uci.edu/ml/datasets (accessed on 23 June 2024 ) ) machine learning repository are chosen; their characteristics are shown in Table 1, the words in parentheses are abbreviated for the name of the datasets, and “Type” denotes the type of features that the dataset contains. BCC, Park, BTSC, and HV only contain quantitative features, while Lymp, Spec, Monk, and DBWS contain only categorical features, and Sta, ILPD, Aus, and GL have mixed features. In addition, HV, DBWS, and GL datasets have more than one hundred features. These datasets can provide an impartial assessment of our method. As the datasets are originally complete, we artificially construct missing datasets with different Missing Rates (MRs). The MR denotes the ratio of missing values to the total number of values in a dataset. We randomly select and replace values with “NaN” through non-replicated sampling, and repeat this process until the dataset reaches the desired MR. Therefore, the performance of PSOFI can be evaluated fairly and conveniently.

The experiments are conducted on a test platform consisting of Windows 10 operating system, Matlab 2020b, Intel i5-7300HQ, and 16GB of RAM. We set K of KNN to 5 and use the default parameters provided by the open-source code (the source code is available at https://github.com/tapios/RegEM (accessed on 12 April 2024) ) for REM. For MOGI, we evaluate it with a population size of 40, 300 iterations, 0.8 crossover percentage, 0.2 mutation percentage, and 0.02 mutation rate. For DEKCF, we set the number of populations to 40, the number of iterations to 300, and other parameter settings to be consistent with the reference [27]. We use a population size of 40 and 300 iterations for PSOFI, with

w = 0.5

,

α = β = 1

. The quantitative feature parameter range is set between the maximum and minimum occurrence values. The parameter settings for PSOIM and PSOFS are kept consistent with PSOFI. As MOGI is a multi-objective algorithm, we select a random solution from the results as the final solution.

After the aforementioned imputation methods are applied, the K-nearest neighbor classifier is utilized for classification. This classifier computes the Euclidean distance between samples and predicts the label of a sample based on its K-nearest neighbors, which is a popular choice due to its simplicity and ease of use. In this study, the K value for the classifier is set to the commonly used value of 5, as reported in the literature [33].

We evaluate the methods using MRs of 5%, 10%, 20%, 30%, 40%, and 50% on each dataset. The training partition and the testing partition are 80% and 20% of the original datasets, respectively, and the testing partition is only used for classifier evaluation to avoid data leakage. Each algorithm is independently conducted 20 times. The accuracy of the classifier is used as the evaluation and optimization objective of PSOFI, PSOIM, and PSOFS. Besides the accuracy measure, we also take into account a classic metric commonly used for imbalanced datasets, that is, the

F_{1}

score measure. Given that the

F_{1}

score represents the weighted harmonic average of the precision and recall indicators, it offers a comprehensive evaluation of the performance. The accuracy and

F_{1}

score measures are calculated as in (4) and (5).

accuracy = \frac{TP + TN}{TP + TN + FP + FN}

(4)

F_{1} score = \frac{2 \times TP}{2 \times TP + FP + FN}

(5)

where TP (True Positive) represents the number of positive samples correctly predicted as positive, TN (True Negative) represents the number of negative samples correctly predicted as negative, FP (False Positive) represents the number of negative samples incorrectly predicted as positive, and FN (False Negative) represents the number of positive samples incorrectly predicted as negative.

To verify the effectiveness of the legacy learning mechanism, comparative experiments between PSOFI and PSOFI without legacy learning mechanism (PSOFIwl) are first carried out on the three different types of datasets, including BCC, Spec, and Aus. The iteration curves of the average accuracy are presented in Figure 4. As the curves show, PSOFI can obtain better solutions during or at the end of iterations under all circumstances except the Aus dataset with 20% and 50% MRs, indicating that the presented legacy learning mechanism can indeed improve the searching ability of PSOFI.

The accuracy box plot results of eight algorithms on all twelve datasets with different MRs are presented in Figure 5. Based on the presented accuracy results, it is evident that evolutionary-algorithm-based methods outperformed statistics-based methods. This can be attributed to the fact that statistics-based approaches only utilize the information provided by features while evolutionary-algorithm-based algorithms utilize both feature and label information, resulting in better imputation outcomes. Among the methods, PSOFI demonstrates consistent performance across each dataset for various MRs. PSOFI’s performance is slightly lower than that of DEKCF for Park and Lymp datasets with a 5% MR, and for the Aus dataset with 30% and 50% MRs. Additionally, DEKCF has better performance on Spec, HV, DBWS, and GL datasets at different MRs; this can be due to the high-dimensional features in the dataset, which lead to a comparatively large search space of PSOFI. With the exception of the aforementioned instances, PSOFI outperforms DEKCF on all other datasets. As for the comparison between MOGI and PSOFI, PSOFI has lower performance on the HV dataset with different MRs, BTSC and ILPD datasets with 10%, 20%, 30%, 40%, and 50% MRs, Monk dataset with 30%, 40%, and 50% MRs, Spec dataset with 40% and 50% MRs, Aus dataset with 30% and 50% MRs, and GL dataset with a 50% MR. However, PSOFI has a significant performance advantage over MOGI in most other cases. Moreover, the performance stability of PSOFI and DEKCF is better than MOGI, which is easy to infer from the figures.

The

F_{1}

score box plots for all the datasets are illustrated in Figure 6. From the presented

F_{1}

score results, the superiority of evolutionary-algorithm-based methods’ performance over statistics-based methods is once again highlighted. A detailed analysis comparing the proposed PSOFI method and DEKCF is presented below. On the Lymp dataset with 5%, 10%, 30%, and 50% MRs, the performance of PSOFI is slightly lower than that of DEKCF. DEKCF performs better than PSOFI on Spec, DBWS, and Aus datasets with six different MRs, and HV and GL datasets with 10%, 20%, 30%, 40%, and 50% MRs. Moreover, PSOFI outperforms DEKCF in other results, which are consistent with the earlier-mentioned accuracy results. When comparing MOGI and PSOFI, the results are quite similar to the accuracy results. PSOFI performs worse on the HV dataset with different MRs, BTSC and ILPD datasets with 10%, 20%, 30%, 40%, and 50% MRs, and Monk dataset with 20%, 30%, 40%, and 50% MRs. Additionally, PSOFI has lower performance on the Spec dataset with 40% and 50% MRs and Aus and GL datasets with a 50% MR. Nevertheless, PSOFI shows superior performance in other results compared to MOGI.

Now, the performance of PSOIM and PSOFS is analyzed to see how the imputation and feature selection techniques influence PSOFI. In the figures of accuracy and

F_{1}

score results, PSOIM and PSOFS have better performance in almost every case when compared with statistics-based methods. When it comes to the evolutionary-algorithm-based imputation methods, PSOFI outperforms PSOIM and PSOFS in almost all cases, and DEKCF is only slightly inferior to PSOIM or PSOFS in a few cases. PSOFS is inferior to MOGI on high-dimensional datasets but it is competitive with MOGI on other datasets with MRs lower than 30%. MOGI is superior to PSOIM in all cases except for the DBWS dataset. PSOFS has a weaker performance than PSOIM on BTSC, DBWS, and GL datasets with different MRs, the HV dataset with MRs higher than 5%, and Park, Lymp, Spec, Monk, and ILPD datasets with MRs higher than 20%. To summarize, PSOIM is effective in imputation but the performance is relatively low. PSOFS can handle datasets with low MRs to a certain extent but, as the MR increases, its performance declines quickly. PSOFS cannot handle high-dimensional datasets well. It may encounter situations where the classifier fails to operate normally on the dataset, especially when the dataset is high-dimensional, as there are still some missing values in the selected features, resulting in no outcomes. It is also worth noting that the box plot results do not include these cases. As PSOFI has significantly better performance than both PSOIM and PSOFS, it can be inferred that integrating feature selection techniques in imputation methods can be a great help in improving the overall classification performance.

To statistically confirm the superiority of PSOFI, the Mann–Whitney U test is conducted for verification. The significance level is set to

α = 0.05

, the mean values of indicators on each dataset with different MRs are used, and the results are shown in Table 2. It can be seen that, except for DEKCF in two indicators, all other comparison algorithms reject the null hypothesis for both accuracy and

F_{1}

score indicators. The findings suggest that PSOFI outperforms the majority of the algorithms under comparison, with the exception of DEKCF. This discrepancy can mainly be attributed to PSOFI’s lower performance on high-dimensional datasets.

Further analyzing the results presented above, we conclude some points as follows.

The performance of PSOFI and DEKCF is usually within a small range, while MOGI’s advantage or disadvantage is typically of a larger magnitude, reflecting the instability of MOGI’s performance.
The indicator values show no significant decline with increasing MRs, and, in some cases, such as BCC and Monk, demonstrate an improvement. This can be attributed to the datasets’ strong discriminative features, where missing values have minimal impact on classification performance.
The MOGI algorithm exhibits improved performance with increasing MRs on each dataset. We speculate that the original values may not be discriminative enough for learning algorithms, and thus MOGI searches for an optimal value for each missing position to ensure a better fit to the algorithm. This explanation supports the fact that MOGI performs better, particularly when other algorithms demonstrate relatively low accuracy.
PSOFI, DEKCF, and PSOFS present comparatively better performance on some datasets, such as the BCC, Monk, and Sta datasets, whereas PSOIM performs poorly. This highlights the essential role that the feature selection technique plays in improving the methods’ performance.
When focusing on the results obtained from mixed feature datasets, PSOFI demonstrates overall superiority, which illustrates the efficacy of our strategy for imputing mixed data. PSOFI utilizes Gaussian distribution to impute missing quantitative feature values and selecting probabilities to impute missing categorical feature values, leading to better outcomes by effectively leveraging the randomness of mixed missing values.
PSOFI obtains mediocre results on high-dimensional datasets while DEKCF shows stronger performance on these datasets. This indicates that PSOFI may not be as competitive as DEKCF in terms of feature selection, as we have used a relatively simple strategy to implement feature selection.

Now, we analyze the time complexity of the PSOFI algorithm. Obviously, the time complexity of statistics-based methods is smaller than that of evolutionary-algorithm-based methods. Therefore, only PSOFI, DEKCF, and MOGI are compared in detail here. The stacked histograms of time costs for the three methods on testing datasets with different MRs are shown in Figure 7.

It can be observed that PSOFI has a higher time overhead in most cases, especially on the Park, Spec, Aus, and high-dimensional datasets. A significance analysis of the Mann–Whitney U test is also conducted on time costs according to the dataset MRs; the significance level is set to

α = 0.05

, the mean time costs on each dataset with different MRs are used, and the results are given in Table 3. The results show that the time costs of PSOFI are significantly higher than DEKCF and MOGI, which indicates that the main defect of PSOFI is its time overhead. This can be the effect of our effective but complicated imputation strategy. Nevertheless, the result is consistent with the “No Free Lunch” theorem, which states that there is no single learning algorithm that can perform well on all kinds of tasks [34]. To conclude, PSOFI performs well when considering only classification. However, when running time is taken into consideration, PSOFI has a weaker performance.

4. Conclusions

In this paper, we introduce PSOFI, a novel and potent classification technique designed for handling missing data with mixed feature types. PSOFI employs distinct imputation models for quantitative and categorical features, incorporates feature selection to enhance its efficacy, and utilizes PSO to simultaneously optimize the parameters of the imputation models and the feature selection process. Furthermore, we propose a legacy learning mechanism to enhance the search capability of PSOFI by engaging additional search agents during iterations.

For a comparative analysis, we employ three statistical methods and two evolutionary-algorithm-based methods, in addition to constructing two PSOFI variants, PSOIM and PSOFS, to showcase the effectiveness of our developed strategies. Moreover, a total of twelve different types of datasets are used for experiments.

The experiment results lead to several conclusions: firstly, PSOFI outperforms other methods in terms of accuracy and

F_{1}

score measures in most cases. Secondly, PSOFI incurs a reasonable time cost. Thirdly, the results demonstrate the efficacy of PSOFI in leveraging both the randomness of missing values and the information of labels. Lastly, it is feasible and applicable to utilize evolutionary algorithms to integrate the data imputation and feature selection techniques.

Despite its strengths, PSOFI has some limitations. It currently relies on artificial feature type determination, does not account for non-Gaussian-distributed quantitative feature values, has a mediocre performance on high-dimensional datasets, and has a higher time overhead, restricting its use in high-efficiency-demand scenarios. Additionally, as an evolutionary-algorithm-based method, PSOFI is heavily influenced by the optimization objective, yet a single classification metric is used in this study. Future work will address these limitations by developing an intelligent feature type discrimination mechanism, accommodating various data distributions, designing unique feature selection strategies for high-dimensional datasets, integrating parallel computing to reduce time overhead, expanding PSOFI to a multi-objective framework, and considering alternative metrics like balanced accuracy. Furthermore, we intend to modify PSOFI to address imbalanced, high-dimensional missing data classification problems.

Author Contributions

Conceptualization, G.L. and Y.L.; methodology, Q.Z. and Y.L.; software, Q.Z. and Y.L.; validation, X.L. and Y.L.; formal analysis, X.L. and W.Q.; investigation, Q.Z. and Y.L.; resources, X.L.; data curation, Y.L.; writing—original draft preparation, G.L. and Q.Z.; writing—review and editing, G.L. and Y.L.; visualization, G.L.; supervision, X.D.; project administration, W.Q. and X.D.; funding acquisition, W.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Science Foundation for Young Scientists of China (Grant No. 61802426) and Young Elite Scientists Sponsorship Program by CAST (Grant No. 2022QNRC001).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in UCI repository at https://archive.ics.uci.edu/ml/datasets (accessed on 23 June 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Mirzaei, A.; Carter, S.R.; Patanwala, A.E.; Schneider, C.R. Missing data in surveys: Key concepts, approaches, and applications. Res. Soc. Adm. Pharm. 2022, 18, 2308–2316. [Google Scholar] [CrossRef] [PubMed]
Gu, B.; Li, Z.; Liu, A.; Xu, J.; Zhao, L.; Zhou, X. Improving the quality of web-based data imputation with crowd intervention. IEEE Trans. Knowl. Data Eng. 2021, 33, 2534–2547. [Google Scholar] [CrossRef]
Luo, Y. Evaluating the state of the art in missing data imputation for clinical data. Brief. Bioinform. 2022, 23, bbab489. [Google Scholar] [CrossRef] [PubMed]
Lyngdoh, G.A.; Zaki, M.; Krishnan, N.M.A.; Das, S. Prediction of concrete strengths enabled by missing data imputation and interpretable machine learning. Cem. Concr. Compos. 2022, 128, 104414. [Google Scholar] [CrossRef]
Alabadla, M.; Sidi, F.; Ishak, I.; Ibrahim, H.; Affendey, L.S.; Che Ani, Z.; Jabar, M.A.; Bukar, U.A.; Devaraj, N.K.; Muda, A.S.; et al. Systematic review of using machine learning in imputing missing values. IEEE Access 2022, 10, 44483–44502. [Google Scholar] [CrossRef]
Weerakody, P.B.; Wong, K.W.; Wang, G.; Ela, W. A review of irregular time series data handling with gated recurrent neural networks. Neurocomputing 2021, 441, 161–178. [Google Scholar] [CrossRef]
Sun, T.; Zhu, S.; Hao, R.; Sun, B.; Xie, J. Traffic missing data imputation: A selective overview of temporal theories and algorithms. Mathematics 2022, 10, 2544. [Google Scholar] [CrossRef]
Fernando, M.; Cèsar, F.; David, N.; José, H. Missing the missing values: The ugly duckling of fairness in machine learning. Int. J. Intell. Syst. 2021, 36, 3217–3258. [Google Scholar] [CrossRef]
Adnan, F.A.; Jamaludin, K.R.; Wan Muhamad, W.Z.A.; Miskon, S. A review of the current publication trends on missing data imputation over three decades: Direction and future research. Neural Comput. Appl. 2022, 34, 18325–18340. [Google Scholar] [CrossRef]
Adhikari, D.; Jiang, W.; Zhan, J.; He, Z.; Rawat, D.B.; Aickelin, U.; Khorshidi, H.A. A comprehensive survey on imputation of missing data in Internet of Things. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
Kabir, G.; Tesfamariam, S.; Hemsing, J.; Sadiq, R. Handling incomplete and missing data in water network database using imputation methods. Sustain. Resilient Infrastruct. 2020, 5, 365–377. [Google Scholar] [CrossRef]
Troyanskaya, O.; Cantor, M.; Sherlock, G.; Brown, P.; Hastie, T.; Tibshirani, R.; Botstein, D.; Altman, R.B. Missing value estimation methods for DNA microarrays. Bioinformatics 2001, 17, 520–525. [Google Scholar] [CrossRef] [PubMed]
Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM Algorithm. J. R. Stat. Soc. 1977, 39, 1–22. [Google Scholar] [CrossRef]
Miao, X.; Wu, Y.; Chen, L.; Gao, Y.; Yin, J. An experimental survey of missing data imputation algorithms. IEEE Trans. Knowl. Data Eng. 2023, 35, 6630–6650. [Google Scholar] [CrossRef]
Emmanuel, T.; Maupong, T.; Mpoeleng, D.; Semong, T.; Mphago, B.; Tabona, O. A survey on missing data in machine learning. J. Big Data 2021, 8, 140. [Google Scholar] [CrossRef]
Feng, H.; Chen, G.; Yin, C.; Yang, B.; Chen, Y. A SVM regression based approach to filling in missing values. In Proceedings of the 9th International Conference on Knowledge-Based Intelligent Information and Engineering Systems (KES 2005), Part 3, Melbourne, Australia, 14–16 September 2005; pp. 581–587. [Google Scholar]
van Buuren, S.; Groothuis-Oudshoorn, K. mice: Multivariate Imputation by Chained Equations in R. J. Stat. Softw. 2011, 45, 1–67. [Google Scholar] [CrossRef]
Liang, J.; Ban, X.; Yu, K.; Qu, B.; Qiao, K.; Yue, C.; Chen, K.; Tan, K.C. A Survey on Evolutionary Constrained Multiobjective Optimization. IEEE Trans. Evolut. Comput. 2023, 27, 201–221. [Google Scholar] [CrossRef]
Chiu, P.C.; Selamat, A.; Krejcar, O.; Kuok, K.K.; Bujang, S.D.A.; Fujita, H. Missing value imputation designs and methods of nature-inspired metaheuristic techniques: A systematic review. IEEE Access 2022, 10, 61544–61566. [Google Scholar] [CrossRef]
Lobato, F.; Sales, C.; Araujo, I.; Tadaiesky, V.; Dias, L.; Ramos, L.; Santana, A. Multi-objective genetic algorithm for missing data imputation. Pattern Recognit. Lett. 2015, 68, 126–131. [Google Scholar] [CrossRef]
Solorio-Fernández, S.; Carrasco-Ochoa, J.A.; Martínez-Trinidad, J.F. A survey on feature selection methods for mixed data. Artif. Intell. Rev. 2022, 55, 2821–2846. [Google Scholar] [CrossRef]
Jiao, R.; Nguyen, B.H.; Xue, B.; Zhang, M. A Survey on Evolutionary Multiobjective Feature Selection in Classification: Approaches, Applications, and Challenges. IEEE Trans. Evol. Comput. 2023, in press. [Google Scholar] [CrossRef]
Dokeroglu, T.; Deniz, A.; Kiziloz, H.E. A comprehensive survey on recent metaheuristics for feature selection. Neurocomputing 2022, 494, 269–296. [Google Scholar] [CrossRef]
Nssibi, M.; Manita, G.; Korbaa, O. Advances in nature-inspired metaheuristic optimization for feature selection problem: A comprehensive survey. Comput. Sci. Rev. 2023, 49, 100559. [Google Scholar] [CrossRef]
Barrera-García, J.; Cisternas-Caneo, F.; Crawford, B.; Gómez Sánchez, M.; Soto, R. Feature Selection Problem and Metaheuristics: A Systematic Literature Review about Its Formulation, Evaluation and Applications. Biomimetics 2023, 9, 9. [Google Scholar] [CrossRef]
Doquire, G.; Verleysen, M. Feature selection with missing data using mutual information estimators. Neurocomputing 2012, 90, 3–11. [Google Scholar] [CrossRef]
Tran, C.T.; Zhang, M.; Andreae, P.; Xue, B.; Bui, L.T. Improving performance of classification on incomplete data using feature selection and clustering. Appl. Soft Comput. 2018, 73, 848–861. [Google Scholar] [CrossRef]
Gad, A.G. Particle Swarm Optimization algorithm and its applications: A Systematic Review. Arch. Comput. Methods Eng. 2022, 29, 2531–2561. [Google Scholar] [CrossRef]
Liu, Y.; Li, G.; Li, X.; Qin, W.; Zheng, Q.; Ren, X. The Classification Method Based on Evolutionary Algorithm for High-dimensional Imbalanced Missing Data. Electron. Lett. 2023, 59, e12842. [Google Scholar] [CrossRef]
Li, M.; Liu, Y.; Zheng, Q.; Li, G.; Qin, W. An Evolutionary Computation Classification Method for High-Dimensional Mixed Missing Variables Data. Electron. Lett. 2023, 59, e13058. [Google Scholar] [CrossRef]
Jeon, Y.; Hwang, G. Bayesian mixture of gaussian processes for data association problem. Pattern Recognit. 2022, 127, 108592. [Google Scholar] [CrossRef]
Schneider, T. Analysis of Incomplete Climate Data: Estimation of Mean Values and Covariance Matrices and Imputation of Missing Values. J. Clim. 2001, 14, 853–871. [Google Scholar] [CrossRef]
Zhang, S.; Li, J. KNN Classification with One-Step Computation. IEEE Trans. Knowl. Data Eng. 2023, 35, 2711–2723. [Google Scholar] [CrossRef]
Adam, S.P.; Alexandropoulos, S.A.N.; Pardalos, P.M. No Free Lunch Theorem: A Review. In Approximation and Optimization; Vrahatis, M.N., Demetriou, I.C., Pardalos, P.M., Eds.; Springer: Berlin/Heidelberg, Germany, 2019; pp. 57–82. [Google Scholar] [CrossRef]

Figure 1. Encoding pattern of particles in PSOFI.

Figure 2. Schematic diagram of the process.

Figure 3. Flow chart of PSOFI.

Figure 4. Iteration curves of PSOFI and PSOFIwl on three datasets with different MRs: (a) with 5% MR; (b) with 10% MR; (c) with 20% MR; (d) with 30% MR; (e) with 40% MR; (f) with 50% MR.

Figure 5. Accuracy box plot results on datasets with different MRs, where different colors represent different MRs: (a) on BCC dataset; (b) on Park dataset; (c) on BTSC dataset; (d) on HV dataset; (e) on Lymp dataset; (f) on Spec dataset; (g) on Monk dataset; (h) on DBWS dataset; (i) on Sta dataset; (j) on ILPD dataset; (k) on Aus dataset; (l) on GL dataset; (m) figure legend.

Figure 6.

F_{1}

score box plot results on datasets with different MRs, where different colors represent different MRs: (a) on BCC dataset; (b) on Park dataset; (c) on BTSC dataset; (d) on HV dataset; (e) on Lymp dataset; (f) on Spec dataset; (g) on Monk dataset; (h) on DBWS dataset; (i) on Sta dataset; (j) on ILPD dataset; (k) on Aus dataset; (l) on GL dataset; (m) figure legend.

Figure 6.

F_{1}

score box plot results on datasets with different MRs, where different colors represent different MRs: (a) on BCC dataset; (b) on Park dataset; (c) on BTSC dataset; (d) on HV dataset; (e) on Lymp dataset; (f) on Spec dataset; (g) on Monk dataset; (h) on DBWS dataset; (i) on Sta dataset; (j) on ILPD dataset; (k) on Aus dataset; (l) on GL dataset; (m) figure legend.

Figure 7. Stacked histogram of time costs.

Table 1. Characteristics of twelve experiment datasets.

Name	Number of Samples of Each Class	Number of Features	Type
Breast Cancer Coimbra (BCC)	$64 / 52$	10	quantitative
Parkinsons (Park)	$147 / 48$	23	quantitative
Blood Transfusion Service Center (BTSC)	$570 / 178$	4	quantitative
Hill Valley (HV)	$305 / 301$	100	quantitative
Lymphography (Lymp)	$81 / 67$	18	categorical
Spectf Heart (Spec)	$212 / 55$	44	categorical
Monk’s Problems (Monk)	$216 / 216$	7	categorical
DBWorld Subjects (DBWS)	$35 / 29$	242	categorical
Statlog Heart (Sta)	$150 / 120$	13	mixed
Indian Liver Patient Dataset (ILPD)	$416 / 167$	10	mixed
Australian (Aus)	$383 / 307$	14	mixed
Gastrointestinal Lesions (GL)	$110 / 42$	467	mixed

Table 2. Mann–Whitney U test p value results of performance.

Indicator	PSOIM	PSOFS	DEKCF	MOGI	MI	KNN	REM
Accuracy	2.01 $\times 10^{- 6}$	9.27 $\times 10^{- 9}$	3.23 $\times 10^{- 1}$	2.83 $\times 10^{- 3}$	3.14 $\times 10^{- 15}$	2.59 $\times 10^{- 15}$	3.92 $\times 10^{- 15}$
$F_{1}$ score	1.70 $\times 10^{- 3}$	2.22 $\times 10^{- 5}$	3.49 $\times 10^{- 1}$	4.86 $\times 10^{- 2}$	1.56 $\times 10^{- 11}$	5.89 $\times 10^{- 12}$	1.15 $\times 10^{- 11}$

Results in bold font represent that they are lower than the significance level.

Table 3. Mann–Whitney U test p value results of time costs.

Algorithm	5%MR	10%MR	20%MR	30%MR	40%MR	50%MR
DEKCF	4.3 $\times 10^{- 3}$	3.5 $\times 10^{- 3}$	3.5 $\times 10^{- 3}$	1.1 $\times 10^{- 3}$	1.1 $\times 10^{- 3}$	1.1 $\times 10^{- 3}$
MOGI	3.0 $\times 10^{- 2}$	2.3 $\times 10^{- 2}$	1.9 $\times 10^{- 2}$	1.9 $\times 10^{- 2}$	1.7 $\times 10^{- 2}$	1.2 $\times 10^{- 2}$

Results in bold font represent that they are lower than the significance level.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, G.; Zheng, Q.; Liu, Y.; Li, X.; Qin, W.; Diao, X. A Classification Method for Incomplete Mixed Data Using Imputation and Feature Selection. Appl. Sci. 2024, 14, 5993. https://doi.org/10.3390/app14145993

AMA Style

Li G, Zheng Q, Liu Y, Li X, Qin W, Diao X. A Classification Method for Incomplete Mixed Data Using Imputation and Feature Selection. Applied Sciences. 2024; 14(14):5993. https://doi.org/10.3390/app14145993

Chicago/Turabian Style

Li, Gengsong, Qibin Zheng, Yi Liu, Xiang Li, Wei Qin, and Xingchun Diao. 2024. "A Classification Method for Incomplete Mixed Data Using Imputation and Feature Selection" Applied Sciences 14, no. 14: 5993. https://doi.org/10.3390/app14145993

APA Style

Li, G., Zheng, Q., Liu, Y., Li, X., Qin, W., & Diao, X. (2024). A Classification Method for Incomplete Mixed Data Using Imputation and Feature Selection. Applied Sciences, 14(14), 5993. https://doi.org/10.3390/app14145993

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Classification Method for Incomplete Mixed Data Using Imputation and Feature Selection

Abstract

1. Introduction

2. Method

2.1. Particle Encoding Strategy

2.2. Legacy Learning Mechanism

2.3. Procedure of PSOFI

3. Results

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI