1. Introduction
Missing data occurs when there is no value assigned to an observation [
1], which is a common issue in various systems and applications, such as web, clinical medicine, and material engineering [
2,
3,
4]. It can be caused by multiple factors, such as survey unresponsiveness, irregular events, sensor malfunction, connection error, and scheduled maintenance, among others [
5,
6,
7]. This problem can lead to a decrease in data quality and the introduction of bias, which in turn hinders the process of extracting knowledge from data analysis. The simplest and fastest way to deal with missing data problems is to discard the sample that has missing values; yet, this may lead to a significant loss of potentially valuable information and result in biased outcomes, especially in small datasets [
8]. Thus, missing data imputation has received significant attention both technically and theoretically, and numerous studies have been performed in the last few decades [
9]. Conventional imputation methods can be classified into two categories: statistics-based and learning-based methods [
10].
In earlier studies, statistics-based methods were more favorable among researchers. The methods make use of statistical information such as the distance, frequency, mean, median, and mode of the observed data to fill in the missing values. The Mean Imputation (MI) method [
11] replaces all missing values with the mean value from their corresponding feature. Although this strategy is simple and fast to implement, it lacks consideration of data distribution and can impact the integrity of the data structure. The K-Nearest Neighbor (KNN) imputation [
12] method is a distance-based method. In particular, the K-nearest neighbors of the missing value are located and their values are utilized to generate the imputation value. The Expectation Maximization (EM) method [
13] finds maximum likelihood estimates and creates models of missing values by iteratively executing two main steps: the expectation step, which imputes missing values with their expectation of likelihood based on the current estimate of the mean and covariance parameter, and the maximization step, which recomputes the parameters to maximize the likelihood in the expectation step. Statistics-based methods are easy to comprehend and simple to implement. However, these methods only use the feature information of samples and are unable to effectively utilize the label information of samples, thus defecting the imputation performance.
On the other hand, learning-based methods train a learning model to predict and impute the missing data. In tree-based imputation methods, a decision tree model is trained for each feature that has missing values, and the models are utilized for imputing missing values [
14]. Clustering imputation methods partition the data into several clusters and utilize the cluster information for dealing with the missing values [
15]. Moreover, the Support Vector Machine (SVM) algorithm is employed for data imputation [
16], where the SVM regression model is trained using the decision attribute to predict the missing values. In a similar fashion, the Multivariate Imputation by Chained Equations (MICE) method [
17] trains the linear regression model multiple times, and the average value of model outputs is used for imputing. Learning-based methods train specific models that can learn the pattern of data distribution to predict the missing values, thereby obtaining better imputation performance. Nevertheless, these methods do not fully consider the randomness of missing data, and this could introduce bias.
Evolutionary algorithms are a class of stochastic optimization methods inspired by natural selection and survival of the fittest in biological evolution and have been widely used in plentiful domains [
18]. In recent years, the use of evolutionary algorithms in missing data imputation has been extensively investigated and has produced impressive results [
19]. For example, the Multi-Objective Genetic algorithm for data Imputation (MOGI) [
20] uses Nondominated Sorting Genetic Algorithm-II (NSGA-II) to find the optimal imputation values. These methods utilize the strong search ability of evolutionary algorithms to search for imputation values, and the large population of search agents can guarantee effectiveness and randomness simultaneously. Despite this, these methods are inadequate in utilizing both feature and label information, thereby constraining their performance. Furthermore, real-world datasets often contain quantitative features and categorical features at the same time. The distribution of these two types of features is probably different, yet most data imputation methods do not take this into consideration and regard all features the same.
In this data-proliferating era of big data, the data produced and stored by different fields have significantly increased. As a consequence, the dimensionality of datasets increases, thus aggravating the “curse of dimensionality” problem in many real-world applications [
21]. Feature selection is one key technique for solving this. It selects a small subset of relevant features from the original feature set to maximize the evaluation criteria. By removing irrelevant and redundant features, feature selection can reduce the time taken to obtain the original data and the storage space of the data, diminish the training time of classification models, and improve the interpretability and performance of the classification models [
22,
23]. Feature selection is a Nondeterministic Polynomial (NP) hard problem with a large search space, and there are three main types of feature selection approaches, namely, filter, wrapper, and embedded methods. Among these, the evolutionary-algorithm-based wrapper methods have shown a great capability of searching the optimal feature subset, therefore attracting much attention from many researchers [
24,
25].
In many application scenarios, high-dimensional datasets with missing data pose a common challenge. Not only can the missing data possibly lead to the dysfunction of learning algorithms, but also the high-dimensional characteristic can be a huge obstruct to obtaining a good data analysis result. Considering that data imputation and feature selection both serve to improve data quality, it is possible that combining data imputation and feature selection can improve the efficiency and accuracy of learning algorithms to a greater extent, especially on high-dimensional missing datasets. Some studies have shown that applying feature selection and data imputation simultaneously is conducive to improving the performance of imputation methods [
26]. Moreover, the Differential Evolution algorithm involving KNN imputation, Clustering and Feature selection (DEKCF) [
27] method combines three techniques to enhance the method’s ability, and demonstrated a further improvement in classification accuracy. However, the differential evolution algorithm is only used for feature selection in DEKCF, and the imputation capability of the differential evolution algorithm is neglected. As mentioned above, evolutionary algorithms are successfully employed in both data imputation and feature selection, but there is little research that uses evolutionary algorithms to simultaneously implement the two techniques. The potential of this integrating strategy still needs evaluation and analysis.
Particle Swarm Optimization (PSO) is a prominent evolutionary algorithm that mimics the foraging behavior observed in fish and bird flocks. It is well known for its superior search capabilities and good scalability, and has been widely applied in various optimization applications [
28]. Based on the fact that PSO has been successfully applied and has obtained good results in both feature selection and missing data imputation fields [
29,
30], in this study, we propose Particle-Swarm-Optimization-based Feature selection and Imputation (PSOFI), a sophisticated algorithm for handling classification problems with missing mixed data. PSOFI effectively combines the benefits of feature selection, data imputation, and PSO. Specifically, PSOFI incorporates a mixed data imputation method that classifies features into quantitative and categorical types and uses different imputation strategies for each. Additionally, it implements feature selection to choose promising features that can enhance its performance. By employing PSO to simultaneously optimize the parameters of the imputation strategies and select features in a wrapper manner, both feature and label information can be utilized, thereby further improving its effectiveness and performance. Moreover, a legacy learning mechanism is introduced to store and utilize the historical optimal solutions during iterations, thus enhancing the performance of the method.
This paper makes several contributions to the field, which are specifically as follows:
We propose a novel data imputation method based on PSO to address the challenges of mixed data imputation. Our method exploits different types of missing features and labels to generate suitable imputation values. Furthermore, we integrate the feature selection technique to enhance its overall performance.
To further enhance the proposed method, we design a legacy learning mechanism that allows the method to utilize historical optimal solutions to guide the search process.
We conduct experiments using seven comparison algorithms and twelve datasets to evaluate the effectiveness and superiority of our proposed method. Empirical results demonstrate that our method outperforms the other algorithms.
The remainder of this paper is organized as follows. The proposed method is elaborated in
Section 2. The results of the experimental evaluation are presented in
Section 3, followed by the conclusions in
Section 4.
3. Results
We conducted exhaustive experiments to evaluate the performance of PSOFI, as described in this section. To ensure fair results, three conventional statistics-based imputation methods and two evolutionary algorithm-based imputation methods are used for comparison. Specifically, MI [
11], KNN [
12], and Regularized Expectation Maximization (REM) [
32] are chosen as statistical-based methods, while MOGI [
20] and DEKCF [
27] are selected as evolutionary-algorithm-based imputation methods. Additionally, to demonstrate the effectiveness of combining the proposed imputation method and feature selection, we included two variants of PSOFI: PSO-based IMputation (PSOIM) without feature selection and PSO-based Feature Selection (PSOFS) without imputation.
Twelve different kinds of datasets from the UCI (the datasets are available at
https://archive.ics.uci.edu/ml/datasets (accessed on 23 June 2024 ) ) machine learning repository are chosen; their characteristics are shown in
Table 1, the words in parentheses are abbreviated for the name of the datasets, and “Type” denotes the type of features that the dataset contains. BCC, Park, BTSC, and HV only contain quantitative features, while Lymp, Spec, Monk, and DBWS contain only categorical features, and Sta, ILPD, Aus, and GL have mixed features. In addition, HV, DBWS, and GL datasets have more than one hundred features. These datasets can provide an impartial assessment of our method. As the datasets are originally complete, we artificially construct missing datasets with different Missing Rates (MRs). The MR denotes the ratio of missing values to the total number of values in a dataset. We randomly select and replace values with “NaN” through non-replicated sampling, and repeat this process until the dataset reaches the desired MR. Therefore, the performance of PSOFI can be evaluated fairly and conveniently.
The experiments are conducted on a test platform consisting of Windows 10 operating system, Matlab 2020b, Intel i5-7300HQ, and 16GB of RAM. We set
K of KNN to 5 and use the default parameters provided by the open-source code (the source code is available at
https://github.com/tapios/RegEM (accessed on 12 April 2024) ) for REM. For MOGI, we evaluate it with a population size of 40, 300 iterations, 0.8 crossover percentage, 0.2 mutation percentage, and 0.02 mutation rate. For DEKCF, we set the number of populations to 40, the number of iterations to 300, and other parameter settings to be consistent with the reference [
27]. We use a population size of 40 and 300 iterations for PSOFI, with
,
. The quantitative feature parameter range is set between the maximum and minimum occurrence values. The parameter settings for PSOIM and PSOFS are kept consistent with PSOFI. As MOGI is a multi-objective algorithm, we select a random solution from the results as the final solution.
After the aforementioned imputation methods are applied, the K-nearest neighbor classifier is utilized for classification. This classifier computes the Euclidean distance between samples and predicts the label of a sample based on its K-nearest neighbors, which is a popular choice due to its simplicity and ease of use. In this study, the
K value for the classifier is set to the commonly used value of 5, as reported in the literature [
33].
We evaluate the methods using MRs of 5%, 10%, 20%, 30%, 40%, and 50% on each dataset. The training partition and the testing partition are 80% and 20% of the original datasets, respectively, and the testing partition is only used for classifier evaluation to avoid data leakage. Each algorithm is independently conducted 20 times. The accuracy of the classifier is used as the evaluation and optimization objective of PSOFI, PSOIM, and PSOFS. Besides the accuracy measure, we also take into account a classic metric commonly used for imbalanced datasets, that is, the
score measure. Given that the
score represents the weighted harmonic average of the precision and recall indicators, it offers a comprehensive evaluation of the performance. The accuracy and
score measures are calculated as in (
4) and (
5).
where TP (True Positive) represents the number of positive samples correctly predicted as positive, TN (True Negative) represents the number of negative samples correctly predicted as negative, FP (False Positive) represents the number of negative samples incorrectly predicted as positive, and FN (False Negative) represents the number of positive samples incorrectly predicted as negative.
To verify the effectiveness of the legacy learning mechanism, comparative experiments between PSOFI and PSOFI without legacy learning mechanism (PSOFIwl) are first carried out on the three different types of datasets, including BCC, Spec, and Aus. The iteration curves of the average accuracy are presented in
Figure 4. As the curves show, PSOFI can obtain better solutions during or at the end of iterations under all circumstances except the Aus dataset with 20% and 50% MRs, indicating that the presented legacy learning mechanism can indeed improve the searching ability of PSOFI.
The accuracy box plot results of eight algorithms on all twelve datasets with different MRs are presented in
Figure 5. Based on the presented accuracy results, it is evident that evolutionary-algorithm-based methods outperformed statistics-based methods. This can be attributed to the fact that statistics-based approaches only utilize the information provided by features while evolutionary-algorithm-based algorithms utilize both feature and label information, resulting in better imputation outcomes. Among the methods, PSOFI demonstrates consistent performance across each dataset for various MRs. PSOFI’s performance is slightly lower than that of DEKCF for Park and Lymp datasets with a 5% MR, and for the Aus dataset with 30% and 50% MRs. Additionally, DEKCF has better performance on Spec, HV, DBWS, and GL datasets at different MRs; this can be due to the high-dimensional features in the dataset, which lead to a comparatively large search space of PSOFI. With the exception of the aforementioned instances, PSOFI outperforms DEKCF on all other datasets. As for the comparison between MOGI and PSOFI, PSOFI has lower performance on the HV dataset with different MRs, BTSC and ILPD datasets with 10%, 20%, 30%, 40%, and 50% MRs, Monk dataset with 30%, 40%, and 50% MRs, Spec dataset with 40% and 50% MRs, Aus dataset with 30% and 50% MRs, and GL dataset with a 50% MR. However, PSOFI has a significant performance advantage over MOGI in most other cases. Moreover, the performance stability of PSOFI and DEKCF is better than MOGI, which is easy to infer from the figures.
The
score box plots for all the datasets are illustrated in
Figure 6. From the presented
score results, the superiority of evolutionary-algorithm-based methods’ performance over statistics-based methods is once again highlighted. A detailed analysis comparing the proposed PSOFI method and DEKCF is presented below. On the Lymp dataset with 5%, 10%, 30%, and 50% MRs, the performance of PSOFI is slightly lower than that of DEKCF. DEKCF performs better than PSOFI on Spec, DBWS, and Aus datasets with six different MRs, and HV and GL datasets with 10%, 20%, 30%, 40%, and 50% MRs. Moreover, PSOFI outperforms DEKCF in other results, which are consistent with the earlier-mentioned accuracy results. When comparing MOGI and PSOFI, the results are quite similar to the accuracy results. PSOFI performs worse on the HV dataset with different MRs, BTSC and ILPD datasets with 10%, 20%, 30%, 40%, and 50% MRs, and Monk dataset with 20%, 30%, 40%, and 50% MRs. Additionally, PSOFI has lower performance on the Spec dataset with 40% and 50% MRs and Aus and GL datasets with a 50% MR. Nevertheless, PSOFI shows superior performance in other results compared to MOGI.
Now, the performance of PSOIM and PSOFS is analyzed to see how the imputation and feature selection techniques influence PSOFI. In the figures of accuracy and score results, PSOIM and PSOFS have better performance in almost every case when compared with statistics-based methods. When it comes to the evolutionary-algorithm-based imputation methods, PSOFI outperforms PSOIM and PSOFS in almost all cases, and DEKCF is only slightly inferior to PSOIM or PSOFS in a few cases. PSOFS is inferior to MOGI on high-dimensional datasets but it is competitive with MOGI on other datasets with MRs lower than 30%. MOGI is superior to PSOIM in all cases except for the DBWS dataset. PSOFS has a weaker performance than PSOIM on BTSC, DBWS, and GL datasets with different MRs, the HV dataset with MRs higher than 5%, and Park, Lymp, Spec, Monk, and ILPD datasets with MRs higher than 20%. To summarize, PSOIM is effective in imputation but the performance is relatively low. PSOFS can handle datasets with low MRs to a certain extent but, as the MR increases, its performance declines quickly. PSOFS cannot handle high-dimensional datasets well. It may encounter situations where the classifier fails to operate normally on the dataset, especially when the dataset is high-dimensional, as there are still some missing values in the selected features, resulting in no outcomes. It is also worth noting that the box plot results do not include these cases. As PSOFI has significantly better performance than both PSOIM and PSOFS, it can be inferred that integrating feature selection techniques in imputation methods can be a great help in improving the overall classification performance.
To statistically confirm the superiority of PSOFI, the Mann–Whitney U test is conducted for verification. The significance level is set to
, the mean values of indicators on each dataset with different MRs are used, and the results are shown in
Table 2. It can be seen that, except for DEKCF in two indicators, all other comparison algorithms reject the null hypothesis for both accuracy and
score indicators. The findings suggest that PSOFI outperforms the majority of the algorithms under comparison, with the exception of DEKCF. This discrepancy can mainly be attributed to PSOFI’s lower performance on high-dimensional datasets.
Further analyzing the results presented above, we conclude some points as follows.
The performance of PSOFI and DEKCF is usually within a small range, while MOGI’s advantage or disadvantage is typically of a larger magnitude, reflecting the instability of MOGI’s performance.
The indicator values show no significant decline with increasing MRs, and, in some cases, such as BCC and Monk, demonstrate an improvement. This can be attributed to the datasets’ strong discriminative features, where missing values have minimal impact on classification performance.
The MOGI algorithm exhibits improved performance with increasing MRs on each dataset. We speculate that the original values may not be discriminative enough for learning algorithms, and thus MOGI searches for an optimal value for each missing position to ensure a better fit to the algorithm. This explanation supports the fact that MOGI performs better, particularly when other algorithms demonstrate relatively low accuracy.
PSOFI, DEKCF, and PSOFS present comparatively better performance on some datasets, such as the BCC, Monk, and Sta datasets, whereas PSOIM performs poorly. This highlights the essential role that the feature selection technique plays in improving the methods’ performance.
When focusing on the results obtained from mixed feature datasets, PSOFI demonstrates overall superiority, which illustrates the efficacy of our strategy for imputing mixed data. PSOFI utilizes Gaussian distribution to impute missing quantitative feature values and selecting probabilities to impute missing categorical feature values, leading to better outcomes by effectively leveraging the randomness of mixed missing values.
PSOFI obtains mediocre results on high-dimensional datasets while DEKCF shows stronger performance on these datasets. This indicates that PSOFI may not be as competitive as DEKCF in terms of feature selection, as we have used a relatively simple strategy to implement feature selection.
Now, we analyze the time complexity of the PSOFI algorithm. Obviously, the time complexity of statistics-based methods is smaller than that of evolutionary-algorithm-based methods. Therefore, only PSOFI, DEKCF, and MOGI are compared in detail here. The stacked histograms of time costs for the three methods on testing datasets with different MRs are shown in
Figure 7.
It can be observed that PSOFI has a higher time overhead in most cases, especially on the Park, Spec, Aus, and high-dimensional datasets. A significance analysis of the Mann–Whitney U test is also conducted on time costs according to the dataset MRs; the significance level is set to
, the mean time costs on each dataset with different MRs are used, and the results are given in
Table 3. The results show that the time costs of PSOFI are significantly higher than DEKCF and MOGI, which indicates that the main defect of PSOFI is its time overhead. This can be the effect of our effective but complicated imputation strategy. Nevertheless, the result is consistent with the “No Free Lunch” theorem, which states that there is no single learning algorithm that can perform well on all kinds of tasks [
34]. To conclude, PSOFI performs well when considering only classification. However, when running time is taken into consideration, PSOFI has a weaker performance.
4. Conclusions
In this paper, we introduce PSOFI, a novel and potent classification technique designed for handling missing data with mixed feature types. PSOFI employs distinct imputation models for quantitative and categorical features, incorporates feature selection to enhance its efficacy, and utilizes PSO to simultaneously optimize the parameters of the imputation models and the feature selection process. Furthermore, we propose a legacy learning mechanism to enhance the search capability of PSOFI by engaging additional search agents during iterations.
For a comparative analysis, we employ three statistical methods and two evolutionary-algorithm-based methods, in addition to constructing two PSOFI variants, PSOIM and PSOFS, to showcase the effectiveness of our developed strategies. Moreover, a total of twelve different types of datasets are used for experiments.
The experiment results lead to several conclusions: firstly, PSOFI outperforms other methods in terms of accuracy and score measures in most cases. Secondly, PSOFI incurs a reasonable time cost. Thirdly, the results demonstrate the efficacy of PSOFI in leveraging both the randomness of missing values and the information of labels. Lastly, it is feasible and applicable to utilize evolutionary algorithms to integrate the data imputation and feature selection techniques.
Despite its strengths, PSOFI has some limitations. It currently relies on artificial feature type determination, does not account for non-Gaussian-distributed quantitative feature values, has a mediocre performance on high-dimensional datasets, and has a higher time overhead, restricting its use in high-efficiency-demand scenarios. Additionally, as an evolutionary-algorithm-based method, PSOFI is heavily influenced by the optimization objective, yet a single classification metric is used in this study. Future work will address these limitations by developing an intelligent feature type discrimination mechanism, accommodating various data distributions, designing unique feature selection strategies for high-dimensional datasets, integrating parallel computing to reduce time overhead, expanding PSOFI to a multi-objective framework, and considering alternative metrics like balanced accuracy. Furthermore, we intend to modify PSOFI to address imbalanced, high-dimensional missing data classification problems.