Next Article in Journal
Mechanical Properties of Adjacent Pile Bases in Collapsible Loess under Metro Depot
Previous Article in Journal
Efficient Sleep–Wake Cycle Staging via Phase–Amplitude Coupling Pattern Classification
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Efficient and Intelligent Feature Selection via Maximum Conditional Mutual Information for Microarray Data

1
Network Information Management Division, Qingdao Agricultural University, Qingdao 266109, China
2
School of Science and Information Science, Qingdao Agricultural University, Qingdao 266109, China
3
College of Mechanical and Electrical Engineering, Qingdao Agricultural University, Qingdao 266109, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(13), 5818; https://doi.org/10.3390/app14135818
Submission received: 15 May 2024 / Revised: 24 June 2024 / Accepted: 2 July 2024 / Published: 3 July 2024
(This article belongs to the Section Applied Biosciences and Bioengineering)

Abstract

:
The challenge of analyzing microarray datasets is significantly compounded by the curse of dimensionality and the complexity of feature interactions. Addressing this, we propose a novel feature selection algorithm based on maximum conditional mutual information (MCMI) to identify a minimal feature subset that is maximally relevant and non-redundant. This algorithm leverages a greedy search strategy, prioritizing both feature quality and classification performance. Experimental results on high-dimensional microarray datasets demonstrate our algorithm’s superior ability to reduce dimensionality, eliminate redundancy, and enhance classification accuracy. Compared to existing filter feature selection methods, our approach exhibits higher adaptability and intelligence.

1. Introduction

With the advancement of bioinformatic technology in the domain of animal breeding, biological datasets, especially those derived from microarrays, have become vital analytical tools for animal scientists and breeders. These datasets significantly improve the precision of trait selection and breeding outcomes, enhancing decision-making processes in breeding programs [1]. However, the analysis of microarray datasets faces significant challenges due to the curse of dimensionality and the complexity of feature interactions [2]. Microarray datasets are typically characterized by a limited number of samples yet a vast array of gene features, many of which are redundant or irrelevant. This redundancy and irrelevance can obscure vital features, leading to poor performance of machine learning algorithms [3]. Additionally, an excessive number of features can lead to the curse of dimensionality, an issue characterized by an exponential increase in problem complexity as the number of features grows. This complication can hinder pattern detection and may result in overfitting or poor model generalization.
To solve these issues, feature selection (FS) is employed as a critical preprocessing step, aiming to optimize the feature space by removing irrelevant and redundant features [4]. This process not only reduces the dimensionality of datasets but also enhances the accuracy, interpretability, and efficiency of the machine learning models [5,6]. Importantly, FS assists in uncovering the crucial mechanisms that link gene expression with specific traits in livestock, supports the analysis of data patterns to identify novel genetic markers correlated with specific traits, and contributes to minimizing expenses in breeding programs [7]. This enhancement in dataset management and analysis improves the efficacy of machine learning algorithms, leading to more accurate associations between traits, genes, and their expression levels, thereby facilitating more informed decisions in trait selection and breeding strategies [8,9].
Over the past two decades, FS has been extensively studied and has had a significant impact in many fields [10,11]. There are three main types of FS methods: filter methods, wrapper methods, and embedded methods. Filter methods are based on the measurement of data characteristics that are independent of any machine learning (ML) algorithms. This method evaluates the relevance between features and the target class (as well as the dependence between features) by using different measurement criteria, such as information, distance, dependence, and consistency, and then selects features based on the evaluation results. In contrast, wrapper and embedded methods select features using a predetermined ML algorithm. Wrapper methods select features based on the classification accuracy of the ML algorithm, while embedded methods select features based on the contribution to the learning process of the ML algorithm. However, compared to filter methods, wrapper and embedded methods have several significant drawbacks, especially for high-dimensional datasets. First, they tend to inherit the bias of the predetermined ML algorithm, resulting in poor versatility for other ML algorithms. Second, they focus only on the classification accuracy of the selected features, which may lead to overfitting or the elimination of some potentially important features. Third and most importantly, they are more computationally intensive and time-consuming because they involve iterative processes of feature search, ML model building, and testing. Therefore, the focus of this study is on filter methods, which can provide a more efficient and versatile approach to FS without being tied to specific ML algorithms.
Early filter methods evaluate the relevance of each individual feature with the target class and then remove the irrelevant and redundant features according to their rankings of relevance. Such methods are called individual evaluation methods (IEMs) [12,13]. Their advantage is that they are computationally light, but they neglect the inter-feature dependence. Therefore, the selected feature subset inevitably contains some redundant features. To overcome this problem, filter methods based on subset evaluation, also known as subset evaluation methods (SEMs), have been proposed [4]. However, these methods involve two critical issues: search strategy and evaluation criterion, which are typically encountered in wrapper and embedded methods. Finding the optimal subset from a high-dimensional dataset is an NP-hard problem, which cannot be solved optimally with an exhaustive search in a reasonable amount of time. Hence, some alternative search strategies (e.g., heuristic search [14], complete search [15], and random search [16]) have been used for the FS process. Although these search strategies significantly enhance the efficiency of the FS process, the absence of robust and precise evaluation criteria presents a considerable challenge in accurately identifying the optimal feature subset.
Recently, evaluation criteria based on relevance and redundancy analysis have gained increasing attention. Among these criteria, mutual information (MI) stands out as a powerful measure of relevance, capturing both linear and non-linear relationships between features and the target class. MI quantifies the amount of information obtained about one variable through the other, making it an ideal tool for assessing the relevance of features in the context of FS. Presently, typical SEMs include MIM [17], JMI [18], CIFE [19], CMIM [20], JMIM [21], MIFS [22], mRMR [23], and maxMIFS [24]. Despite these methods possessing certain advantages in unveiling complex relationships between features and the target class, as well as among the features themselves, they also confront significant challenges, such as failing to accurately capture the maximum relevance between features and the target class and the minimum redundancy in the selected feature subset, not being able to automatically determine the optimal number of features, and overly depending on meticulous parameter calibration.
In this paper, we propose an FS algorithm based on maximum conditional MI and a greedy algorithm to search for the optimal feature subset in high-dimensional datasets. The algorithm is evaluated by comparing it to existing classical filter methods on three pig gene microarray datasets. The significant contributions of this paper are as follows:
  • Proposal of an accurate evaluation criterion based on MI for the maximum relevant no-redundant (MRNR) feature subset along with a detailed proof process. This criterion proves effective in practical contexts to identify whether a feature subset is an MRNR feature subset.
  • Introduction of a greedy search strategy based on maximum conditional MI to efficiently search for the MRNR feature subset with the minimum size. This strategy can automatically determine the number of features in the subset without the need for any wrapper method.
  • Introducing two evaluation metrics, relevance and redundancy, to assess the feature quality of FS methods. These metrics provide a comprehensive analysis of the selected feature subset’s quality and its potential impact on classification performance.
For the convenience of the reader, Abbreviations table lists the abbreviations used in this paper.

2. Methods

MI is a measure of the amount of information shared between two random variables. It evaluates the dependence and relevance between the two variables. For two random variables x and y, MI(x;y) can be defined as follows:
M I x ; y = H x + H y H x , y = p x log p x d x p y log p y d y + p x , y log p x , y d x d y
where H(x) and H(y) are the entropies of x and y, respectively, and H(x,y) is the joint entropy of x and y, p(x) is the probability of x, p(y) is the probability of y, and p(x,y) is the joint probability of x and y.

2.1. Evaluation Criterion

Maximum relevance between the feature subset and the target class is a key criterion for FS. Let D be a dataset that has N features F = {fi, i = 1, 2,…,N} and c be the target class. According to the definition of MI, selecting the feature subset with maximum relevance can be formalized as selecting a feature subset Fs = {fj, j = 1, 2,…,n} (n < N) that satisfies the following equation:
max M I F s ; c
To determine the feature subset that has the maximum mutual information (MI) with the target class, we introduce the following theorem:
Theorem 1.
For the original feature set F, the joint mutual information between any feature subset Fs and the target class c is less than or equal to the joint mutual information between the entire feature set F and the target class c. The equation form of Theorem 1 is as follows:
F s F ,   M I ( c ; F s ) M I ( c ; F )
Proof of Theorem 1.
Let F s = f j , j = 1,2 , , n ( n < N ) be a feature subset of the original feature set F.
F s F , F = { f ^ 1 , f ^ 2 , , f ^ N n , F F s   f 1 , f 2 , , f n F s }
where feature f ^ i belongs to FFs. According to the definition of MI, we have
M I c ; F = M I c ; f ^ 1 , f ^ 2 , , f ^ N n , f 1 , f 2 , , f n = i = 1 N n M I c ; f ^ i | f ^ i 1 , , f ^ 1 , f 1 , f 2 , , f n + M I c ; f 1 , f 2 , , f n i = 1 N n M I c ; f ^ i | f ^ i 1 , , f ^ 1 , f 1 , f 2 , , f n 0 , M I ( c ; F s ) M I ( c ; F ) .
According to Theorem 1, we demonstrate that the feature subset F s has the maximum MI with the target class if and only if F s satisfies the following equation:
M I c ; F s = M I c ; F
Although Equation (4) can ensure that the selected feature subset has maximum relevance, it cannot guarantee that the selected feature subset has no redundant features. In fact, a redundant feature is one that is typically deemed irrelevant to the target class given the presence of other features in the dataset. Therefore, the definition of a redundant feature for a feature set is as follows:
Definition 1.
(Redundant feature): For a feature subset  F s , a feature  f j ( f j F s ) is a redundant feature if and only if
M I c ; f j | F s { f j } = 0
or
M I c ; F s { f j } = M I c ; F s .
By combining Theorem 1 and Definition 1, the FS for the MRNR feature subset can be defined as selecting a feature subset Fs that satisfies the following equations:
M I c ; F s = M I c ; F   f j F s ,   M I c ; F s f j < M I c ; F s .

2.2. Search Strategy

Selecting the optimal MRNR feature subset quickly and accurately poses a challenging task due to the large number of candidate MRNR feature subsets in high-dimensional datasets. In this paper, from the perspective of dimensionality reduction and classification performance, we consider that the MRNR feature subset with the minimum size is the optimal feature subset.
Let the feature subset Fmin = {fk, k = 1, 2,…,m} (m < N) be the MRNR feature subset with the minimum size. When m = 1, the solution for f1 is the feature that maximizes M I c ; f i ( 1 i N ) . When m > 1, M I c ; F m i n can be expressed in the form of (8):
M I c ; F m i n = M I c ; f 1 , f 2 , , f m = M I c ; f 1 , f 2 , , f m 1 + M I c ; f m f m 1 , , f 1 = M I c ; f 1 , f 2 , , f m 2 + k = m 1 m M I c ; f k f k 1 , , f 1 = M I c ; f 1 + k = 2 m M I c ; f k f k 1 , , f 1 = k = 1 m M I c ; f k f k 1 , , f 1
the k t h feature f k can be determined as the one that maximizes the conditional MI M I c ; f k f k 1 , , f 1 , where f k 1 , , f 1 are previously selected features.
Based on the above analysis, a greedy search strategy can be used to iteratively select the feature with the maximum conditional mutual information (MCMI) until the MI between the feature subset and the target class is equal to that of the original feature set and the target class. The evaluation criterion of MCMI can be formulated as follows:
max f i F F s M I c ; f i F s
where F is the original feature set, Fs is the previously selected feature subset, and feature fi belongs to F − Fs.
Furthermore, it should be noted that in specific situations, the later selected features may cause one or more previously selected features to become redundant. Therefore, it is necessary to remove any potential redundant features from the selected feature subset after the FS process.

2.3. Feature Selection Algorithm

This section describes the design of the FS algorithm based on MCMI. According to the analysis in Section 2.2, the algorithm consists of two processes: a greedy search and redundancy elimination, the framework of which is shown in Figure 1. In the first phase, a greedy search strategy is used to search for the candidate feature subset. In the second phase, a redundancy elimination strategy is utilized to eliminate redundant features from the candidate feature subset.
Algorithm 1 presents the pseudocode for the FS algorithm based on MCMI. Below, a simplified description is provided to illustrate the process, along with explanations for each step’s purpose:
1. Initialization (line 1): Calculate the maximal mutual information (maxR) between the original feature set F and the target class c, serving as a benchmark for the best possible mutual information. Initialize two sets: F l i s t to hold candidate features, and F o p t to store the optimal features selected through the algorithm.
2. Candidate Selection (lines 2–4): Calculate the MI between each feature f i in F and the target class c, add features with MI > 0 to F l i s t , and sort F l i s t by descending MI values. This ensures only informative features are considered.
3. Greedy Search (lines 5–12): Iteratively select the feature with the highest conditional MI M I c ; f i | F o p t from F l i s t , add it to F o p t , and remove features from F l i s t with conditional MI = 0. Repeat until M I c ; F o p t equals maxR. This ensures the selected feature set is the minimal subset with maximal relevance to the target class c.
4. Redundancy Elimination (lines 13–15): Evaluate each feature in F o p t for redundancy and remove features that do not significantly contribute to MI when conditioned on the remaining features. This ensures the final feature set is maximum relevant and non-redundant.
5. Return the Optimal Feature Set (line 16): Return F o p t as the optimal feature set.
As shown in Algorithm 1, the greedy search strategy involves a major part of computational effort, and its time complexity has a non-linear relationship with the number of original features (dimensionality N). In the best case, the time complexity of the greedy search is O(N) when only one feature is selected as a candidate feature. In the worst case, it has a time complexity of O(N2) when all features are selected as candidate features. However, in general, the time complexity of the greedy search strategy is much lower than the sequential forward selection (SFS) strategy because, in each iteration, irrelevant or conditionally irrelevant features are removed and not passed to the next iteration. This characteristic significantly reduces the computational burden compared to SFS, especially for high-dimensional datasets. In addition, the time complexity of the greedy search strategy is also influenced by the interdependence of the features in the original feature set. The more features are deleted in earlier iterations, the faster the greedy search strategy becomes. On the other hand, the redundancy elimination strategy has a linear time complexity in terms of the number of candidate features selected by the greedy search strategy.
Algorithm 1. FS algorithm based on MCMI.
Input: dataset D with N features F = {fi, i = 1, 2, …, N} and the target class c
Output: the optimal feature subset Fopt
1. Initialize maxR= M I c ; F , F l i s t = {}, F o p t = {}
2. For i = 1 to N do
3.  If M I c ; f i > 0 then insert f i into F l i s t
4. End for
5. Sort F l i s t by descending M I c ; f i
6. While F l i s t is not empty do
7.   f t e m p = getFirstElement( F l i s t )
8.  Insert f t e m p into F o p t , remove f t e m p from F l i s t
9.  If M I c ; F o p t == maxR then break
10.  Remove features from F l i s t with M I c ; f i | F o p t == 0
11.  Sort F l i s t by descending M I c ; f i | F o p t
12. End while
13. For each feature fi in F o p t (in reverse order) do
14.  If M I c ; f i | F o p t == 0 then remove fi from F o p t
15. End for
16. Return F o p t

3. Experiment Results

We empirically evaluated our FS algorithm by comparing it with representative FS algorithms on three high-dimensional gene microarray datasets. This section is organized as follows: After a brief introduction of the datasets, we describe the experimental setup. We then compare our FS algorithm with the representative filter-based FS algorithms in terms of feature quality, classification accuracy, and computational complexity.

3.1. Datasets

In this study, we conducted an evaluation of our proposed algorithm using three microarray datasets with high dimensionality from the breed Shandong black pig, a popular pig breed in Laiwu, Shandong, China, known for its tender and fine meat. These datasets were obtained from the Life Science Research Institute of Qingdao Agricultural University and were used to analyze intricate biological data and identify relevant genetic information. Remarkably, all datasets contained tens of thousands of features, surpassing the number of samples by a substantial margin, as shown in Table 1.
The BPEx dataset consists of continuous gene expression data for Shandong black pigs. It comprises 181 samples, with each having 24,368 expression features. The target class of the dataset is pork quality, categorized into four levels: excellent (22 samples), good (49 samples), average (76 samples), and inferior (34 samples). Analyzing this dataset can help identify genes associated with high-quality pork and provide valuable insights for improving pork quality through breeding and production strategies.
The BPSnp dataset comprises discrete gene single nucleotide polymorphisms (SNPs) for Shandong black pigs. It includes 236 samples, each having 27,268 SNP features. This dataset has three target classes:
  • Body size and weight are categorized into five levels: very large (49 samples), large (61 samples), medium (59 samples), short (46 samples), and very short (21 samples).
  • The ratio of muscle to fat is categorized into five levels: AAA (46 samples), AA (53 samples), A (61 samples), B (47 samples), and C (29 samples).
  • The growth rate is categorized into four levels: rapid growth (57 samples), fast growth (82 samples), moderate growth (62 samples), and slow growth (35 samples).
Analyzing the BPSnp dataset can provide insights into the genetic factors influencing body size, weight, muscle-to-fat ratio, and growth rate in Shandong black pigs, potentially contributing to improved breeding and production strategies for these traits.
The BPPRRS dataset consists of gene expression data specifically focused on porcine reproductive and respiratory syndrome (PRRS), a highly prevalent viral disease in pigs worldwide that causes significant losses in the swine industry. This dataset includes 29,768 expression features. The dataset is divided into two categories: infected (90 samples) and normal (61 samples). With its detailed gene expression profiles, the BPPRRS dataset serves as a critical resource for researchers aiming to understand the molecular mechanisms underlying PRRS, identify potential genetic markers for resistance or susceptibility to the disease, and develop effective strategies for breeding and managing pigs to minimize the impact of PRRS on pork production.
Due to the excessively large number of features in the datasets, it is imperative that FS algorithms are highly efficient to handle the computational complexity and resource requirements effectively.

3.2. Experimental Setting

The focus of this study is to analyze the effectiveness of the proposed algorithm primarily from a data science perspective. To better evaluate the algorithm, we consider the following three issues in the experimental setup:
  • Evaluation metrics
In this paper, common metrics such as overall accuracy (OA), feature number (FN), and execution time are used to evaluate feature selection (FS) algorithms. Additionally, we propose two novel metrics: relevance (Rel) and number of redundant features (NRF). The Rel metric represents the ratio of the relevance between the selected feature subset F s and the target class c to that between the original feature set F and the target class c . It is calculated using the following equation:
R e l = M I c ; F s / M I c ; F
The Rel value ranges from 0 to 1. The NRF metric represents the number of redundant features contained in the selected feature subset. These two metrics serve to evaluate the quality of the selected feature subset and conduct a more in-depth analysis of the relationship between feature quality and classification accuracy.
In practical terms, OA and FN are crucial for gauging the performance and complexity of models in classification tasks, while Rel and NRF are essential for elucidating and interpreting the significance and effectiveness of the selected features.
2.
Comparison with existing methods
Since the proposed FS algorithm is typically a filter-based model, we introduce representative filter-based FS methods for comparison. We consider two types of methods: IEMs and SEMs. For IEMs, we choose InfoGain [25], GainRatio [13], SU [26], ChiSquare [12], and Fisher [27] as the representative methods. For SEMs, we choose CIFE [19], CMIM [20], JMIM [21], mRMR [23], and maxMIFS [24] as the representative methods. These selected comparison methods cover a wide range of filter-based techniques to assess the performance of the proposed algorithm effectively.
3.
ML algorithms for classification
To demonstrate that our FS algorithm is not biased towards specific ML algorithms, we used four widely used ML algorithms: Naive Bayes (NB) [28], SVM [29], C4.5 [30], and Random Forest (RF) [31], to evaluate the classification performance of our FS algorithm. These four ML algorithms represent different ways of solving supervisory learning problems. Their parameter settings are shown in Table 2.
A 10-fold cross-validation was applied for performance evaluation. In this process, the dataset was split into ten parts, with 90% used for training and 10% for testing in each iteration. Each subset was used once as the test set, mitigating data variability and providing a reliable estimate of the model’s performance. Finally, all algorithms are executed on a Windows computer with an Intel Core i7 3.6 GHz 8-thread, 8 GB RAM, and the Matlab R2018a implementation.

3.3. Analysis of Results

This section aims to demonstrate the effectiveness of the proposed algorithm by comparing it with existing filter methods. Firstly, we analyze and evaluate the results of the proposed algorithm for the five target classes on the three datasets. Table 3 presents the average results obtained using 10-fold cross-validation for the proposed algorithm. It should be noted that datasets BPSnp_1, BPSnp_2, and BPSnp_3 refer to the BPSnp dataset, with body size and weight, muscle–fat ratio, and growth rate as target classes, respectively. The results in Table 3 reflect the performance of the classification models with the proposed FS algorithm applied.
As shown in Table 3, the feature subsets selected by the proposed algorithm exhibit excellent data quality, as they demonstrate maximum relevance to the target class and lack any redundant features. This implies that all of the feature subsets belong to MRNR feature subsets. Notably, the algorithm’s effectiveness is highlighted by its ability to select a small number of features while still achieving relatively high classification accuracy across all four ML algorithms. This success in accurately classifying the data further attests to the quality of the selected features.
Further analysis of classification accuracy reveals that the accuracy of RF is higher than that of other ML algorithms, primarily because RF is a more complex ML algorithm that utilizes multiple classifiers compared to others. However, it is noteworthy that NB, SVM, and C4.5 have also achieved high classification accuracy, which is close to that of RF. This observation highlights that the features selected by the proposed FS algorithm possess very high quality and effectively mitigate the impact of classifiers on accuracy.
Additionally, the algorithm’s performance is measured positively, as evidenced by its good results in terms of feature stability and running time. In conclusion, these evaluations affirm the effectiveness of the proposed algorithm in identifying high-quality feature subsets for data classification. In the subsequent subsections, we will compare the results of the proposed algorithm with those of existing filter methods.

3.3.1. Comparison with IEMs

IEMs have been proven to be computationally light and have shown good performance on certain datasets [10,11]. However, IEMs only rank features based on specific criteria, and the selection of feature subsets ultimately depends on the thresholds preset by researchers. In contrast, the proposed algorithm does not require preset parameters as it automatically determines the optimal number of features. To compare the two types of FS methods, we present the average results for selecting between 5 and 40 features using IEMs. In this study, InfoGain, GainRatio, SU, ChiSquare, and Fisher are used as representative IEMs for comparison. Due to space limitations, we will only present the results of these IEMs on the dataset BPEx. However, similar results were obtained on other datasets.
Figure 2 shows the average classification accuracy of the IEMs using four ML algorithms on the BPEx dataset. As seen in Figure 2, for the four ML algorithms, the classification accuracy of all the IEMs rapidly increases in the range of feature numbers between 5 and 20 and only exhibits slight changes when the feature number is greater than 20. Among the IEMs, GainRatio consistently achieves the highest classification accuracy in most cases. However, SU consistently yields significantly lower classification accuracy compared to other IEMs.
To compare the IEMs with the proposed algorithm, we choose the best classification accuracy and its corresponding feature number for different IEMs and ML algorithms as the benchmark results. Table 4 displays the comparison results between the IEMs and the proposed algorithm. As shown in Table 4, the proposed algorithm outperforms the IEMs in terms of classification accuracy in all cases. Moreover, the number of features selected by MCMI is significantly smaller than that selected by the IEMs, further emphasizing the efficiency and effectiveness of the proposed algorithm.
To further analyze the data quality of the features selected by the IEMs, Figure 3 shows the Rel and NRF of the IEMs on the BPEx dataset. From Figure 3a, it is apparent that for all IEMs except SU, the Rel increases rapidly within the feature number range of 5 to 15 and reaches 1 when the feature number exceeds or equals 15. By comparing Figure 2 and Figure 3a, it can be inferred that the Rel of features plays a key role in classification accuracy and is directly proportional to the classification accuracy.
Furthermore, in Figure 3b, for all IEMs except SU, the NRF increases rapidly when the feature number exceeds 15. By comparing Figure 2 and Figure 3b, it can be inferred that redundant features can have both positive and negative effects on classification accuracy. This is because these features can introduce disturbances in ML algorithms, and the impact of this disturbance, however, is often random and irregular. For certain ML algorithms, some redundant features may enhance classification accuracy by providing additional information, while others may lead to overfitting and poor performance, especially when highly correlated with other features in the dataset. Since the IEMs cannot recognize the correlation between features, it is impossible to determine the redundant features in the selected feature subset using these methods.
As a result, the proposed algorithm exhibits significant superiority over the IEMs in terms of both dimension reduction and classification accuracy, as demonstrated by the mean values presented in the last two rows of Table 4.
Table 5 presents a comparison of execution times between the IEMs and the proposed algorithm. As observed, the execution times of the IEMs are significantly less than those of the proposed algorithm. This is because the IEMs compute only the relevance between each feature and the target class, without considering the interdependence of features. In contrast, the proposed algorithm takes into account the inter-feature dependence, making it more time-consuming. However, it is important to note that the proposed algorithm does not require any preset parameters, making it more convenient and intelligent compared to the IEMs.

3.3.2. Comparison with SEMs

The advantage of SEMs over IEMs is their ability to consider the interdependence of features. In this study, we compare CIFE, CMIM, JMIM, mRMR, and maxMIFS as representative SEMs with the proposed algorithm. Similar to the comparison with IEMs, we present the results of selecting between 5 and 40 features using these SEMs. Figure 4 shows the average classification accuracy of the SEMs using four ML algorithms on the BPEx dataset. As seen in Figure 4, for the four ML algorithms, the classification accuracy of all the SEMs shows a significant upward trend between 5 and 10 features, then becomes stable with minor fluctuations. Furthermore, all the SEMs exhibit comparable levels of classification accuracy, and it should be noted that the CIFE and CMIM methods slightly underperform compared to the other three methods.
As with the comparison to IEMs, we chose the best classification accuracy and corresponding feature number of the SEMs using different ML algorithms, which then served as the benchmark results to be compared with those of the proposed algorithm. Table 6 presents the comparison results of the SEMs and the proposed algorithm. In the majority of cases, the proposed algorithm achieves the highest or near-highest classification accuracy. Additionally, the number of features obtained by the proposed algorithm is significantly smaller than that obtained by the SEMs in all instances. By further comparing the mean of classification accuracy and feature numbers shown in the last two rows of Table 6, it is evident that the proposed algorithm outperforms the SEMs in terms of classification accuracy and feature quality.
Figure 5 shows the feature quality of the SEMs on the BPEx dataset. From Figure 5, it is evident that the classification accuracy of the SEMs significantly improves as the value of Rel increases. However, it’s important to note that after reaching the peak value of Rel, the classification accuracy of the SEMs does not show notable changes and may slightly fluctuate with the addition of redundant features. This observation highlights that the classification performance and feature quality of the SEMs are highly dependent on the setting of the feature subset size.
Table 7 shows the comparison of execution times between the SEMs and the proposed algorithm. Since the execution time of the SEMs varies with their preset feature number, we used the average number of features shown in Table 6 as the preset feature number for the experiment. From Table 7, it can be seen that the execution time of the proposed algorithm is less than that of the SEMs. In practical applications, the execution time of the SEMs far exceeds that of the proposed method, as they require iterative searches to find the optimal number of features.

3.4. Verification of Selected Genes

To verify the effectiveness of the proposed FS algorithm, we conducted a literature review for each selected gene to confirm whether they are validated genes affecting pig traits. Table 8 summarizes the verification results.
For the BPEx dataset, our algorithm selected eight genes, six of which are documented in the literature as being related to pork quality, resulting in an effectiveness rate of 75%. For the BPSnp_1 dataset, nine genes were selected, with eight supported by research on body size and weight in pigs, yielding an effectiveness rate of 89%. For the BPSnp_2 dataset, nine genes were selected, with seven confirmed by the literature as influencing muscle development and fat distribution, resulting in an effectiveness rate of 77.78%. For the BPSnp_3 dataset, eight genes were chosen, with five affecting growth rate, giving an effectiveness rate of 62.5%. For the BPPRRS dataset, six genes were selected, with four documented for PRRS resistance, resulting in an effectiveness rate of 66.67%. Overall, these results demonstrate the robustness and accuracy of our feature selection method across various traits in pigs, confirming its utility in genetic research and breeding programs.

4. Discussion

FS for high-dimensional microarray datasets has always been a challenging task because of the complicated interdependence between features. In high-dimensional microarray datasets, the most critical issues in FS are the effectiveness of the evaluation criterion and the computational complexity of the search strategy. The former ensures that the selected feature subset has high quality, such as high relevance, low redundancy, small size, and good classification performance, while the latter ensures that the feature search can be completed in a short time.
Our experimental results on three high-dimensional microarray datasets demonstrate that our algorithm successfully selects the minimum MRNR feature subset, as expected, without using any wrapper methods. Compared to existing IEMs and SEMs, our algorithm achieves better feature quality (as shown in Table 3, Table 4 and Table 6). In contrast, the IEMs perform poorly in FS for high-dimensional datasets due to their neglect of interdependency between features, resulting in the failure to identify and eliminate redundant features in selected subsets (as shown in Figure 3). Additionally, the IEMs struggle to determine the number of features to select, leading to the need for personal experience or wrapper methods, which increases instability and computational complexity. While SEMs can consider feature interdependence, existing SEMs lack effective evaluation criteria to accurately determine maximum relevance with the target class and identify redundant features in the selected subset. Although the SEMs demonstrate more excellence than the IEMs in classification performance and redundancy reduction (as shown in Figure 2, Figure 3, Figure 4 and Figure 5, and Table 4 and Table 6), they still require preset parameters to determine the number of features to select. Determining the optimal feature number is critical for the SEMs, as too few features lead to insufficient relevance, while too many leads to numerous redundant features. Unfortunately, accurately determining the optimal number for high-dimensional datasets is difficult without resorting to more complex wrapper methods.
In addition, we analyzed the relationship between feature quality and the classification accuracy of feature subsets using Rel and NFS. We find that the Rel of a feature subset is positively correlated with its classification accuracy, while a small number of redundant features within the feature subset have an uncertain (both positive and negative) impact on its classification accuracy. However, including too many redundant features in a feature subset has a significant negative impact on some machine learning classifiers, such as SVM. The conclusion indirectly highlights the advantages of the minimum MRNR feature subset in terms of feature quality and classification accuracy.
In terms of execution time, the IEMs have obvious advantages because they ignore the relevance between features (as shown in Table 5). In contrast, the SEMs are more time-consuming. Our algorithm, however, achieves relatively short execution times compared to SEMs, though it is more time-consuming than IEMs (as shown in Table 5 and Table 7). The main time consumption of our algorithm is the calculation of the joint probability of multiple variables.
In summary, it can be concluded that the IEMs are not suitable for high-dimensional datasets because they ignore feature interdependence, leading to suboptimal feature subsets. Although SEMs consider interdependence, existing SEMs lack effective evaluation criteria, resulting in the selection of redundant features that negatively impact dimensionality reduction and classification performance. In contrast, the proposed algorithm has demonstrated strong performance in terms of feature quality, classification accuracy, and execution time, establishing its overall superiority over existing algorithms.
Finally, through the literature review verification for each selected gene in various datasets, the results demonstrate that our algorithm reliably identifies key genetic factors. This validation underscores the utility of our FS algorithm in genetic research and breeding programs.

5. Conclusions

To address the challenges of FS in high-dimensional microarray datasets, in this paper, we have introduced the concept of the MRNR feature subset and proposed a novel FS algorithm based on MCMI and a greedy algorithm. Compared to the existing filter FS algorithms, the proposed algorithm demonstrates better performance in terms of feature quality and classification accuracy. In addition, because there is no need to preset the number of selected features, this algorithm avoids dependence on a wrapper method.
In terms of running time, the proposed algorithm outperforms the traditional SFS algorithm because it eliminates features that are conditionally irrelevant to the target class in each iteration. Although it is less efficient than the IEMs, it achieves better time complexity compared to the SEMs.
In conclusion, our work provides a powerful and efficient approach for feature selection (FS) in high-dimensional microarray datasets, demonstrating its intelligence and effectiveness on the specific datasets discussed in this paper. Additionally, the results of the literature review verification highlight the utility of our FS algorithm in genetic research and breeding programs. However, further validation on a broader range of datasets is necessary to generalize its effectiveness. Future work will involve extending the application of our algorithm to more diverse datasets to confirm its robustness and general applicability.

Author Contributions

Conceptualization, J.Z. and H.S.; Methodology, J.Z. and H.S.; Writing—original draft, J.Z., H.Y. and J.J.; Writing—review and editing, H.Y., J.J. and S.L.; Supervision, S.L. and H.S.; Funding acquisition, S.L. and H.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Natural Science Foundation of Shandong Province (ZR2022MD025), the Action Plan Project for Rural Revitalization, Scientific and Technological Innovation of Shandong Province (2022TZXD0012-2, 2022TZXD007-5), the Key R&D Program of Shandong Province (Soft Science Project) (2022RKY06004), the Agricultural Improved Seed Project of Shandong Province (2020LZGC010, 2022LZGC021), the Advanced Talents Foundation of Qingdao Agricultural University (6631120066), and the Crosswise Research Tasks of Qingdao Agricultural University (6602422206, 6602423101).

Data Availability Statement

The program and some test data are available at https://github.com/HongtaoShi/MCMI (accessed on 23 June 2024). The full datasets are available upon request due to copyright restrictions by contacting [email protected].

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

FSFeature Selection
MLMachine Learning
MIMutual Information
IEMsIndividual Evaluation Methods
SEMSubset Evaluation Methods
MIMMutual Information Maximization
JMIJoint Mutual Information
CIFEConditional Infomax Feature Extraction
CMIMConditional Mutual Information Maximization
JMIMJoint Mutual Information Maximization
MIFSMutual Information Feature Selection
mRMRMinimum Redundancy Maximum Relevance
maxMIFSMaximum Mutual Information Feature Selection
MRNRMaximum Relevant No-Redundant
SUSymmetrical Uncertainty
MCMIMaximum Conditional Mutual Information
SFSSequential Forward Selection
SNPsSingle Nucleotide Polymorphisms
PRRSPorcine Reproductive and Respiratory Syndrome
OAOverall Accuracy
FNFeature Number
RelRelevance
NRFNumber of Redundant Features
NBNaive Bayes
SVMSupport Vector Machine
RFRandom Forest

References

  1. Kyselová, J.; Tichý, L.; Jochová, K. The role of molecular genetics in animal breeding: A minireview. Czech J. Anim. Sci. 2021, 66, 107–111. [Google Scholar] [CrossRef]
  2. Alhenawi, E.A.; Al-Sayyed, R.; Hudaib, A.; Mirjalili, S. Feature selection methods on gene expression microarray data for cancer classification: A systematic review. Comput. Biol. Med. 2022, 140, 105051. [Google Scholar] [CrossRef] [PubMed]
  3. Bellman, R. Adaptive Control Processes: A Guided Tour; Princeton University Press: Princeton, NJ, USA, 1961. [Google Scholar]
  4. Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
  5. Guyon, I.; Weston, J.; Barhill, S.; Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn. 2002, 46, 389–422. [Google Scholar] [CrossRef]
  6. Ding, C.; Peng, H. Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol. 2005, 3, 185–205. [Google Scholar] [CrossRef]
  7. Chuang, L.-Y.; Chang, H.-W.; Tu, C.-J.; Yang, C.-H. Improved binary PSO for feature selection using gene expression data. Comput. Biol. Chem. 2008, 32, 29–38. [Google Scholar] [CrossRef] [PubMed]
  8. Lazar, C.; Taminau, J.; Meganck, S.; Steenhoff, D.; Coletta, A.; Molter, C.; de Schaetzen, V.; Duque, R.; Bersini, H.; Nowe, A. A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM Trans. Comput. Biol. Bioinform. 2012, 9, 1106–1119. [Google Scholar] [CrossRef] [PubMed]
  9. Chandrashekar, G.; Sahin, F. A survey on feature selection methods. Comput. Electr. Eng. 2014, 40, 16–28. [Google Scholar] [CrossRef]
  10. Agrawal, P.; Abutarboush, H.F.; Ganesh, T.; Mohamed, A.W. Metaheuristic algorithms on feature selection: A survey of one decade of research (2009–2019). IEEE Access 2021, 9, 26766–26791. [Google Scholar] [CrossRef]
  11. Nguyen, B.H.; Xue, B.; Zhang, M. A survey on swarm intelligence approaches to feature selection in data mining. Swarm Evol. Comput. 2020, 54, 100663. [Google Scholar] [CrossRef]
  12. Su, C.; Hsu, J. An extended chi2 algorithm for discretization of real value attributes. IEEE Trans. Knowl. Data Eng. 2005, 17, 437–441. [Google Scholar]
  13. Han, J.; Kamber, M. Data Mining: Concepts and Techniques; Morgan Kaufmann: Burlington, MA, USA, 2006. [Google Scholar]
  14. Blum, L.; Langley, P. Selection of relevant features and examples in machine learning. Artif. Intell. 1997, 97, 245–271. [Google Scholar] [CrossRef]
  15. Dash, M.; Liu, H. Consistency-based search in feature selection. Artif. Intell. 2003, 151, 155–176. [Google Scholar] [CrossRef]
  16. Li, T.; Zhan, Z.H.; Xu, J.C.; Yang, Q.; Ma, Y.Y. A binary individual search strategy-based bi-objective evolutionary algorithm for high-dimensional feature selection. Inf. Sci. 2022, 610, 651–673. [Google Scholar] [CrossRef]
  17. Lewis, D.D. Feature selection and feature extraction for text categorization. In Proceedings of the Workshop on Speech and Natural Language, Association for Computational Linguistics, New York, NY, USA, 23–26 February 1992; pp. 212–217. [Google Scholar]
  18. Yang, H.H.; Moody, J. Data visualization and feature selection: New algorithms for nonGaussian data. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 1999; pp. 687–693. [Google Scholar]
  19. Lin, D.; Tang, X. Conditional infomax learning: An integrated framework for feature extraction and fusion. In Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; Leonardis, A., Bischof, H., Pinz, A., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2006; pp. 68–82. [Google Scholar]
  20. Fleuret, F. Fast binary feature selection with conditional mutual information. J. Mach. Learn. Res. 2004, 5, 1531–1555. [Google Scholar]
  21. Bennasar, M.; Hicks, Y.; Setchi, R. Feature selection using joint mutual information maximization. Expert Syst. Appl. 2015, 42, 8520–8532. [Google Scholar] [CrossRef]
  22. Battiti, R. Using mutual information for selecting features in supervised neural net learning. IEEE Trans. Neural Netw. 1994, 5, 537–550. [Google Scholar] [CrossRef]
  23. Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [Google Scholar] [CrossRef]
  24. Pascoal, C.; Oliveira, M.R.; Pacheco, A.; Valadas, R. Theoretical evaluation of feature selection methods based on mutual information. Neurocomputing 2017, 226, 168–181. [Google Scholar] [CrossRef]
  25. Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley Online Library: Hoboken, NJ, USA, 1991; Volume 6. [Google Scholar]
  26. Press, W.H.; Teukolsky, S.A.; Vetterling, W.T.; Flannery, B.P. Numerical Recipes; Cambridge University Press: Cambridge, MA, USA, 1988. [Google Scholar]
  27. Duda, R.O.; Hart, P.E.; Stork, D.G. Pattern Classification, 2nd ed.; John Wiley & Sons: New York, NY, USA, 2001. [Google Scholar]
  28. Mitchell, T. Machine Learning; McGraw-Hill: New York, NY, USA, 1997. [Google Scholar]
  29. Vapnik, V. The Nature of Statistical Learning Theory; Springer: New York, NY, USA, 1995. [Google Scholar]
  30. Quinlan, J. C4.5: Programs for Machine Learning; Morgan Kaufmann: Burlington, MA, USA, 1993. [Google Scholar]
  31. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  32. Burgos, C.; Galve, A.; Moreno, C.; Altarriba, J.; Reina, R.; García, C.; López-Buesa, P. The effects of two alleles of IGF2 on fat content in pig carcasses and pork. Meat Sci. 2012, 90, 309–313. [Google Scholar] [CrossRef]
  33. Ji, J.; Zhou, L.; Huang, Y.; Zheng, M.; Liu, X.; Zhang, Y.; Huang, C.; Peng, S.; Zeng, Q.; Zhong, L.; et al. A whole-genome sequence based association study on pork eating quality traits and cooking loss in a specially designed heterogeneous F6 pig population. Meat Sci. 2018, 146, 160–167. [Google Scholar] [CrossRef] [PubMed]
  34. Luo, H.F.; Wei, H.K.; Huang, F.R.; Zhou, Z.; Jiang, S.W.; Peng, J. The effect of linseed on intramuscular fat content and adipogenesis related genes in skeletal muscle of pigs. Lipids 2009, 44, 999. [Google Scholar] [CrossRef] [PubMed]
  35. Kennes, Y.M.; Murphy, B.D.; Pothier, F.; Palin, M.F. Characterization of swine leptin (LEP) polymorphisms and their association with production traits. Anim. Genet. 2001, 32, 215–218. [Google Scholar] [CrossRef] [PubMed]
  36. Gao, Y.; Li, Z.; Zhang, Q.; Hao, T.; Liu, H.; Liu, Q.; Liu, L.; Zhang, Z.; Yu, Y.; Li, N. Comparison of meat quality, muscle fiber characteristics and the Sirt1/AMPK/PGC-1α pathway in different breeds of pigs. Anim. Prod. Sci. 2024, in press. [Google Scholar]
  37. Passols, M.; Llobet-Cabau, F.; Sebastià, C.; Castelló, A.; Valdés-Hernández, J.; Criado-Mesas, L.; Sánchez, A.; Folch, J.M. Identification of genomic regions, genetic variants and gene networks regulating candidate genes for lipid metabolism in pig muscle. Animal 2023, 17, 101033. [Google Scholar] [CrossRef]
  38. Brameld, J.M. Molecular mechanisms involved in the nutritional and hormonal regulation of growth in pigs. Proc. Nutr. Soc. 1997, 56, 607–619. [Google Scholar] [CrossRef]
  39. Niu, P.; Kim, S.W.; Choi, B.H.; Kim, T.H.; Kim, J.J.; Kim, K.S. Porcine insulin-like growth factor 1 (IGF1) gene polymorphisms are associated with body size variation. Genes Genom. 2013, 35, 523–528. [Google Scholar] [CrossRef]
  40. Balatsky, V.; Oliinychenko, Y.; Sarantseva, N.; Getya, A.; Saienko, A.; Vovk, V.; Doran, O. Association of single nucleotide polymorphisms in leptin (LEP) and leptin receptor (LEPR) genes with backfat thickness and daily weight gain in Ukrainian Large White pigs. Livest. Sci. 2018, 217, 157–161. [Google Scholar] [CrossRef]
  41. Ovilo, C.; Fernández, A.; Rodríguez, M.C.; Nieto, M.; Silió, L. Association of MC4R gene variants with growth, fatness, carcass composition and meat and fat quality traits in heavy pigs. Meat Sci. 2006, 73, 42–47. [Google Scholar] [CrossRef]
  42. Krupova, Z.; Krupa, E.; Žáková, E.; Zavadilová, L.; Kvašná, E. Candidate genes for congenital malformations in pigs. Acta Fytotechn. Zootech. 2021, 24, 309–314. [Google Scholar] [CrossRef]
  43. Wang, Z.; Li, Y.; Wu, L.; Guo, Y.; Yang, G.; Li, X.; Shi, X.E. Rosiglitazone-induced PPARγ activation promotes intramuscular adipocyte adipogenesis of pig. Anim. Biotechnol. 2023, 34, 3708–3717. [Google Scholar] [CrossRef]
  44. Liu, M.; Lan, Q.; Yang, L.; Deng, Q.; Wei, T.; Zhao, H.; Peng, P.; Lin, X.; Chen, Y.; Ma, H.; et al. Genome-wide association analysis identifies genomic regions and candidate genes for growth and fatness traits in Diannan small-ear (DSE) pigs. Animals 2023, 13, 1571. [Google Scholar] [CrossRef] [PubMed]
  45. Zhang, H.; Zhuang, Z.; Yang, M.; Ding, R.; Quan, J.; Zhou, S.; Gu, T.; Xu, Z.; Zheng, E.; Cai, G.; et al. Genome-wide detection of genetic loci and candidate genes for body conformation traits in Duroc × Landrace × Yorkshire crossbred pigs. Front. Genet. 2021, 12, 664343. [Google Scholar] [CrossRef] [PubMed]
  46. Aslan, O.; Hamill, R.M.; Davey, G.; McBryan, J.; Mullen, A.M.; Gispert, M.; Sweeney, T. Variation in the IGF2 gene promoter region is associated with intramuscular fat content in porcine skeletal muscle. Mol. Biol. Rep. 2012, 39, 4101–4110. [Google Scholar] [CrossRef] [PubMed]
  47. Tempfli, K.; Simon, Z.; Kovács, B.; Posgay, M.; Papp, Á.B. PRLR, MC4R and LEP polymorphisms, and ADIPOQ, A-FABP and LEP expression in crossbred Mangalica pigs. J. Anim. Plant Sci. 2015, 25, 1746–1752. [Google Scholar]
  48. Xue, W.; Wang, W.; Jin, B.; Zhang, X.; Xu, X. Association of the ADRB3, FABP3, LIPE, and LPL gene polymorphisms with pig intramuscular fat content and fatty acid composition. Czech J. Anim. Sci. 2015, 60, 60–66. [Google Scholar] [CrossRef]
  49. Galve, A.; Burgos, C.; Silió, L.; Varona, L.; Rodríguez, C.; Ovilo, C.; López-Buesa, P. The effects of leptin receptor (LEPR) and melanocortin-4 receptor (MC4R) polymorphisms on fat content, fat distribution and fat composition in a Duroc × Landrace/Large White cross. Livest. Sci. 2012, 145, 145–152. [Google Scholar] [CrossRef]
  50. Kušec, I.D.; Kušec, G.; Vuković, R.; Has-Schön, E.; Kralik, G. Differences in carcass traits, meat quality and chemical composition between the pigs of different CAST genotype. Anim. Prod. Sci. 2015, 56, 1745–1751. [Google Scholar] [CrossRef]
  51. Li, B.; Weng, Q.; Dong, C.; Zhang, Z.; Li, R.; Liu, J.; Jiang, A.; Li, Q.; Jia, C.; Wu, W.; et al. A key gene, PLIN1, can affect porcine intramuscular fat content based on transcriptome analysis. Genes 2018, 9, 194. [Google Scholar] [CrossRef]
  52. Damon, M.; Vincent, A.; Lombardi, A.; Herpin, P. First evidence of uncoupling protein-2 (UCP-2) and-3 (UCP-3) gene expression in piglet skeletal muscle and adipose tissue. Gene 2000, 246, 133–141. [Google Scholar] [CrossRef] [PubMed]
  53. Casas-Carrillo, E.; Kirkpatrick, B.W.; Prill-Adams, A.; Price, S.G.; Clutter, A.C. Relationship of growth hormone and insulin-like growth factor-1 genotypes with growth and carcass traits in swine. Anim. Genet. 1997, 28, 88–93. [Google Scholar] [CrossRef]
  54. Te Pas, M.F.W.; Visscher, A.H.; de Greef, K.H. Molecular genetic and physiologic background of the growth hormone–IGF-I axis in relation to breeding for growth rate and leanness in pigs. Domest. Anim. Endocrinol. 2004, 27, 287–301. [Google Scholar] [CrossRef] [PubMed]
  55. Urban, T.; Kuciel, J.; Mikolasova, R. Polymorphism of genes encoding for ryanodine receptor, growth hormone, leptin and MYC protooncogene protein and meat production in Duroc pigs. Czech J. Anim. Sci. 2002, 47, 411–417. [Google Scholar]
  56. Liu, D.W.; Zhang, H.; Wu, Z.F.; Li, J.Q.; Yang, G.F.; Zhang, X.Q. Identification of SNPs and Their Effects on Swine Growth and Carcass Traits for Porcine IGFBP-3 Gene. Agric. Sci. China 2008, 7, 630–635. [Google Scholar] [CrossRef]
  57. Torricelli, M.; Fratto, A.; Ciullo, M.; Sebastiani, C.; Arcangeli, C.; Felici, A.; Giovannini, S.; Sarti, F.M.; Sensi, M.; Biagetti, M. Porcine Reproductive and Respiratory Syndrome (PRRS) and CD163 Resistance Polymorphic Markers: What Is the Scenario in Naturally Infected Pig Livestock in Central Italy? Animals 2023, 13, 2477. [Google Scholar] [CrossRef] [PubMed]
  58. Khatun, A.; Nazki, S.; Jeong, C.G.; Gu, S.; Mattoo, S.U.S.; Lee, S.I.; Yang, M.S.; Lim, B.; Kim, K.S.; Kim, B.; et al. Effect of polymorphisms in porcine guanylate-binding proteins on host resistance to PRRSV infection in experimentally challenged pigs. Vet. Res. 2020, 51, 1–14. [Google Scholar] [CrossRef]
  59. Niu, P.; Shabir, N.; Khatun, A.; Seo, B.J.; Gu, S.; Lee, S.M.; Lim, S.K.; Kim, K.S.; Kim, W.I. Effect of polymorphisms in the GBP1, Mx1 and CD163 genes on host responses to PRRSV infection in pigs. Vet. Microbiol. 2016, 182, 187–195. [Google Scholar] [CrossRef]
  60. Zhao, J.; Feng, N.; Li, Z.; Wang, P.; Qi, Z.; Liang, W.; Zhou, X.; Xu, X.; Liu, B. 2′, 5′-Oligoadenylate synthetase 1 (OAS1) inhibits PRRSV replication in Marc-145 cells. Antivir. Res. 2016, 132, 268–273. [Google Scholar] [CrossRef]
Figure 1. Framework of FS.
Figure 1. Framework of FS.
Applsci 14 05818 g001
Figure 2. Classification accuracy of the IEMs on BPEx: (a) 10-fold cross-validation accuracy by using NB. (b) 10-fold cross-validation accuracy by using SVM. (c) 10-fold cross-validation accuracy by using C4.5. (d) 10-fold cross-validation accuracy by using RF.
Figure 2. Classification accuracy of the IEMs on BPEx: (a) 10-fold cross-validation accuracy by using NB. (b) 10-fold cross-validation accuracy by using SVM. (c) 10-fold cross-validation accuracy by using C4.5. (d) 10-fold cross-validation accuracy by using RF.
Applsci 14 05818 g002
Figure 3. Feature quality of the IEMs on BPEx: (a) Rel. (b) NRF.
Figure 3. Feature quality of the IEMs on BPEx: (a) Rel. (b) NRF.
Applsci 14 05818 g003
Figure 4. Classification accuracy of the SEMs on B16Ex: (a) 10-fold cross-validation accuracy by using NB. (b) 10-fold cross-validation accuracy by using SVM. (c) 10-fold cross-validation accuracy by using C4.5. (d) 10-fold cross-validation accuracy by using RF.
Figure 4. Classification accuracy of the SEMs on B16Ex: (a) 10-fold cross-validation accuracy by using NB. (b) 10-fold cross-validation accuracy by using SVM. (c) 10-fold cross-validation accuracy by using C4.5. (d) 10-fold cross-validation accuracy by using RF.
Applsci 14 05818 g004aApplsci 14 05818 g004b
Figure 5. Feature quality of the SEMs on BPEx: (a) Rel. (b) NRF.
Figure 5. Feature quality of the SEMs on BPEx: (a) Rel. (b) NRF.
Applsci 14 05818 g005
Table 1. Microarray datasets of Shandong black pig.
Table 1. Microarray datasets of Shandong black pig.
DatasetsAcronymRaw Data TypeFeature NumberSample NumberClass Number
BlackPic ExpressionBPExContinuous24,3681814
BlackPic SNPBPSnpDiscrete27,26823614
BlackPic PRRSBPPRRSContinuous29,7681512
Table 2. Parameter settings of ML algorithms.
Table 2. Parameter settings of ML algorithms.
ModelHyperparametersValues
NB-No additional tuning
SVMC1
tol0.001
kernelrbf
C4.5Min_samples_leaf5
Confidence_factor_for_pruning0.25
RFn_estimators100
max_depthNone
min_samples_split2
random_state42
Table 3. Experimental results of the proposed algorithm.
Table 3. Experimental results of the proposed algorithm.
DatasetRelNRFFNExecution Time (s)OA
NBSVMC4.5RF
BPEx10844.740.8870.9040.8440.910
BPSnp_110962.210.9250.9230.9180.946
BPSnp_210968.160.9220.9330.9310.942
BPSnp_310862.490.9280.9400.9320.946
BPPRRS10638.470.9540.9680.9590.970
Table 4. Experimental results of the IEMs and MCMI.
Table 4. Experimental results of the IEMs and MCMI.
DatasetsClassifiersMetricsInfoGainGainRatioSUChiSquareFisherMCMI
BPExNBOA0.8510.8720.6780.8590.8500.887
FN15254020308
SVMOA0.8720.8900.7150.8810.3020.904
FN15204015308
C4.5OA0.8090.8240.6090.8130.7970.844
FN10302040358
RFOA0.8860.8500.8060.8520.8380.910
FN40404040408
BPSnp_1NBOA0.9040.8780.8630.9090.8920.925
FN40404040409
SVMOA0.9030.8630.8450.9160.8880.923
FN1515101059
C4.5OA0.9090.8750.8610.8880.8710.918
FN25402040409
RFOA0.9170.8960.8550.9040.8900.946
FN40304040409
BPSnp_2NBOA0.9000.8610.8510.8740.8830.922
FN15202525159
SVMOA0.8760.8800.7920.8830.8520.933
FN4040254059
C4.5OA0.8920.9050.8520.8980.8640.931
FN30404040409
RFOA0.9070.9870.9560.9070.8750.942
FN30403535359
BPSnp_3NBOA0.9180.9160.9160.9120.8790.928
FN10101515108
SVMOA0.9180.9130.9120.9210.9300.940
FN25351015108
C4.5OA0.9160.9190.9120.9290.9260.932
FN25302010108
RFOA0.9180.9150.9120.9290.9300.946
FN35351010108
BPPRRSNBOA0.9080.9160.9160.9120.8790.941
FN515251056
SVMOA0.8980.9230.9320.9010.9300.951
FN25351515106
C4.5OA0.8760.8890.9320.9090.9260.949
FN25302010156
RFOA0.9180.9250.9320.9190.930.966
FN35352515106
MeanOA0.8950.8950.8520.8960.8570.927
FN2530.2525.7524.2521.758
Table 5. Execution time of the IEMs and MCMI.
Table 5. Execution time of the IEMs and MCMI.
DatasetExecution Time (s)
InfoGainGainRatioSUChiSquareFisherMCMI
BPEx26.0124.8226.9819.5315.0644.74
BPSnp_133.7533.7734.9926.7023.1862.21
BPSnp_233.2034.0532.9627.9221.1068.16
BPSnp_334.0934.4634.6124.5222.1562.49
BPPRRS23.3521.8723.5818.7313.8738.47
Table 6. Experimental results of the SEMs and MCMI.
Table 6. Experimental results of the SEMs and MCMI.
DatasetClassifiersMatricesCIFECMIMJMIMmRMRmaxMIFMCMI
BPExNBOA0.8690.8670.8930.8850.8910.887
FN30152535258
SVMOA0.8880.8910.9060.9110.9160.904
FN15202525308
C4.5OA0.7980.7800.8020.8030.7960.844
FN40302535308
RFOA0.9060.8980.9100.9020.9060.910
FN40404030408
BPSnp_1NBOA0.8890.8980.9000.8980.9010.925
FN15253540359
SVMOA0.9090.9220.9250.9280.9220.923
FN20251510259
C4.5OA0.8310.8390.8370.8250.8540.918
FN25152525309
RFOA0.9220.9260.9320.9200.9300.946
FN15152535409
BPSnp_2NBOA0.8870.8880.9020.8820.9280.922
FN15352030409
SVMOA0.8970.8990.9140.8750.9120.933
FN25253540359
C4.5OA0.8800.8980.9150.9180.9090.931
FN15101515259
RFOA0.8980.9060.9020.9370.9290.942
FN15101535359
BPSnp_3NBOA0.8970.9090.9290.9300.9190.928
FN30153520258
SVMOA0.9190.9110.9170.9490.9290.940
FN20251525208
C4.5OA0.9210.9200.9290.9360.9360.932
FN15252520358
RFOA0.9210.9290.9370.9340.9420.946
FN25202025358
BPPRRSNBOA0.9270.9280.9300.9500.9500.941
FN3030105206
SVMOA0.9320.9400.9490.9490.9590.951
FN201555206
C4.5OA0.9390.9320.9460.9660.9480.949
FN15101015256
RFOA0.9410.9450.9540.9640.9570.966
FN25202525256
MeanOA0.8980.9010.9110.9130.9170.927
FN22.521.2522.2524.7529.758
Table 7. Execution times of the SEMs and MCMI.
Table 7. Execution times of the SEMs and MCMI.
DatasetExecution Time (s)
CIFECMIMJMIMmRMRmaxMIFMCMI
BPEx74.8174.1478.2168.0167.7344.74
BPSnp_193.2193.8797.4584.7485.4762.21
BPSnp_288.0487.5192.0480.0480.6968.16
BPSnp_394.7495.7899.0486.1385.3462.49
BPPRRS53.3551.8753.5848.7343.8738.47
Table 8. Gene verification for pig traits.
Table 8. Gene verification for pig traits.
DatasetSelected GenesKnown GenesValidated CountEffectiveness Rate
BPExIGF2, TNNT3, PPARδ, LEP, SIRT1, APOE, TRIM55, FTOIGF2 [32], TNNT3 [33], PPARδ [34], LEP [35], SIRT1 [36], APOE [37]675.00%
BPSnp_1GHR, IGF1, LEP, MC4R, INSL3, PPARδ, SLC10A2, TNFAIP3, RYR1GHR [38], IGF1 [39], LEP [40], MC4R [41], INSL3 [42], PPARδ [43], SLC10A2 [44], TNFAIP3 [45]889.00%
BPSnp_2IGF2, LEP, FABP3, MC4R, CAST, PLIN1, UCP3, ASIP, ADRB3IGF2 [46], LEP [47], FABP3 [48], MC4R [49], CAST [50], PLIN1 [51], UCP3 [52]777.78%
BPSnp_3IGF1, GHR, GH1, LEP, IGFBP3, MTOR, ASIP, NOS3 IGF1 [53], GHR [54], GH1 [55], LEP [40], IGFBP3 [56]562.50%
BPPRRSCD163, GBP5, MX1, OAS1, MTOR, IGF2BP1CD163 [57], GBP5 [58], MX1 [59], OAS1 [60]466.67%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, J.; Li, S.; Yang, H.; Jiang, J.; Shi, H. Efficient and Intelligent Feature Selection via Maximum Conditional Mutual Information for Microarray Data. Appl. Sci. 2024, 14, 5818. https://doi.org/10.3390/app14135818

AMA Style

Zhang J, Li S, Yang H, Jiang J, Shi H. Efficient and Intelligent Feature Selection via Maximum Conditional Mutual Information for Microarray Data. Applied Sciences. 2024; 14(13):5818. https://doi.org/10.3390/app14135818

Chicago/Turabian Style

Zhang, Jiangnan, Shaojing Li, Huaichuan Yang, Jingtao Jiang, and Hongtao Shi. 2024. "Efficient and Intelligent Feature Selection via Maximum Conditional Mutual Information for Microarray Data" Applied Sciences 14, no. 13: 5818. https://doi.org/10.3390/app14135818

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop