1. Introduction
1.1. Motivation
In the era of big data, datasets have grown exponentially in both size and complexity, posing significant challenges to data analysis and machine learning. As a result, feature selection has become a key preprocessing step that aims to reduce the dimensionality of data by selecting a subset of relevant features that contribute most to the predictive performance of models. It enhances model accuracy and interpretability, as well as reducing computational costs and mitigating the risk of overfitting [
1,
2,
3].
Traditional feature selection techniques often rely on heuristic approaches or domain expertise. However, modern feature selection methods have increasingly leveraged rigorous mathematical theories and optimization techniques based on linear algebra, information theory, statistics, sparse representation, and optimization algorithms. Mathematical approaches provide a structured and quantifiable foundation for feature selection, enabling more robust and scalable solutions across diverse application domains.
Despite the extensive body of literature on feature selection, including numerous review articles, there is no comprehensive analysis specifically focused on the mathematical techniques underpinning feature selection. Given the increasing significance of mathematical methods in this domain, there is a pressing need for a dedicated review that systematically examines these approaches.
1.2. Related Work
Feature selection is widely recognized as a critical process for data preprocessing in machine learning, prompting multiple survey articles that examine the topic from various perspectives. For instance, Dokeroglu et al. [
4] offer a comprehensive survey of metaheuristic-based feature selection algorithms, assessing advances in exploration and exploitation operators, parameter tuning, and fitness evaluation functions. Their work emphasizes the rapid proliferation of new metaheuristic methods and explores current challenges and potential research directions in that domain. Similarly, Dhal and Azad [
5] provide an extensive review of feature selection in machine learning, focusing on structured and unstructured datasets and discussing the impact of high-dimensionality, noise, and storage complexity on the feature selection process. They compile a broad taxonomy of existing methods and applications, including commonly used datasets and performance metrics across different ML fields.
Pudjihartono et al. [
6] shift the perspective to disease risk prediction, illustrating how feature selection methods can help identify the most informative features (e.g., SNPs) in high-dimensional genetic data. By addressing the “curse of dimensionality”, their review highlights the benefits of feature selection in improving the generalizability of machine learning models, albeit at the cost of handling large, noisy datasets. In contrast, swarm intelligence-based feature selection is the focal point of a survey by Rostami et al. [
7], who evaluate emerging swarm intelligence algorithms and discuss their strengths and weaknesses in terms of computational complexity, convergence properties, and practical performance.
Although these prior works collectively cover a broad spectrum of approaches—ranging from metaheuristics to domain-specific implementations—they tend to concentrate on high-level algorithmic designs, application-specific issues, or comparative evaluations of heuristic techniques. Consequently, there remains a gap in the literature concerning the underlying mathematical frameworks that unify these diverse feature selection methods. The present review aims to bridge this gap by examining feature selection methods through a mathematical lens, focusing on three principal frameworks—variance-based, regularization-based, and Bayesian. By discussing the theoretical underpinnings, strengths, and potential limitations of each framework, this review intends to complement and extend existing surveys, offering a more in-depth exploration of the core mathematical principles that govern feature selection.
1.3. Methodology
To ensure that the reviewed works reflect leading-edge research and foundational advancements, we employed a systematic literature search strategy. First, major academic databases (Scopus and Google Scholar) were queried with keywords such as “feature selection”, “mathematical methods”, “sparse modeling”, and “Bayesian feature selection.” The initial search was restricted to publications from the last ten years to capture recent progress, but earlier seminal papers were included where necessary to provide historical context.
Next, the studies were screened for relevance based on their abstracts and titles, and only those discussing mathematically grounded approaches such as theoretical proofs, rigorously formulated optimization algorithms, or well-established statistical or Bayesian frameworks, were retained. Additional filtering was conducted by examining citation counts, the reputation of the publishing venue, and whether the work included a clear articulation of the mathematical principles involved. This vetting process resulted in a targeted corpus of articles spanning various mathematical paradigms. Ultimately, we synthesized these findings to propose a new taxonomy and to highlight the most promising directions for future research in mathematical feature selection.
1.4. Contributions
This paper presents a comprehensive review of modern mathematical approaches to feature selection. We classify these techniques based on their mathematical foundations, including variance-based methods, regularization techniques, and Bayesian approaches. Each method is described in detail, highlighting its key strengths and limitations. Additionally, we discuss practical applications for each technique, providing researchers with insights to select the most suitable feature selection method for their specific needs.
The key contributions of the paper are the following:
We categorize feature selection methods based on their mathematical foundation.
We provide technical backgrounds of the main techniques and discuss their strengths and weaknesses.
We offer guidance for researchers to select appropriate feature selection methods based on specific application domains.
The remainder of this paper is organized as follows:
Section 2 provides an overview of mathematical feature selection methods, which lays the foundation for the subsequent sections.
Section 3 introduces variance-based approaches, while
Section 4 examines regularization-based techniques.
Section 5 delves into Bayesian feature selection methods.
Section 6 presents related methods, including hybrid strategies and graph-theoretical approaches for feature selection.
Section 7 discusses current challenges and highlights potential research directions. Finally,
Section 8 concludes this paper with closing remarks.
3. Variance-Based Feature Selection
Variance measures the dispersion of the data. It is an important concept in statistics and has been widely used as a foundation for many feature selection algorithms. There exists a range of variance-based methods starting from the simple variance threshold method to the sophisticated Sobol sensitivity index. These methods leverage variance to identify informative, non-redundant, and discriminative features, enabling more efficient and effective data representation for downstream machine learning tasks.
In this section, we discuss six major variance-based methods. We describe each method together with its mathematical formulation. In addition, we discuss their strengths and limitations. A summary of the variance-based methods is provided in
Table 1. A more detailed discussion of each method is provided in the following subsections.
3.1. Variance Threshold
Variance threshold is a filter approach that discards features whose variances are below a chosen cutoff. It relies on the intuition that features with little or no variance are unlikely to offer much predictive power. Let
be the data with
n samples and
d features. Denote
as the value of feature
j in sample
i. The sample mean of feature
j is
and its sample variance is
Let
be a pre-defined threshold. Then, the feature is removed, if
. The threshold is usually chosen as a small positive value to remove near-constant features. The approach is computationally cheap and easy to interpret [
13]. However, it overlooks the possibility that some low-variance features may be important for specific predictive tasks. It is common to use this method as an initial data-cleaning step to handle degenerate or trivial features. Variance threshold is an unsupervised method as it does not take into account the values of the target variable. As such, it may miss relevant interactions between the dependent and independent variables.
The variance threshold method can be useful both as a standalone method and in combination with more sophisticated methods [
14]. A study of 14 filter methods for high-dimensional gene expression data found the variance threshold to achieve the optimal results [
8]. The variance threshold method was used in conjunction with Boruta to select optimal features for IDS [
15].
3.2. Variance Decomposition
Sobol indices are based on the decomposition of the variance of the target variable in terms of the input variables. It is often applied in engineering- and simulation-based studies. Sobol decomposition is a sophisticated method that can capture complex interactions between multiple features and the target variable. There are two types of Sobol indices used in feature selection: the first-order index and the total-order index.
The method requires a model
where each
is an input feature. The output variance
is decomposed according to how each feature or combination of features contributes to it. The Hoeffding—Sobol decomposition is given by
with each term representing partial variances. The first-order Sobol index for feature
i is
where
is the variance in the output that is exclusively attributable to
The total-order Sobol index is given by
where
is the variance of
Y attributed exclusively to the subset
u of the feature set. The total-order index captures all variance contributions of
alone and in combination with other variables. Low indices indicate that a feature can be removed with little impact on output variance.
The Sobol index approach can discover nonlinear interactions that simpler correlation-based measures miss [
9]. However, it is computationally expensive because the model must be evaluated multiple times under different sampling schemes. Additionally, the method requires a specific model which reduces the generalizability of the results.
It is best suited for regression problems with a definable model of how inputs affect an output quantity. It has been used in a variety scenarios including gene expression [
16], tunneling-induced settlement prediction [
17], drilling [
18], intrusion detection [
19], and others.
3.3. Principal Component Analysis
PCA is an unsupervised dimensionality reduction method that projects data onto orthogonal directions of greatest variance. Each direction, known as a principal component, is formed as a linear combination of the original features. While PCA is originally employed as a feature extraction technique, it can also be used to select the optimal features [
10,
20]. Many studies employ 2–3 principal components to visualize and analyze the data [
10].
PCA is calculated by diagonalizing the covariance matrix. Let
be mean-centered so that each feature column has an average of zero. Its covariance matrix is
The eigenvalue decomposition of
yields
where
contains eigenvectors and
the corresponding eigenvalues. The eigenvalues represent the amount of variance captured by each principal component. The top
k components can be selected to reduce the dimensionality while retaining most of the variance.
PCA does not use a target variable, so it may discard low-variance features that are actually predictive in a supervised setting. Despite this, it is widely used for noise removal and data compression, especially in large-scale tasks. It produces new latent variables that combine original features, which can complicate interpretability.
Although PCA produces linear combinations of features, the weights of principal components can be used to measure feature importances [
21,
22]. PCA-based feature selection algorithms have been used in a variety applications, including hyperspectral remote sensing [
23], medical diagnosis [
24], gene expression [
25], and others.
3.4. Analysis of Variance
ANOVA-based feature selection typically applies to classification tasks. Each feature is tested to see how well it separates the classes by comparing its between-class variance to its within-class variance [
8]. Suppose there are
k classes, with
samples in class
m. The mean of feature
j in class
m is
, and the overall mean is
. The between-class sum of squares is
and the within-class sum of squares is
The F-statistic for feature
j is
A higher F-statistic indicates greater discriminative power. ANOVA is straightforward to implement and relies on established statistical theory. However, it assumes normality and homogeneity of variance across classes, which may not always hold in real-world data. It also considers each feature individually and does not capture interactions.
ANOVA-based feature selection has shown robust results in multiple studies [
5,
26]. It has been applied in various fields, including computer vision [
27,
28], protein identification [
29], genomic data [
30], and others. There also exist several extensions of ANOVA that aim to improve feature selection in specific domains. Differential variance was proposed in [
31] to improve feature selection for cancer detection. An extension of ANOVA for imbalanced data was proposed in [
32].
3.5. Partial Least Squares
PLS is a supervised technique that, like PCA, projects data into a lower-dimensional latent space, but uses the relationship with the response variable to guide the extraction of factors. After centering and scaling, PLS finds weight vectors that maximize the covariance between projections
The data matrices are then deflated to remove the explained variation, and the process repeats to extract multiple components. By aligning variance in
with variance in
, PLS can handle strongly collinear predictors. Its drawback is that the resulting components are linear combinations of the original predictors, which can make interpretation difficult. Selecting the correct number of components often relies on cross-validation or domain expertise.
PLS is popular in high-dimensional regression contexts [
11,
33]. It can be further adapted for classification under variations such as discriminant analysis (PLS-DA), path modeling (PLS-PM) [
34,
35], the kernel technique (KPLS) [
36], and others.
3.6. Variance–Covariance Subspace Distance-Based Feature Selection (VCSD)
Most recently, variance–covariance subspace distance-based (VCSD) feature selection was proposed in [
12]. VCSD is an unsupervised approach designed to select a subset of features whose spanned subspace best approximates the entire data matrix in terms of its variance–covariance structure. Let
be the original data and let
I be a set of feature indices with
. The submatrix
includes only the features in
I. VCSDFS seeks to minimize
where the Frobenius norm
measures the difference in the covariance-like structure of the full data versus the selected subset. The idea is to preserve as much of the original subspace geometry as possible. This objective can be combinatorially large, so approximate or greedy methods may be employed for practical data sizes. It serves as a valuable tool for situations where no target labels exist but the user wishes to maintain global variance–covariance patterns.
3.7. Discussion
The above methods reveal different ways of leveraging variance in feature selection. Variance threshold and PCA are unsupervised methods that do not take into account the target variable. The simplest approach is variance threshold, which can quickly eliminate constant or near-constant features but ignores any target or deeper data structure, while PCA reduces dimensionality by retaining directions of maximal variance. Supervised methods include ANOVA, Sobol indices, and PLS. ANOVA can rank features by their ability to distinguish classes, though it makes strong statistical assumptions. Partial Least Squares aligns predictor variance with a response variable, making it suitable for high-dimensional regression, albeit with interpretability trade-offs. Variance Decomposition (Sobol indices) can capture complex nonlinear interactions, but demands a callable model and can be computationally intensive. More recently, VCSD was introduced, which preserves the global variance–covariance subspace structure of the data, though it may require complex optimization and is best approached with suitable heuristics for large datasets.
In many practical settings, a combination of these methods is employed to capitalize on their respective advantages. Researchers might first remove constant features using a threshold, then reduce redundancy by examining correlations or decompositions, and finally refine the feature set with a target-aware method like PLS or ANOVA. The choice of strategy depends on whether the task is supervised or unsupervised, on the size of the dataset, and on the nature of the domain from which the data originate. The methods outlined here represent fundamental approaches that highlight the central role of variance-related criteria in feature selection.
4. Regularization-Based Feature Selection
In the embedded feature selection framework, regularization or sparse learning is incorporated during model training to constrain feature weights in order to identify the most important and informative features for the best prediction performance. Thus, finding the best feature subset and constructing the model are combined into one inseparable learning process. Penalized regression represents the most common form of regularization models, where a penalty term is added to the model’s loss function (
residual sum of squares) to constrain the magnitude of the coefficients associated with the features [
37]. Features with zero or very small coefficients can be eliminated from the model. This not only improves the stability and accuracy of the learning algorithm, but also prevents model overfitting. There are various types of penalty terms and tuning parameters which define different penalized regression approaches, such as Ridge regression, Lasso regression, and elastic net regression. Each approach has its own unique properties and advantages when it comes to feature selection, model building, and prediction performance.
In this section, we discuss seven major regularization methods used in feature selection. We describe each method together with its mathematical formulation. In addition, we discuss their strengths and limitations.
Table 2 depicts a summary of the considered methods. A more detailed discussion of each method is provided in the following subsections.
In the following subsections, we provide a more detailed discussion of each method. Without loss of generality, we will only consider linear models in which the training of the target variable can be based on a linear combination of .
4.1. Ridge Regression
Ridge regression [
37,
44,
45] is a powerful regularization technique used for creating small models in the presence of a large number of features. Ridge regression carries out
regularization by adding a penalty for the sum of squares of coefficients to the optimization target. That is, for
and a given
, the Ridge estimator provides values of the coefficients
that minimize the following:
where
where
is the target variable and
is the regularization parameter controlling the strength of the penalty.
In Ridge regression, the penalty term is regularized such that if the coefficients
take large values, the optimization function is penalized. It allows for better model performance through a bias—variance trade-off, hence reducing multicollinearity in the features and providing a more stable selection process. Ridge shrinks coefficients of less important features towards zero without eliminating them. This makes it more beneficial when all features are believed to contribute to the target variable, albeit producing higher estimation bias [
46]. Due to its sensitivity to feature scaling, standardizing or normalizing the features is often essential before applying Ridge regression.
4.2. Least Absolute Shrinkage and Selection Operator (Lasso)
Similar to Ridge regression, the Lasso method [
37,
38,
45] employs a continuous shrinkage procedure in which it applies
regularization by adding the norm of the coefficients vector to the optimization objective function. It reduces the values of some of the coefficients to zero and features with non-zero coefficients are used for the model construction, resulting in more sparse models. The norm of the coefficients is generally constrained to be smaller than a predetermined value (upper bound). The Lasso estimate is the solution to the following optimization problem [
37]:
where
If
is big enough, certain coefficients will reach a value of zero. The advantage of the Lasso method is that it can provide better prediction accuracy, because shrinking and removing the coefficients can reduce variance without a substantial increase in the estimation bias. The method may, however, not perform well in applications where the number of features
d exceeds the number of samples
n due to its inability to select all relevant features. It can also struggle when features are highly correlated, often selecting only one feature from a group of correlated features. Ref. [
47] generalized the Lasso estimation and introduced Relaxed Lasso (rLasso), which includes an additional tuning parameter that balances the fully regularized Lasso and the unpenalized optimization problem. The additional relaxation parameter allows for less shrinkage on selected and important features, leading to enhanced model accuracy and higher feature selection consistency.
4.3. Adaptive Lasso
The Lasso method is straightforward to apply, but it can be inconsistent in handling feature selection, in addition to having an inherent bias even under finite parameter conditions, as noted by the authors in [
48]. To address these limitations, the authors in [
39] proposed the following adaptive Lasso penalization problem:
where the adaptive weight vector
is a known quantity used to guide the
penalty of the model coefficients. The authors in [
49] demonstrated that when the weights
are data-dependent, the feature selection process become more stable and consistent. Furthermore, in problems where
, [
50] demonstrated that adaptive Lasso improves the stability of the feature selection under partial orthogonality of features, where the features with zero coefficients are weakly correlated with the features with non-zero coefficients. The effectiveness of the method, however, heavily relies on the initial estimates of the adaptive weights. Poor choices in the initial estimates can lead to poor performance of the feature selection process, in addition to being more computationally intensive than standard Lasso.
4.4. Elastic Net
In a manner similar to a stretchable fishing net, elastic net selects all relevant features and performs continuous shrinkage [
40,
51]. Unlike Lasso, which only selects individual features while disregarding the relationships that may exist between features, elastic net regularization selects groups of correlated variables. It combines the advantages of both Lasso and Ridge by setting certain coefficients to exactly zero, promoting sparsity, while integrating the
penalty to shrink in the direction of zero to stabilize the estimation. The optimization problem of the elastic net is given by [
40]:
where
and
are two regularization parameters. Elastic net allows for the adjustment of the mixing parameter defined by the ratio of
to
penalties, providing flexibility to adapt to different datasets and modeling scenarios. It eliminates the restriction on the number of selected features (Lasso) and stabilizes the selection from highly correlated features by adding a quadratic component to the penalty (Ridge). A strategy to efficiently apply the elastic net algorithm is shown in [
40]. Ref. [
51] conducted exhaustive numerical experiments on real datasets and concluded that, in terms of prediction accuracy, the elastic net always outperforms Lasso. Following the work of [
47], the authors in [
52] proposed Relaxed adaptive Lasso for more accurate identification of relevant features and higher prediction accuracy, particularly in datasets with more features than instances.
4.5. Group Lasso
In machine learning, integrating problem-specific assumptions into the learning process can lead to greater precision accuracy. For applications with naturally grouped features, such as different levels of a categorical variable or variables within the same domain, using group Lasso is more appropriate [
41]. Group Lasso extends Lasso optimization and provides a structured approach to the selection and elimination of predefined groups of features. Suppose that features are divided into
G disjoint groups
where each group
constitutes a number of related features. Ref. [
41] proposed the following group Lasso
minimization problem:
where
is the submatrix of
X with columns corresponding to the features in group
,
is the coefficient vector of that group, and
is some penalty weight vector computed based on the size of group
. Group Lasso can handle multicollinearity between features such that if a group consists of correlated features, the procedure either selects or eliminates them together. The authors in [
53] extended the group Lasso to allow also for the selection of individual features within the selected groups. The resulting extension is called sparse group Lasso and is mathematically formulated by the following convex optimization problem:
where
controls the strength of the
Lasso penalty (individual feature selection) and
controls the strength of the group-Lasso
penalty (group feature selection). Sparse Group Lasso is a powerful approach, balancing between individual and group feature selections. Despite the computational challenges of the techniques, it is still popular in application domains such as genomics, finance, neuroscience, and other fields where features have natural group structures [
54].
4.6. Regularized Decision Trees
Regularized decision trees represent a generalization of the traditional decision tree algorithms, aiming to improve their performance in feature selection and mitigate overfitting. They provide an effective way to select important features by incorporating penalties on the complexity of the tree, resulting in the selection of the most relevant features while ignoring noise and irrelevant features. Several penalizations of the decision tree have been developed including tree depth, node complexity, and feature importance scores [
42]. In this subsection, we focus on the regression tree where the prediction at each leaf node is typically the average of the target variable in a given region
and the splitting criterion is based on the maximum variance reduction between the regions defined by the split, thereby minimizing the prediction error defined by the following loss impurity function:
where
is the average of the target in region
,
. The mathematical formulation of the tree regularization is given by the following optimization problem:
where
is a tree complexity penalty function and
is a regularization parameter balancing between minimum prediction error and tree complexity [
55].
Common tree regularization techniques include the following:
Tree Depth limits the tree depth, controlling for excessive feature interactions. The penalty function is
, where
denotes the number of leaf nodes. A deeper tree has more splits, and each additional split increases the likelihood of overfitting. High values of
force more shallow depth, selection of the most informative features, and less model complexity [
56].
Feature importance penalizes features with low importance. The complexity penalty function is defined by
where
is the reduction in impurity, which is typically the variance of the target variable, at node
t. Larger
implies more informative feature. With this approach, features with insignificant contribution are eliminated [
42].
L1 (Lasso-like) regularization on leaf predictions that shrinks some leaf values to zero, leading to the elimination of weak features. This technique is inspired by Lasso penalization, as
,
is the weight of leaf
t [
57].
L2 (Ridge-like) regularization controls for feature dependencies and uses
[
57].
Minimum samples per split regularization force at least the minimum number of samples at each leaf node in order to prevent the tree from creating leaves that represent only a small number of data points, which could lead to overfitting and the selection of unimportant features. It uses the penalty function
, where
is the minimum number of samples per split [
58].
Cost complexity pruning ensures the removal of parts of the tree that do not provide power in predicting the target variable. The technique introduces a penalty based on the number of terminal nodes in a tree, i.e.,
[
59].
Combining the above regularization techniques, the overall objective for a regularized regression tree can be constructed with multiple penalization terms, controlling the complexity of the tree in various ways.
Decision trees are inherently easy to interpret, and with regularization, their simplicity is enhanced and their complexity is reduced. Regularization also enhances the ability of the decision trees to capture nonlinear relationships between features and target variables and their robustness to outliers. There exists a risk of over-regularization with regression decision trees, leading to the loss of important features [
56].
4.7. Dantzig Selector
Introduced by [
43], Dantzig selector is designed for high-dimensional regression problems, particularly those with a lower number of features than instances. It is closely related to LASSO but employs a different approach for feature selection. Features selected by Dantzig selector have coefficients
that satisfy the following
regularization problem:
where
s is a positive scalar and
is the standard deviation of the prediction errors. Dantzig selector encourages sparsity in the estimated coefficients and can handle multicollinear features. It is designed to handle noise in the data effectively and can produce a consistent feature selection process. However, its limitations, such as computational complexity and assumptions about linear relationships between features and target variables, necessitate careful consideration when applying it.
4.8. Discussion
Regularization strategies have rapidly become an essential part of machine learning in order to improve the performance of the learning algorithm by selecting a subset of relevant features while preventing model overfitting. They are effective in extremely high-dimensional problems, particularly those where the number of features significantly exceeds the number of instances. The strategies discussed in this section are focused on the supervised learning of linear models. Regularization has been broadly investigated in other learning models, such as
-logistic regression [
60,
61,
62],
-SVM [
63], and Hybrid Huberized SVM (HHSVM), where the characteristics of
and
norms are utilized simultaneously with SVM [
64] in addition to several L1-regularized linear discriminant analysis (LDA) methods [
65,
66] that been developed for classification problems in high-dimensional datasets.
Although regularized approaches have several advantages, there are also challenges associated with these techniques. They often require strong model assumptions that may not hold in real-world situations [
55]. Furthermore, regularized feature selection methods suffer from several computational challenges that can affect their efficiency and effectiveness. In addition to the significant computational overhead required by many feature selection techniques in high dimensionality, regularized methods also suffer from convergence issues and local minima, particularly for non-convex optimization problems [
67]. Researchers have discussed other computational challenges of regularization methods, including scalability challenges due to the iterative optimization and memory-intensive requirements for large datasets [
68]; model underfitting due to over-regularization and inappropriate selection of regularization parameter values
[
40]. Tuning the value of
to find the optimal level of regularization can actually be computationally expensive. Cross-validation is often used to determine the best value of
, further increasing the computational load, particularly for large datasets or when performing the grid search over a wide range of values.
Numerous approaches have been developed in the literature to mitigate or reduce the challenges associated with the regularized strategies for feature selection. Prominent approaches include feature screening to eliminate irrelevant features before model training [
69,
70,
71] and hybrid feature selection that provides a robust framework for effectively combining multiple feature selection techniques to leverage the strengths of each technique in finding the optimal and most informative feature subset efficiently and quickly [
6,
72,
73].
5. Bayesian-Based Feature Selection
Bayes’ theorem provides a principled framework for updating probabilities based on observed evidence. It forms the foundation of probabilistic reasoning and has been extensively employed in the development of feature selection algorithms, and is expressed as follows:
where
represents the posterior probability of event
A given evidence
B,
is the likelihood,
is the prior probability of
A, and
is the evidence or marginal likelihood. A diverse range of Bayes-based methods has been proposed, spanning from basic Bayesian approaches like Lasso to advanced Bayesian variable selection techniques as relative belief ratio. These methods look for the probabilistic dependencies between features and target variables to identify the most informative and discriminative features, enabling more efficient and effective representation of the data for machine learning applications.
Table 3 presents a summary of the Bayesian-based feature selection methods. A comprehensive discussion of each method, including its mathematical formulation, strengths, and limitations, is provided in the following subsections. This section covers seven key Bayesian feature selection techniques, offering insights into their functionality and practical applications.
5.1. Bayesian Lasso
The Bayesian Lasso is based on the original Lasso method presented in
Section 4.2. It was introduced by Park and Casella [
74] to extend the classical Lasso by incorporating a fully Bayesian framework for feature selection and regularization in high-dimensional regression problems. In this method, a Laplace prior is placed on the regression coefficients to encourage sparsity while incorporating uncertainty in the parameters through posterior inference. The Laplace prior for the
j-th coefficient is formulated as
where
denotes the regularization parameter that controls the level of sparsity. Unlike classical Lasso, in Bayesian Lasso,
is treated as a random variable and is typically assigned a hyperprior, such as a Gamma distribution, to facilitate data-driven estimation of the regularization parameter. The posterior distribution of the coefficients is obtained by combining the Laplace prior with the likelihood function of the data. Given a response vector
and a feature matrix
, the posterior distribution is expressed as
Inference in the Bayesian Lasso is typically conducted using Markov Chain Monte Carlo (MCMC) techniques, such as Gibbs sampling, which facilitate the estimation of the full posterior distribution of the coefficients. This allows for both feature selection and the quantification of uncertainty associated with the selected features.
The Bayesian LASSO offers several advantages, including its ability to incorporate prior knowledge through the Bayesian framework, yielding improved sparsity and uncertainty quantification compared to classical LASSO. However, the computational complexity of MCMC sampling can be a limitation, especially for large-scale datasets. Additionally, the choice of the hyperprior for may influence the selection of features, necessitating careful prior specification to achieve optimal performance.
5.2. Balamurugan and Rajaram Method
Balamurugan and Rajaram [
75] introduced a Bayes’ theorem-based feature selection method to address challenges associated with high-dimensional datasets. In this method, feature reduction is achieved by leveraging Bayes’ theorem to compute conditional probabilities, facilitating the identification and elimination of redundant features. The primary objective of the method is to enhance classification accuracy and computational efficiency by retaining only the most informative features.
The methodology involves defining an initial set of features, represented as . For every possible pair of features , the conditional probabilities are computed concerning the class attribute c. Dependencies between features are identified by evaluating these conditional probabilities against a predefined threshold. Features exhibiting strong dependencies are subsequently removed to minimize redundancy while preserving the most relevant features for classification tasks.
This method effectively applies Bayesian principles to feature selection by modeling dependencies among features and the target variable. The integration of Bayes’ theorem ensures a systematic approach to dimensionality reduction, leading to improved classification performance in high-dimensional datasets. However, the method’s reliance on pairwise feature evaluation may introduce computational overhead, particularly in cases involving a large number of features. Despite this limitation, the approach remains a valuable tool for feature selection, offering improved interpretability and data efficiency.
5.3. Relevance Vector Machine
The Relevance Vector Machine (RVM), introduced by Tipping [
76], has been developed as a sparse Bayesian learning framework for regression and classification tasks. This method extends the support vector machine (SVM) by incorporating Bayesian principles to achieve sparsity while providing probabilistic outputs. The RVM identifies a subset of training samples, known as relevance vectors, that contribute significantly to the predictive model, while the weights associated with irrelevant features are driven to zero through the Bayesian inference process. This characteristic allows for the automatic selection of informative features, making the RVM a robust and efficient tool for feature selection, particularly in high-dimensional datasets.
The prediction function of the RVM is expressed as
where
represents the predicted output for input
,
is the kernel function that measures the similarity between input
and training sample
,
denotes the weight associated with the
i-th basis function, and
is the bias term. The kernel function
allows the RVM to capture nonlinear relationships by mapping the inputs into a higher-dimensional feature space.
A key feature of the RVM is its hierarchical Bayesian framework, where a zero-mean Gaussian prior is placed on each weight
, with its variance controlled by a hyperparameter
. This prior is defined as
where
represents a Gaussian distribution, and
denotes the hyperparameters governing the sparsity. During training, many of the
values tend to infinity, which forces the corresponding weights
to zero, effectively pruning irrelevant features.
The Bayesian nature of the RVM offers several advantages, including robustness against overfitting and improved interpretability by highlighting the most informative features. Despite its advantages, the RVM’s reliance on iterative optimization and the need to estimate hyperparameters can introduce computational complexity, especially when applied to large datasets. Nonetheless, the RVM has been successfully employed in various feature selection tasks, demonstrating its effectiveness in improving model performance and enhancing computational efficiency.
5.4. Relevant Sample-Feature Machine
Relevant Sample-Feature Machine (RSFM) was introduced by Mohsenzadeh et al. [
77] in 2013 as a feature selection method based on sparse Bayesian machine learning. It extends the Relevance Vector Machine (RVM) by incorporating a sparse Bayesian framework and Gaussian priors to achieve feature selection. RSFM operates as a sparse kernel-based learning model, where the feature selection process is guided by Bayesian inference to enhance sparsity and interpretability.
The output of the RSFM is predicted using the kernel function
, which is formulated as
where
represents the kernel function measuring the similarity between input
x and sample
, and
denotes the weight vector associated with the basis functions. By exploiting the sparsity-inducing properties of Bayesian priors, irrelevant features are assigned weights close to zero, effectively pruning redundant information.
The RSFM method provides several advantages, including enhanced sparsity, improved model interpretability, and automatic relevance determination of both samples and features. The Bayesian approach allows the model to handle high-dimensional datasets efficiently while mitigating the risk of overfitting. However, the computational complexity associated with inference, particularly in large-scale applications, can be a limitation. The choice of kernel function and prior distributions also influences the model’s performance, necessitating careful tuning for optimal feature selection outcomes. Overall, the RSFM method represents a significant advancement in sparse Bayesian learning, offering an effective solution for feature selection in various domains, including biomedical data analysis and text classification.
5.5. Variational Embedded RSFM
The Variational Embedded RSFM (VRSFM) method was introduced by Mirzaei, Mohsenzadeh, and Sheikhzadeh [
78] in 2017 as an enhancement of the RSFM framework. It employs a Bayesian model of RSFM with a focus on variational Bayesian approximation to improve feature selection for both classification and regression tasks. Prior Gaussian distributions are placed on the model parameters and their hyperparameters to enforce sparsity and facilitate robust feature selection.
In the VRSFM approach, the posterior distributions of the model parameters are approximated using variational Bayesian inference. Given an observation set
and corresponding responses
, the posterior distribution of the weight vector
is approximated by minimizing the Kullback–Leibler (KL) divergence between the true posterior
and the variational distribution
, expressed as follows:
The variational inference approach provides an efficient approximation to the intractable posterior distributions, enabling the method to operate effectively even in scenarios with small-sized datasets. The optimization of the variational distribution parameters is conducted iteratively to minimize the KL divergence, ensuring that the approximated posterior closely resembles the true posterior.
By leveraging Bayesian principles, the VRSFM method balances model complexity and predictive performance while improving interpretability and robustness. This technique extends the applicability of RSFM, making it particularly suitable for feature selection in high-dimensional datasets with limited sample sizes. Despite its advantages, the performance of VRSFM is influenced by the choice of variational distributions and the complexity of the optimization process. The computational cost associated with iterative updates may be significant in large-scale applications. However, the method’s ability to provide probabilistic feature selection and uncertainty quantification makes it a valuable tool in various domains, including biomedical data analysis and financial modeling.
5.6. Bayesian Robit Regression with Hyper-Lasso Priors
Bayesian Robit Regression with Hyper-LASSO priors (BayesHL) has been developed as a feature selection method tailored for high-dimensional datasets, particularly those exhibiting grouping structures. This approach employs a heavy-tailed Robit model combined with Hyper-LASSO priors to achieve robust feature selection, ensuring sparsity and the capability to uncover grouping structures without necessitating a pre-specified grouping index. BayesHL has been effectively applied in gene expression analysis to identify subsets of genes associated with disease outcomes, such as the 5-year survival of endometrial cancer patients [
79].
In the Robit regression model, binary outcomes
are modeled by replacing the normal distribution in probit regression with a scaled Student’s t-distribution. The model is formulated as follows:
where
denotes the vector of features for the
i-th observation,
represents the vector of regression coefficients,
follows a scaled Student’s t-distribution
with degrees of freedom
, and
is the indicator function.
To induce sparsity, a Cauchy prior is assigned to each regression coefficient
, expressed as
where
is a small scale parameter ensuring sparse solutions. The heavy-tailed nature of the Cauchy prior allows for a few large coefficients (signals) while shrinking irrelevant ones (noise) toward zero.
The posterior distribution of the parameters is given by
where
represents the latent variance for each coefficient, modeled as
The utilization of heavy-tailed priors in BayesHL enables more aggressive shrinkage of irrelevant features compared to traditional Lasso, while retaining significant predictors. This characteristic is particularly advantageous in high-dimensional settings with potential grouping structures among features. However, the computational complexity associated with fully Bayesian methods, including the need for Markov Chain Monte Carlo (MCMC) sampling, can be a limitation, especially with large-scale datasets. Despite this, BayesHL provides a flexible and powerful framework for feature selection, offering probabilistic interpretations and accommodating complex data structures.
5.7. Relative Belief Ratio
The Relative Belief Ratio (RBR) feature selection method [
80] offers a Bayesian approach to feature selection in binary classification problems. This method evaluates the significance of each feature by quantifying the change in belief from prior to posterior distributions, thereby identifying and ranking features based on the relative belief strength and hence importance.
The RBR proposed by [
81] is defined as the ratio of the posterior density to the prior density at a specific parameter value. For a parameter
with prior distribution
and posterior distribution
, given data
x, the relative belief ratio is expressed as
A value of
indicates evidence in favor of
, while
suggests evidence against
. This measure allows for a systematic comparison of features by assessing how the observed data update the prior beliefs about each feature’s relevance.
To quantify the strength of the evidence provided by the RBR, a strength function
is defined as
where
is a baseline parameter value. The strength function
represents the posterior probability that the relative belief ratio for
is less than or equal to that for
. A lower strength value indicates stronger evidence in favor of
, providing a calibrated measure of the evidence’s robustness. In practical applications, the RBR method involves estimating the relative belief ratios for each feature and computing a strength score to rank the features. The strength score
for feature
is calculated by integrating the relative belief ratios over the parameter space, providing a quantitative measure of each feature’s importance. Features with lower strength scores are considered more significant.
The RBR method has been applied to both synthetic and real-world datasets, demonstrating its effectiveness in identifying significant features and achieving high classification accuracy in high-dimensional gene datasets, outperforming in general traditional feature selection techniques such as Information Gain and Symmetrical Uncertainty. However, the implementation of the RBR method requires careful elicitation of prior distributions, and can be computationally intensive due to the necessity of estimating posterior distributions. Despite these challenges, the RBR method provides a robust framework for feature selection, particularly suited for high-dimensional datasets where traditional methods may falter.
5.8. Discussion
The Bayesian-based methods discussed above illustrate a range of strategies for leveraging probabilistic reasoning in feature selection. These methods vary in complexity and application, from sparsity-inducing approaches like Bayesian Lasso to more advanced frameworks such as the RBR method. Bayesian Lasso builds on classical Lasso by incorporating prior distributions and providing uncertainty quantification, making it particularly useful in regression tasks where collinearity among features is present. However, its reliance on MCMC sampling increases computational demands, requiring careful tuning of hyperparameters.
More sophisticated methods, including RVM and its extensions, like RSFM and VRSFM, combine sparse Bayesian learning with kernel-based modeling to identify relevant features. These methods excel in high-dimensional settings with nonlinear relationships by enforcing sparsity through hierarchical priors. While these approaches offer flexibility and improved model interpretability, their computational overhead, particularly in large datasets, necessitates efficient optimization techniques and kernel selection strategies.
Bayesian methods tailored for complex data structures, such as Bayesian Robit Regression with BayesHL and the Balamurugan and Rajaram method, offer solutions for scenarios involving dependencies or grouping structures among features. BayesHL effectively handles high-dimensional datasets with group structures by employing heavy-tailed priors, whereas the Balamurugan and Rajaram method focuses on reducing redundancy through pairwise dependency evaluation. Both methods enhance interpretability and classification performance, but may face scalability issues in large datasets.
The RBR method offers a unique Bayesian measure for feature importance by updating prior beliefs based on observed evidence. This method is particularly suited for high-dimensional datasets with limited sample sizes, as it ranks features using a strength score derived from the relative belief ratio. Although the RBR method provides a principled framework for feature selection, it requires careful prior elicitation and is computationally intensive due to the need for posterior estimation.
In practical applications, the combination of multiple Bayesian methods can be advantageous. For example, Bayesian Lasso can be used as an initial filtering step to eliminate irrelevant features, followed by kernel-based methods such as RVM or RSFM for refining the feature set. Advanced techniques like BayesHL or RBR can further prioritize domain-specific features or handle grouping structures. The choice of methods depends on the specific task, data characteristics, and computational resources. Collectively, these Bayesian-based approaches emphasize the importance of probabilistic modeling in feature selection, particularly in addressing uncertainty, dependencies, and high-dimensional feature spaces.
7. Challenges and Future Directions
One key challenge in mathematical feature selection involves handling higher-order interactions among predictors while preserving interpretability. Traditional methods often prioritize marginal or pairwise relationships, yet many real-world domains such as genomics or climate science demand a deeper understanding of complex dependencies. Designing algorithms that capture intricate structural associations among features, while retaining transparency for end-users and domain experts, thus presents a compelling avenue for future research.
A second area of interest encompasses non-convex and robust selection approaches that address issues such as outliers, heavy-tailed distributions, or adversarial data contamination. Although -based regularization methods often dominate regularization-based feature selection, alternative penalty functions (e.g., SCAD, MCP) and robust cost formulations can potentially enhance performance, especially under noisy conditions. The effective integration of these techniques with existing feature selection pipelines would address data integrity concerns without compromising scalability or computational feasibility.
Emerging data paradigms also highlight the need for online and streaming feature selection procedures. In such scenarios, data arrive continuously rather than in fixed batches, making static, offline methods insufficient or computationally expensive. New algorithms must update feature importance adaptively, possibly by maintaining approximate variance–covariance structures in real time or by employing sequential Bayesian inference. Such advances would open opportunities for real-time analytics, a rapidly growing domain in the face of expanding datasets and evolving streaming technologies.
Another notable challenge in evaluating feature selection algorithms is the scarcity of datasets with pre-established relevant features. In this regard, synthetic datasets can play a pivotal role in advancing feature selection methodologies, as they enable researchers to rigorously assess algorithmic performance under controlled conditions and validate new approaches with known ground truths [
97,
98].
The integration of mathematical feature selection principles in deep learning architectures constitutes another promising direction. Sparse regularization layers (e.g., group-Lasso) or hierarchical Bayesian priors can be embedded directly into neural networks to enhance interpretability and reduce computational overhead. Moreover, many application areas would benefit from domain-informed Bayesian priors that anchor feature selection in specialized scientific knowledge. Whether derived from expert input, experimental constraints, or biological pathways, carefully formulated priors can produce both robust and contextually meaningful results. Finally, ensuring that these methods scale gracefully remains a persistent challenge, given the ever-increasing size and complexity of modern datasets. Parallelization strategies, more efficient sampling procedures, and approximate inference techniques that preserve theoretical guarantees may help accommodate real-world computational constraints, allowing the theoretical benefits of mathematical feature selection to materialize in practical, large-scale environments.
8. Conclusions
Mathematical methods form the backbone of many state-of-the-art feature selection techniques, offering theoretical rigor and improved interpretability in both supervised and unsupervised contexts. By systematically examining variance-based, regularization-based, and Bayesian approaches, this review highlights the diverse mathematical paradigms that guide the identification and retention of informative features.
Variance-based methods, which rely on measures of dispersion, generally excel at fast dimensionality reduction, but may be less apt at capturing task-specific relationships between predictors and target variables. Regularization-based algorithms circumvent the issue by embedding feature selection directly in the model training phase, ensuring that sparsity naturally emerges through penalized optimization. Bayesian methods, by contrast, leverage prior distributions and posterior inference to quantify and incorporate uncertainty in a flexible, robust manner.
Although these techniques are each built on distinct theoretical frameworks, their integration can yield synergistic benefits. For example, a preliminary variance-based filter can reduce dimensionality before more computationally intensive Bayesian or regularization-based methods refine the selection process. The multi-stage strategy exemplifies how theoretical assumptions and practical considerations jointly drive successful feature selection outcomes.
Ultimately, the choice of a feature selection method hinges on domain requirements, data characteristics, and computational resources. From capturing nonlinear interactions to employing robust priors informed by scientific expertise, the continued expansion of mathematical feature selection promises both heightened predictive performance and deeper insights into high-dimensional data.