Next Article in Journal
Mathematics of a Process Algebra Inspired by Whitehead’s Process and Reality: A Review
Previous Article in Journal
Production Planning Optimization in a Two-Echelon Multi-Product Supply Chain with Discrete Delivery and Storage at Manufacturer’s Warehouse
Previous Article in Special Issue
A Dynamic Programming Approach to the Collision Avoidance of Autonomous Ships
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A New Alternating Suboptimal Dynamic Programming Algorithm with Applications for Feature Selection

Faculty of Electrical Engineering and Computer Science, University of Maribor, Koroška Cesta 46, SI-2000 Maribor, Slovenia
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(13), 1987; https://doi.org/10.3390/math12131987
Submission received: 17 May 2024 / Revised: 18 June 2024 / Accepted: 25 June 2024 / Published: 27 June 2024
(This article belongs to the Special Issue Dynamic Programming)

Abstract

:
Feature selection is predominantly used in machine learning tasks, such as classification, regression, and clustering. It selects a subset of features (relevant attributes of data points) from a larger set that contributes as optimally as possible to the informativeness of the model. There are exponentially many subsets of a given set, and thus, the exhaustive search approach is only practical for problems with at most a few dozen features. In the past, there have been attempts to reduce the search space using dynamic programming. However, models that consider similarity in pairs of features alongside the quality of individual features do not provide the required optimal substructure. As a result, algorithms, which we will call suboptimal dynamic programming algorithms, find a solution that may deviate significantly from the optimal one. In this paper, we propose an iterative dynamic programming algorithm, which invertsthe order of feature processing in each iteration. Such an alternating approach allows for improving the optimization function by using the score from the previous iteration to estimate the contribution of unprocessed features. The iterative process is proven to converge and terminates when the solution does not change in three successive iterations or when the number of iterations reaches the threshold. Results in more than 95% of tests align with those of the exhaustive search approach, being competitive and often superior to the reference greedy approach. Validation was carried out by comparing the scores of output feature subsets and examining the accuracy of different classifiers learned on these features across nine real-world applications, considering different scenarios with various numbers of features and samples. In the context of feature selection, the proposed algorithm can be characterized as a robust filter method that can improve machine learning models regardless of dataset size. However, we expect that the idea of alternating suboptimal optimization will soon be generalized to tasks beyond feature selection.

1. Introduction

Nowadays, in the era of the Internet of Things, social media platforms, Earth observation, crowdsourcing, medical imaging equipment, various biomedical signals measurement devices, wearable sensors, digital twins, etc., we are flooded with vast amounts of data. The measurable characteristics used as attributes or input variables to describe an object of interest are called features, while individual data points (objects of interest) represent feature vectors. Each feature thus corresponds to a dimension in the vector. Vast amounts of data allow the creation of a large repertoire of features, but usually, not all features are relevant for further processing objects of interest. They often slow down the model, direct it towards a wrong solution, or even make reaching a solution infeasible. In order to address these challenges, feature selection approaches were introduced that select a subset of the most informative features while discarding irrelevant or redundant ones [1]. Feature selection plays a vital role in model construction in statistical analysis, dimensionality reduction, signal processing, pattern recognition, data visualization, and, particularly, in various machine learning tasks, such as classification, regression, and clustering. Its aim is to improve the model’s performance, including its accuracy, generalizability, and interpretability, and reduce overfitting and computational cost [2].
Feature selection methods can be grouped into three categories [1]. Filter methods evaluate candidate subsets with independent criteria that exploit essential characteristics of the training data. They are fast, but the solution may deviate significantly from the optimal one. A wrapper approach uses a learning algorithm for subset evaluation, such as a classifier or regressor. Its performance is usually better but also much slower than the filter approach. Embedded methods interact with a learning algorithm but at a lower computational cost than the wrapperapproach. They use independent criteria to identify optimal subsets for a known cardinality. The learning algorithm is then used to select the final optimal subset across different cardinalities [3].
Regardless of the approach chosen, feature selection can be viewed as an optimization problem as it searches for the best-evaluated feature subset [4]. Different search strategies can be used, including sequential search (greedy approach), exponential search (exhaustive search, beam search, or branch and bound), and random search [3]. Conversely, dynamic programming (DP) is not as commonly applied to feature selection as other methods. This popular optimization approach breaks a problem into smaller subproblems and uses their solutions to construct the solution to the larger problem. An optimal solution can be found if the problem exhibits optimal substructure. This means that an optimal solution to the problem contains optimal solutions to subproblems [5,6]. However, DP is usually computationally demanding, so for reasons of feasibility, acceptable speed, and availability to handle problems with higher dimensionality, it is also required that the number of subproblems is not too high and that the subproblems overlap, suggesting that it makes sense to record their solutions in a table and reuse them [6].
In this paper, we highlight the possibilities of using DP in feature selection, analyze the difficulties of existing (rare) approaches, and propose alternative solutions. An evaluation criterion based on feature quality, correlation, and/or statistics does not generally provide an optimal substructure since, e.g., the union of two optimal subsets is not necessarily optimal due to possible high correlations between pairs of features, one from each subset. It is possible to achieve an optimal solution for specific problems by adapting the evaluation criterion, but this spoils generality (e.g., wrappers or embedded selection methods are tied to specific machine learning models and prone to overfitting [2]), which is among our primary goals. We thus focused on finding the best possible suboptimal solution. We studied approximate (ADP) [7] and iterative [8] dynamic programming (IDP) methods and developed a solution that we called alternating suboptimal dynamic programming (ASDP). It inverts the order of feature processing in each iteration and improves the optimization criterion by using the score from the previous iteration to estimate the contribution of unprocessed features. Its contributions are as follows:
  • A better or at least the same evaluation score of the final solution set compared to the score after a single iteration. Furthermore, the solution found in each iteration is never worse than the one found in the previous iteration.
  • Optimal solution according to the evaluation score found in more than 95% of cases.
  • Polynomial worst-case time complexity ( O ( n 4 ) ) allows significantly larger input feature sets to be considered compared to the exhaustive search approach.
  • Comparable and, in some cases, better classification accuracy on the basis of the feature set selected by the new method than when using our previous graph-based greedy feature selection method. In this respect, we have already demonstrated the competitiveness of the latter in our previous work [9] compared to state-of-the-art classification approaches and applied feature selection methods.
The rest of the paper is structured as follows. In Section 2, we survey existing solutions in feature evaluation and selection, the use of DP in feature selection, and suboptimal DP algorithms. In the most research-intensive Section 3, we first summarise our preliminary filter method for feature selection based on graph cuts, which can be used alone or as a preprocessing for the new alternating suboptimal DP method presented afterward. In Section 4, we show and analyze the results, and, finally, in Section 5, we discuss the work carried out, its strengths, and some weaknesses that pose challenges for future research.

2. Related Works

As the topic presented here combines several challenges, the state-of-the-art review must address several areas. First, in Section 2.1, we address feature evaluation, i.e., procedures and metrics to assess the contribution of individual features and/or a feature subset to a machine learning model. Feature evaluation is the basis for feature selection, which we review in Section 2.2. The goal is to optimally select a subset of the input features that solve a given machine-learning task. We wanted to approach the problem using dynamic programming, so in Section 2.3, we review the use of this software design strategy in feature selection. However, such methods are rare, time-demanding, and practically always offer partial solutions only. Consequently, the solution proposed in this paper is suboptimal; thus, Section 2.4 briefly reviews the use of suboptimal DP for various problems, including feature selection.

2.1. Feature Evaluation

Feature evaluation is a critical step in the feature selection process. It includes assessing the contribution of input features to a machine learning model performance [2]. Features that contribute the most information to the predictive model can improve the model’s performance, reduce overfitting, and accelerate the learning process [2,10,11]. For classification purposes, feature evaluation can be achieved directly by evaluating the classification models built for each feature [2]. The choice of classifier strongly influences the evaluation results, while its learning is often time-consuming. The latter is particularly evident in cases of a large number of features [2,10,11]. Similar drawbacks are also noted for regression purposes, as feature evaluation can be achieved using computationally demanding regression approaches built for each feature.
To avoid using a computationally demanding classifier, techniques for analyzing the discriminatory power of features are introduced. Early approaches focused on the ratio between the distances of samples of different classes and samples of the same class [12]. Examples of these include the Fisher criterion [13], the maximum margin criterion [14], and the Laplacian score [15]. In the case of regression tasks, techniques are based on calculating the correlation coefficients (e.g., Pearson’s or Spearman’s) between the feature’s values and the continuous target variable [2,10]. However, the mentioned techniques for classification and regression tasks cover only linear interdependencies between feature values and the target variable.
Approaches that capture nonlinear dependencies are based on the information contribution. For classification purposes, this technique evaluates features according to the ratio between the classes’ entropy and the feature values’ conditional entropy [16,17,18]. Regarding accuracy, similar results can be obtained using the computationally more efficient Gini impurity [19,20,21]. In the case of multi-class classification, the Gini impurity is biased towards majority classes and prone to overfitting [21]. Conversely, techniques based on mutual information capture nonlinear relationships between features and the target variable successfully, utilizing both classifications [22] and regression [23], respectively. However, such metrics favor features with many different values, which can lead to overfitting.

2.2. Feature Selection

Overfitting negatively affects the power of machine learning methods and cripples predictive accuracy. Irrelevant features lower the predictive power of the model. Feature selection methods can overcome both limitations [24]. We divide these methods into three groups:
  • Filters;
  • Wrappers;
  • Embedded methods.
Filtering is usually performed using a threshold value. Although such methods are computationally very efficient, their classification power largely depends on the feature evaluation techniques [2,10,25]. The latter often consider only pairwise dependencies between feature values and the target variable, ignoring correlations between features [2,12]. As a result, the prediction efficiency is thus limited.
In [26,27], a method calculates the efficiency of separation between different classes in the local neighborhood of selected samples to evaluate the feature. This enables low execution times because it does not use all of the samples contained in the dataset. However, calculations are usually inaccurate due to the limited number of considered samples. The method also does not consider the correlation between features. In [28], the authors proposed an approach that selects features highly correlated with the class labels and in low correlation with each other. A similar method is proposed in [29] but for regression purposes. The only difference is that it selects features highly correlated with the target variable. However, neither technique considers the interaction between features and only considers the linear interdependence between feature values and target variables.
In [30], the authors propose a two-stage feature selection method in which the evaluation is based on the calculation of mutual information while at the same time considering the correlation between pairs of features. In [23], the adequacy of the mutual information for regression is considered. However, in the case of a small number of training samples, inaccurate estimates of mutual information may appear, and the method is biased towards features with a large number of different values due to the use of this metric. In [31], a feature selection framework for large datasets was proposed based on a cascade of methods capable of detecting nonlinear relationships between two features and designed to achieve a balance between accuracy and speed.
Conversely, wrapper methods select a subset of features that maximizes the performance of a given classifier or regressor [2]. Wrappers are considered a multi-criteria optimization problem that maximizes machine-learning-method performance while minimizing the number of selected features. This can be addressed with several optimization techniques [2,32], such as sequential selection algorithms or nature-inspired algorithms, such as the evolutionary and genetic algorithm [33,34], particle swarm optimization [33,35], and the bees algorithm [36].
Early wrappers were based on sequential selection [37]. This starts with an empty set, adding features, and evaluating the prediction performance. The feature that gives the best results is permanently included in the set. The selection continues by adding features one by one again and keeping those that contribute the most to improving the prediction performance. The algorithm terminates when a predetermined threshold of acceptable results is reached or when a sufficient number of features are selected. In [38], the authors propose an inverse procedure where features are removed from the input set. The main limitation of these algorithms is that they do not consider the correlation between features. This limitation is eliminated in the adaptive version of the algorithm [39]. However, such a search for an optimal set with a more significant number of features soon grows into an exponentially time-consuming process [12,37].
Therefore, methods that find a suboptimal solution were proposed [33,34]. Examples of the latter are algorithms based on nature-inspired concepts. Similar to sequential feature selection methods, the evaluation function of evolutionary algorithms represents the performance of a model, while the features represent the population. The best-performing subsets are combined to achieve the desired result [34,35]. The biggest problem with these methods is their computational complexity, as the process involves the model evaluating features for each specific subset over a large number of iterations to obtain useful results [2,40].
An alternative to wrapper methods is embedded methods, which have lower execution times [2,41]. These approaches perform the selection of a subset of features in interaction with the model in the learning phase of the model and are thus tied to the selected machine learning algorithm [2]. Decision trees achieve feature selection based on mutual information evaluation, while support vector methods use LASSO (Least Absolute Shrinkage and Selection operator) regression analysis with Ll [41] or ridge regression with L2-regularization [42] to rank features during learning. This significantly reduces the computational complexity of both methods, as it allows multiple repetitions of the machine learning algorithm learning process to be avoided. At the same time, support vector methods are inexplicable and sensitive to many input parameters. Similar to support vector methods, the selection of features can also be achieved with neural networks, whereby learning the network, we set the weights for individual features according to their suitability [2].

2.3. Dynamic-Programming-Based Feature Selection

DP can be employed since feature selection can be formulated as an optimization problem. In this context, DP solves the exhaustive search of a feature subset by breaking it down into simpler searches while storing the suboptimal subsets of features to avoid redundant calculations.
As early as 1998, Nelson and Levy [43] presented a method where optimality is defined in terms of a particular measure, the Fisher return function, providing the features were uncorrelated. In [44], a method for selecting the best subset of features for classification purposes is presented. It uses DP for divergence analysis of the feature distributions regarding given classes, selecting the subset of the most informative features. However, this method does not consider the interaction between features and thus can choose redundant ones. The identical drawback can be noted in [45]. The approach is similar to the sequential forward feature selection, starting with an empty set of features and adding the best-performing one in each iteration. The main difference is that it can remove any previously selected features if it improves the performance of a set. In [46], the authors presented a method that uses rough sets theory and DP in order to remove redundant features from the input set while maintaining high classification performance. However, this approach is computationally intensive, especially for large datasets with a high number of features.

2.4. Suboptimal Dynamic Programming

In the literature, there are several partially overlapping concepts addressing suboptimal DP.
  • Approximate Dynamic Programming (ADP) is a sophisticated variant of traditional DP, representing a compromise between the method’s accuracy and computational feasibility. It aims to find near-optimal solutions to DP problems, achieved by approximating the value or policy functions using function approximation techniques such as neural networks, linear approximators, or interpolation methods [47,48]. A policy represents a strategy or set of rules that dictate the decision-making process, while the value corresponds to the expected return or benefit from following a particular policy. ADP iteratively refines both, converging towards an optimal or near-optimal solution [48]. The concept has proven itself, particularly in solving large-scale discrete-time multistage stochastic control processes, i.e., complex Markov Decision Processes (MDPs), and found applications in different fields, such as inventory control systems, financial optimization problems, robot path planning, information theory, and decision-making processes in learning agents, i.e., reinforcement learning (RL) [49,50]. Refs [7,51] consider feature selection in RL and MDPs, while [52] addresses feature discovery, also known as feature construction, in the context of ADP and RL. Note that adaptive dynamic programming with the same acronym ADP is sometimes found in the literature for practically the same concept [53].
  • Iterative dynamic programming (IDP) involves solving a problem by iteratively refining an initial solution through DP techniques. The definition can be interpreted in different ways. Most often, IDP is used to solve real-valued optimization problems in a manner that reduces the state and control quantization to an arbitrarily small amount by first searching over a relatively coarse but large set of system inputs and states using DP, and then successively generating a denser and narrower search range centered about the previous result. Luus [54] defined this principle as reducing dimensionality by iteratively varying grid resolutions, while Lock and McKelvey [55] applied it on different time scales. In [8], IDP was used to optimize queries in database systems. IDP was also referred to as a counterweight to recursive DP, while value and policy iteration represents an intersection point of IDP with ADP [48]. Note that IDP can be either optimal or suboptimal.
  • Relaxed dynamic programming reduces the complexity by relaxing the demand for optimality. The distance from optimality is kept within prespecified error bounds, and the size of the bounds determines the computational complexity [56]. The bounds are chosen by the user, who can then effectively trade-off between solution time and accuracy [57]. By controlling the error in the processes of relaxed value iteration and approximate policy iteration, the relaxed DP concept is closely related to ADP and IDP [58].

3. Materials and Methods

In this section, we present a new method for feature selection based on suboptimal DP. There are exponentially many subsets of a given feature set, all of which are candidates for the feature selection solution, so the exhaustive search approach is only practically applicable to problems with a few dozen features. Our method processes, e.g., 200 features in 5 s, but for larger input sets, it makes sense to preprocess the features with some faster filtering. We use our efficient and reliable graph-cut-based feature selection [9], summarised in Section 3.1. In Section 3.2, we discuss the idea of using DP and the encountered difficulties and introduce an iterative suboptimal alternating solution, where the order of feature processing is inverted in each iteration. We conclude the chapter with proof of convergence and a theoretical analysis of time and space complexity.

3.1. Graph-Cut-Based Feature Selection

While wrapper feature selection methods, like the sequential search, nature-inspired algorithms, or binary teaching–learning-based approaches bypass the need for explicit feature evaluation to yield results that are close to optimal, their effectiveness is tied to the specific classification model being used. Additionally, these methods are highly computationally intensive, which can limit their applicability. Similarly, embedded methods incorporate an iterative cycle of evaluating and selecting features as a part of the model training process, which can also demand significant computational resources. Furthermore, the performance of embedded methods is likewise influenced by the choice of the classification model. As an alternative to the discussed wrapper and embedded feature selection techniques, as well as those filter methods that are unable to deal with correlated features, in this section, we presented a graph-cut-based feature selection strategy outlined in our work [9] that enables the selection of a subset of high-quality dissimilar features while providing superior results. Depending on the defined feature estimation measurement, it can be used for both classification and regression purposes. Graph vertices represent features with associated weights that define their quality (as proposed in [9]), while graph edge weights define similarities between them. The method relies on two input parameters, T Δ and T p , used for graph definition. The former defines the necessary level of feature quality (i.e., maximal allowed class overlap) to be included in the output feature space, and the latter determines the minimal level of dissimilarity between them.
Let F S denote an input feature space F S = f i . A feature f i , referred to by an index i [ 1 , n ] , is given as a mapping function f i : Z R . An index m 1 , M refers to a sample, i.e., a feature vector defined as x m = f i , m . An undirected graph used for feature selection is defined as G = ( F , E ) , where a set of vertices F is defined as F = f i F S ; Δ ( f i ) T Δ , while an unordered set of edges E = e i , j ; P ( e i , j ) T p is given by e i , j = ( f i , f j ) for all f i , f j F , such that i j . A vertex-weighting function is given by Δ f i , as defined in [9], and the edge-weighting function is given by the absolute Pearson correlation coefficient P : E [ 0 , 1 ] , formally described by Equation (1).
P ( e i , j ) = | m = 1 M ( f i , m μ i ) ( f j , m μ j ) σ i σ j | ,
where μ i denotes mean, while standard deviation σ i of feature values is defined as σ i = 1 M m = 1 M f i , m μ i . Both functions, Δ and P, are designed such that lower values (closer to 0) are more favorable for selection than higher values (closer to 1).
According to the theoretical framework introduced in [9], we use the following definitions of elementary properties:
  • Vertices f i F and f j F are adjacent in a graph G if there exists an edge e i , j E .
  • A path from f i 0 to f i N is an ordered sequence of vertices  i 0 , i N = f i 0 , f i 1 , , f N , such that f i j and f i j + 1 are adjacent for all j [ 0 , N 1 ] .
  • A graph G is connected if f i , f j F there exists a path i , j .
  • A graph G = ( F , E ) is subgraph of G if F F and E E .
  • A neighbourhood Z ( f i ) of a vertex f i in graph G is the subset of vertices F, defined by all the adjacent vertices of f i k , namely, Z ( f i ) = { f j } ; f j F ; e i , j , where i j .
We say that a set of vertices C U T ( F ) F is a vertex-cut if its removal separates graph G into at least two non-empty and pairwise disconnected connected components. Obviously, Z ( f i ) is a graph-cut, as it separates a singleton { f i } (i.e., an individual vertex) from the rest of the graph, thus creating a subgraph G = ( F , E ) , whose vertex- and edge-sets are given formally by Equation (2).
F = F ( Z ( f i ) { f i } ) , E = { e h , l E ; f h , f l F and h l } .
An example of vertex-cut feature selection is presented in Figure 1. Figure 1a shows an undirected graph G = ( F , E ) , constructed over a set of features F S = { f 1 , f 2 , , f 9 } , with thresholds T Δ = 0.6 and T p = 0.6 applied on the associated vertex- and edge-weighting functions Δ and P, accordingly. To ensure the preservation of the overall informativeness of selected features, a feature of the highest quality f r ^ = arg min f m G Δ ( f m ) is selected first by a vertex-cut of its neighborhood Z ( f r ^ ) . The selected feature f 6 is colored green. All of highly correlated adjacent features Z ( f 6 ^ ) = { f 2 , f 3 , f 8 } are marked red and removed from G. This results in G , as defined by Equation (2), and a disconnected singleton { f 6 } (see Figure 1b). The same process is then repeated on G , separating the feature of the highest quality, namely f 1 , from the remaining graph G by removal of Z ( f 1 ) = { f 4 , f 7 } . The final cut is performed on the graph G separating f 5 (in green) from the remaining (empty) graph G by removal of Z ( f 5 ) = { f 9 } (in red), as shown in Figure 1c. Thus, the output subset of high-quality dissimilar features, namely { f 1 , f 5 , f 6 } , is obtained, as shown in Figure 1d.

3.2. New Suboptimal Dynamic Programming Algorithm

The new method combines the advantages of iterative and approximate dynamic programming. It does not seek a global optimum but instead adopts a suboptimal (approximate) solution, which it iteratively improves. It is based on a graph like the graph-cut-based filtering from Section 3.1. We thus use the same notation, but we will extend it throughout this subsection with additional algorithm parameters and graph vertex attributes. The graph is undirected, i.e., P ( e i , j ) = P ( e j , i ) . The input is the feature set F S = f i , 0 < i n , which is processed in index order, i.e., from f 1 to f n , so we will sometimes also speak of a sequence of features. At both ends of this sequence, the guard vertices f 0 and f n + 1 are added, which do not change during the execution of the algorithm, but they simplify the implementation. There is no edge between the two guards, while the guard vertices and the edges between a guard and any other vertex are given weights 0. We stress this in the form of an Equation (Equation (3)).
Δ ( f 0 ) = 0 , Δ ( f n + 1 ) = 0 , P ( e 0 , n + 1 ) = , P ( e 0 , i ) = 0 , 0 < i n , P ( e i , n + 1 ) = 0 , 0 < i n .
Each graph vertex f i contains, in addition to the weight Δ ( f i ) , a set S i that stores the “optimal” subset (feature selection result) of the vertices already processed, and the score s i of this subset, which is obtained by the evaluation criterion. Their initialization is described by Equation (4) and is important for the convergence proof in Section 3.3. The evaluation criterion described in Equation (5) seeks a minimum for all vertices, except the guards, i.e., 0 < i n .
S i = , 0 i n + 1 , s i = 0 , 0 i n + 1 .
s i = min 0 j < i ( s j + Δ ( f i ) + k S j P ( e k , i ) )
Let r be the value of j where the minimum was identified. The corresponding S i is calculated by Equation (6).
S i = S r { i }
The final score s c o r e and feature selection result S o l u t i o n are given by Equation (7). 1.5
s c o r e = min 0 < i n s i , s o l u t i o n = i , where s c o r e was found , S o l u t i o n = S s o l u t i o n .
Figure 2a shows the situation immediately before Equations (5) and (6) are applied to vertex f i , and Figure 2b shows the situation immediately after the equations are applied. Green indicates the graph vertices that have already been processed, and white indicates those that are being or will be processed. The red text indicates vertex attributes modified during the observed f i processing.
To date, everything seems straightforward, but there are, in fact, three serious problems in the process that need to be addressed. The first is that the importance of vertices and edges might differ. For this reason, we introduce a weight w, 0 w 1 . This modifies the evaluation criterion Equation (5) into Equation (8).
s i = min 0 j < i ( s j + w · Δ ( f i ) + ( 1 w ) · k S j P ( e k , i ) )
The second problem is that Equation (5) in its present form always leads to a trivial solution from Equation (9). Since the weights of the graph vertices and edges are all non-negative, the minimum consists of a single vertex (without incident edges) with the lowest weight.
s c o r e = min 0 < j n ( w · Δ ( f j ) )
To prevent this, we first modified the model by replacing the decreasing vertex evaluation function Δ with the increasing Δ ( f ) = 1 Δ ( f ) . The idea was to award high vertex weights and penalize high edge weights. This resulted in the optimization function Equation (10):
s i = max 0 j < i ( s j + w · Δ ( f i ) ( 1 w ) · k S j P ( e k , i ) ) ,
which does not tend towards the trivial solution. However, to retain complementarity with the graph-cut-based method, we preferred to choose an alternative approach, which decrements all vertex and edge weights (except those of guards and their incident edges) for user-defined non-negative values s h f t Δ and s h f t P , respectively, (see Equation (11)). Furthermore, these two additional parameters provide new possibilities for tuning, as demonstrated in Section 4.
s i = min 0 j < i ( s j + w · ( Δ ( f i ) s h f t Δ ) + ( 1 w ) · k S j ( P ( e k , i ) s h f t P ) )
The third problem is the most demanding. Even if all partial solutions S j , 0 < j < i were optimal, there is no guarantee that this will be the case after adding f i to any of these solutions. It is enough that f i is over-correlated with a single feature from each S j , and the optimum will likely be missed. In other words, optimization defined in this way does not guarantee an optimal substructure, one of the two fundamental assumptions of dynamic programming, along with overlapping subproblems [6]. Of course, when considering f i , we can no longer refresh its predecessors’ attributes S j and s j . We tried to mitigate this problem by extending the evaluation criterion by predicting the contribution of vertices not yet visited and, most importantly, considering the correlation between the visited and predicted parts. The need to predict the contribution of unvisited nodes led us to a simple idea, which later turned out to be very successful, namely to reverse the graph traversing direction after arriving at f n . As G is an undirected graph, the status from the previous traversal can simply be used to estimate the score s i and partial solution S i . The updated evaluation criterion is given by Equation (12).
s f w d = min n + 1 j > i ( s j + s i + ( 1 w ) · k S j , h S i ( P ( e k , h ) s h f t P ) )
When the reverse traversal reaches f 1 , the direction of visiting the vertices is inverted again. The evaluation criterion Equation (12) is slightly modified to Equation (13), corresponding to the forward direction from f 1 towards f n . The only difference between the two equations is, of course, the direction and boundaries of the vertices’ traversal, written under the min function label.
s f w d = min 0 j < i ( s j + s i + ( 1 w ) · k S j , h S i ( P ( e k , h ) s h f t P ) )
The modified evaluation criterion significantly impacts the choice of vertex f r (r is the value of j, providing the minimum) and thus indirectly affects the calculation of s i and S i . Let r be the value of j in Equation (12) or (13) where the minimum was identified. The score s i is then calculated by using Equation (14), while Equation (6) representing the solution subset S i remains applicable.
s i = s r + w · ( Δ ( f i ) s h f t Δ ) + ( 1 w ) · k S r ( P ( e k , i ) s h f t P )
However, s i and S i should not be directly refreshed by s f w d and S r S i , since in the treatment of subsequent vertices, we assume that s i and S i can only refer to vertices that were visited before f i in the current iteration. Conversely, it would be a pity not to make better use of the great potential that Equations (12) and (13) certainly have. Fortunately, they can be used to predict the attributes of another vertex instead of f i , namely f i e n d , which represents the last vertex in the set S i (the one with the lowest index in the reverse direction traversal or with the highest index in the forward traversal). However, we should not update s i e n d and S i e n d when we process f i because we will need the values from the previous iteration when we process f i e n d later. As a consequence, we extend each vertex f k with additional attributes p r ( s k ) and p r ( S k ) ( p r stands for prediction), which store the aforementioned estimates of the score and the solution set. At the beginning of each iteration, the initialization p r ( s k ) = , 0 < k n , is performed. Algorithm 1 shows the processing of vertex f i , which is further explained in Figure 3. For simplicity, we assume that all the variables in Algorithm 1 are global, except i and f o r w a r d . The score s i is determined as the minimum between the previously stored p r ( s i ) and s i computed by Equation (14). In the former case, the set p r ( S i ) is assigned to S i , while in the latter case, S i is determined by Equation (6). Note that p r ( s i ) and p r ( S i ) can be refreshed multiple times in the same iteration since multiple sequences S i at different i can terminate with the same vertex f i e n d .
Algorithm 1 Processing a Considered Graph Vertex
1:
function ProcessVertex(i, f o r w a r d )
2:
    if  f o r w a r d  then
▹ Forward direction graph traversal.
3:
         i e n d = max f k S i k ;
4:
         s f w d = min 0 j < i ( s j + s i + ( 1 w ) · k S j , h S i ( P ( e k , h ) s h f t P ) ) ;
▹ (13)
5:
    else
▹ Reverse direction graph traversal.
6:
         i e n d = min f k S i k ;
7:
         s f w d = min n + 1 j > i ( s j + s i + ( 1 w ) · k S j , h S i ( P ( e k , h ) s h f t P ) ) ;
▹ (12)
8:
    end if
9:
     r = the value of j, where the minimum in line 4 or 7 was achieved;
10:
     s i = s r + w · ( Δ ( f i ) s h f t Δ ) + ( 1 w ) · k S r ( P ( e k , i ) s h f t P ) ;
▹ (14)
11:
     S i = S r { i } ;
▹ (6)
12:
    if  ( p r ( s i ) < s i )  then
▹ Update the vertex with predictions from the same iteration.
13:
         s i = p r ( s i ) ;
14:
         S i = p r ( S i ) ;
15:
    end if
16:
    if  ( s f w d < p r ( s i e n d ) )  then
▹ Update the predictions of a not yet processed f i e n d .
17:
         p r ( s i e n d ) = s f w d ;
18:
         p r ( S i e n d ) = S r S i ;
19:
    end if
20:
    return
▹ No value returned—all the variables are global, except i and f o r w a r d .
21:
end function
Figure 3a,b show the situation immediately before and after Equations (6), (12) and (14) are applied to vertex f i , respectively. The graph traversal is performed in the reverse direction. The obvious difference between the straightforward non-iterative solution from Figure 2 is that here, S i does not contain the initial vertex f i only, but the partial solution from the previous iteration instead. As a consequence, there is a double loop in the sum calculation. The green color indicates the graph vertices that have already been processed in the observed iteration, and the yellow color indicates those that were processed in the previous iteration (and are or will be processed later in the current iteration). Note that these yellow vertices contain the predictions (colored cyan), which might be updated earlier in the ongoing iteration. The red text indicates vertex attributes modified during the observed f i processing. Analogously, Figure 3c,d show the processing of vertex f i when the graph is passed in the forward direction. Equation (13) replaces Equation (12) in this case.
The pseudocode in Algorithm 2 describes the overall structure of the alternating suboptimal dynamic programming method for feature selection. As mentioned, 200 features can still be processed relatively fast, but for larger input sets, it makes sense to preprocess the features with graph-cut-based feature selection filtering (line 2). The initialization in line 3 sets up the guard vertices using Equation (3). Partial solution sets candidates and their scores are initialized using Equation (4), which is needed in lines 4, 7 and 11 of Algorithm 1 within the first-iteration calls of ProcessVertex (line 11 of Algorithm 2). The value f i n a l S c o r e is set to some high value () to provide the first comparison in line 16, and M a x I t e r a t i o n s is set to a user-defined value or default 100. In line 8, all predicted scores are set to a high value () at the beginning of each iteration, which is needed in line 16. The main work is done in the ProcessVertex function, which is called sequentially in line 11 for each feature f i except for the guard vertices. The direction of traversing the features is inverted in each iteration (line 23). The process terminates when the identical score is obtained three times in a row, or the number of iterations reaches m a x I t e r a t i o n s (line 24). If there are two (or more) solutions with the same score, the algorithm may find one during the forward direction traversal and a different one in the reverse direction traversal. In this case, it will return the last of the two solutions found.

3.3. Convergence and Complexity Analysis

The solution found is generally suboptimal but often better than that found in the one-pass method, as will be confirmed by the results in the next section. In any case, the solution after several passes is not worse than the one-pass solution since the result can only improve from iteration to iteration or remain unchanged (after three consecutive such iterations, the algorithm terminates), which is confirmed by Proposition 1 below.
Proposition 1. 
The score in each iteration of the proposed alternating suboptimal dynamic programming algorithm can only be lower (better) or equal to the score in the previous iteration but never higher (worse).
Algorithm 2 Alternating Suboptimal Dynamic Programming
1:
function ASDP( Δ , P, n)
2:
     ( Δ , P , n ) = GraphCutBasedFeatureSelection( Δ , P , n );
▹ Optional filtering
3:
     ( Δ , P , s , S , f i n a l S c o r e , m a x I t e r a t i o n s ) = Init( Δ , P , n );
4:
     s o l u t i o n R e p e a t e d = 0 ;     i t e r a t i o n = 0 ;
5:
     s t a r t = 1 ;     e n d = n ;
6:
    repeat
▹ iterations of ASDP
7:
        for  i s t a r t to e n d  do
▹ for all features
8:
            p r ( s i ) = ;
9:
        end for
▹ for all features
10:
        for  i s t a r t to e n d  do
▹ for all features
11:
           ProcessVertex(i, s t a r t < e n d );
12:
        end for
▹ for all features
13:
         s c o r e = min 0 < i n s i ;
▹ (7): this and the next two lines
14:
         s o l u t i o n = i , where s c o r e was found;
15:
         S o l u t i o n = S s o l u t i o n ;
16:
        if  ( s c o r e < f i n a l S c o r e )  then
17:
            f i n a l S c o r e = s c o r e ;
18:
            s o l u t i o n R e p e a t e d = 0 ;
19:
        else
20:
            s o l u t i o n R e p e a t e d = s o l u t i o n R e p e a t e d + 1 ;
21:
        end if
22:
         i t e r a t i o n = i t e r a t i o n + 1 ;
23:
         ( s t a r t , e n d ) = Swap ( s t a r t , e n d ) ;
24:
    until  ( i t e r a t i o n = m a x I t e r a t i o n s ) ( s o l u t i o n R e p e a t e d = 3 ) ;
25:
    return ( f i n a l S c o r e , S o l u t i o n )
26:
end function
Proof 
(Proof). The proof is conceptually straightforward since we will show that the score s i from the previous iteration is also considered a candidate for the minimum in the observed iteration. Namely, this score is obtained in the evaluation criterion in line 7 of Algorithm 1 at j = 0 or in line 4 at j = n + 1 . The algorithm does not modify the parameters of the two guards, so s j = 0 and S j = in both cases. Consequently, only s i remains from the expression on the right of (13) or (12). If s i is also the minimum in the current iteration, then s f w d = s i will be written first to p r ( s i e n d ) in line 17 of Algorithm 1, then to s i in line 13 of Algorithm 1, to s c o r e in line 13 of Algorithm 2, and finally to f i n a l S c o r e in line 17 of Algorithm 2. Conversely, if s i is not the minimum in the current iteration, then it can only be replaced with a lower score in some of the aforementioned lines of Algorithm 1 or Algorithm 2. This completes the proof. □
Based on Proposition 1, it makes sense to modify the initialization (line 3 of Algorithm 2). The proven convergence allows us to use the input feature set instead of the empty set as an initial solution candidate. Equation (15) introduces a recursive definition of initial values, which replaces Equation (4). Note that the last two lines of Equation (15) were derived from Equations (6) and (14) by setting r = i 1 .
S 0 = S n + 1 = , s 0 = s n + 1 = 0 , S 1 = { 1 } , s 1 = Δ ( f 1 ) , S i = S i 1 { i } , 2 i n , s i = s i 1 + w · ( Δ ( f i ) s h f t Δ ) + ( 1 w ) · k S i 1 ( P ( e k , i ) s h f t P ) , 2 i n .
Propositions 2–4 consider the time and space complexity of the graph-cut-based and the alternating suboptimal dynamic programming feature selection approaches.
Proposition 2. 
The graph-cut-based feature selection method has the worst-case time complexity O ( n 2 ) , where n is the number of features, i.e., graph vertices.
Proof 
The algorithm gradually selects features f r ^ with the highest quality, which requires at most O ( n ) steps. In each step, a neighborhood Z ( f r ^ ) is considered, which contains at most O ( n ) features. This results in O ( n ) · O ( n ) = O ( n 2 ) worst-case time complexity. Note that the method removes the considered features and their highly correlated neighborhood from the graph G in each step and, consequently, the expected time complexity is much closer to O ( n · log n ) , which corresponds to sorting the vertices according to their qualities. □
Proposition 3. 
The proposed alternating suboptimal dynamic programming feature selection approach runs in O ( n 4 ) in the worst case, where n is the number of graph vertices (features).
Proof 
A double sum in lines 4 and 7 of Algorithm 1 contributes O ( n 2 ) time. In both cases, it is performed within the min function, which considers O ( n ) values. The ProcessVertex function thus requires O ( n ) · O ( n 2 ) = O ( n 3 ) time. It is called O ( n ) times in line 11 of Algorithm 2, resulting in O ( n 4 ) time per a single iteration. Although the number of iterations (loop of lines 6–24) is by default set to 100, it rarely exceeds ten and practically never 15, so its time consumption may be considered constant, i.e., O ( 1 ) , and the overall worst-case time complexity is proven O ( n 4 ) . □
Proposition 4. 
Both considered approaches to feature selection, i.e., the graph-cut-based and the alternating suboptimal dynamic programming algorithm, require O ( n 2 ) space, where n is the number of graph vertices (features).
Proof 
In the graph-cut-based approach, the graph contains n vertices and at most n · ( n 1 ) 2 edges. Similarly, there are n + 2 vertices and ( n + 2 ) · ( n + 1 ) 2 1 edges in the ASDP approach. Furthermore, n + 2 sets S i and p r ( S i ) , each with O ( n ) elements, also do not exceed O ( n 2 ) space. The overall space complexity is thus O ( n 2 ) . □

4. Results

4.1. Validation Setup

The proposed method based on alternating suboptimal dynamic programming (ASDP) and the exhaustive search algorithm (brute force, BF) was implemented using C++, while the graph-cut-based feature selection (Graph-FS) was implemented using Python 3.11.5 on the Microsoft® Windows 11 operating system. All experiments were conducted on a workstation with an Intel® Core™ i5 CPU and 16 GB of main memory. The algorithms are not yet integrated into a common application, but the results of the Graph-FS prefiltering are imported into the ASDP and BF methods via text files. The reproducibility of classification experiments is provided through the scikit-learn 1.4.1 implementation of machine learning methods. Classifiers were implemented with the following settings:
  • K-Nearest neighbors classifier (KNN) was assessed using default settings, where K { 2 , 3 , , 8 } were tested;
  • Naive Bayes classifier (NBC) was used with the default settings;
  • Random Forest (RF) was of maximal depth from the range 2 , 4 , 8 , 16 , 20 , while the maximal number of iterations was from 5 , 10 , 15 , 20 , 25 , 30 ;
  • XGBOOST was of maximal depth from the range 2 , 4 , 8 , 16 , 20 , while the maximal number of iterations was from 5 , 10 , 15 , 20 , 25 , 30 .
The ASDP and BF evaluation and the classification accuracy assessment were conducted on nine well-known benchmark datasets, available at the UCI machine learning repository [59]. Table 1 summarises the characteristics of each dataset, including its name and the number of features, classes, and samples contained.
These datasets were chosen to demonstrate the diversity of real-world applications of the proposed methods. For example, while Ds2 presents the utility of feature selection for financial institutions, Ds3 and Ds9 show that feature selection is also beneficial for medical research. Furthermore, to prove the proposed method’s efficiency across different datasets and scenarios, examples with various numbers of features and samples were considered. In continuation, we will also show the consistency and robustness of the proposed methods, as the results will not deviate from the expected ones either in the case of Ds6, which contains 60 features, with only 208 samples, or in the case of Ds5 (it contains 16 features and 20,000 samples).
Each run of the ASDP and BF evaluation test consists of 125 experiments by employing 5 · 5 · 5 triplets of parameters ( w , s h f t Δ , s h f t P ) , where w { 0 , 0.25 , 0.5 , 0.75 , 1 } , s h f t Δ { 0 , m e d 1 / 4 ( Δ ) , m e d ( Δ ) , m e d 3 / 4 ( Δ ) , 1 } , and s h f t P { 0 , m e d 1 / 4 ( P ) , m e d ( P ) , m e d 3 / 4 ( P ) , 1 } . Here, m e d ( Δ ) is the median of Δ ( f i ) , 0 < i n , while m e d 1 / 4 ( Δ ) and m e d 3 / 4 ( Δ ) are the medians of the lower and higher half-sequences of Δ ( f i ) , respectively. In a similar manner, the medians m e d 1 / 4 ( P ) , m e d ( P ) , and m e d 3 / 4 ( P ) are determined.

4.2. Assessment of Scores of the Alternating Suboptimal Dynamic Programming Algorithm

The main question with respect to the ASDP method development was how much could it improve the solution compared to a single iteration of suboptimal dynamic programming (SDP-1). At the same time, it is reasonable to compare the extent to which ASDP and SDP-1 achieve the global optimum provided by the BF approach. The results of the analysis are summarized in Table 2. The three main conclusions are listed below the table.
  • The third column shows that SDP-1 reaches the global optimum in 62.6% of the tests. The fourth column then shows that ASDP significantly raises this percentage to 95.8.
  • The degree of match (61.4%) between the SDP-1 and ASDP scores in the fifth column should not be below that between SDP-1 and BF (62.6%) since ASDP never degrades the score from the first iteration, according to Proposition 1. Indeed, if we ignore rows Ds4, Ds6, Ds7, and Ds9, where we could not evaluate BF, we also obtain 62.6% for ASDP (in brackets). Interestingly, at least for the tests performed, a conclusion can be drawn that whenever ASDP fails to reach the global optimum in the first iteration, it improves the score at least a little in subsequent iterations.
  • The last two columns confirm the empirical finding of the proof of Proposition 3 that the number of iterations of ASDP is within O ( 1 ) , since in the tests performed, it does not exceed 11, and on average it is only 3.7 , barely above the termination condition of 3 consecutive iterations with the unchanged score.
In order to further improve the results and, in particular, the feasibility in situations with a larger number of features, we preprocessed ASDP with fast and highly accurate, though still suboptimal, Graph-FS. The results are shown in Table 3, and the critical observations are listed immediately below.
  • The second column confirms a significantly lower number of features than before the use of Graph-FS (see Table 1).
  • The fourth column shows that BF did not change the Graph-FS results in 38.7% of tests. In other words, it obtains a better score in 61.3% of cases.
  • The fifth column gives the first impression that ASDP performs significantly worse (34.8% vs. 38.7) compared to BF. However, eliminating all tests on the Ds7 dataset, where BF was not viable, made both scores equal. Since ASDP cannot, according to Proposition 1 and the initialization from Equation (15), spoil the initial score, we may also conclude here that the score was strictly improved in the remaining 61.3% of tests. However, a better ASDP score obtained with Equation (14) does not necessarily imply better results in practical applications. We will show this in Section 4.3 by matching the ASDP score with the classification accuracy.
  • The sixth column shows that preprocessing of ASDP with Graph-FS raises the proportion of solutions reaching the global optimum from 95.8% in Table 2 to 98%.
  • The last two columns show a maximum number of iterations of 12 and a lower number of iterations of 3.4 compared to 3.7 from Table 2.

4.3. Assessment of the Use of the Proposed Approach in Classification Tasks

In this section, we demonstrated the usability of Graph-FS and ASDP for feature selection for classification tasks on the real benchmark datasets displayed in Table 1. For this purpose, we compared the classification performance of the selected features for both presented methods and their combination (Graph-FS + ASDP) with the performance of the same classifiers when learning about the input feature set. The results are shown in Table 4, Table 5, Table 6 and Table 7 for each specific classifier used. All tests were conducted by ten-fold cross-validation [60], using average accuracy a c c to indicate the method’s efficiency. The accuracy is defined by (16):
a c c = number   of   correctly   classified   samples number   of   all   classified   samples .
Note that the a c c values in the tables represent the highest achieved classification results. Namely, in all test cases, all combinations of the classifier’s parameter values (see Section 4.1) were tested, except for the NBC. The latter is a non-parametric method and it was used with the default settings. We also report the number of features selected and parameters T Δ and T p used in the Graph-FS and Graph-FS + ASDP methods while obtaining the listed highest results. Since identical results were typically obtained for different combinations, we do not list ASDP parameters w, s h f t Δ , and s h f t p . Table 1 gives the number of input features. The highest accuracy for each dataset is emphasized in bold. Here, we considered that the same accuracy can be achieved across different methods, regardless of selected features.
Analysis shows an improvement in accuracy on the original dataset for all test cases except in the case of Ds1 for classifier RF. Furthermore, Graph-FS and ASDP achieved similar classification scores. However, Graph-FS showed slightly higher accuracy for Ds2, Ds3, Ds5, and Ds8 for classifier RF, while the same results as ASDP are shown in the case of Ds4, Ds5, Ds6, and Ds9. For classifier XGBOOST, similar results are obtained, where Graph-FS is slightly better in classification accuracy than ASDP in cases Ds1, Ds2, Ds3, Ds7, and Ds8. In the case of classifier NBC, Graph-FS achieved the best results in cases Ds2, Ds3, Ds5, and Ds8, while for Ds8, ASDP provides the most informative feature subset, achieving the highest accuracy among those in the comparison. We observed different results for the last classifier, KNN, with the ASDP showing superior performance. It achieved highest accuracy in cases Ds1, Ds3, Ds7, and Ds8.
Conversely, when comparing ASDP and Graph-FS + ASDP, we noticed improved classification performance of selected classifiers in some cases. For example, in the case of Ds4 for classifier RF, we achieved the highest classification accuracy with Graph-FS + ASDP for a selected feature subset that contains only two features, while Graph-FS and ASDP achieved the same results when subsets of 10 and 14 features were selected, respectively. Similar results can be found in the case of Ds2 and Ds6 across all classifiers, Ds1 for NBC, and Ds2 and Ds3 for the KNN classifier, where the combination of Graph-FS and ASDP achieved the highest measured accuracy but with a smaller number of features than Graph-FS and ASDP individually. The most interesting result is that for Ds7 for NBC, where Graph-FS and ASDP combined achieved the highest accuracy among all the measured results.
Finally, the results demonstrate the robustness of both approaches, as no significant deviations regarding the improvements were displayed in experiments with various datasets with different numbers of features or samples. Both, ASDP and Graph-FS + ASDP, achieved comparable results regardless of the number of features, which can be low (e.g., Ds1 and Ds3) or high (e.g., Ds7 and Ds9). In addition, both approaches showed improvements in classification accuracy in datasets containing both small and large numbers of samples.

5. Discussion

This paper introduces an alternating suboptimal dynamic programming (ASDP) algorithm, primarily aimed at improving feature selection, at least in some cases, and being competitive in others. It iteratively considers individual features and inverts the processing order in each iteration. This allows the optimization function to be improved by using the score from the previous iteration to estimate the contribution of yet unprocessed features in the current one. We proved that convergence is achieved and that the time complexity displays a polynomial ( O ( n 4 ) ) relationship. Results on nine well-known benchmark datasets for machine learning tasks demonstrated that a single iteration suboptimal dynamic programming found the global optimum in 62.6 % cases, which was significantly improved to 95.8 % by ASDP in only 3.7 iterations on average (and never above 12). Although ASDP is relatively slow and thus limited to 200–300 input features, we have extended its usability by preprocessing it with our fast and highly accurate graph-cut-based feature selection (Graph-FS) method. This raised the proportion of solutions reaching the global optimum to 98 % and reduced the average number of iterations to 3.4 .
We have also shown the practicality of using ASDP and the Graph-FS + ASDP combination in classification. The latter was slightly behind or equal to the Graph-FS alone when using the RF or XGBOOST classifiers and sometimes slightly better when using the NBC. The former seems contradictory to the proven convergence of ASDP, but the optimization criterion of ASDP and the classification accuracy of the used classifiers do not guarantee the perfect consistency of results. Surprisingly, the ASDP method without Graph-FS prefiltering performed best when using the KNN classifier. Finally, in all but one case for RF, the presented methods achieved better classification accuracy than the classifiers learned from the complete input feature set. Note that the superior performance of Graph-FS in comparison to state-of-the-art approaches was already demonstrated in [9]. We may thus conclude that ASDP and Graph-FS + ASDP are also entirely competitive.
The four contributions of the proposed method, listed in Section 1, were justified as follows. The first was confirmed by the proof of Proposition 1 and by the results in Table 2. Table 2 also confirmed the second promised contribution, which was further exceeded by the results in Table 3. The third contribution was confirmed by the proof of Proposition 3, as well as by the fact that the BF score in some cases in Table 2 could not be determined due to excessive time complexity. The fourth contribution was confirmed by the experiments in Section 4.3, in particular by the results in Table 6 and Table 7.
A disadvantage of using ASDF without preprocessing is that a larger number of features makes the method too slow or, depending on the implementation, even infeasible. It processes 200 features in 5 s on a regular PC and becomes useless at 500 features. This represents a significant improvement compared to the exhaustive search approach, which achieves such performance at a very modest 25 and 30 features, respectively. However, for larger input sets, it makes sense to preprocess ASDF with some faster filtering. Conversely, Graph-FS + ASDF restricts the solution search space to subsets of the Graph-FS solution. We will try to achieve a compromise by cascading Graph-FS over 2–5 iterations, in which in each iteration, we will gradually lower the thresholds T Δ and T P and extend the selected set with features chosen from those not yet in the solution. We would also like to evaluate the use of ASDF in regression tasks in the future. In addition, we expect that the idea of alternating suboptimal optimization will soon be generalized to tasks beyond feature selection as well. In general, graph nodes can represent a wide variety of entities, and edges can represent any bilateral operation, such as distance, similarity, or correlation.

Author Contributions

Conceptualization, D.P. and D.V.; methodology, D.M. and D.V.; software, D.P. and D.V.; validation, D.P., D.V. and B.Ž.; formal analysis, D.V. and B.Ž.; investigation, D.P., D.V., D.M. and B.Ž.; resources, D.V.; data curation, D.V.; writing—original draft preparation, D.P., D.V. and B.Ž.; writing—review and editing, D.P. and D.M.; visualization, D.P. and D.V.; supervision, B.Ž. and D.M.; project administration, B.Ž.; funding acquisition, B.Ž. and D.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Slovene Research and Innovation Agency under Research Project J2-4458 and Research Programme P2-0041.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ADPApproximate/Adaptive Dynamic Programming
ASDPAlternating Suboptimal Dynamic Programming
BFBrute Force, Brute-Force
CPUCentral Processing Unit
DPDynamic Programming
Graph-FSGraph-cut-based Feature Selection
IDPIterative Dynamic Programming
KNNK-Nearest Neighbours classifier
LASSOLeast Absolute Shrinkage and Selection Operator
MDPMarkov Decision Process
NBCNaive Bayes Classifier
RFRandom Forrest
RLReinforcement Learning
SDP-1Single iteration of alternating Suboptimal Dynamic Programming
UCIUniversity of California Irvine machine learning repository
XGBOOSTExtreme Gradient Boosting

References

  1. Liu, H.; Motoda, H. Feature Selection for Knowledge Discovery and Data Mining; Kluwer Academic Publishers: Dordrecht, The Netherlands, 1998. [Google Scholar]
  2. Guyon, I.; Elisseeff, A. An Introduction to Variable and Feature Selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
  3. Kumar, V.; Minz, S. Feature selection: A literature Review. SmartCR 2014, 4, 211–229. [Google Scholar] [CrossRef]
  4. Kohavi, R.; John, G.H. Wrappers for feature subset selection. Artif. Intell. 1997, 97, 273–324. [Google Scholar] [CrossRef]
  5. Bellman, R. Dynamic programming. Princet. Univ. Press 1957, 89, 92. [Google Scholar]
  6. Cormen, T.H.; Leiserson, C.E.; Rivest, R.L.; Stein, C. Introduction to Algorithms; MIT Press: Cambridge, MA, USA, 2022. [Google Scholar]
  7. Liu, D.R.; Li, H.L.; Wang, D. Feature selection and feature learning for high-dimensional batch reinforcement learning: A survey. Int. J. Autom. Comput. 2015, 12, 229–242. [Google Scholar] [CrossRef]
  8. Kossmann, D.; Stocker, K. Iterative dynamic programming: A new class of query optimization algorithms. ACM Trans. Database Syst. 2000, 25, 43–82. [Google Scholar] [CrossRef]
  9. Vlahek, D.; Mongus, D. An Efficient Iterative Approach to Explainable Feature Learning. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 2606–2618. [Google Scholar] [CrossRef] [PubMed]
  10. Forman, G. An Extensive Empirical Study of Feature Selection Metrics for Text Classification. J. Mach. Learn. Res. 2003, 3, 1289–1305. [Google Scholar]
  11. Fakhraei, S.; Soltanian-Zadeh, H.; Fotouhi, F. Bias and Stability of Single Variable Classifiers for Feature Ranking and Selection. Expert Syst. Appl. 2014, 41, 6945–6958. [Google Scholar] [CrossRef] [PubMed]
  12. Liu, H.; Motoda, H. Computational Methods of Feature Selection; Chapman & Hall/CRC: Boca Raton, FL, USA, 2007; p. 440. [Google Scholar]
  13. Gu, Q.; Li, Z.; Han, J. Generalized Fisher Score for Feature Selection. In Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence, UAI 2011, Barcelona, Spain, 14–17 July 2012; pp. 266–273. [Google Scholar]
  14. Li, H.; Jiang, T.; Zhang, K. Efficient and robust feature extraction by maximum margin criterion. In Proceedings of the Advances in Neural Information Processing Systems, Whistler, BC, Canada, 8–13 December 2003; Volume 16. [Google Scholar]
  15. He, X.; Cai, D.; Niyogi, P. Laplacian Score for Feature Selection. In Proceedings of the 18th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 5–8 December 2005; pp. 507–514. [Google Scholar]
  16. Han, J.; Kamber, M.; Pei, J. Data Mining: Concepts and Techniques; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2011; p. 744. [Google Scholar]
  17. Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley-Interscience: Hoboken, NJ, USA, 2006; p. 792. [Google Scholar]
  18. Verleysen, M.; Rossi, F.; François, D. Advances in Feature Selection with Mutual Information. In Similarity-Based Clustering: Recent Developments and Biomedical Applications; Biehl, M., Hammer, B., Verleysen, M., Villmann, T., Eds.; Springer: Berlin/Heidelberg, Germany, 2009; pp. 52–69. [Google Scholar]
  19. Breiman, L.; Friedman, J.; Stone, C.; Olshen, R. Classification and Regression Trees; Wadsworth International Group: Belmont, CA, USA, 1984. [Google Scholar]
  20. Strobl, C.; Boulesteix, A.L.; Augustin, T. Unbiased split selection for classification trees based on the Gini Index. Comput. Stat. Data Anal. 2007, 52, 483–501. [Google Scholar] [CrossRef]
  21. Raileanu, L.; Stoffel, K. Theoretical Comparison between the Gini Index and Information Gain Criteria. Ann. Math. Artif. Intell. 2004, 41, 77–93. [Google Scholar] [CrossRef]
  22. Krakovska, O.; Christie, G.; Sixsmith, A.; Ester, M.; Moreno, S. Performance comparison of linear and non-linear feature selection methods for the analysis of large survey datasets. PLoS ONE 2019, 14, e0213584. [Google Scholar] [CrossRef] [PubMed]
  23. Frénay, B.; Doquire, G.; Verleysen, M. Is mutual information adequate for feature selection in regression? Neural Netw. 2013, 48, 1–7. [Google Scholar] [CrossRef] [PubMed]
  24. Bishop, C.M. Pattern Recognition and Machine Learning (Information Science and Statistics); Springer: Berlin/Heidelberg, Germany, 2006; p. 728. [Google Scholar]
  25. Bell, D.; Wang, H. A Formalism for Relevance and Its Application in Feature Subset Selection. Mach. Learn. 2000, 41, 175–195. [Google Scholar] [CrossRef]
  26. Kira, K.; Rendell, L.A. A Practical Approach to Feature Selection. In Proceedings of the Ninth International Workshop on Machine Learning, San Francisco, CA, USA, 1–3 July 1992; pp. 249–256. [Google Scholar]
  27. Kononenko, I.; Šimec, E.; Robnik-Šikonja, M. Overcoming the myopia of inductive learning algorithms with RELIEFF. Appl. Intell. 1997, 7, 39–55. [Google Scholar] [CrossRef]
  28. Hall, M.A. Correlation-Based Feature Selection for Machine Learning. Ph.D. Thesis, The University of Waikato, Hamilton, New Zealand, 1999. [Google Scholar]
  29. Yu, L.; Liu, H. Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, Washington, DC, USA, 21–24 August 2003; pp. 856–863. [Google Scholar]
  30. Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [Google Scholar] [CrossRef]
  31. Garcia-Ramirez, I.A.; Calderon-Mora, A.; Mendez-Vazquez, A.; Ortega-Cisneros, S.; Reyes-Amezcua, I. A novel framework for fast feature selection based on multi-stage correlation measures. Mach. Learn. Knowl. Extr. 2022, 4, 131–149. [Google Scholar] [CrossRef]
  32. Wang, L.; Zhou, N.; Chu, F. A General Wrapper Approach to Selection of Class-Dependent Features. IEEE Trans. Neural Netw. 2008, 19, 1267–1278. [Google Scholar] [CrossRef]
  33. Oliveira, L.S.; Sabourin, R.; Bortolozzi, F.; Suen, C.Y. A methodology for feature selection using multiobjective genetic algorithms for handwritten digit string recognition. Int. J. Pattern Recognit. Artif. Intell. 2003, 17, 903–929. [Google Scholar] [CrossRef]
  34. Jesenko, D.; Mernik, M.; Žalik, B.; Mongus, D. Two-Level Evolutionary Algorithm for Discovering Relations between Nodes Features in a Complex Network. Appl. Soft Comput. 2017, 56, 82–93. [Google Scholar] [CrossRef]
  35. Chuang, L.Y.; Chang, H.W.; Tu, C.J.; Yang, C.H. Improved binary PSO for feature selection using gene expression data. Comput. Biol. Chem. 2008, 32, 29–38. [Google Scholar] [CrossRef] [PubMed]
  36. Schiezaro, M.; Pedrini, H. Data feature selection based on Artificial Bee Colony algorithm. EURASIP J. Image Video Process. 2013, 47, 1–8. [Google Scholar] [CrossRef]
  37. Narendra, P.; Fukunaga, K. A Branch and Bound Algorithm for Feature Subset Selection. IEEE Trans. Comput. 1977, C-26, 917–922. [Google Scholar] [CrossRef]
  38. Gheyas, I.A.; Smith, L.S. Feature subset selection in large dimensionality domains. Pattern Recognit. 2010, 43, 5–13. [Google Scholar] [CrossRef]
  39. Somol, P.; Pudil, P.; Novovicová, J.; Paclík, P. Adaptive floating search methods in feature selection. Pattern Recognit. Lett. 1999, 20, 1157–1163. [Google Scholar] [CrossRef]
  40. Chandrashekar, G.; Sahin, F. A survey on feature selection methods. Comput. Electr. Eng. 2014, 40, 16–28. [Google Scholar] [CrossRef]
  41. Zhao, P.; Yu, B. On model selection consistency of Lasso. J. Mach. Learn. Res. 2006, 7, 2541–2563. [Google Scholar]
  42. Buteneers, P.; Caluwaerts, K.; Dambre, J.; Verstraeten, D.; Schrauwen, B. Optimized parameter search for large datasets of the regularization parameter and feature selection for ridge regression. Neural Process. Lett. 2013, 38, 403–416. [Google Scholar] [CrossRef]
  43. Nelson, G.D.; Levy, D.M. A Dynamic Programming Approach to the Selection of Pattern Features. IEEE Trans. Syst. Sci. Cybern. 1968, 4, 145–151. [Google Scholar] [CrossRef]
  44. Acır, N. Classification of ECG beats by using a fast least square support vector machines with a dynamic programming feature selection algorithm. Neural Comput. Appl. 2005, 14, 299–309. [Google Scholar] [CrossRef]
  45. Cheung, R.; Eisenstein, B. Feature selection via dynamic programming for text-independent speaker identification. IEEE Trans. Acoust. Speech Signal Process. 1978, 26, 397–403. [Google Scholar] [CrossRef]
  46. Moudani, W.; Shahin, A.; Shakik, F.; Mora-Camino, F. Dynamic programming applied to rough sets attribute reduction. J. Inf. Optim. Sci. 2013, 32, 1371–1397. [Google Scholar] [CrossRef]
  47. Bertsekas, D.; Tsitsiklis, J.N. Neuro-Dynamic Programming; Athena Scientific: Nashua, NH, USA, 1996. [Google Scholar]
  48. Approximate Dynamic Programming. Available online: https://deepgram.com/ai-glossary/approximate-dynamic-programming (accessed on 23 April 2024).
  49. Mes, M.; Perez Rivera, A. Approximate Dynamic Programming by Practical Examples. In Markov Decision Processes in Practice; Boucherie, R., van Dijk, N.M., Eds.; Number 248; Springer: Berlin/Heidelberg, Germany, 2017; pp. 63–101. [Google Scholar]
  50. Loxley, P.N.; Cheung, K.W. A dynamic programming algorithm for finding an optimal sequence of informative measurements. Entropy 2023, 25, 251. [Google Scholar] [CrossRef] [PubMed]
  51. Petrik, M.; Taylor, G.; Parr, R.; Zilberstein, S. Feature Selection Using Regularization in Approximate Linear Programs for Markov Decision Processes. In 27th International Conference on Machine Learning (ICML 2010); Fürnkranz, J., Joachims, T., Eds.; Omnipress: Madison, WI, USA, 2010; pp. 871–878. [Google Scholar]
  52. Preux, P.; Girgin, S.; Loth, M. Feature discovery in approximate dynamic programming. In Proceedings of the 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, Nashville, TN, USA, 30 March–2 April 2009; pp. 109–116. [Google Scholar]
  53. Papadaki, K.P.; Powell, W.B. Exploiting structure in adaptive dynamic programming algorithms for a stochastic batch service problem. Eur. J. Oper. Res. 2002, 142, 108–127. [Google Scholar] [CrossRef]
  54. Luus, R. Optimal control by dynamic programming using systematic reduction in grid size. Int. J. Control 1990, 51, 995–1013. [Google Scholar] [CrossRef]
  55. Lock, J.; McKelvey, T. A computationally fast iterative dynamic programming method for optimal control of loosely coupled dynamical systems with different time scales. IFAC-PapersOnLine 2017, 50, 5953–5960. [Google Scholar] [CrossRef]
  56. Lincoln, B.; Rantzer, A. Suboptimal dynamic programming with error bounds. In Proceedings of the 41st IEEE Conference on Decision and Control, Las Vegas, NV, USA, 10–13 December 2002; Volume 2, pp. 2354–2359. [Google Scholar]
  57. Lincoln, B.; Rantzer, A. Relaxing dynamic programming. IEEE Trans. Control Syst. Technol. 2006, 51, 1249–1260. [Google Scholar] [CrossRef]
  58. Rantzer, A. Relaxed dynamic programming in switching systems. IEE Proc.-Control Theory Appl. 2006, 153, 567–574. [Google Scholar] [CrossRef]
  59. Dua, D.; Graff, C. UCI Machine Learning Repository. 2017. Available online: https://archive.ics.uci.edu (accessed on 23 April 2024).
  60. Alpaydin, E. Introduction to Machine Learning; MIT Press: Cambridge, MA, USA, 2010; p. 537. [Google Scholar]
Figure 1. Vertex-cut-based feature selection: (a) graph G where the feature of the highest quality (coloured green) is selected and its neighbourhood (red) is removed, (b) repeating the same procedure on subgraph G , and (c) subgraph G . (d) The output result { f 1 , f 5 , f 6 } (in green) is obtained.
Figure 1. Vertex-cut-based feature selection: (a) graph G where the feature of the highest quality (coloured green) is selected and its neighbourhood (red) is removed, (b) repeating the same procedure on subgraph G , and (c) subgraph G . (d) The output result { f 1 , f 5 , f 6 } (in green) is obtained.
Mathematics 12 01987 g001
Figure 2. The concept of feature selection based on dynamic programming: (a) partial solution to be stored in f i considers the solutions stored in all its predecessors; (b) the situation after updating the status of f i . S i and s i are calculated with Equations (6) and (5), respectively.
Figure 2. The concept of feature selection based on dynamic programming: (a) partial solution to be stored in f i considers the solutions stored in all its predecessors; (b) the situation after updating the status of f i . S i and s i are calculated with Equations (6) and (5), respectively.
Mathematics 12 01987 g002
Figure 3. The concept of feature selection based on alternating suboptimal dynamic programming: the situation (a) before processing f i during the reverse direction traversal; (b) after processing f i during the reverse direction traversal; (c) before processing f i during the forward direction traversal; and (d) after processing f i during the forward direction traversal.
Figure 3. The concept of feature selection based on alternating suboptimal dynamic programming: the situation (a) before processing f i during the reverse direction traversal; (b) after processing f i during the reverse direction traversal; (c) before processing f i during the forward direction traversal; and (d) after processing f i during the forward direction traversal.
Mathematics 12 01987 g003
Table 1. Description of test datasets.
Table 1. Description of test datasets.
Dataset IDDataset Name# Features# Samples# Classes
Ds1Abalone841772
Ds2Credit Approval156902
Ds3Diabetes87682
Ds4Ionosphere343512
Ds5Letters1620,00026
Ds6Sonar602082
Ds7Spambase5746012
Ds8Vehicle189464
Ds9Wisconsin Brest Cancer Diagnostic305692
Table 2. Comparison of scores obtained by BF, SDP-1, and the ASDP method.
Table 2. Comparison of scores obtained by BF, SDP-1, and the ASDP method.
Dataset ID# TestsSDP-1 Score = BF Score [%]ASDP Score = BF Score [%]SDP-1 Score = ASDP Score [%]Max. # IterationsAvg. # Iterations
Ds112558.4100.058.453.4
Ds212568.8100.068.873.6
Ds312576.8100.076.873.3
Ds4125//65.683.6
Ds512554.494.454.463.4
Ds6125//52.073.9
Ds7125//50.4114.4
Ds812554.484.854.473.7
Ds9125//72.083.7
Total *112562.695.861.4 (62.6) 3.7
* The Tests column contains the sum, and the others contain average values.
Table 3. Comparison of scores of Graph-FS filtering used alone or postprocessed by BF or ASDP.
Table 3. Comparison of scores of Graph-FS filtering used alone or postprocessed by BF or ASDP.
Dataset ID# Features Selected by Graph-FS# TestsBF Score = Graph-FS Score [%]ASDP Score = Graph-FS Score [%]ASDP Score = BF Score [%]Max. # IterationsAvg. # Iterations
Ds11 or 225074.074.0100.033.0
Ds22 to 15125029.829.899.183.3
Ds31 to 850050.250.299.863.1
Ds4212532.032.0100.033.0
Ds510 to 1362532.632.691.783.5
Ds61 to 1162543.743.798.763.5
Ds712 to 56137527.2 *22.094.4 *124.0
Ds85 to 1050026.026.093.253.0
Ds91 to 737553.653.399.753.3
Total ** 562538.734.898.0 3.4
* Only 125 tests used due to the limited number of features in BF. ** The Tests column contains the sum, and the others contain average values.
Table 4. Accuracies for RF classifier after the feature selection with Graph-FS, ASDP, their combination, or when using all input features.
Table 4. Accuracies for RF classifier after the feature selection with Graph-FS, ASDP, their combination, or when using all input features.
Graph-FS ASDPGraph-FS + ASDPInput Data
Dataset ID# Selected Features acc T Δ T p # Selected Features acc # Selected Features acc acc
Ds1146.170.350.4553.11146.1753.34
Ds2798.550.40.4397.10397.1097.10
Ds3380.510.30.4274.02274.0271.42
Ds421000.30.481002100100
Ds51395.850.30.651594.951188.2595.20
Ds6101000.30.7141006100100
Ds75696.520.450.551490.671385.4694.79
Ds8878.820.30.81378.82577.6474.11
Ds9778.940.350.8178.94575.4375.43
Table 5. Accuracies for XGBOOST classifier after Graph-FS, ASDP, Graph-FS + ASDP, or when using all input features.
Table 5. Accuracies for XGBOOST classifier after Graph-FS, ASDP, Graph-FS + ASDP, or when using all input features.
Graph-FS ASDPGraph-FS + ASDPInput Data
Dataset ID# Selected Features acc T Δ T p # Selected Features acc # Selected Features acc acc
Ds1155.020.350.4554.54155.0253.34
Ds2798.550.40.4397.10397.1095.65
Ds3372.720.30.4271.45271.4271.42
Ds421000.30.481002100100
Ds51395.700.30.651595.701190.0095.35
Ds6101000.30.7141006100100
Ds75696.520.450.551489.151385.6894.79
Ds8875.290.30.81374.11570.5875.29
Ds9778.940.350.8178.94477.1975.43
Table 6. Accuracies for NBC after the feature selection with Graph-FS, ASDP, their combination, or when using all input features.
Table 6. Accuracies for NBC after the feature selection with Graph-FS, ASDP, their combination, or when using all input features.
Graph-FS ASDPGraph-FS + ASDPInput Data
Dataset ID# Selected Features acc T Δ T p # Selected Features acc # Selected Features acc acc
Ds1155.020.350.4554.06155.0253.34
Ds2798.550.40.4397.10397.1095.65
Ds3375.320.30.4272.72272.7271.42
Ds421000.30.481002100100
Ds51363.400.30.651562.601152.661.45
Ds6101000.30.7141006100100
Ds75690.600.450.551491.321397.6159.65
Ds8852.940.30.81356.47550.5849.41
Ds9778.940.350.8178.94477.1975.43
Table 7. Accuracies for KNN classifier after Graph-FS, ASDP, Graph-FS + ASDP, or when using all input features.
Table 7. Accuracies for KNN classifier after Graph-FS, ASDP, Graph-FS + ASDP, or when using all input features.
Graph-FS ASDPGraph-FS + ASDPInput data
Dataset ID# Selected Features acc T Δ T p # Selected Features acc # Selected Features acc acc
Ds1151.670.350.4555.02151.6753.34
Ds2795.650.40.4395.65395.6578.26
Ds3371.420.30.4275.32275.3271.42
Ds421000.30.481002100100
Ds51394.850.30.651594.701186.894.45
Ds6101000.30.7141006100100
Ds75691.540.450.551492.621384.3850.45
Ds8867.050.30.81372.94563.5269.41
Ds9778.940.350.8178.94477.1975.43
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Podgorelec, D.; Žalik, B.; Mongus, D.; Vlahek, D. A New Alternating Suboptimal Dynamic Programming Algorithm with Applications for Feature Selection. Mathematics 2024, 12, 1987. https://doi.org/10.3390/math12131987

AMA Style

Podgorelec D, Žalik B, Mongus D, Vlahek D. A New Alternating Suboptimal Dynamic Programming Algorithm with Applications for Feature Selection. Mathematics. 2024; 12(13):1987. https://doi.org/10.3390/math12131987

Chicago/Turabian Style

Podgorelec, David, Borut Žalik, Domen Mongus, and Dino Vlahek. 2024. "A New Alternating Suboptimal Dynamic Programming Algorithm with Applications for Feature Selection" Mathematics 12, no. 13: 1987. https://doi.org/10.3390/math12131987

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop