Feature Selection for High-Dimensional Data

A special issue of Information (ISSN 2078-2489). This special issue belongs to the section "Artificial Intelligence".

Deadline for manuscript submissions: closed (31 October 2017) | Viewed by 14908

Special Issue Editors


E-Mail Website
Guest Editor
Grupo LIDIA, Departamento de Computación, Facultad de Informática, Universidade da Coruña, 15071 A Coruña, Spain
Interests: machine learning; pattern recognition; feature selection; medical applications

E-Mail Website
Guest Editor
Departamento de Computación, Facultad de Informática, Universidade da Coruña, 15071 A Coruña, Spain
Interests: artificial intelligence; machine learning; pattern recognition; feature selection

E-Mail Website
Guest Editor
Departamento de Computación, Facultad de Informática, Universidade da Coruña, 15071 A Coruña, Spain
Interests: computer science; artificial intelligence, machine learning, feature selection, scalability issues in machine learning

Special Issue Information

Dear Colleagues,

Feature selection has been embraced as one of the high activity research areas during the last few years, because of the appearance of datasets containing hundreds of thousands of features. Therefore, feature selection was deemed as a great tool to better model the underlying process of data generation, as well as to reduce the cost of acquiring the features. Furthermore, from the Machine Learning perspective, given that feature selection can reduce the dimensionality of the problem, it can be used for maintaining or even improving the algorithms’ performance, while reducing computational costs. Nowadays, the advent of Big Data has brought unprecedented challenges to machine learning researchers, who now have to deal with huge volumes of data, in terms of both instances and features, making the learning task more complex and computationally demanding than ever. Specifically, when dealing with an extremely large number of features, learning algorithms’ performance can degenerate due to overfitting; learned models decrease their interpretability as they become more complex; and speed and efficiency of the algorithms decline in accordance with size. A vast body of feature selection methods exists in the literature, including filters based on distinct metrics (e.g., entropy, probability distributions or information theory) and embedded and wrapper methods using different induction algorithms. However, some of the most used algorithms were developed when dataset sizes were much smaller, and nowadays they cannot scale well, producing a need to readapt these successful algorithms to be able to deal with Big Data problems.

In this Special Issue, we invite investigators to contribute with their recent developments in feature selection methods for high-dimensional settings, as well as review articles that will stimulate the continuing efforts to understand the problems usually encountered in this field.

Topics of interest include, but are not limited to:

  • New feature selection methods
  • Ensemble methods for feature selection
  • Feature selection to deal with microarray data
  • Parallelization of feature selection methods
  • Missing data in the context of feature selection
  • Feature selection applications

Dr. Verónica Bolón Canedo
Dr. Noelia Sánchez-Maroño
Dr. Amparo Alonso-Betanzos
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Information is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Feature selection
  • Ensemble feature selection
  • Filters
  • Wrappers
  • Embedded methods

Published Papers (3 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

15 pages, 3366 KiB  
Article
Gene Selection for Microarray Cancer Data Classification by a Novel Rule-Based Algorithm
by Adrian Pino Angulo
Information 2018, 9(1), 6; https://doi.org/10.3390/info9010006 - 2 Jan 2018
Cited by 25 | Viewed by 4568
Abstract
Due to the disproportionate difference between the number of genes and samples, microarray data analysis is considered an extremely difficult task in sample classification. Feature selection mitigates this problem by removing irrelevant and redundant genes from data. In this paper, we propose a [...] Read more.
Due to the disproportionate difference between the number of genes and samples, microarray data analysis is considered an extremely difficult task in sample classification. Feature selection mitigates this problem by removing irrelevant and redundant genes from data. In this paper, we propose a new methodology for feature selection that aims to detect relevant, non-redundant and interacting genes by analysing the feature value space instead of the feature space. Following this methodology, we also propose a new feature selection algorithm, namely Pavicd (Probabilistic Attribute-Value for Class Distinction). Experiments in fourteen microarray cancer datasets reveal that Pavicd obtains the best performance in terms of running time and classification accuracy when using Ripper-k and C4.5 as classifiers. When using SVM (Support Vector Machine), the Gbc (Genetic Bee Colony) wrapper algorithm gets the best results. However, Pavicd is significantly faster. Full article
(This article belongs to the Special Issue Feature Selection for High-Dimensional Data)
Show Figures

Figure 1

1876 KiB  
Article
sCwc/sLcc: Highly Scalable Feature Selection Algorithms
by Kilho Shin, Tetsuji Kuboyama, Takako Hashimoto and Dave Shepard
Information 2017, 8(4), 159; https://doi.org/10.3390/info8040159 - 6 Dec 2017
Cited by 8 | Viewed by 4873
Abstract
Feature selection is a useful tool for identifying which features, or attributes, of a dataset cause or explain the phenomena that the dataset describes, and improving the efficiency and accuracy of learning algorithms for discovering such phenomena. Consequently, feature selection has been studied [...] Read more.
Feature selection is a useful tool for identifying which features, or attributes, of a dataset cause or explain the phenomena that the dataset describes, and improving the efficiency and accuracy of learning algorithms for discovering such phenomena. Consequently, feature selection has been studied intensively in machine learning research. However, while feature selection algorithms that exhibit excellent accuracy have been developed, they are seldom used for analysis of high-dimensional data because high-dimensional data usually include too many instances and features, which make traditional feature selection algorithms inefficient. To eliminate this limitation, we tried to improve the run-time performance of two of the most accurate feature selection algorithms known in the literature. The result is two accurate and fast algorithms, namely sCwc and sLcc. Multiple experiments with real social media datasets have demonstrated that our algorithms improve the performance of their original algorithms remarkably. For example, we have two datasets, one with 15,568 instances and 15,741 features, and another with 200,569 instances and 99,672 features. sCwc performed feature selection on these datasets in 1.4 seconds and in 405 seconds, respectively. In addition, sLcc has turned out to be as fast as sCwc on average. This is a remarkable improvement because it is estimated that the original algorithms would need several hours to dozens of days to process the same datasets. In addition, we introduce a fast implementation of our algorithms: sCwc does not require any adjusting parameter, while sLcc requires a threshold parameter, which we can use to control the number of features that the algorithm selects. Full article
(This article belongs to the Special Issue Feature Selection for High-Dimensional Data)
Show Figures

Figure 1

640 KiB  
Article
Ensemble of Filter-Based Rankers to Guide an Epsilon-Greedy Swarm Optimizer for High-Dimensional Feature Subset Selection
by Mohammad Bagher Dowlatshahi, Vali Derhami and Hossein Nezamabadi-pour
Information 2017, 8(4), 152; https://doi.org/10.3390/info8040152 - 22 Nov 2017
Cited by 33 | Viewed by 4859
Abstract
The main purpose of feature subset selection is to remove irrelevant and redundant features from data, so that learning algorithms can be trained by a subset of relevant features. So far, many algorithms have been developed for the feature subset selection, and most [...] Read more.
The main purpose of feature subset selection is to remove irrelevant and redundant features from data, so that learning algorithms can be trained by a subset of relevant features. So far, many algorithms have been developed for the feature subset selection, and most of these algorithms suffer from two major problems in solving high-dimensional datasets: First, some of these algorithms search in a high-dimensional feature space without any domain knowledge about the feature importance. Second, most of these algorithms are originally designed for continuous optimization problems, but feature selection is a binary optimization problem. To overcome the mentioned weaknesses, we propose a novel hybrid filter-wrapper algorithm, called Ensemble of Filter-based Rankers to guide an Epsilon-greedy Swarm Optimizer (EFR-ESO), for solving high-dimensional feature subset selection. The Epsilon-greedy Swarm Optimizer (ESO) is a novel binary swarm intelligence algorithm introduced in this paper as a novel wrapper. In the proposed EFR-ESO, we extract the knowledge about the feature importance by the ensemble of filter-based rankers and then use this knowledge to weight the feature probabilities in the ESO. Experiments on 14 high-dimensional datasets indicate that the proposed algorithm has excellent performance in terms of both the error rate of the classification and minimizing the number of features. Full article
(This article belongs to the Special Issue Feature Selection for High-Dimensional Data)
Show Figures

Figure 1

Back to TopTop