Machine Learning-Based Feature Extraction and Selection

Ruano-Ordás, David

doi:10.3390/app14156567

Open AccessEditorial

Machine Learning-Based Feature Extraction and Selection

by

David Ruano-Ordás

^1,2

¹

CINBIO, Department of Computer Science, ESEI—Escuela Superior de Ingeniería Informática, University of Vigo, 32004 Ourense, Spain

²

SING Research Group, Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, 36213 Vigo, Spain

Appl. Sci. 2024, 14(15), 6567; https://doi.org/10.3390/app14156567 (registering DOI)

Submission received: 22 July 2024 / Accepted: 23 July 2024 / Published: 27 July 2024

(This article belongs to the Special Issue Machine-Learning-Based Feature Extraction and Selection)

Download Versions Notes

Over the last decade, technological advances have brought breakthroughs in the landscape of data management, transmission, processing, and storage. In particular, (i) the improvements in the speed and capacity of data storage devices leading to information management without resource (temporary and financial) constraints, (ii) the enhancement of communication infrastructures enabling continuous and unlimited network access, and (iii) the epoch-making leap in computing capabilities driven by the emergence of a technology war between semiconductor and hardware processing manufacturers to design high-computational devices capable of meeting the high processing demands required by the new generation of AI models.

This scenario, along with the emergence and widespread acceptance of instant communications, social networks, and collaborative applications, has led to revolutionary changes in the human interaction sphere as well as the advent of the information age. The unprecedented and limitless possibilities of current communication devices (smartphones, tablets, laptops, etc.) have enabled people to be permanently interconnected. The work carried out by the authors of [1,2] shows that the widespread use of instant messaging applications has dramatically displaced spontaneous face-to-face interactions, highlighting a societal shift toward virtual communication rather than traditional in-person communication. A clear example of this situation is described by the authors of [3,4], who concluded that around 70% of inhabitants in digitised countries use instant messaging applications to communicate and predicted a bi-annual growth of 13.6%. This fact has a strong impact on two key issues related to the information and technology era: (i) a significant increase in mental problems (such as depression, shyness, social anxiety, social network addiction, etc.) due to the lack of face-to-face skills and the inadequate management of immediacy [5,6], and (ii) the massive dissemination and storage of diverse information (text, voice, or images) from multiple (and almost unlimited) data sources.

The overwhelming amount of unprocessed textual data lacks significance and utility unless appropriate methodologies are employed to identify and extract valuable insights. In this context, feature extraction and selection methods have become a crucial mechanism to alleviate two key challenges related to high-dimensional data: (i) the increase in computational efforts required for its processing and/or analysis, and (ii) the existence of additional duplicated and/or meaningless information associated with the curse of dimensionality phenomenon. Although both techniques are focused on enhancing the performance and interpretability of AI models by selecting the most relevant information, they operate based on different principles. Feature extraction involves transforming the original data by capturing its inherent patterns and reducing dimensionality, allowing better visualisation and easier interpretability [7]. In contrast, feature selection is responsible for strategically identifying and selecting a subset of the most informative features, leading to improved simplicity and efficiency [8]. To this end, we have carefully selected several works that make significant contributions by presenting pertinent ideas, methodologies, and models. A summary of these works and their specific findings are described below.

The current and unprecedented ability to digitise and store massive amounts of information has fostered key scientific breakthroughs in the medical sector, a key pillar of the welfare system. In this context, several works have been published in recent years focusing on extracting suitable information from medical images to improve pathological examinations and clinical diagnosis. A particularly insightful contribution to this field provides a comprehensive review of the published investigations in computed tomography (CT) lung cancer images and includes an overview of the commonly used feature selection methods and predictive models [9]. After comparing the impacts and limitations of each method in a clinical environment, the authors emphasise that assessment and validation of feature selection methods and predictive models remain essential to enhancing feature stability and reproducibility. In the same line, the authors of [10] describe a novel technique for detecting bone fractures in X-ray images. The authors use the well-known grey level co-occurrence matrix (GLCM) technique to identify the most representative features from each X-ray image. Selected features were applied to a fuzzy network classification model, achieving a performance accuracy of 0.8849. Finally, the work presented the authors of [11] introduces a novel methodology responsible for combining both feature extraction and selection techniques to detect and classify brain tumours in medical images. The first stage (feature extraction) obtains a set of features by blending GLCM and LBP techniques with four well-known convolutional neural network models (AlexNet, VGG16, EfficienteNetB0 and ResNet50). Conversely, the second stage uses three optimisation algorithms (genetic algorithms, particle swarm optimisation, and artificial bee colony) over the previously obtained feature set to select the adequate ones. The remaining features were classified using a support vector machine model, achieving an accuracy rate of 98.22%.

The current model of free access to online content is closely linked to the commodification of user data. Users navigate the Internet unaware that they are sharing their personal information as part of a transaction to gain access to unlimited services and content. This personal information is essential for performing accurate marketing strategies. Analysing user preferences, search patterns, and engagement metrics allows marketers to refine their targeting efforts, ensuring the delivery of relevant content and customised advertisements to maximise the efficiency and impact of digital marketing campaigns. One of the most interesting works related to feature extraction introduces a novel location-based advertising (LBA) scheme using a deep learning-based bidirectional hybrid model [12]. To this end, authors propose a methodology based on a geographical information system (GIS) to accurately obtain and interpret spatial user location, utilising a deep sparse autoencoder (DSA) to perform feature extraction and a bidirectional optimised hybrid model (BLSTM-DNN-ASOA) to conduct the classification task. This proposed model achieves better performance compared to existing approaches. Next, the authors of [13] discuss the role and relevance of using feature selection in neuromarketing to improve the accuracy of user preference detection, based on the brain’s response to incoming marketing stimuli. To this end, an experimental study combining the most widely used feature selection methods with well-known machine learning models was conducted to identify the feature selection approach that achieves the best global performance. In the same line, the work presented by the authors of [14] uses a wrapped feature selection method (SVM-RFE) to predict future consumer choices.

Stock market prediction is another interesting area that benefits greatly from the information age. The ability of AI models to analyse dynamic real-time datasets is a major advancement in stock market forecasting, driven by two important factors: (i) it allows for discerning non-linear relationships and uncovering latent factors that may elude traditional analytical approaches, and (ii) it enables capturing a realistic view of market dynamics by the assimilation of diverse data sources, such as social media sentiment, news articles, and historical market trends. Two leading contributions in this field provide an exhaustive review of feature extraction and selection methods for stock market prediction [15,16]. The first work examines 32 research articles published from 2011 to 2022 which addressed the combination of feature study and ML approaches in multiple stock market applications. The achieved results show that correlation criteria, random forest, principal component analysis, and autoencoder are the feature selection and extraction techniques with the best prediction accuracy. The second contribution presents a complete literature review of the data preprocessing strategies, feature extraction and selection techniques, prediction models, and prospects in stock market forecasting. Additionally, after identifying and disserting the major advantages and flaws of the collected studies, the authors propose a novel structured methodology to flexibilize and improve the stock price prediction process.

This Special Issue brings together six papers showing different feature selection and extraction methodologies applied to five different topics: (i) identification of malicious URLs, (ii) recognition and classification of music genres from audio data, (iii) assessment of the impact of land mining activities, (iv) feature reduction applied to high-dimensional datasets, and (v) knowledge base construction from the scientific literature.

Particularly, the first study [17] demonstrates the effectiveness of the DA-BiGRU model in detecting malicious URLs and suggests that future research should be focused on optimising the model for real-life applications. This study also emphasises the positive impact of the Word2Vec training and the BiGRU model with a regularisation mechanism and an attention mechanism. The document provides a comprehensive overview of the proposed method and its potential implications for future research and real-world applications.

The work of Ashraf et al. [18] discusses the significance of deep learning in music classification, particularly the effectiveness of CNN and RNN in extracting features and handling sequential data temporal dependencies. It highlights the limitations of traditional techniques and the potential of deep learning in automatically extracting unbiased features for music classification. In conclusion, this study demonstrates the effectiveness of the proposed hybrid model in music classification and highlights the potential for future experiments on other datasets for music classification, instrument recognition, and artist recognition.

Following this, the authors of [19] assess the impact of land mining for 30 years (1990 to 2020) on Emalahleni (South Africa) using satellite imagery. The random forest algorithm was applied for land categorisation, achieving five different classes: settlement, water, mining area, vegetation, and bare land. The achieved results demonstrate that land mining harms land cover transformations, emphasising the need for sustainable land management strategies that balance economic activities with environmental protection for future generations. In summary, the data obtained provide a basis for further research and a foundation for establishing a policy for future land-use decisions in the region.

Regarding the reduction in high-dimensional datasets, two works were published [D20, D21]. The authors of [20] propose a new method for selecting and ranking group features based on their importance, with a focus on high-dimensional datasets with limited samples. To accomplish this task, the proposed method is divided into two differentiated stages: (i) dimensionality reduction, responsible for eliminating irrelevant individual features, and (ii) a group feature ranking by using random forest to assess the global importance of each feature cluster. This study demonstrates that the proposed method achieves competitive performance in machine learning tasks compared to existing methods. On the other hand, the work of Al-Eiadeh, M.R. [21] introduces MBHO, a novel feature selection method MBHO, based on a modified black hole algorithm. The proposed algorithm utilises a mutation technique (inversion) and an enhanced fitness function that considers feature relationships and relevance to classification labels. The experimental results were performed on 14 benchmark datasets, demonstrating the effectiveness of MBHO for improving classification performance and reducing feature set size.

Finally, the authors of [22] propose a self-adaptive feature words (SAFW) method to enhance the extraction of relational quintuples from unstructured text for scientific knowledge base construction. Unlike prior approaches that rely on explicit entity clues, SAFW generates feature words dynamically for each sample, improving correlation information in the knowledge graph. The model creates new word representations, integrating them into the original sentence to determine relation types and locate entities. Evaluations conducted on four publicly available datasets demonstrate the superior performance of SAWF when compared to existing benchmarks. By effectively utilising BERT as an encoding mechanism and incorporating previously discarded words, SAFW not only enhances the interpretability of knowledge graphs but also strengthens the representation of relationships within these graphs. This methodology exhibits applicability to hidden knowledge discovery scenarios and diverse vertical domains, opening up avenues for novel insights.

Despite the above-mentioned works in the context of the machine learning-based feature extraction and selection approaches, this field of computer science includes major challenges that have yet to be resolved. We sincerely hope that readers enjoy the Special Issue and find it worthy of understanding the real value of textual information compiled worldwide. We thank all authors for their contributions to this Special Issue and the reviewers for their efforts to improve the quality of the collected papers.

Acknowledgments

SING research group thanks CITI (Centro de Investigación, Transferencia e Innovación) from the University of Vigo for hosting its information technology (IT) infrastructure.

Conflicts of Interest

The author declares that he has no conflicts of interest regarding the publication of this paper.

References

Ruben, M.A.; Stosic, M.D.; Correale, J.; Blanch-Hartigan, D. Is Technology Enhancing or Hindering Interpersonal Communication? A Framework and Preliminary Results to Examine the Relationship between Technology Use and Nonverbal Decoding Skill. Front. Psychol. 2021, 11, 611670. [Google Scholar] [CrossRef] [PubMed]
Verduyn, P.; Schulte-Strathaus, J.C.C.; Kross, E.; Hülsheger, U.R. When Do Smartphones Displace Face-to-Face Interactions and What to Do about It? Comput. Hum. Behav. 2021, 114, 106550. [Google Scholar] [CrossRef]
Kemp, S. Digital 2023: Global Overview Report. Available online: https://datareportal.com/reports/digital-2023-global-overview-report (accessed on 22 July 2024).
Ceci, L. Mobile Messaging Users Worldwide 2025. Available online: https://www.statista.com/statistics/483255/number-of-mobile-messaging-users-worldwide/ (accessed on 22 July 2024).
Scott, D.A.; Valley, B.; Simecka, B.A. Mental Health Concerns in the Digital Age. Int. J. Ment. Health Addict. 2017, 15, 604–613. [Google Scholar] [CrossRef]
Pierce, T. Social Anxiety and Technology: Face-to-Face Communication versus Technological Communication among Teens. Comput. Hum. Behav. 2009, 25, 1367–1372. [Google Scholar] [CrossRef]
Guyon, I.; Elisseeff, A. An Introduction to Feature Extraction. In Feature Extraction; Guyon, I., Nikravesh, M., Gunn, S., Zadeh, L.A., Eds.; Studies in Fuzziness and Soft Computing; Springer: Berlin/Heidelberg, Germany, 2006; Volume 207, pp. 1–25. ISBN 978-3-540-35487-1. [Google Scholar]
Liu, H. Feature Selection. In Encyclopedia of Machine Learning; Sammut, C., Webb, G.I., Eds.; Springer: Boston, MA, USA, 2011; pp. 402–406. ISBN 978-0-387-30768-8. [Google Scholar]
Ge, G.; Zhang, J. Feature Selection Methods and Predictive Models in CT Lung Cancer Radiomics. J. Appl. Clin. Med. Phys. 2023, 24, e13869. [Google Scholar] [CrossRef] [PubMed]
Narayan, V.; Mall, P.K.; Awasthi, S.; Srivastava, S.; Gupta, A. FuzzyNet: Medical Image Classification Based on GLCM Texture Feature. In Proceedings of the 2023 International Conference on Artificial Intelligence and Smart Communication (AISC), Greater Noida, India, 27–29 January 2023; IEEE: New York, NY, USA, 2023; pp. 769–773. [Google Scholar]
Başaran, E. A New Brain Tumor Diagnostic Model: Selection of Textural Feature Extraction Algorithms and Convolution Neural Network Features with Optimization Algorithms. Comput. Biol. Med. 2022, 148, 105857. [Google Scholar] [CrossRef] [PubMed]
Rohilla, V.; Chakraborty, S.; Kumar, R. Deep Learning Based Feature Extraction and a Bidirectional Hybrid Optimized Model for Location Based Advertising. Multimed. Tools Appl. 2022, 81, 16067–16095. [Google Scholar] [CrossRef]
Al-Nafjan, A. Feature Selection of EEG Signals in Neuromarketing. PeerJ Comput. Sci. 2022, 8, e944. [Google Scholar] [CrossRef] [PubMed]
Mashrur, F.R.; Rahman, K.M.; Miya, M.T.I.; Vaidyanathan, R.; Anwar, S.F.; Sarker, F.; Mamun, K.A. An Intelligent Neuromarketing System for Predicting Consumers’ Future Choice from Electroencephalography Signals. Physiol. Behav. 2022, 253, 113847. [Google Scholar] [CrossRef] [PubMed]
Htun, H.H.; Biehl, M.; Petkov, N. Survey of Feature Selection and Extraction Techniques for Stock Market Prediction. Financ. Innov. 2023, 9, 26. [Google Scholar] [CrossRef] [PubMed]
Ayyappa, Y.; Kumar, A.P.S. A Compact Literature Review on Stock Market Prediction. In Proceedings of the 2022 4th International Conference on Inventive Research in Computing Applications (ICIRCA), Coimbatore, India, 21–23 September 2022; IEEE: New York, NY, USA, 2022; pp. 1336–1347. [Google Scholar]
Wu, T.; Wang, M.; Xi, Y.; Zhao, Z. Malicious URL Detection Model Based on Bidirectional Gated Recurrent Unit and Attention Mechanism. Appl. Sci. 2022, 12, 12367. [Google Scholar] [CrossRef]
Ashraf, M.; Abid, F.; Din, I.U.; Rasheed, J.; Yesiltepe, M.; Yeo, S.F.; Ersoy, M.T. A Hybrid CNN and RNN Variant Model for Music Classification. Appl. Sci. 2023, 13, 1476. [Google Scholar] [CrossRef]
Cudjoe, M.N.M.; Kwarteng, E.V.S.; Anning, E.; Bodunrin, I.R.; Andam-Akorful, S.A. Application of Remote Sensing and Geographic Information System Technologies to Assess the Impact of Mining: A Case Study at Emalahleni. Appl. Sci. 2024, 14, 1739. [Google Scholar] [CrossRef]
Zubair, I.M.; Lee, Y.-S.; Kim, B. A New Permutation-Based Method for Ranking and Selecting Group Features in Multiclass Classification. Appl. Sci. 2024, 14, 3156. [Google Scholar] [CrossRef]
Al-Eiadeh, M.R.; Qaddoura, R.; Abdallah, M. Investigating the Performance of a Novel Modified Binary Black Hole Optimization Algorithm for Enhancing Feature Selection. Appl. Sci. 2024, 14, 5207. [Google Scholar] [CrossRef]
Liu, Y.; Fu, L.; Xia, X.; Zhang, Y. Exploring the Role of Self-Adaptive Feature Words in Relation Quintuple Extraction for Scientific Literature. Appl. Sci. 2024, 14, 4020. [Google Scholar] [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ruano-Ordás, D. Machine Learning-Based Feature Extraction and Selection. Appl. Sci. 2024, 14, 6567. https://doi.org/10.3390/app14156567

AMA Style

Ruano-Ordás D. Machine Learning-Based Feature Extraction and Selection. Applied Sciences. 2024; 14(15):6567. https://doi.org/10.3390/app14156567

Chicago/Turabian Style

Ruano-Ordás, David. 2024. "Machine Learning-Based Feature Extraction and Selection" Applied Sciences 14, no. 15: 6567. https://doi.org/10.3390/app14156567

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Machine Learning-Based Feature Extraction and Selection

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI