Two-Step Classification with SVD Preprocessing of Distributed Massive Datasets in Apache Spark

Alexopoulos, Athanasios; Drakopoulos, Georgios; Kanavos, Andreas; Mylonas, Phivos; Vonitsanos, Gerasimos

doi:10.3390/a13030071

Open AccessArticle

Two-Step Classification with SVD Preprocessing of Distributed Massive Datasets in Apache Spark

by

Athanasios Alexopoulos

¹,

Georgios Drakopoulos

²

,

Andreas Kanavos

^1,*,

Phivos Mylonas

²

and

Gerasimos Vonitsanos

¹

Computer Engineering and Informatics Department, University of Patras, 26504 Patras, Greece

²

Department of Informatics, Ionian University, 49100 Corfu, Greece

^*

Author to whom correspondence should be addressed.

Algorithms 2020, 13(3), 71; https://doi.org/10.3390/a13030071

Submission received: 26 February 2020 / Revised: 18 March 2020 / Accepted: 21 March 2020 / Published: 24 March 2020

(This article belongs to the Special Issue Mining Humanistic Data 2019)

Download

Browse Figures

Versions Notes

Abstract

At the dawn of the 10V or big data data era, there are a considerable number of sources such as smart phones, IoT devices, social media, smart city sensors, as well as the health care system, all of which constitute but a small portion of the data lakes feeding the entire big data ecosystem. This 10V data growth poses two primary challenges, namely storing and processing. Concerning the latter, new frameworks have been developed including distributed platforms such as the Hadoop ecosystem. Classification is a major machine learning task typically executed on distributed platforms and as a consequence many algorithmic techniques have been developed tailored for these platforms. This article extensively relies in two ways on classifiers implemented in MLlib, the main machine learning library for the Hadoop ecosystem. First, a vast number of classifiers is applied to two datasets, namely Higgs and PAMAP. Second, a two-step classification is ab ovo performed to the same datasets. Specifically, the singular value decomposition of the data matrix determines first a set of transformed attributes which in turn drive the classifiers of MLlib. The twofold purpose of the proposed architecture is to reduce complexity while maintaining a similar if not better level of the metrics of accuracy, recall, and

F 1

. The intuition behind this approach stems from the engineering principle of breaking down complex problems to simpler and more manageable tasks. The experiments based on the same Spark cluster indicate that the proposed architecture outperforms the individual classifiers with respect to both complexity and the abovementioned metrics.

Keywords: Apache Spark; Apache MLlib; PySpark; big data; machine learning; 10V data; two-step classification; ensemble classification; SVD; SparkQL; computing performance; F1 Metric; dataframe

Share and Cite

MDPI and ACS Style

Alexopoulos, A.; Drakopoulos, G.; Kanavos, A.; Mylonas, P.; Vonitsanos, G. Two-Step Classification with SVD Preprocessing of Distributed Massive Datasets in Apache Spark. Algorithms 2020, 13, 71. https://doi.org/10.3390/a13030071

AMA Style

Alexopoulos A, Drakopoulos G, Kanavos A, Mylonas P, Vonitsanos G. Two-Step Classification with SVD Preprocessing of Distributed Massive Datasets in Apache Spark. Algorithms. 2020; 13(3):71. https://doi.org/10.3390/a13030071

Chicago/Turabian Style

Alexopoulos, Athanasios, Georgios Drakopoulos, Andreas Kanavos, Phivos Mylonas, and Gerasimos Vonitsanos. 2020. "Two-Step Classification with SVD Preprocessing of Distributed Massive Datasets in Apache Spark" Algorithms 13, no. 3: 71. https://doi.org/10.3390/a13030071

APA Style

Alexopoulos, A., Drakopoulos, G., Kanavos, A., Mylonas, P., & Vonitsanos, G. (2020). Two-Step Classification with SVD Preprocessing of Distributed Massive Datasets in Apache Spark. Algorithms, 13(3), 71. https://doi.org/10.3390/a13030071

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Two-Step Classification with SVD Preprocessing of Distributed Massive Datasets in Apache Spark

Abstract

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI