Machine Learning and Knowledge Extraction

30 pages, 3135 KiB

Open AccessFeature PaperReview

Evaluation of Regression Models: Model Assessment, Model Selection and Generalization Error

by Frank Emmert-Streib and Matthias Dehmer

Mach. Learn. Knowl. Extr. 2019, 1(1), 521-551; https://doi.org/10.3390/make1010032 - 22 Mar 2019

Cited by 65 | Viewed by 8332

When performing a regression or classification analysis, one needs to specify a statistical model. This model should avoid the overfitting and underfitting of data, and achieve a low generalization error that characterizes its prediction performance. In order to identify such a model, one [...] Read more.

When performing a regression or classification analysis, one needs to specify a statistical model. This model should avoid the overfitting and underfitting of data, and achieve a low generalization error that characterizes its prediction performance. In order to identify such a model, one needs to decide which model to select from candidate model families based on performance evaluations. In this paper, we review the theoretical framework of model selection and model assessment, including error-complexity curves, the bias-variance tradeoff, and learning curves for evaluating statistical models. We discuss criterion-based, step-wise selection procedures and resampling methods for model selection, whereas cross-validation provides the most simple and generic means for computationally estimating all required entities. To make the theoretical concepts transparent, we present worked examples for linear regression models. However, our conceptual presentation is extensible to more general models, as well as classification problems. Full article

(This article belongs to the Section Learning)

► Show Figures

Figure 1

17 pages, 2200 KiB

Open AccessArticle

A Near Real-Time Automatic Speaker Recognition Architecture for Voice-Based User Interface

by Parashar Dhakal, Praveen Damacharla, Ahmad Y. Javaid and Vijay Devabhaktuni

Mach. Learn. Knowl. Extr. 2019, 1(1), 504-520; https://doi.org/10.3390/make1010031 - 19 Mar 2019

Cited by 45 | Viewed by 6794

Abstract

In this paper, we present a novel pipelined near real-time speaker recognition architecture that enhances the performance of speaker recognition by exploiting the advantages of hybrid feature extraction techniques that contain the features of Gabor Filter (GF), Convolution Neural Networks (CNN), and statistical [...] Read more.

In this paper, we present a novel pipelined near real-time speaker recognition architecture that enhances the performance of speaker recognition by exploiting the advantages of hybrid feature extraction techniques that contain the features of Gabor Filter (GF), Convolution Neural Networks (CNN), and statistical parameters as a single matrix set. This architecture has been developed to enable secure access to a voice-based user interface (UI) by enabling speaker-based authentication and integration with an existing Natural Language Processing (NLP) system. Gaining secure access to existing NLP systems also served as motivation. Initially, we identify challenges related to real-time speaker recognition and highlight the recent research in the field. Further, we analyze the functional requirements of a speaker recognition system and introduce the mechanisms that can address these requirements through our novel architecture. Subsequently, the paper discusses the effect of different techniques such as CNN, GF, and statistical parameters in feature extraction. For the classification, standard classifiers such as Support Vector Machine (SVM), Random Forest (RF) and Deep Neural Network (DNN) are investigated. To verify the validity and effectiveness of the proposed architecture, we compared different parameters including accuracy, sensitivity, and specificity with the standard AlexNet architecture. Full article

(This article belongs to the Section Privacy)

► Show Figures

Figure 1

12 pages, 1153 KiB

Open AccessArticle

Gender Recognition by Voice Using an Improved Self-Labeled Algorithm

by Ioannis E. Livieris, Emmanuel Pintelas and Panagiotis Pintelas

Mach. Learn. Knowl. Extr. 2019, 1(1), 492-503; https://doi.org/10.3390/make1010030 - 05 Mar 2019

Cited by 36 | Viewed by 6611

Abstract

Speech recognition has various applications including human to machine interaction, sorting of telephone calls by gender categorization, video categorization with tagging and so on. Currently, machine learning is a popular trend which has been widely utilized in various fields and applications, exploiting the [...] Read more.

Speech recognition has various applications including human to machine interaction, sorting of telephone calls by gender categorization, video categorization with tagging and so on. Currently, machine learning is a popular trend which has been widely utilized in various fields and applications, exploiting the recent development in digital technologies and the advantage of storage capabilities from electronic media. Recently, research focuses on the combination of ensemble learning techniques with the semi-supervised learning framework aiming to build more accurate classifiers. In this paper, we focus on gender recognition by voice utilizing a new ensemble semi-supervised self-labeled algorithm. Our preliminary numerical experiments demonstrate the classification efficiency of the proposed algorithm in terms of accuracy, leading to the development of stable and robust predictive models. Full article

9 pages, 641 KiB

Open AccessArticle

Differentially Private Image Classification Using Support Vector Machine and Differential Privacy

by Makhamisa Senekane

Mach. Learn. Knowl. Extr. 2019, 1(1), 483-491; https://doi.org/10.3390/make1010029 - 20 Feb 2019

Cited by 22 | Viewed by 5991

Abstract

The ubiquity of data, including multi-media data such as images, enables easy mining and analysis of such data. However, such an analysis might involve the use of sensitive data such as medical records (including radiological images) and financial records. Privacy-preserving machine learning is [...] Read more.

The ubiquity of data, including multi-media data such as images, enables easy mining and analysis of such data. However, such an analysis might involve the use of sensitive data such as medical records (including radiological images) and financial records. Privacy-preserving machine learning is an approach that is aimed at the analysis of such data in such a way that privacy is not compromised. There are various privacy-preserving data analysis approaches such as k-anonymity, l-diversity, t-closeness and Differential Privacy (DP). Currently, DP is a golden standard of privacy-preserving data analysis due to its robustness against background knowledge attacks. In this paper, we report a scheme for privacy-preserving image classification using Support Vector Machine (SVM) and DP. SVM is chosen as a classification algorithm because unlike variants of artificial neural networks, it converges to a global optimum. SVM kernels used are linear and Radial Basis Function (RBF), while

ϵ

-differential privacy was the DP framework used. The proposed scheme achieved an accuracy of up to 98%. The results obtained underline the utility of using SVM and DP for privacy-preserving image classification. Full article

(This article belongs to the Section Privacy)

► Show Figures

Figure 1

17 pages, 1219 KiB

Open AccessArticle

Using Resistin, Glucose, Age and BMI and Pruning Fuzzy Neural Network for the Construction of Expert Systems in the Prediction of Breast Cancer

by Vinícius Jonathan Silva Araújo, Augusto Junio Guimarães, Paulo Vitor de Campos Souza, Thiago Silva Rezende and Vanessa Souza Araújo

Mach. Learn. Knowl. Extr. 2019, 1(1), 466-482; https://doi.org/10.3390/make1010028 - 14 Feb 2019

Cited by 48 | Viewed by 5436

Abstract

Research on predictions of breast cancer grows in the scientific community, providing data on studies in patient surveys. Predictive models link areas of medicine and artificial intelligence to collect data and improve disease assessments that affect a large part of the population, such [...] Read more.

Research on predictions of breast cancer grows in the scientific community, providing data on studies in patient surveys. Predictive models link areas of medicine and artificial intelligence to collect data and improve disease assessments that affect a large part of the population, such as breast cancer. In this work, we used a hybrid artificial intelligence model based on concepts of neural networks and fuzzy systems to assist in the identification of people with breast cancer through fuzzy rules. The hybrid model can manipulate the data collected in medical examinations and identify patterns between healthy people and people with breast cancer with an acceptable level of accuracy. These intelligent techniques allow the creation of expert systems based on logical rules of the IF/THEN type. To demonstrate the feasibility of applying fuzzy neural networks, binary pattern classification tests were performed where the dimensions of the problem are used for a model, and the answers identify whether or not the patient has cancer. In the tests, experiments were replicated with several characteristics collected in the examinations done by medical specialists. The results of the tests, compared to other models commonly used for this purpose in the literature, confirm that the hybrid model has a tremendous predictive capacity in the prediction of people with breast cancer maintaining acceptable levels of accuracy with good ability to act on false positives and false negatives, assisting the scientific milieu with its forecasts with the significant characteristic of interpretability of breast cancer. In addition to coherent predictions, the fuzzy neural network enables the construction of systems in high level programming languages to build support systems for physicians’ actions during the initial stages of treatment of the disease with the fuzzy rules found, allowing the construction of systems that replicate the knowledge of medical specialists, disseminating it to other professionals. Full article

(This article belongs to the Special Issue Machine Learning for Biomedical Data Processing)

► Show Figures

Figure 1

16 pages, 3464 KiB

Open AccessArticle

Guidelines and Benchmarks for Deployment of Deep Learning Models on Smartphones as Real-Time Apps

by Abhishek Sehgal and Nasser Kehtarnavaz

Mach. Learn. Knowl. Extr. 2019, 1(1), 450-465; https://doi.org/10.3390/make1010027 - 13 Feb 2019

Cited by 28 | Viewed by 5825

Abstract

Deep learning solutions are being increasingly used in mobile applications. Although there are many open-source software tools for the development of deep learning solutions, there are no guidelines in one place in a unified manner for using these tools toward real-time deployment of [...] Read more.

Deep learning solutions are being increasingly used in mobile applications. Although there are many open-source software tools for the development of deep learning solutions, there are no guidelines in one place in a unified manner for using these tools toward real-time deployment of these solutions on smartphones. From the variety of available deep learning tools, the most suited ones are used in this paper to enable real-time deployment of deep learning inference networks on smartphones. A uniform flow of implementation is devised for both Android and iOS smartphones. The advantage of using multi-threading to achieve or improve real-time throughputs is also showcased. A benchmarking framework consisting of accuracy, CPU/GPU consumption, and real-time throughput is considered for validation purposes. The developed deployment approach allows deep learning models to be turned into real-time smartphone apps with ease based on publicly available deep learning and smartphone software tools. This approach is applied to six popular or representative convolutional neural network models, and the validation results based on the benchmarking metrics are reported. Full article

(This article belongs to the Section Learning)

► Show Figures

Figure 1

23 pages, 1146 KiB

Open AccessArticle

Model Selection Criteria on Beta Regression for Machine Learning

by Patrícia L. Espinheira, Luana C. Meireles da Silva, Alisson de Oliveira Silva and Raydonal Ospina

Mach. Learn. Knowl. Extr. 2019, 1(1), 427-449; https://doi.org/10.3390/make1010026 - 08 Feb 2019

Cited by 14 | Viewed by 4711

Abstract

Beta regression models are a class of supervised learning tools for regression problems with univariate and limited response. Current fitting procedures for beta regression require variable selection based on (potentially problematic) information criteria. We propose model selection criteria that take into account the [...] Read more.

Beta regression models are a class of supervised learning tools for regression problems with univariate and limited response. Current fitting procedures for beta regression require variable selection based on (potentially problematic) information criteria. We propose model selection criteria that take into account the leverage, residuals, and influence of the observations, both to systematic linear and nonlinear components. To that end, we propose a Predictive Residual Sum of Squares (PRESS)-like machine learning tool and a prediction coefficient, namely

P^{2}

statistic, as a computational procedure. Monte Carlo simulation results on the finite sample behavior of prediction-based model selection criteria

P^{2}

are provided. We also evaluated two versions of the

R^{2}

criterion. Finally, applications to real data are presented. The new criterion proved to be crucial to choose models taking into account the robustness of the maximum likelihood estimation procedure in the presence of influential cases. Full article

► Show Figures

Figure 1

11 pages, 3025 KiB

Open AccessArticle

The Number of Topics Optimization: Clustering Approach

by Fedor Krasnov and Anastasiia Sen

Mach. Learn. Knowl. Extr. 2019, 1(1), 416-426; https://doi.org/10.3390/make1010025 - 30 Jan 2019

Cited by 25 | Viewed by 4265

Abstract

Although topic models have been used to build clusters of documents for more than ten years, there is still a problem of choosing the optimal number of topics. The authors analyzed many fundamental studies undertaken on the subject in recent years. The main [...] Read more.

Although topic models have been used to build clusters of documents for more than ten years, there is still a problem of choosing the optimal number of topics. The authors analyzed many fundamental studies undertaken on the subject in recent years. The main problem is the lack of a stable metric of the quality of topics obtained during the construction of the topic model. The authors analyzed the internal metrics of the topic model: coherence, contrast, and purity to determine the optimal number of topics and concluded that they are not applicable to solve this problem. The authors analyzed the approach to choosing the optimal number of topics based on the quality of the clusters. For this purpose, the authors considered the behavior of the cluster validation metrics: the Davies Bouldin index, the silhouette coefficient, and the Calinski-Harabaz index. A new method for determining the optimal number of topics proposed in this paper is based on the following principles: (1) Setting up a topic model with additive regularization (ARTM) to separate noise topics; (2) Using dense vector representation (GloVe, FastText, Word2Vec); (3) Using a cosine measure for the distance in cluster metric that works better than Euclidean distance on vectors with large dimensions. The methodology developed by the authors for obtaining the optimal number of topics was tested on the collection of scientific articles from the OnePetro library, selected by specific themes. The experiment showed that the method proposed by the authors allows assessing the optimal number of topics for the topic model built on a small collection of English documents. Full article

(This article belongs to the Special Issue Language Processing and Knowledge Extraction)

► Show Figures

Figure 1

2 pages, 208 KiB

Open AccessEditorial

Acknowledgement to Reviewers of MAKE in 2018

by MAKE Editorial Office

Mach. Learn. Knowl. Extr. 2019, 1(1), 414-415; https://doi.org/10.3390/make1010024 - 19 Jan 2019

Viewed by 2009

Abstract

Rigorous peer-review is the corner-stone of high-quality academic publishing [...] Full article

14 pages, 2763 KiB

Open AccessFeature PaperArticle

Discovery of Relevant Response in Infected Potato Plants from Time Series of Gene Expression Data

by Dragan Gamberger, Tjaša Stare, Dragana Miljkovic, Kristina Gruden and Nada Lavrač

Mach. Learn. Knowl. Extr. 2019, 1(1), 400-413; https://doi.org/10.3390/make1010023 - 16 Jan 2019

Viewed by 2730

Abstract

The paper presents a methodology for analyzing time series of gene expression data collected from the leaves of potato virus Y (PVY) infected and non-infected potato plants, with the aim to identify significant differences between the two sets of potato plants’ characteristic for [...] Read more.

The paper presents a methodology for analyzing time series of gene expression data collected from the leaves of potato virus Y (PVY) infected and non-infected potato plants, with the aim to identify significant differences between the two sets of potato plants’ characteristic for various time points. We aim at identifying differentially-expressed genes whose expression values are statistically significantly different in the set of PVY infected potato plants compared to non-infected plants, and which demonstrate also statistically significant changes of expression values of genes of PVY infected potato plants in time. The novelty of the approach includes stratified data randomization used in estimating the statistical properties of gene expression of the samples in the control set of non-infected potato plants. A novel estimate that computes the relative minimal distance between the samples has been defined that enables reliable identification of the differences between the target and control datasets when these sets are small. The relevance of the outcomes is demonstrated by visualizing the relative minimal distance of gene expression changes in time for three different types of potato leaves for the genes that have been identified as relevant by the proposed methodology. Full article

(This article belongs to the Section Data)

► Show Figures

Figure 1

16 pages, 1341 KiB

Open AccessArticle

Encrypted DNP3 Traffic Classification Using Supervised Machine Learning Algorithms

by Thais Rodriguez de Toledo and Nunzio Marco Torrisi

Mach. Learn. Knowl. Extr. 2019, 1(1), 384-399; https://doi.org/10.3390/make1010022 - 15 Jan 2019

Cited by 21 | Viewed by 3411

Abstract

The Distributed Network Protocol (DNP3) is predominately used by the electric utility industry and, consequently, in smart grids. The Peekaboo attack was created to compromise DNP3 traffic, in which a man-in-the-middle on a communication link can capture and drop selected encrypted DNP3 messages [...] Read more.

The Distributed Network Protocol (DNP3) is predominately used by the electric utility industry and, consequently, in smart grids. The Peekaboo attack was created to compromise DNP3 traffic, in which a man-in-the-middle on a communication link can capture and drop selected encrypted DNP3 messages by using support vector machine learning algorithms. The communication networks of smart grids are a important part of their infrastructure, so it is of critical importance to keep this communication secure and reliable. The main contribution of this paper is to compare the use of machine learning techniques to classify messages of the same protocol exchanged in encrypted tunnels. The study considers four simulated cases of encrypted DNP3 traffic scenarios and four different supervised machine learning algorithms: Decision tree, nearest-neighbor, support vector machine, and naive Bayes. The results obtained show that it is possible to extend a Peekaboo attack over multiple substations, using a decision tree learning algorithm, and to gather significant information from a system that communicates using encrypted DNP3 traffic. Full article

► Show Figures

Figure 1

25 pages, 1072 KiB

Open AccessReview

High-Dimensional LASSO-Based Computational Regression Models: Regularization, Shrinkage, and Selection

by Frank Emmert-Streib and Matthias Dehmer

Mach. Learn. Knowl. Extr. 2019, 1(1), 359-383; https://doi.org/10.3390/make1010021 - 14 Jan 2019

Cited by 63 | Viewed by 11351

Abstract

Regression models are a form of supervised learning methods that are important for machine learning, statistics, and general data science. Despite the fact that classical ordinary least squares (OLS) regression models have been known for a long time, in recent years there are [...] Read more.

Regression models are a form of supervised learning methods that are important for machine learning, statistics, and general data science. Despite the fact that classical ordinary least squares (OLS) regression models have been known for a long time, in recent years there are many new developments that extend this model significantly. Above all, the least absolute shrinkage and selection operator (LASSO) model gained considerable interest. In this paper, we review general regression models with a focus on the LASSO and extensions thereof, including the adaptive LASSO, elastic net, and group LASSO. We discuss the regularization terms responsible for inducing coefficient shrinkage and variable selection leading to improved performance metrics of these regression models. This makes these modern, computational regression models valuable tools for analyzing high-dimensional problems. Full article

(This article belongs to the Section Learning)

► Show Figures

Figure 1

18 pages, 655 KiB

Open AccessArticle

Recent Advances in Supervised Dimension Reduction: A Survey

by Guoqing Chao, Yuan Luo and Weiping Ding

Mach. Learn. Knowl. Extr. 2019, 1(1), 341-358; https://doi.org/10.3390/make1010020 - 07 Jan 2019

Cited by 73 | Viewed by 7917

Abstract

Recently, we have witnessed an explosive growth in both the quantity and dimension of data generated, which aggravates the high dimensionality challenge in tasks such as predictive modeling and decision support. Up to now, a large amount of unsupervised dimension reduction methods have [...] Read more.

Recently, we have witnessed an explosive growth in both the quantity and dimension of data generated, which aggravates the high dimensionality challenge in tasks such as predictive modeling and decision support. Up to now, a large amount of unsupervised dimension reduction methods have been proposed and studied. However, there is no specific review focusing on the supervised dimension reduction problem. Most studies performed classification or regression after unsupervised dimension reduction methods. However, we recognize the following advantages if learning the low-dimensional representation and the classification/regression model simultaneously: high accuracy and effective representation. Considering classification or regression as being the main goal of dimension reduction, the purpose of this paper is to summarize and organize the current developments in the field into three main classes: PCA-based, Non-negative Matrix Factorization (NMF)-based, and manifold-based supervised dimension reduction methods, as well as provide elaborated discussions on their advantages and disadvantages. Moreover, we outline a dozen open problems that can be further explored to advance the development of this topic. Full article

(This article belongs to the Section Thematic Reviews)

► Show Figures

Figure 1

28 pages, 1802 KiB

Open AccessArticle

Causal Discovery with Attention-Based Convolutional Neural Networks

by Meike Nauta, Doina Bucur and Christin Seifert

Mach. Learn. Knowl. Extr. 2019, 1(1), 312-340; https://doi.org/10.3390/make1010019 - 07 Jan 2019

Cited by 127 | Viewed by 27548

Abstract

Having insight into the causal associations in a complex system facilitates decision making, e.g., for medical treatments, urban infrastructure improvements or financial investments. The amount of observational data grows, which enables the discovery of causal relationships between variables from observation of their behaviour [...] Read more.

Having insight into the causal associations in a complex system facilitates decision making, e.g., for medical treatments, urban infrastructure improvements or financial investments. The amount of observational data grows, which enables the discovery of causal relationships between variables from observation of their behaviour in time. Existing methods for causal discovery from time series data do not yet exploit the representational power of deep learning. We therefore present the Temporal Causal Discovery Framework (TCDF), a deep learning framework that learns a causal graph structure by discovering causal relationships in observational time series data. TCDF uses attention-based convolutional neural networks combined with a causal validation step. By interpreting the internal parameters of the convolutional networks, TCDF can also discover the time delay between a cause and the occurrence of its effect. Our framework learns temporal causal graphs, which can include confounders and instantaneous effects. Experiments on financial and neuroscientific benchmarks show state-of-the-art performance of TCDF on discovering causal relationships in continuous time series data. Furthermore, we show that TCDF can circumstantially discover the presence of hidden confounders. Our broadly applicable framework can be used to gain novel insights into the causal dependencies in a complex system, which is important for reliable predictions, knowledge discovery and data-driven decision making. Full article

(This article belongs to the Special Issue Women in Machine Learning 2018)

► Show Figures

Figure 1

25 pages, 5592 KiB

Open AccessArticle

Evaluation of ARIMA Models for Human–Machine Interface State Sequence Prediction

by Harsh V. P. Singh and Qusay H. Mahmoud

Mach. Learn. Knowl. Extr. 2019, 1(1), 287-311; https://doi.org/10.3390/make1010018 - 03 Jan 2019

Cited by 3 | Viewed by 3920

Abstract

In this paper, auto-regressive integrated moving average (ARIMA) time-series data forecast models are evaluated to ascertain their feasibility in predicting human–machine interface (HMI) state transitions, which are modeled as multivariate time-series patterns. Human–machine interface states generally include changes in their visually displayed information [...] Read more.

In this paper, auto-regressive integrated moving average (ARIMA) time-series data forecast models are evaluated to ascertain their feasibility in predicting human–machine interface (HMI) state transitions, which are modeled as multivariate time-series patterns. Human–machine interface states generally include changes in their visually displayed information brought about due to both process parameter changes and user actions. This approach has wide applications in industrial controls, such as nuclear power plant control rooms and transportation industry, such as aircraft cockpits, etc., to develop non-intrusive real-time monitoring solutions for human operator situational awareness and potentially predicting human-in-the-loop error trend precursors. Full article

(This article belongs to the Section Learning)

► Show Figures

Figure 1

22 pages, 3220 KiB

Open AccessArticle

Multi-Layer Hidden Markov Model Based Intrusion Detection System

by Wondimu K. Zegeye, Richard A. Dean and Farzad Moazzami

Mach. Learn. Knowl. Extr. 2019, 1(1), 265-286; https://doi.org/10.3390/make1010017 - 25 Dec 2018

Cited by 24 | Viewed by 6658

Abstract

The all IP nature of the next generation (5G) networks is going to open a lot of doors for new vulnerabilities which are going to be challenging in preventing the risk associated with them. Majority of these vulnerabilities might be impossible to detect [...] Read more.

The all IP nature of the next generation (5G) networks is going to open a lot of doors for new vulnerabilities which are going to be challenging in preventing the risk associated with them. Majority of these vulnerabilities might be impossible to detect with simple networking traffic monitoring tools. Intrusion Detection Systems (IDS) which rely on machine learning and artificial intelligence can significantly improve network defense against intruders. This technology can be trained to learn and identify uncommon patterns in massive volume of traffic and notify, using such as alert flags, system administrators for additional investigation. This paper proposes an IDS design which makes use of machine learning algorithms such as Hidden Markov Model (HMM) using a multi-layer approach. This approach has been developed and verified to resolve the common flaws in the application of HMM to IDS commonly referred as the curse of dimensionality. It factors a huge problem of immense dimensionality to a discrete set of manageable and reliable elements. The multi-layer approach can be expanded beyond 2 layers to capture multi-phase attacks over longer spans of time. A pyramid of HMMs can resolve disparate digital events and signatures across protocols and platforms to actionable information where lower layers identify discrete events (such as network scan) and higher layers new states which are the result of multi-phase events of the lower layers. The concepts of this novel approach have been developed but the full potential has not been demonstrated. Full article

(This article belongs to the Section Learning)

► Show Figures

Figure 1

13 pages, 684 KiB

Open AccessArticle

The Winning Solution to the IEEE CIG 2017 Game Data Mining Competition

by Anna Guitart, Pei Pei Chen and África Periáñez

Mach. Learn. Knowl. Extr. 2019, 1(1), 252-264; https://doi.org/10.3390/make1010016 - 20 Dec 2018

Cited by 10 | Viewed by 3865

Abstract

Machine learning competitions such as those organized by Kaggle or KDD represent a useful benchmark for data science research. In this work, we present our winning solution to the Game Data Mining competition hosted at the 2017 IEEE Conference on Computational Intelligence and [...] Read more.

Machine learning competitions such as those organized by Kaggle or KDD represent a useful benchmark for data science research. In this work, we present our winning solution to the Game Data Mining competition hosted at the 2017 IEEE Conference on Computational Intelligence and Games (CIG 2017). The contest consisted of two tracks, and participants (more than 250, belonging to both industry and academia) were to predict which players would stop playing the game, as well as their remaining lifetime. The data were provided by a major worldwide video game company, NCSoft, and came from their successful massively multiplayer online game Blade and Soul. Here, we describe the long short-term memory approach and conditional inference survival ensemble model that made us win both tracks of the contest, as well as the validation procedure that we followed in order to prevent overfitting. In particular, choosing a survival method able to deal with censored data was crucial to accurately predict the moment in which each player would leave the game, as censoring is inherent in churn. The selected models proved to be robust against evolving conditions—since there was a change in the business model of the game (from subscription-based to free-to-play) between the two sample datasets provided—and efficient in terms of time cost. Thanks to these features and also to their ability to scale to large datasets, our models could be readily implemented in real business settings. Full article

(This article belongs to the Special Issue Women in Machine Learning 2018)

► Show Figures

Figure 1

17 pages, 1745 KiB

Open AccessArticle

Defining Data Science by a Data-Driven Quantification of the Community

by Frank Emmert-Streib and Matthias Dehmer

Mach. Learn. Knowl. Extr. 2019, 1(1), 235-251; https://doi.org/10.3390/make1010015 - 19 Dec 2018

Cited by 26 | Viewed by 4862

Abstract

Data science is a new academic field that has received much attention in recent years. One reason for this is that our increasingly digitalized society generates more and more data in all areas of our lives and science and we are desperately seeking [...] Read more.

Data science is a new academic field that has received much attention in recent years. One reason for this is that our increasingly digitalized society generates more and more data in all areas of our lives and science and we are desperately seeking for solutions to deal with this problem. In this paper, we investigate the academic roots of data science. We are using data of scientists and their citations from Google Scholar, who have an interest in data science, to perform a quantitative analysis of the data science community. Furthermore, for decomposing the data science community into its major defining factors corresponding to the most important research fields, we introduce a statistical regression model that is fully automatic and robust with respect to a subsampling of the data. This statistical model allows us to define the ‘importance’ of a field as its predictive abilities. Overall, our method provides an objective answer to the question ‘What is data science?’. Full article

(This article belongs to the Section Data)

► Show Figures

Figure 1

11 pages, 2311 KiB

Open AccessArticle

Analysis of Machine Learning Algorithms for Opinion Mining in Different Domains

by Donia Gamal, Marco Alfonse, El-Sayed M. El-Horbaty and Abdel-Badeeh M. Salem

Mach. Learn. Knowl. Extr. 2019, 1(1), 224-234; https://doi.org/10.3390/make1010014 - 08 Dec 2018

Cited by 25 | Viewed by 6976

Abstract

Sentiment classification (SC) is a reference to the task of sentiment analysis (SA), which is a subfield of natural language processing (NLP) and is used to decide whether textual content implies a positive or negative review. This research focuses on the various machine [...] Read more.

Sentiment classification (SC) is a reference to the task of sentiment analysis (SA), which is a subfield of natural language processing (NLP) and is used to decide whether textual content implies a positive or negative review. This research focuses on the various machine learning (ML) algorithms which are utilized in the analyzation of sentiments and in the mining of reviews in different datasets. Overall, an SC task consists of two phases. The first phase deals with feature extraction (FE). Three different FE algorithms are applied in this research. The second phase covers the classification of the reviews by using various ML algorithms. These are Naïve Bayes (NB), Stochastic Gradient Descent (SGD), Support Vector Machines (SVM), Passive Aggressive (PA), Maximum Entropy (ME), Adaptive Boosting (AdaBoost), Multinomial NB (MNB), Bernoulli NB (BNB), Ridge Regression (RR) and Logistic Regression (LR). The performance of PA with a unigram is the best among other algorithms for all used datasets (IMDB, Cornell Movies, Amazon and Twitter) and provides values that range from 87% to 99.96% for all evaluation metrics. Full article

(This article belongs to the Section Learning)

► Show Figures

Figure 1

13 pages, 648 KiB

Open AccessArticle

Using the Outlier Detection Task to Evaluate Distributional Semantic Models

by Pablo Gamallo

Mach. Learn. Knowl. Extr. 2019, 1(1), 211-223; https://doi.org/10.3390/make1010013 - 22 Nov 2018

Cited by 2 | Viewed by 3000

Abstract

In this article, we define the outlier detection task and use it to compare neural-based word embeddings with transparent count-based distributional representations. Using the English Wikipedia as a text source to train the models, we observed that embeddings outperform count-based representations when their [...] Read more.

In this article, we define the outlier detection task and use it to compare neural-based word embeddings with transparent count-based distributional representations. Using the English Wikipedia as a text source to train the models, we observed that embeddings outperform count-based representations when their contexts are made up of bag-of-words. However, there are no sharp differences between the two models if the word contexts are defined as syntactic dependencies. In general, syntax-based models tend to perform better than those based on bag-of-words for this specific task. Similar experiments were carried out for Portuguese with similar results. The test datasets we have created for the outlier detection task in English and Portuguese are freely available. Full article

(This article belongs to the Special Issue Language Processing and Knowledge Extraction)

► Show Figures

Figure 1

6 pages, 4260 KiB

Open AccessOpinion

Exploiting Genomic Relations in Big Data Repositories by Graph-Based Search Methods

by Aliyu Musa, Matthias Dehmer, Olli Yli-Harja and Frank Emmert-Streib

Mach. Learn. Knowl. Extr. 2019, 1(1), 205-210; https://doi.org/10.3390/make1010012 - 22 Nov 2018

Cited by 2 | Viewed by 2678

Abstract

We are living at a time that allows the generation of mass data in almost any field of science. For instance, in pharmacogenomics, there exist a number of big data repositories, e.g., the Library of Integrated Network-based Cellular Signatures (LINCS) that provide millions [...] Read more.

We are living at a time that allows the generation of mass data in almost any field of science. For instance, in pharmacogenomics, there exist a number of big data repositories, e.g., the Library of Integrated Network-based Cellular Signatures (LINCS) that provide millions of measurements on the genomics level. However, to translate these data into meaningful information, the data need to be analyzable. The first step for such an analysis is the deliberate selection of subsets of raw data for studying dedicated research questions. Unfortunately, this is a non-trivial problem when millions of individual data files are available with an intricate connection structure induced by experimental dependencies. In this paper, we argue for the need to introduce such search capabilities for big genomics data repositories with a specific discussion about LINCS. Specifically, we suggest the introduction of smart interfaces allowing the exploitation of the connections among individual raw data files, giving raise to a network structure, by graph-based searches. Full article

(This article belongs to the Section Network)

► Show Figures

Figure 1

13 pages, 2184 KiB

Open AccessArticle

An Algorithm for Generating Invisible Data Poisoning Using Adversarial Noise That Breaks Image Classification Deep Learning

by Adrien CHAN-HON-TONG

Mach. Learn. Knowl. Extr. 2019, 1(1), 192-204; https://doi.org/10.3390/make1010011 - 09 Nov 2018

Cited by 9 | Viewed by 4945

Abstract

Today, the main two security issues for deep learning are data poisoning and adversarial examples. Data poisoning consists of perverting a learning system by manipulating a small subset of the training data, while adversarial examples entail bypassing the system at testing time with [...] Read more.

Today, the main two security issues for deep learning are data poisoning and adversarial examples. Data poisoning consists of perverting a learning system by manipulating a small subset of the training data, while adversarial examples entail bypassing the system at testing time with low-amplitude manipulation of the testing sample. Unfortunately, data poisoning that is invisible to human eyes can be generated by adding adversarial noise to the training data. The main contribution of this paper includes a successful implementation of such invisible data poisoning using image classification datasets for a deep learning pipeline. This implementation leads to significant classification accuracy gaps. Full article

(This article belongs to the Section Data)

► Show Figures

Figure 1

35 pages, 17545 KiB

Open AccessReview

Particle Swarm Optimization: A Survey of Historical and Recent Developments with Hybridization Perspectives

by Saptarshi Sengupta, Sanchita Basak and Richard Alan Peters II

Mach. Learn. Knowl. Extr. 2019, 1(1), 157-191; https://doi.org/10.3390/make1010010 - 10 Oct 2018

Cited by 331 | Viewed by 15711

Abstract

Particle Swarm Optimization (PSO) is a metaheuristic global optimization paradigm that has gained prominence in the last two decades due to its ease of application in unsupervised, complex multidimensional problems that cannot be solved using traditional deterministic algorithms. The canonical particle swarm optimizer [...] Read more.

Particle Swarm Optimization (PSO) is a metaheuristic global optimization paradigm that has gained prominence in the last two decades due to its ease of application in unsupervised, complex multidimensional problems that cannot be solved using traditional deterministic algorithms. The canonical particle swarm optimizer is based on the flocking behavior and social co-operation of birds and fish schools and draws heavily from the evolutionary behavior of these organisms. This paper serves to provide a thorough survey of the PSO algorithm with special emphasis on the development, deployment, and improvements of its most basic as well as some of the very recent state-of-the-art implementations. Concepts and directions on choosing the inertia weight, constriction factor, cognition and social weights and perspectives on convergence, parallelization, elitism, niching and discrete optimization as well as neighborhood topologies are outlined. Hybridization attempts with other evolutionary and swarm paradigms in selected applications are covered and an up-to-date review is put forward for the interested reader. Full article

(This article belongs to the Section Data)

8 pages, 908 KiB

Open AccessViewpoint

A Machine Learning Perspective on Personalized Medicine: An Automized, Comprehensive Knowledge Base with Ontology for Pattern Recognition

by Frank Emmert-Streib and Matthias Dehmer

Mach. Learn. Knowl. Extr. 2019, 1(1), 149-156; https://doi.org/10.3390/make1010009 - 08 Sep 2018

Cited by 32 | Viewed by 7224

Abstract

Personalized or precision medicine is a new paradigm that holds great promise for individualized patient diagnosis, treatment, and care. However, personalized medicine has only been described on an informal level rather than through rigorous practical guidelines and statistical protocols that would allow its [...] Read more.

Personalized or precision medicine is a new paradigm that holds great promise for individualized patient diagnosis, treatment, and care. However, personalized medicine has only been described on an informal level rather than through rigorous practical guidelines and statistical protocols that would allow its robust practical realization for implementation in day-to-day clinical practice. In this paper, we discuss three key factors, which we consider dimensions that effect the experimental design for personalized medicine: (I) phenotype categories; (II) population size; and (III) statistical analysis. This formalization allows us to define personalized medicine from a machine learning perspective, as an automized, comprehensive knowledge base with an ontology that performs pattern recognition of patient profiles. Full article

► Show Figures

Figure 1

11 pages, 659 KiB

Open AccessPerspective

Inference of Genome-Scale Gene Regulatory Networks: Are There Differences in Biological and Clinical Validations?

by Frank Emmert-Streib and Matthias Dehmer

Mach. Learn. Knowl. Extr. 2019, 1(1), 138-148; https://doi.org/10.3390/make1010008 - 22 Aug 2018

Cited by 4 | Viewed by 3572

Abstract

Causal networks, e.g., gene regulatory networks (GRNs) inferred from gene expression data, contain a wealth of information but are defying simple, straightforward and low-budget experimental validations. In this paper, we elaborate on this problem and discuss distinctions between biological and clinical validations. As [...] Read more.

Causal networks, e.g., gene regulatory networks (GRNs) inferred from gene expression data, contain a wealth of information but are defying simple, straightforward and low-budget experimental validations. In this paper, we elaborate on this problem and discuss distinctions between biological and clinical validations. As a result, validation differences for GRNs reflect known differences between basic biological and clinical research questions making the validations context specific. Hence, the meaning of biologically and clinically meaningful GRNs can be very different. For a concerted approach to a problem of this size, we suggest the establishment of the HUMAN GENE REGULATORY NETWORK PROJECT which provides the information required for biological and clinical validations alike. Full article

(This article belongs to the Section Network)

► Show Figures

Figure 1

17 pages, 909 KiB

Open AccessArticle

Phi-Delta-Diagrams: Software Implementation of a Visual Tool for Assessing Classifier and Feature Performance

by Giuliano Armano, Alessandro Giuliani, Ursula Neumann, Nikolas Rothe and Dominik Heider

Mach. Learn. Knowl. Extr. 2019, 1(1), 121-137; https://doi.org/10.3390/make1010007 - 28 Jun 2018

Cited by 1 | Viewed by 4404

Abstract

In this article, a two-tiered 2D tool is described, called

⟨ φ, δ ⟩

diagrams, and this tool has been devised to support the assessment of classifiers in terms of accuracy and bias. In their standard versions, these diagrams provide information, as [...] Read more.

In this article, a two-tiered 2D tool is described, called

⟨ φ, δ ⟩

diagrams, and this tool has been devised to support the assessment of classifiers in terms of accuracy and bias. In their standard versions, these diagrams provide information, as the underlying data were in fact balanced. Their generalization, i.e., ability to account for the imbalance, will be also briefly described. In either case, the isometrics of accuracy and bias are immediately evident therein, as—according to a specific design choice—they are in fact straight lines parallel to the x-axis and y-axis, respectively.

⟨ φ, δ ⟩

diagrams can also be used to assess the importance of features, as highly discriminant ones are immediately evident therein. In this paper, a comprehensive introduction on how to adopt

⟨ φ, δ ⟩

diagrams as a standard tool for classifier and feature assessment is given. In particular, with the goal of illustrating all relevant details from a pragmatic perspective, their implementation and usage as Python and R packages will be described. Full article

(This article belongs to the Section Visualization)

► Show Figures

Figure 1

6 pages, 582 KiB

Open AccessReview

Why Topology for Machine Learning and Knowledge Extraction?

by Massimo Ferri

Mach. Learn. Knowl. Extr. 2019, 1(1), 115-120; https://doi.org/10.3390/make1010006 - 02 May 2018

Cited by 13 | Viewed by 6196

Abstract

Data has shape, and shape is the domain of geometry and in particular of its “free” part, called topology. The aim of this paper is twofold. First, it provides a brief overview of applications of topology to machine learning and knowledge extraction, as [...] Read more.

Data has shape, and shape is the domain of geometry and in particular of its “free” part, called topology. The aim of this paper is twofold. First, it provides a brief overview of applications of topology to machine learning and knowledge extraction, as well as the motivations thereof. Furthermore, this paper is aimed at promoting cross-talk between the theoretical and applied domains of topology and machine learning research. Such interactions can be beneficial for both the generation of novel theoretical tools and finding cutting-edge practical applications. Full article

(This article belongs to the Section Learning)

40 pages, 2919 KiB

Open AccessArticle

A Survey of ReRAM-Based Architectures for Processing-In-Memory and Neural Networks

by Sparsh Mittal

Mach. Learn. Knowl. Extr. 2019, 1(1), 75-114; https://doi.org/10.3390/make1010005 - 30 Apr 2018

Cited by 115 | Viewed by 16139

Abstract

As data movement operations and power-budget become key bottlenecks in the design of computing systems, the interest in unconventional approaches such as processing-in-memory (PIM), machine learning (ML), and especially neural network (NN)-based accelerators has grown significantly. Resistive random access memory (ReRAM) is a [...] Read more.

As data movement operations and power-budget become key bottlenecks in the design of computing systems, the interest in unconventional approaches such as processing-in-memory (PIM), machine learning (ML), and especially neural network (NN)-based accelerators has grown significantly. Resistive random access memory (ReRAM) is a promising technology for efficiently architecting PIM- and NN-based accelerators due to its capabilities to work as both: High-density/low-energy storage and in-memory computation/search engine. In this paper, we present a survey of techniques for designing ReRAM-based PIM and NN architectures. By classifying the techniques based on key parameters, we underscore their similarities and differences. This paper will be valuable for computer architects, chip designers and researchers in the area of machine learning. Full article

(This article belongs to the Section Network)

► Show Figures

Figure 1

11 pages, 7954 KiB

Open AccessArticle

A Machine Learning Approach to Determine Oyster Vessel Behavior

by Devin Joseph Frey, Avdesh Mishra, Md Tamjidul Hoque, Mahdi Abdelguerfi and Thomas Soniat

Mach. Learn. Knowl. Extr. 2019, 1(1), 64-74; https://doi.org/10.3390/make1010004 - 31 Mar 2018

Cited by 5 | Viewed by 5105

Abstract

In this work, we address a multi-class classification task of oyster vessel behaviors determination by classifying them into four different classes: fishing, traveling, poling (exploring) and docked (anchored). The main purpose of this work is to automate the oyster vessel behaviors determination task [...] Read more.

In this work, we address a multi-class classification task of oyster vessel behaviors determination by classifying them into four different classes: fishing, traveling, poling (exploring) and docked (anchored). The main purpose of this work is to automate the oyster vessel behaviors determination task using machine learning and to explore different techniques to improve the accuracy of the oyster vessel behavior prediction problem. To employ machine learning technique, two important descriptors: speed and net speed, are calculated from the trajectory data, recorded by a satellite communication system (Vessel Management System, VMS) attached to the vessels fishing on the public oyster grounds of Louisiana. We constructed a support vector machine (SVM) based method which employs Radial Basis Function (RBF) as a kernel to accurately predict the behavior of oyster vessels. Several validation and parameter optimization techniques were used to improve the accuracy of the SVM classifier. A total 93% of the trajectory data from a July 2013 to August 2014 dataset consisting of 612,700 samples for which the ground truth can be obtained using rule-based classifier is used for validation and independent testing of our method. The results show that the proposed SVM based method is able to correctly classify 99.99% of 612,700 samples using the 10-fold cross validation. Furthermore, we achieved a precision of 1.00, recall of 1.00, F1-score of 1.00 and a test accuracy of 99.99%, while performing an independent test using a subset of 93% of the dataset, which consists of 31,418 points. Full article

► Show Figures

Figure 1

21 pages, 2930 KiB

Open AccessArticle

Category Maps Describe Driving Episodes Recorded with Event Data Recorders

by Hirokazu Madokoro, Kazuhito Sato and Nobuhiro Shimoi

Mach. Learn. Knowl. Extr. 2019, 1(1), 43-63; https://doi.org/10.3390/make1010003 - 12 Mar 2018

Viewed by 3919

Abstract

This study was conducted to create driving episodes using machine-learning-based algorithms that address long-term memory (LTM) and topological mapping. This paper presents a novel episodic memory model for driving safety according to traffic scenes. The model incorporates three important features: adaptive resonance theory [...] Read more.

This study was conducted to create driving episodes using machine-learning-based algorithms that address long-term memory (LTM) and topological mapping. This paper presents a novel episodic memory model for driving safety according to traffic scenes. The model incorporates three important features: adaptive resonance theory (ART), which learns time-series features incrementally while maintaining stability and plasticity; self-organizing maps (SOMs), which represent input data as a map with topological relations using self-mapping characteristics; and counter propagation networks (CPNs), which label category maps using input features and counter signals. Category maps represent driving episode information that includes driving contexts and facial expressions. The bursting states of respective maps produce LTM created on ART as episodic memory. For a preliminary experiment using a driving simulator (DS), we measure gazes and face orientations of drivers as their internal information to create driving episodes. Moreover, we measure cognitive distraction according to effects on facial features shown in reaction to simulated near-misses. Evaluation of the experimentally obtained results show the possibility of using recorded driving episodes with image datasets obtained using an event data recorder (EDR) with two cameras. Using category maps, we visualize driving features according to driving scenes on a public road and an expressway. Full article

(This article belongs to the Section Learning)

► Show Figures

Figure 1

Journal Menu

Journal Browser

Mach. Learn. Knowl. Extr., Volume 1, Issue 1 (March 2019) – 32 articles

Further Information

Guidelines

MDPI Initiatives

Follow MDPI