Next Issue
Volume 4, December
Previous Issue
Volume 4, June
 
 

Mach. Learn. Knowl. Extr., Volume 4, Issue 3 (September 2022) – 13 articles

Cover Story (view full-size image): Molecular descriptors essentially dictate the performance of quantitative structure–activity relationship (QSAR) models that uncover molecules with desired properties in the ever-expanding virtual and synthetically available chemical space. The Simplified Molecular Input Line Entry System (SMILES) is one of the most used descriptors, for which the importance of numerical encoding has recently been recognized. We propose a new variable-length-array SMILES (VLA-SMILES) descriptor that reduces the code size while preserving structural characteristics, where the tradeoff between training speed and accuracy is controlled through clustering of binary numbers. The method of statistical H0 hypothesis testing based on the F2,n-2 criteria was used for predictive ability validation of designed VLA-SMILES featuring QSAR models using prototypical ChEMBL datasets (n is a volume of the testing set). View this paper
  • Issues are regarded as officially published after their release is announced to the table of contents alert mailing list.
  • You may sign up for e-mail alerts to receive table of contents of newly released issues.
  • PDF is the official format for papers published in both, html and pdf forms. To view the papers in pdf format, click on the "PDF Full-text" link, and use the free Adobe Reader to open them.
Order results
Result details
Section
Select all
Export citation of selected articles as:
11 pages, 1459 KiB  
Article
Sensor Fusion for Occupancy Estimation: A Study Using Multiple Lecture Rooms in a Complex Building
by Cédric Roussel, Klaus Böhm and Pascal Neis
Mach. Learn. Knowl. Extr. 2022, 4(3), 803-813; https://doi.org/10.3390/make4030039 - 16 Sep 2022
Cited by 2 | Viewed by 2051
Abstract
This paper uses various machine learning methods which explore the combination of multiple sensors for quality improvement. It is known that a reliable occupancy estimation can help in many different cases and applications. For the containment of the SARS-CoV-2 virus, in particular, room [...] Read more.
This paper uses various machine learning methods which explore the combination of multiple sensors for quality improvement. It is known that a reliable occupancy estimation can help in many different cases and applications. For the containment of the SARS-CoV-2 virus, in particular, room occupancy is a major factor. The estimation can benefit visitor management systems in real time, but can also be predictive of room reservation strategies. By using different terminal and non-terminal sensors in different premises of varying sizes, this paper aims to estimate room occupancy. In the process, the proposed models are trained with different combinations of rooms in training and testing datasets to examine distinctions in the infrastructure of the considered building. The results indicate that the estimation benefits from a combination of different sensors. Additionally, it is found that a model should be trained with data from every room in a building and cannot be transferred to other rooms. Full article
Show Figures

Figure 1

24 pages, 743 KiB  
Article
Factorizable Joint Shift in Multinomial Classification
by Dirk Tasche
Mach. Learn. Knowl. Extr. 2022, 4(3), 779-802; https://doi.org/10.3390/make4030038 - 10 Sep 2022
Cited by 1 | Viewed by 1600
Abstract
Factorizable joint shift (FJS) was recently proposed as a type of dataset shift for which the complete characteristics can be estimated from feature data observations on the test dataset by a method called Joint Importance Aligning. For the multinomial (multiclass) classification setting, we [...] Read more.
Factorizable joint shift (FJS) was recently proposed as a type of dataset shift for which the complete characteristics can be estimated from feature data observations on the test dataset by a method called Joint Importance Aligning. For the multinomial (multiclass) classification setting, we derive a representation of factorizable joint shift in terms of the source (training) distribution, the target (test) prior class probabilities and the target marginal distribution of the features. On the basis of this result, we propose alternatives to joint importance aligning and, at the same time, point out that factorizable joint shift is not fully identifiable if no class label information on the test dataset is available and no additional assumptions are made. Other results of the paper include correction formulae for the posterior class probabilities both under general dataset shift and factorizable joint shift. In addition, we investigate the consequences of assuming factorizable joint shift for the bias caused by sample selection. Full article
(This article belongs to the Section Learning)
11 pages, 1215 KiB  
Article
Investigating Machine Learning Applications in the Prediction of Occupational Injuries in South African National Parks
by Martha Chadyiwa, Juliana Kagura and Aimee Stewart
Mach. Learn. Knowl. Extr. 2022, 4(3), 768-778; https://doi.org/10.3390/make4030037 - 22 Aug 2022
Cited by 5 | Viewed by 2289
Abstract
There is a need to predict occupational injuries in South African National Parks for the purpose of implementing targeted interventions or preventive measures. Machine-learning models have the capability of predicting injuries such that the employees that are at risk of experiencing occupational injuries [...] Read more.
There is a need to predict occupational injuries in South African National Parks for the purpose of implementing targeted interventions or preventive measures. Machine-learning models have the capability of predicting injuries such that the employees that are at risk of experiencing occupational injuries can be identified. Support Vector Machines (SVMs), k Nearest Neighbours (k-NN), XGB classifier and Deep Neural Networks were applied and overall performance was compared to the accuracy of baseline models that always predict low extremity injuries. Data extracted from the Department of Employment and Labour’s Compensation Fund was used for training the models. SVMs had the best performance in predicting between low extremity injuries and injuries in the torso and hands regions. However, the overall accuracy was 56%, which was slightly above the baseline and below findings from similar previous research that reported a minimum of 62%. Gender was the only feature with an importance score significantly greater than zero. There is a need to use more features related to work conditions and which acknowledge the importance of environment in order to improve the accuracy of the predictions of the models. Furthermore, more types of injuries, and employees that have not experienced any injuries, should be included in future studies. Full article
Show Figures

Figure 1

15 pages, 1876 KiB  
Article
Live Fish Species Classification in Underwater Images by Using Convolutional Neural Networks Based on Incremental Learning with Knowledge Distillation Loss
by Abdelouahid Ben Tamou, Abdesslam Benzinou and Kamal Nasreddine
Mach. Learn. Knowl. Extr. 2022, 4(3), 753-767; https://doi.org/10.3390/make4030036 - 22 Aug 2022
Cited by 10 | Viewed by 3196
Abstract
Nowadays, underwater video systems are largely used by marine ecologists to study the biodiversity in underwater environments. These systems are non-destructive, do not perturb the environment and generate a large amount of visual data usable at any time. However, automatic video analysis requires [...] Read more.
Nowadays, underwater video systems are largely used by marine ecologists to study the biodiversity in underwater environments. These systems are non-destructive, do not perturb the environment and generate a large amount of visual data usable at any time. However, automatic video analysis requires efficient techniques of image processing due to the poor quality of underwater images and the challenging underwater environment. In this paper, we address live reef fish species classification in an unconstrained underwater environment. We propose using a deep Convolutional Neural Network (CNN) and training this network by using a new strategy based on incremental learning. This training strategy consists of training the CNN progressively by focusing at first on learning the difficult species well and then gradually learning the new species incrementally using knowledge distillation loss while keeping the high performances of the old species already learned. The proposed approach yields an accuracy of 81.83% on the LifeClef 2015 Fish benchmark dataset. Full article
(This article belongs to the Section Network)
Show Figures

Figure 1

15 pages, 5788 KiB  
Article
Deep Leaning Based Frequency-Aware Single Image Deraining by Extracting Knowledge from Rain and Background
by Yuhong He, Tao Zeng, Ye Xiong, Jialu Li and Haoran Wei
Mach. Learn. Knowl. Extr. 2022, 4(3), 738-752; https://doi.org/10.3390/make4030035 - 16 Aug 2022
Cited by 2 | Viewed by 2265
Abstract
Due to the requirement of video surveillance, machine learning-based single image deraining has become a research hotspot in recent years. In order to efficiently obtain rain removal images that contain more detailed information, this paper proposed a novel frequency-aware single image deraining network [...] Read more.
Due to the requirement of video surveillance, machine learning-based single image deraining has become a research hotspot in recent years. In order to efficiently obtain rain removal images that contain more detailed information, this paper proposed a novel frequency-aware single image deraining network via the separation of rain and background. For the rainy images, most of the background key information belongs to the low-frequency components, while the high-frequency components are mixed by background image details and rain streaks. This paper attempted to decouple background image details from high frequency components under the guidance of the restored low frequency components. Compared with existing approaches, the proposed network has three major contributions. (1) A residual dense network based on Discrete Wavelet Transform (DWT) was proposed to study the rainy image background information. (2) The frequency channel attention module was introduced into the adaptive decoupling of high-frequency image detail signals. (3) A fusion module was introduced that contains the attention mechanism to make full use of the multi receptive fields information using a two-branch structure, using the context information in a large area. The proposed approach was evaluated using several representative datasets. Experimental results shows this proposed approach outperforms other state-of-the-art deraining algorithms. Full article
Show Figures

Figure 1

23 pages, 4340 KiB  
Article
VLA-SMILES: Variable-Length-Array SMILES Descriptors in Neural Network-Based QSAR Modeling
by Antonina L. Nazarova and Aiichiro Nakano
Mach. Learn. Knowl. Extr. 2022, 4(3), 715-737; https://doi.org/10.3390/make4030034 - 5 Aug 2022
Viewed by 2971
Abstract
Machine learning represents a milestone in data-driven research, including material informatics, robotics, and computer-aided drug discovery. With the continuously growing virtual and synthetically available chemical space, efficient and robust quantitative structure–activity relationship (QSAR) methods are required to uncover molecules with desired properties. Herein, [...] Read more.
Machine learning represents a milestone in data-driven research, including material informatics, robotics, and computer-aided drug discovery. With the continuously growing virtual and synthetically available chemical space, efficient and robust quantitative structure–activity relationship (QSAR) methods are required to uncover molecules with desired properties. Herein, we propose variable-length-array SMILES-based (VLA-SMILES) structural descriptors that expand conventional SMILES descriptors widely used in machine learning. This structural representation extends the family of numerically coded SMILES, particularly binary SMILES, to expedite the discovery of new deep learning QSAR models with high predictive ability. VLA-SMILES descriptors were shown to speed up the training of QSAR models based on multilayer perceptron (MLP) with optimized backpropagation (ATransformedBP), resilient propagation (iRPROP), and Adam optimization learning algorithms featuring rational train–test splitting, while improving the predictive ability toward the more compute-intensive binary SMILES representation format. All the tested MLPs under the same length-array-based SMILES descriptors showed similar predictive ability and convergence rate of training in combination with the considered learning procedures. Validation with the Kennard–Stone train–test splitting based on the structural descriptor similarity metrics was found more effective than the partitioning with the ranking by activity based on biological activity values metrics for the entire set of VLA-SMILES featured QSAR. Robustness and the predictive ability of MLP models based on VLA-SMILES were assessed via the method of QSAR parametric model validation. In addition, the method of the statistical H0 hypothesis testing of the linear regression between real and observed activities based on the F2,n−2 -criteria was used for predictability estimation among VLA-SMILES featured QSAR-MLPs (with n being the volume of the testing set). Both approaches of QSAR parametric model validation and statistical hypothesis testing were found to correlate when used for the quantitative evaluation of predictabilities of the designed QSAR models with VLA-SMILES descriptors. Full article
(This article belongs to the Section Learning)
Show Figures

Graphical abstract

15 pages, 5478 KiB  
Article
Data Mining Algorithms for Operating Pressure Forecasting of Crude Oil Distribution Pipelines to Identify Potential Blockages
by Agus Santoso, Fransisco Danang Wijaya, Noor Akhmad Setiawan and Joko Waluyo
Mach. Learn. Knowl. Extr. 2022, 4(3), 700-714; https://doi.org/10.3390/make4030033 - 21 Jul 2022
Viewed by 2581
Abstract
The implementation of data mining has become very popular in many fields recently, including in the petroleum industry. It is widely used to help in decision-making processes in order to minimize oil losses during operations. One of the major causes of loss is [...] Read more.
The implementation of data mining has become very popular in many fields recently, including in the petroleum industry. It is widely used to help in decision-making processes in order to minimize oil losses during operations. One of the major causes of loss is oil flow blockages during transport to the gathering facility, known as the congeal phenomenon. To overcome this situation, real-time surveillance is used to monitor the oil flow condition inside pipes. However, this system is not able to forecast the pipeline pressure on the next several days. The objective of this study is to forecast the pressure several days in advance using real-time pressure data, as well as external factor data recorded by nearby weather stations, such as ambient temperature and precipitation. Three machine learning algorithms—multi-layer perceptron (MLP), long short-term memory (LSTM), and nonlinear autoregressive exogenous model (NARX)—are evaluated and compared with each other using standard regression evaluation metrics, including a steady-state model. As a result, with proper hyperparameters, in the proposed method of NARX with MLP as a regressor, the NARX algorithm showed the best performance among the evaluated algorithms, indicated by the highest values of R2 and lowest values of RMSE. This algorithm is capable of forecasting the pressure with high correlation to actual field data. By forecasting the pressure several days ahead, system owners may take pre-emptive actions to prevent congealing. Full article
(This article belongs to the Section Learning)
Show Figures

Figure 1

12 pages, 929 KiB  
Article
Input/Output Variables Selection in Data Envelopment Analysis: A Shannon Entropy Approach
by Pejman Peykani, Fatemeh Sadat Seyed Esmaeili, Mirpouya Mirmozaffari, Armin Jabbarzadeh and Mohammad Khamechian
Mach. Learn. Knowl. Extr. 2022, 4(3), 688-699; https://doi.org/10.3390/make4030032 - 14 Jul 2022
Cited by 13 | Viewed by 3814
Abstract
The purpose of this study is to provide an efficient method for the selection of input–output indicators in the data envelopment analysis (DEA) approach, in order to improve the discriminatory power of the DEA method in the evaluation process and performance analysis of [...] Read more.
The purpose of this study is to provide an efficient method for the selection of input–output indicators in the data envelopment analysis (DEA) approach, in order to improve the discriminatory power of the DEA method in the evaluation process and performance analysis of homogeneous decision-making units (DMUs) in the presence of negative values and data. For this purpose, the Shannon entropy technique is used as one of the most important methods for determining the weight of indicators. Moreover, due to the presence of negative data in some indicators, the range directional measure (RDM) model is used as the basic model of the research. Finally, to demonstrate the applicability of the proposed approach, the food and beverage industry has been selected from the Tehran stock exchange (TSE) as a case study, and data related to 15 stocks have been extracted from this industry. The numerical and experimental results indicate the efficacy of the hybrid data envelopment analysis–Shannon entropy (DEASE) approach to evaluate stocks under negative data. Furthermore, the discriminatory power of the proposed DEASE approach is greater than that of a classical DEA model. Full article
(This article belongs to the Section Data)
Show Figures

Figure 1

23 pages, 3346 KiB  
Article
Improving Deep Learning for Maritime Remote Sensing through Data Augmentation and Latent Space
by Daniel Sobien, Erik Higgins, Justin Krometis, Justin Kauffman and Laura Freeman
Mach. Learn. Knowl. Extr. 2022, 4(3), 665-687; https://doi.org/10.3390/make4030031 - 7 Jul 2022
Cited by 4 | Viewed by 2168
Abstract
Training deep learning models requires having the right data for the problem and understanding both your data and the models’ performance on that data. Training deep learning models is difficult when data are limited, so in this paper, we seek to answer the [...] Read more.
Training deep learning models requires having the right data for the problem and understanding both your data and the models’ performance on that data. Training deep learning models is difficult when data are limited, so in this paper, we seek to answer the following question: how can we train a deep learning model to increase its performance on a targeted area with limited data? We do this by applying rotation data augmentations to a simulated synthetic aperture radar (SAR) image dataset. We use the Uniform Manifold Approximation and Projection (UMAP) dimensionality reduction technique to understand the effects of augmentations on the data in latent space. Using this latent space representation, we can understand the data and choose specific training samples aimed at boosting model performance in targeted under-performing regions without the need to increase training set sizes. Results show that using latent space to choose training data significantly improves model performance in some cases; however, there are other cases where no improvements are made. We show that linking patterns in latent space is a possible predictor of model performance, but results require some experimentation and domain knowledge to determine the best options. Full article
Show Figures

Figure 1

24 pages, 6351 KiB  
Article
Do We Need a Specific Corpus and Multiple High-Performance GPUs for Training the BERT Model? An Experiment on COVID-19 Dataset
by Nontakan Nuntachit and Prompong Sugunnasil
Mach. Learn. Knowl. Extr. 2022, 4(3), 641-664; https://doi.org/10.3390/make4030030 - 4 Jul 2022
Cited by 3 | Viewed by 3130
Abstract
The COVID-19 pandemic has impacted daily lives around the globe. Since 2019, the amount of literature focusing on COVID-19 has risen exponentially. However, it is almost impossible for humans to read all of the studies and classify them. This article proposes a method [...] Read more.
The COVID-19 pandemic has impacted daily lives around the globe. Since 2019, the amount of literature focusing on COVID-19 has risen exponentially. However, it is almost impossible for humans to read all of the studies and classify them. This article proposes a method of making an unsupervised model called a zero-shot classification model, based on the pre-trained BERT model. We used the CORD-19 dataset in conjunction with the LitCovid database to construct new vocabulary and prepare the test dataset. For NLI downstream task, we used three corpora: SNLI, MultiNLI, and MedNLI. We significantly reduced the training time by 98.2639% to build a task-specific machine learning model, using only one Nvidia Tesla V100. The final model can run faster and use fewer resources than its comparators. It has an accuracy of 27.84%, which is lower than the best-achieved accuracy by 6.73%, but it is comparable. Finally, we identified that the tokenizer and vocabulary more specific to COVID-19 could not outperform the generalized ones. Additionally, it was found that BART architecture affects the classification results. Full article
(This article belongs to the Section Learning)
Show Figures

Figure 1

20 pages, 1768 KiB  
Article
Semantic Image Segmentation Using Scant Pixel Annotations
by Adithi D. Chakravarthy, Dilanga Abeyrathna, Mahadevan Subramaniam, Parvathi Chundi and Venkataramana Gadhamshetty
Mach. Learn. Knowl. Extr. 2022, 4(3), 621-640; https://doi.org/10.3390/make4030029 - 1 Jul 2022
Cited by 4 | Viewed by 2435
Abstract
The success of deep networks for the semantic segmentation of images is limited by the availability of annotated training data. The manual annotation of images for segmentation is a tedious and time-consuming task that often requires sophisticated users with significant domain expertise to [...] Read more.
The success of deep networks for the semantic segmentation of images is limited by the availability of annotated training data. The manual annotation of images for segmentation is a tedious and time-consuming task that often requires sophisticated users with significant domain expertise to create high-quality annotations over hundreds of images. In this paper, we propose the segmentation with scant pixel annotations (SSPA) approach to generate high-performing segmentation models using a scant set of expert annotated images. The models are generated by training them on images with automatically generated pseudo-labels along with a scant set of expert annotated images selected using an entropy-based algorithm. For each chosen image, experts are directed to assign labels to a particular group of pixels, while a set of replacement rules that leverage the patterns learned by the model is used to automatically assign labels to the remaining pixels. The SSPA approach integrates active learning and semi-supervised learning with pseudo-labels, where expert annotations are not essential but generated on demand. Extensive experiments on bio-medical and biofilm datasets show that the SSPA approach achieves state-of-the-art performance with less than 5% cumulative annotation of the pixels of the training data by the experts. Full article
(This article belongs to the Section Network)
Show Figures

Figure 1

30 pages, 3045 KiB  
Article
Certifiable Unlearning Pipelines for Logistic Regression: An Experimental Study
by Ananth Mahadevan and Michael Mathioudakis
Mach. Learn. Knowl. Extr. 2022, 4(3), 591-620; https://doi.org/10.3390/make4030028 - 22 Jun 2022
Cited by 4 | Viewed by 2709
Abstract
Machine unlearning is the task of updating machine learning (ML) models after a subset of the training data they were trained on is deleted. Methods for the task are desired to combine effectiveness and efficiency (i.e., they should effectively “unlearn” deleted data, but [...] Read more.
Machine unlearning is the task of updating machine learning (ML) models after a subset of the training data they were trained on is deleted. Methods for the task are desired to combine effectiveness and efficiency (i.e., they should effectively “unlearn” deleted data, but in a way that does not require excessive computational effort (e.g., a full retraining) for a small amount of deletions). Such a combination is typically achieved by tolerating some amount of approximation in the unlearning. In addition, laws and regulations in the spirit of “the right to be forgotten” have given rise to requirements for certifiability (i.e., the ability to demonstrate that the deleted data has indeed been unlearned by the ML model). In this paper, we present an experimental study of the three state-of-the-art approximate unlearning methods for logistic regression and demonstrate the trade-offs between efficiency, effectiveness and certifiability offered by each method. In implementing this study, we extend some of the existing works and describe a common unlearning pipeline to compare and evaluate the unlearning methods on six real-world datasets and a variety of settings. We provide insights into the effect of the quantity and distribution of the deleted data on ML models and the performance of each unlearning method in different settings. We also propose a practical online strategy to determine when the accumulated error from approximate unlearning is large enough to warrant a full retraining of the ML model. Full article
(This article belongs to the Section Learning)
Show Figures

Figure 1

11 pages, 1248 KiB  
Article
Real Quadratic-Form-Based Graph Pooling for Graph Neural Networks
by Youfa Liu and Guo Chen
Mach. Learn. Knowl. Extr. 2022, 4(3), 580-590; https://doi.org/10.3390/make4030027 - 21 Jun 2022
Cited by 1 | Viewed by 1969
Abstract
Graph neural networks (GNNs) have developed rapidly in recent years because they can work over non-Euclidean data and possess promising prediction power in many real-word applications. The graph classification problem is one of the central problems in graph neural networks, and aims to [...] Read more.
Graph neural networks (GNNs) have developed rapidly in recent years because they can work over non-Euclidean data and possess promising prediction power in many real-word applications. The graph classification problem is one of the central problems in graph neural networks, and aims to predict the label of a graph with the help of training graph neural networks over graph-structural datasets. The graph pooling scheme is an important part of graph neural networks for the graph classification objective. Previous works typically focus on using the graph pooling scheme in a linear manner. In this paper, we propose the real quadratic-form-based graph pooling framework for graph neural networks in graph classification. The quadratic form can capture a pairwise relationship, which brings a stronger expressive power than existing linear forms. Experiments on benchmarks verify the effectiveness of the proposed graph pooling scheme based on the quadratic form in graph classification tasks. Full article
(This article belongs to the Section Network)
Show Figures

Figure 1

Previous Issue
Next Issue
Back to TopTop