**Acceleration of a Feature Selection Algorithm Using High Performance Computing †**

#### **Bieito Beceiro \* , Jorge González-Domínguez and Juan Touriño**

Computer Architecture Group, CITIC, Universidade da Coruña, 15071 A Coruña, Spain; jgonzalezd@udc.es (J.G.-D.); juan@udc.es (J.T.)

**\*** Correspondence: bieito.beceiro.fernandez@udc.es

† Presented at the 3rd XoveTIC Conference, A Coruña, Spain, 8–9 October 2020.

Published: 1 September 2020

**Abstract:** Feature selection is a subfield of data analysis that is on reducing the dimensionality of datasets, so that subsequent analyses over them can be performed in affordable execution times while keeping the same results. Joint Mutual Information (JMI) is a highly used feature selection method that removes irrelevant and redundant characteristics. Nevertheless, it has high computational complexity. In this work, we present a multithreaded MPI parallel implementation of JMI to accelerate its execution on distributed memory systems, reaching speedups of up to 198.60 when running on 256 cores, and allowing for the analysis of very large datasets that do not fit in the main memory of a single node.

**Keywords:** feature selection; mutual information; machine learning; high performance computing; MPI

#### **1. Introduction**

Feature selection algorithms are data analytics techniques that aim to reduce the size of large datasets by detecting those features that are relevant and discarding the irrelevant ones. These methods have become extremely popular in the last years due to the increasing amount of data that are gathered every day and need to be processed. Moreover, the algorithms used to process data are usually computationally expensive, so the execution times are excessive when datasets grow in size.

In this work, we focus on the Joint Mutual Information (JMI) algorithm [1], since it provides the best tradeoff in terms of accuracy, stability, and flexibility for datasets of various properties, as stated in [2]. JMI follows a filter approach for feature selection, which is, it is used as an independent task performed before the subsequent machine learning algorithm. JMI receives from the user the number of features to select, and it iteratively takes the feature with the highest score, calculated as a tradeoff between its relevance and redundancy. Despite its good results, the main drawback of JMI is its quadratic complexity that prevents its usage to analyze large datasets.

The aim of this work is the development of a new version of the JMI algorithm that is faster than the state-of-the-art solution: the C implementation included in the widely employed and cited suite FEAST [2]. Our implementation must also enable the analysis of huge datasets that do not fit in the memory of conventional computing systems. Both of the goals have been achieved thanks to a novel multithreaded MPI implementation of JMI that can be executed on distributed memory systems, so that it can take advantage of enabling parallel work with several nodes and, thus, reduce the execution time. In addition, this parallel version of JMI is able to divide and distribute extremely large datasets among the memory of all nodes, making it possible to process them.

#### **2. Parallel Implementation of JMI**

JMI works with datasets that are represented as matrices, where the rows are the features and the columns are the samples. The output dataset is formatted with a pair for each selected feature. The first element of the pair is its position in the original dataset, while the second one is the score. In this work, we propose a hybrid parallel approach of JMI based on the MPI message-passing library and C++ threads that works with the same input/output format than FEAST and guarantees the same exact results, but with significantly reduced execution times. This has been achieved by using a feature-wise domain decomposition, which is, the features of the input dataset are divided into blocks of the same size so that each block is processed in parallel by a different processing element. Our implementation makes use of this domain decomposition at two levels of parallelism. First, the dataset is distributed among MPI processes. Besides the execution time gains, this MPI parallelization makes it possible the joint usage of the memory of several nodes. Second, the workload of each process is distributed among threads.

Furthermore, two outstanding performance optimizations have been applied to our parallel JMI implementation:


#### **3. Results and Conclusions**

The experiments for the analysis of our optimized parallel version of JMI were conducted on 16 nodes of the "Pluton" cluster, a distributed memory system that is based on Intel Xeon processors installed at CITIC. Each node is composed of two processors with eight cores each (16 logical threads using HyperThreading) and 64 GB of main memory. The whole system provides a total of 256 cores (512 logical threads with HyperThreading) and 1 TB of memory. In order to keep the reproducibility of the experiments, we have used five representative datasets that are publicly available in the LibSVM website (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). Their properties are shown at the left half of Table 1.

Table 1 also shows the runtime and speedup (in parentheses) for different number of nodes. Regarding the parallel executions, the same configuration for each node was used, as it proved to be the most efficient one after some preliminary tests: two MPI processes with eight cores each (16 logical threads per MPI process using HyperThreading). However, due to memory problems for the largest datasets (SVHN and E2006), we had to use one MPI process with 16 cores each (32 logical threads) in these cases. Moreover, E2006 is so large (∼512 GB as a matrix in memory) that it does not fit in

less than 16 nodes, so it could be executed only using this configuration. As can be seen, the parallel version of JMI performs well and offers excellent scalability (execution times largely decrease when the number of cores increases). Furthermore, it enables the analysis of huge datasets (E2006) that the original version of JMI could not handle due to their size.


**Table 1.** Dataset properties, execution times (in seconds), and speedups (in parentheses) to select 200 features with JMI.

**Author Contributions:** Conceptualization, J.G.-D. and J.T.; methodology, B.B., J.G.-D. and J.T.; implementation, B.B.; validation, B.B.; writing—original draft preparation, B.B.; writing—review and editing, J.G.-D. and J.T. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Ministry of Science and Innovation of Spain (PID2019-104184RB-I00/AEI/10.13039/501100011033), and by Xunta de Galicia and FEDER funds of the EU (Centro de Investigación de Galicia accreditation 2019-2022, ref. ED431G2019/01; Consolidation Program of Competitive Reference Groups, ED431C 2017/04).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


c 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Proceedings* **A Doubly Smoothed PD Estimator in Credit Risk†**

#### **Rebeca Peláez Suárez 1,\* , Ricardo Cao Abad <sup>2</sup> and Juan M. Vilar Fernández <sup>2</sup>**


Published: 1 September 2020

**Abstract:** In this work a doubly smoothed probability of default (PD) estimator is proposed based on a smoothed version of the survival Beran's estimator. The asymptotic properties of both the smoothed survival and PD estimators are proved and their behaviour is analyzed by simulation. The results allow us to conclude that the time variable smoothing reduce the error committed in the PD estimation.

**Keywords:** probability of default; risk analysis; censored data; survival analysis; nonparametric estimation; kernel estimation

#### **1. Introduction**

The debts coming from clients with unpaid credits have a important impact in the solvency of banks and other credit institutions. Therefore, one of the most crucial elements that influences the risk in credits is the probability of default (PD). For a fixed time, *t*, and a horizon time, *b*, the PD can be defined as the probability that a credit that has been paid until time *t*, becomes unpaid not later than time *t* + *b*.

The probability of default conditional on the credit scoring can be written as a transformation of the conditional survival function. Therefore, in Section 2.1 Beran's survival estimator is used to obtain a PD estimator. A time variable smoothing for this estimator is proposed in Section 2.2. In Section 3, both estimators are applied to a real data set. Section 4 contains some concluding remarks.

#### **2. Nonparametric PD Estimators**

Let {(*Xi*, *Zi*, *<sup>δ</sup>i*)}*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> be a simple random sample of (*X*, *Z*, *δ*) where *X* is the credit scoring, *Z* = min{*T*, *C*} is the observed maturity, *T* is the time to default, *C* is the time until the end of the study or the time until the anticipated cancellation of the credit and *<sup>δ</sup>* = *<sup>I</sup>*{*T*≤*C*} is the uncensoring indicator. Let *x* be a fixed value of the covariate *X*, *b* a horizon time and *S*(*t*|*x*) the conditional survival function of *T*. Then, the probability of default in a time horizon *t* + *b* from a maturity time *t* is defined as follows

$$PD(t|\mathbf{x}) \quad = \quad P(T \le t+b|T > t, X=\mathbf{x}) = 1 - \frac{S(t+b|\mathbf{x})}{S(t|\mathbf{x})}.\tag{1}$$

Replacing *S*(*t*|*x*) with a nonparametric estimator, *S*(*t*|*x*), in (1), the following estimator for *PD*(*t*|*x*) is obtained:

$$\widehat{PD}(t|\mathbf{x}) \quad = \ 1 - \frac{\dot{S}(t+b|\mathbf{x})}{\widehat{S}(t|\mathbf{x})}.\tag{2}$$

In [5] the theoretical results that allow to obtain, under general conditions, asymptotic properties for a PD estimator are proved. They are based on these properties for the corresponding estimator of the conditional survival function.

#### *2.1. Beran's Estimator*

Beran's survival estimator proposed in [1] is given by

$$\hat{S}\_h^B(t|\mathbf{x}) \quad = \prod\_{i=1}^n \left( 1 - \frac{I\_{\{Z\_i \le t, \, \delta\_i = 1\}} w\_{i,n}(\mathbf{x})}{1 - \sum\_{j=1}^n I\_{\{Z\_j \le Z\_i\}} w\_{n,j}(\mathbf{x})} \right), \tag{3}$$

where the weights are *wi*,*n*(*x*) = *<sup>K</sup>* (*x* − *Xi*)/*h* ∑*n <sup>j</sup>*=<sup>1</sup> *K* (*x* − *Xj*)/*h* , with *<sup>i</sup>* <sup>=</sup> 1, ... , *<sup>n</sup>*, *<sup>K</sup>* is a kernel function and

*<sup>h</sup>* <sup>=</sup> *hn* <sup>&</sup>gt; 0 is a smoothing parameter. Now, replacing *<sup>S</sup>*(*t*|*x*) with *<sup>S</sup><sup>B</sup> <sup>h</sup>* (*t*|*x*) in (2), Beran's estimator of the probability of default, *PD<sup>B</sup> <sup>h</sup>* (*t*|*x*), is available. It was firstly used in [2].

The asymptotic properties of Beran's estimator for the conditional survival function were proven in both [3,4] under certain assumptions. From them, the expressions of the bias and the variance of the estimator *PD<sup>B</sup> <sup>h</sup>* (*t*|*x*) can be found by using Theorem 1 in [5].

A simulation study was conducted in order to analyse the performance of Beran's estimator. Its behavior was compared with other estimators of the probability of default obtained from estimators of the survival function, including a benchmark method based on proportional hazards models. For more details about the simulation study, see [5].

The results show that the probability of default estimations obtained by means of the estimators built according to (2) are very reasonable, but they have excessive variability and they are very rough curves.

#### *2.2. Smoothed Beran's Estimator*

Beran's estimator is smoothed with respect to the covariate, but not with respect to the time variable. This fact along with the survival ratio structure of the PD estimator could be the cause of the instability of the estimations. Therefore, a time variable smoothing of the survival estimator is proposed.

The smoothed Beran's survival estimator is given by

$$\widehat{S}\_{h,\mathbb{g}}^{B}(t|\mathbf{x}) = \mathbf{1} - \sum\_{i=1}^{n} s\_{(i)} \mathbb{K}\left(\frac{t - Z\_{(i)}}{\mathcal{g}}\right) \tag{4}$$

where *<sup>s</sup>*(*i*) = *<sup>S</sup><sup>B</sup> <sup>h</sup>* (*Z*(*i*−1)|*x*) <sup>−</sup> *<sup>S</sup><sup>B</sup> <sup>h</sup>* (*Z*(*i*)|*x*) with *<sup>Z</sup>*(*i*) the *<sup>i</sup>*-th element of the sorted sample of *<sup>Z</sup>*, <sup>K</sup>(*t*) the distribution function of a kernel *K* and *g* = *gn* is the smoothing parameter for the time variable. Finally, the smoothed Beran's PD estimator, *PD<sup>B</sup> <sup>h</sup>*,*g*(*t*|*x*), is obtained by replacing *<sup>S</sup>*(*t*|*x*) with *<sup>S</sup><sup>B</sup> <sup>h</sup>*,*g*(*t*|*x*) in Equation (2).

The asymptotic expressions for the bias and the variance of the smoothed Beran's estimator of the survival function have been recently found [6]. The results are too extensive to be shown here. By applying Theorem 1 of [5], the corresponding asymptotic properties of the smoothed Beran's estimator of the PD are obtained.

The simulation study carried out shows that the time variable smoothing significantly reduces the error committed in the PD estimation. This technique implies a considerable increase in the computation time and the improvement is not very noticeable in the estimation of the survival function. However, in the case of the PD, the variability and roughness of the estimations is clearly reduced.

#### **3. Application to Real Data**

To illustrate the differences between the estimator based on Beran's and its smoothed version, we obtain the estimation of the conditional survival function and the PD in a real data set. The data consists of a sample of 10,000 consumer credits from a Spanish bank registered between July 2004 and November 2006. The sample contains the credit scoring of each borrower, the observed lifetime and the uncensoring indicator. The sample censoring percentage is 92.8%. The probability of default is estimated using Beran's and smoothed Beran's estimators with *h* = 0.05 and *g* = 3. Figure 1 shows the result.

**Figure 1.** Estimation of *S*(*t*|*x*) (**left**) and *PD*(*t*|*x*) (**right**) at horizon *b* = 5 for *x* = 0.8 by means of Beran's (dashed line) and smoothed Beran's (solid line) estimators on the consumer credits dataset.

#### **4. Conclusions**

This work proposes a time variable smoothing for Beran's estimator of the conditional survival function. General asymptotic expressions for the bias and the variance of this estimator are proven. It is used to build a doubly-smoothed PD estimator whose asymptotic properties are also proved. In view of the simulation study carried out, it can be concluded that the smoothed Beran's estimator seems to reduce the estimation error committed when estimating the probability of default.

Work is currently underway to develop a method for choosing the smoothing parameters involved in the above-mentioned estimators. In addition, since the censoring probability is heavy in this context, nonparametric cure models are going to be considered in the study.

**Acknowledgments:** This research has been supported by MINECO Grant MTM2017-82724-R, and by the Xunta de Galicia (Grupos de Referencia Competitiva ED431C-2016-015 and Centro Singular de Investigación de Galicia ED431G/01), all of them through the ERDF.

#### **References**


c 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
