**1. Introduction**

Alzheimer (AD) is a relatively common neurological disorder associated with a decline in cognitive skills [1,2] and memory [3–5]. The causes of Alzheimer are not ye<sup>t</sup> well understood, even as some processes of the development of amyloid plaque seems to be a major part of the disease [6]. The development of biomarkers [7] for the detection of AD is of clear importance. Over the last few decades, there has been a sharp increase in the amount of information publicly available, with researchers graciously making their data public. This, coupled with advances, such as the possibility to simultaneously estimate the methylation [8] levels of thousands of CpGs in the DNA, has created a large amount of information. CpG refers to having a guanine nucleotide after a cytosine nucleotide in a section of the DNA sequence. CpGs can be methylated, i.e., having an additional methyl group added. The level of methylation in the DNA is a frequently used marker for multiple illnesses [9–12], as well as a estimator of the biological age of the patient; hence, it has become an important biomarker [13]. The computational task is rather challenging. Current equipment can quickly analyze the level of methylation of in excess of 450,000 CpGs [14–16], with the latest generation of machines able to roughly double that amount [17]. As previously mentioned, methylation data has been linked to many diseases [18–20] and it is a logical research area for AD biomarkers. An additional challenge is that, at least in principle, there could be a highly non-linear process that is not necessarily accurately described by traditional regression analysis. The scope would then, hence, be to try to identify techniques that select a combination of the CpGs to be analyzed

 Perez, G.; Caballero Villarraso, J. Alzheimer Identification through DNA Methylation and Artificial Intelligence Techniques. *Mathematics* **2021**, *9*, 2482. https://doi.org/ 10.3390/math9192482

Alfonso

**Citation:**

Academic Editors: Monica Bianchini and Maria Lucia Sampoli

Received: 6 September 2021 Accepted: 30 September 2021 Published: 4 October 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

and then a non-linear algorithm that is able to predict whether the patient analyzed has the disease. However, on the other hand, it would not appear reasonable to totally discard the information presented in linear analysis. In the following sections, a mixed approach is presented. It will be shown that the approach is able to generate predictions (classifications between the control and patients suffering from Alzheimer).

#### *1.1. Forecasting and Classification Models*

Prediction and/or classification tasks are frequently found in many scientific and engineering fields with a large amount of potential artificial intelligence related techniques. The specific topics covered are rather diverse, including weather forecasts [21], plane flight time deviation [22], distributed networks [23], and many others [24–26]. One frequently used set of techniques are artificial neural networks. These techniques are extensively used in many fields. There are, however, several alternatives, which have received less attention in the existing literature (for instance, k-nearest neighbors and support vector machines). It should be noted that the k-nearest neighbor technique is frequently used in data pre-processing for instance in situations, in which the dataset has some missing values and the researcher needs to estimate those (typically as a previous step before using them as an input into a more complex model).

In our case the non-linear basic classification algorithm chosen was support vector machines (SVM) [27–29]. The basic idea of SVM is dividing the data into hyperplanes [30] and trying to decrease the measures of the classification error. This is achieved by following the usual supervised learning, in which a proportion of the data are used for training the SVM, while other portion (not used during the training phase) is used for testing purposes only, in order to avoid to avoid the issue of overfitting [5,31]. This technique has been applied in the context of Alzheimer for the classification of MRI images [32,33]. Some SVM models have been proposed in the context of CpGs methylation related to AD [34].

#### *1.2. CpG DNA Methylation*

A CpG is a dinucleotide pair (composed by cytosine a phosphate and guanine), while methylation refers to the addition of a methyl group to the DNA. Methylation levels are typically expressed as a percentage with 0 indicating completely unmethylated and 1 indicating 100% methylated. CpG DNA methylation levels are frequently used as epigenetic biomarkers [35,36]. Methylation levels change as an individual ages and this has been used to build biological clocks [37]. Individuals with some illnesses such as some cancers and Alzheimer present deviations in their levels of methylations.

#### *1.3. Paper Structure*

In the next section a related literature review is carried out given an overview of articles in prediction and classification. The literature review is followed by the materials and methods section, in which the main algorithm is explained. In this section, there is also a subsection describing the analyzed data. In Section 4 the results are presented. This section is divided into two subsection the first one describing the results for a single dataset and the second subsection describing the results when a multi dataset approach is followed. The last two sections are the discussion and the conclusions.

#### **2. Literature Review**

As previously mentioned, the CpG DNA methylation data were used in a variety of biomedical applications, such as the creation of biological clocks. For instance, Horvath [38] created an accurate CpG DNA methylation clock. Horvath managed to reduce the dimensionality of the data from hundred of thousands of CpGs analyzed per patient to a few hundred. This biological clock is able to predict the age of patients (in years) with rather high accuracy using as inputs the methylation data of a few hundred CpGs. A related article is [39], in which the authors used neural networks to predict the forensic

age of individuals. The authors showed how using machine learning techniques could improve the accuracy of the age forecast, compared to traditional (linear) models.

Park et al. [40] is an interesting article focusing on DNA methylation and AD. The authors of this article found a link between DNA methylation and AD but similar to Horvath paper did not use machine learning techniques. Machine learning techniques have been applied with some success. For instance, ref. [41] used neural networks to analyze the relationship between gene-promoters methylation and biomarkers (one carbon metabolism in patients). Another interesting model was created by [42]. In this model the authors use a combination of DNA methylation and gene expression data to predict AD. The approached followed by the authors in this paper is different from the one that we pursued as they increased the amount of input data (including gene expression), while we focus on trying to reduce the dimensionality of the existing data i.e., select CpGs.

While most of the existing literature focuses on neural networks, there are also some interesting applications of other techniques such as for instance support vector machines (SVM). For instance, ref. [43] used SVM for the classification of histones. SVM have also been used for classification purposes in some illnesses such as colorectal cancer [44]. Even if SVM appears to be a natural choice for classification problems there seems to be less existing literature applying it to DNA methylation data in the context of AD identification.

#### **3. Materials and Methods**

One of the main objectives of this paper is to be able to accurately generate classification forecasts differentiating between individuals with Alzheimer's disease (AD) and control cases.The algorithm was built with the intention to be easily expandable from one to multiple data sets. A categorical variable *yi* was created to classify individuals.

$$y\_{\rangle} = \begin{cases} \begin{array}{c} 0 \text{ if } \text{Control} \\ 1 \text{ if } AD \end{array} \end{cases} \tag{1}$$

In this way, a vector *Y* = {*<sup>Y</sup>*1,*Y*2,..., *Ync*} can be constructed classifying all the existing cases according to the disease estate (control or *AD*). In this notation *nc* denotes the total number, including both control and *AD*, of cases considered. Every case analyzed (*j*) has an associated vector *Xj* containing all the methylation levels of each CpG.

$$X^j = \begin{Bmatrix} X^1 \\ X^2 \\ \vdots \\ \vdots \\ X^{mn} \end{Bmatrix} \tag{2}$$

This notation is used in order to clearly differentiate between the vector (*Xj*) containing all the methylation data for a single individual (all CpGs) from the vector (*Xi*) containing all the cases for a given CpG.

$$X\_{\bar{l}} = \{X\_1, X\_2, \dots, X\_{nc}\} \tag{3}$$

In a matrix notation the complete methylation data can be expressed as follows

$$X = \begin{pmatrix} X\_1^1 & X\_2^1 & \dots & X\_{nc}^1 \\ X\_1^2 & X\_2^2 & \dots & X\_{nc}^2 \\ \vdots & \ddots & & \vdots \\ \cdot & \cdot & & \cdot \\ \cdot & \cdot & & \cdot \\ X\_1^{mn} & X\_2^{mn} & \dots & X\_{nc}^{mn} \end{pmatrix} \tag{4}$$

For clarity purposes it is perhaps convenient shoving a hypothetical (oversimplified) example, in which 4 patients (*nc* = 4) are analyzed (2 control and 2 AD) and that only 5 CpGs were included per patient (*mn* = <sup>5</sup>). In this hypothetical example:

$$Y = \{0, 0, 1, 1\}\tag{5}$$

As an example, the methylation data for patient 1 could be:

$$X^1 = \begin{Bmatrix} 0.9832 \\ 0.6145 \\ 0.1254 \\ 0.7845 \\ 0.6548 \end{Bmatrix} \tag{6}$$

Similarly, the methylation data for a single CpG for all patients can be expressed as:

$$X\_i = \{0.9832, 0.3215, 0.6574, 0.6584\}\tag{7}$$

And the methylation data for all patients (matrix form) would be as follows:

$$X = \begin{pmatrix} 0.9832 & 0.3215 & 0.6574 & 0.6584 \\ 0.6145 & 0.6548 & 0.8475 & 0.7487 \\ 0.1254 & 0.6587 & 0.3254 & 0.6514 \\ 0.7845 & 0.3514 & 0.6254 & 0.6584 \\ 0.6548 & 0.6547 & 0.6587 & 0.6555 \end{pmatrix} \tag{8}$$

The proposed algorithm has two distinct steps. In the first step an initial filtering is carried out. This step reduced the dimensionality of the problem. The second step is the main algorithm. Both steps are described in the following subsections.

#### *3.1. Initial Filtering*


$$\{X\_1, X\_2, \dots, X\_{mn}\} \to \{X\_1, X\_2, \dots, X\_m\} \tag{9}$$

with *m* < *mn*.

### *3.2. Main Algorithm*

1. Create a vector grid ( *D*) with the each component representing the dimension (group of *Xi*) includes in the simulation. Two grids are included, a fine grid with relative small differences in the values of the elements (representing the dimensions that the researcher considers more likely) and a broad grid with large differences in values.

$$\text{Fine grid} = \left\{ n\_1, n\_1 + \Delta n\_s, n\_1 + 2\Delta n\_s, \dots, n\_1 + l\Delta n\_s \right\} \tag{10}$$

$$\text{Broad } \operatorname{grid} = \left\{ \left( n\_1 + l \Delta n\_5 \right) + \Delta n\_{l\prime} \left( n\_1 + l \Delta n\_5 \right) + 2 \Delta n\_{l\prime} \dots \right\} $$

$$\{n\_1 + l\Delta n\_s + p\Delta n\_l\}.\tag{11}$$

The values inside the above grids represent the *Xi* selected. As an example, *n*1 represents *X*1. Δ*nl* and Δ*ns* are the constant step increases in the fine and broad grids, respectively. For instance, *n*1 + Δ*nl* and *n*1 + 2Δ*nl* are the second and third elements in the fine grid. The actual *Xi* elements related to this second and third values depend on the actual value of Δ*nl*. If Δ*nl* = 1 then the second and third elements related to *X*2 and *X*3, respectively, while if Δ*nl* = 2, then they relate to *X*3 and *X*5, respectively. Where Δ*nl* > Δ*ns*, each of these values, i.e., *n*1 + Δ*ns* is the number of *xi* chosen. *l* ∈ Z<sup>+</sup> is a

constant that specifies (together with *nl*) the total size of the fine grid, while *p* ∈ Z<sup>+</sup> is the analogous term for the broad grid. For simplicity purposes the case of a fine grid, starting a *X*1, followed by a broad grid has been shown but this is not a required constraint. The intent is giving discretion to the researcher to apply the fine grid to the area that is considered more important. This is an attempt to bring the expertise of the researcher into the algorithm. In Equation (12) it can be seen the combination of these two grids ( *D*).

$$D = \{n\_{1\prime}n\_1 + \Delta n\_{\mathfrak{s}\prime}n\_1 + 2\Delta n\_{\mathfrak{s}\prime}, \dots, n\_1 + l\Delta n\_{\mathfrak{s}\prime}\left(n\_1 + l\Delta n\_{\mathfrak{s}\prime}\right) + \Delta n\_{l\prime}\tag{12}$$

$$(n\_1 + l\Delta n\_{\mathfrak{s}\prime}) + 2\Delta n\_{l\prime}, \dots, (n\_1 + l\Delta n\_{\mathfrak{s}\prime}) + p\Delta n\_{l}\}.\tag{12}$$

For clarity purposes, let simplify the notation:

$$D = \{ \mathcal{S}\_{\dot{j}} \} = \{ \mathcal{S}\_1, \mathcal{S}\_2, \dots, \mathcal{S}\_m \} \tag{13}$$

where Equations (12) and (13) are identical. "*S*" is a more compact notation with for instance *S*1 and *S*2 representing *n*1 and *n*1 + Δ*ns*, respectively.

2. Create a mapping between each *xi* = {*<sup>X</sup>*1, ... , *Xm*} = {*Xi*}, where each *Xi* is a vector, and 10 decile regions. The group of *Xi* with the highest 10% of the *p*-value are included in the first decile and assigned a probability of 100%. The group of *Xi* with the second highest 10% of the *p*-value are included in the second decile and assigned a probability of 90%. This process is repeated for all deciles creating a mapping.

$$\{X\_1, \dots, X\_m\} \to B\{1.0, 0.9, 0.8, \dots, 0.1\} \tag{14}$$

Where *B* is a vector of probabilities. In this way, the *Xi* with the largest *p*-values are more likely to be included.

3. For each *Sj* generate ∀ *Xi*, i=1,. . . ,m, a random number *Ri* with (0 ≤ *Ri* ≤ <sup>1</sup>). If *Ri* > *B*{*Xi*} then *Xi* is not included in the preliminary *Sj* group of *Xis*. Otherwise it is included. In this way a filtering is carried out.

$$\{X\_1, \dots, X\_m\} \to \{X\_1, \dots, X\_{m\*}\} \,\forall S\_j \tag{15}$$


$$HR = \frac{CE}{TE} \tag{16}$$

where TE is the total number of classification estimations and CE is the number of correct classification estimates.

6. Repeat steps (3) to (6) k times for each *Sj*. In this way there is a mapping:

$$\{S\_1, \dots, S\_m\} \to \{HR(S\_1), \dots, HR(S\_m)\} \tag{17}$$

**Remark 1.** *An alternative approach would be choosing the starting distribution Sj as the one after which the mean value of the HR does not statistically increase at a 5% confidence level.*

7. Define new search interval between the two highest success rates:

$$\max\{HR(S\_1), \dots, HR(\_m)\} \to S\_{\max}^1 \tag{18}$$

$$\max\{HR(S\_1), \dots, HR(m)\} < S\_{\max}^1 \to S\_{\max-1}^1 \tag{19}$$

Iteration 1 (Iter=1) ends, identifying interval:

$$\{S\_{\text{max}}^1, S\_{\text{max}-1}^1\} \tag{20}$$

**Remark 2.** *It is assumed, for simplicity, without loss of generality that <sup>S</sup>*1*max* < *<sup>S</sup>*1*max*−1*. If that it is not the case then the interval needs to be switched (*{*S*1*max*−1, *<sup>S</sup>*1*max*}*).*

8. Divide the interval identified in the previous step into *k* − 1 steps.

$$\{\mathcal{S}\_1, \dots, \mathcal{S}\_k\}\tag{21}$$

where *S*1 = *<sup>S</sup>*1*max* and *Sk* = *<sup>S</sup>*1*max*−<sup>1</sup>

9. Create a new mapping estimating the new hit rates (following the same approach as in previous steps)

$$\{\mathcal{S}\_1, \dots, \mathcal{S}\_k\} = \{HR(\mathcal{S}\_1), \dots, HR(\mathcal{S}\_k)\} \tag{22}$$

10. Repeat *Itert* times until the maximum number of iterations (*Itermax*) is reached.

$$Iter\_t \ge Term\_{max}\tag{23}$$

or until the desire hit rate (*HRdesired*) is reached

$$HR(S) \le HR\_{desired} \tag{24}$$

or until no further HR improvement is achieved. Select *<sup>S</sup>tmax*.

A few points need to be highlighted. It is important to reduce the number of combinations to a manageable size. For instance, assuming that there are "*m*" *Xi* (after the initial filtering of *p*-Values) there would be (*mr* ) combinations of size r. The well known equation (25) can be used.

$$\sum\_{r=0}^{m} \binom{m}{r} = 2^m \forall m \in \mathbf{N}^+ \tag{25}$$

Assuming that at least one of the *Xi* is selected:

$$\sum\_{r=0}^{m} \binom{m}{r} = \sum\_{r=1}^{m} \binom{m}{r} + \binom{m}{0} = 2^m \tag{26}$$

$$\sum\_{r=1}^{m} \binom{m}{r} = 2^{m} - 1 \tag{27}$$

For large *m* values the −1 term is negligible.

In the initial step the problem of having to calculate the estimations for 2*m* combinations is simplified into calculating a *q*2*<sup>q</sup>* combinations with *q* < *m*. If for example, *q* = *m*/10, then the problem is reduced form 210*q* to 102*q* combinations. It can be proven that:

$$2^{10q} > 10 \cdot 2^q \,\forall q \ge 2 \tag{28}$$

**Proof.** Using induction. Base case (q=2). 2<sup>10</sup>(*k*) = 2<sup>20</sup> = 1, 048, 576; 10 · 2*q* = 10 · 22 = 40. 1, 048, 576 > 40. Therefore, the base case is confirmed. Assume:

$$2^{10k} \succ 10 \cdot 2^k \text{ for some } k \ge 2 \tag{29}$$

induction hypothesis

$$2^{10(k+1)} > 10 \cdot 2^{k+1} \tag{30}$$

$$2^{10(k+1)} = 2^{10k} 2^{10} > 10 \cdot 2^k 2^{10} = 10 \cdot 2^k 2 2^9 = 10 \cdot 2^{k+1} 2^9 > 10 \cdot 2^{k+1} \tag{31}$$

which completes the proof by induction.
