*1.2. Machine Learning*

Machine learning seeks to answer a very concrete question: How can we build computer systems that automatically improve with experience, and what fundamental laws govern this teaching process? [9]

Through this discipline, it is possible to implement new methods that help researchers in making new findings. Machine learning techniques are used, for example, to learn about models of gene

expression in cells and other applications in bioinformatics, more specifically in metagenomics [10]. One can talk about three types of algorithms within the current machine learning techniques:

**Supervised**: Data training consists of labeled entries and known outputs that the machine analyzes while relabeling. There are many applications of supervised algorithms in bioinformatics to solve problems [11], which are based on information from adequately characterized genes.

**Unsupervised**: This type of analysis of unlabeled and categorized data is based on similarities that have been identified. In this case, the machine can cluster the data based on shared characteristics. Techniques that use unsupervised algorithms are often used for problems in which humans cannot clearly infer patterns, that is, it requires exhaustive observation to identify such patterns. It is also a technique that allows determining behaviors based on different interpretations.

**Semi-supervised**: This analysis refers to a combination of the two previously mentioned techniques. It is used in large data sizes when the labels of some of these data are known. Unsupervised learning is based on the analysis of unlabeled data to group them, while techniques of supervised learning are used to predict the labels of this group formed by the first technique. Artificial Neural Networks (ANN) are a known approach to address complex problems, as neural networks can be implemented at the hardware or software level and, in turn, can use a variety of topologies and learning algorithms.

#### **2. Materials and Methods**

#### *2.1. Selection of the CTX-M and Metagenome Baseline Reference Database for the Study*

First, we based our selection on previous work by [12] Núñez in 2016 (unpublished data), where all the CTX-M reported groups are already considered. After a review of the state of the art, we consolidated the CTX-M database, previously filtered by the analysis of phylogenetic trees carried out by [12] Núñez. Subsequently, the reference metagenome to be studied was selected through a search in the EBI-Metagenomics database (https://www.ebi.ac.uk/metagenomics/), considering the high probability that the *CTX-M* gene was present. We reviewed the following four metagenomes and selected only one as input to develop the prototype:


The metagenome selected was antibiotic resistance within the preterm infant gu<sup>t</sup> (https://www.ebi. ac.uk/ena/data/view/PRJEB15257). Upon selection of the reference metagenome, we filtered the data by following the pipeline described in Figure 1. The filtered metagenomics data was then prepared and machine learning techniques were applied according to the computational pipeline shown in Figure 2, where we assessed the accuracy and cost of the artificial neural network. A brief description is as follows: the filtered metagenome from the first pipeline is provided as input; the data are transformed by the conversion of nucleotide to binaries and the resulting binarized data are input to the ANN (Artificial Neural Network); the ANN is implemented; and accuracy and cost metrics are assessed.

**Figure 1.** Details of the bioinformatic pipeline.

**Figure 2.** Details of the computational pipeline.

We mapped the CTX-M reference database to the sample metagenome using Duk tool (Li, Mingkun, et al., 2018) to eliminate information not relevant for the study. We obtained a consolidated CTX-M database with a total of 211 reference sequences in FASTA (file format for bioinformatics data). As initial mapping parameters, we used k-mers of 16 (default) and 63 for test mappings. Next, we optimized mapping parameters following Algorithm 1.


Based on the initial analysis, k-mers 17, 19 and 21 were found to be the best. Additionally, we validated the results through an NCBI BLAST search of the contig obtained after adjusting the k-mer to 17 and 19 to conclusively verify that this sequence corresponds to bacteria with the CTX-M gene. The pipeline can be downloaded here:

https://github.com/dhcl1580/machinelearniginmetagenomicstesis.

#### *2.2. Defining an Optimal Neural Network Architecture*

An exhaustive review of the existing literature was performed to define the architecture of the neural network for metagenomics. We evaluated di fferent machine learning models focused on improving the precision of the techniques applied in neural networks, such as random forest, or algorithms based on decision trees [13]. None of the studies reviewed take into account a particular architecture, whereby the main goal is to obtain a reduction in the cost function to guarantee that the neural network apprenticeship is being carried out. Conversely, this study proposes an architecture of a multi-layer perception neuronal network (Figure 3), because of the importance of the high sensitivity that different neurons show in each of their layers concerning the activation functions, weights, and epochs. This interaction allows considering more parameters when training and validating such an architecture, taking into account its performance [14].

**Figure 3.** Details of the architecture of ANN (Artificial Neural Network).

#### *2.3. Data Standardization for the Neural Network*

To establish an appropriate training dataset for the proposed neuronal network, we developed a routine in Python 3 in charge of normalizing the data obtained, where basically a binarization of the CTX-M nucleotide sequences is carried out. All sequences are standardized to the value of the longest identified sequence, and additional spaces are defined by the value N. The result is the file "dataGen.csv", where a total of 3896 values are generated for X and the 10 groups of CTX-M (Table 1). The 10 most representative classes were selected to ensure a uniform distribution of classes for stratified cross validation in Stage 2 (validation). Initially, there were 17 classes from which only those with sequences represented at least four times within the test and validation dataset were selected. Each of the 10 classes corresponds to the following CTX-M groups, respectively (Table 1).


**Table 1.** CTX-M group and correspond class selected for the study.

#### **3. Analysis of Results**

#### *3.1. Analysis of the Graph Resulting from the ANN*

Figure 4 shows how the graph of the ANN is built. In this graph, it is possible to observe how the nodes are distributed and how these interaction to the process data.

**Figure 4.** Details of the ANN (Artificial Neural Network) components and the cost, accuracy, optimization and model definition tensors.

#### *3.2. Training Stage Over CPU an GPGPU*

The activation functions tanh and sigmoid were experimented with RELU (Rectified Linear Units), where the parameters LEARNING\_RATE, TRAINING\_EPOCHS, and HIDDEN\_SIZE were varied, obtaining the results presented below for each function. Table 2 shows the parameters that varied in each experiment. The Figures 5–7 show the correspond graphics.

**Table 2.** Summary of target values during the training stage under CPU (Central Process Unit).


The best values were obtained using the tanh activation function in this experiment.

**Figure 5.** Values of accuracy using tanh function over CPU (Central Process Unit).

**Figure 6.** Values of cost using tanh function over CPU (Central Process Unit).

**Figure 7.** ROC (Receiver operating characteristics) analysis for the tanh activation function over CPU (Central Process Unit).

The best values were obtained using the tanh activation function in the other step, the Table 3 show the values ant the Figures 8–10 show the correspond graphics.


**Table 3.** Summary of target values during the training stage under GPU (Graphics Process Unit).

function over GPU (Graphics Process Unit).

**Figure 8.** Values of accuracy using tanh

**Figure 9.** Values of cost using tanh function over GPU (Graphics Process Unit).

**Figure 10.** ROC (Receiver operating characteristics) analysis for the tanh activation function over GPU (Graphics Process Unit).
