Using Machine Learning and Natural Language Processing for Unveiling Similarities between Microbial Data

Brezočnik, Lucija; Žlender, Tanja; Rupnik, Maja; Podgorelec, Vili

doi:10.3390/math12172717

Open AccessArticle

Using Machine Learning and Natural Language Processing for Unveiling Similarities between Microbial Data

by

Lucija Brezočnik

^1,*

,

Tanja Žlender

²

,

Maja Rupnik

^2,3

and

Vili Podgorelec

¹

Faculty of Electrical Engineering and Computer Science, University of Maribor, SI-2000 Maribor, Slovenia

²

National Laboratory of Health, Environment and Food, Centre for Medical Microbiology, Department for Microbiological Research, SI-2000 Maribor, Slovenia

³

Faculty of Medicine, University of Maribor, SI-2000 Maribor, Slovenia

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(17), 2717; https://doi.org/10.3390/math12172717 (registering DOI)

Submission received: 29 July 2024 / Revised: 26 August 2024 / Accepted: 29 August 2024 / Published: 30 August 2024

(This article belongs to the Special Issue Mathematical Models and Computer Science Applied to Biology)

Download

Browse Figures

Versions Notes

Abstract

:

Microbiota analysis can provide valuable insights in various fields, including diet and nutrition, understanding health and disease, and in environmental contexts, such as understanding the role of microorganisms in different ecosystems. Based on the results, we can provide targeted therapies, personalized medicine, or detect environmental contaminants. In our research, we examined the gut microbiota of 16 animal taxa, including humans, as well as the microbiota of cattle and pig manure, where we focused on 16S rRNA V3-V4 hypervariable regions. Analyzing these regions is common in microbiome studies but can be challenging since the results are high-dimensional. Thus, we utilized machine learning techniques and demonstrated their applicability in processing microbial sequence data. Moreover, we showed that techniques commonly employed in natural language processing can be adapted for analyzing microbial text vectors. We obtained the latter through frequency analyses and utilized the proposed hierarchical clustering method over them. All steps in this study were gathered in a proposed microbial sequence data processing pipeline. The results demonstrate that we not only found similarities between samples but also sorted groups’ samples into semantically related clusters. We also tested our method against other known algorithms like the Kmeans and Spectral Clustering algorithms using clustering evaluation metrics. The results demonstrate the superiority of the proposed method over them. Moreover, the proposed microbial sequence data pipeline can be utilized for different types of microbiota, such as oral, gut, and skin, demonstrating its reusability and robustness.

Keywords:

machine learning; NLP; hierarchical clustering; microbial data; microbiome; n-gram

MSC:

92-08

1. Introduction

Through numerous research studies, machine learning (ML) techniques have continued to showcase their ability to conduct highly effective and complex data analyses [1,2,3]. They were utilized for different classification and regression problems united under the common name of supervised learning, like medical diagnosis systems for patients [4], face recognition, and electricity consumption forecasting. Similarly, unsupervised learning approaches such as clustering were applied to problems like student segmentation. Not only that, but other techniques, like natural language processing (NLP), also enhance their research process with the utilization of ML [5]. Therefore, using ML techniques, one can gain good insight into data regardless of the data domain.

However, some domains or subfields are still less explored with specific ML techniques than others. One of them is metagenomics, a field that analyses genetic material, such as DNA, obtained directly from a sample. Genetic material is typically analyzed through sequencing, a process that reveals the nucleic acid sequence, e.g., the order of nucleotides in DNA. These nucleotides or bases are adenine (A), thymine (T), guanine (G), and cytosine (C).

Due to the high complexity of microbiota samples, metagenomic studies usually focus on sequencing specific regions of the genetic material, a method known as amplicon sequencing. One commonly targeted region is the 16S rRNA gene, which is ubiquitous in bacteria. This gene is partially conserved but contains nine hypervariable regions (V1–V9), making it a valuable target for microbial diversity studies [6].

Many studies have been conducted to address metagenomics in a biological manner, but there is not much work where the researchers utilize a complex analysis of sequence data with specific ML techniques. Hence, in this paper, we defined the following research question:

RQ1: How can machine learning techniques be used for a complex analysis of microbial sequence data?

Technically speaking, microbial-sequenced data are practically specific forms of a text. In ML, we have a great number of established approaches for text processing [7], from text classification, where we try to classify text into pre-defined classes based on their content, to text extraction, where we try to identify informative data present within the text and the most mathematical ones, i.e., statistical methods. In the latter’s scope, we conduct the frequency analysis, collocation, concordance, and TF-IDF. Based on this premise, we defined the second research question:

RQ2: Can techniques commonly used on a natural text also be applied to microbial sequence data?

In this research, we used microbial sequence data obtained from animal fecal samples and manure to test our research questions. Our goal is to see if, by the application of the proposed method, we could meaningfully separate animal groups from each other based on their microbial sequence data. However, this can be done if the detected characteristics of the samples are sufficiently informative for the individual animal group. If we can identify such informative characteristics with the proposed method, we can use it in different real-life applications, such as microbial source tracking and disease diagnosis.

Misunderstanding machine learning approaches still presents a big problem in the field of microbial sequence data analysis, as the misuse of machine learning and other related algorithms leads to wrong usage and incorrect conclusions [8,9,10]. Therefore, this paper is trying to close the gap between computer scientists and microbiologists by proposing a microbial sequence data pipeline, aimed at the use of machine learning and NLP techniques for the widest possible applications of microbial sequence data processing. Altogether, the main contributions of this paper can be summarized as follows:

The proposed microbial sequence data processing pipeline;
The utilization of frequency analysis for microbial sequence data transformation into text vectors;
The proposed hierarchical clustering method over microbial sequence text vectors;
The evaluation of the proposed method on an animal feces dataset.

2. Background

A language can be described as a system of rules or symbols that are combined and utilized to communicate or transmit information [7]. Even though a language can be used easily by humans, it represents a much more challenging task for a computer that cannot interpret it similarly. Therefore, the natural language processing (NLP) field emerged, focusing on enabling computers to comprehend natural text. It is divided into two main parts [7]: natural language understanding (NLU) and natural language generation (NLG). NLU focuses on extracting concepts, entities, emotions, and keywords, and NLG focuses on producing meaningful phrases, sentences, and paragraphs.

NLP is utilized for many tasks in different domains. Machine translation is one of the most broadly used NLP applications, like Google Translate. It is also used for text categorization. The latter has become increasingly employed in the recruitment process for parsing resumes [11]. We are also unconsciously using NLP results through spam filtering [12]. In recent years, the list of NLP utilization has grown and become familiar in medicine and biomedical studies.

In medicine, NLP can be used for various applications, such as detecting adverse drug events [13] and assisting medical practitioners in accessing the relevant biomedical literature to answer clinical questions [14]. The rising interest in transfer learning within NLP played a crucial role in developing BioBERT, a domain-specific language representation model pre-trained on large-scale biomedical corpora to understand complex biomedical texts [15]. Although NLP is predominantly used for analyzing medical documents and notes in the biomedical field, it can potentially be used for analyzing other textual data, such as DNA sequence data.

With significant advancements in next-generation sequencing (NGS) technologies, accumulating sequence data offers an opportunity to deepen our understanding of human health and disease. For example, genome-wide association studies (GWAS) can identify the associations between genotype and phenotype [16]. Phenotype refers to a sample’s observable traits (physical properties like eye color, weight, and height), which are determined by genotype and environmental factors. Phenotype is, therefore, an observable expression of a genotype. For example, the DNA genotype carries the instructions for the organism, and the organism grows based on what is written in the genes. However, from the literature, it is evident that NLP is used more on phenotype through phenotyping than genotype [16].

The application of NGS extends far beyond the medical domain into fields such as environmental science, agriculture, biotechnology, and forensics. It plays an essential role in analyzing microbiomes, representing the genetic material of microorganisms in a particular environment or sample. This analysis provides invaluable insights into microbial communities’ composition, diversity, and function in various environments. In forensics, for example, scientists are exploring how the investigation of a skin microbiome can help identify a particular individual [17] and how fecal contamination could be traced back to a particular host [18].

Even now, analyzing the complex data of microbiomes presents a problem [19,20]. Not only is there usually a high number of instances or samples in this case, but an even bigger problem concerns the number of features that describe them. Even after the preprocessing phase, where errors in the sequencing data are removed, data are still high-dimensional. Data are typically organized in huge matrices containing samples and corresponding observed counts of OTUs (Operational Taxonomic Units), i.e., bacteria types. However, some of the biggest challenges occur here. Such matrices usually exhibit high sparsity due to high abundance, meaning having many zeros. The second problem is overdispersion, where data present significant variability. The combination of the issues mentioned above creates challenges for the next steps of microbial analysis.

Researchers usually utilize common mathematical and statistical approaches to tackle those problems—for example, normalization and differential abundance testing methods. The latter usually consist of nonparametric statistical tests like the Mann–Whitney/Wilcoxon rank-sum or Kruskal–Wallis tests [20]. Lately, some more advanced approaches have been employed, like deep neural networks. Nevertheless, even with such advanced approaches, the researchers pointed out that the large volume of raw data produced poses significant challenges in data storage and processing [16,21].

3. Microbial Sequence Data Pipeline

This research demonstrates that we can establish a connection between the steps of the NLP pipeline and the proposed microbial sequence data pipeline. Not only that but with it, we are trying to close the gap between computer scientists and microbiologists. Currently, misunderstanding machine learning approaches, supervised and unsupervised, still presents a big problem, which is also evident from many recent publications published in 2024 in esteemed journals [8,9,10]. These papers emphasize the misuse of machine learning and other related algorithms, which leads to wrong usage and, consequently, incorrect conclusions. For example, one of the most recent works highlighting this problem is “Major data analysis errors invalidate cancer microbiome findings” [10]. Therefore, we are proposing the microbial sequence data pipeline. We wanted to make it very general in order to encompass the widest possible applications of microbial sequence data processing with the use of machine learning and NLP techniques, and thus to contribute a small but important stone to the mosaic of ML for microbial data analysis that would help further development in this field.

Figure 1 depicts the NLP pipeline and corresponding steps within a blue rectangle and the microbial sequence data pipeline with corresponding steps in an orange rectangle. The first step in NLP is data acquisition, where we collect data from known databases or scrape them from websites, social media, or other resources. This step in the microbial sequence data pipeline comprises sample collection, followed by the extraction of DNA and sequencing. Samples can be collected from different sources, such as animals, humans, and soil, through feces, saliva, and water samples. Alternatively, the DNA sequences can be obtained from sequencing data repositories, such as the Sequence Read Archive (SRA) [22].

The next step in NLP is text cleaning. Especially when we are not using preprocessed text published in dedicated datasets, the obtained data can contain some special characters, spelling mistakes, or other things that we do not want to process. It is vital to eliminate them before going to the next step. In this microbial sequence data pipeline step, we merge paired-end reads, remove primer sequences, and eliminate low-quality sequences (see Section 4.2).

The third step of the NLP pipeline involves text preprocessing. The latter is a very demanding job since it comprises many substeps, depending on the analyzed text. Tokenization is always performed, and the text is segmented into a list of tokens. Usually, we first break down the text into sentences and then each sentence into words. Then, we transform all text to lowercase, remove punctuation, and stop words. Stop words depend on the language, but in English, they are words such as “the”, “and”, and “a”. The next significant step is stemming or lemmatization, where we remove the inflectional endings of the words and obtain their canonical form. Therefore, this step aims to reduce noise and enhance the uniformity and structure of the text. We denoise data in the microbial sequence data pipeline by forming Zero-radius Operational Taxonomic Units (ZOTUs) (see Section 4.2).

Feature engineering is the fourth step, transforming text into vectors that ML algorithms can later operate on. In our proposed pipeline, this step transforms microbial sequence data into text vectors (see Section 4.3).

Later, in NLP, we build a model. The model building step is concerned with the use of ML algorithms. According to the proposed pipeline, all types of ML and deep learning models can be applied here, depending on our research purpose. In the microbial sequence data pipeline, we also build a model for utilizing different ML methods. For example, we could apply supervised learning methods, such as classification. In [23], the authors highlighted that machine learning methods for microbiome studies are often poorly documented and executed, thus providing steps toward standardization, and the authors in [24] utilized classification for disease detection. We can also utilize regression as another supervised technique. The authors in [25] proposed a logistic regression model for testing differential abundance in compositional microbiome data, and the authors in [26] predicted measures of soil health. On the other hand, another branch of ML is unsupervised learning, which can also be used in the model building step. In our research, presented in the rest of this article, we are using an unsupervised technique, i.e., clustering. Therefore, we do not build and train a predictive model (to predict a species from the underlying microbial data, for example) but instead focus on dimensionality reduction and find informative characteristics within the underlying microbial data using a built model for a hierarchical clustering algorithm (see Section 4.5).

Finally, in both pipelines, the obtained results have to be evaluated and analyzed (see Section 5). Result visualization can be very informative and is used in different forms in many applications [27,28], but it faces many challenges when results are high-dimensional [21], as in our case.

4. Materials and Methods

4.1. Dataset

We used the dataset from [29,30], consisting of bacterial DNA sequences (specifically the V3–V4 hypervariable region of the 16S rRNA gene). These sequences were obtained from fecal samples of 15 animal taxa, humans, and 2 different manure types. Table 1 presents, in alphabetical order, the animal groups utilized in our research, which are cat, cattle, cattle manure, chicken, dog, fallow deer, gull, hedgehog, horse, human, mouse, nutria, pig, pig manure, pigeon, roe deer, swan, and wild duck. In total, there are 765 samples. The groups with the lowest number of samples are gull and pig manure, with 21 samples, while the human group has the highest count, with 185 samples.

4.2. Preprocessing

Our proposed approach works on microbial sequence data. Because of that, we have no limitations regarding the sample types during the data collection phase. To illustrate, the samples we are trying to analyze can be obtained from different sources like human and animal feces, oral, vaginal, and skin microbiota samples, as well as environmental samples such as soil and water.

Each sample must undergo DNA extraction, 16S amplicon sequencing, and raw sequence processing. Afterward, the sequences are either clustered based on a similarity cutoff or denoised to reduce data complexity and the impact of sequencing errors.

Our research focused on the V3–V4 hypervariable region of the 16S rRNA gene, a standard practice in microbiome studies [6,31]. We obtained raw sequence data from the fecal samples and animal manure dataset (see Section 4.1). The data were processed using the Usearch software version 11.0.667 [32] to improve data quality and reduce complexity. The raw sequence processing involved several steps. Initially, paired-end reads were merged into single sequences to reconstruct the full-length DNA fragments from overlapping reads. Next, primer sequences were removed because they may not perfectly match the source DNA. Error filtering was then performed by removing low-quality sequences, specifically those with more than one expected error, to ensure the accuracy of the data. Finally, the UNOISE algorithm was used to predict biologically correct sequences, resulting in the generation of Zero-radius Operational Taxonomic Units (ZOTUs), each approximately 405 base pairs long. Sequences shorter than 400 base pairs were removed to ensure completeness of the data.

4.3. Frequency Analysis

Frequency analysis (FA) studies the occurrence of specific items in a dataset. It is easily calculated, as shown in Equation (1):

F A = \frac{f}{d},

(1)

where d is the total number of items in the used dataset, and f is the number of times a selected item appears in the used dataset. The latter is shortly called frequency.

An n-gram is a sequence of n elements, where each element is taken from the set of m distinct objects. Therefore, the total number of all possible n-grams can be calculated by a permutation with replacement, as shown in Equation (2):

P^{R} (m, n) = m^{n}

(2)

Since we have four distinct values, i.e., A, T, G, and C, we can modify Equation (2) so that

m = 4

, and obtain the following equation:

P^{R} (4, n) = 4^{n}

(3)

In our case, we need to compute relative frequencies of n-grams in all sequences of each object group by the pseudocode demonstrated in Algorithm 1, where

a l l_g r o u p s

and

o b j e c t_g r o u p

denote all available groups and a specific group (

o b j e c t_g r o u p \in a l l_g r o u p s

) inside all available groups, respectively. Each

o b j e c t_g r o u p

consists of a

s e q u e n c e_l i s t

.

Algorithm 1 General pseudocode of the FA sequence computation

Require:: set n
1:: calculate the number of n-grams by Equation (3)
2:: create all n-grams
3:: for each $o b j e c t_g r o u p$ in $a l l_g r o u p s$ do
4:: for each $s e q u e n c e$ in $s e q u e n c e_l i s t$ do
5:: for each n do
6:: count the number of times an n-gram appears in a $s e q u e n c e$
7:: end for
8:: end for
9:: calculate relative frequencies of an $o b j e c t_g r o u p$
10:: end for
11:: prepare a matrix of relative frequencies of $a l l_g r o u p s$

4.4. Clustering

Clustering or cluster analysis comprises the process of grouping instances or items into groups or clusters based on their similarity. Therefore, the main idea is to have instances within a group that are more mutually similar (coherent) than those in other groups. We can accomplish this by analyzing their content, and with this, we gain a high-level overview of the information they encompass.

The general phases of the clustering process are the following:

Data selection;
Feature selection;
Similarity measure;
Clustering algorithm;
Cluster evaluation;
Cluster interpretation.

Each clustering process starts with data selection. Obtaining as many relevant features as possible for the problem we are trying to solve is essential. Also, feature selection removes redundant and irrelevant features and significantly affects the final clustering outcome. Then, a similarity measure must be defined in order to measure how similar or dissimilar two vectors are. Euclidean distance is one of the most common and easiest measures to use, but several others exist. After that, the clustering algorithm must be used to form clusters and to reveal the clustering structure in the selected dataset. Clustering results must be evaluated and finally interpreted to extract insights or patterns in the data. Usually, the latter is carried out by the experts in the problem area.

Formally, let us have dataset

X = {x_{1}, x_{2}, \dots, x_{n}}

that consists of n-instances. Instances have to be mapped into K-clusters denoted as

C = {c_{1}, c_{2}, \dots, c_{K}}

so that any cluster

c_{j} \in C

is a subset of X,

c_{j} \subset X

. It is mandatory to satisfy the following requirements (Equation (4)):

⋃_{k = 1}^{K} c_{k} = X, x_{i} \in c_{j}, c_{j} \cap c_{k} = \emptyset, j \neq k, c_{j}, c_{k} \in C .

(4)

Hence, all instances have to be assigned to clusters, each belonging to exactly one cluster.

4.5. Proposed Hierarchical Clustering Method

Hierarchical clustering is a special form of clustering where the algorithm is not restricted by the number of clusters in advance but is allowed to determine the number of clusters to create based on its algorithms. In general, the pseudocode of the hierarchical clustering algorithm is presented in Algorithm 2.

Algorithm 2 Hierarchical clustering algorithm

1:: while there is not only one group do
2:: find the closest clusters $c_{a}$ and $c_{b}$ , where distance matrix $d (a, b) = {min}_{u, v} d (c_{u}, c_{v})$
3:: combine clusters $c_{a}$ and $c_{b}$ in cluster $c_{a b} = c_{a} \cup c_{b}$
4:: replace clusters $c_{a}$ and $c_{b}$ with cluster $c_{a b}$
5:: calculate distance between $c_{a b}$ and other clusters
6:: end while

However, how can one say that some instances are more similar than others? Which criterion to use for this task?

Cosine similarity is widely used in NLP and data analysis [33]. It is invaluable since the comparison between two vectors is not purely linear like with the Euclidean distance calculation approach, but it calculates the angle between them. Therefore, it demonstrates much more robustness in capturing the pattern similarities between two datasets, even if their magnitudes vary.

Cosine similarity [27] can mathematically be described as the quotient between the dot product of vectors or samples (

A \cdot B

) and the product of the Euclidean norms (

‖ A ‖ ‖ B ‖

), as seen in Equation (5).

C_{s i m} (A, B) = \frac{A \cdot B}{‖ A ‖ ‖ B ‖}

(5)

The dot product of vectors

A \cdot B

is calculated as

\sum_{1}^{n} A_{i} B_{i}

.

‖ A ‖

and

‖ B ‖

are Euclidean norms of the vectors

A = (A_{1}, A_{2}, \dots, A_{n})

and

B = (B_{1}, B_{2}, \dots, B_{n})

, defined as

\sqrt{A_{1}^{2} + A_{2}^{2} + \dots + A_{n}^{2}}

and

\sqrt{B_{1}^{2} + B_{2}^{2} + \dots + B_{n}^{2}}

, respectively. Thus, the Euclidean norms’ product

‖ A ‖ ‖ B ‖

can be presented as

\sqrt{\sum_{i = 1}^{n} A_{i}^{2}} \sqrt{\sum_{i = 1}^{n} B_{i}^{2}}

. In short, parameter n denotes the dimension of the vector, and parameters

A_{i}

and

B_{i}

represent the i-th components of vectors A and B, respectively.

Equation (6) presents the calculation of the

C_{s i m} (A, B)

:

C_{s i m} (A, B) = cos (θ) = \frac{A \cdot B}{‖ A ‖ ‖ B ‖} = \frac{\sum_{i = 1}^{n} A_{i} B_{i}}{\sqrt{\sum_{i = 1}^{n} A_{i}^{2}} \sqrt{\sum_{i = 1}^{n} B_{i}^{2}}}

(6)

The results of the cosine similarity always fall in the interval

[- 1, 1]

. Calculating the results requires computing the cosine of the angle (

θ

) between the vectors (in our case, the difference between vectors

A

and

B

). The closer the result gets to the right side of the interval, the more similar the vectors are. Equivalently, the results closer to the left side of the interval denote the more dissimilar or opposite vectors. Value 0 represents the orthogonal vectors, meaning that a

90^{\circ}

angle exists between them. Equation (7) presents the interpretation of the results.

C_{s i m} (A, B) = \{\begin{matrix} - 1; & vectors A and B are strongly dissimilar \\ 0; & vectors A and B are orthogonal \\ 1; & vectors A and B are strongly similar \end{matrix}

(7)

To obtain the cosine distance, we have to subtract the obtained value in (6) from 1 so that we get the following formula:

d (C_{s i m} (A, B)) = 1 - cos (θ)

(8)

In hierarchical clustering, many linkage methods or agglomerative clustering schemes are utilized to specify how to calculate the proximity of clusters, e.g., single, complete, average, centroid, ward, median, and weighted [34]. In our case, we used the average, which is based on the arithmetic mean of all clusters (see Equation (9)).

d (C_{a}, C_{b}) = \sum_{i j} \frac{d (C_{a} [i], C_{b} [j])}{‖ C_{a} ‖ ‖ C_{b} ‖}

(9)

Additionally, we utilized cosine for the metric that calculates distances between data points. The proposed method pseudocode is visible in Algorithm 3.

Algorithm 3 General pseudocode of the steps of the proposed method

1:: preprocessing (Section 4.1)
2:: execute Algorithm 1
3:: execute Algorithm 2, where:
4:: ↪ compute distances between data points by Equation (8)
5:: ↪ compute the proximity of clusters by Equation (9)
6:: plot results in a dendrogram

4.6. Visualization

When obtaining and processing high-dimensional data, it is hard to recognize multivariate structures easily [21,35]. One of the most intuitive approaches for solving this problem would be visualization. Moreover, how can a picture or a graph showcase high-dimensional data? For this, dimensionality reduction techniques are utilized. Linear dimensionality reduction techniques, such as principal component analysis (PCA), and nonlinear techniques, such as t-distributed stochastic neighbor embedding (t-SNE) are most commonly utilized. Consequently, those approaches allow us to visualize high-dimensional data in lower-dimensional spaces, usually 2D or 3D. In our case, the main benefit of it is shown through the proposed method evaluation, where the distinction between clusters will be investigated.

The core idea of dimensionality reduction is to find the lower-dimensionality representation

Y = y_{i}

of a given dataset

X = x_{i}

. In both equations,

i \in [1, n]

,

x_{i} \in R^{n}

, and

y_{i} \in R^{m}

. The variables n and m denote the number of all data points and the number of data points after dimensionality reduction, respectively.

One way to reduce the dimensionality is by using the t-SNE algorithm [36]. The first step comprises a computation of pairwise similarities between all data points in the high-dimensional space. Let us have data points

x_{i}

and

x_{j}

and utilize the Gaussian kernel for the mentioned task (see Equation (10)).

p_{j | i} = \frac{e x p (- {‖ x_{i} - x_{j} ‖}^{2} / 2 σ_{i}^{2})}{\sum_{k \neq i} e x p (- {‖ x_{i} - x_{k} ‖}^{2} / 2 σ_{i}^{2})},

(10)

where

p_{j | i}

represents the conditional probability of selecting

x_{j}

as a neighbor of

x_{i}

, considering their similarities, and

σ_{i}

denotes a variance of a Gaussian that controls the area’s spread around each data point. The smaller the variance, the narrower the area around the

x_{i}

, and vice versa.

Next, to make the optimization easier, the obtained

p_{j | i}

must be symmetrized (see Equation (11)).

P = {(p_{i j})}_{i, j = 1}^{n}, p_{i j} = \frac{p_{j | i} + p_{i | j}}{2 n}

(11)

With that, probability distribution in the high-dimensional space P is defined. Then, the computation of low-dimensional affinities is performed by utilizing Student’s t-distribution with a single degree of freedom (see Equation (12)).

Q = {(q_{i j})}_{i, j = 1}^{n}, q_{i j} = \frac{{(1 + {‖ y_{i} - y_{j} ‖}^{2})}^{- 1}}{\sum_{k \neq l} {(1 + {‖ y_{k} - y_{l} ‖}^{2})}^{- 1}},

(12)

where

y_{i}

and

y_{j}

are the lower-dimensional representations of data points

x_{i}

and

x_{j}

, respectively. Here, Q denotes probability distribution in the low-dimensional space.

To find similarity matrices P and Q that are as close as possible, we have to minimize the Kullback–Leibler divergence, i.e., cost function, between them. The cost function calculation is given by Equation (13).

C = K L (P | | Q) = \sum_{i = 1}^{n} \sum_{j = 1}^{n} p_{i j} \log \frac{p_{i j}}{q_{i j}}

(13)

The gradient between P and Q is given by Equation (14).

\frac{δ C}{δ y_{i}} = 4 \sum_{j = 1}^{n} (p_{i j} - q_{i j}) (y_{i} - y_{j}) {(1 + {‖ y_{i} - y_{j} ‖}^{2})}^{- 1}

(14)

5. Results and Discussion

In this study, we treated fasta files of samples within animal groups as text documents and nucleotides A, C, T, and G as an alphabet. Therefore, we utilized verified core NLP concepts and modify them for application in microbial sequence data processing.

First, we tested different sizes of n-grams, where

n \in [2, 7]

. Based on Equation (3), we obtained 16, 64, 256, 1024, 4096, and 16,384 different 2-grams, 3−grams, 4-grams, 5-grams, 6-grams, and 7-grams, respectively. For illustration, let us give an example of all 16 2-grams: [‘AA’, ‘AC’, ‘AG’, ‘AT’, ‘CA’, ‘CC’, ‘CG’, ‘CT’, ‘GA’, ‘GC’, ‘GG’, ‘GT’, ‘TA’, ‘TC’, ‘TG’, ‘TT’]. It is clear to notice that we are dealing with the exponential growth of search space (

4^{n}

). Therefore, we limited our study to 7-grams since this is an NP-hard problem with high computational complexity.

As presented in Section 4.1, we tested our proposed method over 765 fecal samples from 18 groups (16 animal taxa including humans and 2 manure types). Therefore, our input data consist of the calculated relative frequencies of the species’ samples. They are calculated based on steps presented in Algorithm 3. For illustration, Table 2 demonstrates a part of the 2-gram input matrix.

Our goal was to see if applying the proposed method could meaningfully separate the groups from each other, knowing that the 16S sequencing produces high-dimensional data that are hard to interpret [21]. It is also important to highlight that the fecal microbiome is strongly shaped by diet and the host genotype, age, hygiene, and possible antibiotic exposure [37].

5.1. Analysis of the Groups

We ran the proposed method on all 18 groups to see which groups are more similar than others. We primarily graphically demonstrated results using a grid heatmap. A grid heatmap is a color-encoded matrix where the intensity in each matrix cell provides a visual representation of the relative magnitude or intensity of the corresponding value, which is the difference between two groups in our case. It is commonly used to visualize complex data, where usually the darker color represents the higher values, and the lighter color the lower ones. We typically add the color scale or gradient next to it for easier understanding.

Figure 2 shows the results of the proposed algorithm over 2-grams. The darker the color of the cell, the more similar the corresponding object groups are. We can see that in the case of 2-grams, the most different group is gull. This fact is evident from the lighter colors of the corresponding row (or column) cells. Even though we get a basic idea of the object group correlations, we cannot retrieve more data from its subtle color changes.

That is why we went on to present the results with a dendrogram. A dendrogram is a tree-like or branching diagram that denotes the hierarchical relationships between objects. Leaves are the endpoints and, in our case, represent the animal group. They are connected with branches (called clades) that demonstrate a distance between connected clusters, i.e., the higher the line, the more dissimilar the clusters are. Clades with just one leaf are called simplicifolious (in botany, that means “single-leafed”).

Figure 3 presents the results of the proposed method over the 2-grams. Some of the strong similarities found between the groups are meaningful, like between dogs and cats. They are all domestic animals that live mainly in the same place. On top of that, they are both carnivores. On the other hand, gulls differ the most among all animal groups and are simplicifolious. One of the possible explanations for this could be the freshness of the collected feces; some were quite old and, therefore, exposed to aerobic conditions. Moreover, because the amount of gull feces is small, it is difficult to avoid environmental contamination during sample collection. It is also interesting to see that fallow deer and roe deer are very dissimilar, even though they are mammals of the same order, Artiodactyla, and are herbivorous. It is also very intriguing how hedgehogs are so close to the birds, and that nutria is closer to humans than other rodents. From those results, it is clear that when using only 2-grams, we simply cannot detect enough information about the mutual similarities of the groups.

Then, we continued with 3-grams and obtained the dendrogram presented in Figure 4. We have three main clusters. In the first are humans and mice; in the second, gull and hedgehog; in the third, the rest of the animal groups. The first cluster is understandable because humans and mice are omnivorous, and for the second cluster, there is no clear reasoning behind such results. Cat and dog are still the most similar. Among birds, we still see the most substantial similarity between pigeons and swans, but now the wild duck group has changed positions with chicken, meaning that it is more similar to pigeons and swans living in nature like wild ducks. Seeing both manures added to the same cluster is also very promising.

Even though we conducted a study for all n-grams, where

n \in [2, 7]

, we will now focus on presenting the final results of 7-grams. As seen in Figure 5, the clustering is much more meaningful than with the lower n-grams. All bird taxa are grouped together, i.e., swan, wild duck, pigeon, and gull, and among them, there is very strong similarity. Compared with Figure 4, swans and wild ducks are now the most similar, which aligns with them being part of the same animal family. Another meaningful cluster represents cattle, fallow deer, and roe deer species, all in the same animal order. On top of that, they are all herbivores.

Furthermore, manures are also grouped and are very similar to chicken feces. This fact is also meaningful based on the chicken lifestyle and farm environment. The most surprising results are that hedgehogs were simplicifolious on the dendrogram, which could be explained by sampling (most of the hedgehog feces were already dry upon sample collection), and that nutria appears to be different from the other animal taxa.

5.2. Analysis of the Beta Diversity

After the general analysis of the animal groups, we wanted to see how the proposed method behaves and visualizes the samples (analysis of species samples, i.e., Beta diversity). In this case, we utilized t-SNE, presented in Section 4.6. Each specific marker represents one sample of the group. For illustration, one pink “X” marker represents one human sample, one blue circle represents one hedgehog sample, and so on for each defined marker.

Again, we started with 2-grams. Their results are visible in Figure 6. It is interesting to see that there is only one clear cluster, which consists of mice. Samples of other groups are quite mixed. Consequently, this is also expected since 2-grams do not provide enough information for better separation.

Segregation among clusters is better present in the case of 3-grams, as shown in Figure 7. We can see mouse, horse, and human groups in clearly separated clusters. The remaining samples are still quite mixed.

In the case of 7-grams, we obtain the most segregated clusters (see Figure 8). We see mouse, horse, human, and nutria samples in very distinct clusters. There is also a cattle cluster close to roe and fallow deer. Similarly, hedgehog samples are much more mutually close than before. The same goes for pig samples. It is also interesting to see that the manure samples group together.

Information loss is expected because the high-dimensional search space was being reduced to 2D. Therefore, we also presented results in 3D.

Figure 9 presents the 3D visualization of the 7-gram results. A better separation between cattle and deer species samples is noticeable (see bottom left corner of the figure). Similarly, there is a better segregation between both types of manures. At the bottom of the picture, we can see a cluster of gray “+” markers denoting pig manure samples, and above it, a cluster of brown diamonds denoting cattle manure samples.

Because our 3D presentation is interactive, we cannot present all the information from one perspective. Thus, Figure 10 presents a different perspective of Figure 9. Here, we can better see an isolated cluster of pig samples denoted by red diamonds. Similarly, a nutria sample cluster is denoted with yellow “X” markers. For illustration, comparing both 3D figures, one might say that in Figure 9, nutria samples overlap with mouse samples, but Figure 10 presents a clear distinction between them. However, some samples are still spread in a more extensive area, like swans.

From the results, it is evident that utilizing the proposed method over microbial sequence data produces meaningful results. As expected, the higher the n value in n-grams, the better the segregation between groups and samples. Not only were we able to find meaningfully similar animal groups based on their microbial sequence data, but we also visualized how the samples of the 18 groups used are internally similar. With interactive result presentation, like 3D t-SNE, the similarities and dissimilarities among animal clusters are even more evident.

5.3. Quantitative Results Discussion

We also compared the results with the following known algorithms: Kmeans algorithm [38] and Spectral Clustering algorithm [39]. We chose them because they allow setting the number of clusters at their initialization phase and are established on a different basis, i.e., neighboring and graph theory, respectively.

To evaluate algorithms equitably, we set the number of clusters to 18 for both the Kmeans and Spectral Clustering algorithms. The number corresponds to the number of species. The approaches were implemented with Scikit-learn [40], an open-source machine learning library written in Python. We ensured the reproducibility of the results by always setting the ’random_state’ parameter. To compare algorithms fairly, we first evaluated them with three metrics, i.e., Silhouette Score [41], Calinski–Harabasz Index [42], and Davies–Bouldin Index [43]. There are more metrics to evaluate the quality of the clustering process, but we focus on the commonly used ones that correspond to the paper’s topic. Since high-dimensional space cannot be visualized without a chosen dimensionality reduction technique, we later applied t-SNE on all tested algorithms to visualize the clustering results. In this manner, we provided an unbiased presentation of the results.

Figure 11 demonstrates clustering results based on the Kmeans algorithm for 7-grams. For an easier understanding of the results, we only provide a figure for 7-grams since, on average, the algorithm performed best on them. Compared with our proposed method, we can see that the clusters provided by the Kmeans algorithm are not well defined and are more dispersed, which is also supported by the metrics results. Similarly, we can say the same for the Spectral Clustering results. Figure 12 shows that clusters are poorly defined, and our proposed method outperforms the Spectral Clustering and Kmeans algorithms.

The first metric utilized is the Silhouette Score, which considers cluster cohesion and separation. It measures how similar an instance is to its cluster (intra-cluster distances i) compared to other clusters (nearest-cluster distances j). It is calculated by

S = \frac{j - i}{m a x (i, j)}

and produces a result in the interval [−1, 1]. Results towards 1 are better since that demonstrates that the instance is further away from neighboring clusters. Similarly, negative results demonstrate that the instance is clustered into a wrong cluster, and results around 0 show that the instance is more on the border between neighboring clusters. In our case, we can see that the proposed method produced better results than both the Kmeans and Spectral Clustering algorithms, and with longer n-grams, we obtained better cluster separation (see Figure 13).

The second metric is the Calinski–Harabasz Index (Variance Ratio Criterion). For all clusters, it evaluates the ratio of the sum of between-cluster dispersion and within-cluster dispersion with the following formula:

C H I = \frac{B / (k - 1)}{W / (n - k)}

, where B represents the between-cluster dispersion, W represents the within-cluster dispersion, n represents the number of samples, and k is the number of clusters. The results are on the interval

[0, \infty)

, where greater values represent better results. We can see that the proposed method here did not perform the best, but it improved with increasing n−gram size (see Figure 14). That was also our assumption.

The third metric is the Davies–Bouldin Index, also known as DBI. It calculates the average similarity measure of each cluster and compares the score with the most similar cluster. It is calculated as

D B I = \frac{1}{k} \sum_{i = 1}^{k} {max}_{j \neq i} (\frac{s_{i} + s_{j}}{d_{i j}})

, where k represents the number of clusters,

d_{i j}

the between-cluster distance of clusters i and j, and

s_{i}

and

s_{j}

represent the average within-cluster distance of clusters i and j, respectively. Thus, the results are on the interval

[0, \infty)

, where a lower score means better clustering results. In our case, we can see that the lowest score, i.e., the best one, was obtained using the proposed method (see Figure 15).

After obtaining the results, we ranked the tested algorithms according to their results. In total, each method produced 18 results, i.e., 6 results per metric. The proposed method obtained the best results in 12 out of 18 situations, being the predominant one according to the Silhouette Score and the Davies–Bouldin Index, followed by the Kmeans algorithm that produced 6 best results according to the Calinski–Harabasz Index.

The Calinski–Harabasz Index favors clusters with higher between-cluster dispersion relative to within-cluster dispersion [44]. It assumes relatively compact and spherical clusters, so irregular and elongated shapes significantly affect the performance, and the metric tends to favor clusters of similar size and density. In this manner, the proposed method did not score as highly as Kmeans and Spectral Clustering on the Calinski–Harabasz Index. On the other hand, the Davies–Bouldin Index expresses the similarity between clusters and the Silhouette Score considers cluster cohesion and separation (the pairwise intra-cluster and inter-cluster distances) for cluster quality assessments [44]; in both these metrics, the proposed method clearly outperformed the other two.

6. Conclusions

The analysis of microbial sequence data is a very complex task. This is true not only from the perspective of algorithm utilization but also the challenges that might arise during sample collection and processing. For illustration, various factors, such as contamination from environmental sources, cross-contamination between samples, where microbiota or microbial DNA can be transmitted to another sample, inadequate preservation, where microbial composition changes or microbial DNA can be degraded, and so on, contribute to the final result. It is also vital to highlight that amplicon sequencing is prone to technical mistakes by the machines, such as sequencing errors.

In our research, we used 16S amplicon sequencing, which is widely used to analyze microbiota but produces high-dimensional data that are hard to interpret. We demonstrated that machine learning techniques can be utilized for microbial sequence data processing, and techniques commonly used for natural language processing can also be applied to microbial sequence data. Thus, we employed frequency analysis to transform microbial sequence data into text vectors and the proposed hierarchical clustering method was used on them. The proposed method was successfully evaluated on a dataset of animal feces, where instances were sorted into semantically related clusters. Not only that, but the comparison with other known algorithms showed that the proposed method outperformed them. Lastly, we would like to highlight the reusability of the proposed method. Gut, animal, oral, soil, skin, and other microbiota would still be processed in the same manner.

The results demonstrate that with higher n-grams, we can obtain better clustering results, but at the same time, the computational complexity grows exponentially. That is why a better definition of n-grams is vital. One possibility that we want to explore in future research is a strategy of intelligently defining sequence substrings, also known as k-mers, that could replace current n-grams.

Author Contributions

Conceptualization, L.B. and V.P.; methodology, L.B. and V.P.; software, L.B.; validation, L.B., T.Ž., M.R. and V.P.; formal analysis, L.B. and V.P.; investigation, L.B., T.Ž., M.R. and V.P.; resources, T.Ž. and M.R.; data curation, T.Ž. and M.R.; writing—original draft preparation, L.B.; writing—review and editing, L.B., T.Ž., M.R. and V.P.; visualization, L.B.; supervision, M.R. and V.P. All authors have read and agreed to the published version of the manuscript.

Funding

The authors acknowledge the financial support from the Slovenian Research Agency (Research Core Funding No. P2-0057).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The results/data/figures in this manuscript have not been published elsewhere, nor are they under consideration by another publisher. The original contributions presented in the study are included in the article, and further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sammut, C.; Webb, G.I. Encyclopedia of Machine Learning; Springer Science & Business Media: Berlin, Germany, 2011. [Google Scholar]
Jordan, M.I.; Mitchell, T.M. Machine learning: Trends, perspectives, and prospects. Science 2015, 349, 255–260. [Google Scholar] [CrossRef] [PubMed]
Brezočnik, L.; Nalli, G.; De Leone, R.; Val, S.; Podgorelec, V.; Karakatič, S. Machine Learning Model for Student Drop-Out Prediction Based on Student Engagement. In Proceedings of the International Conference “New Technologies, Development and Applications”, Sarajevo, Bosnia and Herzegovina, 22–24 June 2023; Springer: Berlin, Germany, 2023; pp. 486–496. [Google Scholar]
Podgorelec, V.; Kokol, P.; Stiglic, M.M.; Heričko, M.; Rozman, I. Knowledge discovery with classification rules in a cardiovascular dataset. Comput. Methods Programs Biomed. 2005, 80, S39–S49. [Google Scholar] [CrossRef] [PubMed]
Nagarhalli, T.P.; Vaze, V.; Rana, N.K. Impact of Machine Learning in Natural Language Processing: A Review. In Proceedings of the 2021 Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV), Tirunelveli, India, 4–6 February 2021; pp. 1529–1534. [Google Scholar] [CrossRef]
Kameoka, S.; Motooka, D.; Watanabe, S.; Kubo, R.; Jung, N.; Midorikawa, Y.; Shinozaki, N.O.; Sawai, Y.; Takeda, A.K.; Nakamura, S. Benchmark of 16S rRNA gene amplicon sequencing using Japanese gut microbiome data from the V1–V2 and V3–V4 primer sets. BMC Genom. 2021, 22, 527. [Google Scholar] [CrossRef] [PubMed]
Khurana, D.; Koli, A.; Khatter, K.; Singh, S. Natural language processing: State of the art, current trends and challenges. Multimed. Tools Appl. 2023, 82, 3713–3744. [Google Scholar] [CrossRef]
Asnicar, F.; Thomas, A.M.; Passerini, A.; Waldron, L.; Segata, N. Machine learning for microbiologists. Nat. Rev. Microbiol. 2024, 22, 191–205. [Google Scholar] [CrossRef]
Walsh, C.; Stallard-Olivera, E.; Fierer, N. Nine (not so simple) steps: A practical guide to using machine learning in microbial ecology. Mbio 2024, 15, e02050-23. [Google Scholar] [CrossRef]
Gihawi, A.; Ge, Y.; Lu, J.; Puiu, D.; Xu, A.; Cooper, C.S.; Brewer, D.S.; Pertea, M.; Salzberg, S.L. Major data analysis errors invalidate cancer microbiome findings. MBio 2023, 14, e01607-23. [Google Scholar] [CrossRef]
Mohanty, S.; Behera, A.; Mishra, S.; Alkhayyat, A.; Gupta, D.; Sharma, V. Resumate: A Prototype to Enhance Recruitment Process with NLP based Resume Parsing. In Proceedings of the 2023 4th International Conference on Intelligent Engineering and Management (ICIEM), London, UK, 9–11 May 2023; pp. 1–6. [Google Scholar] [CrossRef]
Ismail, S.S.; Mansour, R.F.; Abd El-Aziz, R.M.; Taloba, A.I. Efficient E-mail spam detection strategy using genetic decision tree processing with NLP features. Comput. Intell. Neurosci. 2022, 2022, 7710005. [Google Scholar] [CrossRef]
Chen, L.; Gu, Y.; Ji, X.; Sun, Z.; Li, H.; Gao, Y.; Huang, Y. Extracting medications and associated adverse drug events using a natural language processing system combining knowledge base and deep learning. J. Am. Med. Inform. Assoc. 2020, 27, 56–64. [Google Scholar] [CrossRef]
Afzal, M.; Hussain, M.; Malik, K.M.; Lee, S. Impact of automatic query generation and quality recognition using deep learning to curate evidence from biomedical literature: Empirical study. JMIR Med. Inform. 2019, 7, e13430. [Google Scholar] [CrossRef]
Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef] [PubMed]
Lin, J.; Ngiam, K.Y. How data science and AI-based technologies impact genomics. Singap. Med. J. 2023, 64, 59–66. [Google Scholar] [CrossRef]
Yang, M.Q.; Wang, Z.J.; Zhai, C.B.; Chen, L.Q. Research progress on the application of 16S rRNA gene sequencing and machine learning in forensic microbiome individual identification. Front. Microbiol. 2024, 15, 1360457. [Google Scholar] [CrossRef] [PubMed]
McGhee, J.J.; Rawson, N.; Bailey, B.A.; Fernandez-Guerra, A.; Sisk-Hackworth, L.; Kelley, S.T. Meta-SourceTracker: Application of Bayesian source tracking to shotgun metagenomics. PeerJ 2020, 8, e8783. [Google Scholar] [CrossRef] [PubMed]
Zhou, R.; Ng, S.K.; Sung, J.J.; Goh, W.W.B.; Wong, S.H. Data pre-processing for analyzing microbiome data–A mini review. Comput. Struct. Biotechnol. J. 2023, 21, 4804–4815. [Google Scholar] [CrossRef]
Weiss, S.; Xu, Z.Z.; Peddada, S.; Amir, A.; Bittinger, K.; Gonzalez, A.; Lozupone, C.; Zaneveld, J.R.; Vázquez-Baeza, Y.; Birmingham, A.; et al. Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome 2017, 5, 27. [Google Scholar] [CrossRef]
Love, C.J.; Gubert, C.; Kodikara, S.; Kong, G.; Lê Cao, K.A.; Hannan, A.J. Microbiota DNA isolation, 16S rRNA amplicon sequencing, and bioinformatic analysis for bacterial microbiome profiling of rodent fecal samples. STAR Protoc. 2022, 3, 101772. [Google Scholar] [CrossRef]
Leinonen, R.; Sugawara, H.; Shumway, M.; Collaboration, I.N.S.D. The sequence read archive. Nucleic Acids Res. 2010, 39, D19–D21. [Google Scholar] [CrossRef]
Topçuoğlu, B.D.; Lesniak, N.A.; Ruffin IV, M.T.; Wiens, J.; Schloss, P.D. A framework for effective application of machine learning to microbiome-based classification problems. MBio 2020, 11, 10-1128. [Google Scholar] [CrossRef]
Su, X.; Jing, G.; Sun, Z.; Liu, L.; Xu, Z.; McDonald, D.; Wang, Z.; Wang, H.; Gonzalez, A.; Zhang, Y.; et al. Multiple-disease detection and classification across cohorts via microbiome search. Msystems 2020, 5, 10-1128. [Google Scholar] [CrossRef]
Hu, Y.; Satten, G.A.; Hu, Y.J. LOCOM: A logistic regression model for testing differential abundance in compositional microbiome data with false discovery rate control. Proc. Natl. Acad. Sci. USA 2022, 119, e2122788119. [Google Scholar] [CrossRef] [PubMed]
Wilhelm, R.C.; van Es, H.M.; Buckley, D.H. Predicting measures of soil health using the microbiome and supervised machine learning. Soil Biol. Biochem. 2022, 164, 108472. [Google Scholar] [CrossRef]
Han, J.; Kamber, M.; Pei, J. 2-Getting to Know Your Data. In Data Mining, 3rd ed.; Han, J., Kamber, M., Pei, J., Eds.; The Morgan Kaufmann Series in Data Management Systems; Morgan Kaufmann: Boston, MA, USA, 2012; pp. 39–82. [Google Scholar] [CrossRef]
Zou, H. Clustering algorithm and its application in data mining. Wirel. Pers. Commun. 2020, 110, 21–30. [Google Scholar] [CrossRef]
Žlender, T.; Brezočnik, L.; Podgorelec, V.; Rupnik, M. Uncovering cattle-associated markers of faecal pollution through 16s rRNA gene analysis. In Proceedings of the 13th International Gut Microbiology Symposium, Aberdeen, Scotland, 13–15 June 2023; P&J LIVE: Aberdeen, Scotland; Hong Kong, China, 2023; p. 87. [Google Scholar]
Žlender, T.; Brezočnik, L.; Podgorelec, V.; Rupnik, M. Identifying Markers of Cattle Fecal Pollution Using Comparative Analysis of the 16S rRNA Gene. In Proceedings of the Power of Microbes in Industry and Environment: Book of Abstracts, Poreč, Croatia, 15–18 May 2023; Croatian Microbiological Society: Zagreb, Croatia, 2023; p. 119. [Google Scholar]
López-Aladid, R.; Fernández-Barat, L.; Alcaraz-Serrano, V.; Bueno-Freire, L.; Vázquez, N.; Pastor-Ibáñez, R.; Palomeque, A.; Oscanoa, P.; Torres, A. Determining the most accurate 16S rRNA hypervariable region for taxonomic identification from respiratory samples. Sci. Rep. 2023, 13, 3974. [Google Scholar] [CrossRef]
Edgar, R.C.; Flyvbjerg, H. Error filtering, pair assembly and error correction for next-generation sequencing reads. Bioinformatics 2015, 31, 3476–3482. [Google Scholar] [CrossRef] [PubMed]
Flisar, J.; Podgorelec, V. Improving short text classification using information from DBpedia ontology. Fundam. Inform. 2020, 172, 261–297. [Google Scholar] [CrossRef]
Müllner, D. Modern hierarchical, agglomerative clustering algorithms. arXiv 2011, arXiv:1109.2378. [Google Scholar]
Theus, M. High-dimensional Data Visualization. In Handbook of Data Visualization; Springer: Berlin/Heidelberg, Germany, 2008; pp. 151–178. [Google Scholar] [CrossRef]
Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Penington, J.S.; Penno, M.A.; Ngui, K.M.; Ajami, N.J.; Roth-Schulze, A.J.; Wilcox, S.A.; Bandala-Sanchez, E.; Wentworth, J.M.; Barry, S.C.; Brown, C.Y.; et al. Influence of fecal collection conditions and 16S rRNA gene sequencing at two centers on human gut microbiota analysis. Sci. Rep. 2018, 8, 4386. [Google Scholar] [CrossRef]
Lloyd, S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [Google Scholar] [CrossRef]
Shi, J.; Malik, J. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 888–905. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
Caliński, T.; Harabasz, J. A dendrite method for cluster analysis. Commun.-Stat.-Theory Methods 1974, 3, 1–27. [Google Scholar] [CrossRef]
Davies, D.L.; Bouldin, D.W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1979, PAMI-1, 224–227. [Google Scholar] [CrossRef]
Ekemeyong Awong, L.E.; Zielinska, T. Comparative Analysis of the Clustering Quality in Self-Organizing Maps for Human Posture Classification. Sensors 2023, 23, 7925. [Google Scholar] [CrossRef]

Figure 1. Comparison between the NLP pipeline and the proposed microbial sequence data pipeline.

Figure 2. Results of 2-grams presented with heatmap.

Figure 3. Hierarchical clustering of groups presented as a dendrogram (2-gram).

Figure 4. Hierarchical clustering of groups presented as a dendrogram (3-gram).

Figure 5. Hierarchical clustering of groups presented as a dendrogram (7-gram).

Figure 6. The t-SNE visualization of 2-gram results.

Figure 7. The t-SNE visualization of 3-gram results.

Figure 8. The t-SNE visualization of 7-gram results.

Figure 9. The 3D t-SNE visualization of 7-gram results.

Figure 10. The 3D t-SNE visualization of 7-gram results (different perspective of Figure 9).

Figure 11. The results of the Kmeans algorithm for 7-grams.

Figure 12. The results of the Spectral Clustering algorithm for 7-grams.

Figure 13. Results of the Silhouette Score, where the higher values represent better solutions.

Figure 14. Results of the Calinski–Harabasz Index, where higher values represent better solutions.

Figure 15. Results of the Davies–Bouldin Index, where the lower values represent better solutions.

Table 1. Dataset overview.

	Group	Number of Samples	Number of ZOTUs
1	Cat	33	11.310
2	Cattle	79	28.328
3	Cattle manure	31	34.668
4	Chicken	35	19.861
5	Dog	43	12.506
6	Fallow deer	34	28.978
7	Gull	21	9.460
8	Hedgehog	28	6.164
9	Horse	36	28.835
10	Human	185	6.635
11	Mouse	26	6.660
12	Nutria	23	10.025
13	Pig	45	21.777
14	Pig manure	21	21.593
15	Pigeon	24	13.924
16	Roe deer	34	28.685
17	Swan	41	17.986
18	Wild duck	26	16.507

Table 2. Example of the 2-gram input matrix.

	AA	AC	AG	…	TT
Sample 1	0.07351	0.08578	0.05976	…	0.08654
Sample 2	0.07612	0.04586	0.05228	…	0.04661
…	…	…	…	…	…

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Brezočnik, L.; Žlender, T.; Rupnik, M.; Podgorelec, V. Using Machine Learning and Natural Language Processing for Unveiling Similarities between Microbial Data. Mathematics 2024, 12, 2717. https://doi.org/10.3390/math12172717

AMA Style

Brezočnik L, Žlender T, Rupnik M, Podgorelec V. Using Machine Learning and Natural Language Processing for Unveiling Similarities between Microbial Data. Mathematics. 2024; 12(17):2717. https://doi.org/10.3390/math12172717

Chicago/Turabian Style

Brezočnik, Lucija, Tanja Žlender, Maja Rupnik, and Vili Podgorelec. 2024. "Using Machine Learning and Natural Language Processing for Unveiling Similarities between Microbial Data" Mathematics 12, no. 17: 2717. https://doi.org/10.3390/math12172717

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Using Machine Learning and Natural Language Processing for Unveiling Similarities between Microbial Data

Abstract

1. Introduction

2. Background

3. Microbial Sequence Data Pipeline

4. Materials and Methods

4.1. Dataset

4.2. Preprocessing

4.3. Frequency Analysis

4.4. Clustering

4.5. Proposed Hierarchical Clustering Method

4.6. Visualization

5. Results and Discussion

5.1. Analysis of the Groups

5.2. Analysis of the Beta Diversity

5.3. Quantitative Results Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI