Enrichment score was calculated as copy number of target sequence per Mbps of chromosome.

Here, the consensus site (5 -GGACT-3 ) *n* ≥ 2 was utilized to locate and annotate all m6A hotspots. We identified several genes associated to cancer, diabetes, stress-related mental illnesses, and neuronal development, among other diseases. Especially, GO analysis revealed the crucial genes related to neuronal development.

m6A RNA modification is one of the most prevalent reversible internal modifications, regulated by methyltransferases ("writers") and demethylases ("erasers") [17]. The presence of complementary seed sequences in micro-RNAs (miRNAs) indicated that miRNAs targeted m6A peak regions in both mouse and human experimental studies.Furthermore, m6A has also been reported in the transcriptome of neurons [9,18]. Brain development is a highly specific and coordinated genetic event andany abnormalities can act as a doorway to different anomalies, such as autistic spectrum and schizophrenia-like disorders [19–21]. In our GO analysis data, we selected 1729 genesbased on frequency of target sequence (GGACT) more than 2.Of them, only 27 were scrutinized. The enrichment analysis of the biological process for m6A hotspot genes revealedits association with embryonic brain development, locomotion, neuronal projection, neuronal differentiation, axonal guidance, synaptic assembly, synaptic plasticity, and transmission (Figure 3a,b).

**Figure 3.** GO analysis of m6A target sites. (**a**) The networking analysis of m6A hotspot genes in different physiological processes. (**b**) Enrichment analysis of m6A hotspot genes for neurological processes, such as neuronal development, neurogenesis, differentiation and projection.

#### **4. Discussion**

The human genome sequence was explored for all possible m6A sites with two or more target sequences (5 -GGACT-3 ) in tandem, which might have a high probability for methylation. The human genome may include some m6A-containing motifs, that still remain unidentified due to their less abundance or beyond the range of advanced detection techniques; hence, surveying the human genome for target sites could be an alternative tool to identify them.

Using the tool "PatternRepeatAnnotator", a total of 6.78 × 107 target sequences were recognized on the plus strand of the human genome. We observed over representation of the target sequences in non-coding DNA (96.4% in introns, DRR, promoters and genomic regions), whereas a small quantity of 3.5% was located in coding (exonic) regions (Supplementary Figure S1). This internal modification has been reported in nascent pre-mRNAs, suggesting that the addition of methylation group occurs before splicing [22], which is supported by our current findings with 52% target sequences in intronic regions. The m6A modification exhibits spatio-temporal specific expression patterns; therefore, despite many target sequences, only a few undergo methylation [23]. The high density of m6A sites present in 95.8% of intron in non-coding genomic regions, were primarily involved in producing miRNAs. It has been reported that miRNAs influence the fundamental biological processes from cell division to cell death and may undergo m6A modification [24]. For example, m6A modifications in primary miRNA enhance their recognition and processing by DGCR8, a miRNA microprocessor complex protein [25]. Therefore, identified m6A sites may provide deep insight into the mRNA–miRNA interaction pathways involved in the pathogenesis of various diseases. Ribosomal protein S6 kinase genes *RPS6K* have been predicted as a potential candidate for the pathogenesis of hepatocellular carcinoma by the miRNA–mRNA network analysis [26]. This is in line with our enrichment analysis (Supplementary Table S1) identifying RPS6KA3 and RPS6KA5 ribosomal genes, which are associated with regulation of axonogenesis and cellular morphogenesis in the course of neuronal differentiation. Any alteration of m6A methylation of *RPS6KA3* and *RPS6KA5* may affect the normal neurite outgrowth and arborization [27].

Neurexin performs distinct regulatory functions in different classes of neurons, and any mutation or deletion of Neurexin (*NRXN1* and *NRXN2*) genes have been associated with autism-associated behavioral changes in experimental mice [28]. Neurexin also plays a key role in the trafficking of presynaptic vesicles and their deletion resulted in the reduction of synaptic current. To our knowledge, no report exists on the direct link between neurexins and m6A. However, our enrichment analysis data have shown that m6A may regulate *NRXN1*, *NRXN2* and *NRXN3* genes.

In a synaptic epi-transcriptomic study, 4469 enriched m6A sites have been reported selectively in 2921 genes in the forebrain of adult mice and imply that chemically modified mRNA could significantly promote synaptic function [29]. The knockdown of the m6A reader has shown a dramatic change in the spine morphology and dampened the synaptic transmission, there by suggesting its role in synaptic function. Epidermal Growth Factor Receptor (EGFR) belongs to the tyrosine kinase family and is expressed by neuronal and glial cells in different brain regions [30]. During the early development, EGFR is highly expressed in the midbrain and hippocampus, and its increased expression has been also reported in many pathophysiologies, including Alzheimer's, Huntington's, Parkinson's disease, amyotrophic lateral sclerosis, and traumatic brain injury associated with reactive gliosis [31]. Our data have also shown that m6A is enriched with EGFR, which is consistent with previous findings [32]. YT521-B homology domain family 2 (*YTHDF2*) is a m6A reader and directly binds the m6A modification site of EGFR 3 UTR of mRNA and impedes cell proliferation and growth by modulating the downstream ERK/MAPK pathway [32]. The functions of EGFR could also be modulated by other proteins such as *METTL3* and *FTO* [33,34]. Collectively, these data indicated that m6A modification of mRNA is a requisite for the proper physiological functions of EGFR. Further, the MAPK is a key regulator of neurogenesis, which consists of four distinct cascades, ERK1/2, JNK1/2/3, p38, and ERK5. It has been shown that m6A enriched with *MAPK* and *METTL* played a tumour-suppressive role via the p38/ERK pathway. Since, elevated levels of p-38 and pERK in colorectal cancer have displayed the inhibition of cell migration and proliferation after knockdown of METTL [35]. Likewise, *EGFR, YTHDF2* also regulate the MAPK and NF-kB signalling in systemic lupus erythematosus (SLE). YTHDF2 knockdown has been demonstrated to activate MAPK and NF-kB and resulted in a significant increase in proinflammatory events in SLE [7,36]. Additionally, the neurological involvement appears in the early stage in SLE, with cognitive impairment being the most prevalent symptom that correlates with disease activity [37].

The identification and quantification of m6A in the transcriptome are tedious, expensive, and associated with many significant systematic errors. To date, well established in vitro methods have encountered several obstacles, including single-nucleotide resolution, a lack of selective chemical reactivities for a specific RNA modification, and lengthy protocols for m6A identification. These challenges are exacerbated by the stability of RNA and the random frequency of methylation. As a result, finding m6A signatures throughout the whole transcriptome is an extremely difficult task. To address these issues, several webtools and algorithms have been developed, which either investigate various databases of m6A sequences or utilize statistical techniques to more precisely locate m6A sites [36,38–42]. Other tools, such as iRNA-AI, iMethyl-PseAAC, iDNA-Methyl, iRNA-Methyl, and iRNA-PseU have been generated also for the identification and annotation of specific sites for adenosine to inosine editing, protein methylation, DNA methylation, N6 methyl adenosine, using pseudo-nucleotide, and RNA pseudouridine, respectively [42–45]. These tools need a sequence of interest in which the intended modification is sought, and they offer information on whether or not the desired change is feasible in that sequence. The method created in this work scanned the whole human genome for identification of a specific set of nucleotides (target sequence) and generated well-annotated information as output. This tool fundamentally differs in the origin of the hypothesis, concept of algorithm, and the final results compared with all other available techniques.

The Perl-script-based tool "PatternRepeatAnnotator"employed in our study can be customized in several ways: (i) it can be used to search any repeat type (e.g., CAG triplet repeats of Huntington's disease, GAA repeats of Friedreich's ataxia, etc.), (ii) the number of such repeats (1 or more) in tandem can be chosen by the user, (iii) range of promoter/downstream regions (in nucleotide length) can be given at user's choice, (iv) more importantly, the tool is futuristic, and the latest human genome version (>GRCh37 patch 8) can be provided as a template for target sequence search. The results are stored in a specified folder name after the input sequence, where numerous statistical tools can be applied to analyze data easily. The output file contains well-annotated information, such as (i) identified target sequence viz gene ID, (ii) its symbol, (iii) strand (plus/minus), (iv) location in chromosome (exon/intron/genomic/promoter/downstreamregions), (v) the position of repeat (start to end), (vi) its total length (nucleotides long) and (vi) the sequence itself. Using this robust annotated information, the analysis becomes easier, and the genes of interest can be directly picked up from the desired chromosome for further analysis. This, in turn, reduces the cost, time, and manpower required to evaluate the whole transcriptome for m6A modification. The ability to analyze databases in future depicts long-lived applicability, highly customizable interface, making it user-friendly and robust with rich annotated data.

#### **5. Conclusions**

The m6A is a conservative phenomenon and has been involved in modulating translation efficiency, mRNA turnover, RNA splicing, miRNA and other non-coding RNA biogenesis. As demonstrated in our study, "PatternRepeatAnnotator"could identify and annotate all "methylable adenosines" in the genome, however, their regulation in vivo needs to be verified as not all m6A sites are modified in the human genome. Annotation of these identified m6A sites revealed that over 96% m6A were found in non-coding regions, which corroborates their roles in downstream regulatory processes. Several essential genes in neuronal development harbor extensive m6A sites. More in vivo investigations are required to correlate these identified m6A sites, their modification pattern, and mechanistic approach in cellular processes and various human diseases.

**Supplementary Materials:** The following are available online at https://www.mdpi.com/article/10 .3390/life11111185/s1, Figure S1: Percentage distribution of target sequences in different regions of human genome. Table S1: Enrichment Analysis of genes for their biological functions.

**Author Contributions:** Conceptualization, S.K. and H.N.S.; data curation, L.-W.T., D.G., V.S. and H.N.S.; resources, A.K.S.; supervision, V.S. and H.N.S.; validation, S.K., L.-W.T., D.G., R.D., V.S. and H.N.S.; visualization, S.K., R.D.; writing—original draft, P.K.; writing—review and editing, S.K., L.-W.T., R.D., D.G., V.S. and H.N.S. All authors have read and agreed to the published version of the manuscript.

#### **Funding:** None.

**Institutional Review Board Statement:** This article does not contain any studies involving human or animal participants.

**Informed Consent Statement:** This article does not contain any studies involving human or animal participants. Therefore, this is not required.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** All authors aknowledge the Sharda University-UP, AIIMS-New Delhi and MTA infotech-Varanasi for providing all resources required for this study.

**Conflicts of Interest:** The authors declare that there are no conflict of interest.

#### **References**


## *Article* **A Maximum Flow-Based Approach to Prioritize Drugs for Drug Repurposing of Chronic Diseases**

**Md. Mohaiminul Islam 1, Yang Wang <sup>1</sup> and Pingzhao Hu 1,2,3,4,\***


**\*** Correspondence: pingzhao.hu@umanitoba.ca

**Abstract:** The discovery of new drugs is required in the time of global aging and increasing populations. Traditional drug development strategies are expensive, time-consuming, and have high risks. Thus, drug repurposing, which treats new/other diseases using existing drugs, has become a very admired tactic. It can also be referred to as the re-investigation of the existing drugs that failed to indicate the usefulness for the new diseases. Previously published literature used maximum flow approaches to identify new drug targets for drug-resistant infectious diseases but not for drug repurposing. Therefore, we are proposing a maximum flow-based protein–protein interactions (PPIs) network analysis approach to identify new drug targets (proteins) from the targets of the FDA (Food and Drug Administration) drugs and their associated drugs for chronic diseases (such as breast cancer, inflammatory bowel disease (IBD), and chronic obstructive pulmonary disease (COPD)) treatment. Experimental results showed that we have successfully turned the drug repurposing into a maximum flow problem. Our top candidates of drug repurposing, Guanidine, Dasatinib, and Phenethyl Isothiocyanate for breast cancer, IBD, and COPD were experimentally validated by other independent research as the potential candidate drugs for these diseases, respectively. This shows the usefulness of the proposed maximum flow approach for drug repurposing.

**Keywords:** drug–target interactions; protein–protein interactions; chronic diseases; drug repurposing; maximum flow

#### **1. Introduction**

Chronic diseases are usually defined as the diseases that are persistent or long-lasting and require ongoing medical attention. There are many different types of chronic diseases. For example, breast cancer starts from the breast cancer cells. However, it can also spread to other parts of the body. Breast cancer is referred to as the most frequently identified cancer in women. This is the second prominent reason for cancer death among women [1]. Of note, cancer is a multistage disease [2], increasing the mortality rate among people worldwide [3]. Several breast cancer treatment techniques are available, such as surgery, chemotherapy, radiation, and hormone therapy. Often a combination of these treatments is used in practice [4]. Other chronic diseases, such as inflammatory bowel disease (IBD) and chronic obstructive pulmonary disease (COPD), are usually consequences of many environmental and genomic factors. IBD is a chronic disease that includes both ulcerative colitis and Crohn's disease, and it lasts for a very long time. IBD results in a significant burden to our society and families. IBD triggers segments of the bowel to get red and swollen. IBD treatment involves medicines, diet modifications, and occasionally surgery [5]. The goal of such treatment options is to reduce the inflammation associated with IBD. In the long term, existing treatments may achieve reduced risks of IBD complications. COPD is a chronic lung disease that causes breathing problems. COPD is the main reason for

**Citation:** Islam, M.M.; Wang, Y.; Hu, P. A Maximum Flow-Based Approach to Prioritize Drugs for Drug Repurposing of Chronic Diseases. *Life* **2021**, *11*, 1115. https://doi.org/ 10.3390/life11111115

Academic Editor: Stanislav Miertus

Received: 2 August 2021 Accepted: 18 October 2021 Published: 20 October 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

respiratory mortality worldwide [6]. Current treatment options include lung transplants, quitting smoking, and inhalers. However, these strategies can only assist in lessening the progression of COPD. The fundamental cause of COPD is smoking [7]. Patients may not know about the disease initially, but the condition worsens over time, such as with severe breathing problems during simple tasks, e.g., walking.

There is a pressing need to identify potential drug targets and their drugs for developing personalized treatments for chronic diseases. However, new drug development takes a very long time and is extremely expensive. Usually, this type of approach takes 10–15 years and \$1 billion [8]. Nevertheless, we can save time and money using old drugs for new usages called drug repurposing or repositioning. This is a helpful technique to find different indications for current medications. For example, in 2020, COVID-19 infections from the novel coronavirus became a primary worldwide public health concern [9]. As a result, it was declared as a global pandemic in 2020 [10]. The pandemic created an emergency to develop vaccines or therapeutic treatment for COVID-19 infections. However, there were no available confirmed drugs to treat COVID-19 infections. Therefore, the drug repurposing technique was used to obtain a new drug from the existing FDA-approved drugs [11,12].

There are different types of approaches to identify new indications of an FDAapproved drug, such as network-based [13,14], and machine learning (ML)-based [15,16] approaches.

A biological network consists of a massive number of nodes and interactions among them. A gene can easily make a subnetwork including drug targets, and these drug targets act as the bridge between this subnetwork and the original network. We can identify the risk genes of a given disease and the associated drug targets in a biological network to remove the bridge connection between the subnetwork containing the risk genes and the original network. Therefore, we can potentially treat the disease using drugs associated with the drug targets responsible for the disease's risk genes in the network.

A network-based approach tries to find a subnetwork that provides an insight into the relationship between drugs and disease genes. For example, Cheng et al. [17] proposed a network-based system to list the drug targets using three different inference algorithms, which are drug resemblance in any network, protein correspondence in any network, and recognized drug target within a bipartite network.

Yeh et al. [18] first proposed a maximum flow approach to predict a set of drugs as new effective drug targets for the treatment of prostate cancer. The idea is that the candidate proteins for a drug target with a higher flow value to the risk genes have more influence on risk genes than other candidates for the drug targets. They used microarray data [19] and an interactome (PPI) network [20] of prostate cancer to build their prediction model. Next, they used the shortest path algorithm [21] to perform a maximum flow method within their network and successfully identified 20 drug targets to reuse. These drug targets were validated using other available literature that published these same drug targets for prostate cancer.

Melak et al. [22] also used the idea of the maximum flow approach to prioritize a set of drug targets to reduce the expression of tuberculosis disease from a list of known drug targets. Yeh et al. [18] used the Pearson correlation coefficient and gene expression changes between genes to calculate the weight of the edges of their PPI network. However, Melak et al. [22] used a PPI network from STRING which includes the associated weights for the edges. Thus, Yeh et al. [18] and Melak et al. [22] showed that proteins with the maximum flow to the risk genes in the PPI network could be used as targets for developing drugs to treat diseases.

This study aims to apply the maximum flow technique to a PPI network with a set of breast cancer, IBD, and COPD risk genes to identify new breast cancer, IBD, and COPD drugs, respectively, from a list of FDA-approved medications. We hypothesize that identifying new drugs from the existing drugs (i.e., drug repurposing) for breast cancer, IBD, and COPD can be converted into a maximum flow problem using a human

interactome network (i.e., a PPI network). Furthermore, it is believed that drug targets X (proteins) connected with risk genes through a higher flow value have more impact on these risk genes than other drug targets. Therefore, these Xs can be used as potential targets for drug development for the disease's treatment. Furthermore, deletion of these Xs from the PPI network will disrupt the communication among the risk genes and proteins. Therefore, this study aims to identify a set of strongly correlated proteins with the disease risk genes from a PPI network using a maximum flow approach. Later, we can identify new candidate drugs for repurposing to treat breast cancer, IBD, and COPD associated with these targets using a drug-target interaction network.

#### **2. Materials and Methods**

#### *2.1. Datasets*

#### 2.1.1. Protein–Protein Interaction (PPI) Network

We collected a comprehensive biological network [23] which includes 140,899 interactions among the 13,365 human proteins (genes). We used this biological network (Figure 1) to conduct our experiments.

**Figure 1.** An example of our PPI network. The network shows the interactions among the FDA-approved drug targets (i.e., proteins), potential drug targets, and disease-associated risk proteins/genes.

#### 2.1.2. Drug-Target Interactions (DTIs) Network

We extracted 2390 FDA-approved drug targets (DTs) in human from DrugBank [24]. However, the PPIs network described in Section 2.1.1 contains only 1926 DTs among these 2390 FDA-approved DTs. We also collected the DTIs network, which has ~13,000 DTIs among 5049 unique drugs and 3099 unique targets from the DrugBank.

#### 2.1.3. Risk Genes

In this study, we focused our drug repurposing on the above-mentioned three diseases (breast cancer, IBD, and COPD) since they have a relatively large number of disease-specific risk genes identified from genome-wide association studies (GWAS) as described below. These risk genes make the application of the maximum flow technique to drug repurposing possible in this study. GWAS have already discovered more than 200 breast cancer risk loci. For example, Baxter et al. [25] were able to mark 63 loci and identified 110 known target genes at 33 loci. In addition, Wu et al. [26] identified 179 significant genes associated with breast cancer risk. Thus, we have collected in total 289 breast cancer risk genes from these two studies.

Previously published genomic studies identified 215 risk loci to explain the fundamental molecular biology of IBD [27]. In addition, Katrina et al. [27] marked three additional loci which have therapeutic targets in IBD. They have also prioritized 811 IBD risk genes from 240 risk variants.

A GWAS in the United Kingdom by Sakornsakolpat et al. [6] identified 82 loci associated with COPD or function. Among them, 47 loci were already known as risk loci of

COPD. Of note, Sakornsakolpat et al. [6] have identified 156 COPD risk genes from these 82 loci.

Hence, we have collected 289, 811, and 156 risk genes responsible for breast cancer, IBD, and COPD, respectively, from the earlier studies to validate the usefulness of our proposed drug repurposing method.

#### *2.2. The Maximum Flow Algorithm for Drug Repurposing*

The analysis pipeline for drug repurposing includes multiple steps, as shown in Figure 2 (taking breast cancer as an example). Below we explain the steps in more detail.

**Figure 2.** Analysis pipeline for the maximum flow approach to prioritize drugs for drug repurposing (taking breast cancer as an example). (**A**) Shows the types of data we collected for our experiments. (**B**) We mapped each target protein in the PPI network to be either a risk gene, FDA-approved drug target, or potential candidate target. (**C**) Shows the construction of maximum flow network from the collected PPI network to apply the Push-Relable maximum flow algorithm. (**D**) Shows the steps to repurpose existing drugs based on the maximum flow values of each target protein.

#### 2.2.1. Constructing the Maximum Flow Network

Mapping drug targets and risk genes to the PPIs network: We first mapped the 1926 FDA-approved DTs (FDA\_DT) and risk genes (RGs) of a specific disease to the unweighted PPIs network (refers to a graph where edges do not have weights, and there is only one edge between any two nodes).

Constructing weighted PPIs network: We used TOMSimilarity (topological overlap matrix similarity) [19] to calculate the weight of edges between genes, and we used Equation (1) to get TOMSimilarity between two nodes in our network.

$$TOMSsimilarity\left(x, y\right) = \frac{\left|N\_{neighbor}\left(x\right) \cap N\_{neighbor}\left(y\right)\right| + A\_{xy}}{\min\left(\left|N\_{neighbor}\left(x\right)\right|, \left|N\_{neighbor}\left(y\right)\right|\right) + 1 - A\_{xy}}\tag{1}$$

where *Nneighbor*(*x*) is the neighbors of *x*,

*Nneighbor*(*y*) is the neighbors of *y*,

*Axy* is the value of the adjacency matrix (i.e., one if nodes *x* and *y* are connected and zero otherwise),

*TOMSimilarity* (*x*, *y*) is the Topological Overlap Matrix Similarity between the nodes *x* and *y*.

Drug repurposing as a maximum flow problem: After the mapping of the drug targets and risk genes, we specified the drug repurposing problem into a maximum flow problem, we (1) created a dummy node SDN (i.e., the source of the network) which was connected with all the FDA\_DT; (2) created another dummy node DDN (i.e., the destination of the network) which was connected with all the risk genes; (3) assigned a flow capacity (i.e., weight) using Equation (1) for each of the connections in the network. Flows in the maximum flow network follow the below rules: (1) The input flow is equal to the output flow for any node except the source and destination nodes; (2) for any edge (*e*) in the network, 0 ≤ flow(*e*) ≤ Capacity(*e*); (3) total flow out of the source node is equal to total flow into the destination node.

However, the connections from the dummy source to the candidate drug targets will have a dummy capacity. Each incoming edge from the dummy source node to a protein (drug target) has a capacity equal to the sum of the capacities of the outgoing edges from that protein (drug targets). Similarly, the connections from the risk genes to the dummy sink node have dummy capacities. Each outgoing edge from a risk gene to the sink node has a capacity equal to the sum of the capacities of the incoming edges to that risk gene. At this point, we had the network named MaxNet (Figure 3) to run the maximum flow algorithm.

**Figure 3.** An example of our MaxNet.

#### 2.2.2. Push-Relabel Maximum Flow Algorithm

We used the Push-Relabel maximum flow algorithm [28] in the MaxNet (Figure 3) to maximize the flow amount passed from the FDA-approved drug targets to the risk genes. Algorithm 1 (revised from [29]) shows the Push-Relabel maximum flow algorithm. In addition, this algorithm works with one vertex at a time. Every vertex is associated with two variables: height and excess flow. A vertex can send flows to a lower-height vertex only. The extra flow of a vertex represents the difference between the total in-flow and out-flow of that vertex. Furthermore, each edge is associated with two variables: flow (i.e., current flow through this edge) and capacity (i.e., the maximum flow we can send through this edge). This algorithm sends flows (i.e., PUSH operation) from a node (S) to its adjacent node (D) when the excess flow of D is not equal to zero and the height of D is less than the height of S. If there is no adjacent node of S with lesser height than this algorithm increases the height of S (i.e., RELABEL operation) by the minimum height of the adjacent nodes of S plus 1.


#### 2.2.3. Drug Repurposing from Maximum Flow Values

After applying the Push-Relabel maximum flow algorithm in our MaxNet network, we sorted all the FDA drug targets into a list LDTs according to their flow value to the risk genes (descending order). Then, we used this sorted list LDTs of the DTs to sort the FDA-approved drugs into a list Ldrugs using ~13,000 DTIs collected from DrugBank [29]. Hence, according to our hypothesis, the top drugs in L are the most prominent drugs that can be reused to treat the given disease associated with its risk genes.

The whole analysis pipeline of the maximum flow-based drug repurposing is summarized in Algorithm 2.


**Input:** PPI = all the PPIs, FDA\_DT = all the FDA approved DTs in PPIs network, DTI = DTIs for FDA\_DT, RG = risk genes, W = flow capacity of edges.

	- 1. FOR i = 1 to length [PPI]:
		- a. Calculate flow capacity of the edge using Equation (1): C[i] = TOMSimilarity (PPI[i])
	- 2. CREATE two dummy nodes:
	- a. source dummy node = SDN and destination dummy node = DDN
	- 3. FOR i = 1 to length [FDA\_DT]:
		- a. Index = length [PPI] + 1
		- b. CONNECT SDN to FDA\_DT[i] and add this interaction in PPI[index]
		- c. W[index] = sum of the capacities of the outgoing edges from PPI[index]
	- 4. FOR i = 1 to the length of RG:
		- a. Index = length of PPI + 1
			- b. CONNECT RG[i] to DDN and add this interaction in PPI[index]
			- c. C[index] = sum of the capacities of the incoming edges from PPI[index]
	- 5. The nodes in PPIs and their associated outgoing flow value = Push-Relabel\_ MaximumFlow\_Algorithm (PPI, C, SDN, DDN)
	- 6. prioritized\_DTs = sort the nodes in PPI in decreasing order of their outgoing flows
	- 7. CD = sort drugs in DTI using prioritized\_DTs

#### **3. Experimental Results**

#### *3.1. Mapping Drug Targets and Disease-Specific Risk Genes to the PPIs Network*

First, we collected the PPI network. This is an unweighted network. So, we calculated topological overlap similarity (TOMSimilarity) to assign weights on the edges. These weights were used as the capacities of the flow through the edges. In this PPI network, we had 1926 FDA-approved DTs. Next, we mapped disease-specific risk genes to this PPI network. The PPIs network contained 155 breast cancer RGs from the 289 breast cancer RGs identified by Baxter et al. [25] and Wu et al. [26]. It also had 565 IBD risk genes among the 811 prioritized IBD risk genes by Katrina et al. [27]. This PPI network also contained 118 COPD risk genes among the 156 COPD risk genes identified by Sakornsakolpa et al. [6]. Table 1 shows several statistical properties of the PPI network. In Table 1, transitivity refers to the probability of adjacent nodes being interconnected. It provides an intuition about the clusters in the network. Of note, in a graph, total triangles represent the total number of triangles formed by any three nodes. In addition, we also showed the PPI network's degree distribution in Figure 4. Figure 4 indicates that only a few nodes in the PPIs network have a high number of neighbors. This means the PPI network has a small number of hubs.

**Table 1.** Statistical properties of the PPIs network.


**Figure 4.** Degree distribution of the PPIs network.

#### *3.2. Weights of the Interactions in PPIs Network*

We calculated topological overlap similarity (TOMSimilarity) to assign weights on the edges of the unweighted PPI network. The values of these edge weights ranged from 0 to 1. We used these edge weights as flow capacity for each connection during maximum flow implementation with Algorithm 1.

#### *3.3. Formulating Drug Repurposing as a Maximum Flow Network*

FDA-approved drug targets are the network sources, while risk genes are the destinations of the network. Hence, we needed to convert this multiple sources and multiple destinations network into a single source and single destination network. To do this, we created a dummy source node and connected this node with 1926 DTs. Similarly, we created a dummy destination node and only connected this sink node with the disease-specific risk genes. As a result, there were no incoming arcs to the source node and no outgoing arcs

from the destination node. We calculated the sum of the capacities of the outgoing arcs from a drug target node and put this sum as the capacities on the arcs from the dummy source node to the drug target node. Likewise, we calculated the sum of the capacities of the incoming arcs to a risk gene node and put this sum as the capacities to the arc from the risk gene node to the dummy destination node. We called this network the MaxNet.

#### *3.4. Drug Repurposing for Breast Cancer, IBD, and COPD*

We created three MaxNets (MaxNet\_BC, MaxNet\_IBD, and MaxNet\_COPD) for breast cancer, IBD, and COPD RGs, respectively. For all three MaxNets, the dummy source node was connected with the 1926 FDA-approved DTS. However, our PPIs network contained only 155 breast cancer RGs, 565 IBD RGs, and 118 COPD RGs. Therefore, we connected the 155 breast cancer RGs with the dummy destination node in the MaxNet\_BC, the 565 IBD RGs with the dummy destination node in the MaxNet\_IBD, and the 118 COPD RGs with the dummy destination node in the MaxNet\_COPD.

We ran the Push-Relabel maximum flow algorithm in all three MaxNets to get the maximum flow values for each node from the dummy source to the dummy destination. First, we extracted three sorted lists of the targets (FDA-approved) based on their outgoing flows from the MaxNet\_BC, MaxNet\_IBD, and MaxNet\_COPD in descending order. Then, we used these sorted lists of targets to sort the drug list using a drug-target interaction network for breast cancer, IBD, and COPD. According to our hypothesis, the top drug in each of these sorted drug lists has the maximum potential to be used as a candidate drug for the treatment of breast cancer, IBD, and COPD, respectively.

#### *3.5. Performance Evaluation*

We performed a comprehensive literature review to validate our top five repurposed candidates for breast cancer, IBD, and COPD as shown in Tables 2–4, respectively.


**Table 2.** The top five repurposed drugs for breast cancer.

**Table 3.** The top five repurposed drugs for IBD.



**Table 4.** The top five repurposed drugs for COPD.

In addition, we have shown the top 10 prioritized repurposed drugs in the Supplementary Tables S1–S3 for each of these diseases.

#### *3.6. Performance Comparison with Other Methods*

We used the same datasets to compare the performance of our maximum flow-based drug prioritization with the baseline methods, such as degree, betweenness centrality, closeness centrality, random walk, and page rank (Table 5). Degree centrality refers to the number of incoming links to a node and ranks the risk genes by their degree value. Closeness centrality is defined as the geodesic distance (normalized) for any node to any other node in the network. Finally, the betweenness centrality of a node denotes the number of shortest paths that include this node. First, we used MATLAB functions to calculate degree centrality, closeness centrality, and betweenness centrality from the PPI network for each disease of interest (breast cancer, IBD, and COPD). Then, we sorted each of these lists of targets in descending order. Furthermore, we obtained a sorted list of candidate drugs using these sorted targets and a drug-target interaction dataset. Then we used the python functions of random walk [41] and page rank [42] to calculate the importance of each target associated with breast cancer, IBD, and COPD in the PPI network. Finally, we used sorted random walk [41] and page rank [42] (descending order) lists of targets to identify potential drug repurposing candidates from the drug-target interaction network we collected from the DrugBank database.

**Table 5.** Number of confirmed disease-specific candidates by the baseline approaches for drug repurposing in the list of top five candidate drugs.


#### **4. Discussion**

Traditional machine learning methods, such as naive Bayesian, support vector machines, and the latest deep neural networks, reveal their effectiveness for drug discovery. Zhao et al. proposed a method that uses drug-induced expression profiles to predict the sign of a disease in psychiatry [43]. Saberian et al. [44] introduced a framework that takes anti-similarity between drugs and a disease as input to train a model. Their model can predict new usage apart from the primary indications of a drug. However, researchers have concerns about using conventional machine learning techniques for this purpose because of the background noisiness and the high-dimensionality nature of the biological data [45]. Hence, Cheng et al. [17] used a chemical structure with the genome sequence to perform the drug and protein resemblance checking. At the same time, they anticipated

related drugs might share identical drug targets for a disease. However, they did not find any helpful result from these similarities checking among the drugs. Nevertheless, they concluded that the chemical structure could not be represented as a parameter to identify similar drugs or proteins. Estrada et al. [46] also used a biological network's global measure such as closeness/betweenness centrality to identify drug targets. They considered a node in the network as the drug target if it has a higher closeness/betweenness centrality value than the other nodes. These measures are based on the shortest paths in the network. In addition, random walk [41] and page rank (the algorithm that Google uses for their search engine) [42] can be used to extract such global measures to identify potential new drug targets. In this study, we adopted a maximum flow-based approach similar to Yeh et al. [18] and Melak et al. [22] to prioritize FDA-approved drugs repurposed for breast cancer, IBD, and COPD.

We used a PPI network [23] to conduct our experiments. The investigators mentioned that these interactions do not contain any interactions estimated from gene expression data. These interactions fall into the following categories: protein–protein interactions (most of the interactions fall into this category), regulatory interactions, protein database, and signaling interactions [47]. However, this PPI network is not weighted. Therefore, we converted our PPI network to a weighted network using TOMSimilarity. We used TOM-Similarity because Langfelder et al. [48] showed its effectiveness as a highly robust measure of network interconnectedness (proximity) for the hierarchical clustering of biological data. TOMSimilarity calculates the topological similarity between two connected proteins (i.e., genes) using an adjacency matrix. Then, we applied the Push-Relabel algorithm to obtain the node importance based on its outflow. This algorithm works locally rather than looking into the entire residual graph (this graph indicates if it is possible to send flows from the source to the destination of the network) to find an augmenting path to send flows.

The primary usage of our most promising candidate drug, "Guanidine" (Table 2), is to treat muscle weakness caused by Eaton-Lambert syndrome. In 2009, Meruling et al. [30] showed that at 0.5 microM, dextran aminoguanidine conjugate killed more than 95% of the breast cancer cells compared to 25% for Adriamycin. The second candidate, "Phenethyl Isothiocyanate" (PEITC) (Table 2), with unique specificity, has promising results for HER2 breast cancer patients. "Caffeine" (Table 2) is primarily used to restore mental alertness when fatigue or drowsiness are present and for the treatment of post-dural lumbar puncture headaches. However, Pantziarka et al. [32] confirmed that caffeine could be used to treat breast cancer. The fourth candidate, "Tamoxifen," is primarily used for breast cancer. Hence, we showed the top five candidate drugs using our proposed framework in Table 2.

According to our proposed framework, the most promising candidate drug used as the IBD repurposed drug is "Dasatinib" (Table 3). It has been shown that Dasatinib is helpful to decrease the inflammation in a rodent model of colitis [34] for ulcerative colitis type IBD. Therefore, the study concluded that Dasatinib could be a potential candidate for ulcerative colitis treatment. Our second IBD repurposed drug candidate is "Phenethyl Isothiocyanate" (PEITC) (Table 3). PEITC Essential Oil contains more than 95% of PEITC. Therefore, Dey et al. [35] confirmed PEITC essential oil as a potential treatment for ulcerative colitis patients. The third candidate, "Adenosine" (Table 3), is working as a modulator for inflammation (including Crohn's disease and ulcerative colitis) both in humans and animals [36]. Our last candidate, "Glutamic Acid" (Table 3), was confirmed by [37] as an amino acid is an adjuvant ulcerative colitis type of IBD treatment. Furthermore, the investigators showed that microinjection of this amino acid into the paraventricular nucleus on ulcerative colitis in rats significantly improved anti-oxidation levels. This outcome suggests that glutamic acid is a potential candidate for a therapeutic application of paraventricular nucleus regulation in ulcerative colitis. However, the doses of glutamic acid may change for the naturally-occurring IBD.

The primary usage of Phenethyl Isothiocyanate (PEITC) is the treatment of lung cancer [38]. Nonetheless, our proposed framework considers this drug the most favorable contender in our top five candidate drugs list (Table 4) for COPD. Our next candidate, "Minocycline" (Table 4), is effective as an addition to treatment with cyclophosphamide in reducing the number of lung cancer [39]. The third candidate, "Pseudoephedrine" (Table 4) can also be used for COPD-related diseases such as the treatment of nasal and sinus congestion that is caused by a breathing illness (e.g., bronchitis) [49,50]. Finally, the last candidate, "NADH" (Table 4), improves trial COPD [40], emphasizing a probable helpful treatment for COPD.

From Table 5**,** it is self-evident that our proposed framework outperformed baseline methods (degree centrality, betweenness and closeness centrality, random walk [15], and page rank [42]) in prioritizing drug candidates for disease-specific drug repurposing. A literature review-based validation confirmed that our proposed framework correctly prioritized four out of the top five candidate drugs for drug repurposing for breast cancer, IBD, and COPD, respectively. On the other hand, degree and betweenness centrality methods have only one and two confirmed drug candidates, respectively, to be used as repurposed drugs for COPD only. Closeness centrality has two and one confirmed drug candidates as repurposed drugs for breast cancer and IBD, respectively. The random walk has zero, two, and two confirmed drugs in the predicted top five drugs to treat breast cancer, IBD, and COPD diseases, respectively. However, the page rank approach listed two confirmed drugs for each disease in the top five predicted lists of drugs.

The above literature review-based comparison suggests that our proposed framework can be used for novel drug discovery and drug repurposing. Therefore, it may be promising to use the proposed drug repurposing framework to prioritize candidate disease-specific repurposed drugs and disease-specific primary drugs.

#### **5. Conclusions**

This study aims to formulate drug repurposing for a specific disease as a maximum flow problem. We used a human interactome network and a set of FDA-approved drug targets along with different disease-specific (breast cancer, IBD, and COPD) risk genes to perform our experiments. We hypothesized that our proposed framework would identify a set of FDA-approved drugs that can be repurposed to treat breast cancer, IBD, and COPD. Experimental results showed that we had identified a prioritized list of drug targets and associated drugs that can be reused to treat these diseases. Furthermore, our proposed framework identified the natural flow to strongly influence the disease genes without any prior knowledge. Finally, we performed a comprehensive literature review to validate our proposed framework's performance. This validation shows that our proposed framework outperformed other baseline methods regarding the total number of confirmed repurposed drugs. The validation also suggests that our drug repurposing approach can also be used for novel drug discovery.

Future works of this study include experiments and clinical trials with our prioritized lists of candidate drugs. These approaches will confirm whether our candidate drugs have the potential to treat breast cancer, IBD, and COPD, respectively.

**Supplementary Materials:** The following are available online at https://www.mdpi.com/article/10.3 390/life11111115/s1, Table S1: The top 10 prioritized repurposed drugs for breast cancer, Table S2: The top 10 prioritized repurposed drugs for IBD, Table S3: The top 10 prioritized repurposed drugs for COPD.

**Author Contributions:** Conceptualization, M.M.I. and P.H.; methodology, M.M.I.; software, M.M.I.; validation, M.M.I.; formal analysis, M.M.I.; investigation, M.M.I.; resources, P.H.; data curation, P.H.; writing—original draft preparation, M.M.I.; writing—review and editing, P.H.; visualization, M.M.I.; supervision, P.H. and Y.W.; project administration, P.H.; funding acquisition, P.H. and Y.W. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work is partially funded by a grant from the University of Manitoba Collaborative Research Program and a grant from the Natural Sciences and Engineering Research Council of Canada (NSERC) (EGP-543968-2019).

**Institutional Review Board Statement:** Ethical review and approval were waived for this study, because all the data used in the study are publicly available.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** All the datasets used for the experimental analysis are publicly available.

**Acknowledgments:** We thank Mark Alexiuk very much for his excellent supervisionfor Md. Mohaiminul Islam, and the kind support from Sightline Innovation on the project.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Electrocardiogram Quality Assessment with a Generalized Deep Learning Model Assisted by Conditional Generative Adversarial Networks**

**Xue Zhou 1, Xin Zhu 1,\*, Keijiro Nakamura 2,\* and Mahito Noro <sup>3</sup>**


**Abstract:** The electrocardiogram (ECG) is widely used for cardiovascular disease diagnosis and daily health monitoring. Before ECG analysis, ECG quality screening is an essential but time-consuming and experience-dependent work for technicians. An automatic ECG quality assessment method can reduce unnecessary time loss to help cardiologists perform diagnosis. This study aims to develop an automatic quality assessment system to search qualified ECGs for interpretation. The proposed system consists of data augmentation and quality assessment parts. For data augmentation, we train a conditional generative adversarial networks model to get an ECG segment generator, and thus to increase the number of training data. Then, we pre-train a deep quality assessment model based on a training dataset composed of real and generated ECG. Finally, we fine-tune the proposed model using real ECG and validate it on two different datasets composed of real ECG. The proposed system has a generalized performance on the two validation datasets. The model's accuracy is 97.1% and 96.4%, respectively for the two datasets. The proposed method outperforms a shallow neural network model, and also a deep neural network models without being pre-trained by generated ECG. The proposed system demonstrates improved performance in the ECG quality assessment, and it has the potential to be an initial ECG quality screening tool in clinical practice.

**Keywords:** data augmentation; deep learning; ECG quality assessment

#### **1. Introduction**

Electrocardiogram (ECG) is widely used for cardiovascular disease diagnosis, treatment, and daily personal health monitoring via wearable devices [1,2]. ECG signals are expected to have sufficient signal quality to extract temporal and morphological information for further analysis, such as heart rate variability (HRV) analysis and arrhythmia classification [3,4]. Low-quality ECG signals owing to baseline wander, muscle artifacts, and power-line interferences may cause false ECG arrhythmia alarms [5]. Additionally, ECG collected by wearable devices may include severe electrode motion artifacts, plain lines, and huge impulses due to lead-off. In particular, electrode motion artifacts may be treated as ectopic beats and cannot be removed by simple filters. This is one of the major factors that cause alarm fatigue [6–8]. In clinical practice, before disease diagnosis, low-quality ECG signals are expected to be removed through manual screening by technicians. However, manual quality screening is time-consuming, laborious, and experience-dependent. Therefore, a reliable automatic ECG signal quality assessment system is significant for ECG technicians and cardiologists.

To date, many studies have been conducted on ECG quality assessment. PhysioNet organized a challenge in cardiology in 2011 to classify 12-lead ECG signals as acceptable

**Citation:** Zhou, X.; Zhu, X.; Nakamura, K.; Noro, M. Electrocardiogram Quality Assessment with a Generalized Deep Learning Model Assisted by Conditional Generative Adversarial Networks. *Life* **2021**, *11*, 1013. https://doi.org/10.3390/life11101013

Academic Editors: Md. Altaf-Ul-Amin, Shigehiko Kanaya, Naoaki Ono and Ming Huang

Received: 31 August 2021 Accepted: 21 September 2021 Published: 26 September 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

or unacceptable [6,9]. Quesnel et al. evaluated the quality of ECG signals contaminated with various levels of motion artifacts. They segmented PQRST complexes, which were aligned and averaged to form an estimate of true PQRST complexes. Then, a signal-tonoise ratio (SNR) was estimated by comparing each PQRST complex to the average PQRST complex. In this way, they got a 0.89 Pearson correlation coefficient between estimated and real SNRs [10]. The machine learning technique was also implemented in the ECG quality assessment. Redmond et al. used a Parzen window classifier to classify noisy and clean ECG, and got 82% and 78.7% accuracies using human and automatic annotation features, respectively [11]. Shahriari et al. obtained ECG signals from an ECG alarm study at the University of California, San Francisco (UCSF) and PhysioNet Computing in Cardiology Challenge 2011. They developed an image-based ECG quality assessment method. They computed a structural similarity measure (SSIM) at first, and then selected representative ECG images from the training dataset as templates. The SSIM between each ECG image and all the templates were used to build the features and input them into a linear discriminant analysis classifier. The classifier achieved 93.1% and 82.5% accuracies in the UCSF and Cardiology Challenge 2011 database, respectively [12]. Zhao et al. manually extracted six features, such as R peaks, the power spectrum distribution of QRS complexes, and so forth to build fuzzy vectors. They used the fuzzy comprehensive evaluation method as a feature analysis module. Their model demonstrated a 94.67% accuracy, 90.33% recall, and 93.00% specificity, training and testing on data from the PhysioNet computing in Cardiology Challenge 2011 and 2017 [13]. In 2019, Moeyersons et al. used data from a sleep study collected by the University Hospital Leuven, PhysioNet Computing in Cardiology Challenge 2017 and MIT-BIH Noise Stress Test Database with manual labels. They segmented the ECG signal into 5 s episodes after filtering. Each episode was characterized by an autocorrelation function, and then three features were extracted and fed to a RUSBoot classifier. For Challenge 2017 and Sleep Study Datasets, they obtained a recall of 79.4% and 96.6%, specificity of 78.7% and 84.8%, and area under the curve of 0.928 and 0.970, respectively [14]. More recently, Fu et al. assessed the quality of wearable ECG signals collected via Lenovo H3 Devices. They compared three machine learning algorithms: the support vector machine (SVM), least-squares SVM (LS-SVM), and long short-term memory (LSTM) with manually extracted features. The LSTM models achieved the best performance with 95.5% accuracy [15].

The above studies usually follow three procedures. The first procedure is signal prepossessing, such as filtering. Then, feature extraction is the most important step, and directly affects the model's performance. However, no gold standard exists to identify necessary, effective, or redundant features. As a result, feature extraction usually depends on the experiences of researchers. The final step is model development using machine learning techniques or designing decision rules via setting thresholds or computing related statistical values based on extracted features. However, feature extraction, decision rules, or threshold-making are experience-dependent, and it is hard to cover or find out all significant features, such as QRS-related information [11,13], R-peak-related information [13], or autocorrelation function-related features [14]. In addition, significant features may vary with decision strategies and models. Furthermore, features manually extracted from a certain dataset may not be generalized on other datasets. For example, Shahriari et al. manually extracted the same features on two datasets: the UCSF and Cardiology Challenge 2011 database, but their model had a considerable difference of performance on another two datasets (UCSF vs. Cardiology Challenge 2011: accuracy: 93.1% vs. 82.5%, sensitivity: 96.3% vs. 83.9%, specificity: 90.0% vs. 77.7%) [12]. Manually extracted features generalized for different datasets are usually impractical in view of the costly and limited medical databases.

In this study, the proposed ECG quality assessment system consists of two stages:data augmentation using adversarial networks, and quality assessment using deep neural networks. The goal of data augmentation is to generate versatile ECG to improve the training efficiency. The proposed system can automatically extract features from raw ECG signals and make final decisions. In this case, the system can avoid relying on experience- and database-specific manual features for model development, and thresholds or rules for decision-making; therefore, they may have better generalization ability. The system demonstrates improved performance on two different datasets, and outperforms the shallow neuronal networks model and deep neural networks model without data augmentation. All the experiments were conducted using MATLAB R2019b [16] and TensorFlow 2.3.0. [17].

#### **2. Materials and Methods**

#### *2.1. Datasets Introduction and Construction*

This study uses data from PhysioNet Computing in the Cardiology Challenge 2017 (PCCC2017) database [18], TELE ECG database [19], MIT-BIH arrhythmia database (MIT-BIHA) [20], and MIT-BIH normal sinus rhythm database (MIT-BIHNSR) [21].

PCCC2017 aims to classify single-lead ECG recordings to the sinus rhythm, atrial fibrillation (AF), alternative rhythm, or as too noisy. All the recordings last for 9 to 60 s, sampled at 300 Hz. Then, each recording was resampled to 500 Hz and segmented into segments of 10 s duration with 2 s and 4 s overlap, respectively, to increase the number of unacceptable segments (in the noisy category). In total, there are 555 unacceptable and select 2618 acceptable ECG segments from this dataset, each with 10 s duration.

The TELE ECG database was initially recorded by Redmond et al. [11] from 288 homedwelling patients. Each ECG recording was collected with single-lead and sampled at 500 Hz. Khamis et al. regarded this database as poor-quality telehealth ECG [22]. In this study, all the recordings are marked as noisy ECG signals as well. After signal segmentation, there are 734 unacceptable 10 s ECG segments.

The PCCC2017 and TELE ECG database officially provided specific quality labels for ECG segments, and then the two databases are combined to a new dataset named as COMD. COMD consists of 1289 unacceptable and 2618 acceptable ECG segments in total.

MIT-BIHA contains 48 two-channel ECG recordings, each with a duration of 30 min, sampled at 360 Hz. MIT-BIHNSR includes 18 long-term ECG recordings, and all the recordings are sampled at 128 Hz. The MIT-BIH noise stress test database (MIT-BIHNST) was created by using two clean ECG recordings (118 and 119) from the MIT-BIHA and adding noise recordings on them [7]. The noise recordings are available in MIT-BIHNST, including baseline wander (bm), muscle artifact (ma), and electrode motion artifact (em). Inspired by the construction method of MIT-BIHNST, the original ECG recordings in MIT-BIHA and MIT-BIHNSR as regarded clean signals (acceptable recordings) as well in this study, and then the same noise-added rules as MIT-BIHNST are followed to recreate a new noiseincluded dataset (RECD for short) using a WFDB software package [23]. The recreated ECG recordings will include severe "ma", "bm", or "em" noises provided by MIT-BIHNST or a Gaussian noise and power interference simulated by MATLAB. The detailed noise-added rule is listed in Table 1, where "g" means Gaussian noise generated by a MATLAB function "awgn" in the WFDB software package, and "p" means a power interference simulated by a sine function, with 60 Hz frequency. To increase data diversity, RECD is created by complying with different noise combinations. Then, all the ECG recordings in RECD are resampled at 500 Hz. After that, there were 7557 unacceptable and 20114 acceptable ECG segments available.

ECG segments in COMD were used to train a conditional generative adversarial networks (CGANs) model for ECG segment generation at first. Then, generated unacceptable and real acceptable ECG data were used to pre-train a quality assessment model. Finally, training sets in COMD and RECD were both used for fine-tuning the assessment model, and testing sets were used to test the model. The detailed usage of data is illustrated in Table 2.


**Table 1.** The noise-add rules of dataset recreation.



#### *2.2. Methods*

2.2.1. Data Augmentation

Insufficient and imbalanced data may reduce the performance of deep learning models [24,25]. Thus, the first procedure of the work was to automatically generate unacceptable ECG segments to solve data imbalance issues and perform data augmentation. Although traditional mathematical modeling methods can generate realistic heartbeats, the synthetic heartbeat's morphology lacks diversity or is even almost the same as those of training data [26]. Recently, several studies have confirmed that the generative adversarial networks (GANs) model has the ability to generate real-like ECG segments and arrhythmia [26–31]. The proposed system is shown in Figure 1, a GANs model [32] is developed and trained based on COMD to obtain an ECG generator (G'), and G' is used to generate unacceptable ECG segments with 10 s duration (G'(z|y = 1)). The generator and discriminator are abbreviated as G and D, respectively. Figure 2(a) shows the structure of the proposed CGANs model. The label information is used as the condition, and each label ("0" for acceptable and "1" for unacceptable) is assembled to an M1-element vector representation, one input of G. The other input of G is a random M2-element noise signal, and following the uniform distribution, their amplitude is limited in –1 to 1. Here, M1 and M2 are determined to be 20 and 700, respectively, by trials and errors. G mainly consists of two LSTM layers with 200 and 600 units, respectively. The main layers of D are two convolutional neural network (CNN) layers, with 128 and 64 units, respectively. Their kernel sizes are set to 10 and 5, respectively, and the alpha of the "LeakyReLU" activation layer is set to 0.2. The dropout rate is 0.3. The Dense layer has 32 units and uses "ReLu" as the activation function. For the output layer, we use "sigmoid" as its activation function. The Adam optimizer with a 0.0002 learning rate and binary crossentropy loss function are applied to train the CGANs model. By trials and errors, D is updated three times, and G is then updated once to train the model. After that, an ECG generator (G') is obtained from the CGANs model for unacceptable ECG segment generation.

**Figure 1.** The proposed ECG quality assessment system. It consists of two parts: data augmentation by CGANs and a quality assessment model.

#### 2.2.2. Quality Assessment

The generated unacceptable ECG segments and parts of real acceptable segments in RECD (5000 unacceptable and 5000 acceptable ECG segments in total) were used to pre-train the quality assessment model. The structure of the model is shown in Figure 2b, and consists of three branches: two CNN branches (branch1: left; branch2: middle) and an LSTM branch (branch3: right). For branch1, the number of filters in the two CNN layers is 128 and 32 with a kernel size of 50 and 10, respectively. A dropout rate was set to 0.3, and the pooling size to 10. Branch2 has the same structure as branch1, where its two CNN layers use 64 and 16 filters and the kernel size of each is 25 and 2, respectively. The dropout rate and pooling size are the same as branch1. The number of units in the two LSTM layers of branch3 are 200 and 100, respectively. The Dense layer has 32 units with a "ReLu" activation function. A batch size of 64, an Adam optimizer with a 0.0002 learning rate, and binary crossentropy loss function were applied for training. For model fine-tuning, the

three branches were frozen and only parameters in the final two layers (Dense and Output layers) were updated with real data.

**Figure 2.** Structures of the proposed models. (**a**) The structure of CGANs; (**b**) the structure of the assessment model.

Considering the limited number of data in COMD, each 10 s ECG segment was further segmented into 10 examples with 1 s duration to increase the number of input examples for the training of the assessment model. Because our aim was to assess the quality of each 10 s ECG segment, we conducted a post-processing procedure as shown in Figure 3. The threshold was set to 3 for the quality assessment of ECG segment by trial an error. After model pre-training, the COMD and RECD datasets were randomly split 10 times to obtain training sets, which were used for model fine-tuning, and testing sets, which were used for testing the average performance of the model. For fair comparison, the same number of ECG segments in COMD and RECD datasets was used for fine-tuning (3000 segments with 1000 unacceptable and 2000 acceptable)), and the rest were used for testing.


**Figure 3.** Post-processing for quality assessment of ECG segment.

#### **3. Results**

#### *3.1. Data Augmentation*

Figure 4 illustrates the training curves of CGANs and confirms their convergence. In Figure 4a, for the "Loss" figure, the blue line "d\_real" means the loss of D when it is updated by real ECG segments, the orange line "d\_fake" is the loss of D when it is updated by generated segments, and the green line "g" represents the loss of G. T-distributed stochastic neighbor embedding (tSNE) [33] was used to map real and generated segments to a three-dimensional space for visualization, as shown in Figure 4b. tSNE maps the similar segments to close points, and dissimilar segments were mapped to distant points. Figure 5 visualizes several real segments and generated segments by G'.

**Figure 4.** Training process of CGANs (**a**) and visualization of ECG segment distribution (**b**).

**Figure 5.** *Cont*.

**Figure 5.** Samples of real and generated ECG segments. (**a**,**b**) The generated unacceptable ECG segments; (**c**,**d**) the real unacceptable ECG segments from the derivation dataset.

#### *3.2. Quality Assessment*

The performance of the proposed quality assessment model was measured by three indexes: accuracy, sensitivity, and specificity, which were calculated based on true positive (*TP*), true negative (*TN*), false positive (*FP*), and false negative (*FN*), as shown as follows.

$$Accuracy = \frac{TP + TN}{TP + TN + FP + FN'} \tag{1}$$

$$Sensitivity = \frac{TP}{TP + FN} \,\tag{2}$$

$$Specificity = \frac{TN}{TN + FP} \,\text{.}\tag{3}$$

Figure 6a,b shows the average performance of the assessment model. The specificities are 96.4% and 95.0%, sensitivities are 98.6% and 99.1%, and accuracies are 97.1% and 96.4%, respectively for COMD and RECD.

For comparison, some additional experiments were conducted as listed in Table 3. A CGANs model for 1 s example generation was developed. The quality assessment model performs 95.5% accuracy (acc), 94.5% sensitivity (sen), and 96.0% specificity (spe) on COMD, all lower than the performance of the proposed system. In addition to CGANs, a GANs model was also developed for data augmentation. The model failed in convergence when trained for 10 s ECG segments generation, but it was possible to generate 1 s ECG examples. Using the data generated by the GANs model, the quality assessment model shows an accuracy of 95.4%, sensitivity of 99.3% and specificity of 93.4% on COMD, which are not as good as the performance of the proposed method.

Moreover, to prove the necessity of training the quality assessment model using examples with a duration of 1 s, the quality assessment model was pre-trained directly using ECG segments with 10 s duration, and adding L2 regularization in CNN, LSTM, and Dense layers to alleviate overfitting; however, the model still performed differently between the training and testing sets. For COMD, the training accuracy of the model was 95.8%, the sensitivity was 91.2%, and the specificity was 98.0%, while the testing accuracy, sensitivity, and specificity were 84.4%, 75.8%, and 88.2%, respectively. This may indicate that training by segments with 1 s may improve the generalization of the quality assessment model.

**Figure 6.** Performances of the proposed quality assessment model on COMD (**a**) and RECD (**b**).

Without data augmentation and pre-training the assessment model by generated ECG data, each 10 s ECG segment was segmented to 10 examples with 1 s duration to directly increase the number of examples and then use them to directly develop the assessment model. Its performance declined on both COMD and RECD compared with the proposed system, as listed in Table 3.

In addition to data generation, downsampling is a possible way to assist to train a deep learning model [34]. Thus, a comparison with a previous quality assessment method [35] was conducted. The previous study downsampled ECG segments from 500 Hz to 50 Hz and developed a shallow neural networks model to avoid overfitting. The accuracy of the proposed system increased by 1.3% and 2.6%, respectively, on COMD and RECD.


**Table 3.** Performance of models for quality assessment.


**Table 3.** *Cont*.

#### **4. Discussion**

In this study, ECG generated by CGANs benefited the ECG quality assessment task. Although downsampling is a way to assist to train deep learning models with a small training dataset, it may limit the complexity of deep neural network models and thus reduce the final performance. In this work, with the same training dataset, after pretraining with generated ECG data, the developed deep neural networks model obtained a better performance (on COMD, data augmentation by CGANs vs. downsampling: 97.1% vs. 95.8% for accuracy, 98.6% vs. 96.5% for sensitivity and 96.4% vs. 95.5% for specificity; on RECD, data augmentation by CGANs vs. downsampling: 96.4% vs. 94.0% for accuracy, 99.1% vs. 89.0% for sensitivity and 95.0% vs. 96.2% for specificity). Finally, it just needed to retrain two layers in the whole deep model; in this way, the size of the required training set may be greatly reduced compared with training from scratch. This indicates that the GANs technique may be effective to assist the training of deep neural network models for ECG-related decision-making, such as arrhythmia detection [28].

Traditional mathematical modeling methods are limited to synthesize a normal realistic heartbeat or ECG signals generally with the same morphology [36,37]. To generate ECG signals with arrhythmia, the traditional method needs to manually control position parameters of P, Q, R, S, or T waves [36] to enrich the morphology of heartbeats. On the contrary, the GANs method can instinctively generate ECG signals with a larger diversity, which better matches with real ones [26]. The ECG signal is made up of temporal sequences, and thus, generating longer durations of ECG segments is preferred. In this work, CGANs demonstrated a reliable convergence when they were trained for generating a 10 s ECG segment, but GANs failed to converge. It may be attributed to how the discriminator in CGANs is required to not only identify generated or real ECG segments, but to also provide a correct label to each real segment. This gives a stronger constraint for model training than GANs.

In this study, the proposed system assessed the quality of each 10 s ECG segment by consecutively analyzing the quality of 10 examples, each 1 s in length. This improved the reliability of the system. For example, for a 10 s ECG segment, a threshold of 3 was set in post-processing, as shown in Figure 3; that is, if there were more than 3 out of the 10 consecutive 1 s examples which were determined as "unacceptable" by the model, the corresponding 10 s ECG segment was classified as "unacceptable". In this case, despite the quality of the rest of the seven consecutive examples, the final result did not change

and the sensitivity was assured. This characteristic is confirmed in the results, as shown in Figure 6; that is, the sensitivity of the model validated by COMD and RECD is higher than both the specificity and accuracy.

This work has some limitations, as follows.

(1) The proposed system is suitable for an initial ECG quality assessment, without considering specific applications. The quality requirement may vary for different purposes. For better explanation, several ECG segments that were labeled as "noisy" in PCCC2017 are shown in Figure 7. For example, for a task of AF detection, one of the characteristics for the detection of AF is that there are irregular heartbeats and no regular P waves in ECG. However, P waves in Figure 7a,b (Figure 7a: recording "A07/A07983", time: 10:10:15–10:10:25; Figure 7b: recording "A01/A01938", time: 05:05:40–05:05:50) were hardly observed because of severe noise; thus, the quality of ECG segments in Figure 7a,b is unacceptable because they cannot be used for further AF detection using P waves. In contrast, the segment in Figure 7a can be used for HRV time-domain analysis since it has obvious R peaks (green circle), and thus, the heart rate can be accurately calculated. In this case, it should be regarded as acceptable. Moreover, it can identify premature ventricular contraction (PVC) rhythms in Figure 7b (purple rectangle) and normal beats (green rectangles); thus, it is acceptable when used for PVC detection. In Figure 7c (recording "A00/A00445", time: 09:09:09–09:09:19), the ECG segment can be partly regarded as acceptable (green rectangle) or unacceptable (two sides). The ECG segment in Figure 7d (recording "A01/A01116", time: 04:02:47–04:02:57) should be totally unacceptable because the signal is completely contaminated by noise.

(2) In this study, the system only considered a quality assessment for single-lead ECG; therefore, the proposed method cannot be directly applied to 12-lead ECG quality assessments. The system is suitable for the quality assessment of ECGs collected by bedside monitors or wearable devices, but should be improved for the diagnosis of cardiovascular diseases using 12-lead ECG, such as acute myocardial infarction.

(3) In this study, the proposed system was unable to provide specific signal-to-noise ratio information for acceptable and unacceptable ECG signals; thus, it may be hard to quantize the quality of ECG signals.

In future, it is expected that we develop a multi-hierarchical and meticulous ECG quality assessment system. The system will identify low-quality ECG signals and perform task-specific quality assessments.

**Figure 7.** *Cont.*

**Figure 7.** Samples of noisy ECG signals in PCCC2017 (**a**): recording "A07/A07983", time: 10:10:15–10:10:25; (**b**): recording "A01/A01938", time: 05:05:40–05:05:50; (**c**) recording "A00/A00445", time: 09:09:09–09:09:19, (**d**) recording "A01/A01116", time: 04:02:47–04:02:57 [18].

#### **5. Conclusions**

The CGANs technique is a possible method for ECG generation, and the generated data will help to improve the results of ECG quality assessments. The proposed system is expected to be applied for the accurate initial screening of ECG quality. In particular, for patients with wearable ECG recording devices, the system may assist inexperienced users to collect ECG signals with a quality that meets diagnostic requirements. For clinical technicians, the proposed system is expected to relieve them from tedious and timeconsuming quality screening work.

**Author Contributions:** Conceptualization, X.Z. (Xue Zhou) and X.Z. (Xin Zhu); methodology, X.Z. (Xue Zhou) and X.Z. (Xin Zhu); software, X.Z. (Xue Zhou); validation, X.Z. (Xue Zhou), X.Z. (Xin Zhu), K.N. and M.N.; formal analysis, X.Z. (Xue Zhou); investigation, X.Z. (Xue Zhou); resources, X.Z. (Xin Zhu), K.N. and M.N.; data curation, X.Z. (Xue Zhou), X.Z. (Xin Zhu), K.N. and M.N.; writing original draft preparation, X.Z. (Xue Zhou) and X.Z. (Xin Zhu); writing—review and editing, X.Z. (Xue Zhou), X.Z. (Xin Zhu), K.N. and M.N.; visualization, X.Z. (Xue Zhou) and X.Z. (Xin Zhu); supervision, X.Z. (Xin Zhu); project administration, X.Z. (Xin Zhu); funding acquisition, X.Z. (Xin Zhu) and K.N.. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by JSPS Kakenhi Basic Research Fund C 18K11532 (X.Z.) and 21K10287 (K.N.), and Competitive Research Fund from The University of Aizu, 2021-P-5.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Patient consent was waived due to the use of public databases.

**Data Availability Statement:** (1) The PhysioNet Computing in Cardiology Challenge 2017: https: //physionet.org/content/challenge-2017/1.0.0/ (accessed on 1 December 2020); (2) TELE ECG Database: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/QTG0EP (accessed on 1 December 2020); (3) MIT-BIH arrhythmia database: https://physionet.org/content/ mitdb/1.0.0/ (accessed on 1 December 2020);

(4) MIT-BIH Normal Sinus Rhythm Database: https://www.physionet.org/content/nsrdb/1.0.0/ (accessed on 1 December 2020).

**Acknowledgments:** TensorFlow 2.3.0 are freely provided from tensorflow.org to conduct experiments.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Identification of Targeted Proteins by Jamu Formulas for Different Efficacies Using Machine Learning Approach**

**Sony Hartono Wijaya 1,2,\*, Farit Mochamad Afendi 2,3, Irmanida Batubara 2,4, Ming Huang 5, Naoaki Ono 5, Shigehiko Kanaya <sup>5</sup> and Md. Altaf-Ul-Amin 5,\***

<sup>1</sup> Department of Computer Science, IPB University, Kampus IPB Dramaga Wing 20 Level 5, Bogor 16680, Indonesia

<sup>2</sup> Tropical Biopharmaca Research Center, IPB University, Kampus IPB Taman Kencana, Bogor 16128, Indonesia; fmafendi@apps.ipb.ac.id (F.M.A.); ime@apps.ipb.ac.id (I.B.)


**Abstract:** Background: We performed in silico prediction of the interactions between compounds of Jamu herbs and human proteins by utilizing data-intensive science and machine learning methods. Verifying the proteins that are targeted by compounds of natural herbs will be helpful to select natural herb-based drug candidates. Methods: Initially, data related to compounds, target proteins, and interactions between them were collected from open access databases. Compounds are represented by molecular fingerprints, whereas amino acid sequences are represented by numerical protein descriptors. Then, prediction models that predict the interactions between compounds and target proteins were constructed using support vector machine and random forest. Results: A random forest model constructed based on MACCS fingerprint and amino acid composition obtained the highest accuracy. We used the best model to predict target proteins for 94 important Jamu compounds and assessed the results by supporting evidence from published literature and other sources. There are 27 compounds that can be validated by professional doctors, and those compounds belong to seven efficacy groups. Conclusion: By comparing the efficacy of predicted compounds and the relations of the targeted proteins with diseases, we found that some compounds might be considered as drug candidates.

**Keywords:** compound–protein interaction; Jamu; machine learning; drug discovery; herbal medicine

#### **1. Introduction**

Identification of compounds derived from herbal medicines and natural products has shown potential in drug discovery and drug development [1,2]. Many useful compounds have been found and utilized from herbal medicines and natural products to treat various diseases, such as oseltamivir [3] and roscovitine [4]. Oseltamivir is a neuraminidase inhibitor used in the treatment and prophylaxis of both influenza A and influenza B, whereas roscovitine is known as an anticancer drug. However, the process of identification of compound and target protein interactions in vivo and in vitro requires enormous effort. Therefore, efficient in silico screening methods are needed to predict the interaction between compounds and target proteins. In this light, in silico prediction of the interactions between compounds and target proteins can help in making the efforts easier.

As a country with the largest medicinal plant species in the world, Indonesians utilize medicinal plants as a constituent of herbal medicines [5–7]. These are known as Indonesian Jamu. Currently, Jamu is produced commercially on an industrial scale under the supervision of the National Agency of Drug and Food Control (NADFC) of Indonesia. Jamu,

**Citation:** Wijaya, S.H.; Afendi, F.M.; Batubara, I.; Huang, M.; Ono, N.; Kanaya, S.; Altaf-Ul-Amin, M. Identification of Targeted Proteins by Jamu Formulas for Different Efficacies Using Machine Learning Approach. *Life* **2021**, *11*, 866. https://doi.org/10.3390/life11080866

Academic Editor: Stefania Lamponi

Received: 30 June 2021 Accepted: 18 August 2021 Published: 23 August 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

like the other herbal medicine systems, i.e., traditional Chinese medicine, Japanese Kampo, Ayurveda, and Unani, can be considered as a new resource for compound screening. The molecules might be from a specific part of a plant used as a Jamu ingredient, such as rhizome of Java ginger (*Curcuma xanthorrhiza*), leaf of *kecibeling* (*Strobilanthes crispus*), or fruit of tamarind (*Tamarindus indica*). The utilization of herbal medicines in drug screening is very promising because of the lack of side effects [8,9]. In addition, the high biodiversity in Indonesia has great advantages in the process of finding potential compounds in Jamu. Furthermore, the systematization of Jamu medicine might help not only to obtain information about the major ingredient plants in Jamu medicines, but also to find compound and protein interactions to explain formulation of Jamu. The information on interactions between Jamu compounds and human target proteins will allow understanding the mechanisms of how Jamu medicines work against diseases and will be helpful for finding new drugs based on a scientific basis.

Various screening approaches have been developed to determine candidate compounds from herbal medicines and natural products in drug discovery. One category of the popular approaches is machine learning techniques. This approach can learn from the data, and the resulting model can be applied to make a prediction. Support vector machine (SVM) and random forest are machine learning methods for supervised learning, and they have been used in many research fields with success [10–12]. In order to obtain a good model, the machine learning method requires a great number of data samples. Nowadays, there are many open access databases that can be used to support the prediction of compound and protein interactions, such as KEGG [13], DrugBank [14], KNApSAcK [15], UniProt [16], and Online Mendelian Inheritance in Man (OMIM) [17]. Prediction of compound–protein interactions can exploit these databases to identify candidate compounds. In terms of Indonesian Jamu, IJAH Analytics can be considered as a good reference for Jamu because it has information about plant species used in Jamu formulas. In addition, plant species information can be associated with information regarding compounds, target proteins, diseases, and interactions between entities. It is hoped that the more efficient and effective application of natural products will improve the drug discovery process.

Many studies on the prediction of interactions between compounds and target proteins have been reported. Yamanishi et al. implemented a systematic study on the prediction of compound–target protein interactions by utilizing supervised learning using a bipartite graph [18]. The interactions were predicted by utilizing the structural similarity of compounds and the similarity of amino acid sequences. They computed the structural similarities between compounds using SIMCOMP and the sequence similarities between proteins using normalized Smith–Waterman scores [19,20]. In the prediction methods, they applied the bipartite local model (BLM) and SVM to predict compound–target protein interactions [21,22]. BLM predicts target proteins of a given compound using the structural similarity of compounds, proteomic similarity, and information of interactions between compounds and target proteins, whereas SVM was used as the classifier for the BLM.

In this study, we applied machine learning techniques to predict the interaction between compound and protein. SVM and random forest have been chosen as classifiers, and compound and protein are represented by fingerprint and numerical representation of amino acid, respectively. The accuracy, sensitivity, and specificity were used in the evaluation of the models. After we confirmed the best model obtained in the prediction of compound–protein interactions, we determine targeted proteins for candidate compounds obtained from plants used in the Jamu formulas for different efficacies [11]. The objective was not only to identify targeted proteins for developing new drugs, but also to give a comprehensive understanding of Jamu medicines on the molecular level.

#### **2. Materials and Methods**

Jamu medicines consist of a combination of medicinal plants and are used to treat various diseases. In this work, we exploit information about compound and protein interactions from open access databases to predict compound–protein interactions in the

context of Jamu formulas. The concept of the proposed method is depicted in Figure 1, which mainly consists of three processes: (a) data transformation, (b) model generation and evaluation, and (c) prediction of targeted proteins by Jamu formulas.

**Figure 1.** Concept of the methodology.

Initially, we collected the required data for this study from open access databases such as DrugBank, PubChem [23], KNApSAcK, UniProt, KEGG, OMIM, Matador [24], and Indonesian Jamu Herbs (IJAH Analytics, http://ijah.apps.cs.ipb.ac.id, accessed on 20 August 2021). The acquisition of data for generating the prediction model includes compounds, target proteins, and interactions between them. The chemical structures of the compounds were represented by Simplified Molecular Input Line Entry Specification (SMILES) codes. Many databases, such as DrugBank, provide SMILES of each compound [25]. We eliminated some compounds that have ambiguous SMILES or do not have SMILES information. Compounds with known SMILES codes were used in the training process to generate a model for predicting compound–protein interactions. In addition, the information about target proteins was also collected from public databases, especially the IJAH database, and these data were represented by amino acid sequences using FASTA format. In the case of interactions, we gathered that information from IJAH, Matador, and KEGG databases. We also collected the candidate compounds of Jamu formulas associated with efficacy groups from a previous study [11] and used those as test data.

#### *2.1. Data Transformation*

We transformed information about compounds and amino acid sequences into fingerprints and numerical representations of amino acids, respectively. In the case of compounds, we examined two different fingerprint representations, namely the binary representation of the Molecular Access System (MACCS) and PubChem fingerprints [12,26,27]. Therefore, each compound was represented as 166 and 881 binary vectors, respectively. In the case of proteins, we transformed amino acid sequences into the amino acid composition (AAC) and dipeptide composition descriptors [28]. The AAC represents an amino acid sequence as a fraction of each amino acid type within a protein, and it will produce 20-dimensional AAC vectors. The fractions of all 20 natural amino acids are calculated as:

$$f(r) = Nr/N, \; r = 1, 2, \dots, 20 \tag{1}$$

where *Nr* is the number of the amino acid type *r* and *N* is the length of the sequence. In addition, dipeptide composition will produce 400-dimensional descriptors, defined as:

$$f(r,s) = \frac{N\_{rs}}{N-1}, r, s = 1, 2, \dots, 20\tag{2}$$

where *Nrs* is the number of dipeptides represented by amino acid type *r* and type *s*.

After we transformed compounds and proteins into fingerprints and numerical descriptors, we created four datasets consisting of all combinations of compound and protein vectors for generating the model as follows: a combination of MACCS fingerprint and AACs (called dataset 1), a combination of MACCS fingerprint and dipeptide descriptor (called dataset 2), a combination of PubChem fingerprint and AACs (called dataset 3), and a combination of PubChem fingerprint and dipeptide descriptor (called dataset 4). Figure 2 illustrates the data representation of compounds, proteins, and interactions between them. In the case of testing data, we built combinations of candidate compounds from medicinal plants in Jamu and proteins.


**Figure 2.** Data representation. Each data sample DTk is composed of molecular fingerprints (CD1, CD2, CD3, . . . , CDm) and numerical protein descriptors (PD1, PD2, PD3, . . . , PDn).

#### *2.2. Model Generation and Evaluation*

We applied SVM and random forest in the model generation step. SVM is a binary classifier based on constructing an optimal linear model, which has the largest margin between two classes. The linear separator is constructed by simultaneous minimization of the empirical classification error and maximization of the geometric margin [29]. If we have *<sup>n</sup>* training data pairs, *<sup>T</sup>* <sup>=</sup> {(*xi*, *yi*)}, *<sup>i</sup>* <sup>=</sup> 1, . . . , *<sup>n</sup>*, where *xi*(<sup>∈</sup> <sup>R</sup>*p*) is a vector representing compound and protein and *yi* is the class of *xi*. The decision function of SVM is defined as *f*(*x*) = *wTx* + *b*, where *w* = *w*1, *w*2,..., *wp <sup>T</sup>* is the weight vector and *b* is a scalar. The optimization problem that SVMs aim to minimize is shown in Equation (3):

$$\min\_{w \in \mathbb{R}^{\mathcal{V}}, \mathbb{S}^{\mathcal{I}} \in \mathbb{R}^+} \frac{1}{2} \left\| w \right\|^2 + \mathbb{C} \sum\_{i}^n \mathbb{S}\_i \tag{3}$$

subject to *yi wTxi* + *b* ≥ 1 − *ξi*, where *C* is a trade-off between the width of the margin and the number of misclassifications, and *ξ<sup>i</sup>* is a slack variable. SVM can be extended to classify data that are not linearly separable by utilizing a kernel technique. There are two kernel functions that we applied in this study, namely the linear kernel *K xi*, *xj* = *x<sup>T</sup> <sup>i</sup>* , *xj* and radial basis function (RBF) kernel (*K xi*, *xj* = *e* <sup>−</sup>*γxi*−*xj*<sup>2</sup> , *γ* > 0), where γ is the inverse of the radius of influence of samples selected by the model as support vectors [10,30].

In addition, random forest is an ensemble method composed of many decision trees. For each classification tree, a bootstrap sample of the data is generated, and at each split, the candidate set of variables is a random subset of the variables [31–33]. Given a set of training samples *<sup>L</sup>* <sup>=</sup> {(*xi*, *yi*)}, *<sup>i</sup>* <sup>=</sup> 1, ... , *<sup>n</sup>*, where *xi*(<sup>∈</sup> <sup>R</sup>*p*) is a vector of predictor variables representing compound–protein data *i* and *yi* is the class label. Random forest targets generating a number of *ntree* decision trees from these samples. The same number of *n* samples is randomly selected with replacement (bootstrap resampling) for each tree to form a new training set, and the samples not selected are called out-of-bag (OOB) samples. Using this new training set, a decision tree is grown to the largest extent possible without any pruning according to the classification and regression tree (CART) methodology [34]. The Gini index is used during the development process of a decision tree. The Gini index at node *v*, *Gini*(*v*), is shown in Equation (4).

$$Gini(v) = \sum\_{c=1}^{C} \hat{p}\_c^v (1 - \hat{p}\_c^v) \tag{4}$$

where *p*ˆ*<sup>v</sup> <sup>c</sup>* is the proportion of class *c* observations at node *v* [35]. Then, the Gini information gain of *xi* for splitting node *v* into two child nodes, *Gain*(*xi*, *v*), is shown in Equation (5):

$$\operatorname{Gain}(\mathbf{x}\_{i\prime}, \boldsymbol{\upsilon}) = \operatorname{Gini}(\mathbf{x}\_{i\prime}, \boldsymbol{\upsilon}) - \operatorname{w\_L} \operatorname{Gini}(\mathbf{x}\_{i\prime}^L, \boldsymbol{\upsilon}^L) - \operatorname{w\_R} \operatorname{Gini}(\mathbf{x}\_i^R, \boldsymbol{\upsilon}^R) \tag{5}$$

where *v<sup>L</sup>* and *v<sup>R</sup>* are the left and right child nodes of *v*, *wL* and *wR* are the proportions of instances assigned to the left and right child nodes, and *x<sup>L</sup> <sup>i</sup>* and *<sup>x</sup><sup>R</sup> <sup>i</sup>* are the instances in the left and right child nodes. At each node, a random set of *mtry* features out of *p* is evaluated, and the feature with the maximum *Gain*(*xi*, *v*) is used for splitting the node *v*. The OOB error is estimated in the process of constructing the forest. After constructing the entire forest, OOB classification results for each sample are used to determine a decision for this sample via a majority-voting rule.

We defined and compared the performance of the resulting models by using accuracy, sensitivity, and specificity [36,37]. The higher the accuracy is, the better the performance of the classifier is. We measured the accuracies of SVM with two different kernels and random forest using four data representations (datasets 1–4). In order to estimate the performance of random forest and SVM with two different kernels, 10-fold cross-validation was used [21]. Each of the datasets was divided into 10 subsamples. Then, nine samples were used as a training dataset to make a classification model, and the remaining sample was used as a validation dataset for testing the model. In the model evaluation step, we selected the best classifier and data representation of compounds and amino acid sequences for which we obtained the best result and used that for the prediction of target proteins.

#### *2.3. Prediction of the Target Protein by Jamu Formulas*

The best model with the highest accuracy was applied for the prediction of compound– protein interactions concerning Jamu formulas used as the testing dataset. In this case, we accepted compound–protein interactions as true interactions when the probability was greater than a threshold. Figure 3 illustrates the relations among different entities involving comprehensive Jamu research, where a dotted rectangle indicates the focus of the present work. Figure 3 also shows how we validate our results by comparing efficacy–compound and protein–disease relations. We validated the results by comparing the therapeutic usage of predicted compounds and the relations of the targeted proteins with diseases. We assessed and discussed the results with supporting evidence from published literature and comments from professional doctors.

**Figure 3.** The process of prediction and identification of targeted proteins. Initially, we developed a prediction model of compound–protein interactions by utilizing compound–protein data. Then, the best model was used to predict interaction between compounds of Jamu formulas and their targeted proteins (in the dotted rectangle).

#### **3. Results and Discussion**

The summary of data used in this study is shown in Table 1. We utilized compounds that are reported to be available in the herbs used as Jamu ingredients. There are 17.227 compounds belonging to 4.984 Indonesian herbs collected from KNApSAcK, IJAH, Pub-Chem, and KEGG databases. In addition, the number of target proteins collected from UniProt and IJAH databases is 3.334, and the number of interactions collected from UniProt, IJAH, Matador, and KEGG databases is 7.989. Initially, we removed the data that do not have necessary properties, such as the SMILES in the case of the compounds and the amino sequence in the case of target proteins. Furthermore, we removed the compounds and proteins that are not included in the compound–protein interactions data. We also considered only those compounds that target human proteins. Therefore, the numbers of compounds, proteins, and interactions used in this experiment are 2.146, 3.334, and 7.216, respectively.

#### *3.1. Preprocessing of Compound and Protein Data*

The transformation of compounds from SMILES to fingerprints was done by utilizing ChemDes web-based software and PaDEL descriptor [27,38]. Compounds were transformed to MACCS and PubChem fingerprints. Moreover, we used the protr package in R to generate AAC and dipeptide as numerical representation schemes of protein sequences [28]. We eliminated two amino acid sequences in the preliminary study, i.e., Q9NZV5 and P36969, because they showed unrecognized amino acid type when transforming amino acid sequences to AAC. Therefore, there were 3.332 proteins left for further processes.

After data transformation finished, we created datasets for compound–protein prediction using both compound and protein space information. Each sample vector is composed

of a fingerprint and numerical descriptor of compound and protein. Therefore, for two different compound fingerprints and two protein numerical descriptors, we constructed four matrices with dimensions (2.146 × 3.332) by (166 + 20) for MACCS + AAC, (166 + 400) for MACCS + dipeptide, (881 + 20) for PubChem + AAC, and (881 + 400) for PubChem + dipeptide. The information of interactions between compounds and proteins is considered as a positive class, whereas unknown interactions are considered as a negative class. As the number of samples in the negative class is significantly large (number of compounds multiplies the number of proteins), we randomly selected 7.216 samples for the negative class, the same as the number of positive samples. We determined positive and negative class interactions as classes 1 and 0, respectively.

Wijaya et al. [11] identified 94 significant compounds associated with twelve efficacy groups, and 28 of them were validated by published literature. In this case, the efficacy refers to broad disease classes which are as follows: blood and lymph diseases (E1), cancers (E2), the digestive system (E3), female-specific diseases (E4), the heart and blood vessels (E5), male-specific diseases (E6), muscle and bone (E7), nutritional and metabolic diseases (E8), respiratory diseases (E9), skin and connective tissue (E10), the urinary system (E11), and mental and behavioral disorders (E12). We considered those 94 compounds as test data in this study. Table 2 shows the number of candidate compounds for each efficacy. We transformed the compounds into fingerprints according to the best results we obtained.

**Table 1.** The distribution of compound, protein, and interaction between them as training and testing data.


**Table 2.** The number of compounds for predicting target proteins. All data are classified by efficacies, and some compounds are related to one or more efficacy groups.


#### *3.2. Prediction Performance*

We applied the R packages named e1071 ver. 1.7–4 to implement the SVM method [39] and randomForest ver.4.6–14 to implement random forest (https://cran.r-project.org/web/ packages/randomForest/, accessed on 9 August 2020). The optimal parameters used in the model generations were obtained by utilizing best.tune and tuneRF functions for SVM and random forest, respectively. In the SVM, the regulation parameter *C* depends on numerical protein descriptors. In the case of AACs, *C* is equal to 1, whereas *C* is equal to 1000 in dipeptide. The *γ* values of datasets 1–4 are 0.00763, 0.00177, 0.00437 and 0.00078, respectively. In random forest, the appropriate number of trees *ntree* for datasets 1 and 3 is the same, 1000. Additionally, the *ntree* values for datasets 2 and 4 are 2000 and 500, respectively. The *mtry* values for dataset 2 and 4 are the same, i.e., 10, whereas those for datasets 1 and 2 are 6 and 15, respectively.

Table 3 shows the prediction performance for each type of dataset and each model. Representation of amino acid sequences using AAC descriptor in datasets 1 and dataset 3 obtains better accuracy compared to dipeptide descriptor on both classifiers and compound fingerprints. Furthermore, if we compare the performance of random forest and support vector machine classifiers, the classification accuracy of random forest using AAC descriptor is better than SVM with both kernels. In the case of fingerprints that are used to represent the compounds, MACCS obtains slightly better classification results than Pub-Chem features. One of the reasons for the poor classification results on the dataset using the dipeptide descriptor is the number of features produced by the method. Dipeptide makes 400 features, causing the number of compound–protein features representing the input data to increase. Many features have zero values and affect the resulting model. It is very challenging to determine the most appropriate features because machine learning methods generally rely on feature engineering [40]. This can also be observed in datasets 2 and 4 between MACCS and PubChem fingerprints; when the number of features increases, this also reduces the resulting accuracy. Since this represents sufficiently high performance, the model can be applied to predict interactions between the Jamu compounds and target proteins.



#### *3.3. Prediction Results*

In order to predict interactions between compounds and target proteins, the classification model was taken from the models that obtained the best classification results. Additionally, a testing dataset was constructed to match the dataset that achieved the best classification result. In this case, we utilized MACCS fingerprint to represent Jamu compounds, AAC descriptor to represent amino acid sequences, and random forest as a classifier. Since we focused on whether compounds bind to target proteins, we created a matrix containing all combinations of candidate compounds of Jamu formulas and target proteins as shown in Figure 2. Then, the prediction model was applied to predict whether compound and protein have an interaction or not. We accepted compound–protein interactions as true interactions when their classification probability was greater than 0.85. Not all candidate compounds identified in the work of Wijaya et al. have interactions with one or more proteins that were utilized in the current experiment. Here, we predicted 168 compound–protein interactions of Jamu formulas, involving 68 candidate compounds. Moreover, the professional doctors validated the predicted compound–protein interactions by comparing the efficacy of predicted compounds and the relations of the targeted proteins with diseases, as shown in Figure 3. Based on the current results, interactions involving 27 compounds can be validated, and those compounds belong to seven efficacy groups. Table 4 summarizes predicted compound–protein interactions by Jamu formulas that have

been validated by professional doctors, and all of them are presented under respective efficacies. We also discovered a protein is targeted by many compounds and a compound has interaction with many target proteins. For instance, P02768, known as human serum albumin (HSA), is targeted by caffeic acid, diacetoxy-6-gingerdiol, gallic acid, luteolin, quercitrin, tricin, and ursolic acid. In addition, ursolic acid targets Q92887, Q9NPD5, Q9Y6L6, P08185, and P02768. Further investigation of the predicted compound–protein interactions was also done by finding supporting evidence from published literature, such as HSA being targeted by luteolin [41]. This result indicates that there are some compounds that might be considered as drugs in herbs. This also implies that the prediction model performs well and proteins that are not confirmed yet by any evidence can be candidates to have a relation with the corresponding efficacy group.

**Table 4.** Predicted compound–protein interactions by Jamu formulas. Compound ID is an identifier taken from Pub-Chem CID (https://pubchem.ncbi.nlm.nih.gov, accessed on 20 August 2021) and KNApSAcK ID (http://kanaya.naist.jp/ KNApSAcK\_Family/, accessed on 20 August 2021). If the Compound ID cannot be found in PubChem or KNApSAcK databases, we assigned N/A.


#### **4. Conclusions and Future Works**

We constructed classification–prediction models that predict the interactions between compounds and target proteins using a machine learning approach. The model was created by utilizing compound–protein interaction data obtained from open access databases, and the data were represented by a combination of fingerprint and amino acid sequences. The results showed very good prediction performances, around 90% when the compounds were transformed to MACCS fingerprint, amino acid sequences were transformed to AAC descriptor, and random forest was chosen as a classifier. In addition, some target proteins were predicted from potential compounds of Jamu formulas using the best model obtained in the previous step. By comparing the efficacy of predicted compounds and the relations of the targeted proteins with diseases, we found that some compounds might be considered as drug candidates. There are 27 compounds that can be validated by professional doctors, and those compounds belong to seven efficacy groups. This study is not only determines candidate drugs but also gives a better understanding of Jamu medicine at the omics level. Moreover, further validation of the results of this study can be performed by docking simulation between predicted compound–protein interactions or through in vivo and in vitro validation studies in the laboratory. We can also explore the supporting chemical or biological characteristics in predicted interactions, such as the similarity between the target compound and the known ligands of the predicted protein.

**Author Contributions:** Conceptualization, S.H.W. and M.A.-U.-A.; data curation, S.H.W.; formal analysis, S.H.W. and F.M.A.; funding acquisition, S.K.; investigation, M.H. and N.O.; methodology, S.H.W. and M.A.-U.-A.; supervision, I.B., S.K. and M.A.-U.-A.; writing—original draft, S.H.W.; writing—review and editing, M.A.-U.-A. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the World Class Professor Program Scheme A of the Ministry of Research, Technology and Higher Education of Indonesia and NAIST Big Data and Interdisciplinary Projects, Japan, and partially supported by the Ministry of Education, Culture, Sports, Science, and Technology of Japan (16K07223 and 17K00406).

**Data Availability Statement:** Data used in this study were collected from previous studies and open access databases. Data are available from Computational Systems Biology Laboratory, NAIST, and Department of Computer Science of IPB University for researchers who meet the criteria (contact via correspondence authors).

**Acknowledgments:** We thank Husnawati and Nurida Dessalma Syahrania for validating predicted compound–protein interactions.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

#### **References**


## *Article* **Shared Molecular Mechanisms of Hypertrophic Cardiomyopathy and Its Clinical Presentations: Automated Molecular Mechanisms Extraction Approach**

**Mila Glavaški 1,\* and Lazar Velicki 1,2**


**Abstract:** Hypertrophic cardiomyopathy (HCM) is the most common inherited cardiovascular disease with a prevalence of 1 in 500 people and varying clinical presentations. Although there is much research on HCM, underlying molecular mechanisms are poorly understood, and research on the molecular mechanisms of its specific clinical presentations is scarce. Our aim was to explore the molecular mechanisms shared by HCM and its clinical presentations through the automated extraction of molecular mechanisms. Molecular mechanisms were congregated by a query of the INDRA database, which aggregates knowledge from pathway databases and combines it with molecular mechanisms extracted from abstracts and open-access full articles by multiple machinereading systems. The molecular mechanisms were extracted from 230,072 articles on HCM and 19 HCM clinical presentations, and their intersections were found. Shared molecular mechanisms of HCM and its clinical presentations were represented as networks; the most important elements in the intersections' networks were found, centrality scores for each element of each network calculated, networks with reduced level of noise generated, and cooperatively working elements detected in each intersection network. The identified shared molecular mechanisms represent possible mechanisms underlying different HCM clinical presentations. Applied methodology produced results consistent with the information in the scientific literature.

**Keywords:** hypertrophic cardiomyopathy; data mining; automated curation; molecular mechanisms; atrial fibrillation; sudden cardiac death; heart failure; left ventricular outflow tract obstruction; cardiac fibrosis; myocardial ischemia

#### **1. Introduction**

Hypertrophic cardiomyopathy (HCM) is the most common inherited cardiovascular disease [1–3], with an estimated prevalence of 1 in 500 people worldwide [1,3–5] and recent investigations suggesting an even greater prevalence [5,6]. It is characterized by increased left ventricular wall thickness that cannot be explained by abnormal loading conditions (e.g., hypertension) [1,2,7].

Mutations in genes that encode sarcomeric proteins are the primary molecular cause of HCM [3,8,9]. However, the genetic basis for HCM has proven to be more complex than originally postulated: 40–60% of HCM patients show mutations in one or more of the genes known to be associated with the disease, whereas for others, the cause remains unknown [10–12].

The clinical presentation of HCM varies widely [1,3,7,8]: some patients are asymptomatic [1,7,13], while others manifest symptomatic left ventricular outflow tract obstruction (LVOTO) [7,8], atrial fibrillation (AF) [3,8], sudden cardiac death (SCD) [3,7,13,14], or heart failure (HF) [1,3,10,11]. Pathophysiologic features of HCM include cardiomyocyte

**Citation:** Glavaški, M.; Velicki, L. Shared Molecular Mechanisms of Hypertrophic Cardiomyopathy and Its Clinical Presentations: Automated Molecular Mechanisms Extraction Approach. *Life* **2021**, *11*, 785. https:// doi.org/10.3390/life11080785

Academic Editors: Md. Altaf-Ul-Amin, Shigehiko Kanaya, Naoaki Ono and Ming Huang

Received: 16 June 2021 Accepted: 30 July 2021 Published: 3 August 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

hypertrophy [15,16], cardiomyocyte disarray [16,17], myocardial remodeling [18,19], fibrosis [3,20,21], myocardial hypercontractility [22,23], impaired myocardial relaxation [20,24], myocardial stiffness [17,20], diastolic dysfunction [13,14,17], coronary microvascular dysfunction [25,26], and myocardial ischemia [25,27], but the underlying molecular mechanisms are poorly understood. Molecular determinants of the disease presentations are also still not known. Phenotypic expression of HCM may vary even within the same family [1]. Despite active research, the consistent genotype-phenotype associations are still not known. All these stress the importance of finding additional mechanisms and factors that direct the course and presentations of HCM, and propose all the molecular mechanisms standing between genetic basis and clinical presentations as crucial.

INDRA database [28] aggregates knowledge from pathway databases and combines it with information on molecular mechanisms extracted from abstracts and open-access full articles by multiple machine-reading systems. PubMed is one of the most important platforms for medical journal literature. To be indexed in PubMed, journals must meet certain review or selection criteria [29].

Our aim was to explore the shared molecular mechanisms of HCM and its clinical presentations through the automated extraction of molecular mechanisms.

#### **2. Materials and Methods**

#### *2.1. Molecular Mechanisms Extraction*

Molecular mechanisms were congregated using the INDRA database [28]. Molecular mechanisms from all PubMed articles published starting from 1 January 2010 were separately extracted in the form of INDRA statements [30] for HCM, cardiomyocyte hypertrophy, myofibrillar disarray, cardiomyocyte disarray, myocardial remodeling, cardiac remodeling, myocardial fibrosis, LVOTO, myocardial hypercontractility, impaired myocardial relaxation, impaired cardiac relaxation, myocardial stiffness, diastolic dysfunction, AF, SCD, coronary microvascular dysfunction, myocardial ischemia, HF, MACE, and rehospitalization. INDRA statements were found in the INDRA database by PubMed Identifiers (PMIDs), using REST Client API. PMIDs were collected through the INDRA PubMed client [30] (which searches for articles on PubMed) using the following search terms: hypertrophic cardiomyopathy, cardiomyocyte hypertrophy, myofibrillar disarray, cardiomyocyte disarray, myocardial remodeling, cardiac remodeling, myocardial fibrosis, left ventricular outflow tract obstruction, myocardial hypercontractility, impaired myocardial relaxation, impaired cardiac relaxation, myocardial stiffness, diastolic dysfunction, atrial fibrillation, sudden cardiac death, coronary microvascular dysfunction, myocardial ischemia, heart failure, major adverse cardiovascular events, and rehospitalization (use\_text\_word = True, major\_topic = True).

Subsequently, intersections of the sets consisting of INDRA statements for HCM and its clinical presentations were found.

#### *2.2. Networks Generation*

Each of the intersections (consisting of sets of INDRA statements) was transcribed to a network table, imported to Cytoscape version 3.8.2 [31] for further analysis, and uploaded to NDEx v 2.5.0 [32–34].

#### *2.3. Network Analysis*

The most important nodes in intersections' networks were found using Cytoscape application Wk shell decomposition version 1.1.0 [35]. Rank and k-shell were calculated for every node of each network. The reliability of interactions was determined using Cytoscape PE-measure application version 1.0 [36]. Models with a reduced level of noise were generated and uploaded to NDEx. The nodes' centrality scores were determined using Cytoscape CytoHubba app version 0.1. [37]. Top elements for each centrality measure of each network were uploaded to NDEx. Cooperatively working elements (functional modules) were found using NCMine Cytoscape plugin version 1.3.0 [38]. All networks were analyzed as directed (with applied cliqueness threshold = 0.6, merge threshold = 0.6, dcliqueness threshold = 0.2, and cluster size threshold = 3).

#### **3. Results**

Molecular mechanisms in the form of 182,167 INDRA statements (representations of molecular mechanisms consisting molecular subject, object, and their interaction) were extracted from 230,072 articles on HCM and 19 HCM clinical presentations (Table 1).

**Table 1.** The number of articles on hypertrophic cardiomyopathy and its clinical presentations read automatically and the number of INDRA statements extracted.


#### *3.1. Network Analysis*

3.1.1. Networks

Shared molecular mechanisms of HCM and its clinical presentations are represented as networks (Table 2). The networks differ notably in terms of the number of elements they contain.

The intersection of molecular interactions representing HCM and impaired cardiac relaxation contains only phosphorylation of SMAD Family Member 2 (SMAD2) and could not be displayed as a network. The intersection of HCM and myocardial hypercontractility contains no molecular interactions.


**Table 2.** Shared molecular mechanisms of hypertrophic cardiomyopathy and its clinical presentations.

#### 3.1.2. The Most Important Nodes

The most important nodes for all networks are found (Supplementary Table S1). All networks were presented as packed concentric rings sorted by the most important nodes (Supplementary Figure S1).

#### 3.1.3. Nodes' Centrality Scores

Centrality scores for each node of each network were calculated, and the top elements for each centrality measure of each network were visualized (Table 3).


**Table 3.** Top nodes of each network ranked by centrality scores.

3.1.4. Reliability of Interactions

Networks with a reduced level of noise were generated (Table 4).

3.1.5. Cooperatively Working Elements

In each intersection network, cooperatively working elements (functional modules) were detected (Supplementary Table S2).

**Table 4.** Networks with different PE-values applied. PE-measure (the measure for interaction reliability) removes spurious interactions (below the value applied) and, thus, the level of noise in networks.


*3.2. Shared Molecular Elements and Pathways*

#### 3.2.1. Hypertrophic Cardiomyopathy and Structural Changes

The most important shared elements for cardiomyocyte hypertrophy and HCM were as follows: calcium; Ca2<sup>+</sup>/calmodulin-dependent protein kinase II (CaMKII); *PLN* gene encoding phospholamban; protein kinase A (PKA), which is a master regulator of most cAMP-dependent processes; protein kinase B (PKB), also known as AKT, which regulates cellular survival and metabolism; AMP-activated protein kinase (AMPK), which is involved in cellular energy homeostasis as a "cellular energy sensor"; and sirtuin 1 encoded by *SIRT1* gene; nuclear factor of activated T-cells (NFAT), which is important for immune response and is involved in the development of the cardiac system; *EDN1* gene encoding endothelin 1 (ET-1), which is a potent vasoconstrictor; *AGT* gene, which encodes angiotensinogen; collagen; multifunctional cytokine transforming growth factorβ (TGF-β); signal transduction protein extracellular signal-regulated kinase (ERK); cell population proliferation; and apoptosis.

The most important shared elements for myofibrillar disarray and HCM are actin, myosin complex, *MYL12A* gene, *MYBPC3* gene, ATP, mitogen-activated protein kinase 7 (MAPK7) encoded by the *MAPK7* gene (MAP kinases are involved in many cellular

processes), RAF proto-oncogene serine/threonine-protein kinase (RAF1)—a part of the ERK1/2 pathway as a MAP kinase encoded by *RAF1* gene, ERK, and *EDN1* gene encoding endothelin 1. The effects of immunosuppressant and calcineurin inhibitor cyclosporin A as well as MAP kinase cascade inhibitor PD98059 are also indicated.

In their pathophysiology, cardiomyocyte disarray and HCM share the mechanisms of contractile machinery components (actin, myosin complex, and enzyme ATPase); apoptosisinhibiting mechanisms (B-cell lymphoma 2 gene, *BCL2*); the protein tyrosine phosphatase non-receptor type 11 (PTPN11), which inhibits the growth regulator—the mechanistic target of rapamycin, mTOR; and Src homology 2 domain-containing phosphatase 2 (Shp2), which is involved in cell growth and survival. The importance of the myosin heavy chain 7 gene, *MYH7*, is shown.

The common molecular elements of myocardial remodeling and HCM were as follows: calcium, CaMKII, *AGT* gene, which encodes angiotensinogen, angiotensin II, collagen, TGF-β, tumor necrosis factor (TNF), inflammatory response, cell population proliferation, and apoptosis.

Cardiac remodeling and HCM in their pathophysiology share calcium, AMPK (a "cellular energy sensor"), *AGT* gene encoding angiotensinogen, AKT (regulates cellular survival and metabolism), TGF-β (multifunctional cytokine), *SIRT1* gene encoding sirtuin 1 (SIRT1), collagen, actin, reactive oxygen species, cell population proliferation, and apoptosis.

By the most important nodes and ranked by centrality scores, the most important shared elements for myocardial fibrosis and HCM were calcium, TGF-β, collagen, *AGT* encoding angiotensinogen, angiotensin II, AMPK, cell population proliferation, inflammatory response, and apoptosis.

#### 3.2.2. Hypertrophic Cardiomyopathy and Left Ventricular Outflow Tract Obstruction

LVOTO and HCM share calcium, TGF-β, *POSTN* gene encoding periostin (extracellular matrix protein with multiple functions), collagen, *PIMREG* gene and PIMREG protein (involved in metaphase-to-anaphase transition during mitosis), *SNCG* gene encoding gamma-synuclein (a member of the synuclein family of proteins, which were believed to be involved in the pathogenesis of neurodegenerative diseases and certain tumors), verapamil (calcium channel blocker), dobutamine (β1-agonist), mavacamten (MYK-461, inhibitor of cardiac myosin ATPase), systolic anterior motion, and death.

#### 3.2.3. Hypertrophic Cardiomyopathy and Contractile Dysfunction

Impaired myocardial relaxation and HCM in their pathophysiology share nitric oxide (NO) and constitutive nitric oxide synthase (also known as nitic oxide synthase 3 (NOS3) or endothelial NOS) encoded by the *NOS3* gene as well as N, N-dimethylarginine, a direct endogenous inhibitor of NO synthases; interaction of phospholamban and ATP2A2 intracellular calcium pump; and collagen induction by TGF-β and PKA.

Molecules shared by myocardial stiffness and HCM were the *TTN* gene encoding titin, *RBM20* gene encoding RNA-binding protein that acts as a regulator of mRNA splicing of a subset of genes involved in cardiac development (regulates splicing of *TTN*), actin, TGF-β, *PRKCA* gene encoding protein kinase C-alpha (PKC-α), which was involved in diverse cellular signaling pathways, cyclic GMP-dependent protein kinase (PRKG), which involved in muscle relaxation, *RING1* gene encoding ring finger protein 1 (RING1), and *TRIM63* gene encoding tripartite motif-containing protein 63, which regulates the proteasomal degradation of muscle proteins.

Diastolic dysfunction and HCM share calcium, sodium, actin, troponin I, *TTN* gene encoding titin, *PLN* gene encoding phospholamban, cardiac myosin binding protein-C (cMyBP-C), CaMKII, TGF-β, PKA, AMPK, and apoptosis.

#### 3.2.4. Hypertrophic Cardiomyopathy and Arrhythmia

Based on the most important nodes and centrality score ranks, the following elements were the most important shared elements for AF and HCM: sodium and calcium; CaMKII, which is important for calcium homeostasis in cardiomyocytes; *PLN* gene encoding phospholamban that inhibits the activity of ATPase sarcoplasmic/endoplasmic reticulum Ca2<sup>+</sup> transporting 2 (*ATP2A2*—encodes one of the intracellular pumps that return calcium from the cytosol to the sarcoplasmic reticulum); *RYR2* gene encoding ryanodine receptor 2 (RyR2) (major mediator in calcium-induced calcium release from sarcoplasmic reticulum); *AGT* gene which encodes angiotensinogen, a precursor of angiotensin; junctophilin 2 (*JPH2*) gene which encodes a component of junctional complexes (it also plays a key role in calcium-induced calcium release); T-cell leukemia homeobox protein 2 (*TLX2*) gene; *PSMD4* gene which encodes component of the 26S proteasome, with its main role being the removal of misfolded or damaged proteins as well as proteins whose functions are no longer required. Both diseases share inflammatory response, apoptotic process, and death in their pathophysiology.

SCD and HCM share the following elements: sodium, calcium, *RYR2* gene encoding RyR2, CaMKII, actin, *MYL12A* gene, myosin complex, *TNNT1* gene encoding troponin T, troponin C, *TNNI3* gene encoding troponin I, ATP, PKA, *GJA1* gene encoding gap junction protein alpha 1 (connexin-43), *PLN* gene encoding phospholamban, *GSTK1* gene encoding glutathione S-transferase kappa 1, which belongs to a superfamily of enzymes for cellular detoxification, and death. The effect of the non-selective β adrenoceptor agonist isoprenaline is also indicated.

#### 3.2.5. Hypertrophic Cardiomyopathy and Ischemia

The important common molecular mechanisms of coronary microvascular dysfunction and HCM are the serin/threonine-specific protein kinase Akt (also known as protein kinase B or PKB, this plays an important role in glucose metabolism, cell proliferation, and apoptosis) and its activating phosphorylation site S473, structural sarcomeric protein titin, components of the phosphagen energy system (ATP decreased by creatine), glucose increased by insulin, calcium increased by sodium, and calcium increased by calcium. The effects of a few exogenous elements, such as antioxidant resveratrol (which activates the sirtuin 1 gene, *SIRT1*, regulator of whole-body lipid homeostasis), antimineralocorticoid spironolactone (inhibiting expression of mineralocorticoid receptor, encoded by nuclear receptor subfamily 3 group C member 2 gene, *NR3C2*), and insecticide pyraclofos (increasing calcium) are also indicated.

Myocardial ischemia and HCM share the following elements: calcium, ATP, AMPK, PKA, AKT cross-talking with ERK, glucose, *INS*, *SIRT1* gene encoding sirtuin 1, NF-κB, TNF, reactive oxygen species, inflammatory response, and apoptosis.

#### 3.2.6. Hypertrophic Cardiomyopathy and Endpoints

In their pathogenesis, HF and HCM share calcium, sodium, RYR2, components of contractile machinery and related genes actin, myosin complex, myosin light chain 12A (*MYL12A*) gene, troponin C, troponin T, *TNNI3* gene encoding troponin I, tropomyosin, myosin binding protein C3 (*MYBPC3*) gene, cMyBP-C, glucose, *INS* gene encoding insulin, reactive oxygen species, ATP, 3',5'-cyclic AMP (cAMP), ATPase, AMPK, sirtuin 1, TGF-β, collagen, *AGT* gene encoding angiotensinogen, *EDN1* gene encoding endothelin-1, ERK, *GATA4* encoding transcription factor GATA-4, PKA, PKB (AKT), protein kinase C (PKC), p38 MAP kinase, beta adrenergic receptor genes (ADRBs), mechanistic target of rapamycin (mTOR), NF-κB, calmodulin, CaMKII, CaMKII-delta, *PLN* gene encoding phospholamban, *TTN* gene encoding titin, *CLEC3B* gene encoding tetranectin, *E2F1*, *PSMD4*, and *SMARCA4* genes, NFAT, β adrenoceptor agonist isoprenaline, inflammatory response, autophagy, cell population proliferation, apoptosis, and death.

Major adverse cardiovascular events (MACE) and HCM in their pathophysiology share calcium, (R)-lipoic acid (the most active isomer of a versatile antioxidant, alpha-lipoic acid), *NR3C2* gene encoding mineralocorticoid receptor, and ATPase (a class of enzymes that catalyze the hydrolysis of ATP to ADP). Apart from this, interesting elements found in the intersection and ranked as top nodes for some centrality measures are insulin, *PLN* gene encoding phospholamban, *PSMD4* gene, and TGF-β.

Rehospitalization and HCM share the following aspects: insulin receptor, glucose decrease mediated by insulin; ryanodine receptor; *PSMD4* gene related to increased death; *NLRP3* gene encoding regulator of immunity and inflammation cryopyrin (also known as angiotensin/vasopressin receptor AII/AVP-like), promoting proinflammatory cytokine interleukin 1 beta (IL-1B).

#### 3.2.7. The Most Important Shared Elements and Pathways

The most important putative molecular elements and pathways are illustrated with corresponding HCM presentations (Figures 1 and 2).

**Figure 1.** The most important putative molecular elements (**left**) and corresponding HCM presentations (**right**).


**Figure 2.** The most important putative pathways (**left**) and corresponding HCM presentations (**right**).

#### **4. Discussion**

To the best of our knowledge, this is the first study of shared molecular mechanisms of HCM and its clinical presentations.

Although there is much research about molecular mechanisms of HCM, research about molecular mechanisms in its specific clinical presentations is scarce. In the following literature review, we compared our results with evidence from preclinical and clinical literature.

#### *4.1. Shared Molecular Elements and Pathways*

4.1.1. Hypertrophic Cardiomyopathy and Structural Changes

The most important shared mechanisms of cardiomyocyte hypertrophy and HCM are those involved in fibrosis, calcium, and energy homeostasis. ET-1 strongly induces car-

diomyocyte hypertrophy in HCM-induced pluripotent stem-cells-derived cardiomyocytes, with ET-1 stimulation specifically inducing NFAT nuclear accumulation [39]. Angiotensin II induces cardiomyocyte hypertrophy [40–43].

The most important shared molecular mechanisms of myofibrillar disarray and HCM are contractile machinery components and those involved in ATP homeostasis, MAPK/ERK, and calcineurin/NFAT signaling. Frustaci et al. (2018) showed that in humans, mutation of sarcomeric α-actin is followed by fibrils disarray and hypertrophy with a disarray of cardiomyocytes, while dysfunction of cytoplasmic α-actin causes a disanchorage of myofibrils from the sarcolemmal membrane, followed by myofibrillolysis. The authors proposed that intercalated discs are particularly involved in this mutation, appearing irregular and fragmented, favoring cell disconnection [44]. Tanaka et al. (2014) showed that endothelin-1 induces myofibrillar disarray in HCM-induced pluripotent stem-cell-derived cardiomyocytes [39].

Our results indicate that the contractile machinery components and mechanisms involved in cell growth and survival are the most prominent mutual molecules and processes involved in cardiomyocyte disarray and HCM. Kraft et al. (2019) suggested that mutations in *MYH7* in heterozygous human HCM contribute to the development of cardiomyocyte disarray by burst-like heterogeneous expressions of both *MYH7* alleles (switched on and off in an independent and stochastic manner), which causes an imbalanced force generation going from cell to cell that disrupts the cardiac syncytium over time (stronger cells overstretch weaker cells) [45]. Schramm et al. (2012) showed that the *PTPN11* loss-of-function mutation Q510E-Shp2 causes cardiomyocyte disarray in HCM, with mTOR activation playing a critical role in the underlying mechanism [46]. James et al. (2000) demonstrated that one of the manifestations of cTnI146Gly mutation in mice is cardiomyocyte disarray [47].

Our results indicate that calcium homeostasis, fibrosis, and inflammation mechanisms are the most important at the intersection of myocardial remodeling and HCM.

The most important shared elements of cardiac remodeling and HCM are implicated in fibrosis, calcium, and energy homeostasis. Freeman et al. (2001) showed that a high overexpression of β2-adrenergic receptor increases remodeling in HCM hearts and that inhibition of β-adrenergic receptor kinase (βARK) reverses hypertrophic remodeling in the HCM hearts [48]. Martins et al. (2015) suggested that the TNNC1-A8V mutant increases the calcium-binding affinity of the thin filament and elicits cellular remodeling [49]. Bi et al. (2021) showed that collagen cross-linking plays an important role in heart remodeling in human hypertrophic obstructive cardiomyopathy, which might be regulated mainly by lysyl oxidase (LOX) [50]. Roldán et al. (2008) suggested that the matrix metalloproteinases have an important role in cardiac remodeling in human HCM [51].

Shared molecules of myocardial fibrosis and HCM are those entangled in calcium and energy homeostasis, fibrosis, cell survival, proliferation, and inflammation. Ho et al. (2010) suggested that, in human HCM, sarcomere mutations trigger an early increase in collagen synthesis; this is initially balanced by degradation, but it exceeds degradation in overt HCM synthesis, resulting in myocardial fibrosis (i.e., collagen accumulation in HCM increases as the disease develops) [52]. Kawano et al. (2005) showed that valsartan (an angiotensin II type 1 receptor blocker) suppresses the synthesis of type I collagen in patients with HCM [53]. Arteaga et al. (2009) showed that myocardial fibrosis is prospectively associated with a worse prognosis in patients with HCM [54]. Further, Lim et al. (2001) showed that the blockade of angiotensin II (a known cardiotrophic factor) by losartan reverses myocardial fibrosis in a transgenic mouse model of human HCM [55].

4.1.2. Hypertrophic Cardiomyopathy and Left Ventricular Outflow Tract Obstruction

Scarce molecular mechanisms found at the intersection of LVOTO and HCM are indicative of an important role in calcium homeostasis and fibrosis. Bolca et al. (2002) showed that dobutamine induces dynamic LVOTO in patients with hypertrophic non-obstructive cardiomyopathy, proving that dobutamine stress echocardiography is a reliable tool for the diagnosis of dynamic left ventricular obstruction in patients with hypertrophic nonobstructive cardiomyopathy [56]. Mavacamten (a first-in-class cardiac myosin inhibitor) has been evaluated as a promising new therapy in several clinical studies [57–59].

#### 4.1.3. Hypertrophic Cardiomyopathy and Contractile Dysfunction

Although the molecular mechanisms found at the intersection of impaired myocardial relaxation and HCM are scarce, they indicate the leading role of NO homeostasis and a contribution of calcium homeostasis and collagen induction in their common pathogenesis. Cordts et al. (2019) suggested that higher N, N-dimethylarginine (also known as asymmetric dimethylarginine, ADMA) plasma concentrations might lead to a decreased NO production and an impaired myocardial relaxation in HCM patients [24].

Titin and titin-related molecules were found to be important in the intersection of myocardial stiffness and HCM. Higashikuse et al. (2019) suggested that titin mutations in HCM families can be incorporated into the sarcomere and impair TRIM63 (MURF1) binding, resulting in abnormal sarcomere stiffness [60].

Our results indicate that the contractile machinery components and mechanisms involved in calcium, sodium, and cellular energy homeostasis are the most prominent common molecules of diastolic dysfunction and HCM. Diastolic dysfunction in animal and human HCM is characterized by elevated myocardial activation at low diastolic calcium concentrations, i.e., high myofilament calcium-sensitivity [61–64]. In the majority of cases, the high basal (diastolic) myofilament activation is sufficient to slow the onset of ventricular relaxation and limit proper filling [62]. Sequeira et al. (2015) showed that tropomyosin's ability to block myosin-binding sites on actin is reduced in human HCM with thin-filament mutations, and the effect is exacerbated in human HCM samples by the low PKA phosphorylation of myofilament proteins. They also suggested that cMyBP-C HCM-causing mutations reduce the accessibility of myosin for actin [65]. Teekakirikul et al. (2010) suggested that TGF-β signaling is implicated in progressive diastolic dysfunction in HCM [66]. Dweck et al. (2014) suggested that the inability to enhance myofilament relaxation through cardiac troponin I phosphorylation predisposes the heart to abnormal diastolic function [67]. Alves et al. (2015) proposed that troponin I may be an important target for the development of myofilament calcium desensitizers [68]. Further, Granzier et al. (2009) showed that the absence of PEVK region (one of the two major elastic elements of cardiac titin molecule) results in diastolic dysfunction [69].

#### 4.1.4. Hypertrophic Cardiomyopathy and Arrhythmia

Our results suggest that the most important common mechanisms of AF and HCM are calcium and sodium homeostasis in cardiomyocytes. Bongini et al. (2016) suggested that RyR2 malfunction (probably by spontaneous sarcoplasmic reticulum calcium leakage) might represent a general pathophysiologic mechanism for AF initiation and maintenance in human HCM [70]. Nagai et al. (2007) found a significant association between the prevalence of AF and ACE polymorphism in patients with HCM [71].

Our results suggest that the most important shared molecular elements of SCD and HCM are contractile machinery components as well as sodium, calcium, and energy homeostasis mechanisms. Okuda et al. (2018) proved that CaMKII-mediated phosphorylation of RyR2 plays a crucial role in aberrant calcium release as a potent substrate of lethal arrhythmia in HCM-linked Troponin T-mutated hearts [72]. Alterations in calcium cycling are triggers for cardiac arrhythmias—a serious clinical complication of HCM due to the potential to induce SCD [73]. On the other hand, calcium may be involved in the development of cardiac fibrosis, a potential substrate for cardiac arrhythmias and sudden death. In humans, mutations of calcium-related genes (RyR2 and calsequestrin 2) have been identified in families with a history of SCD [74]. Studies with HCM cardiomyocytes differentiated from patient-specific-induced pluripotent stem cells have confirmed that alterations of intracellular calcium handling are associated with arrhythmic events [75]. Coppini et al. (2020) suggested that increased late sodium current (INaL) plays a central role in cellular arrhythmogenicity in HCM (which is confirmed by the antiarrhythmic efficacy

of ranolazine) [76]. Parvatiyar et al. (2012) showed that *TNNC1* mutation A31S, which alters calcium handling, is associated with verified episodes of ventricular fibrillation and aborted SCD, probably due to altered calcium handling and electrophysiologic remodeling of the cardiomyocyte [77]. Additionally, Chung et al. (2011) found that frameshift mutation (c.363dupG) in Troponin C is associated with HCM and SCD [78]. Further, Fahed et al. (2020) showed that p.Arg21Cys mutation in *TNNI3* impairs calcium handling and results in a malignant HCM phenotype characterized by early-onset SCD [79]. HCM caused by mutations in the cardiac troponin T gene (*TNNT2*) has been associated with a high risk of SCD [80]. R58Q mutation of myosin regulatory light chain (*RLC*) is associated with SCD in HCM [81].

#### 4.1.5. Hypertrophic Cardiomyopathy and Ischemia

We found only several common mechanisms in both coronary microvascular dysfunction and HCM. However, they showed the greatest importance of energy, calcium, and sodium homeostasis in the intersection of these two pathologies. We suggest that extracted interaction "glucose is increased/activated by insulin" refers to insulin-dependent glucose transport into cells, and "calcium increased/activated by calcium" is related to calcium-induced calcium release [82].

The most important elements of the intersection of molecular mechanisms of myocardial ischemia and HCM take part in calcium, sodium, and energy homeostasis, ERK signaling, inflammation, and cell survival.

#### 4.1.6. Hypertrophic Cardiomyopathy and Endpoints

The most important shared molecular elements of HF and HCM are calcium, sodium, and energy homeostasis mechanisms, contractile machinery components, ERK signaling, β-adrenergic receptor mechanisms, and those entangled in fibrosis, cell proliferation, and survival. Mutations of *MYBPC3* gene are a major cause of human cardiomyopathy and associated HF [83]. *MYBPC3* mutations present a high risk for HF [84]. Kissopoulou et al. (2018) showed that homozygous missense *MYBPC3* Pro873His mutation in human HCM is associated with an increased risk of HF development [85]. Chronic administration of β-adrenergic agonists, such as isoproterenol, has been shown to aggravate HCM and induce HF in HCM models of disease [86].

We could not abstract the essence of the intersections of MACE and HCM or rehospitalization and HCM from the corresponding heterogeneous results, probably on account of the diverse pathologies underlying both MACE and rehospitalization.

#### 4.1.7. Calcium in Hypertrophic Cardiomyopathy Presentations

Our results suggest that calcium is among the most important elements in almost all intersections of molecular pathways of HCM and its clinical presentations. Calcium is a key signaling molecule in the cardiac myocyte [74], and imbalances in calcium homeostasis have been described as key characteristics of HCM in numerous reports [73].

#### 4.1.8. The Most Important Shared Elements and Pathways

As expected, at a high level, our results show that cardiomyocyte hypertrophy, myocardial and cardiac remodeling, and myocardial fibrosis; AF and SCD; coronary microvascular dysfunction and myocardial ischemia; myocardial ischemia and HF share similar molecular mechanisms, which is in line with clinical literature findings on HCM progression [87,88], arrhythmic nature and association between AF and SCD [89–91], ischemic nature and association between coronary microvascular dysfunction and myocardial ischemia [25,92,93], and association between myocardial ischemia and HF in HCM [94,95]. The results suggest a more isolated (distinctive) nature of myofibrillar and cardiomyocyte disarray, impaired myocardial relaxation, and myocardial stiffness, which might be, to some extent, a consequence of the relatively low number of articles available and statements extracted, which then reduce the ability to identify the most important molecular elements.

#### *4.2. Non-Molecular Factors That Affect Clinical Presentations of HCM*

Phenotypes of HCM are the consequences of complex interactions among a large number of determinants [96]. In addition to molecular mechanisms (including genetic factors), other factors can affect the clinical course and presentations of HCM. Environmental and lifestyle factors, most probably via epigenetic mechanisms, influence HCM phenotype [97–99]. These factors and their interaction in HCM have yet to be fully defined but might include microbial infection, diet [97], or exercise [96–98]. A study by Repetti et al. suggested that epigenetic and environmental factors, rather than background genetic variation, play a major role in hypertrophic remodeling [97].

Incomplete penetrance [96] and haploinsufficiency [84,99] also complicate interpretations of genotype-phenotype associations [96] and the prediction of clinical presentations. Phenotypic effects in cases of incomplete penetrance are even more responsive to the presence of other genetic and environmental factors. Cell-to-cell variability in gene expression and function also affect the HCM phenotype [96].

Physical factors like pressure changes, stretching, and changes in the generation of contraction force also influence the clinical course of HCM [96].

Other known and unknown factors might contribute to the development of different HCM clinical presentations as well.

By that means, this research, in its broad scope, is interesting for providing the potential of identification of molecular targets for environmental factors or lifestyle choices that could delay or change HCM progression.

#### *4.3. General*

All patients with HCM defined according to ESC guidelines [92] were included. No uniform exclusion criteria were applied.

Many molecular elements recognized as important in this research are non-specific and take part in different cardiac processes and diseases. Some of them might be compensatory mechanisms.

With the approach undertaken in the present study, we were able to detect shared mechanisms that might otherwise remain unnoticed. Although we cannot state that shared mechanisms determine or underlie the clinical presentation of HCM, these shared mechanisms have the potential to direct HCM processes or modify the nature of each disease state. Some of them might be novel therapeutic targets or contribute to the development of innovative strategies for treatment. This research also provides the potential to identify patients with specific or non-specific HCM molecular milieu patterns and with that preventability of certain complications or predisposition to side effects.

In silico studies of molecular interactions rarely provide final answers to questions. Nevertheless, very often, they produce a foundation for further research and initialize the generation of new questions and hypotheses. This work represents only the first step in the dissection of HCM pathogenesis, which could inspire and intensify future research. These results should be used after careful interpretation and critical evaluation of each element of interest in a particular use case.

Thus far, INDRA and the automated extraction of molecular mechanisms have been used in modeling p53 dynamics in response to DNA damage, adaptive drug resistance in BRAF-V600E-mutant melanomas, and the RAS signaling pathway [27].

Based on the literature review, the method applied has the potential to be beneficial in similar use cases. However, there is space for improvement of the technology and its implementation.

#### *4.4. Limitations*

The number of elements in networks (and sets) reflects the quantum of knowledge published on the topic rather than the complexity or granularity of the mechanisms themselves.

The automatic molecular mechanisms extraction approach is not specific (it extracts all molecular interactions from the article with the particular main topic), which is why each interaction should be considered critically.

The automatic extraction of molecular mechanisms sometimes extracts gene products with the name of the corresponding genes. Therefore, when evaluating an element with a gene name, it must be interpreted as the gene itself and/or its product.

Although automated extraction of molecular mechanisms creates a lot of clutter (e.g., elements not representing molecular mechanisms), we suppose that the nature of intersection removes most of the clutter (i.e., a piece of clutter should be present in two intersected sets to appear in results).

Both preclinical and clinical articles were included in the automatic molecular mechanisms extraction. Animal models do not fully replicate human HCM [100]. Our research lacks overall comorbidity information (it is source-article-specific).

#### *4.5. Significance and Implications*

This work collects and represents a quantum of knowledge about shared molecular mechanisms of HCM and its clinical presentations available today. Our results do not represent the final nor perfect dissection of HCM pathogenesis, yet they offer a transitional solution towards the next step in the research on HCM and its clinical presentations. It represents a wide foundation for further research, where new starting points could be found.

All pathways are presented in visual, and by that more intuitive, form, in one place. This work can be seen as a detailed review on the topic in the form of networks (instead of in the form of text) generated automatically (instead of by systematic literature inspection and writing). The pathways in the form of networks enable further analysis, for example, for in silico screening of new biomarkers and drug targets, as well as for predicting additional missing links and elements.

Shared pathways are commonly researched using different approaches [101–114]. The novelty in shared pathways research is the application of the new technology, automated molecular mechanisms extraction, to that task. In this research, we were also examining the reach of the technology used for automated extraction of molecular mechanisms from scientific medical literature. This approach is new in deciphering molecular mechanisms of HCM. Some parts of the methodology are taken over from the big data analysis field [14], and this research is one of the first attempts to analyze such massive data in the domain of this specific clinical entity.

This research also confirms that the results of usage of the technology are consistent with the information present in the scientific literature at a higher level, but also that there is a space for improvement of the technology and its implementation.

#### **5. Conclusions**

The most important molecular mechanisms that HCM shares with its clinical presentations are as follows: fibrosis, calcium and energy homeostasis (shared with cardiomyocyte hypertrophy and cardiac remodeling); contractile machinery components, ATP homeostasis, MAPK/ERK, and calcineurin/NFAT signaling (myofibrillar disarray); contractile machinery components and mechanisms involved in cell growth and survival (cardiomyocyte disarray); calcium homeostasis, fibrosis and inflammation mechanisms (myocardial remodeling); calcium and energy homeostasis, fibrosis, cell survival, proliferation and inflammation (myocardial fibrosis); calcium homeostasis and fibrosis (LVOTO); NO and calcium homeostasis, collagen induction (impaired myocardial relaxation); titin and titinrelated molecules (myocardial stiffness); calcium and sodium homeostasis in cardiomyocytes (AF); contractile machinery components and mechanisms involved in calcium, sodium, and energy homeostasis (SCD and diastolic dysfunction); energy, calcium and sodium homeostasis mechanisms (coronary microvascular dysfunction); calcium, sodium and energy homeostasis; ERK signaling, inflammation and cell survival mechanisms (myocardial ischemia); calcium, sodium, and energy homeostasis mechanisms; contractile machinery components; ERK signaling; β-adrenergic receptor mechanisms; mechanisms entangled in fibrosis, cell proliferation and survival (HF). These mechanisms represent possible processes underlying different HCM clinical presentations, and some of them might be novel therapeutic targets.

This work collects and represents a quantum of knowledge about shared molecular mechanisms of HCM and its clinical presentations available today.

Applied methodology produced results consistent with the information in the scientific literature at a higher level, but there is a space for improvement of the technology and its implementation.

**Supplementary Materials:** The following are available online at https://www.mdpi.com/article/10 .3390/life11080785/s1, Supplementary Table S1: The most important nodes in the networks, Table S2: Cooperatively working elements (functional modules), Figure S1: The most important nodes in networks represented as packed concentric rings sorted by the most important nodes.

**Author Contributions:** Conceptualization, M.G. and L.V.; methodology, M.G. and L.V.; software, M.G.; formal analysis, M.G.; investigation, M.G. and L.V.; resources, M.G. and L.V.; data curation, M.G.; writing—original draft preparation, M.G.; writing—review and editing, M.G. and L.V.; visualization, M.G.; supervision, L.V.; project administration, M.G. and L.V.; funding acquisition, L.V. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research has received funding from the European Union's Horizon 2020 Research and Innovation Programme under grant agreement No. 777204 (www.silicofcm.eu, accessed on 1 August 2021). This article only reflects the author's view. The Commission is not responsible for any use that may be made of the information it contains.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Data are contained within the article and supplementary material.

**Acknowledgments:** The authors are thankful to John Bachman, Harvard Medical School, for providing access to the INDRA Database.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Impact of Water Temperature on Heart Rate Variability during Bathing**

**Jianbo Xu and Wenxi Chen \***

Biomedical Information Engineering Laboratory, The University of Aizu, Aizu-Wakamatsu 965-8580, Japan; d8211103@u-aizu.ac.jp

**\*** Correspondence: wenxi@u-aizu.ac.jp

**Abstract:** Background: Heart rate variability (HRV) is affected by many factors. This paper aims to explore the impact of water temperature (WT) on HRV during bathing. Methods: The bathtub WT was preset at three conditions: i.e., low WT (36–38 °C), medium WT (38–40 °C), and high WT (40–42 °C), respectively. Ten subjects participated in the data collection. Each subject collected five electrocardiogram (ECG) recordings at each preset bathtub WT condition. Each recording was 18 min long with a sampling rate of 200 Hz. In total, 150 ECG recordings and 150 WT recordings were collected. Twenty HRV features were calculated using 1-min ECG segments each time. The k-means clustering analysis method was used to analyze the rough trends based on the preset WT. Analyses of the significant differences were performed using the multivariate analysis of variance of *t*-tests, and the mean and standard deviation (SD) of each HRV feature based on the WT were calculated. Results: The statistics show that with increasing WT, 11 HRV features are significantly (*p* < 0.05) and monotonously reduced, four HRV features are significantly (*p* < 0.05) and monotonously rising, two HRV features are rising first and then reduced, two HRV features (fuzzy and approximate entropy) are almost unchanged, and vLF power is rising. Conclusion: The WT has an important impact on HRV during bathing. The findings in the present work reveal an important physiological factor that affects the dynamic changes of HRV and contribute to better quantitative analyses of HRV in future research works.

**Keywords:** water temperature; bathing; ECG; heart rate variability; quantitative analysis; *t*-test

#### **1. Introduction**

Heart rate variability (HRV) is an important indicator of physical and mental health. The instantaneous HRV rhythm represents a dynamic balance between the sympathetic nervous system (SNS) and parasympathetic nervous system (PNS) branches of the autonomic nervous system (ANS) [1]. Therefore, the quantitative analysis of HRV is considered an effective method for the diagnosis of many cardiac diseases in clinical applications. However, many internal and external factors affect HRV. The internal factors include mental stress, gender, age, and disease, while the external factors include sleep, drugs, alcohol, smoking, and diet.

#### *1.1. HRV and Stress*

The SNS branch of the ANS was more activated during states of mental stress [2]; therefore, some literature evaluated mental stress using HRV analyses based on different stressors [3–12]. Some papers confirmed that the HR was significantly increasing during stress states [5,7,13–20], while some papers found that the mean R-R intervals (RRIs) [5,7–10,12,14–16,21] and the square root of the mean of the squares of the successive differences (RMSSD) between adjacent normal to normal intervals (NNs) [8,10,22–29] were significantly reduced during stress states. Kofman et al. discovered that the percentage of low frequency power in total power, pLF, was significantly higher while the percentage of

**Citation:** Xu, J.; Chen, W. Impact of Water Temperature on Heart Rate Variability during Bathing. *Life* **2021**, *11*, 378. https://doi.org/10.3390/ life11050378

Academic Editors: Md. Altaf-Ul-Amin, Shigehiko Kanaya, Naoaki Ono and Ming Huang

Received: 30 March 2021 Accepted: 20 April 2021 Published: 22 April 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

high frequency power in total power, pHF, was significantly reduced during an examination stress state [4]. Melillo et al. found that the LF/HF ratio was significantly higher in the normal estimated glomerular filtration rate [30], while Hjortskov et al. proved that the LF/HF ratio was significantly higher during computer work stress states [3].

#### *1.2. HRV and Gender and Age*

The HRV dynamically changes with aging and gender. Ramaekers et al. and Schwartz et al. discovered that some HRV parameters decreased with aging, while Ramaekers et al. confirmed that the gender differences in the HRV parameters only exist in subjects younger than 40 years old [31,32]. Lochner et al. found that women's HRV was significantly lower than men's HRV [33]. Davy et al. observed that physically active women had higher levels of cardiac baroreflex sensitivity and HRV compared with sedentary women regardless of age [34]. Nagy et al. proved that gender differences determined HRV differences from birth, while boys' HR baseline was significantly lower than that for girls [35]. Bonnemeier et al. noted that gender differences in HRV were significantly reduced with aging [36]. Yamasaki et al. discovered that LF was highly determined by aging and the pLF of men was significantly higher than that of women [37].

#### *1.3. HRV and Disease*

The HRV differs between healthy people and patients. Wilkowska et al. found that the HRV of depressed patients was significantly lower than that of nondepressed patients [38]. Lutfi and Sukkar showed that lower HRV was associated with higher BP values, putting subjects with such trends at a higher risk of developing hypertension [39]. T. Tombul et al. confirmed that lower HRV in multiple sclerosis patients than that in healthy [40]. D. Gurses et al. observed that some time domain parameters (mean RRIs, SDNN, RMSSD, and PNN50) were significantly lower in the thalassaemic patients then that of the healthy subjects [41]. M. Lan et al. found that the mean RRIs significantly increased, while LF% and LF/HF significantly decreased in the patients with allergic rhinitis in the sitting position [42]. DelRosso et al. investigated obstructive sleep apnea and found that it resulted in increased sympathetic activation during sleep in children [43].

#### *1.4. HRV and Sleep*

The sleep has an important impact on the HRV. Herzig et al. discovered that the HR was higher during REM sleep than during slow wave sleep (deep sleep) [44]. Padole and Ingale found that the HRV was distinguishable among the normal, sleeping, and meditation states [45]. Arslan et al. confirmed that the sleep deprivation resulted in a significant decreased in HF, TP, standard deviation (SD) of NN intervals (SDNN), and pNN50, with concomitant increased in the LF/HF ratio [46]. ÁR. S ˝udy et al. confirmed that the HRV during sleep on workdays and free days was significantly different in young healthy men with social jetlag [47].

#### *1.5. HRV and Other Factors*

Many other factors also affect HRV. Hynynen et al. proved that the HRV of healthy men was significantly decreased, and the HR was significantly increased at night after marathon or moderate exercise sessions [48]. James et al. learned that the level of HRV significantly changed after severe intensity exercise [49]. Zuanetti et al. discovered that HRV significantly varied after patients took antiarrhythmic drugs [50]. Murgia et al. confirmed that HRV significantly increased after smoking cessation [51]. Young et al. learned that diet played an important role in the link between mood and HRV [52]. Latha et al. learned that classical music had a beneficial effect on HRV and reduced medical students' stress levels [53]. Sollers et al. investigated the varying ambient temperature and found it had an important impact on the HRV [54]. Shin proved that ambient temperature induced a significant difference in pulse rate variability compared to HRV, and that the difference became greater at a higher ambient temperature [55].

Some previous studies explored the impact of water temperature (WT) on HRV. Mourot et al. and HC. Choo et al. found that immersion in different WT had an important impact on the HRV [56,57]. Y. Kataoka et al. studied the impact of WT on HRV during bathing, but only 38 °C and 41 °C were included, and a few HRV measures were evaluated [58]. F Edelhäuser et al. investigated the effects of whole-body immersion on HRV at three different WTs (33 °C, 36 °C, and 39 °C) [59].

The main purpose of this paper aims to explore the impact of different WTs on HRV during bathing. The experiment was carried out based on the most commonly used WTs in the daily family life, twenty HRV features (included time domain, frequency domain, and non-linear domain) were calculate.

#### **2. Method**

#### *2.1. ECG Collection System*

The electrocardiogram (ECG) collection system in this study includes four rectangular stainless steel noncontact electrodes, all of them placed on the bathtub wall. When the subject is in the bathtub during bathing, the four noncontact electrodes are near the right foot, right arm, left foot, and left arm, respectively.

The electricity on the skin surface, which is produced by the electrical activity of the heart, arrives in the four noncontact electrodes through the water and three-lead ECG are recorded. The lead I ECG is the potential difference between the left arm (positive) and right arm (negative), the lead II ECG is the potential difference between the left foot (positive) and right arm (negative), and the lead III ECG is the potential difference between the left foot (positive) and left arm (negative). Four shielded wires are, respectively, welded onto the four noncontact electrodes. The three-lead ECG arrives in the ECG collection monitor (Open Brain Computer Interface Biosensing Ganglion Board-OpenBCI Ganglion; OpenBCI, USA) through the shielded wires, and the ECG monitor and the laptop (a MacBook Pro) are connected using a standard Bluetooth 4.n, and all the collected ECG recordings are stored on the laptop. The designed ECG collection system in this study is shown in Figure 1 [60].

**Figure 1.** ECG collection system.

#### *2.2. Subjects and ECG Recordings*

The ECG recording procedures were approved by the Public University Corporation, the University of Aizu Research Ethics Committee. Written informed consent was obtained from each participant before the experiment.

Ten subjects (seven men and three women) aged 23 to 40 years old (mean ± SD: 28.5 ± 4.78 years) who were students from the University of Aizu participated in the data collection. The BP, body temperature, and body weight were recorded before and after the ECG collection, and the temperature profile for WT change and room temperature were recorded every second during the ECG collection using a temperature monitor (TR-71wb/nw; T&D Corporation, 817-1, Shimadachi, Matsumoto, Nagano, Japan, 390-0852).

The preset bathtub WT includes three conditions: low WT (36–38 °C), medium WT (38–40 °C), and high WT (40–42 °C), respectively. Each subject collected 5 ECG recordings at each preset bathtub WT condition and each recording was 18 min long with a sampling rate of 200 Hz. In total, 150 ECG recordings and 150 temperature recordings were collected during bathing.

#### *2.3. ECG processing*

The flowchart for the ECG processing, HRV feature calculation, and statistical analysis is shown in Figure 2.

**Figure 2.** Flowchart for the ECG processing, HRV feature calculation, and statistical analysis.

All data processing and analyses were performed using MATLAB (R2019a). Baseline wandering is obvious in the raw ECG signal due to motion artifacts and respiration from the subjects; therefore, the wandering baseline was removed using the one-dimensional (1-D) wavelet decomposition and reconstruction methods. The Daubechies wavelet at level 10 was used to decompose the raw ECG signal and the baseline wandering approximation coefficient was subtracted from the raw ECG signal after reconstructing at level 8.

Obvious hum noise was also observed in the raw ECG signal; therefore, we performed a spectrum analysis on the raw ECG signal. The spectrum analysis results show that the main frequency component of the hum noise was 50 Hz, which is mainly produced by the electromagnetic interference between the power supply network and equipment [61]. A second-order infinite impulse response digital notch filter was used to remove the 50 Hz hum noise. The numerator and denominator coefficients of the digital notch filter with the notch located at *ω* and the bandwidth at 0.0071 at the −3 dB level were calculated and the *ω* must meet the condition of 0.0 < *ω* < 1.0. The difference equation of the digital notch filter is shown in Equation (1).

$$y[n] = \sum\_{i=0}^{N} b\_i \mathbf{x}[n-i] - \sum\_{i=1}^{M} a\_i y[n-i], \qquad n \ge 0 \tag{1}$$

where *x*[*n*] is the filter input, *y*[*n*] is the filter output, and *ai* and *bi* are the numerator and denominator coefficients, respectively, of the digital notch filter.

Next, the 5-point moving average method was used to smoothen the ECG signal. The mathematical formula for the moving average is shown in Equation (2):

$$\log[n] = \frac{1}{M} \sum\_{j=0}^{M-1} x[n-j] \tag{2}$$

where *x*[*n*] is the input signal, *y*[*n*] is the output signal, and *M* is 5.

Then the ECG was normalized into the range of 0 to 1 using the "mapminmax" function, the R peaks were detected using the "findpeaks" function, the RRI were calculated, and the RRI outliers removed using the 1D 11th order median filter because of its outstanding capability in suppressing the isolated outlier noise without blurring sharp changes in the original signal.

The mathematical formula of the 1D 11th order median filter is shown in Equation (3):

$$y[i] = 
median\{x[i], i \in w\} \tag{3}$$

where *x*[*i*] is the input signal, *y*[*i*] is the output signal, and *w* is the moving window length. The results for each ECG processing step are shown in Figure 3.

**Figure 3.** Output of the ECG processing steps.

#### *2.4. HRV Analysis*

HRV analysis methods include linear and nonlinear domain analysis methods, where the linear domain includes time and frequency domain methods. The HRV features in the time domain include HR, mean RRI, SDNN, RMSSD between adjacent NNs, SD of the successive differences between adjacent NNs (SDSD), and area under RRI (AURRI). The HRV features in the frequency domain include very LF (VLF) power (0.003–0.04 Hz), LF power (0.04–0.15 Hz), HF power (0.15–0.4 Hz), total power (0–0.4 Hz), pLF, pHF, and the LF/HF ratio. The HRV features in the nonlinear domain include the correlation dimension (D2), the SD of the Poincare plot perpendicular to the line of identity (SD1), the SD of the Poincare plot along to the line of identity (SD2), the SD1/SD2 ratio, and the sample (SE), fuzzy (FE), and approximate entropies (AE).

Before the frequency features are calculated, a RRI resample is necessary. According to Nyquist's sampling theorem, the sample rate must be more than two times the highest

frequency in the original signal. The highest frequency of the HRV is 0.4 Hz; therefore, the new resampling rate for RRI was set at 2 Hz. Then, a discrete Fourier transform (DFT) was used to calculate the power spectral density (PSD) of the resampled RRI for a N points sequence. Its DFT is shown in Equation (4):

$$X[k] = \sum\_{n=0}^{N-1} x[n]e^{-i\frac{2\pi}{N}nk}.\tag{4}$$

where *k* = 0, 1, 2, ... , *N*−1, and *i* <sup>2</sup> <sup>=</sup> <sup>−</sup>1.

#### *2.5. Statistical Analysis of the HRV Features*

Each ECG recording was 18 min in length and segmented into 18 equal parts. A 1-min ECG was used to calculate the HRV features each time. Based on the bathtub WT, the mean and the SD of each HRV feature were calculated, and the *t*-test was used to test for significance. The summary statistic results of each HRV feature are visualized in the clustering results and box plot.

#### **3. Results**

The variations of the HRV features based on different WTs are shown in Figure 4. The smaller dots with blue, yellow, and green colors represent the HRV features calculated based on each preset WT, while the bigger black dots are the average values of the HRV features based on each preset WT calculated by the K-means clustering analysis method. For the areas of the dots of HR, the blue area is smallest, the yellow area is biggest, and the green area is medium sized. The low WT condition is very close to the average temperature (about 36.5 °C) of the normal human body, therefore, the WT stimulation was not strong to the subjects, with a small variation in HR and the SD of HR was 3.38. The higher WT has a stronger stimulation to the subjects during bathing. Although the instantaneous HR was very fast at this WT condition, the HRV was not the biggest with a SD of HR 4.17. The stimulation for the medium WT condition was bigger than the stimulation for the low WT condition, but was smaller than the high WT condition. The HRV is obvious with a SD of HR 4.65; therefore, the area of the yellow dots was the biggest.

Figure 4 shows that the controlled condition of WT was not serious or uniform. In fact, the low WT was not strictly and evenly distributed in the range of (36–38 °C) and was far below the ambient temperature during bathing. The WT of 42 °C was far beyond the ambient temperature during bathing; thus, the WT decreased quickly during the data collection and the WT data at about 42 °C were not as concentrated. It is obvious that the D2, HF power, total power, pHF, mean RRI, SDNN, RMSSD, SDSD, AURRI, SD1, and SD2 were monotonously reduced with increasing WT, and the pLF, LF/HF, HR, and SD1/SD2 were monotonously rising with increasing WT. A significant difference (*p* < 0.05) was found among the above HRV features.

The results of significant difference analyses for the 20 HRV features in three analysis domains under three WT conditions are visualized in Figure 5. There are some outliers for each HRV feature. For the HR, the higher of the WT, the more outliers because the subjects experienced stronger stimulation from the higher WT and it is more difficult to adapt the WT environment during bathing. The changes in the mean of VLF, LF, SE, FE, and AE were not obvious, and significant differences were not found in these five HRV features.

The details of the statistic results of the 20 HRV features in the time, frequency, and nonlinear domains are shown in Table 1, where the mean values and SD are listed, and the pairwise statistically significant differences between each WT condition are calculated. The significant difference analysis was performed via the multivariate analysis using the *t*-test variance method, where *p1* represents the significant difference between low and medium WT conditions, *p2* represents the significant difference between medium and high WT conditions, and *p3* represents the significant difference between low and high WT conditions. With the increasing WT, the SD of the HR, mean RRI, AURRI, pLF, pHF, LF/HF

10

2 3 4 SD2 (ms)

> ABC WT Condition

2

6 8 10 SD1/SD2 (-)

> ABC WT Condition

ratio, and SD1/SD2 are first rising and then reduced, and the SD of LF, HF, TP are first reduced and then rising.

**Figure 5.** Analysis of significant differences for 20 HRV features in three analysis domains under three WT conditions: A = (36–38) °C, B = (38–40) °C, and C = (40–42) °C.

ABC WT Condition

35

0.08 0.1 0.12 0.14 FE (-)

> ABC WT Condition

15

0.3 0.4 AE (-)

> ABC WT Condition

2

0.2 0.3 0.4 SE (-)


**Table 1.** The statistic results of the HRV features based on different bathtub WT.


#### **4. Discussion**

As a noninvasive, rapid, and accurate tool in the evaluation of the cardiac autonomic balance modulation activity, heart rate variety (HRV) has been a hot research topic in recent years. This study explored the impact of different water temperature (WT) on HRV during bathing. With the rises of WT, the HR in medium and high WT increased by 6.53% and 15.78%, respectively, compared with the low WT, which reflects a decreased vagal modulation. The significantly and monotonously reduced SDNN with increasing WT shows a significantly reduced whole HRV fluctuation, which is highly consistent with the significantly and monotonously reduced total spectral power (0–0.4 Hz). The LF power (0.04–0.15 Hz) in the PSD reflects both SNS and PNS activities, while the HF power (0.15–0.4 Hz) in the PSD reflects the PNS activity, and the LF/HF ratio represents the balance between the SNS and PNS activities. With the increasing WT, the LF and HF are significantly and monotonously reduced, which reflects that SNS and PNS activities are enhanced significantly. The increased LF/HF ratio shows that the ratio of the cardiac sympathetic to parasympathetic tones (i.e., the sympathovagal balance) was enhanced significantly, which shows that the stimulation of high WT on the subject was also enhanced significantly. The stimulation on the subject under high WT increased by 6.43% and 5.20% over the low and medium WT, as shown in Table 1. Furthermore, the HRV feature of AURRI was newly defined in this paper and its unit is *s*2. The AURRI reflects the fluctuation of HRV signal over time: i.e., with the increasing WT, the mean RRI and AURRI are reduced. In lower WT condition, the parasympathetic activity is dominant. With the WT increasing, our findings show decreased HRV complexity, which induce obvious shift of ANS balance towards the sympathetic activation associated with vagal withdrawal. Therefore, the higher WT can induce a stronger response of physiological allostatic regulatory, which is often accompanied by an enhanced cardiac sympathoexcitation associated

with a vagal withdrawal. From the healthcare perspective, to reduce the sudden onset possibility of cardiac and cardiovascular complications or diseases during bathing, it is more dangerous to choose a higher WT condition for the patients.

The same WTs which belonging to different WT change processes induce different impacts on HRV. For example, if the WT drops from 40 °C to 38 °C during the data collection process, the subject will feel very uncomfortable in the first minute and need a longer time to adapt to the WT environment. However, if the WT increases from 38 °C to 40 °C during the data collection process, the subject will adapt to the WT environment easily. Even if the WT reaches 40 °C, the subjects will not feel very uncomfortable because they have adapted to this WT environment. The WT of 40 °C appeared during two different processes, but had very different instantaneous effects on the HRV and their physiological meanings were also different in these processes. Therefore, some outliers appear in the box plot as shown in Figure 5.

According to experiment records, the difference from other subjects was that Subjects 4, 7, and 8 did not feel very uncomfortable even in the higher WT (40–42 °C).The slopes of the variety of HR are smaller than that of the other seven subjects, as shown in Table 2. Specifically, Subject 7 preferred the higher WT and the change in their HR was not as obvious as in the other subjects, as shown in Figure 6. The questionnaire showed that Subject 7 often participated regular physical activities. Regular exercise could make the sympathetic-adrenal system more easily excited, thus enhancing cardiovascular, hormonal and metabolic responses, further affecting body temperature regulation, water-electrolyte interface, muscle contraction performance, etc., thus ensuring blood perfusion, oxygen, and nutrient supply and elimination of metabolites in organs and tissues throughout the body. There was evidence that exercise could reduce the sympatho-excitation and sympathetic outflow, and the baroreflex-mediated was also suppressed. Therefore, compared with other subjects, Subject 7 demonstrated higher HRV, and their reaction to higher WT indicated a great adaptability of the ANS.


**Table 2.** The slope of the variety of HR with increasing WT.

Subjects 1, 6, and 9 were very sensitive to changes in WT and especially could not tolerate the high bathtub WT (40–42 °C). They felt more comfortable during 4th–11th min on the data collection process. The first 3 min are the adaptation phase. During this stage, their foreheads quickly began to sweat a lot. For the other seven subjects, their adaptation phases were the first 1 min and they began to sweat more after the first 10 min in the same high WT environment. The body weights of these three subjects were reduced more after data collection than in the other seven subjects. From this finding, we speculate that people who are more sensitive to temperature changes are less able to withstand water and WT pressures, and they are more likely to suffer from higher mental and physical stress during the higher WT condition.

Subjects 2, 3, 5, and 10 felt a little uncomfortable, but could tolerate the high WT (40–42 °C). All ten subjects could quickly adapt to the low WT (36–38 °C) and they felt more relaxed and comfortable in the medium WT (38–40 °C). Except for three subjects who were very sensitive to the WT changes, the other seven subjects did not feel uncomfortable.

Although some discoveries were revealed in this paper, there are also some limitations. First, the set-up of the experiment is stressful itself, and therefore may create an additional bias. Although informed consent was obtained and the data collection process was told in detail to each subject, as well as let each subject had a five-minute rest before the data collection, some subjects were still a little nervous at the beginning of data collection. Furthermore, the sensitivity to external stimuli of each subject was different, the water pressure on the chest and thermal stimulus on hemodynamics also induced different stress to each subject, which would induce some additional biases to the results. Therefore, the mental stress factor should be also taken into consideration at the same time to evaluate the impact of WT on HRV. Second, the number of subjects is too small and the subjects should include older people and children, in addition to healthy and unhealthy individuals and different ethnicities. Third, the change ranges for WT during the data collection were too big. Due to poor insulation measures, the WT was relatively divergent during the data collection process. Thus, the HRV analysis should be performed based on several smaller ranges of WT. Fourth, the data collection environment was inconsistent for all the subjects. For example, when the WT is between 40 °C to 42 °C, some subjects could endure the high WT environment, while other subjects felt too uncomfortable to endure the high WT environment. Therefore, to be safe, we must give these subjects a fan to blow a refreshing cool breeze on themselves in this situation. Fifth, during the data-processing stage, the median filter was used three times to remove the outliers of the RRI signal. The skin surface electricity is very weak, in the millivolts. The gentle movement of the four limbs will induce relatively large fluctuations in the ECG amplitude. Therefore, the raw ECG signal includes some noise and there are some outliers in the R peaks detection and RRI signal results. If the median filter is only used once to remove the RRI outliers, then either only the outliers with big amplitude can be removed or there is a gross distortion in the RRI signal after the outliers are removed. Therefore, the median filter was used to remove the outliers with big, median, and small amplitudes, respectively.

**Figure 6.** The variety of HR and its fitted curve at one degree for each subject under three WT conditions.

#### **5. Conclusions and Future Work**

This paper explores the impact of different WTs on HRV during bathing. With the WTs increasing, some HRV features are significantly and monotonously reduced or rising, which induces the change of dynamic balance between the SNS and PNS branches of the ANS. The findings in the present study provide important reference significance in many practical aspects which need to evaluate the amount of disturbance of homeostasis induced by WTs. For example, we can affect the HRV by changing the WTs to set an optimal environment during bathing. Only when the SNS and PNS activities are controlled at a certain range can the people feel relaxed and comfortable.

In future research works, we will further explore the HRV levels of healthy subjects and patients, especially the patients with cardiac diseases (such as arrhythmia, myocardial ischemia, and coronary heart disease), and then design an automated and accurate WTs control system to affect the HRV by changing the WTs so that the HRV level is indirectly controlled in a safe and comfortable range based on individual health condition, which would appropriately reduce the possibility of sudden onset of cardiac disease during bathing. Moreover, in order to achieve the purpose of lifelong healthcare, we will also explore how to apply the cutting-edge blockchain technology in the long-term tracking of ECG data during bathing for the big data collection and analysis [62,63]. Another particularly crucial research topic is the physiological signal encryption and secure transmission related to the privacy protection, some emerging technologies provide a valuable reference [64,65].

**Author Contributions:** Conceived and designed the idea: J.X., W.C.; data collection and analysis, paper writing, and bibliographic research: J.X.; supervision and manuscript revision: W.C. All authors have read and agreed to the published version of the manuscript

**Funding:** This study was supported in part by the University of Aizu's Competitive Research Fund (2020-P-24).

**Institutional Review Board Statement:** The ECG recording procedures were approved by the Public University Corporation, the University of Aizu Research Ethics Committee.

**Informed Consent Statement:** Written informed consent was obtained from each participant before the experiment.

**Data Availability Statement:** Data are available from the biomedical information engineering laboratory (contact via e-mail: wenxi@u-aizu.ac.jp) for researchers who meet the criteria for access to confidential data.

**Acknowledgments:** The authors thank all participants for their cooperation during the data collection.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **References**


MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Life* Editorial Office E-mail: life@mdpi.com www.mdpi.com/journal/life

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18

www.mdpi.com