**Recent Trends in Computational Research on Diseases**

Editors

**Md. Altaf-Ul-Amin Shigehiko Kanaya Naoaki Ono Ming Huang**

MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade • Manchester • Tokyo • Cluj • Tianjin

*Editors* Md. Altaf-Ul-Amin Nara Institute of Science and Technology Japan

Shigehiko Kanaya Nara Institute of Science and Technology Japan

Naoaki Ono Nara Institute of Science and Technology Japan

Ming Huang Nara Institute of Science and Technology Japan

*Editorial Office* MDPI St. Alban-Anlage 66 4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal *Life* (ISSN 2075-1729) (available at: https://www.mdpi.com/journal/life/special issues/computational diseases).

For citation purposes, cite each article independently as indicated on the article page online and as indicated below:

LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. *Journal Name* **Year**, *Volume Number*, Page Range.

**ISBN 978-3-0365-3230-1 (Hbk) ISBN 978-3-0365-3231-8 (PDF)**

© 2022 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license, which allows users to download, copy and build upon published articles, as long as the author and publisher are properly credited, which ensures maximum dissemination and a wider impact of our publications.

The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons license CC BY-NC-ND.

## **Contents**


## **About the Editors**

**Md. Altaf-Ul-Amin** received his B.Sc. degree in Electrical and Electronic Engineering from Bangladesh University of Engineering and Technology (BUET), Dhaka, his M.Sc. degree in Electrical, Electronic and Systems Engineering from University Keban gsaan Malaysia (UKM) and his PhD degree from the Nara Institute of Science and Technology (NAIST), Japan. He previously worked in several universities in Bangladesh, Malaysia, and Japan. Currently he is working as an associate professor in the Computational Systems Biology Lab of NAIST. His major research topics are Network Biology, Systems Biology, Cheminformatics and Biological Databases. He developed and implemented novel graph clustering algorithms DPClus and DPClusO.

**Shigehiko Kanaya** received his B.Sc. degree in Bioscience from the Science University of Tokyo, Japan in 1985, and his Ph.D. from Toyohashi University of Technology, Japan in 1990. He served as an Assistant Professor in Information Engineering at Yamagata Univ. in 1990, Guest Associate Professor at the National Institute of Genetics in 1996, Associate Professor at Electronic and Information Engineering in 1999, Associate Professor, Applied Bio System Engineering at Yamagata Univ. in 2000, Guest researcher at Bio radical Institute (Yamagata Prefecture), in 2000 Associate Professor, Research and Education Centre for Genetic Information at NAIST in 2001, Associate Professor, Graduate School of Information Science at NAIST in 2002, Professor, Graduate School of Information Science at NAIST since 2004. Developed KNApSAcK family databases including a comprehensive species-metabolite relational database. His research areas are systems biology, biological databases, metabolomics.

**Naoaki Ono** Completed his PhD at the University of Tokyo in 2001. After working as a CREST researcher at Kyoto University in 2001, he moved to the Advanced Telecommunications Research Institute International. Since 2007, he has served as an ERATO researcher at Osaka University and since 2012, he has been an assistant professor at the Graduate School of Information Science, Nara Institute of Science and Technology (NAIST). Since 2018, he has been an associate professor at the Data Science Centre in NAIST. His specialty is theanalysis and simulation of life phenomena and their evolution based on bioinformatics, systems biology, complex system models, using statistical models and deep learning models.

**Ming Huang** received Ph.D. degree in biomedical engineering in 2012 from the University of Aizu, Aizu, Japan. He is currently an assistant professor at the Nara Institute of Science and Technology and a visiting scholar at the Biomedical Engineering department of the University of California Davis. His research interests include AI engineering for digital health management systems, and digital biomarker identification, which are indispensable parts of health informatics. He is also interested in applying graph theory and the relevant graph neural network in quantitative structure-activity relationship (QSAR) studies for drug discovery.

### *Editorial* **Recent Trends in Computational Biomedical Research**

**Md. Altaf-Ul-Amin \*, Shigehiko Kanaya, Naoaki Ono and Ming Huang**

Division of Information Science, Nara Institute of Science and Technology, Ikoma 630-0192, Japan; skanaya@gtc.naist.jp (S.K.); nono@is.naist.jp (N.O.); alex-mhuang@is.naist.jp (M.H.)

**\*** Correspondence: amin-m@is.naist.jp

Recent advances in information technology have brought forth a paradigm shift in science, especially in the biology and medical fields. Statistical methodologies based on high-performance computing and big data analysis are now indispensable for qualitative and quantitative understanding of experimental results. In fact, the last few decades have witnessed drastic improvements in high-throughput experiments in health science, for example, mass spectrometry, DNA microarray, and next-generation sequencing. Those methods have been providing massive data involving four major branches of omics (genomics, transcriptomics, proteomics, and metabolomics). On the other hand, cell imaging, clinical imaging, and personal healthcare devices are also providing important data concerning the human body and disease. In parallel, various methods of mathematical modeling such as machine learning have also developed rapidly. All of the types of these data can be utilized in computational approaches for biomedical research such as on understanding disease mechanisms, diagnosis, prognosis, drug discovery, drug repositioning, disease biomarkers, driver mutations, copy number variations, disease pathways, and much more. Therefore, the range of topics under "Recent Trends in Computational Biomedical Research" is extensive, and the present Special Issue is not a comprehensive representation of the subject. Nonetheless, the articles selected for this Special Issue represent a variety of topics relating to the title, and we are sharing with the readers with pleasure.

In this Special Issue, we published a total of eight good papers. Overall, four papers are in cardiovascular-related topics, two are linked to drug development, and the other two are on finding associations between genome sequence aberrations and diseases.

The paper titled "SMCKAT, a Sequential Multi-Dimensional CNV Kernel-Based Association Test" is on copy number variants (CNVs) [1]. Associations between CNVs and various diseases have been well studied before, and the current paper proposes a method called SMCKAT to test the association between the sequential order of CNVs and diseases. SMCKAT was evaluated by applying on CNV data sets, demonstrating its ability to exhibit rare or common CNVs by detecting specific biologically relevant chromosomal regions supported by the biomedical literature.

The paper with the title "The Genome-Wide Scanning of Potential Hotspots for Adenosine Methylation: A Potential Path to Neuronal Development" is about the methylation of adenosines at the N6 position (m6A) [2]. This methylation is the most frequent type of internal modification in mRNAs and is attributable to diverse roles in physiological developments and pathophysiological processes. This work applied a sliding window technique to identify the consensus site and annotated all m6A hotspots in the human genome. Most of the hotspots were found to be located in the non-coding regions, suggesting that methylation occurs before splicing. The role of this methylation in neuron physiology was also elaborated

An automatic ECG quality assessment method has been presented in the paper titled "Electrocardiogram Quality Assessment with a Generalized Deep Learning Model Assisted by Conditional Generative Adversarial Networks" [3]. This automatic method can help cardiologists perform diagnosis much faster. The proposed system first trained a conditional generative adversarial network model for data augmentation and then developed a deep

**Citation:** Altaf-Ul-Amin, M.; Kanaya, S.; Ono, N.; Huang, M. Recent Trends in Computational Biomedical Research. *Life* **2022**, *12*, 27. https:// doi.org/10.3390/life12010027

Received: 23 December 2021 Accepted: 23 December 2021 Published: 24 December 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

quality assessment model based on a training dataset composed of real and generated ECG. The proposed system demonstrated improved performance in the ECG quality assessment and has the potential to be utilized in clinical practice.

The paper titled "A Maximum Flow-Based Approach to Prioritize Drugs for Drug Repurposing of Chronic Diseases" is on drug repurposing, i.e., using existing drugs to treat new/other diseases [4]. The work proposes a maximum-flow-based approach using protein– protein interactions (PPIs) of drug targets (proteins) and risk genes corresponding to chronic diseases such as breast cancer, inflammatory bowel disease (IBD), and chronic obstructive pulmonary disease (COPD). The top candidate drugs identified by the maximum flowbased approach were found to be experimentally validated by other independent studies.

Models and results of in silico prediction of the interactions between compounds of Jamu herbs and human proteins have been presented in the paper titled "Identification of Targeted Proteins by Jamu Formulas for Different Efficacies Using Machine Learning Approach" [5]. Verifying the proteins that are targeted by natural compounds is helpful to select natural herb-based drug candidates and to explain the scientific mechanisms of traditional medicines.

The paper titled "Shared Molecular Mechanisms of Hypertrophic Cardiomyopathy and Its Clinical Presentations: Automated Molecular Mechanisms Extraction Approach" is about finding molecular mechanisms of hypertrophic cardiomyopathy (HCM), which is the most common inherited cardiovascular disease [6]. Molecular mechanisms were extracted from abstracts and open access full articles by multiple machine-reading systems. Shared molecular mechanisms of HCM and its clinical presentations were represented as networks, and the most important elements in the networks were found based on node centrality measures.

The paper with the title "Impact of Water Temperature on Heart Rate Variability during Bathing" aims to explore the impact of water temperature (WT) on Heart Rate Variability (HRV) by collecting five electrocardiogram (ECG) recordings of each of 10 subjects at different preset bathtub WT conditions during bathing [7]. Based on statistical analysis, it has been shown that the WT has an important impact on HRV during bathing.

The accuracy of several methods has been compared and assessed in the paper with the title "Discussion of Cuffless blood pressure prediction using plethysmograph based on a longitudinal experiment: Is individual model necessary?" [8]. Estimating blood pressure using the Plethysmograph (PPG) signal is convenient and makes continuous measurement feasible, but some doubt exists on its accuracy and robustness. By comparing the regression models, this paper shows that an individual Gaussian Process model attains the best results.

**Funding:** This research received no external funding.

**Acknowledgments:** Our heartfelt thanks are due to authors for their excellent and thought-provoking contributions and their patience in communicating with us. Finally, we acknowledge all the dedicated reviewers of these papers for their critical and insightful comments.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Discussion of Cuffless Blood Pressure Prediction Using Plethysmograph Based on a Longitudinal Experiment: Is the Individual Model Necessary?**

**Koshiro Kido 1, Zheng Chen 1, Ming Huang 1,\*, Toshiyo Tamura 2, Wei Chen 3, Naoaki Ono 1,4, Masachika Takeuchi 5, Md. Altaf-Ul-Amin <sup>1</sup> and Shigehiko Kanaya 1,4**


**Abstract:** Using the Plethysmograph (PPG) signal to estimate blood pressure (BP) is attractive given the convenience and possibility of continuous measurement. However, due to the personal differences and the insufficiency of data, the dilemma between the accuracy for a small dataset and the robustness as a general method remains. To this end, we scrutinized the whole pipeline from the feature selection to regression model construction based on a one-month experiment with 11 subjects. By constructing the explanatory features consisting of five general PPG waveform features that do not require the identification of dicrotic notch and diastolic peak and the heart rate, three regression models, which are partial least square, local weighted partial least square, and Gaussian Process model, were built to reflect the underlying assumption about the nature of the fitting problem. By comparing the regression models, it can be confirmed that an individual Gaussian Process model attains the best results with 5.1 mmHg and 4.6 mmHg mean absolute error for SBP and DBP and 6.2 mmHg and 5.4 mmHg standard deviation for SBP and DBP. Moreover, the results of the individual models are significantly better than the generalized model built with the data of all subjects.

**Keywords:** blood pressure; cuffless measurement; longitudinal experiment; plethysmograph; nonlinear regression

#### **1. Introduction**

Ambulatory blood pressure (BP) monitoring provides abundant cardiovascular information, and we have seen numerous studies focusing on replacing the conventional auscultatory/oscillometric measurement that requires the occlusion of arterial blood flow cuffless by using the cardiovascular biosignals with state-of-the-art machine learning methods.

The cuffless methods can be roughly categorized into three groups. The first one is based on the pulse arrival time (PAT) [1,2], the second one is based on photoplethysmograph (PPG) signal, which has attracted more and more attention in recent years [3–6], and the third one is based on other methods [7,8].

The PAT is the period that includes the pulse transition time and the pre-ejection period of the heart. Because the PAT needs the ECG signal and distal PPG signal to be determined, it is a modality that is suitable for intermittent measurement.

The second group utilizes only one biosignal—the PPG, which is a signal that reflects the change of blood volume. The pulsatile component of PPG reflects the change in blood volume and the stable component is related to the basic blood volume and other

**Citation:** Kido, K.; Chen, Z.; Huang, M.; Tamura, T.; Chen, W.; Ono, N.; Takeuchi, M.; Altaf-Ul-Amin, M.; Kanaya, S. Discussion of Cuffless Blood Pressure Prediction Using Plethysmograph Based on a Longitudinal Experiment: Is the Individual Model Necessary? *Life* **2022**, *12*, 11. https://doi.org/ 10.3390/life12010011

Academic Editor: Jonathan L. S. Esguerra

Received: 17 November 2021 Accepted: 18 December 2021 Published: 22 December 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

physiological indices such as respiration and body temperature [9]. Since the PPG signal depends on the blood volume in its optical path, which typically covers the arterial and venous capillaries [10], it can be related to the cardiovascular indices such as the blood oxygen saturation and arterial compliance [11,12]. It is generally accepted that the PPG waveform comprises the arterial pulse wave from the left ventricle to the distal sites and the reflected wave from the sites of impedance mismatch [13]. With these physiological understandings, research interest has been shifting to BP estimation based on PPG signal recently. Xing et al. carried out a mass experiment (1249 subjects, 2358 measurements), and by using the PPG waveform features and biometrics in a bagged regression tree model, they got a 9.5 mmHg, 2.2 mmHg, and 17.4 mmHg mean absolute error (MAE) for hypotensive, normotensive, and hypertensive subjects [3]. Chowdhury et al. used an open dataset obtained in a hospital with 219 subjects and 657 measurements to extract the PPG waveform features and biometrics, which were then inputted into the Gaussian Process Regression model. By using a further Bayesian optimization, they claimed 3.02 mmHg and 1.74 mmHg MAE for SBP and DBP estimation, respectively [6].

There are two main criteria for the cuffless method. The first one is that the method/system should meet the requirement on accuracy determined by IEEE Std 1708-2014. In the mean time, when used as a healthcare device, the convenience for personal use should also be taken into account. Therefore, the final goal of development should be an appropriate computation model with high accuracy and long-term stability.

These preliminary studies show the prospect of cuffless BP estimation with PPG signal; however, fundamental issues remain. The first one is the dilemma between the generalization of the regression model and the individual difference of the limited data. This kind of difference is pervasive in biomedical engineering and the method of making a balance is reflected in the modeling assumption and the choice of machine learning model. Data-driven approach resorts to the accumulation of samples to find the relation between explanatory and dependent variables. A globally nonlinear regression model that severely twists the fitting hyperplane to fit the data on hand may not be well applicable for an individual outside the dataset, especially when the size of the dataset is small. It is not uncommon to see the difficulty in getting a reproducible relation between the waveform features and the blood pressure learned from a small dataset in a large population test. The relation may be altered substantially by the demographic factors and physiological/pathological factors. For example, Allen has confirmed the PPG contour triangulation with aging [14]. This means that a robust method should not rely on the identification of dicrotic notch and/or diastolic peak in the PPG waveform.

Moreover, the cardiovascular functions including BP are regularized by biorhythms modulated by the baroreflex and the autonomic nervous system [15,16], which implies the necessity of re-calibration for long-term use.

Based on the discussion above, a through discussion is necessary before the large scale validation experiment. In this study, we tried to contribute to this topic by answering the following two fundamental questions: (*Q1*) When compared with the individual model, is the generalized model good enough? (*Q2*) Is it necessary to calibrate in a relatively longterm scenario? Special attention is also paid to the following aspects of lift performance (corresponding to *A1*) and robustness (corresponding to *A2*) of the model: (*A1*) to decide the best model for BP regression by comparing the linear, local linear, and Bayesian models; (*A2*) to examine the feasibility of using the general waveform features solely in the regression models, which means that the identification of the dicrotic notch and diastolic peak is unnecessary to address the deterioration of the dicrotic notch and the diastolic peak with aging and hypertension.

#### **2. Materials and Methods**

To the best knowledge of the authors, there are very few open datasets dedicated to the study of cuffless BP estimation. For example, the MIMIC database is well used in blood pressure estimation, whose protocol is not optimized for the BP study. The Non-invasive Blood Pressure Estimation dataset [17], collected data right after running, which would change the behavior of the baroreflex [18].

In view of the situation that none of the open dataset can be used to answer the questions Q1 and Q2, a one-month experiment, from mid-May to mid-June, dedicated to the development and validation of cuffless BP measurement was conducted. The experiment used a PPG-sensor with 940 nm LED (infrared) of a medical device (Checkme Pro B, San-ei Medisys) for 30 s PPG measurement and a clinical Blood Pressure Monitor (A&D, Model UA-1200BLE) as the reference [19]. Eleven subjects, whose basic information can be found in Table 1, participated in the experiment, during which they were required to measure the PPG signal and BP values multiple times (typically 4 times/day) a day in different periods. Subjects were guided to rest for three minutes before taking the measurements in a sitting position. The study was approved by the Ethics Review Committee of the San-ei Medisys Company (#2019002SA). All methods were performed in accordance with the relevant guidelines and regulations. Informed consent was obtained from all the subjects.


**Table 1.** Individual fitting results based on GPR\_set1 model coupled with intermittent calibration.

#### *2.1. Preprocessing*

This research differentiates itself from the previous studies that tried to extract the morphological features of PPG waveforms as many as possible and then used the features selection methods to pick up the important ones to input into the regression models. We argue that the whole pipeline should take serious account of the problem that the deterioration of the dicrotic notch and the diastolic peak with aging and hypertension will bring uncertainty to the relevant features. This is also the reason why we highlight the *A2* aspect in our study.

On this account, the explanatory features set *X* consists of (1) the heart rate reading (*HR*) during the experiment; and (2) the general PPG waveform features, which are described as follows. Firstly, 6th order Butterworth IIR low pass zero-phase filter (*fc* = 10 Hz) and a linear detrend process were applied to the PPG signals before the following procedure:

Firstly, the skewness values, which is a well-used index for the signal quality index (SQI) of PPG signal [20], of beat-by-beat PPG waveforms in each sample was calculated and the waveform with the maximum skewness values was picked, thereafter the sample with the maximum SQI less than 0.1 will be removed.

Secondly, the following features (Figure 1) of the waveform were calculated:

• *PPG* intensive ratio (*PIR*): *PIR* = *PPGpeak*/*PPGroot*, which is an index to reflect changes in the arterial diameter [21];


**Figure 1.** Illustration of PPG features. Inflection point of the PPG waveform between the dicrotic notch and the diastolic peak is used given that the disappearance of the diastolic peak may appear.

The five features above, which have been validated in the previous studies [4,6,21], were selected based on the underlying consideration that potential PPG features should avoid the identification of the dicrotic north and the diastole peak. Meanwhile, the HR, which has been proved to be informative in BP estimation [16,23], is used as another explanatory variable.

#### *2.2. Regression Models*

To respond to the *A1*, three regression models were used in this research. While the PLS corresponds to the linear assumption of the features (*X*) and the BP (*y*), the LWPLS corresponds to the assumption of local linearity defined by the close samples in the features space. The Gaussian Process Regression can be viewed as a nonlinear regression by using a Radial Basis kernel.

#### 2.2.1. Partial Least Squares (PLS)

By projecting the input into new spaces, PLS can handle the problem with visible collinearity in the explanatory variables. In a problem with *<sup>X</sup>* ∈ *N*×*<sup>P</sup>* as the explanatory variables (*N*: the number of samples; *P* the number of the explanatory variables) and *<sup>y</sup>* ∈ *<sup>N</sup>* as the dependent variable, the normalized *<sup>X</sup>* and *<sup>y</sup>* are expressed alternatively as follows:

$$X = TP^T + E\_\prime \tag{1}$$

$$y = Tq + f,\tag{2}$$

where the *<sup>T</sup>* ∈ *N*×*<sup>D</sup>* is the matrix that contains latent components in each column, and *<sup>P</sup>* ∈ *P*×*<sup>D</sup>* is the matrix that contains the loading of each variable for the components in each row. *<sup>q</sup>* ∈ *<sup>D</sup>* is the regression coefficient vector from latent variables to the output. *E* and *f* are the residuals.

The projection of *X* and *y* is done iteratively by maximizing the covariance of the *yTti*, where *t<sup>i</sup>* is the *i*th column of the *T*, taking the first two latent components as an example.

$$t\_1 = Xw\_{1\prime} \tag{3}$$

where the *w*<sup>1</sup> is the weight vector for the first component with the regularization that *w*1 = 1. By using the *Largrange* multiplier and least squares method, the *w*1, the *t*1, and the corresponding *p*<sup>1</sup> and *q*1, which is the first row of *P* and the first element of *q* respectively, can be calculated. Similarly, the second component can be derived by subtracting the projection in the first component from the *X* and *y*,

$$X\_2 = X - \mathbf{t}\_1 \mathbf{p}\_1^T,\tag{4}$$

$$y\_2 = y - \mathfrak{t}\_1 q\_{1'} \tag{5}$$

and calculate the parameters by

$$w\_2 = \frac{\mathbf{X}\_2^T \mathbf{y}\_2}{||\mathbf{X}\_2^T \mathbf{y}\_2||},\tag{6}$$

$$P\_2 = \frac{\mathbf{X}\_2^T \mathbf{t}\_2}{\mathbf{t}\_2^T \mathbf{t}\_2},\tag{7}$$

$$
\eta\_2 = \frac{\mathbf{y}\_2^T \mathbf{t}\_2}{\mathbf{t}\_2^T \mathbf{t}\_2}. \tag{8}
$$

The procedure is repeated when the *i* reaches the number *D*, whose value can be determined by using the cross-validated coefficient of determination (*R*<sup>2</sup> *cv*).

$$r\_{cv}^2 = 1 - \frac{\sum\_{i=1}^{N} (y^i - y\_{cv}^i)^2}{\sum\_{i=1}^{N} (y^i - y\_A)^2},\tag{9}$$

where the *y<sup>i</sup>* and the *y<sup>i</sup> CV* are the true value and the estimate of the *i*th sample in the cross validation, respectively; the *yA* is the mean of *y*. The *i* is set as 2 after the confirmation with *r*<sup>2</sup> *cv*.

#### 2.2.2. LW-PLS

As a just-in-time method, the LW-PLS is conceptually different from the PLS. It separates the samples to a training dataset with *<sup>X</sup><sup>t</sup>* ∈ *M*×*P*and *<sup>y</sup><sup>t</sup>* ∈ *<sup>M</sup>* and the new request *<sup>x</sup><sup>r</sup>* ∈ *P*, whose similarity will be compared with the training dataset and used to generate the diagonal similarity matrix.

$$\mathcal{U} = \begin{bmatrix} u\_r^1 & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & u\_r^M \end{bmatrix} \tag{10}$$

where the *u<sup>i</sup> <sup>q</sup>* is the similarity between the *i*th sample in *X<sup>t</sup>* and the *xr*. The *X<sup>t</sup>* and *y<sup>t</sup>* are adjusted based on the *U*.

$$X\_0 = X\_t - X\_{\text{uv}} \tag{11}$$

$$y\_0 = y\_t - y\_w.\tag{12}$$

The *X<sup>w</sup>* and *y<sup>w</sup>* are generated as follows.

$$\mathbf{X}\_{\textit{w}} = \begin{bmatrix} 1 \\ 1 \\ \cdots \\ 1 \end{bmatrix} \begin{bmatrix} \mathbf{x}\_{\textit{w},1}, \mathbf{x}\_{\textit{w},2}, \dots, \mathbf{x}\_{\textit{w},P} \end{bmatrix}\_{\textit{s}} \tag{13}$$

and

$$y\_w = \frac{\sum\_{i=1}^{M} u\_r^i y^i}{\sum\_{i=1}^{N} u\_r^i},\tag{14}$$

where,

$$\mathbf{x}\_{w,j} = \frac{\sum\_{i=1}^{M} u\_r^i \mathbf{x}\_j^i}{\sum\_{i=1}^{N} u\_r^i}. \tag{15}$$

With all this preparation the first component can be calculated.

$$w\_1 = \frac{\mathbf{X}\_0^T \mathbf{U} y\_0}{||\mathbf{X}\_0^T \mathbf{U} y\_0||},\tag{16}$$

$$\mathbf{t}\_1 = \mathbf{X}\_0 \mathbf{w}\_{1\prime} \tag{17}$$

$$p\_1 = \frac{\mathbf{X}\_0^T \mathbf{U} t\_1}{t\_1^T \mathbf{U} t\_1}.\tag{18}$$

$$q\_1 = \frac{y\_0^T \mathbf{U} \mathbf{t}\_1}{t\_1^T \mathbf{U} \mathbf{t}\_1} \tag{19}$$

The *p***<sup>1</sup>** and *q*<sup>1</sup> represent the loading vector and coefficient of the first component, respectively. The parameters are determined iteratively when the predefined component number is met. The estimate for the request *x<sup>r</sup>* is the summation of the *y*ˆ *<sup>a</sup>* = ∑*<sup>D</sup> <sup>i</sup>*=<sup>1</sup> *ta*,*iqa*,*<sup>i</sup>* + *yw*.

The distance index is crucial to the LW-PLS. Therefore, special care should be paid to its choice. In this research, it is considered that the *Mahalanobis* distance, which is the improved *Euclidian* distance to amend the problems of different scales and correlation between explanatory features with the definition below

$$d\_m = \sqrt{(\mathbf{x}\_i - \mathbf{x}\_j)^T \mathbf{S}^{-1} (\mathbf{x}\_i - \mathbf{x}\_j)}, \text{( $\mathbf{x}\_i$  and  $\mathbf{x}\_j$ ; the samples; S: covariance matrix) \tag{20}$$

is suitable for this heterogeneous set of explanatory features consisting of continuous ones of different scales.

#### 2.2.3. Gaussian Process Regression

Gaussian Process Regression is fundamentally a kernel method, which defines the similarity of explanatory variables in terms of a kernel function. Instead of the explicit definition of the basic functions *φi*(*x*),(*i* = 1, 2, ... , *l*) and determination of the optimized weights vector *w*, where *y*(*x*, *w*) = *wTφ*(*x*), the Gaussian Process defines a prior probability distribution over the basic functions. For a dataset with *N* samples, the equation becomes

$$y = \Phi w,\\
\text{where } \Phi \text{ is the design matrix},\\
\Phi\_{nk} = \phi\_k(\mathbf{x}\_n). \tag{21}$$

By letting the prior distribution of the weights vector to be Gaussian, *<sup>p</sup>*(*w*) ∼ N (*w*|**0**, *<sup>α</sup>*−1*I*) and then defining the kernel function *k*(*x*, *x* ), where

$$k(\mathbf{x}, \mathbf{x}') = \frac{1}{\alpha} \phi(\mathbf{x})^T \phi(\mathbf{x}'),\tag{22}$$

The Gaussian Process can be defined by the kernel function, since E[*y*] = **Φ**E[*w*] = 0

$$\mathbb{E}(y(\mathbf{x}\_n)y(\mathbf{x}\_m)) = k(\mathbf{x}\_n, \mathbf{x}\_m),\tag{23}$$

In this research, the kernel is the Radial Basis Function (scale = 1) contaminated with white noise (noise level = 1) for the standardized explanatory variables.

#### *2.3. Calibration Schemes*

The personality is reflected by incorporating the personal samples into the training dataset. Two kinds of calibration schemes were used. The first scheme (scheme\_1) used the first 20% (typically 23 samples) of the samples in the training process; whereas the second scheme (scheme\_2) used the first 4% and the 15th–18th, 35th–38th, 55th–58th, 75th–78th, and 95th–98th in the training process, which were roughly 20% of the individual samples as well. Of note, to prevent the leaking of future information, the BP values are predicted by using the previous calibration readings. That is, for example, the 19th-34th BP values are predicted by using the initial and the 15–18th calibration readings. These two schemes were used to examine the necessity of re-calibration in a relatively long-term scenario (*Q2*). Specifically, if the best results come from the the scheme\_2, it is reasonable to conclude that the model should take the biorhythm, such as seasonality, into account and the re-calibration should be considered.

#### *2.4. Generalized and Individual Models*

To answer the question *Q1*, two training strategies were used in model construction. A generalized model trains the model as a generalized one by using the individual calibration samples as well as the samples from the other subjects. The other strategy is to construct an individual model, which was trained by using individual calibration samples solely. The two calibration schemes above were used in both models. The mean absolute error (MAE), as well as the mean and standard deviation of the regression error, were used as the main metrics; particularly, the MAE was used to determine the best model.

#### **3. Results**

By summarizing the one-month experiment, all of the 1287 samples collected from the 11 subjects, whose information is shown in Table 1, were used in the following modeling. The numbers of samples for the subjects are even, from 114 to 118.

#### *3.1. Results of Generalized Models*

In this study, the linear partial least square model (PLS), the local linear local-weighted partial least square model (LWPLS), and the Gaussian Process Regression model (GPR) were used to reflect different assumptions about this regression problem. The generalized model is trained in the manner of leave-one-out validation, by taking in all samples in the training dataset (10 subjects) merging the individual calibration samples (calibration scheme\_1 and scheme\_2, more detail in **Methods**). By summarizing the regression results of the testing samples, the mean absolute error (MAE) of systolic and diastolic BP is shown in Figure 2, from which it can be seen that none of the models can fit the personal well with MAE larger than 9 mmHg. Therefore, no further comparison of features was conducted for the generalized models.

**Figure 2.** The mean absolute error (MAE) of systolic and diastolic BP of the three generalized models. The bar plot on the left corresponds to the results with scheme\_1 and the bar plot on the right to the results with scheme\_2. Sticks show the standard errors.

#### *3.2. Comparison of Different Individual Models*

Individual models were constructed by taking the calibration samples from the same subject for model construction and testing the samples remaining, whose results are summarized in Figure 3. As introduced in the **Methods** section, two calibration schemes were used. The left column corresponds to the calibration using the initial 20% samples (scheme\_1), while the right column corresponds to the situation of re-calibrating every 20 samples (scheme\_2).

By comparison with the generalized models, each model significantly outperforms its counterpart (*p* < 0.001). This suggests that the individual model is more appropriate in building the BP regression model in the current stage.

The three models with two calibration schemes were examined by altering the combination of explanatory features. Therefore, there are three trials for each model, as shown in Figure 3. Inside each scheme, we only compare the model having the lowest MAE with others that have close MAEs by pair-wise student *t*-test and the models with lowest MAE from each scheme. For SBP, the best models are PLS\_set1 for shceme\_1 and GPR\_set1 for scheme\_2. Moreover, the GPR\_set1 (scheme\_2) is significantly smaller than PLS\_set1 (shceme\_1) and smaller than GPR\_set2 (scheme\_2); it is considered as the best model for SBP.

For DBP, being consistent with the situation of SBP, the best models are PLS\_set1 (shceme\_1) and GPR\_set1 (scheme\_2). However, no significant difference can be found from these two models (*p* = 0.41). Of note, regarding the individual model that uses the scheme\_2 calibration, linear interpolation that used two consecutive calibration windows has also been conducted to confirm the advantage of nonlinear regression using PPG features. For example, the averaged BP values of the two calibration windows of the 15th–18th samples and the 35th–38th samples are used to interpolate the 19th–34th BP values. The MAEs of SBP and DBP are 5.96 and 4.72 mmHg, respectively. The accuracy of SBP is significantly lower than GRP\_set1; whereas no significant difference can be seen in DBP. Combining these pieces of result, the feature combination set 1 is consistently the best. Moreover, the GPR coupled with the re-calibration scheme (scheme\_2) has a significantly

low MAE for SBP (5.16 mmHg) and similar MAE for DBP (4.63 mmHg); it is considered as the best method for BP estimation.

#### *3.3. Regression Model*

Based on the results above, the GPR model with explanatory features of *hr*, *a02*, *a25*, and *ri* is used for both SBP and DBP regression. The results of the regression model are shown in Figure 4, from which it can be seen that the overall trend of the BP can be captured (Figure 4a,b). The decreasing trend of both the SBP and DBP may be attributed to the seasonal influence and can be observed from most of the subjects [24,25]. The mean and standard deviation of the fitting are

• SBP: −2.2 ± 6.2 mmHg; DBP: −2.0 ± 5.4 mmHg.

The individual fitting results are tabulated in Table 1, from which it can be confirmed that most of the subjects have SBP MAE values less than 6 mmHg, except for No. 3, 4, and 11, whose BP ranges were relatively wide.

**Figure 4.** The results of the GPR\_set1 (scheme\_2) model. The fitting of the one-month samples of two randomly selected subjects can be seen in (**a**,**b**). The BA plot for and linear correlation can be found in (**c**,**d**).

#### **4. Discussion**

The observation that the generalized models are significantly inferior to individual models is consistent with the finding of Gašper et al. [26] based on the MIMIC data. This reconfirmation suggests that the samples (features and BP values) from different individuals follow different distributions. It also suggests that the generalizability of a BP model should be validated on a dataset within a certain period.

Although the ISO81060-2 uses the mean as a major criterion in model evaluation, MSE, which is more sensitive to the variance of the error, is often used in model selection [6,23]. In this study, the Mean errors of SBP and DBP with the PLS\_set1 model are −1.46 and −1.53 mmHg, respectively. However, the correlation values are evidently lower as 0.82 and 0.89, respectively. Figure 5 shows the distributions of the predicting errors of PLS\_set1 (left column) and GPR\_set1 (right column). It can be confirmed in Figure 5 that the bias values of the PLS\_set1 model are less; however, variances are higher than the GPR\_set1 model. The MAE also stands out in its insensitivity to the outliers compared with the mean square error, which is used pervasively in machine learning. Therefore, the MAE is used in model comparison in this study.

**Figure 5.** Distribution of predicting errors of the PLS\_set1 (**left** column) adn GPR\_set1 models (**right** column). Red curves show the approximation of probability density using Gaussian distribution.

The fact that the re-calibration scheme (scheme\_2) surpasses the initial-calibration scheme (scheme\_1) in systolic BP suggests that the model may only capture the fluctuation of the systolic BP within a couple of days and there is a component, periodically or not, with a longer period in the systolic BP change. On the other hand, the two schemes showed little difference in diastolic BP (lower row of Figure 4). This observation suggests that the diastolic BP is more stable than the systolic BP.

The necessary number of samples in a calibration window and the length of calibrationfree interval between windows are important in planning a subsequent experiment with a larger population and longer duration. By setting the target accuracy as 5 mmHg for both SBP and DBP, other settings of the scheme\_2 have been tried, for example, two times of recalibration at the 30th–40th, and the 70–80th samples result in significantly lower accuracy (MAE: 6.28 and 5.02 mmHg for SBP and DBP, respectively). Given that re-calibration too often may cause stress, an interval of 17–20 samples (4 to 5 days typically), and four samples of a calibration window (1 day typically), is suitable.

Another advantage of the re-calibration scheme is the swift initialization, which requires five measurements typically, whereas the initial-calibration scheme requires a long period, 23 measurements typically, for initialization to have acceptable accuracy. Please note that the granularity of this conclusion is the number of days based on the setting of the experiments. Whether repeating the calibrating measurements in a shorter interval, e.g., several minutes, can get similar accuracy could be validated subsequently.

The necessity of an individual model could be explained partially by further plotting the distribution of similarity based on *Mahalanobis* distance (Figure 6). The intra-similarity of the explanatory variables from the same subject is higher than the inter-similarity but with partial overlapping. Since in kernel-based regression, the LWPLS and the GPR will take close samples in the explanatory variables space to predict the new sample, the overlapping samples may be detrimental for the prediction by deviating the predicting hyperplane.

Regarding the choice of the regression model, this insight is especially significant in the era of deep learning, which is fundamentally the data-driven approach. A shallow network model can give an accurate estimation for the dataset on hand given the extraordinary capability in twisting the hyperplane to fit the samples. However, given the strong personality underlying this problem, a few-shot learning scheme that relates the same PPG and relevant features from the same subject with the change of BP fluctuation may be a method worth trying at the present stage. Although the application of the fully data-driven approach, e.g., the end-to-end deep learning pipeline, onto this topic is still difficult for the lack of individual longitudinal data, the information hidden in the sequence of EEG waveforms could be used by recurrent neural network or attention structure in the near future with further accumulation of data. In this regard, the construction of long term individual datasets should be the next key work.

An interesting study that significantly increases the training samples by extending the measurement duration (30 min) and then uses the BiLSTM model to predict the BP values hints that the deep learning structures should be carefully used for this topic. The main conclusion of the leave-one-subject-out validation that the model pretrained with population samples and fine-tuned with individual data attains the best performance may be partially attributed to a more informative feature space constructed by ECG, PPG, and BCG signal than using the PPG sensor solely. That is, the general relation between the features extracted from multiple sensors and the BP values could be captured and used in transfer learning. Although the physiological measurement is complicated, given that the simultaneous measurement of ECG, PPG, and BCG is feasible nowadays, researchers and developers are therefore recommended to decide on the sources of signal based on the target application. However, if the validation duration extended to multiple days, the significant decrease in predicting error suggests the influence of biorhythm [27].

The results of this paper are consistent with the results of Chowdhury et al. [6] by showing that Gaussian Process Regression can attain accurate BP estimation. Additionally, this research distinguishes itself by using the features that do not require the identification of the dicrotic notch and the diastolic peak and validating using a long-term dataset to highlight the necessity of the individual model.

This research emphasizes the importance of feature selection. There is no reason to replace the features which have been validated in previous studies with the automatically generated features using deep learning models in such a data-insufficient problem. At least, at the current stage, the domain knowledge is as important as the improvement of machine learning models.

This research shows the possibility of building an accurate BP estimation model by using the general features of PPG signal and HR. Moreover, given that the disappearance of diastolic peak is inevitable in aged people, who are the targeted population of the cuffless method, the general features could be a good alternative in modeling.

**Figure 6.** Distribution of similarity. Each subfigure corresponds to one subject. In each subfigure, the histogram in red shows the distribution of similarity of samples from the same subject; whereas the one in blue shows the distribution of similarity with other subjects.

#### **5. Conclusions**

This paper, from the perspectives of data science and biomedical engineering, provides insightful results that are important for successive cuffless blood pressure studies. The main results of this research are (1) in the current stage of data insufficiency, an individual nonlinear regression model with intermittent calibration scheme is an ideal alternative for long-term use; (2) the general waveform features are used given the disappearance of morphological features caused by aging. Given the good fitting results of the nonlinear regression model, the general waveform features are informative in this regression problem and therefore can be used in the future studies.

**Author Contributions:** T.T., M.T. and M.H. conceived the experiments; T.T., M.T. and M.H. conducted the experiments; M.H., N.O., S.K. and Z.C. designed the model; M.H. and Z.C. wrote the paper; M.H., Z.C. and K.K. revised the model and code; M.H., T.T., N.O., W.C., M.A.-U.-A. and S.K. revised the paper and helped with interpretation and discussion; T.T., M.T. and M.H. supervised model design and experiments. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research and development work was supported by the KYOTO Industrial Support Organization 21, the Kigyonomori Grant, the Grant-in-Aid for Early-Career Scientists (Kakenhi) (#20K19923), the Grant-in-Aid Scientific Research (C) (Kakenhi) (#17K01440, #21K12760), and the Japan Agency for Medical Research and Development (grant no. s20dk0310111).

**Institutional Review Board Statement:** The study was approved by the Ethics Review Committee of the San-ei Medisys Company (#2019002SA).

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** All data analyzed during the current study are available from the corresponding author on reasonable request.

**Acknowledgments:** The authors would like to thank Nobuhiro Nichimura from San-Eki Medisys Co. Ltd. for the data collection.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **SMCKAT, a Sequential Multi-Dimensional CNV Kernel-Based Association Test**

**Nastaran Maus Esfahani 1,\*,† , Daniel Catchpoole 1,2,† and Paul J. Kennedy 1,†**


† These authors contributed equally to this work.

**Abstract:** Copy number variants (CNVs) are the most common form of structural genetic variation, reflecting the gain or loss of DNA segments compared with a reference genome. Studies have identified CNV association with different diseases. However, the association between the sequential order of CNVs and disease-related traits has not been studied, to our knowledge, and it is still unclear that CNVs function individually or whether they work in coordination with other CNVs to manifest a disease or trait. Consequently, we propose the first such method to test the association between the sequential order of CNVs and diseases. Our sequential multi-dimensional CNV kernel-based association test (SMCKAT) consists of three parts: (1) a single CNV group kernel measuring the similarity between two groups of CNVs; (2) a whole genome group kernel that aggregates several single group kernels to summarize the similarity between CNV groups in a single chromosome or the whole genome; and (3) an association test between the CNV sequential order and disease-related traits using a random effect model. We evaluate SMCKAT on CNV data sets exhibiting rare or common CNVs, demonstrating that it can detect specific biologically relevant chromosomal regions supported by the biomedical literature. We compare the performance of SMCKAT with MCKAT, a multi-dimensional kernel association test. Based on the results, SMCKAT can detect more specific chromosomal regions compared with MCKAT that not only have CNV characteristics, but the CNV order on them are significantly associated with the disease-related trait.

**Keywords:** genetic variation; copy number variants; disease-related traits; sequential order; association test

#### **1. Introduction**

Genetically speaking, all humans are 99.9 percent the same and the 0.1 percent that makes us all unique is called genetic variation [1]. Genetic variation has two main forms: structural alteration and sequence variation. Copy number variant (CNV) and DNA sequence variation are the most common form of structural alteration and sequence variation in the human genome, respectively [2].

A sequence variation or single nucleotide polymorphism (SNP) represents a difference in a single nucleotide. For example, a SNP may replace the nucleotide cytosine with the nucleotide thymine in a certain stretch of DNA. SNPs are classified into two major types based on the gene region they fall within: coding region and non-coding region. SNPs within a coding sequence do not necessarily change the amino acid sequence of the protein that is produced, due to the degeneracy of the genetic code. SNPs in the coding region are of two types: nonsynonymous and synonymous SNPs. Nonsynonymous SNPs change the amino acid sequence of the protein, while synonymous SNPs do not affect the amino acid sequence of the protein. SNPs do not usually function individually, rather, they work in coordination with other SNPs to manifest a disease or trait [3]. Therefore, many sequence studies have been done to test the association between SNPs and disease or traits.

**Citation:** Maus Esfahani, N.; Catchpoole, D.; Kennedy, P.J. SMCKAT, a Sequential Multi-Dimensional Kernel-Based CNV Association Test. *Life* **2021**, *11*, 1302. https://doi.org/10.3390/ life11121302

Academic Editors: Md. Altaf-Ul-Amin, Tao Huang, Shigehiko Kanaya, Naoaki Ono, Ming Huang

Received: 15 September 2021 Accepted: 23 November 2021 Published: 26 November 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

A copy number variant is the gain or loss of DNA segments in the genome ranging in size from one kilobase to several megabases. CNVs are described by three characteristics: type, chromosomal position, and dosage [4]. The type of CNV is either amplification or deletion. The chromosomal position of the CNV is described by the start and end position of the CNV in the chromosome. The dosage represents the total number of copies of the CNV, with a value less than two relating to deletion and greater than two indicating amplification. Besides, CNVs have phenotypic heterogeneity effects. This means that different CNV types and dosages at the same position in the chromosome can have a different impact. It is reported in biological studies that CNVs are distributed non-randomly in the genomes, in particular they tend to be located close to telomeres and centromeres [5]. However, it is still unclear if there is any specific pattern in the sequential order of CNVs that may lead to a disease or trait.

Association studies have determined that genetic variations, both CNVs and SNPs, are associated with diseases or traits. So, understanding the relationship between genetic variation and disease may provide important insights into genetic causes, leading to effective means in preventing and treating the diseases. While there are lots of computational association studies that have investigated the association between SNPs and diseases or traits, methods for studying CNVs are underdeveloped due to the multi-dimensional characteristics of the CNVs.

The CNV kernel association test (CKAT) [6], copy number profile curve-based association test (CONCUR) [7] and multi-dimensional copy number variant kernel association test (MCKAT) [8] are a few existing computational kernel based methods that have studied the association between CNVs and diseases. In these studies, different kernels are proposed to measure the similarity between CNV profiles with respect to CNV characteristics. Then, the similarity between CNV profiles is compared with those in disease-related trait status to identify any potential association between CNVs and the disease. Among them, our previous method, the MCKAT, is the only method that has incorporated all multidimensional characteristics of CNVs in testing the association between CNVs and disease or trait. The MCKAT calculates the *p*-value of the association test analytically, which is computationally efficient and flexible for CNV association analysis for both rare and common CNV types, as demonstrated in numerical studies. However, neither MCKAT nor other methods consider the CNV sequential order in testing the association between CNVs and disease-related traits.

Starting from MCKAT, we propose a sequential multi-dimensional CNV kernel-based association test (SMCKAT) for investigating the association between CNVs and disease or traits. SMCKAT is not only utilizing all multi-dimensional characteristics of CNVs but also the sequential order of CNVs in testing the association between CNVs and disease or traits. Based on the results, SMCKAT is applicable on both rare and common datasets and capable of identifying hot-spots on the genome where both CNV characteristics and the CNV sequential order are significantly associated with disease or traits.

The rest of this paper is as follows. Section 2 presents the method and materials. Section 3 contains simulation studies. Section 4 shows the results of the real data application. Section 5 presents the discussion, and finally Section 6 concludes the work.

#### **2. Method and Materials**

We design a sequential multi-dimensional kernel framework capable of measuring the similarity between CNV profiles utilizing all CNV characteristics and the CNV sequential order. It contains two kernels. The first kernel, the pair group kernel, measures the similarity between two groups of CNVs at the same ordinal position of two CNV profiles. It contains three sub-kernels. Each sub-kernel is responsible for measuring the similarity between two CNVs with respect to one of the three CNV characteristics. The second kernel, the whole genome group kernel, aggregates the similarity between every possible CNV pair group to measure the total similarity between the CNV profiles of the subjects. Finally, the association between CNV sequential order across a chromosome and disease-related

traits is tested by comparing the similarity in CNV profiles to that in the trait using an association test.

#### *2.1. Pair CNV Group Kernel*

Let *X* denote a single CNV which is defined by four characteristics as *X* = (*X*(1), *X*(2), *X*(3), *X*(4)) where *X*(1) and *X*(2) are the CNV starting and ending position on the chromosome, *X*(3) is the CNV type, and *X*(4) is the CNV dosage. First, we generate the CNV profile *R* for subject *i* with *l* CNVs as *Ri* = (*X<sup>i</sup>* <sup>1</sup>, *<sup>X</sup><sup>i</sup>* <sup>2</sup>, ... , *<sup>X</sup><sup>i</sup> li* ) where CNVs are sorted based on their chromosomal position. Secondly, we extract a CNV group of size *n* out of the CNV profile as *Gi* = (*X<sup>i</sup> <sup>m</sup>*, *X<sup>i</sup> <sup>m</sup>*+1, ... , *<sup>X</sup><sup>i</sup> <sup>m</sup>*+*n*) where *n* is the group size that can take any value between 1 and *l*, the number of existing CNVs in a CNV profile as is shown in Figure 1.

**Figure 1.** Generating CNV profile *Ri* where CNVs are sorted with respect to their chromosomal position. A, B,. . . , and F are arbitrary CNVs at *mth*, *mth*<sup>+</sup>1, . . . , and *mth*+*<sup>n</sup>* positions and *Gi* is a group of CNVs of size *n*.

> We propose a pair CNV group kernel, *KPG*, to measure the similarity between two CNV groups of size *n*, *Gi* and *Gj*, in two CNV profiles. First, *KPG* aligns each CNV in the *Gi* with its relevant CNV in the *Gj* with respect to their position to generate *n* CNV pairs as is shown in Figure 2.

**Figure 2.** Aligning CNVs within two CNV groups of size *n*, *Gi* and *Gj*, to generate *n* CNV pairs.

Then, *KPG* measures the similarity between each CNV pair using the single pair CNV kernel, *KS*, we proposed in [8]. *KS* measures the similarity between a CNV pair by three sub-kernels considering all CNV features including chromosomal position, type and dosage. Finally, *KPG* averages the similarities calculated by *KS* between all generated CNV pairs to measure the similarity between two CNV groups, *Gi* and *Gj*, as

$$K\_{\rm PG}(\mathbf{G}\_{i\prime}, \mathbf{G}\_{j}) = \sum\_{m=1}^{n} \frac{K\_{\rm s}(X\_{m\prime}^{i}, X\_{m}^{l})}{n} \tag{1}$$

where *Ks* is defined as

$$\begin{aligned} \mathbf{K}\_{\mathbf{s}}\left(\mathbf{X}\_{m}^{i},\mathbf{X}\_{m}^{j}\right) &= \left[\frac{Intensity\left(\left(\mathbf{X}\_{m}^{i(1)},\mathbf{X}\_{m}^{i(2)}\right),\left(\mathbf{X}\_{m}^{j(1)},\mathbf{X}\_{m}^{j(2)}\right)\right)}{llinion\left(\left(\mathbf{X}\_{m}^{i(1)},\mathbf{X}\_{m}^{i(2)}\right),\left(\mathbf{X}\_{m}^{j(1)},\mathbf{X}\_{m}^{j(2)}\right)\right)}\right] \times \\ &\left[\frac{\left(\mathbf{X}\_{m}^{i(3)} == \mathbf{X}\_{m}^{j(3)}\right) + 1}{2}\right] \times \left[\frac{1}{2^{\left\lceil DR\left(\mathbf{X}\_{m}^{i(4)}\right) - DR\left(\mathbf{X}\_{m}^{j(4)}\right)\right\rceil}}\right] \end{aligned} \tag{2}$$

and the first term measures the mutual presence of a CNV with a specific start and end position by dividing the size of the intersection of two CNVs to their union size. The intersection function calculates the length of the chromosomal region that belongs to both CNVs. Similarly, the union function calculates the length of the chromosomal region that consists of both regions that belong to the first CNV and to the second CNV. The second term compares the CNV type of two CNVs to calculate the similarity between them. The third term measures the similarity between two CNVs with respect to their dosage. The *DR* is the difference from the reference function we proposed in [8] as *DR*(*dosage*) = |*dosage* − 2|. *DR* measures the difference between a CNV dosage and the reference dosage value 2.

#### *2.2. Whole Genome CNV Group Kernel*

First, we create a window of size *n*. We slide this window across the CNV profile *Ri* as is shown in Figure 3 to extract all possible CNV groups of size *n* as *Pi* = (*G<sup>i</sup>* <sup>1</sup>, ... , *<sup>G</sup><sup>i</sup> pi* ) where CNV groups are sorted based on their position and *pi* is the number of extracted CNV groups for the CNV profile *Ri*. Similarly, we have another CNV group series *Pj* = (*G<sup>j</sup>* <sup>1</sup>,..., *<sup>G</sup><sup>j</sup> qj* ) for CNV profile *Rj*.

**Figure 3.** Sliding window of size *n* across CNV profile to extract CNV groups of size *n*.

Then, we propose the whole genome CNV group kernel, *KWG*, to measure the similarity between two CNV group series *Pi* and *Pj* as

$$K\_{\rm WG}(P\_i, P\_j) = \begin{cases} 0 & \text{if } p\_i \times q\_i = 0\\ \\ \sum\_{z=1}^{\text{Max}(p\_i, q\_i)} \text{Max}(K\_{\rm PG}(G\_{z'}^i, G\_{z-1}^j)\_{\prime} \\\ K\_{\rm PG}(G\_z^i, G\_z^j), K\_{\rm PG}(G\_z^i, G\_{z+1}^j)) & \text{if } p\_i \times q\_i \neq 0 \end{cases} \tag{3}$$

where *KPG*(., .) is the pair CNV group kernel from (1). *KWG* measures the similarity between the pair CNV groups of the same position and aggregates these similarities to calculate the similarity in two CNV group series. The second maximum operation in the definition of *KWG* searches for the best group-to-group correspondence of the highest similarity to align CNV groups in two CNV group series as is shown in Figure 4.

The kernel-based association test described in the following section, requires a kernel similarity matrix *K*. *K* is a *d* × *d* matrix, where *Kij* = *KWG*(*Pi*, *Pj*) and *d* is the number of existing CNV profiles. *Kij* expresses the similarity between CNV profile *i* and *j* measured by *KWG*.

**Figure 4.** Aligning *G<sup>i</sup> <sup>z</sup>* to the best group-to-group correspondence of the highest similarity among *<sup>G</sup><sup>j</sup> <sup>z</sup>*−1, *<sup>G</sup><sup>j</sup> <sup>z</sup>* and *<sup>G</sup><sup>j</sup> <sup>z</sup>*+1.

*2.3. Kernel-Based Association Test*

We use the following logistic regression model to test the association between CNV sequential order and a disease related trait

$$\log \text{it} [Pr(y\_i = 1)] = \beta\_0 + Z\beta + f(P\_i) \tag{4}$$

where *yi* is the status of the disease related trait with *yi* = 1 denoting the existence of the trait and *yi* = 0 denoting otherwise, and *i* = 1, 2, ... , *d* indexing the CNV profiles, and *Z* is the covariate matrix including information such as age and gender. *Pi* is the CNV group series of the profile *Ri* as explained previously. *f*(·) is a function spanned by the whole genome CNV group kernel *KWG*(·, ·). According to equation (4), the hypothesis of no association between the CNV sequential order and the existence of a disease related trait can be tested as *H*<sup>0</sup> : *f*(·) = 0. To test this, one way is to treat the *f*(·) as a random effect vector which is distributed as *N*(0, *τK*), where *τ* ≥ 0 and *K* is the *d* × *d* similarity matrix, treated as covariance matrix of the random effect, generated by *KWG* as defined in [6]. Liu et al. [9] has shown that testing *H*<sup>0</sup> : *f*(·) is equivalent to testing *H*<sup>0</sup> : *τ* = 0 in the logistic mixed effect model. Moreover, *τ* is a variance component parameter in the logistic mixed effect model, which can be tested using a restricted maximum likelihood-based score test [9,10].

We use the following score test statistic where *<sup>y</sup>* is estimated under the null model *logit*[*Pr*(*yi* = 1)] = *β*<sup>0</sup> + *Zβ* and *K* is the similarity matrix explained in the previous section.

$$Q = (y - \hat{y})^\prime \mathcal{K}(y - \hat{y}) \tag{5}$$

Then, we used the Davies method [11] as implemented in the CKAT R package [6] to calculate the *p*-value of the proposed kernel based association test. The SMCKAT workflow is summarized in Figure 5

**Figure 5.** SMCKAT workflow diagram.

#### *2.4. Common and Rare CNV Data*

Biologists generally assign CNVs to one of two major types, depending on the length of the affected chromosomal region and occurrence frequency: copy number polymorphisms (CNPs) and rare variants [4]. CNPs are widespread in the general population, with an average occurrence frequency greater than one percent while rare variants are much longer than CNPs, ranging from hundreds of thousands of base pairs to over one million base pairs.

We apply SMCKAT on both rare and common CNV public domain genome sequencing data sets to evaluate the performance on both CNV types. The two CNV data sets used in this study are from individuals with rhabdomyosarcoma (RMS) cancer and autism spectrum disorder (ASD). The RMS data set [12] contains the common CNVs for 44 subjects, while the ASD data set [13] has the rare CNVs of 588 subjects. In both data sets, each CNV is presented by chromosomal position, type, and dosage.

#### **3. Simulation Studies**

We conducted simulations to evaluate the performance of SMCKAT and ensure that it can properly handle type I and II errors, as well as having relatively high power in detecting existing associations. Besides SMCKAT, the MCKAT and CKAT are also studied. We conduct our simulation studies under two main scenarios. In the first scenario, we evaluate the performance of the SMCAKT on the rare CNV data. In the second scenario, we evaluate the performance of the SMCKAT on the common CNV data.

We use the ASD dataset and the RMS dataset in the first and second simulation scenarios, respectively. These datasets are studied in the real data analysis and further details regarding them are shared in the Section 4. We simulated 105 datasets for each simulation scenario.

The ASD dataset has the same dosage value for all deletions and similarly the same dosage value for all amplifications. Therefore, we randomly generate other values for the CNV dosage to conduct our simulation studies and investigate the dosage effect in identifying existing associations. The simulated dosage value can take 0 or 1 for deletion types and 3, 4, . . . , 7 for amplification types. We use equal probabilities when generating random dosage values for deletion and amplification, 0.5 and 0.2, respectively.

A case-control phenotype is generated for both SMCKAT and MCKAT from the following logistic model that we proposed in [8],

$$\begin{split} \log \text{it} \left( \text{Pr} (\mathbf{Y}\_{i} = 1) \right) &= \beta\_{0} + \sum\_{j=1}^{m\_{i}} \beta\_{j}^{\text{Lm}} (\mathbf{X}\_{ij}^{(2)} - \mathbf{X}\_{ij}^{(1)}) + \sum\_{j=1}^{m\_{i}} \left( \beta\_{j}^{\text{Dd}} I [\mathbf{X}\_{ij}^{(3)} = 1] + \beta\_{j}^{\text{Amp}} I [\mathbf{X}\_{ij}^{(3)} = 3] \right) \\ &+ \sum\_{j=1}^{m\_{i}} \beta\_{j}^{\text{Dg}} |\mathbf{X}\_{ij}^{(4)} - 2| + \sum\_{j=1}^{m\_{i}} \beta\_{j}^{\text{Lm}\* \text{Dd}\* \text{Dg}} (\mathbf{X}\_{ij}^{(2)} - \mathbf{X}\_{ij}^{(1)}) \times I [\mathbf{X}\_{ij}^{(3)} = 1] \times \mathbf{X}\_{ij}^{(4)} \\ &+ \sum\_{j=1}^{m\_{i}} \beta\_{j}^{\text{Lm}\* \text{Am}\* \text{Dg}} (\mathbf{X}\_{ij}^{(2)} - \mathbf{X}\_{ij}^{(1)}) \times I [\mathbf{X}\_{ij}^{(3)} = 3] \times \mathbf{X}\_{ij}^{(4)} \end{split} \tag{6}$$

where *Xij* = (*X*(1) *ij* , *<sup>X</sup>*(2) *ij* , *<sup>X</sup>*(3) *ij* , *<sup>X</sup>*(4) *ij* ) is the *j*th CNV of the *i*th individual as defined previously. *β*<sup>0</sup> corresponds to a baseline disease rate. *βLen <sup>j</sup>* controls the effect of chromosomal position, and *βDel <sup>j</sup>* and *<sup>β</sup>Dup <sup>j</sup>* are the log ratio of a CNV *j* for being deletion versus amplification and vice versa. *βDel <sup>j</sup>* and *<sup>β</sup>Dup <sup>j</sup>* share the same values but different signs. *<sup>β</sup>Len*∗*Amp*∗*Dsg j* and *βLen*∗*Del*∗*Dsg <sup>j</sup>* allow the effect of the chromosomal position and CNV type to differ by dosage in CNV *j*.

After generating phenotypes for SMCKAT and MCKAT, we use following logistic model that is proposed in [6] to generate the phenotypes under CKAT method:

$$\log \text{it}(\pi\_i) = \beta\_0 + \sum\_{j=1}^{m\_i} (\beta\_j^{Del} I[X\_{ij}^{(2)} = 1] + \beta\_j^{Dup} I[X\_{ij}^{(2)} = 3]) X\_{ij}^{(1)} \tag{7}$$

where *Xij* = (*X*(1) *ij* , *<sup>X</sup>*(2) *ij* ) is the *j*th CNV of *i*th subject, *π<sup>i</sup>* = *Pr*(*Yi* = 1), *β*<sup>0</sup> is the prevalence rate of the disease, and *βDup <sup>j</sup>* , *<sup>β</sup>Del <sup>j</sup>* are the log of the odd ratio of CNV *j* for duplication and deletion respectively.

#### *Simulation Results*

The QQ-plots of *p*-values of SMCKAT, MCKAT and CKAT under both simulation scenarios are presented in Figure 6.

**Figure 6.** *p*-value based QQ-plots of MCKAT and CKAT under first (**a**) and second (**b**) simulation scenario.

Based on the QQ-plot (a), SMCKAT and MCKAT are on the 45 degree line under the first simulation scenario. This indicates that both SMCKAT and MCKAT can properly handle the type I and II error rate under different nominal significance levels even as low as 10−<sup>5</sup> when dealing with the rare CNV dataset. However, CKAT is showing a higher chance of committing the type II error in detecting existing associations between the rare CNVs and phenotype.

As is shown in QQ-plot (b), both SMCKAT and MCKAT can protect the correct type I and II error rate at different nominal significance levels in dealing with the common CNV data. We observe that SMCKAT is a little conservative when the significance level is small. However, CKAT shows a weak performance in handling the type I error and detecting existing associations between the common CNVs and phenotype.

The empirical powers of SMCKAT, MCKAT and CKAT under the first and second scenarios are presented in Figures 7 and 8, respectively. As is shown is Figure 7, SMCKAT and MCKAT have almost similar powers when dealing with rare CNVs. However, CKAT shows lower power compared with SMCKAT and MCKAT. The reason might be that the CKAT is not considering the CNV dosage information when testing the association.

Similarly, in the second simulation scenario, SMCKAT and MCKAT have similar powers. However, CKAT is showing low power when dealing with common CNV data. This might be due to the CKAT scanning algorithm for aligning CNVs in the CNV profiles. The CKAT shift-by-one scanning algorithm can capture similarity between limited number of CNVs, which may result in low performance when dealing with common CNVs.

**Figure 7.** Empirical power of SMCKAT, MCKAT and CKAT under first simulation scenario, rare CNV data.

**Figure 8.** Empirical power of SMCKAT, MCKAT and CKAT under second simulation scenario, common CNV data.

#### **4. Real Data Application Results**

We conducted SMCKAT analysis, for different CNV group sizes, on single chromosomes and the whole genome to test the association between CNV sequential order and disease-related traits. The disease-related traits studied in this paper are cancer subtype for the RMS data set and disease status for the ASD data set. We compared SMCKAT results with those obtained from MCKAT and CKAT to evaluate SMCKAT performance on real CNV data.

#### *4.1. CNV Analysis on Rhabdomyosarcoma Data Set*

First, we conducted the experiment on the RMS data. The RMS occurs as two major histological subtypes, embryonal (ERMS) and alveolar (ARMS). The classification of the RMS subtype has a direct effect on the patient treatment options. The RMS data includes a total of 59,131 CNVs for 25 alveolar and 19 embryonal cancers. We apply SMCKAT to each of 23 chromosome pairs, with different CNV group sizes, to test the association between CNV sequential order and RMS subtype. Bonferroni correction is used for adjusting the multiple testing to control the family-wise error rate (FWER) of *α* = 0.05. Since 22 chromosomes and a sex chromosome are being tested, the *p*-value threshold for a wholechromosome significance is calculated as 0.05/23 = 2.2 × <sup>10</sup>−3. SMCKAT identifies four chromosomes out of the existing 23 chromosomes that have a CNV sequential order that is significantly associated with the RMS sub-type. The *p*-values of SMCKAT for these four chromosomes are reported in Table 1.

**Table 1.** *p*-values of testing the association between CNV sequential order and RMS subtype trying different CNV group sizes. *n* is the group size and (#) denotes the total number of CNVs on the chromosome.


Based on the results, SMCKAT identifies CNV sequential order in chromosomes 2, 8, 11, and 13, significantly associated with distinguishing RMS subtype at *FWER*= 2.2 × <sup>10</sup><sup>−</sup>3. These results are consistent with the existing biological knowledge, which shows the ability of the SMCKAT to identify the CNV sequential order significantly associated with specific disease-related traits.

For example, ref. [14] shows that RMS is associated with specific chromosomal abnormalities that differentiate ARMS and ERMS. Based on their study, approximately 80% of ARMS tumors display a translocation between the FOXO1 transcription factor gene located on chromosome 13 and the PAX3 transcription factor gene on chromosome 2, and ERMS tumors show a higher frequency of specific genetic mutation on chromosome 11 than ARMS. Ref. [15] has revealed the same earlier. Furthermore, ref. [16] has found that the ARMS subtype is significantly associated with amplifications on chromosome 8. Our findings show another mechanism like CNVs can play a significant role in causing any disease-related traits besides gene mutations and chromosomal translocations.

We tested different CNV group sizes when applying SMCKAT to the RMS data set. Based on the results reported in Table 1, SMCKAT shows the strongest evidence and smallest *p*-value for the chromosome 8 for all CNV group sizes. It means subjects with the same RMS subtype may have a similar CNV sequential order on their chromosome 8.

We tested SMCKAT on the RMS data set for group sizes greater than six. We observed an increasing trend in *p*-values by increasing the group size, which shows a decline in the significance level of the CNV sequential order associated with the RMS subtype.

#### *4.2. CNV Analysis on Cytogenetic Bands in RMS*

Based on the result reported in Table 1, there is strong statistical evidence, as supported by a *p*-value near to zero, that the CNV sequential order of chromosome 8 with group size of six is significantly associated with the RMS subtype. Therefore, we picked chromosome 8 with a CNV group size of six for further analysis. We partitioned chromosome 8 into smaller regions based on the cytogenetic bands. We applied SMCKAT on each cytogenetic band to check if SMCKAT is capable of detecting more specific regions rather than chromosomes. Then, we compared the results with of MCKAT and CKAT. Table 2 contains the *p*-values of the association test in each cytogenetic band in chromosome 8. Since 40 cytogenetic bands are being tested in chromosome 8, the *p*-value threshold for a band significance is calculated as 0.05/40 = 1.2 × <sup>10</sup><sup>−</sup>3.

**Table 2.** *p*-values of the testing association between RMS subtype and CNVs in the chromosome 8 cytogenetic bands by SMCKAT, MCKAT and CKAT. (\*) denotes significant association between RMS subtype and CNVs, (#) denotes the number of total CNVs on the band.


As shown in Table 2, both SMCKAT and MCKAT detect significantly associated cytogenetic bands with the RMS subtype while CKAT does not identify any significant regions. MCKAT has identified 8 cytogenetic bands that CNVs on them are significantly associated with RMS sub type. Two out of these eight cytogenetic bands, 8*p*23.1 and 8*p*12.0, are identified by SMCKAT as well. It means not only the CNV characteristics but also the CNV sequential order in these two bands are significantly associated with the RMS sub type. Based on the results, SMCKAT has the potential to provide us with more specific CNV regions when we are testing the association between CNVs and disease-related traits compared with MCKAT.

#### *4.3. CNV Analysis on Autism Data Set*

We applied SMCKAT on the ASD data set to evaluate its performance on the rare CNV type. We aimed to test if there was any association between the sequential order of CNVs and ASD status. The ASD data set contains 1285 rare CNVs on 310 individuals with ASD and 1074 rare CNVs on 278 healthy individuals. Since the ASD data set contains only rare and large CNVs, an arbitrary CNV profile may have no or few CNVs on some chromosomes. Therefore, instead of applying SMCKAT to every 23 chromosomes, we applied it to the whole genome. Then, we tested if there is any association between the whole genome CNV sequential order and the ASD status. We considered 0.05 as the *p*-value threshold for the whole-chromosome significance. As is reported in Table 3, there is strong statistical evidence, up to a CNV group size of five, that subjects with the same disease status have similar CNV ordesr in their CNV profiles. We tested SMCKAT on the ASD data for the larger group sizes as well. We observed an increasing trend in *p*-values by increasing the group size, which shows a decline in the significance level of the CNV sequential order associated with the ASD status.

**Table 3.** *p*-values of testing the association between CNV sequential order and ASD status trying different CNV group sizes.


#### **5. Discussion**

SMCKAT tests the association between CNVs and disease-related traits. It checks if CNVs are randomly distributed on the chromosomes or if their sequential orders are significant and have associations with disease-related traits. Our approach has several advantages over the existing methods. First, it measures the similarity between CNV profiles by considering not only all CNV characteristics but also the CNV sequential order. To our knowledge, it is the first approach to study the association between CNV sequential order and disease related traits. Secondly, it is applicable to both rare and common CNV data sets, while previous methods like CKAT can not deal with common CNV data sets. Thirdly, SMCKAT is more stringent when compared with the state-of-the-art approach MCKAT in detecting significant CNV regions. Finally, SMCKAT can help biologists detect significantly associated CNV regions with any disease-related trait across a patient group instead of examining the CNVs case by case in each subject.

Although our experimental results are promising and more specific compared with the state-of-the-art kernel approach, this study has limitations. There are not many publicly available CNV data sets. Besides, most available ones do not contain all CNV features together, in particular the dosage information. Consequently, our method is tested only on one data set, an RMS data set, that includes all multi-dimensional CNV characteristics. For the ASD data set, we considered a dosage less than two for all deletions and greater than two for all amplifications to make the most of the proposed method's capability. Applying SMCKAT to more data sets containing all CNV characteristics can help to determine its strengths and weaknesses. In addition, there is no existing study, neither biological nor computational, that has studied the CNV sequential order to be able to validate our experimental results with it.

Our study shows that CNV sequential order has the potential to play a significant role in causing disease-related traits, but more new findings can be revealed by conducting more comprehensive analysis upon the availability of data.

#### **6. Conclusions**

This paper presents a sequential multi-dimensional CNV association test identifying associations between CNVs and disease-rated traits using all multi-dimensional CNV characteristics and CNV sequential order. Our method, SMCKAT, uses different kernels to measure the similarity between CNV profiles with respect to both CNV orders and characteristics. Then, the similarity in CNV profiles is compared to the similarity in diseaserelated traits to test for an association.

The evaluation was conducted on two types of CNV data sets, a rare CNV data set and a common CNV data set. Results indicate that our method provides statistically strong evidence that there is an association between the sequential order of CNVs and disease related traits. Currently, SMCKAT is capable of testing the association between CNVs and qualitative disease-rated traits. In our future work, we will expand the SMCKAT framework to be applicable to both qualitative and quantitative traits.

**Author Contributions:** N.M.E., P.J.K.: conceptualization and study design. N.M.E., D.C. and P.J.K.: data processing. N.M.E.: conducted the analysis. N.M.E.: drafting manuscript. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** All data used on this study is publilc domain data and openly accessible. Please see Data Availability Statement. Access to the RMS Data on dbGaP was provided under NCI Authorized Access #44698\_6.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The ASD and RMS datasets supporting the conclusions of this article are available at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3213131 and https://www.ncbi. nlm.nih.gov/gap (accession number: phs000720.v3.p1) respectively (access date: 20 November 2021). The SMCKAT R package is publicly available at https://github.com/nesfehani/SMCKAT GitHub repository (access date: 20 November 2021).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Genome-Wide Scanning of Potential Hotspots for Adenosine Methylation: A Potential Path to Neuronal Development**

**Sanjay Kumar 1,†, Lung-Wen Tsai 2,3,4,†, Pavan Kumar 5,6, Rajni Dubey 2, Deepika Gupta 7, Anjani Kumar Singh 8, Vishnu Swarup 7,\* and Himanshu Narayan Singh 9,\***


**Abstract:** Methylation of adenosines at N6 position (m6A) is the most frequent internal modification in mRNAs of the human genome and attributable to diverse roles in physiological development, and pathophysiological processes. However, studies on the role of m6A in neuronal development are sparse and not well-documented. The m6A detection remains challenging due to its inconsistent pattern and less sensitivity by the current detection techniques. Therefore, we applied a sliding window technique to identify the consensus site (5 -GGACT-3 ) *n* ≥ 2 and annotated all m6A hotspots in the human genome. Over 6.78 <sup>×</sup> 107 hotspots were identified and 96.4% were found to be located in the non-coding regions, suggesting that methylation occurs before splicing. Several genes, *RPS6K, NRP1, NRXN, EGFR, YTHDF2*, have been involved in various stages of neuron development and their functioning. However, the contribution of m6A in these genes needs further validation in the experimental model. Thus, the present study elaborates the location of m6A in the human genome and its function in neuron physiology.

**Keywords:** adenosine methylation; m6A; RNA modification; neuronal development

#### **1. Introduction**

Among the 150 reported RNA modifications to date, methylation at N6 position of adenosine (m6A) is the post-transcriptional RNA modification with a high physiological relevance [1]. This reversible modification of RNA regulates the expression of several genes and affects human physiology [2]. Over 7000 genes have been reported to carry this modification in humans, and aberrant RNA modification contributes to the pathogenesis of various human diseases. Notably, the abnormal modification of human tRNA may lead to mental retardation and intellectual disability [3]. Among all different RNA modifications, m6A modification is most abundant in mRNAs of eukaryotic cells. Altered m6A modifications have been linked with several diseases, such as obesity, cancer, diabetes mellitus, stress-related psychiatric disorders, neuronal development, and functions [4,5]. Several analytical tools have revealed that 5 -GGACU-3 is the most common structural signature for m6A modification [6,7].

Kumar, P.; Dubey, R.; Gupta, D.; Singh, A.K.; Swarup, V.; Singh, H.N. Genome-Wide Scanning of Potential Hotspots for Adenosine Methylation: A Potential Path to Neuronal Development. *Life* **2021**, *11*, 1185. https://doi.org/10.3390/life11111185

**Citation:** Kumar, S.; Tsai, L.-W.;

Academic Editors: Md. Altaf-Ul-Amin, Shigehiko Kanaya, Naoaki Ono and Ming Huang

Received: 26 September 2021 Accepted: 30 October 2021 Published: 5 November 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Recent reports demonstrate that not all the adenines in RNA are methylated; the probability of methylation is random, and some RNAs are even entirely devoid of this modification. Moreover, no consensus has been reached for the methylation pattern; nucleotides flanking to "methylable adenines" impact the possibility of their methylation. Cumulatively, these factors cause difficulties in the analysis during in vitro validation of m6A in RNA. In addition, there are several limitations in the current technologies, which are being used for identification of m6A sites. The resolution of methyl-RNA immuneprecipitation and sequencing (MeRIP-Seq) covers around 200 nucleotides; therefore, it cannot be used to pinpoint the precise location of the m6A modification [8]. Another technique called site-specific cleavage and radioactive-labeling followed by ligation-assisted extraction and thin-layer chromatography (SCARLET) is time-consuming and expensive and not feasible for high-throughput applications [9,10]. Most existing methods are entirely ineffective in identifying m6A sites due to a biassing and unpredictability of chemicals toward a specific RNA modification, and failure to produce single-nucleotide sequencing data [11–13]. Intrinsic features, such as fragility, multiple open reading frames, alternative splicing, and short RNA half-lives contribute to these m6A analysis flaws. Thus, generating all potential m6A sites in a single transcriptome analysis within a predefined time frame is challenging with these currently available tools. Alternatively, tagging the target sequence in the genome itself can unveil the distribution of all potential m6A sites, which display methylation possibilities, and perhaps aiding in the understanding of m6A's function in physiological processes. Here, we present the sliding window-based technique to identify all adenines in the human genome, considering each one as a potential methylation site. Furthermore, we have also delineated the role of m6A modification in the neurological milieu, contrasting the physiological and pathological conditions.

#### **2. Methodology**

#### *2.1. Definition of m6A Methylation Sites*

The consensus sequence (5 -GGACT-3 )n, *n* = 2 in tandem was searched throughout the human genome (version GRCh37 patch 8). If methylated, the two consensus sequences in tandem are considered as more effective in generating physiological effects. Following the strict criteria, no mismatch in the m6A sites was allowed.

#### *2.2. PatternRepeatAnnotator: A Home-Made PERL Script*

To locate m6A sites in the human genome, a home made PERL script, named "Pattern-RepeatAnnotator" based on the sliding window technique or window shift algorithm was used [14,15]. The "PatternRepeatAnnotator" was developed to explore the user-defined patterns in the genome sequence (Figure 1). The sliding window technique is a method for finding a subarray (e.g., consensus sequence) in the genome that satisfies the given conditions (e.g., tandem). The search was carried out by maintaining a subset of items (e.g., nucleotides) as a window, and rearranged accordingly and shifted them within the more extensive list until the subarray is precisely matched. The "PatternRepeatAnnotator" scanned the consensus sequences through each chromosome (in Fasta format) to locate them with a particular length (n) defined by the user. Consequently, it provided chromosome-wise coordinates for all the identified sites.

**Figure 1.** Schematic algorithm used to develop the "PatternRepeatAnnotator".

#### *2.3. Annotation of m6A Sites*

To annotate the identified m6A sites, the GRCh37 genome annotation file was utilized (https://ftp.ncbi.nlm.nih.gov/genomes/archive/old\_refseq/Homo\_sapiens/ARCHIVE/ BUILD.37.3/GFF/ref\_GRCh37.p5\_top\_level.gff3.gz, accessed on 26 September 2021). The identified coordinates of m6A sites were further mapped to the annotation file. After the processing, all information was transported to a comma-separated value (.csv) file, where the running task was conducted. The promoter and downstream regulatory regions (DRR) were considered as 100 nucleotides upstream and 500 nucleotides downstream of all identified genes, respectively. The genes containing recognition sequences in the coding (plus/sense) DNA strand were selected for further analysis only. A single gene was counted as one entry, even if it had the target sequence at multiple locations.

#### *2.4. Gene Ontology (GO) Analysis*

To assess the mechanistic biological insight into the genes of interest, Gene Ontology (GO) analysis was performed using gprofiler [16]. Enrichment maps were generated using ShinyGo, a gene ontology enrichment analysis software (South Dakota State University, Bioinformatics Research group). The distribution of target sequences (*n* ≥ 2) in proteincoding genes with their frequencies and enrichment score per Mb of respective chromosome were analyzed.

#### **3. Results**

A total of 6.78 × 107 target sequences GGACT (*<sup>n</sup>* ≥ 2) were found throughout the human genome using the homemade script. Chromosome 2, having 242 million base pairs (Mbps) nucleotides were found to carry the highest number of target sequences in total (*<sup>n</sup>* = 1014.79 × 104). Out of these, the target sequences of 31.76 × <sup>10</sup>4, 541.56 × 104, 1.45 × <sup>10</sup>4, 433.77 × <sup>10</sup>4,and 6.23 × 104 Mbps were found in exonic, intronic, promoter, genomic, and downstream regulatory regions (DRR), respectively (Table 1, Figure 2a). The enrichment (copy number of target sequence per Mbps of the chromosome) of target sequence was also found to be highest (4.19 × 104 sequences/Mbps) in chromosome 2 (Figure 2b). Chromosome 24 was found to carry the lowest number of target sequence, in total 41.2 × 104 Mbps with an enrichment score of 0.72 × 104. Out of these, the target sequences 0.07 × <sup>10</sup>4, 0.31 × 104, 0.67 × 104, 10.31 × 104, and 29.93 × 104 Mbps were identified in promoter, DRR, exonic, intronic, and genomic regions, respectively (Table 1).

**Table 1.** Distribution of target sequence (*n* ≥ 2) found in different regions of human genome.


DRR—Downstream Regulatory Regions.

**Figure 2.** Distribution and enrichment score of m6A sites (**a**). The potential m6A sites (×104) in different parts of human genome, such as promoters, DRR, exons, and genomic (intergenic) regions. (**b**) Enrichment score of target sequences according to chromosome size (in million bases pair). DRR: Downstream regulatory regions.

> Subsequently, we also looked up the protein-coding genes per chromosome, which carry the target sequence (*n* ≥ 2). Here, chromosome 2 had the highest number of genes (*n* = 1448) with the target sequence followed by chromosome 11 (*n* = 982) (Table 2). Interestingly, a notable highest frequency of the target sequence (*n* = 163) was observed in MCF2 Transforming Sequence-Like (*MCF2L*) gene located on chromosome 13. Additionally, the highest number of protein-coding genes were also found on chromosome 13 (81%; 266/327),

followed by chromosome 4 (76%; 572/752), whilst chromosome 9 had the lowest number of protein-coding genes with the target sequence (8%; 64/786). Notably, the chromosome 1, containing the highest number of protein-coding genes (*n* = 2058), was found to carry the target sequence only in 27% of genes (Table 2).

**Table 2.** Distribution of target sequences (*n* ≥ 2) in protein-coding genes with their frequencies and enrichment score per Mb of respective chromosomes.

