**What Can Machine Learning Approaches in Genomics Tell Us about the Molecular Basis of Amyotrophic Lateral Sclerosis?**

### **Christina Vasilopoulou 1, Andrew P. Morris 2, George Giannakopoulos 3,4, Stephanie Duguez <sup>1</sup> and William Duddy 1,\***


Received: 30 October 2020; Accepted: 23 November 2020; Published: 26 November 2020 -

**Abstract:** Amyotrophic Lateral Sclerosis (ALS) is the most common late-onset motor neuron disorder, but our current knowledge of the molecular mechanisms and pathways underlying this disease remain elusive. This review (1) systematically identifies machine learning studies aimed at the understanding of the genetic architecture of ALS, (2) outlines the main challenges faced and compares the different approaches that have been used to confront them, and (3) compares the experimental designs and results produced by those approaches and describes their reproducibility in terms of biological results and the performances of the machine learning models. The majority of the collected studies incorporated prior knowledge of ALS into their feature selection approaches, and trained their machine learning models using genomic data combined with other types of mined knowledge including functional associations, protein-protein interactions, disease/tissue-specific information, epigenetic data, and known ALS phenotype-genotype associations. The importance of incorporating gene-gene interactions and cis-regulatory elements into the experimental design of future ALS machine learning studies is highlighted. Lastly, it is suggested that future advances in the genomic and machine learning fields will bring about a better understanding of ALS genetic architecture, and enable improved personalized approaches to this and other devastating and complex diseases.

**Keywords:** Amyotrophic Lateral Sclerosis; machine learning; genome-wide association studies; GWAS; genomics; ALS pathology; gene prioritization

#### **1. Introduction**

Amyotrophic Lateral Sclerosis (ALS) is a progressively fatal, late-onset motor neuron disorder that is predominately characterised by the loss of upper and lower motor neurons. Progressive muscle atrophy in ALS patients leads to swallowing difficulties, paralysis and ultimately to death from neuromuscular respiratory failure [1–3]. ALS is the most common type of motor neuron disorder, and has peak onset at 54–67 years old, although it can affect individuals of any age [2–5]. Patients typically survive 2–5 years after the first symptoms occur, with 5–10% surviving more than 10 years [1,2,6]. A population-based study of estimated ALS incidence in 10 countries found that

prevalence could increase more than 31% from 2015 to 2040 [4]. Thus, there is an increasing need to understand ALS pathology and the molecular pathways involved, towards prevention or successful therapeutic intervention.

There are two major classifications among ALS patients, based on family history: 5–10% of cases are genetically linked, and are classified as *familial*, having one or more relatives that suffer from ALS, while 90% are classified as *sporadic*, in which a familial history is not established, and where a genetic cause is usually not identified [7]. However, the distinction between the two categories is not always simple, with familial ALS-associated mutations also being present among sporadic ALS cases [3]. The extent and form of genetic contribution to sporadic ALS remains unclear, but genetic factors are considered to play an important role in the disease pathology [3,8]. Further investigation of the genetic architecture of both familial and sporadic cases is necessary.

In recent years, advances in high-throughput technologies have enabled the discovery of multiple Single Nucleotide Polymorphisms (SNPs) that are associated with ALS, mainly by the application of the Genome-Wide Association Study (GWAS) approach. GWAS aims to identify SNPs and other types of genetic variation (such as structural variants, copy number variations and multiple nucleotide polymorphisms) that are more frequent in patients than in people without the disease [9]. Statistical tests are carried out for disease association across genetic markers numbering from hundreds of thousands up to millions, depending on the genomic analytical platform. The most popular genotype-phenotype association studies use statistical models such as logistic or linear regression, depending on whether the trait is binary (i.e., case-control studies, such as ALS versus healthy controls) or quantitative (e.g., different scales of height). GWAS has been successful in discovering tens of thousands of significant genotype-phenotype associations in a large spectrum of diseases and traits, such as schizophrenia, anorexia nervosa, body-mass index (BMI), type 2 diabetes, and ALS [10–13]. Over the past decade, the discovery of significant genotype-phenotype associations has provided new insights into disease susceptibility, pathology, prevention, drug design and personalized medical approaches [11,14,15].

Rapid recent technological advances and great efforts in the field have led to the genomic profiling of large ALS cohorts, providing new insights into the pathology of ALS [12]. Initiatives such as Project MinE and dbGaP have contributed to the systematic release of ALS GWAS data [16,17]. The ALSoD publicly available database for genes that are implicated in ALS records 126 genes, with a subset having been reproduced in multiple studies [18]. As of July 2020, the GWAS catalogue has published 317 variants and risk allele associations with ALS [10].

The scope of this review covers genome-wide association studies that employ machine learning approaches with the aim to understand ALS pathology through gene prioritization. A search of PubMed and Google Scholar for the terms "amyotrophic lateral sclerosis", "GWAS" and "machine learning" yielded 420 results, of which 7 research papers were identified as falling into this scope. Machine learning studies that refer to the estimation of ALS heritability, to drug repurposing/prediction for ALS, or survival analyses, are not considered.

The review is structured as follows: first, knowledge from relevant literature about ALS pathology and genetic architecture is summarized, then the central challenges and limitations of traditional GWAS studies are introduced in the context of ALS, while the third section provides a brief overview of key machine learning concepts along with a description and comparison of published feature selection and machine learning approaches using ALS GWAS datasets. The main contribution of the review is to outline the challenges of ALS genomic studies, summarizing and comparing how these have been addressed by the collected research papers. Finally, the further use of machine learning as a method to understand ALS pathology is advocated.

#### *1.1. Current Knowledge of Molecular Pathways Implicated by the Functions of Known ALS-Linked Genes*

Our current knowledge of the aetiology and the genetic architecture of ALS is still elusive. Genetic mutations, environmental contributions, epigenetic changes and DNA damage are

hypothesized as potential causal factors that ultimately lead to motor neuron death [19,20]. Variants in more than 30 genes are recognized as monogenic causes of ALS [12,19,21–23]. The most frequent monogenic cause in European populations is the intronic hexanucleotide GGGGCC (G4C2) repeat expansion (HRE) in the *C9orf72* gene [24,25]. Other genes linked to ALS with high reproducibility include Cu/Zn superoxide dismutase 1 *SOD1*, fused in sarcoma *FUS*, and transactive response DNA-binding protein of 43 kD *TARDBP/TDP-43* [19]. The discovery of risk gene mutations has helped to unravel the molecular mechanisms of ALS, and may lead ultimately to targeted therapy and stratified drug discovery [19,26–28].

Numerous studies have been published aimed at explaining motor neuron death, investigating the functional effects of specific mutations of known risk-associated genes such as *C9orf72, FUS, SOD1* and *TDP-43* [23,27]. Recent systematic reviews from our group have summarised the molecular pathways and biomarkers for which there are strong supporting evidence in ALS [19,26,27]. The molecular pathways affected in ALS can be grouped as follows (see [27] for detailed review):


*TDP-43*, or *C9orf72*-mediated DPRs, thus having an impact on the expression of genes involved in motor neuron survival [27,58].

Understanding the functional processes that drive ALS pathology has proven to be a difficult and complex task, compounded by the heterogeneity that characterises the disease. The gene products of the 30 or more known ALS-associated genes interact with each other, are implicated in multiple molecular pathways, and result in multiple disease phenotypes, making functional curation and interpretation complex [19,27]. In addition, these monogenic causes in ALS occur only in ∼15% of sporadic ALS and ∼66% of familial ALS patients, so that more than 80% of the ALS population do not currently have any known ALS-associated mutations [19,21]. Nonetheless, acquiring an in-depth understanding of the molecular mechanisms and the genetic architecture of ALS could potentially lead to the identification of multiple patient strata and therefore targeted therapies to be applied to different subgroups of ALS patients.

#### *1.2. The Genetic Architecture of ALS*

The genetic contribution to familial and sporadic ALS has not been fully explained by genotype-phenotype discoveries [8,25], and the known Mendelian causes of ALS represent only a small proportion of the ALS population [19,21]. Nonetheless, estimates of heritability are high in sporadic ALS patients - for example, 61% in a twin meta-analysis study - suggesting that genetic factors are strongly represented in sporadic ALS and that further investigation may yet identify novel causal variants and/or multilocus interactions that could account for this high estimated heritability [59].

So far, evidence supports a model implicating rare variants (minor allele frequency <1%) along with non-genetic causes, such as environmental factors [3,36,60,61]. Large GWAS efforts suggest a genetic architecture for ALS that falls somewhere in the middle of the spectrum of genetic pathology in terms of effect size and prevalence of risk variants-i.e., an *intermediate genetic architecture*, lying between conditions such as schizophrenia which have many common variants each imparting a small increase to disease risk, and conditions such as Huntington's disease which are caused by rare large-effect variants located in a single gene [3,20,62,63].

Many ALS-associated variants, particularly for *C9orf72*, also contribute to other conditions such as frontotemporal dementia (FTD) and cerebellar disease, suggesting that ALS is a multi-system syndrome [3,60,61]. ALS has an established overlap with other neurodegenerative and neuropsychiatric disorders, investigation of which could lead to insights into the understanding of pathology [3,5,25,60,64,65]. An example of this is the degree of overlap between familial ALS (∼40%) and familial FTD (∼25%) patients that carry the G6C4 expansion of *C9orf72* [65,66]. *C9orf72* hexanucleotide expansion has been associated to multiple traits including Alzheimer's and Parkinson's diseases, ataxia, chorea and schizophrenia [21,67–69]. A population-based GWAS study reported a higher prevalence of psychosis, suicidal behaviour, and schizophrenia, in Irish ALS kindreds, which was associated with the C9orf72 repeat expansion, based on an aggregation analysis [64]. Further evidence for a shared susceptibility to ALS was provided by the greater occurrence of dementia among first-degree relatives of ALS patients [69]. Several studies have suggested that the genetic overlap between ALS and other neurodegenerative and neuropsychiatric disorders could also be explained by the presence of ALS-associated pleiotropic variants that influence multiple, and in some cases quite distinct, phenotypic traits [70–72]. One study that supports this hypothesis is that of O'Brien et al., which shows that first-degree and second-degree relatives of Irish ALS patients have a significantly higher prevalence of schizophrenia and neuropsychiatric diseases than healthy controls, including obsessive-compulsive disorder, psychotic illness, and autism-the authors performed k-means clustering and calculated the relative risk to estimate aggregation [71,73–75].

Further investigation is needed to achieve a deep understanding of ALS heritability and genetic architecture, incorporating pleiotropic gene effects into experimental design.

#### **2. ALS-Specific GWAS Challenges and Limitations**

So far, numerous ALS GWAS studies have been published, aiming to identify novel ALS-associated variants through standard genotype-phenotype analyses. The first was published in 2007, providing genomic data for 276 cases and 271 controls [76]. Technological advances have provided the opportunity for studies with a higher number of genotyped ALS cohorts. The largest release of ALS genomic data was published in 2018 by Nicolas et al., and identified *KIF5A* as a novel ALS-associated gene; the study included a publicly-available large meta-analysis dataset of 10,031,630 imputed SNPs of 20,806 ALS and 59,804 controls as well as providing controlled access to "raw" genomic data including SNP-arrays of 12,188 cases and 3,292 controls [12,17]. Despite that hundreds of ALS-associated variants have been recorded in public databases such as the GWAS Catalog [10], these associations show very little reproducibility across different studies and have not been able to explain a large percentage of ALS heritability [3,36]; a phenomenon which is generally known as the "missing heritability" paradox [77]. It has been proposed that SNPs contribute ∼8.5% of the overall heritability of ALS, although it should be noted that such estimates consider only linear single-marker effects of SNPs [36,77]. Here we outline some general GWAS limitations in the context of ALS, as well as potential reasons why standard GWAS phenotype-genotype analysis is unlikely to fully explain the genetic architecture of ALS.

A first general challenge in large scale genomic analyses is to ensure a high quality of the genotype data, so that the downstream results of the experimental design reflect true biology and not artifacts. Therefore, the collected genomic data first need to pass a comprehensive Quality Control (QC) pipeline including multiple sample and variant QC steps [78–81]. One challenge is that each dataset has its own specific features, thus there are not fixed thresholds for each quality-control step. For this reason, each study needs to follow a data-driven approach, taking into consideration the distribution of each data metric. However, there are some good practices in QC that may be generally applicable to most studies [78,81]. For example, it is typical to follow a procedure first filtering out low quality samples then removing poor quality markers, the order of this ensuring that as many genetic markers as possible are kept in the final dataset. However, overly strict thresholds can lead to the loss of a substantial proportion of samples, reducing study power. Another challenge is to ensure homogeneity of the collected samples in terms of ancestry. This QC step is carried out by analysing the population structure to remove ethnic outliers, and by accounting for confounding factors in later stages of the analysis, such as a potential inner population sub-structure, usually using the first few Principal Components, after performing a Principal Component Analysis on the homogeneous sample cohort. Also, it is very important to check for duplicated samples and, in non-family GWAS analyses, ensure that all samples are unrelated so that specific genotypes are not over-represented (and thereby contributing a bias to the subsequent analysis). Identity-by-descent (IBD) is a metric that corrects for such bias and takes into account the number of variants that a pair of individuals share.

GWAS is a single marker analysis treating each variant association as an independent event that contributes to the phenotype. Due to this, it is a standard practice for results to be corrected under the strict multiple testing threshold (*<sup>p</sup>* < <sup>5</sup> × <sup>10</sup>−8) of the Bonferroni correction in order to control for false positive discoveries (Family-wise type I errors). This threshold derives from the hypothesis of 1,000,000 independent markers being tested under a significance level of 5%. Particularly in low sample size studies this correction can result in a loss of power of the analysis, which may then fail to capture a portion of potential risk variants that do not pass the significance threshold (Family-wise type II errors) [15,82].

Univariate analyses such as GWAS that test trait association for one locus at a time are not able to capture multilocus interactions-a phenomenon called epistasis-and the interaction of the environment with the genome; events that could potentially account for the missing heritability of ALS and explain the disease pathology [83,84]. The term *epistasis* was introduced in genetics over a century ago by Bateson et al. [85], and genetic and evolutionary biology studies have highlighted the importance of gene-gene interactions not only in the genetic architecture of an organism but also in evolution [77,86,87]. Epistasis represents non-additive events in the genome including interactions among two or more loci that have an effect on the phenotype [88]. Several studies have highlighted the role of epistasis in pathology, showing that SNP interactions provide a stronger association to the disease than the participating SNPs do individually [77,84,89,90]. To understand pathology in a complex disease such as ALS, it may be necessary to identify complex genetic interactions, including epistatic interations [77,87]. Nevertheless, the study of multilocus interactions poses a number of challenges, in particular the need for a high computational power as the number of tested interactions is extremely high even in pairwise combinations. As such, multivariate computational approaches and appropriate machine learning methods may be able to capture the potentially complex relationships among risk variants in ALS [77,90,91].

GWAS is more successfully employed under a "common disease-common variant" hypothesis, being of particular use in common diseases such as schizophrenia which are driven by many risk alleles each with high frequency [92]. In contrast, ALS is a heterogeneous disease likely comprised of multiple strata each resulting from combinations of different rare mutations and other factors. As a result, stratum-specific mutations may each have very small effects that are diluted and thus not captured by GWAS [3,36]. The majority of GWAS analyses have used SNP-arrays as they have until recently had a lower experimental cost in comparison to sequencing of the exome or the whole genome. SNP-array analyses can typically capture the effect of only common variants to the phenotype whereas sequencing analyses identify both common and rare variants. In most SNP-array GWAS studies, variants with Minor Allele Frequency (MAF) of <1–5% are removed from subsequent analysis as they are generally more difficult to genotype and therefore are considered potential false positives [15,78]. Nevertheless, whole genome sequencing, custom designed exome sequencing arrays, rare variant burden analyses and imputation approaches using large reference panels (such as the Haplotype Reference Consortium, which contains 64,976 haplotypes), face this challenge by recovering both rare (up to 0.1% MAF) and common variants that SNP-array platforms do not usually contain [15,93–95]. However, there is still a proportion of low frequency minor allele effects on the phenotype that cannot yet be detected by GWAS approaches and that could also potentially explain some of the missing heritability in ALS [3,36].

Lastly, another common GWAS challenge in complex diseases is the difficulty to distinguish causal variants from other non-disease-associated variants that are in high linkage disequilibrium [15]. Linkage disequilibrium describes the phenomenon where an allele of a variant is inherited together with the alleles of other variants [9]. These alleles of other variants are highly correlated and will have very similar GWAS signals with the truly causal SNP. The majority of disease-related variants are located in cis-regulatory regions of the genome [96], and given our limited knowledge of non-coding genomic loci, it is even more challenging for those to discern causal SNPs from the noise. Our difficulty to identify the causal variants in complex diseases among a pool of statistically significant associated variants adds to the challenge of identifying molecular processes that could have a significant impact on the disease.

Advanced machine learning prediction models trained in ALS genomic data could overcome the aforementioned challenges, moving towards better insights into disease causality and ultimately to a personalized understanding of ALS [15,97]. In Figure 1, we describe the basic steps of an ALS machine learning experimental design in order to discover ALS-associated novel loci or combinations of loci, as well as the main challenges of each step. Each of the main challenges is addressed in successive chapters of the review, as we describe and compare the experimental design of the collected gene prioritization studies. Some of the challenges in Figure 1 have already been mentioned, such as the need for a *large sample size* that could increase the power of the study, a *comprehensive quality control pipeline* to assure high quality genomic data, as well as the *curse of dimensionality* which is a very common problem in genomic studies that include an extremely high number of features and especially in studies that focus on multilocus interactions.

*J. Pers. Med.* **2020**, *10*, 247

**Figure 1.** The main challenges of an ALS machine learning experimental design.

#### **3. Facing the Challenges**

In this chapter, we outline gene prioritization approaches by which published research studies have employed machine learning methods in order to identify and rank novel ALS-linked genes, SNPs and multilocus interactions. First, a short introduction is made to some basic machine learning concepts that will be useful in later discussion of the machine learning approaches that have been used. Then, details are provided about the data representation, feature curation, and selection methods that each study chose for their experimental design and, finally, the machine learning methods and the overall experimental designs of each study are compared, as well as considering their results.

#### *3.1. A Brief Overview of Machine Learning Concepts*

Based on the task and the type of learning there are three main machine-learning categories: supervised, unsupervised and semi-supervised algorithms. Supervised learning methods aim to make predictions on unknown instances (e.g., a sample or a gene) based on known labels (e.g., ALS/non-ALS) [98]. For instance, classification is a supervised machine learning approach, which trains a classifier using labeled data e.g., samples/genes including a case/control label (training set) and predicts the class of an unknown sample or gene (testing set) based on specific rules and patterns that the classifier learned during training and testing.

In classification, it has always been a challenge to identify the input features that are most informative and maximally affect the prediction. This domain of research is described as *feature selection* research [99] and has been traditionally connected to statistical methods. More recently, with the advent of deep neural networks, *explainability* and *interpretability* of machine learning suggestions, predictions and decisions has become an even more acute problem. This is due to the complex nature of the network itself, which does not clearly illustrate the connection between input (features) and output (prediction) in a humanly understandable manner. Thus, a number of recent studies aim for *explainability* in deep models [100], including visualized explanations [101].

On the other hand, unsupervised learning is performed on unlabelled data, with an ultimate purpose to identify interesting patterns or novel sub-groupings of the data. Clustering research offers a well-established set of unsupervised algorithms which identify patterns in a group of instances (e.g., Genes, SNPs, ALS patients), most commonly based on notions of distance (or similarity) between instances. Distance in such cases can be measured with metrics such as Euclidean distance or the Pearson Correlation Coefficient [74].

Lastly, in semi-supervised learning the prediction is carried out in positive and unlabelled data. Semi-supervised learning is applied when information is available only on instances of a single class (usually called "positive" instances, e.g., already known-ALS genes) but there is not sufficient information to label the rest of the instances as "negative". Semi-supervised learning methods can be quite challenging when the aim is to predict novel disease-associated instances since the classifier is trained treating potentially novel instances as "negative". Semi-supervised learning methods are particularly common in gene prioritization studies.

An *instance* in a machine learning task is an entity that the classifier is trained to predict, for example a gene in a gene prioritization algorithm or an ALS sample in an ALS/non-ALS patient classification task. An instance is described by a number of *features* (e.g., gene functional annotations, SNP genotypes etc.), typically represented as a vector, termed *feature vector*. All X instances (e.g., samples or genes/SNPs) need to have the same number of Z features (e.g., genetic mutations or functional annotations), leading to a two-dimensional matrix K that has a size of K=X\*Z. Each instance has a specific location on a Z-dimensional space (where Z is the number of features) and each feature has, accordingly, a specific location in X-dimensional space (where X is the number of instances).

There are a wide variety of metrics that can be used to evaluate the performance of a machine learning model, the choice of which depends on the nature of the task. The most popular metrics include Accuracy, Precision, Recall, F1-score, Specificity and ROC (Receiver Operating Characteristic) curves. Accuracy represents the fraction of true positive and true negative predictions out of the total number of the model predictions. Precision expresses how many positive predictions of the model were truly positive. Recall (True Positive Rate) explains the percentage of the missed true positive predictions. Recall and Precision are both very important metrics that need to be taken into account for the evaluation of a model's performance. F1-score calculates the harmonic mean of those two metrics, hence the higher the F1-score is, the better the model performed. Specificity expresses the fraction of the negative predicted instances that were actually negative. The ROC curve is a plot of True Positive Rate against False Positive Rate expressed as 1-Specificity. The Area Under the Curve (AUC) of ROC is employed to calculate how well the model performed - the closer the AUC is to 1, the better the model performed.

#### *3.2. Recent Feature Preparation and Selection Approaches in ALS Genomic Studies*

Modern analytical genomic platforms and imputation methods have led to the profiling of samples containing up to tens of millions of genetic markers. This vast amount of genomic information makes machine learning modelling more complicated and demanding, as it is likely that only a small minority, if any, of markers are truly causal and associated to the disease. Thus, a first challenge in the genomic machine learning experimental design is to deal with the *curse of dimensionality* using appropriate feature selection methods [102]. Below we describe and group the feature selection and dataset curation approaches that have been used by each study, also summarized in Table 1.

**Table 1.** Details of the feature selection approaches that have been followed by each study prior to the machine learning experiments. The "Data/Instances" column summarises the primary number of instances before further filtering, and details about the datasets that have been used by each study. The "Features" column refers to the initial number of features in each dataset before feature selection and after quality control (the latter being applicable only in studies using genotype data). The "Epistasis" column indicates whether the respective study has incorporated epistatic events (i.e., multi-locus interactions) as a feature selection method. The "Regulatory Elements" column describes studies that have filtered their initial dataset by including only non-coding regulatory regions. Studies that have used prior ALS-related information (e.g., already known ALS-associated SNPs/genes, known functional information, filtering based on an ALS versus Control genotype-phenotype association analysis *p*-value etc.) in order to select and reduce their initial instance and feature space are indicated under "ALS-linked knowledge". Lastly, we indicate machine learning methods that were used to select only highly informative features based on specific criteria. ML: Machine Learning, SNP: Single Nucleotide Polymorphism, MDR: Multifactor Dimensionality Reduction, CNN: Convolutional Neural Network, PPIs: Protein- Protein Interactions, DHS: DNase I hypersensitive sites, TFBS: Transcription Factor Binding Sites, PCA: Principal Component Analysis, t-SNE: t-distributed Stochastic Neighboring Embedding, UMAP: Uniform Manifold Approximation and Projection.


The use of feature selection methods are a common strategy to reduce the high number of initial features for machine learning models [102]. As described in Table 2, four of the seven studies used a machine learning model as a feature selection method, in some cases using specific hypotheses or along with a combination of other strategies that will be described later in this sub-chapter. Mantis-ml is one example of a multi-step gene prioritisation framework which extracts a heterogeneous set of 1249 gene-annotation features mined from a large collection of databases in order to discover and rank novel disease-related genes [103]. Each instance in this model consists of 18,626 coding genes which are labelled as positive (seeds) or unlabelled, depending on known association to the disease, based on information retrieved from the Human Phenotype Ontology (HPO) [109]. A number of pre-processing steps were applied that included filtering of highly correlated pairs of features, removing features with missing data, and imputing certain features with a low missing rate. Some exploratory analyses are then performed (i.e., heat maps, variable distributions, etc.) in the original feature space and then three dimensionality reduction methods are automatically applied: principal component analysis (PCA), t-distributed stochastic neighboring embedding (t-SNE) [110], and uniform manifold approximation and projection (UMAP) [111], in order to identify any interesting pattern(s) and linear/non-linear relationships among the features. The gene pool is randomly split into K balanced datasets containing

both positive and unlabelled genes. Mantis-ml can use the Boruta feature selection algorithm which labels the features (given a decision threshold) as "confirmed"/"tentative"/"rejected" by assessing the contribution of each feature to the prediction [103], although the entire feature space is used by default.

**Table 2.** The biological hypotheses incorporated to the experimental design and main focuses of the collected studies.


Several recent studies have focused on uncovering gene-gene interactions (epistasis) that could potentially explain some part of the missing heritability of complex genetic traits such as ALS. However, modern genomic studies that aim to model epistatic events face considerable challenges, demanding large computational and statistical power for the analysis of a large number of combinations among millions of genotyped loci (even when considering only pairwise interactions). Two of the studies -Greene et al. and Kim et al.- included epistatic events in their feature selection approach [91,107]. Both used the wrapper method Multifactor Dimensionality Deduction (MDR)-a non-parametric, model-free approach which reduces the feature space of multilocus combinations by creating new single variables pooled from multiple SNP genotypes [87,112]- and then estimates statistically significant ALS-risk pairwise interactions of SNPs. Both studies used the same datasets and pre-processing steps to identify pairwise SNP interactions that are significantly associated with ALS. The datasets included two sporadic ALS cohorts along with healthy controls containing 276 cases versus 271 controls in the detection dataset and 211 cases versus 211 controls in the replication dataset [76,106]. SNPs were filtered using a 0.2 Minor Allele Frequency cut-off and a less than 90% call rate. Lastly, Greene et al. considered only independent SNPs in further analysis, leading to a dataset of 210,382 SNPs.

Another strategy to reduce feature space is to include only regulatory elements as features for the subsequent machine learning experiments. Two of the studies focus only on the effect of noncoding regulatory elements in ALS pathology. It has been shown that disease-related variants are mostly located in cis-regulatory elements of the genome marked by DNase I hypersensitive sites (DHSs)-zones of the genome that have been associated with elevated levels of transcriptional activity [96]. In 2020, Yousefian et al. investigated the effect of noncoding variants in ALS [104]. Firstly, they applied a *<sup>p</sup>*-value threshold (*<sup>p</sup>* < <sup>5</sup> × <sup>10</sup>−4) on SNPs from a previously published large ALS meta-analysis GWAS dataset. The authors constructed association blocks by identifying lead SNPs having strong ALS GWAS *p*-value associations and being at least 1 Mb apart from each other, then selecting also the top 30 ALS-associated SNPs located upstream and downstream of each lead SNP; leading ultimately to 274 association blocks [104]. They enriched their selected association blocks with functional and epigenetic information including DHS profile data, histone modifications, functional gene-sets from the KEGG database and TF binding sites collected from TRANSFAC and JASPAR databases [104,113–115]. After the functional enrichment of the features, they constructed a binary feature matrix representing whether a SNP within an association block was associated or not with a particular functional feature [104]. The second ALS study that included only noncoding regulatory elements was published in 2019 by Yin et al. who proposed Promoter-CNN as a feature selection method; a convolutional neural network model (comprised of 2 convolutional layers and two deep layers) that reduces the initial feature space by selecting only the top 8 highest performing promoter regions among variants located on chromosomes 7, 9, 17 and 22 [90]. They assessed the

performance of Promoter-CNN using a 9-fold cross validation. Each promoter region of each individual was represented by a window of 64 genomic features having a value of 0,1,2 to represent the genotype at each of these 64 loci, utilizing the genomic structure of the regions [90]. The aforementioned studies employed the genomic structure in order to build the association blocks and the promoter regions (see Table 1).

A popular strategy that has been used to select the initial data and then further reduce the feature and instance space is ALS-associated knowledge (see Table 1). The most relevant study that falls into this category is Bean et al. which modelled only previously known ALS-linked gene lists, mining information from the literature and from disease databases, as well as including a manually curated set of ALS-associated genes [105]. In order to reduce the feature space they performed an enrichment test on all features and, for the predictive model, kept only those features that are significantly enriched in the mechanism(s) of the disease [116]. Another example is Yin et al. which before applying their Promoter-CNN model as a feature selection method, they first limited their feature space by studying only non-additive phenomena of multiple promoters located on four specific chromosomes (7, 9, 17 and 22), those chromosomes have being selected based on the amounts of missing heritability that have been previously identified in ALS [36,90]. Bean et al., Kim et al., Yousefian et al. and Sha et al., implement multi- step algorithms in which one of the initial steps reduces the feature space by keeping only the highest performing genes/SNPs, by assessing the enrichment of genes in ALS [105] or by applying a specific threshold of single-marker association analysis to ALS [91,104,108].

#### *3.3. Experimental Design and Results of the ALS Gene Prioritization Approaches*

In this section we will focus on the approach and rationale of the collected studies that aim to understand the pathology of ALS using machine learning and probabilistic models on genomic data. We also briefly consider the main findings of each study. Although the majority of the collected studies aim to answer more than one research question, in this section we group studies by methodology, based on common features of their experimental design that we considered to be the main focus in each study. In order to avoid confusion, in Table 2 we describe the main biological focus of each study. We investigate the reproducibility of the results of each analysis, comparing it with related literature as well as mentioning putative novel ALS discoveries in the Discussion section.

The identified studies (see Table 1) all fall under the gene prioritization umbrella category. In Table 3, we provide brief information about the machine learning models that have been used in these studies (after feature selection and filtering approaches have been applied, as discussed in the previous section, see Table 1), as well as details about the assessment and the performance of the models.

Four of the studies included epistatic events in their experimental design, testing the hypothesis that multilocus interactions have an effect on ALS susceptibility (see Table 2). Three studies that fall into this category, all use the same ALS detection dataset in order to discover pairwise SNP interactions [91,107,108]. Greene et al. and Kim et al. use multifactor dimensionality reduction (MDR); a wrapper method which performs both a feature selection and classification to predict pairwise combinations of SNPs as high-risk and low-risk for ALS [91,107]. As proposed by [112], MDR reduces multilocus dimensions into a one-dimensional multilocus variable, the prediction performance of which is evaluated in classification tasks (ALS versus healthy controls) by cross validation and permutation tests. Out of pairwise combinations among 210,382 SNPs, Greene et al. reported the pair of SNPs rs4363506 and rs6014848 to have the highest accuracy (Acc: 0.6551) and a *p*-value of 0.048 after permutation testing. Their replication dataset showed a lower accuracy (Acc: 0.5821) but with a higher statistical significance (*p*-value < 0.021) [107]. Kim et al. chose the best performing MDR model for each SNP, then they mapped SNPs to genes (including neighboring regulatory elements) and investigated their enrichment using Gene Ontology functional terms [91]. Unfortunately, they did not report any specific high performing pairs of SNPs in terms of MDR accuracy, nor specific genes. The statistical significance

of each MDR model was estimated using permutation, with the highest enriched Gene Ontology gene-sets being "Regulation of Cellular Component Organization and Biogenesis" (*p*-values: 0.010 and 0.014), and "Actin Cytoskeleton" (*p*-values: 0.040 and 0.046). The *p*-values for each gene-set refer to the detection and replication dataset respectively, after multiple testing. The third study is proposed by Sha et al. implementing a two-stage probabilistic algorithm that attempts to predict two-locus combinations that are associated to ALS using eight epistatic and nine multiplicative two-locus predicting models. The first step was a single-marker association test using the *x*<sup>2</sup> test statistic, keeping only the 1000 SNPs having the strongest *p*-values. The associations of all the two-locus combinations of those 1000 variants were then tested using the above-mentioned seventeen models. Three SNPs were discovered participating to two two-locus combinations: rs4363506 with rs3733242 (*p*-value = 0.032) and rs4363506 with rs16984239 (*p*-value = 0.042). Reported *p*-values were adjusted for multiple testing correction using permutation. They also performed Multifactor dimensionality reduction and Combinatorial Searching Method to identify high performing two-locus interactions, but none of the results reached the significance threshold (see Table 3). Single locus analysis was not able to capture the three SNPs that participate to the bi-locus interaction.

**Table 3.** The machine learning approaches followed by each study. Acc: Accuracy, SVM: Support Vector Machines, LR: Logistic Regression, RF: Random Forest, DNN: Deep Neural Network, SVC: Support Vector Classifier (SVC), CNN: Convolutional Neural Network, AUC: Area under the receiver operator characteristic curve, CSM: Combinatorial Searching Method, Chr: Chromosome.


A far more complex machine learning model for ALS patient classification, studying non-additive interactions of multilocus promoters among four chromosomes, was published in 2019 by Yin et al. [90]. Building upon previous knowledge that the majority of the disease-associated variants in GWAS are cis-regulatory elements, this study showed that using only the highest performing promoter regions of 4 chromosomes as features provides enough information for successful classification of the ALS genomic profile versus healthy controls [90]. Specifically, they applied a prediction

model for putative ALS-related genotypes located in promoter regions, using a case-control genomic dataset containing 4,511 cases and 7,397 controls from the Dutch cohort of ProjectMinE. A two-level pipeline was constructed in which, as a first step, deep neural networks select the eight highest performing promoter regions of chromosomes 7, 9, 17 and 22 having the highest accuracy in ALS status prediction (Promoter-CNN model) and then the selected promoter regions of each individual are combined for a final classification task (ALS-Net model). A 9-fold cross validation was used to train both models. The authors compared their ALS-Net deep learning model to other classification models using both the pre-selected promoter regions combined from all four chromosomes by Promoter-CNN models, and markers from individual chromosomes. The compared classification models included a logistic regression Polygenic Risk Score (PRS) approach [117], Support Vector Machines (SVM) [118], Random Forest [119] and AdaBoost [120,121]. The two-tier deep learning model showed very promising results identifying both already reported associated ALS genes and new putative ALS markers. The results showed that ALS-Net combined with Promoter-CNN pre-selected promoter regions produced better performance (Acc: 0.769, F1-score: 0.797) than the logistic regression Polygenic Risk Score (PRS) approach (Acc: 0.739, F1-score: 0.728) [117], Support Vector Machines (SVM) (Acc: 0.725, F1-score: 0.694) [118], Random Forest (Acc: 0.596, F1-score: 0.381) [119], or AdaBoost (Acc: 0.661, F1-score: 0.625) [120,121]. They highlight that their Promoter-CNN model is a successful feature selection method keeping only the highest performing promoters from each chromosome individually, advancing the performance of the subsequently tested classifiers [90]. The findings indicate that combining genomic information from all four chromosomes improves the performance for the majority of the models, supporting further the hypothesis that non-additive events take place in ALS pathology.

Another recent study that builds upon the hypothesis that cis-regulatory noncoding variants can have an important effect on ALS pathology, applies a functional SNP prioritization framework using convolutional neural networks (CNN) to make ALS rare noncoding risk-variant predictions [104]. The authors build upon a previously published deep learning CNN-based model that used functional features in order to predict causal regulatory elements in complex diseases [122]. They tested their proposed method on a large GWAS meta-analysis cohort including 8,697,640 SNP *p*-values for 14,791 ALS patients and 26,898 healthy controls [12]. The functional SNP prioritization framework followed a multi-step procedure starting from (a) collecting the upstream and downstream flanking regions of the 274 highest GWAS ALS-associated variants, then (b) functionally annotating the variants (including DNase I hypersensitive sites (DHSs), histone modifications, target gene functions, and transcription factor binding sites (TFBS)) and finally (c) training the CNN model on these association blocks with uncertain class labels using chromosomes 1–10 as a training set, chromosomes 11–14 as a testing test and finally 15–22 as a validation set [104]. The CNN model used two convolutional layers; the first layer measured how well each individual SNP matched the pattern of 50 functional features using a rectified linear unit (ReLU), and the second level output prediction scores for each SNP with a 0–1 range, with values close to one indicating that there are common regulatory patterns embedded for a particular SNP. The CNN model shows a high predictive performance (AUC = 0.96 and F1 = 0.83). A random forest classification for the ALS cell-type specificity showed that a high portion of their ALS selected features have neuronal cell-type specificity within Trancriptional Factor binding sites. The proposed framework highlights two potentially functional ALS-risk variants rs2370964 (chromosome 3, located in enhancer site of *CX3CR1*) and rs3093720 (chromosome 17, intron variant in *TNFAIP1*). An eQTL analysis was performed to investigate the effect of these two variants on the expression of other genes. The analysis showed that the two noncoding variants may impact ALS risk by affecting the expression levels of *CX3CR1* and *TNFAIP1*. The *CX3CR1* gene deletion has been associated with microglia neurotoxicity and neuron loss in transgenic ALS mice [123,124]. TNFAIP1 is an apoptotic protein which has been associated to neurotoxicity [125]. The rs2370964 variant also affects the *CTCF* and *NFAT* binding sites [104]. Mutations in *CTCF* gene has also been related to microglial dysfunction, among other effects [126]. NFAT is a transcription factor that is involved with

the regulation of pro-inflammatory responses in cultured murine microglia [127]. The rs3093720 SNP affects the *NR3C1* binding site, a gene which has been associated with neurodegeneration and multiple sclerosis [128].

The accumulation of large amounts of ALS multi-omic data, functional annotations, and tissue-specific information, has provided a great opportunity and challenge for researchers to combine these in ways that could potentially lead to stronger machine learning models. Two collected studies combine known ALS genes and ALS-related information mined from a variety of databases to predict novel disease-specific genes and rank them. The first one is, Mantis-ml, a recently published multi-step disease agnostic gene prioritisation pipeline which employs known disease-associated genes to predict scores for putative novel genes based on feature pattern similarity [103]. As mentioned in Section 3.2, the disease-related gene information is extracted from multiple databases and resources, including tissue- and disease- specific data [103]. Depending on the users' disease-related queries, the pipeline follows automatic feature selection and pre-processing as well as an exploratory data analysis on disease-related features. A repeated stochastic semi-supervised model is used to iteratively predict disease-related probabilities for each gene, and then rank each gene based on the mean prediction probability of all iterations. The starting point for the modelling is the labelling of an entire coding gene pool (18,626 genes) based on disease relevance retrieved from the Human Phenotype Ontology (HPO), with positive and unlabelled genes then being split into random balanced datasets [109]. They evaluate the performance of their classifier using a stratified 10-fold random split in every balanced dataset, which is followed by testing using the out-of-bag k-fold method. The model generates gene prediction probabilities belonging to the respective testing set of each k-prediction cycle. Lastly, the model calculates an aggregated prediction probability combined from all iteration cycles. The classification performance of Mantis-ml was assessed using seven different supervised models (gradient boosting, random forest, extra trees, extreme gradient boosting, support vector classifier, deep neural networks, and a stacking classifier) with 10 stochastic iterations and 10-fold cross validation in three diseases. All models showed similar performance (AUC: 0.83–0.85), with Extra Trees having the highest mean Area Under Curve for ALS (AUC = 0.814) (see Table 3). For ALS, 77 positively labelled genes were selected, having an average AUC of 0.814 (combined from 7 classifiers). Among the top 50 genes there were two already known ALS associated genes, *FUS* and *L1CAM*. Specifically, "MGI mouse knockout feature" was ranked as the top feature for ALS, including human orthologue mouse genes that have been associated with survival and developmental pathways. Unlabelled genes (i.e., genes not annotated to ALS in the HPO) were also identified as being predictive of ALS. Some of the top novel predicted genes among 5 out of 6 classifiers are *SYNE1, ALDH5A1, ABCA1, DNMT3A, NF2, SZT2, ACADVL, MED12, TSC2, EP400, RYR2, VCL* and *BBS2*. *SYNE1* causes recessive ataxia and has been associated with motor neuron degeneration and ALS [129]; ALDH5A1 has been identified as significantly down-regulated protein in ALS murine models [130]; ABCA1 has been linked with damage of neuromuscular junctions and identified in significant clusters of altered frontal cortex genes in ALS samples [131,132]; a DNMT3A isoform has been identified in synapses and in mitochondria and has been associated with degeneration in motor neurons in ALS patients and abnormal expression levels in skeletal muscle and spinal cord of presymptomatic ALS mice [133,134].

Another study that combined genomic data with other types of data sources to increase the power of the machine learning gene prioritization method was that of Bean et al., integrating functional annotations, known ALS-gene associations, and protein-protein interactions [105]. Protein-protein interaction networks have been useful to decipher new disease mechanisms, as proteins that are encoded by disease-related genes are likely to interact with proteins that are implicated in similar pathologies [135]. These authors used a previously published knowledge graph-based completion model [116], which is trained combining protein-protein interactions data mined from Intact [136,137], known disease-gene lists from DisGeNet [138,139] and functional gene-sets from the Gene Ontology [140] in order to make predictions of novel ALS-linked genes [105] (see Table 1). The algorithm starts by building a knowledge graph containing known ALS data represented as nodes and their interrelationships as

edges. The aim of the model is to predict the missing edges of the graph that could represent novel ALS genes, providing a predictive score to each. This is a similarity score deriving from the profile which is built by the trained knowledge graph model comparing the ALS-known genes to the rest of the genes. They train 5 models for 5 input sets of ALS-linked genes mined from the ALS Online Database (ALSoD), which is intended to host all known ALS- associated genetic variants [18], from the ClinVar database of curated clinical variants [141], from DisGeNet [138,139], and a manual curated ALS gene list generated by the authors [105] (see Table 1). For the training and testing of the model, 5-fold cross validation was used to estimate how well the model would have predicted known ALS genes. As seen in Table 3, all models-except the DisGeNet list- performed very well in each fold above the random baseline, with the manual list being on top. In total, the 5 models predicted 45, 176, 192, 327 and 575 novel ALS for the Manual list, DisGeNet, ClinVar, ALSoD and union ALS-known lists, respectively. All predicted novel genes of the manual list were also present in the other 4 lists. The authors also tested the functional enrichment of the predicted genes following an overrepresentation analysis using Gene Ontology terms, with all gene-sets having statistically significant enrichment to ALS-specific biological processes, like mitochondrial activity, endosome transport and vesicular trafficking, lipid metabolism and others. To validate the relevance of the predicted ALS genes, a gene-set and gene-level analysis was performed using MAGMA [142] on a large ALS meta-analysis GWAS dataset (European cohort, including 20,806 cases and 59,804 controls) [12], keeping only the variants that mapped to the putative ALS-genes. Only ClinVar predicted genes had statistically significant results (*p*-value = 0.038), followed by the Manual model which did not pass the Bonferonni correction but it was close with a *p*-value of 0.060.

#### **4. Discussion**

Here we identified gene prioritization machine learning studies aimed at the understanding of genomic data in ALS, outlining the main challenges faced by such studies. We compared these studies in terms of their feature selection methods, experimental design, machine learning performance and their biological results. In Figure 2, we summarize some of the key decision making steps taken by machine learning approaches in ALS genomics, and examples of some possible choices at each step.

In gene prioritization studies, the initial number of potentially "novel" SNPs/genes or, more problematically, the potential novel multilocus combinations, can be so high as to make computational analysis unfeasible (see Figure 1). In this context, it was intriguing to group and compare the chosen dimensionality reduction approaches in each study. As seen in Table 1, most studies handled this problem quite differently. More than half used a machine learning method for feature selection, along with one or more biological hypotheses for additional filtering of the input variables. The most common approach of biological hypothesis filtering was to infer ALS-specific knowledge early in the experimental design. This is both an advantage and a potential disadvantage as it makes the results more likely to have biological relevance but at the same time risks introducing bias to later stages of the machine learning approach. An example of a feature selection method that fell into this category is an early SNP-filtering approach based on a specific threshold of ALS GWAS *p*-values. This approach succeeds in a straightforward way to reduce the feature space, but comes at the cost that potential GWAS false positives could be inferred and/or that true positives (not captured by GWAS) might be removed from further analysis. This may be especially problematic in the capturing of epistatic events, as traditional GWAS analysis is a linear single-marker analysis, so filtering based on single SNP-disease association *p*-values could risk losing putative significant multi-locus interactions in later stages of the analysis [143,144]. However, a large cohort size could increase the power of a standard GWAS analysis utilized as a feature selection method in a machine learning study.

**Figure 2.** Some of the key decision making steps taken by machine learning approaches in ALS genomics, and examples of some possible choices at each step. These included data collection, data representation, selection of the Machine Learning algorithm for the classification task, and the types of result obtained by the model. The studies collected information from a variety of databases in order to mine, among other things, data on genotypes (e.g., dbGaP, GWAS Catalog, gnomAD), functional annotations (e.g., KEGG), and Protein-Protein Interactions (e.g., STRING). Depending on the purpose of the experimental design, the collected studies modelled genes, SNPs, cis-regulatory regions, multilocus interactions, and/or ALS/non-ALS patients. Each instance was described using features such as Genomic, Epigenomic, and Proteomic data, functional annotations, and/or prior ALS-related knowledge (e.g., ALS gene-sets). Various machine learning algorithms were selected for the classification tasks. Lastly, we visualize the distinction between the modelling results of gene prioritization and multilocus interaction prioritization studies. Gene prioritization studies aim to identify significant ALS associated instances (e.g., SNPs, genes and cis-regulatory regions), whereas multilocus interaction prioritization studies aim to discover significant interactions among multiple loci. Lastly, we note that an ALS versus non-ALS sample classification experiment can be used to prioritize genes if the interpretability of the chosen model permits the identification of informative genomic features.

Deep learning seems a promising machine learning method for ALS gene prioritization studies, as well as in ALS patient classification. Deep learning methods are known to perform well in regulatory genomic classification tasks, being able to incorporate information from the structure of the genomic data and capturing non-linear relationships and patterns of multilocus interactions [145,146]. This statement is further validated by comparing the performance of three collected ALS gene prioritization studies. More specifically, three recent studies that employ deep learning as at least one of their classification methods had very good predictive performance [90,103,104], as summarized in Table 2. Yin et al. showed that deep neural networks perform better than other methods in classifying ALS versus healthy controls, not only in terms of accuracy (0.769) and F1-score (0.797), but also with an excellent recall of 0.908, using a 9-fold cross-validation [90]. Moreover, all of the benchmarked classification methods showed improved performance when promoter selection Convolutional Neural Network models (Promoter-CNN) where incorporated as an extra feature selection stage with the classification models. Yousefian et al. constructed a semi-supervised Convolutional Neural Network model in order to predict ALS-associated non-coding variants using epigenetic features. The model showed an excellent performance achieving an AUC of 0.96 and F1-score of 0.83 [104]. This study did not include benchmarking against other machine learning models, and interestingly the association blocks of non-coding variants that were constructed for training, testing, and validation sets of the model were separated into chromosome numbers 1–10, 11–14 and 15–22, respectively, rather than the cross validation methods typically employed by other studies. Lastly, Vitsios et al. benchmarked their proposed multi-step non-disease specific gene prioritization pipeline assessing the performance of seven classifiers using 10-fold cross validation in three diseases. In ALS, all classifiers achieved very similar performance with an average AUC ranging from 0.767 to 0.814, with Deep Neural Networks achieving an average AUC of 0.774 and Extra Trees classifier being at the top [103]. Even though, the majority of the ALS collected studies show that deep learning yields very good classification results, one very popular challenge in such learning algorithms is the lack of explainability in the model's results, identifying which features are the most informative to the classification task.

In terms of reproducibility, comparing the highest ranked genes and SNPs, as well as the reported statistically significant functional pathways, it is noteworthy that the sALS-associated variant rs4363506 (initially identified by Schymick et al. with an empirical *p*-value = 10−<sup>6</sup> [76]) was found by both Greene et al. and Sha et al. to have statistically significant pairwise interactions with other variants. Specifically, Greene et al. reported rs4363506 to interact with rs6014848 (Acc of 0.6551 and a *p*-value < 0.048 in the detection dataset, and Acc of 0.5821 and *p*-value < 0.021 in the replication dataset) and Sha et al. found that rs4363506 participated to two separate two-locus interactions: one with rs3733242 (*p*-value = 0.032) and another with rs16984239 (*p*-value = 0.042) [107]. These were the only significant pairwise combinations identified by either study. The replicated SNP rs4363506 is an intergenic variant (chr10:127476239, GRCh38.p12) located between *DOCK1* dedicator of cytokinesis 1 and *NPS* neuropeptide S [147]. *DOCK1* is implicated in neural growth and is a member of the KEGG pathway term "Regulation of actin cytoskeleton" and *NPS* in "positive regulation of synaptic transmission, glutamatergic" (GO:0051966) and "regulation of synaptic transmission, GABAergic" (GO:0032228), among others, all processes that are linked to ALS (see Section 1.1) [113,148]. This is consistent with other work implicating the actin cytoskeleton in ALS, including the finding of the third study, Kim et al.-investigating the functional enrichment of pairwise interactions in sporadic ALS-of statistical significance of the Gene Ontology term "Actin Cytoskeleton" (*p*-value = 0.040). However, it should be kept in mind that all 3 of these studies are analyses of the same primary dataset.

There could be several reasons for the disparity in the results of Greene and Sha et al. (i.e., the identification of different interaction partner SNPs to rs4363506). One reason could be that different quality control methods and thresholds are applied to the genomic datasets. In addition, Greene et al. investigated the pairwise combinations of 210,382 uncorrelated SNPs (keeping only independent SNPs in terms of linkage disequilibrium) using their proposed MDRGPU model, whereas Sha et al. tested the pairwise combinations of the top 1000 SNPs with a significant

ALS-association *p*-value, before testing for significant interactions using a series of different two-locus probabilistic models. The significant results of these studies also differed in the design of the predictive model used, with Greene et al. using Multifactor Dimensionality Reduction (MDR)-a non-parametric method combining feature engineering and then classification which does not make any assumption about the underlying genetic mechanisms-and Sha et al. using two-locus probabilistic parametric models which tested specific hypotheses about the type of the genotype interaction, taking into account the order of the genotypes in terms of penetrance of high-risk variants. Asides from the two-locus probabilistic models, Sha et al. did in fact also test (separately) an MDR and a Combinatorial Searching Method-unlike Greene et al., these analyses did not identify any significant pairwise combinations but, interestingly, SNP rs4363506 was present in a pairwise combination in both models that almost passed the threshold of statistical significance. Lastly, Kim et al. followed a multi-level approach starting from SNP pairwise interaction feature selection, moving to testing genes associations and gene-set functional associations to ALS. Unfortunately, they did not report any results about the MDR predicted pairwise interactions, hence we cannot make a direct comparison with the other two studies.

From the comparison of the most statistically significant genes and SNPs predicted from the optimally performing machine learning models of Mantis-ml [103], the knowledge graph completion model [105], the Promoter-CNN model [90], and the non-coding variant CNN model [104], we note several points. First, we compared the top 50 ALS genes predicted from the Mantis-ml model using Extra Trees-the highest performing classifier of the ALS data [103]-with the top 45 performing genes predicted by the knowledge graph-based machine learning approach using the highest performing model trained on the manually curated ALS-linked list [105]. *SLC1A2* (solute carrier family 1 member 2) was the only gene that was predicted by both approaches. SLC1A2 protein is the dominant transporter that clears the extracellular neurotransmitter glutamate in the synapses, expressed by astrocytes [149]. The down-regulation of *SLC1A2* has previously been associated with excitotoxicity leading to motor neuron degeneration and therefore contributing to ALS pathology, as described in the introduction (see Section 1.1) [47,48,150]. The Mantis-ml ALS model top predicted genes also had an overlap with the genes associated with the top performing promoters predicted by the Promoter-CNN model [90], but only in terms of shared protein families: within the top 8 promoter regions that Promoter-CNN selected for chromosomes 7, 9, 17, and 22, there were two promoter regions that were associated with genes *LAMB4* (laminin subunit beta 4) and *TRIM16* (tripartite motif containing 16); while among Mantis-ml highest predicted ALS genes were *LAMB3* (laminin subunit beta 3) and *TRIM28* (tripartite motif containing 28). The Laminin family contains heterotrimeric glycoproteins of the extracellular matrix that are associated with processes such as adhesion, survival, neuronal development and proliferation [151]. *LAMB4* is implicated in tissue development and cell migration, and has been associated with different types of cancer [140,152]. Interestingly, a laminin-4 isoform is expressed in neuromuscular junctions and has been associated with muscular dystrophy [151]. A recent study of *LAMB3* upregulation implicates this gene in cell apoptotic, proliferating and metastatic events in patients that suffer from pancreatic cancer [153]. *TRIM16* has been associated with autophagy, degradation of protein aggregates and ubiquitination of misfolded proteins; pathways that have been previously associated with ALS (see Section 1.1) [154]. Finally, *TRIM28* encodes a co-repressor protein which is expressed in the human brain and is a major regulator of transposable elements [155]. Elevated transcription of transposable elements has been linked with neurological disorders, including ALS, as well as binding to the ALS-associated RNA-processing protein TDP-43 [156].

It is noteworthy that the vast majority of the overlapping genes that we identified among all the collected studies are implicated in previously known ALS-associated functional pathways (as we outlined in Section 1.1) as well as the majority of the highest predicted novel genes (as described in Section 3.3). Nevertheless, we did not observe any further overlap among the other studies in terms of ALS-predicted genes. Limited reproducibility among the four studies could be due to multiple factors that derive from a number of differences in their experimental design and the focus of each study (see Tables 1–3). As summarised in Table 1, these four studies all use different instances, features and feature selection methods. As far as the machine learning models are concerned, the best performing machine learning models which were utilized for comparison were different. However, the low reproducibility could also derive from a more general challenge of gene prioritization studies, which concerns the difficulty of identifying the truly causal genes out of a usually large pool of novel predicted genes that pass a chosen significance threshold [157]. The difficulty of reproducibility is also emphasized in Bean et al. where 5 known ALS-linked gene lists mined from different databases, and one which was manually curated, were used, with each yielding very different results even with the same model, highlighting that the methodology and results of each study should be compared with caution [105]. Lastly, we need to mention that due to the high number of potential novel genes among all the benchmarked models of each study, we only compared the top genes/SNPs predicted by the highest performing model in each case. Hence, due to this limitation, we acknowledge the possibility of a larger existing overlap of the top genes predicted by the rest of the well performing models that were benchmarked among the studies. These challenges, makes the identification of the ALS implicated functional pathways even harder as well as the task of investigating reproducibility in different studies (see Figure 1) [157,158].

We also note that several studies incorporated prior ALS biological knowledge and functional annotations into the general experimental design (see Table 2), and the trained models showed a very good predictive performance (see Table 3). Specifically, as seen in Table 1 all of the considered studies except Greene et al. used an ALS-linked knowledge feature selection method to reduce their feature space as well as the number of instances (i.e., genes and SNPs). Moreover, Bean et al., Vitsios et al. and Yousefian et al. include ALS- specific and generic biological knowledge into their initial feature space, such as tissue/disease-specific features and known ALS disease-gene associations, as well as gene functional annotations, epigenetic features and protein-protein interactions (see Table 1). The most characteristic study that falls into this category is the one of Bean et al., where the instances were only ALS-linked lists from various resources, and the model was an ALS knowledge-based graph which uses the neighborhood of genes and enrichment tests to define significance of association, trained using PPIs, ALS-gene associations and functional annotations, showing the highest performance using a manually curated ALS-linked gene list.

Multilocus interactions may have a significant role in ALS and should be considered in future ALS genomic studies [90,91,107]. Each of the studies that investigated multilocus interactions has highlighted that the statistically significant variants could not be replicated in single-locus analyses [90,91,107,108]. Yin et al. carried out the most complex study in this category, using a large ALS cohort (see Table 1) followed by a thorough quality control -this was the first ALS multilocus study to investigate complex non-additive events in such a large scale of input variables. The results provided further support for the involvement of non-additive genetic interactions in ALS, showing that combining the genomic structure from multiple cis-regulatory elements (in this case promoters located in different chromosomes) yields very promising results in ALS patient classification [36,90]. Also, related literature supports that the "missing heritability" in genetic traits could be uncovered and explained to a significant degree by a network of gene-gene interactions which is not taken into account in the methods typically used to estimate the proportion of heritability that is missing [77,144,159]. Even though, the prediction of novel disease-specific gene-gene interactions poses, as described in previous chapters, a greater number of challenges than single-marker GWAS, it offers the potential to understand the heritability and genetic architecture of complex traits like ALS disease in greater depth. These challenges could be faced using machine learning approaches.

Machine Learning is a rapidly evolving field that has great potential in helping us to understand the complexity of ALS genomics, and how this relates to molecular pathways. However, further advances are needed in GWAS machine learning approaches in order to fully uncover the underlying mechanisms of this deadly disease which ultimately may lead us to successful personalized disease and drug-targeting prediction approaches.

**Author Contributions:** Conceptualization, C.V., S.D. and W.D.; data curation, C.V.; writing—original draft preparation, C.V.; writing—review and editing, C.V., A.P.M., G.G., S.D. and W.D.; supervision, W.D.; project administration, S.D.; funding acquisition, W.D. and S.D. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was financed by the European Union Regional Development Fund (ERDF) EU Sustainable Competitiveness Programme for N. Ireland, Northern Ireland Public Health Agency (HSC R&D) & Ulster University. C.V. was the recipient of a DfE international scholarship from Ulster University.

**Acknowledgments:** We thank George Paliouras for kind and helpful advice.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Editorial* **Understanding Neuromuscular Health and Disease: Advances in Genetics, Omics, and Molecular Function**

**William J. Duddy \* and Stephanie Duguez \***

Northern Ireland Center for Stratified/Personalized Medicine, Ulster University, Glenshane Road, Derry-Londonderry BT47 6SB, UK

**\*** Correspondence: w.duddy@ulster.ac.uk (W.J.D.); s.duguez@ulster.ac.uk (S.D.)

The field of neuromuscular research has seen considerable recent advances in the molecular and cellular understanding of muscle biology, and the treatment of neuromuscular disease. These advances are at the forefront of modern molecular methodologies, often integrating across wet-lab cell and tissue models, dry-lab computational approaches, and clinical studies. The continuing development and application of multiomics methods offer particular challenges and opportunities, not least in the potential for personalized medicine. More than 500 different genes are known to be associated with neuromuscular disorders [1], and the identification of causative mutations has allowed the development of personalized therapies [1,2]. However, even if great progress has been made during the last two decades in different subgroups of neuromuscular disorders, there are still numerous challenges to resolve, such as the optimization of therapeutic knock-down strategies [3], targeting specific muscles and/or tissues of the nervous system [1], identifying genetic modifiers that can impair a therapeutic strategy [2], targeting common pathways being affected in different patient subgroups for a given disease [4,5], or understanding the impact of neuromuscular disorders on other tissues that could be affected but may be understudied.

This Special Issue, entitled "Understanding Neuromuscular Health and Disease: Advances in Genetics, Omics, and Molecular Function", encompasses some 15 publications from colleagues working on a diverse range of neuromuscular diseases, including Duchenne muscular dystrophy [6–9], facioscapulohumeral dystrophy [3,10,11], amyotrophic lateral sclerosis [4,5,12], spinal muscular atrophy [2], Emery–Dreifuss muscular dystrophy [13], and rheumatoid arthritis [14]. Looking across diseases, several themes are recurrent, such as the efforts to identify genotype–phenotype correlations in DMD [6,7,9] and ALS [4,5], the quest for effective biomarkers in many neuromuscular conditions [2,8,10,14], and the use of genomic and multi-omic approaches towards better ways to identify biomarkers and to understand disease [10,12,13].

The search for genotype–phenotype correlations can be aimed at the improved understanding of disease [4,5,7,9], but may also be relevant to potential therapeutic outcomes [6]. Of relevance to this Special Issue are genotype–phenotype correlations in DMD [6,7,9] and in ALS [4,5]. It is interesting to contrast the state of these investigations in these two conditions, differences which are related to the underlying genetics: DMD being due to mutations at a single gene, and therefore correlations being sought between clinical outcomes and specific mutation patterns within that gene [6,7,9]; ALS being a disease of unclear aetiology for the majority of patients and the focus of these genotype–phenotype investigations thus being on the relationship of different genes and their functional roles to the implicated mechanisms [4] or clinical outcomes of the condition [5].

The use of genomics and multi-omics approaches is a theme which itself cuts across the aims of current research, from the overlay of multiple omics data to achieve a global perspective and new understanding of Emery–Dreifuss muscular dystrophy [13], to the identification of novel circulating miRNA and protein biomarkers for FSHD using multiomics [10], through to the application of machine learning to the genomics of ALS, which

**Citation:** Duddy, W.J.; Duguez, S. Understanding Neuromuscular Health and Disease: Advances in Genetics, Omics, and Molecular Function. *J. Pers. Med.* **2021**, *11*, 438. https://doi.org/10.3390/jpm11050438

Received: 30 April 2021 Accepted: 18 May 2021 Published: 20 May 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

is aimed at understanding the molecular basis of this disease (and is also relevant to genotype–phenotype correlations) [12]. Deciphering the pathways and gene mutations involved in neuromuscular diseases may allow for the development of computational models helping our understanding of muscle pathologies, which could enable preclinical studies of neuromuscular diseases in the context of personalized medicine [15].

Aside from therapeutic strategy development, the use of biomarkers may be critical as disease trackers for the development of effective therapeutics (for example, in FSHD [10]), but also to the personalized tailoring of existing treatments (in DMD [8], and in rheumatoid arthritis [14]), and may prove useful in a broad sense for improved stratification, diagnosis, and/or treatment (e.g., in adult SMA [2]).

We hope that studies such as these, that integrate modern molecular methodologies across cell and tissue models, computational approaches, and clinical studies, will continue to drive progress towards improved neuromuscular health and treatments for these often severe diseases.

**Funding:** Not applicable.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** We wish to thank all the authors who contributed their work, and in so doing made the Special Issue a success. We also wish to thank the staff of *JPM* for their excellent support throughout the editorial process.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **A Genotype-Phenotype Correlation Study of Exon Skip-Equivalent In-Frame Deletions and Exon Skip-Amenable Out-of-Frame Deletions across the** *DMD* **Gene to Simulate the Effects of Exon-Skipping Therapies: A Meta-Analysis**

**Saeed Anwar 1, Merry He 1, Kenji Rowel Q. Lim 1, Rika Maruyama <sup>1</sup> and Toshifumi Yokota 1,2,\***


**Abstract:** Dystrophinopathies are caused by mutations in the *DMD* gene. Out-of-frame deletions represent most mutational events in severe Duchenne muscular dystrophy (DMD), while in-frame deletions typically lead to milder Becker muscular dystrophy (BMD). Antisense oligonucleotidemediated exon skipping converts an out-of-frame transcript to an in-frame one, inducing a truncated but partially functional dystrophin protein. The reading frame rule, however, has many exceptions. We thus sought to simulate clinical outcomes of exon-skipping therapies for *DMD* exons from clinical data of exon skip-equivalent in-frame deletions, in which the expressed quasi-dystrophins are comparable to those resulting from exon-skipping therapies. We identified a total of 1298 unique patients with exon skip-equivalent mutations in patient registries and the existing literature. We classified them into skip-equivalent deletions of each exon and statistically compared the ratio of DMD/BMD and asymptomatic individuals across the *DMD* gene. Our analysis identified that five exons are associated with significantly milder phenotypes than all other exons when corresponding exon skip-equivalent in-frame deletion mutations occur. Most exon skip-equivalent in-frame deletions were associated with a significantly milder phenotype compared to corresponding exon skip-amenable out-of-frame mutations. This study indicates the importance of genotype-phenotype correlation studies in the rational design of exon-skipping therapies.

**Keywords:** dystrophinopathy; duchenne muscular dystrophy (DMD); becker muscular dystrophy (BMD); dystrophin; reading frame rule; exon skipping; skip-equivalent deletions

#### **1. Introduction**

Dystrophinopathies are a spectrum of X-linked muscular dystrophies caused by mutations in the *DMD* gene encoding the dystrophin protein, which helps maintain the integrity of muscle membranes [1]. The most lethal end of this spectrum, Duchenne muscular dystrophy (DMD), generally arises from mutations that disrupt the translational reading frame and result in an absence of dystrophin [2]. Three other major conditions that belong to this spectrum are Becker muscular dystrophy (BMD, a mild form of DMD), intermediate muscular dystrophy (IMD, an intermediate form between DMD and BMD), and DMD-associated dilated cardiomyopathy (DCM, a type of nonischemic heart-disease) [2].

With 79 constitutive exons and (at least) seven alternatively-used exons, the *DMD* gene is one of the largest known genes in the human genome [2,3]. Because of the enormous length of the gene, it is highly vulnerable to mutations and, one out of three mutations are de novo in nature. Besides, the presence of two mutational hotspots, encompassing exons 3–22 and exons 45–55, within the coding sequence of the gene makes it more vulnerable to

**Citation:** Anwar, S.; He, M.; Lim, K.R.Q.; Maruyama, R.; Yokota, T. A Genotype-Phenotype Correlation Study of Exon Skip-Equivalent In-Frame Deletions and Exon Skip-Amenable Out-of-Frame Deletions across the *DMD* Gene to Simulate the Effects of Exon-Skipping Therapies: A Meta-Analysis. *J. Pers. Med.* **2021**, *11*, 46. https://doi.org/ 10.3390/jpm11010046

Received: 1 November 2020 Accepted: 22 December 2020 Published: 14 January 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

mutations [4–6]. To date, over five thousand *DMD* mutations have been reported. Large deletions involving deletion of one or more exons account for most cases (~68%), while duplications, other small mutations, and rarely deep intronic copy number variations cause the rest of the cases [7–9]. The major determinant of phenotypes is the reading frame rule, i.e., the mutations giving rise to an mRNA with disrupted reading frame (out-of-frame) result in a severe pathology, i.e., DMD. On the other hand, in-frame deletions result in the production of truncated yet (partially) functional dystrophins, causing a mild clinical phenotype called BMD [10]. The reading frame rule is accurately predictive in ~90% of DMD cases; however, not as consistent for BMD cases, with prediction rates ranging from 56–91% in different cohorts [8,11].

This reading frame rule provides the rationale for a therapeutic strategy, called therapeutic exon skipping using antisense oligonucleotides to transform DMD-related outof-frame mRNAs into in-frame ones to produce truncated dystrophin, against severe and lethal cases of DMD. Recently, three exon-skipping oligonucleotide drugs, namely eteplirsen, golodirsen, and viltolarsen, received accelerated approval by the U.S. Food and Drug Administration (FDA), and a few other oligonucleotide-based drugs are in later stages of clinical trials [12–14]. As such, antisense oligonucleotide-mediated single exon skipping has great promise for treating DMD effectively. However, because of these drugs' highly specific nature, they apply to only a small portion of the DMD population, e.g., eteplirsen, golodirsen, and viltolarsen together are applicable for a total of around 20% of the entire DMD population in the U.S. [12,13]. Recently, skipping of multiple exons, e.g., exons 45–55, using mutation-tailored cocktails of antisense oligonucleotides have shown potential for clinical application, and the applicability of such cocktail drug was shown to reach over 65% of DMD patients with large deletions [6,15].

The reading frame rule, however, is not always an accurate predictor of clinical phenotypes [16]. As multiple exonic deletions are amenable to skipping of each exon, the quasi-dystrophins resulting from the various exon skipped transcripts may vary in stability, function, and phenotype. The DMD/BMD phenotypic ratio observed in individuals with confirmed in-frame deletions starting and/or ending at frameshifting exons have been examined to estimate the therapeutic outcomes from skipping several exons [17–19]. In this study, we employed meta-analysis on the published literature and databases containing data on the genotype-phenotype association. It provides an overview of clinical presentations in patients with each exon "skip-equivalent" deletion and determines the best estimate of each exon skip-amenable mutation's clinical outcome of exon-skipping therapies.

#### **2. Materials and Methods**

#### *2.1. Clinical Database Search and Inclusion/Exclusion Criteria*

The French UMD-DMD-France database and the eDystrophin are two online registries for individuals and families affected by dystrophinopathies [20,21]. From 30 March 2020 to 30 July 2020, we queried these two databases for relevant cases. We excluded patients with pending clinical phenotypes and female carriers from the analysis. If a record were present in both databases, the data were entered only once into the data tabulation sheet.

#### *2.2. Literature Search and Inclusion/Exclusion Criteria*

To collect more data on the clinical features of genetically confirmed DMD patients, a literature search of PubMed was conducted using the following query terms: Dystrophin\*[Title/Abstract]) OR (Duchenne[Title/Abstract])) OR (Becker\*[Title/Abstract])) OR (muscular dystrophy[Title/Abstract]) OR (mutation[Title/Abstract])) OR (large deletion[Title/Abstract])) OR (mutation spectrum[Title/Abstract])) OR (MLPA[Title/Abstract])) OR (ligation[Title/Abstract])) OR (CGH[Title/Abstract])) OR (comparative genomic hybridization[Title/Abstract])). We then also searched google scholar using an equivalent query.

Two researchers (S.A. and M.H.) independently reviewed titles and abstracts of the identified articles to determine that only relevant publications were included. Articles were excluded if, after review, it was evident that the article discussed a non-human disease model, diseases other than dystrophinopathies, a non-dystrophin study, or presents a meta-analysis. We excluded the manuscripts presenting inadequate genotype-phenotype information or discussing only female carriers. Additionally, we excluded the studies that were included in any of the two databases mentioned above. In addition, only the manuscripts reporting cases of dystrophinopathies detected by multiplex ligationdependent probe amplification (MLPA), comparative genomic hybridization (CGH), or equivalent diagnostic procedures were included. No restrictions were made based on language, publication year, publication status, and the latest search date.

We identified a total of 612 unique articles by searching PubMed and Google Scholar, out of which 12 articles were finally selected based on our exclusion criteria after the title and abstract review. We then conducted a comprehensive review of these 12 articles to identify individuals with genetically confirmed large deletions in *dystrophin* (Figure 1, Figure S1, Table S1).

**Figure 1.** Flow chart showing database screening and literature search procedure to collect patients' clinical information with confirmed DMD large deletions. *N* indicates the number of individuals present in each data source.

#### *2.3. Data Extraction and Analysis of the Data*

Two authors (S.A. and M.H.) independently reviewed the full-text versions and, if available, the Supplementary Materials to extract clinical data of dystrophinopathy patients with in-frame deletions from the literature. Data of patients with deletions involving exons 1 and 79 were not included as deletions of these two exons would not produce truncated dystrophins. To assess duplicate records in multiple databases and literature, we utilized the following strategy. One of the investigators (S.A.) compared the partially de-identified information (identifiers, country origin, age at diagnosis, date of birth, and available mutation information, e.g., mutation start point and end point, and type of mutation), assuming any records with these same items represented the same patient [22,23]. Another

author (K.R.Q.L.) scrutinized the inclusion of mutation information and the removal of duplicate records. We categorized the clinically confirmed patients into two groups: (i) DMD, and (ii) BMD and asymptomatic. We categorized BMD and asymptomatic cases together because exon-skipping therapies aim to convert lethal DMD cases into milder forms, whether BMD or asymptomatic, with in-frame deletions. In addition, those asymptomatic cases at the time of examination could become BMD at a later age. The asymptomatic cases, therefore, are combined with BMD in a single group.

#### *2.4. Exon Skip-Amenable and Exon Skip-Equivalent Mutations*

We defined exon skip-amenable mutation as follows; (1) for frame-shifting exons (i.e., exons 2, 6, 7, 8, 11, 12, 17, 18, 19, 20, 21, 22, 43, 44, 45, 46, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 61, 62, 63, 65, 66, 67, 68, 69, 70, 75, 76, and 78), we included large out-of-frame deletion mutations that can be made in-frame by skipping of exon X as "exon X skipamenable mutations". For example, exon 52 of DMD is a frameshifting exon, and skipping of an adjacent exon, exon 51 or exon 53, theoretically restores the reading frame (exon 51 skip-amenable and exon 53 skip-amenable), (2) for non-frame-shifting exons (all other exons between exons 2–78), we included nonsense mutations and small out-of-frame indels in exon Y as "exon Y skip-amenable mutations". For example, therapeutic skipping of exon 23 for a nonsense mutation in exon 23 would restore the reading frame, leading to truncated dystrophin protein expression (exon 23 skip-amenable). We included patients who naturally harbor large in-frame deletion mutations starting and/or ending at exon Z as "exon Z skip-equivalent mutation". Deletions starting at exon 1 and ending at exon 79 are not included.

The same overall definitions of exon skip-amenable and exon skip-equivalent mutations can be applied for multi-exon regions, e.g., exons 3–9 or 45–55. Mutations (i.e., large deletions, large duplications, small indels, and nonsense) whose disrupted reading frame can be restored by multi-exon skipping are considered multi-exon skip-amenable. Naturally, mutations involving both frame-shifting or non-frame-shifting exons are included in our counts. Mutations partly outside of exons 3–9 and 45–55 are not included in exons 3–9 and 45–55 skip-amenable mutations, respectively (e.g., del. ex 43–45, del. ex 51–57). On the other hand, naturally occurring mutations (e.g., in-frame large deletions of exons 3 to 9 and exons 45 to 55) that model multi-exon-skipped transcripts are considered multi-exon skip-equivalent.

#### *2.5. Statistical Analysis*

The odds of DMD for a given exon were compared to BMD and asymptomatic phenotype's odds using Fisher's exact test. The ratios of DMD to BMD/asymptomatic phenotypes associated with a specific mutation and all other mutations were compared using Fisher's exact test. The test-statistic values of multiple test comparisons were adjusted using the Benjamini–Hochberg method to control false discovery rates [24]. To identify if there were hotspot regions in the gene, we compared the frequency of in-frame deletions starting and/or ending at a given exonic region and all other exons using f-statistic. Additionally, we compared the phenotypic outcomes associated with out-of-frame mutations amenable to exon skipping (exon skip-amenable) and exon skip-equivalent in-frame deletions of each exon using Fisher's exact test. Since the databases included and the manuscripts reviewed were, in general, descriptive, case-specific or focus group studies rather than randomized ones, we could not do a formal risk of bias assessment on the data. All data were analyzed in Microsoft Excel (Office 365, 2019).

#### **3. Results**

#### *3.1. Patient Pool and Deletional Patterns*

The UMD-DMD France and eDystrophin databases contain clinical data of 681 and 781 individuals, respectively, with confirmed clinical outcomes with in-frame exon deletions within *dystrophin*. Data of 595 individuals (29 DMD, 556 BMD, and 10 asymptomatic

individuals) was included in both databases. In addition, our literature search identified clinical data of 431 unique patients with in-frame deletions. The final patient pool we analyzed for clinical information contained information of 1298 individuals with confirmed clinical outcomes resulting from genetically diagnosed dystrophinopathy, including 277 (21.34%) individuals with DMD and 1021 (78.66%) with milder BMD (*n* = 976) or asymptomatic phenotypes (*n* = 45) (Figure S2A). There was a significant difference in the ratio of DMD to BMD and asymptomatic phenotypes between the records obtained from the two databases and literature searching (eDystrophin vs. literature search: *p*-value < 0.0001, 95% CI = 27.60–37.79; UMD-DMD France database vs. literature search: *p*-value < 0.001, 95% CI = 30.49–40.55; UMD-DMD France database vs. eDystrophin: *p*-value = 0.0519, 95% CI = −0.034–5.65) (Figure S2A).

Theoretically, there are 1408 potential large in-frame deletions possible across the *DMD* gene (Table S2). Our patient pool represented 180 (12.78%) of these potential in-frame deletions (Figure S2B). Of these theoretically possible large in-frame deletions, Δ 45–47 (*n* = 317, 23 DMD) and Δ 45–48 (*n* = 205, 19 DMD) were the most common deletional patterns reported among the patients. Two deletional hotspots were identified between exons 3 and 13, and 44 and 55, as 15.58% and 74.46% deletions started and/or ended within these two regions (*p*-value < 0.0001, for both regions) (Figures S2C and S3).

#### *3.2. More BMD and Asymptomatic Phenotypes Were Associated with In-Frame Deletions*

Large in-frame deletions were associated with 9.73%, 6.90%, and 42.46% cases with a severe DMD phenotype in the eDystrophin, UMD-DMD France databases, and the literature (Figures S4–S6). Overall, the reading frame rule was predictive for nearly 78.54% of the cases with large in-frame deletions (Figure 2A). However, exclusive of the literature's records, the prediction rate was 88.95% (Figures S4–S6). We identified 19 exons in-frame deletions, resulting in a significantly severer phenotype than average (Figure 2A,B). Additionally, in-frame deletions starting and/or ending at five exons, including exon 4, 45, 47, 48, and 55, are deemed to result in milder phenotypes (Figure 2A). In addition, we observed a lower incidence of DMD phenotype associated with in-frame deletions starting and/or ending at exons encoding the dystrophin protein's central rod domain (Figure S7). On the other hand, in-frame deletions starting and/or ending at the extreme ends of the protein was associated with more DMD phenotype (Figure S7).

#### *3.3. Distribution of Phenotypes for Exon Skip-Amenable Mutations to Each Exon*

We then intended to look at the clinical phenotypes associated with out-of-frame mutations that can be converted to in-frame by therapeutic skipping of one (or more) exons, i.e., skip-amenable mutations, using the patient data collected from the UMD-DMD France Knowledgebase. We determined the ratio of individuals with severe and milder phenotypes associated with exon skip-amenable mutations. Collectively, our analysis identified a total of 1149 individuals with single exon skip-amenable mutations, including 1120 (97.476%) patients with a severe phenotype, i.e., DMD (Figure 3).

**Figure 2.** Clinical phenotypes of exon skip-equivalent in-frame *DMD* exon deletions. (**A**) Association between in-frame deletions starting and/or ending at each exon (exon skip-equivalent) and consequent phenotypes. Phenotypic ratios associated with in-frame deletions starting and/or ending at a given exon and all other exons were compared using Fisher's exact test. *n* indicates the number of individuals with DMD (red) and milder (blue; BMD and asymptomatic) phenotypes. Green and red color indicate a significantly lower and higher incidence of DMD phenotype for a given exon, respectively, as compared to the overall incidence rate. *p* = *p*-value, as calculated by Fisher's exact test; *p*\* = Benjamini–Hochberg adjusted *p*-value. (**B**) Heatmap showing the relative severity of the consequence of in-frame deletions starting and/or ending at specific exons.

**Figure 3.** Phenotypic outcomes of exon skip-amenable mutations present in the UMD-DMD France database. (**A**) Distribution of phenotypes for each exon skip-amenable mutation. Phenotypic ratios associated with exon skip-amenable mutations for each exon vs. all other exons are compared using Fisher's exact test. *n* indicates the number of individuals with DMD (red) and milder (blue; BMD and asymptomatic) phenotypes. Green color indicates a significantly lower incidence of DMD phenotype. *p* = *p*-value, as calculated by Fisher's exact test; *p*\* = Benjamini–Hochberg adjusted *p*-value. (**B**) Heatmap showing the relative severity of the consequence of out-of-frame mutations amenable to skipping of each exon.

#### *3.4. Comparison of Phenotypes of In-Frame Exon Skip-Equivalent and Out-of-Frame Exon Skip-Amenable Mutations*

To simulate the clinical phenotype before and after exon-skipping therapy, we looked at the differences in phenotypic outcomes of in-frame exon skip-equivalent and out-of-frame exon skip-amenable mutations for each exon using the data present in the UMD-DMD France Knowledgebase. Our analysis identified 21 exons at which in-frame deletions start and/or end (exon skip-equivalent mutations) were associated with a significantly lower DMD incidence compared to corresponding exon skip-amenable mutations (Figure 4A,B). Additionally, exons 3–9 and exons 45–55 skip equivalent deletions were associated with a significantly lower


incidence of a severe phenotype compared to exons 3–9 and exons 45–55 skip-amenable mutations (*p*-value < 0.0001).

**Figure 4.** *Cont*.

**Figure 4.** Comparison of clinical phenotypes associated with out-of-frame mutations amenable to exon skipping in UMD-DMD database and in-frame exon skip-equivalent deletions of each exon to simulate the effects of exon-skipping therapies. (**A**) Phenotypic outcomes associated with mutations amenable to exon skipping and exon skip-equivalent in-frame deletions of each exon and their consequent phenotypes. *n* indicates the number of individuals with DMD (red) and milder (blue; BMD and asymptomatic) phenotypes. Asterisks indicate that exon skip-equivalent in-frame deletions are associated with a significantly milder phenotype compared to corresponding exon skip-amenable out-of-frame mutations. We compared the incidence of DMD associated with exon skip-equivalent (or group of exons, e.g., exons 3–9, and exons 45–55) and mutations amenable to skipping each exon (or group of exons, e.g., exons 3–9, and exons 45–55). The statistical significance was calculated using Fisher's exact test. (**B**) Heatmap showing the relative severity of the consequence of exon skip-equivalent in-frame deletions and exon skip-amenable mutations. \* *p* < 0.05, \*\* *p* < 0.01, \*\*\* *p* < 0.001

#### **4. Discussion**

The FDA approvals of eteplirsen, viltolarsen, and golodirsen are an inspiring development in treating DMD, although these drugs cumulatively can only be administered to approximately 20% of the total DMD population who are specifically amenable to exon 51 and exon 53 skipping [12,25,26]. Besides, the development of antisense therapies to skip additional exons could theoretically help to treat up to ~80% of individuals living with DMD [16,27]. Currently, clinical trials to evaluate exon 44, 45, 51, and 53 skipping antisense drugs are underway (clinicaltrials.gov: NCT02530905, NCT04179409, NCT02667483, NCT02500381, NCT04433234, NCT04129294). Besides, the development of mutation-tailored cocktail antisense drugs applicable for treating a large portion of the DMD population carrying out-of-frame deletions also showed potential [15,28].

This study focused on the phenotypes associated with large deletions involving one or more exons to result in in-frame transcripts. While 1408 large in-frame deletion patterns are theoretically possible within the *DMD* gene, this study identified only 180 (12.78%) of them in individuals in the searched sources of clinical data (Figure S2B and Table S1). The absence of clinical data on individuals with the remaining deletions, perhaps due to the deletions' extreme rarity or that the associated phenotypes are asymptomatic or very mild, that the mutations remain undetected and unreported.

The clinical data on individuals with large in-frame deletions in *DMD* present that nearly 80% (78.54%) have milder BMD or asymptomatic phenotypes (Figure S2A and Figure 2A). Nearly 1 out of 5 individuals (21.46%) developed severe phenotype, i.e., DMD, despite their predicted in-frame deletions. Interestingly, the occurrence of severe phenotype was over four-times higher among the individuals constituted from the literature (Figure S2A). This indicates the potential presence of reporting bias in the clinical reports [29]. To note, in the aggregate of the data collected from the two databases but without those obtained from the literature, 11.05% of individuals with in-frame large deletions developed severe phenotype, which is comparable with the numbers indicated by previous reports (Figures S5 and S6) [7,8,10,20,30].

A previous report studying the clinical phenotypes of *DMD* exon 51 skip-equivalent deletions identified 12% of patients had severe phenotypes despite their predicted in-frame deletions [31]. The individuals with DMD phenotype with in-frame deletions indicate the need for subtle knowledge of the reading-frame rule for dystrophinopathies. It also reflects the complex biology of dystrophinopathies and an array of different factors may influence the ultimate clinical phenotype. These factors may include but are not limited to the inherent leakiness of splicing seen in some exons, the effect of predicted frame-altering mutations on splicing signals, and the stability and tertiary structure of the resultant protein [16,32–34].

Irrespective of the data source, deletions starting and/or ending at some exons were associated with a more severe phenotype (Figure 2 and Figures S2–S6). Additionally, inframe deletions starting and/or ending within the exons 3–13 hotspot region was associated with an elevated frequency of a severe phenotype, while those starting and/or ending within exons 44–55 region resulted in a significantly lower incidence rate for a DMD phenotype (Figure S2C and Figure 2). An explanation for why the occurrence of a severe phenotype is associated with these deletions is currently unclear. One possible reasoning could be that these deletions, although been predicted as in-frame, result in a less stable or non-functional protein, as a part of N-terminal hotspot encodes for the actin-binding domain. However, the deletion of certain exons involved with coding for portions of the central rod domain may have a less significant impact on skeletal muscle pathology [35], which is what is reflected in this study. Given the retrospective and meta-analytic nature of this study, we did not have access to do any muscle tissue biopsy to investigate the actual cause beneath the elevated frequency of severe phenotype associated with these deletions, and it would also be beyond the scope of this study. Further assessment in clinical status and muscle tissue analysis may help conclude the factors behind the findings

of a severe phenotype associated with these exons, which could be interesting and may provide essential insights for therapeutic development.

Importantly, this study identified five exons, including exons 4, 45, 47, 48, and 55, at the start and/or end of in-frame deletions associated with a significantly lower frequency of a DMD phenotype than average (Figure 2A). A previous study of 4894 patients with in-frame and out-of-frame large deletions from the TREAT-NMD DMD Global database listed the top ten skippable *DMD* exons, including exons 8, 43, 44, 45, 50, 51, 52, 53, and 55, that would apply to the largest group of patients with DMD [7]. Our study identified that individuals with large in-frame deletions starting and/or ending at many of these exons had a significantly lower frequency of developing a severe phenotype than those having a mutation amenable to skipping those exons (Figures 2 and 4). It also suggests that the skipping of exons 3 to 9 and 45 to 55 could be promising targets for treating DMD (*p*-value < 0.0001).

Among the exons we identified as associated with a significantly lower incidence of DMD, exon 45 is already being thoroughly studied as a target for the development of exon-skipping therapies [16,36]. Casimersen, an antisense drug of phosphorodiamidate morpholino chemistry to treat DMD amenable to exon 45 skipping, has recently been accepted and placed under priority review by the FDA [37,38]. The other five exons are also deemed promising therapeutic targets of exon skipping, although some of them are not applicable to many patients.

While the present study included a reasonably comprehensive dataset, it has several limitations. First, a significant limitation is that the specific deletional patterns between DMD patients eligible for exon skipping with out-of-frame deletions and in-frame deletions of exon skip-equivalent deletion mutations are not always comparable. This is particularly the case for exon 44 skipping. Over two-thirds of the (67.89%) out-of-frame deletions amenable to exon 44 skipping start at exon 45, whereas ~98% of the in-frame deletions (exon 44 skip-equivalent) end at exon 44 (Figure S8). Second, the effects of in-frame deletions at the DNA level are not necessarily equivalent to the effects at the RNA level in antisense-treated cells. For example, in a DMD dog model, skipping of exon 8 using an antisense morpholino led to spontaneous skipping of exon 9 in addition to exon 8 [28]. As such, the effects of exon 8 skipping might be more relevant to exon 9 skip-equivalent in-frame deletion rather than exon 8 skip-equivalent in-frame deletion. Third, among the literature included, some studies only enrolled patients diagnosed with either DMD or BMD, potentially influencing the results. Fourth, some studies did not interrogate all *DMD* exons, leaving the possibility that, although unlikely, the patients may have harbored a second mutation. Fifth, the exceedingly high proportion of the DMD phenotype in the literature also raises the concern that there could be some reporting biases in play. Lastly, other factors, including the variability of antisense oligonucleotide-mediated exon skipping efficacy for different exons, also need to be considered to design exon-skipping therapy. Hence, these potential limitations should be considered with care when interpreting the data presented in this study.

#### **5. Conclusions**

This study highlights the genotype-phenotype association among individuals with large in-frame deletions starting and/or ending at different exons in *DMD*. While most exon skip-equivalent in-frame deletions are associated with a significantly milder phenotype compared to corresponding exon skip-amenable out-of-frame mutations, we identified several exon skip-equivalent in-frame deletions that are the most promising therapeutic targets. However, the phenotypic variability in individuals with specific in-frame exon deletions found in this study is suggestive of the issue that the response to exon-skipping therapy could be variable and may be impacted by multiple factors. This study largely indicates that genotype-phenotype correlation analysis can significantly contribute to the rational design of exon-skipping therapies; however, hints at the necessity of continued evaluation of genetic and other modifiers.

**Supplementary Materials:** The following are available online at https://www.mdpi.com/2075-442 6/11/1/46/s1, Figure S1: Deletional pattern in the DMD gene. From left to right on the bottom row represents the number of exons (2–78) in the DMD gene. The black shaded area shows the range of each deletion in the *DMD* gene. Figure S2: Patient pool and dystrophin mutational patterns among the individuals with large in-frame deletions. Figure S3: Frequency of deletions starting and/or ending at a given exon was significantly high in a proximal (exons 3 to 13, *p*-value = 0.000015, f = 20.64776) and a distal (exons 44 to 55, *p*-value = 0.000069, *f* = 17.2277) region. Figure S4: Association between in-frame deletions (A) starting, (B) ending, and (C) starting and/or ending at each exon and consequent phenotypes in the eDystrophin database. Figure S5: Association between in-frame deletions (A) starting, (B) ending, and (C) starting and/or ending at each exon and consequent phenotypes in the UMD-DMD France database. Figure S6: Association between in-frame deletions (A) starting, (B) ending, and (C) starting and/or ending at each exon and consequent phenotypes in the records identified from existing literature; Figure S7: Overview of clinical phenotypes associated with in-frame deletions starting and/or ending at different exons and their corresponding dystrophin domains. Figure S8: Phenotypes associated with different exon 44 skip equivalent and amenable to exon 44 skipping mutations. Table S1: Clinical data obtained from literature searching. Table S2: List of all theoretically possible in-frame deletion within *dystrophin*.

**Author Contributions:** Conceptualization, T.Y.; methodology, S.A., M.H., K.R.Q.L., and R.M.; software, S.A., M.H., and R.M.; validation and scrutinization, S.A., M.H., R.M., and K.R.Q.L.; investigation, S.A. and M.H.; data curation, S.A., and M.H.; formal analysis, S.A., and M.H.; writing—original draft preparation, S.A.; writing—review and editing, S.A. and T.Y.; visualization, S.A.; supervision, T.Y.; project administration, T.Y.; funding acquisition, T.Y. All authors have read and agreed to the published version of the manuscript.

**Funding:** T.Y. is supported by the Muscular Dystrophy Canada, the Friends of Garrett Cumming Research Fund, the HM Toupin Neurological Science Research Fund, Canadian Institutes of Health Research (CIHR), Alberta Innovates: Health Solutions (AIHS), Jesse's Journey, and the Women and Children's Health Research Institute (WCHRI). S.A. is supported by scholarships from the Maternal and Child Health (MatCH) Program, and the Alberta Innovates Graduate Student Scholarship (AIGSS).

**Institutional Review Board Statement:** For each database and the literature included, each study reported that the investigators have obtained approval from the relevant Institutional Review Board. No data sharing is applicable to this study since no new data were created or analyzed in this study.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data prepared for this study are available in the article or Supplementary Materials.

**Acknowledgments:** We thank Aurélie Nicolas and Sanjana Fatema Chowdhury for their help in data coding and identifying the redundant entries in the patient registries. We especially thank Nur-E Jannat Hoque (Department of Statistics, Biostatistics, and Informatics, University of Dhaka), Maryna Yaskina (Biostatistician, University of Alberta Faculty of Medicine & Dentistry—Women and Children's Health Research Institute), and Matthew Pietrosanu (Department of Mathematics & Statistical Sciences/Training and Consulting Centre, University of Alberta) for kindly reviewing the statistical analyses.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **References**


MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Journal of Personalized Medicine* Editorial Office E-mail: jpm@mdpi.com www.mdpi.com/journal/jpm

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18

www.mdpi.com

ISBN 978-3-0365-1621-9