**1. Introduction**

Neurofibromatosis type 1 is a rare disease and a member of the family of RASopathies (diseases caused by germline mutations in genes that encode components or regulators of the Ras/mitogen-activated protein kinase (MAPK) pathway) that occurs in approximately 1 in 3000 patients worldwide and gives rise to cognitive impairment, skeletal abnormalities, and various nerve tumors including gliomas and neurofibromas and is caused by a mutation or deletion in one NF1 allele [1–3]. Nerve sheath tumors a ffect more than 90% of NF1 patients, mostly in the form of cutaneous neurofibromas (cNFs). These tumors grow at the skin surface and can range in number from 10s to 100s of tumors in a given patient [4]. Neurofibromas that occur deeper in the body, including subcutaneous neurofibromas or plexiform neurofibromas (pNFs), occur in 40–60% of NF1 patients and can cause pain and disfigurement among other symptoms [1,4]. Patients with pNFs have a 10% lifetime risk of these tumors developing into malignant peripheral nerve sheath tumors (MPNSTs) which have a 5-year survival rate of 40–50% [5,6].

The rise of high-throughput genomic and transcriptomic sequencing has enabled many advances in understanding the molecular etiology of NF1 tumor types [7–11]. Genomic studies of NF1-derived tumors, particularly MPNSTs, have identified key features of tumor growth that could point to potential therapeutic avenues. For example, genomic approaches were recently used to identify the loss of function of polycomb repressor 2 complex components *EED* or *SUZ12* genes, alongside *CDKN2A* and *NF1* gene mutations as crucial co-mediators of MPNST transcriptional dysregulation, pathogenesis, and sensitivity to bromodomain (BRD4) inhibitors [9,11]. Others using genomic approaches to explore nerve sheath tumor biology identified *MET* and *HGF* gene amplifications in MPNSTs. Furthermore, models of *MET*-amplified MPNSTs were subsequently sensitive to the MET inhibitor capmatinib [7]. Transcriptomics-focused approaches have also identified molecular features such as MEK signaling, type 1 interferon signaling, and Aurora kinase A as putative therapeutic targets in NF1 tumors [12–14]. Taken together, these and other studies sugges<sup>t</sup> that an integrative approach that combines multiple transcriptomic and genomic datasets might be well poised to identify new therapeutic avenues in MPNSTs and other NF1-related nerve sheath tumors.

Previous genomic profiling studies have demonstrated that many NF1 nerve sheath tumors (MPNST being the exception) are genetically quiet [3,15–17] and lack specific signatures that are predictive of drug response. An approach to compensate for the lack of genetic hotspots in NF1 tumors is to focus on combinations of transcriptomic signatures that may be unique to specific tumor types. In other tumor types, transcriptomic landscapes across cancer datasets [18,19] have shown that combining RNA-seq data from similar diseases can identify expression profiles that correlate with prognosis [20], predict drug response [21,22], or identify key tumor biology [23].

However, comprehensive analysis of genomic data in NF1 tumors is limited [3]. To enable a larger landscape analysis of NF1 nerve sheath tumors, samples from di fferent studies were collated into a single resource as part of the NF Open Science Initiative, a collaboration between NF-related funding agencies. This resource has been made publicly available through the NF Data Portal, which houses high-throughput data for neurofibromatosis 1, neurofibromatosis 2, and schwannomatosis [24].

In this work, we reprocessed and analyzed RNA-seq data from 77 NF1 nerve sheath tumor samples to understand the biological di fferences that give rise to distinct tumor types in NF1 patients. Given the low sample size compared to the large feature space of RNA-seq data, we applied a transfer learning-inspired approach to meaningfully reduce the feature space with minimal decrease in information content. Transfer learning techniques can leverage large well curated datasets such as recount2 [25] to identify latent variables (LVs)—groups of genes derived from larger repositories of gene expression datasets that exhibit common transcriptomic patterns relevant to a specific subset of samples [26,27]. Although many of these LVs are composed of genes that map to known signatures (i.e., documented in Kyoto Encyclopedia of Genes and Genomes (KEGG) database, and Gene Ontology (GO) consortium database), others are uncharacterized and may allow the detection of novel and meaningful transcriptomic patterns in NF data. As a result, reduction of gene-based expression data to individual latent variables can provide multiple benefits—LVs can highlight di fferences in known biology in sets of samples, they can uncover previously unknown biology, and they can reduce the impact of technical and experimental di fferences across multiple datasets [26]. We transferred a

machine learning model [27] trained on recount2 to assess LV expression in the NF1 nerve sheath tumor dataset. We then used supervised machine learning with random forests [28] to isolate combinations of such LVs to identify specific molecular signatures that may describe the underlying biology unique to each of the tumor types: cNFs, pNFs, undefined neurofibromas (NFs), and MPNSTs. Finally, we integrated this information with sample-matched variant data, immune cell signatures [29–31], and protein activity predictions [32] to provide additional biological context to the most important latent variables. This approach revealed biological patterns that underlie di fferent NF1 nerve sheath tumor types and candidate genes and cellular signatures associated with NF1 tumor heterogeneity.

#### **2. Materials and Methods**

#### *2.1. Materials Implementation and Data and Code Availability*

All analyses were performed using the R programming language. A comprehensive list of packages used and their versions are available in an *renv* lockfile in the GitHub repository. Key packages used include "tidyverse" [33], "PLIER" [26], "synapser" [34], "tximport" [35], "immundeconv" [35], and "viper" [32]. All data analyzed in this article are stored on the NF Data Portal [24] (http://nfdataportal.org) with analyses stored at http://synapse.org/nf1landscape. To recapitulate the analysis from these data, all relevant code can be found at https://github.com/Sage-Bionetworks/NF\_LandscapePaper\_2019.

#### *2.2. Sequencing Data Collection and Processing*

Gene expression data were collected from four independent studies and processed via a workflow at https://github.com/Sage-Bionetworks/rare-disease-workflows/tree/master/rna-seq-workflow to be stored on the NF Data Portal (Table 1). Specifically, raw fastq files were downloaded from Synapse and transcripts were quantified using the Salmon pseudo-alignment tool [36] with Gencode V29 transcriptome. Links to specific datasets and the access teams required to download them can be found using Synapse ID *syn21221980*.



Genomic variant data were collected from exome-Seq [37] or whole-genome sequencing [38]. Variant call format (VCF) files were processed using "vcf2maf" (https://github.com/mskcc/vcf2maf) according to the workflow located at https://github.com/Sage-Bionetworks/rare-disease-workflows/ tree/master/gene-variant-workflow and then uploaded to the NF Data Portal. A list of datasets and the access teams required to download them can be found in Table 2 or syn21266269.



#### *2.3. Latent Variable Calculation and Selection*

We analyzed transcriptomic data from NF in the context of latent variables from MultiPLIER, a machine learning resource designed to aid in rare disease analyses [27]. Raw transcriptomic data from NF were retrieved from the NF Data Portal, reprocessed, and stored in Synapse as described above. Salmon output files (quant.sf) files were imported into an R session and converted to HUGO gene names using the "tximport" [35] and "org.Hs.eg.db" [40] packages.

MultiPLIER reuses models that were trained on large public compendia. We retrieved a model [41] that was previously trained on the recount2 RNA-seq dataset [25,42] and then used it to assess the expression of latent variables in the pan-NF dataset. We retrieved code for this analysis from the public repository for MultiPLIER [27] (https://github.com/greenelab/multi-plier). To project the NF data into the MultiPLIER model, we used the GetNewDataB function. This analytical approach is described in more detail in a machine learning training module (https://github.com/AlexsLemonade/trainingmodules) produced by the Alex's Lemonade Stand Foundation's Childhood Cancer Data Lab under an open source license.

In addition, to ensure orthogonality in the final set of latent variables [43], we calculated the pairwise Pearson correlation of all latent variables by comparing the gene loadings (Figure S1), and identified non-self-correlations greater than 0.5. We eliminated one of each of these highly intercorrelated (Pearson correlation > 0.5) latent variables. This process resulted in a final set of 962 latent variables for further analysis. Code for our implementation of this material is available on GitHub at https://github.com/Sage-Bionetworks/NF\_LandscapePaper\_2019.

We used the R function prcomp to compute the principal components for Figure 1A,C both using genes (Figure 1A) and latent variables (Figure 1C).

#### *2.4. Generation of Ensemble of Random Forests for Feature Selection*

To select gene expression patterns of interest, we used ensembles of random forests to su fficiently resample our modestly sized dataset. We compared random forest models built using gene expression data as well as latent variables.

## 2.4.1. Algorithm Implementation

The main algorithm was implemented using the "caret" and "randomforest" packages in R [44,45]. Figure S2 outlines the steps involved in the generation of the ensemble of random forests. Briefly, the full dataset was first split into 80% *model* set and 20% *independent test* set. The function createDataPartition was used to create balanced splits of the data according to the tumor type. We tuned two parameters to the random forest algorithm, *mtrys*, and *ntrees*, using an iterative approach, evaluating *mtrys* values of 1 to 100 and *ntrees* values of 250, 500, 1000, and 2000. We selected the optimal values (*mtry* = 51, *ntrees* = 1000) using fivefold cross-validation, using latent variables as input features. We then split the *model* training set further to generate 500 samples of *training* data (75%) and hold-out *test* data (25%) (balanced splits randomly sampled without replacement). Each of these *training* and *test* datasets were used to train separate random forests to obtain a distribution of F1 scores and feature importance scores

(*n* = 500). Given our noisy dataset with limited sample size, the distribution of feature and F1 scores enabled estimation of confidence intervals for the feature importance as well as model performance.

## 2.4.2. Feature Selection

The importance of each feature was estimated using raw importance scores that measure the change in correctly classified "class" due to random permutation of the values for the feature. To select the top features for a specific class (i.e., tumor type), we calculated the median *importance score* of each feature from the distribution of *raw importance scores* generated through 500 iterations of random forests. The top 40 features were selected according to the mean decrease of the Gini index. The union of top features from all classes was then used as a restricted feature set to train another 500 iterations of random forests as described above. However, each new forest trained with the restricted feature-set was tested using the *independent test set* to examine the performance of the model on a completely unseen dataset. For each class, the median F1 scores of the new ensemble of forests were compared to the previous ensemble of random forests. Improvement of median F1 scores for each class in the final ensemble of random forests compared to the earlier one suggested that the selected features from each class were su fficiently informative for their classification. This subset of features was then selected for downstream analyses.

#### *2.5. Immune Subtype Prediction*

To understand the relative immune infiltration across the nerve sheath tumors studied, we used two tumor deconvolution methods: CIBERSORT and MCP-counter, as implemented through the "immunedeconv" R package [29–31]. Analysis is located at https://github.com/Sage-Bionetworks/NF\_ LandscapePaper\_2019 and results were uploaded to a Synapse table (syn21177277) that includes the tumor-specific immune cell scores for both algorithms as well as associated tumor metadata.
