1. Introduction
Breast cancer remains the most diagnosed malignancy and most common cause of cancer death among women globally, with an estimated 2.3 million diagnoses and 666,000 deaths in 2022 [
1]. Approximately 4.5% of cases in Canada are early-onset breast cancer (EoBC), defined by a diagnosis before 40 years of age, compared to an estimated 10% of cases worldwide [
1,
2]. In Canada, the incidence rate of EoBC has increased annually by 0.66% from 2000 to 2015 compared to 0.21% in the overall population during the same time period [
3]. Further, survival outcomes have not improved in EoBC to the same extent as in the overall breast cancer population over time [
4].
EoBC presents clinical challenges in part due to its rarity but also because of few established risk factors for prevention. Organized routine mammography screening in Canada is indicated for women aged 50–74 years, with the harms of screening outweighing the benefits in women < 40 years [
5]. Therefore, EoBC is often detected symptomatically and at later stages [
6,
7,
8,
9]. EoBC is also likely to present with more aggressive disease biology, including the human epidermal growth factor receptor-2 (HER2)-enriched and triple-negative (TNBC) subtypes, stressing the importance of the clinical management of EoBC [
10,
11]. It is well accepted that the risk of recurrence and mortality is higher in EoBC compared to the overall breast cancer population; however, drivers remain poorly understood. Large epidemiological studies established age < 40 years as an independent risk factor of poor prognosis in breast cancer, even after adjustment of pathological features and treatments received [
4,
10,
12,
13,
14,
15,
16,
17]. This has driven clinical debate as to whether inferior outcomes in EoBC are due to an overrepresentation of aggressive disease features or a unique disease biology [
18].
The strongest established risk factors in EoBC include inherited genetic mutations. However, less than 10% of breast cancer incidence among young women is attributable to heritable mutations in the
BRCA1 or
BRCA2 genes [
19,
20]. Further, Copson et al. found no evidence that germline mutations were related to mortality or tumour aggressiveness among breast cancer patients aged < 40 [
19]. This suggests an important role for somatic mutations caused by lifestyle or environmental exposures in combination with intrinsic processes in tumour progression and survival in young women. Somatic mutations are found in all cancer genomes. A small proportion are drivers that confer clonal advantage, are causally implicated in oncogenesis, and have been positively selected during the evolution of the cancer [
21,
22,
23,
24]. Somatic driver mutations in over 30 cancer genes have been implicated in breast cancer development, including
AKT1,
BRCA1,
CDH1,
GATA3,
PIK3C,
PTEN,
RB1, and
TP53 [
10,
21,
22]. Comparatively fewer studies have assessed driver mutations of recurrence and metastasis in breast cancer, and no such studies have been performed in early-onset populations.
The remaining somatic mutations are “passengers”, which do not contribute to cancer development. However, passenger mutations bear the imprints of the DNA damage and repair processes operative during the development of the cancer, unmodified by selection [
25]. Advancements in next-generation sequencing have permitted sequencing of whole cancer genomes and identified thousands of single nucleotide variants (SNVs) in breast cancer genomes [
26,
27]. There are six unique types of SNVs: C>A, C>G, C>T, T>A, T>C, and T>G. Each of the substitutions is examined by incorporating information on the bases immediately 5′ and 3′ to each mutated base generating 96 possible mutation types (6 types of substitution*4 types of 5′ base*4 types of 3′ base). The array of mutation types is represented in a mutational spectrum, then decomposed into recurring patterns, referred to as mutational signatures. Sixty validated single-base substitution (SBS) mutational signatures are listed in the Catalogue Of Somatic Mutations In Cancer (COSMIC) version 3.3, in addition to 18 insertion–deletion (indel) signatures [
28]. Mutational signatures can be used to decipher how patterns of somatic mutations collectively give rise to mutational processes of disease as well as give insight into the potential etiology of the processes underlying these signatures.
Mealey et al. performed one of the most comprehensive analyses of the mutational landscape of breast cancer ≤40 years [
23]. They found that COSMIC signatures SBS1, 3, and 5 were the most common in the overall cohort and that SBS2 and SBS3 were more likely to be observed in HER2-enriched and triple-negative tumours, respectively. Compared to patients >60 years, early-onset patients were significantly more likely to have C>A mutations (17% vs. 16%) and less likely to have C>T mutations (32% vs. 38%). Finally, patients ≤40 years were more likely to have mutations in
GATA3 compared to those >40 years and >60 years (22% vs. 12.9% vs. 10.8%) [
23]. Studies like this provide insight into multiple genomic features related tumour development in women < 40 years. To date, there have been no applications of mutational signatures to assess outcomes in EoBC and no studies have investigated indel signatures among these patients. Similar to Mealey et al., genomic data can be leveraged to understand how various somatic mutations collectively drive tumour progression and survival in young women. These analyses may discover novel markers to inform targeted therapies or may improve the performance of existing prediction tools to better inform individualized prognosis. In this study, we examine whole exome sequences from 100 EoBC patients in Alberta, Canada to describe their somatic mutation landscape, including mutational load, SBS, and indels. We also extracted de novo SBS and indel signatures and fit mutational profiles to validate COSMIC SBS and indel signatures. Finally, we examined whether extracted and fitted COSMIC signatures were associated with clinicopathological tumour characteristics and survival outcomes.
2. Materials and Methods
2.1. Study Sample and Data Collection
Somatic mutation and clinical data were obtained from 100 women between the ages of 18–39 years diagnosed with invasive non-metastatic breast cancer in Alberta, Canada, from 2001 to 2014. Mutational data were derived from tumour tissue and normal blood samples stored at the Alberta Cancer Registry Biobank. Tumour tissue was extracted at time of surgery or biopsy and stored as formalin-fixed paraffin embed blocks. Blood samples were also collected at time of surgery or biopsy and centrifuged for buffy coat extraction. Tumour and normal blood samples were sent to Genome Québec for DNA extraction and whole-exome sequencing. Extraction was performed with QIAsymphony DSP DNA Kits (QIAGEN, Hilden, Germany) and sequencing was performed with NovaSeq 6000 S4 PE100 (Illumina, San Diego, USA) and SureSelect Human All Exon exome probes (Agilent Technologies, Santa Clara, USA). Following sequencing, variant calling was performed using the Mutect2 workflow [
29] from the Canadian Centre for Computational Genomics (C3G) and obtained in the form of variant call files (VCF). The corresponding reference genome was GRCh37/hg19. Clinical data were obtained through linkage with the Alberta Cancer Registry and included detailed information on baseline demographics, cancer diagnosis (stage and morphology), dates of referral to oncology, clinic visits at any of the cancer centers, surgical procedures, dates and types of therapy received for cancer (chemotherapy, radiation, and hormonal therapy), tumour size, grade, lymph node status, ER/PR status and HER2/neu status, and dates of last follow-up or death. Administrative end of follow-up was 25 February 2018.
2.2. Extraction of Mutational Signatures
Mutational signatures were investigated using the MutationalPatterns package in R (v4.3)/Bioconductor (v3.17) [
30]. This package includes a comprehensive set of functions for extracting mutational signatures de novo and determining the contribution of previously identified mutational signatures on a single sample level. The package works with SNVs, indels, double-base substitutions (DBS) and larger multi-base substitutions (MBS). The VCF files for each participant were passed through “read_vcfs_as_granges” and “get_mut_type” commands to obtain counts of the six SNV types (C>A, C>G, C>T, T>A, T>C, T>G) and indel types.
De novo SBS and indel mutational signature extraction was achieved with non-negative matrix factorization (NMF) using the “extract_signatures” command. The NFM algorithm is detailed by Gaujoux and Seoighe [
31]. In brief, the algorithm factorizes some matrix
X, which has rows
n and columns
m, into two smaller nonnegative matrices
W and
H, where the product of
W and
H approximates
X.
W is defined by
n ×
r and
H is defined by
r ×
m, where
r is the factorization rank, which is the number of extracted de novo signatures. We sampled ranks from 2 to 10. The optimal factorization rank was based on the smallest rank for which the cophenetic correlation coefficient started decreasing. For example, in the case of SBS mutations, the rows of matrix
X were the 96 mutational contexts derived from combinations of 6 mutational types (i.e., C>A, C>G, C>T, T>A, T>C, and T>G) and their 5′- and 3′-adjacent bases, and the columns were the 100 EoBC samples. The optimal rank can be interpreted as the minimal set of mutational signatures that optimally explains the proportion of each mutation type and estimates the contribution of each signature to each sample [
32]. The “fit_to_signature” command determined which COSMIC SBS and indel signatures were present in our samples. This function finds the optimal linear combination of mutation signatures that most closely reconstructs the mutation matrix by solving the nonnegative least-squares constraints problem.
2.3. Statistical Analysis
All demographic, clinical, pathological, and mutation data were described using means and standard deviations (SD) for continuous variables and frequency tables with proportions for categorical variables. The means of mutational load (the sum of SNV and indel mutations) and relative contribution of de novo SBS and indel signatures were compared across categories of patient characteristics using Welch’s two-sample T-test. These variables included age at diagnosis (<30, 30–34, ≥35 years), BMI category (underweight or normal [<25 kg/m2], overweight [25–29.99 kg/m2], obese [≥30 kg/m2]), patient-reported family history of breast cancer (no, yes), molecular subtype (luminal, HER2-enriched, TNBC), ER status (negative, positive), PR status (negative, positive), HER2 status (negative, positive), lymph node status (negative, positive), positive lymph node count (0, 1–3, ≥4), tumour size (≤2 cm, >2 cm), T stage (T1, T2, T3, T4), tumour grade (low, high), and presence of lymphovascular invasion (negative, positive).
As there are 60 and 18 validated COSMIC SBS and indel signatures, respectively, we employed hierarchal clustering algorithms to determine specific combinations of mutational signature contributions. This clustering analysis was only performed on COSMIC signatures present in >25% of samples. Absolute contribution values for each signature were standardized prior to clustering. Euclidian distance was then calculated to form a distance matrix and passed through a hierarchal clustering algorithm based on Ward’s minimum variance method. The average silhouette method determined the optimal number of clusters. The unadjusted associations between cluster membership and demographic and clinical variables were assessed with Fisher’s exact test and multivariable logistic regression assessed mutually adjusted associations.
Recurrence-free survival (RFS) and overall survival (OS) were the primary outcomes to evaluate the prognostic relevance of de novo signatures and COSMIC signature clusters. RFS was defined as time from primary surgery to local–regional or distant relapse, contralateral breast cancer, the appearance of a second (non-breast) primary tumor, or death from breast cancer. OS was defined as time from primary surgery to death from any cause. De novo signatures were converted into binary variables based on absolute contribution below the median (low expression), or equal to or greater than the median (high expression). The Kaplan–Meier method was used to estimate curves for RFS and OS, as well as median time-to-event and 95% confidence intervals (95% CI). Association measures were estimated with multivariable Cox proportional hazard models in the form of hazard ratios (HR) with 95% CI. Statistical significance was defined by p-value <0.05. All analyses were performed in RStudio (v2023.06.0+421).
4. Discussion
In this study, we characterized the somatic mutation landscape of 100 EoBC tumours from Alberta, Canada and assessed their relationship with clinicopathological tumour features and survival outcomes. Our findings indicated higher numbers of SNVs and indels among patients without vascular invasion, in addition to a higher number of indels with lymph node-negative and TNBC tumours. We extracted five de novo SBS signatures, four of which resembled validated COSMIC SBS signatures, and two de novo indel signatures resembling ID6 and ID12. The mean relative contribution of these de novo signatures mainly differed between BMI categories and molecular subtypes. RFS tended to be better among individuals with high SBS13-like signature expression relative to low, and worse in those with high SBS29-like signature expression relative to low. The hierarchal clustering algorithm of validated COSMIC SBS signatures revealed three distinct clusters. However, evidence was insufficient to conclude whether cluster membership was associated with clinical variables and with survival outcomes.
This is the first study to examine the prognostic relevance of somatic mutational signatures and describe differences in signature distribution across clinicopathological tumour characteristics among patients with EoBC. We expanded upon previous work from Mealey et al., who investigated differences in mutational profiles between breast cancer patients < 40 years and ≥40 years with The Cancer Genome Atlas (TCGA) Breast Invasive Carcinoma project (TCGA-BRCA) data [
23]. They also extracted five de novo SBS signatures in their <40 years subgroup, three of which had similar SNV mutation profiles to the signatures we extracted. Specifically, SBSA, SBS6-like, and SBS13-like signatures in our study resembled signatures S2, S3, and S1 in their study, respectively. SBSA had high relative contributions of T>G in the ATG, TTG, and GTT contexts. This was visually most alike COSMIC SBS55, previously observed in Alexandrov et al., a non-validated signature arising from a possible sequencing artifact. The SBS6-like signature was characterized by low peaks of C>T and T>C mutations. The peaks of C>T mutations in the ACG, CCG, and GCG contexts were similar to COSMIC SBS1 and SBS6, and the contribution of T>C mutations likely reflects a combination of signatures present at low levels. The SBS29-like and SBS42-like signatures were unique to this study and are generally not found in breast cancers [
28]. COSMIC SBS29 is linked to chewing tobacco use and SBS42 is linked to haloalkane exposure. The role of smokeless tobacco in breast cancer is not well established. A hospital-based case-control study in Assam, India, found the odds of being diagnosed with breast cancer were 2.35 times higher in betel quid chewers vs. non-chewers [
33]. Interestingly, SBS29 was also found among early-onset testicular cancer tumours, although there is no established link between chewing tobacco and testicular cancer. It is possible that SBS29 represents the process involved in early-onset cancers, but greater research is needed in other sites to confirm this speculation.
The SBS13-like signature resembled a combination of COSMIS SBS2 and SBS13, which often occur together in the same sample. These signatures are attributed to the activity of the AID/APOBEC family of cytidine deaminases, which substantially contribute to the mutation burden in many human cancers, especially in bladder and breast cancers [
32]. We observed higher relative contributions of the SBS13-like signature in the HER2-enriched subtype and HER2-positive tumours, similar to Mealey et al. [
23]. Further, our findings show RFS and OS tended to be better in patients with high SBS13-like expression, even after adjustment for the subtype. Among breast cancer subtypes, HER2+ breast tumours are reported to have the highest median levels of APOBEC signature enrichment [
34]. APOBEC-related mutagenesis is thought to play an important role in tumour immunogenicity, namely in neoantigen presentation and recruitment of T-cells to the tumour microenvironment, implying its potential for cancer immunotherapy [
35]. However, this likely depends on the molecular subtype. In a TCGA cohort, DiMarco et al. observed high correlation between APOBEC enrichment and immune signatures reflective of an antitumor adaptive immune response in the TNBC subtype, including Th1 cells, CD8
+ T cells, cytotoxic cells, interferon signaling pathway, major histocompatibility complex class II antigen presentation pathway [
36]. Conversely, the APOBEC enrichment score was not correlated with immune cell signatures in HER2-enriched breast cancers. Instead, APOBEC enrichment was associated with a higher frequency of subclonal mutations and may suggest the evolution of immune-suppressive mechanisms that limit antitumor adaptive immune responses [
36]. These findings suggest a subgroup of TNBC patients who may benefit from immunotherapy and equally a subgroup of HER2+ patients who may not benefit from immunotherapy beyond anti-HER2 therapy. Unfortunately, our prognostic findings of the SBS13-like signature could not be stratified by subtype due to limited sample size and we could not ascertain if these effects were mediated by treatment received. Nonetheless, there may be a role of ABOPEC-related mutational signatures, like SBS2 and SBS13, as a biomarker for immunotherapy response in breast cancer, regardless of age. APOBEC signatures are associated with a greater likelihood of response to immune checkpoint inhibition in non-small cell lung cancer, head and neck cancer, and bladder cancer [
32,
36,
37,
38].
We also extracted de novo indel signatures that resembled COSMIC ID6 and ID12. Currently, the proposed etiology of the ID12 signature is unknown. The ID6 signature arises from defective homologous recombination-based DNA damage repair, often due to inactivating
BRCA1 or
BRCA2 mutations, leading to non-homologous DNA end-joining activity [
32]. Given that these mutations are associated with younger age and TNBC, it was not unexpected that this signature was extracted in our EoBC cohort, and that relative contribution was highest in the TNBC subtype. Further, we found that the number of indel mutations was higher in TNBCs. Although the ID6-like signature did not bear prognostic significance in our study, there is an important role for homologous recombination deficiency (HRD) in TNBC. Poly(ADP-ribose) polymerases (PARP) inhibitors have been successfully implemented in the treatment of metastatic breast cancer with germline mutations in
BRCA1/2 [
39,
40]. The recent OlympiA trial also established the efficacy of PARP inhibitors for
BRCA1/2 mutation carriers in the early-stage setting, where the median age of the trial population was 43 years, and 82% of participants had TNBC [
41]. The application of these treatments is being explored in patients who display a “BRCAness” phenotype. BRCAness refers to malignancies that have not arisen from germline
BRCA1 or
BRCA2 mutations but share the phenotypic and molecular features of HRD [
42]. These malignancies share the same therapeutic vulnerabilities with
BRCA-associated tumors including sensitivity to platinum chemotherapy [
43,
44,
45]. However, there is no standardized biomarker of “BRCAness” currently available. Further characterization of this phenotype may aid in predicting response to PARP inhibitors in expanded patient populations.
Our analysis of fitting mutational profiles to COSMIC SBS signatures revealed results not in line with previous literature. This is the first study to examine COSMIC v3.2 signatures in the EoBC setting; therefore, these analyses were exploratory in nature. We found high prevalence of newly added signatures, including SBS37, SBS39, SBS42, and SBS87. The most common COSMIC signatures previously observed in breast tumours are SBS1, SBS2, SBS3, SBS5, SBS13, and SBS18. Mealey et al. found that SBS1, SBS3, and SBS5 were the most prevalent COSMIC signatures and had the highest mean contributions in patients < 40 years. Conversely, we observed each of these signatures in five or fewer patients. We observed SBS13 in 15% of samples and SBS18 in 60% of samples. Given that our extracted de novo SBS signatures matched similar profiles to those from Mealey et al. and Nik-Zainal et al. [
23,
46], these discrepancies may be explained by suboptimal fitting of known COSMIC signatures rather than biological differences between study samples. The MutationalPattern package uses COSMIC v3.2 whereas Mealey et al. was based on COSMIC v2.0 [
23]. It is possible that doubling the number of signatures led to overfitting and misattribution in our sample. That is, if samples contained various combinations of mutational signatures the fitting algorithm may erroneously attribute mutations to one signature. This may explain why we did not observe any associations between the COSMIC SBS cluster group and clinical variables. Therefore, we cannot confidently conclude that the high prevalence of recently added COSMIC SBS signatures is biologically or clinically relevant in EoBC.
This study included several strengths. To our knowledge, it is the first to investigate the prognostic relevance of SBS and indel signatures EoBC. We examined multiple characterizations of somatic mutations including mutation load, SNVs, indels, and mutational signatures. Further, provide information on their associations with important molecular and physical tumour characteristics, as well as with RFS and OS. We also extracted an APOBEC-like SBS signature in EoBC, consistent with previous findings, and elucidated extracted indel signatures. There are several limitations to note. First, our study included small sample size, limiting the statistical power and generalizability of our results. Second, this study used WES data so we cannot draw conclusions related to mutations in the genome outside the exome. We also did not investigate germline mutations or signaling pathways, and so did not produce new evidence linking mutational signatures to germline mutations or cellular signaling. Third, the exploratory nature of the study meant the use of data-driven techniques. For example, we converted extracted SBS and indel signatures to binary variables based on a median cut-off for the survival analyses. We also used an unsupervised clustering algorithm for COSMIC SBS signatures. Although these methods have been used in previous research, we cannot confirm their clinical or biological relevance. Fourth, due to the limited sample size, we lacked sufficient power to examine the prognostic relevance of signatures within subgroups and we did not have data on patient race and ethnicity. Mutational profiles can vary between racial and ethnic groups and may explain disparities in therapeutic response and cancer outcomes.