Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Somatic genetic aberrations in benign breast disease and the risk of subsequent breast cancer

Somatic genetic aberrations in benign breast disease and the risk of subsequent breast cancer www.nature.com/npjbcancer ARTICLE OPEN Somatic genetic aberrations in benign breast disease and the risk of subsequent breast cancer 1,2 3 4 5 5 6 7 1✉ Zexian Zeng , Andy Vo , Xiaoyu Li , Ali Shidfar , Paulette Saldana , Luis Blanco , Xiaoling Xuei , Yuan Luo , 5 5 ✉ ✉ Seema A. Khan and Susan E. Clare It is largely unknown how the development of breast cancer (BC) is transduced by somatic genetic alterations in the benign breast. Since benign breast disease is an established risk factor for BC, we established a case-control study of women with a history of benign breast biopsy (BBB). Cases developed BC at least one year after BBB and controls did not develop BC over an average of 17 years following BBB. 135 cases were matched to 69 controls by age and type of benign change: non-proliferative or proliferation without atypia (PDWA). Whole-exome sequencing (WES) was performed for the BBB. Germline DNA (available from n = 26 participants) was utilized to develop a mutation-calling pipeline, to allow differentiation of somatic from germline variants. Among the 204 subjects, two known mutational signatures were identified, along with a currently uncatalogued signature that was significantly associated with triple negative BC (TNBC) (p = 0.007). The uncatalogued mutational signature was validated in 109 TNBCs from TCGA (p = 0.001). Compared to non-proliferative samples, PDWA harbors more abundant mutations at PIK3CA pH1047R (p < 0.001). Among the 26 BBB whose somatic copy number variation could be assessed, deletion of MLH3 is significantly associated with the mismatch repair mutational signature (p < 0.001). Matched BBB-cancer pairs were available for ten cases; several mutations were shared between BBB and cancers. This initial study of WES of BBB shows its potential for the identification of genetic alterations that portend breast oncogenesis. In future larger studies, robust personalized breast cancer risk indicators leading to novel interception paradigms can be assessed. npj Breast Cancer (2020) 6:24 ; https://doi.org/10.1038/s41523-020-0165-z INTRODUCTION molecular level in the breast itself, i.e., identification of somatic genetic changes that predate breast cancer and influence the From 1989 to 2016 the mortality rate for breast cancer (BC) in the biologic profile of cancers that emerge. United States decreased by 40% , a testament to the efficacy of 7,8 Benign breast disease is an established risk factor for BC , with targeted therapies, as well as to combinations and schedules of 30% of BC cases reporting a history of benign breast disease .Of chemotherapeutics. During this same period breast cancer the 1.7 million breast biopsies each year in the U.S. , about 75% incidence rates remained static ; evidence of both the paucity of of these return a diagnosis of benign breast disease, including novel, effective prevention strategies that target specific mole- atypical hyperplasia . This provides a window into the somatic cular risk pathways, and our inability to implement existing genetic environment of the breast, prompting us to evaluate the strategies. Major barriers are two-fold: hesitation among healthy genetic landscape of benign breast biopsy (BBB), and identify women to accept drugs for a disease that they may or may not patterns associated with subsequent malignancy. Starting in the experience in the future; and their reluctance to experience side 11 12 embryo , tissues accumulate DNA mutations over time . Most of effects that impair quality of life and may compromise health . the mutations are repaired, many are inconsequential, but a few The first of these would be mitigated by improved identification of 13,14 may lead to cancer . Before there is any histologic evidence of women at high risk of developing breast cancer, but almost 30 invasive cancer, histologically normal, and benign tissue contain years after the initial publication of the Gail Model , breast cancer 15,16 molecular aberrations that are associated with malignancy . risk stratification remains imprecise and insensitive to breast For example, sun-exposed, normal eyelid skin has been shown to cancer subtype. In an analysis of data from the Women’s Health have a mutation burden of 2–6 mutations/MB/cell, a rate similar to Initiative, the Gail Model displayed modest ability to predict the that observed in many cancers . The processes that cause these risk of breast cancer (AUC = 0.58, 95% CI = 0.56–0.60) . Among mutations leave an imprint on the genome . In the sun-exposed women at high risk of breast cancer, for example, those diagnosed with atypical hyperplasia, neither the Gail Model/Breast Cancer eyelid epidermis, mutations occur within a pattern that mimics the Risk Assessment Tool nor the Tyrer-Cuzick Model performed Welcome Trust Sanger Institute (WTSI) Mutation Signature, which 5,6 well . This is a significant barrier to implementation of is associated with ultraviolet exposure and its consequent CC > TT established medical interventions for disease prevention, and to dinucleotide mutations at dipyrimidines . Exogenous or endo- the development of new, targeted intervention strategies for genous mutational processes, such as that which produced WTSI signature 7, are chemical reactions within DNA. While mutational women at risk. Impactful, targeted prevention strategies require knowledge of how breast cancer risk is transduced at the processes are responsible for the creation of mutations, the 1 2 Division of Health and Biomedical Informatics, Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, USA. Department of Data Sciences, Dana-Farber Cancer Institute, Harvard T. H. Chan School of Public Health, Boston, MA, USA. Committee on Developmental Biology and Regenerative Medicine, The 4 5 University of Chicago, Chicago, IL, USA. Department of Medicine, Brigham and Women’s Hospital, Boston, MA, USA. Department of Surgery, Northwestern University Feinberg 6 7 School of Medicine, Chicago, IL, USA. Department of Pathology, Northwestern University Feinberg School of Medicine, Chicago, IL, USA. Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN, USA. email: yuan.luo@northwestern.edu; s-khan2@northwestern.edu; susan.clare@northwestern.edu Published in partnership with the Breast Cancer Research Foundation 1234567890():,; Z. Zeng et al. a c With Matched Without Matched Control (n=6) Germline Germline Benign Biopsy Benign Biopsy Case (n=20) Control (n=63) Case (n=115) Time ( > 1 year) Biopsy DNA Normal DNA Biopsy DNA Developed Paired-End sequencing Did not develop Controls Cases breast cancer Breast Cancer 26 benign/germline FASTQ BAM BWA VarScan2 Mutations Somatic CNV BAM Predict somatic GISTIC2 mutations Case Control MuTect2 (n=135) (n=69) VarScan2 VarDict Somatic mutations Genotyped 17 samples for validation Genotyped 3 samples Somatic mutations Exome-seq Exome-seq for validation Annovar Benign Biopsy (n=69) Benign Biopsy (n=135) Annotated mutations SnpEff Trained model Subset VEP Subset Subset Tumor (n=10) Annotated mutations Germline (n=6) Germline (n=20) Fig. 1 Case-control study of benign breast biopsy (BBB) samples. a Design of the BBCAR study. The sample tissues of subjects who subsequently developed breast cancer (cases), and their matched controls, who have not developed breast cancer to date, were studied b A total of 135 cases, matched to 69 controls, were selected for whole-exome sequencing (WES). Case and control samples are matched by age and histology c. an illustration of the workflow to identify somatic mutations in the tissue samples that lack matched germline DNA. To train a model to distinguish germline variants and somatic mutations, previously consented donors were re-contacted (with IRB approval) and saliva specimens were requested for germline DNA sequencing. Orthogonal SNP array genotyping was further performed with 20 samples to compare and validate the performance of somatic mutation identification. mutations that are observed ultimately within a malignancy reflect Table 1. Distributions of demographic data and tumor characteristics a process of selection . However, genetic aberrations are not between the Case group and the Control group Student’s t-tests were limited to somatic mutations, and we note that recurrent copy performed for continuous variables and Pearson’s Chi-squared tests number variations (CNVs) are in fact more characteristic of were performed for categorical variables. invasive breast cancers than recurrent mutations . To evaluate the molecular alterations that enable cancer Case (135) Control (69) P-value development in the breast, we established a case-control study of BBB samples, the Benign Breast & Cancer Risk (BBCAR) Study. Age (SD) Mean (SD) 49.7 (9.9) 49.8 (9.6) 0.96 We performed whole-exome sequencing (WES) on the benign Menopausal status N (%) 0.87 biopsies of patients, who subsequently developed breast cancer Pre 114 76 (56.3%) 38 (55.1%) (cases), and matched controls, who have not developed breast Post 90 59 (43.7%) 31 (44.9%) cancer to date. The cases and controls had similar degrees of benign change: non-proliferative or proliferation without atypia. Histology Class N (%) 0.78 The focus on non-atypical lesions was a deliberate choice as non- a Class 1 115 79 (58.5%) 40 (58.0%) atypical lesions predict a generalized risk of subsequent breast Class 2 75 51 (37.8%) 25 (36.2%) cancer, occurring equally frequently in both breasts . They are also NA 10 5 (3.7%) 4 (5.8%) far more common than atypical changes, comprising over 90% of ER status N (%) all breast biopsies , so that elucidation of their molecular profiles will impact the majority of women who undergo BBB. To the best Positive 109 (80.8%) of our knowledge, WES has not been performed in any previous Negative 23 (17.0%) case-control study of benign breast lesions without atypia. In Low 3 (2.2%) addition to profiling the overall BBB mutational landscape, we Follow-up years (SD) 7.3 (4.4) 16.6 (5.4) <0.01 have determined that mutations differ as a function of the type of Has matched germline (%) 20 (14.8%) 6 (8.7%) benign breast disease, specifically that the PIK3CA pH1047R hotspot mutation is more frequent in proliferative disease without Has matched cancer (%) 10 (7.4%) atypia (PDWA) compared to non-proliferative disease (p < 0.001); Class 1/non-proliferative: “Non-proliferation” and “Benign, NOS”. our data reveals a presently uncatalogued mutational signature Class 2/proliferative: “Proliferative lesion without atypia” (includes non- associated with TNBC (p = 0.007), which was validated in 109 atypical hyperplasia, radial scar, sclerosing adenosis). TCGA TNBC samples (p = 0.001); and we observed multiple recurrent CNVs, including a MLH3 deletion, which is significantly associated with a mismatch repair signature (p < 0.001). predates the diagnosis of breast cancer by at least one year (Fig. RESULTS 1a). The median interval from benign biopsy to the diagnosis of Study design cancer is 7.3 (SD = 4.4) years. Controls (n = 69) are women who have not developed breast cancer and are matched for age of A total of 204 subjects were enrolled in this BBB case-control diagnosis (±2 years) and histology (Fig. 1b; Table 1). Controls were study. Cases (n = 135) are women who have undergone a breast biopsy with specimen histology showing non-proliferative disease verified to not have been diagnosed with breast cancer as of 08/ or proliferative disease without atypia (Supplementary Fig. 1) that 14/2018 (Supplementary Data 1) . npj Breast Cancer (2020) 24 Published in partnership with the Breast Cancer Research Foundation 1234567890():,; Z. Zeng et al. a b 100 Proposed model ISOWN NBC & LAD tree Precision 85.4% Recall F1 AUC score MuTect2 VarScan2 40 50 VarDict 25 50 020 40 60 Discrepancy allowance (%) Silent mutation Non-silent mutation e 100 26 BBB 178 BBB 100 TCGA Sample1 82.5% Sample2 Sample3 50 50 72% 63% 32% 0.1 0.01 0 020 40 60 Germline variants Discrepancy allowance (%) Somatic mutations Fig. 2 Accurate somatic mutation identification in benign biopsies. a Comparison of the mutations between WES and genotype array. Somatic mutations were called using Mutect2, VarScan2, and VarDict. With different allele frequency discrepancy allowance, the overlap rate between two platforms was plotted. b Performance of different machine learning models in the test set. Penalized logistic regression (LR); linear SVM; random forest classifier (RFC); gradient boosted tree (GBT); k-nearest neighbor algorithm (K-NN); SVM with rbf kernel; multiple layer perceptron (MLP). c Orthogonal validation of the proposed model using 100 TCGA breast cancer samples and benchmark study with previously validated pipelines, including ISOWN NBC and ISONWN LAD tree . d Pipeline validation using genotype arrays of three samples. Somatic mutations were called using our pipeline and validated by genotyping. The plot shows the overlap rate between the two platforms with different allele frequency discrepancy allowance. e VAF distribution of the germline variants and somatic mutations, grouped by 26 benign biopsies with matched normal DNA, 178 benign biopsies lacking matched normal DNA, and 100 randomly selected TCGA breast cancer samples. f Distribution of silent and non-silent mutations, grouped by germline variants called in normal DNA that matched to the BBB, somatic mutations in BBB, and somatic mutations in TCGA breast cancer samples. Somatic mutation identification with intrinsic sequencing features, such as mutation allele frequency, depth of reference reads, mutation frequency among All 204 specimens were dissected using laser capture microdissec- the cohort. Grid search was applied to unbiasedly tune each tion (LCM) and were subjected to WES . Within this cohort, 26 model’s parameters using five-fold cross-validation on the training matched germline DNA were obtained for WES as well. To set. Evaluation performance was then achieved on the held-out evaluate mutation caller performance in this benign tissue setting, test set (“Methods”). Of the evaluated models including penalized 17 of the 26 sample pairs were subjected to genotyping in order logistic regression (LR), linear SVM, random forest classifier (RFC), to evaluate mutation caller performance (Fig. 1c; “Methods”, gradient boosted tree (GBT), k-nearest neighbor algorithm (K-NN), “Supplementary Materials and Methods”). Allele frequencies of the SVM with rbf kernel, and multi-layer perceptron (MLP), MLP model mutations common to the genotyping array and WES were achieved the best performance (Fig. 2b), where the F1-score is compared. Mutations were categorized as false positive if allele 0.96 (Supplementary Table 2). frequency was discrepant between the two platforms. The Orthogonal validations of the proposed model were performed mutation identification accuracy then varies as a function of the by evaluation studies with the TCGA data and benchmark studies discrepancy allowance (Fig. 2a). Overall, we observed high with previously validated pipelines. Protected datasets in bam consistency between the two platforms (85.4% when discrepancy format of 100 randomly selected breast primary tumors were allowance = 25%). Notably, MuTect2 consistently achieved better downloaded directly from the TCGA data portal. Realigned raw performances in this setting (Fig. 2a), Therefore, MuTect2 was reads were subjected to base recalibration and were passed to selected as mutation caller for subsequent studies. Mutect2 for mutation detection. Mutect2 was performed in so For the samples lacking matched normal DNA (n = 178), a called “tumor only mode” to call somatic and germline mutations. machine learning model was developed to distinguish germline ISOWN , a previously validated pipeline for somatic mutation variants and somatic mutations (Fig. 1c; “Methods”; “Supplemen- identification, was applied for somatic mutation prediction as well. tary Materials and Methods”). With somatic mutations called for The predicted results were evaluated by comparison to the TCGA the 26 samples for which germline DNA was available, we somatic mutation data by Multi-Center Mutation-Calling in Multi- systematically evaluated multiple machine learning approaches to ple Cancers (MC3 public v0.2.8) network . Using the TCGA MC3 distinguish somatic mutations and germline variants in benign data as ground truth, our model achieved a F1-score of 0.89 (Fig. biopsies (Fig. 1c; “Methods”; “Supplementary Materials and 2c) in predicting somatic mutations. Even though designed and Methods”). A total of 31 features were utilized for the model trained in the benign-biopsy setting, our model (F1 = 0.89) evaluation (Supplementary Table 1), including protein structure, obtained similar or better results than previously validated pathogenicity prediction, population frequency, or evolutionary 24 26 factors . Various functional annotation or toxicity scores were pipelines, such as ISOWN NBC (F1 = 0.88) and ISOWN LAD tree derived from ANNOVAR , COSMIC (https://cancer.sanger.ac.uk/ (F1 = 0.80) in predicting somatic mutations in TCGA cancer cosmic), dbSNP/common (https://www.ncbi.nlm.nih.gov), along samples (Fig. 2c). Published in partnership with the Breast Cancer Research Foundation npj Breast Cancer (2020) 24 Germline Variants Normal DNA Somatic mutations Benign biopsy Somatic mutations TCGA cancer Overlap rate (%) Overlap rate (%) Accuracy (%) VAF (%) LR SVM(Linear) RFC GBT K-NN SVM(rbf) MLP Percentage (%) Accuracy (%) Precision Recall AUC F1 F1 F1 Z. Zeng et al. Case Control a b Missense Mutation Stop Gain Frame Shift Insertion 223 Frame Shift Deletion In Frame Insertion In Frame Deletion Splice Site Mutation Case Control c d Number of mutations Case Control Stop Gain Frame Shift Insertion Frame Shift Deletion Case Control Number of mutations Fig. 3 Catalog of somatic mutations in 204 benign breast biopsies. a Catalog of base substitutions, insertions/deletions in the 204 BBBs. Each bar represents one individual’s total number of mutations. Left panel is the case and right panel is the control. b The top 20 mutated genes in the case group (left) and control group (right). c The top mutated genes as b, adjusted by gene length. d Catalog of nonsense mutations in the 204 BBBs. We further applied our pipeline and model to identify somatic When corrected by gene length, case and control still shared mutations in the 178 BBBs lacking matched normal DNA (Fig. 1c) common genes (MUC17, SLC7A4, FLG2, GLTPD2, PGBD1, PLA2G3, (Methods). Overall, the average read depth for the identified ADAM30) (Fig. 3c). Mucins are O-glycosylated by the addition of N- somatic mutations is 99, whereas the average VAF is 0.232. To acetylgalactosamine to the hydroxyl group of serine or threo- estimate the overall mutation identification accuracy, we ran- nine . Therefore, we evaluated the number of missense muta- domly sampled and genotyped three samples from our cohort. tions within MUC17 that resulted in the gain or loss of either Overall, we observed high consistency between our pipeline and serine or threonine residues. Of the MUC17 mutations we the genotype array (82.5% when discrepancy allowance = 25%) observed, 8.7% of missense mutations would be predicted to (Fig. 2d). As a sanity check, the distribution of variant allele result in the loss of serine, 16.8% in the loss of threonine, 14.2% in frequency (VAF) and non-silent mutations were examined. the gain of serine and 17.8% in the gain of threonine. However, 28,29 Consistent with previously reported studies , the majority of there was no significant difference between cases and controls our identified germline variants’ VAFs are around 50% and 100%, (Supplementary Table 3). The proportions of nonsense mutations whereas somatic mutations display much lower VAFs (Fig. 2e). For also vary between samples. The majority of nonsense mutations cancers, non-silent mutations usually account for 2/3 of somatic were frame shift insertions and stop gains, with some exceptions mutations with the remaining 1/3 being silent mutations, whereas in a few samples (Fig. 3d). germline mutations are expected to have higher number of silent mutations . In our data, we have observed similar distribution Genes enriched for mutations in the cases or PDWA (TCGA 100 breast cancer: 72% non-silent mutations) (Fig. 2f). In To determine the enrichment of mutations in the case group, a addition, we observed an increasing spectrum of non-silent logistic regression model was fit for each gene, with case/control mutations in BBB matched normal DNA (32%), BBB (63%), and as output variable and mutation status as input variables. The p- TCGA cancer samples (72%) (Fig. 2f). To note, the average non- values were derived from the fitted models for gene sorting (Fig. synonymous mutations for the 26 BBBs with matched normal DNA 4a). Nonsynonymous mutations in four cancer-associated genes, is 114, whereas the average number for the 178 BBBs without CTNNA2 (11.1% vs. 5.8%; log10 p-value = −0.6), FLG (8.9% vs. 4.3%; matched normal DNA is 127. log10 p-value = −0.6), GNAS (4.4% vs. 1.4%; log10 p-value = −0.5), and BCORL1 (17.0% vs. 11.6%; log10 p-value = −0.5), were more Mutation catalogues abundant in the case group. Of note, same analyses including Among the 204 samples, 36,801 somatic base substitutions and synonymous mutations are presented in Supplementary Fig. 2. 2283 small INDELs were identified. The majority of the mutations Rohan and colleagues utilized targeted sequence capture to were missense mutations (Fig. 3a). Cases had a mean of 6.2 identify mutations present in a panel of 83 genes in the benign mutations/MB (SD = 3.6) and controls had 6.8 mutations/MB (SD breast disease tissue from a case-control study . While they = 3.0). No significant difference was observed in the numbers of identified somatic mutations in a number of genes frequently mutations between the cases and controls (Fig. 3a). Among the mutated in breast cancer, no significant differences were top 20 mutated genes, the case group and control group shared identified comparing cases and controls with regard to the common genes (MUC17, OBSCN, FLG2, GLTPD2, ABCA13, PIK3CA) mutational burden, genes mutated, type of mutation or pathway. (Fig. 3b). Approximately one-fifth of both cases and controls We queried our data for the mutations present in these same 83 display PIK3CA mutations, with the highest frequency at pH1047R. genes. Our data for all variants was very similar to theirs (Fig. 4b), npj Breast Cancer (2020) 24 Published in partnership with the Breast Cancer Research Foundation Number of Mutations Number of Mutations Z. Zeng et al. ab Ca se Control Ca se Control Log10 all variants variants with VAF 25% % % Gene % % pval Controls Cases Controls Cases DMD 11.9 2.9 -1.3 TSKS 12.6 4.3 -1.1 ZNF559 8.9 2.9 -0.9 DENND2C 6.7 1.4 -0.9 TENM4 6.7 1.4 -0.9 PAQR5 6.7 1.4 -0.9 BEST3 6.7 1.4 -0.9 AHNAK 28.1 18.8 -0.8 RGSL1 10.4 4.3 -0.8 NLGN4X 8.1 2.9 -0.8 MYO15B 5.9 1.4 -0.8 APOB 7.4 2.9 -0.7 HTT 5.2 1.4 -0.7 ZNF423 5.2 1.4 -0.7 HECTD4 5.2 1.4 -0.7 PLXNA1 5.2 1.4 -0.7 5.2 1.4 -0.7 PTPRS TTC14 5.2 1.4 -0.7 CTNNA2 11.1 5.8 -0.6 FLG 8.9 4.3 -0.6 MAGI3 12.6 7.2 -0.6 RPGRIP1 14.1 8.7 -0.6 SHCBP1L 6.7 2.9 -0.6 PHRF1 10.4 5.8 -0.5 PLXDC1 10.4 5.8 -0.5 KIF6 4.4 1.4 -0.5 20 15 10 0 0 10 15 20 5 5 15 1051 05 0 0 15 MAMDC4 4.4 1.4 -0.5 Non-silent mutations Non-silent mutations DDX42 4.4 1.4 -0.5 DUS1L 4.4 1.4 -0.5 GNAS 4.4 1.4 -0.5 p < .00001 p = 0.00001 THNSL1 4.4 1.4 -0.5 NP PDWA NP PDWA ZNF595 4.4 1.4 -0.5 PIK3CA PIK3CA pH1047R ZNF782 4.4 1.4 -0.5 100 100 Mutated Mutated FAM209B 4.4 1.4 -0.5 LRP1 4.4 1.4 -0.5 Wild Wild 80 80 TOPORS 4.4 1.4 -0.5 TRIO 4.4 1.4 -0.5 ZNF627 4.4 1.4 -0.5 60 60 BCORL1 17.0 11.6 -0.5 TMC6 11.9 7.2 -0.5 RYR2 8.1 4.3 -0.5 DIDO1 8.1 4.3 -0.5 20.0 14.5 -0.5 MUC5B 32% SLC3A1 5.9 2.9 -0.5 20 20 -0.5 5.9 2.9 ZNF845 MUS81 5.9 -0.5 2.9 0 0 Onc TS % mutant Fig. 4 Genetic aberrations that distinguish case/control or proliferative/non-proliferative BBB. a For each gene, the percentage of mutated lesions in the case and control are shown. In the left panel, known oncogenes are highlighted as green, and known tumor suppressor genes are highlighted as orange. Onc is a known oncogene; TS is a tumor suppressor gene (Cancer Gene Census; https://cancer.sanger.ac.uk/census). The right panel shows the nonsynonymous rate in each group. b Mutations in 83 selected genes presented by Rohan et al. to validate our data and also to facilitate comparisons. Left: all non-silent mutations; right: non-silent mutations with VAF > 25%. c PIK3CA non-synonymous mutations identified in NP BBB and PDWA BBB. d Same as c, with only mutations at PIK3CA pH1047R retained. which orthogonally validated our data quality. Nonetheless, mutational signatures were identified in both case and control differences were observed after filtering for variants with a VAF > group (Supplementary data 3) . In both groups, we identified the 25%; in particular, while no variants in NCOA3 had a VAF greater “aging” signature (cataloged by WTSI as Signature 1b; Fig. 5a; than 25% in the controls, over 10% of cases passed this threshold cosine similarity score: 83.2% for the case and 83.0% for the (Fig. 4b). control), which is the putative result of the hydrolysis 5- We also evaluated mutation enrichments in benign biopsies methylcytosine. We also identified the “mismatch repair” signature showing proliferative disease without atypia (PDWA) (n = 76) (cataloged by WTSI as Signature 6; Fig. 5a; cosine similarity score: versus non-proliferate (NP) disease (n = 119). Using non- 80.5% for the case and 80.1% for the control). Moreover, a synonymous mutations only, the top enriched significant genes signature not currently in the WTSI catalog of Mutational are PIK3CA, HYDIN, DNMT3B, and AKT1 (detail of hotspots in Signatures was identified in each group; both demonstrate Supplementary data 2) . For PIK3CA, mutations are abundantly enrichment of T > G mutations with 5′TTC3′ >5′TGC3′ the most enriched in PDWA compared to NP (31% vs. 12%; p = 0.00001) frequently mutated trinucleotide motif (Fig. 5a). Provisionally, we (Fig. 4c). Specifically, pH1047R is the most enriched hotspot for the have named this signature “O/TN” based on the presumed PDWA (28% vs. 5%; p < 0.00001) (Fig. 4d) mechanism: oxidation, and on its presumptive association with triple negative (TN) breast cancer. Mutational processes and CNV The process of deriving mutational signatures is an unsuper- vised learning process. Pooling the cases and controls, we derived Mutations are non-random and occur within sequence motifs. three signatures in the BBB cohort, namely aging, mismatch repair, These motifs provide evidence from which we can infer the and O/TN. In an association study, we found that O/TN was process that created the mutations. Recent studies led by significantly associated with BBB that predate TNBC (p = 0.007) investigators at the Welcome Trust Sanger Institute (WTSI) present (Fig. 5c). We also performed a second association analysis, the somatic mutation data as a 96-element vector, which captures controlling for the potential covariates of age, menopausal status, the immediate 5′ and 3′ neighbors of the mutated nucleotides. and histology class (NP or PDWA). The association remained Employing non-negative matrix factorization (NMF), 30 “muta- 19,32 significant (p = 0.016), suggesting that the O/TN signature in BBB tional signatures” were produced by these studies , which is predictive of TNBC. To validate the O/TN signature and examine more recently has been updated and expanded to 40 .We hypothesized that like the eyelid epidermis , benign breast whether it is a predictor of TNBC as well, we further retrieved 109 lesions also harbor somatic mutations with associated mutational TNBC samples from TCGA data portal. The downloaded somatic signatures that may provide clues to etiologic processes. Within mutation data were processed, and three mutational signatures the BBB cohort, mutational signatures were examined. Three were derived under the same protocol as BBB (Methods). As result, Published in partnership with the Breast Cancer Research Foundation npj Breast Cancer (2020) 24 Case Control Case Control Case Control Case Control Mutants (%) Mutatans (%) Z. Zeng et al. Case Control Aging Aging Mismatch repair Mismatch repair O/TN O/TN CASES b c d p=0.007 p=4E-6 FRA1F FRA19B/A FRA9D 1.0 FRA10G FRA1A FRA4D 300 FRA7F 0.5 FRA10G FRA1F FRA7D 100 10 15 20 nSamples 0.0 Amp Del -log10(qvalues) 10 20 30 TNBC Non TNBC MLH3 naive MLH3 wild Fig. 5 Mutational processes and somatic copy number variation (CNV) identified in the case and control groups. a The identified mutation signatures were compared with those of the Welcome Trust Sanger Institute. The aging signature and mismatch repair signature (MMR) are enriched in both groups. The uncatalogued signature O/TN is enriched with T > G/A > C mutations, with 5′GAA3′ >5′GCA3′ the most frequently mutated trinucleotide. b Recurrent somatic copy number variation in the case group. Common fragile sites are labeled. The size of the dots represents the q-value (FDR adjusted p-value). Red are amplifications, blue are deletions. y-axis is the number of genes involved, and x-axis is the number of samples involved. c The uncatalogued signature is enriched in BBBs that predate triple negative breast cancer. Each sample is assigned a continuous number representing the signature exposure strength, which was the product of matrix decomposition. d The mismatch repair signature is highly abundant in the BBB with MLH3 deletion (MLH3 naive). Error bars are 95% confidence interval (CI). we were able to identify the O/TN signature in the TCGA TNBC one of the mismatch repair associated genes (Supplementary cohort (Cosine = 0.72, p = 0.001). data 4) . PMS2 is deleted in one-half of the cases (10/20) and A majority of breast tumors, especially those that are HER2 MLH3 in all of the controls (6/6). However, only one of the 10 cases positive, have been reported to be enriched with mutations displaying a PMS2 deletion also evidenced a mutation in an MMR hypothesized to result from the action of the APOBEC enzymes . associated gene, specifically SETD2. None of the controls with In our cohort, no tumors were found to be enriched with MLH3 deletions carried a mutation in any of the MMR associated mutations within the APOBEC motif, nor did we observe either genes. Strikingly, benign biopsies harboring a MLH3 deletion are WTSI Signatures 2 or 13, both of which are hypothesized to be the abundantly enriched with the mismatch repair signature com- result of the activity of these enzymes. We have also examined the pared to MLH3 wild biopsies (p = 4.2E-6) (Fig. 5d). subset of 11 BBB that eventually developed HER2 positive cancer and the subset of 29 BBB that developed cancer within 3 years of Cancer risk prediction at BBB biopsy, and we found no APOBEC signatures enriched in In an attempt to build a model for cancer prediction at the time of these BBB. BBB using somatic information, we fit logistic regression with L1 We also employed VarScan2 to study somatic CNV in the penalty using the case/control status as output variable. To reduce 26 samples for which we have matched normal DNA. The learnt the number of input prediction features, all somatic mutations segments were then passed to GISTIC2 for recurrent CNVs study that were annotated with same protein domain were aggregated (genome-wide CNV variation: Supplementary Fig. 3). We observed as a continuous number, representing the mutation burden of the that majority of the cytobands occur at or immediately next to corresponding protein domain. In total, 1966 annotated protein common fragile sites, suggesting these cells are under consider- domains were utilized as input features for case/control prediction able replication stress (Fig. 5b). The observed cytobands at which (Supplementary data 5) . To evaluate the model and features, we CNVs map exclusively in the cases have been associated with performed a bootstrapping by randomly splitting the BBB samples cancers, in general or invasive breast cancers, in particular. at a 7:3 ratio, and trained the model using 70% of the samples, in Amplifications are hypothesized to be the result of breakage- which 30% of the samples were used as test set. We repeated the fusion-bridge (BFB) cycles triggered at the induction of fragile process ten times and obtained an AUC for each run. As a result, sites . One of the amplifications identified in the BBCAR cases is we obtained an AUC score of 67% (95% CI = 63.1–70.9%) in an amplification outlier identified using breast cancers from the predicting the cases. Of note, the inclusion of clinical character- METABRIC consortium that mapped to chr19q13.33, which istics and demographics, including age at the time of BBB, age at contains 26 genes. No candidate oncogene has yet to be 38 menarche, age at first live birth, family history of breast cancer in a identified within this amplicon . Chromosome 1q21 is the fourth first-degree relative, histologic variable (proliferative vs non- most frequent locus of copy number variations in cancer . proliferative), did not improve the model’s performance. To investigate the mechanisms underlying our mismatch repair signature, mismatch repair genes MLH1, MLH3, MSH2, MSH3, Somatic mutations present in both benign biopsy and cancer MSH6, PMS2, MUTYH, MYH11, SETD2 and TGFBR were examined for deleterious mutations and/or deletion in the subset of samples Our cases were defined as BBB that predate breast cancer. In this with matched germline DNA available (N = 26) . Approximately study, to longitudinally compare mutations in the BBB and in the one-third of the cases and controls have at least one mutation in cancer samples, we retrieved ten tumors that matched to our BBB npj Breast Cancer (2020) 24 Published in partnership with the Breast Cancer Research Foundation Exposure Strength MMR signature Z. Zeng et al. cohort. Preprocessing for mutation calling was performed as for we report here must be regarded as preliminary until larger the BBB, including laser capture microdissection (LCM), DNA numbers can be studied. We were able to obtain germline extraction, library construction, sequencing, alignment, mutation specimens on only 26 subjects. Data from these 26 specimens was calling, and variant filtering. Of the identified mutations in these utilized to build the Panel of Normals (PoN) for germline variant ten cancer samples, 957 were observed in both the benign filtering; GATK recommends a minimum of 40. Using less than the biopsies and cancer tissues (Supplementary data 6) . The average suggested minimum may result in suboptimal denoising of the allele frequencies for these mutations is 32.2% (SD = 18.7%) in the data and may not capture all the common germline variants. Since BBB and 46.7% (SD = 17.3%) in the cancer tissues. FAT1, CTNNA2, all subjects consented to participation and to recontact, we are ATR and ETAA1 were among the top ten mutated genes working actively to acquire additional germline samples. Finally, (Supplementary Table 4); these are known tumor suppressor the use of formalin-fixed paraffin embedded breast samples, genes or oncogenes. All six of the CTNNA2 mutations occur within although unavoidable in this setting, risks introducing artefactual the motif 5′GAA3′ >5′GCA3′. This motif is a predominant feature findings. Among the FFPE artifacts are C to T transitions 43,44 of our O/TN mutation signature (Fig. 3). hypothesized to be the result of the deamination of cytosines 31 41 and both Rohan et al. and Soyal et al. screened for these substitutions. However, a large, prospective study carried out by DISCUSSION the 100,000 Genomes Project argues that the choice of mutation Genetic aberrations associated with malignancy occur within caller as well as tissue heterogeneity or sampling may contribute 17 45 normal tissues and within tissues at the population risk of breast to differences between FFPE and frozen tissue . Our O/TN 15 16 cancer as well as within lesions at substantial risk . A previous signature is not dominated by C to T transitions. Although our case-control study performed by Rohan et al., with a design that MMR signature and aging signatures are populated by C to T closely mirrors ours, utilized targeted sequence capture ;no transitions, we think it unlikely that these are due to formalin- significant differences between cases and controls with regard to induced deamination, as our signatures closely mirror those of somatic mutations were identified and no mutations were shared WTSI 6 and 1b, respectively, which were derived from frozen between the biopsy and tumor pairs. Comparing the number of specimens. Considering the risk of ruling out true mutations, somatic mutations identified in their targeted genes with these therefore, we did not attempt to account the FFPE artifacts in our same genes in our WES data revealed striking similarity and to pipeline. make the similarity easy to discern, we designed Fig. 4b to mirror In contrast to our findings, Soysal et. al found no specific their Fig. 1a, b. Soysal and colleagues also employed targeted mutations in their study of benign breast lesions. Their depth of sequencing in an attempt to identify somatic mutations present in sequencing was more than adequate but only 17 patient samples antecedent fibrocystic disease (FD) and subsequent invasive were included , with lesions that they called “fibrocystic breast breast cancers . In contrast to our study and that of Rohan disease with or without UDH, FEA, or CCL”. These lesions, with the et al. , no significant somatic mutations were identified in the FD. exception of flat epithelial atypia, were also included in our study, In their discussion section Rohan et al. suggested that “more making it unlikely that the choice of histology is determinative. We detailed approaches (e.g., exome/whole-genome sequencing)” should consider the possibility that single nucleotide somatic might prove more informative than targeted sequencing .We mutation is not the correct genetic determinant of risk. Single cell employed WES in a similar case-control setting. We rigorously sequencing of synchronous DCIS and invasive ductal carcinomas evaluated the sequencing quality, mutation calling, and mutation has revealed that CNV is early oncogenic event, i.e., present in classification. Since we did not have germline samples available in situ lesions, and that no additional CNV events are acquired from most of our subjects, we developed a neural network model during the transition from in situ to invasive lesion . In a study to predict somatic mutations for the benign biopsies, which we separate from the one referred to earlier, Soysal et al. showed that were able to accomplish with a F1 score of 96%. This tool was ESR1 gene amplifications are an early event in breast carcinogen- further validated in TCGA (MC3) data with a F1 score of 89%. Using esis and are already present, at least in part, in FD . Additionally, the sequencing data produced, we have identified recurrent recurrent CNVs are more characteristic of invasive breast cancers mutated genes. We also built a predictive model for the risk of than are recurrent mutations . Key breast cancer phenotypes, breast cancer using genetic information alone and obtained an including intrinsic molecular subtypes, estrogen receptor status, AUC of 67% (95% CI = 63.1–70.9%). This represents the best and TP53 mutation status as well as proliferative status and performance to date using benign breast lesions, despite the estrogen-signaling pathway activity can be predicted by DNA 42 48 exclusion of subjects with atypical hyperplasia . Importantly, we copy number features alone . have identified a currently uncatalogued signature, which we have Lesions such as hyperplasia, not all of which are obligate designated O/TN, that is associated with triple negative breast precursors of malignancy, already show evidence of activation of cancer (p = 0.007), which was validated in 109 TCGA TNBC DNA damage response pathways. This is a response to oncogene- samples (p = 0.001); we found that PIK3CA pH1047R hotspot induced DNA replication stress causing unscheduled S-phase mutation is more frequent in proliferative disease without atypia entry with consequent aberrant replication structures and DNA (PDWA) compared to non-proliferative disease (p < 0.001); we damage, which activate ATR/Chk1, ATM/Chk2, and p53, ultimately observed multiple recurrent CNVs as well, including a MLH3 preventing progression by arresting growth or triggering cell deletion, which is significantly associated with a mismatch repair death . Intriguingly, with regard to our data, is the fact that in the signature (p < 0.001). early lesions that are genetically most stable, loss of hetero- This study has several strengths and weaknesses. The speci- zygosity at known fragile sites is observed to occur 3–15 times mens are richly annotated with clinical information (Supplemen- more often than expected from random targeting of these sites . tary Data 1) and they have been laboriously microdissected in Fragile sites were also noted to be targeted during the period in order to sample the epithelial compartment. The controls have a which DNA damage response is maximal. These data suggest a long median follow up and were verified not to have been model in which oncogene activation is an early event in at-risk diagnosed with BC by a telephone interview carried out at the tissue and that cells activate the ATR/ATM-regulated DNA damage time of this study. We have leveraged the advantages of machine responses that delay or prevent malignant progression. This may learning/artificial intelligence to enable the calling of somatic explain why we observe equivalent somatic mutations, e.g. PIK3CA mutations in the absence of germline data. (H1047R), in cases and controls. The ATR and ETAA1 mutations that Weaknesses include the relatively small size of the study and we observed in the BBCAR specimens and their matched tumors the lack of an independent validation dataset, so that the findings may be the specific mutations that enable oncogenic progression Published in partnership with the Breast Cancer Research Foundation npj Breast Cancer (2020) 24 Z. Zeng et al. in the cases. Inactivating mutations including any in the ATM/Chk2 deletion; as the same signature is observed in the cases another or ATR/Chk1 pathways potentially would remove the barrier to mechanism of MLH3 silencing may be operative in the cases such progression and result in cell proliferation and survival, increasing as promoter methylation. genomic instability and tumor progression. The ten most frequently mutated genes shared between the Our O/TN signature is enriched with T > G/A > C mutations, with BBB of cases and their tumors are given in Supplementary Table 4. 5′TTC3′ >5′TGC3′ the most frequently mutated trinucleotide Among them are FAT1, CTNNA2, ATR, and ETAA1. FAT1 has the motif. These single nucleotide T > G transversions are observed most mutations, which is interesting as this same gene was shown in vitro when equimolar oxidized dGTP (8-O-dGTP) is included in to have a statistically significant excess of inactivating mutations the nucleotide pool . Strand information is lost between the across all classes in the sun-exposed, physiologically normal initial occurrence of a mutagenic lesion and the ultimate readout epidermis study . FAT1 encodes a cadherin-like protein and its by DNA sequencing. Conventionally, mutational signatures are inactivation via mutation may lead to tumorigenesis by multiple 60,61 displayed with a mutated pyrimidine at the center of the avenues . From a breast cancer standpoint, investigations into trinucleotide motif. The complement to 5′TTC3′ >5′TGC3′ is 5′ the etiology of CDK4/6 inhibitor resistance have provided GAA3′ >5′GCA3′. There is a 4- to 5-fold difference in the 8-O-dGTP significant clues to FAT1’s role as a tumor suppressor . Loss of mutation rate depending on the sequence context with 5′GAA3′ FAT1 activity results in increased expression of CDK6, consequent being a favored context . The nucleotide pool is sanitized by to dysregulation of the Hippo pathway. ATR and ETAA1 have been MTH1, which hydrolyzes cytotoxic oxidized dNTPs, preventing discussed earlier regarding their function as barriers to progres- them from becoming mis-incorporated into DNA during replica- sion. We hypothesize that the CNV we have observed is due to tion or repair. Even with this cellular sanitizing activity, nucleotide replication stress. Replication stress leads to stalled replication 51,52 pools contain enough 8-oxo-dGTP to promote mutagenesis . forks and if ATR or ETAA1 mutation renders the proteins unable to Mutagenesis results from the insertion of 8-O-dGTP across from stabilize the forks and allow time for repair, further genomic adenine rather than cytosine during DNA replication. Steric instability in the genome is likely to ensue . ATR also specially hindrance of the oxygen of cytosine (C) in the anti-conformation regulates fragile site stability . While admittedly our number of with the triphosphate group of the 8-oxo-dGTP also in the anti- matched BBB and tumors is limited, the data from these conformation prevents Watson-Crick base pairing . However, 8- specimens suggests that, later in oncogenesis, mutations in ATR oxo-dGTP can assume the syn conformation enabling Hoogsteen pathway members, i.e., ATR and ETAA1, are being selected for as ox base pairing with Adenine (A). This A(template): G (nascent) base they observed in both the benign biopsy and its matched tumor. paring results in T > G/A > C following two additional rounds of We note that ATR haploinsufficieny in a mismatch repair deficient replication . background has been shown to result in dramatic increases in The association of triple negative breast cancer (TNBC) with our fragile site instability, amplifications and rearrangements, and in O/TN signature is intriguing. About 80% of TNBCs are of the basal- decreased tumor latency . like subtype and this subtype likely originates from luminal In summary, we have taken an initial step towards what will be 55,56 progenitor cells . We hypothesize that the O/TN signature a series of investigations of somatic DNA changes in the results from deficient repair of a specific oxidative lesion as unaffected breast, which will help define alterations that put discussed above. The levels of reactive oxygen species and women at substantially elevated BC risk. Such studies will also antioxidant defenses have been assayed in both luminal provide the possibility of estimating the time frame of that risk, so progenitor (LP) and basal cells of normal human mammary that women are able to make practical decisions regarding the tissue . Higher levels of both superoxide anion and hydrogen interventions that they choose to adopt. We have shown that such peroxide are present in the LPs. Even though multiple antioxidants work is feasible, with sequencing quality that meets current are deployed, LP display higher levels of oxidative damage, standards in the field, that somatic sequencing data can be specifically increased incorporation of 8-oxo-deoxyguanosine (8- inferred and interpreted even in the absence of matched germline oxo-dG) within the genomic DNA. Therefore, the association of data, and those intriguing findings emerge that are cancer TNBC with our O/TN signature may reflect the susceptibility of its relevant. precursor LP cells to oxidative damage, placing them at a disadvantage if this damage cannot be adequately addressed due to mismatch deficiency. METHODS MHL3 deletion was strongly associated with our MMR signature. Sample collection The trinucleotide motifs most frequently mutated in our MMR At the Northwestern Feinberg School of Medicine, we designed a case- signature and WTSI Signature 6 are 5′GCG3′ >5′GTG3′,5′CCG3′ > control study of BBB samples (BBCAR Study) . Subjects were identified 5′CTG3′,5′ACG3′ >5′ATG3′. These mutations are hypothesized to through searches of the Enterprise Data Warehouse of Northwestern due to an error in the replication of 5-methylcytosine (5mC). Medicine (NM), and at the Lynn Sage Breast Center of NM, under IRB- Tomkomva et al. have advanced a model, which posits that approved protocol NU 09B2. The major eligibility criterion required a wildtype Pol ε has slightly decreased fidelity when encountering history of benign breast biopsy performed at NM, at least 1 year prior to 5mC, particularly in a GCG context, on the template strand and cancer diagnosis for cases. Eligible subjects provided written informed incorrectly pairs it with A, leading to 5mC:A mismatches . They consent for the use of their BBB blocks after the nature and possible note that there is high structural similarity between 5mC and T, consequences of the study were explained, and completed a survey both of which present a methyl group at the same position of detailing breast history and breast cancer risk factors. We have retrieved the BBB paraffin blocks of subjects who subsequently developed breast pyrimidine ring. If the resulting 5mC:A mismatches are not cancer (cases) and from age-matched controls, who have not developed repaired before the next round of replication due to dysfunctional breast cancer to date. The participants are contacted periodically to mismatch repair, one would expect an enrichment of NCG > NTG confirm that controls have not transitioned to cases. A subset of 135 cases, mutations. Given this hypothesized etiology of the mutations, is matched to 69 controls were selected for WES (“Supplementary Materials there evidence that MHL3 repairs such mutations? Sequencing of and Methods”). All subjects included in this analysis were of European the tumors arising from the cross of Apc1638N mutant mice with descent. Case and control samples are matched by age and histologic class -/- -/- Mhl3 nullizygous and Mlh3 ; Pms2 mice reveals that the C:G > T: (non-proliferative benign change, or proliferation without atypia). DNA was A transition mutations, irrespective of MMR genotypes, occurred isolated from the LCM epithelium and sequenced using the Illumina at either CpG dinucleotides or CpNpG sites, typical targets for DNA HiSeq4000. WES was conducted with a sequencing depth of 80–100× and methylation . Thus, although our numbers are small, it appears 80–90 million sequencing reads were generated for each sample that our MMR Signature in the controls may result from MLH3 (Supplementary Materials and Methods). npj Breast Cancer (2020) 24 Published in partnership with the Breast Cancer Research Foundation Z. Zeng et al. Parallel alignment of whole-exome analysis evaluated machine learning models and features for breast cancer risk prediction for the cohort. Benjamini-Hochberg method was applied to We adapted widely used open source software for genome alignment and convert the two-sided P-values to False Discover Rate (FDR) for multi- variant calling. Read alignment and variant calling were performed comparison correction. according to the Broad Institute’s Genome Analysis Toolkit (GATK) best A Mutational Signature study was performed to reveal underlying practices pipeline . Reads were aligned to the human reference genome mutational processes for cancer development. The identified somatic (hg19) using Burrows-Wheeler alignment and Picard 2.6 was subse- mutations were presented as a 96-element vector, which captures the quently used to sort reads and mark duplicates (Fig. 1c). To reduce immediate 5′ and 3′ neighbors of the mutated nucleotides. The summary systematic errors, sorted BAM files were separately generated based on the of these mutation characteristics forms a mutational profile for each tissue sequence lane that the reads were generated. By doing so, various sample. Putting multiple samples’ profiles together form a matrix with the technical artifacts that are associated with lane-specific artifacts can be number of samples as rows (204) and the mutation characteristics as removed during duplicate marking and base recalibration steps. Base columns (96). Nonnegative matrix factorization (NMF) was applied to recalibration was done using the GATK 3.6 using dbSNP build 138 as a enable the discovery of intrinsic patterns in this matrix. The first value training set. Mutations were called and filtered using MuTect2 in the GATK where the Residual Sum of Squares (RSS) curve presents an inflection point package. To capture recurrent technical artifacts, we generated a Panel of was used to determine the number of signatures. In total, three signatures Normals (PON) for Mutect2 analysis using the sequenced 26 germline DNA. were discovered among the cases and controls, or combined. The outputs The PON is created by running the variant caller Mutect2 individually on of NMF consist of an H matrix and a W matrix. The matrix H (dimension of the normal samples and combining the resulting variant calls with the 3 × 96) was used to infer mutational processes. The numbers in matrix W criteria of excluding any sites that are not present in at least 2 normals, 69 (dimension of 204 × 3) correspond to each samples’ signature exposure which is the default cutoff . For the samples without matched normal levels. This matrix was interpreted as each tissue sample’s accumulated DNA available, we run Mutect2 using the so called “tumor only” model exposure effect to the mutational burden. We further evaluated the with PON filtering to call mutations. To obtain a set of mutations with the 35 70 association between the signature exposure level and cancer development highest sensitivity, VarScan2 and VarDict were also applied for with logistic regressions, adjusting for age and histology class. mutation calling. To further ensure a high precision call rate, we filtered all mutations with read depth <20. After filtering, mutations were then 71 72 25 annotated using SNPEFF , VEP , and ANNOVAR . Cancer risk prediction at BBB To predict cancer development using the mutations identified in BBB, we fit logistic regression with L1 penalty using the case/control status as Somatic mutation identification output variable. Multiple input features have been tested, namely, clinical Our initial objective was to develop and test a predictive model for somatic risk factors, somatic mutations, mutation burden by gene/cytoband/ mutation identification. A significant challenge for this study, and for protein domain. The mutation burden is inferred by aggregating all others seeking to identify somatic mutations in archived tissue samples is somatic mutations annotated as same gene/cytoband/protein domain to a the lack of matched germline DNA. Therefore, to prepare for ground truth, continuous number, representing the mutation burden of the correspond- previously consented donors were re-contacted (with IRB approval) and ing unit. In a cross-comparison evaluation, we achieved the best results saliva specimens were requested for normal DNA sequencing. Matched using protein domains as aggregation unit. In total, 1966 annotated germline DNA was obtained for 26 of the 204 BBB specimens which had protein domains were utilized as input features for case/control prediction been selected for WES. For these 26 paired samples, a set of somatic (Supplementary data 5) . To evaluate the model and features, we mutations were generated by using Mutect2 tumor-normal pair mode with performed a bootstrapping by randomly splitting the BBB samples at a 7:3 PON filtering. Independently, for these BBB samples, a set of mutations ratio for training and testing. We also evaluated the models by including were generated using Mutect2 tumor-only mode with PON filtering. This is clinical risk factors, including age at the time of BBB, age at menarche, age the mode to be used for the rest of BBBs without matched normal DNA. at first live birth, family history of BC in a first-degree relative, histologic However, mutations generated in this mode contain germlines variants. To variable (proliferative vs non-proliferative). rule these germline variants, we overlapped this set mutations with their BBB’s germline variants, which were generated using GATK Haplotype Reporting summary callers. The overlapped variants were then labeled as germline variants, together with the somatic mutations were used for model evaluation. We Further information on research design is available in the Nature Research systematically evaluated multiple machine learning models and adopted Reporting Summary linked to this article. Multi-Layer Perceptron (MLP) for somatic mutation classification. Features in the prediction model included intrinsic sequencing features, such as mutation allele frequency, depth of reference reads, number of DATA AVAILABILITY appearances in the cohort as well as published collated data providing The datasets generated and analysed during the current study are publicly available the frequency of the variant in the population and predictions of the in the figshare repository: https://doi.org/10.6084/m9.figshare.12191793 . Whole- impact of amino acid changes on the structure and function of the exome sequencing data, generated during the current study, are publicly available in encoded protein. The model obtained an accuracy of 95% for somatic NCBI Sequence Read Archive (SRA) here: https://identifiers.org/insdc.sra:SRP219328 . mutation in the test set (“Supplementary Materials and Methods”). TCGA data supporting Fig. 2, were downloaded from the Genomic Data Commons Orthogonal SNP array genotyping was performed to compare and validate (GDC) data portal, though a dbGaP application. The link to the relevant dbGaP study the performance of mutation calling and mutation classification. Technical is https://identifiers.org/dbgap:phs000178.v1.p1. validation was performed for 17 of the 26 specimens for which matched germline data was available, and 3 of the specimens without matched germline, using the Infinium Exome-24 v1.1 beadchip (“Supplementary CODE AVAILABILITY Materials and Methods”). The case group is defined as benign biopsies that All codes necessary to process the sequencing data and to re-generate the results are developed breast cancer at least one year later after the biopsy. In the case publicly available at https://github.com/zexian/BBCAR_codes . group, we have retrieved 10 cancer blocks that matched to the cases (Fig. 1b). The same preprocessing procedures were performed as benign biopsies, including LCM dissection, DNA extraction, library construction, Received: 16 September 2019; Accepted: 8 May 2020; sequencing, alignment, mutation calling, and filtering. Somatic copy number variation and mutational signature Using both aligned reads and identified mutations, we studied the genetic REFERENCES aberrations that distinguish cases from controls, including mutations and CNVs. We identified the somatic mutations or CNVs that were common to 1. Siegel, R. L., Miller, K. D. & Jemal, A. Cancer statistics, 2019. CA: a cancer J. clinicians 69,7–34 (2019). both the cases’ benign biopsy tissue as well as paired malignant lesions for 2. Flanagan, M. R., et al. Chemoprevention Uptake for Breast Cancer Risk Reduction the ten cases in which we had both tissues available. P-values were derived Varies by Risk Factor. Ann. Surg. Oncol. https://doi.org/10.1245/s10434-019-07236-8 with the use of Chi-square test or logistic regression. We also studied the (2019). mutations to enable the discovery of mutational signatures. Lastly, we Published in partnership with the Breast Cancer Research Foundation npj Breast Cancer (2020) 24 Z. Zeng et al. 3. Gail, M. H. et al. Projecting individualized probabilities of developing breast 37. Bass, T. E. et al. ETAA1 acts at stalled replication forks to maintain genome cancer for white females who are being examined annually. J. Natl Cancer Inst. integrity. Nat. Cell Biol. 18, 1185–1195 (2016). 81, 1879–1886 (1989). 38. Curtis, C. et al. The genomic and transcriptomic architecture of 2,000 breast 4. Chlebowski, R. T. et al. Predicting risk of breast cancer in postmenopausal women tumours reveals novel subgroups. Nature 486, 346–352 (2012). by hormone receptor status. J. Natl Cancer Inst. 99, 1695–1705 (2007). 39. Ciriello, G. et al. Emerging landscape of oncogenic signatures across human 5. Pankratz, V. S. et al. Assessment of the accuracy of the Gail model in women with cancers. Nat. Genet. 45, 1127–1133 (2013). atypical hyperplasia. J. Clin. Oncol. 26, 5374–5379 (2008). 40. Davies, H. et al. Whole-genome sequencing reveals breast cancers with mismatch 6. Boughey, J. C. et al. Evaluation of the Tyrer-Cuzick (International Breast Cancer repair deficiency. Cancer Res 77, 4755–4762 (2017). Intervention Study) model for breast cancer risk prediction in women with aty- 41. Soysal, S. D. et al. Genetic alterations in benign breast biopsies of subsequent pical hyperplasia. J. Clin. Oncol. 28, 3591–3596 (2010). breast cancer patients. Front Med. (Lausanne) 6, 166 (2019). 7. Dupont, W. D. & Page, D. L. Risk factors for breast cancer in women with pro- 42. Pankratz, V. S. et al. Model for individualized prediction of breast cancer risk after liferative breast disease. N. Engl. J. Med. 312, 146–151 (1985). a benign breast biopsy. J. Clin. Oncol. 33, 923–929 (2015). 8. Dupont, W. D. et al. Breast cancer risk associated with proliferative breast disease 43. Spencer, D. H. et al. Comparison of clinical targeted next-generation sequence and atypical hyperplasia. Cancer 71, 1258–1265 (1993). data from formalin-fixed and fresh-frozen tissue specimens. J. Mol. Diagn. 15, 9. Visscher, D. W. et al. Clinicopathologic features of breast cancers that develop in 623–633 (2013). women with previous benign breast disease. Cancer 122, 378–385 (2016). 44. Bhagwate, A. V. et al. Bioinformatics and DNA-extraction strategies to reliably 10. Silverstein, M. J. et al. Special report: consensus conference III. Image-detected detect genetic variants from FFPE breast tissue samples. BMC Genomics 20, 689 breast cancer: state-of-the-art diagnosis and treatment. J. Am. Coll. Surg. 209, (2019). 504–520 (2009). 45. Robbe, P. et al. Clinical whole-genome sequencing from routine formalin-fixed, 11. Ju, Y. S. et al. Somatic mutations reveal asymmetric cellular dynamics in the early paraffin-embedded specimens: pilot study for the 100,000 Genomes Project. human embryo. Nature 543, 714–718 (2017). Genet Med. 20, 1196–1205 (2018). 12. Alexandrov, L. B. et al. Clock-like mutational processes in human somatic cells. 46. Wang, Y. et al. Clonal evolution in breast cancer revealed by single nucleus Nat. Genet 47, 1402–1407 (2015). genome sequencing. Nature 512, 155–160 (2014). 13. Knudson, A. G. Two genetic hits (more or less) to cancer. Nat. Rev. Cancer 1, 47. Soysal, S. D. et al. Status of estrogen receptor 1 (ESR1) gene in mastopathy 157–162 (2001). predicts subsequent development of breast cancer. Breast Cancer Res Treat. 151, 14. Tamborero, D. et al. Comprehensive identification of mutational cancer driver 709–715 (2015). genes across 12 tumor types. Sci. Rep. 3, 2650 (2013). 48. Xia, Y., Fan, C., Hoadley, K. A., Parker, J. S. & Perou, C. M. Genetic determinants of 15. Danforth, D. N. Jr. Genomic changes in normal breast tissue in women at normal the molecular portraits of epithelial cancers. Nat. Commun. 10, 5666 (2019). risk or at high risk for breast cancer. Breast Cancer (Auckl.) 10, 109–146 (2016). 49. Bartkova, J. et al. Oncogene-induced senescence is part of the tumorigenesis 16. Sakr, R. A. et al. Targeted capture massively parallel sequencing analysis of LCIS barrier imposed by DNA damage checkpoints. Nature 444, 633–637 (2006). and invasive lobular cancer: repertoire of somatic genetic alterations and clonal 50. Minnick, D. T., Pavlov, Y. I. & Kunkel, T. A. The fidelity of the human leading and relationships. Mol. Oncol. 10, 360–370 (2016). lagging strand DNA replication apparatus with 8-oxodeoxyguanosine tripho- 17. Martincorena, I. et al. High burden and pervasive positive selection of somatic sphate. Nucleic Acids Res 22, 5658–5664 (1994). mutations in normal human skin. Science 348, 880–886 (2015). 51. Colussi, C. et al. The mammalian mismatch repair pathway removes DNA 8- 18. Stratton, M. R., Campbell, P. J. & Futreal, P. A. The cancer genome. Nature 458, oxodGMP incorporated from the oxidized dNTP pool. Curr. Biol. 12, 912–918 719–724 (2009). (2002). 19. Alexandrov, L. B. et al. Signatures of mutational processes in human cancer. 52. Pursell, Z. F., McDonald, J. T., Mathews, C. K. & Kunkel, T. A. Trace amounts of 8- Nature 500, 415–421 (2013). oxo-dGTP in mitochondrial dNTP pools reduce DNA polymerase gamma repli- 20. Temko, D., Tomlinson, I. P. M., Severini, S., Schuster-Bockler, B. & Graham, T. A. The cation fidelity. Nucleic Acids Res 36, 2174–2181 (2008). effects of mutational processes and selection on driver mutations across cancer 53. Freudenthal, B. D. et al. Uncovering the polymerase-induced cytotoxicity of an types. Nat. Commun. 9, 1857 (2018). oxidized nucleotide. Nature 517, 635–639 (2015). 21. Hartmann, L. C. et al. Benign breast disease and the risk of breast cancer. N. Engl. 54. Garrido-Castro, A. C., Lin, N. U. & Polyak, K. Insights into molecular classifications J. Med. 353, 229–237 (2005). of triple-negative breast cancer: improving patient selection for treatment. 22. Zeng, Z., et al. Datasets and metadata supporting the published article: somatic Cancer Disco. 9, 176–198 (2019). genetic aberrations in benign breast disease and the risk of subsequent breast 55. Lim, E. et al. Aberrant luminal progenitors as the candidate target population for cancer. figshare. https://doi.org/10.6084/m6089.figshare.12191793 (2020). basal tumor development in BRCA1 mutation carriers. Nat. Med 15,907–913 (2009). 23. NCBI Sequence Read Archive https://identifiers.org/insdc.sra:SRP219328 (2020). 56. Molyneux, G. et al. BRCA1 basal-like breast cancers originate from luminal epi- 24. Flanagan, S. E., Patch, A.-M. & Ellard, S. Using SIFT and PolyPhen to predict loss-of- thelial progenitors and not from basal stem cells. Cell Stem Cell 7, 403–417 (2010). function and gain-of-function mutations. Genet. Test. Mol. Biomark. 14, 533–537 57. Kannan, N. et al. Glutathione-dependent and -independent oxidative stress- (2010). control mechanisms distinguish normal human mammary epithelial cell subsets. 25. Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic Proc. Natl Acad. Sci. USA 111, 7789–7794 (2014). variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010). 58. Tomkova, M., McClellan, M., Kriaucionis, S. & Schuster-Bockler, B. DNA Replication 26. Kalatskaya, I. et al. ISOWN: accurate somatic mutation identification in the and associated repair pathways are involved in the mutagenesis of methylated absence of normal tissue controls. Genome Med. 9, 59 (2017). cytosine. DNA Repair (Amst.) 62,1–7 (2018). 27. Bailey, M. H. et al. Comprehensive characterization of cancer driver genes and 59. Chen, P. C. et al. Novel roles for MLH3 deficiency and TLE6-like amplification in mutations. Cell 173, 371–385.e318 (2018). DNA mismatch repair-deficient gastrointestinal tumorigenesis and progression. 28. Sharma, Y. et al. A pan-cancer analysis of synonymous mutations. Nat. Commun. PLoS Genet 4, e1000092 (2008). 10, 2569 (2019). 60. Morris, L. G. et al. Recurrent somatic mutation of FAT1 in multiple human cancers 29. Pearlman, R. et al. Prevalence and spectrum of germline cancer susceptibility leads to aberrant Wnt activation. Nat. Genet 45, 253–261 (2013). gene mutations among patients with early-onset colorectal cancer. JAMA Oncol. 61. Martin, D. et al. Assembly and activation of the Hippo signalome by FAT1 tumor 3, 464–471 (2017). suppressor. Nat. Commun. 9, 2372 (2018). 30. Hanisch, F. G. O-glycosylation of the mucin type. Biol. Chem. 382, 143–149 (2001). 62. Li, Z. et al. Loss of the FAT1 tumor suppressor promotes resistance to CDK4/6 31. Rohan, T. E. et al. Somatic mutations in benign breast disease tissue and risk of inhibitors via the hippo pathway. Cancer Cell 34, 893–905.e898 (2018). subsequent invasive breast cancer. Br. J. Cancer 118, 1662–1664 (2018). 63. Cimprich, K. A. & Cortez, D. ATR: an essential regulator of genome integrity. Nat. 32. Nik-Zainal, S. et al. Landscape of somatic mutations in 560 breast cancer whole- Rev. Mol. Cell Biol. 9, 616–627 (2008). genome sequences. Nature 534,47–54 (2016). 64. Casper, A. M., Nghiem, P., Arlt, M. F. & Glover, T. W. ATR regulates fragile site 33. Petljak, M. et al. Characterizing mutational signatures in human cancer cell lines stability. Cell 111, 779–789 (2002). reveals episodic APOBEC mutagenesis. Cell 176, 1282–1294.e1220 (2019). 65. Fang, Y. et al. ATR functions as a gene dosage-dependent tumor suppressor on a 34. Roberts, S. A. et al. An APOBEC cytidine deaminase mutagenesis pattern is mismatch repair-deficient background. EMBO J. 23, 3164–3174 (2004). widespread in human cancers. Nat. Genet 45, 970–976 (2013). 66. Shidfar, A. et al. Expression of miR-18a and miR-210 in normal breast tissue as 35. Koboldt, D. C., et al. VarScan 2: somatic mutation and copy number alteration candidate biomarkers of breast cancer risk. Cancer Prev. Res. (Phila.) 10,89–97 discovery in cancer by exome sequencing. Genome Res. 22, 568–576 (2012). (2017). 36. Mermel, C. H. et al. GISTIC2. 0 facilitates sensitive and confident localization of the 67. McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for targets of focal somatic copy-number alteration in human cancers. Genome Biol. analyzing next-generation DNA sequencing data. Genome Res 20, 1297–1303 12, R41 (2011). (2010). npj Breast Cancer (2020) 24 Published in partnership with the Breast Cancer Research Foundation Z. Zeng et al. 68. Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler COMPETING INTERESTS transform. Bioinformatics 26, 589–595 (2010). The authors declare no competing interests. 69. Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 31, 213–219 (2013). 70. Lai, Z. et al. VarDict: a novel and versatile variant caller for next-generation ADDITIONAL INFORMATION sequencing in cancer research. Nucleic acids Res. 44, e108 (2016). Supplementary information is available for this paper at https://doi.org/10.1038/ 71. Cingolani, P. et al. A program for annotating and predicting the effects of single s41523-020-0165-z. nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melano- gaster strain w1118; iso-2; iso-3. Fly (Austin) 6,80–92 (2012). Correspondence and requests for materials should be addressed to Y.L., S.A.K. or S.E.C. 72. McLaren, W. et al. The ensembl variant effect predictor. Genome Biol. 17, 122 (2016). Reprints and permission information is available at http://www.nature.com/ reprints Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims ACKNOWLEDGEMENTS in published maps and institutional affiliations. We thank the Center for Medical Genomics at the Indiana University School of Medicine for the library preparation and high throughput sequencing. This study was supported in part by Breast Cancer Research Foundation, the Lynn Sage Cancer Research Foundation, and grant R21LM012618-01 from the National Institutes of Health. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative AUTHOR CONTRIBUTIONS Commons license, and indicate if changes were made. The images or other third party Z.Z., S.A.K., and S.E.C. conceived the study. A.S. performed laser capture microdisse- material in this article are included in the article’s Creative Commons license, unless cion and extracted the DNA. P.S. administered the questionnaires, organized the indicated otherwise in a credit line to the material. If material is not included in the clinical data, and contacted the subjects for saliva donation and cancer status article’s Creative Commons license and your intended use is not permitted by statutory confirmation. X.X. performed the sequencing. L.B. reviewed all benign biopsy and regulation or exceeds the permitted use, you will need to obtain permission directly tumor sections, verified histologic diagnosis, and identified areas for laser capture. from the copyright holder. To view a copy of this license, visit http://creativecommons. Z.Z. and A.V. carried out the sequence alignment, quality assessment, and mutation org/licenses/by/4.0/. calling. S.E.C. and Z.Z. wrote the paper. Z.Z. and X.L. performed the statistical analysis. Y.L. reviewed all analyzed data. S.A.K. was responsible for the clinical study. All authors discussed the results, revised and approved the paper. © The Author(s) 2020 Published in partnership with the Breast Cancer Research Foundation npj Breast Cancer (2020) 24 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png npj Breast Cancer Springer Journals

Somatic genetic aberrations in benign breast disease and the risk of subsequent breast cancer

Loading next page...
 
/lp/springer-journals/somatic-genetic-aberrations-in-benign-breast-disease-and-the-risk-of-bjWOSJgZky

References (81)

Publisher
Springer Journals
Copyright
Copyright © The Author(s) 2020
eISSN
2374-4677
DOI
10.1038/s41523-020-0165-z
Publisher site
See Article on Publisher Site

Abstract

www.nature.com/npjbcancer ARTICLE OPEN Somatic genetic aberrations in benign breast disease and the risk of subsequent breast cancer 1,2 3 4 5 5 6 7 1✉ Zexian Zeng , Andy Vo , Xiaoyu Li , Ali Shidfar , Paulette Saldana , Luis Blanco , Xiaoling Xuei , Yuan Luo , 5 5 ✉ ✉ Seema A. Khan and Susan E. Clare It is largely unknown how the development of breast cancer (BC) is transduced by somatic genetic alterations in the benign breast. Since benign breast disease is an established risk factor for BC, we established a case-control study of women with a history of benign breast biopsy (BBB). Cases developed BC at least one year after BBB and controls did not develop BC over an average of 17 years following BBB. 135 cases were matched to 69 controls by age and type of benign change: non-proliferative or proliferation without atypia (PDWA). Whole-exome sequencing (WES) was performed for the BBB. Germline DNA (available from n = 26 participants) was utilized to develop a mutation-calling pipeline, to allow differentiation of somatic from germline variants. Among the 204 subjects, two known mutational signatures were identified, along with a currently uncatalogued signature that was significantly associated with triple negative BC (TNBC) (p = 0.007). The uncatalogued mutational signature was validated in 109 TNBCs from TCGA (p = 0.001). Compared to non-proliferative samples, PDWA harbors more abundant mutations at PIK3CA pH1047R (p < 0.001). Among the 26 BBB whose somatic copy number variation could be assessed, deletion of MLH3 is significantly associated with the mismatch repair mutational signature (p < 0.001). Matched BBB-cancer pairs were available for ten cases; several mutations were shared between BBB and cancers. This initial study of WES of BBB shows its potential for the identification of genetic alterations that portend breast oncogenesis. In future larger studies, robust personalized breast cancer risk indicators leading to novel interception paradigms can be assessed. npj Breast Cancer (2020) 6:24 ; https://doi.org/10.1038/s41523-020-0165-z INTRODUCTION molecular level in the breast itself, i.e., identification of somatic genetic changes that predate breast cancer and influence the From 1989 to 2016 the mortality rate for breast cancer (BC) in the biologic profile of cancers that emerge. United States decreased by 40% , a testament to the efficacy of 7,8 Benign breast disease is an established risk factor for BC , with targeted therapies, as well as to combinations and schedules of 30% of BC cases reporting a history of benign breast disease .Of chemotherapeutics. During this same period breast cancer the 1.7 million breast biopsies each year in the U.S. , about 75% incidence rates remained static ; evidence of both the paucity of of these return a diagnosis of benign breast disease, including novel, effective prevention strategies that target specific mole- atypical hyperplasia . This provides a window into the somatic cular risk pathways, and our inability to implement existing genetic environment of the breast, prompting us to evaluate the strategies. Major barriers are two-fold: hesitation among healthy genetic landscape of benign breast biopsy (BBB), and identify women to accept drugs for a disease that they may or may not patterns associated with subsequent malignancy. Starting in the experience in the future; and their reluctance to experience side 11 12 embryo , tissues accumulate DNA mutations over time . Most of effects that impair quality of life and may compromise health . the mutations are repaired, many are inconsequential, but a few The first of these would be mitigated by improved identification of 13,14 may lead to cancer . Before there is any histologic evidence of women at high risk of developing breast cancer, but almost 30 invasive cancer, histologically normal, and benign tissue contain years after the initial publication of the Gail Model , breast cancer 15,16 molecular aberrations that are associated with malignancy . risk stratification remains imprecise and insensitive to breast For example, sun-exposed, normal eyelid skin has been shown to cancer subtype. In an analysis of data from the Women’s Health have a mutation burden of 2–6 mutations/MB/cell, a rate similar to Initiative, the Gail Model displayed modest ability to predict the that observed in many cancers . The processes that cause these risk of breast cancer (AUC = 0.58, 95% CI = 0.56–0.60) . Among mutations leave an imprint on the genome . In the sun-exposed women at high risk of breast cancer, for example, those diagnosed with atypical hyperplasia, neither the Gail Model/Breast Cancer eyelid epidermis, mutations occur within a pattern that mimics the Risk Assessment Tool nor the Tyrer-Cuzick Model performed Welcome Trust Sanger Institute (WTSI) Mutation Signature, which 5,6 well . This is a significant barrier to implementation of is associated with ultraviolet exposure and its consequent CC > TT established medical interventions for disease prevention, and to dinucleotide mutations at dipyrimidines . Exogenous or endo- the development of new, targeted intervention strategies for genous mutational processes, such as that which produced WTSI signature 7, are chemical reactions within DNA. While mutational women at risk. Impactful, targeted prevention strategies require knowledge of how breast cancer risk is transduced at the processes are responsible for the creation of mutations, the 1 2 Division of Health and Biomedical Informatics, Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, USA. Department of Data Sciences, Dana-Farber Cancer Institute, Harvard T. H. Chan School of Public Health, Boston, MA, USA. Committee on Developmental Biology and Regenerative Medicine, The 4 5 University of Chicago, Chicago, IL, USA. Department of Medicine, Brigham and Women’s Hospital, Boston, MA, USA. Department of Surgery, Northwestern University Feinberg 6 7 School of Medicine, Chicago, IL, USA. Department of Pathology, Northwestern University Feinberg School of Medicine, Chicago, IL, USA. Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN, USA. email: yuan.luo@northwestern.edu; s-khan2@northwestern.edu; susan.clare@northwestern.edu Published in partnership with the Breast Cancer Research Foundation 1234567890():,; Z. Zeng et al. a c With Matched Without Matched Control (n=6) Germline Germline Benign Biopsy Benign Biopsy Case (n=20) Control (n=63) Case (n=115) Time ( > 1 year) Biopsy DNA Normal DNA Biopsy DNA Developed Paired-End sequencing Did not develop Controls Cases breast cancer Breast Cancer 26 benign/germline FASTQ BAM BWA VarScan2 Mutations Somatic CNV BAM Predict somatic GISTIC2 mutations Case Control MuTect2 (n=135) (n=69) VarScan2 VarDict Somatic mutations Genotyped 17 samples for validation Genotyped 3 samples Somatic mutations Exome-seq Exome-seq for validation Annovar Benign Biopsy (n=69) Benign Biopsy (n=135) Annotated mutations SnpEff Trained model Subset VEP Subset Subset Tumor (n=10) Annotated mutations Germline (n=6) Germline (n=20) Fig. 1 Case-control study of benign breast biopsy (BBB) samples. a Design of the BBCAR study. The sample tissues of subjects who subsequently developed breast cancer (cases), and their matched controls, who have not developed breast cancer to date, were studied b A total of 135 cases, matched to 69 controls, were selected for whole-exome sequencing (WES). Case and control samples are matched by age and histology c. an illustration of the workflow to identify somatic mutations in the tissue samples that lack matched germline DNA. To train a model to distinguish germline variants and somatic mutations, previously consented donors were re-contacted (with IRB approval) and saliva specimens were requested for germline DNA sequencing. Orthogonal SNP array genotyping was further performed with 20 samples to compare and validate the performance of somatic mutation identification. mutations that are observed ultimately within a malignancy reflect Table 1. Distributions of demographic data and tumor characteristics a process of selection . However, genetic aberrations are not between the Case group and the Control group Student’s t-tests were limited to somatic mutations, and we note that recurrent copy performed for continuous variables and Pearson’s Chi-squared tests number variations (CNVs) are in fact more characteristic of were performed for categorical variables. invasive breast cancers than recurrent mutations . To evaluate the molecular alterations that enable cancer Case (135) Control (69) P-value development in the breast, we established a case-control study of BBB samples, the Benign Breast & Cancer Risk (BBCAR) Study. Age (SD) Mean (SD) 49.7 (9.9) 49.8 (9.6) 0.96 We performed whole-exome sequencing (WES) on the benign Menopausal status N (%) 0.87 biopsies of patients, who subsequently developed breast cancer Pre 114 76 (56.3%) 38 (55.1%) (cases), and matched controls, who have not developed breast Post 90 59 (43.7%) 31 (44.9%) cancer to date. The cases and controls had similar degrees of benign change: non-proliferative or proliferation without atypia. Histology Class N (%) 0.78 The focus on non-atypical lesions was a deliberate choice as non- a Class 1 115 79 (58.5%) 40 (58.0%) atypical lesions predict a generalized risk of subsequent breast Class 2 75 51 (37.8%) 25 (36.2%) cancer, occurring equally frequently in both breasts . They are also NA 10 5 (3.7%) 4 (5.8%) far more common than atypical changes, comprising over 90% of ER status N (%) all breast biopsies , so that elucidation of their molecular profiles will impact the majority of women who undergo BBB. To the best Positive 109 (80.8%) of our knowledge, WES has not been performed in any previous Negative 23 (17.0%) case-control study of benign breast lesions without atypia. In Low 3 (2.2%) addition to profiling the overall BBB mutational landscape, we Follow-up years (SD) 7.3 (4.4) 16.6 (5.4) <0.01 have determined that mutations differ as a function of the type of Has matched germline (%) 20 (14.8%) 6 (8.7%) benign breast disease, specifically that the PIK3CA pH1047R hotspot mutation is more frequent in proliferative disease without Has matched cancer (%) 10 (7.4%) atypia (PDWA) compared to non-proliferative disease (p < 0.001); Class 1/non-proliferative: “Non-proliferation” and “Benign, NOS”. our data reveals a presently uncatalogued mutational signature Class 2/proliferative: “Proliferative lesion without atypia” (includes non- associated with TNBC (p = 0.007), which was validated in 109 atypical hyperplasia, radial scar, sclerosing adenosis). TCGA TNBC samples (p = 0.001); and we observed multiple recurrent CNVs, including a MLH3 deletion, which is significantly associated with a mismatch repair signature (p < 0.001). predates the diagnosis of breast cancer by at least one year (Fig. RESULTS 1a). The median interval from benign biopsy to the diagnosis of Study design cancer is 7.3 (SD = 4.4) years. Controls (n = 69) are women who have not developed breast cancer and are matched for age of A total of 204 subjects were enrolled in this BBB case-control diagnosis (±2 years) and histology (Fig. 1b; Table 1). Controls were study. Cases (n = 135) are women who have undergone a breast biopsy with specimen histology showing non-proliferative disease verified to not have been diagnosed with breast cancer as of 08/ or proliferative disease without atypia (Supplementary Fig. 1) that 14/2018 (Supplementary Data 1) . npj Breast Cancer (2020) 24 Published in partnership with the Breast Cancer Research Foundation 1234567890():,; Z. Zeng et al. a b 100 Proposed model ISOWN NBC & LAD tree Precision 85.4% Recall F1 AUC score MuTect2 VarScan2 40 50 VarDict 25 50 020 40 60 Discrepancy allowance (%) Silent mutation Non-silent mutation e 100 26 BBB 178 BBB 100 TCGA Sample1 82.5% Sample2 Sample3 50 50 72% 63% 32% 0.1 0.01 0 020 40 60 Germline variants Discrepancy allowance (%) Somatic mutations Fig. 2 Accurate somatic mutation identification in benign biopsies. a Comparison of the mutations between WES and genotype array. Somatic mutations were called using Mutect2, VarScan2, and VarDict. With different allele frequency discrepancy allowance, the overlap rate between two platforms was plotted. b Performance of different machine learning models in the test set. Penalized logistic regression (LR); linear SVM; random forest classifier (RFC); gradient boosted tree (GBT); k-nearest neighbor algorithm (K-NN); SVM with rbf kernel; multiple layer perceptron (MLP). c Orthogonal validation of the proposed model using 100 TCGA breast cancer samples and benchmark study with previously validated pipelines, including ISOWN NBC and ISONWN LAD tree . d Pipeline validation using genotype arrays of three samples. Somatic mutations were called using our pipeline and validated by genotyping. The plot shows the overlap rate between the two platforms with different allele frequency discrepancy allowance. e VAF distribution of the germline variants and somatic mutations, grouped by 26 benign biopsies with matched normal DNA, 178 benign biopsies lacking matched normal DNA, and 100 randomly selected TCGA breast cancer samples. f Distribution of silent and non-silent mutations, grouped by germline variants called in normal DNA that matched to the BBB, somatic mutations in BBB, and somatic mutations in TCGA breast cancer samples. Somatic mutation identification with intrinsic sequencing features, such as mutation allele frequency, depth of reference reads, mutation frequency among All 204 specimens were dissected using laser capture microdissec- the cohort. Grid search was applied to unbiasedly tune each tion (LCM) and were subjected to WES . Within this cohort, 26 model’s parameters using five-fold cross-validation on the training matched germline DNA were obtained for WES as well. To set. Evaluation performance was then achieved on the held-out evaluate mutation caller performance in this benign tissue setting, test set (“Methods”). Of the evaluated models including penalized 17 of the 26 sample pairs were subjected to genotyping in order logistic regression (LR), linear SVM, random forest classifier (RFC), to evaluate mutation caller performance (Fig. 1c; “Methods”, gradient boosted tree (GBT), k-nearest neighbor algorithm (K-NN), “Supplementary Materials and Methods”). Allele frequencies of the SVM with rbf kernel, and multi-layer perceptron (MLP), MLP model mutations common to the genotyping array and WES were achieved the best performance (Fig. 2b), where the F1-score is compared. Mutations were categorized as false positive if allele 0.96 (Supplementary Table 2). frequency was discrepant between the two platforms. The Orthogonal validations of the proposed model were performed mutation identification accuracy then varies as a function of the by evaluation studies with the TCGA data and benchmark studies discrepancy allowance (Fig. 2a). Overall, we observed high with previously validated pipelines. Protected datasets in bam consistency between the two platforms (85.4% when discrepancy format of 100 randomly selected breast primary tumors were allowance = 25%). Notably, MuTect2 consistently achieved better downloaded directly from the TCGA data portal. Realigned raw performances in this setting (Fig. 2a), Therefore, MuTect2 was reads were subjected to base recalibration and were passed to selected as mutation caller for subsequent studies. Mutect2 for mutation detection. Mutect2 was performed in so For the samples lacking matched normal DNA (n = 178), a called “tumor only mode” to call somatic and germline mutations. machine learning model was developed to distinguish germline ISOWN , a previously validated pipeline for somatic mutation variants and somatic mutations (Fig. 1c; “Methods”; “Supplemen- identification, was applied for somatic mutation prediction as well. tary Materials and Methods”). With somatic mutations called for The predicted results were evaluated by comparison to the TCGA the 26 samples for which germline DNA was available, we somatic mutation data by Multi-Center Mutation-Calling in Multi- systematically evaluated multiple machine learning approaches to ple Cancers (MC3 public v0.2.8) network . Using the TCGA MC3 distinguish somatic mutations and germline variants in benign data as ground truth, our model achieved a F1-score of 0.89 (Fig. biopsies (Fig. 1c; “Methods”; “Supplementary Materials and 2c) in predicting somatic mutations. Even though designed and Methods”). A total of 31 features were utilized for the model trained in the benign-biopsy setting, our model (F1 = 0.89) evaluation (Supplementary Table 1), including protein structure, obtained similar or better results than previously validated pathogenicity prediction, population frequency, or evolutionary 24 26 factors . Various functional annotation or toxicity scores were pipelines, such as ISOWN NBC (F1 = 0.88) and ISOWN LAD tree derived from ANNOVAR , COSMIC (https://cancer.sanger.ac.uk/ (F1 = 0.80) in predicting somatic mutations in TCGA cancer cosmic), dbSNP/common (https://www.ncbi.nlm.nih.gov), along samples (Fig. 2c). Published in partnership with the Breast Cancer Research Foundation npj Breast Cancer (2020) 24 Germline Variants Normal DNA Somatic mutations Benign biopsy Somatic mutations TCGA cancer Overlap rate (%) Overlap rate (%) Accuracy (%) VAF (%) LR SVM(Linear) RFC GBT K-NN SVM(rbf) MLP Percentage (%) Accuracy (%) Precision Recall AUC F1 F1 F1 Z. Zeng et al. Case Control a b Missense Mutation Stop Gain Frame Shift Insertion 223 Frame Shift Deletion In Frame Insertion In Frame Deletion Splice Site Mutation Case Control c d Number of mutations Case Control Stop Gain Frame Shift Insertion Frame Shift Deletion Case Control Number of mutations Fig. 3 Catalog of somatic mutations in 204 benign breast biopsies. a Catalog of base substitutions, insertions/deletions in the 204 BBBs. Each bar represents one individual’s total number of mutations. Left panel is the case and right panel is the control. b The top 20 mutated genes in the case group (left) and control group (right). c The top mutated genes as b, adjusted by gene length. d Catalog of nonsense mutations in the 204 BBBs. We further applied our pipeline and model to identify somatic When corrected by gene length, case and control still shared mutations in the 178 BBBs lacking matched normal DNA (Fig. 1c) common genes (MUC17, SLC7A4, FLG2, GLTPD2, PGBD1, PLA2G3, (Methods). Overall, the average read depth for the identified ADAM30) (Fig. 3c). Mucins are O-glycosylated by the addition of N- somatic mutations is 99, whereas the average VAF is 0.232. To acetylgalactosamine to the hydroxyl group of serine or threo- estimate the overall mutation identification accuracy, we ran- nine . Therefore, we evaluated the number of missense muta- domly sampled and genotyped three samples from our cohort. tions within MUC17 that resulted in the gain or loss of either Overall, we observed high consistency between our pipeline and serine or threonine residues. Of the MUC17 mutations we the genotype array (82.5% when discrepancy allowance = 25%) observed, 8.7% of missense mutations would be predicted to (Fig. 2d). As a sanity check, the distribution of variant allele result in the loss of serine, 16.8% in the loss of threonine, 14.2% in frequency (VAF) and non-silent mutations were examined. the gain of serine and 17.8% in the gain of threonine. However, 28,29 Consistent with previously reported studies , the majority of there was no significant difference between cases and controls our identified germline variants’ VAFs are around 50% and 100%, (Supplementary Table 3). The proportions of nonsense mutations whereas somatic mutations display much lower VAFs (Fig. 2e). For also vary between samples. The majority of nonsense mutations cancers, non-silent mutations usually account for 2/3 of somatic were frame shift insertions and stop gains, with some exceptions mutations with the remaining 1/3 being silent mutations, whereas in a few samples (Fig. 3d). germline mutations are expected to have higher number of silent mutations . In our data, we have observed similar distribution Genes enriched for mutations in the cases or PDWA (TCGA 100 breast cancer: 72% non-silent mutations) (Fig. 2f). In To determine the enrichment of mutations in the case group, a addition, we observed an increasing spectrum of non-silent logistic regression model was fit for each gene, with case/control mutations in BBB matched normal DNA (32%), BBB (63%), and as output variable and mutation status as input variables. The p- TCGA cancer samples (72%) (Fig. 2f). To note, the average non- values were derived from the fitted models for gene sorting (Fig. synonymous mutations for the 26 BBBs with matched normal DNA 4a). Nonsynonymous mutations in four cancer-associated genes, is 114, whereas the average number for the 178 BBBs without CTNNA2 (11.1% vs. 5.8%; log10 p-value = −0.6), FLG (8.9% vs. 4.3%; matched normal DNA is 127. log10 p-value = −0.6), GNAS (4.4% vs. 1.4%; log10 p-value = −0.5), and BCORL1 (17.0% vs. 11.6%; log10 p-value = −0.5), were more Mutation catalogues abundant in the case group. Of note, same analyses including Among the 204 samples, 36,801 somatic base substitutions and synonymous mutations are presented in Supplementary Fig. 2. 2283 small INDELs were identified. The majority of the mutations Rohan and colleagues utilized targeted sequence capture to were missense mutations (Fig. 3a). Cases had a mean of 6.2 identify mutations present in a panel of 83 genes in the benign mutations/MB (SD = 3.6) and controls had 6.8 mutations/MB (SD breast disease tissue from a case-control study . While they = 3.0). No significant difference was observed in the numbers of identified somatic mutations in a number of genes frequently mutations between the cases and controls (Fig. 3a). Among the mutated in breast cancer, no significant differences were top 20 mutated genes, the case group and control group shared identified comparing cases and controls with regard to the common genes (MUC17, OBSCN, FLG2, GLTPD2, ABCA13, PIK3CA) mutational burden, genes mutated, type of mutation or pathway. (Fig. 3b). Approximately one-fifth of both cases and controls We queried our data for the mutations present in these same 83 display PIK3CA mutations, with the highest frequency at pH1047R. genes. Our data for all variants was very similar to theirs (Fig. 4b), npj Breast Cancer (2020) 24 Published in partnership with the Breast Cancer Research Foundation Number of Mutations Number of Mutations Z. Zeng et al. ab Ca se Control Ca se Control Log10 all variants variants with VAF 25% % % Gene % % pval Controls Cases Controls Cases DMD 11.9 2.9 -1.3 TSKS 12.6 4.3 -1.1 ZNF559 8.9 2.9 -0.9 DENND2C 6.7 1.4 -0.9 TENM4 6.7 1.4 -0.9 PAQR5 6.7 1.4 -0.9 BEST3 6.7 1.4 -0.9 AHNAK 28.1 18.8 -0.8 RGSL1 10.4 4.3 -0.8 NLGN4X 8.1 2.9 -0.8 MYO15B 5.9 1.4 -0.8 APOB 7.4 2.9 -0.7 HTT 5.2 1.4 -0.7 ZNF423 5.2 1.4 -0.7 HECTD4 5.2 1.4 -0.7 PLXNA1 5.2 1.4 -0.7 5.2 1.4 -0.7 PTPRS TTC14 5.2 1.4 -0.7 CTNNA2 11.1 5.8 -0.6 FLG 8.9 4.3 -0.6 MAGI3 12.6 7.2 -0.6 RPGRIP1 14.1 8.7 -0.6 SHCBP1L 6.7 2.9 -0.6 PHRF1 10.4 5.8 -0.5 PLXDC1 10.4 5.8 -0.5 KIF6 4.4 1.4 -0.5 20 15 10 0 0 10 15 20 5 5 15 1051 05 0 0 15 MAMDC4 4.4 1.4 -0.5 Non-silent mutations Non-silent mutations DDX42 4.4 1.4 -0.5 DUS1L 4.4 1.4 -0.5 GNAS 4.4 1.4 -0.5 p < .00001 p = 0.00001 THNSL1 4.4 1.4 -0.5 NP PDWA NP PDWA ZNF595 4.4 1.4 -0.5 PIK3CA PIK3CA pH1047R ZNF782 4.4 1.4 -0.5 100 100 Mutated Mutated FAM209B 4.4 1.4 -0.5 LRP1 4.4 1.4 -0.5 Wild Wild 80 80 TOPORS 4.4 1.4 -0.5 TRIO 4.4 1.4 -0.5 ZNF627 4.4 1.4 -0.5 60 60 BCORL1 17.0 11.6 -0.5 TMC6 11.9 7.2 -0.5 RYR2 8.1 4.3 -0.5 DIDO1 8.1 4.3 -0.5 20.0 14.5 -0.5 MUC5B 32% SLC3A1 5.9 2.9 -0.5 20 20 -0.5 5.9 2.9 ZNF845 MUS81 5.9 -0.5 2.9 0 0 Onc TS % mutant Fig. 4 Genetic aberrations that distinguish case/control or proliferative/non-proliferative BBB. a For each gene, the percentage of mutated lesions in the case and control are shown. In the left panel, known oncogenes are highlighted as green, and known tumor suppressor genes are highlighted as orange. Onc is a known oncogene; TS is a tumor suppressor gene (Cancer Gene Census; https://cancer.sanger.ac.uk/census). The right panel shows the nonsynonymous rate in each group. b Mutations in 83 selected genes presented by Rohan et al. to validate our data and also to facilitate comparisons. Left: all non-silent mutations; right: non-silent mutations with VAF > 25%. c PIK3CA non-synonymous mutations identified in NP BBB and PDWA BBB. d Same as c, with only mutations at PIK3CA pH1047R retained. which orthogonally validated our data quality. Nonetheless, mutational signatures were identified in both case and control differences were observed after filtering for variants with a VAF > group (Supplementary data 3) . In both groups, we identified the 25%; in particular, while no variants in NCOA3 had a VAF greater “aging” signature (cataloged by WTSI as Signature 1b; Fig. 5a; than 25% in the controls, over 10% of cases passed this threshold cosine similarity score: 83.2% for the case and 83.0% for the (Fig. 4b). control), which is the putative result of the hydrolysis 5- We also evaluated mutation enrichments in benign biopsies methylcytosine. We also identified the “mismatch repair” signature showing proliferative disease without atypia (PDWA) (n = 76) (cataloged by WTSI as Signature 6; Fig. 5a; cosine similarity score: versus non-proliferate (NP) disease (n = 119). Using non- 80.5% for the case and 80.1% for the control). Moreover, a synonymous mutations only, the top enriched significant genes signature not currently in the WTSI catalog of Mutational are PIK3CA, HYDIN, DNMT3B, and AKT1 (detail of hotspots in Signatures was identified in each group; both demonstrate Supplementary data 2) . For PIK3CA, mutations are abundantly enrichment of T > G mutations with 5′TTC3′ >5′TGC3′ the most enriched in PDWA compared to NP (31% vs. 12%; p = 0.00001) frequently mutated trinucleotide motif (Fig. 5a). Provisionally, we (Fig. 4c). Specifically, pH1047R is the most enriched hotspot for the have named this signature “O/TN” based on the presumed PDWA (28% vs. 5%; p < 0.00001) (Fig. 4d) mechanism: oxidation, and on its presumptive association with triple negative (TN) breast cancer. Mutational processes and CNV The process of deriving mutational signatures is an unsuper- vised learning process. Pooling the cases and controls, we derived Mutations are non-random and occur within sequence motifs. three signatures in the BBB cohort, namely aging, mismatch repair, These motifs provide evidence from which we can infer the and O/TN. In an association study, we found that O/TN was process that created the mutations. Recent studies led by significantly associated with BBB that predate TNBC (p = 0.007) investigators at the Welcome Trust Sanger Institute (WTSI) present (Fig. 5c). We also performed a second association analysis, the somatic mutation data as a 96-element vector, which captures controlling for the potential covariates of age, menopausal status, the immediate 5′ and 3′ neighbors of the mutated nucleotides. and histology class (NP or PDWA). The association remained Employing non-negative matrix factorization (NMF), 30 “muta- 19,32 significant (p = 0.016), suggesting that the O/TN signature in BBB tional signatures” were produced by these studies , which is predictive of TNBC. To validate the O/TN signature and examine more recently has been updated and expanded to 40 .We hypothesized that like the eyelid epidermis , benign breast whether it is a predictor of TNBC as well, we further retrieved 109 lesions also harbor somatic mutations with associated mutational TNBC samples from TCGA data portal. The downloaded somatic signatures that may provide clues to etiologic processes. Within mutation data were processed, and three mutational signatures the BBB cohort, mutational signatures were examined. Three were derived under the same protocol as BBB (Methods). As result, Published in partnership with the Breast Cancer Research Foundation npj Breast Cancer (2020) 24 Case Control Case Control Case Control Case Control Mutants (%) Mutatans (%) Z. Zeng et al. Case Control Aging Aging Mismatch repair Mismatch repair O/TN O/TN CASES b c d p=0.007 p=4E-6 FRA1F FRA19B/A FRA9D 1.0 FRA10G FRA1A FRA4D 300 FRA7F 0.5 FRA10G FRA1F FRA7D 100 10 15 20 nSamples 0.0 Amp Del -log10(qvalues) 10 20 30 TNBC Non TNBC MLH3 naive MLH3 wild Fig. 5 Mutational processes and somatic copy number variation (CNV) identified in the case and control groups. a The identified mutation signatures were compared with those of the Welcome Trust Sanger Institute. The aging signature and mismatch repair signature (MMR) are enriched in both groups. The uncatalogued signature O/TN is enriched with T > G/A > C mutations, with 5′GAA3′ >5′GCA3′ the most frequently mutated trinucleotide. b Recurrent somatic copy number variation in the case group. Common fragile sites are labeled. The size of the dots represents the q-value (FDR adjusted p-value). Red are amplifications, blue are deletions. y-axis is the number of genes involved, and x-axis is the number of samples involved. c The uncatalogued signature is enriched in BBBs that predate triple negative breast cancer. Each sample is assigned a continuous number representing the signature exposure strength, which was the product of matrix decomposition. d The mismatch repair signature is highly abundant in the BBB with MLH3 deletion (MLH3 naive). Error bars are 95% confidence interval (CI). we were able to identify the O/TN signature in the TCGA TNBC one of the mismatch repair associated genes (Supplementary cohort (Cosine = 0.72, p = 0.001). data 4) . PMS2 is deleted in one-half of the cases (10/20) and A majority of breast tumors, especially those that are HER2 MLH3 in all of the controls (6/6). However, only one of the 10 cases positive, have been reported to be enriched with mutations displaying a PMS2 deletion also evidenced a mutation in an MMR hypothesized to result from the action of the APOBEC enzymes . associated gene, specifically SETD2. None of the controls with In our cohort, no tumors were found to be enriched with MLH3 deletions carried a mutation in any of the MMR associated mutations within the APOBEC motif, nor did we observe either genes. Strikingly, benign biopsies harboring a MLH3 deletion are WTSI Signatures 2 or 13, both of which are hypothesized to be the abundantly enriched with the mismatch repair signature com- result of the activity of these enzymes. We have also examined the pared to MLH3 wild biopsies (p = 4.2E-6) (Fig. 5d). subset of 11 BBB that eventually developed HER2 positive cancer and the subset of 29 BBB that developed cancer within 3 years of Cancer risk prediction at BBB biopsy, and we found no APOBEC signatures enriched in In an attempt to build a model for cancer prediction at the time of these BBB. BBB using somatic information, we fit logistic regression with L1 We also employed VarScan2 to study somatic CNV in the penalty using the case/control status as output variable. To reduce 26 samples for which we have matched normal DNA. The learnt the number of input prediction features, all somatic mutations segments were then passed to GISTIC2 for recurrent CNVs study that were annotated with same protein domain were aggregated (genome-wide CNV variation: Supplementary Fig. 3). We observed as a continuous number, representing the mutation burden of the that majority of the cytobands occur at or immediately next to corresponding protein domain. In total, 1966 annotated protein common fragile sites, suggesting these cells are under consider- domains were utilized as input features for case/control prediction able replication stress (Fig. 5b). The observed cytobands at which (Supplementary data 5) . To evaluate the model and features, we CNVs map exclusively in the cases have been associated with performed a bootstrapping by randomly splitting the BBB samples cancers, in general or invasive breast cancers, in particular. at a 7:3 ratio, and trained the model using 70% of the samples, in Amplifications are hypothesized to be the result of breakage- which 30% of the samples were used as test set. We repeated the fusion-bridge (BFB) cycles triggered at the induction of fragile process ten times and obtained an AUC for each run. As a result, sites . One of the amplifications identified in the BBCAR cases is we obtained an AUC score of 67% (95% CI = 63.1–70.9%) in an amplification outlier identified using breast cancers from the predicting the cases. Of note, the inclusion of clinical character- METABRIC consortium that mapped to chr19q13.33, which istics and demographics, including age at the time of BBB, age at contains 26 genes. No candidate oncogene has yet to be 38 menarche, age at first live birth, family history of breast cancer in a identified within this amplicon . Chromosome 1q21 is the fourth first-degree relative, histologic variable (proliferative vs non- most frequent locus of copy number variations in cancer . proliferative), did not improve the model’s performance. To investigate the mechanisms underlying our mismatch repair signature, mismatch repair genes MLH1, MLH3, MSH2, MSH3, Somatic mutations present in both benign biopsy and cancer MSH6, PMS2, MUTYH, MYH11, SETD2 and TGFBR were examined for deleterious mutations and/or deletion in the subset of samples Our cases were defined as BBB that predate breast cancer. In this with matched germline DNA available (N = 26) . Approximately study, to longitudinally compare mutations in the BBB and in the one-third of the cases and controls have at least one mutation in cancer samples, we retrieved ten tumors that matched to our BBB npj Breast Cancer (2020) 24 Published in partnership with the Breast Cancer Research Foundation Exposure Strength MMR signature Z. Zeng et al. cohort. Preprocessing for mutation calling was performed as for we report here must be regarded as preliminary until larger the BBB, including laser capture microdissection (LCM), DNA numbers can be studied. We were able to obtain germline extraction, library construction, sequencing, alignment, mutation specimens on only 26 subjects. Data from these 26 specimens was calling, and variant filtering. Of the identified mutations in these utilized to build the Panel of Normals (PoN) for germline variant ten cancer samples, 957 were observed in both the benign filtering; GATK recommends a minimum of 40. Using less than the biopsies and cancer tissues (Supplementary data 6) . The average suggested minimum may result in suboptimal denoising of the allele frequencies for these mutations is 32.2% (SD = 18.7%) in the data and may not capture all the common germline variants. Since BBB and 46.7% (SD = 17.3%) in the cancer tissues. FAT1, CTNNA2, all subjects consented to participation and to recontact, we are ATR and ETAA1 were among the top ten mutated genes working actively to acquire additional germline samples. Finally, (Supplementary Table 4); these are known tumor suppressor the use of formalin-fixed paraffin embedded breast samples, genes or oncogenes. All six of the CTNNA2 mutations occur within although unavoidable in this setting, risks introducing artefactual the motif 5′GAA3′ >5′GCA3′. This motif is a predominant feature findings. Among the FFPE artifacts are C to T transitions 43,44 of our O/TN mutation signature (Fig. 3). hypothesized to be the result of the deamination of cytosines 31 41 and both Rohan et al. and Soyal et al. screened for these substitutions. However, a large, prospective study carried out by DISCUSSION the 100,000 Genomes Project argues that the choice of mutation Genetic aberrations associated with malignancy occur within caller as well as tissue heterogeneity or sampling may contribute 17 45 normal tissues and within tissues at the population risk of breast to differences between FFPE and frozen tissue . Our O/TN 15 16 cancer as well as within lesions at substantial risk . A previous signature is not dominated by C to T transitions. Although our case-control study performed by Rohan et al., with a design that MMR signature and aging signatures are populated by C to T closely mirrors ours, utilized targeted sequence capture ;no transitions, we think it unlikely that these are due to formalin- significant differences between cases and controls with regard to induced deamination, as our signatures closely mirror those of somatic mutations were identified and no mutations were shared WTSI 6 and 1b, respectively, which were derived from frozen between the biopsy and tumor pairs. Comparing the number of specimens. Considering the risk of ruling out true mutations, somatic mutations identified in their targeted genes with these therefore, we did not attempt to account the FFPE artifacts in our same genes in our WES data revealed striking similarity and to pipeline. make the similarity easy to discern, we designed Fig. 4b to mirror In contrast to our findings, Soysal et. al found no specific their Fig. 1a, b. Soysal and colleagues also employed targeted mutations in their study of benign breast lesions. Their depth of sequencing in an attempt to identify somatic mutations present in sequencing was more than adequate but only 17 patient samples antecedent fibrocystic disease (FD) and subsequent invasive were included , with lesions that they called “fibrocystic breast breast cancers . In contrast to our study and that of Rohan disease with or without UDH, FEA, or CCL”. These lesions, with the et al. , no significant somatic mutations were identified in the FD. exception of flat epithelial atypia, were also included in our study, In their discussion section Rohan et al. suggested that “more making it unlikely that the choice of histology is determinative. We detailed approaches (e.g., exome/whole-genome sequencing)” should consider the possibility that single nucleotide somatic might prove more informative than targeted sequencing .We mutation is not the correct genetic determinant of risk. Single cell employed WES in a similar case-control setting. We rigorously sequencing of synchronous DCIS and invasive ductal carcinomas evaluated the sequencing quality, mutation calling, and mutation has revealed that CNV is early oncogenic event, i.e., present in classification. Since we did not have germline samples available in situ lesions, and that no additional CNV events are acquired from most of our subjects, we developed a neural network model during the transition from in situ to invasive lesion . In a study to predict somatic mutations for the benign biopsies, which we separate from the one referred to earlier, Soysal et al. showed that were able to accomplish with a F1 score of 96%. This tool was ESR1 gene amplifications are an early event in breast carcinogen- further validated in TCGA (MC3) data with a F1 score of 89%. Using esis and are already present, at least in part, in FD . Additionally, the sequencing data produced, we have identified recurrent recurrent CNVs are more characteristic of invasive breast cancers mutated genes. We also built a predictive model for the risk of than are recurrent mutations . Key breast cancer phenotypes, breast cancer using genetic information alone and obtained an including intrinsic molecular subtypes, estrogen receptor status, AUC of 67% (95% CI = 63.1–70.9%). This represents the best and TP53 mutation status as well as proliferative status and performance to date using benign breast lesions, despite the estrogen-signaling pathway activity can be predicted by DNA 42 48 exclusion of subjects with atypical hyperplasia . Importantly, we copy number features alone . have identified a currently uncatalogued signature, which we have Lesions such as hyperplasia, not all of which are obligate designated O/TN, that is associated with triple negative breast precursors of malignancy, already show evidence of activation of cancer (p = 0.007), which was validated in 109 TCGA TNBC DNA damage response pathways. This is a response to oncogene- samples (p = 0.001); we found that PIK3CA pH1047R hotspot induced DNA replication stress causing unscheduled S-phase mutation is more frequent in proliferative disease without atypia entry with consequent aberrant replication structures and DNA (PDWA) compared to non-proliferative disease (p < 0.001); we damage, which activate ATR/Chk1, ATM/Chk2, and p53, ultimately observed multiple recurrent CNVs as well, including a MLH3 preventing progression by arresting growth or triggering cell deletion, which is significantly associated with a mismatch repair death . Intriguingly, with regard to our data, is the fact that in the signature (p < 0.001). early lesions that are genetically most stable, loss of hetero- This study has several strengths and weaknesses. The speci- zygosity at known fragile sites is observed to occur 3–15 times mens are richly annotated with clinical information (Supplemen- more often than expected from random targeting of these sites . tary Data 1) and they have been laboriously microdissected in Fragile sites were also noted to be targeted during the period in order to sample the epithelial compartment. The controls have a which DNA damage response is maximal. These data suggest a long median follow up and were verified not to have been model in which oncogene activation is an early event in at-risk diagnosed with BC by a telephone interview carried out at the tissue and that cells activate the ATR/ATM-regulated DNA damage time of this study. We have leveraged the advantages of machine responses that delay or prevent malignant progression. This may learning/artificial intelligence to enable the calling of somatic explain why we observe equivalent somatic mutations, e.g. PIK3CA mutations in the absence of germline data. (H1047R), in cases and controls. The ATR and ETAA1 mutations that Weaknesses include the relatively small size of the study and we observed in the BBCAR specimens and their matched tumors the lack of an independent validation dataset, so that the findings may be the specific mutations that enable oncogenic progression Published in partnership with the Breast Cancer Research Foundation npj Breast Cancer (2020) 24 Z. Zeng et al. in the cases. Inactivating mutations including any in the ATM/Chk2 deletion; as the same signature is observed in the cases another or ATR/Chk1 pathways potentially would remove the barrier to mechanism of MLH3 silencing may be operative in the cases such progression and result in cell proliferation and survival, increasing as promoter methylation. genomic instability and tumor progression. The ten most frequently mutated genes shared between the Our O/TN signature is enriched with T > G/A > C mutations, with BBB of cases and their tumors are given in Supplementary Table 4. 5′TTC3′ >5′TGC3′ the most frequently mutated trinucleotide Among them are FAT1, CTNNA2, ATR, and ETAA1. FAT1 has the motif. These single nucleotide T > G transversions are observed most mutations, which is interesting as this same gene was shown in vitro when equimolar oxidized dGTP (8-O-dGTP) is included in to have a statistically significant excess of inactivating mutations the nucleotide pool . Strand information is lost between the across all classes in the sun-exposed, physiologically normal initial occurrence of a mutagenic lesion and the ultimate readout epidermis study . FAT1 encodes a cadherin-like protein and its by DNA sequencing. Conventionally, mutational signatures are inactivation via mutation may lead to tumorigenesis by multiple 60,61 displayed with a mutated pyrimidine at the center of the avenues . From a breast cancer standpoint, investigations into trinucleotide motif. The complement to 5′TTC3′ >5′TGC3′ is 5′ the etiology of CDK4/6 inhibitor resistance have provided GAA3′ >5′GCA3′. There is a 4- to 5-fold difference in the 8-O-dGTP significant clues to FAT1’s role as a tumor suppressor . Loss of mutation rate depending on the sequence context with 5′GAA3′ FAT1 activity results in increased expression of CDK6, consequent being a favored context . The nucleotide pool is sanitized by to dysregulation of the Hippo pathway. ATR and ETAA1 have been MTH1, which hydrolyzes cytotoxic oxidized dNTPs, preventing discussed earlier regarding their function as barriers to progres- them from becoming mis-incorporated into DNA during replica- sion. We hypothesize that the CNV we have observed is due to tion or repair. Even with this cellular sanitizing activity, nucleotide replication stress. Replication stress leads to stalled replication 51,52 pools contain enough 8-oxo-dGTP to promote mutagenesis . forks and if ATR or ETAA1 mutation renders the proteins unable to Mutagenesis results from the insertion of 8-O-dGTP across from stabilize the forks and allow time for repair, further genomic adenine rather than cytosine during DNA replication. Steric instability in the genome is likely to ensue . ATR also specially hindrance of the oxygen of cytosine (C) in the anti-conformation regulates fragile site stability . While admittedly our number of with the triphosphate group of the 8-oxo-dGTP also in the anti- matched BBB and tumors is limited, the data from these conformation prevents Watson-Crick base pairing . However, 8- specimens suggests that, later in oncogenesis, mutations in ATR oxo-dGTP can assume the syn conformation enabling Hoogsteen pathway members, i.e., ATR and ETAA1, are being selected for as ox base pairing with Adenine (A). This A(template): G (nascent) base they observed in both the benign biopsy and its matched tumor. paring results in T > G/A > C following two additional rounds of We note that ATR haploinsufficieny in a mismatch repair deficient replication . background has been shown to result in dramatic increases in The association of triple negative breast cancer (TNBC) with our fragile site instability, amplifications and rearrangements, and in O/TN signature is intriguing. About 80% of TNBCs are of the basal- decreased tumor latency . like subtype and this subtype likely originates from luminal In summary, we have taken an initial step towards what will be 55,56 progenitor cells . We hypothesize that the O/TN signature a series of investigations of somatic DNA changes in the results from deficient repair of a specific oxidative lesion as unaffected breast, which will help define alterations that put discussed above. The levels of reactive oxygen species and women at substantially elevated BC risk. Such studies will also antioxidant defenses have been assayed in both luminal provide the possibility of estimating the time frame of that risk, so progenitor (LP) and basal cells of normal human mammary that women are able to make practical decisions regarding the tissue . Higher levels of both superoxide anion and hydrogen interventions that they choose to adopt. We have shown that such peroxide are present in the LPs. Even though multiple antioxidants work is feasible, with sequencing quality that meets current are deployed, LP display higher levels of oxidative damage, standards in the field, that somatic sequencing data can be specifically increased incorporation of 8-oxo-deoxyguanosine (8- inferred and interpreted even in the absence of matched germline oxo-dG) within the genomic DNA. Therefore, the association of data, and those intriguing findings emerge that are cancer TNBC with our O/TN signature may reflect the susceptibility of its relevant. precursor LP cells to oxidative damage, placing them at a disadvantage if this damage cannot be adequately addressed due to mismatch deficiency. METHODS MHL3 deletion was strongly associated with our MMR signature. Sample collection The trinucleotide motifs most frequently mutated in our MMR At the Northwestern Feinberg School of Medicine, we designed a case- signature and WTSI Signature 6 are 5′GCG3′ >5′GTG3′,5′CCG3′ > control study of BBB samples (BBCAR Study) . Subjects were identified 5′CTG3′,5′ACG3′ >5′ATG3′. These mutations are hypothesized to through searches of the Enterprise Data Warehouse of Northwestern due to an error in the replication of 5-methylcytosine (5mC). Medicine (NM), and at the Lynn Sage Breast Center of NM, under IRB- Tomkomva et al. have advanced a model, which posits that approved protocol NU 09B2. The major eligibility criterion required a wildtype Pol ε has slightly decreased fidelity when encountering history of benign breast biopsy performed at NM, at least 1 year prior to 5mC, particularly in a GCG context, on the template strand and cancer diagnosis for cases. Eligible subjects provided written informed incorrectly pairs it with A, leading to 5mC:A mismatches . They consent for the use of their BBB blocks after the nature and possible note that there is high structural similarity between 5mC and T, consequences of the study were explained, and completed a survey both of which present a methyl group at the same position of detailing breast history and breast cancer risk factors. We have retrieved the BBB paraffin blocks of subjects who subsequently developed breast pyrimidine ring. If the resulting 5mC:A mismatches are not cancer (cases) and from age-matched controls, who have not developed repaired before the next round of replication due to dysfunctional breast cancer to date. The participants are contacted periodically to mismatch repair, one would expect an enrichment of NCG > NTG confirm that controls have not transitioned to cases. A subset of 135 cases, mutations. Given this hypothesized etiology of the mutations, is matched to 69 controls were selected for WES (“Supplementary Materials there evidence that MHL3 repairs such mutations? Sequencing of and Methods”). All subjects included in this analysis were of European the tumors arising from the cross of Apc1638N mutant mice with descent. Case and control samples are matched by age and histologic class -/- -/- Mhl3 nullizygous and Mlh3 ; Pms2 mice reveals that the C:G > T: (non-proliferative benign change, or proliferation without atypia). DNA was A transition mutations, irrespective of MMR genotypes, occurred isolated from the LCM epithelium and sequenced using the Illumina at either CpG dinucleotides or CpNpG sites, typical targets for DNA HiSeq4000. WES was conducted with a sequencing depth of 80–100× and methylation . Thus, although our numbers are small, it appears 80–90 million sequencing reads were generated for each sample that our MMR Signature in the controls may result from MLH3 (Supplementary Materials and Methods). npj Breast Cancer (2020) 24 Published in partnership with the Breast Cancer Research Foundation Z. Zeng et al. Parallel alignment of whole-exome analysis evaluated machine learning models and features for breast cancer risk prediction for the cohort. Benjamini-Hochberg method was applied to We adapted widely used open source software for genome alignment and convert the two-sided P-values to False Discover Rate (FDR) for multi- variant calling. Read alignment and variant calling were performed comparison correction. according to the Broad Institute’s Genome Analysis Toolkit (GATK) best A Mutational Signature study was performed to reveal underlying practices pipeline . Reads were aligned to the human reference genome mutational processes for cancer development. The identified somatic (hg19) using Burrows-Wheeler alignment and Picard 2.6 was subse- mutations were presented as a 96-element vector, which captures the quently used to sort reads and mark duplicates (Fig. 1c). To reduce immediate 5′ and 3′ neighbors of the mutated nucleotides. The summary systematic errors, sorted BAM files were separately generated based on the of these mutation characteristics forms a mutational profile for each tissue sequence lane that the reads were generated. By doing so, various sample. Putting multiple samples’ profiles together form a matrix with the technical artifacts that are associated with lane-specific artifacts can be number of samples as rows (204) and the mutation characteristics as removed during duplicate marking and base recalibration steps. Base columns (96). Nonnegative matrix factorization (NMF) was applied to recalibration was done using the GATK 3.6 using dbSNP build 138 as a enable the discovery of intrinsic patterns in this matrix. The first value training set. Mutations were called and filtered using MuTect2 in the GATK where the Residual Sum of Squares (RSS) curve presents an inflection point package. To capture recurrent technical artifacts, we generated a Panel of was used to determine the number of signatures. In total, three signatures Normals (PON) for Mutect2 analysis using the sequenced 26 germline DNA. were discovered among the cases and controls, or combined. The outputs The PON is created by running the variant caller Mutect2 individually on of NMF consist of an H matrix and a W matrix. The matrix H (dimension of the normal samples and combining the resulting variant calls with the 3 × 96) was used to infer mutational processes. The numbers in matrix W criteria of excluding any sites that are not present in at least 2 normals, 69 (dimension of 204 × 3) correspond to each samples’ signature exposure which is the default cutoff . For the samples without matched normal levels. This matrix was interpreted as each tissue sample’s accumulated DNA available, we run Mutect2 using the so called “tumor only” model exposure effect to the mutational burden. We further evaluated the with PON filtering to call mutations. To obtain a set of mutations with the 35 70 association between the signature exposure level and cancer development highest sensitivity, VarScan2 and VarDict were also applied for with logistic regressions, adjusting for age and histology class. mutation calling. To further ensure a high precision call rate, we filtered all mutations with read depth <20. After filtering, mutations were then 71 72 25 annotated using SNPEFF , VEP , and ANNOVAR . Cancer risk prediction at BBB To predict cancer development using the mutations identified in BBB, we fit logistic regression with L1 penalty using the case/control status as Somatic mutation identification output variable. Multiple input features have been tested, namely, clinical Our initial objective was to develop and test a predictive model for somatic risk factors, somatic mutations, mutation burden by gene/cytoband/ mutation identification. A significant challenge for this study, and for protein domain. The mutation burden is inferred by aggregating all others seeking to identify somatic mutations in archived tissue samples is somatic mutations annotated as same gene/cytoband/protein domain to a the lack of matched germline DNA. Therefore, to prepare for ground truth, continuous number, representing the mutation burden of the correspond- previously consented donors were re-contacted (with IRB approval) and ing unit. In a cross-comparison evaluation, we achieved the best results saliva specimens were requested for normal DNA sequencing. Matched using protein domains as aggregation unit. In total, 1966 annotated germline DNA was obtained for 26 of the 204 BBB specimens which had protein domains were utilized as input features for case/control prediction been selected for WES. For these 26 paired samples, a set of somatic (Supplementary data 5) . To evaluate the model and features, we mutations were generated by using Mutect2 tumor-normal pair mode with performed a bootstrapping by randomly splitting the BBB samples at a 7:3 PON filtering. Independently, for these BBB samples, a set of mutations ratio for training and testing. We also evaluated the models by including were generated using Mutect2 tumor-only mode with PON filtering. This is clinical risk factors, including age at the time of BBB, age at menarche, age the mode to be used for the rest of BBBs without matched normal DNA. at first live birth, family history of BC in a first-degree relative, histologic However, mutations generated in this mode contain germlines variants. To variable (proliferative vs non-proliferative). rule these germline variants, we overlapped this set mutations with their BBB’s germline variants, which were generated using GATK Haplotype Reporting summary callers. The overlapped variants were then labeled as germline variants, together with the somatic mutations were used for model evaluation. We Further information on research design is available in the Nature Research systematically evaluated multiple machine learning models and adopted Reporting Summary linked to this article. Multi-Layer Perceptron (MLP) for somatic mutation classification. Features in the prediction model included intrinsic sequencing features, such as mutation allele frequency, depth of reference reads, number of DATA AVAILABILITY appearances in the cohort as well as published collated data providing The datasets generated and analysed during the current study are publicly available the frequency of the variant in the population and predictions of the in the figshare repository: https://doi.org/10.6084/m9.figshare.12191793 . Whole- impact of amino acid changes on the structure and function of the exome sequencing data, generated during the current study, are publicly available in encoded protein. The model obtained an accuracy of 95% for somatic NCBI Sequence Read Archive (SRA) here: https://identifiers.org/insdc.sra:SRP219328 . mutation in the test set (“Supplementary Materials and Methods”). TCGA data supporting Fig. 2, were downloaded from the Genomic Data Commons Orthogonal SNP array genotyping was performed to compare and validate (GDC) data portal, though a dbGaP application. The link to the relevant dbGaP study the performance of mutation calling and mutation classification. Technical is https://identifiers.org/dbgap:phs000178.v1.p1. validation was performed for 17 of the 26 specimens for which matched germline data was available, and 3 of the specimens without matched germline, using the Infinium Exome-24 v1.1 beadchip (“Supplementary CODE AVAILABILITY Materials and Methods”). The case group is defined as benign biopsies that All codes necessary to process the sequencing data and to re-generate the results are developed breast cancer at least one year later after the biopsy. In the case publicly available at https://github.com/zexian/BBCAR_codes . group, we have retrieved 10 cancer blocks that matched to the cases (Fig. 1b). The same preprocessing procedures were performed as benign biopsies, including LCM dissection, DNA extraction, library construction, Received: 16 September 2019; Accepted: 8 May 2020; sequencing, alignment, mutation calling, and filtering. Somatic copy number variation and mutational signature Using both aligned reads and identified mutations, we studied the genetic REFERENCES aberrations that distinguish cases from controls, including mutations and CNVs. We identified the somatic mutations or CNVs that were common to 1. Siegel, R. L., Miller, K. D. & Jemal, A. Cancer statistics, 2019. CA: a cancer J. clinicians 69,7–34 (2019). both the cases’ benign biopsy tissue as well as paired malignant lesions for 2. Flanagan, M. R., et al. Chemoprevention Uptake for Breast Cancer Risk Reduction the ten cases in which we had both tissues available. P-values were derived Varies by Risk Factor. Ann. Surg. Oncol. https://doi.org/10.1245/s10434-019-07236-8 with the use of Chi-square test or logistic regression. We also studied the (2019). mutations to enable the discovery of mutational signatures. Lastly, we Published in partnership with the Breast Cancer Research Foundation npj Breast Cancer (2020) 24 Z. Zeng et al. 3. Gail, M. H. et al. Projecting individualized probabilities of developing breast 37. Bass, T. E. et al. ETAA1 acts at stalled replication forks to maintain genome cancer for white females who are being examined annually. J. Natl Cancer Inst. integrity. Nat. Cell Biol. 18, 1185–1195 (2016). 81, 1879–1886 (1989). 38. Curtis, C. et al. The genomic and transcriptomic architecture of 2,000 breast 4. Chlebowski, R. T. et al. Predicting risk of breast cancer in postmenopausal women tumours reveals novel subgroups. Nature 486, 346–352 (2012). by hormone receptor status. J. Natl Cancer Inst. 99, 1695–1705 (2007). 39. Ciriello, G. et al. Emerging landscape of oncogenic signatures across human 5. Pankratz, V. S. et al. Assessment of the accuracy of the Gail model in women with cancers. Nat. Genet. 45, 1127–1133 (2013). atypical hyperplasia. J. Clin. Oncol. 26, 5374–5379 (2008). 40. Davies, H. et al. Whole-genome sequencing reveals breast cancers with mismatch 6. Boughey, J. C. et al. Evaluation of the Tyrer-Cuzick (International Breast Cancer repair deficiency. Cancer Res 77, 4755–4762 (2017). Intervention Study) model for breast cancer risk prediction in women with aty- 41. Soysal, S. D. et al. Genetic alterations in benign breast biopsies of subsequent pical hyperplasia. J. Clin. Oncol. 28, 3591–3596 (2010). breast cancer patients. Front Med. (Lausanne) 6, 166 (2019). 7. Dupont, W. D. & Page, D. L. Risk factors for breast cancer in women with pro- 42. Pankratz, V. S. et al. Model for individualized prediction of breast cancer risk after liferative breast disease. N. Engl. J. Med. 312, 146–151 (1985). a benign breast biopsy. J. Clin. Oncol. 33, 923–929 (2015). 8. Dupont, W. D. et al. Breast cancer risk associated with proliferative breast disease 43. Spencer, D. H. et al. Comparison of clinical targeted next-generation sequence and atypical hyperplasia. Cancer 71, 1258–1265 (1993). data from formalin-fixed and fresh-frozen tissue specimens. J. Mol. Diagn. 15, 9. Visscher, D. W. et al. Clinicopathologic features of breast cancers that develop in 623–633 (2013). women with previous benign breast disease. Cancer 122, 378–385 (2016). 44. Bhagwate, A. V. et al. Bioinformatics and DNA-extraction strategies to reliably 10. Silverstein, M. J. et al. Special report: consensus conference III. Image-detected detect genetic variants from FFPE breast tissue samples. BMC Genomics 20, 689 breast cancer: state-of-the-art diagnosis and treatment. J. Am. Coll. Surg. 209, (2019). 504–520 (2009). 45. Robbe, P. et al. Clinical whole-genome sequencing from routine formalin-fixed, 11. Ju, Y. S. et al. Somatic mutations reveal asymmetric cellular dynamics in the early paraffin-embedded specimens: pilot study for the 100,000 Genomes Project. human embryo. Nature 543, 714–718 (2017). Genet Med. 20, 1196–1205 (2018). 12. Alexandrov, L. B. et al. Clock-like mutational processes in human somatic cells. 46. Wang, Y. et al. Clonal evolution in breast cancer revealed by single nucleus Nat. Genet 47, 1402–1407 (2015). genome sequencing. Nature 512, 155–160 (2014). 13. Knudson, A. G. Two genetic hits (more or less) to cancer. Nat. Rev. Cancer 1, 47. Soysal, S. D. et al. Status of estrogen receptor 1 (ESR1) gene in mastopathy 157–162 (2001). predicts subsequent development of breast cancer. Breast Cancer Res Treat. 151, 14. Tamborero, D. et al. Comprehensive identification of mutational cancer driver 709–715 (2015). genes across 12 tumor types. Sci. Rep. 3, 2650 (2013). 48. Xia, Y., Fan, C., Hoadley, K. A., Parker, J. S. & Perou, C. M. Genetic determinants of 15. Danforth, D. N. Jr. Genomic changes in normal breast tissue in women at normal the molecular portraits of epithelial cancers. Nat. Commun. 10, 5666 (2019). risk or at high risk for breast cancer. Breast Cancer (Auckl.) 10, 109–146 (2016). 49. Bartkova, J. et al. Oncogene-induced senescence is part of the tumorigenesis 16. Sakr, R. A. et al. Targeted capture massively parallel sequencing analysis of LCIS barrier imposed by DNA damage checkpoints. Nature 444, 633–637 (2006). and invasive lobular cancer: repertoire of somatic genetic alterations and clonal 50. Minnick, D. T., Pavlov, Y. I. & Kunkel, T. A. The fidelity of the human leading and relationships. Mol. Oncol. 10, 360–370 (2016). lagging strand DNA replication apparatus with 8-oxodeoxyguanosine tripho- 17. Martincorena, I. et al. High burden and pervasive positive selection of somatic sphate. Nucleic Acids Res 22, 5658–5664 (1994). mutations in normal human skin. Science 348, 880–886 (2015). 51. Colussi, C. et al. The mammalian mismatch repair pathway removes DNA 8- 18. Stratton, M. R., Campbell, P. J. & Futreal, P. A. The cancer genome. Nature 458, oxodGMP incorporated from the oxidized dNTP pool. Curr. Biol. 12, 912–918 719–724 (2009). (2002). 19. Alexandrov, L. B. et al. Signatures of mutational processes in human cancer. 52. Pursell, Z. F., McDonald, J. T., Mathews, C. K. & Kunkel, T. A. Trace amounts of 8- Nature 500, 415–421 (2013). oxo-dGTP in mitochondrial dNTP pools reduce DNA polymerase gamma repli- 20. Temko, D., Tomlinson, I. P. M., Severini, S., Schuster-Bockler, B. & Graham, T. A. The cation fidelity. Nucleic Acids Res 36, 2174–2181 (2008). effects of mutational processes and selection on driver mutations across cancer 53. Freudenthal, B. D. et al. Uncovering the polymerase-induced cytotoxicity of an types. Nat. Commun. 9, 1857 (2018). oxidized nucleotide. Nature 517, 635–639 (2015). 21. Hartmann, L. C. et al. Benign breast disease and the risk of breast cancer. N. Engl. 54. Garrido-Castro, A. C., Lin, N. U. & Polyak, K. Insights into molecular classifications J. Med. 353, 229–237 (2005). of triple-negative breast cancer: improving patient selection for treatment. 22. Zeng, Z., et al. Datasets and metadata supporting the published article: somatic Cancer Disco. 9, 176–198 (2019). genetic aberrations in benign breast disease and the risk of subsequent breast 55. Lim, E. et al. Aberrant luminal progenitors as the candidate target population for cancer. figshare. https://doi.org/10.6084/m6089.figshare.12191793 (2020). basal tumor development in BRCA1 mutation carriers. Nat. Med 15,907–913 (2009). 23. NCBI Sequence Read Archive https://identifiers.org/insdc.sra:SRP219328 (2020). 56. Molyneux, G. et al. BRCA1 basal-like breast cancers originate from luminal epi- 24. Flanagan, S. E., Patch, A.-M. & Ellard, S. Using SIFT and PolyPhen to predict loss-of- thelial progenitors and not from basal stem cells. Cell Stem Cell 7, 403–417 (2010). function and gain-of-function mutations. Genet. Test. Mol. Biomark. 14, 533–537 57. Kannan, N. et al. Glutathione-dependent and -independent oxidative stress- (2010). control mechanisms distinguish normal human mammary epithelial cell subsets. 25. Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic Proc. Natl Acad. Sci. USA 111, 7789–7794 (2014). variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010). 58. Tomkova, M., McClellan, M., Kriaucionis, S. & Schuster-Bockler, B. DNA Replication 26. Kalatskaya, I. et al. ISOWN: accurate somatic mutation identification in the and associated repair pathways are involved in the mutagenesis of methylated absence of normal tissue controls. Genome Med. 9, 59 (2017). cytosine. DNA Repair (Amst.) 62,1–7 (2018). 27. Bailey, M. H. et al. Comprehensive characterization of cancer driver genes and 59. Chen, P. C. et al. Novel roles for MLH3 deficiency and TLE6-like amplification in mutations. Cell 173, 371–385.e318 (2018). DNA mismatch repair-deficient gastrointestinal tumorigenesis and progression. 28. Sharma, Y. et al. A pan-cancer analysis of synonymous mutations. Nat. Commun. PLoS Genet 4, e1000092 (2008). 10, 2569 (2019). 60. Morris, L. G. et al. Recurrent somatic mutation of FAT1 in multiple human cancers 29. Pearlman, R. et al. Prevalence and spectrum of germline cancer susceptibility leads to aberrant Wnt activation. Nat. Genet 45, 253–261 (2013). gene mutations among patients with early-onset colorectal cancer. JAMA Oncol. 61. Martin, D. et al. Assembly and activation of the Hippo signalome by FAT1 tumor 3, 464–471 (2017). suppressor. Nat. Commun. 9, 2372 (2018). 30. Hanisch, F. G. O-glycosylation of the mucin type. Biol. Chem. 382, 143–149 (2001). 62. Li, Z. et al. Loss of the FAT1 tumor suppressor promotes resistance to CDK4/6 31. Rohan, T. E. et al. Somatic mutations in benign breast disease tissue and risk of inhibitors via the hippo pathway. Cancer Cell 34, 893–905.e898 (2018). subsequent invasive breast cancer. Br. J. Cancer 118, 1662–1664 (2018). 63. Cimprich, K. A. & Cortez, D. ATR: an essential regulator of genome integrity. Nat. 32. Nik-Zainal, S. et al. Landscape of somatic mutations in 560 breast cancer whole- Rev. Mol. Cell Biol. 9, 616–627 (2008). genome sequences. Nature 534,47–54 (2016). 64. Casper, A. M., Nghiem, P., Arlt, M. F. & Glover, T. W. ATR regulates fragile site 33. Petljak, M. et al. Characterizing mutational signatures in human cancer cell lines stability. Cell 111, 779–789 (2002). reveals episodic APOBEC mutagenesis. Cell 176, 1282–1294.e1220 (2019). 65. Fang, Y. et al. ATR functions as a gene dosage-dependent tumor suppressor on a 34. Roberts, S. A. et al. An APOBEC cytidine deaminase mutagenesis pattern is mismatch repair-deficient background. EMBO J. 23, 3164–3174 (2004). widespread in human cancers. Nat. Genet 45, 970–976 (2013). 66. Shidfar, A. et al. Expression of miR-18a and miR-210 in normal breast tissue as 35. Koboldt, D. C., et al. VarScan 2: somatic mutation and copy number alteration candidate biomarkers of breast cancer risk. Cancer Prev. Res. (Phila.) 10,89–97 discovery in cancer by exome sequencing. Genome Res. 22, 568–576 (2012). (2017). 36. Mermel, C. H. et al. GISTIC2. 0 facilitates sensitive and confident localization of the 67. McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for targets of focal somatic copy-number alteration in human cancers. Genome Biol. analyzing next-generation DNA sequencing data. Genome Res 20, 1297–1303 12, R41 (2011). (2010). npj Breast Cancer (2020) 24 Published in partnership with the Breast Cancer Research Foundation Z. Zeng et al. 68. Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler COMPETING INTERESTS transform. Bioinformatics 26, 589–595 (2010). The authors declare no competing interests. 69. Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 31, 213–219 (2013). 70. Lai, Z. et al. VarDict: a novel and versatile variant caller for next-generation ADDITIONAL INFORMATION sequencing in cancer research. Nucleic acids Res. 44, e108 (2016). Supplementary information is available for this paper at https://doi.org/10.1038/ 71. Cingolani, P. et al. A program for annotating and predicting the effects of single s41523-020-0165-z. nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melano- gaster strain w1118; iso-2; iso-3. Fly (Austin) 6,80–92 (2012). Correspondence and requests for materials should be addressed to Y.L., S.A.K. or S.E.C. 72. McLaren, W. et al. The ensembl variant effect predictor. Genome Biol. 17, 122 (2016). Reprints and permission information is available at http://www.nature.com/ reprints Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims ACKNOWLEDGEMENTS in published maps and institutional affiliations. We thank the Center for Medical Genomics at the Indiana University School of Medicine for the library preparation and high throughput sequencing. This study was supported in part by Breast Cancer Research Foundation, the Lynn Sage Cancer Research Foundation, and grant R21LM012618-01 from the National Institutes of Health. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative AUTHOR CONTRIBUTIONS Commons license, and indicate if changes were made. The images or other third party Z.Z., S.A.K., and S.E.C. conceived the study. A.S. performed laser capture microdisse- material in this article are included in the article’s Creative Commons license, unless cion and extracted the DNA. P.S. administered the questionnaires, organized the indicated otherwise in a credit line to the material. If material is not included in the clinical data, and contacted the subjects for saliva donation and cancer status article’s Creative Commons license and your intended use is not permitted by statutory confirmation. X.X. performed the sequencing. L.B. reviewed all benign biopsy and regulation or exceeds the permitted use, you will need to obtain permission directly tumor sections, verified histologic diagnosis, and identified areas for laser capture. from the copyright holder. To view a copy of this license, visit http://creativecommons. Z.Z. and A.V. carried out the sequence alignment, quality assessment, and mutation org/licenses/by/4.0/. calling. S.E.C. and Z.Z. wrote the paper. Z.Z. and X.L. performed the statistical analysis. Y.L. reviewed all analyzed data. S.A.K. was responsible for the clinical study. All authors discussed the results, revised and approved the paper. © The Author(s) 2020 Published in partnership with the Breast Cancer Research Foundation npj Breast Cancer (2020) 24

Journal

npj Breast CancerSpringer Journals

Published: Jun 12, 2020

There are no references for this article.