Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Epigenetic Conservation Is a Beacon of Function: An Analysis Using Methcon5 Software for Studying Gene Methylation

Epigenetic Conservation Is a Beacon of Function: An Analysis Using Methcon5 Software for Studying... original reports abstract SPECIAL SERIES: INFORMATICS TOOLS FOR CANCER RESEARCH AND CARE Epigenetic Conservation Is a Beacon of Function: An Analysis Using Methcon5 Software for Studying Gene Methylation 1 1 1 2 1 Emil Hvitfeldt, MS ; Chao Xia, BSc ; Kimberly D. Siegmund, PhD ; Darryl Shibata, MD ; and Paul Marjoram, PhD PURPOSE Different epigenetic configurations allow one genome to develop into multiple cell types. Although the rules governing what epigenetic features confer gene expression are increasingly being understood, much remains uncertain. Here, we used a novel software package, Methcon5, to explore whether the principle of biologic conservation can be used to identify expressed genes. The hypothesis is that epigenetic configurations of important expressed genes will be conserved within a tissue. MATERIALS AND METHODS We compared the DNA methylation of approximately 850,000 CpG sites between multiple clonal crypts or glands of human colon, small intestine, and endometrium. We performed this analysis using the new software package, Methcon5, which enables detection of regions of high (or low) conservation. RESULTS We showed that DNA methylation is preferentially conserved at gene-associated CpG sites, particularly in gene promoters (eg, near the transcription start site) or the first exon. Furthermore, higher conservation correlated well with gene expression levels and performed better than promoter DNA methylation levels. Most conserved genes are in canonical housekeeping pathways. CONCLUSION This study introduces the new software package, Methcon5. In this example application, we showed that epigenetic conservation provides an alternative method for identifying functional genomic regions in human tissues. JCO Clin Cancer Inform 4:100-107. © 2020 by American Society of Clinical Oncology Licensed under the Creative Commons Attribution 4.0 License INTRODUCTION In this paper, we provide a software tool, Methcon5, that allows the user to explore whether epigenetic conserva- It wouldbevaluabletoidentifyorrankthe genesthat tion between similar cells in the same and different in- are critical to the function of a given cell type or dividuals can reveal biologic function. Epigenetic marks tissue. In theory, the epigenetic configuration of occupy a large proportion of the human genome, and it a gene should indicate its function, because epi- is uncertain whether they are all equally functional. Sim- genetics allows many different cell types to develop ilar to sequence conservation between species, the idea from a single genome by differentially marking isthat,if anepigenetic configuration isimportant ina given specific genes for expression or silencing. However, cell type, it will be the same between cells, because any the relative functional importance of the epigenetic changes will decrease cell fitness and be subject to elements that cover genomes is a controversial is- negative selection. By contrast, if an epigenetic configu- sue, because a relatively large proportion of the ration is unimportant to the cell, its pattern may drift and Author affiliations human genome can be annotated with different therefore become different between cells. and support biochemical assays. One approach to inferring information (if whether a genomic region is functional is to use Although the method we used is broadly applicable, applicable) appear at evolutionary information: functional elements are we here used DNA methylation of CpG sites to test the the end of this conserved or constrained between species because hypothesis. DNA methylation patterns show somatic article. of negative or purifying selection. Theideaisthat inheritance and usually are copied between cell di- Accepted on December 5, 2019 regions with biologic function are under selection visions, but the replication fidelity of DNA methylation and published at and will change at rates lower than regions without is relatively low, and changes (methylation to deme- ascopubs.org/journal/ function. For example, exonic regions are more con- thylation, or de novo methylation) commonly can be cci on February 20, served than intronic regions. Epigenetic marks are observed within a human lifespan. Furthermore, DNA 2020: DOI https://doi. found in both conserved and nonconserved genomic methylation can be measured with high reproducibility org/10.1200/CCI.19. 00109 regions. using the Illumina MethylationEPIC BeadChip Infinium 100 Epigenetic Conservation Is a Beacon of Function CONTEXT Key Objective We assessed whether conservation of methylation sites can be used as a proxy for gene expression (ie, do highly conserved CpGs sites within a gene tend to indicate higher expression of that gene?). The relative functional importance of the epigenetic elements that cover genomes is a controversial issue. This study adds to our knowledge of this issue. Knowledge Generated We demonstrated that genes that are more conserved in terms of their methylation status do tend to be more highly expressed. Furthermore, we demonstrated that, contrary to some prior speculation, conservation of promoter regions does not correlate with gene expression in this way. Relevance Although this study focuses on three tissue types, we provide a software package to allow other users to easily conduct a similar analysis for data of interest (eg, other tissue types). microarrays (Illumina, San Diego, CA), reducing the ex- statistical methods to enabled us to assess how DNA perimental background that can confound measures of methylation varies along the genome, and in specific re- similarity. We examined the methylation at approximately gions/genes, with a goal of identifying genes with par- 850,000 CpG sites in 32 crypts/glands from the human ticularly highly conserved DNA methylation. We used a colon, small intestine, and endometrium from eight bootstrapping approach for this problem. We then assessed different individuals. We present novel software and al- whether such conserved regions were indicative of genes gorithms that measured epigenetic conservation and that are highly expressed in the tissue of interest. identified and ranked genes that were preferentially Specifically, within each gene of interest, for every pair of conserved in these human tissues. Consistent with samples, we calculated the Manhattan distance between the hypothesis that DNA methylation important to the all CpG sites at which we had measurements of methylation function of a cell shows epigenetic conservation, we proportion for that gene. We then calculated the mean of found that the methylation of CpG sites in genes, pro- those values across all pairs of samples, and we normalized moters, housekeeping genes, and more highly expressed by the number of CpGs measured in that gene. This gave genes are, on average, more conserved. us a standardized measure of conservation, C , for the gth gene that controls for the number of CpG sites, n , MATERIALS AND METHODS measured in that gene. Data More formally, suppose that we have S samples for which The data consisted of 32 samples of normal tissue taken we have measured methylation values at n CpG sites for from the colon, small intestine, and endometrium of eight a given gene and that we denote the data obtained by the human participants. Each sample consisted of a pool of S × n matrix D, where the (i,j)th element of the matrix, approximately 500-10,000 cells from individual crypts or denoted by d , records the methylation measured for the i th ij glands. We then assayed those samples using the Illumina sample at the j th CpG site. Then, the Manhattan (or MethylationEPIC BeadChip Infinium microarray, which pairwise) distance between samples i and j is defined as measures methylation at approximately 850,000 CpG sites n M  d − d .We then define the overall Manhattan ij ik jk 5 k1 using hybridization-ligation. It can measure DNA from distance for the set of samples at these sites (ie, for gene paraffin-embedded tissues, single tumor glands (approxi- g) to be the average of those values—that is, M mately 100 ng), and it has high technical reproducibility M / H , where H = S(S − 1) / 2 is the number of distinct ij 5 i , j (replicates with a Pearson correlation of 0.997 ). In this pairs of samples that can be formed. To control for the example study, we measured the proportion of cells that number of CpGs in that gene, we then normalize this value were methylated at each assayed CpG position for each and work with C = M / n moving forward. g g g sample. We then contrasted the results with measure- Our next task is to determine which of those genes are the ments of gene expression taken from the Expression Atlas, most conserved (ie, have the lowest values of C ). One a European Bioinformatics Institute resource that provides g approach would be to simply rank the calculated C values gene expression results from . 3,000 experiments from 40 g and select the L smallest, for some choice of L (these are different organisms. the most conserved in a “per CpG” sense). However, under Statistical Approach the null hypothesis, H , that there is no difference in The analysis tool, Methcon5, implemented using the sta- conservation across genes, the variance of the observed C tistical programming language R, version 3.6.1, employs value for gene g will be inversely proportional to n , the JCO Clinical Cancer Informatics 101 Hvitfeldt et al number of CpGs measured in that gene. As such, we expect repeated these bootstrapping procedures N = 1,000 times an over-representation of genes for which the number of for each distinct value of n . We note that the “null gene” measured CpGs is low. This is what we observed in sampled in step 2 typically will be different in each repe- practice. Thus, and given that the n values do vary greatly tition of that step. We then took the set of N values of C we g s l for the MethylationEPIC BeadChip microarray, our tool generated for each distinct value of n and used those as instead offers two bootstrapping approaches for this task, the null distribution from which we assessed the signifi- which we now describe. cance level of the observed values C for all genes for which the measured number of CpGs is n . The significance level For each observed value of n , we constructed a null is defined as the quantile of the observed value C in this distribution for the measured value C using a boot- null distribution. The lower the quantile, (which can be strapping procedure. Specifically, we began by con- thought of as a P value), the more significantly conserved structing a subset of all measured CpGs that includes the gene is. We extracted all genes, of all lengths, that fell only those CpGs that were annotated as gene-linked per between the 0th and 5th quantile and took them through to the EPIC microarray documentation. We then repeated the next stage of the analysis, in which we compared them sample sets of n CpG sites to act as “null genes” in the to gene expression values (described in the comparison following ways: with public data). In the approach we call the “naive bootstrap,” we pro- To better understand the importance of genes that are ceeded as follows, repeating these steps for l =1,…,N , for called highly conserved using the procedure described some large number N : in this section, we then applied a gene-set enrichment 1. Sample a set of n of these gene-associated CpGs in- analysis using the ReactomePA software package. This dependently at random, without regard to location, to flagged pathways that were significantly enriched among form each null gene, g . our set of conserved genes. 2. Calculate the normalized Manhattan distance for null In addition to conducting this analysis for each gene in its gene g , as described in Statistical Approach. Denote entirety (ie, including all CpGs that are associated with that this value by C . gene), we also conducted analyses in which we considered particular genic regions (5 untranslated region, or 500 and In reality, it is typically the case that the methylation status 2,000 base pairs [bps] from the transcription start site) to in neighboring CpGs is correlated (ie, neighboring CpGs are see whether these more localized regions might better more likely to have the same methylation state than non- correlate with gene expression. Finally, we also compared neighboring CpGs). Null genes constructed in the manner gene expression with conservation in the promotor region. described by the naive bootstrap approach will not respect this correlation structure present in CpG sites actual ge- Comparison with public data. The ultimate step of this nomes and, as such, may perform badly when CpG in- analysis is to assess how well gene conservation correlates formation is available densely across the genome, as in our with gene expression. Because we had no expression data data (we illustrate this in the Results). For that reason, we for the samples we used, we instead used data from offer a second, more nuanced approach, which we refer to 10a Expression Atlas. From there, we obtained expression as the “adjusted bootstrap.” In this version of the bootstrap, levels for each gene in normal tissue for each of the colon, we proceeded as follows: small intestine, and endometrium, calculated from RNA- First, we extracted all possible sets of n consecutive CpGs, seq data for tissue samples of 122 human individuals. such that all n CpGs are associated with the same gene. Although the samples are different, the tissue is the same, We did this for every gene and denoted this total set across which led us to believe that, for most genes at least, ex- all genes by S . Again, we next repeated the following steps pression would be similar in their and our samples. Clearly, for l =1,…,N , for some large number, N . any correlation between conservation and expression that s s we would see in our actual data (were expression data 1. Sample a set of n consecutive CpGs from S . Denote g n available for our samples) would likely be higher than when this “null gene” by g . The CpGs sites it contains will, by comparing it with expression in unrelated public samples. construction, all be associated with the same gene and As such, our test for correlation is likely to be conservative, will maintain the correlation structure that is typical but it is for this reason that we focused this proof-of- among nearby CpG sites. principle analysis on normal tissue rather than on tumor 2. Calculate the normalized Manhattan distance for null tissue. Ultimately, we do hope to apply this approach to gene g , as described in Statistical Approach. Denote tumor tissue as well. this value by C . RESULTS This adjusted bootstrap procedure tested the same null hypothesis as the naive bootstrap but used sets of CpGs Every cell has its own epigenome. By comparing epi- that better reflected the correlation structure typically genomes between cells within an individual or between found within the genome. In the results we report here, we individuals, we can discover if certain regions are more 102 © 2020 by American Society of Clinical Oncology Epigenetic Conservation Is a Beacon of Function TABLE 1. Average Manhattan Distance by Tissue Type Stratified by value of the per-CpG Manhattan distance, C , as a function Gene Regions or Genome Annotation of category. We categorized the sites in several ways: (1) as Tissue Type gene/nongene; (2) according to whether they fell in CpG islands, shores, shelves, or sea ; (3) whether they were Variable Colon Small Intestine Endometrium located in the 5 untranslated region of a gene; and (4) Total 0.099 0.109 0.089 whether they were located within 1,500 bps or 200 bps of Gene association the transcription start site. As expected if conservation is Yes 0.090 0.099 0.080 a beacon or indicator of biologic function, conservation was No 0.120 0.133 0.111 significantly greater (ie, values of C were low) inside genes versus outside of genes, with the greatest conservation CpG island relation observed within 200 bps of the transcription start site CpG island 0.057 0.058 0.049 (Table 1). These conservation patterns were present for all South shore 0.103 0.110 0.088 three human tissues. Furthermore, we note that there was North shore 0.107 0.113 0.092 a strong correlation between conservation and genomic South shelf 0.108 0.120 0.096 annotation of the region as CpG island/non-island. CpG North shelf 0.108 0.120 0.096 islands are regions that are observed to have low levels of methylation. As such, it is not entirely surprising that Sea 0.109 0.123 0.101 methylation conservation should be high. Regions nearby 5 UTR to islands are often annotated as “shore” (closest to island) Yes 0.080 0.090 0.071 and “shelf” (between shore and sea). From the table, No 0.101 0.112 0.092 though, we see no evidence for increased conservation in TSS1500 these regions. Yes 0.091 0.095 0.080 In Figure 1, we showed the behavior of conservation of No 0.100 0.111 0.091 CpGs sites as a function of their position relative to Yes 0.057 0.060 0.049 their associated gene but averaged across all genes. Each point shows the mean of the absolute value of No 0.103 0.114 0.093 the difference in methylation frequencies at a given Abbreviation: UTR, untranslated region. CpG site across all samples. We grouped CpG sites according to their physical position, where 0 repre- sents the location of the first CpG associated with conserved. As a first step, we illustrated that conservation is a given gene (per the annotation file for the EPIC array) nonuniform along the genome, with greater conservation value. We see that the Manhattan distance is mini- within genes (Table 1). In the table, we showed the mean mized (and, therefore, conservation is maximized) within 0.10 0.08 0.06 0 5,000 10,000 15,000 20,000 0.10 0.08 0.06 0 1,000 2,000 3,000 4,000 5,000 Distance by hg19 Position From First CpG Site (binned to 100s) Colon Small intestine Endometrium FIG 1. Average Manhattan distance for single CpGs as a function of position relative to first 5 annotated gene CpG site. The greater conservation (lower average Manhattan distances) around genes indicates DNA methylation conservation generally extends for hundreds of base pairs and is not isolated to a single CpG site. JCO Clinical Cancer Informatics 103 Average Manhattan Distance Naive Bootstrap Adjusted Bootstrap Hvitfeldt et al Colon Small Intestine Endometrium 1,500 1,000 .00 .25 .50 .75 1.00 .00 .25 .50 .75 1.00 .00 .25 .50 .75 1.00 P-value FIG 2. Distribution of boot-strapping P values for genes. Each column corresponds to a specific tissue. The top row shows results from the naive bootstrap procedure, whereas the bottom row shows the adjusted bootstrap results. the first 2,000 bps or so before then increasing to In Figure 2, we showed how the two methods of boot- a steady value along the rest of the gene. The pattern is strapping described earlier gave different P value distributions replicated across three tissue types, albeit at a slightly when applied to our data. The first row is the distribution one different level for each type. gets when CpG sites are randomly picked when constructing 20 17 Colon Small intestine Endometrium 150 100 50 0 No. of Pathways Set Size FIG 3. Results of pathway conversation analysis. The first three columns show the number of pathways that are called as significantly over-represented just in a single tissue type, while the next four columns show how many pathways that are called as conserved in two or more tissue types. The overall number of pathways called as significantly conserved in each tissue is shown by the colored bars at bottom left. 104 © 2020 by American Society of Clinical Oncology No. of Genes Intersection Size Epigenetic Conservation Is a Beacon of Function the null distribution of similar genes (the “naive” method). how many of the conserved pathways were conserved in This setup leads to the vast majority of genes having an one, two, or all three tissues. Figure 3 shows the results of observed C value that either is always bigger or is always this analysis. The endometrium and small intestine had smaller than the “null” genes created by the bootstrapping the greatest numbers of uniquely conserved pathways, procedure. Because we wished to rank genes according to but the overlap between these pathways was small (just P value, this setup was problematic, because it essentially five pathways). However, interestingly, a core group of resulted in a large number of ties along with an over- 50 pathways were conserved in all three tissues. These representation of P values equal to 0 or 1. The second row pathways are enriched in core housekeeping functions (cell shows the distribution of the P values resulting from the cycle, DNA replication, transcription, translation) that are “adjusted bootstrap” procedure. This distribution is much essential to all mitotic cells (Fig 4). This reinforces the idea more uniform, as desired, and has far fewer ties between that we are successfully using conservation of methylation P values, which enabled us to rank the genes more effectively. to detect genes, and then pathways, that play important For this reason, we used the adjusted method to produce the roles in the tissue concerned. results shown in the rest of this paper. The enriched gene-sets can be organized into a network. In Figure 4, we give an example using the most over- Epigenetic gene conservation can be further stratified or represented pathways in the small intestine. In the fig- ranked, because not all genes are expressed in all tissues. Therefore, conservation should vary between tissues. To ure, the nodes represent pathways that were labeled as explore this, after calculating the C values, we took the 5% conserved in our analysis. The edges between nodes of genes that were most conserved for each tissue, sepa- represent wether genes are associated with both pathways rately for each tissue, and then conducted a gene-set that are labeled as highly conserved in our analysis. If the enrichment analysis to see what pathways were over- overlap proportion of genes between pathways is less represented among those genes. We referred to these than 0.5, no edge is present. Again, we see that most of pathways as “conserved pathways.” We then determined the pathways that we detected as most conserved have mitotic, signal, G2, mitotic, amplification centrosome complex, translation, mediated maintenance, assembly, base expression, regulates, regulation DNA, strand, double FIG 4. Enrichment map of pathways of conserved genes in small intestine tissue. Edges are shown between pathways if the overlap ratio is. 0.5. Major clusters are labeled according to most frequent words in pathway names. JCO Clinical Cancer Informatics 105 Boot-strapped Values Promoter Region Hvitfeldt et al significant overlaps in genes involved, likely because they values and region length. The calculation of the conser- perform key housekeeping roles in cell function. vation value is customizable, with user-provided functions allowed and with a default for arithmetic mean. Currently, Finally, we examined whether conservation correlates with three different bootstrapping methods are included in the gene expression, our proxy measure of importance of package, two of which have been described in this paper. a gene. Although variation in gene promoter methylation Also, a second repository is available, which includes all of often is associated with gene silencing, the degree to which the analysis scripts necessary to reproduce the analysis it correlates with gene expression is unclear. As seen in performed here, starting from IDAT files. The analyses in Figure 5, for all three tissues, we found that conservation this paper took , 30 minutes to run on a Macbook Pro. did correlate with gene expression levels and performed better as an indicator of expression than did gene promoter A priori, it has been found that conserved genomic re- methylation, which did not appear to correlate with ex- 3 gions tend to be functional. Using the software we in- pression at all. The first row represents the adjusted con- troduced here, we applied this principle to epigenomes servation values obtained with our adjusted bootstrapping and presented novel software to identify and rank CpG approach (x-axis, binned by value), whereas the second DNA methylation conservation along the human genome. row is the mean methylation in the promoter region (x-axis, Conserved genomic regions likely reflect selection, and binned by value). The expression values (y-axis) were taken therefore the identification of preferentially conserved epi- from the Expression Atlas and are displayed on a loga- genetic regions potentially can identify the genes that are rithmic scale. The values for conservation and mean most important to the function of a cell—a frequent goal of methylation in promoter were binned in such a way that an biologic investigations. equal number of points were placed in each bin. Higher bin The example analysis presented here illustrates that known number represents more conserved genes (ie, conserva- functional genomic regions have greater epigenetic con- tion increases as we move from left to right in the figure). servation. In principle, such epigenetic conservation can be DISCUSSION used to help identify which genes are more critical to the survival of a cell. Interestingly, function appeared to cor- The Methcon5 R package that we introduced here pro- relate with conservation of methylation at multiple gene- vides software necessary to carry out the calculations for conservation and the bootstrapping procedure. The func- associated CpG sites (Fig 1), which may indicate that the tions have been split into two sections: (1) calculations of epigenetic configuration of the gene region and not of the conservation value by region and (2) bootstrapping a specific CpG site is informative. The approach and methods to calculate P values on the basis of conservation software require at least two samples from the same Colon Small Intenstine Endometrium 10,000 10,000 Quantile of Conservation by Gene FIG 5. The relationship between conservation and expression. Genes are collected into 10 groups according to the degree of conservation measured in our data. For each group, we then show a box-plot of the distribution of log (gene expression) values recorded for the corresponding tissue type in the Expression Atlas database. Columns correspond to the tissue type. The top row shows results when assessing conservation for the entire gene; the bottom row shows the results when assessing conversation just for the promoter region of each gene. We see that gene conservation correlates with expression better than does promoter conversation. 106 © 2020 by American Society of Clinical Oncology Expression 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 Epigenetic Conservation Is a Beacon of Function population. Better still, the comparisons can be applied to critically important for cancers to hide from the immune a range of samples. Potentially, epigenetic conservation system. Although conservation per se does not conclu- can indicate what genes are under greater selection in sively indicate function, highly conserved epigenetic re- native human tissues. For example, comparisons of epi- gions can serve as beacons for unbiased discovery of genes genomes between opposite sides of the same human or noncoding regions that are more likely to be critical to the colorectal cancer reveal preferential conservation of genes function or survival of human cells. The data used in this involved in immune surveillance, suggesting that it is analysis are available upon request. Collection and assembly of data: Darryl Shibata AFFILIATIONS 1 Data analysis and interpretation: All authors Biostatistics Division, Department of Preventive Medicine, Keck School Manuscript writing: All authors of Medicine, University of Southern California, Los Angeles, CA 2 Final approval of manuscript: All authors Department of Pathology, Keck School of Medicine, University of Accountable for all aspects of the work: All authors Southern California, Los Angeles, CA AUTHORS’ DISCLOSURES OF POTENTIAL CONFLICTS OF CORRESPONDING AUTHOR INTEREST Paul Marjoram, PhD, Biostatistics Division, 2001 N Soto St, SSB202V, Department of Preventive Medicine, Keck School of Medicine, University The following represents disclosure information provided by authors of this manuscript. All relationships are considered compensated unless of Southern California, Los Angeles, CA 90089; e-mail: pmarjora@ otherwise noted. Relationships are self-held unless noted. I = Immediate usc.edu. Family Member, Inst = My Institution. Relationships may not relate to the subject matter of this manuscript. For more information about ASCO’s SUPPORT conflict of interest policy, please refer to www.asco.org/rwc or ascopubs. Supported by National Institutes of Health Awards No. 1P01CA196569 org/cci/author-center. and R21 CA226106. Open Payments is a public database containing information reported by companies about payments made to US-licensed physicians (Open AUTHOR CONTRIBUTIONS Payments). Conception and design: Darryl Shibata, Paul Marjoram No potential conflicts of interest were reported. Provision of study material or patients: Darryl Shibata REFERENCES 1. Allis CD, Jenuwein T: The molecular hallmarks of epigenetic control. Nat Rev Genet 17:487-500, 2016 2. Kellis M, Wold B, Snyder MP, et al: Defining functional DNA elements in the human genome. Proc Natl Acad Sci USA 111:6131-6138, 2014 3. Haerty W, Ponting CP: No gene in the genome makes sense except in the light of evolution. Annu Rev Genomics Hum Genet 15:71-92, 2014 4. Shibata D: Mutation and epigenetic molecular clocks in cancer. Carcinogenesis 32:123-128, 2011 5. Moran S, Arribas C, Esteller M: Validation of a DNA methylation microarray for 850,000 CpG sites of the human genome enriched in enhancer sequences. Epigenomics 8:389-399, 2016 6. European Bioinformatics Institute: Expression Atlas. https://www.ebi.ac.uk/gxa/home. 7. R Foundation for Statistical Computing: R: A language and environment for statistical computing., Vienna, Austria, R Foundation for Statistical Computing. https://www.R-project.org/. 8. Craw S: Encyclopedia of Machine Learning and Data Mining. Berlin, Germany, Springer, 2017. 9. Efron B, Tibshirani R: An Introduction to the Bootstrap. Boca Raton, FL, CRC Press, 1994. 10. Yu G, He Q-Y: ReactomePA: An R/Bioconductor package for reactome pathway analysis and visualization. Mol Biosyst 12:477-479, 2016 10a. Papatheodorou I, Moreno P, Manning J, et al: Expression Atlas update: From tissues to single cells. Nucleic Acids Res, gkz947, https://doi.org/10.1093/nar/ gkz947 11. Uhlen ´ M, Fagerberg L, Hallstrom ¨ BM, et al: Proteomics: Tissue-based map of the human proteome. Science 347:1260419, 2015. 12. Antequera F, Bird A: CpG Islands, in Jost JP, Saluz HP (eds): DNA Methylation, EXS vol 64. Basel, Switzerland, Birkhauser, ¨ 1993. 13. Yu G: Enrichplot: Visualization of functional enrichment result. https://github.com/GuangchuangYu/enrichplot 14. Ross MT, Grafham DV, Coffey AJ, et al: The DNA sequence of the human X chromosome. Nature 434:325-337, 2005 15. GitHub: Methcon5. https://github.com/emilhvitfeldt/methcon5 16. GitHub: Epigenetic-conservation-is-a-beacon-of-function. https://github.com/emilhvitfeldt/epigenetic-conservation-is-a-beacon-of-function 17. Ryser MD, Yu M, Grady W, et al: Epigenetic heterogeneity in human colorectal tumors reveals preferential conservation and evidence of immune surveillance. Sci Rep 8:17292, 2018 nn n JCO Clinical Cancer Informatics 107 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png JCO Clinical Cancer Informatics Wolters Kluwer Health

Epigenetic Conservation Is a Beacon of Function: An Analysis Using Methcon5 Software for Studying Gene Methylation

Loading next page...
 
/lp/wolters-kluwer-health/epigenetic-conservation-is-a-beacon-of-function-an-analysis-using-NaluFrmqUr

References (35)

Publisher
Wolters Kluwer Health
Copyright
(C) 2020 American Society of Clinical Oncology
ISSN
2473-4276
DOI
10.1200/CCI.19.00109
Publisher site
See Article on Publisher Site

Abstract

original reports abstract SPECIAL SERIES: INFORMATICS TOOLS FOR CANCER RESEARCH AND CARE Epigenetic Conservation Is a Beacon of Function: An Analysis Using Methcon5 Software for Studying Gene Methylation 1 1 1 2 1 Emil Hvitfeldt, MS ; Chao Xia, BSc ; Kimberly D. Siegmund, PhD ; Darryl Shibata, MD ; and Paul Marjoram, PhD PURPOSE Different epigenetic configurations allow one genome to develop into multiple cell types. Although the rules governing what epigenetic features confer gene expression are increasingly being understood, much remains uncertain. Here, we used a novel software package, Methcon5, to explore whether the principle of biologic conservation can be used to identify expressed genes. The hypothesis is that epigenetic configurations of important expressed genes will be conserved within a tissue. MATERIALS AND METHODS We compared the DNA methylation of approximately 850,000 CpG sites between multiple clonal crypts or glands of human colon, small intestine, and endometrium. We performed this analysis using the new software package, Methcon5, which enables detection of regions of high (or low) conservation. RESULTS We showed that DNA methylation is preferentially conserved at gene-associated CpG sites, particularly in gene promoters (eg, near the transcription start site) or the first exon. Furthermore, higher conservation correlated well with gene expression levels and performed better than promoter DNA methylation levels. Most conserved genes are in canonical housekeeping pathways. CONCLUSION This study introduces the new software package, Methcon5. In this example application, we showed that epigenetic conservation provides an alternative method for identifying functional genomic regions in human tissues. JCO Clin Cancer Inform 4:100-107. © 2020 by American Society of Clinical Oncology Licensed under the Creative Commons Attribution 4.0 License INTRODUCTION In this paper, we provide a software tool, Methcon5, that allows the user to explore whether epigenetic conserva- It wouldbevaluabletoidentifyorrankthe genesthat tion between similar cells in the same and different in- are critical to the function of a given cell type or dividuals can reveal biologic function. Epigenetic marks tissue. In theory, the epigenetic configuration of occupy a large proportion of the human genome, and it a gene should indicate its function, because epi- is uncertain whether they are all equally functional. Sim- genetics allows many different cell types to develop ilar to sequence conservation between species, the idea from a single genome by differentially marking isthat,if anepigenetic configuration isimportant ina given specific genes for expression or silencing. However, cell type, it will be the same between cells, because any the relative functional importance of the epigenetic changes will decrease cell fitness and be subject to elements that cover genomes is a controversial is- negative selection. By contrast, if an epigenetic configu- sue, because a relatively large proportion of the ration is unimportant to the cell, its pattern may drift and Author affiliations human genome can be annotated with different therefore become different between cells. and support biochemical assays. One approach to inferring information (if whether a genomic region is functional is to use Although the method we used is broadly applicable, applicable) appear at evolutionary information: functional elements are we here used DNA methylation of CpG sites to test the the end of this conserved or constrained between species because hypothesis. DNA methylation patterns show somatic article. of negative or purifying selection. Theideaisthat inheritance and usually are copied between cell di- Accepted on December 5, 2019 regions with biologic function are under selection visions, but the replication fidelity of DNA methylation and published at and will change at rates lower than regions without is relatively low, and changes (methylation to deme- ascopubs.org/journal/ function. For example, exonic regions are more con- thylation, or de novo methylation) commonly can be cci on February 20, served than intronic regions. Epigenetic marks are observed within a human lifespan. Furthermore, DNA 2020: DOI https://doi. found in both conserved and nonconserved genomic methylation can be measured with high reproducibility org/10.1200/CCI.19. 00109 regions. using the Illumina MethylationEPIC BeadChip Infinium 100 Epigenetic Conservation Is a Beacon of Function CONTEXT Key Objective We assessed whether conservation of methylation sites can be used as a proxy for gene expression (ie, do highly conserved CpGs sites within a gene tend to indicate higher expression of that gene?). The relative functional importance of the epigenetic elements that cover genomes is a controversial issue. This study adds to our knowledge of this issue. Knowledge Generated We demonstrated that genes that are more conserved in terms of their methylation status do tend to be more highly expressed. Furthermore, we demonstrated that, contrary to some prior speculation, conservation of promoter regions does not correlate with gene expression in this way. Relevance Although this study focuses on three tissue types, we provide a software package to allow other users to easily conduct a similar analysis for data of interest (eg, other tissue types). microarrays (Illumina, San Diego, CA), reducing the ex- statistical methods to enabled us to assess how DNA perimental background that can confound measures of methylation varies along the genome, and in specific re- similarity. We examined the methylation at approximately gions/genes, with a goal of identifying genes with par- 850,000 CpG sites in 32 crypts/glands from the human ticularly highly conserved DNA methylation. We used a colon, small intestine, and endometrium from eight bootstrapping approach for this problem. We then assessed different individuals. We present novel software and al- whether such conserved regions were indicative of genes gorithms that measured epigenetic conservation and that are highly expressed in the tissue of interest. identified and ranked genes that were preferentially Specifically, within each gene of interest, for every pair of conserved in these human tissues. Consistent with samples, we calculated the Manhattan distance between the hypothesis that DNA methylation important to the all CpG sites at which we had measurements of methylation function of a cell shows epigenetic conservation, we proportion for that gene. We then calculated the mean of found that the methylation of CpG sites in genes, pro- those values across all pairs of samples, and we normalized moters, housekeeping genes, and more highly expressed by the number of CpGs measured in that gene. This gave genes are, on average, more conserved. us a standardized measure of conservation, C , for the gth gene that controls for the number of CpG sites, n , MATERIALS AND METHODS measured in that gene. Data More formally, suppose that we have S samples for which The data consisted of 32 samples of normal tissue taken we have measured methylation values at n CpG sites for from the colon, small intestine, and endometrium of eight a given gene and that we denote the data obtained by the human participants. Each sample consisted of a pool of S × n matrix D, where the (i,j)th element of the matrix, approximately 500-10,000 cells from individual crypts or denoted by d , records the methylation measured for the i th ij glands. We then assayed those samples using the Illumina sample at the j th CpG site. Then, the Manhattan (or MethylationEPIC BeadChip Infinium microarray, which pairwise) distance between samples i and j is defined as measures methylation at approximately 850,000 CpG sites n M  d − d .We then define the overall Manhattan ij ik jk 5 k1 using hybridization-ligation. It can measure DNA from distance for the set of samples at these sites (ie, for gene paraffin-embedded tissues, single tumor glands (approxi- g) to be the average of those values—that is, M mately 100 ng), and it has high technical reproducibility M / H , where H = S(S − 1) / 2 is the number of distinct ij 5 i , j (replicates with a Pearson correlation of 0.997 ). In this pairs of samples that can be formed. To control for the example study, we measured the proportion of cells that number of CpGs in that gene, we then normalize this value were methylated at each assayed CpG position for each and work with C = M / n moving forward. g g g sample. We then contrasted the results with measure- Our next task is to determine which of those genes are the ments of gene expression taken from the Expression Atlas, most conserved (ie, have the lowest values of C ). One a European Bioinformatics Institute resource that provides g approach would be to simply rank the calculated C values gene expression results from . 3,000 experiments from 40 g and select the L smallest, for some choice of L (these are different organisms. the most conserved in a “per CpG” sense). However, under Statistical Approach the null hypothesis, H , that there is no difference in The analysis tool, Methcon5, implemented using the sta- conservation across genes, the variance of the observed C tistical programming language R, version 3.6.1, employs value for gene g will be inversely proportional to n , the JCO Clinical Cancer Informatics 101 Hvitfeldt et al number of CpGs measured in that gene. As such, we expect repeated these bootstrapping procedures N = 1,000 times an over-representation of genes for which the number of for each distinct value of n . We note that the “null gene” measured CpGs is low. This is what we observed in sampled in step 2 typically will be different in each repe- practice. Thus, and given that the n values do vary greatly tition of that step. We then took the set of N values of C we g s l for the MethylationEPIC BeadChip microarray, our tool generated for each distinct value of n and used those as instead offers two bootstrapping approaches for this task, the null distribution from which we assessed the signifi- which we now describe. cance level of the observed values C for all genes for which the measured number of CpGs is n . The significance level For each observed value of n , we constructed a null is defined as the quantile of the observed value C in this distribution for the measured value C using a boot- null distribution. The lower the quantile, (which can be strapping procedure. Specifically, we began by con- thought of as a P value), the more significantly conserved structing a subset of all measured CpGs that includes the gene is. We extracted all genes, of all lengths, that fell only those CpGs that were annotated as gene-linked per between the 0th and 5th quantile and took them through to the EPIC microarray documentation. We then repeated the next stage of the analysis, in which we compared them sample sets of n CpG sites to act as “null genes” in the to gene expression values (described in the comparison following ways: with public data). In the approach we call the “naive bootstrap,” we pro- To better understand the importance of genes that are ceeded as follows, repeating these steps for l =1,…,N , for called highly conserved using the procedure described some large number N : in this section, we then applied a gene-set enrichment 1. Sample a set of n of these gene-associated CpGs in- analysis using the ReactomePA software package. This dependently at random, without regard to location, to flagged pathways that were significantly enriched among form each null gene, g . our set of conserved genes. 2. Calculate the normalized Manhattan distance for null In addition to conducting this analysis for each gene in its gene g , as described in Statistical Approach. Denote entirety (ie, including all CpGs that are associated with that this value by C . gene), we also conducted analyses in which we considered particular genic regions (5 untranslated region, or 500 and In reality, it is typically the case that the methylation status 2,000 base pairs [bps] from the transcription start site) to in neighboring CpGs is correlated (ie, neighboring CpGs are see whether these more localized regions might better more likely to have the same methylation state than non- correlate with gene expression. Finally, we also compared neighboring CpGs). Null genes constructed in the manner gene expression with conservation in the promotor region. described by the naive bootstrap approach will not respect this correlation structure present in CpG sites actual ge- Comparison with public data. The ultimate step of this nomes and, as such, may perform badly when CpG in- analysis is to assess how well gene conservation correlates formation is available densely across the genome, as in our with gene expression. Because we had no expression data data (we illustrate this in the Results). For that reason, we for the samples we used, we instead used data from offer a second, more nuanced approach, which we refer to 10a Expression Atlas. From there, we obtained expression as the “adjusted bootstrap.” In this version of the bootstrap, levels for each gene in normal tissue for each of the colon, we proceeded as follows: small intestine, and endometrium, calculated from RNA- First, we extracted all possible sets of n consecutive CpGs, seq data for tissue samples of 122 human individuals. such that all n CpGs are associated with the same gene. Although the samples are different, the tissue is the same, We did this for every gene and denoted this total set across which led us to believe that, for most genes at least, ex- all genes by S . Again, we next repeated the following steps pression would be similar in their and our samples. Clearly, for l =1,…,N , for some large number, N . any correlation between conservation and expression that s s we would see in our actual data (were expression data 1. Sample a set of n consecutive CpGs from S . Denote g n available for our samples) would likely be higher than when this “null gene” by g . The CpGs sites it contains will, by comparing it with expression in unrelated public samples. construction, all be associated with the same gene and As such, our test for correlation is likely to be conservative, will maintain the correlation structure that is typical but it is for this reason that we focused this proof-of- among nearby CpG sites. principle analysis on normal tissue rather than on tumor 2. Calculate the normalized Manhattan distance for null tissue. Ultimately, we do hope to apply this approach to gene g , as described in Statistical Approach. Denote tumor tissue as well. this value by C . RESULTS This adjusted bootstrap procedure tested the same null hypothesis as the naive bootstrap but used sets of CpGs Every cell has its own epigenome. By comparing epi- that better reflected the correlation structure typically genomes between cells within an individual or between found within the genome. In the results we report here, we individuals, we can discover if certain regions are more 102 © 2020 by American Society of Clinical Oncology Epigenetic Conservation Is a Beacon of Function TABLE 1. Average Manhattan Distance by Tissue Type Stratified by value of the per-CpG Manhattan distance, C , as a function Gene Regions or Genome Annotation of category. We categorized the sites in several ways: (1) as Tissue Type gene/nongene; (2) according to whether they fell in CpG islands, shores, shelves, or sea ; (3) whether they were Variable Colon Small Intestine Endometrium located in the 5 untranslated region of a gene; and (4) Total 0.099 0.109 0.089 whether they were located within 1,500 bps or 200 bps of Gene association the transcription start site. As expected if conservation is Yes 0.090 0.099 0.080 a beacon or indicator of biologic function, conservation was No 0.120 0.133 0.111 significantly greater (ie, values of C were low) inside genes versus outside of genes, with the greatest conservation CpG island relation observed within 200 bps of the transcription start site CpG island 0.057 0.058 0.049 (Table 1). These conservation patterns were present for all South shore 0.103 0.110 0.088 three human tissues. Furthermore, we note that there was North shore 0.107 0.113 0.092 a strong correlation between conservation and genomic South shelf 0.108 0.120 0.096 annotation of the region as CpG island/non-island. CpG North shelf 0.108 0.120 0.096 islands are regions that are observed to have low levels of methylation. As such, it is not entirely surprising that Sea 0.109 0.123 0.101 methylation conservation should be high. Regions nearby 5 UTR to islands are often annotated as “shore” (closest to island) Yes 0.080 0.090 0.071 and “shelf” (between shore and sea). From the table, No 0.101 0.112 0.092 though, we see no evidence for increased conservation in TSS1500 these regions. Yes 0.091 0.095 0.080 In Figure 1, we showed the behavior of conservation of No 0.100 0.111 0.091 CpGs sites as a function of their position relative to Yes 0.057 0.060 0.049 their associated gene but averaged across all genes. Each point shows the mean of the absolute value of No 0.103 0.114 0.093 the difference in methylation frequencies at a given Abbreviation: UTR, untranslated region. CpG site across all samples. We grouped CpG sites according to their physical position, where 0 repre- sents the location of the first CpG associated with conserved. As a first step, we illustrated that conservation is a given gene (per the annotation file for the EPIC array) nonuniform along the genome, with greater conservation value. We see that the Manhattan distance is mini- within genes (Table 1). In the table, we showed the mean mized (and, therefore, conservation is maximized) within 0.10 0.08 0.06 0 5,000 10,000 15,000 20,000 0.10 0.08 0.06 0 1,000 2,000 3,000 4,000 5,000 Distance by hg19 Position From First CpG Site (binned to 100s) Colon Small intestine Endometrium FIG 1. Average Manhattan distance for single CpGs as a function of position relative to first 5 annotated gene CpG site. The greater conservation (lower average Manhattan distances) around genes indicates DNA methylation conservation generally extends for hundreds of base pairs and is not isolated to a single CpG site. JCO Clinical Cancer Informatics 103 Average Manhattan Distance Naive Bootstrap Adjusted Bootstrap Hvitfeldt et al Colon Small Intestine Endometrium 1,500 1,000 .00 .25 .50 .75 1.00 .00 .25 .50 .75 1.00 .00 .25 .50 .75 1.00 P-value FIG 2. Distribution of boot-strapping P values for genes. Each column corresponds to a specific tissue. The top row shows results from the naive bootstrap procedure, whereas the bottom row shows the adjusted bootstrap results. the first 2,000 bps or so before then increasing to In Figure 2, we showed how the two methods of boot- a steady value along the rest of the gene. The pattern is strapping described earlier gave different P value distributions replicated across three tissue types, albeit at a slightly when applied to our data. The first row is the distribution one different level for each type. gets when CpG sites are randomly picked when constructing 20 17 Colon Small intestine Endometrium 150 100 50 0 No. of Pathways Set Size FIG 3. Results of pathway conversation analysis. The first three columns show the number of pathways that are called as significantly over-represented just in a single tissue type, while the next four columns show how many pathways that are called as conserved in two or more tissue types. The overall number of pathways called as significantly conserved in each tissue is shown by the colored bars at bottom left. 104 © 2020 by American Society of Clinical Oncology No. of Genes Intersection Size Epigenetic Conservation Is a Beacon of Function the null distribution of similar genes (the “naive” method). how many of the conserved pathways were conserved in This setup leads to the vast majority of genes having an one, two, or all three tissues. Figure 3 shows the results of observed C value that either is always bigger or is always this analysis. The endometrium and small intestine had smaller than the “null” genes created by the bootstrapping the greatest numbers of uniquely conserved pathways, procedure. Because we wished to rank genes according to but the overlap between these pathways was small (just P value, this setup was problematic, because it essentially five pathways). However, interestingly, a core group of resulted in a large number of ties along with an over- 50 pathways were conserved in all three tissues. These representation of P values equal to 0 or 1. The second row pathways are enriched in core housekeeping functions (cell shows the distribution of the P values resulting from the cycle, DNA replication, transcription, translation) that are “adjusted bootstrap” procedure. This distribution is much essential to all mitotic cells (Fig 4). This reinforces the idea more uniform, as desired, and has far fewer ties between that we are successfully using conservation of methylation P values, which enabled us to rank the genes more effectively. to detect genes, and then pathways, that play important For this reason, we used the adjusted method to produce the roles in the tissue concerned. results shown in the rest of this paper. The enriched gene-sets can be organized into a network. In Figure 4, we give an example using the most over- Epigenetic gene conservation can be further stratified or represented pathways in the small intestine. In the fig- ranked, because not all genes are expressed in all tissues. Therefore, conservation should vary between tissues. To ure, the nodes represent pathways that were labeled as explore this, after calculating the C values, we took the 5% conserved in our analysis. The edges between nodes of genes that were most conserved for each tissue, sepa- represent wether genes are associated with both pathways rately for each tissue, and then conducted a gene-set that are labeled as highly conserved in our analysis. If the enrichment analysis to see what pathways were over- overlap proportion of genes between pathways is less represented among those genes. We referred to these than 0.5, no edge is present. Again, we see that most of pathways as “conserved pathways.” We then determined the pathways that we detected as most conserved have mitotic, signal, G2, mitotic, amplification centrosome complex, translation, mediated maintenance, assembly, base expression, regulates, regulation DNA, strand, double FIG 4. Enrichment map of pathways of conserved genes in small intestine tissue. Edges are shown between pathways if the overlap ratio is. 0.5. Major clusters are labeled according to most frequent words in pathway names. JCO Clinical Cancer Informatics 105 Boot-strapped Values Promoter Region Hvitfeldt et al significant overlaps in genes involved, likely because they values and region length. The calculation of the conser- perform key housekeeping roles in cell function. vation value is customizable, with user-provided functions allowed and with a default for arithmetic mean. Currently, Finally, we examined whether conservation correlates with three different bootstrapping methods are included in the gene expression, our proxy measure of importance of package, two of which have been described in this paper. a gene. Although variation in gene promoter methylation Also, a second repository is available, which includes all of often is associated with gene silencing, the degree to which the analysis scripts necessary to reproduce the analysis it correlates with gene expression is unclear. As seen in performed here, starting from IDAT files. The analyses in Figure 5, for all three tissues, we found that conservation this paper took , 30 minutes to run on a Macbook Pro. did correlate with gene expression levels and performed better as an indicator of expression than did gene promoter A priori, it has been found that conserved genomic re- methylation, which did not appear to correlate with ex- 3 gions tend to be functional. Using the software we in- pression at all. The first row represents the adjusted con- troduced here, we applied this principle to epigenomes servation values obtained with our adjusted bootstrapping and presented novel software to identify and rank CpG approach (x-axis, binned by value), whereas the second DNA methylation conservation along the human genome. row is the mean methylation in the promoter region (x-axis, Conserved genomic regions likely reflect selection, and binned by value). The expression values (y-axis) were taken therefore the identification of preferentially conserved epi- from the Expression Atlas and are displayed on a loga- genetic regions potentially can identify the genes that are rithmic scale. The values for conservation and mean most important to the function of a cell—a frequent goal of methylation in promoter were binned in such a way that an biologic investigations. equal number of points were placed in each bin. Higher bin The example analysis presented here illustrates that known number represents more conserved genes (ie, conserva- functional genomic regions have greater epigenetic con- tion increases as we move from left to right in the figure). servation. In principle, such epigenetic conservation can be DISCUSSION used to help identify which genes are more critical to the survival of a cell. Interestingly, function appeared to cor- The Methcon5 R package that we introduced here pro- relate with conservation of methylation at multiple gene- vides software necessary to carry out the calculations for conservation and the bootstrapping procedure. The func- associated CpG sites (Fig 1), which may indicate that the tions have been split into two sections: (1) calculations of epigenetic configuration of the gene region and not of the conservation value by region and (2) bootstrapping a specific CpG site is informative. The approach and methods to calculate P values on the basis of conservation software require at least two samples from the same Colon Small Intenstine Endometrium 10,000 10,000 Quantile of Conservation by Gene FIG 5. The relationship between conservation and expression. Genes are collected into 10 groups according to the degree of conservation measured in our data. For each group, we then show a box-plot of the distribution of log (gene expression) values recorded for the corresponding tissue type in the Expression Atlas database. Columns correspond to the tissue type. The top row shows results when assessing conservation for the entire gene; the bottom row shows the results when assessing conversation just for the promoter region of each gene. We see that gene conservation correlates with expression better than does promoter conversation. 106 © 2020 by American Society of Clinical Oncology Expression 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 Epigenetic Conservation Is a Beacon of Function population. Better still, the comparisons can be applied to critically important for cancers to hide from the immune a range of samples. Potentially, epigenetic conservation system. Although conservation per se does not conclu- can indicate what genes are under greater selection in sively indicate function, highly conserved epigenetic re- native human tissues. For example, comparisons of epi- gions can serve as beacons for unbiased discovery of genes genomes between opposite sides of the same human or noncoding regions that are more likely to be critical to the colorectal cancer reveal preferential conservation of genes function or survival of human cells. The data used in this involved in immune surveillance, suggesting that it is analysis are available upon request. Collection and assembly of data: Darryl Shibata AFFILIATIONS 1 Data analysis and interpretation: All authors Biostatistics Division, Department of Preventive Medicine, Keck School Manuscript writing: All authors of Medicine, University of Southern California, Los Angeles, CA 2 Final approval of manuscript: All authors Department of Pathology, Keck School of Medicine, University of Accountable for all aspects of the work: All authors Southern California, Los Angeles, CA AUTHORS’ DISCLOSURES OF POTENTIAL CONFLICTS OF CORRESPONDING AUTHOR INTEREST Paul Marjoram, PhD, Biostatistics Division, 2001 N Soto St, SSB202V, Department of Preventive Medicine, Keck School of Medicine, University The following represents disclosure information provided by authors of this manuscript. All relationships are considered compensated unless of Southern California, Los Angeles, CA 90089; e-mail: pmarjora@ otherwise noted. Relationships are self-held unless noted. I = Immediate usc.edu. Family Member, Inst = My Institution. Relationships may not relate to the subject matter of this manuscript. For more information about ASCO’s SUPPORT conflict of interest policy, please refer to www.asco.org/rwc or ascopubs. Supported by National Institutes of Health Awards No. 1P01CA196569 org/cci/author-center. and R21 CA226106. Open Payments is a public database containing information reported by companies about payments made to US-licensed physicians (Open AUTHOR CONTRIBUTIONS Payments). Conception and design: Darryl Shibata, Paul Marjoram No potential conflicts of interest were reported. Provision of study material or patients: Darryl Shibata REFERENCES 1. Allis CD, Jenuwein T: The molecular hallmarks of epigenetic control. Nat Rev Genet 17:487-500, 2016 2. Kellis M, Wold B, Snyder MP, et al: Defining functional DNA elements in the human genome. Proc Natl Acad Sci USA 111:6131-6138, 2014 3. Haerty W, Ponting CP: No gene in the genome makes sense except in the light of evolution. Annu Rev Genomics Hum Genet 15:71-92, 2014 4. Shibata D: Mutation and epigenetic molecular clocks in cancer. Carcinogenesis 32:123-128, 2011 5. Moran S, Arribas C, Esteller M: Validation of a DNA methylation microarray for 850,000 CpG sites of the human genome enriched in enhancer sequences. Epigenomics 8:389-399, 2016 6. European Bioinformatics Institute: Expression Atlas. https://www.ebi.ac.uk/gxa/home. 7. R Foundation for Statistical Computing: R: A language and environment for statistical computing., Vienna, Austria, R Foundation for Statistical Computing. https://www.R-project.org/. 8. Craw S: Encyclopedia of Machine Learning and Data Mining. Berlin, Germany, Springer, 2017. 9. Efron B, Tibshirani R: An Introduction to the Bootstrap. Boca Raton, FL, CRC Press, 1994. 10. Yu G, He Q-Y: ReactomePA: An R/Bioconductor package for reactome pathway analysis and visualization. Mol Biosyst 12:477-479, 2016 10a. Papatheodorou I, Moreno P, Manning J, et al: Expression Atlas update: From tissues to single cells. Nucleic Acids Res, gkz947, https://doi.org/10.1093/nar/ gkz947 11. Uhlen ´ M, Fagerberg L, Hallstrom ¨ BM, et al: Proteomics: Tissue-based map of the human proteome. Science 347:1260419, 2015. 12. Antequera F, Bird A: CpG Islands, in Jost JP, Saluz HP (eds): DNA Methylation, EXS vol 64. Basel, Switzerland, Birkhauser, ¨ 1993. 13. Yu G: Enrichplot: Visualization of functional enrichment result. https://github.com/GuangchuangYu/enrichplot 14. Ross MT, Grafham DV, Coffey AJ, et al: The DNA sequence of the human X chromosome. Nature 434:325-337, 2005 15. GitHub: Methcon5. https://github.com/emilhvitfeldt/methcon5 16. GitHub: Epigenetic-conservation-is-a-beacon-of-function. https://github.com/emilhvitfeldt/epigenetic-conservation-is-a-beacon-of-function 17. Ryser MD, Yu M, Grady W, et al: Epigenetic heterogeneity in human colorectal tumors reveals preferential conservation and evidence of immune surveillance. Sci Rep 8:17292, 2018 nn n JCO Clinical Cancer Informatics 107

Journal

JCO Clinical Cancer InformaticsWolters Kluwer Health

Published: Feb 20, 2020

There are no references for this article.