Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Multiomic Integration of Public Oncology Databases in Bioconductor

Multiomic Integration of Public Oncology Databases in Bioconductor original reports abstract SPECIAL SERIES: INFORMATICS TOOLS FOR CANCER RESEARCH AND CARE Multiomic Integration of Public Oncology Databases in Bioconductor 1,2,3 1,2 1,2 1,2,4 1,2,5 Marcel Ramos, MPH ; Ludwig Geistlinger, PhD ; Sehyun Oh, PhD ; Lucas Schiffer, MPH Rimsha Azhar, MS ; 1,2 6 6,7 8 3 Hanish Kodali, MBBS, MPH ; Ino de Bruijn, MSc ; Jianjiong Gao, PhD ; Vincent J. Carey, PhD ; Martin Morgan, PhD ; and 1,2 Levi Waldron, PhD PURPOSE Investigations of the molecular basis for the development, progression, and treatment of cancer increasingly use complementary genomic assays to gather multiomic data, but management and analysis of such data remain complex. The cBioPortal for cancer genomics currently provides multiomic data from . 260 public studies, including The Cancer Genome Atlas (TCGA) data sets, but integration of different data types remains challenging and error prone for computational methods and tools using these resources. Recent advances in data infrastructure within the Bioconductor project enable a novel and powerful approach to creating fully integrated representations of these multiomic, pan-cancer databases. METHODS We provide a set of R/Bioconductor packages for working with TCGA legacy data and cBioPortal data, with special considerations for loading time; efficient representations in and out of memory; analysis platform; and an integrative framework, such as MultiAssayExperiment. Large methylation data sets are provided through out-of-memory data representation to provide responsive loading times and analysis capabilities on machines with limited memory. RESULTS We developed the curatedTCGAData and cBioPortalData R/Bioconductor packages to provide in- tegrated multiomic data sets from the TCGA legacy database and the cBioPortal web application programming interface using the MultiAssayExperiment data structure. This suite of tools provides coordination of diverse experimental assays with clinicopathological data with minimal data management burden, as demonstrated through several greatly simplified multiomic and pan-cancer analyses. CONCLUSION These integrated representations enable analysts and tool developers to apply general statistical and plotting methods to extensive multiomic data through user-friendly commands and documented examples. JCO Clin Cancer Inform 4:958-971. © 2020 by American Society of Clinical Oncology Licensed under the Creative Commons Attribution 4.0 License INTRODUCTION Existing command line resources such as the Geno- mic Data Commons (GDC), the Broad Institute’s Public multiomic databases, such as The Cancer Ge- 1 2,3 GDAC Firehose pipeline tool, R packages such as nome Atlas (TCGA) and the cBioPortal repository, 7 8 9 10 firebrowseR, TCGAbiolinks, RTCGAToolbox, cgdsr, provide extensive data on the molecular landscape of website interfaces such as cBioPortal, the Omics cancer, but their incorporation in multiomic analyses ASSOCIATED Discovery Index, and the GenomicDataCommons CONTENT has been hindered by the complexity of data co- package provide varying degrees of portability, us- Appendix ordination, selection, and management. The TCGA ability, and integration for multiomics data. However, in Author affiliations project generated multiomic data, including muta- and support general, these resources either provide certain pre- tions, copy number variants, methylation, and gene information (if specified analyses but lack integration with platforms for expression quantification, for 33 human cancer applicable) appear at statistical analysis or require significant effort to in- types, while the cBioPortal public repository provides the end of this tegrate the different data types within such a platform. multiomic data for . 260 oncological studies in . 20 article. They also present trade-offs between comprehensive primary sites. The size and complexity of these da- Accepted on data access and ease of use (Fig 1). Tools that provide September 21, 2020 tabases impose time-consuming and technically and published at comprehensive data access require familiarity with data complex barriers to the development of novel tools ascopubs.org/journal/ models, linkage between sample and patient identifiers, and analyses, even for advanced bioinformaticians. cci on October 29, and command line tools. Resources with high ease of The lowering of these barriers requires new ap- 2020: DOI https://doi. proaches to the distribution and management of large useprovide amore limited scopeof datasets, and org/10.1200/CCI.19. 4,5 00119 and complex data outputs. the responsibility to coordinate, manage, and even port 958 Multiomic Integration of Public Oncology Databases CONTEXT Key Objective To provide flexible, integrated, multiomic representations of public oncology databases in R/Bioconductor with greatly re- duced data management overhead. Knowledge Generated Our Bioconductor software packages provide a novel approach to lower barriers to analysis and tool development for The Cancer Genome Atlas and cBioPortal databases. Relevance Our tools provide flexible, programmatic analysis of hundreds of fully integrated multiomic oncology data sets within an ecosystem of multiomic analysis tools. multiple onco-omic data sets to analysis-ready platforms falls METHODS on the user. Installation We have implemented the curatedTCGAData, cBioPortalData, The recommended installation procedure for Bioconduc- and TCGAutils packages to provide easily accessible mul- tor packages is described in its installation instructions. 13 14 tiomic data sets in the analysis-ready R and Bioconductor These instructions detail the use of BiocManager, a Com- environment. The curatedTCGAData package serves in- prehensive R Archive Network package, for Bioconductor tegrated data sets for 33 different cancer types with package installations. BiocManager allows easy installa- . 11,000 tumor samples that are built on demand and tion of all three packages as follows: BiocManager:: contain selected data types as requested by the user. install(c(“curatedTCGAData”, “TCGAutils”, “cBioPortalData”). Where other platforms provide either comprehensive data Docker can be used to provide reproducible and easy- acquisition or data subsets with limited analysis capabil- to-set up Bioconductor environments, using instructions ities, curatedTCGAData provides a solid foundation for provided from its download site. The Docker image researchers looking to get started quickly with analyses of provides an RStudio installation that can be used in TCGA data across genomic assays and/or across different conjunction with the aforementioned R package in- cancer types. cBioPortalData makes use of the cBioPortal stallation commands. Users are encouraged to run web application programming interface (API) to serve in- ‘BiocManager::valid()’ to verify that the Bioconductor tegrative representations of multiomics data for . 260 and installation and packages are up to date and properly growing genomic studies. The TCGAutils package further installed. provides facilities to make working with TCGA data easy with convenient identification, separation, and manipu- Data Structure Overview lation of sample and patient identifiers, leveraging the Data sets from curatedTGCAData and cBioPortalData are capabilities of the MultiAssayExperiment data structure. represented using the established MultiAssayExperiment data structure that provides a framework for managing and organizing experimental assays on a set of samples in Bioconductor. The MultiAssayExperiment container eases curatedTCGAData, cBioPortalData the burden of data management by creating a graph cBioPortal.org representation of biological units and their relationship to cgdsr, firebrowseR, TCGAbiolinks multiple experiment measurements along with associated metadata. It provides a convenient platform from which to GenomicDataCommons, GDAC Firehose conduct integrative analyses while representing complex data structures and classes within the R and Bioconductor ecosystems. FIG 1. Comparison of The Cancer Genome Atlas (TCGA) data re- Experiment data class representations are required to sources by integration, ease of use, and data completeness. In- adhere to a set of minimal operations for compatibility. In tegration refers to the ability of the resource to be used within an particular, these data structures must be divisible by rows analysis platform such as R and Bioconductor. A resource with high and columns and have discoverable dimension attributes, data completeness allows users to download the entirety of TCGA such as length and value labels. SummarizedExperiment is data. Ease of use is defined as the low cognitive overhead for use of a resource as imposed by data models and knowledge of query an example of a commonly supported Bioconductor class structures. that is compatible with these basic requirements. JCO Clinical Cancer Informatics 959 Integration Ease of use Completeness Ramos et al 0 1,000 2,000 TP53 PIK3CA ARID1A KRAS APC NF1 IDH1 EGFR Alterations Frameshift deletion CREBBP Frameshift insertion ERBB4 In-frame deletion CTNNB1 In-frame insertion PIK3R1 Missense mutation ARID2 Nonsense mutation EP300 RNA Splice site CDH1 GNAS KDM6A ERBB2 CASP8 GNAQ IDH2 GNA11 NF2 RHOA CDKN1A AKT1 FIG 2. OncoPrint plot of selected cancer driver genes frequently mutated across 33 The Cancer Genome Atlas cancer types. The SummarizedExperiment class is the de facto standard classes that readily conform to MultiAssayExperiment con- representation for high-throughput genomic data in Bio- tainer requirements. Appendix Figure A1 shows a schematic conductor. It provides a flexible architecture that can of the process from database to R/Bioconductor package. support multiple experimental assays in a single instance. It The pipeline annotates ranged data with genome build also allows easy extensibility to other experimental data information extracted from file names and annotation files classes while maintaining the minimum requirements where possible. It merges open-access tier, level 4, data necessary for MultiAssayExperiment representation. One with the more extensive merged level 1 clinical data, in such extension of SummarizedExperiment is the Ranged- some instances providing approximately 800 additional SummarizedExperiment structure. It supports structured variables, while at the same time, removing columns where genomic range representations as row metadata. Multi- all values are missing and maintaining provenance of such AssayExperiment supports an open-ended range of data column names in the metadata. Molecular subtype data classes despite class evolution. were added to 19 of the 33 available cancer types (Ap- pendix Table A1). Appendix Table A2 lists the available Preprocessing experimental assays and respective Bioconductor classes Data for approximately 11,000 samples and 33 different in curatedTCGAData. The open source curatedTCGAData cancers were preprocessed, harmonized, and redistributed pipeline is available through the MultiAssayExperiment through curatedTCGAData. Data were first downloaded download site. cBioPortalData serves data as provided by from the Broad Institute’s GDAC Firehose pipeline’s last run cBioPortal through its web API or through provided .gz files date (January 28, 2016) using the RTCGAToolbox Bio- for complete data sets. conductor package. Subtype information, taken from ExperimentHub supplemental files of primary TCGA publications, was then added to the phenodata and uploaded to the cloud through The curatedTCGAData assembles data sets from compo- Bioconductor’s ExperimentHub. Uploaded TCGA data nents stored and served by ExperimentHub. After data were packaged into standard Bioconductor objects, such extraction from RTCGAToolbox data representations and 20 21 as SummarizedExperiment and RaggedExperiment , binning into appropriate Bioconductor data classes, the 960 © 2020 by American Society of Clinical Oncology Multiomic Integration of Public Oncology Databases LUSC KIRC FIG 3. Pan-cancer differential ex- pression analysis. Shown are the top eight consistently downregulated genes (bottom left) and the top eight consistently upregulated genes (top right) when comparing cancer versus adjacent normal samples across 14 cancer types. -5 data were saved as serialized R data objects. Metadata were represent such data. The hierarchical data format 5 programmatically generated for each data type, and data for (HDF5)–based DelayedMatrix representation avoids over- all 33 cancers were uploaded to the cloud using Exper- consumption of memory and allows users to load a “lazy” imentHub, a Bioconductor-provided Amazon cloud storage and partial representation of data on ordinary laptops. On service. The online Bioconductor data repository for experi- ExperimentHub, methylation data sets are stored as two ment data is connected to and managed by an in-house files: one provides the SummarizedExperiment shell, and the database. This database is used by the ExperimentHub R other contains the assay data in HDF5 through use of the package for the retrieval and download of queried data saveHDF5SummarizedExperiment function in the Summa- sets. ExperimentHub provides automatic local caching of rizedExperiment package. the component R objects that are assembled by curate- dTCGAData to create a MultiAssayExperiment, but these TCGAutils cached objects are not intended for direct use by the user. The TCGAutils package covers a wide variety of utility curatedTCGAData retrieves piecewise data representa- functions for simplified manipulation of TCGA data. This tions and constructs a MultiAssayExperiment on the fly companion package is tailored to curatedTCGAData data from ExperimentHub while ensuring that data across all sets but can also work with TCGA data sets, such as those requested experimental assays are accounted for and that obtained from cBioPortalData and the GDC (Appendix imported data types conform to MultiAssayExperiment Figs A2Band A3). TCGAutils implements assay trans- requirements through automatic class checks (Appendix formation functions that work on TCGA barcodes, such as Fig A2A). All data sets are harmonized to only include splitAssays, to separate samples on the basis of type associated patient phenotype data for the requested (eg, tumors, normals). We also provided annotation con- assays. verter functions, such as mirToRanges, qreduceTCGA, and symbolsToRanges, for transforming microRNA met- DelayedMatrix adata, summarizing mutation data, and converting gene To ensure efficient access, we used alternate data repre- symbols to genomic ranges, respectively. Several TCGA sentations for methylation 450K and 27K assays because identifier functions, such as barcodeToUUID and TCGA- of their large size. curatedTCGAData makes use of the barcode, manipulate and translate TCGA barcodes to uni- DelayedMatrix class from the DelayedArray package to versal identifiers and vice versa. JCO Clinical Cancer Informatics 961 CHRDL1 C7 DES DPT SFRP1 ABCA8 LYVE1 TCF21 FOXM1 TPX2 KIF4A TOP2A ASPM IQGAP3 MYBL2 MMP11 Log Fold Change 2 Ramos et al cBioPortalData adjacent normal tissue samples were available. While 26 taking the pairing of samples (tumor v adjacent normal) into The cBioPortal for Cancer Genomics is an open access account, differential expression analysis was carried out on resource and open source platform for interactive and the basis of limma across the selected cancer types. Gene programmatic exploration of multiomic cancer data. The set enrichment analysis of Gene Ontology Biologic Process cBioPortal database currently provides . 260 data sets terms was performed using the over-representation test curated by the cBioPortal team, including TCGA and the 3 implemented in the EnrichmentBrowser package and International Cancer Genome Consortium. The cBioPortal 27 contrasted with the results obtained from the application API service provides programmatic access to the cBio- of Pathway Analysis with Down-weighting of Overlapping Portal database, which is also used for in-house omics data Genes (PADOG). Pan-cancer application of differential management at several cancer centers, including the expression and gene set enrichment analysis was carried Memorial Sloan Kettering Cancer Center and the Dana- out using functionality from the GSEABenchmarkeR Farber Cancer Institute. The cBioPortalData package package. makes use of the cBioPortal API service to retrieve, cache, and subsequently integrate multiomic data as Multi- Reproducible Research AssayExperiment data objects. R/Bioconductor users do All analyses presented in this article are reproducible using not need to construct API query operations to retrieve code provided online. cBioPortal data; they only need to provide a study identifier RESULTS and genes of interest to obtain a MultiAssayExperiment data set through the R interface. The cBioPortalData package Data and Software can be installed as of Bioconductor release version 3.11. The curatedTCGAData and cBioPortalData integrate data from two large public multiomic databases, using Bio- Differential Expression and Gene Set Enrichment Analysis conductor’s MultiAssayExperiment data structure (Ap- Upper quartile–normalized RNA-Seq by Expectation- pendix Fig A1). Multiassay and pan-cancer data sets Maximization transcripts per million gene expression are generated using a single R command that specifies values were obtained using curatedTCGAData. Analysis the required data and returns a MultiAssayExperiment was restricted to 14 cancer types for which at least 10 object (Appendix Fig A2A). curatedTCGAData accesses ORA P < .05 PADOG P < .05 Cell division Chromosome segregation Mitotic cell cycle G1/S transition Cellular response to hypoxia Protein phosphorylation DNA replication DNA replication initiation Spindle organization Mitotic chromatid segregation DNA unwinding Response to cadmium ion DNA repair DNA recombination Cellular response to DNA damage FIG 4. Pan-cancer gene set enrichment analysis. Shown are the 15 Gene Ontology Biologic Process terms that were most frequently found enriched for differential expression in cancer v adjacent-normal comparisons across 14 cancer types. On the left, enrichment is defined as being found by an over-representation analysis (ORA) with P , .05. For comparison, the right shows whether these terms were also found to be enriched according to another enrichment method (Pathway Analysis with Down-weighting of Overlapping Genes [PADOG]). 962 © 2020 by American Society of Clinical Oncology BLCA BRCA ESCA HNSC KICH KIRC KIRP LIHC LUAD LUSC PRAD STAD THCA UCEC BLCA BRCA ESCA HNSC KICH KIRC KIRP LIHC LUAD LUSC PRAD STAD THCA UCEC Multiomic Integration of Public Oncology Databases single-assay data sets processed from the GDAC Firehose pipeline and stored in Bioconductor’s ExperimentHub. The package integrates user-requested assays, cancer types, and clinicopathological data into a custom MultiAssayExperi- ment structure. cBioPortalData accesses data through two methods: through the cBioPortal web API, which enables downloading of a defined number of genes across a chosen number of oncological studies, and by parsing complete data sets downloaded as .zip files from cBioPortal. Both approaches use the MultiAssayExperiment representation to link multiomic profiles, enabling harmonized subsetting and flexible reshaping of data across assays and cancer −0.50.00.5 types. This advance in integration improves flexibility and Pearson Coefficient (ρ) ease of use over other programmatic approaches to ac- cessing these data (Fig 1). FIG 5. Histogram of the distribution of Pearson correlation co- TCGAutils provides an assortment of utility functions for efficients between gene copy number and RNA sequencing gene working with MultiAssayExperiment data representations expression in adrenocortical carcinoma. An integrative represen- and TCGA-related data. The principal functionality allows tation readily allows comparison and correlation of multiomics users to convert genomic annotations to genomic ranges experiments. and positions, summarize genomic ranges of nonsilent mutations or copy number variations at the gene level, performing the differential expression analysis. We also identify curated subtypes from primary TCGA publications, performed a pan-cancer gene set enrichment analysis to extract key level 4 clinical and pathological data from the identify Gene Ontology biological processes commonly hundreds or thousands of merged variables available, and activated or deactivated in multiple cancer types. We produce OncoPrint plots. It also permits users to work with compared two common methods for enrichment analysis TCGA metadata by providing reference tables for TCGA in Figure 4: over-representation analysis and PADOG. barcodes and sample types, translating between TCGA These analyses identify consistently altered molecular patient and universal identifiers and separating selected processes across multiple cancer types, including estab- specimens across assays. Other use cases in TCGAutils lished hallmarks of cancer such as cell division and DNA enable data imputation and text data conversion to stan- 36,37 repair. In an analysis involving multiple assay types, we dard Bioconductor data representations. calculated the bivariate correlation coefficients between Analysis Examples Several examples demonstrate the powerful and flexible analysis environment provided. These analyses, previously CopyNumber only achievable through a significant investment of time −1 and bioinformatics training, become straightforward anal- ysis exercises provided in an analysis vignette. First, we used curatedTCGAData to obtain the mutation data from all 33 cancers in TCGA, then isolated the 26 genes associated with tumor suppression and oncogenesis, and repre- sented them by mutation type as an OncoPrint plot (Fig 2). This analysis is efficient and completely flexible, using the range-based representation of mutation data provided by curatedTCGAData. It confirms that TP53 is the pre- dominant gene, with mutations across many cancers and partially showing the mutual exclusivity of key driver 34,35 mutations. Second, we performed a pan-cancer dif- ferential expression analysis across all TCGA cancer types −1 0 1 against adjacent normal samples, showing the distribution Copy Number of fold change across multiple cancer types for genes that are consistently up- and downregulated in cancer (Fig 3). FIG 6. Gene dosage effect on SNRPB2 expression in adreno- This pan-cancer analysis can be performed in expressive cortical carcinoma (ACC) tumors. The violin plots show increas- stepsof creatingaMultiAssayExperiment containing ing expression of SNRPB2 with increasing copy number, all TCGA RNA sequencing (RNA-seq) data sets, filtering corresponding to a Pearson correlation of 0.83 (the highest cor- relation observed in ACC). for primary tumors and adjacent normal tissues, and JCO Clinical Cancer Informatics 963 Frequency log2(expression) Ramos et al gene copy number and RNA-seq expression values for provided by MultiAssayExperiment simplify and extend the adrenocortical carcinoma (Fig 5), observing a mostly potential for novel multiomic analysis and tool develop- positive distribution of correlations and showing that the ment. The examples presented demonstrate significant expression of most genes is partially modulated by copy simplification of previously expensive and challenging pan- number. This analysis takes advantage of features to cal- cancer analyses, such as the identification of frequent culate the overlap between genomic ranges of copy mutations and recurrent differential gene expression number segments with genomic ranges of genes or any across TCGA. other genomic region. Finally, we showed the distribution of These resources serve a large amount of data, and several expression values by copy number for SNRPB2, the gene steps are made to make access and use more efficient. with the strongest relationship between expression and ExperimentHub provides automatic assay-level caching copy number in adrenocortical carcinoma (Fig 6). and avoids data redownload. TCGA methylation data files DISCUSSION are stored in HDF5 out of memory; thus, users are able to load a MultiAssayExperiment with a small memory footprint The availability of large-scale multiomics cancer data of approximately 1 Gb for the most comprehensive cancer provides novel opportunities for integrative analysis. type in TCGA: breast invasive carcinoma. Users can also However, the integration, management, and statistical export the collected data within a MultiAssayExperiment analysis of these resources remain challenging, even for object to text files through the exportClass function. advanced bioinformaticians. We present a set of data packages and software that makes multiomic analysis of Because the GDAC Firehose pipeline primarily serves hg19 TCGA data on 33 human cancers and cBioPortal data for data, users who look to obtain hg38 build data are rec- 6,12 . 260 onco-omic studies flexible, practical, and efficient ommended to use tools such as the GDC, which can be for a broad range of bioinformatic, statistical, and epidemio- integrated as MultiAssayExperiment objects with additional logical researchers. These data packages use established work. We also provide instructions to liftOver genomic Bioconductor infrastructure, including SummarizedExperi- coordinates from hg19 to hg38 using existing Bioconductor ment, MultiAssayExperiment, RaggedExperiment, and packages and associated chain files (Appendix Fig A2C ExperimentHub, integrating multiomic data with clini- and in the TCGAutils vignette). However, Gao et al copathological data and simplifying analysis, visuali- compared legacy hg19-based (as procured by curate- zation, and further tool development. curatedTCGAData dTCGAData) and harmonized hg38-based (from the GDC) and cBioPortalData link these data resources to an eco- data sets in terms of biological interpretation and con- system of 26 Bioconductor packages for multiomic data cluded that most analyses are largely insensitive to the analysis that require or suggest the MultiAssayExperiment update of genome build, with the most meaningful differ- data class. This ecosystem of packages, the companion ences being in mutation calling algorithms and in mapping package TCGAutils, and multiomic data management of methylation probes to noncoding genes. AFFILIATIONS SUPPORT Graduate School of Public Health and Health Policy, City University of Supported by National Cancer Institute (NCI) grant U24-CA180996 New York, New York, NY (M.R., M.M., and L.W.). M.R. was supported by NCI grant U24- Institute for Implementation Science and Population Health, City CA220457. I.d.B. and J.G. were supported by the Marie-Josee and Henry University of New York, New York, NY R. Kravis Center for Molecular Oncology, an NCI Cancer Center, core Roswell Park Comprehensive Cancer Center, Buffalo, NY grant P30-CA008748 and NCI Informatics Technology for Cancer Section of Computational Biomedicine, Boston University School of Research grant U24-CA220457. L.G. was supported by a research Medicine, Boston, MA fellowship from the German Research Foundation (GE3023/1-1). Department of Healthcare Policy and Research, Weill Cornell Medicine, New York, NY AUTHOR CONTRIBUTIONS Marie-Josee ´ and Henry R. Kravis Center for Molecular Oncology, Conception and design: Marcel Ramos, Lucas Schiffer, Ino de Bruijn, Memorial Sloan Kettering Cancer Center, New York, NY Vincent J. Carey, Martin Morgan, Levi Waldron Department of Epidemiology and Biostatistics, Memorial Sloan Financial support: Martin Morgan, Levi Waldron Kettering Cancer Center, New York, NY Administrative support: Rimsha Azhar Channing Division of Network Medicine, Brigham and Women’s Collection and assembly of data: Marcel Ramos, Lucas Schiffer, Rimsha Hospital, Harvard Medical School, Boston, MA Azhar, Hanish Kodali, Ino de Bruijn, Vincent J. Carey Data analysis and interpretation: Marcel Ramos, Ludwig Geistlinger, CORRESPONDING AUTHOR Sehyun Oh, Ino de Bruijn, Jianjiong Gao, Vincent J. Carey, Levi Waldron Levi Waldron, PhD, Graduate School of Public Health and Health Policy, Manuscript writing: All authors City University of New York, 55 W 125th St, 6th Floor, New York, NY Final approval of manuscript: All authors 10027; e-mail: levi.waldron@sph.cuny.edu. Accountable for all aspects of the work: All authors 964 © 2020 by American Society of Clinical Oncology Multiomic Integration of Public Oncology Databases Open Payments is a public database containing information reported by AUTHORS’ DISCLOSURES OF POTENTIAL CONFLICTS OF companies about payments made to US-licensed physicians (Open INTEREST Payments). The following represents disclosure information provided by authors of this manuscript. All relationships are considered compensated unless Vincent J. Carey otherwise noted. Relationships are self-held unless noted. I = Immediate Employment: CleanSlate (I) Family Member, Inst = My Institution. Relationships may not relate to the Honoraria: Gilead Sciences (I) subject matter of this manuscript. For more information about ASCO’s Research Funding: Bayer AG conflict of interest policy, please refer to www.asco.org/rwc or ascopubs. No other potential conflicts of interest were reported. org/cci/author-center. REFERENCES 1. Weinstein JN, Collisson EA, Mills GB, et al: The Cancer Genome Atlas pan-cancer analysis project. Nat Genet 45:1113-1120, 2013 2. Cerami E, Gao J, Dogrusoz U, et al: The cBio cancer genomics portal: An open platform for exploring multidimensional cancer genomics data. Cancer Discov 2:401-404, 2012 3. Gao J, Aksoy BA, Dogrusoz U, et al: Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci Signal 6:pl1, 2013 4. Bourne PE, Lorsch JR, Green ED: Perspective: Sustaining the big-data ecosystem. Nature 527:S16-S17, 2015 5. Kannan L, Ramos M, Re A, et al: Public data and open source tools for multi-assay genomic investigation of disease. Brief Bioinform 17:603-615, 2016 6. Grossman RL, Heath AP, Ferretti V, et al: Toward a shared vision for cancer genomic data. N Engl J Med 375:1109-1112, 2016 7. Deng M, Bragelmann ¨ J, Kryukov I, et al: FirebrowseR: An R client to the Broad Institute’s Firehose pipeline. Database (Oxford) 2017:baw160, 2017 8. Colaprico A, Silva TC, Olsen C, et al: TCGAbiolinks: An R/Bioconductor package for integrative analysis of TCGA data. Nucleic Acids Res 44:e71, 2016 9. Samur MK: RTCGAToolbox: A new tool for exporting TCGA Firehose data. PLoS One 9:e106397, 2014 10. Jacobsen A, Luna A: cgdsr: R-Based API for Accessing the MSKCC Cancer Genomics Data Server (CGDS), 2018. https://CRAN.R-project.org/package=cgdsr 11. Perez-Riverol Y, Bai M, da Veiga Leprevost F, et al: Discovering and linking public omics data sets using the Omics Discovery Index. Nat Biotechnol 35:406-409, 2017 12. Morgan M, Davis SR: GenomicDataCommons: A Bioconductor Interface to the NCI Genomic Data Commons, 2017. https://www.biorxiv.org/content/10.1101/ 117200v4 13. Ihaka R, Gentleman R: R: A language for data analysis and graphics. J Comput Graph Stat 5:299-314, 1996 14. Huber W, Carey VJ, Gentleman R, et al: Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods 12:115-121, 2015 15. Ramos M, Schiffer L, Re A, et al: Software for the integration of multiomics experiments in Bioconductor. Cancer Res 77:e39-e42, 2017 16. Bioconductor: Using Bioconductor. https://bioconductor.org/install 17. Docker: Getting started with Docker. https://www.docker.com 18. Bioconductor: bioconductor_docker. https://github.com/Bioconductor/bioconductor_docker 19. Broad Institute TCGA Genome Data Analysis Center: Analysis-ready standardized TCGA data from Broad GDAC Firehose: 2016_01_28 run, 2016. http://gdac. broadinstitute.org/runs/stddata__2016_01_28 20. Morgan M, Obenchain V, Hester J, et al: SummarizedExperiment: SummarizedExperiment container. R package version, 2017. https://www.bioconductor.org/ packages/SummarizedExperiment/ 21. Morgan M, Ramos M: RaggedExperiment: Representation of sparse experiments and assays across samples, 2018. https://bioconductor.org/packages/ release/bioc/html/RaggedExperiment.html 22. Waldron Lab: MultiAssayExperiment.TCGA. https://github.com/waldronlab/MultiAssayExperiment.TCGA 23. Bioconductor: ExperimentHub: Client to access ExperimentHub resources, 2016. https://bioconductor.org/packages/release/bioc/html/ExperimentHub.html 24. Pages H, Hickey P, Lun A: DelayedArray: A unified framework for working transparently with on-disk and in-memory array-like datasets, 2016. https:// bioconductor.org/packages/release/bioc/html/DelayedArray.html 25. The HDF Group: HDF5, 1997-2019. http://www.hdfgroup.org/HDF5 26. cBioPortal: Select studies for visualization & analysis. https://cbioportal.org 27. cBioPortal: cBioPortal API, 2019. https://www.cbioportal.org/api/swagger-ui.html 28. Li B, Dewey CN: RSEM: Accurate transcript quantification from RNA-seq data with or without a reference genome. BMC Bioinformatics 12:323, 2011 29. Ritchie ME, Phipson B, Wu D, et al: limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 43:e47, 2015 30. Geistlinger L, Csaba G, Zimmer R: Bioconductor’s EnrichmentBrowser: Seamless navigation through combined results of set- & network-based enrichment analysis. BMC Bioinformatics 17:45, 2016 31. Tarca AL, Draghici S, Bhatti G, et al: Down-weighting overlapping genes improves gene set analysis. BMC Bioinformatics 13:136, 2012 32. Geistlinger L, Csaba G, Santarelli M, et al: Toward a gold standard for benchmarking gene set enrichment analysis. Brief Bioinform 10.1093/bib/bbz158 [epub ahead of print on February 6, 2020] 33. LiNk-NY: curatedTCGAManu. https://github.com/LiNk-NY/curatedTCGAManu 34. Bailey MH, Tokheim C, Porta-Pardo E, et al: Comprehensive characterization of cancer driver genes and mutations. Cell 173:371-385.e18, 2018 [Erratum: Cell 174:1034-1035, 2018] 35. Ding L, Bailey MH, Porta-Pardo E, et al: Perspective on oncogenic processes at the end of the beginning of cancer genomics. Cell 173:305-320.e10, 2018 36. Hanahan D, Weinberg RA: The hallmarks of cancer. Cell 100:57-70, 2000 37. Hanahan D, Weinberg RA: Hallmarks of cancer: The next generation. Cell 144:646-674, 2011 38. Gao GF, Parker JS, Reynolds SM, et al: Before and after: Comparison of legacy and harmonized TCGA Genomic Data Commons’ data. Cell Syst 9:24-34.e10, 2019 nn n JCO Clinical Cancer Informatics 965 Ramos et al APPENDIX Database Process Package Pipeline NIH NCI NHGRI The Cancer Genome Atlas cBioPortal for Cancer Broad Insnstitute Genomics GDAC Firehose Data RTCGAToolbox curation Preprocess Infrastructure software MultiAssayExperiment.TCGA MultiAssayExperiment Utility software Experiment data Experiment data TCGAutils curatedTCGAData cBioPortalData FIG A1. Flow diagram of the curatedTCGAData pipeline and cBioPortalData data provenance. NCI, National Cancer Institute; NHGRI, National Human Genome Research Institute; NIH, National Institutes of Health. 966 © 2020 by American Society of Clinical Oncology Multiomic Integration of Public Oncology Databases AA if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") if (!requireNamespace("curatedTCGAData", quietly = TRUE)) BiocManager::install("curatedTCGAData") ## Glioblastoma Multiforme (GBM) library(curatedTCGAData) curatedTCGAData(diseaseCode = "GBM", assays = "RNA*", dry.run = FALSE) BB ## installation if (!requireNamespace("cBioPortalData", quietly = TRUE)) BiocManager::install("cBioPortalData") library(cBioPortalData) gbm <- cBioDataPack("gbm_tcga") ## https://cBioPortal.org/api (API method) cBio <- cBioPortal() ## use exportClass() with the result to save data to files ## demo with ACC, with RPPA and CNA assays only for faster API time. acc341 <- cBioPortalData(cBio, studyId = "acc_tcga", genePanelId = "IMPACT341", molecularProfileIds = c("acc_tcga_rppa", "acc_tcga_linear_CNA")) acc341 exportClass(acc341, dir = tempdir(), fmt = "csv") CC liftchain <- "http://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/hg19ToHg38.over.ch ain.gz" cloc38 <- file.path(tempdir(), gsub("\\.gz", "", basename(liftchain))) dfile <- tempfile(fileext = ".gz") download.file(liftchain, dfile) R.utils::gunzip(dfile, destname = cloc38, remove = FALSE) library(rtracklayer) chain38 <- suppressMessages( import.chain(cloc38) ) ## Run bulk data download (from S2B) to create gbm object if (!exists("gbm")) gbm <- cBioPortalData::cBioDataPack("gbm_tcga") mutations <- gbm[["mutations_extended"]] seqlevelsStyle(mutations) <- "UCSC" ranges38 <- liftOver(rowRanges(mutations), chain38) FIG A2. (A) Example code for installing and downloading The Cancer Genome Atlas (TCGA) data using curatedTCGAData. (B) Example cBioPortalData code for downloading and exporting TCGA data from cBioPortal and through the cBioPortal application programming interface (API). (C) Example hg19 to hg38 liftOver procedure using Bioconductor tools. JCO Clinical Cancer Informatics 967 Ramos et al library(TCGAutils) library(GenomicDataCommons) ## GenomicDataCommons query <- files(legacy = TRUE) %>% filter( ~ cases.project.project_id == "TCGA-COAD" & data_category == "Gene expression" & data_type == "Exon quantification" ) fileids <- manifest(query)$id[1:4] exonfiles <- gdcdata(fileids) ## TCGAutils makeGRangesListFromExonFiles(exonfiles, nrows = 4) FIG A3. Example code for downloading data through Genomic- DataCommons and loading with TCGAutils. 968 © 2020 by American Society of Clinical Oncology Multiomic Integration of Public Oncology Databases 0.7 Protein_PC6 0.6 GISTIC.T_PC3 0.5 miRNA_PC2 0.4 0.3 RNA.Seq_PC1 0.2 RNA.Seq_PC3 0.1 RNA.Seq_PC2 Protein_PC1 RNA.Seq_PC4 RNA.Seq_PC6 GISTIC.T_PC1 GISTIC.T_PC8 Mutations_PC5 RNA.Seq_PC10 Mutations_PC8 Protein_PC5 miRNA_PC3 FIG A4. Correlated principal components (PCs) across experimental assays in adrenocortical carcinoma. miRNA, microRNA; RNA.Seq, RNA sequencing. JCO Clinical Cancer Informatics 969 Protein_PC6 GISTIC.T_PC3 miRNA_PC2 RNA.Seq_PC1 RNA.Seq_PC3 RNA.Seq_PC2 Protein_PC1 RNA.Seq_PC4 RNA.Seq_PC6 GISTIC.T_PC1 GISTIC.T_PC8 Mutations_PC5 RNA.Seq_PC10 Mutations_PC8 Protein_PC5 miRNA_PC3 Ramos et al TABLE A1. TCGA Cancer and Curation Data Available From curatedTCGAData Study Abbreviation Available Subtype Data Study Name ACC Yes Yes Adrenocortical carcinoma BLCA Yes Yes Bladder urothelial carcinoma BRCA Yes Yes Breast invasive carcinoma CESC Yes No Cervical squamous cell carcinoma and endocervical adenocarcinoma CHOL Yes No Cholangiocarcinoma CNTL No No Controls COAD Yes Yes Colon adenocarcinoma DLBC Yes No Lymphoid neoplasm diffuse large B-cell lymphoma ESCA Yes No Esophageal carcinoma FPPP No No FFPE pilot phase II GBM Yes Yes Glioblastoma multiforme HNSC Yes Yes Head and neck squamous cell carcinoma KICH Yes Yes Kidney chromophobe KIRC Yes Yes Kidney renal clear cell carcinoma KIRP Yes Yes Kidney renal papillary cell carcinoma LAML Yes Yes Acute myeloid leukemia LCML No No Chronic myelogenous leukemia LGG Yes Yes Brain lower grade glioma LIHC Yes No Liver hepatocellular carcinoma LUAD Yes Yes Lung adenocarcinoma LUSC Yes Yes Lung squamous cell carcinoma MESO Yes No Mesothelioma MISC No No Miscellaneous OV Yes Yes Ovarian serous cystadenocarcinoma PAAD Yes No Pancreatic adenocarcinoma PCPG Yes No Pheochromocytoma and paraganglioma PRAD Yes Yes Prostate adenocarcinoma READ Yes No Rectum adenocarcinoma SARC Yes No Sarcoma SKCM Yes Yes Skin cutaneous melanoma STAD Yes Yes Stomach adenocarcinoma TGCT Yes No Testicular germ cell tumors THCA Yes Yes Thyroid carcinoma THYM Yes No Thymoma UCEC Yes Yes Uterine corpus endometrial carcinoma UCS Yes No Uterine carcinosarcoma UVM Yes No Uveal melanoma Abbreviation: TCGA, The Cancer Genome Atlas. 970 © 2020 by American Society of Clinical Oncology Multiomic Integration of Public Oncology Databases TABLE A2. Descriptions of Data Types Available in curatedTCGAData by Bioconductor Data Class ExperimentList Data Type Description SummarizedExperiment RNASeqGene RSEM TPM gene expression values RNASeq2GeneNorm Upper quartile normalized RSEM TPM gene expression values miRNAArray Probe-level miRNA expression values miRNASeqGene Gene-level log RPM miRNA expression values mRNAArray Unified gene-level mRNA expression values mRNAArray_huex Gene-level mRNA expression values from Affymetrix Human Exon Array mRNAArray_TX_g4502a Gene-level mRNA expression values from Agilent 244K Array mRNAArray_TX_ht_hg_u133a Gene-level mRNA expression values from Affymetrix Human Genome U133 Array GISTIC_AllByGene Gene-level GISTIC2 copy number values GISTIC_ThresholdedByGene Gene-level GISTIC2 thresholded discrete copy number values RPPAArray Reverse-phase protein array normalized protein expression values RangedSummarizedExperiment GISTIC_Peaks GISTIC2 thresholded discrete copy number values in recurrent peak regions SummarizedExperiment with HDF5Array DelayedMatrix Methylation_methyl27 Probe-level methylation β-values from Illumina HumanMethylation 27K BeadChip Methylation_methyl450 Probe-level methylation β-values from Infinium HumanMethylation 450K BeadChip RaggedExperiment CNASNP Segmented somatic CNA calls from SNP array CNVSNP Segmented germline CNV calls from SNP array CNASeq Segmented somatic CNA calls from low-pass DNA sequencing Mutation Somatic mutations calls CNACGH_CGH_hg_244a Segmented somatic CNA calls from CGH Agilent Microarray 244A CNACGH_CGH_hg_415k_g4124a Segmented somatic CNA calls from CGH Agilent Microarray 415K Abbreviations: CGH, comparative genomic hybridization; CNA, copy number alteration; CNV, copy number variant; miRNA, microRNA; RPM, reads per million; RSEM TPM, RNA-Seq by Expectation-Maximization transcripts per million; SNP, single nucleotide polymorphism. All can be converted to RangedSummarizedExperiment (except RPPAArray) with TCGAutils. JCO Clinical Cancer Informatics 971 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png JCO Clinical Cancer Informatics Wolters Kluwer Health

Loading next page...
 
/lp/wolters-kluwer-health/multiomic-integration-of-public-oncology-databases-in-bioconductor-vJU5fHWEGo
Publisher
Wolters Kluwer Health
Copyright
(C) 2020 American Society of Clinical Oncology
ISSN
2473-4276
DOI
10.1200/CCI.19.00119
Publisher site
See Article on Publisher Site

Abstract

original reports abstract SPECIAL SERIES: INFORMATICS TOOLS FOR CANCER RESEARCH AND CARE Multiomic Integration of Public Oncology Databases in Bioconductor 1,2,3 1,2 1,2 1,2,4 1,2,5 Marcel Ramos, MPH ; Ludwig Geistlinger, PhD ; Sehyun Oh, PhD ; Lucas Schiffer, MPH Rimsha Azhar, MS ; 1,2 6 6,7 8 3 Hanish Kodali, MBBS, MPH ; Ino de Bruijn, MSc ; Jianjiong Gao, PhD ; Vincent J. Carey, PhD ; Martin Morgan, PhD ; and 1,2 Levi Waldron, PhD PURPOSE Investigations of the molecular basis for the development, progression, and treatment of cancer increasingly use complementary genomic assays to gather multiomic data, but management and analysis of such data remain complex. The cBioPortal for cancer genomics currently provides multiomic data from . 260 public studies, including The Cancer Genome Atlas (TCGA) data sets, but integration of different data types remains challenging and error prone for computational methods and tools using these resources. Recent advances in data infrastructure within the Bioconductor project enable a novel and powerful approach to creating fully integrated representations of these multiomic, pan-cancer databases. METHODS We provide a set of R/Bioconductor packages for working with TCGA legacy data and cBioPortal data, with special considerations for loading time; efficient representations in and out of memory; analysis platform; and an integrative framework, such as MultiAssayExperiment. Large methylation data sets are provided through out-of-memory data representation to provide responsive loading times and analysis capabilities on machines with limited memory. RESULTS We developed the curatedTCGAData and cBioPortalData R/Bioconductor packages to provide in- tegrated multiomic data sets from the TCGA legacy database and the cBioPortal web application programming interface using the MultiAssayExperiment data structure. This suite of tools provides coordination of diverse experimental assays with clinicopathological data with minimal data management burden, as demonstrated through several greatly simplified multiomic and pan-cancer analyses. CONCLUSION These integrated representations enable analysts and tool developers to apply general statistical and plotting methods to extensive multiomic data through user-friendly commands and documented examples. JCO Clin Cancer Inform 4:958-971. © 2020 by American Society of Clinical Oncology Licensed under the Creative Commons Attribution 4.0 License INTRODUCTION Existing command line resources such as the Geno- mic Data Commons (GDC), the Broad Institute’s Public multiomic databases, such as The Cancer Ge- 1 2,3 GDAC Firehose pipeline tool, R packages such as nome Atlas (TCGA) and the cBioPortal repository, 7 8 9 10 firebrowseR, TCGAbiolinks, RTCGAToolbox, cgdsr, provide extensive data on the molecular landscape of website interfaces such as cBioPortal, the Omics cancer, but their incorporation in multiomic analyses ASSOCIATED Discovery Index, and the GenomicDataCommons CONTENT has been hindered by the complexity of data co- package provide varying degrees of portability, us- Appendix ordination, selection, and management. The TCGA ability, and integration for multiomics data. However, in Author affiliations project generated multiomic data, including muta- and support general, these resources either provide certain pre- tions, copy number variants, methylation, and gene information (if specified analyses but lack integration with platforms for expression quantification, for 33 human cancer applicable) appear at statistical analysis or require significant effort to in- types, while the cBioPortal public repository provides the end of this tegrate the different data types within such a platform. multiomic data for . 260 oncological studies in . 20 article. They also present trade-offs between comprehensive primary sites. The size and complexity of these da- Accepted on data access and ease of use (Fig 1). Tools that provide September 21, 2020 tabases impose time-consuming and technically and published at comprehensive data access require familiarity with data complex barriers to the development of novel tools ascopubs.org/journal/ models, linkage between sample and patient identifiers, and analyses, even for advanced bioinformaticians. cci on October 29, and command line tools. Resources with high ease of The lowering of these barriers requires new ap- 2020: DOI https://doi. proaches to the distribution and management of large useprovide amore limited scopeof datasets, and org/10.1200/CCI.19. 4,5 00119 and complex data outputs. the responsibility to coordinate, manage, and even port 958 Multiomic Integration of Public Oncology Databases CONTEXT Key Objective To provide flexible, integrated, multiomic representations of public oncology databases in R/Bioconductor with greatly re- duced data management overhead. Knowledge Generated Our Bioconductor software packages provide a novel approach to lower barriers to analysis and tool development for The Cancer Genome Atlas and cBioPortal databases. Relevance Our tools provide flexible, programmatic analysis of hundreds of fully integrated multiomic oncology data sets within an ecosystem of multiomic analysis tools. multiple onco-omic data sets to analysis-ready platforms falls METHODS on the user. Installation We have implemented the curatedTCGAData, cBioPortalData, The recommended installation procedure for Bioconduc- and TCGAutils packages to provide easily accessible mul- tor packages is described in its installation instructions. 13 14 tiomic data sets in the analysis-ready R and Bioconductor These instructions detail the use of BiocManager, a Com- environment. The curatedTCGAData package serves in- prehensive R Archive Network package, for Bioconductor tegrated data sets for 33 different cancer types with package installations. BiocManager allows easy installa- . 11,000 tumor samples that are built on demand and tion of all three packages as follows: BiocManager:: contain selected data types as requested by the user. install(c(“curatedTCGAData”, “TCGAutils”, “cBioPortalData”). Where other platforms provide either comprehensive data Docker can be used to provide reproducible and easy- acquisition or data subsets with limited analysis capabil- to-set up Bioconductor environments, using instructions ities, curatedTCGAData provides a solid foundation for provided from its download site. The Docker image researchers looking to get started quickly with analyses of provides an RStudio installation that can be used in TCGA data across genomic assays and/or across different conjunction with the aforementioned R package in- cancer types. cBioPortalData makes use of the cBioPortal stallation commands. Users are encouraged to run web application programming interface (API) to serve in- ‘BiocManager::valid()’ to verify that the Bioconductor tegrative representations of multiomics data for . 260 and installation and packages are up to date and properly growing genomic studies. The TCGAutils package further installed. provides facilities to make working with TCGA data easy with convenient identification, separation, and manipu- Data Structure Overview lation of sample and patient identifiers, leveraging the Data sets from curatedTGCAData and cBioPortalData are capabilities of the MultiAssayExperiment data structure. represented using the established MultiAssayExperiment data structure that provides a framework for managing and organizing experimental assays on a set of samples in Bioconductor. The MultiAssayExperiment container eases curatedTCGAData, cBioPortalData the burden of data management by creating a graph cBioPortal.org representation of biological units and their relationship to cgdsr, firebrowseR, TCGAbiolinks multiple experiment measurements along with associated metadata. It provides a convenient platform from which to GenomicDataCommons, GDAC Firehose conduct integrative analyses while representing complex data structures and classes within the R and Bioconductor ecosystems. FIG 1. Comparison of The Cancer Genome Atlas (TCGA) data re- Experiment data class representations are required to sources by integration, ease of use, and data completeness. In- adhere to a set of minimal operations for compatibility. In tegration refers to the ability of the resource to be used within an particular, these data structures must be divisible by rows analysis platform such as R and Bioconductor. A resource with high and columns and have discoverable dimension attributes, data completeness allows users to download the entirety of TCGA such as length and value labels. SummarizedExperiment is data. Ease of use is defined as the low cognitive overhead for use of a resource as imposed by data models and knowledge of query an example of a commonly supported Bioconductor class structures. that is compatible with these basic requirements. JCO Clinical Cancer Informatics 959 Integration Ease of use Completeness Ramos et al 0 1,000 2,000 TP53 PIK3CA ARID1A KRAS APC NF1 IDH1 EGFR Alterations Frameshift deletion CREBBP Frameshift insertion ERBB4 In-frame deletion CTNNB1 In-frame insertion PIK3R1 Missense mutation ARID2 Nonsense mutation EP300 RNA Splice site CDH1 GNAS KDM6A ERBB2 CASP8 GNAQ IDH2 GNA11 NF2 RHOA CDKN1A AKT1 FIG 2. OncoPrint plot of selected cancer driver genes frequently mutated across 33 The Cancer Genome Atlas cancer types. The SummarizedExperiment class is the de facto standard classes that readily conform to MultiAssayExperiment con- representation for high-throughput genomic data in Bio- tainer requirements. Appendix Figure A1 shows a schematic conductor. It provides a flexible architecture that can of the process from database to R/Bioconductor package. support multiple experimental assays in a single instance. It The pipeline annotates ranged data with genome build also allows easy extensibility to other experimental data information extracted from file names and annotation files classes while maintaining the minimum requirements where possible. It merges open-access tier, level 4, data necessary for MultiAssayExperiment representation. One with the more extensive merged level 1 clinical data, in such extension of SummarizedExperiment is the Ranged- some instances providing approximately 800 additional SummarizedExperiment structure. It supports structured variables, while at the same time, removing columns where genomic range representations as row metadata. Multi- all values are missing and maintaining provenance of such AssayExperiment supports an open-ended range of data column names in the metadata. Molecular subtype data classes despite class evolution. were added to 19 of the 33 available cancer types (Ap- pendix Table A1). Appendix Table A2 lists the available Preprocessing experimental assays and respective Bioconductor classes Data for approximately 11,000 samples and 33 different in curatedTCGAData. The open source curatedTCGAData cancers were preprocessed, harmonized, and redistributed pipeline is available through the MultiAssayExperiment through curatedTCGAData. Data were first downloaded download site. cBioPortalData serves data as provided by from the Broad Institute’s GDAC Firehose pipeline’s last run cBioPortal through its web API or through provided .gz files date (January 28, 2016) using the RTCGAToolbox Bio- for complete data sets. conductor package. Subtype information, taken from ExperimentHub supplemental files of primary TCGA publications, was then added to the phenodata and uploaded to the cloud through The curatedTCGAData assembles data sets from compo- Bioconductor’s ExperimentHub. Uploaded TCGA data nents stored and served by ExperimentHub. After data were packaged into standard Bioconductor objects, such extraction from RTCGAToolbox data representations and 20 21 as SummarizedExperiment and RaggedExperiment , binning into appropriate Bioconductor data classes, the 960 © 2020 by American Society of Clinical Oncology Multiomic Integration of Public Oncology Databases LUSC KIRC FIG 3. Pan-cancer differential ex- pression analysis. Shown are the top eight consistently downregulated genes (bottom left) and the top eight consistently upregulated genes (top right) when comparing cancer versus adjacent normal samples across 14 cancer types. -5 data were saved as serialized R data objects. Metadata were represent such data. The hierarchical data format 5 programmatically generated for each data type, and data for (HDF5)–based DelayedMatrix representation avoids over- all 33 cancers were uploaded to the cloud using Exper- consumption of memory and allows users to load a “lazy” imentHub, a Bioconductor-provided Amazon cloud storage and partial representation of data on ordinary laptops. On service. The online Bioconductor data repository for experi- ExperimentHub, methylation data sets are stored as two ment data is connected to and managed by an in-house files: one provides the SummarizedExperiment shell, and the database. This database is used by the ExperimentHub R other contains the assay data in HDF5 through use of the package for the retrieval and download of queried data saveHDF5SummarizedExperiment function in the Summa- sets. ExperimentHub provides automatic local caching of rizedExperiment package. the component R objects that are assembled by curate- dTCGAData to create a MultiAssayExperiment, but these TCGAutils cached objects are not intended for direct use by the user. The TCGAutils package covers a wide variety of utility curatedTCGAData retrieves piecewise data representa- functions for simplified manipulation of TCGA data. This tions and constructs a MultiAssayExperiment on the fly companion package is tailored to curatedTCGAData data from ExperimentHub while ensuring that data across all sets but can also work with TCGA data sets, such as those requested experimental assays are accounted for and that obtained from cBioPortalData and the GDC (Appendix imported data types conform to MultiAssayExperiment Figs A2Band A3). TCGAutils implements assay trans- requirements through automatic class checks (Appendix formation functions that work on TCGA barcodes, such as Fig A2A). All data sets are harmonized to only include splitAssays, to separate samples on the basis of type associated patient phenotype data for the requested (eg, tumors, normals). We also provided annotation con- assays. verter functions, such as mirToRanges, qreduceTCGA, and symbolsToRanges, for transforming microRNA met- DelayedMatrix adata, summarizing mutation data, and converting gene To ensure efficient access, we used alternate data repre- symbols to genomic ranges, respectively. Several TCGA sentations for methylation 450K and 27K assays because identifier functions, such as barcodeToUUID and TCGA- of their large size. curatedTCGAData makes use of the barcode, manipulate and translate TCGA barcodes to uni- DelayedMatrix class from the DelayedArray package to versal identifiers and vice versa. JCO Clinical Cancer Informatics 961 CHRDL1 C7 DES DPT SFRP1 ABCA8 LYVE1 TCF21 FOXM1 TPX2 KIF4A TOP2A ASPM IQGAP3 MYBL2 MMP11 Log Fold Change 2 Ramos et al cBioPortalData adjacent normal tissue samples were available. While 26 taking the pairing of samples (tumor v adjacent normal) into The cBioPortal for Cancer Genomics is an open access account, differential expression analysis was carried out on resource and open source platform for interactive and the basis of limma across the selected cancer types. Gene programmatic exploration of multiomic cancer data. The set enrichment analysis of Gene Ontology Biologic Process cBioPortal database currently provides . 260 data sets terms was performed using the over-representation test curated by the cBioPortal team, including TCGA and the 3 implemented in the EnrichmentBrowser package and International Cancer Genome Consortium. The cBioPortal 27 contrasted with the results obtained from the application API service provides programmatic access to the cBio- of Pathway Analysis with Down-weighting of Overlapping Portal database, which is also used for in-house omics data Genes (PADOG). Pan-cancer application of differential management at several cancer centers, including the expression and gene set enrichment analysis was carried Memorial Sloan Kettering Cancer Center and the Dana- out using functionality from the GSEABenchmarkeR Farber Cancer Institute. The cBioPortalData package package. makes use of the cBioPortal API service to retrieve, cache, and subsequently integrate multiomic data as Multi- Reproducible Research AssayExperiment data objects. R/Bioconductor users do All analyses presented in this article are reproducible using not need to construct API query operations to retrieve code provided online. cBioPortal data; they only need to provide a study identifier RESULTS and genes of interest to obtain a MultiAssayExperiment data set through the R interface. The cBioPortalData package Data and Software can be installed as of Bioconductor release version 3.11. The curatedTCGAData and cBioPortalData integrate data from two large public multiomic databases, using Bio- Differential Expression and Gene Set Enrichment Analysis conductor’s MultiAssayExperiment data structure (Ap- Upper quartile–normalized RNA-Seq by Expectation- pendix Fig A1). Multiassay and pan-cancer data sets Maximization transcripts per million gene expression are generated using a single R command that specifies values were obtained using curatedTCGAData. Analysis the required data and returns a MultiAssayExperiment was restricted to 14 cancer types for which at least 10 object (Appendix Fig A2A). curatedTCGAData accesses ORA P < .05 PADOG P < .05 Cell division Chromosome segregation Mitotic cell cycle G1/S transition Cellular response to hypoxia Protein phosphorylation DNA replication DNA replication initiation Spindle organization Mitotic chromatid segregation DNA unwinding Response to cadmium ion DNA repair DNA recombination Cellular response to DNA damage FIG 4. Pan-cancer gene set enrichment analysis. Shown are the 15 Gene Ontology Biologic Process terms that were most frequently found enriched for differential expression in cancer v adjacent-normal comparisons across 14 cancer types. On the left, enrichment is defined as being found by an over-representation analysis (ORA) with P , .05. For comparison, the right shows whether these terms were also found to be enriched according to another enrichment method (Pathway Analysis with Down-weighting of Overlapping Genes [PADOG]). 962 © 2020 by American Society of Clinical Oncology BLCA BRCA ESCA HNSC KICH KIRC KIRP LIHC LUAD LUSC PRAD STAD THCA UCEC BLCA BRCA ESCA HNSC KICH KIRC KIRP LIHC LUAD LUSC PRAD STAD THCA UCEC Multiomic Integration of Public Oncology Databases single-assay data sets processed from the GDAC Firehose pipeline and stored in Bioconductor’s ExperimentHub. The package integrates user-requested assays, cancer types, and clinicopathological data into a custom MultiAssayExperi- ment structure. cBioPortalData accesses data through two methods: through the cBioPortal web API, which enables downloading of a defined number of genes across a chosen number of oncological studies, and by parsing complete data sets downloaded as .zip files from cBioPortal. Both approaches use the MultiAssayExperiment representation to link multiomic profiles, enabling harmonized subsetting and flexible reshaping of data across assays and cancer −0.50.00.5 types. This advance in integration improves flexibility and Pearson Coefficient (ρ) ease of use over other programmatic approaches to ac- cessing these data (Fig 1). FIG 5. Histogram of the distribution of Pearson correlation co- TCGAutils provides an assortment of utility functions for efficients between gene copy number and RNA sequencing gene working with MultiAssayExperiment data representations expression in adrenocortical carcinoma. An integrative represen- and TCGA-related data. The principal functionality allows tation readily allows comparison and correlation of multiomics users to convert genomic annotations to genomic ranges experiments. and positions, summarize genomic ranges of nonsilent mutations or copy number variations at the gene level, performing the differential expression analysis. We also identify curated subtypes from primary TCGA publications, performed a pan-cancer gene set enrichment analysis to extract key level 4 clinical and pathological data from the identify Gene Ontology biological processes commonly hundreds or thousands of merged variables available, and activated or deactivated in multiple cancer types. We produce OncoPrint plots. It also permits users to work with compared two common methods for enrichment analysis TCGA metadata by providing reference tables for TCGA in Figure 4: over-representation analysis and PADOG. barcodes and sample types, translating between TCGA These analyses identify consistently altered molecular patient and universal identifiers and separating selected processes across multiple cancer types, including estab- specimens across assays. Other use cases in TCGAutils lished hallmarks of cancer such as cell division and DNA enable data imputation and text data conversion to stan- 36,37 repair. In an analysis involving multiple assay types, we dard Bioconductor data representations. calculated the bivariate correlation coefficients between Analysis Examples Several examples demonstrate the powerful and flexible analysis environment provided. These analyses, previously CopyNumber only achievable through a significant investment of time −1 and bioinformatics training, become straightforward anal- ysis exercises provided in an analysis vignette. First, we used curatedTCGAData to obtain the mutation data from all 33 cancers in TCGA, then isolated the 26 genes associated with tumor suppression and oncogenesis, and repre- sented them by mutation type as an OncoPrint plot (Fig 2). This analysis is efficient and completely flexible, using the range-based representation of mutation data provided by curatedTCGAData. It confirms that TP53 is the pre- dominant gene, with mutations across many cancers and partially showing the mutual exclusivity of key driver 34,35 mutations. Second, we performed a pan-cancer dif- ferential expression analysis across all TCGA cancer types −1 0 1 against adjacent normal samples, showing the distribution Copy Number of fold change across multiple cancer types for genes that are consistently up- and downregulated in cancer (Fig 3). FIG 6. Gene dosage effect on SNRPB2 expression in adreno- This pan-cancer analysis can be performed in expressive cortical carcinoma (ACC) tumors. The violin plots show increas- stepsof creatingaMultiAssayExperiment containing ing expression of SNRPB2 with increasing copy number, all TCGA RNA sequencing (RNA-seq) data sets, filtering corresponding to a Pearson correlation of 0.83 (the highest cor- relation observed in ACC). for primary tumors and adjacent normal tissues, and JCO Clinical Cancer Informatics 963 Frequency log2(expression) Ramos et al gene copy number and RNA-seq expression values for provided by MultiAssayExperiment simplify and extend the adrenocortical carcinoma (Fig 5), observing a mostly potential for novel multiomic analysis and tool develop- positive distribution of correlations and showing that the ment. The examples presented demonstrate significant expression of most genes is partially modulated by copy simplification of previously expensive and challenging pan- number. This analysis takes advantage of features to cal- cancer analyses, such as the identification of frequent culate the overlap between genomic ranges of copy mutations and recurrent differential gene expression number segments with genomic ranges of genes or any across TCGA. other genomic region. Finally, we showed the distribution of These resources serve a large amount of data, and several expression values by copy number for SNRPB2, the gene steps are made to make access and use more efficient. with the strongest relationship between expression and ExperimentHub provides automatic assay-level caching copy number in adrenocortical carcinoma (Fig 6). and avoids data redownload. TCGA methylation data files DISCUSSION are stored in HDF5 out of memory; thus, users are able to load a MultiAssayExperiment with a small memory footprint The availability of large-scale multiomics cancer data of approximately 1 Gb for the most comprehensive cancer provides novel opportunities for integrative analysis. type in TCGA: breast invasive carcinoma. Users can also However, the integration, management, and statistical export the collected data within a MultiAssayExperiment analysis of these resources remain challenging, even for object to text files through the exportClass function. advanced bioinformaticians. We present a set of data packages and software that makes multiomic analysis of Because the GDAC Firehose pipeline primarily serves hg19 TCGA data on 33 human cancers and cBioPortal data for data, users who look to obtain hg38 build data are rec- 6,12 . 260 onco-omic studies flexible, practical, and efficient ommended to use tools such as the GDC, which can be for a broad range of bioinformatic, statistical, and epidemio- integrated as MultiAssayExperiment objects with additional logical researchers. These data packages use established work. We also provide instructions to liftOver genomic Bioconductor infrastructure, including SummarizedExperi- coordinates from hg19 to hg38 using existing Bioconductor ment, MultiAssayExperiment, RaggedExperiment, and packages and associated chain files (Appendix Fig A2C ExperimentHub, integrating multiomic data with clini- and in the TCGAutils vignette). However, Gao et al copathological data and simplifying analysis, visuali- compared legacy hg19-based (as procured by curate- zation, and further tool development. curatedTCGAData dTCGAData) and harmonized hg38-based (from the GDC) and cBioPortalData link these data resources to an eco- data sets in terms of biological interpretation and con- system of 26 Bioconductor packages for multiomic data cluded that most analyses are largely insensitive to the analysis that require or suggest the MultiAssayExperiment update of genome build, with the most meaningful differ- data class. This ecosystem of packages, the companion ences being in mutation calling algorithms and in mapping package TCGAutils, and multiomic data management of methylation probes to noncoding genes. AFFILIATIONS SUPPORT Graduate School of Public Health and Health Policy, City University of Supported by National Cancer Institute (NCI) grant U24-CA180996 New York, New York, NY (M.R., M.M., and L.W.). M.R. was supported by NCI grant U24- Institute for Implementation Science and Population Health, City CA220457. I.d.B. and J.G. were supported by the Marie-Josee and Henry University of New York, New York, NY R. Kravis Center for Molecular Oncology, an NCI Cancer Center, core Roswell Park Comprehensive Cancer Center, Buffalo, NY grant P30-CA008748 and NCI Informatics Technology for Cancer Section of Computational Biomedicine, Boston University School of Research grant U24-CA220457. L.G. was supported by a research Medicine, Boston, MA fellowship from the German Research Foundation (GE3023/1-1). Department of Healthcare Policy and Research, Weill Cornell Medicine, New York, NY AUTHOR CONTRIBUTIONS Marie-Josee ´ and Henry R. Kravis Center for Molecular Oncology, Conception and design: Marcel Ramos, Lucas Schiffer, Ino de Bruijn, Memorial Sloan Kettering Cancer Center, New York, NY Vincent J. Carey, Martin Morgan, Levi Waldron Department of Epidemiology and Biostatistics, Memorial Sloan Financial support: Martin Morgan, Levi Waldron Kettering Cancer Center, New York, NY Administrative support: Rimsha Azhar Channing Division of Network Medicine, Brigham and Women’s Collection and assembly of data: Marcel Ramos, Lucas Schiffer, Rimsha Hospital, Harvard Medical School, Boston, MA Azhar, Hanish Kodali, Ino de Bruijn, Vincent J. Carey Data analysis and interpretation: Marcel Ramos, Ludwig Geistlinger, CORRESPONDING AUTHOR Sehyun Oh, Ino de Bruijn, Jianjiong Gao, Vincent J. Carey, Levi Waldron Levi Waldron, PhD, Graduate School of Public Health and Health Policy, Manuscript writing: All authors City University of New York, 55 W 125th St, 6th Floor, New York, NY Final approval of manuscript: All authors 10027; e-mail: levi.waldron@sph.cuny.edu. Accountable for all aspects of the work: All authors 964 © 2020 by American Society of Clinical Oncology Multiomic Integration of Public Oncology Databases Open Payments is a public database containing information reported by AUTHORS’ DISCLOSURES OF POTENTIAL CONFLICTS OF companies about payments made to US-licensed physicians (Open INTEREST Payments). The following represents disclosure information provided by authors of this manuscript. All relationships are considered compensated unless Vincent J. Carey otherwise noted. Relationships are self-held unless noted. I = Immediate Employment: CleanSlate (I) Family Member, Inst = My Institution. Relationships may not relate to the Honoraria: Gilead Sciences (I) subject matter of this manuscript. For more information about ASCO’s Research Funding: Bayer AG conflict of interest policy, please refer to www.asco.org/rwc or ascopubs. No other potential conflicts of interest were reported. org/cci/author-center. REFERENCES 1. Weinstein JN, Collisson EA, Mills GB, et al: The Cancer Genome Atlas pan-cancer analysis project. Nat Genet 45:1113-1120, 2013 2. Cerami E, Gao J, Dogrusoz U, et al: The cBio cancer genomics portal: An open platform for exploring multidimensional cancer genomics data. Cancer Discov 2:401-404, 2012 3. Gao J, Aksoy BA, Dogrusoz U, et al: Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci Signal 6:pl1, 2013 4. Bourne PE, Lorsch JR, Green ED: Perspective: Sustaining the big-data ecosystem. Nature 527:S16-S17, 2015 5. Kannan L, Ramos M, Re A, et al: Public data and open source tools for multi-assay genomic investigation of disease. Brief Bioinform 17:603-615, 2016 6. Grossman RL, Heath AP, Ferretti V, et al: Toward a shared vision for cancer genomic data. N Engl J Med 375:1109-1112, 2016 7. Deng M, Bragelmann ¨ J, Kryukov I, et al: FirebrowseR: An R client to the Broad Institute’s Firehose pipeline. Database (Oxford) 2017:baw160, 2017 8. Colaprico A, Silva TC, Olsen C, et al: TCGAbiolinks: An R/Bioconductor package for integrative analysis of TCGA data. Nucleic Acids Res 44:e71, 2016 9. Samur MK: RTCGAToolbox: A new tool for exporting TCGA Firehose data. PLoS One 9:e106397, 2014 10. Jacobsen A, Luna A: cgdsr: R-Based API for Accessing the MSKCC Cancer Genomics Data Server (CGDS), 2018. https://CRAN.R-project.org/package=cgdsr 11. Perez-Riverol Y, Bai M, da Veiga Leprevost F, et al: Discovering and linking public omics data sets using the Omics Discovery Index. Nat Biotechnol 35:406-409, 2017 12. Morgan M, Davis SR: GenomicDataCommons: A Bioconductor Interface to the NCI Genomic Data Commons, 2017. https://www.biorxiv.org/content/10.1101/ 117200v4 13. Ihaka R, Gentleman R: R: A language for data analysis and graphics. J Comput Graph Stat 5:299-314, 1996 14. Huber W, Carey VJ, Gentleman R, et al: Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods 12:115-121, 2015 15. Ramos M, Schiffer L, Re A, et al: Software for the integration of multiomics experiments in Bioconductor. Cancer Res 77:e39-e42, 2017 16. Bioconductor: Using Bioconductor. https://bioconductor.org/install 17. Docker: Getting started with Docker. https://www.docker.com 18. Bioconductor: bioconductor_docker. https://github.com/Bioconductor/bioconductor_docker 19. Broad Institute TCGA Genome Data Analysis Center: Analysis-ready standardized TCGA data from Broad GDAC Firehose: 2016_01_28 run, 2016. http://gdac. broadinstitute.org/runs/stddata__2016_01_28 20. Morgan M, Obenchain V, Hester J, et al: SummarizedExperiment: SummarizedExperiment container. R package version, 2017. https://www.bioconductor.org/ packages/SummarizedExperiment/ 21. Morgan M, Ramos M: RaggedExperiment: Representation of sparse experiments and assays across samples, 2018. https://bioconductor.org/packages/ release/bioc/html/RaggedExperiment.html 22. Waldron Lab: MultiAssayExperiment.TCGA. https://github.com/waldronlab/MultiAssayExperiment.TCGA 23. Bioconductor: ExperimentHub: Client to access ExperimentHub resources, 2016. https://bioconductor.org/packages/release/bioc/html/ExperimentHub.html 24. Pages H, Hickey P, Lun A: DelayedArray: A unified framework for working transparently with on-disk and in-memory array-like datasets, 2016. https:// bioconductor.org/packages/release/bioc/html/DelayedArray.html 25. The HDF Group: HDF5, 1997-2019. http://www.hdfgroup.org/HDF5 26. cBioPortal: Select studies for visualization & analysis. https://cbioportal.org 27. cBioPortal: cBioPortal API, 2019. https://www.cbioportal.org/api/swagger-ui.html 28. Li B, Dewey CN: RSEM: Accurate transcript quantification from RNA-seq data with or without a reference genome. BMC Bioinformatics 12:323, 2011 29. Ritchie ME, Phipson B, Wu D, et al: limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 43:e47, 2015 30. Geistlinger L, Csaba G, Zimmer R: Bioconductor’s EnrichmentBrowser: Seamless navigation through combined results of set- & network-based enrichment analysis. BMC Bioinformatics 17:45, 2016 31. Tarca AL, Draghici S, Bhatti G, et al: Down-weighting overlapping genes improves gene set analysis. BMC Bioinformatics 13:136, 2012 32. Geistlinger L, Csaba G, Santarelli M, et al: Toward a gold standard for benchmarking gene set enrichment analysis. Brief Bioinform 10.1093/bib/bbz158 [epub ahead of print on February 6, 2020] 33. LiNk-NY: curatedTCGAManu. https://github.com/LiNk-NY/curatedTCGAManu 34. Bailey MH, Tokheim C, Porta-Pardo E, et al: Comprehensive characterization of cancer driver genes and mutations. Cell 173:371-385.e18, 2018 [Erratum: Cell 174:1034-1035, 2018] 35. Ding L, Bailey MH, Porta-Pardo E, et al: Perspective on oncogenic processes at the end of the beginning of cancer genomics. Cell 173:305-320.e10, 2018 36. Hanahan D, Weinberg RA: The hallmarks of cancer. Cell 100:57-70, 2000 37. Hanahan D, Weinberg RA: Hallmarks of cancer: The next generation. Cell 144:646-674, 2011 38. Gao GF, Parker JS, Reynolds SM, et al: Before and after: Comparison of legacy and harmonized TCGA Genomic Data Commons’ data. Cell Syst 9:24-34.e10, 2019 nn n JCO Clinical Cancer Informatics 965 Ramos et al APPENDIX Database Process Package Pipeline NIH NCI NHGRI The Cancer Genome Atlas cBioPortal for Cancer Broad Insnstitute Genomics GDAC Firehose Data RTCGAToolbox curation Preprocess Infrastructure software MultiAssayExperiment.TCGA MultiAssayExperiment Utility software Experiment data Experiment data TCGAutils curatedTCGAData cBioPortalData FIG A1. Flow diagram of the curatedTCGAData pipeline and cBioPortalData data provenance. NCI, National Cancer Institute; NHGRI, National Human Genome Research Institute; NIH, National Institutes of Health. 966 © 2020 by American Society of Clinical Oncology Multiomic Integration of Public Oncology Databases AA if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") if (!requireNamespace("curatedTCGAData", quietly = TRUE)) BiocManager::install("curatedTCGAData") ## Glioblastoma Multiforme (GBM) library(curatedTCGAData) curatedTCGAData(diseaseCode = "GBM", assays = "RNA*", dry.run = FALSE) BB ## installation if (!requireNamespace("cBioPortalData", quietly = TRUE)) BiocManager::install("cBioPortalData") library(cBioPortalData) gbm <- cBioDataPack("gbm_tcga") ## https://cBioPortal.org/api (API method) cBio <- cBioPortal() ## use exportClass() with the result to save data to files ## demo with ACC, with RPPA and CNA assays only for faster API time. acc341 <- cBioPortalData(cBio, studyId = "acc_tcga", genePanelId = "IMPACT341", molecularProfileIds = c("acc_tcga_rppa", "acc_tcga_linear_CNA")) acc341 exportClass(acc341, dir = tempdir(), fmt = "csv") CC liftchain <- "http://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/hg19ToHg38.over.ch ain.gz" cloc38 <- file.path(tempdir(), gsub("\\.gz", "", basename(liftchain))) dfile <- tempfile(fileext = ".gz") download.file(liftchain, dfile) R.utils::gunzip(dfile, destname = cloc38, remove = FALSE) library(rtracklayer) chain38 <- suppressMessages( import.chain(cloc38) ) ## Run bulk data download (from S2B) to create gbm object if (!exists("gbm")) gbm <- cBioPortalData::cBioDataPack("gbm_tcga") mutations <- gbm[["mutations_extended"]] seqlevelsStyle(mutations) <- "UCSC" ranges38 <- liftOver(rowRanges(mutations), chain38) FIG A2. (A) Example code for installing and downloading The Cancer Genome Atlas (TCGA) data using curatedTCGAData. (B) Example cBioPortalData code for downloading and exporting TCGA data from cBioPortal and through the cBioPortal application programming interface (API). (C) Example hg19 to hg38 liftOver procedure using Bioconductor tools. JCO Clinical Cancer Informatics 967 Ramos et al library(TCGAutils) library(GenomicDataCommons) ## GenomicDataCommons query <- files(legacy = TRUE) %>% filter( ~ cases.project.project_id == "TCGA-COAD" & data_category == "Gene expression" & data_type == "Exon quantification" ) fileids <- manifest(query)$id[1:4] exonfiles <- gdcdata(fileids) ## TCGAutils makeGRangesListFromExonFiles(exonfiles, nrows = 4) FIG A3. Example code for downloading data through Genomic- DataCommons and loading with TCGAutils. 968 © 2020 by American Society of Clinical Oncology Multiomic Integration of Public Oncology Databases 0.7 Protein_PC6 0.6 GISTIC.T_PC3 0.5 miRNA_PC2 0.4 0.3 RNA.Seq_PC1 0.2 RNA.Seq_PC3 0.1 RNA.Seq_PC2 Protein_PC1 RNA.Seq_PC4 RNA.Seq_PC6 GISTIC.T_PC1 GISTIC.T_PC8 Mutations_PC5 RNA.Seq_PC10 Mutations_PC8 Protein_PC5 miRNA_PC3 FIG A4. Correlated principal components (PCs) across experimental assays in adrenocortical carcinoma. miRNA, microRNA; RNA.Seq, RNA sequencing. JCO Clinical Cancer Informatics 969 Protein_PC6 GISTIC.T_PC3 miRNA_PC2 RNA.Seq_PC1 RNA.Seq_PC3 RNA.Seq_PC2 Protein_PC1 RNA.Seq_PC4 RNA.Seq_PC6 GISTIC.T_PC1 GISTIC.T_PC8 Mutations_PC5 RNA.Seq_PC10 Mutations_PC8 Protein_PC5 miRNA_PC3 Ramos et al TABLE A1. TCGA Cancer and Curation Data Available From curatedTCGAData Study Abbreviation Available Subtype Data Study Name ACC Yes Yes Adrenocortical carcinoma BLCA Yes Yes Bladder urothelial carcinoma BRCA Yes Yes Breast invasive carcinoma CESC Yes No Cervical squamous cell carcinoma and endocervical adenocarcinoma CHOL Yes No Cholangiocarcinoma CNTL No No Controls COAD Yes Yes Colon adenocarcinoma DLBC Yes No Lymphoid neoplasm diffuse large B-cell lymphoma ESCA Yes No Esophageal carcinoma FPPP No No FFPE pilot phase II GBM Yes Yes Glioblastoma multiforme HNSC Yes Yes Head and neck squamous cell carcinoma KICH Yes Yes Kidney chromophobe KIRC Yes Yes Kidney renal clear cell carcinoma KIRP Yes Yes Kidney renal papillary cell carcinoma LAML Yes Yes Acute myeloid leukemia LCML No No Chronic myelogenous leukemia LGG Yes Yes Brain lower grade glioma LIHC Yes No Liver hepatocellular carcinoma LUAD Yes Yes Lung adenocarcinoma LUSC Yes Yes Lung squamous cell carcinoma MESO Yes No Mesothelioma MISC No No Miscellaneous OV Yes Yes Ovarian serous cystadenocarcinoma PAAD Yes No Pancreatic adenocarcinoma PCPG Yes No Pheochromocytoma and paraganglioma PRAD Yes Yes Prostate adenocarcinoma READ Yes No Rectum adenocarcinoma SARC Yes No Sarcoma SKCM Yes Yes Skin cutaneous melanoma STAD Yes Yes Stomach adenocarcinoma TGCT Yes No Testicular germ cell tumors THCA Yes Yes Thyroid carcinoma THYM Yes No Thymoma UCEC Yes Yes Uterine corpus endometrial carcinoma UCS Yes No Uterine carcinosarcoma UVM Yes No Uveal melanoma Abbreviation: TCGA, The Cancer Genome Atlas. 970 © 2020 by American Society of Clinical Oncology Multiomic Integration of Public Oncology Databases TABLE A2. Descriptions of Data Types Available in curatedTCGAData by Bioconductor Data Class ExperimentList Data Type Description SummarizedExperiment RNASeqGene RSEM TPM gene expression values RNASeq2GeneNorm Upper quartile normalized RSEM TPM gene expression values miRNAArray Probe-level miRNA expression values miRNASeqGene Gene-level log RPM miRNA expression values mRNAArray Unified gene-level mRNA expression values mRNAArray_huex Gene-level mRNA expression values from Affymetrix Human Exon Array mRNAArray_TX_g4502a Gene-level mRNA expression values from Agilent 244K Array mRNAArray_TX_ht_hg_u133a Gene-level mRNA expression values from Affymetrix Human Genome U133 Array GISTIC_AllByGene Gene-level GISTIC2 copy number values GISTIC_ThresholdedByGene Gene-level GISTIC2 thresholded discrete copy number values RPPAArray Reverse-phase protein array normalized protein expression values RangedSummarizedExperiment GISTIC_Peaks GISTIC2 thresholded discrete copy number values in recurrent peak regions SummarizedExperiment with HDF5Array DelayedMatrix Methylation_methyl27 Probe-level methylation β-values from Illumina HumanMethylation 27K BeadChip Methylation_methyl450 Probe-level methylation β-values from Infinium HumanMethylation 450K BeadChip RaggedExperiment CNASNP Segmented somatic CNA calls from SNP array CNVSNP Segmented germline CNV calls from SNP array CNASeq Segmented somatic CNA calls from low-pass DNA sequencing Mutation Somatic mutations calls CNACGH_CGH_hg_244a Segmented somatic CNA calls from CGH Agilent Microarray 244A CNACGH_CGH_hg_415k_g4124a Segmented somatic CNA calls from CGH Agilent Microarray 415K Abbreviations: CGH, comparative genomic hybridization; CNA, copy number alteration; CNV, copy number variant; miRNA, microRNA; RPM, reads per million; RSEM TPM, RNA-Seq by Expectation-Maximization transcripts per million; SNP, single nucleotide polymorphism. All can be converted to RangedSummarizedExperiment (except RPPAArray) with TCGAutils. JCO Clinical Cancer Informatics 971

Journal

JCO Clinical Cancer InformaticsWolters Kluwer Health

Published: Oct 29, 2020

References