Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Multi-omics single-cell data integration and regulatory inference with graph-linked embedding

Multi-omics single-cell data integration and regulatory inference with graph-linked embedding Articles https://doi.org/10.1038/s41587-022-01284-4 Multi-omics single-cell data integration and regulatory inference with graph-linked embedding 1,2 1,2  ✉ Zhi-Jie Cao    and Ge Gao    Despite the emergence of experimental methods for simultaneous measurement of multiple omics modalities in single cells, most single-cell datasets include only one modality. A major obstacle in integrating omics data from multiple modalities is that different omics layers typically have distinct feature spaces. Here, we propose a computational framework called GLUE (graph-linked unified embedding), which bridges the gap by modeling regulatory interactions across omics layers explicitly. Systematic benchmarking demonstrated that GLUE is more accurate, robust and scalable than state-of-the-art tools for hetero- geneous single-cell multi-omics data. We applied GLUE to various challenging tasks, including triple-omics integration, integra- tive regulatory inference and multi-omics human cell atlas construction over millions of cells, where GLUE was able to correct previous annotations. GLUE features a modular design that can be flexibly extended and enhanced for new analysis tasks. The full package is available online at https://github.com/gao-lab/GLUE. ecent technological advances in single-cell sequencing have By modeling the regulatory interactions across omics layers explic- enabled the probing of regulatory maps through multiple itly, GLUE bridges the gaps between various omics-specific feature Romics layers, such as chromatin accessibility (single-cell spaces in a biologically intuitive manner. Systematic benchmarks and 1,2 3 ATAC-sequencing (scATAC-seq) ), DNA methylation (snmC-seq , case studies demonstrate that GLUE is accurate, robust and scalable 4 5,6 sci-MET ) and the transcriptome (scRNA-seq ), offering a unique for heterogeneous single-cell multi-omics data. Furthermore, GLUE opportunity to unveil the underlying regulatory bases for the func- is designed as a generalizable framework that allows for easy exten- tionalities of diverse cell types . While simultaneous assays have sion and quick adoption to particular scenarios in a modular manner. 8–11 recently emerged , different omics are usually measured inde- GLUE is publicly accessible at https://github.com/gao-lab/GLUE. pendently and produce unpaired data, which calls for effective and 12,13 efficient in silico multi-omics integration . Results Computationally, one major obstacle faced when integrating Unpaired multi-omics integration via graph-guided embed- unpaired multi-omics data (also known as diagonal integration) dings. Inspired by previous studies, we model cell states as is the distinct feature spaces of different modalities (for exam- low-dimensional cell embeddings learned through variational auto- 30,31 ple, accessible chromatin regions in scATAC-seq versus genes in encoders . Given their intrinsic differences in biological nature scRNA-seq) . A quick fix is to convert multimodality data into and assay technology, each omics layer is equipped with a separate one common feature space based on prior knowledge and apply autoencoder that uses a probabilistic generative model tailored to 15–18 single-omics data integration methods . Such explicit ‘feature the layer-specific feature space (Fig. 1 and Methods). conversion’ is straightforward, but has been reported to result in Taking advantage of prior biological knowledge, we propose the information loss . Algorithms based on coupled matrix factoriza- use of a knowledge-based graph (‘guidance graph’) that explicitly tion circumvent explicit conversion but hardly handle more than models cross-layer regulatory interactions for linking layer-specific 20,21 two omics layers . An alternative option is to match cells from feature spaces; the vertices in the graph correspond to the features of different omics layers via nonlinear manifold alignment, which different omics layers, and edges represent signed regulatory inter- removes the requirement of prior knowledge completely and could actions. For example, when integrating scRNA-seq and scATAC-seq 22–25 reduce inter-modality information loss in theory ; however, this data, the vertices are genes and accessible chromatin regions (that technique has mostly been applied to relatively small datasets with is, ATAC peaks), and a positive edge can be connected between an limited number of cell types. accessible region and its putative downstream gene. Then, adver- The ever-increasing volume of data is another serious chal- sarial multimodal alignment of the cells is performed as an iterative lenge . Recently developed technologies can routinely generate optimization procedure, guided by feature embeddings encoded 27–29 32 datasets at the scale of millions of cells , whereas current integra- from the graph (Fig. 1 and Methods). Notably, when the iterative tion methods have only been applied to datasets with much smaller process converges, the graph can be refined with inputs from the 15,17,20–23 volumes . To catch up with the growth in data through- alignment procedure and used for data-oriented regulatory infer- put, computational integration methods should be designed with ence (see below for more details). scalability in mind. Hereby, we introduce GLUE (graph-linked unified embedding), Systematic benchmarking demonstrates superior perfor- a modular framework for integrating unpaired single-cell multi- mance. We first benchmarked GLUE against multiple popular 15–18,23–25,33 omics data and inferring regulatory interactions simultaneously. unpaired multi-omics integration methods using three State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Biomedical Pioneering Innovative Center (BIOPIC) and Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), Peking University, Beijing, China. Changping Laboratory, Beijing, China. e-mail: gaog@mail.cbi.pku.edu.cn NatuRe Biote ChNoloG y | VOL 40 | OCt OBeR 2022 | 1458–1466 | www.nature.com/naturebiotechnology 1458 NATUrE BioTEcHNoLoGy Articles Encoders Decoders (Variational posteriors) (Generative models) ∣ν ∣ Knowledge-based ∣ν ∣ guidance graph = (ν, ) V p( ∣V;θ ) q(V∣ ;ϕ ) ∣ν ∣ ≈V·V Feature ∣ν ∣ ∣ν ∣ 1 1 embeddings N N 1 1 ˄ ˄ X p(X ∣U ,V ;θ ) N 1 1 1 1 1 ≈U ·V 1 1 scRNA-seq U q(U ∣X ;ϕ ) 1 1 1 ∣ν ∣ ∣ν ∣ X ˄ ˄ 2 U X p(X ∣U ,V ;θ ) 2 2 2 2 2 2 q(U ∣X ;ϕ ) 2 2 2 ≈U ·V scATAC-seq 2 2 ∣ν ∣ ∣ν ∣ Cell embeddings 3 X X 3 q(U ∣X ;ϕ ) 3 p(X ∣U ,V ;θ ) 3 3 3 3 3 3 3 ≈U ·V snmC-seq 3 3 Discriminator ? ? ? D(u;ψ) Omics layers N ×|V | N ×|V | N ×|V | 1 1 2 2 3 3 Fig. 1 | architecture of the Glue framework. Denoting unpaired data from three omics layer as X ∈ R ,X ∈ R ,X ∈ R , 1 2 3 where N , N , N are cell numbers, and V , V , V are sets of omics features in each layer, GLUe uses omics-specific variational autoencoders to learn 1 2 3 1 2 3 low-dimensional cell embeddings u , u , u from each omics layer. t he data dimensionality and generative distribution can differ across layers, but the 1 2 3 embedding dimension m is shared. t o link the omics-specific data spaces, GLUe makes use of prior knowledge about regulatory interactions in the form of a guidance graph G =(V, E), where vertices V = V ∪V ∪V are omics features. A graph variational autoencoder is used to learn feature embeddings 1 2 3 ( ) ⊤ ⊤ ⊤ V = V ,V ,V from the prior knowledge-based guidance graph, which are then used in data decoders to reconstruct omics data via inner product 1 2 3 with cell embeddings, effectively linking the omics-specific data spaces to ensure a consistent embedding orientation. Last, an omics discriminator D is used to align the cell embeddings of different omics layers via adversarial learning. ϕ ,ϕ ,ϕ ,ϕ represent learnable parameters in data and graph 1 2 3 G encoders. θ ,θ ,θ ,θ represent learnable parameters in data and graph decoders. ψ represents learnable parameters in the omics discriminator. 1 2 3 G gold-standard datasets generated by recent simultaneous scRNA-seq During the evaluation described above, we adopted a standard 8 9 and scATAC-seq technologies (SNARE-seq , SHARE-seq and schema (ATAC peaks were linked to RNA genes if they overlapped 34 35 10X Multiome ), along with two unpaired datasets (Nephron in the gene body or proximal promoter regions) to construct the and MOp ). guidance graph for GLUE and to perform feature conversion for An effective integration method should match the correspond- other conversion-based methods. Given that our current knowl- ing cell states from different omics layers, producing cell embed- edge about the regulatory interactions is still far from prefect, a dings where the biological variation is faithfully conserved and the useful integration method must be robust to such inaccuracies. omics layers are well mixed. Compared to other methods, GLUE Thus, we further assessed the methods’ robustness to corruption achieved high level of biology conservation and omics mixing of regulatory interactions by randomly replacing varying fractions simultaneously (Fig. 2a, each quantified by three separate metrics of existing interactions with nonexistent ones. For all three datas- as shown in Extended Data Fig. 1), and was consistently the best ets, GLUE exhibited the smallest performance changes even at cor- method across all benchmark datasets in terms of overall score ruption rates as high as 90% (Fig. 2d and Extended Data Fig. 2a), (Fig. 2b, see Methods for details on metric aggregation); these suggesting its superior robustness. Consistently, we found that results were also validated by uniform manifold approximation and using alternative guidance graphs defined in larger genomic projection (UMAP) visualization of the aligned cell embeddings windows had minimal influence on integration performance (Supplementary Figs. 1–5). (Extended Data Fig. 2b,c). An optimal integration method should produce accurate align- Given its neural network-based nature, GLUE may suffer from ments not only at the cell type level but also at finer scales. Exploiting undertraining when working with small datasets. Thus, we repeated the ground truth cell-to-cell correspondence in the gold-standard the evaluations using subsampled datasets of various sizes. GLUE datasets, we further quantified single-cell level alignment error via remained the top-ranking method with as few as 2,000 cells, but the FOSCTTM (fraction of samples closer than the true match) met- the alignment error increased more steeply when the data volume ric . On all three datasets, GLUE achieved the lowest FOSCTTM, decreased to less than 1,000 cells (Fig. 2e and Extended Data Fig. 2d). decreasing the alignment error by large margins compared to the Additionally, we also noted that the integration performance of second-best method on each dataset (Fig. 2c, the decreases were GLUE was robust for a wide range of hyperparameter and feature 3.6-fold for SNARE-seq, 1.7-fold for SHARE-seq and 1.5-fold for selection settings (Extended Data Figs. 3 and 4). Apart from the 10X Multiome). cell embeddings, the feature embeddings of GLUE also exhibit NatuRe Biote ChNoloG y | VOL 40 | OCt OBeR 2022 | 1458–1466 | www.nature.com/naturebiotechnology 1459 NATUrE BioTEcHNoLoGy Articles a c SNARE-seq SHARE-seq 10X Multiome 1.00 0.6 0.75 0.50 0.4 0.25 0.2 0.25 0.50 0.75 1.00 Nephron MOp 1.00 0.75 SNARE-seq SHARE-seq 10X Multiome 0.50 Dataset 0.25 SNARE-seq SHARE-seq 10X Multiome 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.6 Biology conservation 0.4 0.2 0.75 0.50 Corruption rate 0.25 SNARE-seq SHARE-seq 10X Multiome 0.6 0.4 0.2 Dataset UnionCom Online iNMF LIGER (FiG) Seurat v3 Pamona Online iNMF (FiG) Harmony GLUE Method MMD-MA LIGER bindSC Subsample size Fig. 2 | Systematic benchmarks of integration performance. a, Biological conservation score versus omics integration score for different integration methods. b, Overall integration score (defined as 0.6 × biology conservation + 0.4 × omics integration) of different integration methods (n = 8 repeats with different model random seeds). c, Single-cell level alignment error (quantified by FOSCtt M) of different integration methods (n = 8 repeats with different model random seeds). d, Increases in FOSCtt M at different prior knowledge corruption rates for integration methods that rely on prior feature relations (n = 8 repeats with different corruption random seeds). e, FOSCtt M values of different integration methods on subsampled datasets of varying sizes (n = 8 repeats with different subsampling random seeds). FiG is an alternative feature conversion method recommended by online iNMF and LIGeR (Methods). Online iNMF and LIGeR could not run with FiG conversion on the SNARe-seq data because the raw At AC fragment file was not available, thus marked as ‘NA’. Other NA marks were made because of memory overflow. t he error bars indicate mean ± s.d. considerable robustness to hyperparameter settings, prior knowl- called the integration consistency score, which measures the consis- edge corruption and data subsampling (Extended Data Fig. 5). tency between the integrated multi-omics space and prior knowl- In addition to the systematical difference among omics lay- edge in the guidance graph (Methods). We observed substantially ers, single-cell data are often complicated by batch effect within lower scores (close to 0) when integrating data from inconsistent the same layer. For example, the SHARE-seq data was processed tissues compared to integrating within the same tissue, making it a in four libraries, one of which showed batch effect compared to reliable indicator of integration quality (Extended Data Fig. 6). the other three in scRNA-seq (Supplementary Fig. 6a), while the Nephron data profiled four donors, all of which showed substantial GLUE enables effective triple-omics integration. Benefitting from batch effect against each other in both scRNA-seq and scATAC-seq a modular design and scalable adversarial alignment, GLUE read- (Supplementary Fig. 7a,c). As a solution to such complex sce- ily extends to more than two omics layers. As a case study, we used narios, GLUE provides batch correction capability by including GLUE to integrate three distinct omics layers of neuronal cells in the batch as a decoder covariate (Methods). With batch correction adult mouse cortex, including gene expression , chromatin acces- 38 3 enabled, GLUE was able to correct for these batch effects effec- sibility and DNA methylation . tively, producing substantially better batch mixing (Supplementary Unlike chromatin accessibility, gene body DNA methylation Fig. 6b and Supplementary Fig. 7b,d). To guard against potential generally shows a negative correlation with gene expression in over-correction, for example, when forcing an integration over neuronal cells . GLUE natively supports the mixture of regula- datasets lacking common cell states, we devised a diagnostic metric tory effects by modeling edge signs in the guidance graph. Such a NatuRe Biote ChNoloG y | VOL 40 | OCt OBeR 2022 | 1458–1466 | www.nature.com/naturebiotechnology SNARE-seq SHARE-seq 10X Multiome Nephron MOp 0.2 0.4 0.6 0.8 0.9 1.0 0.2 1,000 0.4 2,000 0.6 0.8 4,000 0.9 8,000 1.0 0.2 1,000 0.4 2,000 0.6 0.8 4,000 0.9 8,000 1.0 1,000 2,000 4,000 8,000 Overall integration score Omics mixing NA NA NA NA NA NA NA NA NA NA NA FOSCTTM Increase in FOSCTTM FOSCTTM NA NA NA NA NA NATUrE BioTEcHNoLoGy Articles a b c scRNA-seq cell type snmC-seq cell type scATAC-seq cell type L2/3 IT Layer 2/3 mL2/3 mDL-3 L4 Layer 5a mL4 mIn-1 L5 IT Layer 5 mL5-1 mVip L6 IT Layer 5b mDL-1 mNdnf-1 L5 PT Layer 6 mDL-2 mNdnf-2 NP Claustrum mL5-2 mPv L6 CT CGE mL6-1 mSst-1 Vip MGE mL6-2 mSst-2 Pvalb Sst UMAP1 UMAP1 UMAP1 d e f Omics layer mPv mL2/3 0.8 mL5-2 mL6-2 mDL-3 0.6 mL4 mL6-1 scRNA-seq mSst 0.4 scATAC-seq mNdnf snmC-seq mDL-2 mL5-1 0.2 mVip mDL-1 mIn-1 0 0 50 100 150 200 Combined mCH mCG ATAC UMAP1 –log FDR Omics layer Fig. 3 | t riple-omics integration of the mouse cortex. a–c, UMAP visualizations of the integrated cell embeddings for scRNA-seq (a), snmC-seq (b) and scAt AC-seq (c), colored by the original cell types. Cells aligning with ‘mPv’ and ‘mSst’ are highlighted with green circles. Cells aligning with ‘mNdnf’ and ‘mVip’ are highlighted with dark blue circles. Cells aligning with ‘mDL-3’ are highlighted with light blue circles. d, UMAP visualizations of the integrated cell embeddings for all cells, colored by omics layers. e, Significance of marker gene overlap for each cell type across all three omics layers (three-way Fisher’s 40 −17 exact test ). t he dashed vertical line indicates that FDR = 0.01. We observed highly significant marker overlap (FDR < 5 × 10 ) for 12 out of the 14 cell types, indicating reliable alignment. For the remaining two cell types, ‘mDL-1’ had marginally significant marker overlap with FDR = 0.003, while the ‘mIn-1’ cells in snmC-seq did not properly align with the scRNA-seq or scAt AC-seq cells. f, Coefficient of determination (R ) for predicting gene expression based on each epigenetic layer as well as the combination of all layers (n = 2,677 highly variable genes common to all three omics layers). t he box plots indicate the medians (centerlines), means (triangles), first and third quartiles (bounds of boxes) and 1.5× interquartile range (whiskers). strategy avoids data inversion, which is required by previous meth- gene expression in cortical neurons (average R = 0.187). When 16,17 ods and can break data sparsity and the underlying distribution. all epigenetic layers were considered, the expression predictability For the triple-omics guidance graph, we linked gene body mCH increased further (average R = 0.236), suggesting the presence of and mCG levels to genes via negative edges, while the positive edges nonredundant contributions (Fig. 3f). Among the neurons of dif- between accessible regions and genes remained the same. ferent layers, DNA methylation (especially mCH) exhibited slightly The GLUE alignment successfully revealed a shared manifold higher predictability for gene expression in deeper layers than in of cell states across the three omics layers (Fig. 3a–d). Notably, the superficial layers (Supplementary Fig. 10a). Across all genes, the original cell types were not annotated at the same resolution, and predictability of gene expression was generally correlated among many could be further clustered into smaller subtypes even within the different epigenetic layers (Supplementary Fig. 10b). We also single layers (Supplementary Fig. 8a–f). To unify the cell type observed varying associations with gene characteristics. For exam- annotations, neighbor-based label transfer was conducted using ple, mCH had higher expression predictability for longer genes, 17,41 the integrated cell embeddings and we observed highly significant which was consistent with previous studies , while chromatin marker overlap (Fig. 3e, three-way Fisher’s exact test , false discov- accessibility contributed more to genes with higher expression vari- −17 ery rate (FDR) < 5 × 10 ) for 12 out of the 14 mapped cell types ability (Supplementary Fig. 10c). We also repeated the same analy- (Supplementary Figs. 8g–o and 9 and Methods), indicating reli- sis using online iNMF, which is currently the only other method able alignment. The GLUE alignment helped improve the effects capable of integrating the three omics layers simultaneously, but it of cell typing in all omics layers, including the further partition- produced much lower cell type resolution and epigenetic correla- + + ing of the scRNA-seq ‘MGE’ cluster into Pvalb (‘mPv’) and Sst tion (Supplementary Fig. 11). (‘mSst’) subtypes (highlighted with green circles/flows in Fig. 3 and Supplementary Fig. 8), the partitioning of the scRNA-seq ‘CGE’ Integrative regulatory inference with GLUE. The incorporation of + + cluster and scATAC-seq ‘Vip’ cluster into Vip (‘mVip’) and Ndnf a graph explicitly modeling regulatory interactions in GLUE further (‘mNdnf ’) subtypes (highlighted with dark blue circles/flows in enables a Bayesian-like approach that combines prior knowledge Fig. 3 and Supplementary Fig. 8), and the identification of snmC-seq and observed data for posterior regulatory inference. Specifically, ‘mDL-3’ cells and a subset of scATAC-seq ‘L6 IT’ cells as claus- since the feature embeddings are designed to reconstruct the trum cells (highlighted with light blue circles/flows in Fig. 3 and knowledge-based guidance graph and single-cell multi-omics data Supplementary Fig. 8). simultaneously (Fig. 1), their cosine similarities should reflect infor- Such triple-omics integration also sheds light on the quantita- mation from both aspects, which we adopt as ‘regulatory scores’. tive contributions of different epigenetic regulation mechanisms As a demonstration, we used the official peripheral blood mono- (Methods). Among mCH, mCG and chromatin accessibility, we nuclear cell Multiome dataset from 10X and fed it to GLUE as found that the mCH level had the highest predictive power for unpaired scRNA-seq and scATAC-seq data. To capture remote NatuRe Biote ChNoloG y | VOL 40 | OCt OBeR 2022 | 1458–1466 | www.nature.com/naturebiotechnology UMAP2 UMAP2 Cell type UMAP2 Gene expression R UMAP2 NATUrE BioTEcHNoLoGy Articles a b c pcHi-C prediction 1.0 1.0 1.0 0.8 0.5 0.5 pcHi-C 0.6 False True 0.4 Cicero (AUROC = 0.548) –0.5 –0.5 Spearman (AUROC = 0.555) 0.2 pcHi-C LASSO (AUROC = 0.547) False –1.0 GLUE (AUROC = 0.631) True –1.0 0 0.25 0.50 0.75 1.00 –0.5 0 0.5 1.0 FPR Spearman correlation Genomic distance d e Target gene CD83 GLUE 0.5 Target gene NCF2 (FDR < 0.05) 0.5 GLUE (FDR < 0.05) pcHi-C pcHi-C eQTL 0.5 ATAC eQTL –0.5 BCL11A ChIP 0.5 ATAC 0 PAX5 ChIP –0.5 SPI1 ChIP RELB ChIP Genes NMNAT2 SMG7 APOBEC4 Genes RNF182 CD83 AL353152.2 SMG7-AS1 NCF2 AL157899.1 MRPL35P1 AL133259.1 AL137800.1 ARPC5 AL022396.1 RNU7-133P AL353152.1 RGL1 LINC01108 AL590422.1 183,400 183,450 183,500 183,550 183,600 183,650 183,700 183,750 Kb 13,950 14,000 14,050 14,100 14,150 14,200 14,250 14,300 Kb chr1 chr6 Fig. 4 | integrative regulatory inference in peripheral blood mononuclear cells. a, GLUe regulatory scores for peak–gene pairs across different genomic ranges, grouped by whether they had pcHi-C support. t he box plots indicate the medians (centerlines), means (triangles), first and third quartiles (bounds of boxes) and 1.5× interquartile range (whiskers). b, Comparison between the GLUe regulatory scores and the empirical peak–gene correlations computed on paired cells. Peak–gene pairs are colored by whether they had pcHi-C support. c, Receiver operating characteristic curves for predicting pcHi-C interactions based on different peak–gene association scores. AUROC is the area under the receiver operating characteristic curve. d,e, GLUe-identified cis-regulatory interactions of NCF2 (d) and CD83 (e), along with individual regulatory evidence. SPI1 (highlighted with a green box) is a known regulator of NCF2. cis-regulatory interactions, we used a long-range guidance graph guidance graph containing distance-weighted interactions as well as connecting ATAC peaks and RNA genes in 150-kb windows pcHi-C- and eQTL-supported interactions (Supplementary Fig. 13). weighted by a power-law function that models chromatin con- The significance of regulatory score was evaluated by comparing 42,43 tact probability (Methods). Visualization of cell embeddings it to a NULL distribution obtained from randomly shuffled fea- confirmed that the GLUE alignment was correct and accurate ture embeddings (Methods). As expected, while the multi-omics (Supplementary Fig. 12a,b). As expected, we found that the regula- alignment was insensitive to the change in guidance graph, the tory score was negatively correlated with genomic distance (Fig. 4a) inferred regulatory interactions showed stronger enrichment for and positively correlated with the empirical peak–gene correlation pcHi-C and eQTL (Supplementary Fig. 13a–d). Large fractions of (computed with paired cells, Fig. 4b), with robustness across differ- high-confidence interactions simultaneously supported by pcHi-C, ent random seeds (Supplementary Fig. 12c). eQTL and correlation could be robustly recovered (FDR < 0.05), To further assess whether the score reflected actual cis-regulatory even if they were corrupted in the guidance graph (Supplementary interactions, we compared it with external evidence, including Fig. 13e). Furthermore, the GLUE-derived transcription factor (TF-) 44 45 pcHi-C and eQTL . The GLUE regulatory score was higher for target gene network (Methods) showed more significant agreement pcHi-C-supported peak–gene pairs in all distance ranges (Fig. 4a) with manually curated connections in the TRRUST v2 database and was a better predictor of pcHi-C interactions than empirical than individual evidence-based networks (Supplementary Figs. 13f peak–gene correlations (Fig. 4b), as well as LASSO and Cicero , and Supplementary Fig. 14 and Supplementary Data 2). the coaccessibility-based regulatory prediction method (Fig. 4c and We noticed that the GLUE-inferred cis-regulatory interactions Supplementary Fig. 12d). The same held for eQTL (Supplementary could provide hints about the regulatory mechanisms of known Fig. 12e–h). TF-target pairs. For example, SPI1 is a known regulator of the NCF2 The GLUE framework also allows additional regulatory evi- gene, and both are highly expressed in monocytes (Supplementary dence, such as pcHi-C, to be incorporated intuitively via the guid- Fig. 15a,b). GLUE identified three remote regulatory peaks for NCF2 ance graph. Thus, we further trained models with a composite with various pieces of evidence, that is, roughly 120 kb downstream, NatuRe Biote ChNoloG y | VOL 40 | OCt OBeR 2022 | 1458–1466 | www.nature.com/naturebiotechnology 0–25 kb 25–50 kb 50–75 kb 75–100 kb 100–125 kb 125–150 kb GLUE regulatory score GLUE regulatory score TPR NATUrE BioTEcHNoLoGy Articles a b Omics layer Cell type UMAP1 UMAP1 Omics layer Corneal and conjunctival epithelial cells Horizontal cells/amacrine cells? Mesangial cells? STC2_TLX1 positive cells scATAC-seq Ductal cells IGFBP1_DKK1 positive cells Mesothelial cells Satellite cells scRNA-seq ELF3_AGBL2 positive cells Inhibitory interneurons Metanephric cells Schwann cells ELF3_AGBL2 positive cells? Inhibitory interneurons? Microglia Skeletal muscle cells Muscle_Unknown.7 Skeletal muscle cells? Cell type ENS glia Inhibitory neurons Smooth muscle cells AFP_ALB positive cells ENS neurons Intestinal epithelial cells Myeloid cells Acinar cells ENS neurons? Intestine_Unknown.4 Neuroendocrine cells Squamous epithelial cells Adrenocortical cells Endocardial cells Intestine_Unknown.8 Oligodendrocytes Stellate cells Amacrine cells Epicardial fat cells Islet endocrine cells PAEP_MECOM positive cells Stromal cells Antigen presenting cells Epithelial cells Kidney_Unknown.7 PDE1C_ACSM3 positive cells Stromal cells? Astrocytes Erythroblasts Kidney_Unknown.14 PDE11A_FAM19A2 positive cells Sympathoblasts Astrocytes/oligodendrocytes Excitatory neurons Lens fibre cells Pancreas_Unknown.1 Syncytiotrophoblast and villous cytotrophoblasts? Bipolar cells Extravillous trophoblasts Limbic system neurons Parietal and chief cells Syncytiotrophoblasts and villous cytotrophoblasts Bronchiolar and alveolar epithelial cells Eye_Unknown.6 Lymphatic endothelial cells Photoreceptor cells Thymic epithelial cells CCL19_CCL21 positive cells Ganglion cells Lymphoid and myeloid cells Purkinje neurons Thymocytes CLC_IL5RA positive cells Goblet cells Lymphoid cells Retinal pigment cells Trophoblast giant cells CSH1_CSH2 positive cells Granule neurons Lymphoid/Myeloid cells Retinal progenitors and muller glia Unipolar brush cells Cardiomyocytes Heart_Unknown.10 MUC13_DMBT1 positive cells SATB2_LRRC7 positive cells Ureteric bud cells Cardiomyocytes/vascular endothelial cells Hematopoietic stem cells Megakaryocytes SKOR2_NPSR1 positive cells Vascular endothelial cells Cerebrum_Unknown.3 Hepatoblasts Megakaryocytes? SLC24A4_PEX5L positive cells Vascular endothelial cells? Chromaffin cells Horizontal cells Mesangial cells SLC26A4_PAEP positive cells Visceral neurons Ciliated epithelial cells Fig. 5 | integration of a multi-omics human cell atlas. a,b, UMAP visualizations of the integrated cell embeddings, colored by omics layers (a) and cell types (b). t he pink circles highlight cells labeled as ‘excitatory neurons’ in scRNA-seq but ‘Astrocytes’ in scAt AC-seq. t he blue circles highlight cells labeled as ‘Astrocytes’ in scRNA-seq but ‘Astrocytes/oligodendrocytes’ in scAt AC-seq. t he brown circles highlight cells labeled as ‘Oligodendrocytes’ in scRNA-seq but ‘Astrocytes/oligodendrocytes’ in scAt AC-seq. 25 kb downstream and 20 kb upstream from the transcription start and unbalanced cell type compositions, and has yet to be accom- site (TSS) (Fig. 4d), all of which were bound by SPI1. Meanwhile, plished at the single-cell level. most putative regulatory interactions were previously unknown. Implemented as a neural network with minibatch optimization, For example, CD83 was linked with three regulatory peaks (two GLUE delivers superior scalability with a sublinear time cost, prom- roughly 25 kb upstream, one about 10 kb upstream from the TSS), ising its applicability at the atlas-scale (Supplementary Fig. 17a). which were enriched for the binding of three TFs (BCL11A, PAX5 Using an efficient multistage training strategy for GLUE (Methods), and RELB; Fig. 4e). While CD83 was highly expressed in both we successfully integrated the gene expression and chromatin acces- monocytes and B cells, the inferred TFs showed more constrained sibility data into a unified multi-omics human cell atlas (Fig. 5). expression patterns (Supplementary Fig. 15c–f ), suggesting that its While the aligned atlas was largely consistent with the origi- active regulators might differ per cell type. Supplementary Fig. 16 nal annotations (Supplementary Fig. 17c–e), we also noticed shows more examples of GLUE-inferred regulatory interactions. several discrepancies. For example, cells originally annotated as ‘Astrocytes’ in scATAC-seq were aligned to an ‘Excitatory neu- Atlas-scale integration over millions of cells with GLUE. As rons’ cluster in scRNA-seq (highlighted with pink circles/flows in technologies continue to evolve, the throughput of single-cell Supplementary Fig. 17). Further inspection revealed that canonical 47,48 experiments is constantly increasing. Recent studies have generated radial glial markers such as PAX6, HES1 and HOPX were actively human cell atlases for gene expression and chromatin accessibil- transcribed in this cluster, both in the RNA and ATAC domain 29 9 ity containing millions of cells. The integration of these atlases (Supplementary Fig. 18), with chromatin priming also detected at poses a substantial challenge to computational methods due to the both neuronal and glial markers (Supplementary Figs. 19–21), sug- sheer volume of data, extensive heterogeneity, low coverage per cell gesting that the cluster consists of multipotent neural progenitors NatuRe Biote ChNoloG y | VOL 40 | OCt OBeR 2022 | 1458–1466 | www.nature.com/naturebiotechnology UMAP2 UMAP2 NATUrE BioTEcHNoLoGy Articles (likely radial glial markers) rather than excitatory neurons or astro- for scRNA-seq and scATAC-seq, and zero-inflated log-normal cytes as originally annotated. GLUE-based integration also resolved for snmC-seq (Methods). Nevertheless, generative distributions several scATAC-seq clusters that were ambiguously annotated. For can be easily reconfigured to accommodate other omics layers, 56 57 example, the ‘Astrocytes/Oligodendrocytes’ cluster was split into such as protein abundance and histone modification , and to two halves and aligned to the ‘Astrocytes’ and ‘Oligodendrocytes’ adopt new advances in data modeling techniques . clusters of scRNA-seq (highlighted, respectively, with blue and • e guid Th ance graphs used in GLUE have currently been limited brown circles/flows in Supplementary Fig. 17), which was also to multipartite graphs, containing only edges between features supported by marker expression and accessibility (Supplementary of different layers. Nonetheless, graphs, as intuitive and flex - Figs. 20 and 21). These results demonstrate the unique value of ible representations of regulatory knowledge, can embody more atlas-scale multi-omics integration where cell typing can be done in complex regulatory patterns, including within-modality inter- an unbiased, data-oriented manner across modalities without losing actions, nonfeature vertices and multi-relations. Beyond canon- single-cell resolution. In particular, the incorporation of batch cor- ical graph convolution, more advanced graph neural network 59–61 rection could further enable effective curation of new datasets with architectures may also be adopted to extract richer informa- the integrated atlas as a global reference . tion from the regulatory graph. Particularly, recent advances in 62,63 In comparison, we also attempted to perform integration using hypergraph modeling could facilitate the use of prior knowl- online iNMF, which was the only other method capable of inte- edge on regulatory interactions involving multiple regulators grating the data at full scale, but the result was far from optimal simultaneously, as well as enable regulatory inference for such (Supplementary Figs. 22a,b and 23). Meanwhile, an attempt to inte- interactions. grate the data as aggregated metacells (Methods) via the popular Seurat v3 method also failed (Supplementary Fig. 22c,d). Recent advances in experimental multi-omics technologies have 8–11,34 increased the availability of paired data . While most of the cur- Discussion rent simultaneous multi-omics protocols still suffer from lower data Combining omics-specific autoencoders with graph-based cou- quality or throughput than that of single-omics methods , paired pling and adversarial alignment, we designed the GLUE framework cells can be highly informative in anchoring different omics layers for unpaired single-cell multi-omics data integration with supe- and should be used in conjunction with unpaired cells whenever rior accuracy and robustness. By modeling regulatory interactions available. It is straightforward to extend the GLUE framework to across omics layers explicitly, GLUE uniquely supports integrative incorporate such pairing information, for example, by adding loss regulatory inference for unpaired multi-omics datasets. Notably, terms that penalize the embedding distances between paired cells . in a Bayesian interpretation, the GLUE regulatory inference can be Such an extension may ultimately lead to a solution for the general seen as a posterior estimate, which can be continuously refined on case of mosaic integration . the arrival of new data. Apart from multi-omics integration, we also note that the GLUE Unpaired multi-omics integration shares some conceptual simi- framework could be suitable for cross-species integration, espe- larities with batch effect correction , but the former is substantially cially when distal species are concerned and one-to-one orthologs more challenging because of the distinct, omics-specific feature are limited. Specifically, we may compile all orthologs into a GLUE spaces. While feature conversion may seem to be a straightforward guidance graph and perform integration without explicit ortholog solution, the inevitable information loss can be detrimental. Seurat conversion. Under that setting, the GLUE approach could also be 15 33 66 v3 (ref. ) and bindSC also devised heuristic strategies to use conceptually connected to a recent work called SAMap . information in the original feature spaces in addition to converted Finally, we note that the inferred regulatory interactions from data, which may explain their improved performance than meth- the current GLUE model are based on the whole input dataset 16,17 ods that do not . Meanwhile, known cell types have also been and may be an aggregation of multiple spatiotemporal-specific 51,52 used to guide integration via (semi-)supervised learning , but circuits, especially for data derived from distinct tissues (for this approach incurs substantial limitations in terms of applicability example, atlas). Meanwhile, we notice that in parallel to the since such supervision is typically unavailable and in many cases coarse-scale global model (for example, the whole-atlas integra- serves as the purpose of multi-omics integration per se . Notably, tion model), finer-scale regulatory inference could be conducted one of these methods was proposed with a similar autoencoder by training dedicated models on cells from a single tissue, poten- architecture and adversarial alignment , but it relied on matched tially with spatiotemporal-specific prior knowledge incorporated cell types or clusters to orient the alignment. In fact, GLUE shares as well . Such a ‘step-wise refinement’ extension would effectively more conceptual similarity with coupled matrix factorization meth- help identify spatiotemporal-specific regulatory circuits and 20,21 ods , but with superior performance, which mostly benefits from key regulators. its deep generative model-based design. We believe that GLUE, as a modular and generalizable frame- We note that the current framework also works for integrat- work, creates an unprecedented opportunity toward effectively ing omics layers with shared features (for example, the integration delineating gene regulatory maps via large-scale multi-omics inte- 53,54 between scRNA-seq and spatial transcriptomics ), by using either gration at single-cell resolution. The whole package of GLUE, along the same vertex or connected surrogate vertices for shared features with tutorials and demo cases, is available online at https://github. in the guidance graph. In addition, cross imputation could also be com/gao-lab/GLUE for the community. implemented by chaining encoders and decoders of different omics layers. However, given a recent report that data imputation could online content induce artifacts and deteriorate the accuracy of gene regulatory Any methods, additional references, Nature Research report- inference , such a function may need further investigation. ing summaries, source data, extended data, supplementary infor- As a generalizable framework, GLUE features a modular mation, acknowledgements, peer review information; details of design, where the data and graph autoencoders are independently author contributions and competing interests; and statements of configurable. data and code availability are available at https://doi.org/10.1038/ s41587-022-01284-4. • e d Th ata autoencoders in GLUE are customizable with appro - Received: 13 September 2021; Accepted: 15 March 2022; priate generative models that conform to omics-specific data Published online: 2 May 2022 distributions. In the current work, we used negative binomial NatuRe Biote ChNoloG y | VOL 40 | OCt OBeR 2022 | 1458–1466 | www.nature.com/naturebiotechnology 1464 NATUrE BioTEcHNoLoGy Articles 34. PBMC from a healthy donor, single cell multiome ATAC gene expression References demonstration data by Cell Ranger ARC 1.0.0. 10X Genomics https://support. 1. Cusanovich, D. A. et al. Multiplex single cell profiling of chromatin 10xgenomics.com/single-cell-multiome-atac-gex/datasets/1.0.0/pbmc_ accessibility by combinatorial cellular indexing. Science 348, 910–914 (2015). granulocyte_sorted_10k (2020). 2. Chen, X., Miragaia, R. J., Natarajan, K. N. & Teichmann, S. A. A rapid and 35. Muto, Y. et al. Single cell transcriptional and chromatin accessibility profiling robust method for single cell chromatin accessibility profiling. Nat. Commun. redefine cellular heterogeneity in the adult human kidney. Nat. Commun. 12, 9, 5345 (2018). 2190 (2021). 3. Luo, C. et al. Single-cell methylomes identify neuronal subtypes and 36. Yao, Z. et al. A transcriptomic and epigenomic cell atlas of the mouse regulatory elements in mammalian cortex. Science 357, 600–604 (2017). primary motor cortex. Nature 598, 103–110 (2021). 4. Mulqueen, R. M. et al. Highly scalable generation of DNA methylation 37. Saunders, A. et al. Molecular diversity and specializations among the cells of profiles in single cells. Nat. Biotechnol. 36, 428–431 (2018). the adult mouse brain. Cell 174, 1015–1030 (2018). 5. Picelli, S. et al. Smart-seq2 for sensitive full-length transcriptome profiling in 38. Fresh cortex from adult mouse brain (v1), single cell ATAC demonstration single cells. Nat. Methods 10, 1096–1098 (2013). data by Cell Ranger 1.1.0. 10X Genomics https://support.10xgenomics.com/ 6. Zheng, G. X. et al. Massively parallel digital transcriptional profiling of single single-cell-atac/datasets/1.1.0/atac_v1_adult_brain_fresh_5k (2019). cells. Nat. Commun. 8, 14049 (2017). 39. Mo, A. et al. Epigenomic signatures of neuronal diversity in the mammalian 7. Packer, J. & Trapnell, C. Single-cell multi-omics: an engine for new brain. Neuron 86, 1369–1384 (2015). quantitative models of gene regulation. Trends Genet. 34, 653–665 (2018). 40. Wang, M., Zhao, Y. & Zhang, B. Efficient test and visualization of multi-set 8. Chen, S., Lake, B. B. & Zhang, K. High-throughput sequencing of the intersections. Sci Rep. 5, 16923 (2015). transcriptome and chromatin accessibility in the same cell. Nat. Biotechnol. 41. Gabel, H. W. et al. Disruption of DNA-methylation-dependent long gene 37, 1452–1457 (2019). repression in Rett syndrome. Nature 522, 89–93 (2015). 9. Ma, S. et al. Chromatin potential identified by shared single-cell profiling of 42. Dekker, J., Marti-Renom, M. A. & Mirny, L. A. Exploring the RNA and chromatin. Cell 183, 1103–1116 (2020). three-dimensional organization of genomes: Interpreting chromatin 10. Clark, S. J. et al. scNMT-seq enables joint profiling of chromatin accessibility interaction data. Nat. Rev. Genet. 14, 390–403 (2013). DNA methylation and transcription in single cells. Nat. Commun. 9, 43. Pliner, H. A. et al. Cicero predicts cis-regulatory DNA interactions from 781 (2018). single-cell chromatin accessibility data. Mol. Cell 71, 858–871 (2018). 11. Wang, Y. et al. Single-cell multiomics sequencing reveals the functional 44. Javierre, B. M. et al. Lineage-specific genome architecture links enhancers regulatory landscape of early embryos. Nat. Commun. 12, 1247 (2021). and non-coding disease variants to target gene promoters. Cell 167, 12. Lake, B. B. et al. Integrative single-cell analysis of transcriptional and 1369–1384 (2016). epigenetic states in the human adult brain. Nat. Biotechnol. 36, 70–80 (2018). 45. Aguet, F. et al. Genetic effects on gene expression across human tissues. 13. Bravo Gonzalez-Blas, C. et al. Identification of genomic enhancers through Nature 550, 204–213 (2017). spatial integration of single-cell transcriptomics and epigenomics. Mol. Syst. 46. Han, H. et al. TRRUST v2: an expanded reference database of human and Biol. 16, e9438 (2020). mouse transcriptional regulatory interactions. Nucleic Acids Res. 46, 14. Argelaguet, R., Cuomo, A. S. E., Stegle, O. & Marioni, J. C. Computational principles and challenges in single-cell data integration. Nat. Biotechnol. 39, D380–D386 (2018). 47. o Th msen, E. R. et al. Fixed single-cell transcriptomic characterization of 1202–1215 (2021). human radial glial diversity. Nat. Methods 13, 87–93 (2016). 15. Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019). 48. Pollen, A. A. et al. Molecular identity of human outer radial glia during cortical development. Cell 163, 55–67 (2015). 16. Gao, C. et al. Iterative single-cell multi-omic integration using online 49. Fischer, D. S. et al. Sfaira accelerates data and model reuse in single cell learning. Nat. Biotechnol. 39, 1000–1007 (2021). 17. Welch, J. D. et al. Single-cell multi-omic integration compares and contrasts genomics. Genome Biol. 22, 248 (2021). 50. Tran, H. T. N. et al. A benchmark of batch-effect correction methods for features of brain cell identity. Cell 177, 1873–1887 (2019). single-cell RNA sequencing data. Genome Biol. 21, 12 (2020). 18. Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019). 51. Stark, S. G. et al. SCIM: universal single-cell matching with unpaired feature 19. Chen, H. et al. Assessment of computational methods for the analysis of sets. Bioinformatics 36, i919–i927 (2020). 52. Yang, K. D. et al. Multi-domain translation between single-cell imaging and single-cell ATAC-seq data. Genome Biol. 20, 241 (2019). 20. Duren, Z. et al. Integrative analysis of single-cell genomics data by sequencing data using autoencoders. Nat. Commun. 12, 31 (2021). coupled nonnegative matrix factorizations. Proc. Natl. Acad. Sci. USA 115, 53. Eng, C.-H. L. et al. Transcriptome-scale super-resolved imaging in tissues by RNA seqfish. Nature 568, 235–239 (2019). 7723–7728 (2018). 21. Zeng, W. et al. DC3 is a method for deconvolution and coupled clustering 54. Rodriques, S. G. et al. Slide-seq: a scalable technology for measuring from bulk and single-cell genomics data. Nat. Commun. 10, 4613 (2019). genome-wide expression at high spatial resolution. Science 363, 1463–1467 (2019). 22. Demetci, P., Santorella, R., Sandstede, B., Noble, W. S. & Singh, R. SCOT: Single-Cell Multi-Omics Alignment with Optimal Transport. J. Comput. Biol. 55. Ly, L.-H. & Vingron, M. Effect of imputation on gene network reconstruction 29, 3–18 (2022). from single-cell RNA-seq data. Patterns 3, 100414 (2021). 23. Cao, K., Bai, X., Hong, Y. & Wan, L. Unsupervised topological alignment for 56. Bandura, D. R. et al. Mass cytometry: technique for real time single cell multitarget immunoassay based on inductively coupled plasma time-of-flight single-cell multi-omics integration. Bioinformatics 36, i48–i56 (2020). 24. Cao, K., Hong, Y. & Wan, L. Manifold alignment for heterogeneous mass spectrometry. Anal. Chem. 81, 6813–6822 (2009). single-cell multi-omics data integration using pamona. Bioinformatics 38, 57. Bartosovic, M., Kabbe, M. & Castelo-Branco, G. Single-cell CUT&Tag profiles histone modifications and transcription factors in complex tissues. 211–219 (2021). 25. Singh, R. et al. Unsupervised manifold alignment for single-cell multi-omics Nat. Biotechnol. 39, 825–835 (2021). data. In Proc. 11th ACM International Conference on Bioinformatics, 58. Ashuach, T., Reidenbach, D. A., Gayoso, A. & Yosef, N. PeakVI: A deep generative model for single-cell chromatin accessibility analysis. Cell Reports Computational Biology and Health Informatics (eds. Aluru, S., Kalyanaraman, A. & Wang, M. D.) a40 (Association for Computing Machinery, 2020). Methods 2, 100182 (2022). 26. Svensson, V., Vento-Tormo, R. & Teichmann, S. A. Exponential scaling of 59. Hamilton, W., et al. in Advances in Neural Information Processing Systems single-cell RNA-seq in the past decade. Nat. Protoc. 13, 599–604 (2018). (eds. Guyon, I. et al.) 1024–1034 (Curran Associates, Inc., 2017). 27. Kozareva, V. et al. A transcriptomic atlas of mouse cerebellar cortex 60. Veličković, P. et al. Graph attention networks. In Proc. 6th International comprehensively defines cell types. Nature 598, 214–219 (2021). Conference on Learning Representations (eds. Bengio, Y. & LeCun, Y.) 28. Cao, J. et al. A human cell atlas of fetal gene expression. Science 370, (ICLR, 2018). eaba7721 (2020). 61. Vashishth, S., Sanyal, S., Nitin, V. & Talukdar, P. Composition-based 29. Domcke, S. et al. A human cell atlas of fetal chromatin accessibility. Science multi-relational graph convolutional networks. In Proc. 8th International 370, eaba7612 (2020). Conference on Learning Representations (ed. Rush, A.) (ICLR, 2020). 30. Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative 62. Zhang, R., Zou, Y. & Ma, J. Hyper-SAGNN: a self-attention based graph modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018). neural network for hypergraphs. In Proc. 8th International Conference on 31. Cao, Z. J., Wei, L., Lu, S., Yang, D. C. & Gao, G. Searching large-scale Learning Representations (ed. Rush, A.) (ICLR, 2020). scRNA-seq databases via unbiased cell embedding with Cell BLAST. 63. Zhang, R., Zhou, T. & Ma, J. Multiscale and integrative single-cell Hi-C Nat. Commun. 11, 3458 (2020). analysis with Higashi. Nat. Biotechnol. 40, 254–261 (2021). 32. Kipf, T. N. & Welling, M. Variational graph auto-encoders. In Neural 64. Stuart, T. & Satija, R. Integrative single-cell analysis. Nat. Rev. Genet. 20, Information Processing Systems Workshop on Bayesian Deep Learning 257–272 (2019). (eds. Gal, Y. et al.) (Curran Associates, Inc., 2016). 65. Amodio, M. & Krishnaswamy, S. MAGAN: aligning biological manifolds. In 33. Dou, J. et al. Unbiased integration of single cell multi-omics data. Preprint at Proc. 35th International Conference on Machine Learning (eds. Dy, J. G. Dy & bioRxiv https://doi.org/10.1101/2020.12.11.422014 (2020). Krause, A.) 215–223 (PMLR, 2018). NatuRe Biote ChNoloG y | VOL 40 | OCt OBeR 2022 | 1458–1466 | www.nature.com/naturebiotechnology 1465 NATUrE BioTEcHNoLoGy Articles 66. Tarashansky, A. J. et al. Mapping single-cell atlases throughout metazoa adaptation, distribution and reproduction in any medium or format, as long as you give unravels cell type evolution. eLife 10, e66747 (2021). appropriate credit to the original author(s) and the source, provide a link to the Creative 67. Jung, I. et al. A compendium of promoter-centered long-range chromatin Commons license, and indicate if changes were made. The images or other third party mate- interactions in the human genome. Nat. Genet. 51, 1442–1449 (2019). rial in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in Commons license and your intended use is not permitted by statutory regulation or exceeds published maps and institutional affiliations. the permitted use, you will need to obtain permission directly from the copyright holder. Open Access This article is licensed under a Creative Commons To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. Attribution 4.0 International License, which permits use, sharing, © The Author(s) 2022 NatuRe Biote ChNoloG y | VOL 40 | OCt OBeR 2022 | 1458–1466 | www.nature.com/naturebiotechnology 1466 NATUrE BioTEcHNoLoGy Articles we first sample the edges (i, j) with probabilities proportional to the edge weights Methods and then sample vertices j′ that are not connected to i and treat them as if s = s . ij′ ij The GLUE framework. We assume that there are K different omics layers to be When maximizing the graph likelihood, the inner products between features are integrated, each with a distinct feature set V ,k = 1,2,…,K . For example, in maximized or minimized (per edge sign) based on the Bernoulli distribution. For scRNA-seq, V is the set of genes, while in scATAC-seq, V is the set of chromatin k k |V | k example, ATAC peaks located near the promoter of a gene would be encouraged to regions. The data spaces of different omics layers are denoted as X ⊆ R with (n) have similar embeddings to that of the gene, while DNA methylation in the gene varying dimensionalities. We use to denote cells from x ∈X ,n = 1,2,…,NK promoter would be encouraged to have a dissimilar embedding to that of the gene. (n) the kth omics layer and to denote the observed value of feature i of x ,i ∈V ki k The data likelihoods p (x |u,V;θ ) (that is, data decoders) in equation (3) are k k the kth layer in the nth cell. N is the sample size of the kth layer. Notably, the cells built on the inner product between the cell embedding u and feature embeddings from different omics layers are unpaired and can have different sample sizes. To V . Thus, analogous to the loading matrix in principal component analysis (PCA), avoid cluttering, we drop the superscript (n) when referring to an arbitrary cell. the feature embeddings V confer semantic meanings for the cell embedding space. We model the observed data from different omics layers as generated by a As V are modulated by interactions among omics features in the guidance graph, m k low-dimensional latent variable (that is, cell embedding) u ∈ R : the semantic meanings become linked. While this linearity limits decoder capacity, our empirical evaluations show that it is well compensated by the nonlinear p (x ;θ )= p (x |u;θ )p (u)du (1) k k k k encoders, producing high-quality multi-omics alignments (Fig. 2, Extended Data Figs. 1–4 and Supplementary Figs. 1–7). The exact formulation of data likelihood where p(u) is the prior distribution of the latent variable, p (x |u;θ ) are k k depends on the omics data distribution. For example, for count-based scRNA-seq learnable generative distributions (that is, data decoders) and θ denotes learnable and scATAC-seq data, we used the negative binomial (NB) distribution: parameters in the decoders. The cell latent variable u is shared across different ( ) omics layers. In other words, u represents the common cell states underlying all p (x |u,V;θ )= NB x ;μ ,θ k k ki i (7) omics observations, while the observed data from each layer are generated by a i∈V specific type of measurement of the underlying cell states. With the introduction of variational posteriors q (u|x ;ϕ ) (that is, data k k ( ) ( ) ( ) x θ k i Γ(x +θ ) μ i ki i θ encoders, where ϕ are learnable parameters in the encoders), model fitting can be i i NB x ;μ ,θ = (8) ki i θ +μ θ +μ Γ(θ )Γ(x +1) i i i ki i i efficiently performed by maximizing the following evidence lower bounds: L (ϕ ,θ )= E E logp (x |u;θ ) ( ) X k k x ∼p (x ) u∼q(u|x ;ϕ ) k k ∑ k k data k k k μ = Softmax α ⊙V u + β · x (2) i k kj i (9) j∈V −KL (q (u|x ;ϕ ) ∥ p (u))] k k k |V | where μ,θ ∈ R are the mean and dispersion of the negative binomial Since different autoencoders are independently parameterized and trained on |V | k |V | separate data, the cell embeddings learned for different omics layers could have distribution, respectively, α ∈ R ,β ∈ R are scaling and bias factors, inconsistent semantic meanings unless they are linked properly. ⊙ is the Hadamard product, Softmax represents the ith dimension of the To link the autoencoders, we propose a guidance graph G =(V, E), which softmax output and x gives the total count in the cell. Taking softmax kj j∈V incorporates prior knowledge about the regulatory interactions among features ∪ and then multiplying by total count ensures that the library size of reconstructed at distinct omics layers, where V = V is the universal feature set and k=1 data matches the original . The set of learnable parameters is θ = {θ,α,β}. E = {(i,j) |i,j ∈ V} is the set of edges. Each edge is also associated with signs Analogously, many other distributions can also be supported, as long as we can and weights, which are denoted as s and w , respectively. We require that w ∈ parameterize the means of the distributions by feature-cell inner products. ij ij ij (0,1], which can be interpreted as interaction credibility, and that s ∈ {−1,1}, ij For efficient inference and optimization, we introduce the following factorized which specifies the sign of the regulatory interaction. For example, an ATAC peak variational posterior: located near the promoter of a gene is usually assumed to positively regulate its q (u,V|x , G;ϕ ,ϕ )= q (u|x ;ϕ ) ·q (V|G;ϕ ) (10) k k G k k G expression, so they can be connected with a positive edge (s = 1). Meanwhile, ij DNA methylation in the gene promoter is usually assumed to suppress expression, The graph variational posterior q (V|G;ϕ ) (that is, graph encoder) is so they can be connected with a negative edge (s = 1). In addition to the ij modeled as diagonal-covariance normal distributions parameterized by a graph connections between features, self-loops are also added for numerical stability, convolutional network : with s = 1,w = 1, ∀i ∈V . The guidance graph is allowed to be a multi-graph, ii ii where more than one edge can exist between the same pair of vertices, representing q (V|G;ϕ )= q (v |G;ϕ ) G i G (11) different types of prior regulatory evidence. i∈V We treat the guidance graph as observed variable and model it as generated by low-dimensional feature latent variables (that is, feature embeddings) ( ) v ∈ R ,i ∈V . Furthermore, differing from the previous model, we now model i q (v |G;ϕ )= N v ;GCN (G;ϕ ),GCN 2 (G;ϕ ) (12) i G i μ G σ G x as generated by the combination of feature latent variables v ∈ R ,i ∈V k i k and the cell latent variable u ∈ R . For convenience, we introduce the notation where ϕ represents the learnable parameters in the graph convolutional network m×|V| V ∈ R , which combines all feature embeddings into a single matrix. The (GCN) encoder. model likelihood can thus be written as: The variational data posteriors q (u|x ;ϕ ) (that is, data encoders) are k k modeled as diagonal-covariance normal distributions parameterized by multilayer p (x , G;θ ,θ )= p (x |u,V;θ )p (G|V;θ )p (u)p (V)dudV (3) G G k k k k perceptron (MLP) neural networks: ( ) where p (x |u,V;θ ) and p (G|V;θ ) are learnable generative distributions for the k k G q (u|x ,V ;ϕ )= N u;MLP (x ;ϕ ),MLP 2 (x ;ϕ ) (13) k k k k,μ k k k k k,σ omics data (that is, data decoders) and knowledge graph (that is, graph decoder), respectively. θ and θ are learnable parameters in the decoders. p(u) and p(V) where ϕ is the set of learnable parameters in the multilayer perceptron encoder of are the prior distributions of the cell latent variable and feature latent variables, the kth omics layer. respectively, which are fixed as standard normal distributions for simplicity: Model fitting can then be performed by maximizing the following evidence lower bound: p (u)= N (u;0,I ) (4)   E logp (x |u,V;θ )p (G|V;θ ) ∏ K k k ∑ u∼q(u|x ;ϕ ),V∼q(V|G;ϕ ) k k G p (v )= N (v ;0,I ),p (V)= p (v )   i i m i (5) x ∼p (x ) k data k i∈V k=1 −KL (q (u|x ;ϕ )q (V|G;ϕ ) ∥ p (u)p (V)) k k G (14) although alternatives may also be used . For convenience, we also introduce the m×|V | notation , which contains only feature embeddings in the kth omics V ∈ R which can be further rearranged into the following form: layer, and u , which emphasizes that the cell embedding is from a cell in the kth omics layer. K ·L (θ ,ϕ )+ L (θ ,ϕ ,ϕ ) (15) The graph likelihood p (G|V;θ ) (that is, graph decoder) is defined as: G G G X k k G G k k=1 logp (G|V;θ )= E i,j∼p i,j;w ( ij) where we have (6) [ ( ) ( ( ))] ⊤ ⊤ logσ s v v + E log 1 − σ s v v ij j j′∼p (j′|i) ij j′ i ns i L (θ ,ϕ ,ϕ )= E X k k G k xk∼pdata(xk) [ ] (16) where σ is the sigmoid function and p is a negative sampling distribution . ns E logp (x |u,V;θ ) −KL (q (u|x ;ϕ ) ∥ p (u)) k k k k u∼q(u|x ;ϕ ),V∼q(V|G;ϕ ) k k G Here the graph likelihood has no trainable parameters, so θ = ∅. In other words, NatuRe Biote ChNoloG y | www.nature.com/naturebiotechnology NATUrE BioTEcHNoLoGy Articles L (θ ,ϕ )= E logp (G|V;θ ) −KL (q (V|G;ϕ ) ∥ p (V)) G G G G G V∼q V|G;ϕ ( ) normalize by cluster size, which effectively balances the contribution of matching (17) clusters regardless of their sizes. In the second stage, we fine-tune the GLUE model with the estimated balancing weights, during which the additive noise Below, for convenience, we denote the union of all encoder parameters ϵ ∼N (ϵ;0,τ · Σ) gradually anneals to 0 (with τ starting at 1 and decreasing ( ) as ϕ = ϕ ∪ϕ and the union of all decoder parameters as k G linearly per epoch until 0). The number of annealing epochs was set automatically k=1 (∪ ) based on the data size and learning rate to match a learning progress equivalent to θ = θ ∪ θ k G k=1 4,000 iterations at a learning rate of 0.002. To ensure the proper alignment of different omics layers, we use the adversarial All benchmarks and case studies in the study were conducted with the 31,71 alignment strategy . A discriminator D with a K-dimensional softmax output is two-stage training procedure as described above, regardless of whether the dataset introduced, which predicts the omics layers of cells based on their embeddings u. being used is balanced or not. The discriminator D is trained by minimizing the multiclass classification cross entropy: Batch effect correction. To handle batch effect within omics layers, we incorporate batch as a covariate of the data decoders. Assuming b ∈{1,2,…,B}, is the batch L (ϕ,ψ)= − E E logD (u;ψ) (18) D x ∼p (x ) u∼q(u|x ;ϕ ) k K k data k k k index, where B is the total number of batches, the decoder likelihood is extended to k=1 p (x |u,V,b;θ ). Specifically, this is achieved by converting learnable parameters k k in the data decoder to be batch-dependent. For example, in the case of a negative where D represents the kth dimension of the discriminator output and ψ is the binomial decoder, the network now uses batch-specific α, β and θ parameters: set of learnable parameters in the discriminator. The data encoders can then be ∏ ( ) trained in the opposite direction to fool the discriminator, ultimately leading to the p (x |u,V,b;θ )= NB x ;μ ,θ 72 k k k b i i i alignment of cell embeddings from different omics layers . (25) i∈V The overall training objective of GLUE thus consists of: minλ ·L (ϕ,ψ) D D ( ) ( ) (19) ( ) x θ Γ x +θ k b ψ ( k b ) μ i θ i i i b i i NB x ;μ ,θ = (26) k b i i i θ +μ θ +μ Γ θ Γ x +1 b b ( b ) ( k ) i i i i i i maxλ ·L (ϕ,ψ)+ λ K ·L (θ ,ϕ )+ L (θ ,ϕ ,ϕ ) (20) ( ) ∑ D D G G G G X k k G θ,ϕ μ = Softmax α ⊙V u + β · x k=1 i b k i k b j (27) j∈V The two hyperparameters λ and λ control the contributions of adversarial B×|V | B×|V | B×|V | k k k where α ∈ R ,β ∈ R ,θ ∈ R , and α , β , θ are the bth row of α, alignment and graph-based feature embedding, respectively. We use stochastic + + b b b gradient descent to train the GLUE model. Each stochastic gradient descent β, θ. Other probabilistic decoders can also be extended in similar ways. iteration is divided into two steps. In the first step, the discriminator is updated according to objective equation (19). In the second step, the data and graph Implementation details. We applied linear dimensionality reduction using autoencoders are updated according to equation (20). The RMSprop optimizer canonical methods such as PCA (for scRNA-seq) or LSI (latent semantic indexing, with no momentum term is used to ensure the stability of adversarial training. for scATAC-seq) as the first transformation layers of the data encoders (note that the decoders were still fitted in the original feature spaces). This effectively reduced Weighted adversarial alignment. As shown in previous work , canonical model size and enabled a modular input, so advanced dimensionality reduction or adversarial alignment amounts to minimizing a generalized form of Jensen–Shannon batch effect correction methods can also be used instead as preprocessing steps for divergence among the cell embedding distributions of different omics layers: GLUE integration. ( ) During model training, 10% of the cells were used as the validation set. In K K ∑ ∑ 1 1 the final stage of training, the learning rate would be reduced by factors of 10 if KL q (u)|| q (u) (21) k k K K the validation loss did not improve for consecutive epochs. Training would be k=1 k=1 terminated if the validation loss still did not improve for consecutive epochs. The q (u)= E q (u|x ;ϕ ) patience for learning rate reduction, training termination and the maximal number where represents the marginal cell embedding k x ∼p (x ) k k k data k of training epochs were automatically set based on the data size and learning rate distribution of the kth layer. Without other loss terms, equation (21) converges at to match a learning progress equivalent to 1,000, 2,000 and 16,000 iterations at a perfect alignment, that is, when q (u)= q (u), ∀i ̸= j. This can be problematic i j learning rate of 0.002, respectively. when cell type compositions differ dramatically across different layers, for example, (n) For all benchmarks and case studies with GLUE, we used the default in the cell atlas integration. To address this issue, we added cell-specific weights w hyperparameters unless explicitly stated. The set of default hyperparameters is to the discriminator loss in equation (18): presented in Extended Data Fig. 3. K k ∑ ∑ (n) 1 1 ( ) L (ϕ,ψ)= − w · E logD (u;ψ) D (22) K W (n) u∼q u|x ;ϕ Integration consistency score. The integration consistency score is a measure k=1 n=1 k of consistency between the integrated multi-omics data and the guidance graph. k (n) First, we jointly cluster cells from all omics layers in the aligned cell embedding where the normalizer W = w . The adversarial alignment still amounts to n=1 space using k-means. For each omics layer, the cells in each cluster are aggregated minimizing equation (21) but with weighted marginal cell embedding distributions ( ) into a metacell. The metacells are established as paired samples, based on which (n) 1 (n) . By assigning appropriate weights to balance feature correlation can be computed. Using the paired metacells, we then compute q (u)= w q u|x ;ϕ k k W k n=1 the Spearman’s correlation for each edge in the guidance graph. The integration the cell distributions across different layers, the optimum of q (u)= q (u), ∀i ̸= j i j consistency score is defined as the average correlation across all graph edges, could be much closer to the desired alignment. negated per edge sign and weighted by edge weight. To obtain the balancing weights in an unsupervised manner, we devised the 23 24 following two-stage training procedure. First, we pretrain the GLUE model with Systematic benchmarks. UnionCom , Pamona and GLUE were executed using (n) constant weight w = 1, during which noise ϵ ∼N (ϵ;0,Σ) was added to the the Python packages ‘unioncom’ (v.0.3.0), ‘Pamona’ (v.0.1.0) and ‘scglue’ (v.0.2.0), respectively. MMD-MA was executed using the Python script provided at cell embeddings before passing to the discriminator. We set ∑ to be 1.5× the 16 17 https://bitbucket.org/noblelab/2020_mmdma_pytorch. Online iNMF , LIGER , empirical variance of cell embeddings in each minibatch, which helps produce a 18 33 15 Harmony , bindSC , and Seurat v3 (ref. ) were executed using the R packages coarse alignment immune to composition imbalance. Then, we cluster the coarsely ‘rliger’ (v.1.0.0), ‘rliger’ (v.1.0.0), ‘harmony’ (v.0.1.0), ‘bindSC’ (v.1.0.0) and ‘Seurat’ aligned cell embeddings per omics layer using Leiden clustering. The balancing (v.4.0.2), respectively. For each method, we used the default hyperparameter weight w for cells in cluster i is computed as: settings and data preprocessing steps as recommended. For the scRNA-seq data, f u ,u ( i j) k ̸=k i j 2,000 highly variable genes were selected using the Seurat ‘vst’ method. We used (23) w = two separate schemes to construct the guidance graph. In the standard scheme, we connected ATAC peaks with RNA genes via positive edges if they overlapped { ( ) 4 in either the gene body or proximal promoter regions (defined as 2 kb upstream cos u ,u , cos(u ,u ) > 0.5 ( ) i j i j from the TSS). In an alternative scheme involving larger genomic windows, we f u ,u = (24) i j connected ATAC peaks with RNA genes via positive edges if the peaks are within 0, otherwise 150 kb of the proximal gene promoters; the edges were weighted by a power-law −0.75 where u is the average cell embedding of cluster i, k denotes the omics layer of function w =(d +1) (d is the genomic distance in kb), which has been i i 42,43 cluster i, and n is the number of cells in cluster i. In other words, we sum up the proposed to model the probability of chromatin contact . For the methods that require feature conversion (online iNMF, LIGER, bindSC and Seurat v.3), we cosine similarities (raised to the power of 4 to increase contrast) between cluster i and all its matching clusters in other layers with cosine similarity >0.5, and then converted the scATAC-seq data to gene-level activity scores by summing up counts NatuRe Biote ChNoloG y | www.nature.com/naturebiotechnology NATUrE BioTEcHNoLoGy Articles (i) in the ATAC peaks connected to specific genes in the guidance graph. Notably, where is the omics layer silhouette width for the ith cell, N is the number omicslayer j online iNMF and LIGER also recommend an alternative way of ATAC feature of cells in cell type j, and M is the total number of cell types. Omics layer ASW has a conversion, that is, directly counting ATAC fragments falling in gene body and range of 0 to 1, and higher values indicate better mixing. promoter regions without resorting to ATAC peaks (https://htmlpreview.github. Graph connectivity (GC) was also used to evaluate the extend of mixing among io/?https://github.com/welch-lab/liger/blob/master/vignettes/Integrating_scRNA_ omics layers and was defined as in a recent benchmark study : and_scATAC_data.html), which we abbreviate to FiG (fragments in genes). We LCC also tested the FiG feature conversion method with online iNMF and LIGER | j| GC = (36) M N whenever applicable. j=1 Mean average precision (MAP) was used to evaluate the cell type resolution. (i) where LCC is the number of cells in largest connected component of the cell Supposing that the cell type of the ith cell is y and that the cell types of its K (i) (i) (i) k-nearest neighbors graph (K = 15) for cell type j, N is the number of cells in cell ordered nearest neighbors are y ,y ,…,y , the mean average precision is then 1 2 K type j and M is the total number of cell types. Graph connectivity has a range of 0 to defined as follows: 1, and higher values indicate better mixing. 1 (i) MAP = AP (28) Omics mixing. Seurat alignment score, omics layer ASW and graph connectivity i=1 all measure omics mixing of the data integration. Following the procedure from  the recent benchmark study , we first conduct min-max scaling for each of the  j=1 (i) (i) ∑ y =y metrics, and then compute the average across the three to summarize them into a  K j  1 · K  k=1 (i) ∑ y =y single metric representing omics mixing: , if 1 > 0 (i) K (i) 1 (i) AP = y =y (29) k=1 (i) (i) k k=1 y =y k scale(SAS)+scale(omicslayerASW)+scale(GC)  (37) omicsmixing = 0, otherwise (i) (i) Overall integration score. To compute an overall integration score, we use a 6:4 where 1 is an indicator function that equals 1 if y = y and 0 otherwise. (i) (i) k y =y weight between biology conservation and omics mixing, following the recent For each cell, average precision (AP) computes the average cell type precision up to 73 benchmark study : each cell type-matched neighbor, and mean average precision is the average average precision across all cells. We set K to 1% of the total number of cells in each dataset. overallintegrationscore = 0.6 ×biologyconservation +0.4 ×omicsmixing Mean average precision has a range of 0 to 1, and higher values indicate better cell (38) type resolution. FOSCTTM was used to evaluate the single-cell level alignment accuracy. It Cell type ASW (average silhouette width) was also used to evaluate the cell type was computed on two datasets with known cell-to-cell pairings. Suppose that each resolution, which was defined as in a recent benchmark study : dataset contains N cells, and that the cells are sorted in the same order, that is, the ( ) (i) N ith cell in the first dataset is paired with the ith cell in the second dataset. Denote 1 i=1 cell type celltypeASW = +1 (30) 2 N x and y as the cell embeddings of the first and second dataset, respectively. The FOSCTTM is then defined as: (i) ( ) N N where s is the cell type silhouette width for the ith cell, and N is the total (i) (i) ∑ ∑ celltype n n 1 1 2 FOSCTTM = + (39) 2N N N number of cells. Cell type ASW has a range of 0 to 1, and higher values indicate i=1 i=1 better cell type resolution. Neighbor consistency (NC) was used to evaluate the preservation of (i) (40) single-omics data variation after multi-omics integration and was defined n = j|d x ,y <d (x ,y ) j i i i following a previous study : N (i) (i) (i) ∑ NNS ∩NNI (41) n = j|d x ,y <d (x ,y ) 1 i j i i NC = (31) N (i) (i) NNS ∪NNI i=1 (i) (i) where n and n are the number of cells in the first and second dataset, 1 2 (i) where NNS is the set of k-nearest neighbors for cell i in the single-omics data, respectively, that are closer to the ith cell than their true matches in the opposite (i) NNI is the set of K-nearest neighbors for the ith cell in the integrated space, and dataset. d is the Euclidean distance. FOSCTTM has a range of 0 to 1, and lower N is the total number of cells. We set K to 1% of the total number of cells in each values indicate higher accuracy. dataset. Neighbor consistency has a range of 0 to 1, and higher values indicate Feature consistency was used to evaluate the consistency of feature embeddings better preservation of data variation. from different models. Since the raw embedding spaces are not directly comparable across models, we defined the consistency as the cross-modal conservation of Biology conservation. Mean average precision, cell type ASW and neighbor cosine similarities among features in the same model. Specifically, we first randomly consistency all measure biology conservation of the data integration. Following subsample 2,000 features and compute the pairwise cosine similarity among them the procedure from the recent benchmark study , we first conduct min-max using feature embeddings from the two compared models. The feature consistency scaling for each of the metrics and then compute the average across the three to score is then defined as the Pearson’s correlation between the cosine similarities of summarize them into a single metric representing biology conservation: two models, averaging across four random subsamples. Feature consistency has a range of −1 to 1, and higher values indicate higher consistency. scale(MAP)+scale(celltypeASW)+scale(NC) biologyconservation = (32) For the baseline benchmark, each method was run eight times with different random seeds, except for Harmony and bindSC that have deterministic Seurat alignment score (SAS) was used to evaluate the extent of mixing among implementations and were run only once. For the guidance corruption benchmark, omics layers and was computed as described in the original paper : we removed the specified proportions of existing peak–gene interactions K and added equal numbers of nonexistent interactions, so the total number of ¯x− SAS = 1 − (33) interactions remained unchanged. Of note, feature conversion was also repeated K− using the corrupted guidance graphs. The corruption procedure was repeated eight times with different random seeds. For the subsampling benchmark, the where ¯x is the average number of cells from the same omics layer among the scRNA-seq and scATAC-seq cells were subsampled in pairs (so FOSCTTM could K-nearest neighbors (different layers were first subsampled to the same number still be computed). The subsampling process was also repeated eight times with of cells as the smallest layer), and N is the number of omics layers. We set K to 1% different random seeds. of the subsampled cell number. Seurat alignment score has a range of 0 to 1, and For the systematic scalability test (Supplementary Fig. 17a), all methods were higher values indicate better mixing. run on a Linux workstation with 40 CPU cores (two Intel Xeon Silver 4210 chips), Omics layer ASW was also used to evaluate the extend of mixing among omics 250 GB of RAM and NVIDIA GeForce RTX 2080 Ti graphical processing units. layers and was defined as in a recent benchmark study : Only a single graphical processing unit card was used when training GLUE. omicslayerASW = omicslayerASW j (34) Triple-omics integration. The scRNA-seq and scATAC-seq data were handled as j=1 previously described (section Systematic benchmarks). Due to low coverage per single-C site, the snmC-seq data were converted to average methylation levels in (i) gene bodies. The mCH and mCG levels were quantified separately, resulting in omicslayerASW = 1 − s (35) N omicslayer i=1 two features per gene. The gene methylation levels were normalized by the global NatuRe Biote ChNoloG y | www.nature.com/naturebiotechnology NATUrE BioTEcHNoLoGy Articles methylation level per cell. An initial dimensionality reduction was performed using network based on the scRNA-seq data, and then uses external cis-regulatory PCA (section Implementation details). For the triple-omics guidance graph, the mCH evidence to filter out false positives. SCENIC accepts cis-regulatory evidence in and mCG levels were connected to the corresponding genes with negative edges. the form of gene rankings per TF, that is, genes with higher TF enrichment levels The normalized methylation levels were positive, with dropouts corresponding in their regulatory regions are ranked higher. To construct the rankings based to the genes that were not covered in single cells. As such, we used the zero-inflated on our inferred peak–gene interactions, we first overlapped the ENCODE TF log-normal (ZILN) distribution for the data decoder: chromatin immunoprecipitation (ChIP) peaks with the ATAC peaks and counted the number of ChIP peaks for each TF in each ATAC peak. Since different genes ∏ ( ) p (x |u,V;θ )= ZILN x ;μ ,σ ,δ k k k i i i i can have different numbers of connected ATAC peaks, and the ATAC peaks vary (42) i∈V in length (longer peaks can contain more ChIP peaks by chance), we devised a sampling-based approach to evaluate TF enrichment. Specifically, for each gene,  ( ) we randomly sampled 1,000 sets of ATAC peaks that matched the connected ATAC (logx −μ )  1−δ ki i  √ exp − , x > 0 ( ) 2 k x σ 2π 2σ peaks in both number and length distribution. We counted the numbers of TF k i i i ZILN x ;μ ,σ ,δ = (43) i i ki  ChIP peaks in these random ATAC peaks as null distributions. For each TF in each δ , x = 0 i gene, an empirical P value could then be computed by comparing the observed ki number of ChIP peaks to the null distribution. Finally, we ranked the genes by the empirical P values for each TF, producing the cis-regulatory rankings used by μ = α ⊙V u + β (44) i k SCENIC. Since peak–gene-based inference is mainly focused on remote regulatory regions, proximal promoters could be missed. As such, we provided SCENIC with |V | |V | |V | k k k where μ ∈ R ,σ ∈ R ,δ ∈ (0,1) are the log-scale mean, log-scale + both the above peak-based and proximal promoter-based cis-regulatory rankings. standard deviation and zero-inflation parameters of the zero-inflated log-normal |V | |V | k k Integration for the human multi-omics atlas. The scRNA-seq and scATAC-seq distribution, respectively, and α ∈ R ,β ∈ R are scaling and bias factors. atlases have highly unbalanced cell type compositions, which are primarily caused To unify the cell type labels, we performed a nearest neighbor-based label by differences in organ sampling sizes (Supplementary Fig. 17b). Although cell transfer with the snmC-seq dataset as a reference. The five nearest neighbors in types are unknown during real-world analyses, organ sources are typically available snmC-seq were identified for each scRNA-seq and scATAC-seq cell in the aligned and can be used to help balance the integration process. To perform organ-balanced embedding space, and majority voting was used to determine the transferred label. data preprocessing, we first subsampled each omics layer to match the organ To verify whether the alignment was correct, we tested for significant overlap in compositions. For the scRNA-seq data, 4,000 highly variable genes were selected cell type marker genes. The features of all omics layers were first converted to using the organ-balanced subsample. Then, for the initial dimensionality reduction, genes. Then, for each omics layer, the cell type markers were identified using the we fitted PCA (scRNA-seq) and LSI (scATAC-seq) on the organ-balanced one-versus-rest Wilcoxon rank-sum test with the following criteria: FDR < 0.05 subsample and applied the projection to the full data. The PCA/LSI coordinates and log fold change >0 for scRNA-seq/scATAC-seq; FDR < 0.05 and log fold were used as the first transformation layer in the GLUE data encoders (section change of <0 for snmC-seq. The significance of marker overlap was determined by 40 Implementation details), as well as for metacell aggregation (below). The guidance the three-way Fisher’s exact test . graph was constructed as described previously (section Systematic benchmarks). To perform correlation and regression analysis after the integration, we The two atlases consist of large numbers of cells but with low coverage per clustered all cells from the three omics layers using fine-scale k-means (k = 200). cell. To alleviate dropout and increase the training speed simultaneously, we used Then, for each omics layer, the cells in each cluster were aggregated into a a metacell aggregation strategy during pretraining. Specifically, in the pretraining metacell by summing their expression/accessibility counts or averaging their DNA stage, we clustered the cells in each omics layer using fine-scaled k-means methylation levels. The metacells were established as paired samples, based on (k = 100,000 for scRNA-seq and k = 40,000 for scATAC-seq). To balance the organ which feature correlation and regression analyses could be conducted. compositions at the same time, k-means centroids were fitted on the previous To integrate the same datasets using online iNMF, we inverted the snmC-seq organ-balanced subsample and then applied to the full data. The cells in each data via subtracting the data matrix by the largest entry, following the procedure 16 k-means cluster were aggregated into a metacell by summing their expression/ described in the original paper . accessibility counts and averaging their PCA/LSI coordinates. GLUE was then pretrained on the aggregated metacells with additive noise, which roughly oriented GLUE-based cis-regulatory inference. To ensure consistency of cell types, we first the cell embeddings but did not actually align them (section Weighted adversarial selected the overlapping cell types between the 10X Multiome and pcHi-C data. alignment). To better use the large data size, the hidden layer dimensionality was The remaining cell types included T cells, B cells and monocytes. The eQTL data doubled to 512 from the default 256. In the second stage, GLUE was fine-tuned were used as is, because they were not cell type-specific. For scRNA-seq, we selected on the full single-cell data with the balancing weight estimated as described in the 6,000 highly variable genes. To capture remote cis-regulatory interactions, the base section Weighted adversarial alignment. No metacell aggregation was used when guidance graph was constructed for peak–gene pairs within a distance of 150 kb, comparing the scalability of different methods (Supplementary Fig. 17a). using the alternative scheme as described in the section Systematic benchmarks. For a comparison with other integration methods, we also tried online iNMF To incorporate the regulatory evidence of pcHi-C and eQTL, we anchored all and Seurat v.3. Online iNMF was the only other method that could scale to evidence to that between the ATAC peaks and RNA genes. A peak–gene pair was millions of cells, so we applied it to the full dataset. On the other hand, Seurat v.3 considered supported by pcHi-C if (1) the gene promoter was within 1 kb of a bait showed the second-best accuracy in our previous benchmark. We also managed fragment, (2) the peak was within 1 kb of an other-end fragment and (3) significant to apply it to the aggregated data used in the first stage of GLUE training, due to contact was identified between the bait and the other-end fragment in pcHi-C. the fact that Seurat v.3 could not scale to the full dataset (Supplementary Fig. 17a). The pcHi-C-supported peak–gene interactions were weighted by multiplying the Label transfer was performed using the same procedure as in the triple-omics case, promoter-to-bait and the peak-to-other-end power-law weights (above). If a peak– except that we used majority voting in 50 nearest neighbors. gene pair was supported by multiple pcHi-C contacts, the weights were summed and clipped to a maximum of 1. A peak–gene pair was considered supported by Reporting Summary. Further information on research design is available in the eQTL if (1) the peak overlapped an eQTL locus and (2) the locus was associated Nature Research Reporting Summary linked to this article. with the expression of the gene. The eQTL-supported peak–gene interactions were assigned weights of 1. The composite guidance graph was constructed by adding the pcHi-C- and eQTL-supported interactions to the previous distance-based Data availability interactions, allowing for multi-edges. All datasets used in this study are already published and were obtained from public For regulatory inference, only peak–gene pairs within 150 kb in distance were data repositories. See Supplementary Table 1 for detailed information on single-cell considered. The GLUE training process was repeated four times with different omics datasets used in this study, including access codes and URLs. For regulatory random seeds. For each repeat, the peak–gene regulatory score was computed inference and evaluation, the pcHi-C data was obtained from supplementary file as the cosine similarity between the feature embeddings. The final regulatory of the original publication (https://www.sciencedirect.com/science/article/pii/ inference was obtained by averaging the regulatory scores across the four repeats. S0092867416313228), eQTL data from GTEx v8 (https://www.gtexportal.org/ To evaluate the significance of the regulatory scores, we compared the scores to home/datasets), TF ChIP–seq data from ENCODE data portal (https://www. a NULL distribution obtained via randomly shuffled feature embeddings and encodeproject.org/) and TRRUST v2 database from the official website (https:// computed empirical P values as the probability of getting more extreme scores in www.grnpedia.org/trrust/downloadnetwork.php). All benchmarking source data the NULL distribution. Finally, we compute FDR of regulatory inference based on are available in Supplementary Data 1. the P values using the Benjamini–Hochberg procedure. For cis-regulatory inference using LASSO, we used hyperparameter α = 0.01, which was optimized for area under the receiver operating characteristic curves of pcHi-C and eQTL prediction. Code availability The GLUE framework was implemented in the ‘scglue’ Python package, which is available at https://github.com/gao-lab/GLUE. For reproducibility, the scripts for TF-target gene regulatory inference. We used the SCENIC workflow to construct a TF-gene regulatory network from the inferred peak–gene regulatory all benchmarks and case studies were assembled using Snakemake (v.6.12.3), which interactions. Briefly, the SCENIC workflow first constructs a gene coexpression is also available in the above repository. NatuRe Biote ChNoloG y | www.nature.com/naturebiotechnology NATUrE BioTEcHNoLoGy Articles comments during the study, as well as authors of the datasets used in this work for References their kindly help. This work was supported by funds from the National Key Research 68. Ding, J. & Regev, A. Deep generative model embedding of single-cell and Development Program (grant no. 2016YFC0901603), the State Key Laboratory RNA-seq profiles on hyperspheres and hyperbolic spaces. Nat. Commun. 12, of Protein and Plant Gene Research and the Beijing Advanced Innovation Center for 2554 (2021). Genomics at Peking University, as well as the Changping Laboratory. The research by 69. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. & Dean, J. in Advances in G.G. was supported in part by the National Program for Support of Top-notch Young Neural Information Processing Systems (eds. Burges, C. J. C. et al.) 3111–3119 Professionals. Part of the analysis was carried out on the Computing Platform of the (Curran Associates, Inc., 2013). Center for Life Sciences of Peking University and supported by the High-performance 70. Kipf, T. N. & Welling, M. Semi-supervised classification with graph Computing Platform of Peking University. Parts of Fig. 1 were created using an image set convolutional networks. In Proc. 5th International Conference on Learning downloaded from Servier Medical Art (https://smart.servier.com/, CC BY 3.0). Representations (eds. Bengio, Y. & LeCun, Y.) (ICLR, 2017). 71. Dincer, A. B., Janizek, J. D. & Lee, S.-I. Adversarial deconfounding autoencoder for learning robust gene expression embeddings. Bioinformatics a uthor contributions 36, i573–i582 (2020). G.G. conceived the study and supervised the research. Z.J.C. designed and implemented 72. Goodfellow, I. et al. in Advances in Neural Information Processing Systems the computational framework and conducted benchmarks and case studies with (eds Ghahramani, Z. et al.) 2672–2680 (Curran Associates, Inc., 2014). guidance from G.G. Z.J.C. and G.G. wrote the manuscript. 73. Luecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics. Nat. Methods 19, 41–50 (2022). Competing interests 74. Xu, C. et al. Probabilistic harmonization and annotation of single-cell The authors declare no competing interests. transcriptomics data with deep generative models. Mol. Syst. Biol. 17, e9620 (2021). 75. Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating a dditional information single-cell transcriptomic data across different conditions, technologies, and Extended data are available for this paper at https://doi.org/10.1038/ species. Nat. Biotechnol. 36, 411–420 (2018). s41587-022-01284-4. 76. Aibar, S. et al. SCENIC: single-cell regulatory network inference and Supplementary information The online version contains supplementary material clustering. Nat. Methods 14, 1083–1086 (2017). available at https://doi.org/10.1038/s41587-022-01284-4. 77. Davis, C. A. et al. The encyclopedia of DNA elements (ENCODE): data portal update. Nucleic Acids Res. 46, D794–D801 (2018). Correspondence and requests for materials should be addressed to Ge Gao. Peer review information Nature Biotechnology thanks Ricard Argelaguet, Yun Li, Romain Lopez and the other, anonymous, reviewer(s) for their contribution to the a cknowledgements peer review of this work. We thank F. Tang, X.S. Xie, Z. Zhang, L. Tao, C. Li, J. Lu (at Peking University) and Y. Ding (at the Beijing Institute of Radiation Medicine) for their helpful discussions and Reprints and permissions information is available at www.nature.com/reprints. NatuRe Biote ChNoloG y | www.nature.com/naturebiotechnology NATUrE BioTEcHNoLoGy Articles Extended Data Fig. 1 | individual metrics for evaluating integration performance. a, Mean average precision vs. Seurat alignment score for different integration methods. Higher mean average precision indicates higher cell type resolution, and higher Seurat alignment score indicates better omics mixing. b, Cell type vs. omics layer average silhouette width for different integration methods. Higher cell type average silhouette width indicates higher cell type resolution, and higher omics layer average silhouette width indicates better omics mixing. c, Neighbor conservation vs. graph connectivity for different integration methods. Higher neighbor conservation indicates better conservation of manifold structure in each original layer, and higher graph connectivity indicates better omics mixing. n=8 repeats with different model random seeds. t he error bars indicate mean ± s.d. NatuRe Biote ChNoloG y | www.nature.com/naturebiotechnology NATUrE BioTEcHNoLoGy Articles Extended Data Fig. 2 | effect of prior knowledge and data size on integration performance. a, Decrease in overall integration score at different prior knowledge corruption rates for integration methods that rely on prior feature relations (n=8 repeats with different corruption random seeds). b, Overall integration score, and c, FOSCtt M with different schemes of connecting peaks and genes as prior regulatory knowledge, for integration methods that rely on prior feature relations (n=8 repeats with different model random seeds). ‘Combined±0’ is the standard scheme where peaks overlapping gene body or promoter regions are linked. ‘Promoter±150k’ means that peaks are linked to genes if they locate within 150kb from the gene promoter, weighted by a 42,43 power-law function that models chromatin contact probability . d, Overall integration score of different integration methods on subsampled datasets of varying sizes (n=8 repeats with different subsampling random seeds). t he error bars indicate mean ± s.d. NatuRe Biote ChNoloG y | www.nature.com/naturebiotechnology NATUrE BioTEcHNoLoGy Articles Extended Data Fig. 3 | integration performance of Glue under different hyperparameter settings. Integration performance is quantified by a, overall integration score, and b, FOSCtt M (n=4 repeats with different model random seeds). t he error bars indicate mean ± s.d. ‘Dimensionality’ denotes the cell embedding dimensionality. ‘Preprocessing dimensionality’ is the reduced dimensionality used for the first transformation layers of the data encoders (see Methods). ‘Hidden layer depth’ is the number of hidden layers in the data encoders and modality discriminator. ‘Hidden layer dimensionality’ is the dimensionality of hidden layers in the data encoders and modality discriminator. ‘Dropout’ is the dropout rate of hidden layers in data encoders and modality discriminator. ‘Lambda graph’ is the weight of the graph loss ( ). ‘Lambda align’ is the weight of the adversarial alignment (λ ). ‘Negative G D sampling rate’ is the number of empirical samples used in negative edge sampling (samples from p ). For each hyperparameter, the center value is the ns default. t o control computational cost, one hyperparameter was varied at a time, with all others set to their default values. t he performance of GLUe was robust across a wide range of hyperparameter settings, except for failed alignments in which the adversarial alignment weight was too low or no hidden layers were used in the neural networks (equivalently a linear model with insufficient capacity). NatuRe Biote ChNoloG y | www.nature.com/naturebiotechnology NATUrE BioTEcHNoLoGy Articles Extended Data Fig. 4 | integration performance of Glue with different numbers of highly variable genes. Integration performance is quantified by a , overall integration score, and b, FOSCtt M (n=8 repeats with different model random seeds). t he error bars indicate mean ± s.d. NatuRe Biote ChNoloG y | www.nature.com/naturebiotechnology NATUrE BioTEcHNoLoGy Articles Extended Data Fig. 5 | Robustness of Glue feature embeddings. Consistency of feature embeddings as defined by the conservation of feature-feature cosine similarity (Methods), under a, different hyperparameter settings (n=4 repeats with different model random seeds), b, different prior knowledge corruption rates (n=8 repeats with different corruption random seeds), and c, different number of subsampled cells (n=8 repeats with different subsampling random seeds). t he error bars indicate mean ± s.d. Feature embeddings are robust across all hyperparameters except for λ , which directly controls the contribution of guidance graph. Consistency also remains high (> 0.8) with up to 40% of prior knowledge corrupted, and a minimal of ~4,000 subsampled cells. NatuRe Biote ChNoloG y | www.nature.com/naturebiotechnology NATUrE BioTEcHNoLoGy Articles Extended Data Fig. 6 | integration consistency score for detecting over-correction. Integration consistency scores with varying numbers of meta-cells for different dataset combinations. Same-tissue combinations represent proper correction, and different-tissue combinations represent over-correction. Dashed horizontal line indicate integration consistency score = 0.05. NatuRe Biote ChNoloG y | www.nature.com/naturebiotechnology http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Nature Biotechnology Springer Journals

Multi-omics single-cell data integration and regulatory inference with graph-linked embedding

Nature Biotechnology , Volume 40 (10) – Oct 1, 2022

Loading next page...
 
/lp/springer-journals/multi-omics-single-cell-data-integration-and-regulatory-inference-with-sxtLH7xRNJ

References (98)

Publisher
Springer Journals
Copyright
Copyright © The Author(s) 2022
ISSN
1087-0156
eISSN
1546-1696
DOI
10.1038/s41587-022-01284-4
Publisher site
See Article on Publisher Site

Abstract

Articles https://doi.org/10.1038/s41587-022-01284-4 Multi-omics single-cell data integration and regulatory inference with graph-linked embedding 1,2 1,2  ✉ Zhi-Jie Cao    and Ge Gao    Despite the emergence of experimental methods for simultaneous measurement of multiple omics modalities in single cells, most single-cell datasets include only one modality. A major obstacle in integrating omics data from multiple modalities is that different omics layers typically have distinct feature spaces. Here, we propose a computational framework called GLUE (graph-linked unified embedding), which bridges the gap by modeling regulatory interactions across omics layers explicitly. Systematic benchmarking demonstrated that GLUE is more accurate, robust and scalable than state-of-the-art tools for hetero- geneous single-cell multi-omics data. We applied GLUE to various challenging tasks, including triple-omics integration, integra- tive regulatory inference and multi-omics human cell atlas construction over millions of cells, where GLUE was able to correct previous annotations. GLUE features a modular design that can be flexibly extended and enhanced for new analysis tasks. The full package is available online at https://github.com/gao-lab/GLUE. ecent technological advances in single-cell sequencing have By modeling the regulatory interactions across omics layers explic- enabled the probing of regulatory maps through multiple itly, GLUE bridges the gaps between various omics-specific feature Romics layers, such as chromatin accessibility (single-cell spaces in a biologically intuitive manner. Systematic benchmarks and 1,2 3 ATAC-sequencing (scATAC-seq) ), DNA methylation (snmC-seq , case studies demonstrate that GLUE is accurate, robust and scalable 4 5,6 sci-MET ) and the transcriptome (scRNA-seq ), offering a unique for heterogeneous single-cell multi-omics data. Furthermore, GLUE opportunity to unveil the underlying regulatory bases for the func- is designed as a generalizable framework that allows for easy exten- tionalities of diverse cell types . While simultaneous assays have sion and quick adoption to particular scenarios in a modular manner. 8–11 recently emerged , different omics are usually measured inde- GLUE is publicly accessible at https://github.com/gao-lab/GLUE. pendently and produce unpaired data, which calls for effective and 12,13 efficient in silico multi-omics integration . Results Computationally, one major obstacle faced when integrating Unpaired multi-omics integration via graph-guided embed- unpaired multi-omics data (also known as diagonal integration) dings. Inspired by previous studies, we model cell states as is the distinct feature spaces of different modalities (for exam- low-dimensional cell embeddings learned through variational auto- 30,31 ple, accessible chromatin regions in scATAC-seq versus genes in encoders . Given their intrinsic differences in biological nature scRNA-seq) . A quick fix is to convert multimodality data into and assay technology, each omics layer is equipped with a separate one common feature space based on prior knowledge and apply autoencoder that uses a probabilistic generative model tailored to 15–18 single-omics data integration methods . Such explicit ‘feature the layer-specific feature space (Fig. 1 and Methods). conversion’ is straightforward, but has been reported to result in Taking advantage of prior biological knowledge, we propose the information loss . Algorithms based on coupled matrix factoriza- use of a knowledge-based graph (‘guidance graph’) that explicitly tion circumvent explicit conversion but hardly handle more than models cross-layer regulatory interactions for linking layer-specific 20,21 two omics layers . An alternative option is to match cells from feature spaces; the vertices in the graph correspond to the features of different omics layers via nonlinear manifold alignment, which different omics layers, and edges represent signed regulatory inter- removes the requirement of prior knowledge completely and could actions. For example, when integrating scRNA-seq and scATAC-seq 22–25 reduce inter-modality information loss in theory ; however, this data, the vertices are genes and accessible chromatin regions (that technique has mostly been applied to relatively small datasets with is, ATAC peaks), and a positive edge can be connected between an limited number of cell types. accessible region and its putative downstream gene. Then, adver- The ever-increasing volume of data is another serious chal- sarial multimodal alignment of the cells is performed as an iterative lenge . Recently developed technologies can routinely generate optimization procedure, guided by feature embeddings encoded 27–29 32 datasets at the scale of millions of cells , whereas current integra- from the graph (Fig. 1 and Methods). Notably, when the iterative tion methods have only been applied to datasets with much smaller process converges, the graph can be refined with inputs from the 15,17,20–23 volumes . To catch up with the growth in data through- alignment procedure and used for data-oriented regulatory infer- put, computational integration methods should be designed with ence (see below for more details). scalability in mind. Hereby, we introduce GLUE (graph-linked unified embedding), Systematic benchmarking demonstrates superior perfor- a modular framework for integrating unpaired single-cell multi- mance. We first benchmarked GLUE against multiple popular 15–18,23–25,33 omics data and inferring regulatory interactions simultaneously. unpaired multi-omics integration methods using three State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Biomedical Pioneering Innovative Center (BIOPIC) and Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), Peking University, Beijing, China. Changping Laboratory, Beijing, China. e-mail: gaog@mail.cbi.pku.edu.cn NatuRe Biote ChNoloG y | VOL 40 | OCt OBeR 2022 | 1458–1466 | www.nature.com/naturebiotechnology 1458 NATUrE BioTEcHNoLoGy Articles Encoders Decoders (Variational posteriors) (Generative models) ∣ν ∣ Knowledge-based ∣ν ∣ guidance graph = (ν, ) V p( ∣V;θ ) q(V∣ ;ϕ ) ∣ν ∣ ≈V·V Feature ∣ν ∣ ∣ν ∣ 1 1 embeddings N N 1 1 ˄ ˄ X p(X ∣U ,V ;θ ) N 1 1 1 1 1 ≈U ·V 1 1 scRNA-seq U q(U ∣X ;ϕ ) 1 1 1 ∣ν ∣ ∣ν ∣ X ˄ ˄ 2 U X p(X ∣U ,V ;θ ) 2 2 2 2 2 2 q(U ∣X ;ϕ ) 2 2 2 ≈U ·V scATAC-seq 2 2 ∣ν ∣ ∣ν ∣ Cell embeddings 3 X X 3 q(U ∣X ;ϕ ) 3 p(X ∣U ,V ;θ ) 3 3 3 3 3 3 3 ≈U ·V snmC-seq 3 3 Discriminator ? ? ? D(u;ψ) Omics layers N ×|V | N ×|V | N ×|V | 1 1 2 2 3 3 Fig. 1 | architecture of the Glue framework. Denoting unpaired data from three omics layer as X ∈ R ,X ∈ R ,X ∈ R , 1 2 3 where N , N , N are cell numbers, and V , V , V are sets of omics features in each layer, GLUe uses omics-specific variational autoencoders to learn 1 2 3 1 2 3 low-dimensional cell embeddings u , u , u from each omics layer. t he data dimensionality and generative distribution can differ across layers, but the 1 2 3 embedding dimension m is shared. t o link the omics-specific data spaces, GLUe makes use of prior knowledge about regulatory interactions in the form of a guidance graph G =(V, E), where vertices V = V ∪V ∪V are omics features. A graph variational autoencoder is used to learn feature embeddings 1 2 3 ( ) ⊤ ⊤ ⊤ V = V ,V ,V from the prior knowledge-based guidance graph, which are then used in data decoders to reconstruct omics data via inner product 1 2 3 with cell embeddings, effectively linking the omics-specific data spaces to ensure a consistent embedding orientation. Last, an omics discriminator D is used to align the cell embeddings of different omics layers via adversarial learning. ϕ ,ϕ ,ϕ ,ϕ represent learnable parameters in data and graph 1 2 3 G encoders. θ ,θ ,θ ,θ represent learnable parameters in data and graph decoders. ψ represents learnable parameters in the omics discriminator. 1 2 3 G gold-standard datasets generated by recent simultaneous scRNA-seq During the evaluation described above, we adopted a standard 8 9 and scATAC-seq technologies (SNARE-seq , SHARE-seq and schema (ATAC peaks were linked to RNA genes if they overlapped 34 35 10X Multiome ), along with two unpaired datasets (Nephron in the gene body or proximal promoter regions) to construct the and MOp ). guidance graph for GLUE and to perform feature conversion for An effective integration method should match the correspond- other conversion-based methods. Given that our current knowl- ing cell states from different omics layers, producing cell embed- edge about the regulatory interactions is still far from prefect, a dings where the biological variation is faithfully conserved and the useful integration method must be robust to such inaccuracies. omics layers are well mixed. Compared to other methods, GLUE Thus, we further assessed the methods’ robustness to corruption achieved high level of biology conservation and omics mixing of regulatory interactions by randomly replacing varying fractions simultaneously (Fig. 2a, each quantified by three separate metrics of existing interactions with nonexistent ones. For all three datas- as shown in Extended Data Fig. 1), and was consistently the best ets, GLUE exhibited the smallest performance changes even at cor- method across all benchmark datasets in terms of overall score ruption rates as high as 90% (Fig. 2d and Extended Data Fig. 2a), (Fig. 2b, see Methods for details on metric aggregation); these suggesting its superior robustness. Consistently, we found that results were also validated by uniform manifold approximation and using alternative guidance graphs defined in larger genomic projection (UMAP) visualization of the aligned cell embeddings windows had minimal influence on integration performance (Supplementary Figs. 1–5). (Extended Data Fig. 2b,c). An optimal integration method should produce accurate align- Given its neural network-based nature, GLUE may suffer from ments not only at the cell type level but also at finer scales. Exploiting undertraining when working with small datasets. Thus, we repeated the ground truth cell-to-cell correspondence in the gold-standard the evaluations using subsampled datasets of various sizes. GLUE datasets, we further quantified single-cell level alignment error via remained the top-ranking method with as few as 2,000 cells, but the FOSCTTM (fraction of samples closer than the true match) met- the alignment error increased more steeply when the data volume ric . On all three datasets, GLUE achieved the lowest FOSCTTM, decreased to less than 1,000 cells (Fig. 2e and Extended Data Fig. 2d). decreasing the alignment error by large margins compared to the Additionally, we also noted that the integration performance of second-best method on each dataset (Fig. 2c, the decreases were GLUE was robust for a wide range of hyperparameter and feature 3.6-fold for SNARE-seq, 1.7-fold for SHARE-seq and 1.5-fold for selection settings (Extended Data Figs. 3 and 4). Apart from the 10X Multiome). cell embeddings, the feature embeddings of GLUE also exhibit NatuRe Biote ChNoloG y | VOL 40 | OCt OBeR 2022 | 1458–1466 | www.nature.com/naturebiotechnology 1459 NATUrE BioTEcHNoLoGy Articles a c SNARE-seq SHARE-seq 10X Multiome 1.00 0.6 0.75 0.50 0.4 0.25 0.2 0.25 0.50 0.75 1.00 Nephron MOp 1.00 0.75 SNARE-seq SHARE-seq 10X Multiome 0.50 Dataset 0.25 SNARE-seq SHARE-seq 10X Multiome 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.6 Biology conservation 0.4 0.2 0.75 0.50 Corruption rate 0.25 SNARE-seq SHARE-seq 10X Multiome 0.6 0.4 0.2 Dataset UnionCom Online iNMF LIGER (FiG) Seurat v3 Pamona Online iNMF (FiG) Harmony GLUE Method MMD-MA LIGER bindSC Subsample size Fig. 2 | Systematic benchmarks of integration performance. a, Biological conservation score versus omics integration score for different integration methods. b, Overall integration score (defined as 0.6 × biology conservation + 0.4 × omics integration) of different integration methods (n = 8 repeats with different model random seeds). c, Single-cell level alignment error (quantified by FOSCtt M) of different integration methods (n = 8 repeats with different model random seeds). d, Increases in FOSCtt M at different prior knowledge corruption rates for integration methods that rely on prior feature relations (n = 8 repeats with different corruption random seeds). e, FOSCtt M values of different integration methods on subsampled datasets of varying sizes (n = 8 repeats with different subsampling random seeds). FiG is an alternative feature conversion method recommended by online iNMF and LIGeR (Methods). Online iNMF and LIGeR could not run with FiG conversion on the SNARe-seq data because the raw At AC fragment file was not available, thus marked as ‘NA’. Other NA marks were made because of memory overflow. t he error bars indicate mean ± s.d. considerable robustness to hyperparameter settings, prior knowl- called the integration consistency score, which measures the consis- edge corruption and data subsampling (Extended Data Fig. 5). tency between the integrated multi-omics space and prior knowl- In addition to the systematical difference among omics lay- edge in the guidance graph (Methods). We observed substantially ers, single-cell data are often complicated by batch effect within lower scores (close to 0) when integrating data from inconsistent the same layer. For example, the SHARE-seq data was processed tissues compared to integrating within the same tissue, making it a in four libraries, one of which showed batch effect compared to reliable indicator of integration quality (Extended Data Fig. 6). the other three in scRNA-seq (Supplementary Fig. 6a), while the Nephron data profiled four donors, all of which showed substantial GLUE enables effective triple-omics integration. Benefitting from batch effect against each other in both scRNA-seq and scATAC-seq a modular design and scalable adversarial alignment, GLUE read- (Supplementary Fig. 7a,c). As a solution to such complex sce- ily extends to more than two omics layers. As a case study, we used narios, GLUE provides batch correction capability by including GLUE to integrate three distinct omics layers of neuronal cells in the batch as a decoder covariate (Methods). With batch correction adult mouse cortex, including gene expression , chromatin acces- 38 3 enabled, GLUE was able to correct for these batch effects effec- sibility and DNA methylation . tively, producing substantially better batch mixing (Supplementary Unlike chromatin accessibility, gene body DNA methylation Fig. 6b and Supplementary Fig. 7b,d). To guard against potential generally shows a negative correlation with gene expression in over-correction, for example, when forcing an integration over neuronal cells . GLUE natively supports the mixture of regula- datasets lacking common cell states, we devised a diagnostic metric tory effects by modeling edge signs in the guidance graph. Such a NatuRe Biote ChNoloG y | VOL 40 | OCt OBeR 2022 | 1458–1466 | www.nature.com/naturebiotechnology SNARE-seq SHARE-seq 10X Multiome Nephron MOp 0.2 0.4 0.6 0.8 0.9 1.0 0.2 1,000 0.4 2,000 0.6 0.8 4,000 0.9 8,000 1.0 0.2 1,000 0.4 2,000 0.6 0.8 4,000 0.9 8,000 1.0 1,000 2,000 4,000 8,000 Overall integration score Omics mixing NA NA NA NA NA NA NA NA NA NA NA FOSCTTM Increase in FOSCTTM FOSCTTM NA NA NA NA NA NATUrE BioTEcHNoLoGy Articles a b c scRNA-seq cell type snmC-seq cell type scATAC-seq cell type L2/3 IT Layer 2/3 mL2/3 mDL-3 L4 Layer 5a mL4 mIn-1 L5 IT Layer 5 mL5-1 mVip L6 IT Layer 5b mDL-1 mNdnf-1 L5 PT Layer 6 mDL-2 mNdnf-2 NP Claustrum mL5-2 mPv L6 CT CGE mL6-1 mSst-1 Vip MGE mL6-2 mSst-2 Pvalb Sst UMAP1 UMAP1 UMAP1 d e f Omics layer mPv mL2/3 0.8 mL5-2 mL6-2 mDL-3 0.6 mL4 mL6-1 scRNA-seq mSst 0.4 scATAC-seq mNdnf snmC-seq mDL-2 mL5-1 0.2 mVip mDL-1 mIn-1 0 0 50 100 150 200 Combined mCH mCG ATAC UMAP1 –log FDR Omics layer Fig. 3 | t riple-omics integration of the mouse cortex. a–c, UMAP visualizations of the integrated cell embeddings for scRNA-seq (a), snmC-seq (b) and scAt AC-seq (c), colored by the original cell types. Cells aligning with ‘mPv’ and ‘mSst’ are highlighted with green circles. Cells aligning with ‘mNdnf’ and ‘mVip’ are highlighted with dark blue circles. Cells aligning with ‘mDL-3’ are highlighted with light blue circles. d, UMAP visualizations of the integrated cell embeddings for all cells, colored by omics layers. e, Significance of marker gene overlap for each cell type across all three omics layers (three-way Fisher’s 40 −17 exact test ). t he dashed vertical line indicates that FDR = 0.01. We observed highly significant marker overlap (FDR < 5 × 10 ) for 12 out of the 14 cell types, indicating reliable alignment. For the remaining two cell types, ‘mDL-1’ had marginally significant marker overlap with FDR = 0.003, while the ‘mIn-1’ cells in snmC-seq did not properly align with the scRNA-seq or scAt AC-seq cells. f, Coefficient of determination (R ) for predicting gene expression based on each epigenetic layer as well as the combination of all layers (n = 2,677 highly variable genes common to all three omics layers). t he box plots indicate the medians (centerlines), means (triangles), first and third quartiles (bounds of boxes) and 1.5× interquartile range (whiskers). strategy avoids data inversion, which is required by previous meth- gene expression in cortical neurons (average R = 0.187). When 16,17 ods and can break data sparsity and the underlying distribution. all epigenetic layers were considered, the expression predictability For the triple-omics guidance graph, we linked gene body mCH increased further (average R = 0.236), suggesting the presence of and mCG levels to genes via negative edges, while the positive edges nonredundant contributions (Fig. 3f). Among the neurons of dif- between accessible regions and genes remained the same. ferent layers, DNA methylation (especially mCH) exhibited slightly The GLUE alignment successfully revealed a shared manifold higher predictability for gene expression in deeper layers than in of cell states across the three omics layers (Fig. 3a–d). Notably, the superficial layers (Supplementary Fig. 10a). Across all genes, the original cell types were not annotated at the same resolution, and predictability of gene expression was generally correlated among many could be further clustered into smaller subtypes even within the different epigenetic layers (Supplementary Fig. 10b). We also single layers (Supplementary Fig. 8a–f). To unify the cell type observed varying associations with gene characteristics. For exam- annotations, neighbor-based label transfer was conducted using ple, mCH had higher expression predictability for longer genes, 17,41 the integrated cell embeddings and we observed highly significant which was consistent with previous studies , while chromatin marker overlap (Fig. 3e, three-way Fisher’s exact test , false discov- accessibility contributed more to genes with higher expression vari- −17 ery rate (FDR) < 5 × 10 ) for 12 out of the 14 mapped cell types ability (Supplementary Fig. 10c). We also repeated the same analy- (Supplementary Figs. 8g–o and 9 and Methods), indicating reli- sis using online iNMF, which is currently the only other method able alignment. The GLUE alignment helped improve the effects capable of integrating the three omics layers simultaneously, but it of cell typing in all omics layers, including the further partition- produced much lower cell type resolution and epigenetic correla- + + ing of the scRNA-seq ‘MGE’ cluster into Pvalb (‘mPv’) and Sst tion (Supplementary Fig. 11). (‘mSst’) subtypes (highlighted with green circles/flows in Fig. 3 and Supplementary Fig. 8), the partitioning of the scRNA-seq ‘CGE’ Integrative regulatory inference with GLUE. The incorporation of + + cluster and scATAC-seq ‘Vip’ cluster into Vip (‘mVip’) and Ndnf a graph explicitly modeling regulatory interactions in GLUE further (‘mNdnf ’) subtypes (highlighted with dark blue circles/flows in enables a Bayesian-like approach that combines prior knowledge Fig. 3 and Supplementary Fig. 8), and the identification of snmC-seq and observed data for posterior regulatory inference. Specifically, ‘mDL-3’ cells and a subset of scATAC-seq ‘L6 IT’ cells as claus- since the feature embeddings are designed to reconstruct the trum cells (highlighted with light blue circles/flows in Fig. 3 and knowledge-based guidance graph and single-cell multi-omics data Supplementary Fig. 8). simultaneously (Fig. 1), their cosine similarities should reflect infor- Such triple-omics integration also sheds light on the quantita- mation from both aspects, which we adopt as ‘regulatory scores’. tive contributions of different epigenetic regulation mechanisms As a demonstration, we used the official peripheral blood mono- (Methods). Among mCH, mCG and chromatin accessibility, we nuclear cell Multiome dataset from 10X and fed it to GLUE as found that the mCH level had the highest predictive power for unpaired scRNA-seq and scATAC-seq data. To capture remote NatuRe Biote ChNoloG y | VOL 40 | OCt OBeR 2022 | 1458–1466 | www.nature.com/naturebiotechnology UMAP2 UMAP2 Cell type UMAP2 Gene expression R UMAP2 NATUrE BioTEcHNoLoGy Articles a b c pcHi-C prediction 1.0 1.0 1.0 0.8 0.5 0.5 pcHi-C 0.6 False True 0.4 Cicero (AUROC = 0.548) –0.5 –0.5 Spearman (AUROC = 0.555) 0.2 pcHi-C LASSO (AUROC = 0.547) False –1.0 GLUE (AUROC = 0.631) True –1.0 0 0.25 0.50 0.75 1.00 –0.5 0 0.5 1.0 FPR Spearman correlation Genomic distance d e Target gene CD83 GLUE 0.5 Target gene NCF2 (FDR < 0.05) 0.5 GLUE (FDR < 0.05) pcHi-C pcHi-C eQTL 0.5 ATAC eQTL –0.5 BCL11A ChIP 0.5 ATAC 0 PAX5 ChIP –0.5 SPI1 ChIP RELB ChIP Genes NMNAT2 SMG7 APOBEC4 Genes RNF182 CD83 AL353152.2 SMG7-AS1 NCF2 AL157899.1 MRPL35P1 AL133259.1 AL137800.1 ARPC5 AL022396.1 RNU7-133P AL353152.1 RGL1 LINC01108 AL590422.1 183,400 183,450 183,500 183,550 183,600 183,650 183,700 183,750 Kb 13,950 14,000 14,050 14,100 14,150 14,200 14,250 14,300 Kb chr1 chr6 Fig. 4 | integrative regulatory inference in peripheral blood mononuclear cells. a, GLUe regulatory scores for peak–gene pairs across different genomic ranges, grouped by whether they had pcHi-C support. t he box plots indicate the medians (centerlines), means (triangles), first and third quartiles (bounds of boxes) and 1.5× interquartile range (whiskers). b, Comparison between the GLUe regulatory scores and the empirical peak–gene correlations computed on paired cells. Peak–gene pairs are colored by whether they had pcHi-C support. c, Receiver operating characteristic curves for predicting pcHi-C interactions based on different peak–gene association scores. AUROC is the area under the receiver operating characteristic curve. d,e, GLUe-identified cis-regulatory interactions of NCF2 (d) and CD83 (e), along with individual regulatory evidence. SPI1 (highlighted with a green box) is a known regulator of NCF2. cis-regulatory interactions, we used a long-range guidance graph guidance graph containing distance-weighted interactions as well as connecting ATAC peaks and RNA genes in 150-kb windows pcHi-C- and eQTL-supported interactions (Supplementary Fig. 13). weighted by a power-law function that models chromatin con- The significance of regulatory score was evaluated by comparing 42,43 tact probability (Methods). Visualization of cell embeddings it to a NULL distribution obtained from randomly shuffled fea- confirmed that the GLUE alignment was correct and accurate ture embeddings (Methods). As expected, while the multi-omics (Supplementary Fig. 12a,b). As expected, we found that the regula- alignment was insensitive to the change in guidance graph, the tory score was negatively correlated with genomic distance (Fig. 4a) inferred regulatory interactions showed stronger enrichment for and positively correlated with the empirical peak–gene correlation pcHi-C and eQTL (Supplementary Fig. 13a–d). Large fractions of (computed with paired cells, Fig. 4b), with robustness across differ- high-confidence interactions simultaneously supported by pcHi-C, ent random seeds (Supplementary Fig. 12c). eQTL and correlation could be robustly recovered (FDR < 0.05), To further assess whether the score reflected actual cis-regulatory even if they were corrupted in the guidance graph (Supplementary interactions, we compared it with external evidence, including Fig. 13e). Furthermore, the GLUE-derived transcription factor (TF-) 44 45 pcHi-C and eQTL . The GLUE regulatory score was higher for target gene network (Methods) showed more significant agreement pcHi-C-supported peak–gene pairs in all distance ranges (Fig. 4a) with manually curated connections in the TRRUST v2 database and was a better predictor of pcHi-C interactions than empirical than individual evidence-based networks (Supplementary Figs. 13f peak–gene correlations (Fig. 4b), as well as LASSO and Cicero , and Supplementary Fig. 14 and Supplementary Data 2). the coaccessibility-based regulatory prediction method (Fig. 4c and We noticed that the GLUE-inferred cis-regulatory interactions Supplementary Fig. 12d). The same held for eQTL (Supplementary could provide hints about the regulatory mechanisms of known Fig. 12e–h). TF-target pairs. For example, SPI1 is a known regulator of the NCF2 The GLUE framework also allows additional regulatory evi- gene, and both are highly expressed in monocytes (Supplementary dence, such as pcHi-C, to be incorporated intuitively via the guid- Fig. 15a,b). GLUE identified three remote regulatory peaks for NCF2 ance graph. Thus, we further trained models with a composite with various pieces of evidence, that is, roughly 120 kb downstream, NatuRe Biote ChNoloG y | VOL 40 | OCt OBeR 2022 | 1458–1466 | www.nature.com/naturebiotechnology 0–25 kb 25–50 kb 50–75 kb 75–100 kb 100–125 kb 125–150 kb GLUE regulatory score GLUE regulatory score TPR NATUrE BioTEcHNoLoGy Articles a b Omics layer Cell type UMAP1 UMAP1 Omics layer Corneal and conjunctival epithelial cells Horizontal cells/amacrine cells? Mesangial cells? STC2_TLX1 positive cells scATAC-seq Ductal cells IGFBP1_DKK1 positive cells Mesothelial cells Satellite cells scRNA-seq ELF3_AGBL2 positive cells Inhibitory interneurons Metanephric cells Schwann cells ELF3_AGBL2 positive cells? Inhibitory interneurons? Microglia Skeletal muscle cells Muscle_Unknown.7 Skeletal muscle cells? Cell type ENS glia Inhibitory neurons Smooth muscle cells AFP_ALB positive cells ENS neurons Intestinal epithelial cells Myeloid cells Acinar cells ENS neurons? Intestine_Unknown.4 Neuroendocrine cells Squamous epithelial cells Adrenocortical cells Endocardial cells Intestine_Unknown.8 Oligodendrocytes Stellate cells Amacrine cells Epicardial fat cells Islet endocrine cells PAEP_MECOM positive cells Stromal cells Antigen presenting cells Epithelial cells Kidney_Unknown.7 PDE1C_ACSM3 positive cells Stromal cells? Astrocytes Erythroblasts Kidney_Unknown.14 PDE11A_FAM19A2 positive cells Sympathoblasts Astrocytes/oligodendrocytes Excitatory neurons Lens fibre cells Pancreas_Unknown.1 Syncytiotrophoblast and villous cytotrophoblasts? Bipolar cells Extravillous trophoblasts Limbic system neurons Parietal and chief cells Syncytiotrophoblasts and villous cytotrophoblasts Bronchiolar and alveolar epithelial cells Eye_Unknown.6 Lymphatic endothelial cells Photoreceptor cells Thymic epithelial cells CCL19_CCL21 positive cells Ganglion cells Lymphoid and myeloid cells Purkinje neurons Thymocytes CLC_IL5RA positive cells Goblet cells Lymphoid cells Retinal pigment cells Trophoblast giant cells CSH1_CSH2 positive cells Granule neurons Lymphoid/Myeloid cells Retinal progenitors and muller glia Unipolar brush cells Cardiomyocytes Heart_Unknown.10 MUC13_DMBT1 positive cells SATB2_LRRC7 positive cells Ureteric bud cells Cardiomyocytes/vascular endothelial cells Hematopoietic stem cells Megakaryocytes SKOR2_NPSR1 positive cells Vascular endothelial cells Cerebrum_Unknown.3 Hepatoblasts Megakaryocytes? SLC24A4_PEX5L positive cells Vascular endothelial cells? Chromaffin cells Horizontal cells Mesangial cells SLC26A4_PAEP positive cells Visceral neurons Ciliated epithelial cells Fig. 5 | integration of a multi-omics human cell atlas. a,b, UMAP visualizations of the integrated cell embeddings, colored by omics layers (a) and cell types (b). t he pink circles highlight cells labeled as ‘excitatory neurons’ in scRNA-seq but ‘Astrocytes’ in scAt AC-seq. t he blue circles highlight cells labeled as ‘Astrocytes’ in scRNA-seq but ‘Astrocytes/oligodendrocytes’ in scAt AC-seq. t he brown circles highlight cells labeled as ‘Oligodendrocytes’ in scRNA-seq but ‘Astrocytes/oligodendrocytes’ in scAt AC-seq. 25 kb downstream and 20 kb upstream from the transcription start and unbalanced cell type compositions, and has yet to be accom- site (TSS) (Fig. 4d), all of which were bound by SPI1. Meanwhile, plished at the single-cell level. most putative regulatory interactions were previously unknown. Implemented as a neural network with minibatch optimization, For example, CD83 was linked with three regulatory peaks (two GLUE delivers superior scalability with a sublinear time cost, prom- roughly 25 kb upstream, one about 10 kb upstream from the TSS), ising its applicability at the atlas-scale (Supplementary Fig. 17a). which were enriched for the binding of three TFs (BCL11A, PAX5 Using an efficient multistage training strategy for GLUE (Methods), and RELB; Fig. 4e). While CD83 was highly expressed in both we successfully integrated the gene expression and chromatin acces- monocytes and B cells, the inferred TFs showed more constrained sibility data into a unified multi-omics human cell atlas (Fig. 5). expression patterns (Supplementary Fig. 15c–f ), suggesting that its While the aligned atlas was largely consistent with the origi- active regulators might differ per cell type. Supplementary Fig. 16 nal annotations (Supplementary Fig. 17c–e), we also noticed shows more examples of GLUE-inferred regulatory interactions. several discrepancies. For example, cells originally annotated as ‘Astrocytes’ in scATAC-seq were aligned to an ‘Excitatory neu- Atlas-scale integration over millions of cells with GLUE. As rons’ cluster in scRNA-seq (highlighted with pink circles/flows in technologies continue to evolve, the throughput of single-cell Supplementary Fig. 17). Further inspection revealed that canonical 47,48 experiments is constantly increasing. Recent studies have generated radial glial markers such as PAX6, HES1 and HOPX were actively human cell atlases for gene expression and chromatin accessibil- transcribed in this cluster, both in the RNA and ATAC domain 29 9 ity containing millions of cells. The integration of these atlases (Supplementary Fig. 18), with chromatin priming also detected at poses a substantial challenge to computational methods due to the both neuronal and glial markers (Supplementary Figs. 19–21), sug- sheer volume of data, extensive heterogeneity, low coverage per cell gesting that the cluster consists of multipotent neural progenitors NatuRe Biote ChNoloG y | VOL 40 | OCt OBeR 2022 | 1458–1466 | www.nature.com/naturebiotechnology UMAP2 UMAP2 NATUrE BioTEcHNoLoGy Articles (likely radial glial markers) rather than excitatory neurons or astro- for scRNA-seq and scATAC-seq, and zero-inflated log-normal cytes as originally annotated. GLUE-based integration also resolved for snmC-seq (Methods). Nevertheless, generative distributions several scATAC-seq clusters that were ambiguously annotated. For can be easily reconfigured to accommodate other omics layers, 56 57 example, the ‘Astrocytes/Oligodendrocytes’ cluster was split into such as protein abundance and histone modification , and to two halves and aligned to the ‘Astrocytes’ and ‘Oligodendrocytes’ adopt new advances in data modeling techniques . clusters of scRNA-seq (highlighted, respectively, with blue and • e guid Th ance graphs used in GLUE have currently been limited brown circles/flows in Supplementary Fig. 17), which was also to multipartite graphs, containing only edges between features supported by marker expression and accessibility (Supplementary of different layers. Nonetheless, graphs, as intuitive and flex - Figs. 20 and 21). These results demonstrate the unique value of ible representations of regulatory knowledge, can embody more atlas-scale multi-omics integration where cell typing can be done in complex regulatory patterns, including within-modality inter- an unbiased, data-oriented manner across modalities without losing actions, nonfeature vertices and multi-relations. Beyond canon- single-cell resolution. In particular, the incorporation of batch cor- ical graph convolution, more advanced graph neural network 59–61 rection could further enable effective curation of new datasets with architectures may also be adopted to extract richer informa- the integrated atlas as a global reference . tion from the regulatory graph. Particularly, recent advances in 62,63 In comparison, we also attempted to perform integration using hypergraph modeling could facilitate the use of prior knowl- online iNMF, which was the only other method capable of inte- edge on regulatory interactions involving multiple regulators grating the data at full scale, but the result was far from optimal simultaneously, as well as enable regulatory inference for such (Supplementary Figs. 22a,b and 23). Meanwhile, an attempt to inte- interactions. grate the data as aggregated metacells (Methods) via the popular Seurat v3 method also failed (Supplementary Fig. 22c,d). Recent advances in experimental multi-omics technologies have 8–11,34 increased the availability of paired data . While most of the cur- Discussion rent simultaneous multi-omics protocols still suffer from lower data Combining omics-specific autoencoders with graph-based cou- quality or throughput than that of single-omics methods , paired pling and adversarial alignment, we designed the GLUE framework cells can be highly informative in anchoring different omics layers for unpaired single-cell multi-omics data integration with supe- and should be used in conjunction with unpaired cells whenever rior accuracy and robustness. By modeling regulatory interactions available. It is straightforward to extend the GLUE framework to across omics layers explicitly, GLUE uniquely supports integrative incorporate such pairing information, for example, by adding loss regulatory inference for unpaired multi-omics datasets. Notably, terms that penalize the embedding distances between paired cells . in a Bayesian interpretation, the GLUE regulatory inference can be Such an extension may ultimately lead to a solution for the general seen as a posterior estimate, which can be continuously refined on case of mosaic integration . the arrival of new data. Apart from multi-omics integration, we also note that the GLUE Unpaired multi-omics integration shares some conceptual simi- framework could be suitable for cross-species integration, espe- larities with batch effect correction , but the former is substantially cially when distal species are concerned and one-to-one orthologs more challenging because of the distinct, omics-specific feature are limited. Specifically, we may compile all orthologs into a GLUE spaces. While feature conversion may seem to be a straightforward guidance graph and perform integration without explicit ortholog solution, the inevitable information loss can be detrimental. Seurat conversion. Under that setting, the GLUE approach could also be 15 33 66 v3 (ref. ) and bindSC also devised heuristic strategies to use conceptually connected to a recent work called SAMap . information in the original feature spaces in addition to converted Finally, we note that the inferred regulatory interactions from data, which may explain their improved performance than meth- the current GLUE model are based on the whole input dataset 16,17 ods that do not . Meanwhile, known cell types have also been and may be an aggregation of multiple spatiotemporal-specific 51,52 used to guide integration via (semi-)supervised learning , but circuits, especially for data derived from distinct tissues (for this approach incurs substantial limitations in terms of applicability example, atlas). Meanwhile, we notice that in parallel to the since such supervision is typically unavailable and in many cases coarse-scale global model (for example, the whole-atlas integra- serves as the purpose of multi-omics integration per se . Notably, tion model), finer-scale regulatory inference could be conducted one of these methods was proposed with a similar autoencoder by training dedicated models on cells from a single tissue, poten- architecture and adversarial alignment , but it relied on matched tially with spatiotemporal-specific prior knowledge incorporated cell types or clusters to orient the alignment. In fact, GLUE shares as well . Such a ‘step-wise refinement’ extension would effectively more conceptual similarity with coupled matrix factorization meth- help identify spatiotemporal-specific regulatory circuits and 20,21 ods , but with superior performance, which mostly benefits from key regulators. its deep generative model-based design. We believe that GLUE, as a modular and generalizable frame- We note that the current framework also works for integrat- work, creates an unprecedented opportunity toward effectively ing omics layers with shared features (for example, the integration delineating gene regulatory maps via large-scale multi-omics inte- 53,54 between scRNA-seq and spatial transcriptomics ), by using either gration at single-cell resolution. The whole package of GLUE, along the same vertex or connected surrogate vertices for shared features with tutorials and demo cases, is available online at https://github. in the guidance graph. In addition, cross imputation could also be com/gao-lab/GLUE for the community. implemented by chaining encoders and decoders of different omics layers. However, given a recent report that data imputation could online content induce artifacts and deteriorate the accuracy of gene regulatory Any methods, additional references, Nature Research report- inference , such a function may need further investigation. ing summaries, source data, extended data, supplementary infor- As a generalizable framework, GLUE features a modular mation, acknowledgements, peer review information; details of design, where the data and graph autoencoders are independently author contributions and competing interests; and statements of configurable. data and code availability are available at https://doi.org/10.1038/ s41587-022-01284-4. • e d Th ata autoencoders in GLUE are customizable with appro - Received: 13 September 2021; Accepted: 15 March 2022; priate generative models that conform to omics-specific data Published online: 2 May 2022 distributions. In the current work, we used negative binomial NatuRe Biote ChNoloG y | VOL 40 | OCt OBeR 2022 | 1458–1466 | www.nature.com/naturebiotechnology 1464 NATUrE BioTEcHNoLoGy Articles 34. PBMC from a healthy donor, single cell multiome ATAC gene expression References demonstration data by Cell Ranger ARC 1.0.0. 10X Genomics https://support. 1. Cusanovich, D. A. et al. Multiplex single cell profiling of chromatin 10xgenomics.com/single-cell-multiome-atac-gex/datasets/1.0.0/pbmc_ accessibility by combinatorial cellular indexing. Science 348, 910–914 (2015). granulocyte_sorted_10k (2020). 2. Chen, X., Miragaia, R. J., Natarajan, K. N. & Teichmann, S. A. A rapid and 35. Muto, Y. et al. Single cell transcriptional and chromatin accessibility profiling robust method for single cell chromatin accessibility profiling. Nat. Commun. redefine cellular heterogeneity in the adult human kidney. Nat. Commun. 12, 9, 5345 (2018). 2190 (2021). 3. Luo, C. et al. Single-cell methylomes identify neuronal subtypes and 36. Yao, Z. et al. A transcriptomic and epigenomic cell atlas of the mouse regulatory elements in mammalian cortex. Science 357, 600–604 (2017). primary motor cortex. Nature 598, 103–110 (2021). 4. Mulqueen, R. M. et al. Highly scalable generation of DNA methylation 37. Saunders, A. et al. Molecular diversity and specializations among the cells of profiles in single cells. Nat. Biotechnol. 36, 428–431 (2018). the adult mouse brain. Cell 174, 1015–1030 (2018). 5. Picelli, S. et al. Smart-seq2 for sensitive full-length transcriptome profiling in 38. Fresh cortex from adult mouse brain (v1), single cell ATAC demonstration single cells. Nat. Methods 10, 1096–1098 (2013). data by Cell Ranger 1.1.0. 10X Genomics https://support.10xgenomics.com/ 6. Zheng, G. X. et al. Massively parallel digital transcriptional profiling of single single-cell-atac/datasets/1.1.0/atac_v1_adult_brain_fresh_5k (2019). cells. Nat. Commun. 8, 14049 (2017). 39. Mo, A. et al. Epigenomic signatures of neuronal diversity in the mammalian 7. Packer, J. & Trapnell, C. Single-cell multi-omics: an engine for new brain. Neuron 86, 1369–1384 (2015). quantitative models of gene regulation. Trends Genet. 34, 653–665 (2018). 40. Wang, M., Zhao, Y. & Zhang, B. Efficient test and visualization of multi-set 8. Chen, S., Lake, B. B. & Zhang, K. High-throughput sequencing of the intersections. Sci Rep. 5, 16923 (2015). transcriptome and chromatin accessibility in the same cell. Nat. Biotechnol. 41. Gabel, H. W. et al. Disruption of DNA-methylation-dependent long gene 37, 1452–1457 (2019). repression in Rett syndrome. Nature 522, 89–93 (2015). 9. Ma, S. et al. Chromatin potential identified by shared single-cell profiling of 42. Dekker, J., Marti-Renom, M. A. & Mirny, L. A. Exploring the RNA and chromatin. Cell 183, 1103–1116 (2020). three-dimensional organization of genomes: Interpreting chromatin 10. Clark, S. J. et al. scNMT-seq enables joint profiling of chromatin accessibility interaction data. Nat. Rev. Genet. 14, 390–403 (2013). DNA methylation and transcription in single cells. Nat. Commun. 9, 43. Pliner, H. A. et al. Cicero predicts cis-regulatory DNA interactions from 781 (2018). single-cell chromatin accessibility data. Mol. Cell 71, 858–871 (2018). 11. Wang, Y. et al. Single-cell multiomics sequencing reveals the functional 44. Javierre, B. M. et al. Lineage-specific genome architecture links enhancers regulatory landscape of early embryos. Nat. Commun. 12, 1247 (2021). and non-coding disease variants to target gene promoters. Cell 167, 12. Lake, B. B. et al. Integrative single-cell analysis of transcriptional and 1369–1384 (2016). epigenetic states in the human adult brain. Nat. Biotechnol. 36, 70–80 (2018). 45. Aguet, F. et al. Genetic effects on gene expression across human tissues. 13. Bravo Gonzalez-Blas, C. et al. Identification of genomic enhancers through Nature 550, 204–213 (2017). spatial integration of single-cell transcriptomics and epigenomics. Mol. Syst. 46. Han, H. et al. TRRUST v2: an expanded reference database of human and Biol. 16, e9438 (2020). mouse transcriptional regulatory interactions. Nucleic Acids Res. 46, 14. Argelaguet, R., Cuomo, A. S. E., Stegle, O. & Marioni, J. C. Computational principles and challenges in single-cell data integration. Nat. Biotechnol. 39, D380–D386 (2018). 47. o Th msen, E. R. et al. Fixed single-cell transcriptomic characterization of 1202–1215 (2021). human radial glial diversity. Nat. Methods 13, 87–93 (2016). 15. Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019). 48. Pollen, A. A. et al. Molecular identity of human outer radial glia during cortical development. Cell 163, 55–67 (2015). 16. Gao, C. et al. Iterative single-cell multi-omic integration using online 49. Fischer, D. S. et al. Sfaira accelerates data and model reuse in single cell learning. Nat. Biotechnol. 39, 1000–1007 (2021). 17. Welch, J. D. et al. Single-cell multi-omic integration compares and contrasts genomics. Genome Biol. 22, 248 (2021). 50. Tran, H. T. N. et al. A benchmark of batch-effect correction methods for features of brain cell identity. Cell 177, 1873–1887 (2019). single-cell RNA sequencing data. Genome Biol. 21, 12 (2020). 18. Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019). 51. Stark, S. G. et al. SCIM: universal single-cell matching with unpaired feature 19. Chen, H. et al. Assessment of computational methods for the analysis of sets. Bioinformatics 36, i919–i927 (2020). 52. Yang, K. D. et al. Multi-domain translation between single-cell imaging and single-cell ATAC-seq data. Genome Biol. 20, 241 (2019). 20. Duren, Z. et al. Integrative analysis of single-cell genomics data by sequencing data using autoencoders. Nat. Commun. 12, 31 (2021). coupled nonnegative matrix factorizations. Proc. Natl. Acad. Sci. USA 115, 53. Eng, C.-H. L. et al. Transcriptome-scale super-resolved imaging in tissues by RNA seqfish. Nature 568, 235–239 (2019). 7723–7728 (2018). 21. Zeng, W. et al. DC3 is a method for deconvolution and coupled clustering 54. Rodriques, S. G. et al. Slide-seq: a scalable technology for measuring from bulk and single-cell genomics data. Nat. Commun. 10, 4613 (2019). genome-wide expression at high spatial resolution. Science 363, 1463–1467 (2019). 22. Demetci, P., Santorella, R., Sandstede, B., Noble, W. S. & Singh, R. SCOT: Single-Cell Multi-Omics Alignment with Optimal Transport. J. Comput. Biol. 55. Ly, L.-H. & Vingron, M. Effect of imputation on gene network reconstruction 29, 3–18 (2022). from single-cell RNA-seq data. Patterns 3, 100414 (2021). 23. Cao, K., Bai, X., Hong, Y. & Wan, L. Unsupervised topological alignment for 56. Bandura, D. R. et al. Mass cytometry: technique for real time single cell multitarget immunoassay based on inductively coupled plasma time-of-flight single-cell multi-omics integration. Bioinformatics 36, i48–i56 (2020). 24. Cao, K., Hong, Y. & Wan, L. Manifold alignment for heterogeneous mass spectrometry. Anal. Chem. 81, 6813–6822 (2009). single-cell multi-omics data integration using pamona. Bioinformatics 38, 57. Bartosovic, M., Kabbe, M. & Castelo-Branco, G. Single-cell CUT&Tag profiles histone modifications and transcription factors in complex tissues. 211–219 (2021). 25. Singh, R. et al. Unsupervised manifold alignment for single-cell multi-omics Nat. Biotechnol. 39, 825–835 (2021). data. In Proc. 11th ACM International Conference on Bioinformatics, 58. Ashuach, T., Reidenbach, D. A., Gayoso, A. & Yosef, N. PeakVI: A deep generative model for single-cell chromatin accessibility analysis. Cell Reports Computational Biology and Health Informatics (eds. Aluru, S., Kalyanaraman, A. & Wang, M. D.) a40 (Association for Computing Machinery, 2020). Methods 2, 100182 (2022). 26. Svensson, V., Vento-Tormo, R. & Teichmann, S. A. Exponential scaling of 59. Hamilton, W., et al. in Advances in Neural Information Processing Systems single-cell RNA-seq in the past decade. Nat. Protoc. 13, 599–604 (2018). (eds. Guyon, I. et al.) 1024–1034 (Curran Associates, Inc., 2017). 27. Kozareva, V. et al. A transcriptomic atlas of mouse cerebellar cortex 60. Veličković, P. et al. Graph attention networks. In Proc. 6th International comprehensively defines cell types. Nature 598, 214–219 (2021). Conference on Learning Representations (eds. Bengio, Y. & LeCun, Y.) 28. Cao, J. et al. A human cell atlas of fetal gene expression. Science 370, (ICLR, 2018). eaba7721 (2020). 61. Vashishth, S., Sanyal, S., Nitin, V. & Talukdar, P. Composition-based 29. Domcke, S. et al. A human cell atlas of fetal chromatin accessibility. Science multi-relational graph convolutional networks. In Proc. 8th International 370, eaba7612 (2020). Conference on Learning Representations (ed. Rush, A.) (ICLR, 2020). 30. Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative 62. Zhang, R., Zou, Y. & Ma, J. Hyper-SAGNN: a self-attention based graph modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018). neural network for hypergraphs. In Proc. 8th International Conference on 31. Cao, Z. J., Wei, L., Lu, S., Yang, D. C. & Gao, G. Searching large-scale Learning Representations (ed. Rush, A.) (ICLR, 2020). scRNA-seq databases via unbiased cell embedding with Cell BLAST. 63. Zhang, R., Zhou, T. & Ma, J. Multiscale and integrative single-cell Hi-C Nat. Commun. 11, 3458 (2020). analysis with Higashi. Nat. Biotechnol. 40, 254–261 (2021). 32. Kipf, T. N. & Welling, M. Variational graph auto-encoders. In Neural 64. Stuart, T. & Satija, R. Integrative single-cell analysis. Nat. Rev. Genet. 20, Information Processing Systems Workshop on Bayesian Deep Learning 257–272 (2019). (eds. Gal, Y. et al.) (Curran Associates, Inc., 2016). 65. Amodio, M. & Krishnaswamy, S. MAGAN: aligning biological manifolds. In 33. Dou, J. et al. Unbiased integration of single cell multi-omics data. Preprint at Proc. 35th International Conference on Machine Learning (eds. Dy, J. G. Dy & bioRxiv https://doi.org/10.1101/2020.12.11.422014 (2020). Krause, A.) 215–223 (PMLR, 2018). NatuRe Biote ChNoloG y | VOL 40 | OCt OBeR 2022 | 1458–1466 | www.nature.com/naturebiotechnology 1465 NATUrE BioTEcHNoLoGy Articles 66. Tarashansky, A. J. et al. Mapping single-cell atlases throughout metazoa adaptation, distribution and reproduction in any medium or format, as long as you give unravels cell type evolution. eLife 10, e66747 (2021). appropriate credit to the original author(s) and the source, provide a link to the Creative 67. Jung, I. et al. A compendium of promoter-centered long-range chromatin Commons license, and indicate if changes were made. The images or other third party mate- interactions in the human genome. Nat. Genet. 51, 1442–1449 (2019). rial in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in Commons license and your intended use is not permitted by statutory regulation or exceeds published maps and institutional affiliations. the permitted use, you will need to obtain permission directly from the copyright holder. Open Access This article is licensed under a Creative Commons To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. Attribution 4.0 International License, which permits use, sharing, © The Author(s) 2022 NatuRe Biote ChNoloG y | VOL 40 | OCt OBeR 2022 | 1458–1466 | www.nature.com/naturebiotechnology 1466 NATUrE BioTEcHNoLoGy Articles we first sample the edges (i, j) with probabilities proportional to the edge weights Methods and then sample vertices j′ that are not connected to i and treat them as if s = s . ij′ ij The GLUE framework. We assume that there are K different omics layers to be When maximizing the graph likelihood, the inner products between features are integrated, each with a distinct feature set V ,k = 1,2,…,K . For example, in maximized or minimized (per edge sign) based on the Bernoulli distribution. For scRNA-seq, V is the set of genes, while in scATAC-seq, V is the set of chromatin k k |V | k example, ATAC peaks located near the promoter of a gene would be encouraged to regions. The data spaces of different omics layers are denoted as X ⊆ R with (n) have similar embeddings to that of the gene, while DNA methylation in the gene varying dimensionalities. We use to denote cells from x ∈X ,n = 1,2,…,NK promoter would be encouraged to have a dissimilar embedding to that of the gene. (n) the kth omics layer and to denote the observed value of feature i of x ,i ∈V ki k The data likelihoods p (x |u,V;θ ) (that is, data decoders) in equation (3) are k k the kth layer in the nth cell. N is the sample size of the kth layer. Notably, the cells built on the inner product between the cell embedding u and feature embeddings from different omics layers are unpaired and can have different sample sizes. To V . Thus, analogous to the loading matrix in principal component analysis (PCA), avoid cluttering, we drop the superscript (n) when referring to an arbitrary cell. the feature embeddings V confer semantic meanings for the cell embedding space. We model the observed data from different omics layers as generated by a As V are modulated by interactions among omics features in the guidance graph, m k low-dimensional latent variable (that is, cell embedding) u ∈ R : the semantic meanings become linked. While this linearity limits decoder capacity, our empirical evaluations show that it is well compensated by the nonlinear p (x ;θ )= p (x |u;θ )p (u)du (1) k k k k encoders, producing high-quality multi-omics alignments (Fig. 2, Extended Data Figs. 1–4 and Supplementary Figs. 1–7). The exact formulation of data likelihood where p(u) is the prior distribution of the latent variable, p (x |u;θ ) are k k depends on the omics data distribution. For example, for count-based scRNA-seq learnable generative distributions (that is, data decoders) and θ denotes learnable and scATAC-seq data, we used the negative binomial (NB) distribution: parameters in the decoders. The cell latent variable u is shared across different ( ) omics layers. In other words, u represents the common cell states underlying all p (x |u,V;θ )= NB x ;μ ,θ k k ki i (7) omics observations, while the observed data from each layer are generated by a i∈V specific type of measurement of the underlying cell states. With the introduction of variational posteriors q (u|x ;ϕ ) (that is, data k k ( ) ( ) ( ) x θ k i Γ(x +θ ) μ i ki i θ encoders, where ϕ are learnable parameters in the encoders), model fitting can be i i NB x ;μ ,θ = (8) ki i θ +μ θ +μ Γ(θ )Γ(x +1) i i i ki i i efficiently performed by maximizing the following evidence lower bounds: L (ϕ ,θ )= E E logp (x |u;θ ) ( ) X k k x ∼p (x ) u∼q(u|x ;ϕ ) k k ∑ k k data k k k μ = Softmax α ⊙V u + β · x (2) i k kj i (9) j∈V −KL (q (u|x ;ϕ ) ∥ p (u))] k k k |V | where μ,θ ∈ R are the mean and dispersion of the negative binomial Since different autoencoders are independently parameterized and trained on |V | k |V | separate data, the cell embeddings learned for different omics layers could have distribution, respectively, α ∈ R ,β ∈ R are scaling and bias factors, inconsistent semantic meanings unless they are linked properly. ⊙ is the Hadamard product, Softmax represents the ith dimension of the To link the autoencoders, we propose a guidance graph G =(V, E), which softmax output and x gives the total count in the cell. Taking softmax kj j∈V incorporates prior knowledge about the regulatory interactions among features ∪ and then multiplying by total count ensures that the library size of reconstructed at distinct omics layers, where V = V is the universal feature set and k=1 data matches the original . The set of learnable parameters is θ = {θ,α,β}. E = {(i,j) |i,j ∈ V} is the set of edges. Each edge is also associated with signs Analogously, many other distributions can also be supported, as long as we can and weights, which are denoted as s and w , respectively. We require that w ∈ parameterize the means of the distributions by feature-cell inner products. ij ij ij (0,1], which can be interpreted as interaction credibility, and that s ∈ {−1,1}, ij For efficient inference and optimization, we introduce the following factorized which specifies the sign of the regulatory interaction. For example, an ATAC peak variational posterior: located near the promoter of a gene is usually assumed to positively regulate its q (u,V|x , G;ϕ ,ϕ )= q (u|x ;ϕ ) ·q (V|G;ϕ ) (10) k k G k k G expression, so they can be connected with a positive edge (s = 1). Meanwhile, ij DNA methylation in the gene promoter is usually assumed to suppress expression, The graph variational posterior q (V|G;ϕ ) (that is, graph encoder) is so they can be connected with a negative edge (s = 1). In addition to the ij modeled as diagonal-covariance normal distributions parameterized by a graph connections between features, self-loops are also added for numerical stability, convolutional network : with s = 1,w = 1, ∀i ∈V . The guidance graph is allowed to be a multi-graph, ii ii where more than one edge can exist between the same pair of vertices, representing q (V|G;ϕ )= q (v |G;ϕ ) G i G (11) different types of prior regulatory evidence. i∈V We treat the guidance graph as observed variable and model it as generated by low-dimensional feature latent variables (that is, feature embeddings) ( ) v ∈ R ,i ∈V . Furthermore, differing from the previous model, we now model i q (v |G;ϕ )= N v ;GCN (G;ϕ ),GCN 2 (G;ϕ ) (12) i G i μ G σ G x as generated by the combination of feature latent variables v ∈ R ,i ∈V k i k and the cell latent variable u ∈ R . For convenience, we introduce the notation where ϕ represents the learnable parameters in the graph convolutional network m×|V| V ∈ R , which combines all feature embeddings into a single matrix. The (GCN) encoder. model likelihood can thus be written as: The variational data posteriors q (u|x ;ϕ ) (that is, data encoders) are k k modeled as diagonal-covariance normal distributions parameterized by multilayer p (x , G;θ ,θ )= p (x |u,V;θ )p (G|V;θ )p (u)p (V)dudV (3) G G k k k k perceptron (MLP) neural networks: ( ) where p (x |u,V;θ ) and p (G|V;θ ) are learnable generative distributions for the k k G q (u|x ,V ;ϕ )= N u;MLP (x ;ϕ ),MLP 2 (x ;ϕ ) (13) k k k k,μ k k k k k,σ omics data (that is, data decoders) and knowledge graph (that is, graph decoder), respectively. θ and θ are learnable parameters in the decoders. p(u) and p(V) where ϕ is the set of learnable parameters in the multilayer perceptron encoder of are the prior distributions of the cell latent variable and feature latent variables, the kth omics layer. respectively, which are fixed as standard normal distributions for simplicity: Model fitting can then be performed by maximizing the following evidence lower bound: p (u)= N (u;0,I ) (4)   E logp (x |u,V;θ )p (G|V;θ ) ∏ K k k ∑ u∼q(u|x ;ϕ ),V∼q(V|G;ϕ ) k k G p (v )= N (v ;0,I ),p (V)= p (v )   i i m i (5) x ∼p (x ) k data k i∈V k=1 −KL (q (u|x ;ϕ )q (V|G;ϕ ) ∥ p (u)p (V)) k k G (14) although alternatives may also be used . For convenience, we also introduce the m×|V | notation , which contains only feature embeddings in the kth omics V ∈ R which can be further rearranged into the following form: layer, and u , which emphasizes that the cell embedding is from a cell in the kth omics layer. K ·L (θ ,ϕ )+ L (θ ,ϕ ,ϕ ) (15) The graph likelihood p (G|V;θ ) (that is, graph decoder) is defined as: G G G X k k G G k k=1 logp (G|V;θ )= E i,j∼p i,j;w ( ij) where we have (6) [ ( ) ( ( ))] ⊤ ⊤ logσ s v v + E log 1 − σ s v v ij j j′∼p (j′|i) ij j′ i ns i L (θ ,ϕ ,ϕ )= E X k k G k xk∼pdata(xk) [ ] (16) where σ is the sigmoid function and p is a negative sampling distribution . ns E logp (x |u,V;θ ) −KL (q (u|x ;ϕ ) ∥ p (u)) k k k k u∼q(u|x ;ϕ ),V∼q(V|G;ϕ ) k k G Here the graph likelihood has no trainable parameters, so θ = ∅. In other words, NatuRe Biote ChNoloG y | www.nature.com/naturebiotechnology NATUrE BioTEcHNoLoGy Articles L (θ ,ϕ )= E logp (G|V;θ ) −KL (q (V|G;ϕ ) ∥ p (V)) G G G G G V∼q V|G;ϕ ( ) normalize by cluster size, which effectively balances the contribution of matching (17) clusters regardless of their sizes. In the second stage, we fine-tune the GLUE model with the estimated balancing weights, during which the additive noise Below, for convenience, we denote the union of all encoder parameters ϵ ∼N (ϵ;0,τ · Σ) gradually anneals to 0 (with τ starting at 1 and decreasing ( ) as ϕ = ϕ ∪ϕ and the union of all decoder parameters as k G linearly per epoch until 0). The number of annealing epochs was set automatically k=1 (∪ ) based on the data size and learning rate to match a learning progress equivalent to θ = θ ∪ θ k G k=1 4,000 iterations at a learning rate of 0.002. To ensure the proper alignment of different omics layers, we use the adversarial All benchmarks and case studies in the study were conducted with the 31,71 alignment strategy . A discriminator D with a K-dimensional softmax output is two-stage training procedure as described above, regardless of whether the dataset introduced, which predicts the omics layers of cells based on their embeddings u. being used is balanced or not. The discriminator D is trained by minimizing the multiclass classification cross entropy: Batch effect correction. To handle batch effect within omics layers, we incorporate batch as a covariate of the data decoders. Assuming b ∈{1,2,…,B}, is the batch L (ϕ,ψ)= − E E logD (u;ψ) (18) D x ∼p (x ) u∼q(u|x ;ϕ ) k K k data k k k index, where B is the total number of batches, the decoder likelihood is extended to k=1 p (x |u,V,b;θ ). Specifically, this is achieved by converting learnable parameters k k in the data decoder to be batch-dependent. For example, in the case of a negative where D represents the kth dimension of the discriminator output and ψ is the binomial decoder, the network now uses batch-specific α, β and θ parameters: set of learnable parameters in the discriminator. The data encoders can then be ∏ ( ) trained in the opposite direction to fool the discriminator, ultimately leading to the p (x |u,V,b;θ )= NB x ;μ ,θ 72 k k k b i i i alignment of cell embeddings from different omics layers . (25) i∈V The overall training objective of GLUE thus consists of: minλ ·L (ϕ,ψ) D D ( ) ( ) (19) ( ) x θ Γ x +θ k b ψ ( k b ) μ i θ i i i b i i NB x ;μ ,θ = (26) k b i i i θ +μ θ +μ Γ θ Γ x +1 b b ( b ) ( k ) i i i i i i maxλ ·L (ϕ,ψ)+ λ K ·L (θ ,ϕ )+ L (θ ,ϕ ,ϕ ) (20) ( ) ∑ D D G G G G X k k G θ,ϕ μ = Softmax α ⊙V u + β · x k=1 i b k i k b j (27) j∈V The two hyperparameters λ and λ control the contributions of adversarial B×|V | B×|V | B×|V | k k k where α ∈ R ,β ∈ R ,θ ∈ R , and α , β , θ are the bth row of α, alignment and graph-based feature embedding, respectively. We use stochastic + + b b b gradient descent to train the GLUE model. Each stochastic gradient descent β, θ. Other probabilistic decoders can also be extended in similar ways. iteration is divided into two steps. In the first step, the discriminator is updated according to objective equation (19). In the second step, the data and graph Implementation details. We applied linear dimensionality reduction using autoencoders are updated according to equation (20). The RMSprop optimizer canonical methods such as PCA (for scRNA-seq) or LSI (latent semantic indexing, with no momentum term is used to ensure the stability of adversarial training. for scATAC-seq) as the first transformation layers of the data encoders (note that the decoders were still fitted in the original feature spaces). This effectively reduced Weighted adversarial alignment. As shown in previous work , canonical model size and enabled a modular input, so advanced dimensionality reduction or adversarial alignment amounts to minimizing a generalized form of Jensen–Shannon batch effect correction methods can also be used instead as preprocessing steps for divergence among the cell embedding distributions of different omics layers: GLUE integration. ( ) During model training, 10% of the cells were used as the validation set. In K K ∑ ∑ 1 1 the final stage of training, the learning rate would be reduced by factors of 10 if KL q (u)|| q (u) (21) k k K K the validation loss did not improve for consecutive epochs. Training would be k=1 k=1 terminated if the validation loss still did not improve for consecutive epochs. The q (u)= E q (u|x ;ϕ ) patience for learning rate reduction, training termination and the maximal number where represents the marginal cell embedding k x ∼p (x ) k k k data k of training epochs were automatically set based on the data size and learning rate distribution of the kth layer. Without other loss terms, equation (21) converges at to match a learning progress equivalent to 1,000, 2,000 and 16,000 iterations at a perfect alignment, that is, when q (u)= q (u), ∀i ̸= j. This can be problematic i j learning rate of 0.002, respectively. when cell type compositions differ dramatically across different layers, for example, (n) For all benchmarks and case studies with GLUE, we used the default in the cell atlas integration. To address this issue, we added cell-specific weights w hyperparameters unless explicitly stated. The set of default hyperparameters is to the discriminator loss in equation (18): presented in Extended Data Fig. 3. K k ∑ ∑ (n) 1 1 ( ) L (ϕ,ψ)= − w · E logD (u;ψ) D (22) K W (n) u∼q u|x ;ϕ Integration consistency score. The integration consistency score is a measure k=1 n=1 k of consistency between the integrated multi-omics data and the guidance graph. k (n) First, we jointly cluster cells from all omics layers in the aligned cell embedding where the normalizer W = w . The adversarial alignment still amounts to n=1 space using k-means. For each omics layer, the cells in each cluster are aggregated minimizing equation (21) but with weighted marginal cell embedding distributions ( ) into a metacell. The metacells are established as paired samples, based on which (n) 1 (n) . By assigning appropriate weights to balance feature correlation can be computed. Using the paired metacells, we then compute q (u)= w q u|x ;ϕ k k W k n=1 the Spearman’s correlation for each edge in the guidance graph. The integration the cell distributions across different layers, the optimum of q (u)= q (u), ∀i ̸= j i j consistency score is defined as the average correlation across all graph edges, could be much closer to the desired alignment. negated per edge sign and weighted by edge weight. To obtain the balancing weights in an unsupervised manner, we devised the 23 24 following two-stage training procedure. First, we pretrain the GLUE model with Systematic benchmarks. UnionCom , Pamona and GLUE were executed using (n) constant weight w = 1, during which noise ϵ ∼N (ϵ;0,Σ) was added to the the Python packages ‘unioncom’ (v.0.3.0), ‘Pamona’ (v.0.1.0) and ‘scglue’ (v.0.2.0), respectively. MMD-MA was executed using the Python script provided at cell embeddings before passing to the discriminator. We set ∑ to be 1.5× the 16 17 https://bitbucket.org/noblelab/2020_mmdma_pytorch. Online iNMF , LIGER , empirical variance of cell embeddings in each minibatch, which helps produce a 18 33 15 Harmony , bindSC , and Seurat v3 (ref. ) were executed using the R packages coarse alignment immune to composition imbalance. Then, we cluster the coarsely ‘rliger’ (v.1.0.0), ‘rliger’ (v.1.0.0), ‘harmony’ (v.0.1.0), ‘bindSC’ (v.1.0.0) and ‘Seurat’ aligned cell embeddings per omics layer using Leiden clustering. The balancing (v.4.0.2), respectively. For each method, we used the default hyperparameter weight w for cells in cluster i is computed as: settings and data preprocessing steps as recommended. For the scRNA-seq data, f u ,u ( i j) k ̸=k i j 2,000 highly variable genes were selected using the Seurat ‘vst’ method. We used (23) w = two separate schemes to construct the guidance graph. In the standard scheme, we connected ATAC peaks with RNA genes via positive edges if they overlapped { ( ) 4 in either the gene body or proximal promoter regions (defined as 2 kb upstream cos u ,u , cos(u ,u ) > 0.5 ( ) i j i j from the TSS). In an alternative scheme involving larger genomic windows, we f u ,u = (24) i j connected ATAC peaks with RNA genes via positive edges if the peaks are within 0, otherwise 150 kb of the proximal gene promoters; the edges were weighted by a power-law −0.75 where u is the average cell embedding of cluster i, k denotes the omics layer of function w =(d +1) (d is the genomic distance in kb), which has been i i 42,43 cluster i, and n is the number of cells in cluster i. In other words, we sum up the proposed to model the probability of chromatin contact . For the methods that require feature conversion (online iNMF, LIGER, bindSC and Seurat v.3), we cosine similarities (raised to the power of 4 to increase contrast) between cluster i and all its matching clusters in other layers with cosine similarity >0.5, and then converted the scATAC-seq data to gene-level activity scores by summing up counts NatuRe Biote ChNoloG y | www.nature.com/naturebiotechnology NATUrE BioTEcHNoLoGy Articles (i) in the ATAC peaks connected to specific genes in the guidance graph. Notably, where is the omics layer silhouette width for the ith cell, N is the number omicslayer j online iNMF and LIGER also recommend an alternative way of ATAC feature of cells in cell type j, and M is the total number of cell types. Omics layer ASW has a conversion, that is, directly counting ATAC fragments falling in gene body and range of 0 to 1, and higher values indicate better mixing. promoter regions without resorting to ATAC peaks (https://htmlpreview.github. Graph connectivity (GC) was also used to evaluate the extend of mixing among io/?https://github.com/welch-lab/liger/blob/master/vignettes/Integrating_scRNA_ omics layers and was defined as in a recent benchmark study : and_scATAC_data.html), which we abbreviate to FiG (fragments in genes). We LCC also tested the FiG feature conversion method with online iNMF and LIGER | j| GC = (36) M N whenever applicable. j=1 Mean average precision (MAP) was used to evaluate the cell type resolution. (i) where LCC is the number of cells in largest connected component of the cell Supposing that the cell type of the ith cell is y and that the cell types of its K (i) (i) (i) k-nearest neighbors graph (K = 15) for cell type j, N is the number of cells in cell ordered nearest neighbors are y ,y ,…,y , the mean average precision is then 1 2 K type j and M is the total number of cell types. Graph connectivity has a range of 0 to defined as follows: 1, and higher values indicate better mixing. 1 (i) MAP = AP (28) Omics mixing. Seurat alignment score, omics layer ASW and graph connectivity i=1 all measure omics mixing of the data integration. Following the procedure from  the recent benchmark study , we first conduct min-max scaling for each of the  j=1 (i) (i) ∑ y =y metrics, and then compute the average across the three to summarize them into a  K j  1 · K  k=1 (i) ∑ y =y single metric representing omics mixing: , if 1 > 0 (i) K (i) 1 (i) AP = y =y (29) k=1 (i) (i) k k=1 y =y k scale(SAS)+scale(omicslayerASW)+scale(GC)  (37) omicsmixing = 0, otherwise (i) (i) Overall integration score. To compute an overall integration score, we use a 6:4 where 1 is an indicator function that equals 1 if y = y and 0 otherwise. (i) (i) k y =y weight between biology conservation and omics mixing, following the recent For each cell, average precision (AP) computes the average cell type precision up to 73 benchmark study : each cell type-matched neighbor, and mean average precision is the average average precision across all cells. We set K to 1% of the total number of cells in each dataset. overallintegrationscore = 0.6 ×biologyconservation +0.4 ×omicsmixing Mean average precision has a range of 0 to 1, and higher values indicate better cell (38) type resolution. FOSCTTM was used to evaluate the single-cell level alignment accuracy. It Cell type ASW (average silhouette width) was also used to evaluate the cell type was computed on two datasets with known cell-to-cell pairings. Suppose that each resolution, which was defined as in a recent benchmark study : dataset contains N cells, and that the cells are sorted in the same order, that is, the ( ) (i) N ith cell in the first dataset is paired with the ith cell in the second dataset. Denote 1 i=1 cell type celltypeASW = +1 (30) 2 N x and y as the cell embeddings of the first and second dataset, respectively. The FOSCTTM is then defined as: (i) ( ) N N where s is the cell type silhouette width for the ith cell, and N is the total (i) (i) ∑ ∑ celltype n n 1 1 2 FOSCTTM = + (39) 2N N N number of cells. Cell type ASW has a range of 0 to 1, and higher values indicate i=1 i=1 better cell type resolution. Neighbor consistency (NC) was used to evaluate the preservation of (i) (40) single-omics data variation after multi-omics integration and was defined n = j|d x ,y <d (x ,y ) j i i i following a previous study : N (i) (i) (i) ∑ NNS ∩NNI (41) n = j|d x ,y <d (x ,y ) 1 i j i i NC = (31) N (i) (i) NNS ∪NNI i=1 (i) (i) where n and n are the number of cells in the first and second dataset, 1 2 (i) where NNS is the set of k-nearest neighbors for cell i in the single-omics data, respectively, that are closer to the ith cell than their true matches in the opposite (i) NNI is the set of K-nearest neighbors for the ith cell in the integrated space, and dataset. d is the Euclidean distance. FOSCTTM has a range of 0 to 1, and lower N is the total number of cells. We set K to 1% of the total number of cells in each values indicate higher accuracy. dataset. Neighbor consistency has a range of 0 to 1, and higher values indicate Feature consistency was used to evaluate the consistency of feature embeddings better preservation of data variation. from different models. Since the raw embedding spaces are not directly comparable across models, we defined the consistency as the cross-modal conservation of Biology conservation. Mean average precision, cell type ASW and neighbor cosine similarities among features in the same model. Specifically, we first randomly consistency all measure biology conservation of the data integration. Following subsample 2,000 features and compute the pairwise cosine similarity among them the procedure from the recent benchmark study , we first conduct min-max using feature embeddings from the two compared models. The feature consistency scaling for each of the metrics and then compute the average across the three to score is then defined as the Pearson’s correlation between the cosine similarities of summarize them into a single metric representing biology conservation: two models, averaging across four random subsamples. Feature consistency has a range of −1 to 1, and higher values indicate higher consistency. scale(MAP)+scale(celltypeASW)+scale(NC) biologyconservation = (32) For the baseline benchmark, each method was run eight times with different random seeds, except for Harmony and bindSC that have deterministic Seurat alignment score (SAS) was used to evaluate the extent of mixing among implementations and were run only once. For the guidance corruption benchmark, omics layers and was computed as described in the original paper : we removed the specified proportions of existing peak–gene interactions K and added equal numbers of nonexistent interactions, so the total number of ¯x− SAS = 1 − (33) interactions remained unchanged. Of note, feature conversion was also repeated K− using the corrupted guidance graphs. The corruption procedure was repeated eight times with different random seeds. For the subsampling benchmark, the where ¯x is the average number of cells from the same omics layer among the scRNA-seq and scATAC-seq cells were subsampled in pairs (so FOSCTTM could K-nearest neighbors (different layers were first subsampled to the same number still be computed). The subsampling process was also repeated eight times with of cells as the smallest layer), and N is the number of omics layers. We set K to 1% different random seeds. of the subsampled cell number. Seurat alignment score has a range of 0 to 1, and For the systematic scalability test (Supplementary Fig. 17a), all methods were higher values indicate better mixing. run on a Linux workstation with 40 CPU cores (two Intel Xeon Silver 4210 chips), Omics layer ASW was also used to evaluate the extend of mixing among omics 250 GB of RAM and NVIDIA GeForce RTX 2080 Ti graphical processing units. layers and was defined as in a recent benchmark study : Only a single graphical processing unit card was used when training GLUE. omicslayerASW = omicslayerASW j (34) Triple-omics integration. The scRNA-seq and scATAC-seq data were handled as j=1 previously described (section Systematic benchmarks). Due to low coverage per single-C site, the snmC-seq data were converted to average methylation levels in (i) gene bodies. The mCH and mCG levels were quantified separately, resulting in omicslayerASW = 1 − s (35) N omicslayer i=1 two features per gene. The gene methylation levels were normalized by the global NatuRe Biote ChNoloG y | www.nature.com/naturebiotechnology NATUrE BioTEcHNoLoGy Articles methylation level per cell. An initial dimensionality reduction was performed using network based on the scRNA-seq data, and then uses external cis-regulatory PCA (section Implementation details). For the triple-omics guidance graph, the mCH evidence to filter out false positives. SCENIC accepts cis-regulatory evidence in and mCG levels were connected to the corresponding genes with negative edges. the form of gene rankings per TF, that is, genes with higher TF enrichment levels The normalized methylation levels were positive, with dropouts corresponding in their regulatory regions are ranked higher. To construct the rankings based to the genes that were not covered in single cells. As such, we used the zero-inflated on our inferred peak–gene interactions, we first overlapped the ENCODE TF log-normal (ZILN) distribution for the data decoder: chromatin immunoprecipitation (ChIP) peaks with the ATAC peaks and counted the number of ChIP peaks for each TF in each ATAC peak. Since different genes ∏ ( ) p (x |u,V;θ )= ZILN x ;μ ,σ ,δ k k k i i i i can have different numbers of connected ATAC peaks, and the ATAC peaks vary (42) i∈V in length (longer peaks can contain more ChIP peaks by chance), we devised a sampling-based approach to evaluate TF enrichment. Specifically, for each gene,  ( ) we randomly sampled 1,000 sets of ATAC peaks that matched the connected ATAC (logx −μ )  1−δ ki i  √ exp − , x > 0 ( ) 2 k x σ 2π 2σ peaks in both number and length distribution. We counted the numbers of TF k i i i ZILN x ;μ ,σ ,δ = (43) i i ki  ChIP peaks in these random ATAC peaks as null distributions. For each TF in each δ , x = 0 i gene, an empirical P value could then be computed by comparing the observed ki number of ChIP peaks to the null distribution. Finally, we ranked the genes by the empirical P values for each TF, producing the cis-regulatory rankings used by μ = α ⊙V u + β (44) i k SCENIC. Since peak–gene-based inference is mainly focused on remote regulatory regions, proximal promoters could be missed. As such, we provided SCENIC with |V | |V | |V | k k k where μ ∈ R ,σ ∈ R ,δ ∈ (0,1) are the log-scale mean, log-scale + both the above peak-based and proximal promoter-based cis-regulatory rankings. standard deviation and zero-inflation parameters of the zero-inflated log-normal |V | |V | k k Integration for the human multi-omics atlas. The scRNA-seq and scATAC-seq distribution, respectively, and α ∈ R ,β ∈ R are scaling and bias factors. atlases have highly unbalanced cell type compositions, which are primarily caused To unify the cell type labels, we performed a nearest neighbor-based label by differences in organ sampling sizes (Supplementary Fig. 17b). Although cell transfer with the snmC-seq dataset as a reference. The five nearest neighbors in types are unknown during real-world analyses, organ sources are typically available snmC-seq were identified for each scRNA-seq and scATAC-seq cell in the aligned and can be used to help balance the integration process. To perform organ-balanced embedding space, and majority voting was used to determine the transferred label. data preprocessing, we first subsampled each omics layer to match the organ To verify whether the alignment was correct, we tested for significant overlap in compositions. For the scRNA-seq data, 4,000 highly variable genes were selected cell type marker genes. The features of all omics layers were first converted to using the organ-balanced subsample. Then, for the initial dimensionality reduction, genes. Then, for each omics layer, the cell type markers were identified using the we fitted PCA (scRNA-seq) and LSI (scATAC-seq) on the organ-balanced one-versus-rest Wilcoxon rank-sum test with the following criteria: FDR < 0.05 subsample and applied the projection to the full data. The PCA/LSI coordinates and log fold change >0 for scRNA-seq/scATAC-seq; FDR < 0.05 and log fold were used as the first transformation layer in the GLUE data encoders (section change of <0 for snmC-seq. The significance of marker overlap was determined by 40 Implementation details), as well as for metacell aggregation (below). The guidance the three-way Fisher’s exact test . graph was constructed as described previously (section Systematic benchmarks). To perform correlation and regression analysis after the integration, we The two atlases consist of large numbers of cells but with low coverage per clustered all cells from the three omics layers using fine-scale k-means (k = 200). cell. To alleviate dropout and increase the training speed simultaneously, we used Then, for each omics layer, the cells in each cluster were aggregated into a a metacell aggregation strategy during pretraining. Specifically, in the pretraining metacell by summing their expression/accessibility counts or averaging their DNA stage, we clustered the cells in each omics layer using fine-scaled k-means methylation levels. The metacells were established as paired samples, based on (k = 100,000 for scRNA-seq and k = 40,000 for scATAC-seq). To balance the organ which feature correlation and regression analyses could be conducted. compositions at the same time, k-means centroids were fitted on the previous To integrate the same datasets using online iNMF, we inverted the snmC-seq organ-balanced subsample and then applied to the full data. The cells in each data via subtracting the data matrix by the largest entry, following the procedure 16 k-means cluster were aggregated into a metacell by summing their expression/ described in the original paper . accessibility counts and averaging their PCA/LSI coordinates. GLUE was then pretrained on the aggregated metacells with additive noise, which roughly oriented GLUE-based cis-regulatory inference. To ensure consistency of cell types, we first the cell embeddings but did not actually align them (section Weighted adversarial selected the overlapping cell types between the 10X Multiome and pcHi-C data. alignment). To better use the large data size, the hidden layer dimensionality was The remaining cell types included T cells, B cells and monocytes. The eQTL data doubled to 512 from the default 256. In the second stage, GLUE was fine-tuned were used as is, because they were not cell type-specific. For scRNA-seq, we selected on the full single-cell data with the balancing weight estimated as described in the 6,000 highly variable genes. To capture remote cis-regulatory interactions, the base section Weighted adversarial alignment. No metacell aggregation was used when guidance graph was constructed for peak–gene pairs within a distance of 150 kb, comparing the scalability of different methods (Supplementary Fig. 17a). using the alternative scheme as described in the section Systematic benchmarks. For a comparison with other integration methods, we also tried online iNMF To incorporate the regulatory evidence of pcHi-C and eQTL, we anchored all and Seurat v.3. Online iNMF was the only other method that could scale to evidence to that between the ATAC peaks and RNA genes. A peak–gene pair was millions of cells, so we applied it to the full dataset. On the other hand, Seurat v.3 considered supported by pcHi-C if (1) the gene promoter was within 1 kb of a bait showed the second-best accuracy in our previous benchmark. We also managed fragment, (2) the peak was within 1 kb of an other-end fragment and (3) significant to apply it to the aggregated data used in the first stage of GLUE training, due to contact was identified between the bait and the other-end fragment in pcHi-C. the fact that Seurat v.3 could not scale to the full dataset (Supplementary Fig. 17a). The pcHi-C-supported peak–gene interactions were weighted by multiplying the Label transfer was performed using the same procedure as in the triple-omics case, promoter-to-bait and the peak-to-other-end power-law weights (above). If a peak– except that we used majority voting in 50 nearest neighbors. gene pair was supported by multiple pcHi-C contacts, the weights were summed and clipped to a maximum of 1. A peak–gene pair was considered supported by Reporting Summary. Further information on research design is available in the eQTL if (1) the peak overlapped an eQTL locus and (2) the locus was associated Nature Research Reporting Summary linked to this article. with the expression of the gene. The eQTL-supported peak–gene interactions were assigned weights of 1. The composite guidance graph was constructed by adding the pcHi-C- and eQTL-supported interactions to the previous distance-based Data availability interactions, allowing for multi-edges. All datasets used in this study are already published and were obtained from public For regulatory inference, only peak–gene pairs within 150 kb in distance were data repositories. See Supplementary Table 1 for detailed information on single-cell considered. The GLUE training process was repeated four times with different omics datasets used in this study, including access codes and URLs. For regulatory random seeds. For each repeat, the peak–gene regulatory score was computed inference and evaluation, the pcHi-C data was obtained from supplementary file as the cosine similarity between the feature embeddings. The final regulatory of the original publication (https://www.sciencedirect.com/science/article/pii/ inference was obtained by averaging the regulatory scores across the four repeats. S0092867416313228), eQTL data from GTEx v8 (https://www.gtexportal.org/ To evaluate the significance of the regulatory scores, we compared the scores to home/datasets), TF ChIP–seq data from ENCODE data portal (https://www. a NULL distribution obtained via randomly shuffled feature embeddings and encodeproject.org/) and TRRUST v2 database from the official website (https:// computed empirical P values as the probability of getting more extreme scores in www.grnpedia.org/trrust/downloadnetwork.php). All benchmarking source data the NULL distribution. Finally, we compute FDR of regulatory inference based on are available in Supplementary Data 1. the P values using the Benjamini–Hochberg procedure. For cis-regulatory inference using LASSO, we used hyperparameter α = 0.01, which was optimized for area under the receiver operating characteristic curves of pcHi-C and eQTL prediction. Code availability The GLUE framework was implemented in the ‘scglue’ Python package, which is available at https://github.com/gao-lab/GLUE. For reproducibility, the scripts for TF-target gene regulatory inference. We used the SCENIC workflow to construct a TF-gene regulatory network from the inferred peak–gene regulatory all benchmarks and case studies were assembled using Snakemake (v.6.12.3), which interactions. Briefly, the SCENIC workflow first constructs a gene coexpression is also available in the above repository. NatuRe Biote ChNoloG y | www.nature.com/naturebiotechnology NATUrE BioTEcHNoLoGy Articles comments during the study, as well as authors of the datasets used in this work for References their kindly help. This work was supported by funds from the National Key Research 68. Ding, J. & Regev, A. Deep generative model embedding of single-cell and Development Program (grant no. 2016YFC0901603), the State Key Laboratory RNA-seq profiles on hyperspheres and hyperbolic spaces. Nat. Commun. 12, of Protein and Plant Gene Research and the Beijing Advanced Innovation Center for 2554 (2021). Genomics at Peking University, as well as the Changping Laboratory. The research by 69. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. & Dean, J. in Advances in G.G. was supported in part by the National Program for Support of Top-notch Young Neural Information Processing Systems (eds. Burges, C. J. C. et al.) 3111–3119 Professionals. Part of the analysis was carried out on the Computing Platform of the (Curran Associates, Inc., 2013). Center for Life Sciences of Peking University and supported by the High-performance 70. Kipf, T. N. & Welling, M. Semi-supervised classification with graph Computing Platform of Peking University. Parts of Fig. 1 were created using an image set convolutional networks. In Proc. 5th International Conference on Learning downloaded from Servier Medical Art (https://smart.servier.com/, CC BY 3.0). Representations (eds. Bengio, Y. & LeCun, Y.) (ICLR, 2017). 71. Dincer, A. B., Janizek, J. D. & Lee, S.-I. Adversarial deconfounding autoencoder for learning robust gene expression embeddings. Bioinformatics a uthor contributions 36, i573–i582 (2020). G.G. conceived the study and supervised the research. Z.J.C. designed and implemented 72. Goodfellow, I. et al. in Advances in Neural Information Processing Systems the computational framework and conducted benchmarks and case studies with (eds Ghahramani, Z. et al.) 2672–2680 (Curran Associates, Inc., 2014). guidance from G.G. Z.J.C. and G.G. wrote the manuscript. 73. Luecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics. Nat. Methods 19, 41–50 (2022). Competing interests 74. Xu, C. et al. Probabilistic harmonization and annotation of single-cell The authors declare no competing interests. transcriptomics data with deep generative models. Mol. Syst. Biol. 17, e9620 (2021). 75. Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating a dditional information single-cell transcriptomic data across different conditions, technologies, and Extended data are available for this paper at https://doi.org/10.1038/ species. Nat. Biotechnol. 36, 411–420 (2018). s41587-022-01284-4. 76. Aibar, S. et al. SCENIC: single-cell regulatory network inference and Supplementary information The online version contains supplementary material clustering. Nat. Methods 14, 1083–1086 (2017). available at https://doi.org/10.1038/s41587-022-01284-4. 77. Davis, C. A. et al. The encyclopedia of DNA elements (ENCODE): data portal update. Nucleic Acids Res. 46, D794–D801 (2018). Correspondence and requests for materials should be addressed to Ge Gao. Peer review information Nature Biotechnology thanks Ricard Argelaguet, Yun Li, Romain Lopez and the other, anonymous, reviewer(s) for their contribution to the a cknowledgements peer review of this work. We thank F. Tang, X.S. Xie, Z. Zhang, L. Tao, C. Li, J. Lu (at Peking University) and Y. Ding (at the Beijing Institute of Radiation Medicine) for their helpful discussions and Reprints and permissions information is available at www.nature.com/reprints. NatuRe Biote ChNoloG y | www.nature.com/naturebiotechnology NATUrE BioTEcHNoLoGy Articles Extended Data Fig. 1 | individual metrics for evaluating integration performance. a, Mean average precision vs. Seurat alignment score for different integration methods. Higher mean average precision indicates higher cell type resolution, and higher Seurat alignment score indicates better omics mixing. b, Cell type vs. omics layer average silhouette width for different integration methods. Higher cell type average silhouette width indicates higher cell type resolution, and higher omics layer average silhouette width indicates better omics mixing. c, Neighbor conservation vs. graph connectivity for different integration methods. Higher neighbor conservation indicates better conservation of manifold structure in each original layer, and higher graph connectivity indicates better omics mixing. n=8 repeats with different model random seeds. t he error bars indicate mean ± s.d. NatuRe Biote ChNoloG y | www.nature.com/naturebiotechnology NATUrE BioTEcHNoLoGy Articles Extended Data Fig. 2 | effect of prior knowledge and data size on integration performance. a, Decrease in overall integration score at different prior knowledge corruption rates for integration methods that rely on prior feature relations (n=8 repeats with different corruption random seeds). b, Overall integration score, and c, FOSCtt M with different schemes of connecting peaks and genes as prior regulatory knowledge, for integration methods that rely on prior feature relations (n=8 repeats with different model random seeds). ‘Combined±0’ is the standard scheme where peaks overlapping gene body or promoter regions are linked. ‘Promoter±150k’ means that peaks are linked to genes if they locate within 150kb from the gene promoter, weighted by a 42,43 power-law function that models chromatin contact probability . d, Overall integration score of different integration methods on subsampled datasets of varying sizes (n=8 repeats with different subsampling random seeds). t he error bars indicate mean ± s.d. NatuRe Biote ChNoloG y | www.nature.com/naturebiotechnology NATUrE BioTEcHNoLoGy Articles Extended Data Fig. 3 | integration performance of Glue under different hyperparameter settings. Integration performance is quantified by a, overall integration score, and b, FOSCtt M (n=4 repeats with different model random seeds). t he error bars indicate mean ± s.d. ‘Dimensionality’ denotes the cell embedding dimensionality. ‘Preprocessing dimensionality’ is the reduced dimensionality used for the first transformation layers of the data encoders (see Methods). ‘Hidden layer depth’ is the number of hidden layers in the data encoders and modality discriminator. ‘Hidden layer dimensionality’ is the dimensionality of hidden layers in the data encoders and modality discriminator. ‘Dropout’ is the dropout rate of hidden layers in data encoders and modality discriminator. ‘Lambda graph’ is the weight of the graph loss ( ). ‘Lambda align’ is the weight of the adversarial alignment (λ ). ‘Negative G D sampling rate’ is the number of empirical samples used in negative edge sampling (samples from p ). For each hyperparameter, the center value is the ns default. t o control computational cost, one hyperparameter was varied at a time, with all others set to their default values. t he performance of GLUe was robust across a wide range of hyperparameter settings, except for failed alignments in which the adversarial alignment weight was too low or no hidden layers were used in the neural networks (equivalently a linear model with insufficient capacity). NatuRe Biote ChNoloG y | www.nature.com/naturebiotechnology NATUrE BioTEcHNoLoGy Articles Extended Data Fig. 4 | integration performance of Glue with different numbers of highly variable genes. Integration performance is quantified by a , overall integration score, and b, FOSCtt M (n=8 repeats with different model random seeds). t he error bars indicate mean ± s.d. NatuRe Biote ChNoloG y | www.nature.com/naturebiotechnology NATUrE BioTEcHNoLoGy Articles Extended Data Fig. 5 | Robustness of Glue feature embeddings. Consistency of feature embeddings as defined by the conservation of feature-feature cosine similarity (Methods), under a, different hyperparameter settings (n=4 repeats with different model random seeds), b, different prior knowledge corruption rates (n=8 repeats with different corruption random seeds), and c, different number of subsampled cells (n=8 repeats with different subsampling random seeds). t he error bars indicate mean ± s.d. Feature embeddings are robust across all hyperparameters except for λ , which directly controls the contribution of guidance graph. Consistency also remains high (> 0.8) with up to 40% of prior knowledge corrupted, and a minimal of ~4,000 subsampled cells. NatuRe Biote ChNoloG y | www.nature.com/naturebiotechnology NATUrE BioTEcHNoLoGy Articles Extended Data Fig. 6 | integration consistency score for detecting over-correction. Integration consistency scores with varying numbers of meta-cells for different dataset combinations. Same-tissue combinations represent proper correction, and different-tissue combinations represent over-correction. Dashed horizontal line indicate integration consistency score = 0.05. NatuRe Biote ChNoloG y | www.nature.com/naturebiotechnology

Journal

Nature BiotechnologySpringer Journals

Published: Oct 1, 2022

There are no references for this article.