Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

ATLAS: an automated association test using probabilistically linked health records with application to genetic studies

ATLAS: an automated association test using probabilistically linked health records with... Abstract Objective Large amounts of health data are becoming available for biomedical research. Synthesizing information across databases may capture more comprehensive pictures of patient health and enable novel research studies. When no gold standard mappings between patient records are available, researchers may probabilistically link records from separate databases and analyze the linked data. However, previous linked data inference methods are constrained to certain linkage settings and exhibit low power. Here, we present ATLAS, an automated, flexible, and robust association testing algorithm for probabilistically linked data. Materials and Methods Missing variables are imputed at various thresholds using a weighted average method that propagates uncertainty from probabilistic linkage. Next, estimated effect sizes are obtained using a generalized linear model. ATLAS then conducts the threshold combination test by optimally combining P values obtained from data imputed at varying thresholds using Fisher’s method and perturbation resampling. Results In simulations, ATLAS controls for type I error and exhibits high power compared to previous methods. In a real-world genetic association study, meta-analysis of ATLAS-enabled analyses on a linked cohort with analyses using an existing cohort yielded additional significant associations between rheumatoid arthritis genetic risk score and laboratory biomarkers. Discussion Weighted average imputation weathers false matches and increases contribution of true matches to mitigate linkage error-induced bias. The threshold combination test avoids arbitrarily choosing a threshold to rule a match, thus automating linked data-enabled analyses and preserving power. Conclusion ATLAS promises to enable novel and powerful research studies using linked data to capitalize on all available data sources. electronic health records, record linkage, genetic association studies, biorepositories, perturbation resampling INTRODUCTION A vast amount of health data stemming from electronic health records (EHR), biorepositories, administrative claims, and biomedical research studies are becoming available for discovery and predictive research.1,2 For patients who contribute information to multiple databases, synthesizing their information across all available sources captures a more complete picture of their health and allows for more comprehensive and powerful research studies. For example, database A may contain genomic data and database B may contain longitudinal phenotypic data. Linking patient records in these databases would allow researchers to investigate gene–disease associations. In a similar vein, researchers recently linked genomics data with environmental factors of BRCA1/BRCA2 mutation carriers from 2 independent studies to study environmental-gene relations, demonstrating the potential to conduct innovative research studies after linking databases.3 To perform linkage when protected health information (PHI) identifiers are not available, as is generally the case in deidentified research databases, researchers employ probabilistic record linkage (PRL).4 However, linkage errors are inevitable in PRL due to data discrepancies, and examples of linkage errors include incorrectly linking 2 records that do not belong to the same patient (false matches) or leaving a record unlinked when a correct link exists (missed matches). Neter et al were the first to investigate the consequences of linkage errors on downstream inference results, and they showed that such errors induce substantial bias in inference.5 Rentsch et al recently attempted to conduct inference on linked real-world data and showed that false matches reduced magnitudes of association biased estimates.6 We are interested in testing the association between some predictors, X—recorded in 1 database, A, and an outcome, Y—recorded in a different database, B. Within this association testing framework, both false matches and missed matches drive results in the direction of no association by undermining statistical power.7–9 False matches increase sample sizes but dilute potential associations while missed matches reduce sample sizes and undermine statistical efficiency.7–9 Despite the growth in analysis of linked data and well documented effects of linkage error on downstream inference, few robust, automated, and flexible inference methods have been proposed to account for linkage error-induced bias.10 Further, current proposed estimators are restricted to specific linkage settings and few are implemented in open-source software. For example, Hof et al propose weighting least square estimators in linear regression, but do not account for non-match events and state that their estimator is biased unless complete matching is achieved.11 Other proposed estimators assume that 1 database must be a complete subset of the other or necessitate the selection of extraneous blocking variables.12,13 Recently, Han et al proposed linkage bias-correcting estimators that are free of specific assumptions regarding linkage settings and do not require specific data structures as previously proposed methods do.14 However, their method does not account for linkage settings where covariates come from either database A or B.14 In this article, we propose automated association testing using probabilistically linked health records (ATLAS), a fully automated and scalable association testing framework that addresses many of these limitations. ATLAS utilizes either (1) a best match method or (2) a weighted imputation method that propagates uncertainty from the linkage process contained in matching probabilities. Then, ATLAS optimally combines several P values estimated using generalized linear models (GLMs) that each correspond to a different matching threshold ρk (⁠ k∈{1,…,K} ⁠) as a significance test, avoiding the difficult choice of choosing a single threshold for defining a match. Unlike previous work, ATLAS performance is not conditional on specific linkage settings, linkage methods, or data structure. To facilitate ease of use and accessibility, we have implemented ATLAS in the “ludic” R package on CRAN. Here, we validate ATLAS performance and compare to existing methods in simulation and real-world studies to show that ATLAS is more robust at detecting associations than previously published estimators. MATERIALS AND METHODS Statistical method The proposed ATLAS algorithm broadly consists of 3 steps: (1) missing variable imputation using either i) the best match above a prespecified threshold, or ii) a weighted average using matching probabilities as weights also filtering on a prespecified threshold; (2) estimation of an adjusted effect size using a GLM; and (3) a significance test relying on optimal combination of P values obtained from data imputed at multiple thresholds using Fisher’s method and perturbation resampling. Figure 1 illustrates this procedure. Without loss of generality, and for the sake of simplicity, here we will consider the situation where we have database A containing a p-dimensional novel predictor vector X and a vector of matching features MA on nA subjects indexed by i∈{1,…,nA} and database B containing outcome information Y ⁠, a vector of matching features MB ⁠, and potentially some other existing covariates W on nB subjects indexed by j∈{1,…,nB} ⁠. We seek to link the predictors X recorded in database A subjects to database B such that we may run an association analysis of Y∼W+X ⁠. Figure 1. Open in new tabDownload slide Schematic of the proposed ATLAS algorithm. Figure 1. Open in new tabDownload slide Schematic of the proposed ATLAS algorithm. Probabilistic linkage We denote the probability of matching between a patient i from database A and a patient j from database B as πij ⁠, where πij is ascertained via any previously developed linkage algorithm that typically assesses the similarity between the matching features MAi and MBj ⁠. For example, the ludic linkage algorithm employs Bayesian modeling of binary diagnosis codes as matching features to estimate a posterior probability of being a match.15 Variable imputation Using the estimated probabilities and given a specific matching cutoff threshold ρk above which patients are ruled as a match, we define ξj(ρ) as whether there is any match for patient j in database B: ξj(ρ)=1if max1≤i≤nAπij≥ρ0else The best match imputation method imputes the missing predictor for patient j in B with the observation Xi from patient i in A with the highest probability π^ij above a threshold ρk ⁠. However, this method risks imputing variables from false matches created by linkage errors and diluting potential associations. Therefore, we additionally propose a weighted average imputation of the predictor using linkage probabilities as weights to weather false matches and increase contribution of true matches. More specifically, for the jth subject in database B and a given threshold ρ ⁠, we identify subjects in database A with linkage probabilities π^ij above ρ and obtain a weighted average of their X’s as the linked predictor: X^ρj=∑i=1nAπijI(πij≥ρ)Xi∑i=1nAπijI(πij≥ρ),j=1,…,nB If max1≤i≤nAπij<ρ ⁠, then the predictor will remain missing for the jth subject. We assume that those remaining missing values are Missing At Random. Effect size estimation using GLM Our goal is to assess the association between X and Y using the linked database B through a GLM with link function g ⁠: EY|X,W=gW'α0+X'θ0=gZ'β0 where Z=W',X'' and β=α',θ'' ⁠. For simplicity, we focus on the downstream association testing for H0:θ=0 using linked data DB(ρ)=Yj,Z^ρj:ξj(ρ)=1,j=1,…,nB where Z^ρj=W'j,X'^ρj' ⁠. We fit the GLM Yj∼Z^ρj to DB to obtain a maximum likelihood estimate for β ⁠, denoted as β^ρ=α'^ρ,θ'^ρ' ⁠, and test for H0:θ0=0 based on the corresponding θ^ρ ⁠. The ATLAS threshold combination test: significance testing with optimal P value combination Various threshold values can be used for ρ ⁠. For instance, one could use ρ=0.5 ⁠, where indicated matches have a higher probability of being match than non-match, or ρ=0.9 ⁠, where indicated matches have higher certainty of being true matches compared to using ρ=0.5 ⁠. However, the optimal choice of such a threshold is unclear in practice due to the lack of gold standard labels on the true mappings between A and B. On the one hand, higher thresholds have lower estimation biases from fewer false matches but at the price of decreased statistical power from smaller sample sizes. On the other hand, lower thresholds exhibit higher statistical power at the expense of increased estimation bias. Thus, instead of arbitrarily choosing a threshold ρ ⁠, we propose to optimally combine several P values that correspond to different matching thresholds {ρl,l=1,…,L} ⁠, thereby automating the significance testing process and preserving statistical power in various settings. Specifically, we propose to obtain a P value via a χ2 test based on θ^ρl for the threshold ρl ⁠, p^ρl=Pχp2≥θ'^ρlΣ^ρl-1θ^ρl and construct a combined test statistic as: γ^=∑l=1L-log(p^ρl) to calculate the final P value for the testing of H0:θ0=0 ⁠. For a given ρl ⁠, we may obtain P value p^ρl=Pχp2≥θ'^ρΣ^ρ-1θ^ρ via a p-degree of freedom χ2 test, where Σ^ρl is the estimated variance–covariance matrix of θ^ρl ⁠. Since θ^ρl,l=1,…,L are estimated using overlapping data, the test statistics θ'^ρlΣ^ρl-1θ^ρl,l=1,…,L are highly correlated with each other. Thus, to estimate the null distribution of γ^ ⁠, we use a perturbation resampling strategy to account for the correlations. Specifically, we generate a vector of nB standard gaussian random variables G=G1,…,GnB' and subsequently obtain a perturbed random vector as θ^ρ[G]=∑j=1nBξj(ρl)S^θj(ρl)Gj ⁠, where S^j(ρ)=S^αj(ρ)',S^θj(ρ)''=J(ρ)-1Z^ρj{Yj-g(β'^ρZ^ρj)}, J(ρ)=∑j=1nBξj(ρ)Z^ρjZ'^ρjg˙(β'^ρZ^ρj) is the Fisher Information matrix for a given ρ and g˙x=dgxdx.16,17 Subsequently, we obtain the perturbed counterpart of γ^ as γ^[G]=∑l=1L-logp^ρl[G] ⁠, where p%26#x0005E%3B%26#x003C1%3Bl[G]%26#x0003D%3BP%26#x003C7%3Bp2%26#x02265%3B%26#x003B8%3B%26#x0005E%3B%26#x003C1%3B[G]'%26#x003A3%3B%26#x0005E%3B%26#x003C1%3B-1%26#x003B8%3B%26#x0005E%3B%26#x003C1%3B[G].' alt="" border="0" data-mathml=" p^ρl[G]=Pχp2≥θ^ρ[G]'Σ^ρ-1θ^ρ[G]. " /> We then generate R realizations of G,{Gr,r=1,…,R} ⁠, to obtain the final P value for testing H0:θ0=0 as: p^combined=1-1R∑r=1R1{γ^≥γ^[Gr]} Data and metrics for evaluation We evaluated the performance of ATLAS in simulation studies and conducted a real-world genetic association study using EHR data that has been linked to a biorepository. Simulation study We estimated type I error rate (statistical size) and empirical statistical power of ATLAS using the deidentified publicly available “RA2” dataset in the “ludic” CRAN package with N=5707 patients extracted from the Mass General Brigham (MGB) Research Patient Data Registry (RPDR) in 2010, and this dataset contained 1342 binarized International Classification of Disease (ICD) codes.15 We studied ATLAS performance under settings created by (1) perturbing databases with multivariate noise to create discrepancies between patient records illustrative of real-world scenarios, like missing information or administrative recording errors; (2) decreasing the average codes per patient record to simulate poor linkage conditions with noisy linkage probabilities; and (3) using varying strengths of association, namely odds ratios (OR) of 1.5 and 2. To simulate upstream linkage, we used 2 linkage algorithms with different methods of estimating πij to study linkage algorithm compatibility with ATLAS. Specifically, we considered: i) ludic, a published algorithm which relies on Bayesian modeling of binary diagnosis codes, and ii) embeddingMatch, which calculates cosine similarities between patient-level embeddings similar to other previous approaches.15,18–20 Further details regarding the embeddingMatch method and the simulation model are in the Supplementary Appendix. Type I error rates using single cutoff thresholds and for the ATLAS threshold combination test were estimated as the proportion of P values less than the nominal testing level α=0.05 under no simulated association. Similarly, empirical power was estimated as the proportion of significant P values at α=0.05 under simulated association for a given OR > 1 ⁠. Results are based on n=1500 simulations in each setting, and thresholds used in the ATLAS threshold combination test include ρ∈0.1,0.3,…,0.9 ⁠. For the sake of simplicity, we report ATLAS results at single cutoff thresholds of 0.1, 0.5, and 0.9, and report the rest in the Supplementary Appendix. Genetic association study using real-world biorepository data To further validate performance and demonstrate real-world utility of ATLAS, we conducted a genetic association study with linked data to assess the association between RA genetic risk score (GRS) and clinical outcomes among rheumatoid arthritis (RA) patients. We considered EHR records from 2 databases with which we performed linkage and the linked data-enabled downstream association analyses: (i) the Crimson Clinical Discards database (herein referred to as Crimson and akin to database A) for a subset of RA patients of European descent previously identified in 2008; and (ii) the MGB RPDR (akin to database B) subset of RA patients identified via an existing machine learning algorithm.21–25 The Crimson RA cohort contains anonymized EHR data along with genotype data for nCrimson=1284 patients collected up to 2008. The RPDR RA cohort contains full EHR data up to 2017 for a total of nRPDR=12 838 patients, and this dataset is unrelated to the deidentified “RA2” dataset used in simulations. A subset of nbiobank=1270 patients in RPDR already report genotype data because they belong to the MGB Biobank, and we have gold standard labels on the true mappings between RPDR and MGB Biobank records.26,27 Our goal is to link Crimson with RPDR to enable genetic association studies of RA GRS in Crimson with various clinical outcomes in RPDR using all available data. Figure 2 provides a schematic of our analyses. Details about how RA GRS were constructed are reported in the Supplementary Appendix.22,28,29 Figure 2. Open in new tabDownload slide Schematic of real-world rheumatoid arthritis genetic association studies conducted using the MGB Biobank and the Crimson-linked data. Figure 2. Open in new tabDownload slide Schematic of real-world rheumatoid arthritis genetic association studies conducted using the MGB Biobank and the Crimson-linked data. To perform record linkage, we assembled all available ICD codes in Crimson records and ICD codes recorded prior and up to 2008 in RPDR records. ICD codes were then aggregated to PheCodes and a vector of 1542 binary matching features was created for each patient record where each feature was a binary indicator of presence or absence of a PheCode.30,31 Next, we performed PRL using ludic to estimate probabilities of being a match for every possible pair of RPDR and Crimson records. Since the Crimson cohort consists of RA patients that were previously managed at MGB, we anticipated that a majority of these patients can be linked to the updated RPDR RA cohort. Further, there are 213 patients with genotype data in both MGB Biobank and Crimson. In the absence of gold standard labels on the true mappings between RPDR and Crimson records, we used these 213 overlapping patients from genotype data reported in both genotype databases to validate the accuracy of the linkage by assessing the concordance between RA GRS from MGB Biobank and those from Crimson. The 213 overlapping patients were subsequently excluded in the linked data study. Once Crimson records are linked to RA RPDR records, we imputed GRS values using the weighted average method from Crimson to those subjects in RA RPDR using thresholds ρl∈{0.5,0.6,…,0.9} ⁠. Subsequently, we used ATLAS to conduct multivariate association studies of linked RA GRS and clinical outcomes while adjusting for patient age, gender, and healthcare utilization (defined as the log[1+total encounters] ⁠). Clinical outcomes from RDPR include laboratory biomarkers commonly used to assess patient inflammation and binary phenotypes for pyogenic arthritis and gout, which are other distinct nonautoimmune forms of arthritis. These binary phenotypes were defined as having at least 2 PheCodes corresponding to these disorders and were constructed using ICD codes recorded up to 2017 in RPDR.30,31 Reported effect sizes were estimated using data imputed at a ρ=0.9 threshold. Using RDPR patients who already belong to the MGB Biobank, we replicated these multivariate association studies and conducted meta-analysis using Fisher’s method to combine the P values estimated from the MGB Biobank and Crimson linked-data study to demonstrate increased power to detect associations when linking databases. To determine statistical significance after meta-analysis, we accounted for multiple testing by adjusting P values to control for a false discovery rate of 5% using the Benjamini-Hochberg procedure.32 Benchmark methodology for comparison We additionally considered the bias correcting estimators for linked data proposed by Han et al as benchmark approaches.14 The 3 Han et al estimators are “Han F” (for “Full”—which considers all possible pairs for each patient in A ⁠), “Han M” (for “Max”—which considers largest probabilities for each patient in A ⁠), and “Han M2” (for “Max 2”—which considers 2 largest probabilities for each patient in A ⁠).14 We report type I error rates and empirical power of these estimators from simulations. We were unable to replicate the real-world multivariate analyses using Han et al estimators, as they do not consider accepting covariates from the same database that provides outcomes. To compare performance in the real-world setting, we further conducted univariate analyses. RESULTS Type I error control in simulations ATLAS effectively controlled for type I error in all simulation settings, and at all considered thresholds, regardless of the linkage or downstream imputation method used (Figure 3). Han et al’s estimators controlled for type I error, although they appear somewhat too conservative. Figure 3. Open in new tabDownload slide Comparison of type I error rates of ATLAS and Han et al estimators in simulation settings with different noise levels and average codes per patient record (simulations under H0 ⁠). ATLAS type I error rates reported for several single cutoff thresholds and the ATLAS threshold combination test. Figure 3. Open in new tabDownload slide Comparison of type I error rates of ATLAS and Han et al estimators in simulation settings with different noise levels and average codes per patient record (simulations under H0 ⁠). ATLAS type I error rates reported for several single cutoff thresholds and the ATLAS threshold combination test. Statistical power evaluation When patient records had on average 16 codes, power at different single-cutoff thresholds was similar (Figure 4). When patient records had on average 4 codes, more significant differences in power between single cutoff thresholds was observed. Relative power between thresholds was further dependent on the linkage method. Across all settings, power exhibited by the ATLAS threshold combination test demonstrated either the highest power or at least matched the highest power among ATLAS single-cutoff thresholds (Figure 4). Larger simulated OR’s increased power across all thresholds. Empirical power, when using weighted average imputation, was similar in comparison to power using best match imputation, although differences in power based on imputation strategy were observed in certain settings when using embeddingMatch (Supplementary Figure 1). Simulation results using the identity link of the GLM and continuous outcomes demonstrated similar results in ATLAS performance (Supplementary Figure 2). In comparison, the estimators proposed by Han et al did not capture signal as effectively, most likely due to noisy linkage probabilities generated from PRL. Figure 4. Open in new tabDownload slide Comparison of empirical power of ATLAS and Han et al estimators in simulation settings with different effect sizes and average codes per patient record (simulations under H1 ⁠). Results were generated using the best match imputation method. ATLAS power reported for several single cutoff thresholds and the ATLAS threshold combination test. Figure 4. Open in new tabDownload slide Comparison of empirical power of ATLAS and Han et al estimators in simulation settings with different effect sizes and average codes per patient record (simulations under H1 ⁠). Results were generated using the best match imputation method. ATLAS power reported for several single cutoff thresholds and the ATLAS threshold combination test. We then evaluated ATLAS performance after creating false matches between databases to simulate linkage errors. In doing this, we sought to mimic real-world scenarios where linkage algorithms cannot discern between many pairs of similar patient records. ATLAS successfully controlled for Type I error in this setting (Supplementary Figure 2). Use of the weighted average imputation yielded significantly higher power compared to best match imputed variables, and the ATLAS threshold combination test again demonstrated good power relative to its power from single cutoff thresholds (Figure 5). Han et al’s M2 estimator yielded comparable power in the context of false matches when using ludic, but was generally outperformed by the ATLAS threshold combination test. Figure 5. Open in new tabDownload slide Comparison of empirical power of ATLAS and Han et al estimators in the presence of false matches between databases (simulations under H1 ⁠). Simulated databases report on average 16 codes per patient record. Figure 5. Open in new tabDownload slide Comparison of empirical power of ATLAS and Han et al estimators in the presence of false matches between databases (simulations under H1 ⁠). Simulated databases report on average 16 codes per patient record. Real-world genetics study: association of clinical outcomes with rheumatoid arthritis risk alleles among rheumatoid arthritis patients We performed PRL on 12 838 patient records from RPDR and 1284 patient records from Crimson. At a conservative threshold of ρ=0.9 ⁠, we identified 1157 matching patient records between RPDR and Crimson. For the 213 patients who have been genotyped in both MGB Biobank and Crimson, Spearman’s correlation coefficient between the MGB Biobank and Crimson GRS’ for these overlapping patients was estimated to be 0.83 (95% CI: 0.77–0.87, P<.001 ⁠), suggesting high concordance of genetic data and reliable linkage quality. Univariate association study results using both ATLAS and Han et al’s estimators are reported in Supplementary Tables 1 and 2. As a positive control, we replicated the known association of anticitrullinated protein antibody (ACPAs) levels with RA GRS using ATLAS and Han et al’s estimators.33,34 In general, ATLAS detected larger effect sizes and smaller P values compared to any of Han et al’s estimators, supporting simulation results that ATLAS is more powerful at detecting associations. For example, for log-transformed rheumatoid factor levels, ATLAS estimated βGRS=0.15 with P<.001 while Han et al’s M2 estimator estimated βGRS=0.14 with p=0.09 ⁠. Multivariate association study results using the MGB Biobank and the Crimson linked data study are presented in Table 1 for biomarkers and Table 2 for phenotypes. Effect sizes estimated from both studies were generally concordant. Figure 6 visualizes the difference in adjusted P values between using only the MGB Biobank cohort for which gold standard mappings were already available and after incorporating additional RA patients with genotype data through the Crimson linked cohort. We demonstrated improved power to detect associations when incorporating linked data as meta-analysis yielded 2 additional significant associations (namely log-transformed Erythrocyte sedimentation rate and C-Reactive Protein level). Further, the average unadjusted -log10(P value) among statistically significant outcomes was 6.45 in the MGB Biobank study and 9.65 after meta-analysis with the Crimson linked data study, again demonstrating the potential to increase power when incorporating additional data by linking databases. Figure 6. Open in new tabDownload slide Logarithm transformed P values from genetic association study using only RA patients with previously available genotype data at MGB Biobank and after incorporating additional RA patients with genotype data through the Crimson-linked cohort. Figure 6. Open in new tabDownload slide Logarithm transformed P values from genetic association study using only RA patients with previously available genotype data at MGB Biobank and after incorporating additional RA patients with genotype data through the Crimson-linked cohort. Table 1. Multivariate association study results of RA GRS and patient biomarkers for the MGB Biobank RA cohort and the Crimson-linked RA cohort . Beta . P value . Biomarker . MGB Biobank (SE) . Crimson linked: ATLAS 0.9 threshold (SE) . MGB Biobank . Crimson linked: ATLAS combination . Two study P value . Anti-citrullinated protein antibodies (log) 0.39 (0.04) 0.31 (0.04) 9.39e-30 0.00e + 00 8.54e-38 Rheumatoid factor (log) 0.12 (0.03) 0.16 (0.03) 2.20e-06 0.00e + 00 8.14e-15 Erythrocyte sedimentation rate 0.49 (0.58) 1.27 (0.55) 3.91e-01 1.00e-02 2.56e-02 C-reactive protein (log) 0.04 (0.02) 0.04 (0.02) 7.80e-02 3.00e-02 1.65e-02 . Beta . P value . Biomarker . MGB Biobank (SE) . Crimson linked: ATLAS 0.9 threshold (SE) . MGB Biobank . Crimson linked: ATLAS combination . Two study P value . Anti-citrullinated protein antibodies (log) 0.39 (0.04) 0.31 (0.04) 9.39e-30 0.00e + 00 8.54e-38 Rheumatoid factor (log) 0.12 (0.03) 0.16 (0.03) 2.20e-06 0.00e + 00 8.14e-15 Erythrocyte sedimentation rate 0.49 (0.58) 1.27 (0.55) 3.91e-01 1.00e-02 2.56e-02 C-reactive protein (log) 0.04 (0.02) 0.04 (0.02) 7.80e-02 3.00e-02 1.65e-02 Crimson-linked RA cohort effect sizes estimated a stringent imputation threshold of 0.9 and P values estimated using the ATLAS combination test. Open in new tab Table 1. Multivariate association study results of RA GRS and patient biomarkers for the MGB Biobank RA cohort and the Crimson-linked RA cohort . Beta . P value . Biomarker . MGB Biobank (SE) . Crimson linked: ATLAS 0.9 threshold (SE) . MGB Biobank . Crimson linked: ATLAS combination . Two study P value . Anti-citrullinated protein antibodies (log) 0.39 (0.04) 0.31 (0.04) 9.39e-30 0.00e + 00 8.54e-38 Rheumatoid factor (log) 0.12 (0.03) 0.16 (0.03) 2.20e-06 0.00e + 00 8.14e-15 Erythrocyte sedimentation rate 0.49 (0.58) 1.27 (0.55) 3.91e-01 1.00e-02 2.56e-02 C-reactive protein (log) 0.04 (0.02) 0.04 (0.02) 7.80e-02 3.00e-02 1.65e-02 . Beta . P value . Biomarker . MGB Biobank (SE) . Crimson linked: ATLAS 0.9 threshold (SE) . MGB Biobank . Crimson linked: ATLAS combination . Two study P value . Anti-citrullinated protein antibodies (log) 0.39 (0.04) 0.31 (0.04) 9.39e-30 0.00e + 00 8.54e-38 Rheumatoid factor (log) 0.12 (0.03) 0.16 (0.03) 2.20e-06 0.00e + 00 8.14e-15 Erythrocyte sedimentation rate 0.49 (0.58) 1.27 (0.55) 3.91e-01 1.00e-02 2.56e-02 C-reactive protein (log) 0.04 (0.02) 0.04 (0.02) 7.80e-02 3.00e-02 1.65e-02 Crimson-linked RA cohort effect sizes estimated a stringent imputation threshold of 0.9 and P values estimated using the ATLAS combination test. Open in new tab Table 2. Multivariate association study results of RA GRS and binary phenotypes for the MGB Biobank RA cohort and the Crimson-linked RA cohort . OR . P value . Phenotype . MGB Biobank (SE) . Crimson linked: ATLAS 0.9 threshold (SE) . MGB Biobank . Crimson linked: ATLAS combination . Two study P value . Pyogenic arthritis 1.41 (0.11) 1.40 (0.18) 1.64e-03 9.57e-02 1.12e-03 Gout and other crystal arthropathies 0.85 (0.07) 0.85 (0.08) 2.82e-02 2.43e-02 5.68e-03 . OR . P value . Phenotype . MGB Biobank (SE) . Crimson linked: ATLAS 0.9 threshold (SE) . MGB Biobank . Crimson linked: ATLAS combination . Two study P value . Pyogenic arthritis 1.41 (0.11) 1.40 (0.18) 1.64e-03 9.57e-02 1.12e-03 Gout and other crystal arthropathies 0.85 (0.07) 0.85 (0.08) 2.82e-02 2.43e-02 5.68e-03 Crimson-linked RA cohort effect sizes estimated a stringent imputation threshold of 0.9 and P values estimated using the ATLAS combination test. Open in new tab Table 2. Multivariate association study results of RA GRS and binary phenotypes for the MGB Biobank RA cohort and the Crimson-linked RA cohort . OR . P value . Phenotype . MGB Biobank (SE) . Crimson linked: ATLAS 0.9 threshold (SE) . MGB Biobank . Crimson linked: ATLAS combination . Two study P value . Pyogenic arthritis 1.41 (0.11) 1.40 (0.18) 1.64e-03 9.57e-02 1.12e-03 Gout and other crystal arthropathies 0.85 (0.07) 0.85 (0.08) 2.82e-02 2.43e-02 5.68e-03 . OR . P value . Phenotype . MGB Biobank (SE) . Crimson linked: ATLAS 0.9 threshold (SE) . MGB Biobank . Crimson linked: ATLAS combination . Two study P value . Pyogenic arthritis 1.41 (0.11) 1.40 (0.18) 1.64e-03 9.57e-02 1.12e-03 Gout and other crystal arthropathies 0.85 (0.07) 0.85 (0.08) 2.82e-02 2.43e-02 5.68e-03 Crimson-linked RA cohort effect sizes estimated a stringent imputation threshold of 0.9 and P values estimated using the ATLAS combination test. Open in new tab DISCUSSION The tremendous amount of biomedical data becoming available for research has led to great interest and demand for linkage of databases. Inference using linked data must acknowledge and mitigate bias in estimated effect sizes that are created by linkage errors while retaining good statistical power. To this end, we propose ATLAS as a supervised, robust, flexible, and scalable method that tests for association between variables originally belonging to separate databases. We demonstrate that ATLAS is a valid method that effectively controls for type I error regardless of linkage or imputation method used, and that ATLAS is more powerful than benchmark methods in a range of linkage and inference settings. We demonstrate in simulation studies that weighted average imputation of missing variables not only protects against type I error in downstream inference but also preserves statistical power to detect associations in the presence of linkage errors, making it the preferable imputation method to use in future studies. The weighted average imputation method propagates the uncertainty contained in probabilistic linkage to downstream inference to reduce bias and increase power. While certain aspects of it resemble multiple imputation and inverse probability weighting, the weighted average imputation additionally uses thresholding to impute only when the linkage is deemed of sufficient quality (ie, above the threshold).35 When using the weighted average imputation, users should be aware that using lower thresholds with linkage algorithms like embeddingMatch may yield lower power as seen in Supplementary Figure 1 when patient records contain few matching features. We also observed that using fewer matching features led to biased estimates of πij ⁠, which subsequently decreased power of ATLAS. For example, the average πij among true matches with 16 and 4 codes per record was 0.86 and 0.22, respectively. The resulting powers of the ATLAS combination test for 16 and 4 codes per record were 0.86 and 0.22, respectively. During missing variable imputation, instead of selecting an ad hoc threshold, ATLAS optimally combines P values originating from multiple thresholds. Our results suggest 3 major advantages to the ATLAS threshold combination test: (1) avoids arbitrarily choosing a threshold, thus automating the record linkage process; (2) reduces estimation bias by combining P values estimated from data imputed at different thresholds; and (3) preserves good statistical power regardless of which threshold performs best in a given setting. When selecting thresholds which are considered in the ATLAS threshold combination test, we suggest using no more than 10 thresholds at a time that are at least 0.05 apart as a rule of thumb to preserve statistical power. These attractive features of ATLAS enable robust performance in real-world settings, and we demonstrated its utility in our real-world genetic association study with data unrelated from the simulation data. When meta-analyzing results obtained from MGB Biobank—for which genotype data was already available—and the newly linked Crimson cohort, we were able to increase power and detect 2 more associations (namely CRP and ESR) than when using MGB Biobank alone. This illustrates the potential to discover new associations with ATLAS when incorporating linked data at no additional cost and without further data collection efforts. We further validated the new association of RA GRS with CRP using the U.K. Biobank (⁠ β=0.14 ⁠; SE = 0.02; P<0.001 ⁠). ESR is not reported in the UK Biobank, but our results are consistent with previous studies that reported associations of RA genetic risk alleles with higher values of CRP and ESR.36–43 Additional studies are needed to further validate these associations. CONCLUSION In this article, we introduce ATLAS, an automated and flexible algorithm that conducts robust inference using probabilistically linked databases. The ATLAS threshold combination test exhibits high power to detect associations in a range of simulation settings while controlling for type I error, and it exhibits substantial improvement in power and flexibility over existing inference methods for linked databases. Thus, ATLAS promises to enable novel and powerful research studies using linked data. FUNDING This work was supported in part by the US National Institutes of Health Grant U54-HG007963. AUTHOR CONTRIBUTIONS All authors made substantial contributions to: conception and design; acquisition, analysis, and interpretation of data; drafting the article or revising it critically; and final approval of the version to be published. SUPPLEMENTARY MATERIAL Supplementary material is available at Journal of the American Medical Informatics Association online. CONFLICT OF INTEREST STATEMENT None declared. DATA AVAILABILITY STATEMENT The “RA2” dataset is in the “ludic” package on CRAN. All data used in the real-world study cannot be share publicly as it is only available for authorized MGB investigators. REFERENCES 1 Kohane IS , Churchill SE, Murphy SN. A translational engine at the national scale: informatics for integrating biology and the bedside . J Am Med Inform Assoc 2012 ; 19 ( 2 ): 181 – 5 . Google Scholar Crossref Search ADS PubMed WorldCat 2 Butte AJ. Translational bioinformatics: coming of age . J Am Med Inform Assoc 2008 ; 15 ( 6 ): 709 – 14 . Google Scholar Crossref Search ADS PubMed WorldCat 3 Jiao Y, Lesueur F, Azencott CA, et al. A new hybrid record linkage process to make epidemiological databases interoperable: application to the GEMO and GENEPSO studies involving BRCA1 and BRCA2 mutation carriers. BMC Med Res Methodol 2021; 21 (1): 155. 4 Gutman R , Afendulis CC, Zaslavsky AM. A Bayesian procedure for file linking to analyze end-of-life medical costs . J Am Stat Assoc 2013 ; 108 ( 501 ): 34 – 47 . Google Scholar Crossref Search ADS PubMed WorldCat 5 Neter J , Maynes ES, Ramanathan R. The effect of mismatching on the measurement of response errors . J Am Stat Assoc 1965 ; 60 : 1005 – 27 . Google Scholar OpenURL Placeholder Text WorldCat 6 Rentsch CT , Harron K, Urassa M, et al. Impact of linkage quality on inferences drawn from analyses using data with high rates of linkage errors in rural Tanzania . BMC Med Res Methodol 2018 ; 18 ( 1 ): 165 . Google Scholar Crossref Search ADS PubMed WorldCat 7 Moore CL , Amin J, Gidding HF, et al. A new method for assessing how sensitivity and specificity of linkage studies affects estimation . PLoS One 2014 ; 9 ( 7 ): e103690 . Google Scholar Crossref Search ADS PubMed WorldCat 8 Harron K , Goldstein H, Wade A, et al. Linkage, evaluation and analysis of national electronic healthcare data: application to providing enhanced blood-stream infection surveillance in paediatric intensive care . PLoS One 2013 ; 8 ( 12 ): e85278 . Google Scholar Crossref Search ADS PubMed WorldCat 9 Schmidlin K , Clough-Gorr KM, Spoerri A, et al. ; Swiss National Cohort . Impact of unlinked deaths and coding changes on mortality trends in the Swiss national cohort . BMC Med Inform Decis Mak 2013 ; 13 : 1 – 11 . Google Scholar Crossref Search ADS PubMed WorldCat 10 Doidge JC , Harron KL. Reflections on modern methods: linkage error bias . Int J Epidemiol 2019 ; 48 ( 6 ): 2050 – 60 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 11 Hof MHP , Zwinderman AH. Methods for analyzing data from probabilistic linkage strategies based on partially identifying variables . Stat Med 2012 ; 31 ( 30 ): 4231 – 42 . Google Scholar Crossref Search ADS PubMed WorldCat 12 Chipperfield J. A weighting approach to making inference with probabilistically linked data . Stat Neerland 2019 ; 73 ( 3 ): 333 – 50 . Google Scholar Crossref Search ADS WorldCat 13 Dalzell NM , Reiter JP. Regression modeling and file matching using possibly erroneous matching variables . J Comput Graph Stat 2018 ; 27 ( 4 ): 728 – 38 . Google Scholar Crossref Search ADS WorldCat 14 Han Y , Lahiri P. Statistical analysis with linked data . Int Stat Rev 2019 ; 87 ( S1 ): S139 – 57 . Google Scholar Crossref Search ADS WorldCat 15 Hejblum BP , Weber GM, Liao KP, et al. Probabilistic record linkage of de-identified research datasets with discrepancies using diagnosis codes . Sci Data 2019 ; 6 : 180298 . Google Scholar Crossref Search ADS PubMed WorldCat 16 Jin Z , Ying Z, Wei L. A simple resampling method by perturbing the minimand . Biometrika 2001 ; 88 ( 2 ): 381 – 90 . Google Scholar Crossref Search ADS WorldCat 17 Minnier J , Tian L, Cai T. A perturbation method for inference on regularized regression estimates . J Am Stat Assoc 2011 ; 106 ( 496 ): 1371 – 82 . Google Scholar Crossref Search ADS PubMed WorldCat 18 Bonomi L, Xiong L, Chen R et al. Frequent grams based embedding for privacy preserving record linkage. In: Proceedings of the 21st acm international conference on information and knowledge management. New York, NY: Association for Computing Machinery; 2012 : 1597 – 601 . 19 Adly N. Efficient record linkage using a double embedding scheme. In: DMIN. Las Vegas, NV; July 13–16, 2009 : 274 – 81 . 20 Shi X , Li X, Cai T. Spherical regression under mismatch corruption with application to automated knowledge translation . J Am Stat Assoc 2020 ; 1 – 12 . Google Scholar OpenURL Placeholder Text WorldCat 21 Boutin N , Holzbach A, Mahanta L, et al. The information technology infrastructure for the translational genomics core and the partners biobank at partners personalized medicine . J Pers Med 2016 ; 6 ( 1 ): 6 . Google Scholar Crossref Search ADS WorldCat 22 Kurreeman F , Liao K, Chibnik L, et al. Genetic basis of autoantibody positive and negative rheumatoid arthritis risk in a multi-ethnic cohort derived from electronic health records . Am J Hum Genet 2011 ; 88 ( 1 ): 57 – 69 . Google Scholar Crossref Search ADS PubMed WorldCat 23 Nalichowski R , Keogh D, Chueh HC, et al. Calculating the benefits of a research patient data repository . In: AMIA Annu Symp Proc 2006 ; Washington, DC;. 2006 : 1044 . Google Scholar OpenURL Placeholder Text WorldCat 24 Liao KP , Cai T, Gainer V, et al. Electronic medical records for discovery research in rheumatoid arthritis . Arthritis Care Res (Hoboken) 2010 ; 62 ( 8 ): 1120 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 25 Huang S , Huang J, Cai T, et al. Impact of ICD10 and secular changes on electronic medical record rheumatoid arthritis algorithms . Rheumatology (Oxford) 2020 ; 59 ( 12 ): 3759 – 66 . Google Scholar Crossref Search ADS PubMed WorldCat 26 Karlson EW , Boutin NT, Hoffnagle AG, et al. Building the partners healthcare biobank at partners personalized medicine: Informed consent, return of research results, recruitment lessons and operational considerations . J Pers Med 2016 ; 6 : 2 . Google Scholar Crossref Search ADS WorldCat 27 Gainer VS , Cagan A, Castro VM, et al. The biobank portal for partners personalized medicine: a query tool for working with consented biobank samples, genotypes, and phenotypes using i2b2 . J Pers Med 2016 ; 6 : 11 . Google Scholar Crossref Search ADS WorldCat 28 Okada Y , Wu D, Trynka G, et al. ; GARNET consortium . Genetics of rheumatoid arthritis contributes to biology and drug discovery . Nature 2014 ; 506 ( 7488 ): 376 – 81 . Google Scholar Crossref Search ADS PubMed WorldCat 29 Raychaudhuri S , Sandor C, Stahl EA, et al. Five amino acids in three HLA proteins explain most of the association between MHC and seropositive rheumatoid arthritis . Nat Genet 2012 ; 44 ( 3 ): 291 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 30 Denny JC , Ritchie MD, Basford MA, et al. PheWAS: Demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations . Bioinformatics 2010 ; 26 ( 9 ): 1205 – 10 . Google Scholar Crossref Search ADS PubMed WorldCat 31 Wei W-Q , Bastarache LA, Carroll RJ, et al. Evaluating phecodes, clinical classification software, and icd-9-cm codes for phenome-wide association studies in the electronic health record . PLoS One 2017 ; 12 ( 7 ): e0175508 . Google Scholar Crossref Search ADS PubMed WorldCat 32 Benjamini Y , Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing . J R Stat Soc B (Methodol) 1995 ; 57 : 289 – 300 . Google Scholar OpenURL Placeholder Text WorldCat 33 Aggarwal R , Liao K, Nair R, et al. Anti-citrullinated peptide antibody (ACPA) assays and their role in the diagnosis of rheumatoid arthritis . Arthritis Rheum 2009 ; 61 ( 11 ): 1472 – 83 . Google Scholar Crossref Search ADS PubMed WorldCat 34 Liao KP , Kurreeman F, Li G, et al. Associations of autoantibodies, autoimmune risk alleles, and clinical diagnoses from the electronic medical records in rheumatoid arthritis cases and non–rheumatoid arthritis controls . Arthritis Rheum 2013 ; 65 ( 3 ): 571 – 81 . Google Scholar Crossref Search ADS PubMed WorldCat 35 Seaman SR , White IR, Copas AJ, et al. Combining multiple imputation and inverse-probability weighting . Biometrics 2012 ; 68 ( 1 ): 129 – 37 . Google Scholar Crossref Search ADS PubMed WorldCat 36 Alemao E , Guo Z, Burns L, et al. Evaluation of the association between C-reactive protein and anti-citrullinated protein antibody in rheumatoid arthritis: analysis of two clinical practice data sets [abstract] . Arthritis Rheumatol 2016 ; 68 (suppl 10): 1226 . Google Scholar OpenURL Placeholder Text WorldCat 37 Pope JE , Choy EH. C-reactive protein and implications in rheumatoid arthritis and associated comorbidities . Semin Arthritis Rheum 2021 ; 51 ( 1 ): 219 – 29 . Google Scholar Crossref Search ADS PubMed WorldCat 38 Plant MJ , Williams AL, O'Sullivan MM, et al. Relationship between time-integrated C-reactive protein levels and radiologic progression in patients with rheumatoid arthritis . Arthritis Rheum 2000 ; 43 ( 7 ): 1473 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 39 Dessein PH , Joffe BI, Stanwix AE. High sensitivity C-reactive protein as a disease activity marker in rheumatoid arthritis . J Rheumatol 2004 ; 31 ( 6 ): 1095 – 7 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 40 Wolfe F. Comparative usefulness of C-reactive protein and erythrocyte sedimentation rate in patients with rheumatoid arthritis . J Rheumatol 1997 ; 24 ( 8 ): 1477 – 85 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 41 Shen R , Ren X, Jing R, et al. Rheumatoid factor, anti-cyclic citrullinated peptide antibody, C-reactive protein, and erythrocyte sedimentation rate for the clinical diagnosis of rheumatoid arthritis . Lab Med 2015 ; 46 ( 3 ): 226 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 42 Amos RS , Constable TJ, Crockson RA, et al. Rheumatoid arthritis: relation of serum C-reactive protein and erythrocyte sedimentation rates to radiographic changes . Br Med J 1977 ; 1 ( 6055 ): 195 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 43 Wolfe F , Pincus T. The level of inflammation in rheumatoid arthritis is determined early and remains stable over the longterm course of the illness . J Rheumatol 2001 ; 28 : 1817 – 24 . Google Scholar PubMed OpenURL Placeholder Text WorldCat Author notes Harrison G. Zhang and Boris P. Hejblum contributed equally. © The Author(s) 2021. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of the American Medical Informatics Association Oxford University Press

ATLAS: an automated association test using probabilistically linked health records with application to genetic studies

Loading next page...
 
/lp/oxford-university-press/atlas-an-automated-association-test-using-probabilistically-linked-9lfsYj2dnl
Publisher
Oxford University Press
Copyright
Copyright © 2021 American Medical Informatics Association
ISSN
1067-5027
eISSN
1527-974X
DOI
10.1093/jamia/ocab187
Publisher site
See Article on Publisher Site

Abstract

Abstract Objective Large amounts of health data are becoming available for biomedical research. Synthesizing information across databases may capture more comprehensive pictures of patient health and enable novel research studies. When no gold standard mappings between patient records are available, researchers may probabilistically link records from separate databases and analyze the linked data. However, previous linked data inference methods are constrained to certain linkage settings and exhibit low power. Here, we present ATLAS, an automated, flexible, and robust association testing algorithm for probabilistically linked data. Materials and Methods Missing variables are imputed at various thresholds using a weighted average method that propagates uncertainty from probabilistic linkage. Next, estimated effect sizes are obtained using a generalized linear model. ATLAS then conducts the threshold combination test by optimally combining P values obtained from data imputed at varying thresholds using Fisher’s method and perturbation resampling. Results In simulations, ATLAS controls for type I error and exhibits high power compared to previous methods. In a real-world genetic association study, meta-analysis of ATLAS-enabled analyses on a linked cohort with analyses using an existing cohort yielded additional significant associations between rheumatoid arthritis genetic risk score and laboratory biomarkers. Discussion Weighted average imputation weathers false matches and increases contribution of true matches to mitigate linkage error-induced bias. The threshold combination test avoids arbitrarily choosing a threshold to rule a match, thus automating linked data-enabled analyses and preserving power. Conclusion ATLAS promises to enable novel and powerful research studies using linked data to capitalize on all available data sources. electronic health records, record linkage, genetic association studies, biorepositories, perturbation resampling INTRODUCTION A vast amount of health data stemming from electronic health records (EHR), biorepositories, administrative claims, and biomedical research studies are becoming available for discovery and predictive research.1,2 For patients who contribute information to multiple databases, synthesizing their information across all available sources captures a more complete picture of their health and allows for more comprehensive and powerful research studies. For example, database A may contain genomic data and database B may contain longitudinal phenotypic data. Linking patient records in these databases would allow researchers to investigate gene–disease associations. In a similar vein, researchers recently linked genomics data with environmental factors of BRCA1/BRCA2 mutation carriers from 2 independent studies to study environmental-gene relations, demonstrating the potential to conduct innovative research studies after linking databases.3 To perform linkage when protected health information (PHI) identifiers are not available, as is generally the case in deidentified research databases, researchers employ probabilistic record linkage (PRL).4 However, linkage errors are inevitable in PRL due to data discrepancies, and examples of linkage errors include incorrectly linking 2 records that do not belong to the same patient (false matches) or leaving a record unlinked when a correct link exists (missed matches). Neter et al were the first to investigate the consequences of linkage errors on downstream inference results, and they showed that such errors induce substantial bias in inference.5 Rentsch et al recently attempted to conduct inference on linked real-world data and showed that false matches reduced magnitudes of association biased estimates.6 We are interested in testing the association between some predictors, X—recorded in 1 database, A, and an outcome, Y—recorded in a different database, B. Within this association testing framework, both false matches and missed matches drive results in the direction of no association by undermining statistical power.7–9 False matches increase sample sizes but dilute potential associations while missed matches reduce sample sizes and undermine statistical efficiency.7–9 Despite the growth in analysis of linked data and well documented effects of linkage error on downstream inference, few robust, automated, and flexible inference methods have been proposed to account for linkage error-induced bias.10 Further, current proposed estimators are restricted to specific linkage settings and few are implemented in open-source software. For example, Hof et al propose weighting least square estimators in linear regression, but do not account for non-match events and state that their estimator is biased unless complete matching is achieved.11 Other proposed estimators assume that 1 database must be a complete subset of the other or necessitate the selection of extraneous blocking variables.12,13 Recently, Han et al proposed linkage bias-correcting estimators that are free of specific assumptions regarding linkage settings and do not require specific data structures as previously proposed methods do.14 However, their method does not account for linkage settings where covariates come from either database A or B.14 In this article, we propose automated association testing using probabilistically linked health records (ATLAS), a fully automated and scalable association testing framework that addresses many of these limitations. ATLAS utilizes either (1) a best match method or (2) a weighted imputation method that propagates uncertainty from the linkage process contained in matching probabilities. Then, ATLAS optimally combines several P values estimated using generalized linear models (GLMs) that each correspond to a different matching threshold ρk (⁠ k∈{1,…,K} ⁠) as a significance test, avoiding the difficult choice of choosing a single threshold for defining a match. Unlike previous work, ATLAS performance is not conditional on specific linkage settings, linkage methods, or data structure. To facilitate ease of use and accessibility, we have implemented ATLAS in the “ludic” R package on CRAN. Here, we validate ATLAS performance and compare to existing methods in simulation and real-world studies to show that ATLAS is more robust at detecting associations than previously published estimators. MATERIALS AND METHODS Statistical method The proposed ATLAS algorithm broadly consists of 3 steps: (1) missing variable imputation using either i) the best match above a prespecified threshold, or ii) a weighted average using matching probabilities as weights also filtering on a prespecified threshold; (2) estimation of an adjusted effect size using a GLM; and (3) a significance test relying on optimal combination of P values obtained from data imputed at multiple thresholds using Fisher’s method and perturbation resampling. Figure 1 illustrates this procedure. Without loss of generality, and for the sake of simplicity, here we will consider the situation where we have database A containing a p-dimensional novel predictor vector X and a vector of matching features MA on nA subjects indexed by i∈{1,…,nA} and database B containing outcome information Y ⁠, a vector of matching features MB ⁠, and potentially some other existing covariates W on nB subjects indexed by j∈{1,…,nB} ⁠. We seek to link the predictors X recorded in database A subjects to database B such that we may run an association analysis of Y∼W+X ⁠. Figure 1. Open in new tabDownload slide Schematic of the proposed ATLAS algorithm. Figure 1. Open in new tabDownload slide Schematic of the proposed ATLAS algorithm. Probabilistic linkage We denote the probability of matching between a patient i from database A and a patient j from database B as πij ⁠, where πij is ascertained via any previously developed linkage algorithm that typically assesses the similarity between the matching features MAi and MBj ⁠. For example, the ludic linkage algorithm employs Bayesian modeling of binary diagnosis codes as matching features to estimate a posterior probability of being a match.15 Variable imputation Using the estimated probabilities and given a specific matching cutoff threshold ρk above which patients are ruled as a match, we define ξj(ρ) as whether there is any match for patient j in database B: ξj(ρ)=1if max1≤i≤nAπij≥ρ0else The best match imputation method imputes the missing predictor for patient j in B with the observation Xi from patient i in A with the highest probability π^ij above a threshold ρk ⁠. However, this method risks imputing variables from false matches created by linkage errors and diluting potential associations. Therefore, we additionally propose a weighted average imputation of the predictor using linkage probabilities as weights to weather false matches and increase contribution of true matches. More specifically, for the jth subject in database B and a given threshold ρ ⁠, we identify subjects in database A with linkage probabilities π^ij above ρ and obtain a weighted average of their X’s as the linked predictor: X^ρj=∑i=1nAπijI(πij≥ρ)Xi∑i=1nAπijI(πij≥ρ),j=1,…,nB If max1≤i≤nAπij<ρ ⁠, then the predictor will remain missing for the jth subject. We assume that those remaining missing values are Missing At Random. Effect size estimation using GLM Our goal is to assess the association between X and Y using the linked database B through a GLM with link function g ⁠: EY|X,W=gW'α0+X'θ0=gZ'β0 where Z=W',X'' and β=α',θ'' ⁠. For simplicity, we focus on the downstream association testing for H0:θ=0 using linked data DB(ρ)=Yj,Z^ρj:ξj(ρ)=1,j=1,…,nB where Z^ρj=W'j,X'^ρj' ⁠. We fit the GLM Yj∼Z^ρj to DB to obtain a maximum likelihood estimate for β ⁠, denoted as β^ρ=α'^ρ,θ'^ρ' ⁠, and test for H0:θ0=0 based on the corresponding θ^ρ ⁠. The ATLAS threshold combination test: significance testing with optimal P value combination Various threshold values can be used for ρ ⁠. For instance, one could use ρ=0.5 ⁠, where indicated matches have a higher probability of being match than non-match, or ρ=0.9 ⁠, where indicated matches have higher certainty of being true matches compared to using ρ=0.5 ⁠. However, the optimal choice of such a threshold is unclear in practice due to the lack of gold standard labels on the true mappings between A and B. On the one hand, higher thresholds have lower estimation biases from fewer false matches but at the price of decreased statistical power from smaller sample sizes. On the other hand, lower thresholds exhibit higher statistical power at the expense of increased estimation bias. Thus, instead of arbitrarily choosing a threshold ρ ⁠, we propose to optimally combine several P values that correspond to different matching thresholds {ρl,l=1,…,L} ⁠, thereby automating the significance testing process and preserving statistical power in various settings. Specifically, we propose to obtain a P value via a χ2 test based on θ^ρl for the threshold ρl ⁠, p^ρl=Pχp2≥θ'^ρlΣ^ρl-1θ^ρl and construct a combined test statistic as: γ^=∑l=1L-log(p^ρl) to calculate the final P value for the testing of H0:θ0=0 ⁠. For a given ρl ⁠, we may obtain P value p^ρl=Pχp2≥θ'^ρΣ^ρ-1θ^ρ via a p-degree of freedom χ2 test, where Σ^ρl is the estimated variance–covariance matrix of θ^ρl ⁠. Since θ^ρl,l=1,…,L are estimated using overlapping data, the test statistics θ'^ρlΣ^ρl-1θ^ρl,l=1,…,L are highly correlated with each other. Thus, to estimate the null distribution of γ^ ⁠, we use a perturbation resampling strategy to account for the correlations. Specifically, we generate a vector of nB standard gaussian random variables G=G1,…,GnB' and subsequently obtain a perturbed random vector as θ^ρ[G]=∑j=1nBξj(ρl)S^θj(ρl)Gj ⁠, where S^j(ρ)=S^αj(ρ)',S^θj(ρ)''=J(ρ)-1Z^ρj{Yj-g(β'^ρZ^ρj)}, J(ρ)=∑j=1nBξj(ρ)Z^ρjZ'^ρjg˙(β'^ρZ^ρj) is the Fisher Information matrix for a given ρ and g˙x=dgxdx.16,17 Subsequently, we obtain the perturbed counterpart of γ^ as γ^[G]=∑l=1L-logp^ρl[G] ⁠, where p%26#x0005E%3B%26#x003C1%3Bl[G]%26#x0003D%3BP%26#x003C7%3Bp2%26#x02265%3B%26#x003B8%3B%26#x0005E%3B%26#x003C1%3B[G]'%26#x003A3%3B%26#x0005E%3B%26#x003C1%3B-1%26#x003B8%3B%26#x0005E%3B%26#x003C1%3B[G].' alt="" border="0" data-mathml=" p^ρl[G]=Pχp2≥θ^ρ[G]'Σ^ρ-1θ^ρ[G]. " /> We then generate R realizations of G,{Gr,r=1,…,R} ⁠, to obtain the final P value for testing H0:θ0=0 as: p^combined=1-1R∑r=1R1{γ^≥γ^[Gr]} Data and metrics for evaluation We evaluated the performance of ATLAS in simulation studies and conducted a real-world genetic association study using EHR data that has been linked to a biorepository. Simulation study We estimated type I error rate (statistical size) and empirical statistical power of ATLAS using the deidentified publicly available “RA2” dataset in the “ludic” CRAN package with N=5707 patients extracted from the Mass General Brigham (MGB) Research Patient Data Registry (RPDR) in 2010, and this dataset contained 1342 binarized International Classification of Disease (ICD) codes.15 We studied ATLAS performance under settings created by (1) perturbing databases with multivariate noise to create discrepancies between patient records illustrative of real-world scenarios, like missing information or administrative recording errors; (2) decreasing the average codes per patient record to simulate poor linkage conditions with noisy linkage probabilities; and (3) using varying strengths of association, namely odds ratios (OR) of 1.5 and 2. To simulate upstream linkage, we used 2 linkage algorithms with different methods of estimating πij to study linkage algorithm compatibility with ATLAS. Specifically, we considered: i) ludic, a published algorithm which relies on Bayesian modeling of binary diagnosis codes, and ii) embeddingMatch, which calculates cosine similarities between patient-level embeddings similar to other previous approaches.15,18–20 Further details regarding the embeddingMatch method and the simulation model are in the Supplementary Appendix. Type I error rates using single cutoff thresholds and for the ATLAS threshold combination test were estimated as the proportion of P values less than the nominal testing level α=0.05 under no simulated association. Similarly, empirical power was estimated as the proportion of significant P values at α=0.05 under simulated association for a given OR > 1 ⁠. Results are based on n=1500 simulations in each setting, and thresholds used in the ATLAS threshold combination test include ρ∈0.1,0.3,…,0.9 ⁠. For the sake of simplicity, we report ATLAS results at single cutoff thresholds of 0.1, 0.5, and 0.9, and report the rest in the Supplementary Appendix. Genetic association study using real-world biorepository data To further validate performance and demonstrate real-world utility of ATLAS, we conducted a genetic association study with linked data to assess the association between RA genetic risk score (GRS) and clinical outcomes among rheumatoid arthritis (RA) patients. We considered EHR records from 2 databases with which we performed linkage and the linked data-enabled downstream association analyses: (i) the Crimson Clinical Discards database (herein referred to as Crimson and akin to database A) for a subset of RA patients of European descent previously identified in 2008; and (ii) the MGB RPDR (akin to database B) subset of RA patients identified via an existing machine learning algorithm.21–25 The Crimson RA cohort contains anonymized EHR data along with genotype data for nCrimson=1284 patients collected up to 2008. The RPDR RA cohort contains full EHR data up to 2017 for a total of nRPDR=12 838 patients, and this dataset is unrelated to the deidentified “RA2” dataset used in simulations. A subset of nbiobank=1270 patients in RPDR already report genotype data because they belong to the MGB Biobank, and we have gold standard labels on the true mappings between RPDR and MGB Biobank records.26,27 Our goal is to link Crimson with RPDR to enable genetic association studies of RA GRS in Crimson with various clinical outcomes in RPDR using all available data. Figure 2 provides a schematic of our analyses. Details about how RA GRS were constructed are reported in the Supplementary Appendix.22,28,29 Figure 2. Open in new tabDownload slide Schematic of real-world rheumatoid arthritis genetic association studies conducted using the MGB Biobank and the Crimson-linked data. Figure 2. Open in new tabDownload slide Schematic of real-world rheumatoid arthritis genetic association studies conducted using the MGB Biobank and the Crimson-linked data. To perform record linkage, we assembled all available ICD codes in Crimson records and ICD codes recorded prior and up to 2008 in RPDR records. ICD codes were then aggregated to PheCodes and a vector of 1542 binary matching features was created for each patient record where each feature was a binary indicator of presence or absence of a PheCode.30,31 Next, we performed PRL using ludic to estimate probabilities of being a match for every possible pair of RPDR and Crimson records. Since the Crimson cohort consists of RA patients that were previously managed at MGB, we anticipated that a majority of these patients can be linked to the updated RPDR RA cohort. Further, there are 213 patients with genotype data in both MGB Biobank and Crimson. In the absence of gold standard labels on the true mappings between RPDR and Crimson records, we used these 213 overlapping patients from genotype data reported in both genotype databases to validate the accuracy of the linkage by assessing the concordance between RA GRS from MGB Biobank and those from Crimson. The 213 overlapping patients were subsequently excluded in the linked data study. Once Crimson records are linked to RA RPDR records, we imputed GRS values using the weighted average method from Crimson to those subjects in RA RPDR using thresholds ρl∈{0.5,0.6,…,0.9} ⁠. Subsequently, we used ATLAS to conduct multivariate association studies of linked RA GRS and clinical outcomes while adjusting for patient age, gender, and healthcare utilization (defined as the log[1+total encounters] ⁠). Clinical outcomes from RDPR include laboratory biomarkers commonly used to assess patient inflammation and binary phenotypes for pyogenic arthritis and gout, which are other distinct nonautoimmune forms of arthritis. These binary phenotypes were defined as having at least 2 PheCodes corresponding to these disorders and were constructed using ICD codes recorded up to 2017 in RPDR.30,31 Reported effect sizes were estimated using data imputed at a ρ=0.9 threshold. Using RDPR patients who already belong to the MGB Biobank, we replicated these multivariate association studies and conducted meta-analysis using Fisher’s method to combine the P values estimated from the MGB Biobank and Crimson linked-data study to demonstrate increased power to detect associations when linking databases. To determine statistical significance after meta-analysis, we accounted for multiple testing by adjusting P values to control for a false discovery rate of 5% using the Benjamini-Hochberg procedure.32 Benchmark methodology for comparison We additionally considered the bias correcting estimators for linked data proposed by Han et al as benchmark approaches.14 The 3 Han et al estimators are “Han F” (for “Full”—which considers all possible pairs for each patient in A ⁠), “Han M” (for “Max”—which considers largest probabilities for each patient in A ⁠), and “Han M2” (for “Max 2”—which considers 2 largest probabilities for each patient in A ⁠).14 We report type I error rates and empirical power of these estimators from simulations. We were unable to replicate the real-world multivariate analyses using Han et al estimators, as they do not consider accepting covariates from the same database that provides outcomes. To compare performance in the real-world setting, we further conducted univariate analyses. RESULTS Type I error control in simulations ATLAS effectively controlled for type I error in all simulation settings, and at all considered thresholds, regardless of the linkage or downstream imputation method used (Figure 3). Han et al’s estimators controlled for type I error, although they appear somewhat too conservative. Figure 3. Open in new tabDownload slide Comparison of type I error rates of ATLAS and Han et al estimators in simulation settings with different noise levels and average codes per patient record (simulations under H0 ⁠). ATLAS type I error rates reported for several single cutoff thresholds and the ATLAS threshold combination test. Figure 3. Open in new tabDownload slide Comparison of type I error rates of ATLAS and Han et al estimators in simulation settings with different noise levels and average codes per patient record (simulations under H0 ⁠). ATLAS type I error rates reported for several single cutoff thresholds and the ATLAS threshold combination test. Statistical power evaluation When patient records had on average 16 codes, power at different single-cutoff thresholds was similar (Figure 4). When patient records had on average 4 codes, more significant differences in power between single cutoff thresholds was observed. Relative power between thresholds was further dependent on the linkage method. Across all settings, power exhibited by the ATLAS threshold combination test demonstrated either the highest power or at least matched the highest power among ATLAS single-cutoff thresholds (Figure 4). Larger simulated OR’s increased power across all thresholds. Empirical power, when using weighted average imputation, was similar in comparison to power using best match imputation, although differences in power based on imputation strategy were observed in certain settings when using embeddingMatch (Supplementary Figure 1). Simulation results using the identity link of the GLM and continuous outcomes demonstrated similar results in ATLAS performance (Supplementary Figure 2). In comparison, the estimators proposed by Han et al did not capture signal as effectively, most likely due to noisy linkage probabilities generated from PRL. Figure 4. Open in new tabDownload slide Comparison of empirical power of ATLAS and Han et al estimators in simulation settings with different effect sizes and average codes per patient record (simulations under H1 ⁠). Results were generated using the best match imputation method. ATLAS power reported for several single cutoff thresholds and the ATLAS threshold combination test. Figure 4. Open in new tabDownload slide Comparison of empirical power of ATLAS and Han et al estimators in simulation settings with different effect sizes and average codes per patient record (simulations under H1 ⁠). Results were generated using the best match imputation method. ATLAS power reported for several single cutoff thresholds and the ATLAS threshold combination test. We then evaluated ATLAS performance after creating false matches between databases to simulate linkage errors. In doing this, we sought to mimic real-world scenarios where linkage algorithms cannot discern between many pairs of similar patient records. ATLAS successfully controlled for Type I error in this setting (Supplementary Figure 2). Use of the weighted average imputation yielded significantly higher power compared to best match imputed variables, and the ATLAS threshold combination test again demonstrated good power relative to its power from single cutoff thresholds (Figure 5). Han et al’s M2 estimator yielded comparable power in the context of false matches when using ludic, but was generally outperformed by the ATLAS threshold combination test. Figure 5. Open in new tabDownload slide Comparison of empirical power of ATLAS and Han et al estimators in the presence of false matches between databases (simulations under H1 ⁠). Simulated databases report on average 16 codes per patient record. Figure 5. Open in new tabDownload slide Comparison of empirical power of ATLAS and Han et al estimators in the presence of false matches between databases (simulations under H1 ⁠). Simulated databases report on average 16 codes per patient record. Real-world genetics study: association of clinical outcomes with rheumatoid arthritis risk alleles among rheumatoid arthritis patients We performed PRL on 12 838 patient records from RPDR and 1284 patient records from Crimson. At a conservative threshold of ρ=0.9 ⁠, we identified 1157 matching patient records between RPDR and Crimson. For the 213 patients who have been genotyped in both MGB Biobank and Crimson, Spearman’s correlation coefficient between the MGB Biobank and Crimson GRS’ for these overlapping patients was estimated to be 0.83 (95% CI: 0.77–0.87, P<.001 ⁠), suggesting high concordance of genetic data and reliable linkage quality. Univariate association study results using both ATLAS and Han et al’s estimators are reported in Supplementary Tables 1 and 2. As a positive control, we replicated the known association of anticitrullinated protein antibody (ACPAs) levels with RA GRS using ATLAS and Han et al’s estimators.33,34 In general, ATLAS detected larger effect sizes and smaller P values compared to any of Han et al’s estimators, supporting simulation results that ATLAS is more powerful at detecting associations. For example, for log-transformed rheumatoid factor levels, ATLAS estimated βGRS=0.15 with P<.001 while Han et al’s M2 estimator estimated βGRS=0.14 with p=0.09 ⁠. Multivariate association study results using the MGB Biobank and the Crimson linked data study are presented in Table 1 for biomarkers and Table 2 for phenotypes. Effect sizes estimated from both studies were generally concordant. Figure 6 visualizes the difference in adjusted P values between using only the MGB Biobank cohort for which gold standard mappings were already available and after incorporating additional RA patients with genotype data through the Crimson linked cohort. We demonstrated improved power to detect associations when incorporating linked data as meta-analysis yielded 2 additional significant associations (namely log-transformed Erythrocyte sedimentation rate and C-Reactive Protein level). Further, the average unadjusted -log10(P value) among statistically significant outcomes was 6.45 in the MGB Biobank study and 9.65 after meta-analysis with the Crimson linked data study, again demonstrating the potential to increase power when incorporating additional data by linking databases. Figure 6. Open in new tabDownload slide Logarithm transformed P values from genetic association study using only RA patients with previously available genotype data at MGB Biobank and after incorporating additional RA patients with genotype data through the Crimson-linked cohort. Figure 6. Open in new tabDownload slide Logarithm transformed P values from genetic association study using only RA patients with previously available genotype data at MGB Biobank and after incorporating additional RA patients with genotype data through the Crimson-linked cohort. Table 1. Multivariate association study results of RA GRS and patient biomarkers for the MGB Biobank RA cohort and the Crimson-linked RA cohort . Beta . P value . Biomarker . MGB Biobank (SE) . Crimson linked: ATLAS 0.9 threshold (SE) . MGB Biobank . Crimson linked: ATLAS combination . Two study P value . Anti-citrullinated protein antibodies (log) 0.39 (0.04) 0.31 (0.04) 9.39e-30 0.00e + 00 8.54e-38 Rheumatoid factor (log) 0.12 (0.03) 0.16 (0.03) 2.20e-06 0.00e + 00 8.14e-15 Erythrocyte sedimentation rate 0.49 (0.58) 1.27 (0.55) 3.91e-01 1.00e-02 2.56e-02 C-reactive protein (log) 0.04 (0.02) 0.04 (0.02) 7.80e-02 3.00e-02 1.65e-02 . Beta . P value . Biomarker . MGB Biobank (SE) . Crimson linked: ATLAS 0.9 threshold (SE) . MGB Biobank . Crimson linked: ATLAS combination . Two study P value . Anti-citrullinated protein antibodies (log) 0.39 (0.04) 0.31 (0.04) 9.39e-30 0.00e + 00 8.54e-38 Rheumatoid factor (log) 0.12 (0.03) 0.16 (0.03) 2.20e-06 0.00e + 00 8.14e-15 Erythrocyte sedimentation rate 0.49 (0.58) 1.27 (0.55) 3.91e-01 1.00e-02 2.56e-02 C-reactive protein (log) 0.04 (0.02) 0.04 (0.02) 7.80e-02 3.00e-02 1.65e-02 Crimson-linked RA cohort effect sizes estimated a stringent imputation threshold of 0.9 and P values estimated using the ATLAS combination test. Open in new tab Table 1. Multivariate association study results of RA GRS and patient biomarkers for the MGB Biobank RA cohort and the Crimson-linked RA cohort . Beta . P value . Biomarker . MGB Biobank (SE) . Crimson linked: ATLAS 0.9 threshold (SE) . MGB Biobank . Crimson linked: ATLAS combination . Two study P value . Anti-citrullinated protein antibodies (log) 0.39 (0.04) 0.31 (0.04) 9.39e-30 0.00e + 00 8.54e-38 Rheumatoid factor (log) 0.12 (0.03) 0.16 (0.03) 2.20e-06 0.00e + 00 8.14e-15 Erythrocyte sedimentation rate 0.49 (0.58) 1.27 (0.55) 3.91e-01 1.00e-02 2.56e-02 C-reactive protein (log) 0.04 (0.02) 0.04 (0.02) 7.80e-02 3.00e-02 1.65e-02 . Beta . P value . Biomarker . MGB Biobank (SE) . Crimson linked: ATLAS 0.9 threshold (SE) . MGB Biobank . Crimson linked: ATLAS combination . Two study P value . Anti-citrullinated protein antibodies (log) 0.39 (0.04) 0.31 (0.04) 9.39e-30 0.00e + 00 8.54e-38 Rheumatoid factor (log) 0.12 (0.03) 0.16 (0.03) 2.20e-06 0.00e + 00 8.14e-15 Erythrocyte sedimentation rate 0.49 (0.58) 1.27 (0.55) 3.91e-01 1.00e-02 2.56e-02 C-reactive protein (log) 0.04 (0.02) 0.04 (0.02) 7.80e-02 3.00e-02 1.65e-02 Crimson-linked RA cohort effect sizes estimated a stringent imputation threshold of 0.9 and P values estimated using the ATLAS combination test. Open in new tab Table 2. Multivariate association study results of RA GRS and binary phenotypes for the MGB Biobank RA cohort and the Crimson-linked RA cohort . OR . P value . Phenotype . MGB Biobank (SE) . Crimson linked: ATLAS 0.9 threshold (SE) . MGB Biobank . Crimson linked: ATLAS combination . Two study P value . Pyogenic arthritis 1.41 (0.11) 1.40 (0.18) 1.64e-03 9.57e-02 1.12e-03 Gout and other crystal arthropathies 0.85 (0.07) 0.85 (0.08) 2.82e-02 2.43e-02 5.68e-03 . OR . P value . Phenotype . MGB Biobank (SE) . Crimson linked: ATLAS 0.9 threshold (SE) . MGB Biobank . Crimson linked: ATLAS combination . Two study P value . Pyogenic arthritis 1.41 (0.11) 1.40 (0.18) 1.64e-03 9.57e-02 1.12e-03 Gout and other crystal arthropathies 0.85 (0.07) 0.85 (0.08) 2.82e-02 2.43e-02 5.68e-03 Crimson-linked RA cohort effect sizes estimated a stringent imputation threshold of 0.9 and P values estimated using the ATLAS combination test. Open in new tab Table 2. Multivariate association study results of RA GRS and binary phenotypes for the MGB Biobank RA cohort and the Crimson-linked RA cohort . OR . P value . Phenotype . MGB Biobank (SE) . Crimson linked: ATLAS 0.9 threshold (SE) . MGB Biobank . Crimson linked: ATLAS combination . Two study P value . Pyogenic arthritis 1.41 (0.11) 1.40 (0.18) 1.64e-03 9.57e-02 1.12e-03 Gout and other crystal arthropathies 0.85 (0.07) 0.85 (0.08) 2.82e-02 2.43e-02 5.68e-03 . OR . P value . Phenotype . MGB Biobank (SE) . Crimson linked: ATLAS 0.9 threshold (SE) . MGB Biobank . Crimson linked: ATLAS combination . Two study P value . Pyogenic arthritis 1.41 (0.11) 1.40 (0.18) 1.64e-03 9.57e-02 1.12e-03 Gout and other crystal arthropathies 0.85 (0.07) 0.85 (0.08) 2.82e-02 2.43e-02 5.68e-03 Crimson-linked RA cohort effect sizes estimated a stringent imputation threshold of 0.9 and P values estimated using the ATLAS combination test. Open in new tab DISCUSSION The tremendous amount of biomedical data becoming available for research has led to great interest and demand for linkage of databases. Inference using linked data must acknowledge and mitigate bias in estimated effect sizes that are created by linkage errors while retaining good statistical power. To this end, we propose ATLAS as a supervised, robust, flexible, and scalable method that tests for association between variables originally belonging to separate databases. We demonstrate that ATLAS is a valid method that effectively controls for type I error regardless of linkage or imputation method used, and that ATLAS is more powerful than benchmark methods in a range of linkage and inference settings. We demonstrate in simulation studies that weighted average imputation of missing variables not only protects against type I error in downstream inference but also preserves statistical power to detect associations in the presence of linkage errors, making it the preferable imputation method to use in future studies. The weighted average imputation method propagates the uncertainty contained in probabilistic linkage to downstream inference to reduce bias and increase power. While certain aspects of it resemble multiple imputation and inverse probability weighting, the weighted average imputation additionally uses thresholding to impute only when the linkage is deemed of sufficient quality (ie, above the threshold).35 When using the weighted average imputation, users should be aware that using lower thresholds with linkage algorithms like embeddingMatch may yield lower power as seen in Supplementary Figure 1 when patient records contain few matching features. We also observed that using fewer matching features led to biased estimates of πij ⁠, which subsequently decreased power of ATLAS. For example, the average πij among true matches with 16 and 4 codes per record was 0.86 and 0.22, respectively. The resulting powers of the ATLAS combination test for 16 and 4 codes per record were 0.86 and 0.22, respectively. During missing variable imputation, instead of selecting an ad hoc threshold, ATLAS optimally combines P values originating from multiple thresholds. Our results suggest 3 major advantages to the ATLAS threshold combination test: (1) avoids arbitrarily choosing a threshold, thus automating the record linkage process; (2) reduces estimation bias by combining P values estimated from data imputed at different thresholds; and (3) preserves good statistical power regardless of which threshold performs best in a given setting. When selecting thresholds which are considered in the ATLAS threshold combination test, we suggest using no more than 10 thresholds at a time that are at least 0.05 apart as a rule of thumb to preserve statistical power. These attractive features of ATLAS enable robust performance in real-world settings, and we demonstrated its utility in our real-world genetic association study with data unrelated from the simulation data. When meta-analyzing results obtained from MGB Biobank—for which genotype data was already available—and the newly linked Crimson cohort, we were able to increase power and detect 2 more associations (namely CRP and ESR) than when using MGB Biobank alone. This illustrates the potential to discover new associations with ATLAS when incorporating linked data at no additional cost and without further data collection efforts. We further validated the new association of RA GRS with CRP using the U.K. Biobank (⁠ β=0.14 ⁠; SE = 0.02; P<0.001 ⁠). ESR is not reported in the UK Biobank, but our results are consistent with previous studies that reported associations of RA genetic risk alleles with higher values of CRP and ESR.36–43 Additional studies are needed to further validate these associations. CONCLUSION In this article, we introduce ATLAS, an automated and flexible algorithm that conducts robust inference using probabilistically linked databases. The ATLAS threshold combination test exhibits high power to detect associations in a range of simulation settings while controlling for type I error, and it exhibits substantial improvement in power and flexibility over existing inference methods for linked databases. Thus, ATLAS promises to enable novel and powerful research studies using linked data. FUNDING This work was supported in part by the US National Institutes of Health Grant U54-HG007963. AUTHOR CONTRIBUTIONS All authors made substantial contributions to: conception and design; acquisition, analysis, and interpretation of data; drafting the article or revising it critically; and final approval of the version to be published. SUPPLEMENTARY MATERIAL Supplementary material is available at Journal of the American Medical Informatics Association online. CONFLICT OF INTEREST STATEMENT None declared. DATA AVAILABILITY STATEMENT The “RA2” dataset is in the “ludic” package on CRAN. All data used in the real-world study cannot be share publicly as it is only available for authorized MGB investigators. REFERENCES 1 Kohane IS , Churchill SE, Murphy SN. A translational engine at the national scale: informatics for integrating biology and the bedside . J Am Med Inform Assoc 2012 ; 19 ( 2 ): 181 – 5 . Google Scholar Crossref Search ADS PubMed WorldCat 2 Butte AJ. Translational bioinformatics: coming of age . J Am Med Inform Assoc 2008 ; 15 ( 6 ): 709 – 14 . Google Scholar Crossref Search ADS PubMed WorldCat 3 Jiao Y, Lesueur F, Azencott CA, et al. A new hybrid record linkage process to make epidemiological databases interoperable: application to the GEMO and GENEPSO studies involving BRCA1 and BRCA2 mutation carriers. BMC Med Res Methodol 2021; 21 (1): 155. 4 Gutman R , Afendulis CC, Zaslavsky AM. A Bayesian procedure for file linking to analyze end-of-life medical costs . J Am Stat Assoc 2013 ; 108 ( 501 ): 34 – 47 . Google Scholar Crossref Search ADS PubMed WorldCat 5 Neter J , Maynes ES, Ramanathan R. The effect of mismatching on the measurement of response errors . J Am Stat Assoc 1965 ; 60 : 1005 – 27 . Google Scholar OpenURL Placeholder Text WorldCat 6 Rentsch CT , Harron K, Urassa M, et al. Impact of linkage quality on inferences drawn from analyses using data with high rates of linkage errors in rural Tanzania . BMC Med Res Methodol 2018 ; 18 ( 1 ): 165 . Google Scholar Crossref Search ADS PubMed WorldCat 7 Moore CL , Amin J, Gidding HF, et al. A new method for assessing how sensitivity and specificity of linkage studies affects estimation . PLoS One 2014 ; 9 ( 7 ): e103690 . Google Scholar Crossref Search ADS PubMed WorldCat 8 Harron K , Goldstein H, Wade A, et al. Linkage, evaluation and analysis of national electronic healthcare data: application to providing enhanced blood-stream infection surveillance in paediatric intensive care . PLoS One 2013 ; 8 ( 12 ): e85278 . Google Scholar Crossref Search ADS PubMed WorldCat 9 Schmidlin K , Clough-Gorr KM, Spoerri A, et al. ; Swiss National Cohort . Impact of unlinked deaths and coding changes on mortality trends in the Swiss national cohort . BMC Med Inform Decis Mak 2013 ; 13 : 1 – 11 . Google Scholar Crossref Search ADS PubMed WorldCat 10 Doidge JC , Harron KL. Reflections on modern methods: linkage error bias . Int J Epidemiol 2019 ; 48 ( 6 ): 2050 – 60 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 11 Hof MHP , Zwinderman AH. Methods for analyzing data from probabilistic linkage strategies based on partially identifying variables . Stat Med 2012 ; 31 ( 30 ): 4231 – 42 . Google Scholar Crossref Search ADS PubMed WorldCat 12 Chipperfield J. A weighting approach to making inference with probabilistically linked data . Stat Neerland 2019 ; 73 ( 3 ): 333 – 50 . Google Scholar Crossref Search ADS WorldCat 13 Dalzell NM , Reiter JP. Regression modeling and file matching using possibly erroneous matching variables . J Comput Graph Stat 2018 ; 27 ( 4 ): 728 – 38 . Google Scholar Crossref Search ADS WorldCat 14 Han Y , Lahiri P. Statistical analysis with linked data . Int Stat Rev 2019 ; 87 ( S1 ): S139 – 57 . Google Scholar Crossref Search ADS WorldCat 15 Hejblum BP , Weber GM, Liao KP, et al. Probabilistic record linkage of de-identified research datasets with discrepancies using diagnosis codes . Sci Data 2019 ; 6 : 180298 . Google Scholar Crossref Search ADS PubMed WorldCat 16 Jin Z , Ying Z, Wei L. A simple resampling method by perturbing the minimand . Biometrika 2001 ; 88 ( 2 ): 381 – 90 . Google Scholar Crossref Search ADS WorldCat 17 Minnier J , Tian L, Cai T. A perturbation method for inference on regularized regression estimates . J Am Stat Assoc 2011 ; 106 ( 496 ): 1371 – 82 . Google Scholar Crossref Search ADS PubMed WorldCat 18 Bonomi L, Xiong L, Chen R et al. Frequent grams based embedding for privacy preserving record linkage. In: Proceedings of the 21st acm international conference on information and knowledge management. New York, NY: Association for Computing Machinery; 2012 : 1597 – 601 . 19 Adly N. Efficient record linkage using a double embedding scheme. In: DMIN. Las Vegas, NV; July 13–16, 2009 : 274 – 81 . 20 Shi X , Li X, Cai T. Spherical regression under mismatch corruption with application to automated knowledge translation . J Am Stat Assoc 2020 ; 1 – 12 . Google Scholar OpenURL Placeholder Text WorldCat 21 Boutin N , Holzbach A, Mahanta L, et al. The information technology infrastructure for the translational genomics core and the partners biobank at partners personalized medicine . J Pers Med 2016 ; 6 ( 1 ): 6 . Google Scholar Crossref Search ADS WorldCat 22 Kurreeman F , Liao K, Chibnik L, et al. Genetic basis of autoantibody positive and negative rheumatoid arthritis risk in a multi-ethnic cohort derived from electronic health records . Am J Hum Genet 2011 ; 88 ( 1 ): 57 – 69 . Google Scholar Crossref Search ADS PubMed WorldCat 23 Nalichowski R , Keogh D, Chueh HC, et al. Calculating the benefits of a research patient data repository . In: AMIA Annu Symp Proc 2006 ; Washington, DC;. 2006 : 1044 . Google Scholar OpenURL Placeholder Text WorldCat 24 Liao KP , Cai T, Gainer V, et al. Electronic medical records for discovery research in rheumatoid arthritis . Arthritis Care Res (Hoboken) 2010 ; 62 ( 8 ): 1120 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 25 Huang S , Huang J, Cai T, et al. Impact of ICD10 and secular changes on electronic medical record rheumatoid arthritis algorithms . Rheumatology (Oxford) 2020 ; 59 ( 12 ): 3759 – 66 . Google Scholar Crossref Search ADS PubMed WorldCat 26 Karlson EW , Boutin NT, Hoffnagle AG, et al. Building the partners healthcare biobank at partners personalized medicine: Informed consent, return of research results, recruitment lessons and operational considerations . J Pers Med 2016 ; 6 : 2 . Google Scholar Crossref Search ADS WorldCat 27 Gainer VS , Cagan A, Castro VM, et al. The biobank portal for partners personalized medicine: a query tool for working with consented biobank samples, genotypes, and phenotypes using i2b2 . J Pers Med 2016 ; 6 : 11 . Google Scholar Crossref Search ADS WorldCat 28 Okada Y , Wu D, Trynka G, et al. ; GARNET consortium . Genetics of rheumatoid arthritis contributes to biology and drug discovery . Nature 2014 ; 506 ( 7488 ): 376 – 81 . Google Scholar Crossref Search ADS PubMed WorldCat 29 Raychaudhuri S , Sandor C, Stahl EA, et al. Five amino acids in three HLA proteins explain most of the association between MHC and seropositive rheumatoid arthritis . Nat Genet 2012 ; 44 ( 3 ): 291 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 30 Denny JC , Ritchie MD, Basford MA, et al. PheWAS: Demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations . Bioinformatics 2010 ; 26 ( 9 ): 1205 – 10 . Google Scholar Crossref Search ADS PubMed WorldCat 31 Wei W-Q , Bastarache LA, Carroll RJ, et al. Evaluating phecodes, clinical classification software, and icd-9-cm codes for phenome-wide association studies in the electronic health record . PLoS One 2017 ; 12 ( 7 ): e0175508 . Google Scholar Crossref Search ADS PubMed WorldCat 32 Benjamini Y , Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing . J R Stat Soc B (Methodol) 1995 ; 57 : 289 – 300 . Google Scholar OpenURL Placeholder Text WorldCat 33 Aggarwal R , Liao K, Nair R, et al. Anti-citrullinated peptide antibody (ACPA) assays and their role in the diagnosis of rheumatoid arthritis . Arthritis Rheum 2009 ; 61 ( 11 ): 1472 – 83 . Google Scholar Crossref Search ADS PubMed WorldCat 34 Liao KP , Kurreeman F, Li G, et al. Associations of autoantibodies, autoimmune risk alleles, and clinical diagnoses from the electronic medical records in rheumatoid arthritis cases and non–rheumatoid arthritis controls . Arthritis Rheum 2013 ; 65 ( 3 ): 571 – 81 . Google Scholar Crossref Search ADS PubMed WorldCat 35 Seaman SR , White IR, Copas AJ, et al. Combining multiple imputation and inverse-probability weighting . Biometrics 2012 ; 68 ( 1 ): 129 – 37 . Google Scholar Crossref Search ADS PubMed WorldCat 36 Alemao E , Guo Z, Burns L, et al. Evaluation of the association between C-reactive protein and anti-citrullinated protein antibody in rheumatoid arthritis: analysis of two clinical practice data sets [abstract] . Arthritis Rheumatol 2016 ; 68 (suppl 10): 1226 . Google Scholar OpenURL Placeholder Text WorldCat 37 Pope JE , Choy EH. C-reactive protein and implications in rheumatoid arthritis and associated comorbidities . Semin Arthritis Rheum 2021 ; 51 ( 1 ): 219 – 29 . Google Scholar Crossref Search ADS PubMed WorldCat 38 Plant MJ , Williams AL, O'Sullivan MM, et al. Relationship between time-integrated C-reactive protein levels and radiologic progression in patients with rheumatoid arthritis . Arthritis Rheum 2000 ; 43 ( 7 ): 1473 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 39 Dessein PH , Joffe BI, Stanwix AE. High sensitivity C-reactive protein as a disease activity marker in rheumatoid arthritis . J Rheumatol 2004 ; 31 ( 6 ): 1095 – 7 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 40 Wolfe F. Comparative usefulness of C-reactive protein and erythrocyte sedimentation rate in patients with rheumatoid arthritis . J Rheumatol 1997 ; 24 ( 8 ): 1477 – 85 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 41 Shen R , Ren X, Jing R, et al. Rheumatoid factor, anti-cyclic citrullinated peptide antibody, C-reactive protein, and erythrocyte sedimentation rate for the clinical diagnosis of rheumatoid arthritis . Lab Med 2015 ; 46 ( 3 ): 226 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 42 Amos RS , Constable TJ, Crockson RA, et al. Rheumatoid arthritis: relation of serum C-reactive protein and erythrocyte sedimentation rates to radiographic changes . Br Med J 1977 ; 1 ( 6055 ): 195 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 43 Wolfe F , Pincus T. The level of inflammation in rheumatoid arthritis is determined early and remains stable over the longterm course of the illness . J Rheumatol 2001 ; 28 : 1817 – 24 . Google Scholar PubMed OpenURL Placeholder Text WorldCat Author notes Harrison G. Zhang and Boris P. Hejblum contributed equally. © The Author(s) 2021. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Journal

Journal of the American Medical Informatics AssociationOxford University Press

Published: Oct 5, 2021

References