Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Metagenomic strain detection with SameStr: identification of a persisting core gut microbiota transferable by fecal transplantation

Metagenomic strain detection with SameStr: identification of a persisting core gut microbiota... Background: The understanding of how microbiomes assemble, function, and evolve requires metagenomic tools that can resolve microbiota compositions at the strain level. However, the identification and tracking of microbial strains in fecal metagenomes is challenging and available tools variably classify subspecies lineages, which affects their applicability to infer microbial persistence and transfer. Results: We introduce SameStr, a bioinformatic tool that identifies shared strains in metagenomes by determining single-nucleotide variants (SNV ) in species-specific marker genes, which are compared based on a maximum vari- ant profile similarity. We validated SameStr on mock strain populations, available human fecal metagenomes from healthy individuals and newly generated data from recurrent Clostridioides difficile infection (rCDI) patients treated with fecal microbiota transplantation (FMT ). SameStr demonstrated enhanced sensitivity to detect shared dominant and subdominant strains in related samples (where strain persistence or transfer would be expected) when compared to other tools, while being robust against false-positive shared strain calls between unrelated samples (where neither strain persistence nor transfer would be expected). We applied SameStr to identify strains that are stably maintained in fecal microbiomes of healthy adults over time (strain persistence) and that successfully engraft in rCDI patients after FMT (strain engraftment). Taxonomy-dependent strain persistence and engraftment frequencies were positively correlated, indicating that a specific core microbiota of intestinal species is adapted to be competitive both in healthy microbiomes and during post-FMT microbiome assembly. We explored other use cases for strain-level microbiota profiling, as a metagenomics quality control measure and to identify individuals based on the persisting core gut microbiota. Conclusion: SameStr provides for a robust identification of shared strains in metagenomic sequence data with suf- ficient specificity and sensitivity to examine strain persistence, transfer, and engraftment in human fecal microbiomes. Our findings identify a persisting healthy adult core gut microbiota, which should be further studied to shed light on microbiota contributions to chronic diseases. Background Disturbances of the human gut ecosystem have been *Correspondence: daniel.podlesny@uni-hohenheim.de; w.florian.fricke@uni- implicated in many metabolic, inflammatory, and infec - hohenheim.de tious diseases, based on altered taxonomic or func- Department of Microbiome Research and Applied Bioinformatics, University of Hohenheim, Stuttgart, Germany tional microbiota compositions in affected individuals. Institute for Genome Sciences, University of Maryland School However, attempts to identify consistent, disease-spe- of Medicine, Baltimore, MD, USA cific microbiome markers have been less successful, as Full list of author information is available at the end of the article © The Author(s) 2022. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/. The Creative Commons Public Domain Dedication waiver (http:// creat iveco mmons. org/ publi cdoma in/ zero/1. 0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Podlesny et al. Microbiome (2022) 10:53 Page 2 of 15 reported associations vary and have frequently not been to become assigned to the same strain, which is prob- consistent between studies [1, 2]. Among many other fac- lematic if the strain is used to infer microbial persistence tors [3], taxonomic and functional variations between or transfer. In this case, for example, human intestinal microbial subspecies or strains that are members of the microbiomes may contain the same “strain,” i.e., a sub- human microbiome [4] can produce inconsistent find - species lineage with widespread prevalence in the human ings, but have not been comprehensively character- population, without having experienced direct microbial ized. The species Ruminococcus gnavus, for example, transfer. has been linked to inflammatory bowel diseases [5], but To address these limitations, we developed SameStr disease associations appear to be specific to only one of as a new tool for the detection of shared strains in two described subspecies clades [6] and may be depend- metagenomic samples. SameStr leverages the Strain- ent on strain-specific variations in carbohydrate utiliza - PhlAn approach to map metagenomic reads to clade-spe- tion [7] or pro-inflammatory polysaccharide production cific marker genes [17], which compared to other tools [8], emphasizing the need for health-related microbiome affords increased taxonomic resolution [29]. However, studies to focus at subspecies level microbiota varia- SameStr extends the detection of shared strains to sub- tions. Moreover, many of the ecological forces that shape dominant members of multi-strain species populations. microbiomes in health and disease, or after perturbation This is achieved by considering multiple alleles instead and therapeutic modulation, involve microbial inter- of the consensus sequence at polymorphic positions in actions, such as competition, inhibition, or predation, the metagenomic marker gene alignments. We validated which can be strain-dependent [9–11] and require com- SameStr using new and available metagenomes, includ- positional microbiota analyses to provide strain-level tax- ing temporally linked sample pairs (i.e., collected from onomic resolution. the same individual at different time points) or physically Shotgun metagenomics has the potential for a maxi- linked sample pairs (i.e., collected from different, con - mum phylogenetic resolution that can theoretically nected individuals, such as FMT donors and recipients). resolve even individual microbial genomes in a metagen- We demonstrate increased sensitivity for the detection omic sample [12]. Consequently, several bioinformatic of subdominant shared strains and increased specificity methods have been introduced to identify microbial for the detection of species-specific strains, which are strains in metagenomes, based on the generation of not shared between unrelated sample pairs, over previ- metagenome-assembled genomes (MAGs, see [13], ous methods. We applied SameStr to identify a core gut Strainberry [12], and STRONG [14]) or the mapping of microbiota of strains that persist over time in healthy individual metagenomic reads to universal (see Strain- adults and to determine the contributions of recipi- Finder [15] and mOTUs2 [16]) or taxon-specific marker ent- and donor-derived strains to the post-FMT patient genes (see StrainPhlAn [17]), or whole-genomes (see microbiota, illustrating SameStr’s utility to study micro- InStrain [18]) to detect phylogenetically informative, biome stability and transfer across different settings. We strain-specific, single nucleotide variant (SNV) profiles. further show that persisting strains in healthy adults fre- Microbiota strain profiling has been successfully applied quently belonged to the same species as donor-derived to study strain-specific adaptations to human body sites strains in post-FMT patients, suggesting the existence of [4]; associations with individual human hosts, families, a healthy adult core gut microbiota that is transferable and geography [17, 19]; and transmission along the gas- from donors to rCDI patients by FMT. trointestinal tract [20], from mothers to infants [21–23] and from the donors to the recipients of fecal microbi- Results ota transplantation (FMT) [15, 24, 25]. Yet, strain-level Detection of shared strains in metagenomic samples microbiota analysis is hampered by inconsistent “strain” with SameStr definitions [26, 27] and available methods exhibit vari- We developed the SameStr tool based on a workflow able sensitivities and specificities, which have not been related to StrainPhlAn [17] to identify shared microbial comprehensively compared and validated. For example, strains in distinct metagenomic samples using within- the taxonomic classification of strains based on univer - species phylogenetic sequence variations (Fig.  1A). In sal marker gene phylogenetic comparisons can produce brief, metagenomic input data are first quality-filtered inconsistent assignments relative to established taxono- and trimmed to reduce sequencing errors and then mies [15, 24]. Detection may also be limited to the domi- mapped to the MetaPhlAn reference database of spe- nant strain in a metagenomic sample [17] or depend cies-specific marker genes [30], in order to limit the on the availability of completely sequenced reference interference of higher-level taxonomic sequence vari- genomes for comparison [28] Finally, non-stringent simi- ations with strain detection. Individual alignments for larity thresholds can result in distinct subspecies lineages each sample and species are filtered and merged. Strains P odlesny et al. Microbiome (2022) 10:53 Page 3 of 15 Fig. 1 Species-specific shared strain detection in metagenomic samples with SameStr. A Schematic of the SameStr workflow. SameStr has been implemented modularly, including optional wrapper functions for quality preprocessing and alignment of whole-genome shotgun ( WGS) metagenomic reads to species-specific MetaPhlAn markers (align), functions for the conversion to nucleotide variant profiles (convert), extraction of markers from genome sequences (extract), sample and reference pooling (merge), extensive global, per-sample, marker and position filtering (filter) and comparison of SNV profiles (compare) based on maximum variant similarity (MVS). SameStr outputs (summarize) tables denoting pairwise comparison results, including species alignment similarity and overlap, and co-occurrence of taxa at distinct taxonomic levels (based on MetaPhlAn) and at the strain level. B SameStr identifies shared strains in metagenomic samples by calculating a pairwise MVS, using all single-nucleotide variants detected in the read alignments of these samples to species-specific marker genes. C To assess the MetaPhlAn-based phylogenetic resolution (db_v20) and validate the 99.9% similarity threshold of shared strains, which is used by SameStr, 458 bacterial genomes from 20 of the most abundant and prevalent fecal microbiota species in our rCDI cohort ( Table S4) were compared with MetaPhlAn2 [30] and based on average nucleotide identities (ANIs) as determined with FastANI [31]. MetaPhlAn2 and FastANI-based pairwise sequence similarities are strongly correlated (Spearman’s r = 0.93, p < 2.2e−16, n = 9813), demonstrating comparable phylogenetic resolution. Genome similarities exhibit a multimodal distribution (two-dimensional density kernel contours): reference genomes share peak sequence similarities at 97.5%, 99.0%, and above 99.9% identity that reflect the presence of distinct species, subspecies, and strains in the reference dataset shared between samples are identified by comparing compared to metagenomic sequence preprocessing with alignments, using a maximum variant profile similar - Kneaddata and taxonomic analysis with MetaPhlAn is ity (MVS), which is calculated as the fraction of identi- shown in the supplement (Fig. S1). In contrast to Strain- cal nucleotide positions in both alignments divided PhlAn, which determines a consensus sequence for each by the total length of the shared alignment (Fig.  1B). A marker alignment and compares metagenomes based on comparison of SameStr’s resource requirements (total the consensus variant similarity (CVS) that only reflects CPU time, CPU time per sample, and average RAM use), the dominant strain in each sample, SameStr considers Podlesny et al. Microbiome (2022) 10:53 Page 4 of 15 all detected single nucleotide variants to calculate MVS, species populations, detecting 85% of shared strains com- including polymorphic positions with different rela - pared to 59% with the CVS-based approach. SameStr tive allelic frequencies (default: ≥ 10%), thereby includ- also detected 57% of shared strains among subdominant ing non-dominant strains into the sample comparison. strains (15–50% relative strain abundance at ≥ 5-fold tar- SameStr calls shared strains in two metagenomic sam- get strain sequencing depth), compared to only 2% for ples if the corresponding species alignments share a the consensus-based method. The better performance of minimum overlap (default: ≥ 5 kb) and MVS (default: ≥ SameStr compared to consensus-based methods in even 99.9%) over all detected sites. A similarity threshold of the identification of dominant strains might be due to the 99.9% for comparing MetaPhlAn marker genes (db_v20) lower sensitivity of the MVS-based approach to sequenc- was previously shown to differentiate between microbial ing errors and wrong consensus calls at polymorphic strains within species and subspecies [17, 32, 33] and is and/or low-coverage positions of the metagenomic read further validated by our phylogenetic comparison of ref- alignment. Importantly, advantages in accuracy were not erence genomes based on whole-genome average nucleo- accompanied by reduced specificity, as both approaches tide identity (ANI) (Fig. 1C). were robust against false-positive shared strain calls even in complex multi-strain species mixtures (see 0-fold tar- Validation of sensitivity and specificity of SameStr get strain coverage in Fig. 2A). in comparison to other strain prediction tools The StrainFinder tool has been developed to study We first evaluated SameStr’s performance on synthetic, strain-level microbiota dynamics in the course of fecal simulated metagenomes from species containing mul- microbiota transplantation (FMT) [15]. StrainFinder tiple strains. Mock sequence data from 100 individual used phylogenetic comparisons of 31 widely distributed, isolates from 20 frequent and abundant bacterial gut spe- single-copy marker genes from the AMPHORA database cies (Table S4) were mixed in various combinations to [34] to define metagenomic operational taxonomic units simulate metagenomes containing species composed of (mg-OTUs) and call distinct strains based on sequence multiple strains and variable complexity and sequenc- variations within these species equivalents [15]. We com- ing depth. For each species, simulated shotgun sequence pared the performances of SameStr and StrainFinder data from a reference genome (at a 5-fold sequencing with respect to (i) taxonomic sensitivity, i.e., the number depth and showing typical sequencing error profiles, of microbial genera and species assessed for shared strain see “Materials and methods”) were compared to simu- detection (Fig.  2B), and (ii) specificity for the detec - lated metagenomes. These included the same reference tion of ‘unique’ shared strain events, i.e., the frequency genome (showing an independent typical error profile) at of shared strain predictions in unrelated sample pairs, variable sequencing depths (target strain coverage), com- which would interfere with our goal to use shared strains bined with additional sequence data from between 1 and to infer strain persistence or transfer (Fig. 2C). Using the 4 other available genomes from the same species at vary- published datasets and taxonomic profiles from the origi - ing sequencing depths (noise coverage). nal StrainFinder publication [15], SameStr consistently SameStr’s strain predictions based on maximum vari- detected more species and genera, both across the entire ant profile similarity (MVS) were compared to those dataset (154 vs. 116 genera and 399 vs. 306 species/mg- of a StrainPhlAn-equivalent consensus variant similar- OTUs) and per sample (50.54 ± 15.0 vs. 23.78 ± 16.67 ity (CVS)-based approach across a total of 3276 simu- genera and 97.62 ± 39.6 vs. 48.48 ± 33.88 species/mg- lated combinations (Fig.  2A). SameStr outperformed the OTUs; values shown as mean ± sd) (Fig.  2B). Differen - consensus-based approach for the detection of domi- tially detected taxa included prominent members of the nant target strains (≥ 50% relative strain abundance at gastrointestinal tract microbiota, such as Bacteroides ≥ 5-fold target strain sequencing depth) in multi-strain spp. (6.54 ± 5.35 species vs. 3.87 ± 4.70 mg-OTUs per (See figure on next page.) Fig. 2 Sensitivity and specificity comparison to other strain prediction tools. A SameStr detects dominant and subdominant strains at low sequencing depth (mean-fold target strain coverage) and relative abundance (i.e., high noise coverage) in simulated metagenomes (n = 3276) of multi-strain species populations, compared to consensus variant profile similarity (CVS)-based methods. B Using MetaPhlAn’s clade-specific marker gene database (db_v20), SameStr identifies more genera and species per metagenomic sample (n = 65) than StrainFinder, which uses mg-OTUs that are defined based on phylogenetic comparisons of universally distributed bacterial genes from the AMPHORA database. C Fewer shared strain calls demonstrate the increased specificity of SameStr compared to StrainFinder, which allows for the differentiation of related (n=555) and unrelated (n=1,525) sample pairs. D Cumulative relative abundance and fraction of species for which strain-level resolution was achieved with SameStr in fecal metagenomes from a reference cohort of 67 longitudinally sampled healthy adults (n = 202). E SameStr’s MVS-based method detects shared strains in a larger fraction of species in related (same individual, n = 281) but not in unrelated (different individuals, n = 20,020) sample pairs of the control cohort (n = 202 individuals) compared to CVS-based methods P odlesny et al. Microbiome (2022) 10:53 Page 5 of 15 Fig. 2 (See legend on previous page.) Podlesny et al. Microbiome (2022) 10:53 Page 6 of 15 of FMT, we applied SameStr to measure strain persis- sample), Clostridium spp. (4.81 ± 4.05 species vs. 2.43 ± tence and engraftment in our reference dataset of fecal 3.25 mg-OTUs per sample), and Lactobacillus spp. (5.06 metagenomes from healthy adult individuals and a ± 3.33 species vs. 1.41 ± 2.37 mg-OTUs per sample). combined FMT dataset with fecal samples from FMT- For the detection of shared strains, we divided the treated rCDI patients and their donors from our pre- original FMT dataset from Smillie et al. into related and viously described cohort [36] and the study by Smillie unrelated sample pairs. Related sample pairs included et al. [15]. corresponding FMT recipient and donor samples, pre To study strain persistence in the fecal microbiota of and post-FMT patient samples, and distinct samples healthy individuals, we used the reference cohort of 67 from the same donor or post-FMT patient. SameStr healthy adults described above and determined shared detected on average 14.77 (median = 12, range = 0–67) strains in sample pairs collected from the same individu- shared strains in 555 related sample pairs and 0.45 als over periods of up to one year (Fig. 3A, see Fig. S2 for (median = 0, range = 0–8) shared strains in 1525 unre- individual cases and samples). Contributions of tempo- lated sample pairs. By comparison, StrainFinder reported rally persistent strains that were shared between multiple on average 93.13 (median = 73, range = 0–384) shared samples from the same individual were relatively stable strains in related but also 35.16 (median = 25, range = over time and comprised on average 22.6% ± 6.3 (mean 0–238) shared strains in unrelated sample pairs (Fig. 2C). ± sd) of all detected species in the later sample, which These findings suggest that StrainFinder classifies sub - accounted for 73.1% ± 18.3 relative abundance. Strain species lineages with broader prevalence in human popu- persistence was detected with variable frequencies for lations as shared strains, which based on SameStr’s more different microbial genera (Fig.  3B) and species (Fig. S3). conservative definition of “unique” shared strains would Based on the assignment of microbial species to different be considered false-positive predictions. functional and lifestyle feature categories (see “Materi- To further assess SameStr’s rate of false-positive shared als and methods” for details, Table S5), strain persistence strain predictions in fecal metagenomes, we downloaded a was less frequent in oral and/or oxygen-tolerant genera reference dataset (‘control’) from the curatedMetagenom- (Fig. 3B) and species (Fig. S3). icData package [35], consisting of 202 fecal metagenomes To study strain persistence and engraftment in the from four different studies, including 67 healthy adults course of FMT, we generated new metagenomic sequence that were sampled multiple times over a period of up to 1 data from our previously described cohort of FMT- year (see “Materials and methods” and Table S2). On aver- treated rCDI patients [36, 37], which we combined with age, strain-level resolution was obtained for 26.2% ± 6.8 other available data [15] and applied SameStr to detect of species or 71.4% ± 15.9 relative abundance per sam- shared strains between pre- and post-FMT patients ple (Fig.  2D). This control dataset was divided into related and post-FMT patients and donors (Fig.  3C, Table S7). sample pairs from the same individual, which would be Recipient and donor-derived species fractions and rela- expected to share strains, and unrelated sample pairs from tive abundances in post-FMT patients were determined distinct individuals, which would not be expected to share as being represented by shared strains between pre- and strains. Compared to the consensus-based method that is post-FMT patients or post-FMT patients and donors, used by StrainPhlAn, SameStr detected more shared strains respectively (Fig.  3D, see Fig. S4 for individual cases and in 281 related sample pairs (range = 4–43, median = 14) samples). During the first week after FMT, both donor but not in 20,020 unrelated sample pairs (range = 0–4, and recipient-derived strains contributed large rela- median = 0) (Fig.  2E), demonstrating increased sensitiv- tive abundances to the post-FMT microbiota (days 1–7: ity without compromising the low rates of false-positive 42.5% ± 30.3 vs. 18.9% ± 22.3), but donor-derived micro- shared strain detections that both approaches showed. biota fractions remained more stable over the following In summary, SameStr can detect shared strains in syn- weeks and months, whereas recipient-derived microbiota thetic and real metagenomes, including from single- and fractions continuously decreased (days 70–84: 26.5% ± multi-strain species populations, with improved accu- 21.9 vs. 4.9% ± 9.0). Donors and recipients before FMT racy for low-abundant and subdominant strains com- frequently carried the same microbial species, but this pared to StrainPhlAn and taxonomically more accurate rarely resulted in the detection of both recipient and and restrictive predictions of shared strains compared to donor-derived strains after FMT (Table S8). Conse- StrainFinder. quently, coexisting recipient and donor strains from the same species accounted for only small and decreasing Identification of strain persistence and engraftment species fractions (0.46% ± 0.68) and relative abundances in healthy individuals and rCDI patients after FMT (5.19% ± 11.54) in post-FMT patients (Fig.  3D). Donor To gain insights into (i) microbiome stability in healthy strain engraftment frequencies varied taxonomically and individuals and (ii) microbiome transfer in the course P odlesny et al. Microbiome (2022) 10:53 Page 7 of 15 Fig. 3. Identification of strain persistence and donor strain engraftment in healthy individuals and rCDI patients after FMT. A Longitudinal species and strain persistence in healthy adults from the reference (Control) cohort are shown as relative abundances of shared species and species fractions in 95 sample pairs from 59 individuals and modeled using binomial smoothing. Strain proportions are based on corresponding species. Species fractions indicate insufficient resolution for strain prediction. B Taxonomic variations in the frequency of species (dark blue), and strain (light blue) persistence in healthy individuals (n = 59) and FMT recipients (n = 19), and of donor species (dark green) and strain (light green) engraftment in post-FMT patients are shown, as summarized on the genus level for the 50 most prevalent genera (see Fig. S3 for species). Newly detected species and strains are shown in dark and light yellow, respectively. C Comparison of shared strain numbers between rCDI patients and donors. Distinct rCDI patients who received stool from the same donor share more strains than other post-FMT patients. D Donor-derived strains and species (exclusively shared with the donor but with insufficient resolution for strain prediction) account for large and stable relative abundances and species fractions in FMT-treated rCDI patients. Data for triads of successfully FMT-treated rCDI patients (n = 30) in reference to their pre-FMT (n = 19) and donor (n = 14) metagenomes are modeled across cases using binomial smoothing. E The frequencies of strain persistence in healthy individuals and of donor strain engraftment in rCDI patients after FMT are positively correlated at the genus level (Spearman’s r = 0.72, p < 1e−8), including for abundant members of the healthy adult fecal microbiota (see Fig. S5 for species-level comparison) were less frequent in oral and/or oxygen-tolerant genera comparison; Table S9-S10). Frequently persisting and (Fig. 3B) and species (Fig. S3). engrafting genera included abundant (>5%) members of We next compared the healthy adult and FMT the healthy adult gut microbiota, such as Bacteroides, cohorts and found strains that frequently persisted Blautia, Coprococcus, and Eubacterium (Fig.  3E), and in healthy individuals to belong to the same genera similar observations were made at the species level and species as donor strains that frequently engrafted (Fig. S5). Thus, FMT appears to specifically lead to the in patients after FMT (Fig.  3E, see Fig. S5 for species engraftment of persisting and abundant healthy gut microbiota members in rCDI patients. Podlesny et al. Microbiome (2022) 10:53 Page 8 of 15 Identification of healthy individuals and FMT recipients strain profiles, whereas shared family and genus pro - and donors using shared strain profiles files were insufficient (auPR ≤ 0.18, auROC ≤ 0.87) and The detection of species overlaps between the per - even shared species profiles performed poorly (auPR = sisting core gut microbiota in healthy adults and the 0.47, auROC = 0.93). We next tested the same logistic engrafted donor microbiota in rCDI patients after regression classifier that was trained on healthy indi - FMT, prompted us to test if individuals were identifi - viduals for the identification of related sample pairs able based on shared strain profiles in fecal metagen - from the FMT cohort (n = 580 related compared to omes. To this end, we first trained and tested a logistic n = 3606 unrelated sample pairs), i.e., pre- and post- regression classifier (60% / 40% data split for training FMT samples from the same patients, corresponding and testing) to identify sample pairs from the same post-FMT patient and donor samples, and post-FMT individuals in our healthy adult reference dataset, samples from different patients that received FMT based on overlapping taxonomic microbiota compo- from the same donor. Again, our classifier performed sitions. Microbiota profiles at the family, genus, and well using shared strain profiles as input (auPR = species level were determined with MetaPhlAn2 and 0.94, auROC = 0.93) but not higher-level taxa profiles at the strain level with SameStr; total and shared taxa (Fig. 4B, Table S6). Thus, our findings demonstrate that and strains were used as input for the classifier (Fig.  4A, the fecal microbiota of healthy adults harbors identifi - Table S6). A perfect classification (auPR = 1, auROC = able personal strain profiles, at least over periods of up 1) of 8120 hold-out sample pairs (n = 112 sample pairs to one year, which are transferable from donors to rCDI from the same individuals) was achieved with shared patients after FMT. Fig. 4 Identification of healthy individuals and FMT recipients and donors using shared strain profiles. Receiver-operating characteristic (ROC) and precision-recall (PR) curves of logistic regression classifiers demonstrate sensitive and accurate identification of (A) longitudinally collected sample pairs from the same healthy individuals (n = 112 from a total of n = 8120 sample pairs) and (B) related FMT patient and donor sample pairs (n = 580, including pre- and post-FMT patient samples, post-FMT patient and donor samples, and post-FMT patient samples that received FMT from the same donor, from a total of n = 4186 sample pairs) P odlesny et al. Microbiome (2022) 10:53 Page 9 of 15 Shared strain network analysis for the identification related samples to distinct clusters linking, for example, of mislabeled metagenomes samples from the same individual (Fig. 5A) or from FMT The published metagenomic sequence data used for recipients and donors (Fig. 5B). However, in three cases > this study included several samples that, while present- 2× more shared strains were detected between suppos- ing with inconspicuous species-level taxonomic micro- edly unrelated samples than between any of the other > biota compositions, showed unexpected and inconsistent 20,000 unrelated sample pairs from our dataset. In every shared strain profiles that led to their removal from the case, suspicious sample pairs had been submitted as part analysis (Table S11). To illustrate these inconsistencies, of the same study and inconsistencies could be resolved shared strain profiles, as generated with SameStr, were by switching or changing sample labels (see Fig. 5 legend visualized as unsupervised networks, which assigned for details), suggesting sample mix-up or mislabeling. We Fig. 5 SameStr-based unsupervised strain sharing networks identify potentially mislabeled samples. Shared strain profiles were visualized as unsupervised networks with individual samples as nodes and shared strain numbers as edges. A These networks connect samples from Louis et al. [38] by individual, with the exception of two samples (AS64_24 and AS66_24) that appear to be mixed up. B In a case of multiple rCDI patients treated with FMT from the same donor [15], shared strains were detected between pre- (blue) and post-FMT (yellow) patient samples, as well as between post-FMT and donor (green) samples and among post-FMT samples. Pre-FMT samples did not share strains with donor samples, with the exception of FMT15, which shares (> 15) strains with all three donor samples and exhibits ɑ/β-diversity compositions that are comparable to other post-FMT samples (data not shown). As this sample was collected on the day of the FMT procedure, FMT15 could in fact represent a post-FMT sample that was accidentally mislabeled as a pre-FMT sample (Smillie, personal communication) Podlesny et al. Microbiome (2022) 10:53 Page 10 of 15 have reported similar findings of potentially mislabeled the identification of strain sharing between the intestinal, samples in a meta-analysis of neonatal metagenomes reproductive, and/or urinary tract or bloodstream, which [23], indicating that inconsistencies in public metage- could be used to better characterize endogenous reser- nomes might be common. Microbiota strain profiling voirs of opportunistic pathogens and microbial transloca- with SameStr or equivalent tools could represent a viable tion between human body niches as a cause of infection strategy for the quality control of metagenomic sequence and disease [41, 42]. data from fecal microbiome projects. We applied SameStr to study strain persistence in the intestinal microbiota of healthy individuals, as well as Discussion strain persistence and engraftment in patients after fecal We introduce SameStr as a new bioinformatic tool for microbiota transplantation, using combined datasets the identification of shared microbial strains in metagen - from multiple studies, including healthy adults sampled omic shotgun sequence data, which allows for the detec- over durations of up to one year and rCDI patients, sam- tion and quantification of strain persistence and transfer pled before and after FMT together with their donors. and improves our ability to track and understand sub- We detected strain persistence for many of the same species population dynamics in complex microbiomes. bacterial taxa, such as Bacteroides species, as previously In contrast to related methods that define strains more reported based on temporal single nucleotide polymor- broadly and allow for the presence of the same strain in phism (SNP) stability [43] and strain-resolved species- different, unrelated individuals [15, 16], SameStr applies a specific MAGs [19] in fecal metagenomes from healthy more conservative definition of strains as “unique” phylo - individuals. Persistence has been negatively correlated to genetic lineages that should only be shared by either tem- the genetic capacity for oxygen tolerance and sporulation porally or physically related samples. It thereby affords before [19] and, based on comparative genome analy- the specificity to infer persistence or transmission from ses, the loss of sporulation has been genetically linked the detection of shared strains in distinct metagenomes. to typical features of host-adaptation, such as genome Recent fecal metagenomics-based epidemiological stud- reduction and metabolic specialization [44], confirm - ies identified subspecies lineages or clades of, for exam - ing our functional predictions for species that are fre- ple, Prevotella copri and Ruminococcus gnavus with quently represented by persisting strains, as well as our widespread prevalence in the human population, which concept of a persisting core gut microbiota of strict could be linked to dietary habits [39, 40] and host health anaerobe, non-spore-forming bacteria in the healthy background, i.e., inflammatory bowel disease [6], respec - human gut. We also identified a surprising taxonomic tively. Strain-level microbiota profiling with SameStr pro - association between strain persistence and engraftment, vides the phylogenetic resolution to track even individual as strains with a high persistence rate in healthy indi- strains within these subspecies clades in the human pop- viduals belonged to the same bacterial species as donor ulation, illustrating new opportunities to shed light on strains with a high engraftment rate in rCDI patients the role of these and other microbiome members for after FMT. Given that persistence in the complex gut human lifestyle adaptation and disease development. microbiomes of healthy individuals, as well as engraft- Methodically, SameStr is related to the StrainPhlAn ment in the dysbiotic microbiomes of rCDI patients, tool, as both use the taxon-specific marker gene database requires strains to compete with other persisting, resi- from MetaPhlAn [30] to identify and compare microbial dent, and/or newly incoming strains, our analysis likely species-specific single nucleotide variant profiles. How - identified bacterial species of high ecological competi - ever, SameStr’s approach to determine maximum vari- tiveness and fitness. This is further supported by Hilde - ant profile similarities between metagenomic samples, brand et al., who used the concept of tenacity to describe including polymorphic alleles, demonstrates increased strain persistence in human individuals and described sensitivity for the detection of shared strains among tenacious bacteria, such as Bacteroides species, as host- multi-strain species populations, especially between adapted, frequently dispersed by vertical transmission subdominant strains. Dominant and secondary mater- from mothers to infants, and most negatively affected by nal strains of Bifidobacterium and Bacteroides species antibiotic perturbation [19]. In this context, the lack of have been shown to compete for colonization in neo- sporulation genes in tenacious bacteria likely reflects an nates after birth, contingent on their strain-specific car - adaptive mechanism to increase persistence by avoiding bohydrate-degrading capabilities [22], emphasizing the excessive intra-species strain competition [19]. Using dif- importance of considering multiple strains per species ferent methodologies, Watson et  al. similarly concluded for the detection of strain sharing and microbial transfer. that FMT selects for high-fitness populations of the gut Other clinical use cases, specifically for SameStr’s con - microbiome, based on the observation that a high preva- servative shared strain calls, could include, for example, lence of a microbial species in healthy individuals is more P odlesny et al. Microbiome (2022) 10:53 Page 11 of 15 predictive for colonization success after FMT than a high resolution and accuracy of SameStr’s taxonomic strain relative abundance of the same species in a FMT donor classifications compared to those from the StrainFinder [45]. Based on these considerations, the identification tool. Moreover, SameStr can be easily adapted for use and characterization of stably persisting strains in healthy with updated (e.g., MetaPhlAn3, mpa_v30_CHOC- individuals could present a viable and more useful strat- OPhlAn_201901 [29]) or alternative, user-provided, egy to determine different constitutions of personalized, marker sequences. Second, we developed SameStr spe- adapted core microbiomes of the human gut, than more cifically for the metagenome-based detection of strain commonly used β-diversity metrics based on species or sharing between fecal microbiomes. SameStr can be higher-level taxon persistence. As key microbiome attrib- used to identify species that are represented by multiple utes, such as colonization resistance against pathogens strains, based on the detection of multiple alleles within [46] or resilience towards other perturbations [47] should a species-specific marker gene alignment of a single sam - be determined by the fitness of its core members, char - ple, with multi-strain species populations exhibiting ≥ acterization of the persisting gut microbiota might con- 0.1% polymorphic positions  of all detected alignment stitute an ecological approach to define a healthy human sites. However, it does not provide similar insights into gut microbiome [48]. strain population structures as related tools [15]. Third, Our analyses suggest additional practical applications in order to reliably detect strain-specific SNV profiles, for metagenomics strain profiling that extend previous SameStr required a sequencing depth of the genome concepts of microbiome-based forensic markers for per- corresponding to this strain of > 5-fold in our validation sonal identification [49]. Franzosa et al. identified combi - experiments, irrespective of whether this strain was the nations of taxonomic (operational taxonomic units and only representative or a minor component of a multi- species), genomic (genome fragments), and functional strain species population. Assuming an average genome (genes) markers as ‘metagenomic codes’ that could be length of 2.5 Mbp and a metagenomic sequencing depth used to match > 80% of fecal sample pairs that were col- of 5 Gbp per sample (corresponding to 2000 genomes of lected over periods of 30-300 days from the same individ- average length), we estimate that SameStr is limited to uals [50]. Similarly, a majority of > 300 individuals could the detection of shared or coexisting strains that make up be identified in a mixed human cohort (auPR = 0.87, at least 0.25% of all genomes in the metagenomic sam- auROC = 0.95), using rare fecal metagenomic oligomers ple or 0.25% species relative abundance in case of single- (k-mers of 18–30-bp length) [51]. Yet our shared strain- strain species. based personal identification method outperformed both In conclusion, we present SameStr as a new bioinfor- previous attempts by demonstrating a 100% success rate matic tool for the species-specific, conservative identifi - for the detection of matching sample pairs (n = 112 from cation of unique shared subspecies taxa in metagenomic a total of 8120 sample pairs) from the same healthy indi- shotgun sequence data, including subdominant members viduals and, in addition, correctly identified most sam - of multi-strain species populations. We demonstrate ple pairs from linked FMT donors and recipients (auPR increased sensitivity, specificity, and taxonomic accu - = 0.94, auROC = 0.93 for n = 580 from a total of 4186 racy of detected strains in fecal metagenomes compared sample pairs). Standard practice for microbiome projects to related tools, which affords reliable detection of tem - dictates the removal of human reads from metagenomic poral strain persistence and transfer after fecal micro- sequence data to de-identify samples before release. Our biota transplantation. We identify a persisting fecal core findings attest to the persistence and FMT-dependent microbiota in healthy individuals, which taxonomically transferability of personalized gut microbiome strain overlaps with the engrafted donor microbiota in rCDI profiles and suggest that filtered public metagenomes patients after FMT, demonstrating the utility of SameStr retain personal information that could make study par- to gain new insights into human gut microbiome stabil- ticipants and FMT donors retrospectively identifiable. ity and modulation. Application of this approach to other The SameStr platform has a few limitations. First, as microbiome projects will improve our understanding strains are identified based on SNV profiles in clade-spe - of microbiome organization and function and should cific marker genes, their detection is dependent on the advance most areas of microbiome research. underlying database and limited to previously described, sequenced, and comparatively analyzed taxa [30]. How- Materials and methods ever, taxonomic assignments based on universal instead Study cohort of species-specific marker genes, which are less depend - Metagenomic shotgun sequence data were generated ent on available genome sequence information, can from a previously published cohort of FMT-treated rCDI show discrepancies from established taxonomic sys- patients [36, 37]. The sample set included eight rCDI tems [52], which could explain the increased taxonomic patient samples, collected 1–2 days before treatment, and Podlesny et al. Microbiome (2022) 10:53 Page 12 of 15 Metagenomic strain‑level profiling with SameStr eleven patient samples, collected between 1 week and up The following individual analysis steps are part of the to 1 year after FMT. FMT was performed at Sinai Hospi- SameStr protocol to identify shared strains in metagen- tal of Baltimore, Baltimore, MD, USA, by single infusion omic samples (Fig. 1): of fecal filtrate from healthy donors into the jejunum and colon of rCDI patients. Study design, patient selection criteria, donor screening, infusion protocol, and sample 1. Taxonomic microbiota analysis. Preprocessed collection have previously been reported [36]. sequence reads from each sample were mapped against the MetaPhlAn clade-specific marker gene DNA isolation and sequencing database (db_v20, mpa_v20_m200) using Met- Metagenomic DNA extraction and sequencing of the 27 aPhlAn2 v2.6.0 [57]. We additionally generated taxo- fecal samples was conducted at the Institute for Genome nomic profiles for rarefied data, which were subsam - Sciences, University of Maryland School of Medicine. pled to 5 M reads (after QC) per sample (seqtk v1.0) DNA was extracted from 0.25 g of stored fecal samples before processing with MetaPhlAn2, confirming (− 80 °C), using the MoBio Microbiome kit automated representativeness of microbial communities as indi- on a Hamilton STAR robotic platform after a bead-beat- cated by strong correlations of Shannon Index (diver- ing step on a Qiagen TissueLyser II (20 Hz for 20 min) sity, vegan v2.5.7) between data. in 96 deep-well plates. Metagenomic libraries were con- 2. Detection of SNV profiles in marker gene alignments. structed using the KAPA Hyper Prep (KAPA Biosystems/ Using the SameStr tool, MetaPhlAn2 marker gene Roche, San Francisco, CA, USA) library preparation kit alignments were filtered for ≥ 90% sequence identity, according to the manufacturer’s protocols. Sequencing a base call quality of Q20, and mapping length of 40 was performed on the Illumina HiSeq 4000 platform to bp. The frequencies of all four nucleotides were tabu - generate 150-bp paired-end reads. lated with Samtools v0.1.19 [58] and kpileup v1.0 [15], retaining unmapped alignment sites as gap positions. Published sequence data acquisition Marker gene alignments were trimmed by 20 nucleo- Publicly available fecal metagenomic sequence data, tides at both ends, concatenated for each species, and longitudinally collected from healthy adult individu- combined from all samples. In order to address atypical als, were obtained through curatedMetagenomicsData vertical coverage and wrong base calls for each sample, [35], including 202 metagenomes of 67 subjects (59 with alignment positions that diverged from the mean cov- known sampling days) from four different studies [38, erage by more than five standard deviations and alleles 53–55]. Individuals were sampled at least twice within that were represented by < 10% of all mapped reads at a year and had not reported medical conditions that this position were zeroed in the alignment. would suggest extensive medication or strong microbi- 3. Determination of maximum variant profile similarity ota perturbations between time points. For each subject, (MVS). To consider individual strains from multi-strain sequence data downloaded from the SRA were concat- species populations for the detection of shared strains, enated in case of multiple available accessions (Table S2). MVS were calculated between all species/alignment A total of 65 additional fecal metagenomes were obtained pairs M and M as the fraction of the sum of alignment i j from 18 cases of FMT-treated rCDI patients who had not positions with at least one shared allele C divided by allele been treated with FMT before [15]. the sum of positions with coverage in both alignments C , where the vector of shared alleles C was cal- cov allele Quality control and preprocessing of metagenomic culated as the pairwise Boolean product of 4 vectors of sequencing data nucleotide counts at all positions between alignments All raw paired-end metagenomic sequence reads were M and M . For consensus variant profile similarity i j quality-processed with Kneaddata v0.6.1 (KneadData (CVS) calculation, shared alleles were calculated as the Development Team, 2017) in order to trim sequence pairwise Boolean product of a single vector represent- regions where base quality fell below Q20 within a ing the consensus sequence of the alignment at all posi- 4-nucleotide sliding window and to remove reads that tions between alignments M and M . i j were truncated by more than 30% (SLIDINGWIN- 4. Comparison of reference genomes. Species-specific DOW:4:20, MINLEN:70). To remove human sequence marker gene regions were extracted from a total of contamination, trimmed reads were mapped against the 458 available genome sequences in the NCBI RefSeq human genome (GRCh37/hg19) with Bowtie2 v2.2.3 [56]. and Genome databases from the 20 most abundant Output files consisting of surviving paired and orphan and prevalent species in our rCDI cohort (Table S4). reads were concatenated and used for further processing For this, marker gene regions were extracted from ref- (Table S3). erence genomes with a StrainPhlAn utility [17], based P odlesny et al. Microbiome (2022) 10:53 Page 13 of 15 on BLASTn v2.6.0 comparisons, and used to generate Supplementary Information multiple sequence alignments with MUSCLE v3.8.31 The online version contains supplementary material available at https:// doi. org/ 10. 1186/ s40168- 022- 01251-w. [59]. After removing gap positions, marker gene align- ments were tabulated, concatenated, trimmed, and used to calculate the single-genome equivalent of Additional file 1: Figure S1. Computational Resource Requirements. (A) Total CPU time, (B) average CPU time per sample, and (C) average use of MVS. MVS-based genome similarities were compared RAM by the Kneaddata, MetaPhlAn3 (mpa_v30_CHOCOPhlAn_201901), to average nucleotide identities (ANI), as calculated for and SameStr programs during the processing of three datasets of different entire genomes with FastANI v1.3 [31]. sizes. SameStr, on average, added 4.3 CPU minutes per sample to the computational effort of the entire workflow. 5. Shared strain detection in distinct metagenomes. Additional file 2: Figure S2. Microbial Tracking across Individual Based on our reference genome comparison (Fig. 1C) Metagenomic Samples of Healthy Controls. Microbial tracking at the and in agreement with previous reports [21], a MVS species (top) and strain level (bottom) in healthy controls. Healthy adults threshold of 99.9% was applied to detect shared from the reference (Control) cohort harbor a core microbiota of persisting strains and species (insufficient sequencing depth for strain calls) shared strains that would be identified in related but not between fecal metagenomes sampled up to one year apart. unrelated microbiomes. Shared strain predictions Additional file 3: Figure S3. Predicting Donor Strain Engraftment in rCDI were additionally limited to sample pairs with at least Recipients after FMT. The frequencies of species (dark blue) and strain 5000 overlapping alignment positions. (light blue) persistence in healthy individuals and rCDI recipients, and of 6. Validation of SameStr on mock species populations. donor species (dark green) and strain (light green) engraftment in post- FMT patients, differ between bacterial species, with retained recipient Simulated shotgun sequence data were generated species and strains mostly being classified as oral and/or oxygen-tolerant with ART read simulator v2.5.8 [60] and combined in species. Newly detected species and strains are shown in dark and light various proportions to generate metagenomes from yellow, respectively. mock multi-strain species populations. Metagen- Additional file 4: Figure S4. Microbial Tracking across Individual Metagenomic Samples of FMT-treated rCDI patients. Donor-derived strains omic paired-end sequence read error profiles were and species (exclusively shared with donor but insufficient resolution for independently generated for each genome and simu- strain prediction) account for large and stable relative abundances across lation, using the Illumina HiSeq-20 error profile. For all post-FMT patient samples, whereas contributions of recipient-derived strains are comparatively smaller. each species (Table S4), five reference genomes were Additional file 5: Figure S5. Predicting Donor Strain Engraftment in rCDI randomly selected, including one target genome for Recipients after FMT. The same species that are represented by frequently shared strain detection and four other genomes to persisting strains in healthy individuals are also represented by strains that simulate a background noise of additional strains frequently engraft from donors in rCDI patients after FMT and belong to species that have a high relative abundance in the healthy adult control from the same species. Both the sequencing depths cohort. (fold coverage) of the target strain and its abundance Additional file 6: Table S1. Sample and Case Metadata. Table S2. WGS relative to all other strains (noise coverage) were var- Accession Identifiers. Table S3. WGS QC Data. Table S4. Reference Acces- ied for each simulation. Marker gene alignments and sions. Table S5. Species Metadata. Table S6. Logistic Regression of (Un) comparisons for MVS or CVS calculation and shared related Sample Pairs at Taxonomic Levels. Table S7. Shared Species and Strains for Individual Cases. Table S8. Events with Competing Strains. strain detection were performed as described above. Table S9. Strain Transmission Rates (per Species). Table S10. Strain Trans- mission Rates (per Genus). Table S11. Potential Sample Mislabellings. Classification of related and unrelated sample pairs Acknowledgements Not applicable. For the prediction of related samples (distinct samples from the same individual or connected samples from Authors’ contributions FMT donors and recipients) based on strain sharing, the Conceptualization, Methodology, and Writing—Original Draft and Review and Editing, D.P., J.W. and W.F.F.; Software, Validation, and Formal Analysis, D.P.; number of detected and shared taxa between sample pairs Resources, S.K.D.; Data Curation, C.A., E.D., and S.V.; Funding Acquisition, W.F.F. from the healthy adult reference dataset were determined The authors read and approved the final manuscript. at the family, genus, species, or strain level with Met- Funding aPhlAn or SameStr, respectively, as described above. Data D.P. and W.F.F. received funding from the German Research Foundation (DFG, were divided into training and hold-out data (60%/40%) Deutsche Forschungsgemeinschaft) under SPP 1656 (Project no. 316130265). and shared taxon or strain fractions used to train simple Availability of data and materials logistic regression models (tidymodels v0.1.2). The clas - We implemented SameStr to facilitate the comparison of nucleotide variant sifier that was trained on strain persistence in healthy profiles presented in this analysis. The program builds on previously published adults was then used to predict related sample pairs from tools such as the concept of StrainPhlAn but extends the analysis of Met- aPhlAn markers beyond a consensus-based approach by extracting all four the FMT cohorts. To assess the performance of the pre- nucleotide alleles from sequence alignments. Generated SNV-Profiles are in dictor, precision-recall (tidymodels v0.1.2) and receiver NumPy format and can be used as input for strain composition modeling and operating characteristic (ROC) curves were generated other analyses. The SameStr program and further documentation are available at GitHub: https:// www. github. com/ danie lpodl esny/ SameS tr. git. R Markdown (tidymodels v0.1.2) and visualized (plotROC v2.2.1). Podlesny et al. Microbiome (2022) 10:53 Page 14 of 15 notebooks and additional code for generating figures are available at https:// 12. Vicedomini R, Quince C, Darling AE, Chikhi R. Strainberry: automated www. github. com/ danie lpodl esny/ fmt_ rcdi. git. Metagenomic sequence strain separation in low-complexity metagenomes using long reads. Nat data are available from the European Nucleotide Archive under accession Commun. 2021;12:4485. PRJEB39023. 13. Karcher N, Nigro E, Punčochář M, Blanco-Míguez A, Ciciani M, Manghi P, et al. Genomic diversity and ecology of human-associated Akkermansia species in the gut microbiome revealed by extensive metagenomic Declarations assembly. Genome Biol. 2021;22:209. 14. Quince C, Nurk S, Raguideau S, James R, Soyer OS, Summers JK, et al. Ethics approval and consent to participate STRONG: metagenomics strain resolution on assembly graphs. Genome The Institutional Review Board of Sinai Hospital Baltimore approved the Biol. 2021;22:214. study under protocol number #1826 and all subjects provided their written 15. Smillie CS, Sauk J, Gevers D, Friedman J, Sung J, Youngster I, et al. Strain informed consent to participate in the study. tracking reveals the determinants of bacterial engraftment in the human gut following fecal microbiota transplantation. Cell Host Microbe. Consent for publication 2018;23:229–40.e5. Not applicable. 16. Milanese A, Mende DR, Paoli L, Salazar G, Ruscheweyh H-J, Cuenca M, et al. Microbial abundance, activity and population genomic profiling Competing interests with mOTUs2. Nat Commun. 2019;10:1014. The authors declare that they have no competing interests. 17. Truong DT, Tett A, Pasolli E, Huttenhower C, Segata N. Microbial strain- level population structure and genetic diversity from metagenomes. Author details Genome Res. 2017;27:626–38. Department of Microbiome Research and Applied Bioinformatics, University 18. Olm MR, Crits-Christoph A, Bouma-Gregson K, Firek BA, Morowitz MJ, of Hohenheim, Stuttgart, Germany. Current address: Ring Therapeutics, Banfield JF. inStrain profiles population microdiversity from metagenomic Cambridge, MA, USA. Division of Gastroenterology, Sinai Hospital of Balti- data and sensitively detects shared microbial strains. Nat Biotechnol. more, Baltimore, MD, USA. APC Microbiome Ireland, School of Microbiology, 2021;39:727–36. and Department of Medicine, University College Cork, Cork, Ireland. I nstitute 19. Hildebrand F, Gossmann TI, Frioux C, Özkurt E, Myers PN, Ferretti P, et al. for Genome Sciences, University of Maryland School of Medicine, Baltimore, Dispersal strategies shape persistence and evolution of human gut MD, USA. bacteria. Cell Host Microbe. 2021;29:1167–76.e9. 20. Schmidt TS, Hayward MR, Coelho LP, Li SS, Costea PI, Voigt AY, et al. Received: 27 October 2021 Accepted: 24 February 2022 Extensive transmission of microbes along the gastrointestinal tract. Elife. 2019;8. https:// doi. org/ 10. 7554/ eLife. 42693. 21. Ferretti P, Pasolli E, Tett A, Asnicar F, Gorfer V, Fedi S, et al. Mother-to-infant microbial transmission from different body sites shapes the developing infant gut microbiome. Cell Host Microbe. 2018;24:133–45.e5. References 22. Yassour M, Jason E, Hogstrom LJ, Arthur TD, Tripathi S, Siljander H, et al. 1. Duvallet C, Gibbons SM, Gurry T, Irizarry RA, Alm EJ. Meta-analysis of gut Strain-level analysis of mother-to-child bacterial transmission during the microbiome studies identifies disease-specific and shared responses. Nat first few months of life. Cell Host Microbe. 2018;24:146–54.e4. Commun. 2017;8:1784. 23. Podlesny D, Fricke WF. Strain inheritance and neonatal gut microbiota 2. Sze MA, Schloss PD. Erratum for Sze and Schloss, “Looking for a signal in development: a meta-analysis. Int J Med Microbiol. 2021;311:151483. the noise: revisiting obesity and the microbiome”. MBio. 2017;8. https:// 24. Li SS, Zhu A, Benes V, Costea PI, Hercog R, Hildebrand F, et al. Durable doi. org/ 10. 1128/ mBio. 01995- 17. coexistence of donor and recipient strains after fecal microbiota trans- 3. Wirbel J, Zych K, Essex M, Karcher N, Kartal E, Salazar G, et al. Microbiome plantation. Science. 2016;352:586–9. meta-analysis and cross-disease comparison enabled by the SIAMCAT 25. Wilson BC, Vatanen T, Jayasinghe TN, Leong KSW, Derraik JGB, Albert BB, machine learning toolbox. Genome Biol. 2021;22:93. et al. Strain engraftment competition and functional augmentation in a 4. Lloyd-Price J, Mahurkar A, Rahnavard G, Crabtree J, Orvis J, Hall AB, et al. multi-donor fecal microbiota transplantation trial for obesity. Microbi- Strains, functions and dynamics in the expanded Human Microbiome ome. 2021;9:107. Project. Nature. 2017;550:61–6. 26. Yan Y, Nguyen LH, Franzosa EA, Huttenhower C. Strain-level epidemiology 5. Schirmer M, Garner A, Vlamakis H, Xavier RJ. Microbial genes and path- of microbial communities and the human microbiome. Genome Med. ways in inflammatory bowel disease. Nat Rev Microbiol. 2019;17:497–511. 2020;12:71. 6. Hall AB, Yassour M, Sauk J, Garner A, Jiang X, Arthur T, et al. A novel Rumi- 27. Van Rossum T, Ferretti P, Maistrenko OM, Bork P. Diversity within species: nococcus gnavus clade enriched in inflammatory bowel disease patients. interpreting strains in microbiomes. Nat Rev Microbiol. 2020;18:491–506. Genome Med. 2017;9:103. 28. Aggarwala V, Mogno I, Li Z, Yang C, Britton GJ, Chen-Liaw A, et al. Precise 7. Bell A, Brunt J, Crost E, Vaux L, Nepravishta R, Owen CD, et al. Elucidation quantification of bacterial strains after fecal microbiota transplantation of a sialic acid metabolism pathway in mucus-foraging Ruminococcus delineates long-term engraftment and explains outcomes. Nat Microbiol. gnavus unravels mechanisms of bacterial adaptation to the gut. Nat 2021;6:1309–18. Microbiol. 2019;4:2393–404. 29. Beghini F, McIver LJ, Blanco-Míguez A, Dubois L, Asnicar F, Maharjan 8. Henke MT, Kenny DJ, Cassilly CD, Vlamakis H, Xavier RJ, Clardy J. a S, et al. Integrating taxonomic, functional, and strain-level profiling of member of the human gut microbiome associated with Crohn’s disease, diverse microbial communities with bioBakery 3. bioRxiv. bioRxiv. 2020. produces an inflammatory polysaccharide. Proc Natl Acad Sci U S A. https:// doi. org/ 10. 1101/ 2020. 11. 19. 388223. 2019;116:12672–7. 30. Segata N, Waldron L, Ballarini A, Narasimhan V, Jousson O, Huttenhower 9. Lee SM, Donaldson GP, Mikulski Z, Boyajian S, Ley K, Mazmanian SK. C. Metagenomic microbial community profiling using unique clade- Bacterial colonization factors control specificity and stability of the gut specific marker genes. Nat Methods. 2012;9:811–4. microbiota. Nature. 2013;501:426–9. 31. Jain C, Rodriguez-R LM, Phillippy AM, Konstantinidis KT, Aluru S. High 10. Porter NT, Hryckowian AJ, Merrill BD, Fuentes JJ, Gardner JO, Glowacki throughput ANI analysis of 90K prokaryotic genomes reveals clear spe- RWP, et al. Phase-variable capsular polysaccharides and lipoproteins cies boundaries. Nat Commun. 2018;9:5114. modify bacteriophage susceptibility in Bacteroides thetaiotaomicron. Nat 32. Johnson RC, Deming C, Conlan S, Zellmer CJ, Michelin AV, Lee-Lin S, et al. Microbiol. 2020;5:1170–81. Investigation of a Cluster of Sphingomonas koreensis Infections. N Engl J 11. Sorbara MT, Littmann ER, Fontana E, Moody TU, Kohout CE, Gjonbalaj M, Med. 2018;379:2529–39. et al. Functional and genomic variation between human-derived isolates 33. Chng KR, Li C, Bertrand D, Ng AHQ, Kwah JS, Low HM, et al. Cartography of Lachnospiraceae reveals inter- and intra-species diversity. Cell Host of opportunistic pathogens and antibiotic resistance genes in a tertiary Microbe. 2020;28:134–46.e4. hospital environment. Nat Med. 2020;26:941–51. P odlesny et al. Microbiome (2022) 10:53 Page 15 of 15 34. Wu M, Eisen JA. A simple, fast, and accurate method of phylogenomic 57. Truong DT, Franzosa EA, Tickle TL, Scholz M, Weingart G, Pasolli E, et al. inference. Genome Biol. 2008;9:R151. MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat Meth- 35. Pasolli E, Schiffer L, Manghi P, Renson A, Obenchain V, Truong DT, et al. ods. 2015;12:902–3. Accessible, curated metagenomic data through ExperimentHub. Nat 58. Li H. A statistical framework for SNP calling, mutation discovery, associa- Methods. 2017;14:1023–4. tion mapping and population genetical parameter estimation from 36. Dutta SK, Girotra M, Garg S, Dutta A, von Rosenvinge EC, Maddox C, et al. sequencing data. Bioinformatics. 2011;27:2987–93. Efficacy of combined jejunal and colonic fecal microbiota transplantation 59. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and for recurrent Clostridium difficile Infection. Clin Gastroenterol Hepatol. high throughput. Nucleic Acids Res. 2004;32:1792–7. 2014;12:1572–6. 60. Huang W, Li L, Myers JR, Marth GT. ART: a next-generation sequencing 37. Song Y, Garg S, Girotra M, Maddox C, von Rosenvinge EC, Dutta A, read simulator. Bioinformatics. 2012;28:593–4. et al. Microbiota dynamics in patients treated with fecal microbiota transplantation for recurrent Clostridium difficile infection. PLoS One. Publisher’s Note 2013;8:e81330. Springer Nature remains neutral with regard to jurisdictional claims in pub- 38. Louis S, Tappu R-M, Damms-Machado A, Huson DH, Bischoff SC. Charac- lished maps and institutional affiliations. terization of the gut microbial community of obese patients following a weight-loss intervention using whole metagenome shotgun sequencing. PLoS One. 2016;11:e0149564. 39. Tett A, Huang KD, Asnicar F, Fehlner-Peach H, Pasolli E, Karcher N, et al. The Prevotella copri complex comprises four distinct clades underrepresented in Westernized populations. Cell Host Microbe. 2019;26:666–79.e7. 40. Fehlner-Peach H, Magnabosco C, Raghavan V, Scher JU, Tett A, Cox LM, et al. Distinct Polysaccharide Utilization Profiles of Human Intestinal Prevotella copri Isolates. Cell Host Microbe. 2019;26:680–90.e5. 41. Tamburini FB, Andermann TM, Tkachenko E, Senchyna F, Banaei N, Bhatt AS. Precision identification of diverse bloodstream pathogens in the gut microbiome. Nat Med. 2018;24:1809–14. 42. Magruder M, Sholi AN, Gong C, Zhang L, Edusei E, Huang J, et al. Gut uropathogen abundance is a risk factor for development of bacteriuria and urinary tract infection. Nat Commun. 2019;10:5521. 43. Schloissnig S, Arumugam M, Sunagawa S, Mitreva M, Tap J, Zhu A, et al. Genomic variation landscape of the human gut microbiome. Nature. 2013;493:45–50. 44. Browne HP, Almeida A, Kumar N, Vervier K, Adoum AT, Viciani E, et al. Host adaptation in gut Firmicutes is associated with sporulation loss and altered transmission cycle. Genome Biol. 2021;22:204. 45. Watson AR, Fuessel J, Veseli I, DeLongchamp JZ, Silva M, Trigodet F, et al. Adaptive ecological processes and metabolic independence drive micro- bial colonization and resilience in the human gut. bioRxiv. 2021; Available from: http:// biorx iv. org/ lookup/ doi/ 10. 1101/ 2021. 03. 02. 433653. 46. Litvak Y, Bäumler AJ. The founder hypothesis: a basis for microbiota resistance, diversity in taxa carriage, and colonization resistance against pathogens. PLoS Pathog. 2019;15:e1007563. 47. Fassarella M, Blaak EE, Penders J, Nauta A, Smidt H, Zoetendal EG. Gut microbiome stability and resilience: elucidating the response to pertur- bations in order to modulate gut health. Gut. 2021;70:595–605. 48. McBurney MI, Davis C, Fraser CM, Schneeman BO, Huttenhower C, Verbeke K, et al. Establishing What Constitutes a Healthy Human Gut Microbiome: State of the Science, Regulatory Considerations, and Future Directions. J Nutr. 2019;149:1882–95. 49. Robinson JM, Pasternak Z, Mason CE, Elhaik E. Forensic applications of microbiomics: a review. Front Microbiol. 2020;11:608101. 50. Franzosa EA, Huang K, Meadow JF, Gevers D, Lemon KP, Bohannan BJM, et al. Identifying personal microbiomes using metagenomic codes. Proc Natl Acad Sci U S A. 2015;112:E2930–8. 51. Wang Z, Lou H, Wang Y, Shamir R, Jiang R, Chen T. GePMI: a statistical model for personal intestinal microbiome identification. NPJ Biofilms Re Read ady y to to submit y submit your our re researc search h ? Choose BMC and benefit fr ? Choose BMC and benefit from om: : Microbiomes. 2018;4:20. 52. Mende DR, Sunagawa S, Zeller G, Bork P. Accurate and universal delinea- fast, convenient online submission tion of prokaryotic species. Nat Methods. 2013;10:881–4. thorough peer review by experienced researchers in your field 53. Asnicar F, Manara S, Zolfo M, Truong DT, Scholz M, Armanini F, et al. Studying vertical microbiome transmission from mothers to infants by rapid publication on acceptance strain-level metagenomic profiling. mSystems. 2017;2. https:// doi. org/ 10. support for research data, including large and complex data types 1128/ mSyst ems. 00164- 16. • gold Open Access which fosters wider collaboration and increased citations 54. Raymond F, Ouameur AA, Déraspe M, Iqbal N, Gingras H, Dridi B, et al. The initial state of the human gut microbiome determines its reshaping by maximum visibility for your research: over 100M website views per year antibiotics. ISME J. 2016;10:707–20. 55. Human Microbiome Project Consortium. Structure, function and diversity At BMC, research is always in progress. of the healthy human microbiome. Nature. 2012;486:207–14. Learn more biomedcentral.com/submissions 56. Langmead B. Aligning short sequencing reads with Bowtie. Curr Protoc Bioinformatics. 2010;Chapter 11:Unit 11.7. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Microbiome Springer Journals

Metagenomic strain detection with SameStr: identification of a persisting core gut microbiota transferable by fecal transplantation

Loading next page...
 
/lp/springer-journals/metagenomic-strain-detection-with-samestr-identification-of-a-amcWJh5XXh

References (119)

Publisher
Springer Journals
Copyright
Copyright © The Author(s) 2022
eISSN
2049-2618
DOI
10.1186/s40168-022-01251-w
Publisher site
See Article on Publisher Site

Abstract

Background: The understanding of how microbiomes assemble, function, and evolve requires metagenomic tools that can resolve microbiota compositions at the strain level. However, the identification and tracking of microbial strains in fecal metagenomes is challenging and available tools variably classify subspecies lineages, which affects their applicability to infer microbial persistence and transfer. Results: We introduce SameStr, a bioinformatic tool that identifies shared strains in metagenomes by determining single-nucleotide variants (SNV ) in species-specific marker genes, which are compared based on a maximum vari- ant profile similarity. We validated SameStr on mock strain populations, available human fecal metagenomes from healthy individuals and newly generated data from recurrent Clostridioides difficile infection (rCDI) patients treated with fecal microbiota transplantation (FMT ). SameStr demonstrated enhanced sensitivity to detect shared dominant and subdominant strains in related samples (where strain persistence or transfer would be expected) when compared to other tools, while being robust against false-positive shared strain calls between unrelated samples (where neither strain persistence nor transfer would be expected). We applied SameStr to identify strains that are stably maintained in fecal microbiomes of healthy adults over time (strain persistence) and that successfully engraft in rCDI patients after FMT (strain engraftment). Taxonomy-dependent strain persistence and engraftment frequencies were positively correlated, indicating that a specific core microbiota of intestinal species is adapted to be competitive both in healthy microbiomes and during post-FMT microbiome assembly. We explored other use cases for strain-level microbiota profiling, as a metagenomics quality control measure and to identify individuals based on the persisting core gut microbiota. Conclusion: SameStr provides for a robust identification of shared strains in metagenomic sequence data with suf- ficient specificity and sensitivity to examine strain persistence, transfer, and engraftment in human fecal microbiomes. Our findings identify a persisting healthy adult core gut microbiota, which should be further studied to shed light on microbiota contributions to chronic diseases. Background Disturbances of the human gut ecosystem have been *Correspondence: daniel.podlesny@uni-hohenheim.de; w.florian.fricke@uni- implicated in many metabolic, inflammatory, and infec - hohenheim.de tious diseases, based on altered taxonomic or func- Department of Microbiome Research and Applied Bioinformatics, University of Hohenheim, Stuttgart, Germany tional microbiota compositions in affected individuals. Institute for Genome Sciences, University of Maryland School However, attempts to identify consistent, disease-spe- of Medicine, Baltimore, MD, USA cific microbiome markers have been less successful, as Full list of author information is available at the end of the article © The Author(s) 2022. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/. The Creative Commons Public Domain Dedication waiver (http:// creat iveco mmons. org/ publi cdoma in/ zero/1. 0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Podlesny et al. Microbiome (2022) 10:53 Page 2 of 15 reported associations vary and have frequently not been to become assigned to the same strain, which is prob- consistent between studies [1, 2]. Among many other fac- lematic if the strain is used to infer microbial persistence tors [3], taxonomic and functional variations between or transfer. In this case, for example, human intestinal microbial subspecies or strains that are members of the microbiomes may contain the same “strain,” i.e., a sub- human microbiome [4] can produce inconsistent find - species lineage with widespread prevalence in the human ings, but have not been comprehensively character- population, without having experienced direct microbial ized. The species Ruminococcus gnavus, for example, transfer. has been linked to inflammatory bowel diseases [5], but To address these limitations, we developed SameStr disease associations appear to be specific to only one of as a new tool for the detection of shared strains in two described subspecies clades [6] and may be depend- metagenomic samples. SameStr leverages the Strain- ent on strain-specific variations in carbohydrate utiliza - PhlAn approach to map metagenomic reads to clade-spe- tion [7] or pro-inflammatory polysaccharide production cific marker genes [17], which compared to other tools [8], emphasizing the need for health-related microbiome affords increased taxonomic resolution [29]. However, studies to focus at subspecies level microbiota varia- SameStr extends the detection of shared strains to sub- tions. Moreover, many of the ecological forces that shape dominant members of multi-strain species populations. microbiomes in health and disease, or after perturbation This is achieved by considering multiple alleles instead and therapeutic modulation, involve microbial inter- of the consensus sequence at polymorphic positions in actions, such as competition, inhibition, or predation, the metagenomic marker gene alignments. We validated which can be strain-dependent [9–11] and require com- SameStr using new and available metagenomes, includ- positional microbiota analyses to provide strain-level tax- ing temporally linked sample pairs (i.e., collected from onomic resolution. the same individual at different time points) or physically Shotgun metagenomics has the potential for a maxi- linked sample pairs (i.e., collected from different, con - mum phylogenetic resolution that can theoretically nected individuals, such as FMT donors and recipients). resolve even individual microbial genomes in a metagen- We demonstrate increased sensitivity for the detection omic sample [12]. Consequently, several bioinformatic of subdominant shared strains and increased specificity methods have been introduced to identify microbial for the detection of species-specific strains, which are strains in metagenomes, based on the generation of not shared between unrelated sample pairs, over previ- metagenome-assembled genomes (MAGs, see [13], ous methods. We applied SameStr to identify a core gut Strainberry [12], and STRONG [14]) or the mapping of microbiota of strains that persist over time in healthy individual metagenomic reads to universal (see Strain- adults and to determine the contributions of recipi- Finder [15] and mOTUs2 [16]) or taxon-specific marker ent- and donor-derived strains to the post-FMT patient genes (see StrainPhlAn [17]), or whole-genomes (see microbiota, illustrating SameStr’s utility to study micro- InStrain [18]) to detect phylogenetically informative, biome stability and transfer across different settings. We strain-specific, single nucleotide variant (SNV) profiles. further show that persisting strains in healthy adults fre- Microbiota strain profiling has been successfully applied quently belonged to the same species as donor-derived to study strain-specific adaptations to human body sites strains in post-FMT patients, suggesting the existence of [4]; associations with individual human hosts, families, a healthy adult core gut microbiota that is transferable and geography [17, 19]; and transmission along the gas- from donors to rCDI patients by FMT. trointestinal tract [20], from mothers to infants [21–23] and from the donors to the recipients of fecal microbi- Results ota transplantation (FMT) [15, 24, 25]. Yet, strain-level Detection of shared strains in metagenomic samples microbiota analysis is hampered by inconsistent “strain” with SameStr definitions [26, 27] and available methods exhibit vari- We developed the SameStr tool based on a workflow able sensitivities and specificities, which have not been related to StrainPhlAn [17] to identify shared microbial comprehensively compared and validated. For example, strains in distinct metagenomic samples using within- the taxonomic classification of strains based on univer - species phylogenetic sequence variations (Fig.  1A). In sal marker gene phylogenetic comparisons can produce brief, metagenomic input data are first quality-filtered inconsistent assignments relative to established taxono- and trimmed to reduce sequencing errors and then mies [15, 24]. Detection may also be limited to the domi- mapped to the MetaPhlAn reference database of spe- nant strain in a metagenomic sample [17] or depend cies-specific marker genes [30], in order to limit the on the availability of completely sequenced reference interference of higher-level taxonomic sequence vari- genomes for comparison [28] Finally, non-stringent simi- ations with strain detection. Individual alignments for larity thresholds can result in distinct subspecies lineages each sample and species are filtered and merged. Strains P odlesny et al. Microbiome (2022) 10:53 Page 3 of 15 Fig. 1 Species-specific shared strain detection in metagenomic samples with SameStr. A Schematic of the SameStr workflow. SameStr has been implemented modularly, including optional wrapper functions for quality preprocessing and alignment of whole-genome shotgun ( WGS) metagenomic reads to species-specific MetaPhlAn markers (align), functions for the conversion to nucleotide variant profiles (convert), extraction of markers from genome sequences (extract), sample and reference pooling (merge), extensive global, per-sample, marker and position filtering (filter) and comparison of SNV profiles (compare) based on maximum variant similarity (MVS). SameStr outputs (summarize) tables denoting pairwise comparison results, including species alignment similarity and overlap, and co-occurrence of taxa at distinct taxonomic levels (based on MetaPhlAn) and at the strain level. B SameStr identifies shared strains in metagenomic samples by calculating a pairwise MVS, using all single-nucleotide variants detected in the read alignments of these samples to species-specific marker genes. C To assess the MetaPhlAn-based phylogenetic resolution (db_v20) and validate the 99.9% similarity threshold of shared strains, which is used by SameStr, 458 bacterial genomes from 20 of the most abundant and prevalent fecal microbiota species in our rCDI cohort ( Table S4) were compared with MetaPhlAn2 [30] and based on average nucleotide identities (ANIs) as determined with FastANI [31]. MetaPhlAn2 and FastANI-based pairwise sequence similarities are strongly correlated (Spearman’s r = 0.93, p < 2.2e−16, n = 9813), demonstrating comparable phylogenetic resolution. Genome similarities exhibit a multimodal distribution (two-dimensional density kernel contours): reference genomes share peak sequence similarities at 97.5%, 99.0%, and above 99.9% identity that reflect the presence of distinct species, subspecies, and strains in the reference dataset shared between samples are identified by comparing compared to metagenomic sequence preprocessing with alignments, using a maximum variant profile similar - Kneaddata and taxonomic analysis with MetaPhlAn is ity (MVS), which is calculated as the fraction of identi- shown in the supplement (Fig. S1). In contrast to Strain- cal nucleotide positions in both alignments divided PhlAn, which determines a consensus sequence for each by the total length of the shared alignment (Fig.  1B). A marker alignment and compares metagenomes based on comparison of SameStr’s resource requirements (total the consensus variant similarity (CVS) that only reflects CPU time, CPU time per sample, and average RAM use), the dominant strain in each sample, SameStr considers Podlesny et al. Microbiome (2022) 10:53 Page 4 of 15 all detected single nucleotide variants to calculate MVS, species populations, detecting 85% of shared strains com- including polymorphic positions with different rela - pared to 59% with the CVS-based approach. SameStr tive allelic frequencies (default: ≥ 10%), thereby includ- also detected 57% of shared strains among subdominant ing non-dominant strains into the sample comparison. strains (15–50% relative strain abundance at ≥ 5-fold tar- SameStr calls shared strains in two metagenomic sam- get strain sequencing depth), compared to only 2% for ples if the corresponding species alignments share a the consensus-based method. The better performance of minimum overlap (default: ≥ 5 kb) and MVS (default: ≥ SameStr compared to consensus-based methods in even 99.9%) over all detected sites. A similarity threshold of the identification of dominant strains might be due to the 99.9% for comparing MetaPhlAn marker genes (db_v20) lower sensitivity of the MVS-based approach to sequenc- was previously shown to differentiate between microbial ing errors and wrong consensus calls at polymorphic strains within species and subspecies [17, 32, 33] and is and/or low-coverage positions of the metagenomic read further validated by our phylogenetic comparison of ref- alignment. Importantly, advantages in accuracy were not erence genomes based on whole-genome average nucleo- accompanied by reduced specificity, as both approaches tide identity (ANI) (Fig. 1C). were robust against false-positive shared strain calls even in complex multi-strain species mixtures (see 0-fold tar- Validation of sensitivity and specificity of SameStr get strain coverage in Fig. 2A). in comparison to other strain prediction tools The StrainFinder tool has been developed to study We first evaluated SameStr’s performance on synthetic, strain-level microbiota dynamics in the course of fecal simulated metagenomes from species containing mul- microbiota transplantation (FMT) [15]. StrainFinder tiple strains. Mock sequence data from 100 individual used phylogenetic comparisons of 31 widely distributed, isolates from 20 frequent and abundant bacterial gut spe- single-copy marker genes from the AMPHORA database cies (Table S4) were mixed in various combinations to [34] to define metagenomic operational taxonomic units simulate metagenomes containing species composed of (mg-OTUs) and call distinct strains based on sequence multiple strains and variable complexity and sequenc- variations within these species equivalents [15]. We com- ing depth. For each species, simulated shotgun sequence pared the performances of SameStr and StrainFinder data from a reference genome (at a 5-fold sequencing with respect to (i) taxonomic sensitivity, i.e., the number depth and showing typical sequencing error profiles, of microbial genera and species assessed for shared strain see “Materials and methods”) were compared to simu- detection (Fig.  2B), and (ii) specificity for the detec - lated metagenomes. These included the same reference tion of ‘unique’ shared strain events, i.e., the frequency genome (showing an independent typical error profile) at of shared strain predictions in unrelated sample pairs, variable sequencing depths (target strain coverage), com- which would interfere with our goal to use shared strains bined with additional sequence data from between 1 and to infer strain persistence or transfer (Fig. 2C). Using the 4 other available genomes from the same species at vary- published datasets and taxonomic profiles from the origi - ing sequencing depths (noise coverage). nal StrainFinder publication [15], SameStr consistently SameStr’s strain predictions based on maximum vari- detected more species and genera, both across the entire ant profile similarity (MVS) were compared to those dataset (154 vs. 116 genera and 399 vs. 306 species/mg- of a StrainPhlAn-equivalent consensus variant similar- OTUs) and per sample (50.54 ± 15.0 vs. 23.78 ± 16.67 ity (CVS)-based approach across a total of 3276 simu- genera and 97.62 ± 39.6 vs. 48.48 ± 33.88 species/mg- lated combinations (Fig.  2A). SameStr outperformed the OTUs; values shown as mean ± sd) (Fig.  2B). Differen - consensus-based approach for the detection of domi- tially detected taxa included prominent members of the nant target strains (≥ 50% relative strain abundance at gastrointestinal tract microbiota, such as Bacteroides ≥ 5-fold target strain sequencing depth) in multi-strain spp. (6.54 ± 5.35 species vs. 3.87 ± 4.70 mg-OTUs per (See figure on next page.) Fig. 2 Sensitivity and specificity comparison to other strain prediction tools. A SameStr detects dominant and subdominant strains at low sequencing depth (mean-fold target strain coverage) and relative abundance (i.e., high noise coverage) in simulated metagenomes (n = 3276) of multi-strain species populations, compared to consensus variant profile similarity (CVS)-based methods. B Using MetaPhlAn’s clade-specific marker gene database (db_v20), SameStr identifies more genera and species per metagenomic sample (n = 65) than StrainFinder, which uses mg-OTUs that are defined based on phylogenetic comparisons of universally distributed bacterial genes from the AMPHORA database. C Fewer shared strain calls demonstrate the increased specificity of SameStr compared to StrainFinder, which allows for the differentiation of related (n=555) and unrelated (n=1,525) sample pairs. D Cumulative relative abundance and fraction of species for which strain-level resolution was achieved with SameStr in fecal metagenomes from a reference cohort of 67 longitudinally sampled healthy adults (n = 202). E SameStr’s MVS-based method detects shared strains in a larger fraction of species in related (same individual, n = 281) but not in unrelated (different individuals, n = 20,020) sample pairs of the control cohort (n = 202 individuals) compared to CVS-based methods P odlesny et al. Microbiome (2022) 10:53 Page 5 of 15 Fig. 2 (See legend on previous page.) Podlesny et al. Microbiome (2022) 10:53 Page 6 of 15 of FMT, we applied SameStr to measure strain persis- sample), Clostridium spp. (4.81 ± 4.05 species vs. 2.43 ± tence and engraftment in our reference dataset of fecal 3.25 mg-OTUs per sample), and Lactobacillus spp. (5.06 metagenomes from healthy adult individuals and a ± 3.33 species vs. 1.41 ± 2.37 mg-OTUs per sample). combined FMT dataset with fecal samples from FMT- For the detection of shared strains, we divided the treated rCDI patients and their donors from our pre- original FMT dataset from Smillie et al. into related and viously described cohort [36] and the study by Smillie unrelated sample pairs. Related sample pairs included et al. [15]. corresponding FMT recipient and donor samples, pre To study strain persistence in the fecal microbiota of and post-FMT patient samples, and distinct samples healthy individuals, we used the reference cohort of 67 from the same donor or post-FMT patient. SameStr healthy adults described above and determined shared detected on average 14.77 (median = 12, range = 0–67) strains in sample pairs collected from the same individu- shared strains in 555 related sample pairs and 0.45 als over periods of up to one year (Fig. 3A, see Fig. S2 for (median = 0, range = 0–8) shared strains in 1525 unre- individual cases and samples). Contributions of tempo- lated sample pairs. By comparison, StrainFinder reported rally persistent strains that were shared between multiple on average 93.13 (median = 73, range = 0–384) shared samples from the same individual were relatively stable strains in related but also 35.16 (median = 25, range = over time and comprised on average 22.6% ± 6.3 (mean 0–238) shared strains in unrelated sample pairs (Fig. 2C). ± sd) of all detected species in the later sample, which These findings suggest that StrainFinder classifies sub - accounted for 73.1% ± 18.3 relative abundance. Strain species lineages with broader prevalence in human popu- persistence was detected with variable frequencies for lations as shared strains, which based on SameStr’s more different microbial genera (Fig.  3B) and species (Fig. S3). conservative definition of “unique” shared strains would Based on the assignment of microbial species to different be considered false-positive predictions. functional and lifestyle feature categories (see “Materi- To further assess SameStr’s rate of false-positive shared als and methods” for details, Table S5), strain persistence strain predictions in fecal metagenomes, we downloaded a was less frequent in oral and/or oxygen-tolerant genera reference dataset (‘control’) from the curatedMetagenom- (Fig. 3B) and species (Fig. S3). icData package [35], consisting of 202 fecal metagenomes To study strain persistence and engraftment in the from four different studies, including 67 healthy adults course of FMT, we generated new metagenomic sequence that were sampled multiple times over a period of up to 1 data from our previously described cohort of FMT- year (see “Materials and methods” and Table S2). On aver- treated rCDI patients [36, 37], which we combined with age, strain-level resolution was obtained for 26.2% ± 6.8 other available data [15] and applied SameStr to detect of species or 71.4% ± 15.9 relative abundance per sam- shared strains between pre- and post-FMT patients ple (Fig.  2D). This control dataset was divided into related and post-FMT patients and donors (Fig.  3C, Table S7). sample pairs from the same individual, which would be Recipient and donor-derived species fractions and rela- expected to share strains, and unrelated sample pairs from tive abundances in post-FMT patients were determined distinct individuals, which would not be expected to share as being represented by shared strains between pre- and strains. Compared to the consensus-based method that is post-FMT patients or post-FMT patients and donors, used by StrainPhlAn, SameStr detected more shared strains respectively (Fig.  3D, see Fig. S4 for individual cases and in 281 related sample pairs (range = 4–43, median = 14) samples). During the first week after FMT, both donor but not in 20,020 unrelated sample pairs (range = 0–4, and recipient-derived strains contributed large rela- median = 0) (Fig.  2E), demonstrating increased sensitiv- tive abundances to the post-FMT microbiota (days 1–7: ity without compromising the low rates of false-positive 42.5% ± 30.3 vs. 18.9% ± 22.3), but donor-derived micro- shared strain detections that both approaches showed. biota fractions remained more stable over the following In summary, SameStr can detect shared strains in syn- weeks and months, whereas recipient-derived microbiota thetic and real metagenomes, including from single- and fractions continuously decreased (days 70–84: 26.5% ± multi-strain species populations, with improved accu- 21.9 vs. 4.9% ± 9.0). Donors and recipients before FMT racy for low-abundant and subdominant strains com- frequently carried the same microbial species, but this pared to StrainPhlAn and taxonomically more accurate rarely resulted in the detection of both recipient and and restrictive predictions of shared strains compared to donor-derived strains after FMT (Table S8). Conse- StrainFinder. quently, coexisting recipient and donor strains from the same species accounted for only small and decreasing Identification of strain persistence and engraftment species fractions (0.46% ± 0.68) and relative abundances in healthy individuals and rCDI patients after FMT (5.19% ± 11.54) in post-FMT patients (Fig.  3D). Donor To gain insights into (i) microbiome stability in healthy strain engraftment frequencies varied taxonomically and individuals and (ii) microbiome transfer in the course P odlesny et al. Microbiome (2022) 10:53 Page 7 of 15 Fig. 3. Identification of strain persistence and donor strain engraftment in healthy individuals and rCDI patients after FMT. A Longitudinal species and strain persistence in healthy adults from the reference (Control) cohort are shown as relative abundances of shared species and species fractions in 95 sample pairs from 59 individuals and modeled using binomial smoothing. Strain proportions are based on corresponding species. Species fractions indicate insufficient resolution for strain prediction. B Taxonomic variations in the frequency of species (dark blue), and strain (light blue) persistence in healthy individuals (n = 59) and FMT recipients (n = 19), and of donor species (dark green) and strain (light green) engraftment in post-FMT patients are shown, as summarized on the genus level for the 50 most prevalent genera (see Fig. S3 for species). Newly detected species and strains are shown in dark and light yellow, respectively. C Comparison of shared strain numbers between rCDI patients and donors. Distinct rCDI patients who received stool from the same donor share more strains than other post-FMT patients. D Donor-derived strains and species (exclusively shared with the donor but with insufficient resolution for strain prediction) account for large and stable relative abundances and species fractions in FMT-treated rCDI patients. Data for triads of successfully FMT-treated rCDI patients (n = 30) in reference to their pre-FMT (n = 19) and donor (n = 14) metagenomes are modeled across cases using binomial smoothing. E The frequencies of strain persistence in healthy individuals and of donor strain engraftment in rCDI patients after FMT are positively correlated at the genus level (Spearman’s r = 0.72, p < 1e−8), including for abundant members of the healthy adult fecal microbiota (see Fig. S5 for species-level comparison) were less frequent in oral and/or oxygen-tolerant genera comparison; Table S9-S10). Frequently persisting and (Fig. 3B) and species (Fig. S3). engrafting genera included abundant (>5%) members of We next compared the healthy adult and FMT the healthy adult gut microbiota, such as Bacteroides, cohorts and found strains that frequently persisted Blautia, Coprococcus, and Eubacterium (Fig.  3E), and in healthy individuals to belong to the same genera similar observations were made at the species level and species as donor strains that frequently engrafted (Fig. S5). Thus, FMT appears to specifically lead to the in patients after FMT (Fig.  3E, see Fig. S5 for species engraftment of persisting and abundant healthy gut microbiota members in rCDI patients. Podlesny et al. Microbiome (2022) 10:53 Page 8 of 15 Identification of healthy individuals and FMT recipients strain profiles, whereas shared family and genus pro - and donors using shared strain profiles files were insufficient (auPR ≤ 0.18, auROC ≤ 0.87) and The detection of species overlaps between the per - even shared species profiles performed poorly (auPR = sisting core gut microbiota in healthy adults and the 0.47, auROC = 0.93). We next tested the same logistic engrafted donor microbiota in rCDI patients after regression classifier that was trained on healthy indi - FMT, prompted us to test if individuals were identifi - viduals for the identification of related sample pairs able based on shared strain profiles in fecal metagen - from the FMT cohort (n = 580 related compared to omes. To this end, we first trained and tested a logistic n = 3606 unrelated sample pairs), i.e., pre- and post- regression classifier (60% / 40% data split for training FMT samples from the same patients, corresponding and testing) to identify sample pairs from the same post-FMT patient and donor samples, and post-FMT individuals in our healthy adult reference dataset, samples from different patients that received FMT based on overlapping taxonomic microbiota compo- from the same donor. Again, our classifier performed sitions. Microbiota profiles at the family, genus, and well using shared strain profiles as input (auPR = species level were determined with MetaPhlAn2 and 0.94, auROC = 0.93) but not higher-level taxa profiles at the strain level with SameStr; total and shared taxa (Fig. 4B, Table S6). Thus, our findings demonstrate that and strains were used as input for the classifier (Fig.  4A, the fecal microbiota of healthy adults harbors identifi - Table S6). A perfect classification (auPR = 1, auROC = able personal strain profiles, at least over periods of up 1) of 8120 hold-out sample pairs (n = 112 sample pairs to one year, which are transferable from donors to rCDI from the same individuals) was achieved with shared patients after FMT. Fig. 4 Identification of healthy individuals and FMT recipients and donors using shared strain profiles. Receiver-operating characteristic (ROC) and precision-recall (PR) curves of logistic regression classifiers demonstrate sensitive and accurate identification of (A) longitudinally collected sample pairs from the same healthy individuals (n = 112 from a total of n = 8120 sample pairs) and (B) related FMT patient and donor sample pairs (n = 580, including pre- and post-FMT patient samples, post-FMT patient and donor samples, and post-FMT patient samples that received FMT from the same donor, from a total of n = 4186 sample pairs) P odlesny et al. Microbiome (2022) 10:53 Page 9 of 15 Shared strain network analysis for the identification related samples to distinct clusters linking, for example, of mislabeled metagenomes samples from the same individual (Fig. 5A) or from FMT The published metagenomic sequence data used for recipients and donors (Fig. 5B). However, in three cases > this study included several samples that, while present- 2× more shared strains were detected between suppos- ing with inconspicuous species-level taxonomic micro- edly unrelated samples than between any of the other > biota compositions, showed unexpected and inconsistent 20,000 unrelated sample pairs from our dataset. In every shared strain profiles that led to their removal from the case, suspicious sample pairs had been submitted as part analysis (Table S11). To illustrate these inconsistencies, of the same study and inconsistencies could be resolved shared strain profiles, as generated with SameStr, were by switching or changing sample labels (see Fig. 5 legend visualized as unsupervised networks, which assigned for details), suggesting sample mix-up or mislabeling. We Fig. 5 SameStr-based unsupervised strain sharing networks identify potentially mislabeled samples. Shared strain profiles were visualized as unsupervised networks with individual samples as nodes and shared strain numbers as edges. A These networks connect samples from Louis et al. [38] by individual, with the exception of two samples (AS64_24 and AS66_24) that appear to be mixed up. B In a case of multiple rCDI patients treated with FMT from the same donor [15], shared strains were detected between pre- (blue) and post-FMT (yellow) patient samples, as well as between post-FMT and donor (green) samples and among post-FMT samples. Pre-FMT samples did not share strains with donor samples, with the exception of FMT15, which shares (> 15) strains with all three donor samples and exhibits ɑ/β-diversity compositions that are comparable to other post-FMT samples (data not shown). As this sample was collected on the day of the FMT procedure, FMT15 could in fact represent a post-FMT sample that was accidentally mislabeled as a pre-FMT sample (Smillie, personal communication) Podlesny et al. Microbiome (2022) 10:53 Page 10 of 15 have reported similar findings of potentially mislabeled the identification of strain sharing between the intestinal, samples in a meta-analysis of neonatal metagenomes reproductive, and/or urinary tract or bloodstream, which [23], indicating that inconsistencies in public metage- could be used to better characterize endogenous reser- nomes might be common. Microbiota strain profiling voirs of opportunistic pathogens and microbial transloca- with SameStr or equivalent tools could represent a viable tion between human body niches as a cause of infection strategy for the quality control of metagenomic sequence and disease [41, 42]. data from fecal microbiome projects. We applied SameStr to study strain persistence in the intestinal microbiota of healthy individuals, as well as Discussion strain persistence and engraftment in patients after fecal We introduce SameStr as a new bioinformatic tool for microbiota transplantation, using combined datasets the identification of shared microbial strains in metagen - from multiple studies, including healthy adults sampled omic shotgun sequence data, which allows for the detec- over durations of up to one year and rCDI patients, sam- tion and quantification of strain persistence and transfer pled before and after FMT together with their donors. and improves our ability to track and understand sub- We detected strain persistence for many of the same species population dynamics in complex microbiomes. bacterial taxa, such as Bacteroides species, as previously In contrast to related methods that define strains more reported based on temporal single nucleotide polymor- broadly and allow for the presence of the same strain in phism (SNP) stability [43] and strain-resolved species- different, unrelated individuals [15, 16], SameStr applies a specific MAGs [19] in fecal metagenomes from healthy more conservative definition of strains as “unique” phylo - individuals. Persistence has been negatively correlated to genetic lineages that should only be shared by either tem- the genetic capacity for oxygen tolerance and sporulation porally or physically related samples. It thereby affords before [19] and, based on comparative genome analy- the specificity to infer persistence or transmission from ses, the loss of sporulation has been genetically linked the detection of shared strains in distinct metagenomes. to typical features of host-adaptation, such as genome Recent fecal metagenomics-based epidemiological stud- reduction and metabolic specialization [44], confirm - ies identified subspecies lineages or clades of, for exam - ing our functional predictions for species that are fre- ple, Prevotella copri and Ruminococcus gnavus with quently represented by persisting strains, as well as our widespread prevalence in the human population, which concept of a persisting core gut microbiota of strict could be linked to dietary habits [39, 40] and host health anaerobe, non-spore-forming bacteria in the healthy background, i.e., inflammatory bowel disease [6], respec - human gut. We also identified a surprising taxonomic tively. Strain-level microbiota profiling with SameStr pro - association between strain persistence and engraftment, vides the phylogenetic resolution to track even individual as strains with a high persistence rate in healthy indi- strains within these subspecies clades in the human pop- viduals belonged to the same bacterial species as donor ulation, illustrating new opportunities to shed light on strains with a high engraftment rate in rCDI patients the role of these and other microbiome members for after FMT. Given that persistence in the complex gut human lifestyle adaptation and disease development. microbiomes of healthy individuals, as well as engraft- Methodically, SameStr is related to the StrainPhlAn ment in the dysbiotic microbiomes of rCDI patients, tool, as both use the taxon-specific marker gene database requires strains to compete with other persisting, resi- from MetaPhlAn [30] to identify and compare microbial dent, and/or newly incoming strains, our analysis likely species-specific single nucleotide variant profiles. How - identified bacterial species of high ecological competi - ever, SameStr’s approach to determine maximum vari- tiveness and fitness. This is further supported by Hilde - ant profile similarities between metagenomic samples, brand et al., who used the concept of tenacity to describe including polymorphic alleles, demonstrates increased strain persistence in human individuals and described sensitivity for the detection of shared strains among tenacious bacteria, such as Bacteroides species, as host- multi-strain species populations, especially between adapted, frequently dispersed by vertical transmission subdominant strains. Dominant and secondary mater- from mothers to infants, and most negatively affected by nal strains of Bifidobacterium and Bacteroides species antibiotic perturbation [19]. In this context, the lack of have been shown to compete for colonization in neo- sporulation genes in tenacious bacteria likely reflects an nates after birth, contingent on their strain-specific car - adaptive mechanism to increase persistence by avoiding bohydrate-degrading capabilities [22], emphasizing the excessive intra-species strain competition [19]. Using dif- importance of considering multiple strains per species ferent methodologies, Watson et  al. similarly concluded for the detection of strain sharing and microbial transfer. that FMT selects for high-fitness populations of the gut Other clinical use cases, specifically for SameStr’s con - microbiome, based on the observation that a high preva- servative shared strain calls, could include, for example, lence of a microbial species in healthy individuals is more P odlesny et al. Microbiome (2022) 10:53 Page 11 of 15 predictive for colonization success after FMT than a high resolution and accuracy of SameStr’s taxonomic strain relative abundance of the same species in a FMT donor classifications compared to those from the StrainFinder [45]. Based on these considerations, the identification tool. Moreover, SameStr can be easily adapted for use and characterization of stably persisting strains in healthy with updated (e.g., MetaPhlAn3, mpa_v30_CHOC- individuals could present a viable and more useful strat- OPhlAn_201901 [29]) or alternative, user-provided, egy to determine different constitutions of personalized, marker sequences. Second, we developed SameStr spe- adapted core microbiomes of the human gut, than more cifically for the metagenome-based detection of strain commonly used β-diversity metrics based on species or sharing between fecal microbiomes. SameStr can be higher-level taxon persistence. As key microbiome attrib- used to identify species that are represented by multiple utes, such as colonization resistance against pathogens strains, based on the detection of multiple alleles within [46] or resilience towards other perturbations [47] should a species-specific marker gene alignment of a single sam - be determined by the fitness of its core members, char - ple, with multi-strain species populations exhibiting ≥ acterization of the persisting gut microbiota might con- 0.1% polymorphic positions  of all detected alignment stitute an ecological approach to define a healthy human sites. However, it does not provide similar insights into gut microbiome [48]. strain population structures as related tools [15]. Third, Our analyses suggest additional practical applications in order to reliably detect strain-specific SNV profiles, for metagenomics strain profiling that extend previous SameStr required a sequencing depth of the genome concepts of microbiome-based forensic markers for per- corresponding to this strain of > 5-fold in our validation sonal identification [49]. Franzosa et al. identified combi - experiments, irrespective of whether this strain was the nations of taxonomic (operational taxonomic units and only representative or a minor component of a multi- species), genomic (genome fragments), and functional strain species population. Assuming an average genome (genes) markers as ‘metagenomic codes’ that could be length of 2.5 Mbp and a metagenomic sequencing depth used to match > 80% of fecal sample pairs that were col- of 5 Gbp per sample (corresponding to 2000 genomes of lected over periods of 30-300 days from the same individ- average length), we estimate that SameStr is limited to uals [50]. Similarly, a majority of > 300 individuals could the detection of shared or coexisting strains that make up be identified in a mixed human cohort (auPR = 0.87, at least 0.25% of all genomes in the metagenomic sam- auROC = 0.95), using rare fecal metagenomic oligomers ple or 0.25% species relative abundance in case of single- (k-mers of 18–30-bp length) [51]. Yet our shared strain- strain species. based personal identification method outperformed both In conclusion, we present SameStr as a new bioinfor- previous attempts by demonstrating a 100% success rate matic tool for the species-specific, conservative identifi - for the detection of matching sample pairs (n = 112 from cation of unique shared subspecies taxa in metagenomic a total of 8120 sample pairs) from the same healthy indi- shotgun sequence data, including subdominant members viduals and, in addition, correctly identified most sam - of multi-strain species populations. We demonstrate ple pairs from linked FMT donors and recipients (auPR increased sensitivity, specificity, and taxonomic accu - = 0.94, auROC = 0.93 for n = 580 from a total of 4186 racy of detected strains in fecal metagenomes compared sample pairs). Standard practice for microbiome projects to related tools, which affords reliable detection of tem - dictates the removal of human reads from metagenomic poral strain persistence and transfer after fecal micro- sequence data to de-identify samples before release. Our biota transplantation. We identify a persisting fecal core findings attest to the persistence and FMT-dependent microbiota in healthy individuals, which taxonomically transferability of personalized gut microbiome strain overlaps with the engrafted donor microbiota in rCDI profiles and suggest that filtered public metagenomes patients after FMT, demonstrating the utility of SameStr retain personal information that could make study par- to gain new insights into human gut microbiome stabil- ticipants and FMT donors retrospectively identifiable. ity and modulation. Application of this approach to other The SameStr platform has a few limitations. First, as microbiome projects will improve our understanding strains are identified based on SNV profiles in clade-spe - of microbiome organization and function and should cific marker genes, their detection is dependent on the advance most areas of microbiome research. underlying database and limited to previously described, sequenced, and comparatively analyzed taxa [30]. How- Materials and methods ever, taxonomic assignments based on universal instead Study cohort of species-specific marker genes, which are less depend - Metagenomic shotgun sequence data were generated ent on available genome sequence information, can from a previously published cohort of FMT-treated rCDI show discrepancies from established taxonomic sys- patients [36, 37]. The sample set included eight rCDI tems [52], which could explain the increased taxonomic patient samples, collected 1–2 days before treatment, and Podlesny et al. Microbiome (2022) 10:53 Page 12 of 15 Metagenomic strain‑level profiling with SameStr eleven patient samples, collected between 1 week and up The following individual analysis steps are part of the to 1 year after FMT. FMT was performed at Sinai Hospi- SameStr protocol to identify shared strains in metagen- tal of Baltimore, Baltimore, MD, USA, by single infusion omic samples (Fig. 1): of fecal filtrate from healthy donors into the jejunum and colon of rCDI patients. Study design, patient selection criteria, donor screening, infusion protocol, and sample 1. Taxonomic microbiota analysis. Preprocessed collection have previously been reported [36]. sequence reads from each sample were mapped against the MetaPhlAn clade-specific marker gene DNA isolation and sequencing database (db_v20, mpa_v20_m200) using Met- Metagenomic DNA extraction and sequencing of the 27 aPhlAn2 v2.6.0 [57]. We additionally generated taxo- fecal samples was conducted at the Institute for Genome nomic profiles for rarefied data, which were subsam - Sciences, University of Maryland School of Medicine. pled to 5 M reads (after QC) per sample (seqtk v1.0) DNA was extracted from 0.25 g of stored fecal samples before processing with MetaPhlAn2, confirming (− 80 °C), using the MoBio Microbiome kit automated representativeness of microbial communities as indi- on a Hamilton STAR robotic platform after a bead-beat- cated by strong correlations of Shannon Index (diver- ing step on a Qiagen TissueLyser II (20 Hz for 20 min) sity, vegan v2.5.7) between data. in 96 deep-well plates. Metagenomic libraries were con- 2. Detection of SNV profiles in marker gene alignments. structed using the KAPA Hyper Prep (KAPA Biosystems/ Using the SameStr tool, MetaPhlAn2 marker gene Roche, San Francisco, CA, USA) library preparation kit alignments were filtered for ≥ 90% sequence identity, according to the manufacturer’s protocols. Sequencing a base call quality of Q20, and mapping length of 40 was performed on the Illumina HiSeq 4000 platform to bp. The frequencies of all four nucleotides were tabu - generate 150-bp paired-end reads. lated with Samtools v0.1.19 [58] and kpileup v1.0 [15], retaining unmapped alignment sites as gap positions. Published sequence data acquisition Marker gene alignments were trimmed by 20 nucleo- Publicly available fecal metagenomic sequence data, tides at both ends, concatenated for each species, and longitudinally collected from healthy adult individu- combined from all samples. In order to address atypical als, were obtained through curatedMetagenomicsData vertical coverage and wrong base calls for each sample, [35], including 202 metagenomes of 67 subjects (59 with alignment positions that diverged from the mean cov- known sampling days) from four different studies [38, erage by more than five standard deviations and alleles 53–55]. Individuals were sampled at least twice within that were represented by < 10% of all mapped reads at a year and had not reported medical conditions that this position were zeroed in the alignment. would suggest extensive medication or strong microbi- 3. Determination of maximum variant profile similarity ota perturbations between time points. For each subject, (MVS). To consider individual strains from multi-strain sequence data downloaded from the SRA were concat- species populations for the detection of shared strains, enated in case of multiple available accessions (Table S2). MVS were calculated between all species/alignment A total of 65 additional fecal metagenomes were obtained pairs M and M as the fraction of the sum of alignment i j from 18 cases of FMT-treated rCDI patients who had not positions with at least one shared allele C divided by allele been treated with FMT before [15]. the sum of positions with coverage in both alignments C , where the vector of shared alleles C was cal- cov allele Quality control and preprocessing of metagenomic culated as the pairwise Boolean product of 4 vectors of sequencing data nucleotide counts at all positions between alignments All raw paired-end metagenomic sequence reads were M and M . For consensus variant profile similarity i j quality-processed with Kneaddata v0.6.1 (KneadData (CVS) calculation, shared alleles were calculated as the Development Team, 2017) in order to trim sequence pairwise Boolean product of a single vector represent- regions where base quality fell below Q20 within a ing the consensus sequence of the alignment at all posi- 4-nucleotide sliding window and to remove reads that tions between alignments M and M . i j were truncated by more than 30% (SLIDINGWIN- 4. Comparison of reference genomes. Species-specific DOW:4:20, MINLEN:70). To remove human sequence marker gene regions were extracted from a total of contamination, trimmed reads were mapped against the 458 available genome sequences in the NCBI RefSeq human genome (GRCh37/hg19) with Bowtie2 v2.2.3 [56]. and Genome databases from the 20 most abundant Output files consisting of surviving paired and orphan and prevalent species in our rCDI cohort (Table S4). reads were concatenated and used for further processing For this, marker gene regions were extracted from ref- (Table S3). erence genomes with a StrainPhlAn utility [17], based P odlesny et al. Microbiome (2022) 10:53 Page 13 of 15 on BLASTn v2.6.0 comparisons, and used to generate Supplementary Information multiple sequence alignments with MUSCLE v3.8.31 The online version contains supplementary material available at https:// doi. org/ 10. 1186/ s40168- 022- 01251-w. [59]. After removing gap positions, marker gene align- ments were tabulated, concatenated, trimmed, and used to calculate the single-genome equivalent of Additional file 1: Figure S1. Computational Resource Requirements. (A) Total CPU time, (B) average CPU time per sample, and (C) average use of MVS. MVS-based genome similarities were compared RAM by the Kneaddata, MetaPhlAn3 (mpa_v30_CHOCOPhlAn_201901), to average nucleotide identities (ANI), as calculated for and SameStr programs during the processing of three datasets of different entire genomes with FastANI v1.3 [31]. sizes. SameStr, on average, added 4.3 CPU minutes per sample to the computational effort of the entire workflow. 5. Shared strain detection in distinct metagenomes. Additional file 2: Figure S2. Microbial Tracking across Individual Based on our reference genome comparison (Fig. 1C) Metagenomic Samples of Healthy Controls. Microbial tracking at the and in agreement with previous reports [21], a MVS species (top) and strain level (bottom) in healthy controls. Healthy adults threshold of 99.9% was applied to detect shared from the reference (Control) cohort harbor a core microbiota of persisting strains and species (insufficient sequencing depth for strain calls) shared strains that would be identified in related but not between fecal metagenomes sampled up to one year apart. unrelated microbiomes. Shared strain predictions Additional file 3: Figure S3. Predicting Donor Strain Engraftment in rCDI were additionally limited to sample pairs with at least Recipients after FMT. The frequencies of species (dark blue) and strain 5000 overlapping alignment positions. (light blue) persistence in healthy individuals and rCDI recipients, and of 6. Validation of SameStr on mock species populations. donor species (dark green) and strain (light green) engraftment in post- FMT patients, differ between bacterial species, with retained recipient Simulated shotgun sequence data were generated species and strains mostly being classified as oral and/or oxygen-tolerant with ART read simulator v2.5.8 [60] and combined in species. Newly detected species and strains are shown in dark and light various proportions to generate metagenomes from yellow, respectively. mock multi-strain species populations. Metagen- Additional file 4: Figure S4. Microbial Tracking across Individual Metagenomic Samples of FMT-treated rCDI patients. Donor-derived strains omic paired-end sequence read error profiles were and species (exclusively shared with donor but insufficient resolution for independently generated for each genome and simu- strain prediction) account for large and stable relative abundances across lation, using the Illumina HiSeq-20 error profile. For all post-FMT patient samples, whereas contributions of recipient-derived strains are comparatively smaller. each species (Table S4), five reference genomes were Additional file 5: Figure S5. Predicting Donor Strain Engraftment in rCDI randomly selected, including one target genome for Recipients after FMT. The same species that are represented by frequently shared strain detection and four other genomes to persisting strains in healthy individuals are also represented by strains that simulate a background noise of additional strains frequently engraft from donors in rCDI patients after FMT and belong to species that have a high relative abundance in the healthy adult control from the same species. Both the sequencing depths cohort. (fold coverage) of the target strain and its abundance Additional file 6: Table S1. Sample and Case Metadata. Table S2. WGS relative to all other strains (noise coverage) were var- Accession Identifiers. Table S3. WGS QC Data. Table S4. Reference Acces- ied for each simulation. Marker gene alignments and sions. Table S5. Species Metadata. Table S6. Logistic Regression of (Un) comparisons for MVS or CVS calculation and shared related Sample Pairs at Taxonomic Levels. Table S7. Shared Species and Strains for Individual Cases. Table S8. Events with Competing Strains. strain detection were performed as described above. Table S9. Strain Transmission Rates (per Species). Table S10. Strain Trans- mission Rates (per Genus). Table S11. Potential Sample Mislabellings. Classification of related and unrelated sample pairs Acknowledgements Not applicable. For the prediction of related samples (distinct samples from the same individual or connected samples from Authors’ contributions FMT donors and recipients) based on strain sharing, the Conceptualization, Methodology, and Writing—Original Draft and Review and Editing, D.P., J.W. and W.F.F.; Software, Validation, and Formal Analysis, D.P.; number of detected and shared taxa between sample pairs Resources, S.K.D.; Data Curation, C.A., E.D., and S.V.; Funding Acquisition, W.F.F. from the healthy adult reference dataset were determined The authors read and approved the final manuscript. at the family, genus, species, or strain level with Met- Funding aPhlAn or SameStr, respectively, as described above. Data D.P. and W.F.F. received funding from the German Research Foundation (DFG, were divided into training and hold-out data (60%/40%) Deutsche Forschungsgemeinschaft) under SPP 1656 (Project no. 316130265). and shared taxon or strain fractions used to train simple Availability of data and materials logistic regression models (tidymodels v0.1.2). The clas - We implemented SameStr to facilitate the comparison of nucleotide variant sifier that was trained on strain persistence in healthy profiles presented in this analysis. The program builds on previously published adults was then used to predict related sample pairs from tools such as the concept of StrainPhlAn but extends the analysis of Met- aPhlAn markers beyond a consensus-based approach by extracting all four the FMT cohorts. To assess the performance of the pre- nucleotide alleles from sequence alignments. Generated SNV-Profiles are in dictor, precision-recall (tidymodels v0.1.2) and receiver NumPy format and can be used as input for strain composition modeling and operating characteristic (ROC) curves were generated other analyses. The SameStr program and further documentation are available at GitHub: https:// www. github. com/ danie lpodl esny/ SameS tr. git. R Markdown (tidymodels v0.1.2) and visualized (plotROC v2.2.1). Podlesny et al. Microbiome (2022) 10:53 Page 14 of 15 notebooks and additional code for generating figures are available at https:// 12. Vicedomini R, Quince C, Darling AE, Chikhi R. Strainberry: automated www. github. com/ danie lpodl esny/ fmt_ rcdi. git. Metagenomic sequence strain separation in low-complexity metagenomes using long reads. Nat data are available from the European Nucleotide Archive under accession Commun. 2021;12:4485. PRJEB39023. 13. Karcher N, Nigro E, Punčochář M, Blanco-Míguez A, Ciciani M, Manghi P, et al. Genomic diversity and ecology of human-associated Akkermansia species in the gut microbiome revealed by extensive metagenomic Declarations assembly. Genome Biol. 2021;22:209. 14. Quince C, Nurk S, Raguideau S, James R, Soyer OS, Summers JK, et al. Ethics approval and consent to participate STRONG: metagenomics strain resolution on assembly graphs. Genome The Institutional Review Board of Sinai Hospital Baltimore approved the Biol. 2021;22:214. study under protocol number #1826 and all subjects provided their written 15. Smillie CS, Sauk J, Gevers D, Friedman J, Sung J, Youngster I, et al. Strain informed consent to participate in the study. tracking reveals the determinants of bacterial engraftment in the human gut following fecal microbiota transplantation. Cell Host Microbe. Consent for publication 2018;23:229–40.e5. Not applicable. 16. Milanese A, Mende DR, Paoli L, Salazar G, Ruscheweyh H-J, Cuenca M, et al. Microbial abundance, activity and population genomic profiling Competing interests with mOTUs2. Nat Commun. 2019;10:1014. The authors declare that they have no competing interests. 17. Truong DT, Tett A, Pasolli E, Huttenhower C, Segata N. Microbial strain- level population structure and genetic diversity from metagenomes. Author details Genome Res. 2017;27:626–38. Department of Microbiome Research and Applied Bioinformatics, University 18. Olm MR, Crits-Christoph A, Bouma-Gregson K, Firek BA, Morowitz MJ, of Hohenheim, Stuttgart, Germany. Current address: Ring Therapeutics, Banfield JF. inStrain profiles population microdiversity from metagenomic Cambridge, MA, USA. Division of Gastroenterology, Sinai Hospital of Balti- data and sensitively detects shared microbial strains. Nat Biotechnol. more, Baltimore, MD, USA. APC Microbiome Ireland, School of Microbiology, 2021;39:727–36. and Department of Medicine, University College Cork, Cork, Ireland. I nstitute 19. Hildebrand F, Gossmann TI, Frioux C, Özkurt E, Myers PN, Ferretti P, et al. for Genome Sciences, University of Maryland School of Medicine, Baltimore, Dispersal strategies shape persistence and evolution of human gut MD, USA. bacteria. Cell Host Microbe. 2021;29:1167–76.e9. 20. Schmidt TS, Hayward MR, Coelho LP, Li SS, Costea PI, Voigt AY, et al. Received: 27 October 2021 Accepted: 24 February 2022 Extensive transmission of microbes along the gastrointestinal tract. Elife. 2019;8. https:// doi. org/ 10. 7554/ eLife. 42693. 21. Ferretti P, Pasolli E, Tett A, Asnicar F, Gorfer V, Fedi S, et al. Mother-to-infant microbial transmission from different body sites shapes the developing infant gut microbiome. Cell Host Microbe. 2018;24:133–45.e5. References 22. Yassour M, Jason E, Hogstrom LJ, Arthur TD, Tripathi S, Siljander H, et al. 1. Duvallet C, Gibbons SM, Gurry T, Irizarry RA, Alm EJ. Meta-analysis of gut Strain-level analysis of mother-to-child bacterial transmission during the microbiome studies identifies disease-specific and shared responses. Nat first few months of life. Cell Host Microbe. 2018;24:146–54.e4. Commun. 2017;8:1784. 23. Podlesny D, Fricke WF. Strain inheritance and neonatal gut microbiota 2. Sze MA, Schloss PD. Erratum for Sze and Schloss, “Looking for a signal in development: a meta-analysis. Int J Med Microbiol. 2021;311:151483. the noise: revisiting obesity and the microbiome”. MBio. 2017;8. https:// 24. Li SS, Zhu A, Benes V, Costea PI, Hercog R, Hildebrand F, et al. Durable doi. org/ 10. 1128/ mBio. 01995- 17. coexistence of donor and recipient strains after fecal microbiota trans- 3. Wirbel J, Zych K, Essex M, Karcher N, Kartal E, Salazar G, et al. Microbiome plantation. Science. 2016;352:586–9. meta-analysis and cross-disease comparison enabled by the SIAMCAT 25. Wilson BC, Vatanen T, Jayasinghe TN, Leong KSW, Derraik JGB, Albert BB, machine learning toolbox. Genome Biol. 2021;22:93. et al. Strain engraftment competition and functional augmentation in a 4. Lloyd-Price J, Mahurkar A, Rahnavard G, Crabtree J, Orvis J, Hall AB, et al. multi-donor fecal microbiota transplantation trial for obesity. Microbi- Strains, functions and dynamics in the expanded Human Microbiome ome. 2021;9:107. Project. Nature. 2017;550:61–6. 26. Yan Y, Nguyen LH, Franzosa EA, Huttenhower C. Strain-level epidemiology 5. Schirmer M, Garner A, Vlamakis H, Xavier RJ. Microbial genes and path- of microbial communities and the human microbiome. Genome Med. ways in inflammatory bowel disease. Nat Rev Microbiol. 2019;17:497–511. 2020;12:71. 6. Hall AB, Yassour M, Sauk J, Garner A, Jiang X, Arthur T, et al. A novel Rumi- 27. Van Rossum T, Ferretti P, Maistrenko OM, Bork P. Diversity within species: nococcus gnavus clade enriched in inflammatory bowel disease patients. interpreting strains in microbiomes. Nat Rev Microbiol. 2020;18:491–506. Genome Med. 2017;9:103. 28. Aggarwala V, Mogno I, Li Z, Yang C, Britton GJ, Chen-Liaw A, et al. Precise 7. Bell A, Brunt J, Crost E, Vaux L, Nepravishta R, Owen CD, et al. Elucidation quantification of bacterial strains after fecal microbiota transplantation of a sialic acid metabolism pathway in mucus-foraging Ruminococcus delineates long-term engraftment and explains outcomes. Nat Microbiol. gnavus unravels mechanisms of bacterial adaptation to the gut. Nat 2021;6:1309–18. Microbiol. 2019;4:2393–404. 29. Beghini F, McIver LJ, Blanco-Míguez A, Dubois L, Asnicar F, Maharjan 8. Henke MT, Kenny DJ, Cassilly CD, Vlamakis H, Xavier RJ, Clardy J. a S, et al. Integrating taxonomic, functional, and strain-level profiling of member of the human gut microbiome associated with Crohn’s disease, diverse microbial communities with bioBakery 3. bioRxiv. bioRxiv. 2020. produces an inflammatory polysaccharide. Proc Natl Acad Sci U S A. https:// doi. org/ 10. 1101/ 2020. 11. 19. 388223. 2019;116:12672–7. 30. Segata N, Waldron L, Ballarini A, Narasimhan V, Jousson O, Huttenhower 9. Lee SM, Donaldson GP, Mikulski Z, Boyajian S, Ley K, Mazmanian SK. C. Metagenomic microbial community profiling using unique clade- Bacterial colonization factors control specificity and stability of the gut specific marker genes. Nat Methods. 2012;9:811–4. microbiota. Nature. 2013;501:426–9. 31. Jain C, Rodriguez-R LM, Phillippy AM, Konstantinidis KT, Aluru S. High 10. Porter NT, Hryckowian AJ, Merrill BD, Fuentes JJ, Gardner JO, Glowacki throughput ANI analysis of 90K prokaryotic genomes reveals clear spe- RWP, et al. Phase-variable capsular polysaccharides and lipoproteins cies boundaries. Nat Commun. 2018;9:5114. modify bacteriophage susceptibility in Bacteroides thetaiotaomicron. Nat 32. Johnson RC, Deming C, Conlan S, Zellmer CJ, Michelin AV, Lee-Lin S, et al. Microbiol. 2020;5:1170–81. Investigation of a Cluster of Sphingomonas koreensis Infections. N Engl J 11. Sorbara MT, Littmann ER, Fontana E, Moody TU, Kohout CE, Gjonbalaj M, Med. 2018;379:2529–39. et al. Functional and genomic variation between human-derived isolates 33. Chng KR, Li C, Bertrand D, Ng AHQ, Kwah JS, Low HM, et al. Cartography of Lachnospiraceae reveals inter- and intra-species diversity. Cell Host of opportunistic pathogens and antibiotic resistance genes in a tertiary Microbe. 2020;28:134–46.e4. hospital environment. Nat Med. 2020;26:941–51. P odlesny et al. Microbiome (2022) 10:53 Page 15 of 15 34. Wu M, Eisen JA. A simple, fast, and accurate method of phylogenomic 57. Truong DT, Franzosa EA, Tickle TL, Scholz M, Weingart G, Pasolli E, et al. inference. Genome Biol. 2008;9:R151. MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat Meth- 35. Pasolli E, Schiffer L, Manghi P, Renson A, Obenchain V, Truong DT, et al. ods. 2015;12:902–3. Accessible, curated metagenomic data through ExperimentHub. Nat 58. Li H. A statistical framework for SNP calling, mutation discovery, associa- Methods. 2017;14:1023–4. tion mapping and population genetical parameter estimation from 36. Dutta SK, Girotra M, Garg S, Dutta A, von Rosenvinge EC, Maddox C, et al. sequencing data. Bioinformatics. 2011;27:2987–93. Efficacy of combined jejunal and colonic fecal microbiota transplantation 59. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and for recurrent Clostridium difficile Infection. Clin Gastroenterol Hepatol. high throughput. Nucleic Acids Res. 2004;32:1792–7. 2014;12:1572–6. 60. Huang W, Li L, Myers JR, Marth GT. ART: a next-generation sequencing 37. Song Y, Garg S, Girotra M, Maddox C, von Rosenvinge EC, Dutta A, read simulator. Bioinformatics. 2012;28:593–4. et al. Microbiota dynamics in patients treated with fecal microbiota transplantation for recurrent Clostridium difficile infection. PLoS One. Publisher’s Note 2013;8:e81330. Springer Nature remains neutral with regard to jurisdictional claims in pub- 38. Louis S, Tappu R-M, Damms-Machado A, Huson DH, Bischoff SC. Charac- lished maps and institutional affiliations. terization of the gut microbial community of obese patients following a weight-loss intervention using whole metagenome shotgun sequencing. PLoS One. 2016;11:e0149564. 39. Tett A, Huang KD, Asnicar F, Fehlner-Peach H, Pasolli E, Karcher N, et al. The Prevotella copri complex comprises four distinct clades underrepresented in Westernized populations. Cell Host Microbe. 2019;26:666–79.e7. 40. Fehlner-Peach H, Magnabosco C, Raghavan V, Scher JU, Tett A, Cox LM, et al. Distinct Polysaccharide Utilization Profiles of Human Intestinal Prevotella copri Isolates. Cell Host Microbe. 2019;26:680–90.e5. 41. Tamburini FB, Andermann TM, Tkachenko E, Senchyna F, Banaei N, Bhatt AS. Precision identification of diverse bloodstream pathogens in the gut microbiome. Nat Med. 2018;24:1809–14. 42. Magruder M, Sholi AN, Gong C, Zhang L, Edusei E, Huang J, et al. Gut uropathogen abundance is a risk factor for development of bacteriuria and urinary tract infection. Nat Commun. 2019;10:5521. 43. Schloissnig S, Arumugam M, Sunagawa S, Mitreva M, Tap J, Zhu A, et al. Genomic variation landscape of the human gut microbiome. Nature. 2013;493:45–50. 44. Browne HP, Almeida A, Kumar N, Vervier K, Adoum AT, Viciani E, et al. Host adaptation in gut Firmicutes is associated with sporulation loss and altered transmission cycle. Genome Biol. 2021;22:204. 45. Watson AR, Fuessel J, Veseli I, DeLongchamp JZ, Silva M, Trigodet F, et al. Adaptive ecological processes and metabolic independence drive micro- bial colonization and resilience in the human gut. bioRxiv. 2021; Available from: http:// biorx iv. org/ lookup/ doi/ 10. 1101/ 2021. 03. 02. 433653. 46. Litvak Y, Bäumler AJ. The founder hypothesis: a basis for microbiota resistance, diversity in taxa carriage, and colonization resistance against pathogens. PLoS Pathog. 2019;15:e1007563. 47. Fassarella M, Blaak EE, Penders J, Nauta A, Smidt H, Zoetendal EG. Gut microbiome stability and resilience: elucidating the response to pertur- bations in order to modulate gut health. Gut. 2021;70:595–605. 48. McBurney MI, Davis C, Fraser CM, Schneeman BO, Huttenhower C, Verbeke K, et al. Establishing What Constitutes a Healthy Human Gut Microbiome: State of the Science, Regulatory Considerations, and Future Directions. J Nutr. 2019;149:1882–95. 49. Robinson JM, Pasternak Z, Mason CE, Elhaik E. Forensic applications of microbiomics: a review. Front Microbiol. 2020;11:608101. 50. Franzosa EA, Huang K, Meadow JF, Gevers D, Lemon KP, Bohannan BJM, et al. Identifying personal microbiomes using metagenomic codes. Proc Natl Acad Sci U S A. 2015;112:E2930–8. 51. Wang Z, Lou H, Wang Y, Shamir R, Jiang R, Chen T. GePMI: a statistical model for personal intestinal microbiome identification. NPJ Biofilms Re Read ady y to to submit y submit your our re researc search h ? Choose BMC and benefit fr ? Choose BMC and benefit from om: : Microbiomes. 2018;4:20. 52. Mende DR, Sunagawa S, Zeller G, Bork P. Accurate and universal delinea- fast, convenient online submission tion of prokaryotic species. Nat Methods. 2013;10:881–4. thorough peer review by experienced researchers in your field 53. Asnicar F, Manara S, Zolfo M, Truong DT, Scholz M, Armanini F, et al. Studying vertical microbiome transmission from mothers to infants by rapid publication on acceptance strain-level metagenomic profiling. mSystems. 2017;2. https:// doi. org/ 10. support for research data, including large and complex data types 1128/ mSyst ems. 00164- 16. • gold Open Access which fosters wider collaboration and increased citations 54. Raymond F, Ouameur AA, Déraspe M, Iqbal N, Gingras H, Dridi B, et al. The initial state of the human gut microbiome determines its reshaping by maximum visibility for your research: over 100M website views per year antibiotics. ISME J. 2016;10:707–20. 55. Human Microbiome Project Consortium. Structure, function and diversity At BMC, research is always in progress. of the healthy human microbiome. Nature. 2012;486:207–14. Learn more biomedcentral.com/submissions 56. Langmead B. Aligning short sequencing reads with Bowtie. Curr Protoc Bioinformatics. 2010;Chapter 11:Unit 11.7.

Journal

MicrobiomeSpringer Journals

Published: Mar 25, 2022

There are no references for this article.