Fast lightweight accurate xenograft sorting

Jens Zentgraf; Sven Rahmann

doi:10.1186/s13015-021-00181-w

Fast lightweight accurate xenograft sorting

Zentgraf, Jens; Rahmann, Sven 2021-04-02 00:00:00 Motivation: With an increasing number of patient-derived xenograft (PDX) models being created and subsequently sequenced to study tumor heterogeneity and to guide therapy decisions, there is a similarly increasing need for methods to separate reads originating from the graft (human) tumor and reads originating from the host species’ (mouse) surrounding tissue. Two kinds of methods are in use: On the one hand, alignment-based tools require that reads are mapped and aligned (by an external mapper/aligner) to the host and graft genomes separately first; the tool itself then processes the resulting alignments and quality metrics (typically BAM files) to assign each read or read pair. On the other hand, alignment-free tools work directly on the raw read data (typically FASTQ files). Recent studies compare different approaches and tools, with varying results. Results: We show that alignment-free methods for xenograft sorting are superior concerning CPU time usage and equivalent in accuracy. We improve upon the state of the art sorting by presenting a fast lightweight approach based on three-way bucketed quotiented Cuckoo hashing. Our hash table requires memory comparable to an FM index typically used for read alignment and less than other alignment-free approaches. It allows extremely fast lookups and uses less CPU time than other alignment-free methods and alignment-based methods at similar accuracy. Several engineering steps (e.g., shortcuts for unsuccessful lookups, software prefetching) improve the performance even further. Availability: Our software xengsort is available under the MIT license at http:// gitlab. com/ genom einfo rmati cs/ xengs ort. It is written in numba-compiled Python and comes with sample Snakemake workflows for hash table construc- tion and dataset processing. Keywords: Xenograft sorting, Alignment-free method, Cuckoo hashing, k-mer Introduction be used to predict the response to different chemother - To learn about tumor heterogeneity and tumor progres- apy alternatives and to monitor treatment success or fail- sion under realistic in vivo conditions, but without put- ure. A key step in such analyses is xenograft sorting, i.e., ting human life at risk, one can implant human tumor separating the human tumor reads from the mouse reads. tissue into a mouse and study its evolution. This is called A recent study [1] showed that if such a step is omitted, a (patient-derived) xenograft (PDX). Over time, several several mouse reads would be aligned to certain regions samples of the (graft/human) tumor and surrounding of the human genome (HAMA: human-aligned mouse (host/mouse) tissue are taken and subjected to exome or allele) and induce false positive variant calls for the whole genome sequencing in order to monitor the chang- tumor; this especially concerns certain oncogenes. ing genomic features of the tumor. This information can Several tools have been developed for xenograft sort- ing, motivated by different goals and using different approaches; a summary appears below. Here we improve upon the existing approaches in several ways: by using *Correspondence: Sven.Rahmann@uni-due.de Genome Informatics, Institute of Human Genetics, University Hospital carefully engineered k-mer hash tables, our approach is Essen, University of Duisburg-Essen, Essen, Germany both faster and needs less memory than existing tools. By Full list of author information is available at the end of the article © The Author(s) 2021. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/. The Creative Commons Public Domain Dedication waiver (http:// creat iveco mmons. org/ publi cdoma in/ zero/1. 0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Zentgraf and Rahmann Algorithms Mol Biol (2021) 16:2 Page 2 of 16 Table 1 Tools for xenograft sorting and read filtering with key XenofilteR, Xenosplit, Bamcmp and Disambiguate all properties work on aligned BAM files. This means that the reads must be mapped and aligned with a supported read map- Tool Ref. Input Operations Language per first (typically, ‘bwa mem’) and the resulting BAM XenofilteR [2] Aligned BAM Filter R file must be sorted in a specific way required by the tool. Xenosplit [3] Aligned BAM Filter, count Python The tool is typically a script that reads and compares Bamcmp [4] Aligned BAM Partial sort C++ the mapping scores and qualities in the two BAM files Disambiguate [5] Aligned BAM Partial sort Python or C++ containing host and graft alignments. In principle, all BBsplit [6] Raw FASTQ Partial sort Java of these tools do the same thing; large differences result Xenome [7] Raw FASTQ Count, sort C++ rather from different alignment parameters than the tool Xengsort ( This) Raw FASTQ Count, sort Python + numba itself. We therefore picked XenofilteR as a representative See text for definition of operations of this family, also because it performed well in a recent comparison [1]. BBsplit (part of BBTools) is special in the sense that it designing a new decision function, we also obtain fewer performs the read mapping itself, against multiple refer- unclassified reads and in some cases even higher classifi - ences simultaneously, based on k-mer seeds. Unfortu- cation accuracy. Since we use a comprehensive reference nately, only up to approximately 1.9 billion k-mers can be of the genome and transcriptome, we are in principle able indexed because of Java’s array indexing limitations (up to process genome, exome, and transcriptome samples of to 2 elements) and a table load limit of 90%; so BBsplit xenografts. Of course, different sources may exhibit dif - was not usable for our human-mouse index that contains ferent error distributions and require distinct optimized 9 32 approximately 4.5 · 10 > 2 k-mers. parameter sets for classification. Nevertheless, our evalu - The tool xenome [7] is similar to our approach: It is ation shows that we obtain good results on all of exomes, based on a large hash table of k-mers and sorts the reads genomes and transcriptomes with the same parameter into several categories (host, graft, both, neither, ambig- set. uous). A read is classified based on its k-mer content Concerning related work, we distinguish alignment- according to relatively strict rules. We found the thread- based methods that work on already aligned reads (BAM ing code of xenome to be buggy, such that the pure count- files), versus alignment-free methods that directly work ing mode resulted in a deadlock and produced no output. on short subsequences (k-mers) of the raw reads (FASTQ The sorting mode produced the complete output but files). then did not terminate either. Alignment-based methods scan existing alignments in Recent studies [1, 8, 9] have compared the computa- BAM files and test whether each read maps better to the tional efficiency of several methods, as well as the clas - graft genome or to the host genome. Differences result sification accuracy of these methods and the effects on from different parameter settings used for the alignment subsequent variant calling after running vs. not running tool (often bwa or bowtie2) and from the way “better xenograft sorting. The results were contradictory, with alignment” is defined by each of these tools. Alignment- some studies reporting that alignment-based tools are free methods use a large lookup table to associate species more efficient than alignment-free tools, and different information with each k-mer. tools achieving highest accuracy. Our interpretation of In Table 1, we list properties of existing tools and of the results of [1] is that each of the existing approaches is xengsort, our implementation of the method we describe able to sort with good accuracy and the main difference is in this article. These tools support different operations: in computational efficiency. Results about efficiency have Operation “count” outputs proportions of reads belong- to be interpreted with care because sometimes the time ing to each category (host, graft, etc.); operation “sort” for alignment is included and sometimes not. sorts reads or alignments into different files according to origin, ideally into five categories: host, graft, both, nei - Methods ther, ambiguous; a “partial sort” only has three categories: Overview host, graft, both/other; operation “filter” writes only an By considering all available host and graft reference output file with graft reads or alignments. The sort opera - sequences (both transcripts and genomic sequences of tion is more general than the filter or partial sort opera - mouse and human), we build a large key-value store that tion and allows full flexibility in downstream processing. allows us to look up the species of origin (host, graft or The count operation, when it is available separately, is both) of each DNA/RNA k-mer that occurs in either spe- faster than counting the output of the sort operation, cies. A sequenced dataset (a collection of single-end or because it avoids the overhead of creating output files. Z entgraf and Rahmann Algorithms Mol Biol (2021) 16:2 Page 3 of 16 Fig. 1 Illustration of (3,4) Cuckoo hashing with 3 hash functions and buckets of size 4. Left: Each key-value pair can be stored at one of up to 12 locations in 3 buckets. For key x, the bucket given by f (x) is full, so bucket f (x) is attempted, where a free slot is found. Right: If all hb slots are full, 1 2 key x is placed into one of these slots at random (blue), and the currently present key-value pair is evicted and re-inserted into an alternative slot paired-end FASTQ files) is then processed by iterating to the same number. We use a simple base-4 numeric over reads or read pairs, looking up the species of origin encoding A → 0 , C → 1 , G → 2 , T/U → 3 , e.g., reading of each k-mer in a read (host, graft, both or none) and the 4-mer AGCG as (0212) = 38 and its reverse comple- classifying the read based on a decision rule. ment CGCT as (1213) = 103 . The canonical code is then Our implementation of the key-value store as a three- the maximum of these two numbers; here the canonical way bucketed Cuckoo hash table makes k-mer lookup code of both AGCG and CGCT is thus 103. (In xenome, faster than in other methods; the associated value can canonical k-mer codes are implemented with a more often be retrieved with a single random memory access. complex but still deterministic function of the two base-4 A high load factor of the hash table, combined with the encodings; in other tools, it is often the minimum of the technique of quotienting, ensures a low memory foot- two encodings.) For odd k, there are exactly c(k) := 4 /2 print, without resorting to approximate membership data different canonical k-mer codes, so each can be stored structures, such as Bloom filters. in 2k − 1 bits in principle. However, implementing a fast bijection of the set of canonical codes (which is a subset Keyv ‑ alue stores of canonical k‑mers of size c(k) of {0 .. (4 − 1)} ) to {0 .. (c(k) − 1)} seems diffi - We partition the reference genomes (plus alternative cult, so we use 2k bits to store the canonical code directly, alleles and unplaced contigs) and transcriptomes into which allows faster access. However, we use quotienting, short substrings of a given length k (so-called k-mers); we described below, to reduce the size of the stored k-mer evaluated k ∈{23, 25, 27} . For each k-mer (“key”) in any code. of the reference sequences, we store whether it occurs exclusively in the host reference, exclusively in the graft Multi‑way bucketed quotiented Cuckoo hashing reference, or in both, represented by “values” 1, 2, 3, We use multi-way bucketed Cuckoo hash table as the respectively. For the host- and graft-exclusive k-mers, we data structure for the k-mer key-value store. Let C be also store whether a closely similar k-mer (at Hamming the set of canonical codes of k-mers; as explained above, distance 1) occurs in the other species (add value 4); we take C ={0 .. (4 − 1)} , even though only half of such a k-mer is then called a weak (host or graft) k-mer. the codes are used (for odd k). Let P be the set of loca- This idea extends the k-mer classification of xenome [7], tions (buckets) in the hash table and p their number; we where a k-mer can be host, graft, both, or marginal, the set P := {0 .. (p − 1)} . Each key can be stored at up to h latter category comprising both our weak host and weak different locations (buckets) in the table. The possible graft k-mers. So we store, for each k-mer, a value from buckets for a code are computed by h different hash func - the 5-element set “host” (1), “graft” (2), “both” (3), “weak tions f , f , . . . , f : C → P . Each bucket can store up to 1 2 h host” (5), “weak graft” (6). This value is stored using a certain number b of key-value pairs. So there is space 3 bits. While a more compact base-5 representation is for N := pb key-value pairs in the table overall, and each possible (e.g., storing 3 values with 125 < 128 = 2 com- pair can be stored at one of hb locations in h buckets. binations in 7 bits instead of in 9 bits), we decided to use Together with an insertion strategy as described below, slightly more memory for higher speed. this framework is referred to as (h, b) Cuckoo hashing. To be precise, we do not work on k-mers directly, but Classical Cuckoo hashing uses h = 2 and b = 1 ; for this on their canonical integer representations (canonical work, we use h = 3 and b = 4 . A visualization is provided codes), such that a k-mer and its reverse complement map in Fig. 1. Using several hash functions and larger buckets Zentgraf and Rahmann Algorithms Mol Biol (2021) 16:2 Page 4 of 16 increases the load limit; using h = 3 and b = 4 allows the opportunity here to evaluate whether it significantly a load factor of over 99.9% [10, Table 1], while classical improves lookup times in comparison to a table filled by Cuckoo hashing only allows to fill 50% of the table. the above random walk strategy (see “Results”). Search Bijective hash functions and quotienting Searching for a key-value pair works as follows. Given In principle, we need to store the 2k bits for the canoni- key (canonical code) x, first f (x) is computed, and this cal k-mer code x and the 3 bits for the value at each bucket is searched for key x and the associated value. If slot. However, by using hash functions of the form it is not found, buckets f (x) and then f (x) are searched 2 3 f (x) := g(x) mod p , where p is the number of buckets similarly. Each bucket access is a random memory lookup and g is a bijective (randomized) transformation on the and most likely triggers a cache miss. We can ensure full key set {0 .. (4 − 1)} , we can encode part of x in f(x): that each bucket is contained within a single cache line Note that from f(x) and q(x) := g(x)//p (integer divi- (by using additional padding bits if necessary). Then, the sion), we can recover g(x) = p · q(x) + f (x) , and since g number of cache misses is limited to h = 3 for one search is bijective, we can recover x itself. This means that we operation. only need to store q(x), not x itself in bucket f(x), which When we fill the table well below the load limit (at 88% only takes ⌈2k − log p⌉ instead of 2k bits. However, since of 99.9%), we are able to store many key-value pairs in the we have h alternative hash functions, we also need to bucket indicated by the first hash function f , and only store which hash function we used, using 2 bits for h = 3 incur a single cache miss when looking for them. Unsuc- (0 indicating that the slot is empty). This technique is cessful searches (for k-mers that are not present in either known as quotienting. It gives higher savings for smaller host or graft genome) need all h memory accesses. How- buckets (for constant N = pb , smaller b means larger p), ever, optimizations are possible and described below (see but on the other hand the load limit is smaller for small b. “Performance engineering” section). We find b = 4 to be a good compromise, allowing table loads of 99.9%. Insert For the bijective part g(x), we use affine functions of the Insertion of a key-value pair works as follows. First, the form key is searched as described above. If it is found, the value is updated with the new value. For example, if an g (x) := [a · (rot (x) xor b)] mod 4 , a,b k existing host k-mer is to be inserted again as a graft k- where rot performs a cyclic rotation of k bits (half the mer, the value is updated to “both”. If the key is not found, width of x), moving the “random” inner bits to outer we check whether any of the buckets f (x), f (x), f (x) (in 1 2 3 positions and the less random outer bits (due to the max that order) contains a free slot. If this is the case, x and operation when taking canonical codes) inside, b is a 2k- its value are inserted there. If all buckets are full, a ran- bit offset, and a is an odd multiplier. Picking a “random” dom slot among the hb slots is picked, and the key-value hash function means picking random values for a and b. pair stored there is evicted (like a cuckoo removes eggs from other birds’ nests) to make room for x and its value. Lemma 1 For any 2k-bit number b and any odd Then an alternative location for the evicted element is 2k-bit number a, the function g is a bijection on searched. This process may continue for several itera - a,b K := {0 .. (4 − 1)} , and its inverse can be efficiently tions and is called a “random walk” through the table. If obtained. the walk becomes too long (longer than 5000 steps, say), we declare that the table is too full, and construction fails and has to be restarted with a larger table or different Proof Let y = g (x) . By definition, the range of g a,b a,b random seed. on K is a subset of K. Because |K| is a power of 2 and Our implementation requires that the size (number of a is odd, the greatest common divisor of |K| and a is buckets p) of the hash table is known in advance, so we 1, and so there exists a unique multiplicative inverse ′ k ′ k can pre-allocate it. The genome length is a good (over-) a of a modulo 4 =|K| , such that aa = 1 (mod 4 ) . estimate of the number of distinct k-mers and can be This inverse a can be obtained efficiently using the used. We recently presented a practical algorithm [11] to extended Euclidean algorithm. The other operations optimize the assignment of k-mers to buckets (i.e., their (xor b, rot ) are inverses of themselves; so we recover ′ k hash function choices) such that the average search cost x = rot ([(a · y) mod 4 ] xor b) . of present k-mers is minimized to the provable opti- mum. This optimization takes significant additional time In summary, each stored canonical k-mer needs and requires large additional data structures; so we took 2 + 3 + ⌈2k − log p⌉ bits to remember the hash function 2 Z entgraf and Rahmann Algorithms Mol Biol (2021) 16:2 Page 5 of 16 choice and to store the value (species) and the quotient, unsuccessfully when the element is not found in this respectively. For k = 25 and p = 1 276 595 745 buck- bucket. The same idea has been proposed by Alain Espi - ets, this amounts to 25 bits per k-mer, or 100 bits for nosa as “unlucky buckets trace” [12]. each bucket of 4 k-mers. To ensure cache line (512 bits) Using a second bit per bucket, the resolution of this aligned buckets, we could use 500 bits for 5 buckets and type of information can be further improved: One bit insert 12 padding bits; however, we chose to use less indicates that there exists an element whose first choice memory and let a few buckets cross cache line bounda- would have been this bucket, but that is stored at its ries, accepting the resulting speed decrease. second choice bucket. The other bit indicates that there exists an element whose first or second choice would have been this bucket, but that is stored at its third choice Performance engineering bucket. If the element is not found in the current bucket, Software prefetching then, depending on the bit combination, the search can Prefetching refers to instructing the memory system to be stopped early (0,0), only the second choice needs to be fetch data from RAM into the cache hierarchy before checked (1,0), only the third choice needs to be checked the CPU actually needs the data. This can reduce the (0,1), or both (1,1). time spent by the CPU waiting for data, especially in These shortcuts work best if (a) there are many unsuc - lookup-intensive applications such as this one. Hard- cessful lookups and (b) they are evaluated only after the ware prefetching is automatically performed by the CPU search in the first bucket was unsuccessful. The perfor - based on memory access patterns (i.e., a linear scan over mance gains are evaluated in the “Results” section. a large array). Software prefetching refers to application- controlled prefetching. Our application xengsort sup- ports three levels of prefetching: none (0), prefetching No key deletions the second choice bucket before searching the first choice In principle, our implementation of Cuckoo hashes bucket (1), or prefetching both second and third choice allows for easy deletion of keys: in the corresponding slot, buckets before searching the first choice bucket (2). The simply set the hash choice bits to zero. However, this will disadvantage is that, if the search of the first bucket is destroy the tight layout mentioned in the previous para- successful, the memory system has done unnecessary graph and invalidate the shortcut flag bits; therefore sub - work, possibly slowing down other threads that want to sequent unsuccessful lookups must examine all locations access different memory locations at the same time. As and become more expensive. Restoring the tight layout a consequence, software prefetching should only be ena- may involve many iterations of moving keys throughout bled if the second and/or third bucket must be examined the table, and hence is an expensive operation. As our frequently, i.e., at high load factors, or when many unsuc- application never needs to delete an existing key, we fully cessful lookups can be expected. benefit from the above mentioned shortcuts. Shortcuts for unsuccessful lookups Additional memory savings As described so far, unsuccessful lookups are slow Theoretically, the two bits indicating the hash func - because all three buckets must be completely examined, tion choice for each slot may be saved by using separate even though software prefetching may solve part of the tables for each hash function. However, this drastically problem. In addition, algorithmic optimizations are pos- decreases the load limits and overall results in higher sible, with 0 to 2 extra bits of memory per bucket. memory requirements. A better alternative is to exploit The following shortcut is possible without using any that the order of slots in a bucket is arbitrary, so we may additional memory: If, say, the first bucket f (x) contains enforce a fixed order: first all keys that are present with an empty slot, we do not need to search further, because their final hash function, then in order all keys resulting the random walk insertion procedure produces a tight from earlier hash functions, and finally all empty slots. layout, in the sense that if a single element could have u Th s the configuration of a bucket is given by a non-neg - been moved to an “earlier” bucket, it would have been ative (h + 1)-tuple (c , . . . , c , c ) ≥ 0 with sum b, where h 1 0 done. c is the number of elements in the bucket which are pre- Using a single additional “shortcut” flag bit per sent because of their i-th hash function (for i ≥ 1 ), and bucket, we can store whether there exists any element c is the number of empty slots. Especially for large b, in a “later” choice bucket that could have been inserted there are much fewer possible such tuples than (h + 1) . into this bucket, had there been more space. So a set bit For the case of (h, b) = (3, 4) , there are 35 such tuples, (value 1) indicates that later choices must be searched if and the configuration can be encoded in 6 instead of the element is not found in this bucket, while a cleared 8 bits. Encoding more than one bucket jointly results in bit (value 0) indicates that a search can be terminated further savings. Similarly, the values for a bucket can be Zentgraf and Rahmann Algorithms Mol Biol (2021) 16:2 Page 6 of 16 Fig. 2 A k-mer is partitioned into its ℓ-prefix, a middle base and its ℓ-suffix. Efficient local re-sorting of k-mers according to common ℓ-prefix and ℓ -suffix yields groups of k-mers that differ only in their middle base encoded jointly. For example, given a value set of size 5, (value 4) is set on both k-mers of the pair. Now consider where a value requires 3 bits, there are 5 = 625 different the case where the difference is in the ℓ-prefix. Because value combinations in a bucket, which can be encoded reverse complements are included in the full list, this in 10 bits instead of 4 · 3 = 12 . In a practical setting case is already covered. (human/mouse, k = 25 , load 0.88; see Table 3), combin- It remains to find pairs of k-mers that differ only in ing both options reduces the hash table size by 0.5 GB their middle base. We thus conceptually re-partition the from 15.9 GB to 15.4 GB. However, these savings come at chunk into blocks of constant ℓ-prefixes. We now switch the cost of increased CPU time for decoding the configu - the order of ℓ-suffix and middle base (Fig. 2) and re-sort ration or values. Neither option has been implemented each block internally. This is a cache-friendly local opera - yet, but will be added in a future release. tion on typically relatively small blocks. Now Hamming- distance-1 groups that differ in their original middle base Annotating weak k‑mers occur consecutively in a block and agree in their 2ℓ-pre- A k-mer that occurs only in the host (graft) reference, but fix. A scan over the block reveals all relevant pairs. has a Hamming-distance-1 neighbor in the graft (host) Finally, the updated values are transferred to the values reference, is called a weak host (graft) k-mer. So for a of the canonical k-mers in the hash table. weak k-mer, a single nucleotide variation could flip its assigned species, while a k-mer that is not weak is more Reference sequences robust in this sense. After the hash table has been con- To build the k-mer hash table of the human (GRCh38, structed with all k-mers and their values “host”, “graft” or hg38) and mouse (GRCm38, mm10) genome and tran- “both”, we mark weak k-mers by modifying the value, set- scriptome, we obtained the “toplevel DNA” genome ting an additional “weak” bit. In principle, we could scan FASTA files, which include both the primary assem - over the k-mers and query all 3k neighbors of each k-mer, bly, unplaced contigs and alternative alleles, and the “all but this is inefficient. cDNA” files, which contain the known transcripts, from Instead, we extract from the hash table a complete list L the ensembl FTP site, release 98. of k-mers and their reverse complements (not canonical As the alternative alleles of the human and mouse 9 9 codes; approx. 9 · 10 entries for 4.5 · 10 distinct k-mers), toplevel references contain mostly Ns to keep positional together with their current values. To save memory, this alignment of alternative alleles to the consensus refer- list is created and processed in 16 chunks according ence, they decompress to huge FASTA files (over 60 GB to the first two nucleotides of the k-mer, thus needing for human, over 12 GB for mouse). Therefore we con - approx. 4.5 GB of additional memory temporarily. Since densed the toplevel reference sequences by replacing we use odd k = 2ℓ + 1 , we can partition a k-mer into its ℓ runs of more than 25 Ns by 25 Ns. This does not change -prefix, its middle base and its ℓ-suffix (Fig. 2). We make the k-mer content, as k-mers containing even a single N use of the following observation. are ignored. It does provide an efficiency boost to align - ment-based tools because read mappers build an index of Observation 1 For k = 2ℓ + 1 , two k-mers x, y with every position in the genome and typically replace runs Hamming distance 1 differ either in their ℓ-suffix, in the ℓ of Ns by random sequence. -suffix of their reverse complement or in their middle base. Consider first the case where the difference is in the ℓ Fragment classification -suffix. We thus partition the sorted chunk into blocks of Given a sequenced fragment (single read or read pair), constant (ℓ + 1)-prefixes. Different blocks are processed we query each k-mer of the fragment about its origins; independently in parallel threads. The ℓ-suffixes of all k-mers with undetermined bases (Ns) are ignored. Our pairs of k-mers in such a block are queried with a fast implementation reads large chunks (several MB) of bit-vector test for Hamming distance 1. If a pair is found FASTQ files and distributes read classification over sev - and the k-mers occur in different species, the “weak bit” eral threads (we found that 8 threads saturate the I/O). b + b ≥ T both else b + b ≥ T both x > 3n/4 x > 3n/4 S ≥ 3 host g ≥ 6 and h = 0 and h ≤ 6 g + g ≥ T and h ≤ n/20 and h < S graft graft h + h = 0 (no host) else Z entgraf and Rahmann Algorithms Mol Biol (2021) 16:2 Page 7 of 16 graft host neither both both neither graft graft host host both neither ′ ′ Fig. 3 Decision rule tree for classifying a DNA fragment from k-mer statistics (h, h , g, g , b, x; n) , meaning number of k-mers of type “host” ′ ′ (h), “weak host” ( h ), “ graft ” (g), “weak graft” ( g ), “both” (b), and number of k-mers not present in the key-value store (x), respectively; n is the ′ ′ total number of (valid) k-mers in the fragment. We also use weighted scores S := h +⌊h /2⌋ and S := g +⌊g /2⌋ and thresholds host graft T := ⌊n/4⌋, T := ⌊n/4⌋ and T := ⌊n/4⌋ . A fragment is thus classified as “host”, “graft”, “both”, “neither”, or “ambiguous”. Category host graft both “ambiguous” is chosen if no other rule applies and no “else” rule is present in a node We collect k-mer statistics for each fragment (adding i.e. x ≥ 3n/4 ). If none of these conditions is true, the the numbers of both reads for a read pair): let n be the “ambiguous” class is chosen. A symmetric quick deci- number of (valid) k-mers in the fragment. Let h be the sion rule exists for the case that no host k-mers exist ′ ′ number of (non-weak) host k-mers and h the number ( h + h = 0 ). If no quick decision can be made, more of weak host k-mers, and analogously define g and g for complex rules are applied: The next test is whether there the graft species. Further, let b be the number of k-mers are no (strong) graft k-mers ( g = 0 ), only few weak graft occuring in both species, and let x be the number of k-mers ( g ≤ 6 ), but at least some (strong) host k-mers k-mers that were not found in either species. ( h ≥ 6 ), in which case the read is classified as “host”. A ′ ′ Based on the vector (h, h , g , g , b, x; n) , we use a tree of symmetric rule exists for the “graft” class, of course. An hierarchical rules to classify the fragment into one of five even more complex rule tests whether there is sufficient categories: “host”, “graft”, “both”, “neither” and “ambigu- overall evidence for host but only little strong graft evi- ous”. Categories “host” and “graft” are for reads that can dence in absolute terms, and little weak graft evidence in be clearly assigned to one of the species. Category “both” comparison to the host evidence. For categories “both” is for reads that match equally well to both references. and “neither”, a relatively large number of correspond- Category “neither” is for reads that contain many k-mers ing k-mers is required. Category “ambiguous” is always that cannot be found in the key-value store; these could chosen if no “else” rule exists and no other rule applies point to technical problems (primer dimers) or contami- in any given node. The thresholds have been iteratively nation of the sample with other species. Finally, category hand-tuned on several internal human, mouse and bacte- “ambiguous” is for reads that provide conflicting infor - rial datasets that were not part of the evaluation datasets. mation. Such reads should not usually be seen; they could The thresholds are optimized for typical high-quality result from PCR hybrids between host and graft during short reads (100–150 bp) and may have to be adjusted for library preparation. long reads with higher error rates. For completeness, the The precise rules are shown in Fig. 3. The rules are Python source of the classification function appears in designed to arrive at easy decisions quickly. For exam- Table 6 in Appendix. ple, at the root node, if there are no graft k-mers at all ( g + g = 0 ), then an easy decision can be made between Quick mode the classes “host” (if there is at least a little evidence of Inspired by a feature of the kallisto software [13] for tran- host k-mers, i.e., S ≥ 3 , where S := h + ⌊h /2⌋ ), script expression quantification, we additionally imple - host host “both” (if there are sufficiently many such k-mers, i.e. mented a “quick mode” that initially looks only at the b ≥ T := ⌊n/4⌋ , but the “host” class does not apply), type of the third and third-last k-mer in every read. If the both and “neither” (if there are sufficiently many such k-mers, two (for single-end reads) or four (for paired-end reads) else g + g = 0 (no graft) h + h ≥ T and g ≤ n/20 and g < S host host h ≥ 6 and g = 0 and g ≤ 6 S ≥ 3 graft x > 3n/4 b + b ≥ T both and S ≤ n/20 graft and S ≤ n/20 host Zentgraf and Rahmann Algorithms Mol Biol (2021) 16:2 Page 8 of 16 Table 2 Properties of the k-mer index for different values of k k-mers k = 23 (%) k = 25 (%) k = 27 (%) Total 4,396,323,491 (100) 4,496,607,845 (100) 4,576,953,994 (100) Host 1,924,087,512 (43.8) 2,050,845,757 (45.6) 2,105,520,461 (46.0) Graft 2,173,923,063 (49.4) 2,323,880,612 (51.7) 2,395,147,724 (52.3) Both 18,701,862 (0.4) 12,579,160 (0.3) 9,627,252 (0.2) Weak host 132,469,231 (3.0) 52,063,110 (1.2) 32,445,717 (0.7) Weak graft 147,141,823 (3.4) 57,239,206 (1.3) 34,212,840 (0.7) Underlying reference sequences are given in “Reference sequences” section Table 3 Index construction Tool k Build Build Mark Mark Total Total Mem Mem CPU Wall CPU Wall CPU Wall Final Peak Xengsort 23 50 50 591 176 641 226 12.8 17.3 Xengsort 25 53 53 437 158 490 211 15.9 20.4 Xengsort 27 51 51 495 214 546 265 17.3 21.8 Xenome 25 992 151 2338 356 3626 552 31.2 57.1 XenofilteR – – – – – 528 658 13.0 22.0 CPU times and wall clock times in minutes and memory in Gigabytes using different tools and different k-mer sizes for xengsort. “Build” times refer to collecting and hashing the k-mers according to species, but without marking weak k-mers. “Mark” times refer to marking weak k-mers. “Total” times are the sum of build and mark times, plus additional I/O times. “CPU” times measure total CPU work load (as reported by the time command as user time), and “wall” times refer to actually passed time. Final size (“mem final”) is measured by index size on disk (GB). Memory peak (“mem peak”) is the highest memory usage during construction (GB) types agree (e.g. all are “graft”), the fragment is classified species specificity and memory requirements. Table 2 on this sampled evidence alone. This results in quicker shows several index properties. In particular, moving processing of large FASTQ files, but only considers a from k = 25 to k = 27 , the small decrease in k-mers that small sample of the available information. map to both genomes and in weak k-mers did not justify the additional memory requirements. In addition, shorter Results k-mers lead to better error tolerance against sequenc- We evaluate our alignment-free xenograft sorting ing errors, as each error affects up to k of the k-mers in approach and its implementation xengsort for the com- a read. mon case of human-tumor-in-mouse xenografts, by using mouse datasets, human datasets, xenograft datasets Construction time and memory and datasets from other species, and compare against an Table 3 shows time and memory requirements for build- existing tool with the same purpose, xenome from the ing the k-mer hash table or FM index for bwa (for Xeno- gossamer suite [7], and against a representative of align- lteR fi ). The main difference is that the BWA index is a ment-based filtering tools, XenofilteR [2]. The hardware succinct representation of the suffix array of the refer - used for the benchmarks was one server with two AMD ences and not a k-mer hash table. Our hash table con- Epyc 7452 CPUs (with 32 cores and 64 threads each), struction is not paralellized; hence CPU times and 1024 GB DDR4-2666 memory and one 12 TB HDD with wall clock times agree and are less than one hour. The 7200 rpm and 256 MB cache. hash construction of xenome is paralellized; we gave it We first report on statistics and efficiency of index 8 threads (but 9 were sometimes used); yet it does about construction (“Hash table construction” section), then 20 times the CPU work and takes three times as long as discuss classification accuracy on several datasets (“ Clas- xengsort, even when using multiple threads. sification results” section), and finally compare running Marking weak or marginal k-mers is paralellized in times (“Running times” section). both approaches; wall clock times are measured using 8 threads. Again, xengsort finds the weak k-mers faster, Hash table construction both in terms of total CPU work and wall clock time. Table size and uniqueness of k‑mers The indexing method of bwa is not comparable, as it We evaluated k ∈{23, 25, 27} and then decided to use builds a complete suffix array (FM index) that is inde - k = 25 because it offers a good compromise between pendent of k and does not include marking weak k-mers. Z entgraf and Rahmann Algorithms Mol Biol (2021) 16:2 Page 9 of 16 Here the CPU time is lower than the wall clock time, with less than 0.64% human reads for both tools). The which indicates an I/O starved process. main difference between the tools is that xenome is more We note that xenome uses a large amount of mem- conservative, assigning a larger fraction of reads to the ory during hash table construction (it was given up to “ambiguous” (unclassified) category. With xenome, this 64 GB). It works with less if restricted, but at the expense happens for reads that contain two k-mers x, y, where x of longer running times. BWA indexing also needs sig- maps uniquely to human and y maps uniquely to mouse. nificant additional memory during construction. The The decision rule of xengsort is more permissive and tol - additional memory required by xengsort results from the erant towards small inconsistencies. Therefore, xengsort additional sorted k-mer list required for detecting weak assigns more reads correctly to mouse, and fewer to the k-mers. Overall, our construction is fast (even though ambiguous category. Additionally, xengsort assigns fewer serial only) and uses a reasonable amount of memory. reads incorrectly to human. However, the two samples of strain A/J give different Load factor and hash choice distribution results. Both xengsort and xenome assign a large fraction As explained in “Multi-way bucketed quotiented Cuckoo of reads (around 21% and 3.6% in the two samples) to the hashing” section, 3-way Cuckoo hash tables support very human genome, while XenofilteR assigns only 10.5% and high loads (fill ratios) over 99.9%. However, such loads 2.7%, respectively. While xengsort does assign more reads come at the expense of distributing all k-mers almost to mouse, it also assigns more reads to human, following evenly across hash function choices. For faster lookup, its strategy of leaving fewer reads unassigned (ambigu- it is beneficial to leave part of the hash table empty. We ous). Inspection of these reads revealed that almost all used a load factor of 88% and thus find 76.7% of the of them are low-complexity, i.e. consist of repetitive k-mers at their first bucket choice, 15.5% at their second sequence, and a check with BLAT [14] revealed no hits choice and only 7.8% at their third choice, yielding an in mouse and several gapped hits in the human genome. average of 1.31 lookups for a present k-mer. So the classification as human reads is not incorrect from Applying assignment optimization [11], which takes a technical standpoint, but in fact these reads appear to an additional 5 h (serial CPU time, not parallelized) and point to techincal problems during then enrichment step temporarily needs over 80 GB of RAM, we achieve a of the library generation. An additional low-complexity slightly better average of 1.17 lookups for a present k-mer. filter would remove most problematic reads. Classification results Human genome (GIAB) matepair library We applied our method xengsort, xenome and XenofilteR We obtained FASTQ files of an Illumina-sequenced to several datasets with reads of known origin (except 6kb matepair library from the Genome In A Bottle possible contamination issues or technical artefacts), that (GIAB) Ashkenazim trio dataset according to the pro- however present certain particular challenges. Each of vided sequence file index . The data represents a family the following paragraphs discusses one dataset. (mother, father, son). Ideally, we see only human reads. Figure 4a shows the classification results for xengsort Human‑captured mouse exomes and xenome. XenofilteR reported that the BAM files were A recent comparative study [1] made five mouse exomes too large to be processed and did not give a result (400 accessible, which were captured with a human-exome GB total for human and mouse; each BAM file over 30 capture kit and hence presents mouse reads that are GB in size). We see that almost all reads are correctly biased towards high similarity with human reads. The identified as human, while a small fraction is neither, mouse strains were A/J (two mice), BALB/c (one mouse), which could be adapter dimers or other technical issues. and C57BL6 (two mice); they were sequenced on the Illu- However, xenome classifies a similarly small fraction as mina HiSeq 2500 platform, resulting in 11.8 to 12.7 Gbp. ambiguous. Both alignment-free tools accurately recog- The datasets are available under accession numbers nize that this is a pure human dataset. SRX5904321 (strain A/J, mouse 1), SRX5904320 (strain A/J, mouse 2), SRX5904319 (strain BALB/c, mouse 1), Chicken genome SRX5904318 (strain C57BL/6, mouse 1) and SRX5904322 We obtained a paired-end (2x101bp) Illumina whole (strain C57BL/6, mouse 2). Ideally, all reads should be genome sequencing run of a chicken genome from a whole classified as mouse reads. blood sample (accession SRX6911418) with a total of 251 Table 4 shows detailed classification results and run - ning times. Considering the BALB/c and C57BL/6 strains first, it is evident that classification accuracy is high f t p:// f t p- trac e. ncbi. nlm. nih. g ov/ g ia b/ f t p/ da t a_ inde x e s/ A shk e naz im T r io/ (over 98.9% mouse for xengsort, over 97.4% for xenome; seque nce. index. AJtrio_ Illum ina_ 6kb_ matep air_ wgs_ 08032 015. Zentgraf and Rahmann Algorithms Mol Biol (2021) 16:2 Page 10 of 16 Table 4 Detailed classification results on five human-captured mouse exomes from different mouse strains ( 2× A/J, 1× BALB/c, 2× C57BL/6) A/J-1 xengsort xenome XfR Time 70 Cm 14 Wm 371 Cm 45 Wm 56 Cm 56 Wm Fragmets (%) Fragmets (%) Fragmets (%) Mouse 46,648,014 (78.03) 45,759,814 (76.54) Both 120,808 (0.20) 65,269 (0.11) Human 12,813,583 (21.43) 12,500,844 (20.91) 6,315,955 (10.56) Ambgs. 58,449 (0.10) 1,383,547 (2.31) Neither 143,775 (0.24) 75,155 (0.13) A/J-2 xengsort xenome XfR Time 70 Cm 15 Wm 416 Cm 50 Wm 67 Cm 67 Cm Fragmets (%) Fragmets (%) Fragmets (%) Mouse 60,255,189 (95.57) 59,135,489 (93.80) Both 151,396 (0.24) 89,089 (0.14) Human 2,301,384 (3.65) 2,271,131 (3.60) 1,718,545 (2.73) Ambgs. 57,827 (0.09) 1,340,814 (2.13) Neither 279,556 (0.44) 208,829 (0.33) BALB/c xengsort xenome XfR Time 68 Cm 15 Wm 392 Cm 45 Wm 61 Cm 61 Wm Mouse 62,235,960 (98.99) 61,274,277 (97.46) Both 118,541 (0.19) 68,949 (0.11) Human 342,908 (0.55) 348,154 (0.55) 285,556 (0.45) Ambgs. 45,063 (0.07) 1,098,036 (1.65) Neither 127,035 (0.20) 80,091 (0.13) C57BL/6-1 xengsort xenome XfR Time 72 Wm 14 Wm 359 Wm 44 Wm 58 Cm 58 Wm Mouse 57,993,361 (98.93) 57,522,446 (98.13) Both 118,984 (0.20) 74,325 (0.13) Human 375,716 (0.64) 376,653 (0.64) 290,894 (0.50) Ambgs. 27,731 (0.05) 571,542 (0.98) Neither 103,895 (0.18) 74,721 (0.13) C57BL/6-2 xengsort xenome XfR Time 67 Cm 15 Wm 422 Cm 51 Wm 62 Cm 62 Wm Mouse 62,384,448 (99.00) 61,941,783 (98.30) Both 107,019 (0.17) 66,163 (0.10) Human 189,536 (0.30) 208,149 (0.33) 132,535 (0.21) Ambgs. 27,142 (0.04) 562,659 (0.89) Neither 304,677 (0.48) 234,068 (0.37) Running times are reported both in CPU minutes (Cm), measuring CPU work, and wall clock minutes (Wm), measuring actual time spent. Times for XenofilteR (XfR) do not include alignment or BAM sorting time. Classification results report the number and percentage (in brackets) of fragments classified as mouse (correct), both human and mouse (likely correct), human (incorrect), ambiguous (no statement) and neither (likely incorrect). XenofilteR (XfR) only extracts human fragments and does not classify the remainder; so only the number of fragments classified as human are reported million paired-end reads. Ideally, none of these reads are almost no reads are extracted as human; the remainder recognized as mouse or human reads. Figure 4b shows is unclassified. Xenome assigns a small number of reads divergent results. For XenofilteR, we can only say that to each category and only around 90% into the “neither” Z entgraf and Rahmann Algorithms Mol Biol (2021) 16:2 Page 11 of 16 a b c d Fig. 4 Classification results of different tools (XenofilteR, xenome, xengsort, and partially xengsort with “quick” option) on several datasets: a GIAB human matepair dataset (XenofilteR did not run on this dataset); b Chicken genome; c Human lymphocytic leukemia RNA-seq data; d Patient-derived xenograft (PDX) RNA-seq data. e CPU times on the PDX RNA-seq dataset with different tools and different xengsort parameters (see text) Zentgraf and Rahmann Algorithms Mol Biol (2021) 16:2 Page 12 of 16 Table 5 Dataset sizes (number of fragments; M: millions) and CPU times in minutes spent on different datasets, measured with the “time” command (user time) when running with 8 threads [xenome, xengsort, bwa-mem, BAM sorting, except for XenofilteR (XfR), which is single-threaded] Dataset/tool Size XfR+ bwa+ Sort Xenome Xengsort Mouse exomes 307 M 310+ 8291+ 179 1823 368 Human genome 1258 M N/A+ 222939+ 940 9845 2463 Chicken genome 251 M 76+ 6976+ 118 1273 592 Leukemia RNA 1760 M 778+ 22,111+ 521 5188 1680 PDX RNA 9742 M 16,043+ 2,78,329+ 5862 59,692 13,555 N/A: not applicable; tool could not be run on this dataset category, while xengsort assigns 98.11% of the reads as internally at University Hospital Essen. Figure 4d shows “neither”. that all three tools classify between 70% and 74% as graft (human) fragments. Again, XenofilteR seems to be the Human lymphocytic leukemia tumor RNA‑seq data most conservative tool with about 70%, and xenome clas- We obtained single-end FASTQ files from RNA-seq data sifies about 72% as human and xengsort 74%. The remain - of 5 human T-cell large granular lymphocytic leukemia ing reads are not classified by XenofilteR, while xenome samples, where recurrent alterations of TNFAIP3 were and xengsort both assign about 25% to host (mouse). Fur- observed, and 5 matched controls (13.4 Gbp to 27.5 Gbp). thermore, xenome classifies about 2% and xengsort less The files are available from SRA accession SRP059322 than 1% as ambiguous. (datasets SRX1055051 to SRX1055060). Surprisingly, not So we observe that on all datasets, xengsort is more all fragments were recognized as originating from human decisive than xenome and, judging from the pure human tissue (Fig. 4c). While xenome and xengsort agreed that the and mouse datasets, mostly correct about it. Because this human fraction is close to 75%, XenofilteR assigned fewer is a large dataset, we also applied xengsort’s quick mode reads to human origins (less than 70%). and found essentially no differences in classification For this and the other RNA-seq datasets, we trimmed the results (less than 0.001 percentage points in each class; Illumina adapters using cutadapt [15] prior to classifica - e.g. for graft: quick 74.0111% vs. standard 74.0105% of all tion, as some RNA fragments may be shorter than the read reads; difference 0.0006%; cf. Fig. 4d). length. If this step is omitted, even fewer fragments are classified as human (graft): just below 70% for xenome and Running times xengsort, and only about 53% for XenofilteR. The number of A summary of running times for all datasets appears in fragments classified as neither increases correspondingly. Table 5. We investigated the reads classified by xengsort as nei - ther human nor mouse. Quality control with FastQC [16] Human‑captured mouse exomes revealed nothing of concern, but showed an unusual bio- Our implementation xengsort needs around 70 CPU min- modal per-fragment GC content distribution with peaks utes for each of the five human-captured mouse exomes at 45% and 55%. BLASTing the fragments against the non- (total: 368 min), and less than 15 min of wall clock time redundant nucleotide database [17] yielded no hits at all for using 8 threads. The speed-up being less than 8 results 97% of these fragments. A small number (2%) originated from serial intermediate I/O steps. While xenome makes from the bacteriophage PhiX, which was to be expected, better use of parallelism, it is slower overall, requiring because it is a typical spike-in for Illumina libraries. The 5 to 6 times the CPU work of xengsort. For only scan- remaining 1% of fragments showed random hits over ning already aligned BAM files, XenofilteR is surpris - many species without a distinctive pattern. We therefore ingly slow, and we see that we can sort the reads from concluded that the “neither; ; fragments mainly consisted scratch in almost the same amount of CPU work that of artefacts from library construction, such as ligated and is required to compare (already computed) alignment then sequenced random primers. scores. When adding bwa mem alignment times (even without the time required for sorting the resulting BAM Patient‑derived xenograft (PDX) RNA‑seq samples files), XenofilteR needs an additional 887 to 1424 CPU from human pancreatic tumors minutes for the human alignments and an additional 424 We evaluated 174 pancreatic tumor patient-derived to 777 min for the mouse alignments per dataset, making xenograft (PDX) RNA-seq samples that are available Z entgraf and Rahmann Algorithms Mol Biol (2021) 16:2 Page 13 of 16 12,000 12,000 level Bits 0 0 1 1 10,000 10,000 2 2 8,000 8,000 6,000 6,000 4,000 4,000 2,000 2,000 0 0 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Hash table load factor Hash table load factor Fig. 5 Eec ff t of different prefetching levels on the running time of Fig. 6 Eec ff t of shortcut bits for unsuccessful k-mer searches on the the adversarial chicken genome dataset: no prefetching (0, default), running time of the adversarial chicken genome dataset: no extra bits prefetching the second bucket (1), or prefetching both second and (0, default), one extra bit (1) or two extra bits (2); see “Performance third bucket (2). Times are averages over 4 repeated runs engineering” section. Times are averages over 6 repeated runs the alignment-based approach far less efficient than the helpful (level 1) or even detrimental (level 2 compared to alignment-free approach. level 1) because of the additional overhead. At intermedi- ate load levels (0.85), prefetching helps, but a second bit Human genome (GIAB) matepair library does not provide an additional advantage. At high table We observe the same wall clock time ratio (about 3.5) loads (0.95), more aggressive prefetching provides an between xenome and xengsort as for the mouse exome additional gain in running time. In fact, with prefetch- dataset. ing level 2, the running time is almost independent of the Because this is a very large dataset (112 GB gzipped load factor. FASTQ), we additionally evaluated the effects of using Figure 6 shows the effect of using 0 (default), 1 or 2 xengsort’s “quick mode”. We observed a significant shortcut bits per bucket. Almost independently of the reduction in processing time (by about 33%) and almost load factor, using one shortcut bit yields a measurable unchanged classification results. We also ran the xeng - running time reduction by 10%. Using a second bit gives sort classification with the optimized hash table (using an only a small additional advantage (ca. 4%). optimized assignment computed using the methods from Unfortunately, the effects of both optimizations are not [11] and found a small reduction (9%) in running time. cumulative. Essentially, an effective use of shortcuts ren - ders prefetching almost useless. On the other datasets, Chicken genome where most k-mer queries are successful, the effects of The BAM file scan of XenofilteR here beats the align - both optimizations are much less pronounced and even ment-free tools (cf. Table 5) because both BAM files negligible. are essentially empty, as very few reads align against human or mouse. Also, the speed advantage of xeng- Human lymphocytic leukemia tumor RNA‑seq data sort over xenome is less on this dataset, mainly because Again, xengsort is more than 3 times faster than xenome most k-mers are not found in the index and require h = 3 and needs time comparable to XenofilteR even when only memory lookups and likely cache misses. Such a dataset the time for sorting and scanning the existing BAM files that contains neither graft nor host reads is adversarial is taken into account (Table 5). Producing the alignments for our design of xengsort. However, the engineering for XenofilteR takes much longer. methods introduced in “Performance engineering” sec- tion are effective on such a dataset. The following evalua - Patient‑derived xenograft (PDX) RNA‑seq samples tions are based on one lane (1/3) of the complete chicken from human pancreatic tumors dataset because of time constraints. With its 174 samples, this is a particularly large dataset of Figure 5 shows the effect of using different amounts the type that we optimized xengsort for. Therefore, run - of prefetching: none (0, default), prefetching the second ning time differences between the three methods become choice bucket (1), or the second and third choice buck- particularly apparent. Figure 4e shows that the align- ets (2). At low table loads (0.7), prefetching is not very ment using bwa-mem and the sorting of the BAM file CPU times [seconds] CPU times [seconds] Zentgraf and Rahmann Algorithms Mol Biol (2021) 16:2 Page 14 of 16 for XenofilteR took over 284,191 CPU minutes (close to strategies, such as k-mer look-ahead, may lead to further 200 CPU days). After that, XenofilteR required an addi - improvements, and we will experiment with additional tional 16,043 CPU minutes (over 11 CPU days) to clas- ideas. sify the aligned and sorted reads. In comparison, xenome Given that producing and sorting the BAM files takes with 59,691 CPU minutes (41.5 days) took only 20% of significant additional time, our results show that overall, the time used by bwa-mem and XenofilteR, and xengsort alignment-free methods require significantly less com - needed 13,555 CPU minutes (9.5 CPU days) to sort all putational resources than alignment-based methods. In reads and is therefore even faster than the classification view of the current worldwide discussions on climate by XenofilteR alone, even excluding the alignment and change and energy efficiency, we advocate that the most sorting steps, and over 4 times faster than xenome. Using resource-efficient available methods should be used for a the “quick mode” with an optimized hash table at 88% task, and we propose that xengsort is preferable to exist- load needed only 5713 CPU minutes (less than 4 CPU ing work in this regard. Even though one could argue that days), i.e., less than half of the time of a full analysis. alignments are needed later anyway, we find that this is We additionally examined some trade-offs for this data - not always true: First, to analyze PDX samples, typically set. First, we note that only counting proportions without only the graft reads are further considered and need to output (“count” operation) is not much faster than sort- be aligned. Second, recent research has shown that more ing the reads into different output files (“sort” operation): and more application areas can be addressed by align- 13,285 vs. 13,555 CPU minutes (2% faster). We addition- ment-free methods, even structural variation and variant ally measured the running time of xengsort’s count opera- calling [18], so alignments may not be needed at all. tion on hash tables with different load factors (88% and On the methodological side, we developed a gen- 99%) using both the standard assignment by random eral key-value store for DNA/RNA k-mers that allows walk and an optimal assignment [11]. As expected, a load extremely fast lookups, often only a single random mem- factor of 99% was slower than 88% (by 10.4% on the ran- ory access, and that has a low memory footprint thanks dom walk assignment, but only by 2.6% on the optimized to a high load factor and the technique of quotienting. assignment). Using the optimal assignment gives a speed u Th s this work might be seen as a blueprint for imple - boost (13.3% faster at 88% load; 19.3% at 99% load). The mentations of other alignment-free methods, such as optimized assignment at 99% load yields an even faster for gene expression quantification, metagenomics, etc. running time than the random walk assignment at 88% In principle, one could replace the underlying key-value load by 11% (11,824 vs. 13,285 CPU minutes). store of each published k-mer based method by the hashing approach presented here and probably obtain Discussion and conclusion a speed-up of factor 2 to 4, while at the same time sav- We revisited the xenograft sorting problem and improved ing some space for the hash table. In practice, such an upon the state of the art in alignment-free methods with approach may be difficult because the code in question our implementation of xengsort. is often deeply nested in the application. However, we On typical datasets (PDX RNA-seq), it is at least four would like to suggest that for future implementations, times faster and needs less memory than the compa- three-way bucketed Cuckoo hash tables with quotienting rable xenome tool. Our experiments show that xengsort should be given serious consideration. provides accurate classification results, and classifies A (small) limitation of our approach is that the size of more reads than xenome, which more often assigns the the hash table must be known (at least approximately) in label “ambiguous”. Surprisingly, on PDX datasets, our advance. In principle we could grow the table dynami- approach is even faster than scanning already aligned cally, but it means re-hashing all elements. Fortunately, BAM files. This favorable behavior arises because almost the total length of the sequences in the k-mer key-value every k-mer in every read can be expected to be found store provides an easily calculated upper bound. The in the key-value store, and lookups of present keys are advantage of such a static approach is that only little highly optimized. additional memory is required during construction. On adversarial datasets (e.g., a sequenced chicken The software xengsort is available at http:// gitlab. genome, where almost none of the k-mers can be found com/ genom einfo rmati cs/ xengs ort under the MIT in the hash table), xengsort is twice as fast as xenome, but license. Installation and usage instructions are provided 8 times slower than scanning pre-aligned and pre-sorted within the README file of the repository. The software BAM files (which are mostly empty). With additional is written in Python, but makes use of just-in-time com- engineering tweaks, such as shortcut bits or software pilation using the numba package [19]. While requiring prefetching, our performance on such datasets can be an additional 1–2 s of startup time, this allows for many improved (10% speed gain). More refined prefetching optimizations, because certain parameters that become Z entgraf and Rahmann Algorithms Mol Biol (2021) 16:2 Page 15 of 16 Table 6 Python source code of xengsort’s classification routine with thresholds, as of v1.0.0 def classify_xengsort (counts): # counts=[neither ,host ,graft ,both,0, weakhost weakgraft, both ] #returns: 0=host ,1=graft ,2=ambiguous ,3=both, 4= neither nkmers = 0 for i in counts : nkmers += i if nkmers == 0: return 2 #nok-mers -> ambiguous nothing = uint32 (0) few = uint32 (6) insubstantial = uint32 ( nkmers // 20) Ag = uint32 (3) Ah = uint32 (3) Mh = uint32( nkmers // 4) Mg = uint32( nkmers // 4) Mb = uint32( nkmers // 5) Mn = uint32 (( nkmers * 3) // 4 + 1) hscore = counts [1]+ counts [5]// 2 gscore = counts [2]+ counts [6]// 2 #no host if counts [1]+ counts [5]== nothing : #nohost if gscore >= Ag: return 1 #graft if counts [3]+ counts [7]>= Mb: #both return 3 # both if counts [0]>= Mn: #neither return 4 #neither #host, but no graft elif counts [2]+ counts [6]== nothing : #no graft if hscore >= Ah: return 0 # host if counts [3]+ counts [7]>= Mb: #both return 3 # both if counts [0]>= Mn: #neither return 4 #neither # some real graft ,few weak host ,noreal host : if counts [2]>= few and counts [5]<= few and counts [1]== nothing : return 1 #graft # some real host ,few weak graft, no real graft: if counts [1]>= few and counts [6]<= few and counts [2]== nothing : return 0 #host #substantial graft ,insubstantialrealhost, #a little weak host comparedtograft : if ( counts [2]+ counts [6]>= Mg and counts [1]<= insubstantial and counts [5]< gscore): return 1 #graft #substantial host, insubstantial real graft, #a little weak graftcomparedtohost : if ( counts [1]+ counts [5]>= Mh and counts [2]<= insubstantial and counts [6]< hscore): return 0 #host #substantial both, insubstantial host and graft: if ( counts [3]+ counts [7]>= Mb and gscore <= insubstantial and hscore <= insubstantial): return 3 #both #substantial neither: if counts [0]>= Mn: return 4 #neither #no specific rule applies : return 2 #ambiguous only known at run time, such as random parameters for While we have indications that classification results the hash functions, can be compiled as constants into agree well overall among all methods and variants, we the code. These optimizations yield savings that exceed concur with a recent study [1] that there exist sub- the initial compilation effort. tle differences, whose effects can propagate through Zentgraf and Rahmann Algorithms Mol Biol (2021) 16:2 Page 16 of 16 References computational pipelines and influence, for example, 1. Jo SY, Kim E, Kim S. Impact of mouse contamination in genomic profiling variant calling results downstream, and we believe that of patient-derived models and best practice for robust analysis. Genome further evaluation studies are necessary. In contrast Biol. 2019;20(1):231. 2. Kluin RJC, Kemper K, Kuilman T, de Ruiter JR, Iyer V, Forment JV, Cornelis- to their study, we however suggest that a best practice sen-Steijger P, de Rink I, Ter Brugge P, Song JY, Klarenbeek S, McDermott workflow for PDX analysis should start (after quality U, Jonkers J, Velds A, Adams DJ, Peeper DS, Krijgsman O. XenofilteR: com- control and adapter trimming on RNA-seq data) with putational deconvolution of mouse and human reads in tumor xenograft sequence data. BMC Bioinform. 2018;19(1):366. alignment-free xenograft sorting, followed by aligning 3. Giner G. XenoSplit. Unpublished; 2019. source code available at https:// the graft reads and the reads that can originate from github. com/ goknu rginer/ XenoS plit. both genomes to the graft genome. In any workflow, 4. Khandelwal G, Girotti MR, Smowton C, Taylor S, Wirth C, Dynowski M, Frese KK, Brady G, Dive C, Marais R, Miller C. Next-generation sequenc- the latter reads, classified as “both” may pose prob - ing analysis and algorithms for PDX and CDX models. Mol Cancer Res. lems, because one may not be able to decide on the 2017;15(8):1012–6. species of origin. Indeed, ultra-conserved regions of 5. Ahdesmäki MJ, Gray SR, Johnson JH, Lai Z. Disambiguate: an open-source application for disambiguating two species in next generation sequenc- DNA sequence exist between human and mouse. In ing data from grafted samples. F1000Res. 2016;5:2741. this sense we believe that full read sorting (into catego- 6. Bushnell B. BBsplit, Joint Genome Institute, Walnut Creek, CA. Part of ries host, graft, both, neither, ambiguous, as opposed to BBTools; 2014–2020. https:// jgi. doe. gov/ data- and- tools/ bbtoo ls/. 7. Conway T, Wazny J, Bromage A, Tymms M, Sooraj D, Williams ED, extracting graft reads only) gives the highest flexibility Beresford-Smith B. Xenome—a tool for classifying reads from xenograft for downstream steps and is preferable to filter-only samples. Bioinformatics. 2012;28(12):172–8. approaches. 8. Callari M, Batra AS, Batra RN, Sammut SJ, Greenwood W, Clifford H, Hercus C, Chin SF, Bruna A, Rueda OM, Caldas C. Computational approach to discriminate human and mouse sequences in patient-derived tumour xenografts. BMC Genomics. 2018;19(1):19. 9. Dai W, Liu J, Li Q, Liu W, Li YX, Li YY. A comparison of next-generation sequencing analysis methods for cancer xenograft samples. J Genet Appendix Genomics. 2018;45(7):345–50. Table 6 shows the Python source of the read (pair) 10. Walzer S. Load thresholds for Cuckoo hashing with overlapping blocks. classification routine. The input vector counts corre - In: Chatzigiannakis I, Kaklamanis C, Marx D, Sannella D, editors. 45th ′ ′ international colloquium on automata, languages, and programming, sponds to (x, h, g, b , 0, h , g , b ) with b = b + b in the 1 2 1 2 ICALP 2018. LIPIcs; 2018. vol. 107, p. 102–110210. Schloss Dagstuhl-Leib- notation of “Fragment classification ” section. niz-Zentrum fuer Informatik, Wadern, Germany. https:// doi. org/ 10. 4230/ LIPIcs. ICALP. 2018. 102 11. Zentgraf J, Timm H, Rahmann S. Cost-optimal assignment of elements in Acknowledgements genome-scale multi-way bucketed Cuckoo hash tables. In: Proceedings We thank Uriel Elias Wiebelitz for preliminary experiments of the effec- of the symposium on algorithm engineering and experiments (ALENEX) tiveness of prefetching and Elias Kuthe for helping with the prefetching 2020, 2020, p. 186–98. SIAM, Philadelphia, PA, USA. https:// doi. org/ 10. implementation.1137/1. 97816 11976 007. 15 12. Espinosa A. Cuckoo breeding ground—a better cuckoo hash table; 2018. Authors’ contributionshttps:// cbg. netli fy. app/ publi cation/ resea rch_ cuckoo_ cbg/. SR provided the initial concept. JZ designed and implemented the software 13. Bray NL, Pimentel H, Melsted P, Pachter L. Near-optimal probabilistic RNA- and performed the evlauations. Both authors wrote and edited the manu- seq quantification. Nat. Biotechnol. 2016;34(5): 525–7. Erratum in Nat. script. Both authors read and approved the final manuscript. Biotechnol. 2016;34(8):888. 14. Kent WJ. BLAT—the BLAST-like alignment tool. Genome Res. Funding 2002;12(4):656–64. Open Access funding enabled and organized by Projekt DEAL. S.R. is grateful 15. Martin M. Cutadapt removes adapter sequences from high-throughput for funding from DFG SFB 876 subproject C1, and Mercator Research Center sequencing reads. EMBnet J. 2011;17(1):10–2. https:// doi. org/ 10. 14806/ ej. Ruhr (MERCUR), project Pe-2013-0012.17.1. 200. 16. Andrews S. FastQC: a quality control tool for high throughput sequence Availability of data and materials data. Babraham Bioinformatics, Inc; 2010. http:// www. bioin forma tics. All datasets mentioned in this article are publicly available from third parties; babra ham. ac. uk/ proje cts/ fastqc/. their accession numbers are given in the respective paragraph. 17. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. BLAST+: architecture and applications. BMC Bioinform. 2009;10:421. Declarations 18. Standage D.S, Brown C.T, Hormozdiari F. Kevlar: a mapping-free frame- work for accurate discovery of de novo variants. iScience. 2019;18:28–36. Competing interests 19. Lam SK, Pitrou A, Seibert S. Numba: a LLVM-based python JIT compiler. The authors declare that they have no competing interests. In: Finkel H, editor. Proceedings of the second workshop on the LLVM compiler infrastructure in HPC, LLVM 2015; 2015, p. 7–176. New York: Author details ACM. https:// doi. org/ 10. 1145/ 28331 57. 28331 62. Bioinformatics, Computer Science XI, TU Dortmund University, Dortmund, Germany. Genome Informatics, Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen, Essen, Germany. Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in pub- Received: 2 February 2021 Accepted: 24 March 2021 lished maps and institutional affiliations. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Algorithms for Molecular Biology Springer Journals http://www.deepdyve.com/lp/springer-journals/fast-lightweight-accurate-xenograft-sorting-IRfl2xw0Qv

Loading next page...

References (22)

RJC Kluin (2018)
10.1186/s12859-018-2353-5
BMC Bioinform, 19
10.1137/1.9781611976007.15
10.1145/2833157.2833162
M Callari (2018)
10.1186/s12864-017-4414-y
BMC Genomics, 19
(Espinosa A. Cuckoo breeding ground—a better cuckoo hash table; 2018. https://cbg.netlify.app/publication/research_cuckoo_cbg/.)
Espinosa A. Cuckoo breeding ground—a better cuckoo hash table; 2018. https://cbg.netlify.app/publication/research_cuckoo_cbg/.
Espinosa A. Cuckoo breeding ground—a better cuckoo hash table; 2018. https://cbg.netlify.app/publication/research_cuckoo_cbg/., Espinosa A. Cuckoo breeding ground—a better cuckoo hash table; 2018. https://cbg.netlify.app/publication/research_cuckoo_cbg/.
G Khandelwal (2017)
10.1158/1541-7786.MCR-16-0431
Mol Cancer Res, 15
(Bray NL, Pimentel H, Melsted P, Pachter L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 2016;34(5): 525–7. Erratum in Nat. Biotechnol. 2016;34(8):888.)
Bray NL, Pimentel H, Melsted P, Pachter L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 2016;34(5): 525–7. Erratum in Nat. Biotechnol. 2016;34(8):888.
Bray NL, Pimentel H, Melsted P, Pachter L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 2016;34(5): 525–7. Erratum in Nat. Biotechnol. 2016;34(8):888., Bray NL, Pimentel H, Melsted P, Pachter L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 2016;34(5): 525–7. Erratum in Nat. Biotechnol. 2016;34(8):888.
(Lam SK, Pitrou A, Seibert S. Numba: a LLVM-based python JIT compiler. In: Finkel H, editor. Proceedings of the second workshop on the LLVM compiler infrastructure in HPC, LLVM 2015; 2015, p. 7–176. New York: ACM. 10.1145/2833157.2833162.)
Lam SK, Pitrou A, Seibert S. Numba: a LLVM-based python JIT compiler. In: Finkel H, editor. Proceedings of the second workshop on the LLVM compiler infrastructure in HPC, LLVM 2015; 2015, p. 7–176. New York: ACM. 10.1145/2833157.2833162.
Lam SK, Pitrou A, Seibert S. Numba: a LLVM-based python JIT compiler. In: Finkel H, editor. Proceedings of the second workshop on the LLVM compiler infrastructure in HPC, LLVM 2015; 2015, p. 7–176. New York: ACM. 10.1145/2833157.2833162., Lam SK, Pitrou A, Seibert S. Numba: a LLVM-based python JIT compiler. In: Finkel H, editor. Proceedings of the second workshop on the LLVM compiler infrastructure in HPC, LLVM 2015; 2015, p. 7–176. New York: ACM. 10.1145/2833157.2833162.
WJ Kent (2002)
10.1101/gr.229202
Genome Res, 12
M Martin (2011)
10.14806/ej.17.1.200
EMBnet J, 17
MJ Ahdesmäki (2016)
10.12688/f1000research.10082.1
F1000Res, 5
(Giner G. XenoSplit. Unpublished; 2019. source code available at https://github.com/goknurginer/XenoSplit.)
Giner G. XenoSplit. Unpublished; 2019. source code available at https://github.com/goknurginer/XenoSplit.
Giner G. XenoSplit. Unpublished; 2019. source code available at https://github.com/goknurginer/XenoSplit., Giner G. XenoSplit. Unpublished; 2019. source code available at https://github.com/goknurginer/XenoSplit.
(Bushnell B. BBsplit, Joint Genome Institute, Walnut Creek, CA. Part of BBTools; 2014–2020. https://jgi.doe.gov/data-and-tools/bbtools/.)
Bushnell B. BBsplit, Joint Genome Institute, Walnut Creek, CA. Part of BBTools; 2014–2020. https://jgi.doe.gov/data-and-tools/bbtools/.
Bushnell B. BBsplit, Joint Genome Institute, Walnut Creek, CA. Part of BBTools; 2014–2020. https://jgi.doe.gov/data-and-tools/bbtools/., Bushnell B. BBsplit, Joint Genome Institute, Walnut Creek, CA. Part of BBTools; 2014–2020. https://jgi.doe.gov/data-and-tools/bbtools/.
T Conway (2012)
10.1093/bioinformatics/bts236
Bioinformatics, 28
W Dai (2018)
10.1016/j.jgg.2018.07.001
J Genet Genomics, 45
(Andrews S. FastQC: a quality control tool for high throughput sequence data. Babraham Bioinformatics, Inc; 2010. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.)
Andrews S. FastQC: a quality control tool for high throughput sequence data. Babraham Bioinformatics, Inc; 2010. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.
Andrews S. FastQC: a quality control tool for high throughput sequence data. Babraham Bioinformatics, Inc; 2010. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/., Andrews S. FastQC: a quality control tool for high throughput sequence data. Babraham Bioinformatics, Inc; 2010. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.
D.S Standage (2019)
10.1016/j.isci.2019.07.032
iScience, 18
SY Jo (2019)
10.1186/s13059-019-1849-2
Genome Biol, 20
(Zentgraf J, Timm H, Rahmann S. Cost-optimal assignment of elements in genome-scale multi-way bucketed Cuckoo hash tables. In: Proceedings of the symposium on algorithm engineering and experiments (ALENEX) 2020, 2020, p. 186–98. SIAM, Philadelphia, PA, USA. 10.1137/1.9781611976007.15)
Zentgraf J, Timm H, Rahmann S. Cost-optimal assignment of elements in genome-scale multi-way bucketed Cuckoo hash tables. In: Proceedings of the symposium on algorithm engineering and experiments (ALENEX) 2020, 2020, p. 186–98. SIAM, Philadelphia, PA, USA. 10.1137/1.9781611976007.15
Zentgraf J, Timm H, Rahmann S. Cost-optimal assignment of elements in genome-scale multi-way bucketed Cuckoo hash tables. In: Proceedings of the symposium on algorithm engineering and experiments (ALENEX) 2020, 2020, p. 186–98. SIAM, Philadelphia, PA, USA. 10.1137/1.9781611976007.15, Zentgraf J, Timm H, Rahmann S. Cost-optimal assignment of elements in genome-scale multi-way bucketed Cuckoo hash tables. In: Proceedings of the symposium on algorithm engineering and experiments (ALENEX) 2020, 2020, p. 186–98. SIAM, Philadelphia, PA, USA. 10.1137/1.9781611976007.15
10.4230/LIPIcs.ICALP.2018.102
C Camacho (2009)
10.1186/1471-2105-10-421
BMC Bioinform, 10
(Walzer S. Load thresholds for Cuckoo hashing with overlapping blocks. In: Chatzigiannakis I, Kaklamanis C, Marx D, Sannella D, editors. 45th international colloquium on automata, languages, and programming, ICALP 2018. LIPIcs; 2018. vol. 107, p. 102–110210. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, Wadern, Germany. 10.4230/LIPIcs.ICALP.2018.102)
Walzer S. Load thresholds for Cuckoo hashing with overlapping blocks. In: Chatzigiannakis I, Kaklamanis C, Marx D, Sannella D, editors. 45th international colloquium on automata, languages, and programming, ICALP 2018. LIPIcs; 2018. vol. 107, p. 102–110210. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, Wadern, Germany. 10.4230/LIPIcs.ICALP.2018.102
Walzer S. Load thresholds for Cuckoo hashing with overlapping blocks. In: Chatzigiannakis I, Kaklamanis C, Marx D, Sannella D, editors. 45th international colloquium on automata, languages, and programming, ICALP 2018. LIPIcs; 2018. vol. 107, p. 102–110210. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, Wadern, Germany. 10.4230/LIPIcs.ICALP.2018.102, Walzer S. Load thresholds for Cuckoo hashing with overlapping blocks. In: Chatzigiannakis I, Kaklamanis C, Marx D, Sannella D, editors. 45th international colloquium on automata, languages, and programming, ICALP 2018. LIPIcs; 2018. vol. 107, p. 102–110210. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, Wadern, Germany. 10.4230/LIPIcs.ICALP.2018.102

Publisher: Springer Journals
Copyright: Copyright © The Author(s) 2021
eISSN: 1748-7188
DOI: 10.1186/s13015-021-00181-w
Publisher site: See Article on Publisher Site

Abstract

Motivation: With an increasing number of patient-derived xenograft (PDX) models being created and subsequently sequenced to study tumor heterogeneity and to guide therapy decisions, there is a similarly increasing need for methods to separate reads originating from the graft (human) tumor and reads originating from the host species’ (mouse) surrounding tissue. Two kinds of methods are in use: On the one hand, alignment-based tools require that reads are mapped and aligned (by an external mapper/aligner) to the host and graft genomes separately first; the tool itself then processes the resulting alignments and quality metrics (typically BAM files) to assign each read or read pair. On the other hand, alignment-free tools work directly on the raw read data (typically FASTQ files). Recent studies compare different approaches and tools, with varying results. Results: We show that alignment-free methods for xenograft sorting are superior concerning CPU time usage and equivalent in accuracy. We improve upon the state of the art sorting by presenting a fast lightweight approach based on three-way bucketed quotiented Cuckoo hashing. Our hash table requires memory comparable to an FM index typically used for read alignment and less than other alignment-free approaches. It allows extremely fast lookups and uses less CPU time than other alignment-free methods and alignment-based methods at similar accuracy. Several engineering steps (e.g., shortcuts for unsuccessful lookups, software prefetching) improve the performance even further. Availability: Our software xengsort is available under the MIT license at http:// gitlab. com/ genom einfo rmati cs/ xengs ort. It is written in numba-compiled Python and comes with sample Snakemake workflows for hash table construc- tion and dataset processing. Keywords: Xenograft sorting, Alignment-free method, Cuckoo hashing, k-mer Introduction be used to predict the response to different chemother - To learn about tumor heterogeneity and tumor progres- apy alternatives and to monitor treatment success or fail- sion under realistic in vivo conditions, but without put- ure. A key step in such analyses is xenograft sorting, i.e., ting human life at risk, one can implant human tumor separating the human tumor reads from the mouse reads. tissue into a mouse and study its evolution. This is called A recent study [1] showed that if such a step is omitted, a (patient-derived) xenograft (PDX). Over time, several several mouse reads would be aligned to certain regions samples of the (graft/human) tumor and surrounding of the human genome (HAMA: human-aligned mouse (host/mouse) tissue are taken and subjected to exome or allele) and induce false positive variant calls for the whole genome sequencing in order to monitor the chang- tumor; this especially concerns certain oncogenes. ing genomic features of the tumor. This information can Several tools have been developed for xenograft sort- ing, motivated by different goals and using different approaches; a summary appears below. Here we improve upon the existing approaches in several ways: by using *Correspondence: Sven.Rahmann@uni-due.de Genome Informatics, Institute of Human Genetics, University Hospital carefully engineered k-mer hash tables, our approach is Essen, University of Duisburg-Essen, Essen, Germany both faster and needs less memory than existing tools. By Full list of author information is available at the end of the article © The Author(s) 2021. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/. The Creative Commons Public Domain Dedication waiver (http:// creat iveco mmons. org/ publi cdoma in/ zero/1. 0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Zentgraf and Rahmann Algorithms Mol Biol (2021) 16:2 Page 2 of 16 Table 1 Tools for xenograft sorting and read filtering with key XenofilteR, Xenosplit, Bamcmp and Disambiguate all properties work on aligned BAM files. This means that the reads must be mapped and aligned with a supported read map- Tool Ref. Input Operations Language per first (typically, ‘bwa mem’) and the resulting BAM XenofilteR [2] Aligned BAM Filter R file must be sorted in a specific way required by the tool. Xenosplit [3] Aligned BAM Filter, count Python The tool is typically a script that reads and compares Bamcmp [4] Aligned BAM Partial sort C++ the mapping scores and qualities in the two BAM files Disambiguate [5] Aligned BAM Partial sort Python or C++ containing host and graft alignments. In principle, all BBsplit [6] Raw FASTQ Partial sort Java of these tools do the same thing; large differences result Xenome [7] Raw FASTQ Count, sort C++ rather from different alignment parameters than the tool Xengsort ( This) Raw FASTQ Count, sort Python + numba itself. We therefore picked XenofilteR as a representative See text for definition of operations of this family, also because it performed well in a recent comparison [1]. BBsplit (part of BBTools) is special in the sense that it designing a new decision function, we also obtain fewer performs the read mapping itself, against multiple refer- unclassified reads and in some cases even higher classifi - ences simultaneously, based on k-mer seeds. Unfortu- cation accuracy. Since we use a comprehensive reference nately, only up to approximately 1.9 billion k-mers can be of the genome and transcriptome, we are in principle able indexed because of Java’s array indexing limitations (up to process genome, exome, and transcriptome samples of to 2 elements) and a table load limit of 90%; so BBsplit xenografts. Of course, different sources may exhibit dif - was not usable for our human-mouse index that contains ferent error distributions and require distinct optimized 9 32 approximately 4.5 · 10 > 2 k-mers. parameter sets for classification. Nevertheless, our evalu - The tool xenome [7] is similar to our approach: It is ation shows that we obtain good results on all of exomes, based on a large hash table of k-mers and sorts the reads genomes and transcriptomes with the same parameter into several categories (host, graft, both, neither, ambig- set. uous). A read is classified based on its k-mer content Concerning related work, we distinguish alignment- according to relatively strict rules. We found the thread- based methods that work on already aligned reads (BAM ing code of xenome to be buggy, such that the pure count- files), versus alignment-free methods that directly work ing mode resulted in a deadlock and produced no output. on short subsequences (k-mers) of the raw reads (FASTQ The sorting mode produced the complete output but files). then did not terminate either. Alignment-based methods scan existing alignments in Recent studies [1, 8, 9] have compared the computa- BAM files and test whether each read maps better to the tional efficiency of several methods, as well as the clas - graft genome or to the host genome. Differences result sification accuracy of these methods and the effects on from different parameter settings used for the alignment subsequent variant calling after running vs. not running tool (often bwa or bowtie2) and from the way “better xenograft sorting. The results were contradictory, with alignment” is defined by each of these tools. Alignment- some studies reporting that alignment-based tools are free methods use a large lookup table to associate species more efficient than alignment-free tools, and different information with each k-mer. tools achieving highest accuracy. Our interpretation of In Table 1, we list properties of existing tools and of the results of [1] is that each of the existing approaches is xengsort, our implementation of the method we describe able to sort with good accuracy and the main difference is in this article. These tools support different operations: in computational efficiency. Results about efficiency have Operation “count” outputs proportions of reads belong- to be interpreted with care because sometimes the time ing to each category (host, graft, etc.); operation “sort” for alignment is included and sometimes not. sorts reads or alignments into different files according to origin, ideally into five categories: host, graft, both, nei - Methods ther, ambiguous; a “partial sort” only has three categories: Overview host, graft, both/other; operation “filter” writes only an By considering all available host and graft reference output file with graft reads or alignments. The sort opera - sequences (both transcripts and genomic sequences of tion is more general than the filter or partial sort opera - mouse and human), we build a large key-value store that tion and allows full flexibility in downstream processing. allows us to look up the species of origin (host, graft or The count operation, when it is available separately, is both) of each DNA/RNA k-mer that occurs in either spe- faster than counting the output of the sort operation, cies. A sequenced dataset (a collection of single-end or because it avoids the overhead of creating output files. Z entgraf and Rahmann Algorithms Mol Biol (2021) 16:2 Page 3 of 16 Fig. 1 Illustration of (3,4) Cuckoo hashing with 3 hash functions and buckets of size 4. Left: Each key-value pair can be stored at one of up to 12 locations in 3 buckets. For key x, the bucket given by f (x) is full, so bucket f (x) is attempted, where a free slot is found. Right: If all hb slots are full, 1 2 key x is placed into one of these slots at random (blue), and the currently present key-value pair is evicted and re-inserted into an alternative slot paired-end FASTQ files) is then processed by iterating to the same number. We use a simple base-4 numeric over reads or read pairs, looking up the species of origin encoding A → 0 , C → 1 , G → 2 , T/U → 3 , e.g., reading of each k-mer in a read (host, graft, both or none) and the 4-mer AGCG as (0212) = 38 and its reverse comple- classifying the read based on a decision rule. ment CGCT as (1213) = 103 . The canonical code is then Our implementation of the key-value store as a three- the maximum of these two numbers; here the canonical way bucketed Cuckoo hash table makes k-mer lookup code of both AGCG and CGCT is thus 103. (In xenome, faster than in other methods; the associated value can canonical k-mer codes are implemented with a more often be retrieved with a single random memory access. complex but still deterministic function of the two base-4 A high load factor of the hash table, combined with the encodings; in other tools, it is often the minimum of the technique of quotienting, ensures a low memory foot- two encodings.) For odd k, there are exactly c(k) := 4 /2 print, without resorting to approximate membership data different canonical k-mer codes, so each can be stored structures, such as Bloom filters. in 2k − 1 bits in principle. However, implementing a fast bijection of the set of canonical codes (which is a subset Keyv ‑ alue stores of canonical k‑mers of size c(k) of {0 .. (4 − 1)} ) to {0 .. (c(k) − 1)} seems diffi - We partition the reference genomes (plus alternative cult, so we use 2k bits to store the canonical code directly, alleles and unplaced contigs) and transcriptomes into which allows faster access. However, we use quotienting, short substrings of a given length k (so-called k-mers); we described below, to reduce the size of the stored k-mer evaluated k ∈{23, 25, 27} . For each k-mer (“key”) in any code. of the reference sequences, we store whether it occurs exclusively in the host reference, exclusively in the graft Multi‑way bucketed quotiented Cuckoo hashing reference, or in both, represented by “values” 1, 2, 3, We use multi-way bucketed Cuckoo hash table as the respectively. For the host- and graft-exclusive k-mers, we data structure for the k-mer key-value store. Let C be also store whether a closely similar k-mer (at Hamming the set of canonical codes of k-mers; as explained above, distance 1) occurs in the other species (add value 4); we take C ={0 .. (4 − 1)} , even though only half of such a k-mer is then called a weak (host or graft) k-mer. the codes are used (for odd k). Let P be the set of loca- This idea extends the k-mer classification of xenome [7], tions (buckets) in the hash table and p their number; we where a k-mer can be host, graft, both, or marginal, the set P := {0 .. (p − 1)} . Each key can be stored at up to h latter category comprising both our weak host and weak different locations (buckets) in the table. The possible graft k-mers. So we store, for each k-mer, a value from buckets for a code are computed by h different hash func - the 5-element set “host” (1), “graft” (2), “both” (3), “weak tions f , f , . . . , f : C → P . Each bucket can store up to 1 2 h host” (5), “weak graft” (6). This value is stored using a certain number b of key-value pairs. So there is space 3 bits. While a more compact base-5 representation is for N := pb key-value pairs in the table overall, and each possible (e.g., storing 3 values with 125 < 128 = 2 com- pair can be stored at one of hb locations in h buckets. binations in 7 bits instead of in 9 bits), we decided to use Together with an insertion strategy as described below, slightly more memory for higher speed. this framework is referred to as (h, b) Cuckoo hashing. To be precise, we do not work on k-mers directly, but Classical Cuckoo hashing uses h = 2 and b = 1 ; for this on their canonical integer representations (canonical work, we use h = 3 and b = 4 . A visualization is provided codes), such that a k-mer and its reverse complement map in Fig. 1. Using several hash functions and larger buckets Zentgraf and Rahmann Algorithms Mol Biol (2021) 16:2 Page 4 of 16 increases the load limit; using h = 3 and b = 4 allows the opportunity here to evaluate whether it significantly a load factor of over 99.9% [10, Table 1], while classical improves lookup times in comparison to a table filled by Cuckoo hashing only allows to fill 50% of the table. the above random walk strategy (see “Results”). Search Bijective hash functions and quotienting Searching for a key-value pair works as follows. Given In principle, we need to store the 2k bits for the canoni- key (canonical code) x, first f (x) is computed, and this cal k-mer code x and the 3 bits for the value at each bucket is searched for key x and the associated value. If slot. However, by using hash functions of the form it is not found, buckets f (x) and then f (x) are searched 2 3 f (x) := g(x) mod p , where p is the number of buckets similarly. Each bucket access is a random memory lookup and g is a bijective (randomized) transformation on the and most likely triggers a cache miss. We can ensure full key set {0 .. (4 − 1)} , we can encode part of x in f(x): that each bucket is contained within a single cache line Note that from f(x) and q(x) := g(x)//p (integer divi- (by using additional padding bits if necessary). Then, the sion), we can recover g(x) = p · q(x) + f (x) , and since g number of cache misses is limited to h = 3 for one search is bijective, we can recover x itself. This means that we operation. only need to store q(x), not x itself in bucket f(x), which When we fill the table well below the load limit (at 88% only takes ⌈2k − log p⌉ instead of 2k bits. However, since of 99.9%), we are able to store many key-value pairs in the we have h alternative hash functions, we also need to bucket indicated by the first hash function f , and only store which hash function we used, using 2 bits for h = 3 incur a single cache miss when looking for them. Unsuc- (0 indicating that the slot is empty). This technique is cessful searches (for k-mers that are not present in either known as quotienting. It gives higher savings for smaller host or graft genome) need all h memory accesses. How- buckets (for constant N = pb , smaller b means larger p), ever, optimizations are possible and described below (see but on the other hand the load limit is smaller for small b. “Performance engineering” section). We find b = 4 to be a good compromise, allowing table loads of 99.9%. Insert For the bijective part g(x), we use affine functions of the Insertion of a key-value pair works as follows. First, the form key is searched as described above. If it is found, the value is updated with the new value. For example, if an g (x) := [a · (rot (x) xor b)] mod 4 , a,b k existing host k-mer is to be inserted again as a graft k- where rot performs a cyclic rotation of k bits (half the mer, the value is updated to “both”. If the key is not found, width of x), moving the “random” inner bits to outer we check whether any of the buckets f (x), f (x), f (x) (in 1 2 3 positions and the less random outer bits (due to the max that order) contains a free slot. If this is the case, x and operation when taking canonical codes) inside, b is a 2k- its value are inserted there. If all buckets are full, a ran- bit offset, and a is an odd multiplier. Picking a “random” dom slot among the hb slots is picked, and the key-value hash function means picking random values for a and b. pair stored there is evicted (like a cuckoo removes eggs from other birds’ nests) to make room for x and its value. Lemma 1 For any 2k-bit number b and any odd Then an alternative location for the evicted element is 2k-bit number a, the function g is a bijection on searched. This process may continue for several itera - a,b K := {0 .. (4 − 1)} , and its inverse can be efficiently tions and is called a “random walk” through the table. If obtained. the walk becomes too long (longer than 5000 steps, say), we declare that the table is too full, and construction fails and has to be restarted with a larger table or different Proof Let y = g (x) . By definition, the range of g a,b a,b random seed. on K is a subset of K. Because |K| is a power of 2 and Our implementation requires that the size (number of a is odd, the greatest common divisor of |K| and a is buckets p) of the hash table is known in advance, so we 1, and so there exists a unique multiplicative inverse ′ k ′ k can pre-allocate it. The genome length is a good (over-) a of a modulo 4 =|K| , such that aa = 1 (mod 4 ) . estimate of the number of distinct k-mers and can be This inverse a can be obtained efficiently using the used. We recently presented a practical algorithm [11] to extended Euclidean algorithm. The other operations optimize the assignment of k-mers to buckets (i.e., their (xor b, rot ) are inverses of themselves; so we recover ′ k hash function choices) such that the average search cost x = rot ([(a · y) mod 4 ] xor b) . of present k-mers is minimized to the provable opti- mum. This optimization takes significant additional time In summary, each stored canonical k-mer needs and requires large additional data structures; so we took 2 + 3 + ⌈2k − log p⌉ bits to remember the hash function 2 Z entgraf and Rahmann Algorithms Mol Biol (2021) 16:2 Page 5 of 16 choice and to store the value (species) and the quotient, unsuccessfully when the element is not found in this respectively. For k = 25 and p = 1 276 595 745 buck- bucket. The same idea has been proposed by Alain Espi - ets, this amounts to 25 bits per k-mer, or 100 bits for nosa as “unlucky buckets trace” [12]. each bucket of 4 k-mers. To ensure cache line (512 bits) Using a second bit per bucket, the resolution of this aligned buckets, we could use 500 bits for 5 buckets and type of information can be further improved: One bit insert 12 padding bits; however, we chose to use less indicates that there exists an element whose first choice memory and let a few buckets cross cache line bounda- would have been this bucket, but that is stored at its ries, accepting the resulting speed decrease. second choice bucket. The other bit indicates that there exists an element whose first or second choice would have been this bucket, but that is stored at its third choice Performance engineering bucket. If the element is not found in the current bucket, Software prefetching then, depending on the bit combination, the search can Prefetching refers to instructing the memory system to be stopped early (0,0), only the second choice needs to be fetch data from RAM into the cache hierarchy before checked (1,0), only the third choice needs to be checked the CPU actually needs the data. This can reduce the (0,1), or both (1,1). time spent by the CPU waiting for data, especially in These shortcuts work best if (a) there are many unsuc - lookup-intensive applications such as this one. Hard- cessful lookups and (b) they are evaluated only after the ware prefetching is automatically performed by the CPU search in the first bucket was unsuccessful. The perfor - based on memory access patterns (i.e., a linear scan over mance gains are evaluated in the “Results” section. a large array). Software prefetching refers to application- controlled prefetching. Our application xengsort sup- ports three levels of prefetching: none (0), prefetching No key deletions the second choice bucket before searching the first choice In principle, our implementation of Cuckoo hashes bucket (1), or prefetching both second and third choice allows for easy deletion of keys: in the corresponding slot, buckets before searching the first choice bucket (2). The simply set the hash choice bits to zero. However, this will disadvantage is that, if the search of the first bucket is destroy the tight layout mentioned in the previous para- successful, the memory system has done unnecessary graph and invalidate the shortcut flag bits; therefore sub - work, possibly slowing down other threads that want to sequent unsuccessful lookups must examine all locations access different memory locations at the same time. As and become more expensive. Restoring the tight layout a consequence, software prefetching should only be ena- may involve many iterations of moving keys throughout bled if the second and/or third bucket must be examined the table, and hence is an expensive operation. As our frequently, i.e., at high load factors, or when many unsuc- application never needs to delete an existing key, we fully cessful lookups can be expected. benefit from the above mentioned shortcuts. Shortcuts for unsuccessful lookups Additional memory savings As described so far, unsuccessful lookups are slow Theoretically, the two bits indicating the hash func - because all three buckets must be completely examined, tion choice for each slot may be saved by using separate even though software prefetching may solve part of the tables for each hash function. However, this drastically problem. In addition, algorithmic optimizations are pos- decreases the load limits and overall results in higher sible, with 0 to 2 extra bits of memory per bucket. memory requirements. A better alternative is to exploit The following shortcut is possible without using any that the order of slots in a bucket is arbitrary, so we may additional memory: If, say, the first bucket f (x) contains enforce a fixed order: first all keys that are present with an empty slot, we do not need to search further, because their final hash function, then in order all keys resulting the random walk insertion procedure produces a tight from earlier hash functions, and finally all empty slots. layout, in the sense that if a single element could have u Th s the configuration of a bucket is given by a non-neg - been moved to an “earlier” bucket, it would have been ative (h + 1)-tuple (c , . . . , c , c ) ≥ 0 with sum b, where h 1 0 done. c is the number of elements in the bucket which are pre- Using a single additional “shortcut” flag bit per sent because of their i-th hash function (for i ≥ 1 ), and bucket, we can store whether there exists any element c is the number of empty slots. Especially for large b, in a “later” choice bucket that could have been inserted there are much fewer possible such tuples than (h + 1) . into this bucket, had there been more space. So a set bit For the case of (h, b) = (3, 4) , there are 35 such tuples, (value 1) indicates that later choices must be searched if and the configuration can be encoded in 6 instead of the element is not found in this bucket, while a cleared 8 bits. Encoding more than one bucket jointly results in bit (value 0) indicates that a search can be terminated further savings. Similarly, the values for a bucket can be Zentgraf and Rahmann Algorithms Mol Biol (2021) 16:2 Page 6 of 16 Fig. 2 A k-mer is partitioned into its ℓ-prefix, a middle base and its ℓ-suffix. Efficient local re-sorting of k-mers according to common ℓ-prefix and ℓ -suffix yields groups of k-mers that differ only in their middle base encoded jointly. For example, given a value set of size 5, (value 4) is set on both k-mers of the pair. Now consider where a value requires 3 bits, there are 5 = 625 different the case where the difference is in the ℓ-prefix. Because value combinations in a bucket, which can be encoded reverse complements are included in the full list, this in 10 bits instead of 4 · 3 = 12 . In a practical setting case is already covered. (human/mouse, k = 25 , load 0.88; see Table 3), combin- It remains to find pairs of k-mers that differ only in ing both options reduces the hash table size by 0.5 GB their middle base. We thus conceptually re-partition the from 15.9 GB to 15.4 GB. However, these savings come at chunk into blocks of constant ℓ-prefixes. We now switch the cost of increased CPU time for decoding the configu - the order of ℓ-suffix and middle base (Fig. 2) and re-sort ration or values. Neither option has been implemented each block internally. This is a cache-friendly local opera - yet, but will be added in a future release. tion on typically relatively small blocks. Now Hamming- distance-1 groups that differ in their original middle base Annotating weak k‑mers occur consecutively in a block and agree in their 2ℓ-pre- A k-mer that occurs only in the host (graft) reference, but fix. A scan over the block reveals all relevant pairs. has a Hamming-distance-1 neighbor in the graft (host) Finally, the updated values are transferred to the values reference, is called a weak host (graft) k-mer. So for a of the canonical k-mers in the hash table. weak k-mer, a single nucleotide variation could flip its assigned species, while a k-mer that is not weak is more Reference sequences robust in this sense. After the hash table has been con- To build the k-mer hash table of the human (GRCh38, structed with all k-mers and their values “host”, “graft” or hg38) and mouse (GRCm38, mm10) genome and tran- “both”, we mark weak k-mers by modifying the value, set- scriptome, we obtained the “toplevel DNA” genome ting an additional “weak” bit. In principle, we could scan FASTA files, which include both the primary assem - over the k-mers and query all 3k neighbors of each k-mer, bly, unplaced contigs and alternative alleles, and the “all but this is inefficient. cDNA” files, which contain the known transcripts, from Instead, we extract from the hash table a complete list L the ensembl FTP site, release 98. of k-mers and their reverse complements (not canonical As the alternative alleles of the human and mouse 9 9 codes; approx. 9 · 10 entries for 4.5 · 10 distinct k-mers), toplevel references contain mostly Ns to keep positional together with their current values. To save memory, this alignment of alternative alleles to the consensus refer- list is created and processed in 16 chunks according ence, they decompress to huge FASTA files (over 60 GB to the first two nucleotides of the k-mer, thus needing for human, over 12 GB for mouse). Therefore we con - approx. 4.5 GB of additional memory temporarily. Since densed the toplevel reference sequences by replacing we use odd k = 2ℓ + 1 , we can partition a k-mer into its ℓ runs of more than 25 Ns by 25 Ns. This does not change -prefix, its middle base and its ℓ-suffix (Fig. 2). We make the k-mer content, as k-mers containing even a single N use of the following observation. are ignored. It does provide an efficiency boost to align - ment-based tools because read mappers build an index of Observation 1 For k = 2ℓ + 1 , two k-mers x, y with every position in the genome and typically replace runs Hamming distance 1 differ either in their ℓ-suffix, in the ℓ of Ns by random sequence. -suffix of their reverse complement or in their middle base. Consider first the case where the difference is in the ℓ Fragment classification -suffix. We thus partition the sorted chunk into blocks of Given a sequenced fragment (single read or read pair), constant (ℓ + 1)-prefixes. Different blocks are processed we query each k-mer of the fragment about its origins; independently in parallel threads. The ℓ-suffixes of all k-mers with undetermined bases (Ns) are ignored. Our pairs of k-mers in such a block are queried with a fast implementation reads large chunks (several MB) of bit-vector test for Hamming distance 1. If a pair is found FASTQ files and distributes read classification over sev - and the k-mers occur in different species, the “weak bit” eral threads (we found that 8 threads saturate the I/O). b + b ≥ T both else b + b ≥ T both x > 3n/4 x > 3n/4 S ≥ 3 host g ≥ 6 and h = 0 and h ≤ 6 g + g ≥ T and h ≤ n/20 and h < S graft graft h + h = 0 (no host) else Z entgraf and Rahmann Algorithms Mol Biol (2021) 16:2 Page 7 of 16 graft host neither both both neither graft graft host host both neither ′ ′ Fig. 3 Decision rule tree for classifying a DNA fragment from k-mer statistics (h, h , g, g , b, x; n) , meaning number of k-mers of type “host” ′ ′ (h), “weak host” ( h ), “ graft ” (g), “weak graft” ( g ), “both” (b), and number of k-mers not present in the key-value store (x), respectively; n is the ′ ′ total number of (valid) k-mers in the fragment. We also use weighted scores S := h +⌊h /2⌋ and S := g +⌊g /2⌋ and thresholds host graft T := ⌊n/4⌋, T := ⌊n/4⌋ and T := ⌊n/4⌋ . A fragment is thus classified as “host”, “graft”, “both”, “neither”, or “ambiguous”. Category host graft both “ambiguous” is chosen if no other rule applies and no “else” rule is present in a node We collect k-mer statistics for each fragment (adding i.e. x ≥ 3n/4 ). If none of these conditions is true, the the numbers of both reads for a read pair): let n be the “ambiguous” class is chosen. A symmetric quick deci- number of (valid) k-mers in the fragment. Let h be the sion rule exists for the case that no host k-mers exist ′ ′ number of (non-weak) host k-mers and h the number ( h + h = 0 ). If no quick decision can be made, more of weak host k-mers, and analogously define g and g for complex rules are applied: The next test is whether there the graft species. Further, let b be the number of k-mers are no (strong) graft k-mers ( g = 0 ), only few weak graft occuring in both species, and let x be the number of k-mers ( g ≤ 6 ), but at least some (strong) host k-mers k-mers that were not found in either species. ( h ≥ 6 ), in which case the read is classified as “host”. A ′ ′ Based on the vector (h, h , g , g , b, x; n) , we use a tree of symmetric rule exists for the “graft” class, of course. An hierarchical rules to classify the fragment into one of five even more complex rule tests whether there is sufficient categories: “host”, “graft”, “both”, “neither” and “ambigu- overall evidence for host but only little strong graft evi- ous”. Categories “host” and “graft” are for reads that can dence in absolute terms, and little weak graft evidence in be clearly assigned to one of the species. Category “both” comparison to the host evidence. For categories “both” is for reads that match equally well to both references. and “neither”, a relatively large number of correspond- Category “neither” is for reads that contain many k-mers ing k-mers is required. Category “ambiguous” is always that cannot be found in the key-value store; these could chosen if no “else” rule exists and no other rule applies point to technical problems (primer dimers) or contami- in any given node. The thresholds have been iteratively nation of the sample with other species. Finally, category hand-tuned on several internal human, mouse and bacte- “ambiguous” is for reads that provide conflicting infor - rial datasets that were not part of the evaluation datasets. mation. Such reads should not usually be seen; they could The thresholds are optimized for typical high-quality result from PCR hybrids between host and graft during short reads (100–150 bp) and may have to be adjusted for library preparation. long reads with higher error rates. For completeness, the The precise rules are shown in Fig. 3. The rules are Python source of the classification function appears in designed to arrive at easy decisions quickly. For exam- Table 6 in Appendix. ple, at the root node, if there are no graft k-mers at all ( g + g = 0 ), then an easy decision can be made between Quick mode the classes “host” (if there is at least a little evidence of Inspired by a feature of the kallisto software [13] for tran- host k-mers, i.e., S ≥ 3 , where S := h + ⌊h /2⌋ ), script expression quantification, we additionally imple - host host “both” (if there are sufficiently many such k-mers, i.e. mented a “quick mode” that initially looks only at the b ≥ T := ⌊n/4⌋ , but the “host” class does not apply), type of the third and third-last k-mer in every read. If the both and “neither” (if there are sufficiently many such k-mers, two (for single-end reads) or four (for paired-end reads) else g + g = 0 (no graft) h + h ≥ T and g ≤ n/20 and g < S host host h ≥ 6 and g = 0 and g ≤ 6 S ≥ 3 graft x > 3n/4 b + b ≥ T both and S ≤ n/20 graft and S ≤ n/20 host Zentgraf and Rahmann Algorithms Mol Biol (2021) 16:2 Page 8 of 16 Table 2 Properties of the k-mer index for different values of k k-mers k = 23 (%) k = 25 (%) k = 27 (%) Total 4,396,323,491 (100) 4,496,607,845 (100) 4,576,953,994 (100) Host 1,924,087,512 (43.8) 2,050,845,757 (45.6) 2,105,520,461 (46.0) Graft 2,173,923,063 (49.4) 2,323,880,612 (51.7) 2,395,147,724 (52.3) Both 18,701,862 (0.4) 12,579,160 (0.3) 9,627,252 (0.2) Weak host 132,469,231 (3.0) 52,063,110 (1.2) 32,445,717 (0.7) Weak graft 147,141,823 (3.4) 57,239,206 (1.3) 34,212,840 (0.7) Underlying reference sequences are given in “Reference sequences” section Table 3 Index construction Tool k Build Build Mark Mark Total Total Mem Mem CPU Wall CPU Wall CPU Wall Final Peak Xengsort 23 50 50 591 176 641 226 12.8 17.3 Xengsort 25 53 53 437 158 490 211 15.9 20.4 Xengsort 27 51 51 495 214 546 265 17.3 21.8 Xenome 25 992 151 2338 356 3626 552 31.2 57.1 XenofilteR – – – – – 528 658 13.0 22.0 CPU times and wall clock times in minutes and memory in Gigabytes using different tools and different k-mer sizes for xengsort. “Build” times refer to collecting and hashing the k-mers according to species, but without marking weak k-mers. “Mark” times refer to marking weak k-mers. “Total” times are the sum of build and mark times, plus additional I/O times. “CPU” times measure total CPU work load (as reported by the time command as user time), and “wall” times refer to actually passed time. Final size (“mem final”) is measured by index size on disk (GB). Memory peak (“mem peak”) is the highest memory usage during construction (GB) types agree (e.g. all are “graft”), the fragment is classified species specificity and memory requirements. Table 2 on this sampled evidence alone. This results in quicker shows several index properties. In particular, moving processing of large FASTQ files, but only considers a from k = 25 to k = 27 , the small decrease in k-mers that small sample of the available information. map to both genomes and in weak k-mers did not justify the additional memory requirements. In addition, shorter Results k-mers lead to better error tolerance against sequenc- We evaluate our alignment-free xenograft sorting ing errors, as each error affects up to k of the k-mers in approach and its implementation xengsort for the com- a read. mon case of human-tumor-in-mouse xenografts, by using mouse datasets, human datasets, xenograft datasets Construction time and memory and datasets from other species, and compare against an Table 3 shows time and memory requirements for build- existing tool with the same purpose, xenome from the ing the k-mer hash table or FM index for bwa (for Xeno- gossamer suite [7], and against a representative of align- lteR fi ). The main difference is that the BWA index is a ment-based filtering tools, XenofilteR [2]. The hardware succinct representation of the suffix array of the refer - used for the benchmarks was one server with two AMD ences and not a k-mer hash table. Our hash table con- Epyc 7452 CPUs (with 32 cores and 64 threads each), struction is not paralellized; hence CPU times and 1024 GB DDR4-2666 memory and one 12 TB HDD with wall clock times agree and are less than one hour. The 7200 rpm and 256 MB cache. hash construction of xenome is paralellized; we gave it We first report on statistics and efficiency of index 8 threads (but 9 were sometimes used); yet it does about construction (“Hash table construction” section), then 20 times the CPU work and takes three times as long as discuss classification accuracy on several datasets (“ Clas- xengsort, even when using multiple threads. sification results” section), and finally compare running Marking weak or marginal k-mers is paralellized in times (“Running times” section). both approaches; wall clock times are measured using 8 threads. Again, xengsort finds the weak k-mers faster, Hash table construction both in terms of total CPU work and wall clock time. Table size and uniqueness of k‑mers The indexing method of bwa is not comparable, as it We evaluated k ∈{23, 25, 27} and then decided to use builds a complete suffix array (FM index) that is inde - k = 25 because it offers a good compromise between pendent of k and does not include marking weak k-mers. Z entgraf and Rahmann Algorithms Mol Biol (2021) 16:2 Page 9 of 16 Here the CPU time is lower than the wall clock time, with less than 0.64% human reads for both tools). The which indicates an I/O starved process. main difference between the tools is that xenome is more We note that xenome uses a large amount of mem- conservative, assigning a larger fraction of reads to the ory during hash table construction (it was given up to “ambiguous” (unclassified) category. With xenome, this 64 GB). It works with less if restricted, but at the expense happens for reads that contain two k-mers x, y, where x of longer running times. BWA indexing also needs sig- maps uniquely to human and y maps uniquely to mouse. nificant additional memory during construction. The The decision rule of xengsort is more permissive and tol - additional memory required by xengsort results from the erant towards small inconsistencies. Therefore, xengsort additional sorted k-mer list required for detecting weak assigns more reads correctly to mouse, and fewer to the k-mers. Overall, our construction is fast (even though ambiguous category. Additionally, xengsort assigns fewer serial only) and uses a reasonable amount of memory. reads incorrectly to human. However, the two samples of strain A/J give different Load factor and hash choice distribution results. Both xengsort and xenome assign a large fraction As explained in “Multi-way bucketed quotiented Cuckoo of reads (around 21% and 3.6% in the two samples) to the hashing” section, 3-way Cuckoo hash tables support very human genome, while XenofilteR assigns only 10.5% and high loads (fill ratios) over 99.9%. However, such loads 2.7%, respectively. While xengsort does assign more reads come at the expense of distributing all k-mers almost to mouse, it also assigns more reads to human, following evenly across hash function choices. For faster lookup, its strategy of leaving fewer reads unassigned (ambigu- it is beneficial to leave part of the hash table empty. We ous). Inspection of these reads revealed that almost all used a load factor of 88% and thus find 76.7% of the of them are low-complexity, i.e. consist of repetitive k-mers at their first bucket choice, 15.5% at their second sequence, and a check with BLAT [14] revealed no hits choice and only 7.8% at their third choice, yielding an in mouse and several gapped hits in the human genome. average of 1.31 lookups for a present k-mer. So the classification as human reads is not incorrect from Applying assignment optimization [11], which takes a technical standpoint, but in fact these reads appear to an additional 5 h (serial CPU time, not parallelized) and point to techincal problems during then enrichment step temporarily needs over 80 GB of RAM, we achieve a of the library generation. An additional low-complexity slightly better average of 1.17 lookups for a present k-mer. filter would remove most problematic reads. Classification results Human genome (GIAB) matepair library We applied our method xengsort, xenome and XenofilteR We obtained FASTQ files of an Illumina-sequenced to several datasets with reads of known origin (except 6kb matepair library from the Genome In A Bottle possible contamination issues or technical artefacts), that (GIAB) Ashkenazim trio dataset according to the pro- however present certain particular challenges. Each of vided sequence file index . The data represents a family the following paragraphs discusses one dataset. (mother, father, son). Ideally, we see only human reads. Figure 4a shows the classification results for xengsort Human‑captured mouse exomes and xenome. XenofilteR reported that the BAM files were A recent comparative study [1] made five mouse exomes too large to be processed and did not give a result (400 accessible, which were captured with a human-exome GB total for human and mouse; each BAM file over 30 capture kit and hence presents mouse reads that are GB in size). We see that almost all reads are correctly biased towards high similarity with human reads. The identified as human, while a small fraction is neither, mouse strains were A/J (two mice), BALB/c (one mouse), which could be adapter dimers or other technical issues. and C57BL6 (two mice); they were sequenced on the Illu- However, xenome classifies a similarly small fraction as mina HiSeq 2500 platform, resulting in 11.8 to 12.7 Gbp. ambiguous. Both alignment-free tools accurately recog- The datasets are available under accession numbers nize that this is a pure human dataset. SRX5904321 (strain A/J, mouse 1), SRX5904320 (strain A/J, mouse 2), SRX5904319 (strain BALB/c, mouse 1), Chicken genome SRX5904318 (strain C57BL/6, mouse 1) and SRX5904322 We obtained a paired-end (2x101bp) Illumina whole (strain C57BL/6, mouse 2). Ideally, all reads should be genome sequencing run of a chicken genome from a whole classified as mouse reads. blood sample (accession SRX6911418) with a total of 251 Table 4 shows detailed classification results and run - ning times. Considering the BALB/c and C57BL/6 strains first, it is evident that classification accuracy is high f t p:// f t p- trac e. ncbi. nlm. nih. g ov/ g ia b/ f t p/ da t a_ inde x e s/ A shk e naz im T r io/ (over 98.9% mouse for xengsort, over 97.4% for xenome; seque nce. index. AJtrio_ Illum ina_ 6kb_ matep air_ wgs_ 08032 015. Zentgraf and Rahmann Algorithms Mol Biol (2021) 16:2 Page 10 of 16 Table 4 Detailed classification results on five human-captured mouse exomes from different mouse strains ( 2× A/J, 1× BALB/c, 2× C57BL/6) A/J-1 xengsort xenome XfR Time 70 Cm 14 Wm 371 Cm 45 Wm 56 Cm 56 Wm Fragmets (%) Fragmets (%) Fragmets (%) Mouse 46,648,014 (78.03) 45,759,814 (76.54) Both 120,808 (0.20) 65,269 (0.11) Human 12,813,583 (21.43) 12,500,844 (20.91) 6,315,955 (10.56) Ambgs. 58,449 (0.10) 1,383,547 (2.31) Neither 143,775 (0.24) 75,155 (0.13) A/J-2 xengsort xenome XfR Time 70 Cm 15 Wm 416 Cm 50 Wm 67 Cm 67 Cm Fragmets (%) Fragmets (%) Fragmets (%) Mouse 60,255,189 (95.57) 59,135,489 (93.80) Both 151,396 (0.24) 89,089 (0.14) Human 2,301,384 (3.65) 2,271,131 (3.60) 1,718,545 (2.73) Ambgs. 57,827 (0.09) 1,340,814 (2.13) Neither 279,556 (0.44) 208,829 (0.33) BALB/c xengsort xenome XfR Time 68 Cm 15 Wm 392 Cm 45 Wm 61 Cm 61 Wm Mouse 62,235,960 (98.99) 61,274,277 (97.46) Both 118,541 (0.19) 68,949 (0.11) Human 342,908 (0.55) 348,154 (0.55) 285,556 (0.45) Ambgs. 45,063 (0.07) 1,098,036 (1.65) Neither 127,035 (0.20) 80,091 (0.13) C57BL/6-1 xengsort xenome XfR Time 72 Wm 14 Wm 359 Wm 44 Wm 58 Cm 58 Wm Mouse 57,993,361 (98.93) 57,522,446 (98.13) Both 118,984 (0.20) 74,325 (0.13) Human 375,716 (0.64) 376,653 (0.64) 290,894 (0.50) Ambgs. 27,731 (0.05) 571,542 (0.98) Neither 103,895 (0.18) 74,721 (0.13) C57BL/6-2 xengsort xenome XfR Time 67 Cm 15 Wm 422 Cm 51 Wm 62 Cm 62 Wm Mouse 62,384,448 (99.00) 61,941,783 (98.30) Both 107,019 (0.17) 66,163 (0.10) Human 189,536 (0.30) 208,149 (0.33) 132,535 (0.21) Ambgs. 27,142 (0.04) 562,659 (0.89) Neither 304,677 (0.48) 234,068 (0.37) Running times are reported both in CPU minutes (Cm), measuring CPU work, and wall clock minutes (Wm), measuring actual time spent. Times for XenofilteR (XfR) do not include alignment or BAM sorting time. Classification results report the number and percentage (in brackets) of fragments classified as mouse (correct), both human and mouse (likely correct), human (incorrect), ambiguous (no statement) and neither (likely incorrect). XenofilteR (XfR) only extracts human fragments and does not classify the remainder; so only the number of fragments classified as human are reported million paired-end reads. Ideally, none of these reads are almost no reads are extracted as human; the remainder recognized as mouse or human reads. Figure 4b shows is unclassified. Xenome assigns a small number of reads divergent results. For XenofilteR, we can only say that to each category and only around 90% into the “neither” Z entgraf and Rahmann Algorithms Mol Biol (2021) 16:2 Page 11 of 16 a b c d Fig. 4 Classification results of different tools (XenofilteR, xenome, xengsort, and partially xengsort with “quick” option) on several datasets: a GIAB human matepair dataset (XenofilteR did not run on this dataset); b Chicken genome; c Human lymphocytic leukemia RNA-seq data; d Patient-derived xenograft (PDX) RNA-seq data. e CPU times on the PDX RNA-seq dataset with different tools and different xengsort parameters (see text) Zentgraf and Rahmann Algorithms Mol Biol (2021) 16:2 Page 12 of 16 Table 5 Dataset sizes (number of fragments; M: millions) and CPU times in minutes spent on different datasets, measured with the “time” command (user time) when running with 8 threads [xenome, xengsort, bwa-mem, BAM sorting, except for XenofilteR (XfR), which is single-threaded] Dataset/tool Size XfR+ bwa+ Sort Xenome Xengsort Mouse exomes 307 M 310+ 8291+ 179 1823 368 Human genome 1258 M N/A+ 222939+ 940 9845 2463 Chicken genome 251 M 76+ 6976+ 118 1273 592 Leukemia RNA 1760 M 778+ 22,111+ 521 5188 1680 PDX RNA 9742 M 16,043+ 2,78,329+ 5862 59,692 13,555 N/A: not applicable; tool could not be run on this dataset category, while xengsort assigns 98.11% of the reads as internally at University Hospital Essen. Figure 4d shows “neither”. that all three tools classify between 70% and 74% as graft (human) fragments. Again, XenofilteR seems to be the Human lymphocytic leukemia tumor RNA‑seq data most conservative tool with about 70%, and xenome clas- We obtained single-end FASTQ files from RNA-seq data sifies about 72% as human and xengsort 74%. The remain - of 5 human T-cell large granular lymphocytic leukemia ing reads are not classified by XenofilteR, while xenome samples, where recurrent alterations of TNFAIP3 were and xengsort both assign about 25% to host (mouse). Fur- observed, and 5 matched controls (13.4 Gbp to 27.5 Gbp). thermore, xenome classifies about 2% and xengsort less The files are available from SRA accession SRP059322 than 1% as ambiguous. (datasets SRX1055051 to SRX1055060). Surprisingly, not So we observe that on all datasets, xengsort is more all fragments were recognized as originating from human decisive than xenome and, judging from the pure human tissue (Fig. 4c). While xenome and xengsort agreed that the and mouse datasets, mostly correct about it. Because this human fraction is close to 75%, XenofilteR assigned fewer is a large dataset, we also applied xengsort’s quick mode reads to human origins (less than 70%). and found essentially no differences in classification For this and the other RNA-seq datasets, we trimmed the results (less than 0.001 percentage points in each class; Illumina adapters using cutadapt [15] prior to classifica - e.g. for graft: quick 74.0111% vs. standard 74.0105% of all tion, as some RNA fragments may be shorter than the read reads; difference 0.0006%; cf. Fig. 4d). length. If this step is omitted, even fewer fragments are classified as human (graft): just below 70% for xenome and Running times xengsort, and only about 53% for XenofilteR. The number of A summary of running times for all datasets appears in fragments classified as neither increases correspondingly. Table 5. We investigated the reads classified by xengsort as nei - ther human nor mouse. Quality control with FastQC [16] Human‑captured mouse exomes revealed nothing of concern, but showed an unusual bio- Our implementation xengsort needs around 70 CPU min- modal per-fragment GC content distribution with peaks utes for each of the five human-captured mouse exomes at 45% and 55%. BLASTing the fragments against the non- (total: 368 min), and less than 15 min of wall clock time redundant nucleotide database [17] yielded no hits at all for using 8 threads. The speed-up being less than 8 results 97% of these fragments. A small number (2%) originated from serial intermediate I/O steps. While xenome makes from the bacteriophage PhiX, which was to be expected, better use of parallelism, it is slower overall, requiring because it is a typical spike-in for Illumina libraries. The 5 to 6 times the CPU work of xengsort. For only scan- remaining 1% of fragments showed random hits over ning already aligned BAM files, XenofilteR is surpris - many species without a distinctive pattern. We therefore ingly slow, and we see that we can sort the reads from concluded that the “neither; ; fragments mainly consisted scratch in almost the same amount of CPU work that of artefacts from library construction, such as ligated and is required to compare (already computed) alignment then sequenced random primers. scores. When adding bwa mem alignment times (even without the time required for sorting the resulting BAM Patient‑derived xenograft (PDX) RNA‑seq samples files), XenofilteR needs an additional 887 to 1424 CPU from human pancreatic tumors minutes for the human alignments and an additional 424 We evaluated 174 pancreatic tumor patient-derived to 777 min for the mouse alignments per dataset, making xenograft (PDX) RNA-seq samples that are available Z entgraf and Rahmann Algorithms Mol Biol (2021) 16:2 Page 13 of 16 12,000 12,000 level Bits 0 0 1 1 10,000 10,000 2 2 8,000 8,000 6,000 6,000 4,000 4,000 2,000 2,000 0 0 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Hash table load factor Hash table load factor Fig. 5 Eec ff t of different prefetching levels on the running time of Fig. 6 Eec ff t of shortcut bits for unsuccessful k-mer searches on the the adversarial chicken genome dataset: no prefetching (0, default), running time of the adversarial chicken genome dataset: no extra bits prefetching the second bucket (1), or prefetching both second and (0, default), one extra bit (1) or two extra bits (2); see “Performance third bucket (2). Times are averages over 4 repeated runs engineering” section. Times are averages over 6 repeated runs the alignment-based approach far less efficient than the helpful (level 1) or even detrimental (level 2 compared to alignment-free approach. level 1) because of the additional overhead. At intermedi- ate load levels (0.85), prefetching helps, but a second bit Human genome (GIAB) matepair library does not provide an additional advantage. At high table We observe the same wall clock time ratio (about 3.5) loads (0.95), more aggressive prefetching provides an between xenome and xengsort as for the mouse exome additional gain in running time. In fact, with prefetch- dataset. ing level 2, the running time is almost independent of the Because this is a very large dataset (112 GB gzipped load factor. FASTQ), we additionally evaluated the effects of using Figure 6 shows the effect of using 0 (default), 1 or 2 xengsort’s “quick mode”. We observed a significant shortcut bits per bucket. Almost independently of the reduction in processing time (by about 33%) and almost load factor, using one shortcut bit yields a measurable unchanged classification results. We also ran the xeng - running time reduction by 10%. Using a second bit gives sort classification with the optimized hash table (using an only a small additional advantage (ca. 4%). optimized assignment computed using the methods from Unfortunately, the effects of both optimizations are not [11] and found a small reduction (9%) in running time. cumulative. Essentially, an effective use of shortcuts ren - ders prefetching almost useless. On the other datasets, Chicken genome where most k-mer queries are successful, the effects of The BAM file scan of XenofilteR here beats the align - both optimizations are much less pronounced and even ment-free tools (cf. Table 5) because both BAM files negligible. are essentially empty, as very few reads align against human or mouse. Also, the speed advantage of xeng- Human lymphocytic leukemia tumor RNA‑seq data sort over xenome is less on this dataset, mainly because Again, xengsort is more than 3 times faster than xenome most k-mers are not found in the index and require h = 3 and needs time comparable to XenofilteR even when only memory lookups and likely cache misses. Such a dataset the time for sorting and scanning the existing BAM files that contains neither graft nor host reads is adversarial is taken into account (Table 5). Producing the alignments for our design of xengsort. However, the engineering for XenofilteR takes much longer. methods introduced in “Performance engineering” sec- tion are effective on such a dataset. The following evalua - Patient‑derived xenograft (PDX) RNA‑seq samples tions are based on one lane (1/3) of the complete chicken from human pancreatic tumors dataset because of time constraints. With its 174 samples, this is a particularly large dataset of Figure 5 shows the effect of using different amounts the type that we optimized xengsort for. Therefore, run - of prefetching: none (0, default), prefetching the second ning time differences between the three methods become choice bucket (1), or the second and third choice buck- particularly apparent. Figure 4e shows that the align- ets (2). At low table loads (0.7), prefetching is not very ment using bwa-mem and the sorting of the BAM file CPU times [seconds] CPU times [seconds] Zentgraf and Rahmann Algorithms Mol Biol (2021) 16:2 Page 14 of 16 for XenofilteR took over 284,191 CPU minutes (close to strategies, such as k-mer look-ahead, may lead to further 200 CPU days). After that, XenofilteR required an addi - improvements, and we will experiment with additional tional 16,043 CPU minutes (over 11 CPU days) to clas- ideas. sify the aligned and sorted reads. In comparison, xenome Given that producing and sorting the BAM files takes with 59,691 CPU minutes (41.5 days) took only 20% of significant additional time, our results show that overall, the time used by bwa-mem and XenofilteR, and xengsort alignment-free methods require significantly less com - needed 13,555 CPU minutes (9.5 CPU days) to sort all putational resources than alignment-based methods. In reads and is therefore even faster than the classification view of the current worldwide discussions on climate by XenofilteR alone, even excluding the alignment and change and energy efficiency, we advocate that the most sorting steps, and over 4 times faster than xenome. Using resource-efficient available methods should be used for a the “quick mode” with an optimized hash table at 88% task, and we propose that xengsort is preferable to exist- load needed only 5713 CPU minutes (less than 4 CPU ing work in this regard. Even though one could argue that days), i.e., less than half of the time of a full analysis. alignments are needed later anyway, we find that this is We additionally examined some trade-offs for this data - not always true: First, to analyze PDX samples, typically set. First, we note that only counting proportions without only the graft reads are further considered and need to output (“count” operation) is not much faster than sort- be aligned. Second, recent research has shown that more ing the reads into different output files (“sort” operation): and more application areas can be addressed by align- 13,285 vs. 13,555 CPU minutes (2% faster). We addition- ment-free methods, even structural variation and variant ally measured the running time of xengsort’s count opera- calling [18], so alignments may not be needed at all. tion on hash tables with different load factors (88% and On the methodological side, we developed a gen- 99%) using both the standard assignment by random eral key-value store for DNA/RNA k-mers that allows walk and an optimal assignment [11]. As expected, a load extremely fast lookups, often only a single random mem- factor of 99% was slower than 88% (by 10.4% on the ran- ory access, and that has a low memory footprint thanks dom walk assignment, but only by 2.6% on the optimized to a high load factor and the technique of quotienting. assignment). Using the optimal assignment gives a speed u Th s this work might be seen as a blueprint for imple - boost (13.3% faster at 88% load; 19.3% at 99% load). The mentations of other alignment-free methods, such as optimized assignment at 99% load yields an even faster for gene expression quantification, metagenomics, etc. running time than the random walk assignment at 88% In principle, one could replace the underlying key-value load by 11% (11,824 vs. 13,285 CPU minutes). store of each published k-mer based method by the hashing approach presented here and probably obtain Discussion and conclusion a speed-up of factor 2 to 4, while at the same time sav- We revisited the xenograft sorting problem and improved ing some space for the hash table. In practice, such an upon the state of the art in alignment-free methods with approach may be difficult because the code in question our implementation of xengsort. is often deeply nested in the application. However, we On typical datasets (PDX RNA-seq), it is at least four would like to suggest that for future implementations, times faster and needs less memory than the compa- three-way bucketed Cuckoo hash tables with quotienting rable xenome tool. Our experiments show that xengsort should be given serious consideration. provides accurate classification results, and classifies A (small) limitation of our approach is that the size of more reads than xenome, which more often assigns the the hash table must be known (at least approximately) in label “ambiguous”. Surprisingly, on PDX datasets, our advance. In principle we could grow the table dynami- approach is even faster than scanning already aligned cally, but it means re-hashing all elements. Fortunately, BAM files. This favorable behavior arises because almost the total length of the sequences in the k-mer key-value every k-mer in every read can be expected to be found store provides an easily calculated upper bound. The in the key-value store, and lookups of present keys are advantage of such a static approach is that only little highly optimized. additional memory is required during construction. On adversarial datasets (e.g., a sequenced chicken The software xengsort is available at http:// gitlab. genome, where almost none of the k-mers can be found com/ genom einfo rmati cs/ xengs ort under the MIT in the hash table), xengsort is twice as fast as xenome, but license. Installation and usage instructions are provided 8 times slower than scanning pre-aligned and pre-sorted within the README file of the repository. The software BAM files (which are mostly empty). With additional is written in Python, but makes use of just-in-time com- engineering tweaks, such as shortcut bits or software pilation using the numba package [19]. While requiring prefetching, our performance on such datasets can be an additional 1–2 s of startup time, this allows for many improved (10% speed gain). More refined prefetching optimizations, because certain parameters that become Z entgraf and Rahmann Algorithms Mol Biol (2021) 16:2 Page 15 of 16 Table 6 Python source code of xengsort’s classification routine with thresholds, as of v1.0.0 def classify_xengsort (counts): # counts=[neither ,host ,graft ,both,0, weakhost weakgraft, both ] #returns: 0=host ,1=graft ,2=ambiguous ,3=both, 4= neither nkmers = 0 for i in counts : nkmers += i if nkmers == 0: return 2 #nok-mers -> ambiguous nothing = uint32 (0) few = uint32 (6) insubstantial = uint32 ( nkmers // 20) Ag = uint32 (3) Ah = uint32 (3) Mh = uint32( nkmers // 4) Mg = uint32( nkmers // 4) Mb = uint32( nkmers // 5) Mn = uint32 (( nkmers * 3) // 4 + 1) hscore = counts [1]+ counts [5]// 2 gscore = counts [2]+ counts [6]// 2 #no host if counts [1]+ counts [5]== nothing : #nohost if gscore >= Ag: return 1 #graft if counts [3]+ counts [7]>= Mb: #both return 3 # both if counts [0]>= Mn: #neither return 4 #neither #host, but no graft elif counts [2]+ counts [6]== nothing : #no graft if hscore >= Ah: return 0 # host if counts [3]+ counts [7]>= Mb: #both return 3 # both if counts [0]>= Mn: #neither return 4 #neither # some real graft ,few weak host ,noreal host : if counts [2]>= few and counts [5]<= few and counts [1]== nothing : return 1 #graft # some real host ,few weak graft, no real graft: if counts [1]>= few and counts [6]<= few and counts [2]== nothing : return 0 #host #substantial graft ,insubstantialrealhost, #a little weak host comparedtograft : if ( counts [2]+ counts [6]>= Mg and counts [1]<= insubstantial and counts [5]< gscore): return 1 #graft #substantial host, insubstantial real graft, #a little weak graftcomparedtohost : if ( counts [1]+ counts [5]>= Mh and counts [2]<= insubstantial and counts [6]< hscore): return 0 #host #substantial both, insubstantial host and graft: if ( counts [3]+ counts [7]>= Mb and gscore <= insubstantial and hscore <= insubstantial): return 3 #both #substantial neither: if counts [0]>= Mn: return 4 #neither #no specific rule applies : return 2 #ambiguous only known at run time, such as random parameters for While we have indications that classification results the hash functions, can be compiled as constants into agree well overall among all methods and variants, we the code. These optimizations yield savings that exceed concur with a recent study [1] that there exist sub- the initial compilation effort. tle differences, whose effects can propagate through Zentgraf and Rahmann Algorithms Mol Biol (2021) 16:2 Page 16 of 16 References computational pipelines and influence, for example, 1. Jo SY, Kim E, Kim S. Impact of mouse contamination in genomic profiling variant calling results downstream, and we believe that of patient-derived models and best practice for robust analysis. Genome further evaluation studies are necessary. In contrast Biol. 2019;20(1):231. 2. Kluin RJC, Kemper K, Kuilman T, de Ruiter JR, Iyer V, Forment JV, Cornelis- to their study, we however suggest that a best practice sen-Steijger P, de Rink I, Ter Brugge P, Song JY, Klarenbeek S, McDermott workflow for PDX analysis should start (after quality U, Jonkers J, Velds A, Adams DJ, Peeper DS, Krijgsman O. XenofilteR: com- control and adapter trimming on RNA-seq data) with putational deconvolution of mouse and human reads in tumor xenograft sequence data. BMC Bioinform. 2018;19(1):366. alignment-free xenograft sorting, followed by aligning 3. Giner G. XenoSplit. Unpublished; 2019. source code available at https:// the graft reads and the reads that can originate from github. com/ goknu rginer/ XenoS plit. both genomes to the graft genome. In any workflow, 4. Khandelwal G, Girotti MR, Smowton C, Taylor S, Wirth C, Dynowski M, Frese KK, Brady G, Dive C, Marais R, Miller C. Next-generation sequenc- the latter reads, classified as “both” may pose prob - ing analysis and algorithms for PDX and CDX models. Mol Cancer Res. lems, because one may not be able to decide on the 2017;15(8):1012–6. species of origin. Indeed, ultra-conserved regions of 5. Ahdesmäki MJ, Gray SR, Johnson JH, Lai Z. Disambiguate: an open-source application for disambiguating two species in next generation sequenc- DNA sequence exist between human and mouse. In ing data from grafted samples. F1000Res. 2016;5:2741. this sense we believe that full read sorting (into catego- 6. Bushnell B. BBsplit, Joint Genome Institute, Walnut Creek, CA. Part of ries host, graft, both, neither, ambiguous, as opposed to BBTools; 2014–2020. https:// jgi. doe. gov/ data- and- tools/ bbtoo ls/. 7. Conway T, Wazny J, Bromage A, Tymms M, Sooraj D, Williams ED, extracting graft reads only) gives the highest flexibility Beresford-Smith B. Xenome—a tool for classifying reads from xenograft for downstream steps and is preferable to filter-only samples. Bioinformatics. 2012;28(12):172–8. approaches. 8. Callari M, Batra AS, Batra RN, Sammut SJ, Greenwood W, Clifford H, Hercus C, Chin SF, Bruna A, Rueda OM, Caldas C. Computational approach to discriminate human and mouse sequences in patient-derived tumour xenografts. BMC Genomics. 2018;19(1):19. 9. Dai W, Liu J, Li Q, Liu W, Li YX, Li YY. A comparison of next-generation sequencing analysis methods for cancer xenograft samples. J Genet Appendix Genomics. 2018;45(7):345–50. Table 6 shows the Python source of the read (pair) 10. Walzer S. Load thresholds for Cuckoo hashing with overlapping blocks. classification routine. The input vector counts corre - In: Chatzigiannakis I, Kaklamanis C, Marx D, Sannella D, editors. 45th ′ ′ international colloquium on automata, languages, and programming, sponds to (x, h, g, b , 0, h , g , b ) with b = b + b in the 1 2 1 2 ICALP 2018. LIPIcs; 2018. vol. 107, p. 102–110210. Schloss Dagstuhl-Leib- notation of “Fragment classification ” section. niz-Zentrum fuer Informatik, Wadern, Germany. https:// doi. org/ 10. 4230/ LIPIcs. ICALP. 2018. 102 11. Zentgraf J, Timm H, Rahmann S. Cost-optimal assignment of elements in Acknowledgements genome-scale multi-way bucketed Cuckoo hash tables. In: Proceedings We thank Uriel Elias Wiebelitz for preliminary experiments of the effec- of the symposium on algorithm engineering and experiments (ALENEX) tiveness of prefetching and Elias Kuthe for helping with the prefetching 2020, 2020, p. 186–98. SIAM, Philadelphia, PA, USA. https:// doi. org/ 10. implementation.1137/1. 97816 11976 007. 15 12. Espinosa A. Cuckoo breeding ground—a better cuckoo hash table; 2018. Authors’ contributionshttps:// cbg. netli fy. app/ publi cation/ resea rch_ cuckoo_ cbg/. SR provided the initial concept. JZ designed and implemented the software 13. Bray NL, Pimentel H, Melsted P, Pachter L. Near-optimal probabilistic RNA- and performed the evlauations. Both authors wrote and edited the manu- seq quantification. Nat. Biotechnol. 2016;34(5): 525–7. Erratum in Nat. script. Both authors read and approved the final manuscript. Biotechnol. 2016;34(8):888. 14. Kent WJ. BLAT—the BLAST-like alignment tool. Genome Res. Funding 2002;12(4):656–64. Open Access funding enabled and organized by Projekt DEAL. S.R. is grateful 15. Martin M. Cutadapt removes adapter sequences from high-throughput for funding from DFG SFB 876 subproject C1, and Mercator Research Center sequencing reads. EMBnet J. 2011;17(1):10–2. https:// doi. org/ 10. 14806/ ej. Ruhr (MERCUR), project Pe-2013-0012.17.1. 200. 16. Andrews S. FastQC: a quality control tool for high throughput sequence Availability of data and materials data. Babraham Bioinformatics, Inc; 2010. http:// www. bioin forma tics. All datasets mentioned in this article are publicly available from third parties; babra ham. ac. uk/ proje cts/ fastqc/. their accession numbers are given in the respective paragraph. 17. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. BLAST+: architecture and applications. BMC Bioinform. 2009;10:421. Declarations 18. Standage D.S, Brown C.T, Hormozdiari F. Kevlar: a mapping-free frame- work for accurate discovery of de novo variants. iScience. 2019;18:28–36. Competing interests 19. Lam SK, Pitrou A, Seibert S. Numba: a LLVM-based python JIT compiler. The authors declare that they have no competing interests. In: Finkel H, editor. Proceedings of the second workshop on the LLVM compiler infrastructure in HPC, LLVM 2015; 2015, p. 7–176. New York: Author details ACM. https:// doi. org/ 10. 1145/ 28331 57. 28331 62. Bioinformatics, Computer Science XI, TU Dortmund University, Dortmund, Germany. Genome Informatics, Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen, Essen, Germany. Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in pub- Received: 2 February 2021 Accepted: 24 March 2021 lished maps and institutional affiliations.

Journal

Algorithms for Molecular Biology – Springer Journals

Published: Apr 2, 2021

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Fast lightweight accurate xenograft sorting

Fast lightweight accurate xenograft sorting

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Fast lightweight accurate xenograft sorting

Fast lightweight accurate xenograft sorting

References (22)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies