Ultrafast clustering of single-cell flow cytometry data using FlowGrid

Xiaoxin Ye; Joshua W. K. Ho

doi:10.1186/s12918-019-0690-2

Ultrafast clustering of single-cell flow cytometry data using FlowGrid

Ye, Xiaoxin; Ho, Joshua W. K. 2019-04-05 00:00:00 Background: Flow cytometry is a popular technology for quantitative single-cell profiling of cell surface markers. It enables expression measurement of tens of cell surface protein markers in millions of single cells. It is a powerful tool for discovering cell sub-populations and quantifying cell population heterogeneity. Traditionally, scientists use manual gating to identify cell types, but the process is subjective and is not effective for large multidimensional data. Many clustering algorithms have been developed to analyse these data but most of them are not scalable to very large data sets with more than ten million cells. Results: Here, we present a new clustering algorithm that combines the advantages of density-based clustering algorithm DBSCAN with the scalability of grid-based clustering. This new clustering algorithm is implemented in python as an open source package, FlowGrid. FlowGrid is memory efficient and scales linearly with respect to the number of cells. We have evaluated the performance of FlowGrid against other state-of-the-art clustering programs and found that FlowGrid produces similar clustering results but with substantially less time. For example, FlowGrid is able to complete a clustering task on a data set of 23.6 million cells in less than 12 seconds, while other algorithms take more than 500 seconds or get into error. Conclusions: FlowGrid is an ultrafast clustering algorithm for large single-cell flow cytometry data. The source code is available at https://github.com/VCCRI/FlowGrid. Keywords: Clustering, Flow cytometry, Single cell, DBSCAN Background terms of the order in which pairs of protein markers are Recent technological advancement has made it possible explored, and the inherent uncertainty of manually draw- to quantitatively measure the expression of a handful of ing the cluster boundaries [2]. An emerging solution is protein markers in millions of cells in a flow cytome- to use unsupervised clustering algorithms to automati- try experiment [1]. The ability to profile such a large cally identify clusters in potentially multidimensional flow number of cells allows us to gain insights into cellular cytometry data. heterogeneity at an unprecedented resolution. Tradition- The Flow Cytometry Critical Assessment of Population ally, cell types are identified based on manual gating of Identification Methods (Flow-CAP) challenge has com- several markers in flow cytometry data. Manual gating pared the performance of many flow cytometry clustering relies on visual inspection of a series of two dimensional algorithms [3]. In the challenge, ADIcyt has the highest scatter plots, which makes it difficult to discover struc- accuracy but has a long runtime, which makes it impracti- ture in high dimensions. It also suffers subjectivity, in cal for routine usage. Flock [4] maintains a high accuracy and reasonable runtime. After the challenge, several algo- rithms have been built for flow cytometry data analysis *Correspondence: jwkho@hku.hk Victor Chang Cardiac Research Institute, Sydney, Australia such as FlowPeaks [5], FlowSOM [6] and BayesFlow [7]. University of New South Wales, Sydney, Australia Full list of author information is available at the end of the article © The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Ye and Ho BMC Systems Biology 2019, 13(Suppl 2):35 Page 2 of 8 FlowPeaks and Flock are largely based on k-means clus- detail of the algorithm is presented in the Methods tering. k-means clustering requires the number of clus- section. ters (k) to be defined prior to the analysis. It is hard to determine a suitable k in practice. FlowPeaks performs Methods k-means clustering with a large initial k, and iteratively The key idea of our algorithm is to replace the calcula- merges nearby clusters that are not separated by low den- tion of density from individual points to discrete bins as sity regions into one cluster. Flock utilises grids to iden- defined by a uniform grid. This way, the clustering step tify high density regions, which the algorithm then uses of the algorithm will scale with the number of non-empty to identify initial cluster centres for k-means clustering. bins, which is significantly smaller than the number of This grid-based method of identifying high density region points in lower dimensional data sets. Therefore the over- allows k-means clustering to converge much quicker com- all time complexity of our algorithm is dominated by the pared to using random initialisation of cluster centres, and binning step, which is in the order of O(N ).Thisissignifi- also directly identifies a suitable value for k. FlowSOM cantly better than the time complexity of DBSCAN, which starts with training Self-Organising Map (SOM), followed is in the order of O(NlogN ). The definition and algorithm by consensus hierarchical clustering of the cells for meta- are presented in the following subsections. clustering. In the algorithm, the number of clusters (k)is required for meta-clustering. Definition BayesFlow uses a Bayesian hierarchical model to iden- Thekey termsinvolvedinthe algorithmare definedinthis tify different cell populations in one or many samples. subsection. A graphical example can be found in Fig. 1. The key benefit of this method is its ability to incorporate N is the number of equally sized bins in each bin prior knowledge, and captures the variability in shapes dimension. In theory, there are (N ) bins in the bin and locations of populations between the samples [7]. data space, where d is the number of dimensions. However, BayesFlow tends to be computational expen- However, in practice, we only consider the sive as Markov Chain Monte Carlo sampling requires a non-empty bins. The number of non-empty bins (N) large number of iterations. Therefore, BayesFlow is often is less than (N ) , especially for high dimensional bin impractical for flow cytometry data sets of realistic size. data. Each non-empty bin is assigned an integer index These algorithms perform well on the Flow-CAP data i = 1 ... N. sets, but they may not be scalable to larger data sets that Bin is labelled by a tuple with d positive integers we are dealing with nowadays – those with tens of millions C = (C , C , C , ... , C ) where C is the i i1 i2 i3 id i1 of cells. Aiming to quantify cell population heterogene- coordinate (the bin index) at dimension 1. For ity in huge data sets, we have to develop an ultrafast and scalable clustering algorithm. In this paper, we present a new clustering algorithm that combines the benefit of DBSCAN [8](awidely- based density-based clustering algorithm) and a grid- based approach to achieve scalability. DBSCAN is fast and can detect clusters with complex shapes in the pres- ence of outliers [8]. DBSCAN starts with identifying core points that have a large number of neighbours within a user-defined region. Once the core points are found, nearby core points and closely located non-core points are grouped together to form clusters. This algorithm will identify clusters that are defined as high-density regions that are separated by the low-density regions. However, Fig. 1 An illustrative example of the FlowGrid clustering algorithm. In this DBSCAN is memory inefficient if the data set is very large, example, Bin 1, Bin 2, Bin 3 and Bin 6 are core bins as their Den are larger than or has large highly connected components. MinDen (5 in this example), their Den are larger than MinDen (20 in thi b c c s example), and their Den are larger than ρ% (75% in this example) of To reduce the computational search space and mem- b √ √ 2 2 its directly connected bins. Dist(C , C ) = 1 + 1 = 2 ≤ 1 2 ory requirement, our algorithm extends the idea of ( = 2 in this example), so Bin 1 and Bin 2 are directly connected. √ √ DBSCAN by using equal-spaced grids like Flock. We 2 2 Dist(C , C ) = 1 + 1 = 2 ≤ ,soBin 2and Bin4aredirectly 2 4 implemented our algorithm in an open source python connected. Therefore, Bin 1, Bin 2 and Bin 4 are mutually connected, package called FlowGrid. Using a range of real data and they are assigned into the same cluster. Bin 5 is not a core bin but isaborderbin,asitisdirectlyconnected to Bin6,which isacorebin. sets, we demonstrate that FlowGrid is much faster Bin 3 is a outlier bin, as it is not a core bin nor a border bin. In practice, than other state-of-the-art flow cytometry clustering MinDen is set to be 3, MinDen is set to 40 and ρ is set to be 85 b c algorithms, and produce similar clustering results. The Ye and Ho BMC Systems Biology 2019, 13(Suppl 2):35 Page 3 of 8 example, if Bin has coordinate C = (2, 3, 5), this bin i i Algorithm 1: FlowGrid is located in second bin in dimension 1, third bin in input : X, N , , ρ, MinDen , MinDen bin b c dimension 2 and the fifth bin in dimension 3. output:DataLabel The distance between Bin and Bin is defined as i j 1 Normalise the data X ranging from 1 to (N + 1) bin 2 Assign data into corresponding bins based on the integer of normalised value Dist(C , C ) = C − C (1) i j ik jk 3 Identify S as the set of non-empty bins bin k=1 4 Search the core bins and their directly connected bins Bin and Bin are defined to be directly connected if i j by SearchCore Dist(C , C ) , where is a user-specified i j 5 Group connected bins into a cluster by Breadth First parameter. Search(BFS) Den (C ) is the density of Bin , which is defined as b i i 6 Label cells by the label of their corresponding bins the number of points in Bin . Den (C ) is the collective density of Bin ,calculatedby c i i Den (C ) = Den (C ) c i b j {j|Bin andBin are directly connected} j i Algorithm 2: SearchCore (2) input : S , , ρ, MinDen , MinDen bin b c Bin is a core bin if output: S , L i core Initial an empty adjacency list L. 1 Den (C ) is larger than MinDen , a user-specified b i b S ={} core parameter. forall the Bin in S do bin 2 Den (C ) is larger than ρ%ofits directly b i if Den (i)> MinDen then b b connected bins, where ρ is a user-specified nnBin=radiusNeighbors(S , Bin , ) bin i parameter. nnCount counting the number of points for 3 Den (C ) is larger than MinDen , a user-specified c i c each bin in nnBin parameter. if Den (i) is greater than ρ% of nnCount then Den (i)= the sum of nnCount Bin is a border bin if it is not a core bin but it is if Den (i)> MinDen then c c directly connected to a core bin. S = S ∪{i} core core Bin is an outlier bin, if it is not a core bin nor a mapping bin with nnBin in L border bin. end Bin and Bin are in the same cluster, if they satisfy a b end one of the following conditions: end end 1 they are directly connected and at least one of The input of radiusNeighbors is all non-empty bins, them is core bin; the query bin and the maximum query distance . 2 they are not directly connected but are connected The output is the bins whose distance with the query by a sequence of directly connected core bins. bin are less than (including the query bin). Two points are in the same cluster, if they belong to the same bin or their corresponding bins belong to the same cluster. Algorithm Evaluation Algorithm 1 describes the key steps of FlowGrid, starting Procedure with normalising the values in each dimension to range FlowGrid aims to be an ultrafast and accurate clustering between 1 and (N + 1). Then, we use the integer part algorithm for very large flow cytometry data. Therefore, bin of the normalised value as the coordinate of its corre- both the accuracy and scalability performance need to sponding bin. Then, the SearchCore algorithm is applied be evaluated. The benchmark data sets from Flow-CAP to discover the core bins and their directly connected bins. [3], the multi-centre CyTOF data from Li et al. [9]and Once the core bins and connections are found, Breadth the SeaFlow project [10]are selected to comparethe First Search(BFS) is used to group the connected bins performance of FlowGrid against other state-of-the-art into a cluster. The cells are labelled by the label of their algorithms, FlowSOM, FlowPeaks, and FLOCK. These corresponding bins. three algorithms are chosen because they are widely used, Ye and Ho BMC Systems Biology 2019, 13(Suppl 2):35 Page 4 of 8 the pre-precessing step, we apply the inverse hyperbolic Algorithm 3: Breadth First Search(BFS) function with the factor of 5 to transform the multi-centre input : S , S , adjacency list L core bin data and the SeaFlow data. As the Flow-CAP and multi- output:Bin Label centre CyTOF data contain many samples and we treat Label every bin as -1 each sample as a data set, we run all algorithms on each Index=1 sample. The performances are measured by the ARI and for Bin in S do i core runtime, which are reported by the arithmetic means (x ¯) if the laebl of Bin is -1 then and standard deviation (sd). For the Seaflow data sets, we Queue={} treateachconcatenateddataset as adataset.Inthe eval- Label Bin as Index uation, all algorithms are applied on these concatenated Queue.push(Bin ) data sets. while Queue is not empty do To evaluate the scalability of each algorithm, we down- Bin = Queue.pop() sample the largest concatenated data set from the SeaFlow forall the directed connected Bin of Bin 2 1 project, generating 10 sub-sampled data sets in which the do numbers of cells range from 20 thousand to 20 million. if the laebl of Bin is -1 then Label Bin as Index Performance measure if Bin is core bin then The efficiency performance is measured by the runtime Queue.push(Bin ) while the clustering performance is measured by Adjusted end Rand Index (ARI). ARI is used to measure the cluster- end ing performance. ARI is the corrected-for-chance version end of the Rand index [11]. Although it may result in nega- end tive values if the index is less than expected, it tends to index=index +1 be more robust than many other measures like F-measure end and Rand index. end ARIiscalculatedasfollow. Givenaset S of n ele- ments, and two groups of cluster labels (one group of ground truth label and one group of predicted labels) of these elements, namely X ={X , X , ... , X } and Y = 1 2 r are generally considered to be quite fast, and have good {Y , Y , ... , Y }, the overlap between X and Y can be sum- 1 2 s accuracy. marized by n where n denotes the number of objects in ij ij Three benchmark data sets from Flow-CAP [3]are common between X and Y : n =|X ∩ Y |. i j ij i j selected for evaluation, including the Diffuse Large B-cell Lymphoma (DLBL), Hematopoietic Stem Cell Transplant (HSCT), and Graft versus Host Disease(GvHD) data set. n − a b / n ij i j ij i j 2 2 Each data set contains 10-30 samples with 3-4 markers, ARI= and each sample includes 2,000-35,000 cells. a + b − a b / n i j i j 2 i j i j The multi-centre CyTOF data set from Li et al. [9]pro- 2 2 2 2 vides a labelled data set with 16 samples. Each samples where a = n and b = n contains 40,000-70,000 cells and 26 markers. Since only 8 i ij j ij j i out fo 26 markers are determined to be relevant markers Experimentation in the original paper [9], only these 8 markers are used for FlowGrid is publicly available as an open source program clustering. on GitHub. FlowSOM and FlowPeaks are available as R We also use three data sets from the SeaFlow project packages from Bioconductor. The source code of Flock [10] and they contain many samples. Instead of analysing is downloaded from its Sourceforge repository. To repro- the independent samples, we analyse the concatenated duce all the comparisons presented in this paper, the data sets as the original paper [10] and these concate- source code and data can be downloaded from the GitHub nated data sets contain 12.7, 22.7 and 23.6 millions repository FlowGrid_compare. We run all the experiments of cells respectively. Each data sets include 15 features on six2.60GHz coresCPU with 32 GRAM. but the original study only uses four features for clus- tering analysis. The four features are forward scatter FlowPeaks and Flock provide automated version with- (small and perpendicular), phycoerythrin, and chlorophyll out any user-input parameter. FlowSOM requires one (small) [10]. user-supplied parameter (k, the number of clusters In the evaluation, we treat the manual gating label as the in meta-clustering step). FlowGrid requires two user- gold standard for measuring the quality of clustering. In supplied parameters (bin and ). To optimise the result, n Ye and Ho BMC Systems Biology 2019, 13(Suppl 2):35 Page 5 of 8 Table 1 Comparison of runtime (in seconds) of FlowGrid against other clustering algorithms Time in second (x ± sd) Data set Samples Markers Cells FlowGrid FlowSOM FlowPeaks Flock Multi-center 16 8 29-77 ×10 0.23± 0.09 4.01± 1.08 2.27± 0.61 10.3± 3.45 Flow-CAP-GvHD 12 4 12-33×10 0.07± 0.04 2.16± 0.54 0.28± 0.16 0.58± 0.28 Flow-CAP-DLBL 30 3 2-25×10 0.04± 0.01 1.25± 0.32 0.10± 0.09 0.22± 0.16 Flow-CAP-HSCT 30 4 6-9×10 0.04± 0.02 1.35± 0.28 0.11± 0.02 0.28± 0.06 Seaflow0 - 4 23.6×10 11.51 572.65 NA 6628.30 Seaflow1 - 4 12.7×10 3.09 312.95 258.13 NA Seaflow11 - 4 22.7×10 6.37 544.79 NA NA NA represents that the algorithm got error in the data set we try many k for FlowSOM and many combinations of Flock generates too many clusters in this case. It is impor- bin and for our algorithm. tant to note that FlowGrid also identifies cells that do not belong to a main cluster (i.e., a high density region). These Results cells can be viewed as ’outliers’, and are labelled as ’-1’ in Fig. 2. Performance comparison This is a feature that is not present in other clustering Table 1 summaries the performance of our algorithm and algorithms. three other algorithms – FlowSOM, FlowPeaks, and Flock in terms of runtime. Our algorithm is substantially faster Scalability analysis than other clustering algorithms in all the data sets. This To further evaluate the scalability of the algorithms, we improvement of runtime is especially substantial in the sub-sample one Seaflow data set and the sampled data Seaflow data sets. FLOCK and FlowPeaks sometimes fail sets range from 20 thousand to 20 million cells. Figure 3 to complete in some of the data sets. In a data set of 23.6 shows the scalability of our algorithm and three other million cells, FlowSOM completes the execution in 572 s, algorithms. Flock has a low runtime when processing a whereas FlowGrid completes the execution in only 12 s. small data set, but its runtime dramatically increases to This is a 50× speed up. Table 2 summaries the clustering 6640 s for a 20 million-cell data set. FlowPeaks and Flow- accuracy performance. In Flow-CAP and the multi-centre SOM share similar scalability but FlowPeaks is not able to data sets, FlowGrid shares the similar clustering accuracy execute 20 million data set. Our algorithm have the best (in terms of ARI) with other clustering algorithms but in performance in the evaluation as FlowGrid is faster than Seaflow data sets, FlowGrid gives higher accuracy than other algorithm in all the sampled data by an order of other clustering algorithms. magnitude. Figure 2 shows that the clustering results of our algo- rithm and three other algorithms in a HSCT sample. Flow- Parameter robustness analysis Grid, FlowSOM and FlowPeaks recover similar number Like other density-based clustering algorithm, parameter of clusters, and the clustering results are largely similar. setting is important. In our experience, Bin and are data-set-dependent. We recommend trying out different Table 2 Comparison of accuracy (in ARI) of FlowGrid against combinations of Bin between 4 and 15, and between other clustering algorithms 1 and 5. To pick the best parameter combinations, some ARI (x ¯ ± sd) Data set prior knowledge is helpful such as the expected number FlowGrid FlowSOM FlowPeaks Flock of clusters and the proportion of outliers which should be Multi-center 0.66± 0.20 0.75± 0.17 0.68± 0.20 0.66± 0.16 less than 10% in our experience. Flow-CAP-GvHD 0.79± 0.15 0.85± 0.11 0.72± 0.16 0.47± 0.20 We found that other parameters, namely MinDen , MinDen and ρ are mostly robust across a wide range of Flow-CAP-DLBL 0.85± 0.10 0.84± 0.10 0.82± 0.15 0.84± 0.09 values. Flow-CAP-HSCT 0.90± 0.08 0.87± 0.14 0.83± 0.24 0.57± 0.27 To demonstrate this robustness, we used the bench- Seaflow0 0.94 0.81 NA 0.27 mark data sets from Flow-CAP for a parameter sensitivity Seaflow1 0.59 0.54 0.34 NA analysis. For these experiments, we first set 3, 40, 85, 4 Seaflow11 0.77 0.33 NA NA and 1 as the default value for MinDen , MinDen , ρ, Bin b c n and , respectively. In each experiment, we only change NA represents that the algorithm got error in the data set Ye and Ho BMC Systems Biology 2019, 13(Suppl 2):35 Page 6 of 8 Fig. 2 Visual comparison of the clustering performance of FlowGrid, FlowPeaks, FlowSOM, and Flock using manual gating (top row) as the gold standard one parameter to test its sensitivity to the overall clas- to be 3, 40 and 85 respectively, FlowGrid maintains sification result. The performance is measured by ARI good clustering performance and excellent runtime. and runtime. In the first experiment, we varied MinDen They are therefore set as the default parameters for from 1to50whilefixingother parameters.Inthe sec- FlowGrid. ond experiment, we varied MinDen ranging from 10 to 300 while fixing other parameters. In the third experi- Discussion ment, we varied ρ ranging from 70 to 95 while fixing other In this paper, we have developed an ultrafast cluster- parameters. ing algorithm, FlowGrid, for single-cell flow cytome- Figure 4 demonstrates that the clustering accu- try analysis, and compared it against other state-of-the- racy and runtime are largely insensitive to MinDen , art algorithms such as Flock, FlowSOM and FlowPeaks. MinDen and ρ across a large range of parameter val- FlowGrid borrows ideas from DBSCAN for detection ues. The experiments are applied to all the bench- of high density regions and outliers. It does not only mark data sets from Flow-CAP and similar results perform well in the presence of outliers, but also have are observed in all the benchmark data sets. In our great scalability without getting into memory issues. It experiments, when MinDen , MinDen and ρ are set is both time efficient and memory efficient. FlowGrid b c Ye and Ho BMC Systems Biology 2019, 13(Suppl 2):35 Page 7 of 8 Fig. 3 Comparison of the runtime of FlowGrid, FlowPeaks, FlowSOM, and Flock using data sets with different number of cells shares similar clustering accuracy with state-of-the-art will also fractionally decreases but it may lead to sep- flow cytometry clustering algorithms, but it is substan- aration of real clusters and create spurious outliers. In tially faster than them. With any given number of markers, any case, we showed that the performance of FlowGrid the runtime of FlowGrid scales linearly with the num- is generally robust against changes in MinDen , MinDen ber of cells, which is a useful property for extremely large and ρ. data sets. The current implementation of FlowGrid is already MinDen and MinDen are density threshold parame- very fast for most practical purposes. In the future, if ters to reduce the search space of high density bins. If the the data size grows even larger, it is possible to further parameters are set very low, the runtime may fraction- speed up FlowGrid by parallelising the binning step of the ally increase but the accuracy is not likely to be affected. algorithm, which is currently the most computationally However, if the parameters are set very high, the runtime intensive step of the algorithm. Fig. 4 Sensitivity analysis of three different parameters on clustering accuracy (as measured by adjusted rand index; ARI) and runtime (seconds) Ye and Ho BMC Systems Biology 2019, 13(Suppl 2):35 Page 8 of 8 Abbreviations response using a density-based method for the automated identification ARI: Adjusted rand index; BFS: Breadth first search; CyTOF: Mass cytometry; of cell populations in multidimensional flow cytometry data. Cytom Part DBSCAN: Density-based spatial clustering of applications with noise; B Clin Cytom. 2010;78(S1):69–82. Flow-CAP: Flow cytometry critical assessment of population identification 5. Ge Y, Sealfon SC. flowPeaks: a fast unsupervised clustering for flow methods; SOM: Self-organising map cytometry data via k-means and density peak finding. Bioinformatics. 2012;28(15):2052–8. Acknowledgments 6. Van Gassen S, Callebaut B, Van Helden MJ, Lambrecht BN, Demeester P, We thank members of the Ho Laboratory for their valuable comments. Dhaene T, Saeys Y. FlowSOM: Using self-organizing maps for visualization and interpretation of cytometry data. Cytom Part A. 2015;87(7):636–45. Funding 7. Johnsson K, Wallin J, Fontes M. BayesFlow: latent modeling of flow This work was supported in part by funds from the New South Wales Ministry cytometry cell populations. BMC Bioinformatics. 2016;17(1):25. of Health, a National Health and Medical Research Council Career 8. Ester M, Kriegel H-P, Sander J, Xu X, et al. A density-based algorithm for Development Fellowship (1105271 to JWKH), and a National Heart Foundation discovering clusters in large spatial databases with noise. In: Proceedings Future Leader Fellowship (100848 to JWKH). Publication charge is supported of the Second International Conference on Knowledge Discovery and by the Victor Chang Cardiac Research Institute. Data Mining (KDD’96). Portland: AAAI Press; 1996. p. 226–231. 9. Li H, Shaham U, Stanton KP, Yao Y, Montgomery RR, Kluger Y. Gating Availability of data and materials mass cytometry data by deep learning. Bioinformatics. 2017;33(21): Project Name: FlowGrid 3423–30. Project Home Page: https://github.com/VCCRI/FlowGrid 10. Hyrkas J, Clayton S, Ribalet F, Halperin D, Virginia Armbrust E, Howe B. Operating Systems: Unix, Mac, Windows Scalable clustering algorithms for continuous environmental flow Programming Languages: Python cytometry. Bioinformatics. 2015;32(3):417–23. Other Requirements: sklearn, numpy 11. Hubert L, Arabie P. Comparing partitions. J Classif. 1985;2(1):193–218. License: MIT Public License Any Restrictions to Use By Non-Academics: None About this supplement This article has been published as part of BMC Systems Biology Volume 13 Supplement 2, 2019: Selected articles from the 17th Asia Pacific Bioinformatics Conference (APBC 2019): systems biology. The full contents of the supplement are available online at https://bmcsystbiol.biomedcentral.com/articles/ supplements/volume-13-supplement-2. Authors’ contributions XY and JWKH initiated and designed the project. XY implemented the algorithm, carried out all the experiments, and wrote the paper. JWKH revised the paper. Both authors approved the final version of the paper. Ethics approval and consent to participate Not applicable. Consent for publication Not applicable. Competing interests The authors declare that they have no competing interests. Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Author details 1 2 Victor Chang Cardiac Research Institute, Sydney, Australia. University of New South Wales, Sydney, Australia. School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pokfulam, Hong Kong. Published: 5 April 2019 References 1. Weber LM, Robinson MD. Comparison of clustering methods for high-dimensional single-cell flow and mass cytometry data. Cytom Part A. 2016;89(12):1084–96. 2. Saeys Y, Van Gassen S, Lambrecht BN. Computational flow cytometry: helping to make sense of high-dimensional immunology data. Nat Rev Immunol. 2016;16(7):449. 3. Aghaeepour N, Finak G, Hoos H, Mosmann TR, Brinkman R, Gottardo R, Scheuermann RH, Consortium F, Consortium D, et al. Critical assessment of automated flow cytometry data analysis techniques. Nat Methods. 2013;10(3):228. 4. Qian Y, Wei C, Eun-Hyung Lee F, Campbell J, Halliley J, Lee JA, Cai J, Kong YM, Sadat E, Thomson E, et al. Elucidation of seventeen human peripheral blood B-cell subsets and quantification of the tetanus http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png BMC Systems Biology Springer Journals http://www.deepdyve.com/lp/springer-journals/ultrafast-clustering-of-single-cell-flow-cytometry-data-using-flowgrid-FOGmqYmCUz

Loading next page...

References (13)

K. Johnsson, J. Wallin, Magnus Fontes (2016)
BayesFlow: latent modeling of flow cytometry cell populations
BMC Bioinformatics, 17
S. Gassen, Britt Callebaut, M. Helden, B. Lambrecht, P. Demeester, T. Dhaene, Y. Saeys (2015)
FlowSOM: Using self‐organizing maps for visualization and interpretation of cytometry data
Cytometry Part A, 87
L. Weber, Mark Robinson (2016)
Comparison of Clustering Methods for High-Dimensional Single-Cell Flow and Mass Cytometry Data
bioRxiv
Yongchao Ge, S. Sealfon (2012)
flowPeaks: a fast unsupervised clustering for flow cytometry data via K-means and density peak finding
Bioinformatics, 28 15
M. Ester, H. Kriegel, J. Sander, Xiaowei Xu (1996)
A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise
L. Hubert, P. Arabie (1985)
Comparing partitions
Journal of Classification, 2
Y. Qian, Chungwen Wei, F. Lee, John Campbell, Jessica Halliley, Jamie Lee, Jennifer Cai, Y. Kong, E. Sadat, Elizabeth Thomson, P. Dunn, Adam Seegmiller, N. Karandikar, C. Tipton, T. Mosmann, I. Sanz, R. Scheuermann (2010)
Elucidation of seventeen human peripheral blood B‐cell subsets and quantification of the tetanus response using a density‐based method for the automated identification of cell populations in multidimensional flow cytometry data
Cytometry Part B: Clinical Cytometry, 78B
M Ester, H-P Kriegel, J Sander, X Xu (1996)
Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD’96).
Jeremy Hyrkas, S. Clayton, F. Ribalet, D. Halperin, E. Armbrust, Bill Howe (2016)
Scalable clustering algorithms for continuous environmental flow cytometry
Bioinformatics, 32 3
M. Cugmas, A. Ferligoj (2015)
On comparing partitions
International Federation of Classification Societies
N. Aghaeepour, Greg Finak, H. Hoos, T. Mosmann, R. Brinkman, R. Gottardo, R. Scheuermann, Michael Biehl (2013)
Critical assessment of automated flow cytometry data analysis techniques
Nature Methods, 10
Huamin Li, Uri Shaham, Kelly Stanton, Yi Yao, Ruth Montgomery, Y. Kluger (2016)
Gating mass cytometry data by deep learning
bioRxiv
Y. Saeys, S. Gassen, B. Lambrecht (2016)
Computational flow cytometry: helping to make sense of high-dimensional immunology data
Nature Reviews Immunology, 16

Publisher: Springer Journals
Copyright: Copyright © 2019 by The Author(s)
Subject: Life Sciences; Bioinformatics; Systems Biology; Simulation and Modeling ; Computational Biology/Bioinformatics; Physiological, Cellular and Medical Topics; Algorithms
eISSN: 1752-0509
DOI: 10.1186/s12918-019-0690-2
Publisher site: See Article on Publisher Site

Abstract

Background: Flow cytometry is a popular technology for quantitative single-cell profiling of cell surface markers. It enables expression measurement of tens of cell surface protein markers in millions of single cells. It is a powerful tool for discovering cell sub-populations and quantifying cell population heterogeneity. Traditionally, scientists use manual gating to identify cell types, but the process is subjective and is not effective for large multidimensional data. Many clustering algorithms have been developed to analyse these data but most of them are not scalable to very large data sets with more than ten million cells. Results: Here, we present a new clustering algorithm that combines the advantages of density-based clustering algorithm DBSCAN with the scalability of grid-based clustering. This new clustering algorithm is implemented in python as an open source package, FlowGrid. FlowGrid is memory efficient and scales linearly with respect to the number of cells. We have evaluated the performance of FlowGrid against other state-of-the-art clustering programs and found that FlowGrid produces similar clustering results but with substantially less time. For example, FlowGrid is able to complete a clustering task on a data set of 23.6 million cells in less than 12 seconds, while other algorithms take more than 500 seconds or get into error. Conclusions: FlowGrid is an ultrafast clustering algorithm for large single-cell flow cytometry data. The source code is available at https://github.com/VCCRI/FlowGrid. Keywords: Clustering, Flow cytometry, Single cell, DBSCAN Background terms of the order in which pairs of protein markers are Recent technological advancement has made it possible explored, and the inherent uncertainty of manually draw- to quantitatively measure the expression of a handful of ing the cluster boundaries [2]. An emerging solution is protein markers in millions of cells in a flow cytome- to use unsupervised clustering algorithms to automati- try experiment [1]. The ability to profile such a large cally identify clusters in potentially multidimensional flow number of cells allows us to gain insights into cellular cytometry data. heterogeneity at an unprecedented resolution. Tradition- The Flow Cytometry Critical Assessment of Population ally, cell types are identified based on manual gating of Identification Methods (Flow-CAP) challenge has com- several markers in flow cytometry data. Manual gating pared the performance of many flow cytometry clustering relies on visual inspection of a series of two dimensional algorithms [3]. In the challenge, ADIcyt has the highest scatter plots, which makes it difficult to discover struc- accuracy but has a long runtime, which makes it impracti- ture in high dimensions. It also suffers subjectivity, in cal for routine usage. Flock [4] maintains a high accuracy and reasonable runtime. After the challenge, several algo- rithms have been built for flow cytometry data analysis *Correspondence: jwkho@hku.hk Victor Chang Cardiac Research Institute, Sydney, Australia such as FlowPeaks [5], FlowSOM [6] and BayesFlow [7]. University of New South Wales, Sydney, Australia Full list of author information is available at the end of the article © The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Ye and Ho BMC Systems Biology 2019, 13(Suppl 2):35 Page 2 of 8 FlowPeaks and Flock are largely based on k-means clus- detail of the algorithm is presented in the Methods tering. k-means clustering requires the number of clus- section. ters (k) to be defined prior to the analysis. It is hard to determine a suitable k in practice. FlowPeaks performs Methods k-means clustering with a large initial k, and iteratively The key idea of our algorithm is to replace the calcula- merges nearby clusters that are not separated by low den- tion of density from individual points to discrete bins as sity regions into one cluster. Flock utilises grids to iden- defined by a uniform grid. This way, the clustering step tify high density regions, which the algorithm then uses of the algorithm will scale with the number of non-empty to identify initial cluster centres for k-means clustering. bins, which is significantly smaller than the number of This grid-based method of identifying high density region points in lower dimensional data sets. Therefore the over- allows k-means clustering to converge much quicker com- all time complexity of our algorithm is dominated by the pared to using random initialisation of cluster centres, and binning step, which is in the order of O(N ).Thisissignifi- also directly identifies a suitable value for k. FlowSOM cantly better than the time complexity of DBSCAN, which starts with training Self-Organising Map (SOM), followed is in the order of O(NlogN ). The definition and algorithm by consensus hierarchical clustering of the cells for meta- are presented in the following subsections. clustering. In the algorithm, the number of clusters (k)is required for meta-clustering. Definition BayesFlow uses a Bayesian hierarchical model to iden- Thekey termsinvolvedinthe algorithmare definedinthis tify different cell populations in one or many samples. subsection. A graphical example can be found in Fig. 1. The key benefit of this method is its ability to incorporate N is the number of equally sized bins in each bin prior knowledge, and captures the variability in shapes dimension. In theory, there are (N ) bins in the bin and locations of populations between the samples [7]. data space, where d is the number of dimensions. However, BayesFlow tends to be computational expen- However, in practice, we only consider the sive as Markov Chain Monte Carlo sampling requires a non-empty bins. The number of non-empty bins (N) large number of iterations. Therefore, BayesFlow is often is less than (N ) , especially for high dimensional bin impractical for flow cytometry data sets of realistic size. data. Each non-empty bin is assigned an integer index These algorithms perform well on the Flow-CAP data i = 1 ... N. sets, but they may not be scalable to larger data sets that Bin is labelled by a tuple with d positive integers we are dealing with nowadays – those with tens of millions C = (C , C , C , ... , C ) where C is the i i1 i2 i3 id i1 of cells. Aiming to quantify cell population heterogene- coordinate (the bin index) at dimension 1. For ity in huge data sets, we have to develop an ultrafast and scalable clustering algorithm. In this paper, we present a new clustering algorithm that combines the benefit of DBSCAN [8](awidely- based density-based clustering algorithm) and a grid- based approach to achieve scalability. DBSCAN is fast and can detect clusters with complex shapes in the pres- ence of outliers [8]. DBSCAN starts with identifying core points that have a large number of neighbours within a user-defined region. Once the core points are found, nearby core points and closely located non-core points are grouped together to form clusters. This algorithm will identify clusters that are defined as high-density regions that are separated by the low-density regions. However, Fig. 1 An illustrative example of the FlowGrid clustering algorithm. In this DBSCAN is memory inefficient if the data set is very large, example, Bin 1, Bin 2, Bin 3 and Bin 6 are core bins as their Den are larger than or has large highly connected components. MinDen (5 in this example), their Den are larger than MinDen (20 in thi b c c s example), and their Den are larger than ρ% (75% in this example) of To reduce the computational search space and mem- b √ √ 2 2 its directly connected bins. Dist(C , C ) = 1 + 1 = 2 ≤ 1 2 ory requirement, our algorithm extends the idea of ( = 2 in this example), so Bin 1 and Bin 2 are directly connected. √ √ DBSCAN by using equal-spaced grids like Flock. We 2 2 Dist(C , C ) = 1 + 1 = 2 ≤ ,soBin 2and Bin4aredirectly 2 4 implemented our algorithm in an open source python connected. Therefore, Bin 1, Bin 2 and Bin 4 are mutually connected, package called FlowGrid. Using a range of real data and they are assigned into the same cluster. Bin 5 is not a core bin but isaborderbin,asitisdirectlyconnected to Bin6,which isacorebin. sets, we demonstrate that FlowGrid is much faster Bin 3 is a outlier bin, as it is not a core bin nor a border bin. In practice, than other state-of-the-art flow cytometry clustering MinDen is set to be 3, MinDen is set to 40 and ρ is set to be 85 b c algorithms, and produce similar clustering results. The Ye and Ho BMC Systems Biology 2019, 13(Suppl 2):35 Page 3 of 8 example, if Bin has coordinate C = (2, 3, 5), this bin i i Algorithm 1: FlowGrid is located in second bin in dimension 1, third bin in input : X, N , , ρ, MinDen , MinDen bin b c dimension 2 and the fifth bin in dimension 3. output:DataLabel The distance between Bin and Bin is defined as i j 1 Normalise the data X ranging from 1 to (N + 1) bin 2 Assign data into corresponding bins based on the integer of normalised value Dist(C , C ) = C − C (1) i j ik jk 3 Identify S as the set of non-empty bins bin k=1 4 Search the core bins and their directly connected bins Bin and Bin are defined to be directly connected if i j by SearchCore Dist(C , C ) , where is a user-specified i j 5 Group connected bins into a cluster by Breadth First parameter. Search(BFS) Den (C ) is the density of Bin , which is defined as b i i 6 Label cells by the label of their corresponding bins the number of points in Bin . Den (C ) is the collective density of Bin ,calculatedby c i i Den (C ) = Den (C ) c i b j {j|Bin andBin are directly connected} j i Algorithm 2: SearchCore (2) input : S , , ρ, MinDen , MinDen bin b c Bin is a core bin if output: S , L i core Initial an empty adjacency list L. 1 Den (C ) is larger than MinDen , a user-specified b i b S ={} core parameter. forall the Bin in S do bin 2 Den (C ) is larger than ρ%ofits directly b i if Den (i)> MinDen then b b connected bins, where ρ is a user-specified nnBin=radiusNeighbors(S , Bin , ) bin i parameter. nnCount counting the number of points for 3 Den (C ) is larger than MinDen , a user-specified c i c each bin in nnBin parameter. if Den (i) is greater than ρ% of nnCount then Den (i)= the sum of nnCount Bin is a border bin if it is not a core bin but it is if Den (i)> MinDen then c c directly connected to a core bin. S = S ∪{i} core core Bin is an outlier bin, if it is not a core bin nor a mapping bin with nnBin in L border bin. end Bin and Bin are in the same cluster, if they satisfy a b end one of the following conditions: end end 1 they are directly connected and at least one of The input of radiusNeighbors is all non-empty bins, them is core bin; the query bin and the maximum query distance . 2 they are not directly connected but are connected The output is the bins whose distance with the query by a sequence of directly connected core bins. bin are less than (including the query bin). Two points are in the same cluster, if they belong to the same bin or their corresponding bins belong to the same cluster. Algorithm Evaluation Algorithm 1 describes the key steps of FlowGrid, starting Procedure with normalising the values in each dimension to range FlowGrid aims to be an ultrafast and accurate clustering between 1 and (N + 1). Then, we use the integer part algorithm for very large flow cytometry data. Therefore, bin of the normalised value as the coordinate of its corre- both the accuracy and scalability performance need to sponding bin. Then, the SearchCore algorithm is applied be evaluated. The benchmark data sets from Flow-CAP to discover the core bins and their directly connected bins. [3], the multi-centre CyTOF data from Li et al. [9]and Once the core bins and connections are found, Breadth the SeaFlow project [10]are selected to comparethe First Search(BFS) is used to group the connected bins performance of FlowGrid against other state-of-the-art into a cluster. The cells are labelled by the label of their algorithms, FlowSOM, FlowPeaks, and FLOCK. These corresponding bins. three algorithms are chosen because they are widely used, Ye and Ho BMC Systems Biology 2019, 13(Suppl 2):35 Page 4 of 8 the pre-precessing step, we apply the inverse hyperbolic Algorithm 3: Breadth First Search(BFS) function with the factor of 5 to transform the multi-centre input : S , S , adjacency list L core bin data and the SeaFlow data. As the Flow-CAP and multi- output:Bin Label centre CyTOF data contain many samples and we treat Label every bin as -1 each sample as a data set, we run all algorithms on each Index=1 sample. The performances are measured by the ARI and for Bin in S do i core runtime, which are reported by the arithmetic means (x ¯) if the laebl of Bin is -1 then and standard deviation (sd). For the Seaflow data sets, we Queue={} treateachconcatenateddataset as adataset.Inthe eval- Label Bin as Index uation, all algorithms are applied on these concatenated Queue.push(Bin ) data sets. while Queue is not empty do To evaluate the scalability of each algorithm, we down- Bin = Queue.pop() sample the largest concatenated data set from the SeaFlow forall the directed connected Bin of Bin 2 1 project, generating 10 sub-sampled data sets in which the do numbers of cells range from 20 thousand to 20 million. if the laebl of Bin is -1 then Label Bin as Index Performance measure if Bin is core bin then The efficiency performance is measured by the runtime Queue.push(Bin ) while the clustering performance is measured by Adjusted end Rand Index (ARI). ARI is used to measure the cluster- end ing performance. ARI is the corrected-for-chance version end of the Rand index [11]. Although it may result in nega- end tive values if the index is less than expected, it tends to index=index +1 be more robust than many other measures like F-measure end and Rand index. end ARIiscalculatedasfollow. Givenaset S of n ele- ments, and two groups of cluster labels (one group of ground truth label and one group of predicted labels) of these elements, namely X ={X , X , ... , X } and Y = 1 2 r are generally considered to be quite fast, and have good {Y , Y , ... , Y }, the overlap between X and Y can be sum- 1 2 s accuracy. marized by n where n denotes the number of objects in ij ij Three benchmark data sets from Flow-CAP [3]are common between X and Y : n =|X ∩ Y |. i j ij i j selected for evaluation, including the Diffuse Large B-cell Lymphoma (DLBL), Hematopoietic Stem Cell Transplant (HSCT), and Graft versus Host Disease(GvHD) data set. n − a b / n ij i j ij i j 2 2 Each data set contains 10-30 samples with 3-4 markers, ARI= and each sample includes 2,000-35,000 cells. a + b − a b / n i j i j 2 i j i j The multi-centre CyTOF data set from Li et al. [9]pro- 2 2 2 2 vides a labelled data set with 16 samples. Each samples where a = n and b = n contains 40,000-70,000 cells and 26 markers. Since only 8 i ij j ij j i out fo 26 markers are determined to be relevant markers Experimentation in the original paper [9], only these 8 markers are used for FlowGrid is publicly available as an open source program clustering. on GitHub. FlowSOM and FlowPeaks are available as R We also use three data sets from the SeaFlow project packages from Bioconductor. The source code of Flock [10] and they contain many samples. Instead of analysing is downloaded from its Sourceforge repository. To repro- the independent samples, we analyse the concatenated duce all the comparisons presented in this paper, the data sets as the original paper [10] and these concate- source code and data can be downloaded from the GitHub nated data sets contain 12.7, 22.7 and 23.6 millions repository FlowGrid_compare. We run all the experiments of cells respectively. Each data sets include 15 features on six2.60GHz coresCPU with 32 GRAM. but the original study only uses four features for clus- tering analysis. The four features are forward scatter FlowPeaks and Flock provide automated version with- (small and perpendicular), phycoerythrin, and chlorophyll out any user-input parameter. FlowSOM requires one (small) [10]. user-supplied parameter (k, the number of clusters In the evaluation, we treat the manual gating label as the in meta-clustering step). FlowGrid requires two user- gold standard for measuring the quality of clustering. In supplied parameters (bin and ). To optimise the result, n Ye and Ho BMC Systems Biology 2019, 13(Suppl 2):35 Page 5 of 8 Table 1 Comparison of runtime (in seconds) of FlowGrid against other clustering algorithms Time in second (x ± sd) Data set Samples Markers Cells FlowGrid FlowSOM FlowPeaks Flock Multi-center 16 8 29-77 ×10 0.23± 0.09 4.01± 1.08 2.27± 0.61 10.3± 3.45 Flow-CAP-GvHD 12 4 12-33×10 0.07± 0.04 2.16± 0.54 0.28± 0.16 0.58± 0.28 Flow-CAP-DLBL 30 3 2-25×10 0.04± 0.01 1.25± 0.32 0.10± 0.09 0.22± 0.16 Flow-CAP-HSCT 30 4 6-9×10 0.04± 0.02 1.35± 0.28 0.11± 0.02 0.28± 0.06 Seaflow0 - 4 23.6×10 11.51 572.65 NA 6628.30 Seaflow1 - 4 12.7×10 3.09 312.95 258.13 NA Seaflow11 - 4 22.7×10 6.37 544.79 NA NA NA represents that the algorithm got error in the data set we try many k for FlowSOM and many combinations of Flock generates too many clusters in this case. It is impor- bin and for our algorithm. tant to note that FlowGrid also identifies cells that do not belong to a main cluster (i.e., a high density region). These Results cells can be viewed as ’outliers’, and are labelled as ’-1’ in Fig. 2. Performance comparison This is a feature that is not present in other clustering Table 1 summaries the performance of our algorithm and algorithms. three other algorithms – FlowSOM, FlowPeaks, and Flock in terms of runtime. Our algorithm is substantially faster Scalability analysis than other clustering algorithms in all the data sets. This To further evaluate the scalability of the algorithms, we improvement of runtime is especially substantial in the sub-sample one Seaflow data set and the sampled data Seaflow data sets. FLOCK and FlowPeaks sometimes fail sets range from 20 thousand to 20 million cells. Figure 3 to complete in some of the data sets. In a data set of 23.6 shows the scalability of our algorithm and three other million cells, FlowSOM completes the execution in 572 s, algorithms. Flock has a low runtime when processing a whereas FlowGrid completes the execution in only 12 s. small data set, but its runtime dramatically increases to This is a 50× speed up. Table 2 summaries the clustering 6640 s for a 20 million-cell data set. FlowPeaks and Flow- accuracy performance. In Flow-CAP and the multi-centre SOM share similar scalability but FlowPeaks is not able to data sets, FlowGrid shares the similar clustering accuracy execute 20 million data set. Our algorithm have the best (in terms of ARI) with other clustering algorithms but in performance in the evaluation as FlowGrid is faster than Seaflow data sets, FlowGrid gives higher accuracy than other algorithm in all the sampled data by an order of other clustering algorithms. magnitude. Figure 2 shows that the clustering results of our algo- rithm and three other algorithms in a HSCT sample. Flow- Parameter robustness analysis Grid, FlowSOM and FlowPeaks recover similar number Like other density-based clustering algorithm, parameter of clusters, and the clustering results are largely similar. setting is important. In our experience, Bin and are data-set-dependent. We recommend trying out different Table 2 Comparison of accuracy (in ARI) of FlowGrid against combinations of Bin between 4 and 15, and between other clustering algorithms 1 and 5. To pick the best parameter combinations, some ARI (x ¯ ± sd) Data set prior knowledge is helpful such as the expected number FlowGrid FlowSOM FlowPeaks Flock of clusters and the proportion of outliers which should be Multi-center 0.66± 0.20 0.75± 0.17 0.68± 0.20 0.66± 0.16 less than 10% in our experience. Flow-CAP-GvHD 0.79± 0.15 0.85± 0.11 0.72± 0.16 0.47± 0.20 We found that other parameters, namely MinDen , MinDen and ρ are mostly robust across a wide range of Flow-CAP-DLBL 0.85± 0.10 0.84± 0.10 0.82± 0.15 0.84± 0.09 values. Flow-CAP-HSCT 0.90± 0.08 0.87± 0.14 0.83± 0.24 0.57± 0.27 To demonstrate this robustness, we used the bench- Seaflow0 0.94 0.81 NA 0.27 mark data sets from Flow-CAP for a parameter sensitivity Seaflow1 0.59 0.54 0.34 NA analysis. For these experiments, we first set 3, 40, 85, 4 Seaflow11 0.77 0.33 NA NA and 1 as the default value for MinDen , MinDen , ρ, Bin b c n and , respectively. In each experiment, we only change NA represents that the algorithm got error in the data set Ye and Ho BMC Systems Biology 2019, 13(Suppl 2):35 Page 6 of 8 Fig. 2 Visual comparison of the clustering performance of FlowGrid, FlowPeaks, FlowSOM, and Flock using manual gating (top row) as the gold standard one parameter to test its sensitivity to the overall clas- to be 3, 40 and 85 respectively, FlowGrid maintains sification result. The performance is measured by ARI good clustering performance and excellent runtime. and runtime. In the first experiment, we varied MinDen They are therefore set as the default parameters for from 1to50whilefixingother parameters.Inthe sec- FlowGrid. ond experiment, we varied MinDen ranging from 10 to 300 while fixing other parameters. In the third experi- Discussion ment, we varied ρ ranging from 70 to 95 while fixing other In this paper, we have developed an ultrafast cluster- parameters. ing algorithm, FlowGrid, for single-cell flow cytome- Figure 4 demonstrates that the clustering accu- try analysis, and compared it against other state-of-the- racy and runtime are largely insensitive to MinDen , art algorithms such as Flock, FlowSOM and FlowPeaks. MinDen and ρ across a large range of parameter val- FlowGrid borrows ideas from DBSCAN for detection ues. The experiments are applied to all the bench- of high density regions and outliers. It does not only mark data sets from Flow-CAP and similar results perform well in the presence of outliers, but also have are observed in all the benchmark data sets. In our great scalability without getting into memory issues. It experiments, when MinDen , MinDen and ρ are set is both time efficient and memory efficient. FlowGrid b c Ye and Ho BMC Systems Biology 2019, 13(Suppl 2):35 Page 7 of 8 Fig. 3 Comparison of the runtime of FlowGrid, FlowPeaks, FlowSOM, and Flock using data sets with different number of cells shares similar clustering accuracy with state-of-the-art will also fractionally decreases but it may lead to sep- flow cytometry clustering algorithms, but it is substan- aration of real clusters and create spurious outliers. In tially faster than them. With any given number of markers, any case, we showed that the performance of FlowGrid the runtime of FlowGrid scales linearly with the num- is generally robust against changes in MinDen , MinDen ber of cells, which is a useful property for extremely large and ρ. data sets. The current implementation of FlowGrid is already MinDen and MinDen are density threshold parame- very fast for most practical purposes. In the future, if ters to reduce the search space of high density bins. If the the data size grows even larger, it is possible to further parameters are set very low, the runtime may fraction- speed up FlowGrid by parallelising the binning step of the ally increase but the accuracy is not likely to be affected. algorithm, which is currently the most computationally However, if the parameters are set very high, the runtime intensive step of the algorithm. Fig. 4 Sensitivity analysis of three different parameters on clustering accuracy (as measured by adjusted rand index; ARI) and runtime (seconds) Ye and Ho BMC Systems Biology 2019, 13(Suppl 2):35 Page 8 of 8 Abbreviations response using a density-based method for the automated identification ARI: Adjusted rand index; BFS: Breadth first search; CyTOF: Mass cytometry; of cell populations in multidimensional flow cytometry data. Cytom Part DBSCAN: Density-based spatial clustering of applications with noise; B Clin Cytom. 2010;78(S1):69–82. Flow-CAP: Flow cytometry critical assessment of population identification 5. Ge Y, Sealfon SC. flowPeaks: a fast unsupervised clustering for flow methods; SOM: Self-organising map cytometry data via k-means and density peak finding. Bioinformatics. 2012;28(15):2052–8. Acknowledgments 6. Van Gassen S, Callebaut B, Van Helden MJ, Lambrecht BN, Demeester P, We thank members of the Ho Laboratory for their valuable comments. Dhaene T, Saeys Y. FlowSOM: Using self-organizing maps for visualization and interpretation of cytometry data. Cytom Part A. 2015;87(7):636–45. Funding 7. Johnsson K, Wallin J, Fontes M. BayesFlow: latent modeling of flow This work was supported in part by funds from the New South Wales Ministry cytometry cell populations. BMC Bioinformatics. 2016;17(1):25. of Health, a National Health and Medical Research Council Career 8. Ester M, Kriegel H-P, Sander J, Xu X, et al. A density-based algorithm for Development Fellowship (1105271 to JWKH), and a National Heart Foundation discovering clusters in large spatial databases with noise. In: Proceedings Future Leader Fellowship (100848 to JWKH). Publication charge is supported of the Second International Conference on Knowledge Discovery and by the Victor Chang Cardiac Research Institute. Data Mining (KDD’96). Portland: AAAI Press; 1996. p. 226–231. 9. Li H, Shaham U, Stanton KP, Yao Y, Montgomery RR, Kluger Y. Gating Availability of data and materials mass cytometry data by deep learning. Bioinformatics. 2017;33(21): Project Name: FlowGrid 3423–30. Project Home Page: https://github.com/VCCRI/FlowGrid 10. Hyrkas J, Clayton S, Ribalet F, Halperin D, Virginia Armbrust E, Howe B. Operating Systems: Unix, Mac, Windows Scalable clustering algorithms for continuous environmental flow Programming Languages: Python cytometry. Bioinformatics. 2015;32(3):417–23. Other Requirements: sklearn, numpy 11. Hubert L, Arabie P. Comparing partitions. J Classif. 1985;2(1):193–218. License: MIT Public License Any Restrictions to Use By Non-Academics: None About this supplement This article has been published as part of BMC Systems Biology Volume 13 Supplement 2, 2019: Selected articles from the 17th Asia Pacific Bioinformatics Conference (APBC 2019): systems biology. The full contents of the supplement are available online at https://bmcsystbiol.biomedcentral.com/articles/ supplements/volume-13-supplement-2. Authors’ contributions XY and JWKH initiated and designed the project. XY implemented the algorithm, carried out all the experiments, and wrote the paper. JWKH revised the paper. Both authors approved the final version of the paper. Ethics approval and consent to participate Not applicable. Consent for publication Not applicable. Competing interests The authors declare that they have no competing interests. Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Author details 1 2 Victor Chang Cardiac Research Institute, Sydney, Australia. University of New South Wales, Sydney, Australia. School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pokfulam, Hong Kong. Published: 5 April 2019 References 1. Weber LM, Robinson MD. Comparison of clustering methods for high-dimensional single-cell flow and mass cytometry data. Cytom Part A. 2016;89(12):1084–96. 2. Saeys Y, Van Gassen S, Lambrecht BN. Computational flow cytometry: helping to make sense of high-dimensional immunology data. Nat Rev Immunol. 2016;16(7):449. 3. Aghaeepour N, Finak G, Hoos H, Mosmann TR, Brinkman R, Gottardo R, Scheuermann RH, Consortium F, Consortium D, et al. Critical assessment of automated flow cytometry data analysis techniques. Nat Methods. 2013;10(3):228. 4. Qian Y, Wei C, Eun-Hyung Lee F, Campbell J, Halliley J, Lee JA, Cai J, Kong YM, Sadat E, Thomson E, et al. Elucidation of seventeen human peripheral blood B-cell subsets and quantification of the tetanus

Journal

BMC Systems Biology – Springer Journals

Published: Apr 5, 2019

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Ultrafast clustering of single-cell flow cytometry data using FlowGrid

Ultrafast clustering of single-cell flow cytometry data using FlowGrid

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Ultrafast clustering of single-cell flow cytometry data using FlowGrid

Ultrafast clustering of single-cell flow cytometry data using FlowGrid

References (13)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies