Access the full text.
Sign up today, get DeepDyve free for 14 days.
(WHO. Global tuberculosis report 2020. Technical report. WHO. 2020.)
WHO. Global tuberculosis report 2020. Technical report. WHO. 2020.WHO. Global tuberculosis report 2020. Technical report. WHO. 2020., WHO. Global tuberculosis report 2020. Technical report. WHO. 2020.
L Breiman (2001)
Random forestsMach Learn, 45
TM Walker, TA Kohl, SV Omar, J Hedge, C Del Ojo Elias, P Bradley, Z Iqbal, S Feuerriegel, KE Niehaus, DJ Wilson (2015)
Whole-genome sequencing for prediction of Mycobacterium tuberculosis drug susceptibility and resistance: a retrospective cohort studyLancet Infect Dis, 15
S Doerken, M Avalos, E Lagarde, M Schumacher (2019)
Penalized logistic regression with low prevalence exposures beyond high dimensional settingsPLoS ONE, 14
R Leinonen, R Akhtar, E Birney, L Bower, A Cerdeno-Tárraga (2011)
The European nucleotide archiveNucleic Acids Res, 39
JE San, S Baichoo, A Kanzi, Y Moosa, R Lessells, V Fonseca, J Mogaka, R Power, T de Oliveira (2020)
Current affairs of microbial genome-wide association studies: approaches, bottlenecks and analytical pitfallsFront Microbiol, 10
L Mathelin, K Gallivan (2012)
A compressed sensing approach for partial differential equations with random input dataCommun Comput Phys, 12
M Lustig, D Donoho, JM Pauly (2007)
Sparse MRI: the application of compressed sensing for rapid MR imagingMagn Resonance Med, 58
P Bradley, N Gordon, T Walker (2015)
Rapid antibiotic-resistance predictions from genome sequence data for Staphylococcus aureus and Mycobacterium tuberculosisNat Commun, 6
R Leinonen, H Sugawara, M Shumway (2010)
The sequence read archiveNucleic Acids Res, 39
(Malioutov D, Malyutov M. Boolean compressed sensing: LP relaxation for group testing. In: 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP); 2012. p. 3305–8.)
Malioutov D, Malyutov M. Boolean compressed sensing: LP relaxation for group testing. In: 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP); 2012. p. 3305–8.Malioutov D, Malyutov M. Boolean compressed sensing: LP relaxation for group testing. In: 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP); 2012. p. 3305–8., Malioutov D, Malyutov M. Boolean compressed sensing: LP relaxation for group testing. In: 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP); 2012. p. 3305–8.
H Iwai, M Kato-Miyazawa, T Kirikae, T Miyoshi-Akiyama (2015)
CASTB (the comprehensive analysis server for the Mycobacterium tuberculosis complex): a publicly accessible web server for epidemiological analyses, drug-resistance prediction and phylogenetic comparison of clinical isolatesTuberculosis, 95
MM Saber, BJ Shapiro (2020)
Benchmarking bacterial genome-wide association study methods using simulated genomes and phenotypesMicrob Genom, 6
(IBM. IBM ILOG CPLEX optimization studio V12.10.0 documentation. International Business Machines Corporation. 2020.)
IBM. IBM ILOG CPLEX optimization studio V12.10.0 documentation. International Business Machines Corporation. 2020.IBM. IBM ILOG CPLEX optimization studio V12.10.0 documentation. International Business Machines Corporation. 2020., IBM. IBM ILOG CPLEX optimization studio V12.10.0 documentation. International Business Machines Corporation. 2020.
MF Duarte, YC Eldar (2011)
Structured compressed sensing: from theory to applicationsIEEE Trans Signal Process, 59
(Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv; 2013.)
Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv; 2013.Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv; 2013., Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv; 2013.
G Arango-Argoty, E Garner, A Pruden, LS Heath, P Vikesland, L Zhang (2018)
DeepARG: a deep learning approach for predicting antibiotic resistance genes from metagenomic dataMicrobiome, 6
F Coll, R McNerney, JA Guerra-Assunção, JR Glynn, JA Perdigão, M Viveiros, I Portugal, A Pain, N Martin, TG Clark (2014)
A robust SNP barcode for typing Mycobacterium tuberculosis complex strainsNat Commun, 5
Y Yang, KE Niehaus, TM Walker, Z Iqbal, AS Walker, DJ Wilson, TE Peto, DW Crook, EG Smith, T Zhu (2018)
Machine learning for classifying tuberculosis drug-resistance from DNA sequencing dataBioinformatics, 34
C Cortes, V Vapnik (1995)
Support-vector networksMach Learn, 20
R Dorfman (1943)
The detection of defective members of large populationsAnn Math Stat, 14
(O’Neill J. Antimicrobial resistance: tackling a crisis for the health and wealth of nations. Review on Antimicrobial Resistance. Technical report; 2014.)
O’Neill J. Antimicrobial resistance: tackling a crisis for the health and wealth of nations. Review on Antimicrobial Resistance. Technical report; 2014.O’Neill J. Antimicrobial resistance: tackling a crisis for the health and wealth of nations. Review on Antimicrobial Resistance. Technical report; 2014., O’Neill J. Antimicrobial resistance: tackling a crisis for the health and wealth of nations. Review on Antimicrobial Resistance. Technical report; 2014.
(Murdoch WJ, Singh C, Kumbier K, Abbasi-Asl R, Yu B. Interpretable machine learning: definitions, methods, and applications. arXiv; 2019.)
Murdoch WJ, Singh C, Kumbier K, Abbasi-Asl R, Yu B. Interpretable machine learning: definitions, methods, and applications. arXiv; 2019.Murdoch WJ, Singh C, Kumbier K, Abbasi-Asl R, Yu B. Interpretable machine learning: definitions, methods, and applications. arXiv; 2019., Murdoch WJ, Singh C, Kumbier K, Abbasi-Asl R, Yu B. Interpretable machine learning: definitions, methods, and applications. arXiv; 2019.
A Drouin, G Letarte, F Raymond, M Marchand, J Corbeil, F Laviolette (2019)
Interpretable genotype-to-phenotype classifiers with performance guaranteesSci Rep, 9
S Kouchaki, Y Yang, TM Walker, A Sarah Walker, DJ Wilson, TE Peto, DW Crook, DA Clifton (2019)
Application of machine learning techniques to tuberculosis drug resistance analysisBioinformatics, 35
V Schleusener, C Köser, P Beckert (2017)
Mycobacterium tuberculosis resistance prediction and lineage classification from genome sequencing: comparison of automated analysis toolsSci Rep, 7
X Chen, H Ishwaran (2012)
Random forests for genomic data analysisGenomics, 99
(Poplin R, Ruano-Rubio V, DePristo MA, Fennell TJ, Carneiro MO, der Auwera GAV, Kling DE, Gauthier LD, Levy-Moonshine A, Roazen D, et al. Scaling accurate genetic variant discovery to tens of thousands of samples. bioRxiv. 2017.)
Poplin R, Ruano-Rubio V, DePristo MA, Fennell TJ, Carneiro MO, der Auwera GAV, Kling DE, Gauthier LD, Levy-Moonshine A, Roazen D, et al. Scaling accurate genetic variant discovery to tens of thousands of samples. bioRxiv. 2017.Poplin R, Ruano-Rubio V, DePristo MA, Fennell TJ, Carneiro MO, der Auwera GAV, Kling DE, Gauthier LD, Levy-Moonshine A, Roazen D, et al. Scaling accurate genetic variant discovery to tens of thousands of samples. bioRxiv. 2017., Poplin R, Ruano-Rubio V, DePristo MA, Fennell TJ, Carneiro MO, der Auwera GAV, Kling DE, Gauthier LD, Levy-Moonshine A, Roazen D, et al. Scaling accurate genetic variant discovery to tens of thousands of samples. bioRxiv. 2017.
M Aldridge, O Johnson, J Scarlett (2019)
Group testing: an information theory perspectiveFound Trends Commun Inf Theory, 15
(Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, editors. Advances in neural information processing systems, vol. 30; 2017. p. 4765–74.)
Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, editors. Advances in neural information processing systems, vol. 30; 2017. p. 4765–74.Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, editors. Advances in neural information processing systems, vol. 30; 2017. p. 4765–74., Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, editors. Advances in neural information processing systems, vol. 30; 2017. p. 4765–74.
(Malioutov D, Varshney K. Exact rule learning via Boolean compressed sensing. In: International conference on machine learning; 2013. p. 765–73.)
Malioutov D, Varshney K. Exact rule learning via Boolean compressed sensing. In: International conference on machine learning; 2013. p. 765–73.Malioutov D, Varshney K. Exact rule learning via Boolean compressed sensing. In: International conference on machine learning; 2013. p. 765–73., Malioutov D, Varshney K. Exact rule learning via Boolean compressed sensing. In: International conference on machine learning; 2013. p. 765–73.
ML Chen, A Doddi, J Royer, L Freschi, M Schito, M Ezewudo, IS Kohane, A Beam, M Farhat (2019)
Beyond multidrug resistance: leveraging rare variants with machine and statistical learning models in Mycobacterium tuberculosis resistance predictionEBioMedicine, 43
(Mitchell S, O’Sullivan M, Dunning I. PuLP: a linear programming toolkit for Python. 2011. http://www.optimization-online.org/DB_FILE/2011/09/3178.pdf.)
Mitchell S, O’Sullivan M, Dunning I. PuLP: a linear programming toolkit for Python. 2011. http://www.optimization-online.org/DB_FILE/2011/09/3178.pdf.Mitchell S, O’Sullivan M, Dunning I. PuLP: a linear programming toolkit for Python. 2011. http://www.optimization-online.org/DB_FILE/2011/09/3178.pdf., Mitchell S, O’Sullivan M, Dunning I. PuLP: a linear programming toolkit for Python. 2011. http://www.optimization-online.org/DB_FILE/2011/09/3178.pdf.
A Drouin, S Giguère, M Déraspe, M Marchand, M Tyers, VG Loo, A-M Bourgault, F Laviolette, J Corbeil (2016)
Predictive computational phenotyping and biomarker discovery using reference-free genome comparisonsBMC Genom, 17
P Miotto, B Tessema, E Tagliani, L Chindelevitch (2017)
A standardised method for interpreting the association between mutations and phenotypic drug-resistance in Mycobacterium tuberculosisEur Respir J, 50
SM Lundberg, G Erion, H Chen, A DeGrave, JM Prutkin, B Nair, R Katz, J Himmelfarb, N Bansal, S-I Lee (2020)
From local explanations to global understanding with explainable AI for treesNat Mach Intell, 2
AM Starks, E Avilés, DM Cirillo, CM Denkinger, DL Dolinger, C Emerson, J Gallarda, D Hanna, PS Kim, R Liwski (2015)
Collaborative effort for a centralized worldwide tuberculosis relational sequencing data platformClin Infect Dis, 61
S Basu, K Kumbier, JB Brown, B Yu (2018)
Iterative random forests to discover predictive and stable high-order interactionsProc Natl Acad Sci, 115
F Pedregosa, G Varoquaux, A Gramfort, V Michel, B Thirion, O Grisel, M Blondel, P Prettenhofer, R Weiss, V Dubourg (2011)
Scikit-learn: machine learning in PythonJ Mach Learn Res, 12
MA Herman, T Strohmer (2009)
High-resolution radar via compressed sensingIEEE Trans Signal Process, 57
R Lougee-Heimer (2003)
The common optimization interface for operations research: promoting open-source software in the operations research communityIBM J Res Dev, 47
(Foucart S, Rauhut H. A mathematical introduction to compressive sensing. In: Applied and numerical harmonic analysis. New York: Springer; 2013. https://books.google.ca/books?id=zb28BAAAQBAJ.)
Foucart S, Rauhut H. A mathematical introduction to compressive sensing. In: Applied and numerical harmonic analysis. New York: Springer; 2013. https://books.google.ca/books?id=zb28BAAAQBAJ.Foucart S, Rauhut H. A mathematical introduction to compressive sensing. In: Applied and numerical harmonic analysis. New York: Springer; 2013. https://books.google.ca/books?id=zb28BAAAQBAJ., Foucart S, Rauhut H. A mathematical introduction to compressive sensing. In: Applied and numerical harmonic analysis. New York: Springer; 2013. https://books.google.ca/books?id=zb28BAAAQBAJ.
EJ Candes, MB Wakin (2008)
An introduction to compressive samplingIEEE Signal Process Mag, 25
S Feuerriegel, V Schleusener, P Beckert, TA Kohl, P Miotto, DM Cirillo, AM Cabibbe, S Niemann, K Fellenberg (2015)
PhyResSE: a web tool delineating Mycobacterium culosis antibiotic resistance and lineage from whole-genome sequencing dataJ Clin Microbiol, 53
H Li, B Handsaker, A Wysoker, T Fennell, J Ruan, N Homer, G Marth, G Abecasis, R Durbin (2009)
The sequence alignment/map format and SAMtoolsBioinformatics, 25
A Doostan, H Owhadi (2011)
A non-adapted sparse approximation of PDEs with stochastic inputsJ Comput Phys, 230
T-M Ngo, Y-Y Teo (2019)
Genomic prediction of tuberculosis drug-resistance: benchmarking existing databases and prediction algorithmsBMC Bioinform, 20
(van Rossum G. Python tutorial. Technical report CS-R9526, Centrum voor Wiskunde en Informatica (CWI). Amsterdam; 1995.)
van Rossum G. Python tutorial. Technical report CS-R9526, Centrum voor Wiskunde en Informatica (CWI). Amsterdam; 1995.van Rossum G. Python tutorial. Technical report CS-R9526, Centrum voor Wiskunde en Informatica (CWI). Amsterdam; 1995., van Rossum G. Python tutorial. Technical report CS-R9526, Centrum voor Wiskunde en Informatica (CWI). Amsterdam; 1995.
MC Raviglione, IM Smith (2007)
XDR tuberculosis—implications for global public healthN Engl J Med, 356
(Drouin A. Learn interpretable computational phenotyping models from \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k$$\end{document}k-merized genomic data; 2020. https://github.com/aldro61/kover.)
Drouin A. Learn interpretable computational phenotyping models from \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k$$\end{document}k-merized genomic data; 2020. https://github.com/aldro61/kover.Drouin A. Learn interpretable computational phenotyping models from \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k$$\end{document}k-merized genomic data; 2020. https://github.com/aldro61/kover., Drouin A. Learn interpretable computational phenotyping models from \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k$$\end{document}k-merized genomic data; 2020. https://github.com/aldro61/kover.
(WHO. Antimicrobial resistance: global report on surveillance. Technical report. WHO. 2014.)
WHO. Antimicrobial resistance: global report on surveillance. Technical report. WHO. 2014.WHO. Antimicrobial resistance: global report on surveillance. Technical report. WHO. 2014., WHO. Antimicrobial resistance: global report on surveillance. Technical report. WHO. 2014.
M Aldridge, L Baldassini, O Johnson (2014)
Group testing algorithms: bounds and simulationsIEEE Trans Inf Theory, 60
E Avalos, D Catanzaro, A Catanzaro, T Ganiats, S Brodine, J Alcaraz, T Rodwell (2015)
Frequency and geographic distribution of gyra and gyrb mutations associated with fluoroquinolone resistance in clinical Mycobacterium tuberculosis isolates: a systematic reviewPLoS ONE, 10
W Deelder, S Christakoudi, J Phelan, E Diez Benavente, S Campino, R McNerney, L Palla, TG Clark (2019)
Machine learning predicts accurately Mycobacterium tuberculosis drug resistance from whole genome sequencing dataFront Genet, 10
GK Atia, V Saligrama (2012)
Boolean compressed sensing and noisy group testingIEEE Trans Inf Theory, 58
S Drăghici, RB Potter (2003)
Predicting HIV drug resistance with neural networksBioinformatics, 19
A Cohen, W Dahmen, R DeVore (2009)
Compressed sensing and best k\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k$$\end{document}-term approximationJ Am Math Soc, 22
YC Eldar, G Kutyniok (2012)
10.1017/CBO9780511794308Compressed sensing: theory and applications
BK Natarajan (1995)
Sparse approximate solutions to linear systemsSIAM J Comput, 24
AR Wattam, D Abraham, O Dalay, TL Disz, T Driscoll, JL Gabbard, JJ Gillespie, R Gough, D Hix, R Kenyon (2014)
PATRIC, the bacterial bioinformatics database and analysis resourceNucleic Acids Res, 42
S Gagneux (2018)
Ecology and evolution of Mycobacterium tuberculosisNat Rev Microbiol, 16
K Drlica, X Zhao (1997)
DNA gyrase, topoisomerase IV, and the 4-quinolonesMicrobiol Mol Biol Rev, 61
A Steiner, D Stucki, M Coscolla, S Borrell, S Gagneux (2014)
KvarQ: targeted and direct variant calling from fastq reads of bacterial genomesBMC Genom, 15
(Boser BE, Guyon IM, Vapnik VN. A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on computational learning theory. COLT ’92. New York: Association for Computing Machinery; 1992. p. 144–52.)
Boser BE, Guyon IM, Vapnik VN. A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on computational learning theory. COLT ’92. New York: Association for Computing Machinery; 1992. p. 144–52.Boser BE, Guyon IM, Vapnik VN. A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on computational learning theory. COLT ’92. New York: Association for Computing Machinery; 1992. p. 144–52., Boser BE, Guyon IM, Vapnik VN. A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on computational learning theory. COLT ’92. New York: Association for Computing Machinery; 1992. p. 144–52.
F Coll, R McNerney, M Preston (2015)
Rapid determination of anti-tuberculosis drug resistance from whole-genome sequencesGenome Med, 7
Motivation: Prediction of drug resistance and identification of its mechanisms in bacteria such as Mycobacterium tuberculosis, the etiological agent of tuberculosis, is a challenging problem. Solving this problem requires a transpar- ent, accurate, and flexible predictive model. The methods currently used for this purpose rarely satisfy all of these criteria. On the one hand, approaches based on testing strains against a catalogue of previously identified mutations often yield poor predictive performance; on the other hand, machine learning techniques typically have higher predictive accuracy, but often lack interpretability and may learn patterns that produce accurate predictions for the wrong reasons. Current interpretable methods may either exhibit a lower accuracy or lack the flexibility needed to generalize them to previously unseen data. Contribution: In this paper we propose a novel technique, inspired by group testing and Boolean compressed sens- ing, which yields highly accurate predictions, interpretable results, and is flexible enough to be optimized for various evaluation metrics at the same time. Results: We test the predictive accuracy of our approach on five first-line and seven second-line antibiotics used for treating tuberculosis. We find that it has a higher or comparable accuracy to that of commonly used machine learning models, and is able to identify variants in genes with previously reported association to drug resistance. Our method is intrinsically interpretable, and can be customized for different evaluation metrics. Our implementation is available at github. com/ hooma nzabe ti/ INGOT_ DR and can be installed via The Python Package Index (Pypi) under ingotdr. This package is also compatible with most of the tools in the Scikit-learn machine learning library. Keywords: Drug resistance, Interpretable machine learning, Group testing, Integer linear programming, Rule-based learning, Whole-genome sequencing Background 10 million detected cases and approximately 1.4 million Drug resistance is the phenomenon by which an infec- deaths only in 2019 [2]. tious organism (also known as pathogen) develops resist- The development of resistance to common drugs used ance to one or more drugs that are commonly used in treatment is a serious public health threat, not only in treatment [1]. In this paper we focus our attention in low and middle-income countries, but also in high- on Mycobacterium tuberculosis (MTB), the etiological income countries where it is particularly problematic agent of tuberculosis, which is the largest single infec- in hospital settings [3]. It is estimated that, without the tious agent killer in the world today, responsible for over urgent development of novel antimicrobial drugs, the total mortality due to drug resistance will exceed 10 million people a year by 2050, a number exceeding the *Correspondence: hooman_zabeti@sfu.ca 1 annual mortality due to cancer today [4]. School of Computing Science, Simon Fraser University, Burnaby, Canada Full list of author information is available at the end of the article © The Author(s) 2021. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/. The Creative Commons Public Domain Dedication waiver (http:// creat iveco mmons. org/ publi cdoma in/ zero/1. 0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Zabeti et al. Algorithms Mol Biol (2021) 16:17 Page 2 of 12 Existing models for predicting drug resistance from state-of-the-art machine learning and rule-based learn- whole-genome sequencing (WGS) data broadly fall into ing methods which have been previously used for geno- two classes. The first, which we refer to as “catalogue type-phenotype prediction on MTB data. These methods methods,” involves testing the WGS data of an isolate for are logistic regression (LR) [27], random forests (RF) [28], the presence of point mutations (most often single-nucle- Support Vector Machines (SVM) [29], and KOVER [30]. otide polymorphisms, or SNPs) associated with known The comparison covers the prediction of drug resistance drug resistance. These mutations are typically identified for twelve drugs, of which five are first-line and seven are via a microbial genome-wide association study (GWAS) second-line drugs. INGOT-DR displays a competitive and may be confirmed with a functional genomics study. performance while maintaining interpretability, flexibil - If at least one previously identified mutation is present, ity, and accurately recovering many of the known mecha- the isolate is declared to be resistant [5–9]. While these nisms of drug resistance. methods are simple to understand and apply, they often suffer from poor predictive accuracy [10], especially in Methods identifying new resistance mechanisms or predicting We present our methodology as follows. “Group test- resistance to rarely used drugs. ing and Boolean compressed sensing” and “From group The second class, which we refer to as “machine learn - testing to interpretable classiffication ” introduce the ing methods”, seeks to infer the drug resistance of an iso- group testing problem, and discuss how group testing late by training complex models directly on WGS and can be combined with compressed sensing to deliver an drug susceptibility test (DST) data [11–13]. Such meth- interpretable predictive model. “Our approach leads to a ods tend to result in highly accurate predictions at the refined ILP formulation ” introduces substantial modifica - cost of flexibility and interpretability - specifically, they tions to a previously published method, which are needed typically provide only limited, if any, insights into the to produce an accurate and flexible classifier that can be drug resistance mechanisms involved, and often do not tuned for specific evaluation metrics and tasks. “ Opti- impose explicit limits on the predictive model’s complex- mizing different target metrics such as the sensitivity ity. Learning approaches based on deep neural networks and the specificity ” describes the tuning process required [13, 14] are an example of very accurate but very complex to provide the desired trade-off between sensitivity and “black-box” models of drug resistance. specificity in a model’s predictions. In this paper we propose a novel method, based on the group testing problem [15] and Boolean compressed Group testing and Boolean compressed sensing sensing (CS), for the prediction of drug resistance. CS is We frame the problem of predicting drug resistance a mathematical technique for sparse signal recovery from given sequence data as a group testing problem, origi- under-determined systems of linear equations [16], and nally introduced in [15]. This approach for detecting has been successfully applied in many application areas defective members of a set was motivated by the need to including digital signal processing [17, 18], MRI imaging screen a large population of soldier recruits for syphilis in [19], radar detection [20], and computational uncertainty the United States during the World War II. The screen - quantification [21, 22]. Under a sparsity assumption on ing, performed by testing blood samples, was costly due the unknown signal vector, it has been shown that CS to the low numbers of infected individuals. To make the techniques enable recovery from far fewer measurements screening more efficient, Robert Dorfman suggested than required by the Nyquist–Shannon sampling theo- pooling blood samples into specific groups and testing rem [23]. Boolean CS is a modification of the CS prob - the groups instead. A positive result for the group would lem, replacing linear algebra over the real numbers with imply the presence of at least one infected member. The Boolean algebra over binary numbers [24], which has problem then becomes one of finding the subset of indi - been successfully applied to various forms of non-adap- viduals whose infected status can explain all the positive tive group testing [24–26]. results without invalidating any of the negative ones. Our approach, INterpretable GrOup Testing for Drug In this setting, the design matrix encodes the individu- Resistance (INGOT-DR), combines the flexibility and als tested in each group, the outcome vector describes the interpretability of catalogue methods with the accu- result of each test, and the solution, obtained from a suit- racy of machine learning methods. More specifically, able algorithmic procedure, is a {0, 1}-valued vector rep- INGOT-DR is capable of recovering interpretable rules resenting the infection status of the individuals [24, 31]. for predicting drug resistance that both result in a high Since the fraction of infected individuals is assumed to be classification accuracy as well as provide insights into small, the solution vector is sparse and can be recovered the mechanisms of drug resistance. We compare the with Boolean CS. The importance of this observation lies in performance of INGOT-DR with that of standard and the fact that the result of solving the Boolean CS problem Zabeti et al. Algorithms Mol Biol (2021) 16:17 Page 3 of 12 can also be interpreted as a sparse set of rules for determin- finding w becomes an instance of the sparse Boolean vec- ing the status of each sample in other data mining contexts tor recovery problem: [24]. We summarize this correspondence in our context in min �w� subject to y = A ∨ w, 0 (2) Table 1 below, and use the context-specific interpretation throughout the rest of this paper. where w , called the ℓ -norm of w, is the number of 0 0 Mathematically, the problem with m isolates and n non-zero entries it contains. SNPs can be described by the Boolean design matrix The combinatorial optimization problem (2) is well- m×n A ∈{0, 1} , where A indicates the presence/absence ij known to be NP-hard [34]. In [24, 35] an equivalent for- status of SNP j in the i-th isolate, and the Boolean outcome mulation of (2) via 0–1 integer linear programming (ILP) vector y ∈{0, 1} , where y represents the drug resistance is proposed, in which the ℓ -norm is replaced by the con- phenotype of the i-th isolate. Let us define the relevance vex ℓ -norm, equivalent to it over binary vectors, and the vector w ∈{0, 1} in such a way that w = 1 if and only if Boolean matrix-vector product is replaced with equiva- the j-th SNP is relevant to drug resistance. lent linear constraints. We recapitulate their formulation The key assumption is that one or more SNPs relevant to here: drug resistance can cause the isolate to be drug-resistant, whereas an isolate with no such SNP will be drug-sensi- min w tive. This is an assumption commonly made in the litera - j=1 ture, and is precisely the same as the key assumption of (3) s.t. w ∈{0, 1} group testing, which is that the presence of one or more infected individuals leads to a positive test, while a test with A w ≥ 1 no infected individuals comes out negative (we note that A w = 0 these assumptions only hold in the absence of noise). In Here, P := {i : y = 1} and Z := {i : y = 0} are the sets i i fact, although our group is the first one, to our knowledge, of positive (drug-resistant) and negative (drug-sensitive) to make the connection between group testing and drug isolates, respectively, and A denotes the submatrix of A resistance prediction, a previously published method for whose row indices are in the given subset S. In this for- this task [32] corresponds almost perfectly to the Definite mulation, the objective is to minimize the number of Defectives algorithm used in group testing [33]. SNPs inferred to be relevant to drug resistance. The first Under this assumption, the outcome vector satisfies the constraint then ensures that each SNP is classified as relationship either relevant or irrelevant, the second one ensures that the drug-resistant isolates have at least one relevant SNP y = A ∧ w ∀ 1 ≤ i ≤ m, i ij j present, and the third one ensures that the drug-sensitive j=1 isolates do not have any such SNPs, in line with our key assumption. This NP-hard problem formulation can fur - where ∨ and ∧ are the Boolean OR and AND operators, ther be made tractable for linear programming by relax- respectively. Using the definition of Boolean matrix-vec - ing the Boolean constraint on w in (3) to 0 ≤ w ≤ 1 for tor multiplication, this can be equivalently written all 1 ≤ j ≤ n [24]. y = A ∨ w. (1) Because the Boolean CS problem is based on Boolean algebra, the conditions on the Boolean matrices A that If the status vector w satisfying Eq. (1) is assumed to be guarantee exact recovery of k-sparse status vectors sparse (i.e. there are few relevant SNPs), the problem of (vectors with at most k 1’s) via such linear program- ming relaxations are quite stringent, and differ from Table 1 Correspondence between group testing and the drug resistance prediction problem Term Meaning: group testing Meaning: drug resistance Notation Domain Row dimension Number of tests Number of isolates m N Column dimension Population size Number of SNPs/variants n N Sparsity/rule size Infection prevalence Number of relevant SNPs k {0, 1, ... , n} m×n Design matrix Test membership Genotype matrix A {0, 1} Outcome vector Test result vector Phenotype/label vector y {0, 1} Status vector Infected/uninfected Relevant/irrelevant to DR w {0, 1} Zabeti et al. Algorithms Mol Biol (2021) 16:17 Page 4 of 12 Our approach leads to a refined ILP formulation those of standard CS. Specifically, in order to hold, The formulation of the ILP (4 ) is designed to provide, via these guarantees require the matrix A to be k-disjunct, the parameter , a trade-off between the sparsity of a rule i.e. for any sum of at most k of its columns to not be and the total slack, a quantity that resembles (but does not greater than or equal to any other column. As we have equal) the training error. We now describe a refinement of no control over A in our setting, no such recovery guar- this formulation that directly encodes the different types of antees can be provided. error, which provides more flexibility during the training In [24], the combinatorial problem (3) is augmented process by allowing us to optimize a more precise objec- with slack variables and a regularization term to trade tive function that is particularly suitable to the application off between the sparsity of w on the one hand, and the at hand. discrepancy between the predicted and the actual out- As was done in the previous section, we assume that c ˆ come vector on the other hand. With these modifica - is the binary classifier obtained by training with a Boolean tions, the formulation becomes: feature matrix A and its corresponding label vector y. We n m further refer to a misclassified training sample as a false min w + ξ j i (4a) negative if it has label 1 (is in P ), and as a false positive if j=1 i=1 it has label 0 (is in Z ). In the drug resistance setting, a false negative would mean that we incorrectly predict a drug- s.t. w ∈{0, 1} (4b) resistant isolate to be drug-sensitive, while a false positive would mean that we predict a drug-sensitive isolate to be 0 ≤ ξ ≤ 1, i ∈ P (4c) i drug-resistant. We begin by noting that in the ILP (4), each entry of ξ must take on a value of 0 or 1, and a value of 1 corresponds 0 ≤ ξ , i ∈ Z i (4d) to a false negative for c ˆ . This follows from the fact that A is a binary matrix and w is a binary vector, so the optimal ξ is A w + ξ ≥ 1 P P (4e) also a binary vector (since > 0) , and therefore A w − ξ = 0, T Z Z (4f ) ξ = 1 ξ =: FN, i P (6) i∈P where > 0 is a regularization parameter and ξ is the so- called slack vector. Taking this formulation as a starting where we use FN to denote the number of false negatives. point, we introduce several refinements in “ Our approach However, ξ in the ILP (4) can take on integer values leads to a refined ILP formulation ”. greater than 1 corresponding to false positives for c ˆ . To be able to express the number of false positives, denoted FP, we modify the constraints (4d) and (4f ) by also setting From group testing to interpretable classification ξ ∈{0, 1}, i ∈ Z (7) As described in the previous section, the solution to the ILP (4) can be seen as an interpretable rule-based and replacing the equality constraint A w − ξ = 0 with Z Z classifier in contexts beyond group testing. The status the tighter inequality vector w naturally encodes the following rule: If any feature f with w = 1 is present in the sample, clas- α ξ − A w ≥ 0 ∀ i ∈ Z, i i i (8) sify it as positive; otherwise, classify it as negative. where α = A and A is the ith row of A. i ij i j=1 More formally, assume that we have a labelled dataset After these modifications, (8 ) ensures that ξ = 1 if n i D ={(x , y ), . . . , (x , y )} , where the x ∈ X := {0, 1} 1 1 m m i A w > 0 , while the presence of ξ in the objective function, i Z are n-dimensional binary feature vectors and the with > 0 , ensures that ξ = 0 if A w = 0 , for any i ∈ Z . i i y ∈{0, 1} are the binary labels. The feature matrix A is We now also get defined via A = (x ) (the j-th component of the i-th ij i j feature vector). If w ˆ is the solution of ILP (4) for this ξ = 1 ξ = FP. i Z (9) matrix A and the outcome vector y = (y ) , we define i=1 i∈Z the classifier cˆ : X →{0, 1} via To provide additional flexibility for situations where false cˆ(x) = x ∨ w. (5) positives and false negatives are valued differently, we further split the regularization term into two: one for the What makes this classifier interpretable is that it explic - positive class P , and one for the negative class Z: itly depends on the presence or absence of specific fea - tures in its input, while ignoring all the other features. Zabeti et al. Algorithms Mol Biol (2021) 16:17 Page 5 of 12 ξ + ξ . min ξ P i Z i (10) i∈P k∈Z i∈P s.t. w ∈{0, 1} The general form of the new ILP is now as follows: ξ ∈{0, 1} , (16) A w + ξ ≥ 1 P P min w + ξ + ξ j P i Z k α ξ − A w ≥ 0 ∀ i ∈ Z i i i j=1 i∈P k∈Z 1 w ≤ k s.t. w ∈{0, 1} (11) m T ξ ∈{0, 1} 1 ξ ≤ (1 − t)|Z|. A w + ξ ≥ 1 P P The maximum specificity at given sensitivity and rule size α ξ − A w ≥ 0 ∀ i ∈ Z i i i can be found analogously. In a similar way, one can mini- mize a weighted average of rule size and false positive In this new formulation, and control the trade-off P Z rate at a given maximum false negative rate (minimum between the false positives and the false negatives, and sensitivity), or vice versa. jointly influence the sparsity of the rule. In the following section we describe how this formulation can be further tailored to optimize different evaluation metrics, such as Implementation the sensitivity and the specificity of the predictor. Existing methods used for comparison with INGOT‑DR To ensure a fair comparison, we use three popular machine learning methods used for drug resistance Optimizing different target metrics such as the sensitivity prediction: random forests (RF) [28], logistic regres- and the specificity sion (LR) [27], and support vector machines (SVM) Since the ILP formulation in (11) provides us with [29]. The use of RF is motivated by its flexibility and direct access to the two components of the training its many successful applications in computational biol- error as well as the sparsity (rule size), we may modify ogy and genomics [36, 37]. The use of LR is based on the classifier to optimize a variety of target metrics by its excellent performance in drug resistance prediction transforming some of the objective function compo- for MTB in comparison to other methods [38]. The use nents into constraints and optimizing the remaining of SVM is motivated by its excellent performance in a ones. comparison of drug resistance prediction for multiple For instance, assume that we would like to train the bacterial pathogens [30]; we use it with a linear kernel classifier c ˆ to maximize the sensitivity at a given mini- for simplicity, although other kernels are often used mum specificity t and maximum rule size k. Recall that [39]. For LR and SVM, we consider the ℓ and ℓ regu- 1 2 TN FP larizations, which correspond to penalizing the sum of Specificity = = 1 − , (12) TN + FP N the absolute values and the Euclidean norm of the coef- ficients, respectively. We also use, to our knowledge, the only other inter- TP FN Sensitivity = = 1 − . (13) pretable machine learning method for drug resist- TP + FN P ance prediction, KOVER [30]. All the methods except From Eqs. (10), (12) and the definition of Z , we get the KOVER are implemented in the Python program- constraint ming language [40]. Although KOVER can provide rule-based classifiers from two algorithms: Classifi - 1 ξ ¯ ¯ cation and Regression Trees (CART) and Set Cover- t ≤ 1 − ⇐⇒ 1 ξ ≤ (1 − t)|Z|. (14) |Z| ing Machine (SCM), we only consider the latter as it is the main innovation of KOVER [41], and the two Also, to restrict the maximum rule size to k we can use algorithms yield very similar accuracy [30]. We use the the constraint Scikit-learn [42] implementation for the machine learn- (15) ing models—RandomForestClassifier for RF, LogisticRe - 1 w ≤ k gression for LR, and LinearSVC for SVM. We also use Our objective is to maximize the sensitivity, which is KOVER version 2.0 [43], and harness the Python API to equivalent to minimizing ξ by Eqs. (13) and (6). In i∈P the CPLEX optimizer, version 12.10.0 [44], through the addition, by incorporating Eqs. (14) and (15), the ILP (11) Pulp API [45, 46] to solve the ILPs in INGOT-DR. can be modified as follows: Zabeti et al. Algorithms Mol Biol (2021) 16:17 Page 6 of 12 Data Table 2 Summary statistics for our dataset, with a line We combine data from the Pathosystems Resource separating first-line and second-line drugs Integration Center (PATRIC) [47] and the Relational Drug # of isolates # of # of SNPs # of SNP groups Sequencing TB Data Platform (ReSeqTB) [48]. This resistant isolates results in 8000 isolates together with their resistant/sus- ceptible status for twelve drugs, including five first-line Ethambutol 6096 1407 597,133 55,164 (rifampicin, isoniazid, pyrazinamide, ethambutol, and Isoniazid 7734 3445 642,373 65,090 streptomycin) and seven second-line drugs (kanamycin, Pyrazinamide 3858 754 281,432 33,942 amikacin, capreomycin, ofloxacin, moxifloxacin, cipro - Rifampicin 7715 2968 646,855 65,379 floxacin, and ethionamide) [49, 50]. The whole-genome Streptomycin 5125 2104 542,640 45,037 sequencing data for these 8000 isolates, in the form of Kanamycin 2436 697 391,708 21,513 paired FASTQ files, are downloaded from the European Amikacin 2033 573 141,952 17,103 Nucleotide Archive [51] and the Sequence Read Archive Capreomycin 1991 552 341,935 15,389 [52]. The accession numbers used to obtain the data in Ofloxacin 2911 800 407,235 23,905 our study are: ERP[000192, 000520, 006989, 008667, Moxifloxacin 961 129 97,700 11,927 010209, 013054], PRJEB[10385, 10950, 14199, 2358, Ciprofloxacin 443 37 43,950 5,563 2794, 5162, 9680], PRJNA[183624, 235615, 296471], and Ethionamide 1516 498 344,960 15,145 SRP[018402, 051584, 061066]. In order to transform the raw sequencing data into variant calls, we use a pipeline similar to that used in pre- search and randomized search cross-validation. KOVER vious work [50, 53]. We use the BWA software [54], spe- is also equipped with two tuning techniques, K-fold cifically, the BWA-MEM program, for the mapping. We cross-validation and risk bound selection. To make then call the single-nucleotide polymorphisms (SNPs) of the comparison as consistent as possible, we use 5-fold each isolate with two different pipelines, SAMtools [55] cross-validation for KOVER and grid search with 5-fold and GATK [56], and take the intersection of their calls cross-validation for all the other models. During cross- to ensure reliability. The final dataset, which includes the validation, balanced accuracy is used as the model selec- position as well as the reference and alternative allele for tion metric for all the models except KOVER; to the best each SNP [50], is used as the input to our machine learn- of our knowledge, KOVER does not provide the option to ing tools. change the model selection metric. Starting from this input we create a binary feature matrix as described in “From group testing to interpret- able classiffication ”. For each drug, we only consider the Evaluating the models’ performance isolates with a status for this drug. We group all the SNPs Evaluating the performance of an interpretable predic- in perfect linkage disequilibrium (LD) [57], i.e. sharing tive model can be challenging. While most evaluation identical presence/absence patterns in those isolates, methods focus on predictive accuracy, it is essential to into a single feature that we call a SNP group. This rep - assess the model’s interpretability as well. Although there resentation does not affect the predictive accuracy of any is no consensus definition of interpretability, [58] suggest machine learning methods, but helps create a consist- that an interpretable method should be able to provide ent feature importance score for the non-interpretable an acceptable predictive accuracy while being easy to ones. In KOVER, at most one SNP in a SNP group can be understand and provide meaningful insights to its audi- selected to be part of a rule, and the remaining SNPs in ence. Adopting their idea, we evaluate the performance the group are labelled equivalent [41]; we adopt this con- of our approach and the competitor methods using three vention here. The number of labeled and drug-resistant metrics: isolates, as well as the number of SNPs and SNP groups for each drug, is shown in Table 2. 1. Predictive accuracy, measured via the balanced accu- racy, Sensitivity + Specificity Splitting the data into a training and testing set; tuning Balanced Accuracy = the hyper‑parameters To evaluate our classifier we use a random stratified 2. Simplicity, measured via the number of features train-test split, where the training set contains 80% and (SNPs) in the trained model. the testing set contains 20% of data. For hyper-parameter 3. Insight generation, measured via the relevance of the tuning, Scikit-learn provides two main approaches: grid selected SNPs to known drug resistance mechanisms. Zabeti et al. Algorithms Mol Biol (2021) 16:17 Page 7 of 12 This evaluation process is demonstrated in detail in “ The for making explainable predictions rooted in game the- comparison between interpretable and non-interpretable ory. This algorithm, implemented in the SHAP Python models” and “Results”. package, version 0.37.0 [61], provides the guaranteed unique solution satisfying three fairness conditions. We The comparison between interpretable apply TreeExplainer for RF and LinearExplainer for LR and non‑interpretable models and SVM, and select the k SNPs with the highest impor- The overall pipeline consists of SNP calling and SNP tance. We use k = 20 in all our experiments. grouping as described in “Data”, hyper-parameter tuning as described in “Splitting the data into a training and test- Results ing set; tuning the hyper-parameters”, and model training INGOT‑DR produces accurate predictive models and testing using the balanced accuracy as the metric as The performance of INGOT-DR compared to that of the described in “Evaluating the models’ performance”. This other methods in terms of the balanced accuracy is sum- addresses the first evaluation criterion, the predictive marized in Table 3, and Fig. 1 separately shows the sen- accuracy. sitivity and specificity. Overall, INGOT-DR outperforms To evaluate model simplicity, we investigate the SNPs all other models on 4/12 of the drugs, obtains the best selected by each model. For the rule-based classifi - performance (tied with KOVER) on an additional drug, ers, we ensure a low model complexity, and therefore a and achieves a balanced accuracy within 5% of the best higher interpretability, by training both INGOT-DR and one for the remaining 7/12 drugs. SVM-l1 achieves the KOVER with the same maximum allowed rule size (num- best balanced accuracy in 4/12 of the drugs, while LR-l1 ber of SNPs used), k. By default, INGOT-DR also has a and KOVER obtain the best balanced accuracy in 2/12 (training) specificity lower bound of t = 90% , via the con- drugs each. Furthermore, INGOT-DR has a performance straint explained in “Optimizing different target metrics exceeding that of RF in 12/12 drugs, that of KOVER, such as the sensitivity and the specificity ”. We evaluate LR-l2, and SVM-l2 in 9/12 drugs, that of LR-l1 in 8/12 the simplicity of the remaining models by counting the drugs. SVM-l1 is the only competitive model, whose per- SNPs with non-zero coefficients for LR and SVM, and the formance it only exceeds in 5/12 drugs, although it does SNPs with a non-zero importance according to Scikit- obtain a marginally better balanced accuracy on average learn for RF. (85.7% vs. 85.3%). Lastly, to evaluate and fairly compare the models’ ability to generate insights, we compare the top k most INGOT‑DR produces interpretable models important SNPs for each one [59]. For both INGOT-DR INGOT-DR produces predictive models in the form and KOVER, we simply evaluate the k or fewer SNPs of disjunctive (logical-OR) rules over the presence of used in each rule. Since the other machine learning specific SNPs, as explained in “From group testing to methods are not inherently interpretable, we extract the interpretable classiffication ”. These models are easy to SNP importance values using the Shapley additive expla- understand and interpret. Although KOVER considers nation (SHAP) algorithm [60], a model-agnostic method rules containing both presence and absence of features Table 3 Balanced accuracy of all the methods in predicting drug resistance to 12 drugs Drug INGOT‑DR KOVER LR‑l1 LR‑l2 RF SVM‑l1 SVM‑l2 Isoniazid 0.903 0.898 0.889 0.877 0.801 0.899 0.880 Rifampicin 0.909 0.904 0.923 0.894 0.826 0.920 0.902 Ethambutol 0.809 0.805 0.833 0.816 0.781 0.836 0.835 Pyrazinamide 0.873 0.860 0.862 0.829 0.796 0.841 0.844 Streptomycin 0.826 0.839 0.852 0.840 0.792 0.859 0.847 Kanamycin 0.856 0.864 0.838 0.845 0.805 0.859 0.838 Amikacin 0.843 0.817 0.880 0.853 0.785 0.853 0.851 Capreomycin 0.859 0.826 0.836 0.812 0.764 0.826 0.812 Ethionamide 0.734 0.736 0.715 0.704 0.659 0.740 0.702 Ofloxacin 0.912 0.908 0.909 0.840 0.788 0.914 0.845 Moxifloxacin 0.920 0.834 0.912 0.803 0.82 0.918 0.803 Ciprofloxacin 0.845 0.845 0.780 0.720 0.623 0.774 0.714 Maximum values are shown in bold Zabeti et al. Algorithms Mol Biol (2021) 16:17 Page 8 of 12 Fig. 1 Sensitivity and specificity of all the methods in predicting drug resistance to 12 drugs [30], the absence of a SNP is harder to interpret in the by training the model on the complement of the feature context of genomics, so we only focus on the presence matrix, A , and outcome vector, y ¯ ; however, we focus on of SNPs here. We note that, by DeMorgan’s law, both disjunctive rules in this paper. methods could produce conjunctive (logical-AND) rules Table 4 Number of SNPs involved in the prediction made by each model for each drug Drug INGOT‑DR KOVER LR‑l1 LR‑l2 RF SVM‑l1 SVM‑l2 Isoniazid 20 20 1045 62,707 22,336 626 54,630 Rifampicin 20 20 739 63,621 29,373 476 52,732 Ethambutol 20 19 154 53,476 19,864 661 43,094 Pyrazinamide 20 17 114 32,885 9495 428 25,485 Streptomycin 20 13 5804 43,771 23,996 594 40,183 Kanamycin 20 20 2383 20,934 9314 231 18,716 Amikacin 20 19 2252 16,622 7639 212 14,260 Capreomycin 20 20 2103 14,907 7881 234 13,432 Ethionamide 20 20 41 14,791 7777 280 13,551 Ofloxacin 20 17 394 23,206 14,312 265 19,694 Moxifloxacin 12 7 29 11,678 1371 125 10,237 Ciprofloxacin 5 5 18 5448 325 29 4343 Zabeti et al. Algorithms Mol Biol (2021) 16:17 Page 9 of 12 Fig. 2 Top k ≤ 20 SNPs chosen by each model, categorized by association with drug resistance We display the number of SNPs used by the predic- KOVER almost always produces shorter rules, they tend tive models produced by each method in Table 4. These to not generalize as well to the testing dataset. results, combined with those of the previous section, sug- For a specific example, we consider the most con - gest that INGOT-DR is producing the most interpretable cise model produced by INGOT-DR—the one for cip- models without sacrificing predictive accuracy. Although rofloxacin, a drug in the fluoroquinolone family. This Zabeti et al. Algorithms Mol Biol (2021) 16:17 Page 10 of 12 model has a rule size of 5, and the SNPs used are all in (lowest-numbered) category of any of the SNPs con- the gyrA gene, known to be involved in the resistance to tained in the group. However, very few such SNP groups fluoroquinolones such as ciprofloxacin in bacteria [62]. were selected by any of the models, and the absolute In this example, INGOT-DR not only identifies the cor - majority of the ones that were contained SNPs within the rect gene, but also selects mutations that are known to same gene. be associated with fluoroquinolone resistance in MTB— A comparison between the methods based on Fig. 2 the selected codons, 90, 91 and 94, are among the codons suggests that INGOT-DR and KOVER detect more SNPs most strongly associated with this type of resistance [63]. in regions known to be associated with drug resistance We state the rule obtained by INGOT-DR below, in a than all the other methods, with INGOT-DR detecting standard format specifying the gene, the original amino slightly more such SNPs than KOVER on average, even acid, the codon number, and the mutated amino acid. after adjusting for the slightly more concise rules pro- duced by KOVER relative to INGOT-DR. However, with IFgyrA_A90V ∨ gyrA_S91P ∨ gyrA_D94A the exception of the most common first-line drugs (top ∨ gyrA_D94G ∨ gyrA_D94Y row) and the three fluoroquinolones (bottom row), even THEN Resistant to ciprofloxacin the interpretable methods tend to select more SNPs in parts of the genome not known to be associated with drug resistance, suggesting the potentially important effects of population structure in MTB. INGOT‑DR selects many SNPs in genes previously associated with drug resistance Our results demonstrate that the models produced by Conclusion INGOT-DR contain many SNPs in genes previously asso- In this paper, we introduced a new approach for creating ciated with drug resistance in MTB. This suggests that rule-based classifiers. Our method, INGOT-DR, utilizes INGOT-DR not only makes accurate predictions, but techniques from group testing and Boolean compressed that it makes them for the right reason, and could thus sensing, and leverages a 0–1 ILP formulation. It produces also be used to prioritize hypotheses about the mecha- classifiers that combine high accuracy with interpret - nisms associated with drug resistance. ability, and are flexible enough to be tailored for specific Figure 2 shows, for each of the models, the top k ≤ 20 evaluation metrics. most important SNPs, defined as all the SNPs included We used INGOT-DR to produce classifiers for pre - in a rule by KOVER and INGOT-DR, and the top k SNPs dicting drug resistance in MTB, by setting a minimum by feature importance as defined by SHAP for the other specificity of 90% and a maximum rule size of 20. We models. We categorize each SNP according to the known tested the classifiers’ predictive accuracy on a variety information about its association with resistance to the of antibiotics commonly used for treating tuberculosis, drug of interest in MTB. This categorization is based on a including five first-line and seven second-line drugs. We list of 183 genes and 19 promoter regions selected out of showed that INGOT-DR produces classifiers with a bal - over 4000 MTB genes through a data-driven and consen- anced accuracy exceeding that of other state-of-the-art sus-driven process by a panel of experts [64]. We use the rule-based and machine learning methods. In addition, following categories: we showed that INGOT-DR produces accurate models with a rule size small enough to keep the model under- 1. Drug specific association: SNP in a gene or inter - standable for human users. Finally, we showed that our genic region associated with drug resistance to the approach generates insights by successfully identifying drug of interest; SNPs associated with drug resistance, as we ascertained 2. Known association: SNP in a gene or intergenic on the specific example of ciprofloxacin. region associated with drug resistance to any other We note that the presence of SNPs in perfect linkage drug; disequilibrium (LD) [57], i.e. sharing identical presence/ 3. Unknown association: SNP in a gene not known to absence patterns, is common in bacteria such as MTB be associated with drug resistance to any drug; whose evolution is primarily clonal [65]. For this reason, 4. Intergenic association: SNP in an intergenic region while the grouping of such SNPs substantially simplifies not known to be associated with drug resistance to the computational task at hand and makes it tractable, any drug. ascertaining the exact representative of each group to be selected to predict the drug resistance status of an isolate We note that for the purposes of this categorization, remains difficult. The presence of clonal structure within whenever a group of SNPs in perfect LD was selected by bacterial populations is a key challenge for the prediction the model, it was categorized according to the highest Zabeti et al. Algorithms Mol Biol (2021) 16:17 Page 11 of 12 Received: 11 February 2021 Accepted: 23 July 2021 of drug resistance, which we plan to address in future work. In conclusion, our work has introduced a novel method, INGOT-DR, based on group testing tech- References niques, for producing interpretable models of drug 1. WHO. Antimicrobial resistance: global report on surveillance. Technical resistance, which demonstrated a state-of-the-art accu- report. WHO. 2014. 2. WHO. Global tuberculosis report 2020. Technical report. WHO. 2020. racy, descriptive ability, and relevance on an MTB data- 3. Raviglione MC, Smith IM. XDR tuberculosis—implications for global set. In future work, we plan to address the challenges of public health. N Engl J Med. 2007;356(7):656–9. population structure and to extend this framework to 4. O’Neill J. Antimicrobial resistance: tackling a crisis for the health and wealth of nations. Review on Antimicrobial Resistance. Technical report; other bacteria as well as to less frequently used antimi- crobial drugs. We expect our method to become a key 5. Steiner A, Stucki D, Coscolla M, Borrell S, Gagneux S. KvarQ: targeted and part of the drug resistance prediction toolkit for clinical direct variant calling from fastq reads of bacterial genomes. BMC Genom. 2014;15:1–12. and public health microbiology researchers. 6. Coll F, McNerney R, Preston M, et al. Rapid determination of anti-tuber- culosis drug resistance from whole-genome sequences. Genome Med. 2015;7:51. Abbreviations 7. Bradley P, Gordon N, Walker T, et al. Rapid antibiotic-resistance predictions CS: Compressed sensing; FN: False negatives; FP: False positives; GWAS: from genome sequence data for Staphylococcus aureus and Mycobacte- Genome-wide association study; ILP: Integer linear programming; LD: Linkage rium tuberculosis. Nat Commun. 2015;6:1–15. disequilibrium; LR: Logistic regression; MTB: Mycobacterium tuberculosis; RF: 8. Iwai H, Kato-Miyazawa M, Kirikae T, Miyoshi-Akiyama T. CASTB (the com- Random forests; SCM: Set covering machine; SNP: Single-nucleotide polymor- prehensive analysis server for the Mycobacterium tuberculosis complex): phism; SVM: Support vector machine; WGS: Whole-genome sequencing. a publicly accessible web server for epidemiological analyses, drug- resistance prediction and phylogenetic comparison of clinical isolates. Acknowledgements Tuberculosis. 2015;95:843–4. The authors would like to acknowledge Dr. Cedric Chauve, Dr. Ben Adcock and 9. Feuerriegel S, Schleusener V, Beckert P, Kohl TA, Miotto P, Cirillo DM, Matthew Nguyen for helpful discussions, and the feedback from Dr. Nicholas Cabibbe AM, Niemann S, Fellenberg K. PhyResSE: a web tool delineating Croucher, Dr. John Lees, Dr. Tim Walker, and Dr. Zamin Iqbal. Mycobacterium culosis antibiotic resistance and lineage from whole- genome sequencing data. J Clin Microbiol. 2015;53(6):1908–14. Authors’ contributions 10. Schleusener V, Köser C, Beckert P, et al. Mycobacterium tuberculosis resist- HZ has conceptualized and implemented the method, carried out the experi- ance prediction and lineage classification from genome sequencing: ments, and analyzed the results. ND has contributed to the conceptualization comparison of automated analysis tools. Sci Rep. 2017;7:1–9. and implementation. AS and NF have contributed to the data collection, 11. Yang Y, Niehaus KE, Walker TM, Iqbal Z, Walker AS, Wilson DJ, Peto preprocessing and analysis. ML and LC have contributed to conceptualizing TE, Crook DW, Smith EG, Zhu T, et al. Machine learning for classifying the method, and supervised the research. All the authors have additionally tuberculosis drug-resistance from DNA sequencing data. Bioinformatics. contributed to writing or editing the draft and the final version of the manu- 2018;34(10):1666–71. script. All authors read and approved the final manuscript. 12. Drăghici S, Potter RB. Predicting HIV drug resistance with neural networks. Bioinformatics. 2003;19(1):98–107. Funding 13. Arango-Argoty G, Garner E, Pruden A, Heath LS, Vikesland P, Zhang L. This project was funded by a Genome Canada grant, “Machine learning to DeepARG: a deep learning approach for predicting antibiotic resistance predict drug resistance in pathogenic bacteria”. LC acknowledges funding genes from metagenomic data. Microbiome. 2018;6(1):1–15. from the MRC Centre for Global Infectious Disease Analysis (MR/R015600/1), 14. Chen ML, Doddi A, Royer J, Freschi L, Schito M, Ezewudo M, Kohane IS, funded by the UK Medical Research Council (MRC) and the UK Foreign, Com- Beam A, Farhat M. Beyond multidrug resistance: leveraging rare variants monwealth & Development Office (FCDO) under the MRC/FCDO Concordat with machine and statistical learning models in Mycobacterium tuberculo- agreement, and is part of the EDCTP2 program supported by the EU. sis resistance prediction. EBioMedicine. 2019;43:356–69. 15. Dorfman R. The detection of defective members of large populations. Availability of data and materials Ann Math Stat. 1943;14(4):436–40. The data used in this study is freely available from the ENA and the NCBI. The 16. Foucart S, Rauhut H. A mathematical introduction to compressive sens- code is freely available on GitHub. ing. In: Applied and numerical harmonic analysis. New York: Springer; 2013. https:// books. google. ca/ books? id= zb28B AAAQB AJ. 17. Eldar YC, Kutyniok G. Compressed sensing: theory and applications. Declarations Cambridge: Cambridge University Press; 2012. 18. Duarte MF, Eldar YC. Structured compressed sensing: from theory to Ethics approval and consent to participate applications. IEEE Trans Signal Process. 2011;59(9):4053–85. No ethics approval was required for this study. 19. Lustig M, Donoho D, Pauly JM. Sparse MRI: the application of compressed sensing for rapid MR imaging. Magn Resonance Med. 2007;58(6):1182–95. Consent for publication 20. Herman MA, Strohmer T. High-resolution radar via compressed sensing. All authors have consented to this publication. IEEE Trans Signal Process. 2009;57(6):2275–84. 21. Mathelin L, Gallivan K. A compressed sensing approach for partial dif- Competing interests ferential equations with random input data. Commun Comput Phys. The authors declare that they have no competing interests. 2012;12(4):919–54. 22. Doostan A, Owhadi H. A non-adapted sparse approximation of PDEs with Author details stochastic inputs. J Comput Phys. 2011;230(8):3015–34. School of Computing Science, Simon Fraser University, Burnaby, Canada. 23. Candes EJ, Wakin MB. An introduction to compressive sampling. IEEE Department of Mathematics, Simon Fraser University, Burnaby, Canada. Signal Process Mag. 2008;25(2):21–30. Department of Infectious Disease Epidemiology, Imperial College, London, 24. Malioutov D, Varshney K. Exact rule learning via Boolean compressed UK. sensing. In: International conference on machine learning; 2013. p. 765–73. Zabeti et al. Algorithms Mol Biol (2021) 16:17 Page 12 of 12 25. Atia GK, Saligrama V. Boolean compressed sensing and noisy group test- 48. Starks AM, Avilés E, Cirillo DM, Denkinger CM, Dolinger DL, Emerson ing. IEEE Trans Inf Theory. 2012;58(3):1880–901. C, Gallarda J, Hanna D, Kim PS, Liwski R, et al. Collaborative effort for a 26. Aldridge M, Johnson O, Scarlett J, et al. Group testing: an infor- centralized worldwide tuberculosis relational sequencing data platform. mation theory perspective. Found Trends Commun Inf Theory. Clin Infect Dis. 2015;61(suppl_3):141–6. 2019;15(3–4):196–392. 49. Ngo T-M, Teo Y-Y. Genomic prediction of tuberculosis drug-resistance: 27. Doerken S, Avalos M, Lagarde E, Schumacher M. Penalized logistic regres- benchmarking existing databases and prediction algorithms. BMC Bioin- sion with low prevalence exposures beyond high dimensional settings. form. 2019;20(1):68. PLoS ONE. 2019;14(5):1–14. 50. Deelder W, Christakoudi S, Phelan J, Diez Benavente E, Campino S, 28. Breiman L. Random forests. Mach Learn. 2001;45(1):5–32. McNerney R, Palla L, Clark TG. Machine learning predicts accurately Myco- 29. Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20:273–97. bacterium tuberculosis drug resistance from whole genome sequencing 30. Drouin A, Letarte G, Raymond F, Marchand M, Corbeil J, Laviolette F. data. Front Genet. 2019;10:922. Interpretable genotype-to-phenotype classifiers with performance 51. Leinonen R, Akhtar R, Birney E, Bower L, Cerdeno-Tárraga A, et al. The guarantees. Sci Rep. 2019;9(1):1–13. European nucleotide archive. Nucleic Acids Res. 2011;39:28–31. 31. Cohen A, Dahmen W, DeVore R. Compressed sensing and best k-term 52. Leinonen R, Sugawara H, Shumway M, Collaboration INSD. The sequence approximation. J Am Math Soc. 2009;22(1):211–31. read archive. Nucleic Acids Res. 2010;39(suppl_1):19–21. 32. Walker TM, Kohl TA, Omar SV, Hedge J, Del Ojo Elias C, Bradley P, Iqbal 53. Coll F, McNerney R, Guerra-Assunção JA, Glynn JR, Perdigão JA, Viveiros Z, Feuerriegel S, Niehaus KE, Wilson DJ, et al. Whole-genome sequenc- M, Portugal I, Pain A, Martin N, Clark TG. A robust SNP barcode for typing ing for prediction of Mycobacterium tuberculosis drug susceptibil- Mycobacterium tuberculosis complex strains. Nat Commun. 2014;5:1–5. ity and resistance: a retrospective cohort study. Lancet Infect Dis. 54. Li H. Aligning sequence reads, clone sequences and assembly contigs 2015;15(10):1193–202. with BWA-MEM. arXiv; 2013. 33. Aldridge M, Baldassini L, Johnson O. Group testing algorithms: bounds 55. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abe- and simulations. IEEE Trans Inf Theory. 2014;60(6):3671–87. casis G, Durbin R. The sequence alignment/map format and SAMtools. 34. Natarajan BK. Sparse approximate solutions to linear systems. SIAM J Bioinformatics. 2009;25(16):2078–9. Comput. 1995;24(2):227–34. 56. Poplin R, Ruano-Rubio V, DePristo MA, Fennell TJ, Carneiro MO, der Auw- 35. Malioutov D, Malyutov M. Boolean compressed sensing: LP relaxation for era GAV, Kling DE, Gauthier LD, Levy-Moonshine A, Roazen D, et al. Scal- group testing. In: 2012 IEEE international conference on acoustics, speech ing accurate genetic variant discovery to tens of thousands of samples. and signal processing (ICASSP); 2012. p. 3305–8. bioRxiv. 2017. 36. Chen X, Ishwaran H. Random forests for genomic data analysis. Genom- 57. San JE, Baichoo S, Kanzi A, Moosa Y, Lessells R, Fonseca V, Mogaka J, Power ics. 2012;99(6):323–9. R, de Oliveira T. Current affairs of microbial genome-wide association 37. Basu S, Kumbier K, Brown JB, Yu B. Iterative random forests to discover studies: approaches, bottlenecks and analytical pitfalls. Front Microbiol. predictive and stable high-order interactions. Proc Natl Acad Sci. 2020;10:3119. 2018;115(8):1943–8. 58. Murdoch WJ, Singh C, Kumbier K, Abbasi-Asl R, Yu B. Interpretable 38. Kouchaki S, Yang Y, Walker TM, Sarah Walker A, Wilson DJ, Peto TE, Crook machine learning: definitions, methods, and applications. arXiv; 2019. DW, Clifton DA. Application of machine learning techniques to tubercu- 59. Saber MM, Shapiro BJ. Benchmarking bacterial genome-wide associa- losis drug resistance analysis. Bioinformatics. 2019;35(13):2276–82. tion study methods using simulated genomes and phenotypes. Microb 39. Boser BE, Guyon IM, Vapnik VN. A training algorithm for optimal margin Genom. 2020;6(3):000337. classifiers. In: Proceedings of the fifth annual workshop on computational 60. Lundberg SM, Lee S-I. A unified approach to interpreting model predic- learning theory. COLT ’92. New York: Association for Computing Machin- tions. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwana- ery; 1992. p. 144–52. than S, Garnett R, editors. Advances in neural information processing 40. van Rossum G. Python tutorial. Technical report CS-R9526, Centrum voor systems, vol. 30; 2017. p. 4765–74. Wiskunde en Informatica (CWI). Amsterdam; 1995. 61. Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, Katz R, Him- 41. Drouin A, Giguère S, Déraspe M, Marchand M, Tyers M, Loo VG, Bourgault melfarb J, Bansal N, Lee S-I. From local explanations to global understand- A-M, Laviolette F, Corbeil J. Predictive computational phenotyping and ing with explainable AI for trees. Nat Mach Intell. 2020;2(1):2522–5839. biomarker discovery using reference-free genome comparisons. BMC 62. Drlica K, Zhao X. DNA gyrase, topoisomerase IV, and the 4-quinolones. Genom. 2016;17(1):754. Microbiol Mol Biol Rev. 1997;61(3):377–92. 42. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, 63. Avalos E, Catanzaro D, Catanzaro A, Ganiats T, Brodine S, Alcaraz J, Rodwell Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al. Scikit-learn: machine T. Frequency and geographic distribution of gyra and gyrb mutations learning in Python. J Mach Learn Res. 2011;12:2825–30. associated with fluoroquinolone resistance in clinical Mycobacterium 43. Drouin A. Learn interpretable computational phenotyping models from k tuberculosis isolates: a systematic review. PLoS ONE. 2015;10(3):0120470. -merized genomic data; 2020. https:// github. com/ aldro 61/ kover. 64. Miotto P, Tessema B, Tagliani E, Chindelevitch L, et al. A standardised 44. IBM. IBM ILOG CPLEX optimization studio V12.10.0 documentation. Inter- method for interpreting the association between mutations and national Business Machines Corporation. 2020. phenotypic drug-resistance in Mycobacterium tuberculosis. Eur Respir J. 45. Mitchell S, O’Sullivan M, Dunning I. PuLP: a linear programming toolkit 2017;50(6):170. for Python. 2011. http:// www. optim izati on- online. org/ DB_ FILE/ 2011/ 09/ 65. Gagneux S. Ecology and evolution of Mycobacterium tuberculosis. Nat Rev 3178. pdf. Microbiol. 2018;16:202–13. 46. Lougee-Heimer R. The common optimization interface for operations research: promoting open-source software in the operations research Publisher’s Note community. IBM J Res Dev. 2003;47(1):57–66. https:// doi. org/ 10. 1147/ rd. Springer Nature remains neutral with regard to jurisdictional claims in pub- 471. 0057. lished maps and institutional affiliations. 47. Wattam AR, Abraham D, Dalay O, Disz TL, Driscoll T, Gabbard JL, Gillespie JJ, Gough R, Hix D, Kenyon R, et al. PATRIC, the bacterial bioinformatics database and analysis resource. Nucleic Acids Res. 2014;42(D1):581–91.
Algorithms for Molecular Biology – Springer Journals
Published: Aug 10, 2021
Keywords: Drug resistance; Interpretable machine learning; Group testing; Integer linear programming; Rule-based learning; Whole-genome sequencing
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.