Dealing with the evaluation of supervised classification algorithms

Guzman Santafe; Iñaki Inza; Jose Lozano

doi:10.1007/s10462-015-9433-y

Loading next page...

References (145)

E. Ziegel (2003)
The Elements of Statistical Learning
Technometrics, 45
N. Bailey, R. Fisher (1957)
Statistical methods and scientific inference.
M. Stone (1976)
Cross‐Validatory Choice and Assessment of Statistical Predictions
Journal of the royal statistical society series b-methodological, 36
N Japkowicz, M Shah (2011)
Evaluating learning algorithms
J. Rodríguez, Aritz Martínez, J. Lozano (2013)
A general framework for the statistical analysis of the sources of variance for classification error estimators
Pattern Recognit., 46
Claude Nadeau, Y. Bengio
Inference for the Generalization Error
P. Tan, M. Steinbach, Vipin Kumar (2005)
Introduction to Data Mining, (First Edition)
(2012)
hmeasure: the H-measure and other scalar classification performance metrics. http://CRAN.R-project.org/package=hmeasure, R package version 1
D. Hand (2010)
Evaluating diagnostic tests: The area under the ROC curve and the balance of errors
Statistics in Medicine, 29
G. Nakhaeizadeh, A. Schnabl (1998)
Towards the Personalization of Algorithms Evaluation in Data Mining
Yoshua Bengio, Yves Grandvalet (2005)
Bias in Estimating the Variance of K-Fold Cross-Validation
M. Joshi, R. Agarwal, Vipin Kumar (2001)
Mining needle in a haystack: classifying rare classes via two-phase rule induction
(2008)
Classifier evaluation: a need for better education and restructuring
(2004)
No Unbiased Estimator of the Variance of K-Fold Cross-Validation
P. Langley (2004)
Machine learning as an experimental science
Machine Learning, 3
R. Prati, Gustavo Batista, M. Monard (2011)
A Survey on Graphical Methods for Classification Predictive Performance Evaluation
IEEE Transactions on Knowledge and Data Engineering, 23
W. Rozeboom (1960)
The fallacy of the null-hypothesis significance test.
Psychological bulletin, 57
S. García, Alberto Fernández, J. Luengo, F. Herrera (2010)
Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power
Inf. Sci., 180
Peter Killeen (2005)
An Alternative to Null-Hypothesis Significance Tests
Psychological Science, 16
João Gama, Raquel Sebastião, P. Rodrigues (2009)
Issues in evaluation of stream learning algorithms
Scott Glover, P. Dixon (2004)
Likelihood ratios: A simple and flexible statistic for empirical psychologists
Psychonomic Bulletin & Review, 11
D. Hand, R. Till (2001)
A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems
Machine Learning, 45
M Stone (1974)
Cross-validatory choice and assessment of statistical predictions (with discussion)
J R Stat Soc Ser B, 36
D. Hand (1986)
Recent advances in error rate estimation
Pattern Recognit. Lett., 4
P. Golland, Feng Liang, S. Mukherjee, D. Panchenko (2005)
Permutation Tests for Classification
R. Rifkin, A. Klautau (2004)
In Defense of One-Vs-All Classification
J. Mach. Learn. Res., 5
Gary Weiss (2004)
Mining with rarity: a unifying framework
SIGKDD Explor., 6
C. Ling, Chenghui Li (1998)
Data Mining for Direct Marketing: Problems and Solutions
M. Kuhn (2015)
caret: Classification and Regression Training
F. Wilcoxon (1945)
Individual Comparisons by Ranking Methods
Biometrics, 1
J. Rodríguez, Aritz Martínez, J. Lozano (2010)
Sensitivity Analysis of k-Fold Cross Validation in Prediction Error Estimation
IEEE Transactions on Pattern Analysis and Machine Intelligence, 32
H. Yanagihara, H. Fujisawa (2012)
Iterative Bias Correction of the Cross‐Validation Criterion
Scandinavian Journal of Statistics, 39
Jesse Davis, Mark Goadrich (2006)
The relationship between Precision-Recall and ROC curves
Proceedings of the 23rd international conference on Machine learning
C. Drummond, R. Holte (2006)
Cost curves: An improved method for visualizing classifier performance
Machine Learning, 65
Ron Kohavi, D. Wolpert (1996)
Bias Plus Variance Decomposition for Zero-One Loss Functions
N. Chawla, N. Japkowicz, Aleksander Kotcz (2004)
Editorial: special issue on learning from imbalanced data sets
SIGKDD Explor., 6
Ron Kohavi (1995)
A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection
A. Bradley (1997)
The use of the area under the ROC curve in the evaluation of machine learning algorithms
Pattern Recognit., 30
M. Galar, Alberto Fernández, E. Tartas, H. Bustince, F. Herrera (2011)
An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes
Pattern Recognit., 44
S. Goodman (2008)
A dirty dozen: twelve p-value misconceptions.
Seminars in hematology, 45 3
Christine Schubert-Kabban, S. Thorsen, M. Oxley (2011)
The ROC manifold for classification systems
Pattern Recognit., 44
Damien Brain, Geoffrey Webb (1999)
On the effect of data set size on bias and variance in classification learning
D. Hand, C. Anagnostopoulos (2013)
When is the area under the receiver operating characteristic curve an appropriate measure of classifier performance?
Pattern Recognit. Lett., 34
Alexandre Lacoste, François Laviolette, M. Marchand (2012)
Bayesian Comparison of Machine Learning Algorithms on Single and Multiple Datasets
T. Hamill (1997)
Reliability Diagrams for Multicategory Probabilistic Forecasts
Weather and Forecasting, 12
Tadayoshi Fushiki (2011)
Estimation of prediction error by using K-fold cross-validation
Statistics and Computing, 21
B. Efron, R. Tibshirani (1997)
Improvements on Cross-Validation: The 632+ Bootstrap Method
Journal of the American Statistical Association, 92
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. Witten (2009)
The WEKA data mining software: an update
SIGKDD Explor., 11
D. Berrar, J. Lozano (2013)
Significance tests or confidence intervals: which are preferable for the comparison of classifiers?
Journal of Experimental & Theoretical Artificial Intelligence, 25
H. Finner (1993)
On a Monotonicity Problem in Step-Down Multiple Test Procedures
Journal of the American Statistical Association, 88
R. Iman, J. Davenport (1980)
Approximations of the critical region of the fbietkan statistic
Communications in Statistics-theory and Methods, 9
N. Japkowicz (2006)
Why Question Machine Learning Evaluation Methods ? ( An illustrative review of the shortcomings of current methods )
Rukshan Batuwita, V. Palade (2009)
A New Performance Measure for Class Imbalance Learning. Application to Bioinformatics Problems
2009 International Conference on Machine Learning and Applications
Erin Allwein, Rob Schapire, Y. Singer (2000)
Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers
J. Mach. Learn. Res., 1
G. McLachlan (1992)
Discriminant Analysis and Statistical Pattern Recognition
M Ojala, GC Garriga (2010)
Permutation tests for studying classifier performance
J Mach Learn Res, 11
C. Drummond (2006)
Machine Learning as an Experimental Science ( Revisited ) ∗
F. Provost, Tom Fawcett, Ron Kohavi (1998)
The Case against Accuracy Estimation for Comparing Induction Algorithms
Ajay Joshi, F. Porikli, N. Papanikolopoulos (2012)
Scalable Active Learning for Multiclass Image Classification
IEEE Transactions on Pattern Analysis and Machine Intelligence, 34
C. Elkan (2001)
The Foundations of Cost-Sensitive Learning
J. Shaffer (1995)
Multiple Hypothesis Testing
Annual Review of Psychology, 46
Corinna Cortes, M. Mohri (2003)
AUC Optimization vs. Error Rate Minimization
Geoffrey Webb (2000)
MultiBoosting: A Technique for Combining Boosting and Wagging
Machine Learning, 40
D. Hinkley (2008)
Bootstrap Methods: Another Look at the Jackknife
G. Brier (1950)
VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY
Monthly Weather Review, 78
V. García, R. Mollineda, J. Sánchez (2010)
Theoretical Analysis of a Performance Measure for Imbalanced Data
2010 20th International Conference on Pattern Recognition
W. Kruskal, W. Wallis (1952)
Use of Ranks in One-Criterion Variance Analysis
Journal of the American Statistical Association, 47
Vijay Raghavan, Gwang Jung, P. Bollmann-Sdorra (1989)
A critical investigation of recall and precision as measures of retrieval system performance
ACM Trans. Inf. Syst., 7
A. Isaksson, Mikael Wallman, Hanna Göransson, M. Gustafsson (2008)
Cross-validation and bootstrapping are unreliable in small sample classification
Pattern Recognit. Lett., 29
Jacob Cohen (1994)
The earth is round (p < .05)
American Psychologist, 49
Carlos Silla, A. Freitas (2010)
A survey of hierarchical classification across different application domains
Data Mining and Knowledge Discovery, 22
J. Otero, L. Sánchez, Inés Couso, Ana Palacios (2014)
Bootstrap analysis of multiple repetitions of experiments using an interval-valued multiple comparison procedure
J. Comput. Syst. Sci., 80
Kendrick Boyd, K. Eng, David Page (2013)
Area under the Precision-Recall Curve: Point Estimates and Confidence Intervals
Marina Sokolova, N. Japkowicz, S. Szpakowicz (2006)
Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation
B. Efron, R. Tibshirani (1986)
Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy
Statistical Science, 1
2009 Ninth IEEE International Conference on Data Mining Permutation Tests for Studying Classifier Performance
(2006)
Hypothesis testing for cross-validation. Tech. rep., Département d'informatique et recherche opérationnelle
T. Hastie, R. Tibshirani, J. Friedman (2001)
The Elements of Statistical Learning
CJ Rijsbergen (1979)
Information retrieval
J. Friedman (2004)
On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality
Data Mining and Knowledge Discovery, 1
Adrien Jamain, D. Hand (2008)
Mining Supervised Classification Performance Studies: A Meta-Analytic Investigation
Journal of Classification, 25
A. Davison, D. Hinkley (1997)
Bootstrap Methods and their Application
João Gama, J. Aguilar-Ruiz, R. Klinkenberg (2009)
Knowledge discovery from data streams
R. Ranawana, V. Palade (2006)
Optimized Precision - A New Measure for Classifier Performance Evaluation
2006 IEEE International Conference on Evolutionary Computation
Damien Brain, Geoffrey Webb (2002)
The Need for Low Bias Algorithms in Classification Learning from Large Data Sets
M. Friedman (1940)
A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings
Annals of Mathematical Statistics, 11
M. Budka, B. Gabrys (2013)
Density-Preserving Sampling: Robust and Efficient Alternative to Cross-Validation for Error Estimation
IEEE Transactions on Neural Networks and Learning Systems, 24
M. Masson (2011)
A tutorial on a practical Bayesian alternative to null-hypothesis significance testing
Behavior Research Methods, 43
A. Andersson, P. Davidsson, Johan Lindén (1999)
Measure-based classifier performance evaluation
Pattern Recognit. Lett., 20
J. Dmochowski, P. Sajda, L. Parra (2010)
Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds
J. Mach. Learn. Res., 11
(2007)
An Introduction to the Bootstrap
Erin Allwein, Robert Schapire, Y. Singer (2001)
Reducing multiclass to binary: a unifying approach for margin classifiers
Journal of Machine Learning Research, 1
Michael Egmont-Petersen, J. Talmon, A. Hasman (1997)
Robustness metrics for measuring the influence of additive noise on the performance of statistical classifiers.
International journal of medical informatics, 46 2
N. Japkowicz, Mohak Shah (2011)
Evaluating Learning Algorithms: A Classification Perspective
A. Dawid (1985)
Calibration-Based Empirical Probability
Annals of Statistics, 13
(2006)
Evaluation of supervised learning algorithms and classifiers
Kendrick Boyd, K. Eng, David Page (2013)
Erratum: Area under the Precision-Recall Curve: Point Estimates and Confidence Intervals
R. Bouckaert (2004)
Estimating replicability of classifier learning experiments
Proceedings of the twenty-first international conference on Machine learning
I. Good (1968)
Corroboration, Explanation, Evolving Probability, Simplicity and a Sharpened Razor
The British Journal for the Philosophy of Science, 19
R. Bouckaert, E. Frank (2004)
Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms
S. García, F. Herrera (2008)
An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons
Journal of Machine Learning Research, 9
C. Smith (1947)
Some examples of discrimination.
Annals of eugenics, 13 4
P. Golland, B. Fischl (2003)
Permutation Tests for Classification: Towards Statistical Significance in Image-Based Studies
Information processing in medical imaging : proceedings of the ... conference, 18
T. Hsing, S. Attoor, E. Dougherty (2003)
Relation Between Permutation-Test P Values and Classifier Error Estimates
Machine Learning, 52
Grigorios Tsoumakas, I. Katakis (2007)
Multi-Label Classification: An Overview
Int. J. Data Warehous. Min., 3
D. Wolpert (1996)
The Lack of A Priori Distinctions Between Learning Algorithms
Neural Computation, 8
Phillip Good (1995)
Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses
Technometrics, 37
U. Braga-Neto, E. Dougherty (2004)
Bolstered error estimation
Pattern Recognit., 37
B. Efron (1987)
The jackknife, the bootstrap, and other resampling plans
NV Chawla, N Japkowicz (2004)
Editorial: Special issue on learning from imbalanced data sets
ACM SIGKDD Explor Newslett, 6
P. Burman (1989)
A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods
Biometrika, 76
D. Hand (2009)
Measuring classifier performance: a coherent alternative to the area under the ROC curve
Machine Learning, 77
C. Drummond (2008)
Finding a Balance between Anarchy and Orthodoxy
JH Zar (2010)
Biostatistical analysis
W. May, William Johnson (1997)
Confidence intervals for differences in correlated binary proportions.
Statistics in medicine, 16 18
J. Moreno-Torres, T. Raeder, R. Alaíz-Rodríguez, N. Chawla, F. Herrera (2012)
A unifying view on dataset shift in classification
Pattern Recognit., 45
J. Moreno-Torres, José Sáez, F. Herrera (2012)
Study on the Impact of Partition-Induced Dataset Shift on $k$-Fold Cross-Validation
IEEE Transactions on Neural Networks and Learning Systems, 23
B. Efron (1983)
Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation
Journal of the American Statistical Association, 78
Borja Molinos (2008)
Positive unlabelled learning with applications in computational biology
J Cohen (1994)
The earth is round ( $$p <.05$$ p < . 05 )
Am Psychol, 49
Chet Langin (2019)
Introduction to Data Mining
Scalable Comput. Pract. Exp., 9
T. Raeder, T. Hoens, N. Chawla (2010)
Consequences of Variability in Classifier Performance Estimates
2010 IEEE International Conference on Data Mining
B. Holland, M. Copenhaver (1987)
An Improved Sequentially Rejective Bonferroni Test Procedure
Biometrics, 43
Douglas Johnson (1999)
The Insignificance of Statistical Significance Testing
Journal of Wildlife Management, 63
(2008)
Chapman & Hall/CRC Data Mining and Knowledge Discovery Series
Y Bengio, Y Grandvalet (2005)
Statistical modeling and analysis for complex data problems, chap 5
J. Demšar (2006)
Statistical Comparisons of Classifiers over Multiple Data Sets
J. Mach. Learn. Res., 7
H. Haller, S. Krauss (2002)
Misinterpretations of significance: A problem students share with their teachers?
, 7
C. Ferri, J. Hernández-Orallo, R. Modroiu (2009)
An experimental comparison of performance measures for classification
Pattern Recognit. Lett., 30
A. Webb (1999)
Statistical Pattern Recognition
M. Stone (1977)
Asymptotics for and against cross-validation
Biometrika, 64
J. Hartigan (1986)
[Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy]: Comment
Statistical Science, 1
E. Jaynes, O. Kempthorne (1976)
Confidence Intervals vs Bayesian Intervals
D. Hand, C. Anagnostopoulos (2012)
A better Beta for the H measure of classification performance
ArXiv, abs/1202.2564
D. Rom (1990)
A sequentially rejective test procedure based on a modified Bonferroni inequality
Biometrika, 77
D. Hand (1994)
Deconstructing Statistical Questions
Journal of The Royal Statistical Society Series A-statistics in Society, 157
Geoffrey Webb, P. Conilione, Geoffrey Webb (2003)
Estimating bias and variance from data
G. McLachlan (2005)
Discriminant Analysis and Statistical Pattern Recognition: McLachlan/Discriminant Analysis & Pattern Recog
C. Drummond, N. Japkowicz (2010)
Warning: statistical benchmarking is addictive. Kicking the habit in machine learning
Journal of Experimental & Theoretical Artificial Intelligence, 22
Tom Fawcett (2006)
An introduction to ROC analysis
Pattern Recognit. Lett., 27
S. Larson (1931)
The shrinkage of the coefficient of multiple correlation.
Journal of Educational Psychology, 22
GM Weiss (2004)
Mining with rarity: a unifying framework
ACM SIGKDD Explor Newslett, 6
W. Daniel (1978)
Applied Nonparametric Statistics
Yanmin Sun, A. Wong, M. Kamel (2009)
Classification of Imbalanced Data: a Review
Int. J. Pattern Recognit. Artif. Intell., 23
M. Kubát, R. Holte, S. Matwin (1998)
Machine Learning for the Detection of Oil Spills in Satellite Radar Images
Machine Learning, 30

Publisher: Springer Journals
Copyright: Copyright © 2015 by Springer Science+Business Media Dordrecht
Subject: Computer Science; Artificial Intelligence (incl. Robotics); Computer Science, general
ISSN: 0269-2821
eISSN: 1573-7462
DOI: 10.1007/s10462-015-9433-y
Publisher site: See Article on Publisher Site

Abstract

Performance assessment of a learning method related to its prediction ability on independent data is extremely important in supervised classification. This process provides the information to evaluate the quality of a classification model and to choose the most appropriate technique to solve the specific supervised classification problem at hand. This paper aims to review the most important aspects of the evaluation process of supervised classification algorithms. Thus the overall evaluation process is put in perspective to lead the reader to a deep understanding of it. Additionally, different recommendations about their use and limitations as well as a critical view of the reviewed methods are presented according to the specific characteristics of the supervised classification problem scenario.

Journal

Artificial Intelligence Review – Springer Journals

Published: Jun 30, 2015

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Dealing with the evaluation of supervised classification algorithms

Dealing with the evaluation of supervised classification algorithms

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Dealing with the evaluation of supervised classification algorithms

Dealing with the evaluation of supervised classification algorithms

References (145)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies