Access the full text.
Sign up today, get DeepDyve free for 14 days.
E. Ziegel (2003)
The Elements of Statistical LearningTechnometrics, 45
N. Bailey, R. Fisher (1957)
Statistical methods and scientific inference.
M. Stone (1976)
Cross‐Validatory Choice and Assessment of Statistical PredictionsJournal of the royal statistical society series b-methodological, 36
N Japkowicz, M Shah (2011)
Evaluating learning algorithms
J. Rodríguez, Aritz Martínez, J. Lozano (2013)
A general framework for the statistical analysis of the sources of variance for classification error estimatorsPattern Recognit., 46
Claude Nadeau, Y. Bengio
Inference for the Generalization Error
P. Tan, M. Steinbach, Vipin Kumar (2005)
Introduction to Data Mining, (First Edition)
(2012)
hmeasure: the H-measure and other scalar classification performance metrics. http://CRAN.R-project.org/package=hmeasure, R package version 1
D. Hand (2010)
Evaluating diagnostic tests: The area under the ROC curve and the balance of errorsStatistics in Medicine, 29
G. Nakhaeizadeh, A. Schnabl (1998)
Towards the Personalization of Algorithms Evaluation in Data Mining
Yoshua Bengio, Yves Grandvalet (2005)
Bias in Estimating the Variance of K-Fold Cross-Validation
M. Joshi, R. Agarwal, Vipin Kumar (2001)
Mining needle in a haystack: classifying rare classes via two-phase rule induction
(2008)
Classifier evaluation: a need for better education and restructuring
(2004)
No Unbiased Estimator of the Variance of K-Fold Cross-Validation
P. Langley (2004)
Machine learning as an experimental scienceMachine Learning, 3
R. Prati, Gustavo Batista, M. Monard (2011)
A Survey on Graphical Methods for Classification Predictive Performance EvaluationIEEE Transactions on Knowledge and Data Engineering, 23
W. Rozeboom (1960)
The fallacy of the null-hypothesis significance test.Psychological bulletin, 57
S. García, Alberto Fernández, J. Luengo, F. Herrera (2010)
Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of powerInf. Sci., 180
Peter Killeen (2005)
An Alternative to Null-Hypothesis Significance TestsPsychological Science, 16
João Gama, Raquel Sebastião, P. Rodrigues (2009)
Issues in evaluation of stream learning algorithms
Scott Glover, P. Dixon (2004)
Likelihood ratios: A simple and flexible statistic for empirical psychologistsPsychonomic Bulletin & Review, 11
D. Hand, R. Till (2001)
A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification ProblemsMachine Learning, 45
M Stone (1974)
Cross-validatory choice and assessment of statistical predictions (with discussion)J R Stat Soc Ser B, 36
D. Hand (1986)
Recent advances in error rate estimationPattern Recognit. Lett., 4
P. Golland, Feng Liang, S. Mukherjee, D. Panchenko (2005)
Permutation Tests for Classification
R. Rifkin, A. Klautau (2004)
In Defense of One-Vs-All ClassificationJ. Mach. Learn. Res., 5
Gary Weiss (2004)
Mining with rarity: a unifying frameworkSIGKDD Explor., 6
C. Ling, Chenghui Li (1998)
Data Mining for Direct Marketing: Problems and Solutions
M. Kuhn (2015)
caret: Classification and Regression Training
F. Wilcoxon (1945)
Individual Comparisons by Ranking MethodsBiometrics, 1
J. Rodríguez, Aritz Martínez, J. Lozano (2010)
Sensitivity Analysis of k-Fold Cross Validation in Prediction Error EstimationIEEE Transactions on Pattern Analysis and Machine Intelligence, 32
H. Yanagihara, H. Fujisawa (2012)
Iterative Bias Correction of the Cross‐Validation CriterionScandinavian Journal of Statistics, 39
Jesse Davis, Mark Goadrich (2006)
The relationship between Precision-Recall and ROC curvesProceedings of the 23rd international conference on Machine learning
C. Drummond, R. Holte (2006)
Cost curves: An improved method for visualizing classifier performanceMachine Learning, 65
Ron Kohavi, D. Wolpert (1996)
Bias Plus Variance Decomposition for Zero-One Loss Functions
N. Chawla, N. Japkowicz, Aleksander Kotcz (2004)
Editorial: special issue on learning from imbalanced data setsSIGKDD Explor., 6
Ron Kohavi (1995)
A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection
A. Bradley (1997)
The use of the area under the ROC curve in the evaluation of machine learning algorithmsPattern Recognit., 30
M. Galar, Alberto Fernández, E. Tartas, H. Bustince, F. Herrera (2011)
An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemesPattern Recognit., 44
S. Goodman (2008)
A dirty dozen: twelve p-value misconceptions.Seminars in hematology, 45 3
Christine Schubert-Kabban, S. Thorsen, M. Oxley (2011)
The ROC manifold for classification systemsPattern Recognit., 44
Damien Brain, Geoffrey Webb (1999)
On the effect of data set size on bias and variance in classification learning
D. Hand, C. Anagnostopoulos (2013)
When is the area under the receiver operating characteristic curve an appropriate measure of classifier performance?Pattern Recognit. Lett., 34
Alexandre Lacoste, François Laviolette, M. Marchand (2012)
Bayesian Comparison of Machine Learning Algorithms on Single and Multiple Datasets
T. Hamill (1997)
Reliability Diagrams for Multicategory Probabilistic ForecastsWeather and Forecasting, 12
Tadayoshi Fushiki (2011)
Estimation of prediction error by using K-fold cross-validationStatistics and Computing, 21
B. Efron, R. Tibshirani (1997)
Improvements on Cross-Validation: The 632+ Bootstrap MethodJournal of the American Statistical Association, 92
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. Witten (2009)
The WEKA data mining software: an updateSIGKDD Explor., 11
D. Berrar, J. Lozano (2013)
Significance tests or confidence intervals: which are preferable for the comparison of classifiers?Journal of Experimental & Theoretical Artificial Intelligence, 25
H. Finner (1993)
On a Monotonicity Problem in Step-Down Multiple Test ProceduresJournal of the American Statistical Association, 88
R. Iman, J. Davenport (1980)
Approximations of the critical region of the fbietkan statisticCommunications in Statistics-theory and Methods, 9
N. Japkowicz (2006)
Why Question Machine Learning Evaluation Methods ? ( An illustrative review of the shortcomings of current methods )
Rukshan Batuwita, V. Palade (2009)
A New Performance Measure for Class Imbalance Learning. Application to Bioinformatics Problems2009 International Conference on Machine Learning and Applications
Erin Allwein, Rob Schapire, Y. Singer (2000)
Reducing Multiclass to Binary: A Unifying Approach for Margin ClassifiersJ. Mach. Learn. Res., 1
G. McLachlan (1992)
Discriminant Analysis and Statistical Pattern Recognition
M Ojala, GC Garriga (2010)
Permutation tests for studying classifier performanceJ Mach Learn Res, 11
C. Drummond (2006)
Machine Learning as an Experimental Science ( Revisited ) ∗
F. Provost, Tom Fawcett, Ron Kohavi (1998)
The Case against Accuracy Estimation for Comparing Induction Algorithms
Ajay Joshi, F. Porikli, N. Papanikolopoulos (2012)
Scalable Active Learning for Multiclass Image ClassificationIEEE Transactions on Pattern Analysis and Machine Intelligence, 34
C. Elkan (2001)
The Foundations of Cost-Sensitive Learning
J. Shaffer (1995)
Multiple Hypothesis TestingAnnual Review of Psychology, 46
Corinna Cortes, M. Mohri (2003)
AUC Optimization vs. Error Rate Minimization
Geoffrey Webb (2000)
MultiBoosting: A Technique for Combining Boosting and WaggingMachine Learning, 40
D. Hinkley (2008)
Bootstrap Methods: Another Look at the Jackknife
G. Brier (1950)
VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITYMonthly Weather Review, 78
V. García, R. Mollineda, J. Sánchez (2010)
Theoretical Analysis of a Performance Measure for Imbalanced Data2010 20th International Conference on Pattern Recognition
W. Kruskal, W. Wallis (1952)
Use of Ranks in One-Criterion Variance AnalysisJournal of the American Statistical Association, 47
Vijay Raghavan, Gwang Jung, P. Bollmann-Sdorra (1989)
A critical investigation of recall and precision as measures of retrieval system performanceACM Trans. Inf. Syst., 7
A. Isaksson, Mikael Wallman, Hanna Göransson, M. Gustafsson (2008)
Cross-validation and bootstrapping are unreliable in small sample classificationPattern Recognit. Lett., 29
Jacob Cohen (1994)
The earth is round (p < .05)American Psychologist, 49
Carlos Silla, A. Freitas (2010)
A survey of hierarchical classification across different application domainsData Mining and Knowledge Discovery, 22
J. Otero, L. Sánchez, Inés Couso, Ana Palacios (2014)
Bootstrap analysis of multiple repetitions of experiments using an interval-valued multiple comparison procedureJ. Comput. Syst. Sci., 80
Kendrick Boyd, K. Eng, David Page (2013)
Area under the Precision-Recall Curve: Point Estimates and Confidence Intervals
Marina Sokolova, N. Japkowicz, S. Szpakowicz (2006)
Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation
B. Efron, R. Tibshirani (1986)
Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical AccuracyStatistical Science, 1
2009 Ninth IEEE International Conference on Data Mining Permutation Tests for Studying Classifier Performance
(2006)
Hypothesis testing for cross-validation. Tech. rep., Département d'informatique et recherche opérationnelle
T. Hastie, R. Tibshirani, J. Friedman (2001)
The Elements of Statistical Learning
CJ Rijsbergen (1979)
Information retrieval
J. Friedman (2004)
On Bias, Variance, 0/1—Loss, and the Curse-of-DimensionalityData Mining and Knowledge Discovery, 1
Adrien Jamain, D. Hand (2008)
Mining Supervised Classification Performance Studies: A Meta-Analytic InvestigationJournal of Classification, 25
A. Davison, D. Hinkley (1997)
Bootstrap Methods and their Application
João Gama, J. Aguilar-Ruiz, R. Klinkenberg (2009)
Knowledge discovery from data streams
R. Ranawana, V. Palade (2006)
Optimized Precision - A New Measure for Classifier Performance Evaluation2006 IEEE International Conference on Evolutionary Computation
Damien Brain, Geoffrey Webb (2002)
The Need for Low Bias Algorithms in Classification Learning from Large Data Sets
M. Friedman (1940)
A Comparison of Alternative Tests of Significance for the Problem of $m$ RankingsAnnals of Mathematical Statistics, 11
M. Budka, B. Gabrys (2013)
Density-Preserving Sampling: Robust and Efficient Alternative to Cross-Validation for Error EstimationIEEE Transactions on Neural Networks and Learning Systems, 24
M. Masson (2011)
A tutorial on a practical Bayesian alternative to null-hypothesis significance testingBehavior Research Methods, 43
A. Andersson, P. Davidsson, Johan Lindén (1999)
Measure-based classifier performance evaluationPattern Recognit. Lett., 20
J. Dmochowski, P. Sajda, L. Parra (2010)
Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper BoundsJ. Mach. Learn. Res., 11
(2007)
An Introduction to the Bootstrap
Erin Allwein, Robert Schapire, Y. Singer (2001)
Reducing multiclass to binary: a unifying approach for margin classifiersJournal of Machine Learning Research, 1
Michael Egmont-Petersen, J. Talmon, A. Hasman (1997)
Robustness metrics for measuring the influence of additive noise on the performance of statistical classifiers.International journal of medical informatics, 46 2
N. Japkowicz, Mohak Shah (2011)
Evaluating Learning Algorithms: A Classification Perspective
A. Dawid (1985)
Calibration-Based Empirical ProbabilityAnnals of Statistics, 13
(2006)
Evaluation of supervised learning algorithms and classifiers
Kendrick Boyd, K. Eng, David Page (2013)
Erratum: Area under the Precision-Recall Curve: Point Estimates and Confidence Intervals
R. Bouckaert (2004)
Estimating replicability of classifier learning experimentsProceedings of the twenty-first international conference on Machine learning
I. Good (1968)
Corroboration, Explanation, Evolving Probability, Simplicity and a Sharpened RazorThe British Journal for the Philosophy of Science, 19
R. Bouckaert, E. Frank (2004)
Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms
S. García, F. Herrera (2008)
An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise ComparisonsJournal of Machine Learning Research, 9
C. Smith (1947)
Some examples of discrimination.Annals of eugenics, 13 4
P. Golland, B. Fischl (2003)
Permutation Tests for Classification: Towards Statistical Significance in Image-Based StudiesInformation processing in medical imaging : proceedings of the ... conference, 18
T. Hsing, S. Attoor, E. Dougherty (2003)
Relation Between Permutation-Test P Values and Classifier Error EstimatesMachine Learning, 52
Grigorios Tsoumakas, I. Katakis (2007)
Multi-Label Classification: An OverviewInt. J. Data Warehous. Min., 3
D. Wolpert (1996)
The Lack of A Priori Distinctions Between Learning AlgorithmsNeural Computation, 8
Phillip Good (1995)
Permutation Tests: A Practical Guide to Resampling Methods for Testing HypothesesTechnometrics, 37
U. Braga-Neto, E. Dougherty (2004)
Bolstered error estimationPattern Recognit., 37
B. Efron (1987)
The jackknife, the bootstrap, and other resampling plans
NV Chawla, N Japkowicz (2004)
Editorial: Special issue on learning from imbalanced data setsACM SIGKDD Explor Newslett, 6
P. Burman (1989)
A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methodsBiometrika, 76
D. Hand (2009)
Measuring classifier performance: a coherent alternative to the area under the ROC curveMachine Learning, 77
C. Drummond (2008)
Finding a Balance between Anarchy and Orthodoxy
JH Zar (2010)
Biostatistical analysis
W. May, William Johnson (1997)
Confidence intervals for differences in correlated binary proportions.Statistics in medicine, 16 18
J. Moreno-Torres, T. Raeder, R. Alaíz-Rodríguez, N. Chawla, F. Herrera (2012)
A unifying view on dataset shift in classificationPattern Recognit., 45
J. Moreno-Torres, José Sáez, F. Herrera (2012)
Study on the Impact of Partition-Induced Dataset Shift on $k$-Fold Cross-ValidationIEEE Transactions on Neural Networks and Learning Systems, 23
B. Efron (1983)
Estimating the Error Rate of a Prediction Rule: Improvement on Cross-ValidationJournal of the American Statistical Association, 78
Borja Molinos (2008)
Positive unlabelled learning with applications in computational biology
J Cohen (1994)
The earth is round ( $$p <.05$$ p < . 05 )Am Psychol, 49
Chet Langin (2019)
Introduction to Data MiningScalable Comput. Pract. Exp., 9
T. Raeder, T. Hoens, N. Chawla (2010)
Consequences of Variability in Classifier Performance Estimates2010 IEEE International Conference on Data Mining
B. Holland, M. Copenhaver (1987)
An Improved Sequentially Rejective Bonferroni Test ProcedureBiometrics, 43
Douglas Johnson (1999)
The Insignificance of Statistical Significance TestingJournal of Wildlife Management, 63
(2008)
Chapman & Hall/CRC Data Mining and Knowledge Discovery Series
Y Bengio, Y Grandvalet (2005)
Statistical modeling and analysis for complex data problems, chap 5
J. Demšar (2006)
Statistical Comparisons of Classifiers over Multiple Data SetsJ. Mach. Learn. Res., 7
H. Haller, S. Krauss (2002)
Misinterpretations of significance: A problem students share with their teachers?, 7
C. Ferri, J. Hernández-Orallo, R. Modroiu (2009)
An experimental comparison of performance measures for classificationPattern Recognit. Lett., 30
A. Webb (1999)
Statistical Pattern Recognition
M. Stone (1977)
Asymptotics for and against cross-validationBiometrika, 64
J. Hartigan (1986)
[Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy]: CommentStatistical Science, 1
E. Jaynes, O. Kempthorne (1976)
Confidence Intervals vs Bayesian Intervals
D. Hand, C. Anagnostopoulos (2012)
A better Beta for the H measure of classification performanceArXiv, abs/1202.2564
D. Rom (1990)
A sequentially rejective test procedure based on a modified Bonferroni inequalityBiometrika, 77
D. Hand (1994)
Deconstructing Statistical QuestionsJournal of The Royal Statistical Society Series A-statistics in Society, 157
Geoffrey Webb, P. Conilione, Geoffrey Webb (2003)
Estimating bias and variance from data
G. McLachlan (2005)
Discriminant Analysis and Statistical Pattern Recognition: McLachlan/Discriminant Analysis & Pattern Recog
C. Drummond, N. Japkowicz (2010)
Warning: statistical benchmarking is addictive. Kicking the habit in machine learningJournal of Experimental & Theoretical Artificial Intelligence, 22
Tom Fawcett (2006)
An introduction to ROC analysisPattern Recognit. Lett., 27
S. Larson (1931)
The shrinkage of the coefficient of multiple correlation.Journal of Educational Psychology, 22
GM Weiss (2004)
Mining with rarity: a unifying frameworkACM SIGKDD Explor Newslett, 6
W. Daniel (1978)
Applied Nonparametric Statistics
Yanmin Sun, A. Wong, M. Kamel (2009)
Classification of Imbalanced Data: a ReviewInt. J. Pattern Recognit. Artif. Intell., 23
M. Kubát, R. Holte, S. Matwin (1998)
Machine Learning for the Detection of Oil Spills in Satellite Radar ImagesMachine Learning, 30
Performance assessment of a learning method related to its prediction ability on independent data is extremely important in supervised classification. This process provides the information to evaluate the quality of a classification model and to choose the most appropriate technique to solve the specific supervised classification problem at hand. This paper aims to review the most important aspects of the evaluation process of supervised classification algorithms. Thus the overall evaluation process is put in perspective to lead the reader to a deep understanding of it. Additionally, different recommendations about their use and limitations as well as a critical view of the reviewed methods are presented according to the specific characteristics of the supervised classification problem scenario.
Artificial Intelligence Review – Springer Journals
Published: Jun 30, 2015
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.