Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Leakage in data mining: Formulation, detection, and avoidance

Leakage in data mining: Formulation, detection, and avoidance Leakage in Data Mining: Formulation, Detection, and Avoidance SHACHAR KAUFMAN and SAHARON ROSSET, Tel Aviv University CLAUDIA PERLICH and ORI STITELMAN, m6d Deemed "one of the top ten data mining mistakes", leakage is the introduction of information about the data mining target that should not be legitimately available to mine from. In addition to our own industry experience with real-life projects, controversies around several major public data mining competitions held recently such as the INFORMS 2010 Data Mining Challenge and the IJCNN 2011 Social Network Challenge are evidence that this issue is as relevant today as it has ever been. While acknowledging the importance and prevalence of leakage in both synthetic competitions and real-life data mining projects, existing literature has largely left this idea unexplored. What little has been said turns out not to be broad enough to cover more complex cases of leakage, such as those where the classical independently and identically distributed (i.i.d.) assumption is violated, that have been recently documented. In our new approach, these cases and others are explained by explicitly defining modeling goals and analyzing the broader framework of the data mining problem. The resulting definition enables us to derive general methodology for http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png ACM Transactions on Knowledge Discovery from Data (TKDD) Association for Computing Machinery

Loading next page...
 
/lp/association-for-computing-machinery/leakage-in-data-mining-formulation-detection-and-avoidance-dhFju6OC41
Publisher
Association for Computing Machinery
Copyright
Copyright © 2012 by ACM Inc.
ISSN
1556-4681
DOI
10.1145/2382577.2382579
Publisher site
See Article on Publisher Site

Abstract

Leakage in Data Mining: Formulation, Detection, and Avoidance SHACHAR KAUFMAN and SAHARON ROSSET, Tel Aviv University CLAUDIA PERLICH and ORI STITELMAN, m6d Deemed "one of the top ten data mining mistakes", leakage is the introduction of information about the data mining target that should not be legitimately available to mine from. In addition to our own industry experience with real-life projects, controversies around several major public data mining competitions held recently such as the INFORMS 2010 Data Mining Challenge and the IJCNN 2011 Social Network Challenge are evidence that this issue is as relevant today as it has ever been. While acknowledging the importance and prevalence of leakage in both synthetic competitions and real-life data mining projects, existing literature has largely left this idea unexplored. What little has been said turns out not to be broad enough to cover more complex cases of leakage, such as those where the classical independently and identically distributed (i.i.d.) assumption is violated, that have been recently documented. In our new approach, these cases and others are explained by explicitly defining modeling goals and analyzing the broader framework of the data mining problem. The resulting definition enables us to derive general methodology for

Journal

ACM Transactions on Knowledge Discovery from Data (TKDD)Association for Computing Machinery

Published: Dec 1, 2012

There are no references for this article.