Access the full text.
Sign up today, get DeepDyve free for 14 days.
D Gamberger, N Lavrač, F Zelezný, J Tolar (2004)
Induction of comprehensible models for gene expression datasets by subgroup discovery methodologyJ Biomed Inform, 37
L Breiman (1996)
Heuristics of instability and stabilization in model selectionAnn Stat, 24
JR Quinlan (1993)
C4.5: programs for machine learning
GM Weiss, F Provost (2003)
Learning when training data are costly: the effect of class distribution on tree inductionJ Artif Intell Res, 19
P Domingos (1999)
MetaCost: a general method for making classifiers cost-sensitive
O Bousquet, A Elisseeff (2002)
Stability and generalizationJ Mach Learn Res, 2
SM Weiss, N Indurkhya, T Zhang, F Damerau (2004)
Text mining: predictive methods for analyzing unstructured information
N Rosenfeld, R Aharonov, E Meiri, S Rosenwald, Y Spector, M Zepeniuk, H Benjamin, N Shabes, S Tabak, A Levy (2008)
Micrornas accurately identify cancer tissue originNat Biotechnol, 26
GEAPA Batista, RC Prati, MC Monard (2004)
A study of the behavior of several methods for balancing machine learning training dataSIGKDD Explor Newsl, 6
L Breiman, J Friedman, R Olshen, C Stone (1984)
Classification and regression tress
L Breiman (1996b)
Technical note: some properties of splitting criteriaMach Learn, 24
For a long time, experimental studies have been performed in a large number of fields of AI, specially in machine learning. A careful evaluation of a specific machine learning algorithm is important, but often difficult to conduct in practice. On the other hand, simulation studies can provide insights on behavior and performance aspects of machine learning approaches much more readily than using real-world datasets, where the target concept is normally unknown. Under decision tree induction algorithms an interesting source of instability that sometimes is neglected by researchers is the number of classes in the training set. This paper uses simulation to extended a previous work performed by Leo Breiman about properties of splitting criteria. Our simulation results have showed the number of best-splits grows according to the number of classes: exponentially, for both entropy and twoing criteria and linearly, for gini criterion. Since more splits imply more alternative choices, decreasing the number of classes in high dimensional datasets (ranging from hundreds to thousands of attributes, typically found in biomedical domains) can help lowering instability of decision trees. Another important contribution of this work concerns the fact that for $$<$$ 5 classes balanced datasets are prone to provide more best-splits (thus increasing instability) than imbalanced ones, including binary problems often addressable in machine learning; on the other hand, for five or more classes balanced datasets can provide few best-splits.
Artificial Intelligence Review – Springer Journals
Published: Dec 5, 2012
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.