Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

The number of classes as a source for instability of decision tree algorithms in high dimensional datasets

The number of classes as a source for instability of decision tree algorithms in high dimensional... For a long time, experimental studies have been performed in a large number of fields of AI, specially in machine learning. A careful evaluation of a specific machine learning algorithm is important, but often difficult to conduct in practice. On the other hand, simulation studies can provide insights on behavior and performance aspects of machine learning approaches much more readily than using real-world datasets, where the target concept is normally unknown. Under decision tree induction algorithms an interesting source of instability that sometimes is neglected by researchers is the number of classes in the training set. This paper uses simulation to extended a previous work performed by Leo Breiman about properties of splitting criteria. Our simulation results have showed the number of best-splits grows according to the number of classes: exponentially, for both entropy and twoing criteria and linearly, for gini criterion. Since more splits imply more alternative choices, decreasing the number of classes in high dimensional datasets (ranging from hundreds to thousands of attributes, typically found in biomedical domains) can help lowering instability of decision trees. Another important contribution of this work concerns the fact that for $$<$$ 5 classes balanced datasets are prone to provide more best-splits (thus increasing instability) than imbalanced ones, including binary problems often addressable in machine learning; on the other hand, for five or more classes balanced datasets can provide few best-splits. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Artificial Intelligence Review Springer Journals

The number of classes as a source for instability of decision tree algorithms in high dimensional datasets

Artificial Intelligence Review , Volume 43 (2) – Dec 5, 2012

Loading next page...
 
/lp/springer-journals/the-number-of-classes-as-a-source-for-instability-of-decision-tree-tZhzhcfHxp

References (11)

Publisher
Springer Journals
Copyright
Copyright © 2012 by Springer Science+Business Media Dordrecht
Subject
Computer Science; Artificial Intelligence (incl. Robotics); Computer Science, general
ISSN
0269-2821
eISSN
1573-7462
DOI
10.1007/s10462-012-9374-7
Publisher site
See Article on Publisher Site

Abstract

For a long time, experimental studies have been performed in a large number of fields of AI, specially in machine learning. A careful evaluation of a specific machine learning algorithm is important, but often difficult to conduct in practice. On the other hand, simulation studies can provide insights on behavior and performance aspects of machine learning approaches much more readily than using real-world datasets, where the target concept is normally unknown. Under decision tree induction algorithms an interesting source of instability that sometimes is neglected by researchers is the number of classes in the training set. This paper uses simulation to extended a previous work performed by Leo Breiman about properties of splitting criteria. Our simulation results have showed the number of best-splits grows according to the number of classes: exponentially, for both entropy and twoing criteria and linearly, for gini criterion. Since more splits imply more alternative choices, decreasing the number of classes in high dimensional datasets (ranging from hundreds to thousands of attributes, typically found in biomedical domains) can help lowering instability of decision trees. Another important contribution of this work concerns the fact that for $$<$$ 5 classes balanced datasets are prone to provide more best-splits (thus increasing instability) than imbalanced ones, including binary problems often addressable in machine learning; on the other hand, for five or more classes balanced datasets can provide few best-splits.

Journal

Artificial Intelligence ReviewSpringer Journals

Published: Dec 5, 2012

There are no references for this article.