Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

An effective web document clustering algorithm based on bisection and merge

An effective web document clustering algorithm based on bisection and merge To cluster web documents, all of which have the same name entities, we attempted to use existing clustering algorithms such as K-means and spectral clustering. Unexpectedly, it turned out that these algorithms are not effective to cluster web documents. According to our intensive investigation, we found that clustering such web pages is more complicated because (1) the number of clusters (known as ground truth) is larger than two or three clusters as in general clustering problems and (2) clusters in the data set have extremely skewed distributions of cluster sizes. To overcome the aforementioned problem, in this paper, we propose an effective clustering algorithm to boost up the accuracy of K-means and spectral clustering algorithms. In particular, to deal with skewed distributions of cluster sizes, our algorithm performs both bisection and merge steps based on normalized cuts of the similarity graph G to correctly cluster web documents. Our experimental results show that our algorithm improves the performance by approximately 56% compared to spectral bisection and 36% compared to K-means. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Artificial Intelligence Review Springer Journals

An effective web document clustering algorithm based on bisection and merge

Artificial Intelligence Review , Volume 36 (1) – Jan 18, 2011

Loading next page...
 
/lp/springer-journals/an-effective-web-document-clustering-algorithm-based-on-bisection-and-dJ7sVBK5ZU

References (30)

Publisher
Springer Journals
Copyright
Copyright © 2011 by Springer Science+Business Media B.V.
Subject
Computer Science; Computer Science, general; Artificial Intelligence (incl. Robotics)
ISSN
0269-2821
eISSN
1573-7462
DOI
10.1007/s10462-011-9203-4
Publisher site
See Article on Publisher Site

Abstract

To cluster web documents, all of which have the same name entities, we attempted to use existing clustering algorithms such as K-means and spectral clustering. Unexpectedly, it turned out that these algorithms are not effective to cluster web documents. According to our intensive investigation, we found that clustering such web pages is more complicated because (1) the number of clusters (known as ground truth) is larger than two or three clusters as in general clustering problems and (2) clusters in the data set have extremely skewed distributions of cluster sizes. To overcome the aforementioned problem, in this paper, we propose an effective clustering algorithm to boost up the accuracy of K-means and spectral clustering algorithms. In particular, to deal with skewed distributions of cluster sizes, our algorithm performs both bisection and merge steps based on normalized cuts of the similarity graph G to correctly cluster web documents. Our experimental results show that our algorithm improves the performance by approximately 56% compared to spectral bisection and 36% compared to K-means.

Journal

Artificial Intelligence ReviewSpringer Journals

Published: Jan 18, 2011

There are no references for this article.