Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

A New Method of Automatic Text Document Classification

A New Method of Automatic Text Document Classification This paper describes the procedures and specific features of application of a new method of automatic classification based on calculation of deviations of stop-words distribution from Zipfian score. To neutralize discrepancies in texts lengths the author describes and applies the text undersampling methodology. The concept of an iterative threshold level is introduced to reduce text dimensionality to several dozen units. To evaluate the method’s efficiency the author has developed discriminative and similarative powers indicators that underlie the generalized efficiency score. Fourteen tests have been conducted, including comparison with the cosine similarity measure, that proved high efficiency of the proposed method for the solution of the tasks of authorship attribution of texts of fiction and clusterization of political texts. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Automatic Documentation and Mathematical Linguistics Springer Journals

A New Method of Automatic Text Document Classification

Loading next page...
 
/lp/springer-journals/a-new-method-of-automatic-text-document-classification-ZBSGWZcXVI
Publisher
Springer Journals
Copyright
Copyright © Allerton Press, Inc. 2021. ISSN 0005-1055, Automatic Documentation and Mathematical Linguistics, 2021, Vol. 55, No. 3, pp. 122–133. © Allerton Press, Inc., 2021. Russian Text © The Author(s), 2021, published in Nauchno-Tekhnicheskaya Informatsiya, Seriya 2: Informatsionnye Protsessy i Sistemy, 2021, No. 6, pp. 32–43.
ISSN
0005-1055
eISSN
1934-8371
DOI
10.3103/s0005105521030080
Publisher site
See Article on Publisher Site

Abstract

This paper describes the procedures and specific features of application of a new method of automatic classification based on calculation of deviations of stop-words distribution from Zipfian score. To neutralize discrepancies in texts lengths the author describes and applies the text undersampling methodology. The concept of an iterative threshold level is introduced to reduce text dimensionality to several dozen units. To evaluate the method’s efficiency the author has developed discriminative and similarative powers indicators that underlie the generalized efficiency score. Fourteen tests have been conducted, including comparison with the cosine similarity measure, that proved high efficiency of the proposed method for the solution of the tasks of authorship attribution of texts of fiction and clusterization of political texts.

Journal

Automatic Documentation and Mathematical LinguisticsSpringer Journals

Published: May 1, 2021

Keywords: automatic text classification; methods and algorithms; Zipf distribution; reduction of text dimensionality; threshold levels; efficiency indices

References