Albanian Text Classification: Bag of Words Model and Word Analogies

Arbana Kadriu; Lejla Abazi; Hyrije Abazi

doi:10.2478/bsrj-2019-0006

Loading next page...

References (19)

Y. Gui, Zhiqiang Gao, Renyong Li, Xin Yang (2012)
Hierarchical Text Classification for News Articles Based-on Named Entities
S. Chaudhari (2013)
Classification of News and Research Articles Using Text Pattern Mining
IOSR Journal of Computer Engineering, 14
Kevin Scannell (2007)
The Crúbadán Project: Corpus building for under-resourced languages
T. Jurka, Loren Collingwood, Amber Boydstun, E. Grossman, W. Atteveldt (2013)
RTextTools: A Supervised Learning Package for Text Classification
R J., 5
Ioannis Antonellis, C. Bouras, V. Poulopoulos (2006)
Personalized News Categorization Through Scalable Text Classification
S. Raschka (2015)
Python Machine Learning
N. Hartmann, Erick Fonseca, C. Shulby, Marcos Treviso, Jéssica Rodrigues, S. Aluísio (2017)
Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks
ArXiv, abs/1708.06025
Dimitris Liparas, Yaakov HaCohen-Kerner, A. Moumtzidou, S. Vrochidis, Y. Kompatsiaris (2014)
News Articles Classification Using Random Forests and Weighted Multimodal Features
Robin Swezey, Hiroyuki Sano, Shun Shiramatsu, Tadachika Ozono, T. Shintani (2012)
Automatic Detection of News Articles of Interest to Regional Communities
Ray Larson (2008)
Introduction to Information Retrieval
T. Rubin, America Chambers, Padhraic Smyth, M. Steyvers (2011)
Statistical topic models for multi-label document classification
Machine Learning, 88
(2010)
South-East European Times : A parallel corpus of Balkan languages , Francis Tyers and
Tomas Mikolov, Kai Chen, G. Corrado, J. Dean (2013)
Efficient Estimation of Word Representations in Vector Space
Armand Joulin, Edouard Grave, Piotr Bojanowski, Tomas Mikolov (2016)
Bag of Tricks for Efficient Text Classification
ArXiv, abs/1607.01759
Piotr Bojanowski, Edouard Grave, Armand Joulin, Tomas Mikolov (2016)
Enriching Word Vectors with Subword Information
Transactions of the Association for Computational Linguistics, 5
Daniel Zhou, P. Resnick, Q. Mei (2011)
Classifying the Political Leaning of News Articles and Users from User Votes
Proceedings of the International AAAI Conference on Web and Social Media
Corinna Cortes, V. Vapnik (1995)
Support-Vector Networks
Machine Learning, 20
K. Crammer, O. Dekel, Joseph Keshet, S. Shalev-Shwartz, Y. Singer (2003)
Online Passive-Aggressive Algorithms
J. Mach. Learn. Res., 7
Fabian Pedregosa, G. Varoquaux, Alexandre Gramfort, V. Michel, B. Thirion, O. Grisel, Mathieu Blondel, Gilles Louppe, P. Prettenhofer, Ron Weiss, Ron Weiss, J. Vanderplas, Alexandre Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay (2011)
Scikit-learn: Machine Learning in Python
ArXiv, abs/1201.0490

Publisher: de Gruyter
Copyright: © 2019 Arbana Kadriu et al., published by Sciendo
ISSN: 1847-9375
eISSN: 1847-9375
DOI: 10.2478/bsrj-2019-0006
Publisher site: See Article on Publisher Site

Abstract

AbstractBackground: Text classification is a very important task in information retrieval. Its objective is to classify new text documents in a set of predefined classes, using different supervised algorithms. Objectives: We focus on the text classification for Albanian news articles using two approaches. Methods/Approach: In the first approach, the words in a collection are considered as independent components, allocating to each of them a conforming vector in the vector’s space. Here we utilized nine classifiers from the scikit-learn package, training the classifiers with part of news articles (80%) and testing the accuracy with the remaining part of these articles. In the second approach, the text classification treats words based on their semantic and syntactic word similarities, supposing a word is formed by n-grams of characters. In this case, we have used the fastText, a hierarchical classifier, that considers local word order, as well as sub-word information. We have measured the accuracy for each classifier separately. We have also analyzed the training and testing time. Results: Our results show that the bag of words model does better than fastText when testing the classification process for not a large dataset of text. FastText shows better performance when classifying multi-label text. Conclusions: News articles can serve to create a benchmark for testing classification algorithms of Albanian texts. The best results are achieved with a bag of words model, with an accuracy of 94%.

Journal

Business Systems Research Journal – de Gruyter

Published: Apr 1, 2019

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Albanian Text Classification: Bag of Words Model and Word Analogies

Albanian Text Classification: Bag of Words Model and Word Analogies

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Albanian Text Classification: Bag of Words Model and Word Analogies

Albanian Text Classification: Bag of Words Model and Word Analogies

References (19)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies