Size Matters: The Impact of Training Size in Taxonomically-Enriched Word Embeddings

Alfredo Maldonado; Filip Klubička; John Kelleher

doi:10.1515/comp-2019-0009

Loading next page...

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher: de Gruyter
Copyright: © 2019 Alfredo Maldonado et al., published by De Gruyter Open
eISSN: 2299-1093
DOI: 10.1515/comp-2019-0009
Publisher site: See Article on Publisher Site

Abstract

AbstractWord embeddings trained on natural corpora (e.g., newspaper collections, Wikipedia or the Web) excel in capturing thematic similarity (“topical relatedness”) on word pairs such as ‘coffee’ and ‘cup’ or ’bus’ and ‘road’. However, they are less successful on pairs showing taxonomic similarity, like ‘cup’ and ‘mug’ (near synonyms) or ‘bus’ and ‘train’ (types of public transport). Moreover, purely taxonomy-based embeddings (e.g. those trained on a random-walk of WordNet’s structure) outperform natural-corpus embeddings in taxonomic similarity but underperform them in thematic similarity. Previous work suggests that performance gains in both types of similarity can be achieved by enriching natural-corpus embeddings with taxonomic information from taxonomies like Word-Net. This taxonomic enrichment can be done by combining natural-corpus embeddings with taxonomic embeddings (e.g. those trained on a random-walk of WordNet’s structure). This paper conducts a deep analysis of this assumption and shows that both the size of the natural corpus and of the random-walk coverage of the WordNet structure play a crucial role in the performance of combined (enriched) vectors in both similarity tasks. Specifically, we show that embeddings trained on medium-sized natural corpora benefit the most from taxonomic enrichment whilst embeddings trained on large natural corpora only benefit from this enrichment when evaluated on taxonomic similarity tasks. The implication of this is that care has to be taken in controlling the size of the natural corpus and the size of the random-walk used to train vectors. In addition, we find that, whilst the WordNet structure is finite and it is possible to fully traverse it in a single pass, the repetition of well-connected WordNet concepts in extended random-walks effectively reinforces taxonomic relations in the learned embeddings.

Journal

Open Computer Science – de Gruyter

Published: Jan 1, 2019

References

Retrofitting Word Vectors of MeSH Terms to Improve Semantic Similarity Measures in Proceedings of the Seventh International Workshop on Health Text Mining and

Yu
Semantic Specialisation of Distributional Word Vector Spaces using Monolingual and Cross - Lingual Transactions of the Association for Computational

Mrkšić
Non - distributional Word Vector in Proceedings of the rd Annual Meeting of the Association for Computational Linguistics and the th International Joint Conference on Natural Processing Short Beijing

Faruqui
Single or Multiple Combining Independently Learned from in

Goikoetxea
Representing General Relational Knowledge in ConceptNet in Proceedings of the Eight International Conference on Language Resources and Istanbul

Speer
Capturing and measuring thematic relatedness and

Kacmajor
Schulte im Walde Integrating Distributional Lexical Contrast into Word Embeddings for Antonym - Synonym Distinction in Proceedings of the th Annual Meeting of the Association for Computational

Nguyen
Adversarial Propagation and Zero - Shot Cross - Lingual Transfer of Word Vector Specialization in Proceedings of the Conference on Empirical Methods in Natural Language Processing

Ponti
Post - Specialisation : Retrofitting Vectors of Words Unseen in Lexical Resources in Proceedings of New Orleans

Vulić
Embeddings for Learning Hierarchical in in CA

Nickel
Embedding of semantic predications of jbi

Cohen
Efficient Estimation of Word Representations in Vector Space in Proceedings of the International Conference on Learning

Mikolov
Task Multilingual and Cross - lingual Semantic Word Similarity in Proceedings of the th International Workshop on Semantic Evaluation Vancouver

Camacho
Counter - fitting word vectors to linguistic constraints arXiv preprint arXiv

Mrkšić
From Frequency to Meaning : Vector Space Models of Semantics of

Turney
Polyglot Distributed Representations for Multilingual in Proceedings of the Seventeenth Conference on Computational Natural Sofia

Rfou
The paraphrase database in Proceedings of the Conference of the North American Chapter of the Association for Human Language Technologies

Ganitkevitch
The berkeley framenet project in Proceedings of the th international conference on Volume Association for

Baker
From Paraphrase Database to Compositional Paraphrase Model and Back Transactions of the Association for

Wieting
Distributed Representations of Words and Phrases and their Compositionality in Proceedings of the Twenty - Seventh Annual Conference on Neural Information Processing Systems NIPS In in Neural Information Processing

Mikolov
Fellbaum An Electronic Lexical MIT

WordNet
Schulte im Walde Hierarchical Embeddings for Hypernymy Detection and in Proceedings of the Conference on Empirical Methods in Natural Language Processing

Nguyen

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Size Matters: The Impact of Training Size in Taxonomically-Enriched Word Embeddings

Size Matters: The Impact of Training Size in Taxonomically-Enriched Word Embeddings

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Size Matters: The Impact of Training Size in Taxonomically-Enriched Word Embeddings

Size Matters: The Impact of Training Size in Taxonomically-Enriched Word Embeddings

References

Abstract

Journal

Recommended Articles

References

Our policy towards the use of cookies