Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Blogs, Twitter Feeds, and Reddit Comments: Cross-domain Authorship Attribution

Blogs, Twitter Feeds, and Reddit Comments: Cross-domain Authorship Attribution Abstract Stylometry is a form of authorship attribution that relies on the linguistic information to attribute documents of unknown authorship based on the writing styles of a suspect set of authors. This paper focuses on the cross-domain subproblem where the known and suspect documents differ in the setting in which they were created. Three distinct domains, Twitter feeds, blog entries, and Reddit comments, are explored in this work. We determine that state-of-the-art methods in stylometry do not perform as well in cross-domain situations (34.3% accuracy) as they do in in-domain situations (83.5% accuracy) and propose methods that improve performance in the cross-domain setting with both feature and classification level techniques which can increase accuracy to up to 70%. In addition to testing these approaches on a large real world dataset, we also examine real world adversarial cases where an author is actively attempting to hide their identity. Being able to identify authors across domains facilitates linking identities across the Internet making this a key security and privacy concern; users can take other measures to ensure their anonymity, but due to their unique writing style, they may not be as anonymous as they believe. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Proceedings on Privacy Enhancing Technologies de Gruyter

Blogs, Twitter Feeds, and Reddit Comments: Cross-domain Authorship Attribution

Loading next page...
 
/lp/de-gruyter/blogs-twitter-feeds-and-reddit-comments-cross-domain-authorship-6eNOWKnlrd

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher
de Gruyter
Copyright
Copyright © 2016 by the
ISSN
2299-0984
eISSN
2299-0984
DOI
10.1515/popets-2016-0021
Publisher site
See Article on Publisher Site

Abstract

Abstract Stylometry is a form of authorship attribution that relies on the linguistic information to attribute documents of unknown authorship based on the writing styles of a suspect set of authors. This paper focuses on the cross-domain subproblem where the known and suspect documents differ in the setting in which they were created. Three distinct domains, Twitter feeds, blog entries, and Reddit comments, are explored in this work. We determine that state-of-the-art methods in stylometry do not perform as well in cross-domain situations (34.3% accuracy) as they do in in-domain situations (83.5% accuracy) and propose methods that improve performance in the cross-domain setting with both feature and classification level techniques which can increase accuracy to up to 70%. In addition to testing these approaches on a large real world dataset, we also examine real world adversarial cases where an author is actively attempting to hide their identity. Being able to identify authors across domains facilitates linking identities across the Internet making this a key security and privacy concern; users can take other measures to ensure their anonymity, but due to their unique writing style, they may not be as anonymous as they believe.

Journal

Proceedings on Privacy Enhancing Technologiesde Gruyter

Published: Jul 1, 2016

References