Using Word Embeddings to Deter Intellectual Property Theft through Automated Generation of Fake Documents

Almas Abdibayev; Dongkai Chen; Haipeng Chen; Deepti Poluru; V. S. Subrahmanian

doi:10.1145/3418289

Loading next page...

References (25)

Brian M. Bowen, Shlomo Hershkop, Angelos D. Keromytis, Salvatore J. Stolfo (2009)
Baiting inside attackers using decoy documents
Proceedings of the International Conference on Security and Privacy in Communication Systems. Springer
Jure Leskovec, Anand Rajaraman, Jeffrey David Ullman (2020)
Mining of Massive Data Sets
Cambridge University Press.
Catalin Cimpanu (2020)
FBI is investigating more than 1,000 cases of Chinese theft of US technology
Retrieved from https://www.zdnet.com/article/fbi-is-investigating-more-than-1000-cases-of-chinese-theft-of-us-technology/.
E. Simperl, Christoph Tempich, York Sure-Vetter (2006)
: A Cost Estimation Model for Ontology Engineering
Jonathan Voris, Nathaniel Boggs, Salvatore J. Stolfo (2012)
Lost in translation: Improving decoy documents via automated translation
Proceedings of the 2012 IEEE Symposium on Security and Privacy Workshops. IEEE, 2012
Tanmoy Chakraborty, S. Jajodia, Jonathan Katz, A. Picariello, Giancarlo Sperlí, V. Subrahmanian (2019)
A Fake Online Repository Generation Engine for Cyber Deception
IEEE Transactions on Dependable and Secure Computing, 18
Elena Paslaru Bontas Simperl, Christoph Tempich, York Sure (2006)
Ontocom: A cost estimation model for ontology engineering
Proceedings of the International Semantic Web Conference. Springer
Thomas C. Schelling (2008)
Arms and influence
Strategic Studies. Routledge
Eric Rosenbaum (2019)
1 in 5 corporations say China has stolen their IP within the last year: CNBC CFO survey
Retrieved from https://www.cnbc.com/2019/02/28/1-in-5-companies-say-china-stole-their-ip-within-the-last-year-cnbc.html.
Hans Christian, Mikhael Agus, Derwin Suhartono (2016)
Single Document Automatic Text Summarization using Term Frequency-Inverse Document Frequency (TF-IDF)
ComTech, 7
Younghee Park, Salvatore J. Stolfo (2012)
Software decoys for insider threat
Proceedings of the 7th ACM Symposium on Information
Francois Mathey, Francois Mercier, Michel Spagnol, Frederic Robin, Virginie Mouries (2003)
6, 6′-bis-(1-phosphanorbornadiene) diphosphines, their preparation and their uses
US Patent 6,521,795., 6
Ben Whitham (2013)
AUTOMATING THE GENERATION OF FAKE DOCUMENTS TO DETECT NETWORKINTRUDERS
International Journal of Cyber-Security and Digital Forensics, 2
Lei Wang, Chenglong Li, QingFeng Tan, XueBin Wang (2013)
Generation and distribution of decoy document system
Proceedings of the International Conference on Trustworthy Computing and Services. Springer
Steven Bird, Ewan Klein, Edward Loper (2009)
Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit
“O’Reilly Media
Piotr Bojanowski, Edouard Grave, Armand Joulin, Tomas Mikolov (2016)
Enriching Word Vectors with Subword Information
Transactions of the Association for Computational Linguistics, 5
P. Rousseeuw (1987)
Silhouettes: a graphical aid to the interpretation and validation of cluster analysis
Journal of Computational and Applied Mathematics, 20
V. Subrahmanian, D. Recupero (2008)
AVA: Adjective-Verb-Adverb Combinations for Sentiment Analysis
IEEE Intelligent Systems, 23
James MacQueen et al (1967)
Some methods for classification and analysis of multivariate observations
Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability
Paul Huth (1999)
DETERRENCE AND INTERNATIONAL CONFLICT: Empirical Findings and Theoretical Debates
Annual Review of Political Science, 2
Jonathan White, Dale Thompson (2006)
Using Synthetic Decoys to Digitally Watermark Personally-Identifying Data and to Promote Data Security
Ben Whitham (2014)
Design requirements for generating deceptive content to protect document repositories
Proceedings in the 15th Australian Information Warfare Conference
Leyla Bilge, T. Dumitras (2012)
Before we knew it: an empirical study of zero-day attacks in the real world
Proceedings of the 2012 ACM conference on Computer and communications security
Tomas Mikolov, Kai Chen, G. Corrado, J. Dean (2013)
Efficient Estimation of Word Representations in Vector Space
Jim Yuill, M. Zappe, D. Denning, F. Feer (2004)
Honeyfiles: deceptive files for intrusion detection
Proceedings from the Fifth Annual IEEE SMC Information Assurance Workshop, 2004.

Publisher: Association for Computing Machinery
Copyright: Copyright © 2021 ACM
ISSN: 2158-656X
eISSN: 2158-6578
DOI: 10.1145/3418289
Publisher site: See Article on Publisher Site

Abstract

Theft of intellectual property is a growing problem—one that is exacerbated by the fact that a successful compromise of an enterprise might only become known months after the hack. A recent solution called FORGE addresses this problem by automatically generating N “fake” versions of any real document so that the attacker has to determine which of the N + 1 documents that they have exfiltrated from a compromised network is real. In this article, we remove two major drawbacks in FORGE: (i) FORGE requires ontologies in order to generate fake documents—however, in the real world, ontologies, especially good ontologies, are infrequently available. The WE-FORGE system proposed in this article completely eliminates the need for ontologies by using distance metrics on word embeddings instead. (ii) FORGE generates fake documents by first identifying “target” concepts in the original document and then substituting “replacement” concepts for them. However, we will show that this can lead to sub-optimal results (e.g., as target concepts are selected without knowing the availability and/or quality of the replacement concepts, they can sometimes lead to poor results). Our WE-FORGE system addresses this problem in two possible ways by performing a joint optimization to select concepts and replacements simultaneously. We conduct a human study involving both computer science and chemistry documents and show that WE-FORGE successfully deceives adversaries.

Journal

ACM Transactions on Management Information Systems (TMIS) – Association for Computing Machinery

Published: Feb 2, 2021

Keywords: AI security

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Using Word Embeddings to Deter Intellectual Property Theft through Automated Generation of Fake Documents

Using Word Embeddings to Deter Intellectual Property Theft through Automated Generation of Fake Documents

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Using Word Embeddings to Deter Intellectual Property Theft through Automated Generation of Fake Documents

Using Word Embeddings to Deter Intellectual Property Theft through Automated Generation of Fake Documents

References (25)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies