Unit Under Test Identification Using Natural Language Processing Techniques

Matej Madeja; Jaroslav Porubän

doi:10.1515/comp-2020-0150

Loading next page...

References (25)

Stephen Thomas, Bram Adams, A. Hassan, D. Blostein (2014)
Studying software evolution using topic models
Sci. Comput. Program., 80
Hazeline Asuncion, A. Asuncion, R. Taylor (2010)
Software traceability with topic modeling
2010 ACM/IEEE 32nd International Conference on Software Engineering, 1
Jonathan Maletic, Naveen Valluri (1999)
Automatic software clustering via Latent Semantic Analysis
14th IEEE International Conference on Automated Software Engineering
A. Reddy (2000)
Java™ Coding Style Guide
F. Détienne (2007)
What model(s) for program understanding?
ArXiv, abs/cs/0702004
Ella Bingham, H. Mannila (2001)
Random projection in dimensionality reduction: applications to image and text data
Kent Beck, E. Gamma (2000)
Test-infected: programmers love writing tests
Jey Lau, D. Newman, Timothy Baldwin (2014)
Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality
Stephen Thomas (2011)
Mining software repositories using topic models
2011 33rd International Conference on Software Engineering (ICSE)
N. Tolia, D. Andersen, M. Satyanarayanan (2006)
Quantifying interactive user experience on thin clients
Computer, 39
Simon Butler, M. Wermelinger, Y. Yu, H. Sharp (2011)
Mining java class naming conventions
2011 27th IEEE International Conference on Software Maintenance (ICSM)
Matej Madeja, J. Porubän (2019)
Accuracy of Unit Under Test Identification Using Latent Semantic Analysis and Latent Dirichlet Allocation
2019 IEEE 15th International Scientific Conference on Informatics
D. Hiemstra (2000)
A probabilistic justification for using tf×idf term weighting in information retrieval
International Journal on Digital Libraries, 3
Radim Rehurek, Petr Sojka (2010)
Software Framework for Topic Modelling with Large Corpora
P. Kanerva, Jan Kristoferson, Anders Holst (2000)
Random indexing of text samples for latent semantic analysis
, 22
Jonathan Maletic, Andrian Marcus (2000)
Using latent semantic analysis to identify similarities in source code to support program understanding
Proceedings 12th IEEE Internationals Conference on Tools with Artificial Intelligence. ICTAI 2000
Chong Wang, J. Paisley, D. Blei (2011)
Online Variational Inference for the Hierarchical Dirichlet Process
A. Huang (2008)
Similarity Measures for Text Document Clustering
D. Blei, A. Ng, Michael Jordan (2009)
Latent Dirichlet Allocation
W. Croft, Donald Metzler, Trevor Strohman (2009)
Search Engines - Information Retrieval in Practice
T. Landauer, S. Dumais (2008)
Latent semantic analysis
Scholarpedia, 3
(2019)
Unit Testing Best Practices: How to Get the Most Out of Your Test Automation
Christopher Manning, Hinrich Schütze (1999)
Book Reviews: Foundations of Statistical Natural Language Processing
Tonči Cvitanić, Bumsoo Lee, H. Song, Katherine Fu, D. Rosen (2016)
LDA v. LSA: A Comparison of Two Computational Text Analysis Tools for the Functional Categorization of Patents
Matej Madeja, J. Porubän (2019)
Tracing Naming Semantics in Unit Tests of Popular Github Android Projects

Publisher: de Gruyter
Copyright: © 2021 Matej Madeja et al., published by De Gruyter
eISSN: 2299-1093
DOI: 10.1515/comp-2020-0150
Publisher site: See Article on Publisher Site

Abstract

AbstractUnit under test identification (UUT) is often difficult due to test smells, such as testing multiple UUTs in one test. Because the tests best reflect the current product specification they can be used to comprehend parts of the production code and the relationships between them. Because there is a similar vocabulary between the test and UUT, five NLP techniques were used on the source code of 5 popular Github projects in this paper. The collected results were compared with the manually identified UUTs. The tf-idf model achieved the best accuracy of 22% for a right UUT and 57% with a tolerance up to fifth place of manual identification. These results were obtained after preprocessing input documents with java keywords removal and word split. The tf-idf model achieved the best model training time and the index search takes within 1s per request, so it could be used in an Integrated Development Environment (IDE) as a support tool in the future. At the same time, it has been found that, for document preprocessing, word splitting improves accuracy best and removing java keywords has just a small improvement for tf-idf model results. Removing comments only slightly worsens the accuracy of Natural Language Processing (NLP) models. The best speed provided the word splitting with average 0.3s preprocessing time per all documents in a project.

Journal

Open Computer Science – de Gruyter

Published: Jan 1, 2021

Keywords: natural language processing; unit under test; program comprehension; automated identification; software maintenance

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Unit Under Test Identification Using Natural Language Processing Techniques

Unit Under Test Identification Using Natural Language Processing Techniques

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Unit Under Test Identification Using Natural Language Processing Techniques

Unit Under Test Identification Using Natural Language Processing Techniques

References (25)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies