Access the full text.
Sign up today, get DeepDyve free for 14 days.
Stephen Thomas, Bram Adams, A. Hassan, D. Blostein (2014)
Studying software evolution using topic modelsSci. Comput. Program., 80
Hazeline Asuncion, A. Asuncion, R. Taylor (2010)
Software traceability with topic modeling2010 ACM/IEEE 32nd International Conference on Software Engineering, 1
Jonathan Maletic, Naveen Valluri (1999)
Automatic software clustering via Latent Semantic Analysis14th IEEE International Conference on Automated Software Engineering
A. Reddy (2000)
Java™ Coding Style Guide
F. Détienne (2007)
What model(s) for program understanding?ArXiv, abs/cs/0702004
Ella Bingham, H. Mannila (2001)
Random projection in dimensionality reduction: applications to image and text data
Kent Beck, E. Gamma (2000)
Test-infected: programmers love writing tests
Jey Lau, D. Newman, Timothy Baldwin (2014)
Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality
Stephen Thomas (2011)
Mining software repositories using topic models2011 33rd International Conference on Software Engineering (ICSE)
N. Tolia, D. Andersen, M. Satyanarayanan (2006)
Quantifying interactive user experience on thin clientsComputer, 39
Simon Butler, M. Wermelinger, Y. Yu, H. Sharp (2011)
Mining java class naming conventions2011 27th IEEE International Conference on Software Maintenance (ICSM)
Matej Madeja, J. Porubän (2019)
Accuracy of Unit Under Test Identification Using Latent Semantic Analysis and Latent Dirichlet Allocation2019 IEEE 15th International Scientific Conference on Informatics
D. Hiemstra (2000)
A probabilistic justification for using tf×idf term weighting in information retrievalInternational Journal on Digital Libraries, 3
Radim Rehurek, Petr Sojka (2010)
Software Framework for Topic Modelling with Large Corpora
P. Kanerva, Jan Kristoferson, Anders Holst (2000)
Random indexing of text samples for latent semantic analysis, 22
Jonathan Maletic, Andrian Marcus (2000)
Using latent semantic analysis to identify similarities in source code to support program understandingProceedings 12th IEEE Internationals Conference on Tools with Artificial Intelligence. ICTAI 2000
Chong Wang, J. Paisley, D. Blei (2011)
Online Variational Inference for the Hierarchical Dirichlet Process
A. Huang (2008)
Similarity Measures for Text Document Clustering
D. Blei, A. Ng, Michael Jordan (2009)
Latent Dirichlet Allocation
W. Croft, Donald Metzler, Trevor Strohman (2009)
Search Engines - Information Retrieval in Practice
T. Landauer, S. Dumais (2008)
Latent semantic analysisScholarpedia, 3
(2019)
Unit Testing Best Practices: How to Get the Most Out of Your Test Automation
Christopher Manning, Hinrich Schütze (1999)
Book Reviews: Foundations of Statistical Natural Language Processing
Tonči Cvitanić, Bumsoo Lee, H. Song, Katherine Fu, D. Rosen (2016)
LDA v. LSA: A Comparison of Two Computational Text Analysis Tools for the Functional Categorization of Patents
Matej Madeja, J. Porubän (2019)
Tracing Naming Semantics in Unit Tests of Popular Github Android Projects
AbstractUnit under test identification (UUT) is often difficult due to test smells, such as testing multiple UUTs in one test. Because the tests best reflect the current product specification they can be used to comprehend parts of the production code and the relationships between them. Because there is a similar vocabulary between the test and UUT, five NLP techniques were used on the source code of 5 popular Github projects in this paper. The collected results were compared with the manually identified UUTs. The tf-idf model achieved the best accuracy of 22% for a right UUT and 57% with a tolerance up to fifth place of manual identification. These results were obtained after preprocessing input documents with java keywords removal and word split. The tf-idf model achieved the best model training time and the index search takes within 1s per request, so it could be used in an Integrated Development Environment (IDE) as a support tool in the future. At the same time, it has been found that, for document preprocessing, word splitting improves accuracy best and removing java keywords has just a small improvement for tf-idf model results. Removing comments only slightly worsens the accuracy of Natural Language Processing (NLP) models. The best speed provided the word splitting with average 0.3s preprocessing time per all documents in a project.
Open Computer Science – de Gruyter
Published: Jan 1, 2021
Keywords: natural language processing; unit under test; program comprehension; automated identification; software maintenance
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.