Scalable Source Code Similarity Detection in Large Code Repositories

07/26/2019
by   F Alomari, et al.
0

Source code similarity are increasingly used in application development to identify clones, isolate bugs, and find copy-rights violations. Similar code fragments can be very problematic due to the fact that errors in the original code must be fixed in every copy. Other maintenance changes, such as extensions or patches, must be applied multiple times. Furthermore, the diversity of coding styles and flexibility of modern languages makes it difficult and cost ineffective to manually inspect large code repositories. Therefore, detection is only feasible by automatic techniques. We present an efficient and scalable approach for similar code fragment identification based on source code control flow graphs fingerprinting. The source code is processed to generate control flow graphs that are then hashed to create a unique fingerprint of the code capturing semantics as well as syntax similarity. The fingerprints can then be efficiently stored and retrieved to perform similarity search between code fragments. Experimental results from our prototype implementation supports the validity of our approach and show its effectiveness and efficiency in comparison with other solutions.

READ FULL TEXT

page 1

page 8

research
02/08/2021

Academic Source Code Plagiarism Detection by Measuring Program Behavioural Similarity

Source code plagiarism is a long-standing issue in tertiary computer sci...
research
01/18/2018

Challenges of the Dynamic Detection of Functionally Similar Code Fragments

Classic clone detection approaches are hardly capable of finding redunda...
research
10/08/2021

Towards Learning (Dis)-Similarity of Source Code from Program Contrasts

Understanding the functional (dis)-similarity of source code is signific...
research
09/30/2019

Multi-Modal Attention Network Learning for Semantic Source Code Retrieval

Code retrieval techniques and tools have been playing a key role in faci...
research
10/29/2007

Code Similarity on High Level Programs

This paper presents a new approach for code similarity on High Level pro...
research
10/28/2018

Dynamic Thresholding Mechanisms for IR-Based Filtering in Efficient Source Code Plagiarism Detection

To solve time inefficiency issue, only potential pairs are compared in s...
research
10/03/2021

Towards Informative Tagging of Code Fragments to Support the Investigation of Code Clones

Investigating the code fragments of code clones detected by code clone d...

Please sign up or login with your details

Forgot password? Click here to reset