Towards Code Watermarking with Dual-Channel Transformations

by   Borui Yang, et al.
Shanghai Jiao Tong University
The Hong Kong University of Science and Technology

The expansion of the open source community and the rise of large language models have raised ethical and security concerns on the distribution of source code, such as misconduct on copyrighted code, distributions without proper licenses, or misuse of the code for malicious purposes. Hence it is important to track the ownership of source code, in wich watermarking is a major technique. Yet, drastically different from natural languages, source code watermarking requires far stricter and more complicated rules to ensure the readability as well as the functionality of the source code. Hence we introduce SrcMarker, a watermarking system to unobtrusively encode ID bitstrings into source code, without affecting the usage and semantics of the code. To this end, SrcMarker performs transformations on an AST-based intermediate representation that enables unified transformations across different programming languages. The core of the system utilizes learning-based embedding and extraction modules to select rule-based transformations for watermarking. In addition, a novel feature-approximation technique is designed to tackle the inherent non-differentiability of rule selection, thus seamlessly integrating the rule-based transformations and learning-based networks into an interconnected system to enable end-to-end training. Extensive experiments demonstrate the superiority of SrcMarker over existing methods in various watermarking requirements.


Cross-Language Binary-Source Code Matching with Intermediate Representations

Binary-source code matching plays an important role in many security and...

Unsupervised Translation of Programming Languages

A transcompiler, also known as source-to-source translator, is a system ...

Inaccessible Neural Language Models Could Reinvigorate Linguistic Nativism

Large Language Models (LLMs) have been making big waves in the machine l...

CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning

Github Copilot, trained on billions of lines of public code, has recentl...

Using Paragraph Vectors to improve our existing code review assisting tool-CRUSO

Code reviews are one of the effective methods to estimate defectiveness ...

A Language-Agnostic Model for Semantic Source Code Labeling

Code search and comprehension have become more difficult in recent years...

On ML-Based Program Translation: Perils and Promises

With the advent of new and advanced programming languages, it becomes im...

Please sign up or login with your details

Forgot password? Click here to reset