Watermarking Text Data on Large Language Models for Dataset Copyright Protection

05/22/2023
by   Yixin Liu, et al.
0

Large Language Models (LLMs), such as BERT and GPT-based models like ChatGPT, have recently demonstrated their impressive capacity for learning language representations, yielding significant benefits for various downstream Natural Language Processing (NLP) tasks. However, the immense data requirements of these large models have incited substantial concerns regarding copyright protection and data privacy. In an attempt to address these issues, particularly the unauthorized use of private data in LLMs, we introduce a novel watermarking technique via a backdoor-based membership inference approach, i.e., TextMarker, which can safeguard diverse forms of private information embedded in the training text data in LLMs. Specifically, TextMarker is a new membership inference framework that can eliminate the necessity for additional proxy data and surrogate model training, which are common in traditional membership inference techniques, thereby rendering our proposal significantly more practical and applicable.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/16/2021

Membership Inference Attack Susceptibility of Clinical Language Models

Deep Neural Network (DNN) models have been shown to have high empirical ...
research
02/11/2022

What Does it Mean for a Language Model to Preserve Privacy?

Natural language reflects our private lives and identities, making its p...
research
05/15/2023

Private Training Set Inspection in MLaaS

Machine Learning as a Service (MLaaS) is a popular cloud-based solution ...
research
06/03/2022

Differentially Private Model Compression

Recent papers have shown that large pre-trained language models (LLMs) s...
research
05/24/2023

Flocks of Stochastic Parrots: Differentially Private Prompt Learning for Large Language Models

Large language models (LLMs) are excellent in-context learners. However,...
research
06/24/2022

Text and author-level political inference using heterogeneous knowledge representations

The inference of politically-charged information from text data is a pop...
research
04/11/2023

Multi-step Jailbreaking Privacy Attacks on ChatGPT

With the rapid progress of large language models (LLMs), many downstream...

Please sign up or login with your details

Forgot password? Click here to reset