A Customised Text Privatisation Mechanism with Differential Privacy
In Natural Language Understanding (NLU) applications, training an effective model often requires a massive amount of data. However, text data in the real world are scattered in different institutions or user devices. Directly sharing them with NLU service provider brings huge privacy risks, as text data often contains sensitive information, leading to potential privacy leakage. A typical way to protect privacy is to directly privatize raw text and leverage Differential Privacy (DP) to quantify the privacy protection level. However, existing text privatization mechanisms that privatize text by applying d_χ-privacy are not applicable for all similarity metrics and fail to achieve a good privacy-utility trade-off. This is primarily because (1) d_χ-privacy's strict requirements for similarity metrics; (2) they treat each input token equally. Bad utility-privacy trade-off performance impedes the adoption of current text privatization mechanisms in real-world applications. In this paper, we propose a Customised differentially private Text privatization mechanism (CusText) that assigns each input token a customized output set to provide more advanced adaptive privacy protection at the token level. It also overcomes the limitation for the similarity metrics caused by d_χ-privacy notion, by turning the mechanism to satisfy ϵ-DP. Furthermore, we provide two new text privatization strategies to boost the utility of privatized text without compromising privacy and design a new attack strategy for further evaluating the protection level of our mechanism empirically from a new attack's view. We also conduct extensive experiments on two widely used datasets to demonstrate that our proposed mechanism CusText can achieve a better privacy-utility trade-off and practical application value than the existing methods.
READ FULL TEXT