Detecting egregious responses in neural sequence-to-sequence models

09/11/2018
by   Tianxing He, et al.
0

In this work, we attempt to answer a critical question: whether there exists some input sequence that will cause a well-trained discrete-space neural network sequence-to-sequence (seq2seq) model to generate egregious outputs (aggressive, malicious, attacking, etc.). And if such inputs exist, how to find them efficiently. We adopt an empirical methodology, in which we first create lists of egregious outputs, and then design a discrete optimization algorithm to find input sequences that will generate them. Moreover, the optimization algorithm is enhanced for large vocabulary search and constrained to search for input sequences that are likely to appear in real-world settings. In our experiments, we apply this approach to a dialogue response generation model for two real-world dialogue datasets: Ubuntu and Switchboard, testing whether the model can generate malicious responses. We demonstrate that given the trigger inputs our algorithm finds, a significant number of malicious sentences are assigned a large probability by the model.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset