Identifying Untrustworthy Samples: Data Filtering for Open-domain Dialogues with Bayesian Optimization

by   Lei Shen, et al.

Being able to reply with a related, fluent, and informative response is an indispensable requirement for building high-quality conversational agents. In order to generate better responses, some approaches have been proposed, such as feeding extra information by collecting large-scale datasets with human annotations, designing neural conversational models (NCMs) with complex architecture and loss functions, or filtering out untrustworthy samples based on a dialogue attribute, e.g., Relatedness or Genericness. In this paper, we follow the third research branch and present a data filtering method for open-domain dialogues, which identifies untrustworthy samples from training data with a quality measure that linearly combines seven dialogue attributes. The attribute weights are obtained via Bayesian Optimization (BayesOpt) that aims to optimize an objective function for dialogue generation iteratively on the validation set. Then we score training samples with the quality measure, sort them in descending order, and filter out those at the bottom. Furthermore, to accelerate the "filter-train-evaluate" iterations involved in BayesOpt on large-scale datasets, we propose a training framework that integrates maximum likelihood estimation (MLE) and negative training method (NEG). The training method updates parameters of a trained NCMs on two small sets with newly maintained and removed samples, respectively. Specifically, MLE is applied to maximize the log-likelihood of newly maintained samples, while NEG is used to minimize the log-likelihood of newly removed ones. Experimental results on two datasets show that our method can effectively identify untrustworthy samples, and NCMs trained on the filtered datasets achieve better performance.


page 1

page 2

page 3

page 4


Least informative distributions in Maximum q-log-likelihood estimation

We use the Maximum q-log-likelihood estimation for Least informative dis...

Stop Filtering: Multi-View Attribute-Enhanced Dialogue Learning

There is a growing interest in improving the conversational ability of m...

Group-wise Contrastive Learning for Neural Dialogue Generation

Neural dialogue response generation has gained much popularity in recent...

Dialogue Distillation: Open-domain Dialogue Augmentation Using Unpaired Data

Recent advances in open-domain dialogue systems rely on the success of n...

Utterance Pair Scoring for Noisy Dialogue Data Filtering

Filtering noisy training data is one of the key approaches to improving ...

A Model-Agnostic Data Manipulation Method for Persona-based Dialogue Generation

Towards building intelligent dialogue agents, there has been a growing i...

Implementing a Bayes Filter in a Neural Circuit: The Case of Unknown Stimulus Dynamics

In order to interact intelligently with objects in the world, animals mu...

Please sign up or login with your details

Forgot password? Click here to reset