CheckDST: Measuring Real-World Generalization of Dialogue State Tracking Performance

12/15/2021
by   Hyundong Cho, et al.
3

Recent neural models that extend the pretrain-then-finetune paradigm continue to achieve new state-of-the-art results on joint goal accuracy (JGA) for dialogue state tracking (DST) benchmarks. However, we call into question their robustness as they show sharp drops in JGA for conversations containing utterances or dialog flows with realistic perturbations. Inspired by CheckList (Ribeiro et al., 2020), we design a collection of metrics called CheckDST that facilitate comparisons of DST models on comprehensive dimensions of robustness by testing well-known weaknesses with augmented test sets. We evaluate recent DST models with CheckDST and argue that models should be assessed more holistically rather than pursuing state-of-the-art on JGA since a higher JGA does not guarantee better overall robustness. We find that span-based classification models are resilient to unseen named entities but not robust to language variety, whereas those based on autoregressive language models generalize better to language variety but tend to memorize named entities and often hallucinate. Due to their respective weaknesses, neither approach is yet suitable for real-world deployment. We believe CheckDST is a useful guide for future research to develop task-oriented dialogue models that embody the strengths of various methods.

READ FULL TEXT
research
04/14/2021

On the Robustness of Goal Oriented Dialogue Systems to Real-world Noise

Goal oriented dialogue systems, that interact in real-word environments,...
research
07/02/2019

MultiWOZ 2.1: Multi-Domain Dialogue State Corrections and State Tracking Baselines

MultiWOZ is a recently-released multidomain dialogue dataset spanning 7 ...
research
01/21/2022

Description-Driven Task-Oriented Dialog Modeling

Task-oriented dialogue (TOD) systems are required to identify key inform...
research
07/10/2020

MultiWOZ 2.2 : A Dialogue Dataset with Additional Annotation Corrections and State Tracking Baselines

MultiWOZ is a well-known task-oriented dialogue dataset containing over ...
research
12/29/2020

RADDLE: An Evaluation Benchmark and Analysis Platform for Robust Task-oriented Dialog Systems

For task-oriented dialog systems to be maximally useful, it must be able...
research
11/30/2018

Flexible and Scalable State Tracking Framework for Goal-Oriented Dialogue Systems

Goal-oriented dialogue systems typically rely on components specifically...
research
05/11/2018

Behavior Analysis of NLI Models: Uncovering the Influence of Three Factors on Robustness

Natural Language Inference is a challenging task that has received subst...

Please sign up or login with your details

Forgot password? Click here to reset