Exploring the Impact of Human Evaluator Group on Chat-Oriented Dialogue Evaluation

09/14/2023
by   Sarah E. Finch, et al.
0

Human evaluation has been widely accepted as the standard for evaluating chat-oriented dialogue systems. However, there is a significant variation in previous work regarding who gets recruited as evaluators. Evaluator groups such as domain experts, university students, and professional annotators have been used to assess and compare dialogue systems, although it is unclear to what extent the choice of an evaluator group can affect results. This paper analyzes the evaluator group impact on dialogue system evaluation by testing 4 state-of-the-art dialogue systems using 4 distinct evaluator groups. Our analysis reveals a robustness towards evaluator groups for Likert evaluations that is not seen for Pairwise, with only minor differences observed when changing evaluator groups. Furthermore, two notable limitations to this robustness are observed, which reveal discrepancies between evaluators with different levels of chatbot expertise and indicate that evaluator objectivity is beneficial for certain dialogue metrics.

READ FULL TEXT

page 4

page 7

research
05/10/2019

Survey on Evaluation Methods for Dialogue Systems

In this paper we survey the methods and concepts developed for the evalu...
research
06/29/2017

Relevance of Unsupervised Metrics in Task-Oriented Dialogue for Evaluating Natural Language Generation

Automated metrics such as BLEU are widely used in the machine translatio...
research
09/13/2017

A Review of Evaluation Techniques for Social Dialogue Systems

In contrast with goal-oriented dialogue, social dialogue has no clear me...
research
12/18/2022

Don't Forget Your ABC's: Evaluating the State-of-the-Art in Chat-Oriented Dialogue Systems

There has been great recent advancement in human-computer chat. However,...
research
05/18/2016

On the Evaluation of Dialogue Systems with Next Utterance Classification

An open challenge in constructing dialogue systems is developing methods...
research
02/22/2023

Few-Shot Structured Policy Learning for Multi-Domain and Multi-Task Dialogues

Reinforcement learning has been widely adopted to model dialogue manager...
research
09/06/2019

ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and Multi-turn Comparisons

While dialogue remains an important end-goal of natural language researc...

Please sign up or login with your details

Forgot password? Click here to reset