Fine-grained, span-level human evaluation has emerged as a reliable and
...
Large language models (e.g., GPT-3.5) are uniquely capable of producing
...
Training learnable metrics using modern language models has recently eme...
This paper addresses the quality issues in existing Twitter-based paraph...
Modern neural text generation systems can produce remarkably fluent and
...
We study conversational dialog in which there are many possible response...