Multi-Narrative Semantic Overlap Task: Evaluation and Benchmark

01/14/2022
by   Naman Bansal, et al.
0

In this paper, we introduce an important yet relatively unexplored NLP task called Multi-Narrative Semantic Overlap (MNSO), which entails generating a Semantic Overlap of multiple alternate narratives. As no benchmark dataset is readily available for this task, we created one by crawling 2,925 narrative pairs from the web and then, went through the tedious process of manually creating 411 different ground-truth semantic overlaps by engaging human annotators. As a way to evaluate this novel task, we first conducted a systematic study by borrowing the popular ROUGE metric from text-summarization literature and discovered that ROUGE is not suitable for our task. Subsequently, we conducted further human annotations/validations to create 200 document-level and 1,518 sentence-level ground-truth labels which helped us formulate a new precision-recall style evaluation metric, called SEM-F1 (semantic F1). Experimental results show that the proposed SEM-F1 metric yields higher correlation with human judgement as well as higher inter-rater-agreement compared to ROUGE metric.

READ FULL TEXT
research
06/23/2014

VideoSET: Video Summary Evaluation through Text

In this paper we present VideoSET, a method for Video Summary Evaluation...
research
09/24/2018

Sentence-Level Fluency Evaluation: References Help, But Can Be Spared!

Motivated by recent findings on the probabilistic modeling of acceptabil...
research
05/24/2023

#REVAL: a semantic evaluation framework for hashtag recommendation

Automatic evaluation of hashtag recommendation models is a fundamental t...
research
08/04/2023

Redundancy Aware Multi-Reference Based Gainwise Evaluation of Extractive Summarization

While very popular for evaluating extractive summarization task, the ROU...
research
02/28/2022

'Tis but Thy Name: Semantic Question Answering Evaluation with 11M Names for 1M Entities

Classic lexical-matching-based QA metrics are slowly being phased out be...
research
04/08/2023

Bipol: A Novel Multi-Axes Bias Evaluation Metric with Explainability for NLP

We introduce bipol, a new metric with explainability, for estimating soc...
research
06/02/2021

Survey Equivalence: A Procedure for Measuring Classifier Accuracy Against Human Labels

In many classification tasks, the ground truth is either noisy or subjec...

Please sign up or login with your details

Forgot password? Click here to reset