SMATCH++: Standardized and Extended Evaluation of Semantic Graphs

by   Juri Opitz, et al.

The Smatch metric is a popular method for evaluating graph distances, as is necessary, for instance, to assess the performance of semantic graph parsing systems. However, we observe some issues in the metric that jeopardize meaningful evaluation. E.g., opaque pre-processing choices can affect results, and current graph-alignment solvers do not provide us with upper-bounds. Without upper-bounds, however, fair evaluation is not guaranteed. Furthermore, adaptions of Smatch for extended tasks (e.g., fine-grained semantic similarity) are spread out, and lack a unifying framework. For better inspection, we divide the metric into three modules: pre-processing, alignment, and scoring. Examining each module, we specify its goals and diagnose potential issues, for which we discuss and test mitigation strategies. For pre-processing, we show how to fully conform to annotation guidelines that allow structurally deviating but valid graphs. For safer and enhanced alignment, we show the feasibility of optimal alignment in a standard evaluation setup, and develop a lossless graph compression method that shrinks the search space and significantly increases efficiency. For improved scoring, we propose standardized and extended metric calculation of fine-grained sub-graph meaning aspects. Our code is available at


page 1

page 2

page 3

page 4


CLEval: Character-Level Evaluation for Text Detection and Recognition Tasks

Despite the recent success of text detection and recognition methods, ex...

Classes Matter: A Fine-grained Adversarial Approach to Cross-domain Semantic Segmentation

Despite great progress in supervised semantic segmentation,a large perfo...

Neural Pre-Processing: A Learning Framework for End-to-end Brain MRI Pre-processing

Head MRI pre-processing involves converting raw images to an intensity-n...

ÚFAL MRPipe at MRP 2019: UDPipe Goes Semantic in the Meaning Representation Parsing Shared Task

We present a system description of our contribution to the CoNLL 2019 sh...

FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets

Evaluation of Large Language Models (LLMs) is challenging because aligni...

SEMA: an Extended Semantic Evaluation Metric for AMR

Abstract Meaning Representation (AMR) is a recently designed semantic re...

Evaluating ChatGPT's Information Extraction Capabilities: An Assessment of Performance, Explainability, Calibration, and Faithfulness

The capability of Large Language Models (LLMs) like ChatGPT to comprehen...

Please sign up or login with your details

Forgot password? Click here to reset