On the State of the Art in Authorship Attribution and Authorship Verification

by   Jacob Tyo, et al.

Despite decades of research on authorship attribution (AA) and authorship verification (AV), inconsistent dataset splits/filtering and mismatched evaluation methods make it difficult to assess the state of the art. In this paper, we present a survey of the fields, resolve points of confusion, introduce Valla that standardizes and benchmarks AA/AV datasets and metrics, provide a large-scale empirical evaluation, and provide apples-to-apples comparisons between existing methods. We evaluate eight promising methods on fifteen datasets (including distribution-shifted challenge sets) and introduce a new large-scale dataset based on texts archived by Project Gutenberg. Surprisingly, we find that a traditional Ngram-based model performs best on 5 (of 7) AA tasks, achieving an average macro-accuracy of 76.50% (compared to 66.71% for a BERT-based model). However, on the two AA datasets with the greatest number of words per author, as well as on the AV datasets, BERT-based models perform best. While AV methods are easily applied to AA, they are seldom included as baselines in AA papers. We show that through the application of hard-negative mining, AV methods are competitive alternatives to AA methods. Valla and all experiment code can be found here: https://github.com/JacobTyo/Valla


page 1

page 2

page 3

page 4


BERT-based Authorship Attribution on the Romanian Dataset called ROST

Being around for decades, the problem of Authorship Attribution is still...

TRAK: Attributing Model Behavior at Scale

The goal of data attribution is to trace model predictions back to train...

Inserting Information Bottlenecks for Attribution in Transformers

Pretrained transformers achieve the state of the art across tasks in nat...

VeriDark: A Large-Scale Benchmark for Authorship Verification on the Dark Web

The DarkWeb represents a hotbed for illicit activity, where users commun...

A Computational Approach to Measure Empathy and Theory-of-Mind from Written Texts

Theory-of-mind (ToM), a human ability to infer the intentions and though...

RGB-D-Based Categorical Object Pose and Shape Estimation: Methods, Datasets, and Evaluation

Recently, various methods for 6D pose and shape estimation of objects at...

On the Evaluation of RGB-D-based Categorical Pose and Shape Estimation

Recently, various methods for 6D pose and shape estimation of objects ha...

Please sign up or login with your details

Forgot password? Click here to reset