DOCENT: Learning Self-Supervised Entity Representations from Large Document Collections

02/26/2021
by   Yury Zemlyanskiy, et al.
19

This paper explores learning rich self-supervised entity representations from large amounts of the associated text. Once pre-trained, these models become applicable to multiple entity-centric tasks such as ranked retrieval, knowledge base completion, question answering, and more. Unlike other methods that harvest self-supervision signals based merely on a local context within a sentence, we radically expand the notion of context to include any available text related to an entity. This enables a new class of powerful, high-capacity representations that can ultimately distill much of the useful information about an entity from multiple text sources, without any human supervision. We present several training strategies that, unlike prior approaches, learn to jointly predict words and entities – strategies we compare experimentally on downstream tasks in the TV-Movies domain, such as MovieLens tag prediction from user reviews and natural language movie search. As evidenced by results, our models match or outperform competitive baselines, sometimes with little or no fine-tuning, and can scale to very large corpora. Finally, we make our datasets and pre-trained models publicly available. This includes Reviews2Movielens (see https://goo.gle/research-docent ), mapping the up to 1B word corpus of Amazon movie reviews (He and McAuley, 2016) to MovieLens tags (Harper and Konstan, 2016), as well as Reddit Movie Suggestions (see https://urikz.github.io/docent ) with natural language queries and corresponding community recommendations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/01/2021

A Survey of Knowledge Enhanced Pre-trained Models

Pre-trained models learn contextualized word representations on large-sc...
research
01/25/2023

ViDeBERTa: A powerful pre-trained language model for Vietnamese

This paper presents ViDeBERTa, a new pre-trained monolingual language mo...
research
12/26/2017

Advances in Pre-Training Distributed Word Representations

Many Natural Language Processing applications nowadays rely on pre-train...
research
09/11/2023

LeBenchmark 2.0: a Standardized, Replicable and Enhanced Framework for Self-supervised Representations of French Speech

Self-supervised learning (SSL) is at the origin of unprecedented improve...
research
10/14/2020

Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision

Humans learn language by listening, speaking, writing, reading, and also...
research
09/06/2019

Abductive Reasoning as Self-Supervision for Common Sense Question Answering

Question answering has seen significant advances in recent times, especi...
research
05/25/2020

Incidental Supervision: Moving beyond Supervised Learning

Machine Learning and Inference methods have become ubiquitous in our att...

Please sign up or login with your details

Forgot password? Click here to reset