Coherence-Based Distributed Document Representation Learning for Scientific Documents

01/08/2022
by   Shicheng Tan, et al.
0

Distributed document representation is one of the basic problems in natural language processing. Currently distributed document representation methods mainly consider the context information of words or sentences. These methods do not take into account the coherence of the document as a whole, e.g., a relation between the paper title and abstract, headline and description, or adjacent bodies in the document. The coherence shows whether a document is meaningful, both logically and syntactically, especially in scientific documents (papers or patents, etc.). In this paper, we propose a coupled text pair embedding (CTPE) model to learn the representation of scientific documents, which maintains the coherence of the document with coupled text pairs formed by segmenting the document. First, we divide the document into two parts (e.g., title and abstract, etc) which construct a coupled text pair. Then, we adopt negative sampling to construct uncoupled text pairs whose two parts are from different documents. Finally, we train the model to judge whether the text pair is coupled or uncoupled and use the obtained embedding of coupled text pairs as the embedding of documents. We perform experiments on three datasets for one information retrieval task and two recommendation tasks. The experimental results verify the effectiveness of the proposed CTPE model.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/29/2019

Structuring an unordered text document

Segmenting an unordered text document into different sections is a very ...
research
05/10/2018

hyperdoc2vec: Distributed Representations of Hypertext Documents

Hypertext documents, such as web pages and academic papers, are of great...
research
03/29/2022

LDKP: A Dataset for Identifying Keyphrases from Long Scientific Documents

Identifying keyphrases (KPs) from text documents is a fundamental task i...
research
06/05/2019

Terminology-based Text Embedding for Computing Document Similarities on Technical Content

We propose in this paper a new, hybrid document embedding approach in or...
research
03/29/2018

High Capacity Image Data Hiding of Scanned Text Documents Using Improved Quadtree

In this paper, an effective method was introduced to steganography of te...
research
09/09/2021

Tiny CNN for feature point description for document analysis: approach and dataset

In this paper, we study the problem of feature points description in the...
research
10/18/2019

Towards Learning Cross-Modal Perception-Trace Models

Representation learning is a key element of state-of-the-art deep learni...

Please sign up or login with your details

Forgot password? Click here to reset