Modeling Language Change in Historical Corpora: The Case of Portuguese

09/30/2016
by   Marcos Zampieri, et al.
0

This paper presents a number of experiments to model changes in a historical Portuguese corpus composed of literary texts for the purpose of temporal text classification. Algorithms were trained to classify texts with respect to their publication date taking into account lexical variation represented as word n-grams, and morphosyntactic variation represented by part-of-speech (POS) distribution. We report results of 99.8 with a Support Vector Machines classifier to predict the publication date of documents in time intervals of both one century and half a century. A feature analysis is performed to investigate the most informative features for this task and how they are linked to language change.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/09/2022

A comparison of several AI techniques for authorship attribution on Romanian texts

Determining the author of a text is a difficult task. Here we compare mu...
research
09/13/2017

Linguistic Features of Genre and Method Variation in Translation: A Computational Perspective

In this paper we describe the use of text classification methods to inve...
research
12/15/2022

The Effects of Character-Level Data Augmentation on Style-Based Dating of Historical Manuscripts

Identifying the production dates of historical manuscripts is one of the...
research
10/02/2017

Compiling and Processing Historical and Contemporary Portuguese Corpora

This technical report describes the framework used for processing three ...
research
03/10/2016

Part-of-Speech Tagging for Historical English

As more historical texts are digitized, there is interest in applying na...
research
08/04/2017

Predicting the Law Area and Decisions of French Supreme Court Cases

In this paper, we investigate the application of text classification met...
research
08/26/2015

A fully data-driven method to identify (correlated) changes in diachronic corpora

In this paper, a method for measuring synchronic corpus (dis-)similarity...

Please sign up or login with your details

Forgot password? Click here to reset