Exploiting a comparability mapping to improve bi-lingual data categorization: a three-mode data analysis perspective

02/25/2015
by   Pierre-François Marteau, et al.
0

We address in this paper the co-clustering and co-classification of bilingual data laying in two linguistic similarity spaces when a comparability measure defining a mapping between these two spaces is available. A new approach that we can characterized as a three-mode analysis scheme, is proposed to mix the comparability measure with the two similarity measures. Our aim is to improve jointly the accuracy of classification and clustering tasks performed in each of the two linguistic spaces, as well as the quality of the final alignment of comparable clusters that can be obtained. We used first some purely synthetic random data sets to assess our formal similarity-comparability mixing model. We then propose two variants of the comparability measure that has been defined by (Li and Gaussier 2010) in the context of bilingual lexicon extraction to adapt it to clustering or categorizing tasks. These two variant measures are subsequently used to evaluate our similarity-comparability mixing model in the context of the co-classification and co-clustering of comparable textual data sets collected from Wikipedia categories for the English and French languages. Our experiments show clear improvements in clustering and classification accuracies when mixing comparability with similarity measures, with, as expected, a higher robustness obtained when the two comparability variant measures that we propose are used. We believe that this approach is particularly well suited for the construction of thematic comparable corpora of controllable quality.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/06/2021

An empirical comparison and characterisation of nine popular clustering methods

Nine popular clustering methods are applied to 42 real data sets. The ai...
research
09/04/2021

A Neural Network-Based Linguistic Similarity Measure for Entrainment in Conversations

Linguistic entrainment is a phenomenon where people tend to mimic each o...
research
04/29/2016

An expressive dissimilarity measure for relational clustering using neighbourhood trees

Clustering is an underspecified task: there are no universal criteria fo...
research
01/16/2014

An Empirical Evaluation of Similarity Measures for Time Series Classification

Time series are ubiquitous, and a measure to assess their similarity is ...
research
12/09/2021

A Note on Comparison of F-measures

We comment on a recent TKDE paper "Linear Approximation of F-measure for...
research
01/23/2017

The Impact of Random Models on Clustering Similarity

Clustering is a central approach for unsupervised learning. After cluste...
research
09/26/2022

Clustering by Direct Optimization of the Medoid Silhouette

The evaluation of clustering results is difficult, highly dependent on t...

Please sign up or login with your details

Forgot password? Click here to reset