UnibucKernel: A kernel-based learning method for complex word identification

03/20/2018
by   Andrei M. Butnaru, et al.
0

In this paper, we present a kernel-based learning approach for the 2018 Complex Word Identification (CWI) Shared Task. Our approach is based on combining multiple low-level features, such as character n-grams, with high-level semantic features that are either automatically learned using word embeddings or extracted from a lexical knowledge base, namely WordNet. After feature extraction, we employ a kernel method for the learning phase. The feature matrix is first transformed into a normalized kernel matrix. For the binary classification task (simple versus complex), we employ Support Vector Machines. For the regression task, in which we have to predict the complexity level of a word (a word is more complex if it is labeled as complex by more annotators), we employ v-Support Vector Regression. We applied our approach only on the three English data sets containing documents from Wikipedia, WikiNews and News domains. Our best result during the competition was the third place on the English Wikipedia data set. However, in this paper, we also report better post-competition results.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/19/2021

Learning in High-Dimensional Feature Spaces Using ANOVA-Based Fast Matrix-Vector Multiplication

Kernel matrices are crucial in many learning tasks such as support vecto...
research
04/06/2022

The 2021 Urdu Fake News Detection Task using Supervised Machine Learning and Feature Combinations

This paper presents the system description submitted at the FIRE Shared ...
research
04/06/2022

Abusive and Threatening Language Detection in Urdu using Supervised Machine Learning and Feature Combinations

This paper presents the system descriptions submitted at the FIRE Shared...
research
02/18/2021

UnibucKernel: Geolocating Swiss German Jodels Using Ensemble Learning

In this work, we describe our approach addressing the Social Media Varie...
research
05/13/2018

UnibucKernel Reloaded: First Place in Arabic Dialect Identification for the Second Year in a Row

We present a machine learning approach that ranked on the first place in...
research
06/14/2020

Vietnamese Word Segmentation with SVM: Ambiguity Reduction and Suffix Capture

In this paper, we approach Vietnamese word segmentation as a binary clas...
research
11/11/2021

Reducing Data Complexity using Autoencoders with Class-informed Loss Functions

Available data in machine learning applications is becoming increasingly...

Please sign up or login with your details

Forgot password? Click here to reset