evaluating bert and parsbert for analyzing persian advertisement data

05/03/2023
by   Ali Mehrban, et al.
0

This paper discusses the impact of the Internet on modern trading and the importance of data generated from these transactions for organizations to improve their marketing efforts. The paper uses the example of Divar, an online marketplace for buying and selling products and services in Iran, and presents a competition to predict the percentage of a car sales ad that would be published on the Divar website. Since the dataset provides a rich source of Persian text data, the authors use the Hazm library, a Python library designed for processing Persian text, and two state-of-the-art language models, mBERT and ParsBERT, to analyze it. The paper's primary objective is to compare the performance of mBERT and ParsBERT on the Divar dataset. The authors provide some background on data mining, Persian language, and the two language models, examine the dataset's composition and statistical features, and provide details on their fine-tuning and training configurations for both approaches. They present the results of their analysis and highlight the strengths and weaknesses of the two language models when applied to Persian text data. The paper offers valuable insights into the challenges and opportunities of working with low-resource languages such as Persian and the potential of advanced language models like BERT for analyzing such data. The paper also explains the data mining process, including steps such as data cleaning and normalization techniques. Finally, the paper discusses the types of machine learning problems, such as supervised, unsupervised, and reinforcement learning, and the pattern evaluation techniques, such as confusion matrix. Overall, the paper provides an informative overview of the use of language models and data mining techniques for analyzing text data in low-resource languages, using the example of the Divar dataset.

READ FULL TEXT
research
03/29/2021

Text Normalization for Low-Resource Languages of Africa

Training data for machine learning models can come from many different s...
research
08/23/2022

Bitext Mining for Low-Resource Languages via Contrastive Learning

Mining high-quality bitexts for low-resource languages is challenging. T...
research
09/19/2019

Low-Resource Parsing with Crosslingual Contextualized Representations

Despite advances in dependency parsing, languages with small treebanks s...
research
02/03/2023

Mitigating Data Scarcity for Large Language Models

In recent years, pretrained neural language models (PNLMs) have taken th...
research
09/14/2022

On Language Clustering: A Non-parametric Statistical Approach

Any approach aimed at pasteurizing and quantifying a particular phenomen...
research
02/23/2023

In What Languages are Generative Language Models the Most Formal? Analyzing Formality Distribution across Languages

Multilingual generative language models (LMs) are increasingly fluent in...

Please sign up or login with your details

Forgot password? Click here to reset