NukeBERT: A Pre-trained language model for Low Resource Nuclear Domain

03/30/2020
by   Ayush Jain, et al.
19

Significant advances have been made in recent years on Natural Language Processing with machines surpassing human performance in many tasks, including but not limited to Question Answering. The majority of deep learning methods for Question Answering targets domains with large datasets and highly matured literature. The area of Nuclear and Atomic energy has largely remained unexplored in exploiting non-annotated data for driving industry viable applications. Due to lack of dataset, a new dataset was created from the 7000 research papers on nuclear domain. This paper contributes to research in understanding nuclear domain knowledge which is then evaluated on Nuclear Question Answering Dataset (NQuAD) created by nuclear domain experts as part of this research. NQuAD contains 612 questions developed on 181 paragraphs randomly selected from the IGCAR research paper corpus. In this paper, the Nuclear Bidirectional Encoder Representational Transformers (NukeBERT) is proposed, which incorporates a novel technique for building BERT vocabulary to make it suitable for tasks with less training data. The experiments evaluated on NQuAD revealed that NukeBERT was able to outperform BERT significantly, thus validating the adopted methodology. Training NukeBERT is computationally expensive and hence we will be open-sourcing the NukeBERT pretrained weights and NQuAD for fostering further research work in the nuclear domain.

READ FULL TEXT

page 1

page 3

page 5

research
08/04/2019

Exploring Neural Net Augmentation to BERT for Question Answering on SQUAD 2.0

Enhancing machine capabilities to answer questions has been a topic of c...
research
05/25/2021

NukeLM: Pre-Trained and Fine-Tuned Language Models for the Nuclear and Energy Domains

Natural language processing (NLP) tasks (text classification, named enti...
research
05/04/2022

KenSwQuAD – A Question Answering Dataset for Swahili Low Resource Language

This research developed a Kencorpus Swahili Question Answering Dataset K...
research
09/23/2021

ParaShoot: A Hebrew Question Answering Dataset

NLP research in Hebrew has largely focused on morphology and syntax, whe...
research
10/16/2019

Unsupervised Question Answering for Fact-Checking

Recent Deep Learning (DL) models have succeeded in achieving human-level...
research
11/06/2019

Unsupervised Domain Adaptation of Contextual Embeddings for Low-Resource Duplicate Question Detection

Answering questions is a primary goal of many conversational systems or ...
research
03/14/2019

Nuclear Environments Inspection with Micro Aerial Vehicles: Algorithms and Experiments

In this work, we address the estimation, planning, control and mapping p...

Please sign up or login with your details

Forgot password? Click here to reset