PathologyBERT – Pre-trained Vs. A New Transformer Language Model for Pathology Domain

05/13/2022
by   Thiago Santos, et al.
0

Pathology text mining is a challenging task given the reporting variability and constant new findings in cancer sub-type definitions. However, successful text mining of a large pathology database can play a critical role to advance 'big data' cancer research like similarity-based treatment selection, case identification, prognostication, surveillance, clinical trial screening, risk stratification, and many others. While there is a growing interest in developing language models for more specific clinical domains, no pathology-specific language space exist to support the rapid data-mining development in pathology space. In literature, a few approaches fine-tuned general transformer models on specialized corpora while maintaining the original tokenizer, but in fields requiring specialized terminology, these models often fail to perform adequately. We propose PathologyBERT - a pre-trained masked language model which was trained on 347,173 histopathology specimen reports and publicly released in the Huggingface repository. Our comprehensive experiments demonstrate that pre-training of transformer model on pathology corpora yields performance improvements on Natural Language Understanding (NLU) and Breast Cancer Diagnose Classification when compared to nonspecific language models.

READ FULL TEXT
research
11/18/2021

RoBERTuito: a pre-trained language model for social media text in Spanish

Since BERT appeared, Transformer language models and transfer learning h...
research
12/16/2021

CLIN-X: pre-trained language models and a study on cross-task transfer for concept extraction in the clinical domain

The field of natural language processing (NLP) has recently seen a large...
research
11/01/2022

VarMAE: Pre-training of Variational Masked Autoencoder for Domain-adaptive Language Understanding

Pre-trained language models have achieved promising performance on gener...
research
04/06/2022

Language Model for Text Analytic in Cybersecurity

NLP is a form of artificial intelligence and machine learning concerned ...
research
10/24/2022

Unsupervised Term Extraction for Highly Technical Domains

Term extraction is an information extraction task at the root of knowled...
research
07/24/2023

Opinion Mining Using Population-tuned Generative Language Models

We present a novel method for mining opinions from text collections usin...

Please sign up or login with your details

Forgot password? Click here to reset