MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training

by   Chaoyi Wu, et al.
Shanghai Jiao Tong University

In this paper, we consider the problem of enhancing self-supervised visual-language pre-training (VLP) with medical-specific knowledge, by exploiting the paired image-text reports from the radiological daily practice. In particular, we make the following contributions: First, unlike existing works that directly process the raw reports, we adopt a novel report filter to extract the medical entities, avoiding unnecessary complexity from language grammar and enhancing the supervision signals; Second, we propose a novel entity embedding module by querying an external knowledge description base, to exploit the rich context of additional information that the medical domain affords, and implicitly build relationships between entities in the language embedding space; Third, we propose a novel Transformer-based fusion model for spatially aligning the entity description with visual signals at the image patch level only with self-supervised learning, thus enabling the ability for spatial grounding; Fourth, we conduct thorough experiments to validate the effectiveness of our proposed architecture, and benchmark on numerous public benchmarks e.g., ChestX-ray14, RSNA Pneumonia, SIIM-ACR Pneumothorax, COVIDx CXR-2, COVID Rural, and EdemaSeverity. In both zero-shot and fine-tuning settings, our model has demonstrated strong performance compared with the former methods on disease classification and grounding.


K-Diag: Knowledge-enhanced Disease Diagnosis in Radiographic Imaging

In this paper, we consider the problem of disease diagnosis. Unlike the ...

Frozen Language Model Helps ECG Zero-Shot Learning

The electrocardiogram (ECG) is one of the most commonly used non-invasiv...

CounTR: Transformer-based Generalised Visual Counting

In this paper, we consider the problem of generalised visual object coun...

Self-supervised Image-text Pre-training With Mixed Data In Chest X-rays

Pre-trained models, e.g., from ImageNet, have proven to be effective in ...

Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime

This paper explores training medical vision-language models (VLMs) – whe...

Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing

Self-supervised learning in vision-language processing exploits semantic...

Please sign up or login with your details

Forgot password? Click here to reset