Whether and When does Endoscopy Domain Pretraining Make Sense?

by   Dominik Batić, et al.

Automated endoscopy video analysis is a challenging task in medical computer vision, with the primary objective of assisting surgeons during procedures. The difficulty arises from the complexity of surgical scenes and the lack of a sufficient amount of annotated data. In recent years, large-scale pretraining has shown great success in natural language processing and computer vision communities. These approaches reduce the need for annotated data, which is always a concern in the medical domain. However, most works on endoscopic video understanding use models pretrained on natural images, creating a domain gap between pretraining and finetuning. In this work, we investigate the need for endoscopy domain-specific pretraining based on downstream objectives. To this end, we first collect Endo700k, the largest publicly available corpus of endoscopic images, extracted from nine public Minimally Invasive Surgery (MIS) datasets. Endo700k comprises more than 700,000 unannotated raw images. Next, we introduce EndoViT, an endoscopy pretrained Vision Transformer (ViT). Through ablations, we demonstrate that domain-specific pretraining is particularly beneficial for more complex downstream tasks, such as Action Triplet Detection, and less effective and even unnecessary for simpler tasks, such as Surgical Phase Recognition. We will release both our code and pretrained models upon acceptance to facilitate further research in this direction.


page 4

page 5

page 7

page 8

page 11

page 12


Vision-and-Language Pretraining

With the burgeoning amount of data of image-text pairs and diversity of ...

Cost-effective Selection of Pretraining Data: A Case Study of Pretraining BERT on Social Media

Recent studies on domain-specific BERT models show that effectiveness on...

Task Transfer and Domain Adaptation for Zero-Shot Question Answering

Pretrained language models have shown success in various areas of natura...

Improved Pretraining for Domain-specific Contextual Embedding Models

We investigate methods to mitigate catastrophic forgetting during domain...

NetGPT: Generative Pretrained Transformer for Network Traffic

All data on the Internet are transferred by network traffic, thus accura...

Could Giant Pretrained Image Models Extract Universal Representations?

Frozen pretrained models have become a viable alternative to the pretrai...

UniTabE: Pretraining a Unified Tabular Encoder for Heterogeneous Tabular Data

Recent advancements in Natural Language Processing (NLP) have witnessed ...

Please sign up or login with your details

Forgot password? Click here to reset