Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures

by   Kun Yuan, et al.

Recent advancements in surgical computer vision applications have been driven by fully-supervised methods, primarily using only visual data. These methods rely on manually annotated surgical videos to predict a fixed set of object categories, limiting their generalizability to unseen surgical procedures and downstream tasks. In this work, we put forward the idea that the surgical video lectures available through open surgical e-learning platforms can provide effective supervisory signals for multi-modal representation learning without relying on manual annotations. We address the surgery-specific linguistic challenges present in surgical video lectures by employing multiple complementary automatic speech recognition systems to generate text transcriptions. We then present a novel method, SurgVLP - Surgical Vision Language Pre-training, for multi-modal representation learning. SurgVLP constructs a new contrastive learning objective to align video clip embeddings with the corresponding multiple text embeddings by bringing them together within a joint latent space. To effectively show the representation capability of the learned joint latent space, we introduce several vision-and-language tasks for surgery, such as text-based video retrieval, temporal activity grounding, and video captioning, as benchmarks for evaluation. We further demonstrate that without using any labeled ground truth, our approach can be employed for traditional vision-only surgical downstream tasks, such as surgical tool, phase, and triplet recognition. The code will be made available at https://github.com/CAMMA-public/SurgVLP


page 2

page 6

page 13

page 14

page 15

page 17


Video-Text Representation Learning via Differentiable Weak Temporal Alignment

Learning generic joint representations for video and text by a supervise...

Motion2Vec: Semi-Supervised Representation Learning from Surgical Videos

Learning meaningful visual representations in an embedding space can fac...

MixGen: A New Multi-Modal Data Augmentation

Data augmentation is a necessity to enhance data efficiency in deep lear...

Multimodal and self-supervised representation learning for automatic gesture recognition in surgical robotics

Self-supervised, multi-modal learning has been successful in holistic re...

Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene Segmentation

Automatic surgical scene segmentation is fundamental for facilitating co...

Not End-to-End: Explore Multi-Stage Architecture for Online Surgical Phase Recognition

Surgical phase recognition is of particular interest to computer assiste...

Relational Graph Learning on Visual and Kinematics Embeddings for Accurate Gesture Recognition in Robotic Surgery

Automatic surgical gesture recognition is fundamentally important to ena...

Please sign up or login with your details

Forgot password? Click here to reset