CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery

by   Long Bai, et al.
The Chinese University of Hong Kong

Medical students and junior surgeons often rely on senior surgeons and specialists to answer their questions when learning surgery. However, experts are often busy with clinical and academic work, and have little time to give guidance. Meanwhile, existing deep learning (DL)-based surgical Visual Question Answering (VQA) systems can only provide simple answers without the location of the answers. In addition, vision-language (ViL) embedding is still a less explored research in these kinds of tasks. Therefore, a surgical Visual Question Localized-Answering (VQLA) system would be helpful for medical students and junior surgeons to learn and understand from recorded surgical videos. We propose an end-to-end Transformer with the Co-Attention gaTed Vision-Language (CAT-ViL) embedding for VQLA in surgical scenarios, which does not require feature extraction through detection models. The CAT-ViL embedding module is designed to fuse multimodal features from visual and textual sources. The fused embedding will feed a standard Data-Efficient Image Transformer (DeiT) module, before the parallel classifier and detector for joint prediction. We conduct the experimental validation on public surgical videos from MICCAI EndoVis Challenge 2017 and 2018. The experimental results highlight the superior performance and robustness of our proposed model compared to the state-of-the-art approaches. Ablation studies further prove the outstanding performance of all the proposed components. The proposed method provides a promising solution for surgical scene understanding, and opens up a primary step in the Artificial Intelligence (AI)-based VQLA system for surgical training. Our code is publicly available.


Surgical-VQA: Visual Question Answering in Surgical Scenes using Transformer

Visual question answering (VQA) in surgery is largely unexplored. Expert...

A Dual-Attention Learning Network with Word and Sentence Embedding for Medical Visual Question Answering

Research in medical visual question answering (MVQA) can contribute to t...

SurgicalGPT: End-to-End Language-Vision GPT for Visual Question Answering in Surgery

Advances in GPT-based large language models (LLMs) are revolutionizing n...

Revisiting Distillation for Continual Learning on Visual Question Localized-Answering in Robotic Surgery

The visual-question localized-answering (VQLA) system can serve as a kno...

Towards Answering Health-related Questions from Medical Videos: Datasets and Approaches

The increase in the availability of online videos has transformed the wa...

SurgMAE: Masked Autoencoders for Long Surgical Video Analysis

There has been a growing interest in using deep learning models for proc...

Please sign up or login with your details

Forgot password? Click here to reset