OmniNet: A unified architecture for multi-modal multi-task learning

07/17/2019
by   Subhojeet Pramanik, et al.
6

Transformer is a popularly used neural network architecture, especially for language understanding. We introduce an extended and unified architecture which can be used for tasks involving a variety of modalities like image, text, videos, etc. We propose a spatio-temporal cache mechanism that enables learning spatial dimension of the input in addition to the hidden states corresponding to the temporal input sequence. The proposed architecture further enables a single model to support tasks with multiple input modalities as well as asynchronous multi-task learning, thus we refer to it as OmniNet. For example, a single instance of OmniNet can concurrently learn to perform the tasks of part-of-speech tagging, image captioning, visual question answering and video activity recognition. We demonstrate that training these four tasks together results in about three times compressed model while retaining the performance in comparison to training them individually. We also show that using this neural network pre-trained on some modalities assists in learning an unseen task. This illustrates the generalization capacity of the self-attention mechanism on the spatio-temporal cache present in OmniNet.

READ FULL TEXT
research
12/31/2020

Gated Ensemble of Spatio-temporal Mixture of Experts for Multi-task Learning in Ride-hailing System

Designing spatio-temporal forecasting models separately in a task-wise a...
research
02/16/2023

MINOTAUR: Multi-task Video Grounding From Multimodal Queries

Video understanding tasks take many forms, from action detection to visu...
research
04/16/2023

AutoSTL: Automated Spatio-Temporal Multi-Task Learning

Spatio-Temporal prediction plays a critical role in smart city construct...
research
05/28/2019

Gaining Extra Supervision via Multi-task learning for Multi-Modal Video Question Answering

This paper proposes a method to gain extra supervision via multi-task le...
research
06/10/2019

UniDual: A Unified Model for Image and Video Understanding

Although a video is effectively a sequence of images, visual perception ...
research
12/03/2018

SUSiNet: See, Understand and Summarize it

In this work we propose a multi-task spatio-temporal network, called SUS...
research
07/09/2019

M3D-GAN: Multi-Modal Multi-Domain Translation with Universal Attention

Generative adversarial networks have led to significant advances in cros...

Please sign up or login with your details

Forgot password? Click here to reset