VS-TransGRU: A Novel Transformer-GRU-based Framework Enhanced by Visual-Semantic Fusion for Egocentric Action Anticipation

07/08/2023
by   Congqi Cao, et al.
0

Egocentric action anticipation is a challenging task that aims to make advanced predictions of future actions from current and historical observations in the first-person view. Most existing methods focus on improving the model architecture and loss function based on the visual input and recurrent neural network to boost the anticipation performance. However, these methods, which merely consider visual information and rely on a single network architecture, gradually reach a performance plateau. In order to fully understand what has been observed and capture the dependencies between current observations and future actions well enough, we propose a novel visual-semantic fusion enhanced and Transformer GRU-based action anticipation framework in this paper. Firstly, high-level semantic information is introduced to improve the performance of action anticipation for the first time. We propose to use the semantic features generated based on the class labels or directly from the visual observations to augment the original visual features. Secondly, an effective visual-semantic fusion module is proposed to make up for the semantic gap and fully utilize the complementarity of different modalities. Thirdly, to take advantage of both the parallel and autoregressive models, we design a Transformer based encoder for long-term sequential modeling and a GRU-based decoder for flexible iteration decoding. Extensive experiments on two large-scale first-person view datasets, i.e., EPIC-Kitchens and EGTEA Gaze+, validate the effectiveness of our proposed method, which achieves new state-of-the-art performance, outperforming previous approaches by a large margin.

READ FULL TEXT

page 4

page 11

research
12/02/2021

Visual-Semantic Transformer for Scene Text Recognition

Modeling semantic information is helpful for scene text recognition. In ...
research
10/11/2021

CLIP4Caption ++: Multi-CLIP for Video Caption

This report describes our solution to the VALUE Challenge 2021 in the ca...
research
12/06/2018

Video Action Transformer Network

We introduce the Action Transformer model for recognizing and localizing...
research
09/20/2019

Forecasting Future Action Sequences with Neural Memory Networks

We propose a novel neural memory network based framework for future acti...
research
03/15/2023

Co-Occurrence Matters: Learning Action Relation for Temporal Action Localization

Temporal action localization (TAL) is a prevailing task due to its great...
research
11/22/2022

Breaking Free from Fusion Rule: A Fully Semantic-driven Infrared and Visible Image Fusion

Infrared and visible image fusion plays a vital role in the field of com...
research
09/29/2021

Geometry-Entangled Visual Semantic Transformer for Image Captioning

Recent advancements of image captioning have featured Visual-Semantic Fu...

Please sign up or login with your details

Forgot password? Click here to reset