Holistic Large Scale Video Understanding

04/25/2019
by   Ali Diba, et al.
30

Action recognition has been advanced in recent years by benchmarks with rich annotations. However, research is still mainly limited to human action or sports recognition - focusing on a highly specific video understanding task and thus leaving a significant gap towards describing the overall content of a video. We fill in this gap by presenting a large-scale "Holistic Video Understanding Dataset" (HVU). HVU is organized hierarchically in a semantic taxonomy that focuses on multi-label and multi-task video understanding as a comprehensive problem that encompasses the recognition of multiple semantic aspects in the dynamic scene. HVU contains approx. 577k videos in total with 13M annotations for training and validation set spanning over 4378 classes. HVU encompasses semantic aspects defined on categories of scenes, objects, actions, events, attributes and concepts, which naturally captures the real-world scenarios. Further, we introduce a new spatio-temporal deep neural network architecture called "Holistic Appearance and Temporal Network" (HATNet) that builds on fusing 2D and 3D architectures into one by combining intermediate representations of appearance and temporal cues. HATNet focuses on the multi-label and multi-task learning problem and is trained in an end-to-end manner. The experiments show that HATNet trained on HVU outperforms current state-of-the-art methods on challenging human action datasets: HMDB51, UCF101, and Kinetics. The dataset and codes will be made publicly available.

READ FULL TEXT

page 3

page 6

research
12/09/2022

Tencent AVS: A Holistic Ads Video Dataset for Multi-modal Scene Segmentation

Temporal video segmentation and classification have been advanced greatl...
research
04/02/2023

From Isolated Islands to Pangea: Unifying Semantic Space for Human Action Understanding

Action understanding matters and attracts attention. It can be formed as...
research
03/13/2020

LSCP: Enhanced Large Scale Colloquial Persian Language Understanding

Language recognition has been significantly advanced in recent years by ...
research
06/24/2020

Comprehensive Information Integration Modeling Framework for Video Titling

In e-commerce, consumer-generated videos, which in general deliver consu...
research
10/18/2022

How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios

In recent years, deep neural networks have demonstrated increasingly str...
research
09/05/2017

Multi-label Class-imbalanced Action Recognition in Hockey Videos via 3D Convolutional Neural Networks

Automatic analysis of the video is one of most complex problems in the f...
research
04/05/2020

Deep Homography Estimation for Dynamic Scenes

Homography estimation is an important step in many computer vision probl...

Please sign up or login with your details

Forgot password? Click here to reset