Graph Distillation for Action Detection with Privileged Information
In this work, we propose a technique that tackles the video understanding problem under a realistic, demanding condition in which we have limited labeled data and partially observed training modalities. Common methods such as transfer learning do not take advantage of the rich information from extra modalities potentially available in the source domain dataset. On the other hand, previous work on cross-modality learning only focuses on a single domain or task. In this work, we propose a graph-based distillation method that incorporates rich privileged information from a large multi-modal dataset in the source domain, and shows an improved performance in the target domain where data is scarce. Leveraging both a large-scale dataset and its extra modalities, our method learns a better model for temporal action detection and action classification without needing to have access to these modalities during test time. We evaluate our approach on action classification and temporal action detection tasks, and show that our models achieve the state-of-the-art performance on the PKU-MMD and NTU RGB+D datasets.
READ FULL TEXT