Detecting Attended Visual Targets in Video
We address the problem of detecting attention targets in video. Specifically, our goal is to identify where each person in each frame of a video is looking, and correctly handle the out-of-frame case. Our novel architecture effectively models the dynamic interaction between the scene and head features in order to infer time-varying attention targets. We introduce a new dataset, VideoAttentionTarget, consisting of fully-annotated video clips containing complex and dynamic patterns of real-world gaze behavior. Experiments on this dataset show that our model can effectively infer attention in videos. To further demonstrate the utility of our approach, we apply our predicted attention maps to two social gaze behavior recognition tasks, and show that the resulting classifiers significantly outperform existing methods. We achieve state-of-the-art performance on three datasets: GazeFollow (static images), VideoAttentionTarget (videos), and VideoCoAtt (videos), and obtain the first results for automatically classifying clinically-relevant gaze behavior without wearable cameras or eye trackers.
READ FULL TEXT