Training a network to attend like human drivers saves it from common but misleading loss functions
We proposed a novel FCN-ConvLSTM model to predict multi-focal human driver's attention merely from monocular dash camera videos. Our model has surpassed the state-of-the-art performance and demonstrated sophisticated behaviors such as watching out for a driver exiting from a parked car. In addition, we have demonstrated a surprising paradox: fine-tuning AlexNet on a largescale driving dataset degraded its ability to register pedestrians. This is due to the fact that the commonly practiced training paradigm has failed to reflect the different importance levels of the frames of the driving video datasets. As a solution, we propose to unequally sample the learning frames at appropriate probabilities and introduced a way of using human gaze to determine the sampling weights. We demonstrated the effectiveness of this proposal in human driver attention prediction, which we believe can also be generalized to other driving-related machine learning tasks.
READ FULL TEXT