We propose LocFormer, a Transformer-based model for video grounding whic...
We extend the task of composed image retrieval, where an input query con...
Accuracy of many visiolinguistic tasks has benefited significantly from ...
Vision-and-Language Navigation (VLN) requires an agent to navigate in a
...
This paper studies the task of temporal moment localization in a long
un...
The availability of a large labeled dataset is a key requirement for app...
Despite the recent advances in opinion mining for written reviews, few w...
Vision-and-language navigation requires an agent to navigate through a r...
Vision-based sign language recognition aims at helping the hearing-impai...
This paper studies the problem of temporal moment localization in a long...