Recently, large-scale pre-trained language-image models like CLIP have s...
Relational Language-Image Pre-training (RLIP) aims to align vision
repre...
This paper introduces ModelScopeT2V, a text-to-video synthesis model tha...
Spatial convolutions are extensively used in numerous deep video models....
The pursuit of controllability as a higher standard of visual content
cr...
Current state-of-the-art approaches for few-shot action recognition achi...
Learning from large-scale contrastive language-image pre-training like C...
Decentralized optimization is effective to save communication in large-s...
Communication overhead hinders the scalability of large-scale distribute...
The scale of deep learning nowadays calls for efficient distributed trai...
Researches have demonstrated that low bit-width (e.g., INT8) quantizatio...