Large vision-language models (LVLMs) have recently witnessed rapid
advan...
We introduce the Qwen-VL series, a set of large-scale vision-language mo...
Recently, plain vision Transformers (ViTs) have shown impressive perform...
Although recent approaches aiming for video instance segmentation have
a...
Both masked image modeling (MIM) and natural language supervision have
f...
Since the development of self-supervised visual representation learning ...
Recently vision transformer has achieved tremendous success on image-lev...
We present an approach to efficiently and effectively adapt a masked ima...
Evaluation metrics in machine learning are often hardly taken as loss
fu...
Recently, query based deep networks catch lots of attention owing to the...
Recently, query based object detection frameworks achieve comparable
per...
Modeling temporal visual context across frames is critical for video ins...