Large-scale vision-language models (VLMs), e.g., CLIP, learn broad visua...
Nowadays, the research on Large Vision-Language Models (LVLMs) has been
...
Learning self-supervised image representations has been broadly studied ...
Hierarchical semantic structures naturally exist in an image dataset, in...
Autonomous highlight detection is crucial for enhancing the efficiency o...