VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining

by   Junjie Ke, et al.

Assessing the aesthetics of an image is challenging, as it is influenced by multiple factors including composition, color, style, and high-level semantics. Existing image aesthetic assessment (IAA) methods primarily rely on human-labeled rating scores, which oversimplify the visual aesthetic information that humans perceive. Conversely, user comments offer more comprehensive information and are a more natural way to express human opinions and preferences regarding image aesthetics. In light of this, we propose learning image aesthetics from user comments, and exploring vision-language pretraining methods to learn multimodal aesthetic representations. Specifically, we pretrain an image-text encoder-decoder model with image-comment pairs, using contrastive and generative objectives to learn rich and generic aesthetic semantics without human labels. To efficiently adapt the pretrained model for downstream IAA tasks, we further propose a lightweight rank-based adapter that employs text as an anchor to learn the aesthetic ranking concept. Our results show that our pretrained aesthetic vision-language model outperforms prior works on image aesthetic captioning over the AVA-Captions dataset, and it has powerful zero-shot capability for aesthetic tasks such as zero-shot style classification and zero-shot IAA, surpassing many supervised baselines. With only minimal finetuning parameters using the proposed adapter module, our model achieves state-of-the-art IAA performance over the AVA dataset.


page 1

page 3

page 8

page 14


Natural Language Supervision for General-Purpose Audio Representations

Audio-Language models jointly learn multimodal text and audio representa...

Data-Efficient Language-Supervised Zero-Shot Learning with Self-Distillation

Traditional computer vision models are trained to predict a fixed set of...

CoCa: Contrastive Captioners are Image-Text Foundation Models

Exploring large-scale pretrained foundation models is of significant int...

Contrastive Language-Image Pretrained Models are Zero-Shot Human Scanpath Predictors

Understanding the mechanisms underlying human attention is a fundamental...

Style-Content Disentanglement in Language-Image Pretraining Representations for Zero-Shot Sketch-to-Image Synthesis

In this work, we propose and validate a framework to leverage language-i...

IPA-CLIP: Integrating Phonetic Priors into Vision and Language Pretraining

Recently, large-scale Vision and Language (V&L) pretraining has become t...

Emojich – zero-shot emoji generation using Russian language: a technical report

This technical report presents a text-to-image neural network "Emojich" ...

Please sign up or login with your details

Forgot password? Click here to reset