Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

06/15/2023
by   Xiaoshi Wu, et al.
0

Recent text-to-image generative models can generate high-fidelity images from text inputs, but the quality of these generated images cannot be accurately evaluated by existing evaluation metrics. To address this issue, we introduce Human Preference Dataset v2 (HPD v2), a large-scale dataset that captures human preferences on images from a wide range of sources. HPD v2 comprises 798,090 human preference choices on 430,060 pairs of images, making it the largest dataset of its kind. The text prompts and images are deliberately collected to eliminate potential bias, which is a common issue in previous datasets. By fine-tuning CLIP on HPD v2, we obtain Human Preference Score v2 (HPS v2), a scoring model that can more accurately predict text-generated images' human preferences. Our experiments demonstrate that HPS v2 generalizes better than previous metrics across various image distributions and is responsive to algorithmic improvements of text-to-image generative models, making it a preferable evaluation metric for these models. We also investigate the design of the evaluation prompts for text-to-image generative models, to make the evaluation stable, fair and easy-to-use. Finally, we establish a benchmark for text-to-image generative models using HPS v2, which includes a set of recent text-to-image models from the academia, community and industry. The code and dataset is / will be available at https://github.com/tgxs002/HPSv2.

READ FULL TEXT

page 3

page 8

page 13

page 14

research
03/25/2023

Better Aligning Text-to-Image Models with Human Preference

Recent years have witnessed a rapid growth of deep generative models, wi...
research
04/12/2023

ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation

We present ImageReward – the first general-purpose text-to-image human p...
research
12/12/2022

T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics

Modern embedding-based metrics for evaluation of generated text generall...
research
04/11/2023

HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models

In recent years, Text-to-Image (T2I) models have been extensively studie...
research
12/15/2022

TeTIm-Eval: a novel curated evaluation data set for comparing text-to-image models

Evaluating and comparing text-to-image models is a challenging problem. ...
research
11/17/2022

Is the Elephant Flying? Resolving Ambiguities in Text-to-Image Generative Models

Natural language often contains ambiguities that can lead to misinterpre...
research
02/23/2023

Aligning Text-to-Image Models using Human Feedback

Deep generative models have shown impressive results in text-to-image sy...

Please sign up or login with your details

Forgot password? Click here to reset