Zeroth-Order Optimization Meets Human Feedback: Provable Learning via Ranking Oracles

by   Zhiwei Tang, et al.

In this paper, we focus on a novel optimization problem in which the objective function is a black-box and can only be evaluated through a ranking oracle. This problem is common in real-world applications, particularly in cases where the function is assessed by human judges. Reinforcement Learning with Human Feedback (RLHF) is a prominent example of such an application, which is adopted by the recent works <cit.> to improve the quality of Large Language Models (LLMs) with human guidance. We propose ZO-RankSGD, a first-of-its-kind zeroth-order optimization algorithm, to solve this optimization problem with a theoretical guarantee. Specifically, our algorithm employs a new rank-based random estimator for the descent direction and is proven to converge to a stationary point. ZO-RankSGD can also be directly applied to the policy search problem in reinforcement learning when only a ranking oracle of the episode reward is available. This makes ZO-RankSGD a promising alternative to existing RLHF methods, as it optimizes in an online fashion and thus can work without any pre-collected data. Furthermore, we demonstrate the effectiveness of ZO-RankSGD in a novel application: improving the quality of images generated by a diffusion generative model with human ranking feedback. Throughout experiments, we found that ZO-RankSGD can significantly enhance the detail of generated images with only a few rounds of human feedback. Overall, our work advances the field of zeroth-order optimization by addressing the problem of optimizing functions with only ranking feedback, and offers an effective approach for aligning human and machine intentions in a wide range of domains. Our code is released here <>.


page 1

page 9

page 10

page 23

page 25

page 27


Censored Sampling of Diffusion Models Using 3 Minutes of Human Feedback

Diffusion models have recently shown remarkable success in high-quality ...

Training Diffusion Models with Reinforcement Learning

Diffusion models are a class of flexible generative models trained with ...

Preference Ranking Optimization for Human Alignment

Large language models (LLMs) often contain misleading content, emphasizi...

FABRIC: Personalizing Diffusion Models with Iterative Feedback

In an era where visual content generation is increasingly driven by mach...

Provable Offline Reinforcement Learning with Human Feedback

In this paper, we investigate the problem of offline reinforcement learn...

Order-Disorder: Imitation Adversarial Attacks for Black-box Neural Ranking Models

Neural text ranking models have witnessed significant advancement and ar...

Designing Biological Sequences via Meta-Reinforcement Learning and Bayesian Optimization

The ability to accelerate the design of biological sequences can have a ...

Please sign up or login with your details

Forgot password? Click here to reset