Top-K Off-Policy Correction for a REINFORCE Recommender System

12/06/2018
by   Minmin Chen, et al.
0

Industrial recommender systems deal with extremely large action spaces -- many millions of items to recommend. Moreover, they need to serve billions of users, who are unique at any point in time, making a complex user state space. Luckily, huge quantities of logged implicit feedback (e.g., user clicks, dwell time) are available for learning. Learning from the logged feedback is however subject to biases caused by only observing feedback on recommendations selected by the previous versions of the recommender. In this work, we present a general recipe of addressing such biases in a production top-K recommender system at Youtube, built with a policy-gradient-based algorithm, i.e. REINFORCE. The contributions of the paper are: (1) scaling REINFORCE to a production recommender system with an action space on the orders of millions; (2) applying off-policy correction to address data biases in learning from logged feedback collected from multiple behavior policies; (3) proposing a novel top-K off-policy correction to account for our policy recommending multiple items at a time; (4) showcasing the value of exploration. We demonstrate the efficacy of our approaches through a series of simulations and multiple live experiments on Youtube.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/04/2019

Toward Building Conversational Recommender Systems: A Contextual Bandit Approach

Contextual bandit algorithms have gained increasing popularity in recomm...
research
05/10/2019

Recommending Dream Jobs in a Biased Real World

Machine learning models learn what we teach them to learn. Machine learn...
research
07/31/2021

An Empirical Analysis on Transparent Algorithmic Exploration in Recommender Systems

All learning algorithms for recommendations face inevitable and critical...
research
01/18/2023

Biases in Scholarly Recommender Systems: Impact, Prevalence, and Mitigation

With the remarkable increase in the number of scientific entities such a...
research
08/03/2023

Fast Slate Policy Optimization: Going Beyond Plackett-Luce

An increasingly important building block of large scale machine learning...
research
12/22/2022

Local Policy Improvement for Recommender Systems

Recommender systems aim to answer the following question: given the item...
research
06/06/2022

Pessimistic Off-Policy Optimization for Learning to Rank

Off-policy learning is a framework for optimizing policies without deplo...

Please sign up or login with your details

Forgot password? Click here to reset