Improving Long-Term Metrics in Recommendation Systems using Short-Horizon Offline RL

by   Bogdan Mazoure, et al.

We study session-based recommendation scenarios where we want to recommend items to users during sequential interactions to improve their long-term utility. Optimizing a long-term metric is challenging because the learning signal (whether the recommendations achieved their desired goals) is delayed and confounded by other user interactions with the system. Immediately measurable proxies such as clicks can lead to suboptimal recommendations due to misalignment with the long-term metric. Many works have applied episodic reinforcement learning (RL) techniques for session-based recommendation but these methods do not account for policy-induced drift in user intent across sessions. We develop a new batch RL algorithm called Short Horizon Policy Improvement (SHPI) that approximates policy-induced distribution shifts across sessions. By varying the horizon hyper-parameter in SHPI, we recover well-known policy improvement schemes in the RL literature. Empirical results on four recommendation tasks show that SHPI can outperform matrix factorization, offline bandits, and offline RL baselines. We also provide a stable and computationally efficient implementation using weighted regression oracles.


page 1

page 2

page 3

page 4


Offline Meta-level Model-based Reinforcement Learning Approach for Cold-Start Recommendation

Reinforcement learning (RL) has shown great promise in optimizing long-t...

Integrating Offline Reinforcement Learning with Transformers for Sequential Recommendation

We consider the problem of sequential recommendation, where the current ...

Batch-Constrained Distributional Reinforcement Learning for Session-based Recommendation

Most of the existing deep reinforcement learning (RL) approaches for ses...

Multi-Task Fusion via Reinforcement Learning for Long-Term User Satisfaction in Recommender Systems

Recommender System (RS) is an important online application that affects ...

Sequential Search with Off-Policy Reinforcement Learning

Recent years have seen a significant amount of interests in Sequential R...

User Retention-oriented Recommendation with Decision Transformer

Improving user retention with reinforcement learning (RL) has attracted ...

ResAct: Reinforcing Long-term Engagement in Sequential Recommendation with Residual Actor

Long-term engagement is preferred over immediate engagement in sequentia...

Please sign up or login with your details

Forgot password? Click here to reset