Compositional Learning of Visually-Grounded Concepts Using Reinforcement

by   Zijun Lin, et al.

Deep reinforcement learning agents need to be trained over millions of episodes to decently solve navigation tasks grounded to instructions. Furthermore, their ability to generalize to novel combinations of instructions is unclear. Interestingly however, children can decompose language-based instructions and navigate to the referred object, even if they have not seen the combination of queries prior. Hence, we created three 3D environments to investigate how deep RL agents learn and compose color-shape based combinatorial instructions to solve novel combinations in a spatial navigation task. First, we explore if agents can perform compositional learning, and whether they can leverage on frozen text encoders (e.g. CLIP, BERT) to learn word combinations in fewer episodes. Next, we demonstrate that when agents are pretrained on the shape or color concepts separately, they show a 20 times decrease in training episodes needed to solve unseen combinations of instructions. Lastly, we show that agents pretrained on concept and compositional learning achieve significantly higher reward when evaluated zero-shot on novel color-shape1-shape2 visual object combinations. Overall, our results highlight the foundations needed to increase an agent's proficiency in composing word groups through reinforcement learning and its ability for zero-shot generalization to new combinations.


page 3

page 13

page 14


Zero-Shot Task Generalization with Multi-Task Deep Reinforcement Learning

As a step towards developing zero-shot task generalization capabilities ...

Exploiting Language Instructions for Interpretable and Compositional Reinforcement Learning

In this work, we present an alternative approach to making an agent comp...

Infinite use of finite means: Zero-Shot Generalization using Compositional Emergent Protocols

Human language has been described as a system that makes use of finite m...

Zero-Shot Generalization using Intrinsically Motivated Compositional Emergent Protocols

Human language has been described as a system that makes use of finite m...

Learning to Follow Language Instructions with Compositional Policies

We propose a framework that learns to execute natural language instructi...

Reinforcement Learning of Implicit and Explicit Control Flow in Instructions

Learning to flexibly follow task instructions in dynamic environments po...

Do Vision-Language Pretrained Models Learn Primitive Concepts?

Vision-language pretrained models have achieved impressive performance o...

Please sign up or login with your details

Forgot password? Click here to reset