AMAGO: Scalable In-Context Reinforcement Learning
for Adaptive Agents
Jake Grigsby1 Jim Fan2 Yuke Zhu1
1The University of Texas at Austin 2NVIDIA Research
Paper | Code
"In-context" RL trains memory-equipped agents to adapt to new environments from test-time experience and unifies meta-RL, zero-shot generalization, and long-term memory into a single problem. While this technique was one of the first approaches to deep meta-RL [1], it is often outperformed by more complicated methods. Fortunately, the right off-policy implementation details and tuning can make in-context RL stable and competitive [2]. Off-policy in-context RL creates a tradeoff because it is conceptually simple but hard to use, and agents are limited by their model size, memory length, and planning horizon. AMAGO redesigns off-policy sequence-based RL to break these bottlenecks and stably train long-context Transformers with end-to-end RL. AMAGO is open-source and designed to require minimal tuning with the goal of making in-context RL an easy-to-use default in new research on adaptive agents.
|
Improving Off-Policy Actor-Critics with Transformers
AMAGO improves memory and adaptation by optimizing long-context Transformers on sequences gathered from large off-policy datasets. This creates many technical challenges that we address with three main ideas:
- Sharing One Sequence Model. Actors and critics are updated simultaneously on top of the outputs of a single sequence model that learns from every training objective and maximizes throughput. AMAGO's update looks more like supervised sequence modeling than an actor-critic. This approach is discouraged in previous work but can be stabilized with careful details.
- Long-Horizon Off-Policy Updates. AMAGO's learning update improves performance and reduces tuning by always giving the sequence model "something to learn about": we compute RL losses over many planning horizons (\(\gamma\)) that have different optimization landscapes depending on current performance. When all else fails, AMAGO includes an offline RL term that resembles supervised learning and does not depend on the scale of returns. This "multi-\(\gamma\)" update makes AMAGO especially effective for sparse rewards over long horizons.
- Stabilizing Long-Context Transformers. Both RL and Transformers can be unstable on their own, and combining them creates more obstacles. An especially relevant issue in memory-intensive RL is attention entropy collapse because the optimal memory patterns in RL environments can be far more specific than in language modeling. We use a stable Transformer block that prevents collapse and reduces tuning by letting us pick model sizes that are safely too large for the problem.
In-Context RL's flexibility lets us evaluate AMAGO on many generalization, memory, and meta-learning domains with minimal changes.
Meta-RL and Long-Term Memory
AMAGO lets us put Transformers' impressive recall to effective use in RL tasks. We evaluate AMAGO on 39 environments from the POPGym suite, where it leads to dramatic improvements in memory-intensive generalization problems and creates a strong default for sequence-based RL:
AMAGO handles meta-learning as a simple extension of zero-shot generalization, and we demonstrate its stability and flexibility on several common meta-RL benchmarks. AMAGO makes it easy to tune memory lengths to the adaptation difficulty of the problem but is efficient enough to train with context lengths of hundreds or thousands of timesteps.
Adaptive Instruction-Following
An important benefit of off-policy learning is the ability to relabel rewards in hindsight . AMAGO extends hindsight experience replay to "instructions" or sequences of multiple goals. Relabeling instructions extends the diversity of our dataset and plays to the strengths of data-hungry Transformers while generating automatic exploration curricula for more complex objectives. The combination of AMAGO's relabeling, memory-based adaptation, and long-horizon learning update can be very effective in goal-conditioned generalization tasks. We introduce several easily-simulated benchmarks to research this setting, which highlight the importance of AMAGO's technical details:
Finally, we evaluate AMAGO in the procedurally generated worlds of Crafter. Instructions are strings from a closed vocabulary of Crafter's achievement system, with added goals for navigation and block placement.
Above, we use several single-task instructions to evaluate the exploration capabilities of various ablations. As tasks require more exploration and adaptation to new world layouts, AMAGO's memory and relabeling become essential to success. Multi-step goals require considerable generalization, and AMAGO qualitatively demonstrates a clear understanding of the instruction. Sample videos are shown below; these tasks are prompted by the user at test-time, and each video represents just one of the thousands of instructions an agent was trained on.
"collect sapling, place plant x2, eat cow" | "eat cow, make stone pickaxe, collect coal, make stone sword, defeat zombie" | "make wood pickaxe, collect stone, build at (30, 30)" | "travel to (10, 10), place stone, travel to (50, 50), place stone" |
Check out our paper for more details and results!
Using AMAGO
In-context RL is applicable to any memory, generalization, or meta-learning problem, and we have designed AMAGO to be flexible enough to support all of these cases. Our code is fully open-source and available on GitHub. We hope our agent can serve as a strong baseline in the development of new benchmarks that require long-term memory and adaptation, and include many examples of how to apply AMAGO to:
- Standard (Memory-Free) MDPs/gym Environments
- POMDPs and Long-Term Memory Tasks
- K-Shot Meta-RL
- Goal-Conditioned Environment Adaptation
- Multi-task Learning from Pixels
Citation
Citation
|