amago.envs.exploration#

Exploration via gymnasium wrappers.

Classes

BilevelEpsilonGreedy(amago_env[, ...])

Implements the bi-level epsilon greedy exploration strategy visualized in Figure 13 of the AMAGO paper. .. tip::.

EpsilonGreedy(amago_env[, eps_start, ...])

Standard exploration noise schedule.

ExplorationWrapper(amago_env)

Abstract base class for exploration wrappers.

class BilevelEpsilonGreedy(amago_env, rollout_horizon=gin.REQUIRED, eps_start_start=1.0, eps_start_end=0.05, eps_end_start=0.8, eps_end_end=0.01, steps_anneal=1000000, randomize_eps=True)[source]#

Bases: ExplorationWrapper

Implements the bi-level epsilon greedy exploration strategy visualized in Figure 13 of the AMAGO paper. .. tip:

This class is ``@gin.configurable``. Default values of kwargs can be overridden using `gin <https://github.com/google/gin-config>`_.

Exploration noise decays both over the course of training and throughout each rollout. This more closely resembles the online exploration/exploitation strategy of a meta-RL agent in a new environment.

Parameters:
  • amago_env – The environment to wrap.

  • rollout_horizon (int) – The expected maximum length of each rollout. gin.REQUIRED, meaning it must be configured manually via gin on a case-by-case basis, e.g. gin.bind_parameter(“BilevelEpsilonGreedy.rollout_horizon”, 100)

  • eps_start_start (float) – Exploration noise at the start of training and start of a rollout. Default is 1.0.

  • eps_start_end (float) – Exploration noise at the end of training and start of a rollout. Default is 0.05.

  • eps_end_start (float) – Exploration noise at the start of training and end of a rollout. Default is 0.8.

  • eps_end_end (float) – Exploration noise at the end of training and end of a rollout. Default is 0.01.

  • steps_anneal (int) – Linear schedule end point (in terms of steps taken in each actor process). Default is 1,000,000.

  • randomize_eps (bool) – Treat the schedule as the max and sample uniform [0, max) for each actor. This tends to reduce the need to tune steps_anneal. Default is True.

add_exploration_noise(action, local_step)[source]#

Add exploration noise to the action.

Parameters:
  • action (ndarray) – The action provided by the Agent.

  • local_step (ndarray) – The number of timesteps since the last episode reset.

Returns:

The action to be taken in the environment.

current_eps(local_step)[source]#
reset(*args, **kwargs)[source]#

Uses the reset() of the env that can be overwritten to change the returned data.

step(action)[source]#

Runs the env env.step() using the modified action from self.action().

class EpsilonGreedy(amago_env, eps_start=1.0, eps_end=0.05, steps_anneal=1000000, randomize_eps=True)[source]#

Bases: BilevelEpsilonGreedy

Standard exploration noise schedule.

Tip

This class is @gin.configurable. Default values of kwargs can be overridden using gin.

Classic epsilon-greedy exploration in discrete action spaces. TD3-style (action + epsilon * noise).clip(-1, 1) in continuous action spaces.

Parameters:
  • amago_env – The environment to wrap.

  • eps_start (float) – Exploration noise at the start of training. Default is 1.0.

  • eps_end (float) – Exploration noise at the end of training. Default is 0.05.

  • steps_anneal (int) – Linear schedule end point (in terms of steps taken in a single actor process). Default is 1,000,000.

  • randomize_eps (bool) – Treat the schedule as the max and sample uniform [0, max) for each parallel actor. This tends to reduce the need to tune steps_anneal. Default is True.

class ExplorationWrapper(amago_env)[source]#

Bases: ABC, ActionWrapper

Abstract base class for exploration wrappers.

Parameters:

amago_env – The environment to wrap.

action(a)[source]#

Returns a modified action before env.step() is called.

Parameters:

action – The original step() actions

Return type:

ndarray

Returns:

The modified actions

abstract add_exploration_noise(action, local_step)[source]#

Add exploration noise to the action.

Parameters:
  • action (ndarray) – The action provided by the Agent.

  • local_step (int) – The number of timesteps since the last episode reset.

Return type:

ndarray

Returns:

The action to be taken in the environment.

property return_history#
property success_history#