Over the past few months, I have frequently used the open-source reinforcement learning library rlpyt, to the point where it’s now one of the primary code bases in my research repertoire. There is a BAIR Blog post which nicely describes the rationale for rlpyt, along with its features.

Before rlpyt, my primary reinforcement learning library was OpenAI’s baselines. My switch from baselines to rlpyt was motivated by several factors. The primary one is that baselines is no longer actively maintained. I argued in an earlier blog post that it was one of OpenAI’s best resources, but I respect OpenAI’s decision to prioritize other resources, and if anything, baselines may have helped spur the development of subsequent reinforcement learning libraries. In addition, I wanted to switch to a reinforcement learning library that supported more recent algorithms such as distributional Deep Q-Networks, coupled with perhaps higher quality code with better documentation.

Aside from baselines and rlpyt, I have some experience with stable-baselines, which is a strictly superior version of baselines, but I also wanted to switch from TensorFlow to PyTorch, hence why I did not gravitate to stable-baselines. I have very limited experience with the first major open-source DeepRL library, rllab, which also came out of Berkeley, though I never used it for research as I got on the bandwagon relatively late. I also used John Schulman’s modular_rl library when I was trying to figure out how to implement Trust Region Policy Optimization. More recently, I have explored rlkit for its Twin-Delayed DDPG implementation, along with SpinningUp to see cleaner code implementations.

I know there are a slew of other DeepRL libraries, such as Intel’s NervanaSystems coach which I would like to try due to its huge variety of algorithms. There are also reinforcement learning libraries for distributed systems, but I prefer to run code on one machine to avoid complicating things.

Hence, rlpyt it is!

Installation and Quick Usage

To install rlpyt, observe that the repository already provides conda environment configuration files, which will bundle up the most important packages for you. This is not a virtualenv, though it has the same functional effect in practice. I believe conda environments and virtualenvs are the two main ways to get an isolated bundle of python packages.

On the machines I use, I find it easiest to first install miniconda. This can be done remotely by downloading via wget and running bash on it:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
// after installing ...
. ~/.bashrc  // to get conda commands to work
// to ensure (base) is not loaded by default
conda config --set auto_activate_base false
. ~/.bashrc  // to remove the (base) env

In the above, I set it so that conda does not automatically activate its “base” environment for myself. I like having a clean, non-environment setup by default on Ubuntu systems. In addition, during the bash command above, the Miniconda installer will ask this:

Do you wish the installer to initialize Miniconda3
by running conda init? [yes|no]
[no] >>> yes

I answer “yes” so that it gets initialized.

After the above, I clone the repository and then install with this command:

conda env create -f linux_cuda10.yml

This will automatically make a new conda environment, specialized for Linux with CUDA 10 for the command above. Then, finally, don’t forget:

pip install -e .

to make rlpyt a package you can import within your conda environment, and to ensure that any chances you make in rlpyt will be propagated throughout your environment without having to do another pip install.

For quick usage, I follow the rlpyt README and use the examples directory. There are several scripts in there that can be run easily.

Possible Workflow

There are several possible workflows one can follow when using rlpyt. For running experiments, you can use scripts that mirror those in the examples directory. Alternatively, for perhaps more heavy-duty experiments, you can look at what’s in rlpyt/experiments. This contains configuration, launch, and run scripts, which provide utility methods for testing a wide variety of hyperparameters. Since that requires me to dive through three scripts that are nested deep into rlpyt’s code repository, I personally don’t follow that workflow; instead I just take a script in the examples directory and build upon it to handle more complex cases.

Here’s another thing I find useful. As I note later, rlpyt can use more CPU resources than expected. Therefore, particularly with machines I share with other researchers, I limit the number of CPUs that my scripts can “see.” I do this with taskset. For example, suppose I am using a server with 32 CPUs. I can run a script like this:

taskset -c 21-31 python experiments/subscribe_to_my_blog.py

and this will limit the script to using CPUs indexed from 21 to 31. On htop, this will be CPUs numbered 22 through 32, as it’s one-indexed there.

With this in mind, here is my rough workflow for heavy-duty experiments:

Double check the machine to ensure that there are enough resources available. For example, if nvidia-smi shows that the GPU usage is near 100% for all GPUs, then I’m either not going to run code, or I will send a Slack message to my collaborators politely inquiring when the machine will free up.
Enter a GNU screen via typing in screen.
Run conda activate rlpyt to activate the conda environment.
Set export CUDA_VISIBLE_DEVICES=x to limit the experiment to the desired GPU.
Run the script with taskset as described earlier.
Spend a few seconds afterwards checking that the script is running correctly.

There are variations to the above, such as using tmux instead of screen, but hopefully this general workflow makes sense for most researchers.

For plotting I don’t use the built-in plotter from rlpyt (which is really from another code base). I keep the progress.csv file and download it in a stand-alone python script for plotting. I also don’t use TensorBoard. In fact, I still have never used TensorBoard to this day. Yikes!

Understanding Steps, Iterations, and Parallelism

When using rlpyt, I think one of the most important things to understand is how the parallelism works. Due to parallelism, interpreting the number of “steps” an algorithm runs requires some care. In rlpyt, the code frequently refers to an itr variable. One itr should be interpreted as “one data collection AND optimization phase”, which is repeated for however many itrs we desire. After some number of itrs have passed, rlpyt logs the data by reporting it to the command line and saving the textual form in a debug.log file.

The data collection phase uses parallel environments. Often in the code, a “Sampler” class (which could be Serial-, CPU-, or GPU-based) will be defined like this:

sampler = Sampler(
    EnvCls=AtariEnv,
    TrajInfoCls=AtariTrajInfo,  # Needed for Atari game scores!
    env_kwargs=dict(game=game),
    eval_env_kwargs=dict(game=game),
    batch_T=T,
    batch_B=B,
    max_decorrelation_steps=0,
    eval_n_envs=10,
    eval_max_steps=int(1e6),
    eval_max_trajectories=50,
)

(The examples folder in the code base will show how the samplers are used.)

What’s important for our purposes is batch_T and batch_B. The batch_T defines the number of steps taken in each parallel environment, while batch_B is the number of parallel environments. Thus, in DeepMind’s DQN Nature paper, they set batch_B=1 (i.e., it was serial) with batch_T=4 to get 4 steps of new data, then train, then 4 new steps of data, etc. rlpyt will enforce a similar “replay ratio” so that if we end up with more parallel environments, such as batch_B=10, it performs more gradient updates in the optimization phase. For example, a single itr could consist of the following scenarios:

batch_T, batch_B = 4, 1: get 4 new samples in the replay buffer, then 1 gradient update.
batch_T, batch_B = 4, 10: get 40 new samples in the replay buffer, then 10 gradient updates.

The cumulative environment steps, which is CumSteps in the logger, is thus batch_T * batch_B, multiplied by the number of itrs thus far.

In order to define how long the algorithm runs, one needs to specify the n_steps argument to a runner, usually MinibatchRl or MinibatchEval (depending on whether evaluation should be online or offline), as follows:

runner = MinibatchRl(
    algo=algo,
    agent=agent,
    sampler=sampler,
    n_steps=n_steps,
    log_interval_steps=1e5,
    affinity=affinity,
)

Then, based on n_steps, the maximum number of itrs is determined from that. Modulo some rounding issues, this is n_steps / (batch_T * batch_B).

In addition, we use log_interval_steps to represent the itr interval when we log data.

Current Issues

I have been very happy with rlpyt. Nonetheless, as with any major open-source code produced by a single PhD student (named Adam), there are bound to be some little issues that pop up here and there. Throughout the last few months, I have posted five issue reports:

CPU Usage. This describes some of the nuances regarding how rlpyt uses CPU resources on a machine. I posted it because I was seeing some discrepancies between my intended CPU allocation versus the actual CPU allocation, as judged from htop. From this issue report, I started prefacing all my python scripts with taskset -c x-y where x and y represent CPU indices.
Using Atari Game Scores. I was wondering why the performance of my DQN benchmarks were substantially lower than those I saw in DeepMind’s papers, and the reason was due to reporting clipped scores (i.e., bounding values within $[-1,1]$) versus the game scores. From this issue report, I added in AtariTrajInfo as the “trajectory information” class in my Atari-related scripts, because papers usually report the game score. Fortunately, this change has since been updated to the master branch.
Repeat Action Probability in Atari. Another nuance with the Atari environments is that they are deterministic, in the sense that taking an action will lead to only one possible next state. As this paper argues, using sticky actions helps to introduce stochasticity into the Atari environments while requiring minimal outside changes. Unfortunately, rlpyt does not enable it by default because it was benchmarking against results that did not use sticky frames. For my own usage, I keep the sticky frames on with probability $p=0.25$ and I encourage others to do the same.
Epsilon Greedy for CPU Sampling (bug!). This one, which is an actual bug, has to do with the epsilon schedule for epsilon greedy agents, as used in DQN. With the CPU sampler (but not the Serial or GPU variants) the epsilon was not decayed appropriately. Fortunately, this has been fixed in the latest version of rlpyt.
Loading a Replay Buffer. I thought this would be a nice feature. What if we want to resume training for an off-policy reinforcement learning algorithm with a replay buffer? It’s not sufficient to save the policy and optimizer parameters, as in an on-policy algorithm such as Proximal Policy Optimization, because we need to reproduce the exact contents of the replay buffer at the point when we saved the training state.

Incidentally, notice how these issue reports are designed so that they are easy for others to reproduce. I have argued previously that we need sufficiently detailed issue reports for them to be useful.

There are other issue reports that I did not create, but which I have commented on, such as this one about saving snapshots, that I hope are helpful.

Fortunately, Adam has been very responsive and proactive, which increases the usability of this code base for research. If researchers from Berkeley all gravitate to rlpyt, then it provides additional benefits for using rlpyt, since we can assist each other.

The Future

I am happy with using rlpyt for research and development. Hopefully it will be among the last major reinforcement learning libraries I need to pick up for my research. There is always some setup cost to using a code base, but I feel like that threshold has passed for me and that I am at the “frontier” of rlpyt.

Finally, thank you Adam, for all your efforts. Let’s go forth and do some great research.