Over the past few months, I have frequently used the open-source reinforcement learning library rlpyt, to the point where it’s now one of the primary code bases in my research repertoire. There is a BAIR Blog post which nicely describes the rationale for rlpyt, along with its features.
Before rlpyt, my primary reinforcement learning library was OpenAI’s baselines. My switch from baselines to rlpyt was motivated by several factors. The primary one is that baselines is no longer actively maintained. I argued in an earlier blog post that it was one of OpenAI’s best resources, but I respect OpenAI’s decision to prioritize other resources, and if anything, baselines may have helped spur the development of subsequent reinforcement learning libraries. In addition, I wanted to switch to a reinforcement learning library that supported more recent algorithms such as distributional Deep Q-Networks, coupled with perhaps higher quality code with better documentation.
Aside from baselines and rlpyt, I have some experience with stable-baselines, which is a strictly superior version of baselines, but I also wanted to switch from TensorFlow to PyTorch, hence why I did not gravitate to stable-baselines. I have very limited experience with the first major open-source DeepRL library, rllab, which also came out of Berkeley, though I never used it for research as I got on the bandwagon relatively late. I also used John Schulman’s modular_rl library when I was trying to figure out how to implement Trust Region Policy Optimization. More recently, I have explored rlkit for its Twin-Delayed DDPG implementation, along with SpinningUp to see cleaner code implementations.
I know there are a slew of other DeepRL libraries, such as Intel’s NervanaSystems coach which I would like to try due to its huge variety of algorithms. There are also reinforcement learning libraries for distributed systems, but I prefer to run code on one machine to avoid complicating things.
Hence, rlpyt it is!
Installation and Quick Usage
To install rlpyt, observe that the repository already provides conda environment configuration files, which will bundle up the most important packages for you. This is not a virtualenv, though it has the same functional effect in practice. I believe conda environments and virtualenvs are the two main ways to get an isolated bundle of python packages.
On the machines I use, I find it easiest to first install miniconda. This can
be done remotely by downloading via
wget and running
bash on it:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh // after installing ... . ~/.bashrc // to get conda commands to work // to ensure (base) is not loaded by default conda config --set auto_activate_base false . ~/.bashrc // to remove the (base) env
In the above, I set it so that conda does not automatically activate its “base” environment for myself. I like having a clean, non-environment setup by default on Ubuntu systems. In addition, during the bash command above, the Miniconda installer will ask this:
Do you wish the installer to initialize Miniconda3 by running conda init? [yes|no] [no] >>> yes
I answer “yes” so that it gets initialized.
After the above, I clone the repository and then install with this command:
This will automatically make a new conda environment, specialized for Linux with CUDA 10 for the command above. Then, finally, don’t forget:
rlpyt a package you can import within your conda environment, and
to ensure that any chances you make in
rlpyt will be propagated throughout
your environment without having to do another pip install.
For quick usage, I follow the rlpyt README and use the
There are several scripts in there that can be run easily.
There are several possible workflows one can follow when using rlpyt. For
running experiments, you can use scripts that mirror those in the
directory. Alternatively, for perhaps more heavy-duty experiments, you can look
at what’s in
rlpyt/experiments. This contains configuration, launch, and run
scripts, which provide utility methods for testing a wide variety of
hyperparameters. Since that requires me to dive through three scripts that are
nested deep into rlpyt’s code repository, I personally don’t follow that
workflow; instead I just take a script in the
examples directory and build
upon it to handle more complex cases.
Here’s another thing I find useful. As I note later, rlpyt can use more CPU
resources than expected. Therefore, particularly with machines I share with
other researchers, I limit the number of CPUs that my scripts can “see.” I do
taskset. For example, suppose I am using a server with 32 CPUs. I
can run a script like this:
and this will limit the script to using CPUs indexed from 21 to 31. On
this will be CPUs numbered 22 through 32, as it’s one-indexed there.
With this in mind, here is my rough workflow for heavy-duty experiments:
Double check the machine to ensure that there are enough resources available. For example, if
nvidia-smishows that the GPU usage is near 100% for all GPUs, then I’m either not going to run code, or I will send a Slack message to my collaborators politely inquiring when the machine will free up.
Enter a GNU screen via typing in
conda activate rlpytto activate the conda environment.
export CUDA_VISIBLE_DEVICES=xto limit the experiment to the desired GPU.
Run the script with
tasksetas described earlier.
Spend a few seconds afterwards checking that the script is running correctly.
There are variations to the above, such as using
tmux instead of
but hopefully this general workflow makes sense for most researchers.
For plotting I don’t use the built-in plotter from rlpyt (which is really from
another code base). I keep the
progress.csv file and download it in a
stand-alone python script for plotting. I also don’t use TensorBoard. In fact,
I still have never used TensorBoard to this day. Yikes!
Understanding Steps, Iterations, and Parallelism
When using rlpyt, I think one of the most important things to understand is how
the parallelism works. Due to parallelism, interpreting the number of “steps”
an algorithm runs requires some care. In rlpyt, the code frequently refers to
itr variable. One
itr should be interpreted as “one data collection AND
optimization phase”, which is repeated for however many
itrs we desire. After
some number of
itrs have passed, rlpyt logs the data by reporting it to the
command line and saving the textual form in a
The data collection phase uses parallel environments. Often in the code, a “Sampler” class (which could be Serial-, CPU-, or GPU-based) will be defined like this:
examples folder in the code base will show how the samplers are used.)
What’s important for our purposes is
defines the number of steps taken in each parallel environment, while
is the number of parallel environments. Thus, in DeepMind’s DQN Nature paper,
batch_B=1 (i.e., it was serial) with
batch_T=4 to get 4 steps of
new data, then train, then 4 new steps of data, etc. rlpyt will enforce a
similar “replay ratio” so that if we end up with more parallel environments,
batch_B=10, it performs more gradient updates in the optimization
phase. For example, a single
itr could consist of the following scenarios:
batch_T, batch_B = 4, 1: get 4 new samples in the replay buffer, then 1 gradient update.
batch_T, batch_B = 4, 10: get 40 new samples in the replay buffer, then 10 gradient updates.
The cumulative environment steps, which is
CumSteps in the logger, is thus
batch_T * batch_B, multiplied by the number of
itrs thus far.
In order to define how long the algorithm runs, one needs to specify the
n_steps argument to a runner, usually
(depending on whether evaluation should be online or offline), as follows:
Then, based on
n_steps, the maximum number of
itrs is determined from that.
Modulo some rounding issues, this is
n_steps / (batch_T * batch_B).
In addition, we use
log_interval_steps to represent the
itr interval when we
I have been very happy with rlpyt. Nonetheless, as with any major open-source code produced by a single PhD student (named Adam), there are bound to be some little issues that pop up here and there. Throughout the last few months, I have posted five issue reports:
CPU Usage. This describes some of the nuances regarding how rlpyt uses CPU resources on a machine. I posted it because I was seeing some discrepancies between my intended CPU allocation versus the actual CPU allocation, as judged from
htop. From this issue report, I started prefacing all my python scripts with
taskset -c x-ywhere
yrepresent CPU indices.
Using Atari Game Scores. I was wondering why the performance of my DQN benchmarks were substantially lower than those I saw in DeepMind’s papers, and the reason was due to reporting clipped scores (i.e., bounding values within $[-1,1]$) versus the game scores. From this issue report, I added in
AtariTrajInfoas the “trajectory information” class in my Atari-related scripts, because papers usually report the game score. Fortunately, this change has since been updated to the master branch.
Repeat Action Probability in Atari. Another nuance with the Atari environments is that they are deterministic, in the sense that taking an action will lead to only one possible next state. As this paper argues, using sticky actions helps to introduce stochasticity into the Atari environments while requiring minimal outside changes. Unfortunately, rlpyt does not enable it by default because it was benchmarking against results that did not use sticky frames. For my own usage, I keep the sticky frames on with probability $p=0.25$ and I encourage others to do the same.
Epsilon Greedy for CPU Sampling (bug!). This one, which is an actual bug, has to do with the epsilon schedule for epsilon greedy agents, as used in DQN. With the CPU sampler (but not the Serial or GPU variants) the epsilon was not decayed appropriately. Fortunately, this has been fixed in the latest version of rlpyt.
Loading a Replay Buffer. I thought this would be a nice feature. What if we want to resume training for an off-policy reinforcement learning algorithm with a replay buffer? It’s not sufficient to save the policy and optimizer parameters, as in an on-policy algorithm such as Proximal Policy Optimization, because we need to reproduce the exact contents of the replay buffer at the point when we saved the training state.
Incidentally, notice how these issue reports are designed so that they are easy for others to reproduce. I have argued previously that we need sufficiently detailed issue reports for them to be useful.
There are other issue reports that I did not create, but which I have commented on, such as this one about saving snapshots, that I hope are helpful.
Fortunately, Adam has been very responsive and proactive, which increases the usability of this code base for research. If researchers from Berkeley all gravitate to rlpyt, then it provides additional benefits for using rlpyt, since we can assist each other.
I am happy with using rlpyt for research and development. Hopefully it will be among the last major reinforcement learning libraries I need to pick up for my research. There is always some setup cost to using a code base, but I feel like that threshold has passed for me and that I am at the “frontier” of rlpyt.
Finally, thank you Adam, for all your efforts. Let’s go forth and do some great research.