My Research Workflow: Conda, Deep Learning, CUDA, Storage, and SSH
This is a regularly updated post, last updated November 12, 2022.
In the past, I have written about some workflow and coding tips, such as improving my development environment with virtualenvwrapper, organizing GitHub repositories, running and saving experiments in Python and understanding (a little) about how docker works.
As I transition to my new postdoc role at CMU as of September 2021, it feels like a good time to recap my current workflow. I am constantly trying to think about how I can be more productive and whether I should learn about this or that feature (the answer is usually “no” but sometimes it is “yes”).
In this blog post, I will discuss different aspects of my current workflow, split into the following sections in this order:
- Conda Environments
- Installing TensorFlow and PyTorch
- CUDA and NVIDIA drivers
- Storage on Shared Machines
- Managing ssh
In the future, I plan to update this post with additional information about my workflow. There are also parts of my prior workflow that I have gotten rid of. Looking back, I’m surprised I managed to get a PhD with some of the sloppy tactics that I employed!
When reading this post, keep in mind that the main operating system I use is Ubuntu 18.04 and that I do essentially all my programming with Python. (I keep telling myself and writing in my New Year Resolution documents that I will get back to coding with C++, but I never do so. My apologies in advance.) At some point, I may upgrade to Ubuntu 20.04, but the vast majority of research code I use these days is still tied to Ubuntu 18.04. I do use a Macbook Pro laptop, but for work contexts, that is mainly for making presentations and possibly writing papers on Overleaf. If I do “research programming” on my laptop, it is done through ssh-ing to an Ubuntu 18.04 machine.
Update 08/06/2022: these days, I still use Ubuntu 18.04 by default, but I now do a fair amount of research work on Ubuntu 20.04 machines. I have not gotten SoftGym working with Ubuntu 20.04, but I can run other software. I also updated this post with a new section on CUDA and NVIDIA drivers and made minor touch-ups to other areas.
Conda Environments
Starting in 2019, I began using conda environments. Previously, I was using virtualenvs coupled with virtualenvwrapper to make handling multiple environments easier, but it turned out to be a huge hassle to manage with various “command not found” errors and warnings. Furthermore, I was running into countless issues with CUDA and TensorFlow incompatibilities, and inspired by this October 2018 Medium article, which amusingly says that if using “pip install” commands for TensorFlow, “There is a probability of 1% that this process will go right for you!”, I switched to conda environments.
Conda environments work in basically the same way as virtualenvs in that they isolate a set of Python packages independent of the system Python. Here, “conda install” plays the role of “pip install”. Not all packages installable with pip are available through conda, but that’s not a huge issue because you can also run normal pip install commands in a conda environment. The process might be delicate, though (see this for a warning) but I can’t remember if I have ever experienced issues with mixing conda and pip packages.
Here’s how I get the process started on new machines:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
I use miniconda instead of anaconda, but that’s mainly because I prefer something more lightweight to make the process faster and take less disk space. Furthermore, anaconda comes with packages that I would normally want to install myself anyway later (such as numpy) so that I can easily control versions and dependencies.
To be clear, here’s what I do when I run conda envs after that bash
command.
I accept the license:
I always use the default location (click enter) which is typically
/home/<USER>/miniconda3
. Then after that I will see:
I typically say “yes” so that miniconda automatically adds stuff to my .bashrc file. After this, I can run the “conda” command right away, but I also don’t want the “base” environment to be active right away because I would rather have each new command line window start with a blank non-conda environment. Thus, after closing and re-opening the shell (typically via exiting the machine and ssh-ing again) I do:
conda config --set auto_activate_base false
This information goes into the ~/.condarc
file. After refreshing with .
~/.bashrc
, conda is all set up for me to use. Here are a few commands that I
regularly use:
-
conda activate <NAME>
andconda deactivate
to activate or deactivate the environment. When the environment is activated, useconda install <PACKAGE>
. -
conda info --envs
to check all my existing conda environments. -
conda list
: This will check the installed packages in the current conda environment. This will also conveniently clarify if any packages were installed via pip. -
conda create --name <NAME> python=3.7 -y
, to create conda environments with the specified Python version. You can add the “-y” argument to avoid having to explicitly approve the process. -
conda env remove --name <NAME>
, to remove conda environments.
We now turn to discussing how conda environments work with TensorFlow and PyTorch.
Installing TensorFlow and PyTorch
Migrating to TensorFlow was the original motivation for me to use conda environments due to running into incompatible CUDA/CuDNN versions with “pip install tensorflow” commands on various machines. You can find a table of TensorFlow packages and their associated CUDA and CuDNN versions here and a popular StackOverflow post here.
As of today, the latest version of TensorFlow is 2.6.0 through pip, but it’s 2.4.1 through conda. A different set of maintainers package the conda TensorFlow version as compared to the one provided through the Python Package Index (PyPI) which is from the official TensorFlow developers, which is why there is some version lag (see this post for some context). Since it’s rare that I absolutely require the latest TensorFlow version, I focus on TensorFlow 2.4.1 here. I run the following commands to quickly start a Python 3.7 conda environment with TensorFlow 2.4.1 installed:
conda create --name tftest python=3.7 -y && conda activate tftest
conda install ipython tensorflow-gpu==2.4.1 -y
Similar Python versions will likely work as well. These days, I use Python 3.6
at a minimum. Also, I just put in ipython
since I like running it over the
default Python shell. Once I run ipython
on the command line, I can try:
The tf.test.is_gpu_available()
method is deprecated, so use
tf.config.list_physical_devices('GPU')
instead. Presumably, this should give
information that is consistent with what happens when running nvidia-smi
on
the command line; the first one should list all GPUs and the second one should
return True
. If not, something went wrong.
This process consistently works for a variety of machines I have access to, and
gets TensorFlow working while bundling CUDA internally within the conda
environment. This means in general, the conda environment will not have the
same CUDA version as the one provided from nvcc --version
which is typically
the one installed system-wide in /usr/local/
. For the commands above, this
should install cudatoolkit-10.1.243
in the conda environment. This package is
347.4 MB, and includes CuDNN. Here is another relevant StackOverflow post on
this matter.
Finally, wrap things up by removing each created test environment to reduce
clutter: conda env remove --name tftest
.
Hopefully that helps clarify one way to install TensorFlow in conda environments for shared machines. One day I hope that TensorFlow will be simpler to install. To be clear, it’s simple but could be made a little easier as judged by the community’s reception. (To put things in perspective, remember how hard it was to install CAFFE back in 2014-2015? Heh.) In new “clean” machines where one can easily control which CUDA/CuDNN versions are packaged on a machine on the fly, such as those created from Google Cloud Platform, the pip version could be relatively easy to install.
What about PyTorch? For PyTorch, the process for installing is even easier because I believe that the PyTorch maintainers simultaneously maintain conda and pip packages, because we have the option of selecting either one on the official installation page:
As with my TensorFlow tests, I can test PyTorch installation via:
conda create --name pttest python=3.7 -y && conda activate pttest
conda install ipython pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch
As of today, this will install PyTorch 1.9.0 along with ipython
. Again,
ipython
is not necessary but I like including it. You can then check if
PyTorch is using the GPU(s) as follows:
Here is the StackOverflow reference. As with my TensorFlow test, this
method of installing PyTorch will detect the GPUs and does not rely on the
system’s existing CUDA version because the conda installation will provide it
for us. For Pytorch, the cudatoolkit-10.2.89
package (which is 365.1 MB) gets
installed, which you can check with cuda list
. Once again, this also includes
CuDNN.
Clean things up with: conda env remove --name pttest
.
CUDA and NVIDIA Drivers
(This subsection was last updated on 08/06/2022.)
Using conda environments, we can bundle CUDA and CuDNN with TensorFlow and PyTorch, which is a huge relief in avoiding headaches with version incompatibilities. However, it is (probably) still necessary to install a system-wide version of CUDA, as well as something called “NVIDIA drivers.” For example, if using a local machine with GPUs (as is common in many AI research labs which have personal machines in addition to servers and clusters), we need NVIDIA drivers to get high resolution displays.
Attempting to install CUDA and NVIDIA drivers in the past has been a nightmare. Fortunately, I think I have found a method that works realiably for me on Ubuntu 18.04 and 20.04 machines.
The first step for me is to install the NVIDIA drivers. I do this first,
before installing CUDA. I purge any prior NVIDIA drivers, and then install the
recommended drivers using ubuntu-drivers autoinstall
and then reboot.
sudo apt-get update
sudo apt-get purge nvidia*
sudo ubuntu-drivers autoinstall
sudo reboot
To check that NVIDIA drivers are installed, I use the widely-used nvidia-smi
utility, but there’s also cat /proc/driver/nvidia/version
.
To clarify the “autoinstall” part, it will choose the recommended NVIDIA driver to use. You can check which version it will choose:
seita@lambda-dual2:~ $ ubuntu-drivers devices
WARNING:root:_pkg_get_support nvidia-driver-515-server: package has invalid Support PBheader, cannot determine support level
WARNING:root:_pkg_get_support nvidia-driver-515: package has invalid Support PBheader, cannot determine support level
== /sys/devices/pci0000:40/0000:40:01.1/0000:41:00.0 ==
modalias : pci:v000010DEd00002204sv00001458sd0000403Bbc03sc00i00
vendor : NVIDIA Corporation
driver : nvidia-driver-515-server - distro non-free
driver : nvidia-driver-510-server - third-party non-free
driver : nvidia-driver-515 - distro non-free recommended
driver : nvidia-driver-510 - third-party non-free
driver : xserver-xorg-video-nouveau - distro free builtin
seita@lambda-dual2:~ $
In the example above, I’m on an Ubuntu 20.04 system with RTX 3090 GPUs, and thus the “recommended” NVIDIA driver is 515.
The references that I rely on for this process include this, this, this, and this. This post from the NVIDIA forums was also helpful, though the poster from NVIDIA requests us to use published instructions from NVIDIA. The issue is that when I have done this, I have run into numerous errors with displays not working and needing to use “tty” mode to access my machine.
Then after this, I install CUDA by using the runfile, not the debian
file. For that, I go to this page from NVIDIA to download the appropriate
CUDA toolkit for my system. The runfile will be named something like this:
cuda_11.3.0_465.19.01_linux.run
.
I follow the installation instructions for the runfile, but when I do the installation, I do not install the drivers! I just install the toolkit, like this:
Above, I am only selecting the CUDA Toolkit, since I installed the drivers, and the other parts are not relevant.
If the installation works (a reboot might be needed), the CUDA toolkit should
appear in /usr/local
:
seita@takeshi:~$ ls -lh /usr/local/
total 40K
drwxr-xr-x 2 root root 4.0K Jul 5 14:34 bin
lrwxrwxrwx 1 root root 21 Aug 4 10:46 cuda -> /usr/local/cuda-11.3/
drwxr-xr-x 17 root root 4.0K Mar 6 19:04 cuda-11.2
drwxr-xr-x 15 root root 4.0K Aug 4 10:46 cuda-11.3
drwxr-xr-x 2 root root 4.0K Sep 15 2021 etc
drwxr-xr-x 2 root root 4.0K Sep 15 2021 games
drwxr-xr-x 2 root root 4.0K Sep 15 2021 include
drwxr-xr-x 3 root root 4.0K Mar 6 18:31 lib
lrwxrwxrwx 1 root root 9 Mar 6 18:30 man -> share/man
drwxr-xr-x 2 root root 4.0K Sep 15 2021 sbin
drwxr-xr-x 6 root root 4.0K Sep 15 2021 share
drwxr-xr-x 2 root root 4.0K Sep 15 2021 src
seita@takeshi:~$
For example, here we see that the default CUDA toolkit is 11.3 because that’s
where the cuda “symlink” is pointing to, but I could have easily set it to be
11.2. We can check with nvcc --version
. If that’s not working, you might need
to check your ~/.bashrc
for something like this:
export CUDA_HOME=/usr/local/cuda
export PATH=${CUDA_HOME}/bin:${PATH}
export LD_LIBRARY_PATH=${CUDA_HOME}/lib64:$LD_LIBRARY_PATH
I recognize that these are not exactly the published NVIDIA instructions and that I’m probably missing something here, but this has been a simple way to get things working.
Another post from StackOverflow that helped me understand things was this
one which clarifies that we might get different CUDA versions from nvcc
and nvidia-smi
after this process. I have had this happen on many of my
machines without problems.
Managing Storage on Shared Machines
In academic research labs, students often share machines. It thus helps to have a scalable, efficient, and manageable way to store data. Here’s how I typically do this for machines that I administer, where I am a “sudo” user and grant access to the machine to other lab members who may or may not be sudo (for example, I rarely make new undergrad researchers sudo unless there’s a good reason). I assume that the machine is equipped with a separate SSD and HDD. The SSD is typically where users store their local data, and because it’s an SSD rather than an HDD, reading and writing data is faster. The HDD is mainly for storing larger datasets, and typically has much more storage than the SSD.
For a clean machine, one of the most basic first steps is to make sure that the
SSD and HDD are mounted upon startup, and accessible to all users. Usually,
the SSD is automatically mounted, but the HDD might not be. I can mount these
drives automatically by editing the /etc/fstab
file, or by using the “disks”
program, which will end up editing /etc/fstab
for me. I suggest following
the top answer to this AskUbuntu question. My convention is to mount the
HDD under a directory named /data
.
To inspect file systems, I use df -h
, where the -h
argument makes the sizes
human-readable. Here’s an example of a subset of the output when I run df -h
:
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p1 1.9T 1.2T 598G 67% /
/dev/sdb1 13T 571G 12T 5% /data
According to the above, the SSD has 1.9T of total space (of which 67 percent is
used), and the HDD has 13T of total space. The output of df -h
includes a
bunch of other lines with information that I’m not sure how to interpret; I am
guessing those correspond to other minor “drives” that are needed for Ubuntu
systems to function. I only use df -h
to look at the SSD and HDD, to make
sure they are actually there, and to check disk space.
Incidentally, another way I check disk space is by using du -sh <directory>
,
which will list space recursively stored under <directory>
. Depending on user
privileges, the command might result in a lot of distracting “permission
denied” warnings, in which case I add 2> /dev/null
at the end of the command
to suppress those messages. I recommend reading this article for more
information and useful tips on managing disk space.
After mounting the HDD, it is typically under control of root for both the
user and the group, which you can check with ls -h /
. This is problematic if
I want any user to be able to read and write to this directory. To resolve
this, I usually follow the top answer to this AskUbuntu question. I
typically make a new group called datausers, and then add all users to the
group. I then change the ownership of the shared folder, /data
. Lastly, I
choose this option:
sudo chmod -R 1775 /data
According to the AskUbuntu question, this means that all users in datausers can
add to /data
, and can read but not write to each others files. Furthermore,
only owners of files can delete them, and users outside of datausers will be
able to see the files but not change them.
From these steps, running ls -lh /
shows:
user@machine:~$ ls -lh /
drwxrwxr-t 6 root datausers 4.0K Sep 17 21:43 data
As with df -h
, I am only showing part of the output of the above command,
just the line that lists data
. This shows that it is correctly under the
group “datausers.”
Finally, I reboot the machine, and then now users who are in the datausers
group should be able to read and write to the /data
without sudo access.
Furthermore, unless sudo privileges are involved, users cannot modify data from
other users in /data
.
Managing ssh
I use many machines for work, so I need to seamlessly connect to them from any
one machine (including from a personal laptop). I use ssh: ssh daniel@<ID>
,
where daniel
is a hypothetical username for me on a machine, and <ID>
should be replaced with some corresponding machine ID number, which often looks
something like 123.456.7.890
with a few numbers spaced out with periods.
When I get started with a new machine, a common question I have for myself is:
what is the ID for ssh? I often use ifconfig
to find the ID, though I admit
that sometimes it’s not clear from this command. If you’re trying to connect to
a machine as part of a research lab, typically the lab will have already set
some kind of address to conect with SSH. If this is a new machine, don’t forget
that it needs to have the ssh server on to accept connections (see this
AskUbuntu post).
Sometimes machines will restrict ssh access to particular ports, which means I
connect with ssh -p PORTNUM daniel@<ID>
. When a port number is necessary for
ssh connectivity, I have been able to find the right port by running this
command (source):
sudo grep Port /etc/ssh/sshd_config
I also highly recommend using ssh keys. By sharing keys among machines,
this means one can avoid typing in passwords. This is an enormous benefit for
me since I might need to connect with as many as 10 different lab servers, and
I use different login passwords for each machine. For more information, I
recommend checking a source like this guide. But to summarize, here’s
what I do. Assuming I have a key from running ssh-keygen
, then on my current
machine, I run a command like this:
ssh-copy-id -i ~/.ssh/id_rsa daniel@<IP>
This is the same as a standard normal ssh command, except it has ssh-copy-id
and a reference to the (public) key that I am trying to copy to the other
machine.
Assuming that command works, I should then be able to run ssh daniel@<IP>
from my current machine to the one specified with the <IP>
, without
typing in a password. Amazingly, for something that I rely a lot on for my
ongoing day-to-day work convenience, I didn’t even know about this feature
until halfway through my PhD.
Conclusion and Outlook
Using conda environments has been a huge help for my research workflow, and makes it easy to manage multiple Python projects. I have also been fortunate to get a better sense for how to effectively manage a finite amount of disk space among multiple users.
Some stuff that I also use in my current workflow, and would like to write more about in the future, include VSCode, vim, more on managing ssh and internet connectivity, and Docker. I also would like to understand how packages work for C++, to see how the process contrasts with Python packages.