In many research projects, it is essential to test which of several competing methods and/or hyperparameters works best. The process of saving and logging experiments, however, can create a disorganized jungle of output files. Furthermore, reproducibility can be challenging without knowing all the exact parameter choices that were used to generate results. Inspired in part by Dustin Tran’s excellent Research-to-Engineering framework blog post, in this post I will present several techniques that have worked well for me in managing my research code, with a specific emphasis on logging and saving experimental runs.

Technique 0. I will label this as technique “0” since it should be mandatory and generalizes far beyond logging research code compared to the other tips here: use version control. git, along with the “hub extension” to form GitHub, is the standard in my field, though I’ve also managed projects using GitLab.

In addition, I’ve settled on these relevant strategies:

• To evaluate research code, I create a separate branch strictly for this purpose (which I name eval-[whatever]), so that it doesn’t interfere with my main master branch, and to enable greater ease of reproducing prior results by simply switching to the appropriate branch. The alternative would be to reset and restore to an older commit in master, which can be highly error-prone.
• I make a new Python virtualenv for each major project, and save a requirements.txt somewhere in the repository so that recreating the environment on any of the several machines I have access to is (usually) as simple as pip install -r requirements.txt.
• For major repositories, I like to add a setup.py file so that I can install the library using python setup.py develop, allowing me to freely import the code regardless of where I am in my computer’s directory system, so long as the module is installed in my virtualenv.

Technique 1. In machine learning, and deep learning in particular, hyperparameter tuning is essential. For the ones I frequently modify, I use the argparse library. This lets me run code on the command line like this:

python script.py --batch_size 32 --lrate 1e-5 --num_layers 4 <more args here...>


While this is useful, the downside is readily apparent: I don’t want to have to write down all the hyperparameters each time, and copying and pasting earlier commands might be error prone, particularly when the code constantly changes. There are a few strategies to make this process easier, all of which I employ at some point:

• Make liberal use of default argument settings. I find reasonable values of most arguments, and stick with them for my experiments. That way, I don’t need to specify the values in the command line.
• Create bash scripts. I like to have a separate folder called bash/ where I insert shell scripts (with the endname .sh) with many command line arguments for experiments. Then, after making the scripts executable with chmod, I can call experiment code using ./bash/script_name.sh.
• Make use of json or yaml files. For an alternative (or complimentary) technique for managing lots of arguments, consider using .json or .yaml files. Both file types are human-readable and have built-in support from Python libraries.

Technique 2. I save the results from experiment runs in unique directories using Python’s os.path.join and os.makedirs functions for forming the string and creating the resulting directory, respectively. Do not create the directory with code like this:

because it’s clumsy and vulnerable to issues with slashes in directory names. Just use os.path.join, which is so ubiquitous in my research code that by habit I write

at the top of many scripts.

Subdirectories can (and should) be created as needed within the head experiment directory. For example, every now and then I save neural network snapshots in a snapshots/ sub-directory, with the relevant parameter (e.g., epoch) in the snapshot name.

But snapshots and other data files can demand lots of memory. The machines I use for my research generally have small SSDs and large HDDs. Due to memory constraints on the SSDs, which often have less than 1TB of space, I almost always save experiment logs in my HDDs.

Don’t forget to back up data! I’ve had several machines compromised by “bad guys” in the past, forcing me to reinstall the operating system. HDDs and other large-storage systems can be synced across several machines, making it easy to access. If this isn’t an option, then simply copying files over from machine-to-machine manually every few days will do; I write down reminders in my Google Calendar.

Technique 3. Here’s a surprisingly non-trivial question related to the prior tactic: how shall the directory be named? Ideally, the name should reflect the most important hyperparameters, but it’s too easy for directory names to get out of control, like this:

experiment_seed_001_lrate_1e-3_network_2_opt_adam_l2reg_1e-5_batchsize_32_ [ and more ...!]


I focus strictly on three or four of the most important experiment settings and put them in the file name. When random seeds matter, I also put them in the file name.

Then, I use Python’s datetime module to format the date that the experiment started to run, and insert that somewhere in the file name. You can do this with code similar to the following snippet:

where I create the “suffix” using the algorithm name, the date, and the random seed (with str().zfill() to get leading zeros inserted to satisfy my OCD), and where the “HEAD” is the machine-dependent path to the HDD (see my previous tip).

There are at least two advantages for having the date embedded in the file names:

• It avoids issues with duplicate directory names. This prevents the need to manually delete or re-name older directories.
• It makes it easy to spot-check (via ls -lh on the command line) which experiment runs can be safely deleted if major updates were made since then.

Based on the second point above, I prefer the date to be human-readable, which is why I like formatting it the way I do above. I don’t put in the seconds as I find that to be a bit too much, but one can easily add it.

Technique 4. This last pair of techniques pertains to reproducibility. Don’t neglect them! How many times have you failed to reproduce your own results? I have experienced this before and it is embarrassing.

The first part of this technique happens during code execution: save all arguments and hyperparmaters in the output directory. That means, at minimum, write code like this:

which will save the arguments in a pickle file in the save path, denoted as args.save_path which (as stated earlier) usually points somewhere in my machine’s HDD. Alternatively, or in addition, you can save arguments in human-readable form using json.

The second part of this technique happens during paper writing. Always write down the command that was used to generate figures. I mostly use Overleaf — now merged with ShareLaTeX — for writing up my results, and I insert the command in the comments above the figures, like this:

% Generate with:
% python [script].py --arg1 val1 --arg2 val2
% at commit [hashtag]
\begin{figure}
% LaTeX figure code here...
\end{figure}


It sounds trivial, but it’s helped me several times for last-minute figure changes to satisfy page and margin limits. In many of my research projects, the stuff I save and log changes so often that I have little choice but to have an entire scripts/ folder with various scripts for generating figures depending on the output type, and I can end up with tens of such files.

While I know that TensorBoard is popular for checking results, I’ve actually never used it (gasp!); I find good old matplotlib to serve my needs sufficiently well, even for checking training in progress. Thus, each of my files in scripts/ creates matplotlib plots, all of which are saved in the appropriate experiment directory in my HDDs.

Conclusion. These techniques will hopefully make one’s life easier in managing and parsing the large set of experiment results that are inevitable in empirical research projects. A recent example when these tips were useful to me was with the bed-making paper we wrote, with neural network training code here, where I was running a number of experiments to test different hyperparameters, neural network architectures, and so forth.

I hope these tips prove to be useful for your experimental framework.