Dual Book Discussion on Political Development and Faith

I finally read two books that were on my agenda for a long time: Francis Fukuyama’s 2010 history book The Origins of Political Order: From Prehuman Times to the French Revolution and Jimmy Carter’s personal memoir Faith: A Journey for All. Reading these books took way longer than it should have, due to a research deadline. Fortunately, that’s in the past and I have pleasantly gotten back to reading too many books and spending too much time blogging.

Before proceeding, here’s a little background on Francis Fukuyama. It is actually tricky to succinctly describe his career. I view him a political scientist and author, but he has additionally been a professor, a senior fellow, a council member, and probably ten other things, at a variety of universities and think tanks related to the development of democracies. His most well-known work is the 1992 book The End of History and the Last Man, where he argues that liberal democracy represents the final, evolved form of government.1 Some events since the 1992 book — off the top of my head, 9/11, Radical Islam and ISIS, political populism, the rise of unaccountable and authoritarian governments in Russia and China — have made Fukuyama a frequent punching bag by various commentators. For one perspective, check out this recent New Yorker article for some background (and unsurprisingly, criticism) on Fukuyama, though that piece is mostly about Fukuyama’s 2018 book on identity politics and doesn’t make much reference to the book I will soon discuss on political development.

Fukuyama is also associated with the rise of neoconservatism, to which he distanced himself from due to the Iraq war. How do we know? He literally says so in a Quora answer.2 Ah, the wonders of the modern world and those “verified accounts” we see on Quora, Twitter, and other social media outlets!

Meanwhile, the second author whose book I will soon discuss, Jimmy Carter, needs no introduction. He served as the 39th President of the United States from 1977 to 1981.

You might be wondering why I am discussing their books in the same blog post. The books are different:

• Fukuyama’s book is dense and scholarly, a 500-page historical account spanning from — as the subtitle makes clear — prehuman times to the French Revolution (1789-ish). The Origins of Political Order includes historical commentary on a variety of European countries, along with China, India, and the occasional detour into the Middle East, Latin America, and other areas. It frequently references other scholarly works that Fukuyama must have reviewed and digested in his long career.

• Carter’s book, in contrast, is a brief personal memoir, and weighs in at around 160 pages. It describes his view of religion and how it has shaped his life, from his youth to his Navy service, to his time as president, and beyond.3

Yet, they have an interesting common theme.

First, consider The Origins of Political Order. It is a book describing how humans came to organize themselves politically, from forming small tribes and then later creating larger kingdoms and states. Fukuyama repeatedly refers to the following three political institutions:

• The State: government itself, which in particular, needs to consolidate and control power.
• Rule of Law: effective legal institutions that constrain what all people (most importantly, leaders!) can and cannot do.
• Accountable Government: having democratic elections to ensure leaders can be voted out of office.

He argues that successful, modern, liberal democracies (the kind of states I want to live in) combine these three institutions in an appropriate balance, which itself is an enormously challenging task. In particular, the pursuit of a strong state seems to be at odds with rulers and elected leaders being bound by a rule of law and accountable government.4

The Origins of Political Order attempts to outline the history, development, and evolution5 of these three institutions, focusing on factors that result in their formation (or decay). It does not attempt to describe a general “rule” or a set of instructions for the oft-used “Getting to Denmark” goal. Fukuyama believes that it is futile to develop clear theories or rules due to the multitude of factors involved.

If there is any “clear rule” that I learned from the book, it is that political decay, or the weakening of these institutions, is a constant threat to be addressed. Fukuyama invokes patrimonialism, the tendency for people to favor family and friends, as the prime factor causing political decay. He makes a strong case. Patrimonialism is natural, but doing so can lead to weaker governments as compared to those using more merit-based, impersonal systems to judge people. China, Fukuyama argues, was a pioneer in applying merit-based rules for civil service employees. Indeed, Fukuyama refers to China (and not Greece or Rome) as having built the first modern state.

The book was a deep dive into some long-term historical trends — the kind that I like to read, even if it was a struggle for me to weave together the facts. (I had to re-read many parts, and was constantly jotting down notes with my pencil in the book margins.) I was pleasantly reminded of Guns, Germs, and Steel along with The Ideas that Conquered the World, both of which I greatly enjoyed reading three years ago. I would later comment on them in a blog post.

I hope that Fukuyama’s insights can be used to create better governments throughout the world, and can additionally lead to the conclusion he sought when writing The End of History and the Last Man. Is Fukuyama right about liberal democracy being the final form of government? I will let the coming years answer that.

Do I hope Fukuyama turns out to be right all along, and vindicated by future scholars? Good heavens. By God, yes, I hope so.

Now let’s return to something I was not expecting in Fukuyama’s book: religion. (My diction in the prior paragraph was not a coincidence.) Fukuyama discusses how religion was essential for state formation by banding people together and facilitating “large-scale collective action”. To be clear, nothing in Fukuyama’s book is designed to counter the chief claims of the “new Atheist” authors he references; Fukuyama simply mentions that religion was historically a source of cohesion and unity.6

The discussion about religion brings us to Carter’s book.

In Faith, Carter explains that acquiring faith is rarely clear-cut. He does not attribute a singular event which caused him to be deeply faithful, as I have seen others do. Carter lists several deeply religious people who he had the privilege to meet, such as Bill Foege, Ugandan missionaries, and his brother. Much of Carter’s knowledge of Christianity derives from these and other religious figures, along with his preparation for when he teaches at Sunday School, which he still admirably continues to do so at 94.

Carter, additionally, explains how his faith has influenced his career as a politician and beyond. The main takeaways are that faith has: (1) provided stability to Carter’s life, and (2) driven him to change the world for the better.

• How do members of the same religion come to intensely disagree on certain political topics? Do disagreements arise from reading different Biblical sources or studying under different priests and pastors? Or are people simply misunderstanding the same text, just as students nowadays might misunderstand the same mathematics or science text?

Here are some examples. In Chapter 2, Carter mentions he was criticized by conservative Christians for appointing women and racial minorities to positions in government — where do such disagreements come from? Later, in Chapter 5, Carter rightfully admonishes male chauvinists who tout the Bible’s passage that says “Wives, submit yourselves to your own husbands, as you do to the Lord” because Carter claims that the Bible later says that both genders must commit to each other equally. But where do these male chauvinists come from? In Chapter 6, Carter mentions his opposition to the death penalty and opposition to discrimination on the basis of sexual orientation. Again, why are these straightforward-to-describe issues so bitterly contested?

Or do differences in beliefs come outside of religion, such as from “Enlightenment thinking”?

• What does Carter believe we should do in light of “religious fundamentalism”? As Carter says in Chapter 2, this is when certain deeply religious people believe they are superior to others, particularly those outside the faith or viewed as insufficiently faithful. Moreover, what are the appropriate responses for when these people have political power and invoke their religious beliefs when creating and/or applying controversial laws?

• What about the ages-old question of science versus religion? In Chapter 5, Carter states that scientific discoveries about the universe do not contradict his belief in a higher being, and serve to “strengthen the reverence and awe generated by what has already become known and what remains unexplained.” But, does this mean we should attribute all events that we can’t explain with science by defaulting to God and intelligent design? In addition, this also raises the question as to whether God currently exists, or whether God simply created the universe by gestating the Big Bang but then took his (or her??) hands permanently off the controls to see — but not influence — what would happen. This matters in the context of politicians who justify God for their political decisions. See my previous point.

Despite my frequent questions, it was insightful to understand his perspective on religion. Admittedly, I don’t think it would be fair to expect firm answers to any of my questions.

I am a non-religious atheist,7 and in all likelihood that will last for the remainder of my life, unless (as I mentioned at the bottom of this earlier blog post), I observe evidence that a God currently exists. Until then, it will be hard for me to spend my limited time reading the Bible or engaging in other religious activities when I have so many competing attentions — first among them, developing a general-purpose robot.

I will continue reading more books like Carter’s Faith (and Fukuyama’s book for that matter) because I believe it’s important to understand a variety of perspectives, and reading books lets me scratch the surface of deep subjects. This is the most time-efficient way for me to obtain a nontrivial understanding of a vast number of subjects.

On a final note, it was a pleasant surprise when Carter reveals in his book that people of a variety of different faiths, including potentially atheists, have attended his Sunday School classes. If the opportunity arises, I probably would, if only to get the chance to meet him. Or perhaps I could meet Carter if I get on a commercial airplane that he’s flying on. I would like to meet people like him, and to imagine myself changing the world as much as he has.

Since I currently have no political power, my ability to create a positive impact on the world is probably predicated in my technical knowledge. Quixotic though it may sound, I hope to use computer science and robotics to change the world for the better. If you have thoughts on how to do this, feel free to contact me.

1. I have not read The End of History and the Last Man. Needless to say, that book is high on my reading agenda. Incidentally, it seems that a number of people knowledgeable about history and foreign affairs are aware of the book, but have not actually read it. I am doing my best to leave this group.

2. Let’s be honest: leaving the neoconservatism movement due to the Iraq war was the right decision.

3. Carter has the longest post-presidency lifespan of any US president in history.

4. There are obvious parallels in the “balance” of political institutions sought out by Fukuyama, and the “checks and balances” designed by the framers of the American Constitution.

5. My word choice of “evolution” here is deliberate. Fukuyama occasionally makes references to Charles Darwin and the theory of evolution, and its parallels in the development of political institutions.

6. I do not think it is fair to criticize the New Atheist claim that “religion is a source of violence”. I would be shocked if Dawkins, Harris, and similar people, believe that religion had no benefits early on during state formation. It is more during the present day when we already have well-formed states that such atheists point out the divisiveness that religion creates.

7. In addition, I am also an ardent defender of free religion.

BAIR Blog Post on Depth Maps and Deep Learning in Robotics

As usual, I have been slow blogging here. This time, I have a valid excuse. I was consumed with writing for another one: the Berkeley Artificial Intelligence Research (BAIR) blog, of which I serve as the primary editorial board member. If I may put my non-existent ego aside, the BAIR blog is more important (and popular!)1 than my personal blog. BAIR blog posts generally require more effort to write than personal blog posts. Quality over quantity, right?

You can read my blog post there, which is about using depth images in the context of deep learning and robotics. Unlike most BAIR blog posts, this one tries to describe a little history and a unifying theme (depth images) across multiple papers. It’s a little long; we put in a lot of effort into this post.

I also have an earlier BAIR blog post from last year, about the work I did with Markov chain Monte Carlo methods. I’ve since moved on to robotics research, which explains the sudden change in blogging topics.

Thank you for reading this little note, and I hope you also enjoy the BAIR blog post.

1. As of today, my blog (a.k.a., “Seita’s Place”) has 88 subscribers via MailChimp. The BAIR Blog has at least 3,600.

Three Approaches to Deep Learning for Robotic Grasping

In ICRA 2018, “Deep Learning” was the most popular keyword in the accepted papers, and for good reason. The combination of deep learning and robotics has led to a wide variety of impressive results. In this blog post, I’ll go over three remarkable papers that pertain to deep learning for robotic grasping. While the core idea remains the same — just design a deep network, get appropriate data, and train — the papers have subtle differences in their proposed methods that are important to understand. For these papers, I will attempt to describe data collection, network design, training, and deployment.

Paper 1: Supersizing Self-supervision: Learning to Grasp from 50K Tries and 700 Robot Hours

The grasping architecture used in this paper. No separate motor command is passed as input to the network, since the position is known from the image patch and the angle is one of 18 different discretized values.

In this award-winning ICRA 2016 paper, the authors propose a data-driven grasping method that involves a robot (the Baxter in this case) repeatedly executing grasp attempts and training a network using automatically-labeled data of grasp success. The Baxter attempted 50K grasps which took 700 robot hours. Yikes!

• Data Collection. Various objects get scattered across a flat workspace in front of the robot. An off-the-shelf “Mixture of Gaussians subtraction algorithm” is used to detect various objects. This is a necessary bias in the procedure so that a random (more like “semi-random”) grasp attempt will be near the region of the object and thus may occasionally succeed. Then, the robot moves its end-effector to a known height above the workspace, and attempts to grasp by randomly sampling a nearby 2D point and angle. To automatically deduce the success or failure label, the authors measure force readings on the gripper; if the robot has grasped successfully, then the gripper will not be completely closed. Fair enough!

• Network Architecture. The neural network is designed to regress the grasping problem as an 18-way binary classification task (i.e., success or failure) over image patches. The 18-way branch at the end is because multiple angles may lead to successful grasps for an object, so it makes no sense to try and say only one out of 18 (or whatever the discretization) will work. Thus, they have 18 different logits, and during training on a given training data sample, only the branch corresponding to the angle in that data sample is updated with gradients.

They use a 380x380 RGB image patch centered at the target object, and downsample it to 227x227 before passing it to the network. The net uses fine-tuned AlexNet CNN layers pre-trained on ImageNet. They then add fully connected layers, and branch out as appropriate. See the top image for a visual.

In sum, the robot only needs to output a grasp that is 3 DoF: the $(x,y)$ position and the grasp angle $\theta$. The $(x,y)$ position is implicit in the input image, since it is the central point of the image.

• Training and Testing Procedure. Their training formally involves multiple stages, where they start with random trials, train the network, and then use the trained network to continue executing grasps. For faster training, they generate “hard-negative” samples, which are data points that the model thinks are graspable but are not. Effectively, they form a curriculum.

For evaluation, they can first measure classification performance of held-out data. This requires a forward pass for the grasping network, but does not require moving the robot, so this step can be done quickly. For deployment, they can sample a variety of patches, and for each, obtain the logits from the 18 different heads. Then for all those points, the robot picks the patch and angle combination that the grasp network rates as giving the highest probability of success.

Paper 2: Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection

(Note that I briefly blogged about the paper earlier this year.)

The grasping architecture used in this paper. Notice that it takes two RGB images as input, representing the initial and current images for the grasp attempt.

This paper is the “natural” next step, where we now get an order of magnitude more data points and use a much deeper neural network. Of course, there are some subtle differences with the method which are worth thinking about, and which I will go over shortly.

• Data Collection. Levine’s paper uses six to fourteen robots collecting data in parallel, and is able to get roughly 800K grasp attempts over the course of two months. Yowza! As with Pinto’s paper, the human’s job is only to restock the objects in front of the robot (this time, in a bin with potential overlap and contact) while the robot then “randomly” attempts grasps.

The samples in their training data have labels that indicate whether a grasp attempt was successful or not. Following the trend of self-supervision papers, these labels are automatically supplied by checking if the gripper is closed or not, which is similar to what Pinto did. There is an additional image subtraction test which serves as a backup for smaller objects.

A subtle difference with Pinto’s work is that Pinto detected objects via a Mixture of Gaussians test and then had the robot attempt to grasp it. Here, the robot simply grasps at anything, and a success is indicated if the robot grasps any object. In fact, from the videos, I see that the robot can grasp multiple objects at once.

In addition, grasps are not executed in one shot, but via multiple steps of motor commands, ranging from $T=2$ to $T=10$ different steps. Each grasp attempt $i$ provides $T$ training data instances: $\{(\mathbf{I}_t^i, \mathbf{p}_T^i - \mathbf{p}_t^i, \ell_i)\}_{t=1}^T$. So, the labels are the same for all data points, and all that matters is what happened after the last motor command. The paper discusses the interesting interpretation as reinforcement learning, which assumes actions induce a transitive relation between states. I agree in that this seems to be simpler than the alternative of prediction based on movement vectors at consecutive time steps.

• Network Architecture. The paper uses a much deep convolutional neural network. Seriously, did they need all of those layers? I doubt that. But anyway, unlike the other architectures here, it takes two RGB 472x472x3 images as input (actually, both are 512x512x3 but then get randomly cropped for translation invariance), one for the initial scene before the grasp attempt, and the other for the current scene. The other architectures from Pinto and Mahler do not need this because they assume precise camera calibration, which allows for an open loop grasp attempt upon getting the correct target and angle.

In addition to the two input images, it takes in a 5D motor command, which is passed as input later on in the network and combined, as one would expect. This encodes the angle, which avoids the need to have different branches like in Pinto’s network. Then, the last part of the network predicts if the motor command will lead to (any) successful grasp (of any object in the bin).

• Training and Testing Procedure. They train the network over the course of two months, updating the network 4 times and then increasing the number of steps for each grasp attempt from $T=2$ to $T=10$. So it is not just “collect and train” once. Each robot experienced different wear and tear, which I can agree with, though it’s a bit surprising that the paper emphasizes this a lot. I would have thought Google robots would be relative high quality and resistant to such forces.

For deploying the robot, they use a continuous servoing mechanism to continually adjust the trajectory solely based on visual input. So, the grasp attempt is not a single open-loop throw, but involves multiple steps. At each time step, it samples a set of potential motor commands, which are coupled with heuristics to ensure safety and compatibility requirements. The motor commands are also projected to go downwards to the scene, since this more closely matches the commands seen in the training data. Then, the algorithm queries the trained grasp network to see which one would have the highest success probability.

Levine’s paper briefly mentions the research contribution with respect to Dex-Net (coming up next):

Aside from the work of Pinto & Gupta (2015), prior large-scale grasp data collection efforts have focused on collecting datasets of object scans. For example, Dex-Net used a dataset of 10,000 3D models, combined with a learning framework to acquire force closure grasps.

With that, let’s move on to discussing Dex-Net.

Paper 3: Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics

(Don’t forget to check out Jeff Mahler’s excellent BAIR Blog post.)

The grasping architecture used in this paper. Notice how the input image to the far left is cropped and aligned to form the actual input to the GQ-CNN.

The Dexterity Network (“Dex-Net”) is an ongoing project at UC Berkeley’s AUTOLAB, led by Professor Ken Goldberg. There are a number of Dex-Net related papers, and for this post I will focus on the RSS 2017 paper since that uses a deep network for grasping. (It’s also the most cited of the Dex-Net related papers, with 80 as of today.)

• Data Collection. Following their notation, states, grasps, depth images, and success metrics are denoted as $\mathbf{x}$, $\mathbf{u}$, $\mathbf{y}$, and $S(\mathbf{u},\mathbf{x})$, respectively. You can see the paper for the details. Grasps are parameterized as $\mathbf{u} = (\mathbf{p}, \phi)$, where $\mathbf{p}$ is the center of the grasp with respect to the camera pose and $\phi$ is an angle in the table plane, which should be similar to the angle used in Pinto’s paper. In addition, depth images are also referred to as point clouds in this paper.

The Dex-Net 2.0 system involves the creation of a synthetic dataset of 6.7 million points for training a deep neural network. The dataset is created from 10K 3D object models from Dex-Net 1.0, and augmented with sampled grasps and robustness metrics, so it is not simply done via “try executing grasps semi-randomly.” More precisely, they sample from a graphical model to generate multiple grasps and success metrics for each object model, with constraints to ensure sufficient coverage over the model. Incidentally, the success metric is itself evaluated via another round of sampling. Finally, they create depth images using standard pinhole camera and projection models. They further process the depth images so that it is cropped to be centered at the grasp location, and rotated so that the grasp is at the middle row of the image.

Figure 3 in the paper has a nice, clear overview of the dataset generation pipeline. You can see the example images in the dataset, though these include the grasp overlay, which is not actually passed to the network. It is only for our human intuition.

• Network Architecture. Unlike the two other papers I discuss here, the GQ-CNN takes in a depth image as input. The depth images are just 32x32 in size, so the images are definitely smaller as compared to the 227x227x3 in Pinto’s network, which in turn is smaller than the 472x472x3 input images for Levine’s network. See the image above for the GQ-CNN. Note the alignment of the input image; the Dex-Net paper claims that this removes the need to have a predefined set of discretized angles, as in Pinto’s work. It also arguably simplifies the architecture by not requiring 18 different branches at the end. The alignment process requires two coordinates of the grasp point $\mathbf{p}$ along with the angle $\phi$. This leaves $z$, the height, which is passed as a separate layer. This is interesting, so instead of passing in a full grasp vector, three out of its four components are implicitly encoded in the image alignment process.

• Training and Testing Procedure. The training seems to be straightforward SGD with momentum. I wonder if it is possible to use a form of curriculum learning as with Pinto’s paper?

They have a detailed experiment protocol for their ABB YuMi robot, which — like the Baxter — has two arms and high precision. I like this section of the paper: it’s detailed and provides a description for how objects are actually scattered across the workspace, and discusses not just novel objects but also adversarial ones. Excellent! In addition, they only define a successful grasp if the gripper held the object after not just lifting but also transporting and shaking. That will definitely test robustness.

The grasp planner assumes singulated objects (like with Pinto’s work, but not with Levine’s), but they were able to briefly test a more complicated “order fulfillment” experiment. In follow-up research, they got the bin-picking task to work.

Overall, I would argue that Dex-Net is unique compared to the two other papers in that it uses more physics and analytic-based prior methods to assist with Deep Learning, and does not involve repeatedly executing and trying out grasps.

In terms of the grasp planner, one could argue that it’s a semi-hybrid (if that makes sense) of the two other papers. In Pinto’s paper, the grasp planner isn’t really a planner: it only samples for picking the patches and then running the network to see the highest patch and angle combination. In Levine’s paper, the planner involves continuous visual servoing which can help correct actions. The Dex-Net setup requires sampling for the grasp (and not image patches) and, like Levine’s paper, uses the cross-entropy method. Dex-Net, though, does not use continuous servoing, so it requires precise camera calibration.

On OpenAI Baselines Refactored and the A2C Code

OpenAI, a San Francisco nonprofit organization, has been in the news for a number of reasons, such as when their Dota2 AI system was able to beat a competitive semi-professional team, and when they trained a robotic hand to have unprecedented dexterity, and in various contexts about their grandiose mission of founding artificial general intelligence. It’s safe to say that such lofty goals are characteristic of an Elon Musk-founded company (er, nonprofit). I find their technical accomplishments impressive thus far, and hope that OpenAI can continue their upward trajectory in impact. What I’d like to point out in this blog post, though, is that I don’t actually find their Dota2 system, their dexterous hand, or other research products to be their most useful or valuable contribution to the AI community.

I think OpenAI’s open-source baselines code repository wins the prize of their most important product. You can see an announcement in a blog post from about 1.5 years ago, where they correctly point out that reinforcement learning algorithms, while potentially simple to describe and outline in mathematical notation, are surprisingly hard to implement and debug. I have faced my fair share of issues in implementing reinforcement learning algorithms, and it was a relief to me when I found out about this repository. If other AI researchers base their code on this repository, then it makes it far easier to compare and extend algorithms, and far easier to verify correctness (always a concern!) of research code.

That’s not to say it’s been a smooth ride. Far from it, in fact. The baselines repository has been notorious for being difficult to use and extend. You can find plenty of complaints and constructive criticism on the GitHub issues and on reddit (e.g., see this thread).

The good news is that over the last few months — conveniently, when I was distracted with ICRA 2019 — they substantially refactored their code base.

While the refactoring is still in progress for some of the algorithms (e.g., DDPG, HER, and GAIL seem to be following their older code), the shared code and API that different algorithms should obey is apparent.

First, as their README states, algorithms should now be run with the following command:

python -m baselines.run --alg=<name of the algorithm> \


The baselines.run is a script shared across algorithms that handles the following tasks:

• It processes command line arguments and handles “ranks” for MPI-based code. MPI is used for algorithms that require multiple processes for parallelism.

• It runs the training method, which returns a model and an env.

• The training method needs to first fetch the learning function, along with its arguments.

• It does this by treating the algorithm input (e.g., 'a2c' in string form) as a python module, and then importing a learn method. Basically, this means in a sub-directory (e.g., baselines/a2c) there needs to be a python script of the same name (which would be a2c.py in this example) which defines a learn method. This is the main “entry point” for all refactored algorithms.

• After fetching the learning function, the code next searches to see if there are any default arguments provided. For A2C it looks like it lacks a defaults.py file, so there are no defaults specified outside of the learn method. If there was such a file, then the arguments in defaults.py override those in learn. In turn, defaults.py is overriden by anything that we write on the command line. Whew, got that?

• Then it needs to build the environment. Since parallelism is so important for algorithms like A2C, this often involves creating multiple environments of the same type, such as creating 16 different instantiations of the Pong game. (Such usage also depends on the environment type: whether it’s atari, retro, mujoco, etc.)

• Without any arguments for num_env, this will often default to the number of CPUs on the system from running multiprocessing.cpu_count(). For example, on my Ubuntu 16.04 machine with a Titan X (Pascal) GPU, I have 8 CPUs. This is also the value I see when running htop. Technically, my processor only supports 4 CPUs, but the baseline code “sees” 8 CPUs due to hyperthreading.

• They use the SubprocVecEnv classes for making multiple environments of the same type. In particular, it looks like it’s called as:

SubprocVecEnv([make_env(i + start_index) for i in range(num_env)])


from make_vec_env in baselines/common/cmd_util.py, where each environment is created with its own ID, and the make_env method further creates a random seed based on the MPI rank. This is a list of OpenAI gym environments, as one would expect.

• The current code comments in SubprocVecEnv succinctly describe why this class exists:

VecEnv that runs multiple environments in parallel in subproceses and communicates with them via pipes. Recommended to use when num_envs > 1 and step() can be a bottleneck.

It makes sense to me. Otherwise, we’d need to sequentially iterate through a bunch of step() functions in a list — clearly a bottleneck in the code. Bleh! There’s a bunch of functionality that should look familiar to those who have used the gym library, except it considers the combination of all the environments in the list.

• In A2C, it looks like the SubprocVecEnv class is further passed as input to the VecFrameStack class, so it’s yet another wrapper. Wrappers, wrappers, and wrappers all day, yadda yadda yadda. This means it will call the SubprocVecEnv’s methods, such as step_wait(), and process the output (observations, rewards, etc.) as needed and then pass them to an end-algorithm like A2C with the same interface. In this case, I think the wrapper provides functionality to stack the observations so that they are all in one clean numpy array, rather than in an ugly list, but I’m not totally sure.

• Then it loads the network used for the agent’s policy. By default, this is the Nature CNN for atari-based environments, and a straightforward (input-64-64-output) fully connected network otherwise. The TensorFlow construction code is in baselines.common.models. The neural networks are not built until the learning method is subsequently called, as in the next bullet point:

• Finally, it runs the learning method it acquired earlier. Then, after training, it returns the trained model. See the individual algorithm directories for details on their learn method.

• In A2C, for instance, one of the first things the learn method does is to build the policy. For details, see baselines/common/policies.py.

• There is one class there, PolicyWithValue, which handles building the policy network and seamlessly integrates shared parameters with a value function. This is characteristic of A2C, where the policy and value functions share the same convolutional stem (at least for atari games) but have different fully connected “branches” to complete their individual objectives. When running Pong (see commands below), I get this as the list of TensorFlow trainable parameters:

<tf.Variable 'a2c_model/pi/c1/w:0' shape=(8, 8, 4, 32) dtype=float32_ref>
<tf.Variable 'a2c_model/pi/c1/b:0' shape=(1, 32, 1, 1) dtype=float32_ref>
<tf.Variable 'a2c_model/pi/c2/w:0' shape=(4, 4, 32, 64) dtype=float32_ref>
<tf.Variable 'a2c_model/pi/c2/b:0' shape=(1, 64, 1, 1) dtype=float32_ref>
<tf.Variable 'a2c_model/pi/c3/w:0' shape=(3, 3, 64, 64) dtype=float32_ref>
<tf.Variable 'a2c_model/pi/c3/b:0' shape=(1, 64, 1, 1) dtype=float32_ref>
<tf.Variable 'a2c_model/pi/fc1/w:0' shape=(3136, 512) dtype=float32_ref>
<tf.Variable 'a2c_model/pi/fc1/b:0' shape=(512,) dtype=float32_ref>
<tf.Variable 'a2c_model/pi/w:0' shape=(512, 6) dtype=float32_ref>
<tf.Variable 'a2c_model/pi/b:0' shape=(6,) dtype=float32_ref>
<tf.Variable 'a2c_model/vf/w:0' shape=(512, 1) dtype=float32_ref>
<tf.Variable 'a2c_model/vf/b:0' shape=(1,) dtype=float32_ref>


There are separate policy and value branches, which are shown in the bottom four lines above. There are six actions in Pong, which explains why one of the dense layers has shape 512x6. Their code technically exposes two different interfaces to the policy network to handle stepping during training and testing, since these will in general involve different batch sizes for the observation and action placeholders.

• The A2C algorithm uses a Model class to define various TensorFlow placeholders and the computational graph, while the Runner class is for stepping in the (parallel) environments to generate experiences. Within the learn method (which is what actually creates the model and runner), for each update step, the code is remarkably simple: call the runner to generate batches, call the train method to update weights, print some logging statistics, and repeat. Fortunately, the runner returns observations, actions, and other stuff in numpy form, making it easy to print and inspect.

• Regarding the batch size: there is a parameter based on the number of CPUs (e.g., 8). That’s how many environments are run in parallel. But there is a second parameter, nsteps, which is 5 by default. This is how many steps the runner will execute for each minibatch. The highlights of the runner’s run method looks like this:

for n in range(self.nsteps):
actions, values, states, _ = self.model.step(
self.obs, S=self.states, M=self.dones)
# skipping a bunch of stuff ...
obs, rewards, dones, _ = self.env.step(actions)
# skipping a bunch of stuff ...


The model’s step method returns actions, values and states for each of the parallel environments, which is straightforward to do since it’s a batch size in the network’s forward pass. Then, the env class can step in parallel using MPI and the CPU. All of these results are combined for nsteps which multiplies an extra factor to the batch size. Then the rewards are computed based on the nsteps-step returns, which is normally 5. Indeed, from checking the original A3C paper, I see that DeepMind used 5-step returns. Minor note: technically 5 is the maximum “step-return”: the last time step uses the 1-step return, the penultimate time step uses the 2-step return, and so on. It can be tricky to think about.

• At the end, it handles saving and visualizing the agent, if desired. This uses the step method from both the Model and the env, to handle parallelism. The Model step method directly calls the PolicyWithValue’s step function. This exposes the value function, which allows us to see what the network thinks regarding expected return.

Incidentally, I have listed the above in order of code logic, at least as of today’s baselines code. Who knows what will happen in a few months?

Since the code base has been refactored, I decided to run a few training scripts to see performance. Unfortunately, despite the refactoring, I believe the DQN-based algorithms still are not correctly implemented. I filed a GitHub issue where you can check out the details, and suffice to say, this is a serious flaw in the baselines repository.

So for now, let’s not use DQN. Since A2C seems to be working, let us go ahead and test that. I decided to run the following command line arguments:

python -m baselines.run --alg=a2c --env=PongNoFrameskip-v4 --num_timesteps=2e7 \
--num_env=2  --save_path=models/a2c_2e7_02cpu

python -m baselines.run --alg=a2c --env=PongNoFrameskip-v4 --num_timesteps=2e7 \
--num_env=4  --save_path=models/a2c_2e7_04cpu

python -m baselines.run --alg=a2c --env=PongNoFrameskip-v4 --num_timesteps=2e7 \
--num_env=8  --save_path=models/a2c_2e7_08cpu

python -m baselines.run --alg=a2c --env=PongNoFrameskip-v4 --num_timesteps=2e7 \
--num_env=16 --save_path=models/a2c_2e7_16cpu


Yes, I know my computer has only 8 CPUs but I am running with 16. I’m not actually sure how this works, maybe each CPU has to deal with two processes sequentially? Heh.

When you run these commands, it (in the case of 16 environments) creates the following output in the automatically-created log directory:

daniel@takeshi:/tmp\$ ls -lh openai-2018-09-26-16-06-58-922448/
total 568K
-rw-rw-r-- 1 daniel daniel 7.7K Sep 26 17:33 0.0.monitor.csv
-rw-rw-r-- 1 daniel daniel 7.7K Sep 26 17:33 0.10.monitor.csv
-rw-rw-r-- 1 daniel daniel 7.7K Sep 26 17:33 0.11.monitor.csv
-rw-rw-r-- 1 daniel daniel 7.7K Sep 26 17:33 0.12.monitor.csv
-rw-rw-r-- 1 daniel daniel 7.7K Sep 26 17:33 0.13.monitor.csv
-rw-rw-r-- 1 daniel daniel 7.7K Sep 26 17:33 0.14.monitor.csv
-rw-rw-r-- 1 daniel daniel 7.6K Sep 26 17:33 0.15.monitor.csv
-rw-rw-r-- 1 daniel daniel 7.7K Sep 26 17:33 0.1.monitor.csv
-rw-rw-r-- 1 daniel daniel 7.7K Sep 26 17:33 0.2.monitor.csv
-rw-rw-r-- 1 daniel daniel 7.7K Sep 26 17:33 0.3.monitor.csv
-rw-rw-r-- 1 daniel daniel 7.7K Sep 26 17:33 0.4.monitor.csv
-rw-rw-r-- 1 daniel daniel 7.8K Sep 26 17:33 0.5.monitor.csv
-rw-rw-r-- 1 daniel daniel 7.7K Sep 26 17:33 0.6.monitor.csv
-rw-rw-r-- 1 daniel daniel 7.8K Sep 26 17:33 0.7.monitor.csv
-rw-rw-r-- 1 daniel daniel 7.7K Sep 26 17:33 0.8.monitor.csv
-rw-rw-r-- 1 daniel daniel 7.8K Sep 26 17:33 0.9.monitor.csv
-rw-rw-r-- 1 daniel daniel 333K Sep 26 17:33 log.txt
-rw-rw-r-- 1 daniel daniel  95K Sep 26 17:33 progress.csv


Clearly, there is one monitor.csv for each of the 16 environments, which contains the corresponding environment’s episode rewards (and not the other 15).

The log.txt is the same as the standard output, and progress.csv records the log’s stats.

Using this python script, I plotted the results. They are shown in the image below, which you can expand in a new window to see the full size.

Results of the A2C commands. Each row corresponds to using a different number of environments (2, 4, 8, or 16) in A2C, and each column corresponds to some smoothing setting for the score curves, and some option for the x-axis (episodes, steps, or time).

It seems like running with 8 environments results in the best game scores, with the final values for all 8 surpassing 20 points. The other three settings look like they need a little more training to get past 20. Incidentally, the raw scores (left column) are noisy, so the second and third column represent smoothing over a window of 10 and 100 episodes, respectively.

The columns also report scores as a function of different items we might care about: training episodes, training steps, or training time (in minutes). The x-axis values vary across the different rows, because the 2e7 steps limit considers the combination of all steps in the parallel environments. For example, the 16 environment case ran in 175 minutes (almost 3 hours). Interestingly enough, the speedup over the 8 environment case is smaller than one might expect, perhaps because my computer only has 8 CPUs. There is, fortunately, a huge gap in speed between the 8 and 4 settings.

Whew! That’s all for now. I will continue checking the baselines code repository for updates. I will also keep trying out more algorithms to check for correctness and to understand usage. Thanks, OpenAI, for releasing such an incredibly valuable code base!

Paper Notes: Learning to Teach

My overview of the "Learning to Teach" pipeline, using their example of classifying MNIST images. The pipeline first samples a minibatch of data from MNIST, and passes it through the student network to obtain statistics such as the predicted class probabilities, the loss function value, and so on. No training is done yet. The student architecture, incidentally, is a fully connected 784-500-10 network. Then, these predictions, along with other meta-data (e.g., training iteration number, one-hot vector labels, etc.) are concatenated (shown in the dashed rectangle) and passed as input to the teacher network, which determines whether to keep or reject the sample in the minibatch. The teacher's architecture is (in the case of MNIST classification) a fully connected 25-12-1 network. Only the non-rejected samples are used for the purposes of updating the student network, via Adam gradient updates. Finally, after a few updates to the student, the teacher network is adjusted using the REINFORCE policy gradient rule, with a sparse reward function based on how soon the student achieves a pre-defined accuracy threshold. Once the teacher and student have been sufficiently trained, the teacher network can then be deployed on other students --- even those with different neural network architectures and testing on different datasets --- to accelerate learning.

Sorry for the post-free month — I was consumed with submitting to ICRA 2019 for the last two months, so I am only now able to get back to my various blogging and reading goals. As usual, one way I tackle both is by writing about research papers. Hence, in this post, I’ll discuss an interesting, unique paper from ICLR 2018 succinctly titled Learning to Teach. The OpenReview link is here, where you can see the favorable reviews and other comments.

Whereas standard machine learning investigates ways to better optimize an agent attempting to attain good performance for some task (e.g., classification accuracy on images), the machine teaching problem generally assumes the agent — now called the “learner” — is running some fixed algorithm, and the teacher must figure out a way to accelerate learning. Professor Zhu at Wisconsin has a nice webpage that summarizes the state of the art.

In Learning to Teach, the authors formalize their two player setup, and propose to train the teacher agent by reinforcement learning with policy gradients (the usual REINFORCE estimator). The authors explain the teacher’s state space, action space, reward, and so on, effectively describing the teaching problem as an MDP. The formalism is clean and well-written. I’m impressed. Kudos to the authors for clarity! The key novelty here must be that the teacher is updated via optimization-based methods, rather than heuristics or rules as in prior work.

The authors propose three ways the teacher can impact the student and accelerate its learning:

• Training data. The teacher can decide which training data to provide to the student. This is curriculum learning.1
• Loss function. The teacher can design an appropriate loss for the student to optimize.
• Hypothesis space. The teacher can restrict the potential hypothesis space of the student.

These three spaces make sense. I was disappointed, though, upon realizing that Learning to Teach is only about the training data portion. So, it’s a curriculum learning paper where the teacher is a reinforcement learning agent which designs the correct data input for the student. I wish there was some stuff about the other two categories: the loss function and the hypothesis space, since those seem intuitively to be much harder (and interesting!) problems. Off the top of my head, I know the domain agnostic meta learning (RSS 2018) and evolved policy gradients (NIPS 2018) papers involve changing loss functions, but it would be nice to see this in a machine teaching context.

Nonetheless, curriculum learning (or training data “scheduling”) is an important problem, and to the credit of the authors, they try a range of models and tasks for the student:

• MLP students for MNIST
• CNN students for CIFAR-10
• RNN students for text understanding (IMDB)

For the curriculum learning aspect, the teacher’s job is to filter each minibatch of data so that only a fraction of it is actually used for the student’s gradient updates. (See my figure above.) The evaluation protocol involves training the teacher and student interactively, using perhaps half of the dataset. Then, the teacher can be deployed to new students, with two variants: to students with the same or different neural network architecture. This is similar to the way the Born Again Neural Networks paper works — see my earlier blog post about it. Evaluation is based on how fast the learner achieves certain accuracy values.

Is this a fair protocol? I think so, and perhaps it is reflective of how teaching works in the real world. As far as I understand, for most teachers there is an initial training period before they are “deployed” on students.

I wonder, though, if we can somehow (a) evaluate the teacher while it is training, and (b) have the teacher engage in lifelong learning? As it is, the paper assumes the teacher trains and then is fixed and deployed, and hence the teacher does not progressively improve. But again, using a real-life analogy, consider the PhD advisor-student relationship. In theory, the PhD advisor knows much more and should be teaching the student, but as time goes on, the advisor should be learning something from its interaction with the student.

• The teacher features are heavily hand-tuned. For example, the authors pass in the one-hot vector label and the predicted class probabilities of each training input. This is 20 dimensions total for the two image classification tasks. It makes sense that the one-hot part isn’t as important (as judged from the appendix) but it seems like there needs to be a better way to design this. I thought the teacher would be taking in features from the input images so it could “tell” if they were close to being part of multiple classes, as is done in Hinton’s knowledge distillation paper. On the other hand, if Learning to Teach did that, the teachers would certainly not be able to generalize to different datasets.

• Policy gradients is nothing more than random search but it works here, perhaps since (a) the teacher neural network architecture size is so small and (b) the features heavily are tuned to be informative. The reward function is sparse, but again, due to a short (unspecified) time horizon, it works in the cases they try, but I do not think it scales.

• I’m confused by these sudden spikes in some of the CIFAR-10 plots. Can the authors explain those? It makes me really suspicious. I also wish the plots were able to show some standard deviation values because we only see the average over 5 trials. Nonetheless, the figures certainly show benefits to teaching. The gap may additionally be surprising due to the small teacher network and the fact that datasets like MNIST are simple enough that, intuitively, teaching might not be necessary.

Overall, I find the paper to be generally novel in terms of the formalism and teacher actions, which makes up for perhaps some simplistic experimental setups (e.g., simple teacher, using MNIST and CIFAR-10, only focusing on data scheduling) and lack of theory. But hey, papers can’t do everything, and it’s above the bar for ICLR.

I am excited to see what research will build upon this. Some other papers on my never-ending TODO list:

• Iterative Machine Teaching (ICML 2017)
• Towards Black-box Iterative Machine Teaching (ICML 2018)
• Learning to Teach with Dynamic Loss Functions (NIPS 2018)

1. Note that in the standard reference to curriculum learning (Bengio et al., ICML 2009), the data scheduling was clearly done via heuristics. For instance, that paper had a shape recognition task, where the shapes were divided into easy and hard shapes. The curriculum was quite simple: train on easy shapes, then after a certain epoch, train on the hard ones.

Ever since I started using TensorFlow in late 2016, I’ve been a happy user of the software. Yes, the word “happy” is deliberate and not a typo. While I’m aware that it’s fashionable in certain social circles to crap on TensorFlow, to me, it’s a great piece of software that tackles an important problem, and is undoubtedly worth the time to understand in detail. Today, I did just that by addressing one of my serious knowledge gaps of TensorFlow: how to save and load models. To put this in perspective, here’s how I used to do it:

• Count the number of parameters in my Deep Neural Network and create a placeholder vector for it.
• Fetch the parameters (e.g., using tf.trainable_variables()) in a list.
• Iterate through the parameters, flatten them, and “assign” them into the vector placeholder via tf.assign by careful indexing.
• Run a session on the vector placeholder, and save the result in a numpy file.

Ouch. I’m embarrassed by my code. It was originally based on John Schulman’s TRPO code, but I think he did that to facilitate the Fisher-Vector products as part of the algorithm, rather than to save and load weights.

Fortunately, I have matured. I now know that it is standard practice to save and load using tf.train.Saver(). By looking at the TensorFlow documentation and various blog posts — one aspect where TensorFlow absolutely shines compared to other Deep learning software — I realized that such savers could save weights and meta-data into checkpoint files. As of TensorFlow 1.8.0, they are structured like this:

name.data-00000-of-00001
name.index
name.meta


where name is what we choose. We have data representing the actual weights, index representing the connection between variable names and values (like a dictionary), and meta representing various properties of the computational graph. Then, by reconstructing (i.e., re-running) code that builds the same network, it’s easy to get the same network running.

But then my main thought was: is it possible to just load a network in a new Python script without having to call any neural network construction code? Suppose I trained a really Deep Neural Network and saved the model into checkpoints. (Nowadays, this would be hundreds of layers, so it’s impractical with the tools I have access to, but never mind.) How would I load it in a new script and deploy it, without having to painstakingly reconstruct the network? And by “reconstruction” I specifically mean having to re-define the same variables (the names must match!!) and building the same neural network in the same exact layer order, etc.

The solution is to first use tf.train.import_meta_graph. Then, to fetch the desired placeholders and operations, it is necessary to call get_tensor_by_name from a TensorFlow graph.

I have written a proof of concept of the above high-level description in my aptly-named “TensorFlow practice” GitHub code repository. The goal is to train on (you guessed it) MNIST, save the model after each epoch, then load it in a separate Python script, and check that each model gets exactly the same test-time performance. (And it should be exact, since there’s no stochasticity.) As a bonus, we’ll learn how to use tf.contrib.slim, one of the many convenience wrapper libraries around stock TensorFlow to make it easier to design and build Deep Neural Networks.

In my training code, I use the keras convenience method for loading in MNIST. As usual, I check the shapes of the training and testing data (and labels):

(60000, 28, 28) float64 # x_train
(60000,) uint8          # y_train
(10000, 28, 28) float64 # x_test
(10000,) uint8          # y_test


Whew, the usual sanity check passed.

Next, I use tf.slim to build a simple Convolutional Neural Network. Before training, I always like to print the state of the tensors after each layer, to ensure that the sizing and dimensions make sense. The resulting printout is here, where each line indicates the value of a tensor after a layer has been applied:

Tensor("images:0", shape=(?, 28, 28, 1), dtype=float32)
Tensor("Conv/Relu:0", shape=(?, 28, 28, 16), dtype=float32)
Tensor("MaxPool2D/MaxPool:0", shape=(?, 14, 14, 16), dtype=float32)
Tensor("Conv_1/Relu:0", shape=(?, 14, 14, 16), dtype=float32)
Tensor("MaxPool2D_1/MaxPool:0", shape=(?, 7, 7, 16), dtype=float32)
Tensor("Flatten/flatten/Reshape:0", shape=(?, 784), dtype=float32)
Tensor("fully_connected/Relu:0", shape=(?, 100), dtype=float32)
Tensor("fully_connected_1/Relu:0", shape=(?, 100), dtype=float32)


For example, the inputs are each 28x28 images. Then, by passing them through a convolutional layer with 16 filters and with padding set to the same, we get an output that’s also 28x28 in the first two axis (ignoring the batch size axis) but which has 16 as the number of channels. Again, this makes sense.

During training, I get the following output, where I evaluate on the full test set after each epoch:

epoch, test_accuracy, test_loss
0, 0.065, 2.30308
1, 0.908, 0.31122
2, 0.936, 0.20877
3, 0.953, 0.15362
4, 0.961, 0.12030
5, 0.967, 0.10056
6, 0.972, 0.08706
7, 0.975, 0.07774
8, 0.977, 0.07102
9, 0.979, 0.06605


At the beginning, the test accuracy is just 0.065, which isn’t far from random guessing (0.1) since no training was applied. Then, after just one pass through the training data, accuracy is already over 90 percent. This is expected with MNIST; if anything, my learning rate was probably too small. Eventually, I get close to 98 percent.

More importantly for the purposes of this blog post, after each epoch ep, I save the model using:

I now have all these saved models:

total 12M
-rw-rw-r-- 1 daniel daniel   71 Aug 17 17:07 checkpoint
-rw-rw-r-- 1 daniel daniel 1.1M Aug 17 17:06 epoch-0.data-00000-of-00001
-rw-rw-r-- 1 daniel daniel 1.2K Aug 17 17:06 epoch-0.index
-rw-rw-r-- 1 daniel daniel  95K Aug 17 17:06 epoch-0.meta
-rw-rw-r-- 1 daniel daniel 1.1M Aug 17 17:06 epoch-1.data-00000-of-00001
-rw-rw-r-- 1 daniel daniel 1.2K Aug 17 17:06 epoch-1.index
-rw-rw-r-- 1 daniel daniel  95K Aug 17 17:06 epoch-1.meta
-rw-rw-r-- 1 daniel daniel 1.1M Aug 17 17:06 epoch-2.data-00000-of-00001
-rw-rw-r-- 1 daniel daniel 1.2K Aug 17 17:06 epoch-2.index
-rw-rw-r-- 1 daniel daniel  95K Aug 17 17:06 epoch-2.meta
-rw-rw-r-- 1 daniel daniel 1.1M Aug 17 17:06 epoch-3.data-00000-of-00001
-rw-rw-r-- 1 daniel daniel 1.2K Aug 17 17:06 epoch-3.index
-rw-rw-r-- 1 daniel daniel  95K Aug 17 17:06 epoch-3.meta
-rw-rw-r-- 1 daniel daniel 1.1M Aug 17 17:06 epoch-4.data-00000-of-00001
-rw-rw-r-- 1 daniel daniel 1.2K Aug 17 17:06 epoch-4.index
-rw-rw-r-- 1 daniel daniel  95K Aug 17 17:06 epoch-4.meta
-rw-rw-r-- 1 daniel daniel 1.1M Aug 17 17:06 epoch-5.data-00000-of-00001
-rw-rw-r-- 1 daniel daniel 1.2K Aug 17 17:06 epoch-5.index
-rw-rw-r-- 1 daniel daniel  95K Aug 17 17:06 epoch-5.meta
-rw-rw-r-- 1 daniel daniel 1.1M Aug 17 17:06 epoch-6.data-00000-of-00001
-rw-rw-r-- 1 daniel daniel 1.2K Aug 17 17:06 epoch-6.index
-rw-rw-r-- 1 daniel daniel  95K Aug 17 17:06 epoch-6.meta
-rw-rw-r-- 1 daniel daniel 1.1M Aug 17 17:06 epoch-7.data-00000-of-00001
-rw-rw-r-- 1 daniel daniel 1.2K Aug 17 17:06 epoch-7.index
-rw-rw-r-- 1 daniel daniel  95K Aug 17 17:06 epoch-7.meta
-rw-rw-r-- 1 daniel daniel 1.1M Aug 17 17:06 epoch-8.data-00000-of-00001
-rw-rw-r-- 1 daniel daniel 1.2K Aug 17 17:06 epoch-8.index
-rw-rw-r-- 1 daniel daniel  95K Aug 17 17:06 epoch-8.meta
-rw-rw-r-- 1 daniel daniel 1.1M Aug 17 17:07 epoch-9.data-00000-of-00001
-rw-rw-r-- 1 daniel daniel 1.2K Aug 17 17:07 epoch-9.index
-rw-rw-r-- 1 daniel daniel  95K Aug 17 17:07 epoch-9.meta


In my loading/deployment code, I call this relevant code snippet for each epoch:

Next, we need to get references to placeholders and operations. Fortunately we can do precisely that using:

Note that these names match the names I assigned during my training code, except that I append an extra :0 at the end of each name. The importance of getting names right is why I will start carefully naming TensorFlow variables in my future code.

After using these same placeholders and operations, I get the following test-time output:

1, 0.908, 0.31122
2, 0.936, 0.20877
3, 0.953, 0.15362
4, 0.961, 0.12030
5, 0.967, 0.10056
6, 0.972, 0.08706
7, 0.975, 0.07774
8, 0.977, 0.07102
9, 0.979, 0.06605


(I skipped over epoch 0, as I didn’t save that model.)

Whew. The above accuracy and loss values exactly match. And thus, we now know how to load and use stored TensorFlow checkpoints without having to reconstruct the entire training graph. Achievement unlocked.

Presenting to AI4ALL

I, trying to inspire some high-schoolers with the Toyota HSR.

Last Friday — my 26th birthday, actually — I had the opportunity to give a brief demonstration of our Toyota Human Support Robot (HSR) as part of the AI4ALL program at UC Berkeley. I provided some introductory remarks about the HSR, which is a home robot developed by Toyota with the goal of assisting the elderly in Japan and elsewhere. I then demonstrated the HSR in action by showing how it could reach to a grasp pose on the “bed-making” setup shown in the picture above, and then pull the sheet to a target. (Our robot has some issues with its camera perception, so the demonstration wasn’t as complete as I would have liked it to be, but I hope I still managed to inspired some of the kids.)

I then discussed some of the practical knowledge that I’ve learned over the last year when dealing with physical robots, such as: (1) robots will break down, (2) robots will break down, and (3) robots will break down. Finally, I answered any questions that the kids had, and allowed a few volunteers to play around with the joystick to teleoperate the robot.

Some context: the Berkeley-specific AI4ALL session is a five-day program, from 8:00am to 5:00pm, at the UC Berkeley campus, and is designed to introduce Artificial Intelligence to socioeconomically disadvantaged kids (e.g., those who qualify for free school lunch) in the 9th and 10th grade. Attendance to AI4ALL is free, and admission is based on math ability. I am not part of the official committee for AI4ALL so I don’t know much beyond what is listed on their website. Last week was the second instance of AI4ALL, continuing last year’s trend, and involved about 25 high-school students.

AI4ALL isn’t just a Berkeley thing; there are also versions of it at Stanford, CMU, and other top universities. I took a skim at the program websites, and the Stanford version is a 3-week residential program, so it probably has slightly more to offer. Still, I hope we at Berkeley were at least able to inspire some of the next generation of potential AI employees, researchers, and entrepreneurs.

On a more personal note, this was my first time giving a real robot demonstration to an audience that didn’t consist of research collaborators. I enjoyed this, and hope to do more in the coming years. These are the kind of things one just cannot do with theoretical or simulator-based research.

Max Welling's "Intelligence Per Kilowatt-Hour"

I recently took the time to watch Max Welling’s excellent and thought-provoking ICML keynote. You can view part 1 and part 2 on YouTube. The video quality is low, but at least the text on the slides is readable. I don’t think slides are accessible anywhere; I don’t see them on his Amsterdam website.

As you can tell from his biography, Welling comes from a physics background and spent undergraduate and graduate school in Amsterdam studying under a Nobel Laureate, and this background is reflected in the talk.

I will get to the main points of the keynote, but the main reason why I for once managed to watch a keynote talk (rather than partake in the usual “Oh, I’ll watch it later when I have time …” and then forgetting about it1) is that I wanted to test out a new pair of hearing aids and the microphone that came with it. I am testing the ReSound ENZO 3D hearing aids, along with the accompanying ReSound Multi Mic.

That microphone will use a 3.5mm mini jack cable to connect to an audio source, such as my laptop. Then, with an app through my iPhone, I can switch my hearing aid’s mode to “Stream,” meaning that the sound from my laptop or audio source, which is connected to the Multi Mic, goes directly into my hearing aids. In other words, it’s like a wireless headphone. I have long wanted to test out something like this, but never had the chance to do so until the appropriate technology came for the kind of hearing aid power I need.

The one downside, I suppose, of this is that if I were to listen to music while I work, there wouldn’t be any headphones visible (either wired or wireless) as would be the case with other students. This means someone looking at me might try and talk to me, and think I am ignoring him or her if I do not respond due to hearing only the sound streaming through the microphone. I will need to plan this out if I end up getting this microphone.

But anyway, back to the keynote. Welling titled the talk as “Intelligence Per Kilowatt-Hour” and pointed out early that this could also be expressed as the following equation:

Free Energy = Energy - Entropy

After some high-level physics comments, such as connecting gravity, entropy, and the second law of thermodynamics, Welling moved on to discuss more familiar2 territory to me: Bayes’ Rule, which we should all know by now. In his notation:

Clearly, there’s nothing surprising here.

He then brought up Geoffrey Hinton as the last of his heroes in the introductory parts of the talks, along with the two papers:

• Keeping the Neural Networks Simple by Minimizing the Description Length of the Weights (1993)
• A View of the EM Algorithm that Justifies Incremental, Sparse, and other Variants (1998)

I am aware of these papers, but I just cannot find time to read them in detail. Hopefully I will, someday. Hey, if I can watch Welling’s keynote and blog about it, then I can probably find time to read a paper.

Probably an important, relevant bound to know is:

where the equality to lower bound results because we ignore a KL divergence term which is always non-negative. The right hand side of the final line can be re-thought as negative energy plus entropy.

In the context of discussing the above math, Welling talked about intractable distributions, a thorn in the side of many statisticians and machine learning practitioners. Thus, he discussed two broad classes of techniques to approximate intractable distributions: MCMC and Variational methods. The good news is that I understood this because John Canny and I wrote a blog post about this last year on the Berkeley AI Research Blog3.

Welling began with his seminal work: Stochastic Gradient Langevin Dynamics, which gives us a way to use minibatches for large-scale MCMC. I won’t belabor the details of this, since I wrote a blog post (on this blog!) two years ago about this very concept. Here’s the relevant equation and method reproduced here, for completeness:

where we need $\epsilon_t$ to vary and decrease towards zero, among other technical requirements. Incidentally, I like how he says: “sample from the true posterior.” This is what I say in my talks.

Afterwards, he discussed some of the great work that he has done in Variational Bayesian Learning. I’m most aware of him and his student, Durk Kingma, introducing Variational Autoencoders for generative modeling. That paper also popularized what’s known as the reparameterization trick in statistics. In Welling’s notation,

I will discuss this in more detail, I swear. I have a blog post about the math here but it’s languished in my drafts folder for months.

In addition to the general math overview, Welling discussed:

• How the reparameterization trick helps to decrease variance in REINFORCE. I’m not totally sure about the details, but again, I’ll have to mention it in my perpetually-in-progress draft blog post previously mentioned.
• The local reparameterization trick. I see. What’s next, the tiny reparameterization trick?
• That we need to make Deep Learning more efficient. Right now, our path is not sustainable. That’s a solid argument; Google can’t keep putting this much energy into AI projects forever. To do this, we can remove parameters or quantize them. For the latter, this is like reducing them from float32 to int, to cut down on memory usage. At the extreme, we can use binary neural networks.
• Welling also mentioned that AI will move to the edge. This means moving from servers with massive computational power to everyday smart devices with lower compute and power. In fact, his example was smart hearing aids, which I found amusing since, as you know, the main motivation for me watching this video was precisely to test out a new pair of hearing aids! I don’t think there is AI in the ReSound ENZO 3D.

The last point above about AI moving to the edge is what motivates the title of the talk. Since we are compute- and resource-constrained on the edge, it is necessary to extract the benefits of AI efficiently, hence AI per kilowatt hour.

Towards the end of the talk, Welling brought up more recent work on Bayesian Deep Learning for model compression, including:

• Probabilistic Binary Networks
• Differentiable Quantization
• Spiking Neural Networks

These look like some impressive bits of research, especially spiking neural networks because the name sounds cool. I wish I had time to read these papers and blog about them, but Welling gave juuuuuuust enough information that I think I can give one or two sentence explanations of the research contribution.

Welling concluded with a few semi-serious comments, such as inquiring about the question of life (OK, seriously?), and then … oh yeah, that Qualcomm AI is hiring (OK, seriously again?).

Well, advertising aside — which to be fair, lots of professors do in their talks if they’re part of an industrial AI lab — the talk was thought-provoking to me because it forced me to consider energy-efficiency if we are going to make further progress in AI and to also ensure that we can maximally extract AI utility in compute-limited devices. These are things worth thinking about at a high level for our current and future AI projects.

1. To be fair, this happens all the time when I try and write long, lengthy blog posts, but then realize I will never have the effort to fix up the post to make it acceptable for the wider world.

2. I am trying to self-study physics. Unfortunately, it is proceeding at a snail’s pace.

3. John Canny also comes from a theoretical physics background, so I bet he would like Welling’s talk.

Pre-Conference Logistics Checklist

I am finally attending more conferences and by now it’s become clear that I need a more formal checklist for future conferences, since there were many things I should have done further in advance. Hopefully this checklist will serve me well for future events.

• Start planning for travel (plane tickets, hotels, etc.) no later than the point when I know I am attending for sure. The conferences I would attend mandate that someone from the author list of an accepted paper has to attend in order for the paper to appear in the “proceedings.” It is very likely, though, that authors can tell if their work will be accepted before the actual decision gets sent to them 2-4 months before the conference. This is especially true for conferences that offer multiple rounds of reviews, since the scores from the first set of reviews usually remain the same even after rebuttals. Thus, any planning of any sort should start before the final paper acceptance decision.

• Email other Berkeley students or those who I know about the possibilities of having joint activities or a group hotel reservation. I would rather not miss out on any gatherings among awesome people (i.e., Berkeley students). For this, it’s helpful to know if or when the conference offers lunches and dinners. If the conference is smaller or less popular among Berkeley people, ask the organizers to add it to ConferenceShare and search there.

• Normally, in order to get better rates, we book hotel rooms through the conference website, or though a related source (e.g., the “Federated AI Meeting” portal for ICML/IJCAI/etc.) which is not the official hotel website. Be extremely careful, however, if trying to upgrade or adjust the room. I nearly got burned by this in IJCAI because I think an external source canceled one of my original hotel reservations after I had upgraded the room by emailing the hotel directly. Lesson: always, always ask for confirmation of my room, and do this multiple times, spaced within a few weeks to avoid angry hotel receptionists.

• Regarding academic accommodation (e.g., sign language interpreter or captioning), first figure out what I am doing. This itself is nontrivial, owing to the tradeoffs among different techniques and considering the conference location. Then, draft a “Conference XYZ Planning Logistics for [Insert Dates Here]” and email the details to Berkeley’s DSP. Email them repeatedly, every three days at minimum, demanding at minimum a confirmation that they have read my email and are doing something. I apologize in advance to Berkeley’s DSP for clogging up their inboxes.

• If the accommodations will involve additional people attending the venue, which is nearly always the case, then get them in touch with the conference organizers so that they can get badges with name tags, and to ask about any potentially relevant information (e.g., space limitations in certain social events).

• One thing I need to start doing is contacting the conference venue about the services they offer. For instance, many venues nowadays offer services such as hearing loops or captioning, which could augment or mix with those from Berkeley’s DSP. It’s also important to get a sense of how easily the lights or speakers can be adjusted in rooms. IJCAI was held at Stockholmsmässan, and the good news is that in the main lecture hall, it was straightforward for an IT employee to adjust the lighting to make the sign language interpreters visible (the room gets dark when speakers present), and to provide them with special microphones.

• Attire: don’t bring two business suits. One is enough, if I want to bring one at all. Two simply takes too much space in a small suitcase, and there’s no way I’m risking checked-in luggage. Always bring an equal amount of undershirts which double as workout clothes, and make sure the hotel I’m in actually has a fitness center! Finally, bring two pairs of shoes: one for walking, one for running.

I likely won’t be attending conferences until the spring or summer of 2019, so best to jot these items down before forgetting them.

Quick Overview of the 27th International Joint Conference on Artificial Intelligence (IJCAI), with Photos

I recently attended the 27th International Joint Conference on Artificial Intelligence (IJCAI) in Stockholm, Sweden. Unlike what I did for UAI 2017 and ICRA 2018, I won’t be able to write daily blog posts about the conference. In part, this is because I got struck by some weird flu-like and fever symptoms on my flight to Sweden, which sapped my energy. I thought I was clever when my original plan was to rig my sleep schedule so that I’d skip sleep during my last night in the United States, and then get a full 8-hours’ worth of sleep on the flight to Stockholm, upon which I’d arrive at 7:00am (in Stockholm time) feeling refreshed. Unfortunately, my lack of sleep probably exacerbated the unexpected illness, so that plan went out of whack.

On a more positive note, here were some of the highlights of IJCAI. I’ll split this into three main sections, followed by some concluding comments and the photos I took.

Keynote Talks (a.k.a., “Invited Talks”):

• I enjoyed Yann LeCun’s keynote on Deep Learning. Because it’s Yann LeCun and Deep Learning.
• Jean-François Bonnefon’s gave a thought-providing talk about The Moral Machine Experiment. Think of what happens with a self-driving car. Suppose it’s in a situation when people are blocking the car’s way, but it’s going too fast to stop. Either it continues going forward (and kills the people in front of it) or it swerves and hits a nearby wall (and kills the passengers). What characteristics of the passengers or pedestrians would cause us to favor which of the two groups to kill?
• There was also a great talk about “Markets Without Money” by Nicole Immorlica of Microsoft Research. It was a high-level talk discussing some of the theoretical work tying together economics and computer science, about some of the markets we’re engaged in, but which aren’t money-centric. I was reminded of many related books that I’ve been reading about platforms.
• On the last day there were four invited talks: Andrew Barto for a career achievement award (postponed one year as he was supposed to give it last year), Jitendra Malik for this year’s career achievement award, Milind Tambe for the John McCarthy award, and Stefano Ermon for the IJCAI Computers and Thought award. I enjoyed all the talks.

Workshops, Tutorials, Various Sessions, and Other Stuff:

• IJCAI had several parallel workshops and tutorials on the first three days, which were co-located with ICML and other AI conferences. I only attended the last day since I had to recover from my illness, and on that day, I attended tutorials about AI and the Law and Predicting Human Decision-Making. They were interesting, though I admit that it was hard to focus after two or three hours. The one on human decision-making, as I predicted, brought up Daniel Kahneman’s 2011 magnus opus Thinking, Fast and Slow, one of my favorite books of all time.
• There was a crowd-captivating robotics performance during the opening remarks before Yann LeCun’s keynote; a mini-robot and a human actor moved alongside while performing a slow dance-like motion. See my photos — it was entertaining! I’m not sure how the robot was able to conduct such dexterous movements.
• On the penultimate day, there were back-to-back sessions about AI in Europe. The first featured a lively panel of seven Europeans who opinionated about Europe’s strengths and weaknesses in AI, and what it can do in the future. Several common themes stood out: the need to prevent “brain drain” to the United States and the need for more investment in AI. Notably, the panel’s only mention of Donald Trump was when Max Tegmark (himself one of Europe’s “brain drains!”) criticized America’s leadership and called for Europe to resist Trump when needed. The second session was about Europe’s current AI strategy and consisted of four consecutive talks. They were a bit dry and the speakers spoke with a flat tone while reading directly from text on the slides, so I left early to attend the student reception.
• Unique to IJCAI, it hosts a “Sister Track for Best Papers.” Authors of papers that won some kind of award in other AI conferences are invited to speak about their work in IJCAI. This is actually why I was there, due to my UAI paper, so I’m grateful that IJCAI includes such sessions.
• IJCAI also has early career talks. For these, pre-tenure faculty come and give 25-minute talks on their research. This was the most common session I attended (not including keynotes), because faculty tend to be better at giving presentations than graduate students, who make up the bulk of the speakers in most sessions.

Our Social Events:

• A visit to Skansen, a large open museum, where we could see what a Swedish village might have looked like many years ago. It is a pleasantly surprising mix of a zoo, a hodgepodge of old architecture, and employees acting as ancient Swedish citizens. The Skansen event was joint with ICML and the other co-located conferences, so I saw several Berkeley people (ICML is more popular among us). The food was mediocre, but this was countered by an insane amount of Champagne. Sadly, I was still recovering from my illness and couldn’t drink any.
• A reception at City Hall, which is most famous for hosting the annual Nobel Prize ceremony. The city of Stockholm actually paid for us, so it wasn’t included as part of any IJCAI registration fees. (They must really want AI researchers to like Sweden!) The bad news? There were over 2000 IJCAI registrants, but City Hall has a strict capacity of 1200, so the only people who could get in were those that skipped the last session of talks that day. I felt this was unfair to those speakers, and I hope if similar scenarios happen in future iterations, IJCAI can hold a lottery and offer a second banquet for those who don’t get in the first one.
• A conference banquet, held at the Vasa Museum near Skansen. I was excited about attending, but alas, it was closed to students. This was unclear from the conference website and program, and judging from what others said on the Whova app for the conference, I wasn’t the only one confused. That this was not open to students caused one of the faculty attending to boycott the dinner, according to his comments on Whova.
• To make up for that (I suppose?), there was a student reception the next day, open to students only (obviously). As usual, the wine and beer was great, though the food itself — served cocktail-style — was short of what would qualify as a full dinner. There was a minor steak dish, along with some green soup for vegetarians.
• A closing reception, at the very end of the conference. It was in the conference venue and offered the usual wine, beer, non-alcoholic drinks, and some small food dishes. There wasn’t much variety in the food offering.

Since I blogged about ICRA 2018 at length, I suppose it’s inevitable to make a (necessarily incomplete) comparison among the two.

In terms of food, ICRA completely outclasses IJCAI. ICRA included lunch (whereas IJCAI didn’t) and the ICRA evening receptions all had far richer food offerings than IJCAI’s. The coffee breaks for ICRA also had better food and drink, including free lattes (ah, the memories…). It looks like the higher registration fees we pay for ICRA are reflected in the proportionately better food and drink quality. The exception may be the alcoholic beverages; the offerings from ICRA and IJCAI seemed to be comparable, though I’ll add that I still haven’t developed the ability to tell great wine from really, really great wine.

ICRA also has a better schedule in that poster sessions were clearly labeled in the schedule, whereas IJCAI’s weren’t explicitly scheduled, meaning that it was technically “all day”. Finally, I think the venue for ICRA — the Brisbane Convention & Exhibition Centre — is better designed than Stockholmsmässan, and furthermore, there’s more interesting stuff within walking distance to Brisbane’s convention. (To mitigate this, IJCAI wisely offered a public transportation card for all attendees.)

That’s not to say ICRA was superior in every metric — far from it! The main advantage of IJCAI is probably that we get to see more of the city itself, as I mention in the social events above.

Here are the photos I took while I was at IJCAI, which I’ve hosted in my Flickr account. (For future conferences I will probably host pictures on Flickr, since I’ve used up a dangerously high amount of my memory allocation for hosting on GitHub.) I hope you enjoy the photos!

Actor-Critic Methods: A3C and A2C

Actor-critic methods are a popular deep reinforcement learning algorithm, and having a solid foundation of these is critical to understand the current research frontier. The term “actor-critic” is best thought of as a framework or a class of algorithms satisfying the criteria that there exists parameterized actors and critics. The actor is the policy $\pi_{\theta}(a \mid s)$ with parameters $\theta$ which conducts actions in an environment. The critic computes value functions to help assist the actor in learning. These are usually the state value, state-action value, or advantage value, denoted as $V(s)$, $Q(s,a)$, and $A(s,a)$, respectively.

I suggest that the most basic actor-critic method (beyond the tabular case) is vanilla policy gradients with a learned baseline function.1 Here’s an overview of this algorithm:

The basic vanilla policy gradients algorithm. Credit: John Schulman.

I also reviewed policy gradients in an older blog post, so I won’t repeat the details.2 I used expected values in that post, but in practical implementations, you’ll just take the saved rollouts to approximate the expectation as in the image above. The main point to understand here is that an unbiased estimate of the policy gradient can be done without the learned baseline $b(s_t)$ (or more formally, $b_{\theta_C}(s_t)$ for parameters $\theta_C$) by just using $R_t$, but this estimate performs poorly in practice. Hence why, people virtually always apply a baseline.

My last statement is somewhat misleading. Yes, people apply learned baseline functions, but I would argue that the more important thing is to ditch vanilla policy gradients all together and use a more sophisticated framework of actor critic methods, called A3C and popularized from the corresponding DeepMind ICML 2016 paper.3 In fact, when people refer to “actor-critic” nowadays, I think this paper is often the associated reference, and one can probably view it as the largest or most popular subset of actor-critic methods. This is despite how the popular DDPG algorithm is also an actor-critic method, perhaps because its is more commonly thought of as the continuous control analogue of DQN, which isn’t actor-critic as the critic (Q-network) suffices to determine the policy; just take a softmax and pick the action maximizing the Q-value.

A3C stands for Asynchronous Advantage Actor Critic. At a high level, here’s what the name means:

• Asynchronous: because the algorithm involves executing a set of environments in parallel (ideally, on different cores4 in a CPU) to increase the diversity of training data, and with gradient updates performed in a Hogwild! style procedure. No experience replay is needed, though one could add it if desired (this is precisely the ACER algorithm).

• Advantage: because the policy gradient updates are done using the advantage function; DeepMind specifically used $n$-step returns.

• Actor: because this is an actor-critic method which involves a policy that updates with the help of learned state-value functions.

You can see what the algorithm looks like mathematically in the paper and in numerous blog posts online. For me, a visual diagram helps. Here’s what I came up with:

My visualization of how A3C works.

A few points:

• I’m using 16 environments in parallel, since that’s what DeepMind used. I suppose I could use close to this in modern machines since many CPUs have four or six cores, and with hyperthreading we get double that. Of course, it might be easier to simply use Amazon Web Services … and incidentally, no GPU is needed.

• I share the value function and the policy in the same way DeepMind did, but for generality I keep the gradient updates separate for $\theta$ (the policy) and $\theta_v$ (the value function) and have respective learning rates $\alpha$ and $\alpha_v$. In TensorFlow code, I would watch out for the variables that my optimizers update.

• The policy has an extra entropy bonus regularizer that is embedded in the $d\theta$ term to encourage exploration.

• The updates are done in Hogwild! fashion, though nothing I drew in the figure above actually shows that, since it assumes that different threads reached their “update point” at different times and update separately. Hogwild! would apply when two or more threads call a gradient update to the shared parameter simultaneously, raising the possibility of one thread overwriting another. This shouldn’t happen too often, since there’s only 16 threads — my intuition is that it’d be a lot worse with orders of magnitude more threads — but the point is even if they do, things should be fine in the long run.

• The advantage is computed using $n$-step returns with something known as the forward view, rather than the backward view, as done with eligibility traces. If you are unfamiliar with “eligibility traces” then I recommend reading Sutton and Barto’s online reinforcement learning textbook.

To be clear about this, for some interval where the thread’s agent takes steps, we get rewards $r_1, r_2, \ldots, r_k$, where upon reaching the $k$-th step, the agent stopped, either due to reaching a terminal state or because it’s reached the human-designated maximum number of steps before an update. Then, for the advantage estimate, we go backwards in time to accumulate the discounted reward component. For the last time step, we’d get

for the penultimate step, we’d get:

then the next:

and so on. See the pseudocode in the A3C paper if this is not clear.

The rewards were already determined from executing trajectories in the environment, and by summing them this way, we get the empirical advantage estimate. The value function which gets subtracted has subscripts that match the advantage (because $A(s,a) = Q(s,a)-V(s)$), but not the value function used for the $n$-step return. Incidentally, that value will often be zero, and it should be zero if this trajectory ended due to a terminal state.

Now let’s talk about A2C: Advantage Actor Critic. Given the name (A2C vs A3C) why am I discussing A2C after A3C if it seems like it might be simpler? Ah, it turns out that (from OpenAI):

After reading the paper, AI researchers wondered whether the asynchrony led to improved performance (e.g. “perhaps the added noise would provide some regularization or exploration?“), or if it was just an implementation detail that allowed for faster training with a CPU-based implementation.

As an alternative to the asynchronous implementation, researchers found you can write a synchronous, deterministic implementation that waits for each actor to finish its segment of experience before performing an update, averaging over all of the actors.

Thus, think of the figure I have above, but with all 16 of the threads waiting until they all have an update to perform. Then we average gradients over the 16 threads with one update to the network(s). Indeed, this should be more effective due to larger batch sizes.

In the OpenAI baselines repository, the A2C implementation is nicely split into four clear scripts:

• The main call, run_atari.py, in which we supply the type of policy and learning rate we want, along with the actual (Atari) environment. By default, the code sets the number of CPUs (i.e., number of environments) to 16 and then creates a vector of 16 standard gym environments, each specified by a unique integer rank. I think the rank is mostly for logging purposes, as they don’t seem to be using mpi4py for the Atari games. The environments and CPUs utilize the Python multiprocessing library.

• Building policies (policies.py), for the agent. These build the TensorFlow computational graphs and use CNNs or LSTMs as in the A3C paper.

• The actual algorithm (a2c.py), with a learn method that takes the policy function (from policies.py) as input. It uses a Model class for the overall model and a Runner class to handle the different environments executing in parallel. When the runner takes a step, this performs a step for each of the 16 environments.

• Utilities (utils.py), since helper and logger methods help make any modern DeepRL algorithm easier to implement.

The environment steps are still a CPU-bound bottleneck, though. Nonetheless, I think A2C is likely my algorithm of choice over A3C for actor-critic based methods.

Update: as of September 2018, the baselines code has been refactored. In particular, there is now an algorithm-agnostic run script that gets called, and they moved some of the policy-building (i.e., neural network building) code into the common sub-package. Despite the changes, the general structure of their A2C algorithm is consistent with what I’ve written above. Feel free to check out my other blog post which describes some of these changes in more detail.

1. It’s still unclear to me if the term “vanilla policy gradients” (which should be the same as “REINFORCE”) includes the learned value function which determines the state-dependent baseline. Different sources I’ve read say different things, in part because I think vanilla policy gradients just doesn’t work unless you add in the baseline, as in the image I showed earlier. (And even then, it’s still bad.) Fortunately, my reading references are in agreement that once you start including any sort of learned value function for reducing gradient variance, that’s a critic, and hence an actor-critic method.

2. I also noticed that on Lil’log, there’s an excellent blog post on various policy policy algorithms. I was going to write a post like this, but looks like Lilian Weng beat me to it. I’ve added Lil’Log to my bookmarks.

3. The A3C paper already has 726 citations as of the writing of this blog post. I wonder if it was more deserving of the ICML 2016 best paper award than the other RL DeepMind winner, Dueling Architectures? Don’t get me wrong; both papers are great, but the A3C one seems to have had more research impact, which is whole the point, right?

4. If one is using a CPU that enables hyperthreading, which is likely the case for those with modern machines, then perhaps this enables twice the number of parallel environments? I think this is the case, but I wouldn’t bet my life on it.

Papers That Have Cited Policy Distillation

About a week and a half ago, I carefully read the Policy Distillation paper from DeepMind. The algorithm is easy to understand yet surprisingly effective. The basic idea is to have student and teacher agents (typically parameterized as neural networks) acting on an environment, such as the Atari 2600 games. The teacher is already skilled at the game, but the student isn’t, and need to learn somehow. Rather than run standard deep reinforcement learning, DeepMind showed that simply running supervised learning where the student trains its network to match a (tempered) softmax of the Q-values of the teacher is sufficient to learn how to play an Atari 2600 game. It’s surprising that this works; for one, Q-values are not even a probability distribution, so it’s not straightforward to conclude that a student trained to match the softmaxes would be able to learn a sequential decision-making task.

It was published in ICLR 2016, and one of the papers that cited this was Born Again Neural Networks (to appear in ICML 2018), a paper which I blogged about recently. The algorithms in these two papers are similar, and they apply in the reinforcement learning (PD) and supervised learning (BANN) domains.

After reading both papers, I developed the urge to understand all the Policy Distillation follow-up work. Thus, I turned to Google Scholar, one of the greatest research conveniences of modern times; as of this writing, the Policy Distillation paper has 68 citations. (Google Scholar sometimes has a delay in registering certain citations, and it also lists PhD theses and textbooks, so the previous sentence isn’t entirely accurate, but it’s close enough.)

I resolved to understand the main idea of every paper that cited Policy Distillation, especially with how relevant the paper is to the algorithm. I wanted to understand if papers directly extended the algorithm, or if they simply cited it as related work to try and boost up the citation count for DeepMind.

I have never done this before to a paper with more than 15 Google Scholar citations, so this was new to me. After spending a week and a half on this, I think I managed to get the gist of Policy Distillation’s “follow-up space.” You can see my notes in this shareable PDF which I’ve hosted on Dropbox. Feel free to send me recommendations about other papers I should read!

Born Again Neural Networks

I recently read Born Again Neural Networks (to appear at ICML 2018) and enjoyed the paper. Why? First, the title is cool. Second, it’s related to the broader topics of knowledge distillation and machine teaching that I have been gravitating to lately. The purpose of this blog post is to go over some of the math in Section 3 and discuss its implications, though I’ll assume the reader has a general idea of the BAN algorithm. As a warning, notation is going to be a bit tricky/cumbersome but I will generally match with what the paper uses and supplement it with my preferred notation for clarity.

We have $\mathbf{z}$ and $\mathbf{t}$ representing vectors corresponding to the student and teacher logits, respectively. I’ll try to stick to the convention of boldface meaning vectors, even if they have subscripts to them, which instead of components means that they are part of a sequence of such vectors. Hence, we have:

or we can also write $\mathbf{z} = \mathbf{z}_k$ if we’re considering a minibatch $\{\mathbf{z}_1, \ldots, \mathbf{z}_b\}$ of these vectors.

Let $\mathbf{x}$ denote input samples (also vectors) and let $Z=\sum_{k=1}^n e^{z_k}$ and $T=\sum_{k=1}^n e^{t_k}$ to simplify the subsequent notation, and consider the cross entropy loss function

which here corresponds to a single-sample cross entropy between the student logits and the teacher’s logits, assuming we’ve applied the usual softmax (with temperature one) to turn these into probability distributions. The teacher’s probability distribution could be a one-hot vector if we consider the “usual” classification problem, but the argument made in many knowledge distillation papers is that if we consider targets that are not one-hot, the student obtains richer information and achieves lower test error.

The derivative of the cross entropy with respect to a single output $z_i$ is often applied as an exercise in neural network courses, and is good practice:

or $q_i - p_i$ in the paper’s notation. (As a side note, I don’t understand why the paper uses $\mathcal{L}_i$ with a subscript $i$ when the loss is the same for all components?) We have $i \in \{1, 2, \ldots, n\}$, and following the paper’s notation, let $*$ represent the true label. Without loss of generality, though, we assume that $n$ is always the appropriate label (just re-shuffle the labels as necessary) and now consider the more complete case of a minibatch with $b$ elements and considering all the possible logits. We have:

and so the derivative we use is:

Just to be clear, we sum up across the minibatch and scale by $1/b$, which is often done in practice so that gradient updates are independent of minibatch size. We also sum across the logits, which might seem odd but remember that the $z_{i,s}$ terms are not neural network parameters (in which case we wouldn’t be summing them up) but are the outputs of the network. In backpropagation, computing the gradients with respect to weights requires computing derivatives with respect to network nodes, of which the $z$s (usually) form the final-layer of nodes, and the sum here arises from an application of the chain rule.

Indeed, as the paper claims, if we have the ground-truth label $y_{*,s} = 1$ then the first term is:

and thus the output of the teacher, $p_{*,s}$ is a weighting factor on the original ground-truth label. If we were doing the normal one-hot target, then the above is the gradient assuming $p_{*,s}=1$, and it gets closer and closer to it the more confident the teacher gets. Again, all of this seems reasonable.

The paper also argues that this is related to importance weighting of the samples:

So the question is, does knowledge distillation (called “dark knowledge”) from (Hinton et al., 2014) work because it is performing a version of importance weighting? And by “a version of” I assume the paper refers to this because it seems like the $q_{*,s}$ is included in importance weighting, but not in their interpretation of the gradient.

Of course, it could also work due to to the information here:

which is in the “wrong” labels. This is the claim made by (Hinton et al., 2014), though it was not backed up by much evidence. It would be interesting to see the relative contribution of these two gradients in these refined, more sophisticated experiments with ResNets and DenseNets. How do we do that? The authors apply two evaluation metrics:

• Confidence Weighted by Teacher Max (CWTM): One which “formally” applies importance weighting with the argmax of the teacher.
• Dark Knowledge with Permuted Predictions (DKPP): One which permutes the non-argmax labels.

These techniques apply the argmax of the teacher, not the ground-truth label as discussed earlier. Otherwise, we might as well not be doing machine teaching.

It appears that if CWTM performs very well, one can conclude most of the gains are from the importance weighting scheme. If not, then it is the information in the non-argmax labels that is critical. A similar thing applies to DKPP, because if it performs well, then it can’t be due to the non-argmax labels. I was hoping to see a setup which could remove the importance weighting scheme, but I think that’s too embedded into the real/original training objective to disentangle.

The experiments systematically test a variety of setups (identical teacher and student architectures, ResNet teacher to DenseNet student, applying CWTM and DKPP, etc.). They claim improvements across different setups, validating their hypothesis.

Since I don’t have experience programming or using ResNets or DenseNets, it’s hard for me to fully internalize these results. Incidentally, all the values reported in the various tables appear to have been run with one random seed … which is extremely disconcerting to me. I think it would be advantageous to pick fewer of these experiment setups and run 50 seeds to see the level of significance. It would also make the results seem less like a laundry list.

It’s also disappointing to see the vast majority of the work here on CIFAR-100, which isn’t ImageNet-caliber. There’s a brief report on language modeling, but there needs to be far more.

Most of my criticisms are a matter of doing more training runs, which hopefully should be less problematic given more time and better computing power (the authors are affiliated with Amazon, after all…), so hopefully we will have stronger generalization claims in future work.

Update 05/29/2018: After reading the Policy Distillation paper, it looks like that paper already showed that matching a tempered softmax (of Q-values) from the teacher using the same architecture resulted in better performance in a deep reinforcement learning task. Given that reinforcement learning on Atari is arguably a harder problem than supervised learning of CIFAR-100 images, I’m honestly surprised that the Born Again Neural Networks paper got away without mentioning the Policy Distillation comparison in more detail, even when considering that the Q-values do not form a probability distribution.

International Conference on Robotics and Automation (ICRA) 2018, Day 5 of 5

ICRA, like many academic conferences, schedules workshops and/or tutorials on the beginning and ending days. The 2018 edition was no exception, so for the fifth and final day, it offered about 10 workshops on a variety of topics. Succinctly, these are venues where a smaller group of researchers can discuss a common research sub-theme. Typically, workshops invite guest speakers and have their own poster sessions for works-in-progress or for shorter papers. These are less prestigious for full conference papers, which is why I don’t submit to workshops.

I attended most of the cognitive robotics workshop, since it included multi-robot and human-robot collaboration topics.

In the morning session, at least two of the guest speakers hinted some skepticism of Deep Learning. One, for instance, had this slide:

An amusing slide at the day's workshop, featuring our very own Michael I. Jordan.

which features Berkeley professor Michael I. Jordan’s (infamous) IEEE interview from four years ago. I would later get to meet the speaker when he walked over to me to inquire about the sign language interpreting services (yay, networking!!). I obviously did not have much to offer him in terms of technical advice, so I recommended that he read Michael I. Jordan’s recent Medium blog post about how the AI revolution “hasn’t happened yet.”

The workshops were located near each other, so there were lots of people during the food breaks.

I stayed for the full morning, and then for a few more talks in the afternoon. Eventually, I decided that the topics being presented — while interesting in their own right — were less relevant to my immediate research agenda than I had originally thought, so I left at about 2:00pm, my academic day done. For the rest of the afternoon, I stayed at the convention center and finally finished reading Enlightenment Now: The Case for Reason, Science, Humanism, and Progress.

While I was reading the book, I took part in my bad habit of checking my phone and social media. I had access to a Berkeley Facebook group chat, and it turns out that many of the students went traveling today to other areas in Brisbane.

Huh, I wonder if frequent academic conference attendees often skip the final “workshop day”? Just to be clear, I don’t mean these workshops are pointless or useless, but maybe the set of workshops is too heavily specialized or just not as interesting? I noticed a similar trend with UAI 2017, in that the final workshop day had relatively low attendance.

Now that the conference is over, my thoughts generally lean positive. Sure, there are nitpicks here and there: ICRA isn’t double-blind (which seems contrary to best science practices) and is pricey, as I mentioned in an earlier blog post. But as a consequence, ICRA is well-funded and exudes a sophisticated feel. The Brisbane venue was fantastic, as was the food and drink.

As always, I don’t think I networked enough, but I noticed that most Berkeley students ended up sticking with people they already knew, so maybe students don’t network as much as I thought?

I also have praise for my sign language interpreters, who tried hard. They also taught me about Auslan and the differences in sign language between Australia and the United States.

Well, that’s a wrap for ICRA. It is time for me to fly back to Vancouver and then to San Francisco … life will return to normal.

International Conference on Robotics and Automation (ICRA) 2018, Day 4 of 5

For the fourth day of ICRA, I again went running (for the fourth consecutive morning). This time, rather than run across the bridge to get to the South Bank, I ran on a long pathway that extended below some roads:

Below the roads, there is a long paved path.

Normally, I would feel hesitant to run underneath roads, since (at least in America) those places tend to be messy and populated by those with nowhere else to live. But the pathway here was surprisingly clean, and even at 6:30am, there were a considerable amount of walkers, runners, and bikers.

After my usual preparation, I went over to the conference for the 9:00am plenary talk, provided by Queensland Professor Mandyam Srinivasan.

Professor Mandyam Srinivasan gave the third plenary talk for ICRA 2018.

As usual, it was hard to follow the technical details of the talk. The good news is that the talk was high-level, probably (almost) as high-level as Professor Brooks’ talk, and certainly less technical than Raia Hadsell’s talk. I remember there being lots of videos in this plenary, which presents logistical “eye-challenges” since I have to figure out a delicate balance of looking at the video or the sign language interpreter.

Due to the biological nature of the talk, I also remembered Professor Robert Full’s thrilling keynote at the Bay Area Robotics Symposium last November. I wonder if those two have ever collaborated?

I stayed for the keynote talk after that, about soft robotics, and then we had the morning poster session. As usual, there was plenty of food and drink, and I had to resist the urge to keep making trips to the food tables. The food items followed the by-now familiar pattern of one “sweet” and one “savory” item:

The food selection for today's morning poster session.

Later, we had the sixth and final poster session of the conference. The most interesting thing for me was … me, since that was when I presented my poster:

I, standing by my poster.

I stood there for 2.5 hours and talked with a number of conference attendees. Thankfully, none of the conversations were hostile or overly combative. People by and large seemed happy with what I was doing and saying. Also, my sign language interpreters finally had something to do during the poster sessions, since for the other five I had mostly been walking around without talking to people.

After the poster session, we had the farewell reception, which (as you can expect) was filled with lots of food and drinks. It took place in the plaza level of the convention center, which included an outside area along with several adjacent indoor rooms.

The food items included the usual bread, rice, and veggie dishes. For meat, we had salmon, sausages, and steak:

Some delicious steaks being cooked.

The steak was delicious!

Interestingly enough, the “dessert” turned out to be fruit, breaking the trend from past meals.

The farewell reception was crowded and dark, but the food was great.

The reception was crowded with long lines for food, particularly for the steak (obviously!). The other food stations providing the salmon and sausages were frequently out of stock. These are, however, natural problems since most of us were grabbing as much meat as we could during our first trips to the food tables. Maybe we need an honor code about the amount of meat we consume?

As an aside, I think for future receptions, ICRA should provide backpack and poster tube storage. We had that yesterday for the conference dinner and it was very helpful since cocktail-style dining means both hands are often holding something — one for alcoholic beverages and the other for food. Since I had just finished presenting my poster/paper, I was awkwardly lugging around a poster tube. My sign language interpreter kindly offered to hold it for the time I was there.

Again, ICRA does not skimp on the food and beverages. Recall that we had a welcome reception (day one), a conference dinner (day three) and the farewell reception (day four, today), so it’s only the second and fifth evenings that the conference doesn’t officially sponsor a dinner.