My Blog Posts, in Reverse Chronological Order
As I write this post, UC Berkeley is hosting its “visit days” program for admitted EECS PhD students. This is a three-day event that lets admitted students see the department, meet people, and get a (tiny) flavor of what Berkeley is like. Those interested in some history may enjoy my blog post about visit days four years ago.
If you’re an admitted student, congratulations! It’s super-competitive to get in. When I applied, the acceptance rate was roughly 5 percent, and the competition has probably increased since I applied. This is definitely true for those applying to work in Artificial Intelligence. I’ve seen statistics from BAIR director Trevor Darrell showing that the number of AI applicants has soared in recent years, to the point where the corresponding acceptance rate is now less than three percent.
It’s technically true that you’re not tied to a specific area when you apply, and that’s what the department probably advertises to admitted students. Do not, however, take this as implying that you can apply in an area you’re not interested in but think is “less competitive” and then pivot to AI. If you want to do fundamental AI research (and not just use it in an application) you must apply in AI — otherwise, I highly doubt the faculty will be interested in working with you when they already have the cream of the crop to consider from other applicants.
That being said, here are some related thoughts regarding graduate school, visit days, and so forth, which might be of use to admitted students:
You must come to visit days. You will learn a lot about the professors who are interested in working with you based on your assigned one-on-one meetings. I don’t know the details on how those assignments are made, but it’s a good bet that if a professor wants to work with you, then you’ll have a one-on-one meeting with him or her.
On a related note, if there’s a faculty member you desperately want to work with, then not only do you need to talk to him or her during visit days, you also need a firm commitment that he or she is willing to advise you without qualifications. This is particularly true for the “rock-star” faculty such as Pieter Abbeel and Michael I. Jordan, the best of the best, who get swarmed with emails from top-tier students asking to work with them. Get commitments done early.
You also want to be in touch with the students in your target lab(s). If you accept your offer, consistently communicate with them well before the official start of your PhD. This might mean just occasional emails over the summer, or (better) being remotely involved in an ongoing research project that can lead to a fast paper during your first semester. The point is, you want to be in the loop in what the other students are doing. This also includes incoming students — you’ll want to take the same classes as those in your research area, so that you can collaborate on homework and (ideally) research.
These previous points imply the following: you do not want to be spending your first year (or two) trying to “explore” or “get incubated” into research. Your goal must be to do outstanding research in your area of interest from day one.
It’s easy to experience euphoria upon getting your offer of admission. I don’t mean to rain on this, but there’s life beyond getting the admitted offer, and you want to make a sound and informed decision on something that will impact you forever. Again, if you got accepted to Berkeley, congratulations! I hope you seriously consider attending, as it’s one of the top computer science schools. Just ensure that you were admitted to your area of interest, and furthermore, it is crystal clear that the professors who you want to work with are willing to advise you from day one.
One of the things I’m most excited about nowadays is that physical robots now have the capability to repeatedly execute trajectories to gather data. This can then be fed into a learning algorithm to subsequently learn complex manipulation tasks. In this post, I’ll talk about a paper which does exactly that: the NIPS 2016 paper Learning to Poke by Poking: Experiental Learning of Intuitive Physics. (arXiv link; project website) Yes, it’s experiental, not experimental, which I originally thought was a typo, heh.
The main idea of the paper is that by repeatedly poking objects, a robot can then “learn” (via Deep Learning) an internal model of physics. The motivation for the paper came out of how humans seem to possess this “internal physics” stuff:
Humans can effortlessly manipulate previously unseen objects in novel ways. For example, if a hammer is not available, a human might use a piece of rock or back of a screwdriver to hit a nail. What enables humans to easily perform such tasks that machines struggle with? One possibility is that humans possess an internal model of physics (i.e. “intuitive physics” (Michotte, 1963; McCloskey, 1983)) that allows them to reason about physical properties of objects and forecast their dynamics under the effect of applied forces.
I think it’s a bit risky to try and invoke human reasoning in a NIPS paper, but it seems to have worked out here (and the paper has been cited a fair amount).
The methodology can be summarized as:
In our setup (see Figure 1), a Baxter robot interacts with objects kept on a table in front of it by randomly poking them. The robot records the visual state of the world before and after it executes a poke in order to learn a mapping between its actions and the accompanying change in visual state caused by object motion. To date our robot has interacted with objects for more than 400 hours and in process collected more than 100K pokes on 16 distinct objects.
Now, how does the Deep Learning stuff work to actually develop this internal model? To describe this, we need to understand two things: the data collection and the neural network architecture(s).
First, for data collection, they randomly poke objects in a workstation and collect the tuple of: before image, after image, and poke. The first two are just the images from the robot sensors and the “poke” is a tuple with information about the poke point, direction and length. Second, they train two models: a forward model to predict the next state given the current state and the applied force, and an inverse model to predict the action given the initial and target state. A state, incidentally, could be the raw image from the robot’s sensors, or it could be some processed version of it.
I’d like to go through the architecture in more detail. If we assume naively that the forward and inverse models are trained separately, we get something like this:
Visualization of the forward and inverse models. Here, we assume the forward and inverse models are trained separately. Thus, the forward model takes a raw image and action as input, and has to predict the full image of the next state. In the inverse model, the start and goal images are input, and it needs to predict the action that takes the environment to the goal image.
where the two models are trained separately and act on raw images from the robot’s sensors (perhaps 1080x1080 pixels).
Unfortunately, this kind of model has a number of issues:
- In the forward model, predicting a full image is very challenging. It is also not what we want. Our goal is for forward model to predict a more abstract event. To use their example, we want to predict that pushing a glass over a counter will result in the abstract event of “shattered glass.” We don’t need to know the precise pixel location of every shattered glass.
- The inverse model has to deal with ambiguity: there are multiple actions that may head to a resulting goal state, or perhaps no action at all can possibly lead to the next state.
All these factors require some re-thinking in terms of our model architecture (and training protocol). One obvious alternative the authors suggest is to avoid acting on image space and just feed all images into a CNN trained on ImageNet data and extract some intermediate layer. The problem is that it’s unclear if object classification and object manipulation mandate a similar set of features. One would also need to fine-tune ImageNet somehow, which would make this more task-specific (e.g., for a different workstation setup, you’d need to fine-tune again).
Figure from their paper describing (a) objects used, (b) before/after image pairs, (c) the network.
Their solution, show in above, involves the following:
Two consecutive images are separately passed through a CNN and then the output (i.e., latent feature representation) is concatenated.
To conclude the inverse model, are used to conditionally estimate the poke length, poke angle, and then poke location. We can measure the prediction accuracy since all the relevant information was automatically collected in the training data.
As to why we need to predict conditionally: I’m assuming it’s so that we can get “more reasonable” metrics since knowing the poke length may adjust the angle required, etc., but I’m not sure. (The project website actually shows a network which doesn’t rely on this conditioning … well OK, it’s probably not a huge factor.)
Also, the three poke attributes are technically continuous, but the authors simply discretize.
For the forward model, the action along with the latent feature representation of image is concatenated and fed through its own neural network, to predict , which in fact we already know as we have passed it through the inverse model!
By integrating both networks together, and making use of the randomly-generated training data to provide labels for both the forward and inverse model, they can simply rely on one loss function to train:
where is a hyperparameter. They show that using the forward model is better than ignoring it by setting , so that it is advantageous to simultaneously learn the task feature space and forecasting the outcome of actions.
To evaluate their model, they supply their robot with a goal image and ask it to apply the necessary pokes to reach the goal from the current starting state . This by itself isn’t enough: what if and are almost exact the same? To make results more convincing, the authors:
- set and to be sufficiently different in terms of pixels, thus requiring a sequence of pokes.
- use novel objects not seen in the (automatically-generated) training data.
- test different styles of pokes for different objects.
- compare against a baseline of a “blob model” which uses a template-based object detector and then uses the vector difference to compute the poke.
One question I have pertains to their greedy planner. They claim they can provide the goal image into the learned model, so that the greedy planner sees input to execute a poke, then sees the subsequent input for the next poke, and so on. But wasn’t the learned model trained on consecutive images instead of pairs?
The results are impressive, showing that the robot is successfully able to learn a variety of pokes even with this greedy planner. One possible caveat is that their blob baseline seems to be just as good (if not better due to lower variance) than the joint model when poking/pushing objects that are far apart.
Their strategy of combining networks and conducting self-supervised learning with large-scale, near-automatic data collection is increasingly common in Deep Learning and Robotics research, and I’ll keep this in mind for my current and future projects. I’ll also keep in mind their comments regarding generalization: many real and simulated robots are trained to achieve a specific goal, but they don’t really develop an underlying physics model that can generalize. This work is one step in the direction of improved generalization.
Sample efficiency is a huge problem in reinforcement learning. Popular general-purpose algorithms, such as vanilla policy gradients, are effectively performing random search in the environment1, and may be no better than Evolution Strategies, which is more explicit about acting random (I mean, c’mon). The sample-efficiency problem is exacerbated when environments contain sparse rewards, such as when it consists of just a binary signal indicating success or failure.
To be clear, the reward signal is an integral design parameter of a reinforcement learning environment. While it’s possible to engage in reward shaping (indeed, there is a long line of literature on just this topic!) the problem is that this requires heavy domain-specific engineering. Furthermore, humans are notoriously bad at specifying even our own preferences; how do we expect us to define accurate reward functions in complicated environments? Finally, many environments are most naturally specified by the binary success signal introduced above, such as whether or not an object is inserted into the appropriate goal state.
I will now summarize two excellent papers from OpenAI (plus a few Berkeley people) that attempt to improve sample efficiency in reinforcement learning environments with sparse rewards: Hindsight Experience Replay (NIPS 2017) and Overcoming Exploration in Reinforcement Learning with Demonstrations (ICRA 2018). Both preprints were updated in February so I encourage you to check the latest versions if you haven’t already.
Hindsight Experience Replay
Hindsight Experience Replay (HER) is a simple yet effective idea to improve the signal extracted from the environment. Suppose we want our agent (a simulated robot, say) to reach a goal , which is achieved if the configuration reaches the defined goal configuration within some tolerance. For simplicity, let’s just say that , so the goal is a specific state in the environment.
When the robot rolls out its policy, it obtains some trajectory and reward sequence
achieved from the current behavioral policy , internal environment dynamics , and (sparse) reward function . Clearly, in the beginning, our agent’s final state will not match the goal state , so that all the rewards are zero (or -1, as done in the HER paper, depending on how you define the “non-success” reward).
The key insight of HER is that during those failed trajectories, we still managed to learn something: how to get to the final state of the trajectory, even if it wasn’t what we wanted. So, why not use the actual final state and treat it as if it was our goal? We can then add the transitions into the experience replay buffer, and run our usual off-policy RL algorithm such as DDPG.
In OpenAI’s recent blog post, they have a video describing their setup, and I encourage you to look at the it along with the paper website — it’s way better than what I could describe. I’ll therefore refrain from discussing additional HER algorithmic details here, apart from providing a visual which I drew to help me better understand the algorithm:
My visualization of Hindsight Experience Replay.
There are a number of experiments that demonstrate the usefulness of HER. They perform experiments on three simulated robotics environments and then on a real Fetch robot. They find that:
DDPG with HER is vastly superior to DDPG without HER.
HER with binary rewards works better than HER with shaped rewards (!), providing additional evidence that reward shaping may not be fruitful.
The performance of HER depends on the sampling strategy for goals. In the example earlier, I suggested using just the last trajectory state as the “fake” goal, but (I think) this would mean the transition is the only one which contains the dense reward ; there would still be other states with the non-informative reward. There are alternative strategies, such as sampling more frequent states. However, doing this too much has a downside in that “fake” goals can distract us from our true objective.
HER allows them to transfer a policy trained on a simulator to a real Fetch robot.
Overcoming Exploration in Reinforcement Learning with Demonstrations
This paper extends HER and benchmarks using similar environments with sparse rewards, but their key idea is that instead of trying to randomly explore with RL algorithms, we should use demonstrations from humans, which is safer and widely applicable.
The idea of combining demonstrations and supervised learning with reinforcement learning is not new, as shown in papers such as Deep Q-Learning From Demonstrations and DDPG From Demonstrations. However, they show several novel, creative ways to utilize demonstrations. Their algorithm, in a nutshell:
Collect demonstrations beforehand. In the paper, they obtain them from humans using virtual reality, which I imagine will be increasingly available in the near future. This information is then put into a replay buffer for the demonstrator data.
Their reinforcement learning strategy is DDPG with HER, with the basic sampling strategy (see discussion above) of only using the final state as the new goal. The DDPG+HER algorithm has its own replay buffer.
During learning, both replay buffers are sampled to get the desired proportion of supervisor data and data collected from environment interaction.
For the actor (i.e., policy) update in DDPG, they add the Behavior Cloning loss in addition to the normal gradient update for DDPG (function denoted as ):
I can see why this is useful. Notice, by the way, that they are not just using the demonstrator data to initialize the policy. It’s continuously used throughout training.
There’s one problem with the above: what if we want to improve upon the demonstrator performance? The behavior cloning loss function prevents this from happening, so instead, we can use the Q-filter, a clever contribution:
The critic network determines . If the demonstrator action is better than the current actor’s action , then we’ll use that term in the loss function. Note that this is entirely embedded within the training procedure: as the critic network improves, we’ll get better at distinguishing which terms to include in the loss function!
Lastly, they use “resets”. I initially got confused about this, but I think it’s as simple as occasionally starting episodes from within a demonstrator trajectory. This should increase the presence of relevant states and dense rewards during training.
I enjoyed reading about this algorithm. It raises important points about how best to interleave demonstrator data within a reinforcement learning procedure, and some of the concepts here (e.g., resets) can easily be combined with other algorithms.
Their experimental results are impressive, showing that with demonstrations, they outperform HER. In addition, they show that their method works on a complicated, long-horizon task such as block stacking.
I thoroughly enjoyed both of these papers.
They make steps towards solving relevant problems in robotics: increasing sample efficiency, dealing with sparse rewards, learning long-horizon tasks, using demonstrator data, etc.
The algorithms are not insanely complicated and fairly easy to understand, yet seem effective.
HER and some of the components within the “Overcoming Exploration” (OE) algorithm are modular and can easily be embedded into well-known, existing methods.
The ablation studies appear to be done correctly for the most part, and asking for more experiments would likely be beyond the scope of a single paper.
If there are any possible downsides, it could be that:
The HER paper had to cheat a bit on the pick-and-place environment by starting trajectories from when the gripper grips the block.
In the OE paper, their results which benchmark against HER (see Section 6.A, 6.B) were done with only one random seed, and that’s odd given that it’s entirely in simulation.
Their OE claim that the method “can be done on real robot” needs additional evidence. That’s a bold statement. They argue that “learning the pick-and-place task takes about 1 million timesteps, which is about 6 hours of real world interaction time” but does that mean we can really execute the robot that often in 6 hours? I’m not seeing how the times match up, but I guess they didn’t have enough space to describe this in detail.
For both papers, I was initially disappointed that there wasn’t code available. Fortunately, that has recently changed! (OK, with some caveats.) I’ll go over that in a future blog post.
I’m happy to see that Professor Ben Recht has a new batch of reinforcement learning blog posts, as he’s a brilliant, first-rate machine learning researcher. I’ve been devouring these posts, and I remain amused at his perspective on control theory versus reinforcement learning. He has a point in that RL seems silly if we deliberately constrain the knowledge we can provide to the environment (particularly with model-free RL). For instance, we wouldn’t deploy airplanes and other machines today without a deep understanding of the physics involved. But those are thoughts for another day. ↩
My preprint on surgical debridement and robot calibration was accepted to the 2018 IEEE International Conference on Robotics and Automation (ICRA). It’s in Brisbane, Australia, which means I’ll be going to Australia for the second time in less than a year — last August, I went to Sydney for UAI 2017.
I’m excited about this opportunity and look forward to traveling to Brisbane in May. (That is, assuming Berkeley’s Disabled Students’ Program isn’t as slow as they were in August, but never mind.) I have already booked my travel reservations and registered for the conference.
Everyone knows that long-haul international travel is expensive, but what might not be clear to those outside academia is that conference registration fees can be just as high as those airfare fees. For ICRA, the cost of my registration came to be 1,171.36 AUD before taxes, and 1,275.00 AUD with taxes. That corresponds to 1,033.94 in US dollars. Ouch.
Fortunately, I’m going to get reimbursed, since Berkeley professors are not short on money, but I still wish that costs could be lower. The breakdown was: 31.36 AUD for a hotel deposit (I’ll pay the full hotel fees when I arrive in May), 600 AUD for the early-bird IEEE student membership registration, 100 AUD for the workshops/tutorials, and 440 AUD for the two extra page charge.
Wait, what was the last one?
Ah, I should clarify. The policy of ICRA, and for many IEEE conferences at that (hence the title of this blog post), is the following:
Papers to ICRA can be submitted through two channels:
- To ICRA. Six pages in standard ICRA format and a maximum of two additional pages can be purchased.
- To the IEEE Robotics and Automation Letters (RA-L) journal, and tick the option for presentation at ICRA. Six pages in standard ICRA format are allowed for each paper, including figures and references, but a maximum of two additional pages can be purchased. Details are provided on the RA-L webpage and FAQ.
All papers are submitted in PDF format and the page count is inclusive of figures and references. We strongly encourage authors to submit a video clip to complement the submission. Papers hosted on arXiv may be submitted to ICRA.
So, in short, we can have six pages, and can purchase two extra pages if needed.
This makes no sense.
Is it because of printing costs associated with the proceedings? It shouldn’t be. The proceedings, as far as I know, are those enormous books that concatenate all the papers from a conference.
They are also worthless and should never be printed outside of maybe one or two historical copies for IEEE’s book archives. No one should read directly from them. Who has the time? Academics are judged based on the papers they produce, not the papers they consume. This year, ICRA alone accepted 1030 papers (!!). Yes, over a thousand. It makes no sense to browse proceedings to search for a paper; just type in a search query on Google Scholar. If you think you might be missing a gem somewhere in the proceedings, I wouldn’t worry. Good papers will make themselves known eventually through word of mouth. They also tend to be widely accessible to all, such as being available on arXiv rather than being stuck behind an IEEE paywall. Most universities have IEEE subscriptions so it’s not generally a problem to download IEEE papers for free, but it’s still a bit of an unnecessary nuisance.
Speaking of arXiv, perhaps IEEE doesn’t follow a similar model due to hosting costs? That doesn’t seem like a good rationale, and in particular it doesn’t justify the steep jump in price from 6 to 8 pages. Why not have pages 1 through 6 charged accordingly? Or simply make the charge based on file size instead of page size, while obviously keeping a hard page limit to alleviate the load on reviewers. There seem to be way more rational price structures than the 220 AUD each for pages 7 and 8.
It seems like ICRA organizers would prefer to see 6 page papers, yet the problem is that everyone knows that if you allow 8 pages, then that becomes the effective lower bound on paper length. An 8-page paper has a better chance of being accepted to ICRA than a 7-page paper, which in turn has odds over a 6-page paper. And so on. Indeed, if you look at ICRA papers nowadays, the vast majority hit the 8 page limit, many with barely a line to spare (such as my paper!). The trend is possibly even more pronounced with ICML, NIPS, and other AI conferences, from my anecdotal experience reading those papers.
Such a cost structure might needlessly disadvantage students and authors from schools without the money to easily pay the over-length fees. This is further exacerbated by ICRA’s single-blind policy, where reviewers can see the names of authors and thus be influenced by research fame and school institution name.
In short, I’m not a fan of the two-page extra charge, and I would suggest that ICRA (and similar IEEE-based conferences) switch to a simple, hard, 8-page limit for papers. In addition, I would also like to see all accepted papers freely available for download in an arXiv-style format. If hosting costs are a burden, a more rational price structure would be to slightly increase the conference registration fees, or encourage authors to upload their papers on arXiv in lieu of being hosted by ICRA.
To be clear, I’m still extremely excited about attending ICRA, and I’m grateful to IEEE for organizing what I’ve heard is the preeminent conference on robotics. I just wish that they were a little bit clearer on why they have this two-page extra charge policy.
In this post, I build upon my previous one by further investigating fundamental concepts in Murray, Li, and Sastry’s A Mathematical Introduction to Robotic Manipulation. One of the challenges of their book is that there’s a lot of notation, so I first list the important bits here. I then review an example that uses some of this notation to better understand the meaning of twists and exponential coordinates.
Side comment: there is an alternative, more recent robotics book by Frank C. Park and Kevin M. Lynch called Modern Robotics. It’s available online and has its own Wikipedia, and even has some lecture videos! Despite its 2017 publication date, the concepts it describes are very similar to Murray, Li, and Sastry, except that the presentation can be a bit smoother. The notation they use is similar, but with a few exceptions, so be aware of that if you’re reading their book.
Back to our target textbook. Here is relevant notation from Murray, Li, and Sastry:
The unit vector specifies a direction of rotation, and represents the angle of rotation (in radians).
An important fact is that any rotation can be represented as rotating by some amount through an axis vector, so we could write rotation matrices as functions of and , i.e., . Murray, Li, and Sastry call this “Euler’s Theorem”.
Note: if you’re familiar with the Product of Exponentials formula, then , which generalizes to the case when there are joints in a robotic arm. Also, doesn’t have to be an angle; it could be a displacement, which would be the case if we had a prismatic joint.
The cross product matrix satisfies for some , where indicates the cross product operation. More formally, we have
and so is a skew-symmetric matrix, of which the set of those is denoted as . This is easily verified by explicit computation.
The matrix exponential is a matrix in the set . In other words, it’s a rotation matrix! We can write it in closed form from Rodrigues’ formula:
Here’s the relevant mathematical relationship: the exponential map transforms skew-symmetric matrices into orthogonal matrices, and every rotation matrix can be represented as the matrix exponential of some skew-symmetric matrix.
A twist is defined as the set of matrices parameterized by exponential coordinates s.t. and . The matrix is written as
and yes, the last row consists of four zeros. We can derive this matrix from considering rotations about revolute and prismatic joints, where is the axis of rotation, and is the vector describing the translation. (See Section 3.2 in Murry, Li, and Sastry for details.)
Incidentally, sometimes we call twists as , or with the “attached” to it. We can also do the same for the exponential coordinates.
The matrix exponential of a twist represents the relative motion of a rigid body. Hence, if we left-multiply this with a vector, we interpret it as moving the input vector with respect to the same frame. Said another way, we are not changing the “frame of reference” for the input vector.
Given twist coordinates , we can explicitly construct the RBMs:
if . This is equivalent to a pure translation. If , we have the slightly more complicated formula
Both of the above are elements of , so they are rigid body motions. To be clear, you need to construct these from the actual definition of a matrix exponential based on its Taylor series expansion.
Let’s look at the example in the figure below to better understand the above concepts.
We have inertial frame and body frame . With some pencil and paper, you can show that the RBM from to is:
which follows the style from my last post: determine the rotation and translation components and then plug them in. Fortunately, the rotation is about the axis, so is easy.
It turns out that every RBM can be expressed as the exponential of some twist . So let’s consider the following:
Given that (note , not ) and that we know the exact form of , can we compute ?
To do this, we need to compute the components and .
Let’s do first. From our above equations for and , we equate components and see that we need
We know corresponds to a rotation matrix about the -axis only. This means . We also simply set .
Now consider . Once again, we equate components from both sides to get
This is the standard “solve for in ” problem that you saw in linear algebra classes. Solving for , we get .
We conclude that our exponential coordinates which generated are
Over the last few weeks, I have been devouring Murray, Li, and Sastry’s A Mathematical Introduction to Robotic Manipulation. It’s freely available online and, despite the 1994 publication date, is still relevant for robotic manipulation as it’s used in EE206 at Berkeley.
(The main novelty today, from my perspective, is the use of Deep Learning to automate out analytic models under certain conditions, but I still think it’s valuable for me to know classical robotics concepts.)
In this post, I discuss the Special Exponential group, denoted as . We can define it as follows:
where is a position vector, and is a matrix in the special orthogonal group :
This is the same as saying that is a rotation matrix.
Side comment: the reason for “” as the input is that and can be generalized to an arbitrary amount of dimensions. However, I’m only concerned about robotic manipulation in .
The above suggests the obvious question:
For what purpose do we utilize ?
We use to encode rigid body motions (RBMs) in robotic manipulation, which preserve distance between points and angles between vectors. RBMs consist of a rotation and a translation. To visualize RBMs, look at the left of the figure below. There are two coordinate frames: frame , which is “inertial” (I think of this as the “default” frame), and frame , attached to the base of the curvy object drawn there.
Vector shows the 3-D position of the origin of with respect to . This ordering from to and not vice versa is important, and it’s important to know these well for robotic manipulation, which in advanced contexts relies on multiple, consecutive coordinate frames attached to links in a robot arm. For rotation matrices, we’ll keep the ordering of subscripts the same and write to indicate that it transforms 3-D points from frame to frame .
Left: visualization of two coordinate frames, one inertial and one attached to the base of an object. Right: again, two coordinate frames visualized, this time in the context of rotating about an axis.
Now consider the point on the object. We can express its coordinates as or , depending on which frame of or is our reference. Suppose we have . A rigid body motion can be conducted as follows:
and we’ll collect as all the information needed to specify an RBM, transforming coordinates from to . This is an element of . Indeed, any such RBM must be an element of , which defines what’s known as a configuration space of RBMs. Configuration spaces are defined in page 25 of Murray et al:
More generally, we shall call a set a configuration space for a system if every element corresponds to a valid configuration of the system and each configuration of the system can be identified with a unique element of .
I should also provide some intuition to make it clear what happens when we “transform coordinates.” One way is as follows. If we view as jetting out in the positive , , and directions of frame , and the same for w.r.t. frame , then the components of are element-wise larger than in . This is why we add when doing RBMs with translations, since the vector will increase the values. (Drawing a picture really helps.)
Keeping and separate can result in some cumbersome math when a bunch of rotations and translations are combined. Thus, it’s common to use homogeneous coordinates. A full discussion is beyond the scope of this blog post, but for us, the important point is that if , then the equivalent homogeneous representation is
where the “0” above is a row vector of three zeros. This enables us to perform one matrix-vector multiply for a 3-D point which is expanded with a fourth coordinate with a “1” in it. Thus, the origin point is , and for vectors — defined as the difference between two points — the fourth component is zero.
Let’s do an example. Consider the second image in the figure above, showing rotation about an axis . Given a fixed, “real-life”, physical point, let’s show how to encode a rigid body transformation which can transform a coordinate representation of that point from frames to .
Translation. Inertial frame and frame differ only by in the y-direction, with representing the origin of frame with respect to frame .
Rotation. The rotation about coincides with the axis. Hence, we use the well-known formula for the -axis rotation matrix:
While the precise positioning of the sines and cosines might not be immediately apparent, it should be clear why the last row looks like that, since a rotation about the axis leaves the component of the 3-D vector unchanged. You can also easily check that .
The matrix above is , the orientation of the origin of frame w.r.t. frame .
These two form our specification in . Combining these by using the homogeneous representation for compactness, our RBM from to is:
where and represent, in homogeneous coordinates, the same physical point but with respect to frames and , respectively.
A year ago, I listed all the books that I read in the year 2016. I listed 35 books with summaries for each, and grouped them into categories by subject. I’m pleased to announce that I am continuing my tradition with this current blog post, which summarizes all the books I read in 2017.
As before, I am only listing non-fiction books, and am excluding (among other things) textbooks, magazines, and certainly all the academic papers1 that I read. Books with starred titles (like ** this **) are those that I especially enjoyed reading, for one reason or another.
In 2017, I read 43 books, which is eight more than the 35 I reported on last year. Yay! The book categories are:
- Artificial Intelligence and Robotics (11 Books)
- Technology, Excluding AI/Robotics (5 Books)
- Business and Economics (5 Books)
- Biographies and Memoirs (6 Books)
- Conservative Politics (3 Books)
- Self-Help and Personal Development (3 Books)
- Psychology and Human Relationships (4 Books)
- Miscellaneous (6 Books)
Within each section, books are listed according to publication date.
I hope you enjoy this blog post! For 2018, I hope to continue reading lots and lots of non-fiction books with a heavy focus on technology, businesses, and economics.
Group 1: Artificial Intelligence and Robotics
Yowza! From the 11 books here, you can tell that I’m becoming a huge fan of this genre. ;-)
** Incognito: The Secret Lives of the Brain ** is a thrilling 2011 book by neuroscientist David Eagleman of the Baylor College School of Medicine. (I consider this book as “AI” in this blog post, though you could argue that “Psychology” might be better.) It is clearly designed for the lay reader with interest in neuroscience, like me, due to a number of engaging examples that thrill the reader without going overboard with the technical content. Eagleman describes how we don’t actually have that much control over our brain, that there are so many unexpected contradictions with how we think, and mentions a few interesting neuroscience factoids. Did you know, for instance, that half of a child’s brain can be removed and the child can still survive? On a technical note, I was impressed with how Eagleman referenced a few machine learning papers from Michael I. Jordan and Geoff Hinton in his footnotes about hierarchical learning. From the perspective of a computer scientist, the most interesting part was when he talked about the brain being a team of competing rivals. This is awfully similar to the idea behind Generative Adversarial Networks an enormously successful and well-cited paper that came out in NIPS 2014 … three years after this book was published! I have no idea how a non-computer scientist was able to almost predict this, but it shows that cross-collaboration between neuroscientists may be good for AI. He doesn’t get everything right, though. He mentions more than once that artificial neural networks have been a failure. Well… this book was published in 2011, and Alex-Net came out in 2012, so that kind of flopped quickly. Despite this, I hope to see Eaglemen write another book about the brain so that I can see a revised perspective. Incognito also contains interesting perspectives on neuroscience and the law. Eagleman doesn’t take sides but doesn’t go in too much depth either. He says the “bar” for blameworthiness will change depending on available neuroscience, which he says is a mistake (and I agree).
** The Most Human Human: What Artificial Intelligence Teaches Us About Being Alive ** is an engaging 2011 book written by Brian Christian, who also (co-)authored another book I read this year, Algorithms to Live By. This book’s main focus is on the Turing Test, and it plays a greater theme in this one more than any of the other AI books I’ve read. Simply put, while we’re so obsessed about getting a computer to fool human judges (the “most human computer”), Christian argues that an equally important criteria is the “most human human.” In other words, in the Turing Test, who is the human that can most convince the judges that he/she is human? The book’s chapters are explorations of the different factors that make us human, among other things our ability to barge and interrupt, our use of “um” and “uh”, our constant sidetracking, and so forth. Intuitively, these are hard for a computer to model. Christian has a computer science background, so some of the book covers technical concepts such as entropy, which he argues we need to be making as high as possible; low entropy means we’re not saying anything worth knowing more about. And yet, these seem to be the most negatively encouraged aspects of our society, which is quite odd in Christian’s opinion. I enjoyed reading most of the book because it stated observations that seem obvious to me in retrospect, but which I never gave much thought. That’s the best kind of observation. (It’s like Incognito to some extent, and indeed the Incognito author praises this book!) The biggest drawback is that it never goes through a blow by blow of the actual Turing test! I mean, c’mon, I was looking forward to that, and Christian essentially ruined it by fast forwarding to the end, when he mentions he won the title of the “most human human.” Well, congratulations dude, but why wasn’t there at least a complete transcript in the book???
** Our Final Invention: Artificial Intelligence and the End of the Human Era ** is a 2013 book by documentary filmmaker James Barrat exploring the rise of AI and its potential existentialist risk. This is not far off the mainstream of AI researchers and programmers as it sounds. In fact, Peter Norvig and Stuart Russell include a brief discussion about it at the end of their famous AI textbook. Russell has also explicitly said that he’s clearly worried about superintelligence. So what is superintelligence? I view it as AI that is so far advanced that it becomes better and better, and surpasses humans in just about every quality imaginable.2 I think I enjoyed reading this book, even if it is a tad too sensational. There isn’t that much technical detail, which is OK since it’s a popular science book. Barrat makes an excellent point that scientists need to make their work accessible to the public. I agree — that’s partly why I discuss technical stuff on this blog — but I also think that people have got to start learning more math on their own. It needs to be a two-way partnership. Moving on, an unexpected benefit of reading Our Final Invention was that I learned about the work done by I.J. Good, Eliezer Yudowsky, and others from the Machine Intelligent Research Institute. I’m embarrassed to admit that I didn’t know about those two people beforehand, but now I will remember their names. Unlike MIRI in most Berkeley AI research groups, such as the ones I’m in, we don’t give a modicum of thought to existential risks of AI, but the topic is garnering more attention. One final comment about the book: Barrat mentions that no computer is better than a child at object recognition. Well, whoops. They are now! He talks a lot about neural networks and how we don’t understand them, which by now feels old and I wish authors would take note of all the people workong on this area. I’d like to see an updated version of Our Final Invention in 2018, with the last five years of AI advancement taken into consideration.
** Superintelligence: Paths, Dangers, Strategies ** is a 2014 book by Oxford philosopher Nick Bostrom.3 His philosophy background is apparent in the way Superintelligence is written, though it is obviously much easier to read than a real, academic philosophy paper. In this book, Bostrom considers when AI grows to the point where machines are “superintelligent.” That term can be broadly understood as when machines are so powerful and intelligent that they effectively have complete control over the future of the universe. Why, Bostrom says, can we assume that superintelligence is friendly? We cannot, he concludes. The book is about different ways we can get to superintelligence (i.e., “paths”) including AI, emulating the brain, collective superintelligence (think a super-charged Internet) and so forth. There are also “dangers” and “strategies”. Bostrom convincingly explains why superintelligence poses an existential risk to humanity, and also explains what strategies we may take to counter it, such as by uploading appropriate values to the agent. Unfortunately, none of his solutions are clear-cut. There are two things you will notice when reading this book. First, almost every assumption has a counterexample or unexpected consequence. Bostrom often comes up contrived scenarios for this purpose. Second, Bostrom frequently cites Eliezer Yudkowsky’s work. I first learned about Yudkowski by reading James Barrat’s Our Final Invention (see above). If you like Yudkowsky, you’ll like Bostrom. Taking a more general view, Superintelligence is meant to be a serious academic-style discussion, but not a recipe that can be easily followed, because it assumes so many things and continues to make the reader feel like every case is hopelessly complicated with advantages and disadvantages abound, both obvious and non-obvious. Overall, I’m happy I read this book even if it is wildly premature. It made me think hard about any assumptions I make in my work.
What to Think About Machines That Think: Today’s Leading Thinkers on the Age of Machine Intelligence collects responses based on a 2015 survey of the EDGE question: “what do you think about machines that think?” This was sent to about a hundred or so experts in a variety of fields, mostly different academics and well-known authors. Obvious inclusions were Nick Bostrom (of Superintelligence fame), Eliezer Yudkowsky, Stuart Russell, Peter Norvig, and others, but there were also some interesting new additions. I of course did not know the vast majority of the people here. This book is a bit of an unusual format; each author’s answer to this question took up 1-4 pages. There were no constraints otherwise, so the answers varied considerably. For instance, I like Steven Pinker’s comment of why people don’t think AI will “naturally develop along female lines […] without the desire to take over the world.” Gee, isn’t that stereotyping, Pinker?? There were also some amusing responses, such as when someone said that he himself was an AI and people didn’t know it (!!) as well as alarmingly short answers saying “machines can’t think”. Most responses were along the same theme of: “AIs taking over the world aren’t going to happen anytime soon, but they are affecting us now, sometimes in subtle ways, and [insert ‘novel’ insight here].” Overall, while I like the idea of the book format, I think the utility that I derive from reading is more about understanding a long, engrossing story. This is not necessarily a bad book, and I can see it being useful for someone who can only read books in short 3-5 minute spurts at a time, but it’s not my style.
** Machines of Loving Grace: The Quest for Common Ground Between Humans and Robots **, is a 2015 book by NY Times journalist John Markoff. Though a journalist, Markoff has frequently written about AI-related topics and has good connections in the field. Machines of Loving Grace gives an impressively balanced view of the history of Artificial Intelligence (AI) and Intelligence Augmentation (IA), which can be roughly thought of as HCI (I might view it as a subset of HCI). Obviously, I am more interested in the AI aspect. Markoff covers topics such as self-driving cars (unsurprisingly), and Rodney Brooks’ Baxter robot, which has been used in many research papers that I’ve read. But more surprising from a Berkeley perspective, Markoff also mentioned Pieter Abbeel’s work, though this was his clothes folding experiment, and not his later, more exciting DeepRL stuff from 2014 and onwards. I was also — unsurprisingly — interested in Markoff’s description of the history of how the neural network pioneers met each other (e.g., Terence Sejnowski, Geoffrey Hinton, and Yann LeCun). For the IA side, the most prominent example is Apple’s Siri, which interacts with humans, though I don’t have much to say about it because I have never used Siri. Yes, I’m embarrassed. On the AI vs IA dilemma, Markoff notes people such as John McCarthy and Doug Engelbart on opposite sides. And of course who could forget Marvin Minsky who decimated the field of neural networks with his legendary 1969 book? I decided to read this based on Professor Ken Goldberg’s brief Nature article, and was pleased that I did so, mostly (again) to learn about history (since I’m trying to become like one of those AI experts…) and the importance of ensuring that, at least for AI applied to the real world, we keep the human interaction aspect in mind.
Rise of the Robots: Technology and the Threat of a Jobless Future is a 2015 book by technologist Martin Ford, who warns that our society is not prepared to handle all the future technological advances with robots automating out jobs. He begins by arguing that IT advances have not been as useful as electricity and other breakthroughs, and indeed that is a key theme from Robert Gordon’s The Rise and Fall of American Growth. To make the point clear, in response to Ray Kurzwiel saying that smartphones have provided incalculably large benefits to their owners, Ford counters with: “in practice, they may offer little more than the ability to play Angry Birds while standing in an unemployment line.” Ford continues by citing sources and reasons for the decline of the middle class in America. This part of the book is not controversial. Ford then raises the point that the IT revolution, along with not just robotics but also machine learning, means that even “high-skilled” jobs are at risk of being automated out. We now have machines that can write as well as most humans, that play Jeopardy! (as expected, IBM’s Watson was mentioned) and which perform better at image recognition and language translation using Deep Learning. Ford worries that, in the worst case, an elite few with all the wealth will hoard it and be guarded in a fortress by robots. Yes, he admits this is science fiction (and he discusses the Singularity, probably not the best idea…) but the point seems clear. Ford concludes the book with what he probably wanted to discuss all along: Universal Basic Income to the rescue! I heard about this book from Professor Ken Goldberg’s brief Nature article, who is critical of Ford “falling for the singularity hype” and his “extremely sketchy” evidence. I probably don’t find it as bad, and lately I’ve been thinking more seriously about supporting a Universal Basic Income. We might as well try on smaller scales, given that the best we can hope for in the future is more debt and safety net cuts with Republicans (now) or more debt inefficient bureaucracy with Democrats (in 2018/2020).
Our Robots, Ourselves: Robotics and the Myths of Autonomy, a 2015 book written by MIT professor David A. Mindell, is the third robotics-related book that I read based off of Professor Ken Goldberg’s brief Nature article. Mindell has an unusual background, being a Professor of Aeronautics and Professor of the History of Engineering and Manufacturing (I didn’t even know that was a department). He’s also a pilot. So this book brings together his expertise when he discusses what it means for robots to be automated. Our Robots, Ourselves discusses five realms: sea, land, air, war, and space, and shows that in all of those, it is not straightforward to claim that robots are being more and more autonomous at the expense of the human aspect. In addition, Mindell tells stories of the natural conflict between increasing automation and human employees. For instance, with sea, what does it mean for geologists and scuba-diving analysts if robots do it for them? Does it detract from their job? A similar thing rings true for pilots. We need some way of humans taking over in emergencies, and pilots are worried that increasing automation will lower the prerequisite skills for the job and/or reduce the job’s purpose. Next, consider war. People who once fought on the front lines or as air force pilots are feeling resentful that those who manage drones remotely are getting respect and various honors. Mindell argues that increasing automation must also go along with better human-robot interaction, a topic which is rightfully becoming increasingly important for academia and the world. After reading this book, I now believe I do not want systems to be fully autonomous (a huge issue with self-driving cars) but instead, I want the automation to work well with humans. That’s the key insight I got from this book.
** Algorithms to Live By: The Computer Science of Human Decisions ** is a 2016 book co-authored by writer Brian Christian and Berkeley psychology professor Tom Griffiths. It consists of 11 chapters, each of which correspond to one broad theme in computer science, such as Bayes’ Rule, Overfitting, and Caching. Most of these topics are related to algorithms and machine learning, which wasn’t particularly surprised to me given the authors’ backgrounds. I also know Professor Griffiths publishes machine learning papers on occasion, such as his groundbreaking 2004 paper Finding Scientific Topics. Algorithms to Live By lists how the major technical issues and questions related to these topics can have implications for actions in our own lives, such as dating, parking cars, and designing our rooms/desks (this example with caching always comes up). The authors point out how, in practice, the algorithms people engage in for these activities can be surprisingly correct or well off the mark of optimality, where here the metric is based on mathematical proofs. Of course, whenever we talk about mathematical proofs, we have to be clear on what assumptions we make, which will drastically affect our options, and which in fact can often validate some of the seemingly irrational activities that humans perform. I tremendously enjoyed reading this, though admittedly it was easier for me to digest the material given that I knew the main idea of the computer science concepts covered. It was nice to get a high-level overview, though, and I still learned a lot from the book since I have not studied every computer science subfield in detail. My final thought is that, just like when I read The Checklist Manifesto last year and tried to think about utilizing checklists myself, I will try and see if I can incorporate some of the authors’ suggestions in my own life.
** Thinking Machines: The Quest for Artificial Intelligence and Where It’s Taking Us Next ** is a recent 2017 book by journalist Luke Dormehl. I found out about it by reading Ray Kurzweil’s favorable book review in The New York Times. Kurzweil remarks that Luke is a journalist who “actually knows the technical details.” I think that’s true, though there is virtually no math in this book, or at least very little of it compared to Pedro Domingos’ book. In Thinking Machines, Dormehl mentions the backpropagation algorithm which has powered Deep Learning, but only at a very high level (obviously). He also talks about Deep Learning’s history, which I know already (and it could have been derived right from Stanford’s CS 231n slides) but it’s good to have here. Dormehl writes about the by-now famous story that “neural networks were ignored for a while then they became popular and are now known as Deep Learning,” which Professor Jitendra Malik would remark is “more marketable.” As far as technical material goes, it’s correct, so no worries. Dormehl includes a substantial amount of material about sensors, the Internet of Things (similar to Thomas L. Friedman’s Thank You For Being Late) and of course about AI ethics, laws, and the singularity. These are not new themes, but the difference between this book and others is that it’s very recent and current, which is useful due to the fast-growing pace of AI, so it was able to cover AlphaGo from DeepMind. I consider it a broad “story” about AI, and less opinionated compared to James Barrat’s Our Final Invention. It’s of reasonable length (not too long, not too short) and great for a wide audience of readers. Overall, I enjoyed reading the book, and it kept me up late longer than I should have.
** Heart of the Machine: Our Future in a World of Artificial Emotional Intelligence ** is a 2017 book also recommended by Ray Kurzweil. The author is Richard Yonck, founder and president of Intelligent Future Computing, a company which provides advice on the impact of technology on business and society (is this called “consulting”?). Heart of the Machine, like many AI-related books, discusses recent research and commercial advances, but it emphasizes an emotional perspective. It discusses how we got to affective computing and the rise of emotional machines. The first part contains a little history and discusses some of the labs that are working on this (e.g., the MIT Media Lab). The second, like the first, shows how many companies are measuring emotions, in part using advances in AI and Big Data analysis, and cautions us about the uncanny valley. The third part of the book is about the future, and obviously sexbots play a role. What I remember most form the book are its anecdotes, one of the most touching of which was when someone wanted to marry a robot, and a parent opposed this. Will this be the future of marriage? The first step is interracial marriage, then the next is same-sex marriage, and the last (?) step is human-machine marriage. Yonck shows his academic side by citing some ACM/IEEE International Conference on Human-Robot Interaction papers. That’s a niche-style conference but will likely grow into something much larger in the coming years (see the 2018 website here), similar to how NIPS grew from a niche into an enormous conference with thousands of attendees each year. Finally, I appreciated that Yonck said we are already merging with technology in some ways. For instance, many deaf people opt for cochlear implants to better interact in a hearing world. (I likely would have one had I been born a few years later and if hearing aids were not already highly successful for me.) We already merge with technology so much, and this is likely to increase in the future.
Group 2: Technology, Excluding AI/Robotics
These are books loosely related to technology, though excluding AI and robotics, as I discussed those in the previous section.
** The Second Machine Age: Work, Progress, and Prosperity in a Time of Brilliant Technologies **, co-authored by Erik Brynjolfsson and Andrew McAfee. Both authors are from MIT: Brynjolfsson is a professor of management and McAfee a research scientist. The authors appear to take the opposite perspective of economist Robert Gordon (author of The Rise and Fall of American Growth, which I discuss later), a point that they emphasize repeatedly: they argue that we are now at an inflection point and that we are on our way towards better times, and not stagnation. Their key rebuttal to Gordon is that innovation is due to recombination. Sure, we may not invent brand new things like electricity, but the IT revolution was all about combining stuff that had previously existed, and that will continue onwards as more people are able to try new things. As expected, they provide the usual disclaimers (at least from the technologically elite) that technological growth isn’t always great, that people fall behind, etc. To their credit, both men propose solutions, which I think are reasonable and — crucially in today’s politics — are widely agreed upon by economists across the entire political spectrum. For instance, they mention the universal basic income but seem to prefer the more mainstream “earned income tax credit” idea, and I think I can agree. One major quibble I have is that the book has one chapter to AI, but the actual AI portion of it is only two and a half pages long. And this for what might be the biggest technology advance of the 21st century! Fortunately, they seem to have given it greater attention since the book was first published. I bought The Second Machine Age in December 2016 as a Christmas gift, and they included a new introduction saying that they had underestimated progress in AI, particularly with deep neural networks, a topic which I frequently blog about here! (Incidentally, I saw Brynjofsson’s praise for a MOOC on Deep Learning … even MIT professors are going to MOOCs4 to learn about the subject!) The book is relatively straightforward to read and oozes more excitement compared to Gordon’s book. There is a book website for more details if you are interested. Brynjofsson and McAfee have since written more about Deep Learning, as you can see from their NY Times article after AlphaGo famously beat Go super-duper star Lee Sodol. I feel extremely fortunate to be in a position where, though I’m not the one creating this stuff, I can understand it.
** How Google Works ** is a 2014 book (updated in 2017) with a self-explanatory title, written by two of the most knowledgeable people about Google, Eric Schmidt and Jonathan Rosenberg. The former was the CEO of Google from 2001 to 2011 until he stepped down to be come “Executive Chairman” of Google (and then Alphabet later). So … basically he’s shuffling around titles without loss of power, I think.5 Jonathan Rosenberg6 was a longtime Product Manager for Google, and now he is advisor to Alphabet CEO Larry Page. These two men thus know a lot about Google and are well-qualified to talk about it. The book is an entertaining mix of the lessons they’ve learned about working at Google, how to scale it up, etc. I was particularly impressed about stories such as how Jeff Dean et al. found a note from the CEO who complained that “these ads suck”. So in one weekend, despite them not being on the ads team, they were able to fully diagnose the problem. Wow, that’s Google for you. The main takeaway from this book is that I need to be a better smart creative. The only way I know how to do this is by always learning, whether by coding or (as I try to do) read a lot of books. That being said, the book does suffer from trying to describe many concepts that I would argue are obvious and well-known. Many themes, such as “think 10x better, not incremental” are common in books that combine technology and business, such as Peter Thiel’s Zero to One book, which I read last year. Another is that “you can’t apply the lessons you learn in business school” which is again something commonly assumed in the tech industry. Another is “hiring is the most important thing you can do” but Joel Spolsky has already said something similar earlier on his blog. I don’t mean to completely negate the benefits of this book; it seems to maintain just enough of the “uniqueness” balance to make it a worthwhile read. Homer alert: I wish the authors would write a follow-up book where they discuss Artificial Intelligence. After all, current Google CEO Sundar Pichai has made it a point to emphasize AI for Google. To be fair, they mention it as one of the “things that might happen in the next five years”, so right now we’re smack dab in the middle of that time period. I’ll keep watch in case they publish a sequel later. On a final note, reading this book made me want to work at Google!
** Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy ** is a 2016 book by mathematician-turned-data scientist and author (and MathBabe founder) Cathy O’Neil who argues that the use of large datasets in industry and government contexts has, well, increased inequality in our society. She describes stories about how the use of big data to predict whether someone will commit a crime or default on a loan has a harmful feedback loop on the poor and minorities. (Blacks are the minority group emphasized the most in the book.) Why is there a feedback loop? Minorities are more likely to be around people who are committing crimes, and the “birds of a feather” mentality among big data algorithms is that they tend to relate people to those who bear similar qualities. Whereas in the past, for instance, a banker might not have relied on big data but on his instincts to grant or deny loans, which would hurt women and minorities, nowadays we mostly have data algorithms to determine that, but even so, algorithms have their own biases and values (indeed, this is an academic research topic, see the BAIR Blog post on this which also uses Google’s “labeling blacks as gorillas” example of algorithms trained on wrong data). O’Neil calls for increased transparency in these algorithms, which she calls Weapons of Math Destruction (WMDs), and for the people working on these algorithms to better understand the values that are inherent in the models. I enjoyed most of the short, fast-paced book and highly recommend it. It’s also worth noting that O’Neil regularly writes columns about this subject area, which interested readers should check out.
** Thank You for Being Late: An Optimist’s Guide to Thriving in the Age of Accelerations ** is Thomas L. Friedman’s most recent book, published in late 2016 (though the manuscript was done before the outcome of the presidential election). For a long time, I’ve been following Friedman’s Sunday weekly columns at The New York Times, which has served as a preview for what’s to come in the book: Moore’s Law, the refugee/migration crisis, unstable governments, droughts and climate change, and the polarization at the highest level of American politics. Friedman goes through these and discusses topics much like he did in The World is Flat, though I think he tempers his idiosyncratic writing style. He mentions at one point a handful of policy changes he’d like to do, and claims he’s neither left nor right politically and that those labels are now outdated. For instance, he’s very free trade (right) but also for single-payer health care (left). I was duly impressed from the book because it taught me much about how the world works today. It also made me appreciate that I’m in a position where I can take advantage of what the world has to offer. Thank You for Being Late also mentioned several technical topics that I’m passionate about. It was really nice to see a mainstream, “non-technophobe” talk about Moore’s Law, GitHub7, and even TensorFlow/Deep Learning (!!); he explained these topics as well as he could given the non-technical nature of the targeted audience. I also appreciated the surgeon general’s comment in the end that America’s biggest killer “was not heart disease, but isolation” which is ironic given how we are more connected than ever before. Ultimately, I want to be part of that acceleration and, of course, to ensure that the vast majority of Americans aren’t left behind (including myself!). The book, however, made me concerned about the future. I finished this just a few days before Trump was inaugurated so … hopefully things will be OK.
What to Do When Machines Do Everything: How to Get Ahead in a World of AI, Algorithms, Bots, and Big Data is a recent 2017 book by three leaders from Cognizant, a firm which I didn’t know about beforehand. This book takes the now-standard view (at least among many technology thinkers) that automation will be overall better for us, destroying some jobs but also creating new ones and clearing out old drudgery. One thing the authors note which I haven’t heard before is that they subscribe to the “S-Curve”: we’re in a “stall” zone, but for the next two decades, we will experience dramatic economic growth with more equalizing effects as it relates to income distribution. I find this hard to believe, unfortunately. Another perspective the authors bring is that once old entrenched companies make more of a digital transition, that’s when we’ll really see GDP take off. Regarding the book style, it’s short and written in a mini-textbook style. The abbreviations in it were a bit corny but I enjoyed the examples, at least the ones they had. I surprisingly didn’t seem to enjoy it as much as some other similar books, probably because some of their advice is really high level and generic, over-simplifying things. All in all, I think the book is mostly correct on a technical level but may not be my style.
Group 3: Business and Economics
I badly need to better understand the world of business, particularly due to the increasing business-related importance of Artificial Intelligence nowadays.
How the West Grew Rich: The Economic Transformation of the Industrial World. This is an old 1986 book by the late economist and historian Nathan Rosenberg and co-author L.E. Birdzell Jr, an attorney and legal scholar, and I have several quick thoughts. The first was that this book was a real slog for me to read. It’s not even close to being the longest book I’ve read8 but I had to struggle through it; I think the writing style of 1986 is different from the one I’m used to today, but I’m also partly to blame since I spaced out my reading over many evenings when I was tired. In any case, this book is about capitalism in some sense, though the authors complain that the term is misleading. Their main argument is that the freedom of business and enterprise from religious and political control was the key factor in explaining the rise of the West, and not other factors generally attributed, such as science or mass production. Judging from the book, the prevailing wisdom at that time may have been mass production, but apparently not to them. It’s a bit interesting to think about what conclusions they make which are still relevant today, like how it’s so hard for Third World countries to catch up. I was also amused at seeing the Soviet Union mentioned so much, and I had to remind myself: 1986, 1986, 1986. (In a shout-out to the AI people reading this, that was the year when Rumelhardt, Willimans, and Hinton published their famous backpropgation paper with “readable math”.) Ultimately, while this book has some good spots in it, I lost focus too much to really benefit from it, and I think The Rise and Fall of American Growth is a vastly superior alternative, unless you want to get a better understanding of European stuff (not just American) and also some discussion about the Middle Ages.
Shop Class As Soulcraft: An Inquiry Into the Value of Work, is a 2009 book written by Matthew Crawford, who has one of the most unusual profiles among authors I read. Crawford is a mechanic and works at a bike shop, but he also holds an undergraduate degree in physics and a PhD in political philosophy from the University of Chicago. After his PhD, he worked at a “think tank” (where he had to basically repeat what the oil companies wanted to say about global warming) and at a firm where his job was to basically rewrite abstracts of research papers (what?!?). His true heart lies in building things, where he gets value. Crawford is concerned that today’s white-collar emphasis of the world focuses too much on removing value from humans (whereas mechanics can just point and say “here’s my result!”), and the white-collar blue-collar divide is making the mechanics earn less respect across society. I am indeed concerned that this is true, especially with today’s political divide among the college-educated and non-college educated, and I wish that more people with solid academics, those who have “never failed” had a little more humility. (I certainly feel like I’ve failed a lot and I’m pretty academically credentialed compared to a lot of others.) In a sense, I get the feeling that this book is like Jaron Lanier’s “You Are Not a Gadget” — those two men might see a lot of common ground over their critiques of modern life, though for different reasons. One takeaway from the book is that I’m happy to be where I am since I can try and perform deep work to produce results (code, papers, etc.) that people can look at, as I am doing more frequently on my GitHub account. Unsurprisingly, this was the key point Cal Newport made from this book. Lastly, I can’t resist mentioning one of the most interesting parts of this book. In a footnote in the seventh chapter, he talks about AI in the context of lamenting how humans were often being reduced to simple straightforward rules. His footnote talks about computer science and says the one hope for AI in the future is with neural networks since they’re not reduced to simple rules. Wow … that was in 2009 (before Alex-Net, etc.). Even your bike shop repairman knows about Deep Learning!
** The Lean Startup: How Today’s Entrepreneurs Use Continuous Innovation to Create Radically Successful Businesses ** is a 2011 book by software entrepreneur Eric Ries, known for co-founding IMVU and then later for consulting various start-ups through his Lean Startup methodology. In this book, Reis provides a guide for start-ups, which he defines as: “[…] a human institution designed to create a new product or service under conditions of extreme uncertainty”. Note the lack of any comment about company size, and also note the inclusion of extreme uncertainty. This means start-ups can include non-profits, large companies, and even governmental organizations, so long as there is initial uncertainty in their roadmap. The Lean Startup argues that, in order for startups to thrive, they must follow a Build-Measure-Learn feedback loop. Furthermore, that loop must be their competitive advantage compared to slower, bulkier competition. By building Minimum Viable Products, Reis argues that appropriate metrics (not “vanity metrics” as he calls them) and customer feedback can be measured rapidly. Understanding these early results then guides the startup towards the next step, which may or may not involve the painful act of pivoting to change strategies. The book’s advice appears sound and reasonable. While I certainly don’t have much experience in this area to fully critique the book, Reis has famous tech titans such as Sheryl Sandberg and Andrew Ng to vouch for the book, so I think I can trust the advice. (I found out about this book from Andrew Ng’s reading recommendations.) While reading the book, I imagined what I would do if I tried to create (or more realistically, join) a start-up. My PhD program isn’t going to last forever … but I suppose while I’m here, I should emphasize the Minimum Viable Product aspect with respect to research.
** The Hard Thing About Hard Things: Building a Business When There are No Easy Answers ** is a 2014 book by billionaire venture capitalist Ben Horowitz where he recounts his experience running Loudcloud and Opsware as CEO. The book starts out by first describing the CEO experience. Then, Horowitz remarks about the lessons he’s learned and outlines recommendations and guides on what he thinks CEOs should be like. The book concludes with him explaining how he founded the venture capital firm Andreessen Horowitz, which he’s still running today9 to help technical founders become groomed CEOs. The book is fast-paced and feels like a high-octance novel, because Horowitz’s tenure at Loudcloud and Opsware was anything but smooth. Horowitz argues that there are peacetime CEOs and wartime CEOs, of which he was definitely the latter as he estimates he only had “three days of peace” when running the company. Loudcloud and Opsware initially raised a lot of money, but then after the dot-com crash, they struggled a lot and I’m amazed that Horowitz turned it around and eventually sold the company for $1.6 billion to Hewlett-Packard. Reading his story, and Elon Musk’s story (which I’ll get to later), makes me wonder how these two CEOs managed to pull their companies out from the financial brink. I am kind of surprised that something “comes out of nothing,” and one of my disappointments is that it arguably spends a lot of time on Horowitz’s lessons for the CEO whereas I would have preferred more details on his CEO experience (at least, more than what’s in the book) because, again, I don’t understand how companies can go to a billion dollars’ worth of value out of nothing. I really need to step in the shoes of a CEO one of these days. But perhaps I would have better understood this if I understood more about business, and I certainly learned a lot about the business word from reading this book. For instance, I’m embarrassed to say that I only had a vague notion of what it meant for “a company to go public” but reading this book (and checking Wikipedia, Investopedia, and other online resources in parallel) made me better understand the process.
** The Rise and Fall of American Growth: The U.S. Standard of Living Since the Civil War ** is a long book by economist Robert Gordon, published in January 2016. For an academic-style book that’s 762 pages long, it is quite well-known, particularly due to the current debates on economic growth in politics. On Google, you can find pages and pages of reviews for Gordon’s book. Most of them mirror Bill Gates’ review in that they praise Gordon for providing a surprisingly complete historical picture of what American life was like in 1870, and how it was completely transformed in the “great century” to 1970. Gone were the days of darkness, backbreaking labor, endless drudgery in chores, and a stale diet, among other things, and in place of those came the electric light bulb, work conditions in heated and air-conditioned offices, the internal combustion engine (leading to automobiles and the airplane), and shopping centers to buy a variety of food and clothes. Gordon’s thesis is that since 1970, America has been in a long reign of slow growth despite the recent progress in AI, IT, and other tech-related fields. There are two reasons: these advances do not match up with those from previous generations (to echo Peter Thiel, “we wanted flying cars but ended up with 140 characters [Twitter]”), and there are headwinds preventing rapid growth such as income inequality, college debt, and demographic trends. He ends with a brief postscript on policy actions that might be useful to counter these trends; I wish more politicians would take note of them as some of his suggestions have broad appeal nowadays. This book is amazing, and despite my close connection with the technology sector, I agree with his thesis. Bill Gates counters by suggesting we’re on the cusp of medical advances, but I’m heavily skeptical about researchers finding cures for cancer and Alzheimer’s disease. It might be challenging for the average reader to go through a book this long, especially one packed with figures and footnotes. My advice? Read it. It’s worth it. I have probably learned more from this book than I have from any other.
Group 4: Biographies and Memoirs
I am reading biographies of famous people because I want to be famous someday. My aim is to be famous for a good reason, e.g., developing technology that benefits large swaths of humanity. (It is obviously easier to become famous for a bad reason than a good reason.)
** Alan Turing: The Enigma ** is the definitive biography of Alan Turing, quite possibly the best computer scientist of all time. The book was written in 1983 by Andrew Hodges, a British mathematics tutor at the University of Oxford (now retired). I discussed this in a separate blog post so I will not repeat the details here.
** My Beloved World ** is Supreme Court Justice Sonia Sotomayor’s memoir, published in 2013. It’s written from the first person perspective and outlines her life from starting in South Bronx and moving up to her appointment as a judge to US District Court, Southern District of New York. It — unfortunately — doesn’t talk much about her experiences after that, getting appointed to the United States Court of Appeals for the Second Circuit in 1998, and of course, her time on the nation’s highest court starting in August 2009. She had a father who struggled with alcoholism and died when she was nine years old, and didn’t appear to be a good student until she was in fifth grade, when she started to obsess over getting “gold stars.” (I can attest to a similar experience over obsessing to get “gold star-like” objects when I was younger.) She then, as we all know, did well in high school and entered Princeton as one of the first incoming batch of women students and Hispanic students, graduating with stellar academic credentials in 1976 and then going on to Yale Law School where she graduated in 1979. The book describes her experiences in vivid terms, and I liked following through her footsteps. I feel and share her pain at not knowing “secrets” that the rich and privileged students had when I was an undergrad (I was clueless about how finance and investment banking jobs worked, and I’m still clueless today.) Overall, I enjoyed the book. It’s brilliantly written, with an engrossing, powerful story. I will be reminded of her key attribute of persistence and determination and focus which she says were key. I’m trying to pursue the same skills myself. While I understand the low likelihood of landing such tiny jobs (e.g., the tech equivalent of a Supreme Court Justice) I do try and think big and that’s what motivates me a lot. I read this book on a day trip where I was sitting in a car passenger seat, and I sometimes dozed off and imagined myself naming various hypothetical Supreme Court Justices.
An Appetite for Wonder: The Making of a Scientist is Richard Dawkins’ first of two (!!) autobiographies, published in 2013 and which accounts for the first half of his life. Dawkins is one of the most famous and accomplished scientists today, not only in terms of raw science but also with respect to public outreach and fame (whether famous, in my opinion, or infamous, if otherwise), so perhaps two books is justified. Dawkins discusses his childhood, which he first spent in Africa before moving to England to attend boarding schools; he remarked that the students seemed to be relatively stronger in Africa. I sometimes wish I had attended boarding schools instead of my standard public schools, since perhaps I would have developed independence faster, so it was interesting to read his perspective. After this, Dawkins talks about his undergraduate years at Oxford10 (where his relatives had gone) and this is where I really want to know what he did, because I’m hoping to use my own “appetite for wonder” in science, since I think Artificial Intelligence is the new electricity. But anyway, Dawkins became a professor at Berkeley (!!) but he quickly left to return to England for another position. This book ends with his publication of The Selfish Gene, a book that I want to read one of these days. I’m impressed: it’s a challenge to write an autobiography, but fortunately, Dawkins’ parents saved a lot of letters and information, so that’s good. The book, however, is likely aimed at a niche audience of readers. It was also interesting to Dawkins used to be religious, before becoming an atheist by his late teens (like me, though I was never religious at all). I also liked his stories about computer programming and research, which were a lot simpler back then but presumably harder due to lack of documentation and the Internet.
Brief Candle in the Dark: My Life in Science is Richard Dawkins’ second autobiography, written in 2015 and covering the second half of his life, at least, up to that point (he could theoretically still have 30 more years if he lives long enough). He writes more about his life as a professor at the University of Oxford, including his time as the inaugural Simonyi “Professor of the Public Understanding of Science,” which I have to admit was an odd title when I first read it in the back flap of my copy of The God Delusion. He certainly has helped me understand things in this world, and it’s true that I consider Richard Dawkins to be one my heroes. On the other hand, I’m not sure most lay readers would be willing to slog through both of his autobiographies, so keep this in mind in case you’re on the fence about reading this book. It is a non-chronological history of his academic life about debates (uh oh), being on television, writing books, giving talks, and so forth. Dawkins describes various stories about him with other famous people. I also learned a little more about basic evolution. His previous autobiography highlighted how genes — and not the individual — are the unit of evolution, but in his book The Extended Phenotype, he talks about an extension of natural selection onto the physical world (interesting, though one must not misinterpret this). He also emphasizes, and this is something I agree with, that natural selection can still explain complicated structures today that creationists use as evidence against evolution, such as the eye. Natural selection is the only theory we have of what can work gradually and cumulatively. This is key for developing complicated structures; in the absence of evidence, God should not be the default option. I also liked other tidbits of the book, such as how Dawkins did a lot of “evolutionary programming” — I bet he would be interested in reading the research paper Evolution Strategies as a Scalable Alternative to Reinforcement Learning.
** Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future ** is a 2015 biography of Elon Musk, written by technology journalist Ashlee Vance. He documents Musk’s chaotic life, both nowadays as the CEO of SpaceX and Tesla and before, when he was struggling to get his companies off the ground and earlier still, when he started his entrepreneurial spirit by starting Zip2. Musk grew up in South Africa and moved to Canada so that he could get to the United States as quickly as possible. Musk had some initial businesses successes, but was forced out of X.com (which particularly hurt as Musk is the personality who wants full control over his companies), then earned some more success with Tesla and SpaceX before teetering on the brink of collapse at the end of 2008 (you know, like the financial “oopsie” we had). Then later, those companies recovered, Musk married a young actress, divorced, then re-married (then divorced again …). The book concludes with some thoughts on Musk’s wild personality and ambitions, and basically says that there is no one like Musk, who still is holding out hope for humans to go to Mars by 2025. This book is an absolute thrill to read. Vance brilliantly writes it so that the reader often feels like he or she is swept up into “Elon Musk”-mode: hard-working, super-charging, and borderline out of control. After reading it, I kept thinking that my work ethic is too soft and weak, and that I better get back to working sixteen hours a day (or less, if I’ve focused really hard in fewer hours). I have two main criticisms of this book. The first is that I was hoping to see more information from his two wives, and for this I’ll probably have to relegate myself to reading Justine Musk’s blog. The second and most important critique is that when Vance wrote an updated epilogue in January 2017, which was five months before I bought the book at Chicago O’Hare International Airport, he never mentioned Musk’s investment in OpenAI, a nonprofit AI research company which aims to produce, or pave the path to, artificial general intelligence. In their introductory blog post from December 2015, they claim to have 1 billion dollars in investment. I’m not sure how much Musk contributed to that, but it must have been a lot!
Keeper of the Olympic Flame: Lake Placid’s Jack Shea vs. Avery Brundage and the Nazi Olympics, a recent 2016 book by Michael Burgess, is one that hits home to me because Jack Shea was my great-uncle. Jack Shea was born and raised in Lake Placid, New York, and as a 21-year-old competitor in the 1932 Winter Olympics, he won two gold medals in speed-skating, becoming a hometown hero and putting Lake Placid “on the map.” A few years later, when it became apparent that the next Winter Olympics were going to be in Nazi Germany, Shea urged Avery Brundage (then in charge of American involvement in the Olympics) to boycott out of concerns over Adolf Hitler’s treatment of Jews and other minorities. Shea did not participate in those controversial Olympics, but the Americans did send a team, with relatively disappointing speed-skating results. The book then discusses more about the intersection between politics and sports, and also talks about the odd déjà vu when Lake Placid again hosted the Winter Olympics in 1980, again with politics causing tension (this time, from the Soviet Union). For obvious reasons, I enjoyed reading this book despite its flaws: it’s short and has obvious typos. I like knowing more about my ancestors and what they did, and the photos were really striking. My favorites include 19-year-old Shea shaking hands with then-governor Franklin Roosevelt, another one with Shea and his extended family (including my grandfather), and a third which shows a pre-teen Shea and his brother, Eugene, already in skates. Eugene, incidentally, lived to be 105 years old (!!) before passing away in October this year (obituary here) and was able to contribute photos and assistance to the author. I got to meet Jack Shea once, and he might very well have lived to be 100 years of age had he not been killed by a drunk driver at the age of 91. This was 17 days before his grandson would end up winning a gold medal in the 2002 Salt Lake City Winter Olympics. In my parent’s home, there is a photo of me with my cousin holding his gold medal (it was heavy!). I also have a separate blog post about this book soon after I read it.
Group 5: Conservative Politics and Thoughts
Well, this will be interesting. I’m not a registered Republican, though I possess a surprisingly large amount conservative beliefs, some of which I’m not brave enough to blog about (for obvious reasons). In addition, I believe it is important to understand people’s beliefs across the political spectrum, though for this purpose I exclude the extreme far left (e.g., hardcore Communists) and right (e.g., the fascists and the Ku Klux Klan).
** Please Stop Helping Us: How Liberals Make it Harder for Blacks to Succeed ** a 2014 book written by Wall Street Journal columnist Jason Riley. It’s no secret that (a) most blacks tend to be liberal, I would guess due to the liberals getting the civil rights movement correct in the 1960s, and (b) blacks tend to have more credibility when criticizing blacks compared to whites. Riley, as a black conservative, can get away with roundly criticizing blacks in a way that I wouldn’t do since I do not want to be perceived as a racist. In Please Stop Helping Us, Riley “eviscerates nonsense” as described by his hero, Thomas Sowell, criticizing concepts such as the minimum wage, unions, young black culture, and affirmative action policies, among other things, for the decline in black prosperity. His chief claim is that liberals, while having good intentions, have not managed to achieve their desired results with respect to the black population. He also laments that young blacks tend to watch too much TV, engage in hip-hop culture, and the like. One of his stories that stuck with me was when a young (black) relative asked him: “why are you so white”, when all Riley did was speak proper English and tuck in his shirt. Indeed, variants of this story are common complaints that I’ve seen and heard about from black students and professionals across the political spectrum. I don’t agree with Riley on everything. For instance, Riley tends to ignore or explain away issues regarding racism as it relates to the lack of opportunities for job promotions or advancement, or when blacks are penalized more relative to others for a given crime. On the other hand, we agree on affirmative action, which he roundly criticizes, pointing out that no one wants to be the token “diversity hire”. To his credit, he additionally mentions that Asians are hurt the most from affirmative action, as I pointed out in an earlier blog post, making it a dubious policy when it come to advancing racial equality. In the end, this book is a thought-provoking piece about race. My impression is that Riley genuinely wants to see more blacks succeed in America (as I do), but he is disappointed that the major civil rights battles were all won decades ago, and nowadays current policies do not have the same positive impact factor.
** The Conservative Heart: How to Build a Fairer, Happier, and More Prosperous America **, is a 2015 book by Arthur Brooks, the president of the American Enterprise Institute, officially a nonpartisan think tank but widely regarded (both inside and outside the organization) as a place for conservative public policy development and analysis. Brooks argues that today’s conservatives, while they have most of the technical arguments right (e.g., on the benefits of free enterprise), lack the “moral high ground” that liberals have. Brooks cites statistics showing that conservatives are seen as less compassionate and less caring than liberals. He argues that conservatives can’t be about being anti-everything: government, minimum wage increases, food stamps, etc. Instead, they have to show that they care about people. They need to emphasize an equal starting line for which people can flourish, which contrasts with the common liberal perspective of making the end product equal (by income redistribution or proportional racial representation). One key point Brooks emphasizes is the need for work fulfillment and purpose instead of lying around while collecting checks from the American welfare state. I liked this book and found it engaging and accessible. It is, Brooks says, a book for a wide range of people, including “open-minded liberals” who wish to understand the conservative perspective. I have two major issues with his book, though. The first is that while he correctly points out the uneven recovery and the lack of progress on fixing poverty, he fails to mention the technological forces that have created these uneven conditions (see my technology, economics, and business related books above), much of which is outside the control of any presidential administration or Congress. The second is that I think he’s been proved wrong on a lot of things. President Donald Trump is virtually none of the stuff that a conservative “heart” would suggest and, well, he was elected President (after this book was published, to boot). I wish President Trump would start following Brooks’ suggestions.
Conscience of a Conservative: A Rejection of Destructive Politics and a Return to Principle is a brief 2017 book/manifesto by U.S. Senator Jeff Flake of Arizona. Flake is well known for being one of those “Never Trump” style of Republicans since he remains true to certain Republican principles that have arguably fallen out of favor with the recent populist surge of Trump-ian Republicanism in 2016, such as free trade and limited government spending. And yes, I don’t think Republicans can claim to be the party of fiscal prudence nowadays, since Trump is decidedly not a limited spending conservative. In this book, Senator Flake argues that Republicans have to get back to true, Conservative principles and can’t allow populism and immaturity to define their party. He laments at the lack of bipartisanship in Congress, and while he makes it clear that both parties are to blame, in this book he mostly aims at Republicans. This explains why so many Republicans, including Barry Goldwater’s relatives, dislike this book. (Barry Goldwater wrote a book of the same title, “Conscience of a Conservative”, from which Jeff Flake borrowed the title.) I sort of liked this book but didn’t really like it. It still fails to address the notion of how the parties have fallen apart, and he (like everyone else) preaches bipartisanship without proposing clear solutions. Honestly, I think the main reason I read it was not that I think Flake has all the solutions, but that I sometimes think of myself in Congress in my fantasies. Thus, I jumped at the chance to read a book directly from a Congressman, and particularly a book like this where Flake bravely didn’t have his staff revise it to make it more “politically palatable.” It’s a bit raw and lacks the polish of super-skilled writers, but we shouldn’t hold Senators to such a high writing standard so it’s fine with me. It’s unfortunate that Flake isn’t going to seek re-election next year.
Group 6: Self-Help and Personal Development
I’m reading these “personal development” books because, well, I want to be a far more effective person than I am right now. “Effectiveness” can mean a lot of things. I define it as being vastly more productive in (a) Artificial Intelligence research and (b) my social life.
** How to Win Friends and Influence People: The Only Book You Need to Lead You to Success ** is Dale Carnegie’s famous book based on his human interaction courses. It was originally published in 1936, during the depths of the Great Depression, making this book by far the oldest one I’ve read this year. I will not go into too much depth about it since I wrote a summary in an earlier blog post. The good news is that 2017 has been a much better year for me socially, and the book might have helped. I look forward to continuing the upward trend in 2018, and to read other Dale Carnegie books.
** The 7 Habits of Highly Effective People: Powerful Lessons in Personal Change **, written by Stephen R. Covey in 1989, is widely considered to be the “successor” to Dale Carnegie’s classic book (see above summary). In The 7 Habits, Covey argues that the habits are based on well-timed principles and thus do not noticeably vary across different religious groups, ethnic groups, and so forth. They are: “Be Proactive”, “Begin With the End in Mind”, “Put First Things First”, “Think Win-Win”, “Seek First to Understand, Then to be Understood”, “Synergize”, and “Sharpen the Saw”. You can find their details on the Wikipedia page so I won’t repeat the points here, but I will say that the one which really touches upon me is “Think Win-Win”. In general, I am always trying to make more friends, and I’d like these to be win-win. My strategy, which aligns with Covey’s (great minds think alike!), is to start a relationship by doing more work than the other person or letting the other person benefit more. Specifically, this means that I will be happy to (a) take the initiative in setting meeting times and any necessary reservations, (b) drive or travel farther distances, (c) let the other person choose the activity, and so forth. At some point, however, the relationship needs to be reciprocal. Indeed, I often tell people, subtly or not so subtly, that the true test of friendship is if friends are willing to do things for you just as much as you do to them. With respect to the six other principles, there isn’t much to disagree. There is striking similarity to Cal Newport’s Deep Work when Covey discusses high-impact, Quadrant II activities. Possibly my main disagreement with the book is that Covey argues how these principles come (to some extent) from religion and God. As an atheist, I do not buy this rationale, but I still agree with the principles themselves and I am trying to follow them as much as I can. This book has earned a place on my desk along with Dale Carnegie’s classic, and I will always remember it because I want to be a highly effective person.
You are a Badass: How to Stop Doubting Your Greatness and Start Living an Awesome Life is a 2013 book by self-help guru Jen Sincero. It’s deliberately written in a very “teenage”-like way, where the author acts like she’s talking directly to the reader as the self-help coach. The target audience seems to be people who have “screwed up” and feel like their life is not as awesome as it could be. She goes through 27 relatively short chapters, each with different generic advice, though she does repeat this each chapter: love yourself. I definitely need reminders about that, since I don’t feel like I am achieving enough in life. However, I was somewhat skeptical of her advice and in general I am a self-help skeptic since I think it’s better for me to build my technical skills than to try and optimize advice from self-help books. Overall, I did not enjoy this book (largely due to the writing style), and I’m surprised it’s gotten so much critical acclaim and that it’s a best-seller. Yes, I will “love myself” but I can’t see myself remembering many other tidbits about this book that I didn’t already know before (e.g., think positive!!). Perhaps this book would be better suited with some concrete success stories of Sincero’s clients.
Group 7: Psychology and Human Relationships
These books are about psychology, broadly speaking, which I suppose can include “human relationships”. I thoroughly enjoyed reading all four of these books.
** Thinking, Fast and Slow ** is a famous 2011 book by Daniel Kahneman, winner of the 2002 Nobel Prize in Economics for his work on decision making. This is a book about psychology and how humans think, and much of it is based on Kahneman’s research with Amos Tversky many decades ago. To make the concepts clearer to the reader, Kahneman describes a story consisting of System 1 and System 2. These are the fast and slow parts of our thinking, respectively, so the former represents our immediate intuition and the latter reflects what happens after we expend nontrivial amounts of effort on some task. Thinking, Fast and Slow is filled with informative anecdotes, thrilling insights, and unexpected contradictions about the way humans think, and supplements those with exercises to the reader. (I normally find these annoying, but here they were reasonable.) Possibly the biggest insight I gained is that human thinking is flawed and is easily manipulated, so I better be extra cautious if I have to make important judgments in my life. (For minor life decisions, I don’t have a hope of remembering all the advice in this 400+ page book.) To be clear, I already knew that humans behaved irrationally, but Kahneman does an excellent job in putting my haphazard thoughts about human irrationality on more solid footing. Kahneman augments that with related topics such as overconfidence (a major issue with CEOs and start-ups) and how anchoring, priming, and baselines influence human preferences. After reading the book, all I can say is, I think (pun intended!) Thinking, Fast and Slow lives up to its billing as a true classic.
** To Sell Is Human: The Surprising Truth About Moving Others ** is a 2012 book by best-selling author Daniel Pink. He argues that we should stop focusing on outdated views of salespeople. That is, that they are slimy, conniving, attempting to rip of us off, etc. Today, one in nine work in “sales” but Pink’s chief message to the reader is that the other eight of nine are also in sales. We try to influence people all the time. So in some sense this is obvious. I mean, come on, if I’m aiming to get a girlfriend, then I’m trying to influence her based on my positive qualities.11 For academics, we sell our work (i.e., research) all the time. That’s what Pink means when he says “everyone is working in sales.” He argues that nowadays, the barriers have fallen (he almost says “The World is Flat” a la Thomas L. Friedman) and that salespeople are no longer people who walk door by door to ask people to buy things. That’s outdated. One possible negative aspect of the book is that I don’t think we need this much “proof” that we’re all salespeople. Yes, some people think only in terms of that job, but all you have to do is say: “hey everyone is a salesperson, if you try to become friends with someone, that counts…” and if you tell that to people, all of a sudden they’ll get it and I don’t think belaboring the point is necessary. On the positive side, the book contains several case studies and lists of things to do, so that I can think of these and reread the book in case I want to apply them in my life. Indeed, as I was reading this book, I was also thinking of ways I could convince someone to become friends with me.
** Lean In: Women, Work, and the Will to Lead ** is a well-known 2013 book by Facebook COO Sheryl Sandberg. It’s a semi-memoir which also acts as a manifesto for women (and men) to be more aware of the gender gap in “prestigious positions” and how to counteract it. By such “prestigious positions” I mean CEOs (particularly of top companies), politicians, and other leadership positions. Women occupy fewer of these positions than men in virtually every country in the world, and Sandberg wants this to change. She outlines numerous factors that hold women back, not all of which are obvious. Her first example deals with parking spots reserved for pregnant women, in which she admits she (despite being a woman!) didn’t think about until she became pregnant herself. Pregnancy is a major focus in this book, along with work-life balance, a typical inclusion in books about women and careers. Sandberg also recounts stories about women being quiet in meetings or not taking seats in the center of a meeting table even when prompted to do so, and lowering their hands when people say there are no more questions (despite how men keep their hands up and thus get to ask more questions). This forms the overall basis for her advice that women must “lean in” and be more involved in discussions. I liked reading this fast-paced book but I also almost felt disappointed, since I anticipated much of the material in advance. Perhaps it’s because I read about gender-related issues frequently. Another possible explanation is that it is hard for me to participate in group meetings, so I often spend more time observing people and noticing things rather than focusing on the subject at hand. On a final note, I’d like to mention that I do, in some sense, believe that “other men are the problem, not me” though I would never say this in public to someone, because (a) it’s politically charged, and (b) I could, of course, make a mistake in the future and thus I would be hypocritical and have to eat my own words. In my adult life, I do not believe I have ever done anything blatantly sexist, though I certainly worry a lot about committing “microaggressions” when I interact with women, and do my best to avoid them to make my female (as well as male) conversationalists feel respected and comfortable.
** Originals: How Non-Conformists Move the World ** is a recent 2016 book by famous Wharton professor Adam Grant, also known as the author of Give and Take. I’ve been aware of Grant for some time, in part because he’s been featured in Cal Newport’s writing as someone who engages in the virtues of Deep Work (see an excerpt here). Yeah, he’s really productive, finishing a PhD in less than three years12 and then becoming the youngest tenured professor at his university. But what is this book about, anyway? In Originals, Grant argues that people who “buck the trend” are often ones who can make a difference for the better. As I anticipated ahead of time, Martin Luther King Jr is in the book, but not for all the reasons I thought. One of them — why procrastination might actually have been helpful (i.e., first mover disadvantage) for him when he was crafting his “I Have a Dream” speech, though one was more realistic: focusing on the victims of crimes (blacks facing discrimination) rather than criticizing the perpetrators. Another nice tidbit from Grant was making sure to emphasize the downsides of your work rather than the positives to venture capitalists, as that will help you look more sincere. Other stuff in this book include how to foster a correct sense of dissent in a company (e.g., Bridgewater Associates is unique in this regard because people freely criticize the billionaire founder Ray Dalio). I certainly felt like some of this was cherry-picking, which admittedly is unavoidable, but this book seems to pursue that more than others. Nonetheless, a lot of the advice seems reasonable and I hope to apply it in my life.
Group 8: Miscellaneous
These books, in my opinion, don’t neatly align in one of the earlier groups.
** Knocking on Heaven’s Door: How Physics and Scientific Thinking Illuminate the Universe and the Modern World ** is Harvard physics professor Lisa Randall’s second of three major books. Last year, I read her most recent book Dark Matter and the Dinosaurs, so this is “going back” in time back to 2011. Sorry, I know should have read them in order. But anyway, this book is a fascinating exploration of what I argue are two major topics. First, the Large Hadron Collider — the well-known experimental setup that revealed the Higgs Boson particle in 2012 and earned Peter Higgs an Nobel Prize. Randall describes how the experiment was set up in great detail, but with juuuuuust enough clarity for non-physicists like me to barely follow. I don’t have much knowledge about the LHC, and indeed I didn’t even know it was a fantastic engineering feat; it is an enormous system built deep underground in Europe, as the pictures in the book helped to illuminate. The second major part of the book is about scientific thinking itself: why do scientists revise theories, why is the notion of scale important, why is quantum mechanics important at smaller distances, but why can we “average out” its effects with Newtonian physics? I learned a little about how the Standard Model in physics works, and it was great to see how she describes the scientific approach to thinking. Randall also discusses cosmology in this book, but it’s much shorter relative to particle physics and feels slightly out of place, but fortunately any reader who wants an overview in cosmology should just read Dark Matter and the Dinosaurs. Overall, this is a book that somehow remains fascinating and mostly accessible despite all the physics facts and jargon. It’s tricky to write science books for the general public. Randall does a good job in that when I was reading the book and felt somewhat confused at the jargon, I felt like it was my fault for my incompetence, and not hers. I am now thinking about reading her first book, Warped Passages, or her e-book on the Higgs Discovery. I’ll definitely be on the lookout for any future books she publishes!
** The Signal and the Noise: Why So Many Predictions Fail – But Some Don’t ** is Nate Silver’s 2012 book where he urges us to consider various issues that might be adversely affecting the quality of predictions. They range from the obvious, such as political biases which affect our assessment of political pundits (known as “hedgehogs” in his book), and perhaps less obvious things such as a bug in the Deep Blue chess program which nonetheless grandmaster Gary Kasparov took to meaning that Deep Blue could “predict twenty moves into advance.” I really enjoyed this book. The examples are far ranging: how to detect terrorist attacks (a major difficulty but one with enormous political importance) to playing poker (Silver’s previous main source of income), to uncertainties involving global warming models (always important to consider), and to the stock market (this one is hardest for me to understand given my lack of background knowledge on the stock market, but I am learning and working to rectify this!). The one issue I have is that Silver seems to just assume: hey let’s apply Bayes’ rule to fix everything, so that we have a prior of , and we assume the probability of … and therein lies the problem. In real settings we rarely get those and values to a high degree of accuracy. But I have no issue with the general idea of revising predictions and using Bayes’ rule. I encourage you to see a related critique in The New Yorker. The reality, by the way, is that most current professional statisticians likely employ a mix of Frequentist and Bayesian statistics. For a more technical overview, check out Professor Michael I. Jordan’s talk on Are You A Bayesian or a Frequentist?.
** The Soul of an Octopus: A Surprising Exploration into the Wonder of Consciousness ** is a splendid 2015 book by author Sy Montgomery, who has written numerous biology-related books about animals. I would call this entirely a popular science book; it’s more like a combination of the author discovering octopuses and describing her own experience visiting the New England aquarium, learning how to scuba dive, watching octopuses having sex in Seattle, and of course, connecting with octopuses. To be frank, I had no idea octopuses could do any of the things she mentions in the book (such as walking on dry land and squeezing through a tiny hole to get out of a tank). Clearly, aquariums have their hands full trying to deal with octopuses. Much of the book is about trying to connect with the three octopuses the New England aquarium has; the author regularly touches and feeds the octopuses, observing and attempting to understand them. I was impressed by the way Montgomery manages to make the book educational, riveting, and emotional all at once, which was surprising to me when I found out about the book’s title. It’s surely a nice story, and that’s what I care about.
Nothing Ever Dies: Vietnam and the Memory of War is a book by USC English Professor Viet Thanh Nguyen, published in 2016 and a finalist for the National Book Award in Non-Fiction that same year. It’s not a recap or history of the Vietnam War (since that subject has been beaten to death) but instead it focuses specifically on how people from different sides (obviously, American and Vietnamese, but also the rest of the world) view the war, because that will shape questions such as who is at fault and should make reparations and also how we can avoid similar wars in the future. It’s an academic-style book, so the writing is a bit dry and it’s important not to read this when tired. I think it provides a useful perspective on the Vietnam War and memories in general. Nguyen travels to many areas in Vietnam and Asia and explores how they view America — for instance, he argues that South Korea attempts to both ally with the US and look down on Vietnam with contempt. I found the most thought-provoking discussion to be about identity politics and how minorities often have to be the ones describing their own experiences. I’ve observed this in the books I read, in which if they’re written by a minority author (and here I’ll include Asians despite how critics of the tech industry bizarrely decide otherwise) are often about said minority. Other interesting (though obvious) insights include how the entire war machine and capitalism of the US means it can spread its memories of the war more effectively than Vietnam can. Thus, the favorable American perspective of the US as attempting to “save” minorities is more widespread, which puts America in a better light than (in my opinion, channeling my inner Noam Chomsky) it deserves.
The Once and Future Liberal: After Identity Politics is a short book (describing it as an essay is probably more accurate) written by humanities professor Mark Lilla of Columbia University. This book grew out of his fantastic (perhaps my all-time favorite) Op-Ed in the NYTimes about the need to end identity politics, or specifically identity liberalism. I agree wholeheartedly; we need to stop treating different groups of people as monolithic. Now, it is certainly the case that racism or mistreating of any group must be called out, and white identity politics is often played on the right, versus the variety of identities on the left. Anyway, this short book is split into three parts: anti-politics, pseudo-politics, and politics, but this doesn’t seem to have registered much to me, and the book is arranged in a different style as I might have hoped. I was mostly intrigued by how he said Roosevelt-esque liberalism dominated from roughly 1930 to 1970. Then the Reagan-esque conservatism (i.e., the era of the individual) has dominated from 1980 to 2016 or so, and now we’re supposed to be starting a new era as Trump has demolished the old conservatism. But Lilla is frustrated that modern liberalism is so obsessed about identity, and quite frankly, so am I. He is correct, and many liberals would agree, that change must be aimed locally now, as Republicans have dominated state and local governments, particularly throughout the Obama years. I do wish, however, that he had focused more directly on (a) how groups are not monolithic, (b) why identity politics is bad politics. I know there was some focus, but there didn’t seem to be enough for me. But I suppose, this being a short essay, he wanted to prioritize the Roosevelt-Reagan parallels, which in all fairness is indeed interesting to ponder.
** Climate of Hope: How Cities, Businesses, and Citizens can Save the Planet **, a 2017 book jointly written by Michael Bloomberg and Carl Pope. Surprisingly, considering that I was born and raised in New York state all my life (albeit, upstate and not in the city) the first time I really learned about Bloomberg was when he gave the commencement speech at my college graduation. You can view the video here, and in fact, to the right you can barely see the hands of a sign language interpreter who I really should re-connect with sometime soon. Climate of Hope consists of a series of chapters, which are split into half from Bloomberg’s perspective, half from Pope’s perspective. The dynamics between the two men are interesting. Pope is a “typical” Sierra Nevada member, while Bloomberg is known for being a ridiculously-rich billionaire and a three-term (!!) mayor of New York City.13 The book is about cities, businesses, and citizens, and the omission of national governments is no accident: both men have been critical of Washington’s failure to get things done. Bloomberg and Pope aim their ire at the “climate change deniers” in Washington, though they do levy slight criticism on Democrats for failing to support nuclear power. They offer a brief scientific background on climate change, and then argue that new market forces and the rise of cities (thus greener due to more public transportation and more cramped living quarters) means we should be able to emphasize more renewable energy. One key thing I especially agree with is that to market policies that promote renewable energy — particularly to skeptical conservatives — people cannot talk about how “worldwide temperatures in 2100 will be two degrees higher.” Rather, we need to talk about things we can do now, such as saving money, protecting our cities, creating construction jobs, protecting our health from smog, all these thing we can do right now and which will have the effect of fighting long-term climate change anyway. I enjoyed this easy-to-read and optimistic book, though it’s also fair to say that I tend to view Bloomberg quite favorably and honor his commitment to getting things done rather than having dysfunction in Washington. Or maybe I just want to obtain a fraction of his professional success in my life.
You’ll also notice in that link that Stuart Russell says he thinks superintelligence will happen in “more than 25 years” but he thinks it will happen. Russell’s been one of the leading academics voicing concern about AI. I’m not sure what has been created out of it, except raising a discussion of AI’s risks, kind of like how Barrat’s book doesn’t really propose solutions. (Disclaimer: I have not read all of Russell’s work on this, and I might need to see this page for information.) ↩
In this interview, Oren Etzioni said that AI leaders were not concerned about superintelligence, and even quoted an anonymous AAAI Fellow who said that Nick Bostrom was “the Donald Trump of AI”. Stuart Russell, who has praised Superintelligence, wrote a rebuttal to Etzioni, who then apologized to Bostrom. ↩
Of course, this raises the other problem with MOOCs. Only people who have sufficient motivation to learn are actually taking advantage of MOOCs, and these tend to be skewed towards those who are already well-educated. Under no circumstances is Brynjolfsson someone who needs a MOOC for his career. But there are many people who cannot afford college and the like, but who don’t have the motivation (or time!) to learn on their own. Is it fair for them to suffer under this new economy? ↩
Eric Schmidt got his computer science PhD from Berkeley in 1982. So at least I know someone famous essentially started off on a similar career trajectory as I am. ↩
I didn’t realize this until the authors put it in a footnote, but Jonathan Rosenberg’s father is Nathan Rosenberg, who wrote the 1986 book How the West Grew Rich which I also read this year. Heh, the more I read the more I realize that it’s a small world among the academic and technically elite among our society. ↩
To compare, How the West Grew Rich is less than half the length of The Rise and Fall of American Growth. In addition, I skipped most footnotes for the former, but read all the footnotes for the later. ↩
A quick thanks to Ben and Marc for helping to fund Berkeley’s Computer Science Graduate Student Association! ↩
Dawkins mentions that, if anything was “the making” of him, Oxford was. For me, I consider Berkeley to be “the making of me” as I’ve learned much more, both academically and otherwise, here than at Williams College. ↩
For the sake of keeping this blog mostly professional, I won’t list all my positive qualities here. ;) ↩
Usually, someone completing a PhD in 2-3 years raises red flags since they likely didn’t get much research done and may have wanted to graduate ASAP. Grant is an exception, and it’s worth noting that there are also exceptions in computer science. ↩
Given the fact that Bloomberg was able to buy his way into being a politician, I really think the easiest way for me to enter national politics is to have enormous success in the business and technology sector. Then I can just buy my way in, or use my connections. It’s unfortunate that American politics is like this, but at least it’s better than having a king and royal family. ↩
It took me six and a half years to do this, but I finally managed to install an email subscription form for readers of this blog. The link is here. No more nasty RSS feeds that no one knows how to use!
The email subscription form for this blog uses MailChimp. Each time I publish a post, I will send an email to everyone on the list using MailChimp’s “Campaign” feature.
Incidentally, this is the same kind of email form we use over at the Berkeley AI Research (BAIR) Blog. If you haven’t already, please subscribe to the BAIR Blog! As a member of the editorial board, I know the posts that are coming up next. I obviously cannot reveal the exact content, though I can say that we have lots of interesting stuff lined up for 2018 and beyond.
For assistance on getting this set up, I thank Jane Liang, a UC Berkeley EECS student who set up MailChimp for the BAIR Blog. I also thank Dominic Spadacene, who wrote dead-simple HTML installation instructions on his Ctrl-F’d blog.
For a long time, I wanted to write a nice, long, friendly blog post on Hamiltonian Monte Carlo that I could come back to for more intuition and understanding as needed.
Fortunately, there’s no need for me to invest a ginormous amount of time I don’t have for that, because physicists/statistician Michael Betancourt has written a fantastic introduction to Hamiltonian Monte Carlo, called A Conceptual Introduction to Hamiltonian Monte Carlo. You can find the preprint here on arXiv. Don’t be deterred by the length; it’s a fast read compared to other academic papers, and certainly a much more intuitive read than Radford Neal’s 2011 review chapter, which I already thought couldn’t be surpassed in terms of a quality introduction to HMC. Indeed, even prominent statisticians such as COPSS Presidents’ Award winner Andrew Gelman have praised the writeup, and someone like him obviously doesn’t need it.
I have extensively read Radford Neal’s writeup, to the point where I was able to reproduce almost all his figures in my MCMC and Dynamics code repository on GitHub. There was, however, one question I had about HMC that I didn’t feel was elaborated upon enough:
Why is it necessary to flip the sign of the momentum to induce a symmetric proposal?
Fortunately, Betancourt’s writeup comes to the rescue! Thus, in this post, I’d like to go through the details on why it is necessary to flip the sign of the momentum term in HMC.
Let be the density function defining the current proposal method, whatever that may be. With a Gaussian proposal, we have symmetry in that . The same is true with Hamiltonian Monte Carlo … if we handle the momentum term correctly.
Borrowing Betancourt’s notation (from Section 5.2), we’ll assume that, starting from state , we integrate the dynamics for steps to land at , upon which we use that as our proposal:
where is the Dirac delta function, and the difference is assumed to be real-valued; if and are vectors, these would need to be done component-wise and then summed up, but the same basic idea holds. Furthermore, and are “placeholder” random variables, kind of like how we often use when writing in introductory probability courses; is the placeholder and is the actual quantity.
Reading the definition straight from the Dirac delta functions, we see that our proposal density is one exactly at state , and zero everywhere else. This makes sense because Hamiltonian dynamics are deterministic after re-sampling the momentum variables (but it’s understood that represents those states after the re-sampling, not before).
The problem with this is that the proposal becomes “ill-posed”. Betancourt remarks that:
however, I believe that’s a typo and that the numerator and denominator should be flipped, so that the numerator contains the density of the starting state given the proposal.
Regardless, to me it doesn’t make sense to have proposal probabilities or densities with these Dirac delta functions that result in zero everywhere (that means we’d always reject samples). The following figure (from Betancourt) visually depicts the problem:
Because these position and momentum variables are continuous-valued, the probability of actually landing back in the starting state has measure zero.
Suppose, however, that after we integrate for steps, we flip the sign of the momentum term. Then we have
so that only results in a probability mass of one. See the following figure for this revision:
The key observation now, of course, is that
Why is this true? The dynamics are time-reversible, and if we set our potential energy to be the usual , then flipping the momentum term and going through the leapfrog means the sampler encounters the same exact steps, only in reverse.
To make this concrete, I like to explicitly go through the math of one leapfrog step. It requires some care with notation, but I find it’s worth it. I’ll write as the -th element encountered during the forward trajectory. For the reverse, I use so that the superscript is now two instead of one. Furthermore, due to the leapfrog step taking half-steps for momentums, I use for this purpose.
Here’s the forward trajectory, starting from :
and the last step negates the momentum, so that the final state is .
Here’s the reverse trajectory, starting from :
with our final state as . Above, the only difference between the reverse and the forward trajectories is the change in superscripts. But when we do the math for the reverse trajectory while plugging in the values from the forward trajectory, we get:
Gee, this is exactly the negative of the half-step we were at in the first iteration! Similarly, for the position update, we have:
The leapfrog has brought the position back to the starting point. For the final half-step momentum update, we have:
and we see that our reverse trajectory landed us back in , and flipping the momentum gets us to the same exact starting state.
Thus, using this type of proposal means the proposal densities cancel out, result in a Metropolis test, not a Metropolis-Hastings test.
I should also add one other point: we do not consider the momentum resampling as being part of the proposal, as resampling the momentum can be considered as maintaining the canonical distribution, so it’s something that we can “always” decide to invoke if we want (hopefully that’s not too bad of a hand-wavy explanation).
Hopefully the above discussion clarifies why flipping the sign of the momentum is necessary, at least in theory. In practice, we don’t do it since for the usual Gaussian kinetic energies, so their energy levels are the same, and because the momentum variables are traditionally entirely resampled after the acceptance test.
This semester, I took CS 294-131, a Deep Learning “special topics” course which has been offered each semester since Fall 2016 for a variable amount of class units and will be taught again next semester (the course website is already up). As usual, it was co-taught by the Trevor Darrell and Dawn Song team. The course is low-commitment for them because it’s a seminar and they don’t have to give lectures or prepare assignments and exams. CS 294-131 meets only once a week; for us, it was Mondays from 1:00PM to 2:30PM. Each meeting featured a guest speaker from academia or industry who gave a talk on his or her cutting-edge Deep Learning research results.
Here were some of the highlights for me:
Vladlen Koltun’s talk about his ICLR 2017 paper Learning to Act by Predicting the Future. I enjoyed his presentation, though admittedly most of it was because he was funny and actively engaging with the audience. I previously blogged about the more technical aspects here.
Barret Zoph and Quoc Le’s joint talk on neural architecture search, also from ICLR 2017 (here’s the OpenReview link) and also (like Koltun’s paper) an oral presentation at that conference. I’ve been hoping to find some time to read their paper and perhaps the winter break will afford me that opportunity. Zoph and Le’s presentation featured a lot of aggressive questioning from students, to the point where Professor Song asked the students to quiet down and let the speakers proceed. Fortunately, at least to me, the technical content of the presentation was interesting enough to keep my attention.
Ross Girshik’s talk on computer vision and object recognition. Actually, we had a fire alarm for this one, which delayed the start class for about 15 minutes … so then we had to find a new room. Unfortunately, it took about 10 more minutes to get the projector working, and then we were told we had to leave the room at around 2:10PM. At least when Girshik was actually able to talk about computer vision, I found the historical overview to be educational.
Percy Liang’s presentation on fighting black boxes and adversaries in Deep Learning. This was somewhat more theoretical work but he didn’t go too much into the details. I am less familiar with his work but would like to get accustomed with it as adversarial learning is a pretty hot topic in robotics these days.
In case you’re wondering, yes I had the usual sign language interpreters for the class. Yes they were unhappy, but they tried, and we were able to agree on a few terminology-related issues beforehand. And yes, I didn’t get all the technical details from the talks. I tried to allocate two hours before class to do the background reading, but inevitably that turned into 1.5 hours … and then 1 hour … and then 30 minutes. How do people manage to do class readings ahead of time when they’re juggling four major research projects? Or do people with normal hearing just find it easy to absorb almost all the technical stuff in these talks without prior preparation?1
As I mentioned earlier, CS 294-131 can be taken for a different amount of credits. This year, we had the option of taking it for 1, 2, or 3 credits. We got one credit for doing “arXiv summaries” and “discussion leads” and two for doing a class project. I decided that those summaries and discussions would be too much of a hassle, so I took CS 294-131 for two credits. It helps that I’ve long since finished my course requirements.
While I enjoyed some of the Deep Learning talks, I do have some criticisms about CS 294-131:
There are too many websites/links related to the class. We have Piazza, the course website, the Google group (seriously?) and Slack channels, with one for each new week. I think this is too much information and things should be centralized in two spots at most — the course website and Piazza.
I’m also not a fan of the arXiv leads, which was new this semester. Students had to give 1-minute presentations on Deep Learning papers that appeared on arXiv the past week. The problem with this is that the majority of students tried to cram as may technical details in their talk as possible, rather than give the clear key insight from the paper. In addition, students often went over their allotted speaking time (gee, who would have guessed??).
Finally, I have no idea why class participation is worth 20% of the grading here. On Piazza, we were literally told that we would get class participation credit by simply attending the lectures. Not only does it not make sense to award students who attend lectures but don’t pay attention, it also hurts those who watch the video livestreams to reduce pressure on the lecture room, since the first lecture was “standing-room only.”
To be honest, I didn’t quite enjoy the class as much as I should have, and my project didn’t turn out as well as I would have liked. I worked on a Deep Reinforcement Learning project with three other students, and hopefully that will turn into a research paper later, but in retrospect, it’s difficult to coordinate a four-person project when everyone else has other priorities.
I don’t plan to take the Spring 2018 version of the course, but I’ll certainly keep track of the papers in the background reading. I’m excited to see who the guest speakers will be this time around …
For readers of this blog who are EECS PhD students (and yes, I know you read this blog) that means I’m not-so-subtly asking you to tell me how much you can absorb from technical talks and lectures. Posting as a comment here or emailing me personally works. ↩
In this post, I try and learn as much about Bayesian Neural Networks (BNNs) as I can. I borrow the perspective of Radford Neal: BNNs are updated in two steps. The first step samples the hyperparameters, which are typically the regularizer terms set on a per-layer basis. The second step performs Hamiltonian Monte Carlo over the data, or through a series of minibatches plus sophisticated “friction” techniques, if using Stochastic Gradient Hamiltonian Monte Carlo. These update the actual weights we use for the neural networks; the hyperparameters are sampled mainly to invoke a “fully Bayesian” hierarchical model.
The above is different from the paradigm of using Bayesian Neural Networks with a technique known as variational inference. I will not be discussing that.
To make things concrete, in this blog post I will assume we have the following neural network which:
- is fully connected.
- takes MNIST data as input (784 dimensions for each data point), has a hidden layer of 100 units, and outputs a 10-dimensional vector from a softmax.
- uses the sigmoid activation for the hidden layer.
- uses a regularizer hyperparameter for each of the two sets of weight matrices, along with two for the biases.
We can write the network’s mathematical meaning using and as the weight matrices, along with and as the bias vectors. Using this, the network output can be expressed as:
where indicates the class label and is the th column of . (The entire output would just be the vector of values.) I write without subscripts, but in general we should write to represent the entire dataset.
We need to incorporate Bayesian assumptions somehow, so first we assume that all these weights have Gaussian priors with zero mean and covariance matrix set to some multiple of the identity. Intuitively, this seems to be a reasonable prior, as we’d generally like our weights to be small and roughly symmetrical about zero. For example, with the weights for , we
where we set to be the inverse variance, also known as the precision term. Similarly, we have , , and . Notice that I am now flattening the matrices and so that they are vectors. This makes the notation a bit easier, and it means that when I write , I mean the norm, not the spectral norm on matrices (a.k.a., the largest singular value). I will use the flattened notation for the remainder of this blog post.
The precision terms are hyperparameters which we also endow with their own (IID) priors:1
Letting denote our eight major parameters (including the hyperparameters), the posterior distribution which we want to sample from based on dataset is
where we follow the usual assumption of conditional independence among the data; think of drawing data points from the true “data distribution” to form our training data.
BNNs use Bayesian methods to figure out a good set of parameters for some task, which here is based on digit classification accuracy. I will now go over how the hyperparameter updates work, followed by the parameter updates.
The Hyperparameter Updates
This step samples the following:
There is no dependence on the dataset as the hyperparameters are sampled based on “data” which consists of the parameter values at the lower level. Also, since we assumed an IID prior for the precision terms, and because the values of the parameters are viewed as independent as well (I admit this probably isn’t the best way of describing it but it feels intuitive) we have:
We can literally sample the four precision terms sequentially due to their independence assumptions. For simplicity, let us assume we are sampling only so that the rest of the computation is straightforward. The math turns out to be:
and indeed, we have conjugacy: sampling the hyperparameters can be done simply by sampling from a Gamma distribution with these updated parameters based on the previous values of and . For intuition on what these parameters mean, if we have , then .
The Parameter Updates
This step samples the following:
where to simplify the notation if we depend on all the hyperparameters. There is dependence on here as those values determine the spread of the Gaussian distributions. Also, notice the dependence on the data here, unlike the previous case.
Using Bayes’ Rule as we did earlier (with that condensed notation) we get:
How do we sample from this distribution? We use Hamiltonian Monte Carlo (HMC).
The Hamiltonian, Potential Energy, and Kinetic Energy
Briefly: HMC uses what is known as a Hamiltonian Function where are the parameters and refers to auxiliary momentum variables.2 In Bayesian statistics, current practice is to split the Hamiltonian into two functions: known as the potential energy and kinetic energy, respectively. HMC is designed to sample from the distribution defined as follows:
where is a normalizing constant and is some temperature, typically used to “flatten” or “diffuse” the target distribution (which here is ) to make optimization easier. For the rest of this post, I include for clarity but I keep it separate from and .
In Bayesian statistics, the Potential Energy is because if we plug that in, we get
which is exactly what we want for the position variables, assuming that the momentum is independent so that , which is standard practice. To be clear: (a) we’re only interested in sampling from the distribution for , not the momentum’s distribution, so (b) to get our desired samples of from the posterior, we generate samples that include the momentum variables, and then we drop the latter after we’re done.
Regarding the Kinetic Energy, current practice is to set it to be a quadratic potential with mass matrix a multiple of the identity:
To be clear, we need to sample from the target distribution as specified by , which means we must technically sample from the distributions defined as:
Here, and are energy functions, but they are not the same as the actual distribution we are sampling from. For example, with the kinetic energy, the actual distribution we sample from is proportional to
i.e., a zero-mean Gaussian with covariance . We sequentially sample for each because they are independent by assumption.
Running Hamiltonian Monte Carlo
Running HMC in computer simulation requires a Metropolis test3 each iteration (i.e., each sample in our MCMC chain) to correct for discretization error. This requires computing the follow ratio :
where and refer to the proposed position and momentum variables, respectively. Computing the kinetic energy is typically a matter of adding squared norms, so that part is easy. But what about ? To compute this difference for our proposed Bayesian Neural Network, we see that
where represents a constant independent of the parameter . The reason why I ignore this is twofold.
First, when computing the Metropolis ratio, that will be the same for both and so we can ignore it.
Second, we also need to use when we sample with HMC, and this means taking the gradient which will kill . (The Metropolis test is only to determine whether we accept or reject a proposal, but we need some way of actually getting that proposal).
To elaborate on the second point, sampling using “Hamiltonian Dynamics” requires the momentum update:
where is a (leapfrog) step size parameter which we divide by two as required from the leapfrog method.
You can immediately see from this that must have the same dimensions as . I think of as concatenating flattened weights, so it’s one giant vector. The gradient updates can be specified weight-by-weight, which will change the corresponding “slices” of the vector . For instance, with and abusing notation by re-using , we have:
The term serves as a weight decay regularizer, and the sum over the gradient of probabilities can be computed via backpropagation through the neural network.
Remark 1: hopefully my above explanation clarifies why imposing a Gaussian prior on the weights is equivalent to regularization.
Remark 2: consider using TensorFlow to get the gradients corresponding to the log probabilities. In particular, TensorFlow can return gradients using
One thing I should point out: here, we can view as a “cost function” that we’re trying to minimize. This is equivalent to minimizing the cross entropy loss between what our the neural network predicts, and the distribution that consists of one-hot vectors of the training labels.4 That’s precisely the loss function I’d use if I were formulating the classification problem and solving it with stochastic gradient descent instead of HMC. The implication is that it’s OK to try and maximize the value that we see above, which is what happens when we perform gradient ascent on it; intuitively, the resulting weights will assign higher probability to the correct class.
After the momentum updates, the position variables are updated using a similar gradient-based step:
so that, intuitively, is also updated in roughly the same direction of the gradient. That’s how HMC works and uses gradient information.
Using Bayesian Neural Networks in practice often requires sampling a set of neural network weights many times and then computing the mean and standard deviation of the predictions.
A figure copied from the VIME paper (NIPS 2016) showing Bayesian Neural Network predictions and uncertainty levels.
For instance, in the figure above (taken from the VIME paper) the authors construct a regression task, where the network takes in a scalar-valued5 input and outputs a prediction . The red dots are the targets, while the green dots are the predictions. It’s clear that the red dots are clustered near the center of the figure, so logically, our Bayesian Neural Networks should be very confident in their predictions in those areas, and less confident outside the training data’s dominant regime. Indeed, the shaded areas confirm this, as they represent the output mean plus/minus one and two standard deviations (I think the second standard deviation is too far to see for the extremes in this figure) based on different neural network weight parameters.
These types of figures are typically shown when people talk about Bayesian Neural Networks, such as in Yarin Gal’s excellent tutorial.
I am working on implementing Bayesian Neural Networks in my MCMC repository, which is a TensorFlow implementation based on Tianqi Chen’s earlier pure numpy code. The code is a bit disorganized and not quite ready for consumption by the public, but I think I’m getting something going with this code.
I follow Radford Neal’s notation in setting as the auxiliary momentum variables for HMC. Neal uses as the position variables, but I set them here as for obvious reasons. ↩
The proposals have the same density, so it is not necessary to perform a Metropolis-Hastings test. Why this is true is based on reversibility, but it is still somewhat unclear to me. ↩
This is a reasonably well-known fact in machine learning, but I would like to write up some details on this because I sometimes find myself looking up the derivations again. ↩
Well, technically they preprocessed the input to be but thinking of it as 1-D makes it much easier to plot. ↩
In this post, I attempt to learn as much as possible about the paper Multi-Level Discovery of Deep Options, which introduced the DDO algorithm. This post is split into my understanding of the math, followed by implementation/experiments.
Before we begin, here is some useful notation:
Options are denoted as , with representing the set of possible options. To make the notation clearer when we’re dealing with options at each time step , we could write .
Higher-level policies are denoted as , in which we pick options. You could also repeat this recursively; think of using something like , though the paper never officially uses that notation.
A set of demonstrations are denoted as . They do not assume that these demonstrations are from an expert supervisor, only that (a) they have some hierarchical structure to be discovered, and (b) that they are informative of the relevant actions to take in each state. Assumption (a) is easier to understand and justify, for if we didn’t have a hierarchical structure to discover, there is no point doing DDO or any hierarchical learning algorithm. Fortunately, in many real-life tasks (for humans), we have hierarchical structures.
The paper also goes through some usual reinforcement learning and imitation learning notation, which I won’t repeat here as it will be clear to those with the relevant background.
The goal of DDO is to discover a set of parameterized options from .
To derive the math, the authors use the perspective of fitting a generative model to the trajectories. That generative model, however, has latent variables, so they need to perform a flavor of Expectation-Maximization, which I’ve previously blogged about here.
But what are the latent (a.k.a., hidden) variables?
The choice of , which represents the option at time .
In addition, their generative model also assumes that at each time step, we have a binary random variable , where if , a new option is drawn from the high-level policy. The random variable follows a Bernoulli distribution with parameter based on the option termination conditions ; the higher the value at a given state, the more likely it is that the option should finish and return control to the meta-policy.
The generative model tells us how to assign probabilities to the trajectories. Now, let’s figure out how to find that maximizes the likelihood of the trajectories! We’re using to denote all the parameters of everything together: the meta-control policy and for each option. The derivation here assumes a two-level hierarchy, but the extension to multiple levels should be straightforward, besides some nightmarish notation.
The discussion above makes it clear that the th trajectory in our data follows this structure:
with and representing the visible and latent variables, respectively.
To find parameters, we need to encode this probabilistically, as in . Fortunately, as is standard in RL/IL, this long joint probability decomposes based on time steps. Specifically, we express it as:
where for notational clarity, we denote the various probability distributions as follows:
- represents the probability distribution over the initial state. There is no dependence on .
- represents the dynamics of the environment. Once again, there is no dependence on .
- represents the distribution over the hidden variables. (My notation is slightly different than what the DDO paper uses, but I find it easier to think of the entire likelihood as distinct from this.)
- and , of course, represent the policy and option selection.
Since is Bernoulli, we can easily split into cases to define it more precisely:
We should now step back and ask ourselves if this definition of makes sense. Does it?
Yes. The first few terms draw the initial state, and then the ensures that we actually have an option to sample from to start (though it is really unnecessary notation). Then we assume we draw .
Next, we iterate through the remaining time steps. We draw the action based on the policy, and then the dynamics provide the state. But then we also need to sample the two latent variables. We’re packing these together in , but you can also think of it sequentially: first sampling a Bernoulli for the option termination, and then sampling from the meta-policy. The split of in two cases makes the sequential aspect of this clearer: represents the probability that , and if it is zero (with probability ), we don’t need to draw at all, hence the delta term . Otherwise, we need to sample, hence the . Yes, this all makes sense, and can be reasoned by iterating through the generative model “pseudocode.”
Great. Now we know the likelihood of one trajectory. We want to maximize the (log) likelihood assigned to a trajectory, which means we need to take gradient steps to get our neural network weights to go to the correct spot. For one trajectory (omitting the subscript which normally indicates which trajectory in our set of them), we have:
Where in (i) we substituted the definition with the thing we want to optimize since that will lead to “likely” trajectories learned from gradient steps, in (ii) we applied the “trick” and then applied the probability 101 rule that while explicitly writing out what it means to sum over the discrete-valued latent variable over all times (this is an exponentially large sum!), in (iii) we again applied the same log-derivative trick, except this time from the perspective of , in (iv) we used the definition of conditional probability and then the definition of expectation in (v), and finally in (vi) we applied our previous derivation of the probability of both the visible and latent variables, and omitted terms which cancel out from the gradient. Recall again that includes the parameters of the meta policy (i.e., ) and the lower-level options .
In general, we’re dealing with a batch of trajectories, so just apply it to each of the trajectories and take an average over the minibatch to make the gradient independent of batch size.
This is good, but we can actually write the gradient more explicitly, without the expectation by converting the expectation to the sums. Effectively, instead of doing the step (vi) to (v) thing we did earlier, we’ll explicitly simplify the probability based on the summation. However, we’ll start from step (vi) above since it’s easier to manage the calculations with all the stuff canceled out after applying . We’ll go through this computation in several steps.
For the third term, we have:
Where (i) is by linearity of expectation, (ii) simplifies the expectation by realizing that the term under expectation only (directly) depends on , and (iii) applies the definition of expectation. If step (ii) isn’t clear, it should work out if you expand the full probability into sums over all and all . Then the sums should get “pushed to the right” and sum to one (and thus go away) and the rest should follow from there. I would write it out, except that it makes sense to me and I don’t want to spend my entire blogging career getting bogged down with notation.
Additional comment: it might also be the case that conditioning on instead of just the relevant part of was done for notational convenience.
Next, here’s how to rewrite the first two terms in the expectation. This requires a considerable amount of care:
For the most part, it uses similar techniques as the other part, such as converting the expectation to sums over probabilities (i.e., applying the definition) and then moving the sums to the right as far as possible so that they can sum to one and then eliminate. The main challenge here is tracking all the s here in addition to the s.
Putting all this together, we get the equation shown in the DDO paper where they’ve substituted in , and terms. It should be equivalent to what I have above, though the odds that there is a typo somewhere (probably above) is 100 percent. I think I know how this works in theory but there is a lot of notation.
Incidentally, there is some good explanation about how — since we’re living in discrete-land — there are three cross entropy loss terms embedded into the gradient update.
The last piece of math to note before proceeding to the implementation details is … how to implement Expectation-Gradient efficiently. Fortunately, the Expectation step — where latent variables are “sampled” and weighted probabilistically — can be done with the Baum-Welch update, which I have previously blogged about. I won’t go through the details here, though I went through them on pencil and paper and the math makes sense. The key to my intuition is to think of forward and backward probabilities as “counting up” the number of ways to reach a given spot from the start (for trajectory prefixes) or the end (for trajectory suffixes). The Baum-Welch algorithm gives us efficient ways to compute various probabilities, and then, since we can formulate a loss function which is the negative log likelihood of a trajectory, we can call TensorFlow to compute gradients for us for the Gradient step. Thus, we have the Expectation-Gradient algorithm.
Quick note: recall Expectation-Maximization. They’re not doing that because the maximization part can’t be done in closed form with neural network models:
Our work is most related to , who use a similar generative model, originally introduced by  as an Abstract Hidden Markov Model, and learn its parameters via the Expectation-Maximization (EM) algorithm. EM applies the same forward-backward E-step as our Expectation-Gradient algorithm (Section 4.2) to compute marginal posteriors, but uses them for a complete optimization M-step over the options and the meta-control policies. This optimization is infeasible for expressive representations, as well as for multi-level hierarchies.
OK, now let’s move on to some of the implementation details and experimental results.
Implementation Details and Experiments
I am mostly going to investigate their GridWorld-related experiments, as the Atari stuff uses RAM, not images, so it’s hard to interpret, and the surgical robotics part is impossible to comprehend without intimate knowledge of the training data.
They consider DDO under two different scenarios:
(Supervised) given a supervisor who demonstrates a few times how to perform a task, show that the discovered options are useful for accelerating reinforcement learning on the same task; (Exploration) apply reinforcement learning for episodes, sample trajectories from the current best policy, and augment the action space with the discovered options for the remaining episodes
This is a bit confusing to me for a few reasons. I will state why later when I review some questions I have about the paper. Let’s move on to GridWorld. Details:
- a grid, with four rooms.
- the agent actions is randomized with probability 0.3.
- the agent can move in four directions (the “atomic actions”), and moving into wall has no effect.
- an apple is spawned in random location, and upon reaching it the agent gets +1 reward (and then it is respawned).
- the agent knows where it is, as it’s hard-coded in the state space.
All neural network policies (for control, meta-control, termination) are neural nets with one hidden layer of … two nodes. Yeah, you don’t need much for GridWorld.
The GridWorld used in the DDO paper. It is repeated four times in this image, each with a different discovered option: down, up, right, and left, respectively.
For the two setups:
Supervised. They generate 50 trajectories of length 1000 using Value Iteration. (1000 seems like a large number, but the agent repeatedly respawns after hitting the apple so trajectories can go on indefinitely.) Then, after running DDO on this data, they can discover options. This is the supervised case, so there is (I assume) no environment interaction during the DDO stage. They executed the learned options and found four of them at the lower level (see figure above). There’s another figure for the higher-level options, which shows the options that are invoked to move the agents to rooms, starting from any given state.
Exploration. They trained a DQN agent for 2000 steps with atomic actions. Then after those steps, they can roll out trajectories from the -greedy policy, which effectively turn this into a “supervised” case. DDO learns options, and then the Q-function is then reset and DQN is run again with the augmented action space. I think what they want to show is that, compared to baseline DQN, the DQN augmented with options learns faster. It is a little hard to interpret Figure 2, though. They are starting from 2000 steps when they compute the number of steps for the option-augmented agents, right? (Because that’s 2000 steps of computation needed.) I think so, as their Figure 2 plots for the “exploration setting” are like the ones from the “supervised” setting except shifted to the right past 2000 steps, but then I’m surprised at why rewards shoot up at (essentially) exactly 2000 steps.
I also read through their Supplementary Material, which provides mostly more information on various GridWorld settings. Figure 7 appears to be missing the benchmark of having options-only DQN, but otherwise it has primitives-only DQN and options-and-primitives DQN, which is the “augmented DQN” from the paper.
Now here are some questions I have:
For the supervised setting of DDO, is it fair to say it “accelerates RL”? In the default RL setting, we do not start with expert trajectories. Thus, any time we can get expert trajectories to bootstrap our starting policy, isn’t that an unfair advantage? This is what I wondered about when digesting the first half of Figure 2. Perhaps the comparison should be with reinforcement learning initialized with behavior cloning?
For the exploration setting of DDO, what does it mean to augment the action space with discovered options? I think it means this: suppose our default action space is . Then we discover three options, so the action space becomes . Is that right? Then how is this logic implemented? The original RL policy must have started with a neural network that outputs four components, one per action (so that we can do a softmax to get the full probability distribution). Do we then copy weights over to a new neural network with seven outputs, so that all weights except for the last layer are pre-initialized?
Regarding the DQN results, DQN is notorious for requiring lots of hyperparameter tuning and there are many ways that the implementation can go wrong. I wonder if this was hand-implemented or if the authors based it on an existing library with known benchmarks.
Whew! That was a fairly exhausting read. I needed to read this three times (which is the amount Professor Michael I. Jordan keeps telling us to read textbooks) but at least I think I get the gist of how DDO works.
Last Friday, I participated in the Bay Area Robotics Symposium for the first time. Since 2013, this has been an annual November event with alternating hosts of the University of California, Berkeley, and Stanford University. This year, it was held in Berkeley, and since by now I am closer to calling myself a robotics researcher, and additionally have (finally!) a robotics-related preprint online, I really had no excuse not to join this time around.
The International House auditorium, where BARS took place.
BARS took place at the International House in the southeast corner of the UC Berkeley campus. The talks were in the room shown in the picture above. As usual:
- I arrived early.
- I sat near the front of the room. Thus, this picture shows basically the entirety of the auditorium.
I normally follow the two rules above because of the need to (a) meet my sign language interpreters early, and (b) grab a seat by the front to get a good view of them. Sadly, it means that (as usual) I don’t get other students to sit next to me. In fact, the seats nearby me were empty. One of these days, I will figure out how to sit next to other students.
After about a 15 minute delay (typical Berkeley), BARS finally got started. The Berkeley host, Professor Anca Dragan, gave a few opening remarks, which included that BARS 2017 had (if I recall correctly) 392 people signed up. The capacity was 500, I think, and I’m actually surprised that we didn’t hit the limit. Perhaps CoRL 2017, held recently, meant that BARS 2017 was redundant?
Anyway, soon we started off with the agenda. We started with the first of four sets of faculty talks, each consisting of six faculty giving ten minute talks. Berkeley Professor Pieter Abbeel started things off.
Pieter Abbeel started off the faculty talks. He presented "Deep Learning for Robotics". Apologies for the glare on the camera --- I am a rookie with using my iPhone for cameras. You can also see my sign language interpreter there, who had her work cut out for her due to Abbeel's fast and energetic speaking rate.
My goodness, how does he get all these papers?!? His talk started off by listing his long list of publications … in 2017 alone. Then he talked about Meta-Learning, which is what he’s most interested in nowadays within AI. (Just to be clear, I said “most interested,” not “the only thing he’s interested in.”) Incidentally, Abbeel was featured in the New York Times for co-founding Embodied Intelligence. I’m excited to see what they will produce.
We then had more faculty talks. Once these concluded, we moved on to the first of two student spotlight talk sessions, when students gave 1-minute lightning talks on their research. As expected, about 90% of the students went over their alloted time, and most wasted time by saying “My name is X and I’m a student at Y advised by Z and blah blah blah.” The good news is that many of the presentations were interesting enough to pique the curiosity of a nontrivial fraction of the audience. Then those folks can read the students’ papers in their own time.
I didn’t give a lightning talk. I’m not sure why, but I suppose professors could only pick two or three of their students due to time constraints.
We then had our first coffee break. Yay! I stood up, stretched, and briefly chatted with a few other people from Berkeley who I knew. I wish I had talked to some students from Stanford, though. How do people network to brand-new folks at events like these? I really wish I knew the answer.
After another set of faculty talks (including Ken Goldberg’s work on the Dexterity-Network), and then a lunch break, we had … our keynote talk.
Professor Robert Full at the end of his keynote talk, asking the audience about our thoughts on a suitable "Grand Challenge" for robotics.
Professor Full isn’t technically a core robotics faculty — I think — because “biomechanics” is a better way of describing his work. But he quickly captivated the audience with his funny videos on insects and squirrels going through wild motions … and then videos of robots attempting to replicate that movement. I particularly liked the videos of insects and robots that were able to resist insane amounts of forces and squeeze their way through impossibly tiny paths. Judging from the frequent audience laughter, I wasn’t the only one who enjoyed his talk.
Towards the end, Professor Full asked us for “Grand Challenges” of Robotics. One of the professors who I work with, Ken Goldberg, offered “the ability of a robot to pick up anything a human can pick up, including stuff that doesn’t want to be picked up.” You can verify this by looking at his Twitter. I agree with him.
Ken, meanwhile, helped me out a bit later that day. During the second coffee break, Ken gathered a few of his students (including me) and introduced us to Stanford Professor Allison Okamura, who’s also done some work on surgical robotics. At least I got to network a bit, which is better than the usual nothing!
We then had our second set of lightning talks, with the same old issues (people going over their time, etc.), and then our fourth set of faculty talks. This featured stars such as Jitendra Malik and Sergey Levine. I enjoyed Malik’s talk, which was entertaining for two reasons. First, he joked that vision and robotics people (and Malik is a vision person) were historically separate communities in AI, but after drastic improvements in image recognition due to Deep Learning, the “robotics people better pay attention to what the vision people are doing.”
The second joke Malik made was with regards to Stanford vs Berkley. The joke was that Stanford people could develop algorithms for robots (e.g., navigation) that perform well when applied to Stanford-based environments, but which fail on Berkeley-based environments. Why? Stanford is nice and clean, Berkeley is dirty and messy.
Well, UC Berkeley may be a bit run down, not to mention crowded (see image below), but it’s the people that matter, right?
Yeowch. BARS sure was popular! Well, that, and the venue is a bit too small for what we're offering. No wonder I regularly hear complaints that Berkeley is too crowded.
Some random, concluding thoughts:
Two areas of research in robotics (and AI more generally) that are hugely popular are safety and robustness. These are related, though there are subtle differences.
There were several faculty talking about aerospace dynamics and mechanical engineering. This is not my area of research so I had a hard time processing the concepts. There was also more on surgical robotics than I expected, due to several Stanford faculty (as our surgical robotics person, Ken, is mostly working on grasping). Berkeley, naturally, has more faculty who do Deep Reinforcement Learning for Robotics, and I find that to be the most riveting field of robotics and AI.
Many faculty kept saying some variant of: “we’ll do Deep Learning because it’s so popular and works.” Yes, I know it’s popular, and it’s funny to talk about it, but by now this has grown stale on me and I wish people would cut back on their hackneyed Deep Learning comments.
Unfortunately, since I spent most of the coffee breaks talking to a few people or relaxing, I didn’t get to attend either poster session. I got a glimpse of one of them and it looked crowded (and noisy) so maybe I didn’t miss much.
Well, that’s a wrap. I look forward to attending BARS 2018 at Stanford. Thank you to everyone who helped make BARS 2017 happen!
When reading academic papers about a certain subfield, I often find it difficult to clearly understand how they connect with each other. For example, what algorithms are based on other algorithms? Can the contributions of two papers be combined? Would combining them result in notable improvements or just on-the-margin, negligible changes? (The answer to that last question is usually “the latter” but it’s something we should at least consider.)
This post is an attempt to unify my understanding of papers related to scalable Markov Chain Monte Carlo and scalable Metropolis-Hastings. By “scalable,” I refer to the usual meaning of using these algorithms in the large data regime.
These are the papers I’m trying to understand:
- MCMC Using Hamiltonian Dynamics, Handbook of MCMC 2010
- Bayesian Learning via Stochastic Gradient Langevin Dynamics, ICML 2011
- Stochastic Gradient Hamiltonian Monte Carlo, ICML 2014
- Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget, ICML 2014
- Towards Scaling up Markov Chain Monte Carlo: An Adaptive Subsampling Approach, ICML 2014
- Firefly Monte Carlo: Exact MCMC with Subsets of Data, UAI 2014
- On Markov Chain Monte Carlo Methods For Tall Data, JMLR 2017.
(All of them are freely available online.)
First, I’ll briefly discuss why we care about the problem of scalability with MCMC and MH. Then, I’ll group these papers into categories and explain how they are connected to each other. This will then motivate our UAI 2017 paper, An Efficient Minibatch Acceptance Test for Metropolis-Hastings.
Why Markov Chain Monte Carlo?
I’m not going to review MCMC here as you can find many other references, both online and in textbooks. It may help to look at my blog post from June 2016 where I describe the general problem setting. My more recent BAIR Blog post also contains some potentially useful background material.
But why use MCMC at all? Here’s one reason: if we use it to sample some model’s parameter , then the chain of samples should let us quantify useful statistics about properties of interest. Two of these are the expectation and variance of something, which we might apply on the parameter itself. We can estimate (for example) the expectation by taking a sequence of the most recent samples (or a subsampled sequence) from our chain and then taking the sample vector-valued expectation. More generally, letting be a function of the parameters, we can compute the expectation using the expectation of the sampled values .
We can’t do this if we take stochastic gradient steps, because the samples from SGD are not from the posterior distribution of the parameter. SGD is designed to converge around a single point in the space of possible values, unlike MCMC methods which are supposed to approximate a distribution, which can then be used for sample estimates of expectations and variances.
My perspective is supported in papers such as the SGLD paper from 2011 (one of the papers I listed above); the authors (Welling & Teh) claim that:
Bayesian methods are appealing in their ability to capture uncertainty in learned parameters and avoid overfitting. Arguably with large datasets there will be little overfitting. Alternatively, as we have access to larger datasets and more computational resources, we become interested in building more complex models, so that there will always be a need to quantify the amount of parameter uncertainty.
So … that’s why we like the Bayesian perspective. These authors are rock-stars, by the way, so I generally trust their conclusions.
I’ll be honest, though: I can’t think of something nontrivial I’ve done in which the Bayesian perspective was that useful to me. In Deep Learning, Deep Imitation Learning, and Deep Reinforcement Learning, I’ve never used priors and posteriors; RMSProp or Adam is good enough, and it seems like this goes for the rest of the community. Maybe it’s just not that necessary in these domains? I have two papers on my reading list, Bootstrapped DQNs and Robust Bayesian Neural Networks, which might clarify some of my questions regarding how much of a Bayesian perspective is needed in Deep Learning. I should also definitely check out the Bayesian Deep Learning NIPS workshop.
Langevin Dynamics and Hamiltonian Dynamics
This section concerns the following three papers:
- MCMC Using Hamiltonian Dynamics, Handbook of MCMC 2010
- Bayesian Learning via Stochastic Gradient Langevin Dynamics, ICML 2011
- Stochastic Gradient Hamiltonian Monte Carlo, ICML 2014
I gave a brief introduction to Langevin Dynamics in my earlier blog post, so just to summarize for this one, Langevin Dynamics injects an appropriate amount of noise so that (in our context) a gradient-based algorithm will converge to a distribution over the posterior of . The Stochastic Gradient Langevin Dynamics algorithm combines the computational efficiencies of SGD by using a minibatch gradient, but uses the Langevin noise to appropriately cover the posterior:
[…] Langevin dynamics which injects noise into the parameter updates in such a way that the trajectory of the parameters will converge to the full posterior distribution rather than just the maximum a posteriori mode.
As a follow-up, the Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) algorithm is similar to SGLD in that it uses a minibatch gradient along with “exploration noise.” This time, the noise is from Hamiltonian Monte Carlo, which is more sophisticated than Langevin Dynamics since HMC introduces extra momentum variables which allows for larger jumps.
Radford Neal’s excellent 2010 book chapter goes over HMC in great detail, so I won’t go through the details here (though I’d like to write a blog post solely about HMC — so stay tuned!). Just to give a quick overview, though, our problem context is similar, where we have a target posterior:
with potential energy function
(Don’t worry too much about the “potential energy” terminology; HMC was originally developed from a physics background. We’re still in the same problem setting.)
HMC generates samples from a joint distribution that involves extra momentum variables:
where are the momentum variables and is a mass matrix. The update rules are:
where is some step size. If this doesn’t make sense, read Neal’s 2010 book chapter.
The result from HMC is a set of samples . But we’re only interested in the s, so … we simply drop the terms to get our samples for . Amazingly, is sampled from the correct target distribution, which one can show via some “reversibility” analysis.
SGHMC needs a little massaging to actually get it to sample the target distribution, since simply taking a subset of the data to compute an approximation to will lose the “Hamiltonian Dynamics” property; the authors resolve this by using second-order Langevin Dynamics to counteract the effect of too much gradient noise in estimating , and the result is a similar algorithm to SGLD except with a different noise term.
Just to be clear, both SGLD and SGHMC are minibatch, gradient-based algorithms that are also considered “Bayesian methods.” Neither are pure random walks, i.e., neither use Gaussian proposals because the proposals are based on the stochastic gradient value, plus some additive noise term. For SGLD, that extra noise is actually a random walk, but not for SGHMC.
For both SGLD and SGHMC, we have to apply the Metropolis-Hastings test for computer implementations due to discretization error, even though in theory we shouldn’t have to since energy is preserved. In both papers, the authors decrease step sizes to zero so that the MH rejection rate goes to zero. Intuitively, smaller step sizes mean samples are concentrated into higher regions of the posterior, and the gradient ensures going in the direction of greatest increase of the posterior probability. In addition, decreasing step sizes also means discretization error decreases, which yet again further reduces the need for MH tests. While this is great, because the MH test requires full-batch computation, perhaps we are missing out somehow by keeping our step sizes small.1
In this section, I discuss the remaining papers listed at the introduction of this post. They are related in some form to the Metropolis-Hastings algorithm, which is commonly used in MCMC techniques to act as a correction to ensure that samples do not deviate too frequently away from the target posterior distribution.
As I mentioned in both of my earlier blog posts, conventional MH tests require a full pass over the entire dataset. This makes them extremely costly, and is one of the reasons why both SGLD and SGHMC emphasized how decreasing step sizes results in lower discretization error, so that they could omit the MH tests.
Their computational cost has raised the question over whether using subsamples of the data for the MH test computation is feasible. It’s not as straightforward as taking a fixed-sized subset (i.e., minibatch) of the dataset because that results in a non-trivial target distribution which is not the desired posterior.
The following two papers propose subsampling-based algorithms that attempt to tackle the high cost of full-batch MH tests:
- Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget, ICML 2014
- Towards Scaling up Markov Chain Monte Carlo: An Adaptive Subsampling Approach, ICML 2014
I discussed the first one in an earlier blog post. The second one follows a similar procedure as the first one, except that it uses a slightly different way of interpreting when to stop data collection. The downside, as I painfully realized when I tried to implement this, is that due to its concentration bounds, it requires a real-valued parameter which depends on the entire collection of values each iteration, which defeats the point of using a subset of the data. (Here, I use to denote the proposed distribution.)
The authors of the Adaptive Subsampling paper have a follow-up JMLR 2017 paper (it was under review for a long time) which expands upon this discussion. I found it quite useful, particularly because of their proof (in Section 6.1) about how naive subsampling for the MH test results in a nontrivial and hard-to-interpret target distribution. In Section 6.3, they introduce a novel contribution where they rely on subsampling noise for exploration; that is, use the minibatch-induced noise (which is Gaussian by the Central Limit Theorem) to explore the posterior. However, they showed that this approach still seems to require data points each iteration. On the other hand, they didn’t investigate this method in too much detail, so it’s hard to comment on its usefulness.
The last related work was the Firefly paper, which won the Best Paper Award at UAI 2014. It can perform exact MCMC, but the main drawback is (emphasis mine):
FlyMC is compatible with a wide variety of modern MCMC algorithms, and only requires a lower bound on the per-datum likelihood factors.
To be clear on what this means, they require the existence of functions satisfying for all . How realistic is that? I have no idea, honestly, but it seems like something that is difficult to achieve in practice, especially because it’s conditioning on and , which will vary considerably throughout sampling. There is some interesting discussion about this at Christian Roberts’ excellent blog, with Ryan Adams (the professor co-author) commenting.
The prior work then motivated our work, where we avoided needing these assumptions and showed that we could cut the M-H test cost time down to one equivalent with SGD, without loss of performance. There’s no free lunch, though; our algorithm has applicability constraints but those are hopefully not that restrictive. Check out our BAIR blog post for more information.
I’ve discussed these set of papers and tried grouping them together to see a coherent theme in all of this. Hopefully this makes it clearer what these papers are trying to do.
I’m actually not sure if we can even use the Metropolis-Hastings test (and not just the “Metropolis Algorithm”) with SGHMC. The authors of the SGHMC paper claim that MH tests are impossible for both SGLD and SGHMC since the reverse proposal probability cannot be computed. It seems to me, however, that one can compute the SGLD reverse probability because that’s a Gaussian centered at the gradient term with some known variance. What am I missing here? At the very least, applying the MH test to regular HMC should be OK, since we can omit the proposal probabilities. And that’s what both the SGHMC authors (judging from Tianqi Chen’s source code) and Radford Neal do in their experiments. ↩
In the process of applying to graduate school, and then visiting schools that admitted me, I was told that PhD students needed to possess solid writing ability in addition to technical skills. One UT Austin professor told me he believed liberal arts students (like me) were better prepared than those from large research universities, presumably because of our increased exposure to writing courses. One Cornell professor emphasized the importance of writing by telling me that he spent at least 50 percent of his professional life writing. A Berkeley professor who I frequently collaborate with has a private Google Doc that he gives to students with instructions on writing papers, particularly about how to structure an introduction, what phrases to use, and so on.
The ability to write well is an important skill for academics, and I don’t mean to dismiss this outright. However, I think that we need to be very clear that technical skills matter far, far more for the typical graduate student, at least for computer science students focusing in artificial intelligence like me. I would additionally argue that factors such as research advisors and graduate student collaborators matter more than writing ability.
Perhaps the emphasis on writing skills is aimed at two groups of people: international students, and the very best graduate students for whom technical skills are relatively less of a research bottleneck. I won’t comment too much on the former group, besides saying that I absolutely respect their commitment to learning the English language and that I know I’m incredibly lucky to be a native English user.
I bring up the second group because much of the advice I get are from faculty at top institutions who were stellar graduate students. Perhaps most of their academic life is dominated by the time it takes to convert research contributions to a paper, instead of the time it takes to actually come up with the contribution itself. For instance, this is what UT Austin professor Scott Aaronson had to say in an old 2005 (!!) blog post, back when he was a postdoc (emphasis mine):
I’ll estimate that I spend at least two months on writing for every week on research. I write, and rewrite, and rewrite. Then I compress to 10 pages for the STOC/FOCS/CCC abstract. Then I revise again for the camera-ready version. Then I decompress the paper for the journal version. Then I improve the results, and end up rewriting the entire paper to incorporate the improvements (which takes much more time than it would to just write up the improved results from scratch). Then, after several years, I get back the referee reports, which (for sound and justifiable reasons, of course) tell me to change all my notation, and redo the proofs of Theorems 6 through 12, and identify exactly which result I’m invoking from [GGLZ94], and make everything more detailed and rigorous. But by this point I’ve forgotten the results and have to re-learn them. And all this for a paper that maybe five people will ever read.
Two months of writing for every week of research? I have no idea how that is humanly possible.
For me, the reverse holds: I probably spend two months of research for every week of actual writing. What dominates my academic life is the time it takes (a) to process the details from academic papers so that I understand how their ideas work, and (b) to build upon those results with my own novel contribution. Getting intuition on novel artificial intelligence concepts takes a considerable amount of mathematical thinking, and getting them to work in practice requires programming skills. Both math and programming fall under the realm of “technical skills.”
Obviously, once I HAVE a research contribution, then I have to “worry” about writing it, but I enjoy writing so it is no big deal.
But again, the research contribution itself must first exist. That’s what frustrates me about much of the academic advice that I see. Yes, it’s easier to tell someone how to write (use this phrase, don’t use this phrase, active instead of passive, blah blah blah), but it would be better to explain the thought process on how to come up with an original research contribution.
I would happily trade away some of my writing ability for a commensurate increase in technical skill.
Again, I am not disregard writing ability, since it is incredibly valuable for many reasons (such as for blogging!!) and more applicable than technical skills in life. However, I believe that the biggest priority for computer science doctoral students should be to focus on technical skills.