# Why Does IEEE Charge Hundreds for Two Extra Pages?

My preprint on surgical debridement and robot calibration was accepted to the 2018 IEEE International Conference on Robotics and Automation (ICRA). It’s in Brisbane, Australia, which means I’ll be going to Australia for the second time in less than a year — last August, I went to Sydney for UAI 2017.

I’m excited about this opportunity and look forward to traveling to Brisbane in May. (That is, assuming Berkeley’s Disabled Students’ Program isn’t as slow as they were in August, but never mind.) I have already booked my travel reservations and registered for the conference.

Everyone knows that long-haul international travel is expensive, but what might not be clear to those outside academia is that conference registration fees can be just as high as those airfare fees. For ICRA, the cost of my registration came to be 1,171.36 AUD before taxes, and 1,275.00 AUD with taxes. That corresponds to 1,033.94 in US dollars. Ouch.

Fortunately, I’m going to get reimbursed, since Berkeley professors are not short on money, but I still wish that costs could be lower. The breakdown was: 31.36 AUD for a hotel deposit (I’ll pay the full hotel fees when I arrive in May), 600 AUD for the early-bird IEEE student membership registration, 100 AUD for the workshops/tutorials, and 440 AUD for the two extra page charge.

Wait, what was the last one?

Ah, I should clarify. The policy of ICRA, and for many IEEE conferences at that (hence the title of this blog post), is the following:

Papers to ICRA can be submitted through two channels:

1. To ICRA. Six pages in standard ICRA format and a maximum of two additional pages can be purchased.
2. To the IEEE Robotics and Automation Letters (RA-L) journal, and tick the option for presentation at ICRA. Six pages in standard ICRA format are allowed for each paper, including figures and references, but a maximum of two additional pages can be purchased. Details are provided on the RA-L webpage and FAQ.

All papers are submitted in PDF format and the page count is inclusive of figures and references. We strongly encourage authors to submit a video clip to complement the submission. Papers hosted on arXiv may be submitted to ICRA.

So, in short, we can have six pages, and can purchase two extra pages if needed.

This makes no sense to me.

Is it because of printing costs associated with the proceedings? It shouldn’t be. The proceedings, as far as I know, are those enormous books that concatenate all the papers from a conference.

They are also worthless and should never be printed outside of maybe one or two historical copies for IEEE’s book archives. No one should read directly from them. Who has the time? Academics are judged based on the papers they produce, not the papers they consume. This year, ICRA alone accepted 1030 papers (!!). Yes, over a thousand. It makes no sense to browse proceedings to search for a paper; just type in a search query on Google Scholar. If you think you might be missing a gem somewhere in the proceedings, I wouldn’t worry. Good papers will make themselves known eventually through word of mouth. They also tend to be widely accessible to all, such as being available on arXiv rather than being stuck behind an IEEE paywall. Most universities have IEEE subscriptions so it’s not generally a problem to download IEEE papers for free, but it’s still a bit of an unnecessary nuisance.

Speaking of arXiv, perhaps IEEE doesn’t follow a similar model due to hosting costs? That doesn’t seem like a good rationale, and in particular it doesn’t justify the steep jump in price from 6 to 8 pages. Why not have pages 1 through 6 charged accordingly? Or simply make the charge based on file size instead of page size, while obviously keeping a hard page limit to alleviate the load on reviewers. There seem to be way more rational price structures than the 220 AUD each for pages 7 and 8.

It seems like ICRA organizers would prefer to see 6 page papers, yet the problem is that everyone knows that if you allow 8 pages, then that becomes the effective lower bound on paper length. An 8-page paper has a better chance of being accepted to ICRA than a 7-page paper, which in turn has odds over a 6-page paper. And so on. Indeed, if you look at ICRA papers nowadays, the vast majority hit the 8 page limit, many with barely a line to spare (such as my paper!). The trend is possibly even more pronounced with ICML, NIPS, and other AI conferences, from my anecdotal experience reading those papers.

Such a cost structure might needlessly disadvantage students and authors from schools without the money to easily pay the over-length fees. This is further exacerbated by ICRA’s single-blind policy, where reviewers can see the names of authors and thus be influenced by research fame and school institution name.

In short, I’m not a fan of the two-page extra charge, and I would suggest that ICRA (and similar IEEE-based conferences) switch to a simple, hard, 8-page limit for papers. In addition, I would also like to see all accepted papers freely available for download in an arXiv-style format. If hosting costs are a burden, a more rational price structure would be to slightly increase the conference registration fees, or encourage authors to upload their papers on arXiv in lieu of being hosted by ICRA.

To be clear, I’m still extremely excited about attending ICRA, and I’m grateful to IEEE for organizing what I’ve heard is the preeminent conference on robotics. I just wish that they were a little bit clearer on why they have this two-page extra charge policy.

# Twists and Exponential Coordinates

In this post, I build upon my previous one by further investigating fundamental concepts in Murray, Li, and Sastry’s A Mathematical Introduction to Robotic Manipulation. One of the challenges of their book is that there’s a lot of notation, so I first list the important bits here. I then review an example that uses some of this notation to better understand the meaning of twists and exponential coordinates.

Side comment: there is an alternative, more recent robotics book by Frank C. Park and Kevin M. Lynch called Modern Robotics. It’s available online and has its own Wikipedia, and even has some lecture videos! Despite its 2017 publication date, the concepts it describes are very similar to Murray, Li, and Sastry, except that the presentation can be a bit smoother. The notation they use is similar, but with a few exceptions, so be aware of that if you’re reading their book.

Back to our target textbook. Here is relevant notation from Murray, Li, and Sastry:

• The unit vector $\omega \in \mathbb{R}^3$ specifies a direction of rotation, and $\theta \in \mathbb{R}$ represents the angle of rotation (in radians).

An important fact is that any rotation can be represented as rotating by some amount through an axis vector, so we could write rotation matrices $R$ as functions of $w$ and $\theta$, i.e., $R(\omega, \theta)$. Murray, Li, and Sastry call this “Euler’s Theorem”.

Note: if you’re familiar with the Product of Exponentials formula, then $\theta = (\theta_1, \ldots, \theta_J)$, which generalizes to the case when there are $J$ joints in a robotic arm. Also, $\theta_i$ doesn’t have to be an angle; it could be a displacement, which would be the case if we had a prismatic joint.

• The cross product matrix $\hat{\omega}$ satisfies $\hat{\omega} p = \omega \times p$ for some $p \in \mathbb{R}^3$, where $\theta$ indicates the cross product operation. More formally, we have

and so $\hat{\omega}$ is a skew-symmetric matrix, of which the set of those is denoted as $so(3)$. This is easily verified by explicit computation.

• The matrix exponential $e^{\hat{\omega} \theta}$ is a $3\times 3$ matrix in the set $SO(3)$. In other words, it’s a rotation matrix! We can write it in closed form from Rodrigues’ formula:

Here’s the relevant mathematical relationship: the exponential map transforms skew-symmetric matrices into orthogonal matrices, and every rotation matrix can be represented as the matrix exponential of some skew-symmetric matrix.

• A twist $\hat{\xi} \in se(3)$ is defined as the set of $4\times 4$ matrices parameterized by exponential coordinates $\xi = (v,\omega)$ s.t. $v \in \mathbb{R}^3$ and $\hat{\omega} \in so(3)$. The matrix is written as

and yes, the last row consists of four zeros. We can derive this matrix from considering rotations about revolute and prismatic joints, where $\omega$ is the axis of rotation, and $v$ is the vector describing the translation. (See Section 3.2 in Murry, Li, and Sastry for details.)

Incidentally, sometimes we call twists as $\hat{\xi}\theta$, or with the $\theta$ “attached” to it. We can also do the same for the exponential coordinates.

• The matrix exponential of a twist $e^{\hat{\xi}\theta}$ represents the relative motion of a rigid body. Hence, if we left-multiply this with a vector, we interpret it as moving the input vector with respect to the same frame. Said another way, we are not changing the “frame of reference” for the input vector.

Given twist coordinates $(v,\omega)$, we can explicitly construct the RBMs:

if $\omega = 0$. This is equivalent to a pure translation. If $\omega \ne 0$, we have the slightly more complicated formula

Both of the above are elements of $SE(3)$, so they are rigid body motions. To be clear, you need to construct these from the actual definition of a matrix exponential based on its Taylor series expansion.

Let’s look at the example in the figure below to better understand the above concepts.

We have inertial frame $A$ and body frame $B$. With some pencil and paper, you can show that the RBM from $B$ to $A$ is:

which follows the style from my last post: determine the rotation and translation components and then plug them in. Fortunately, the rotation is about the $z$ axis, so $R_{ab}$ is easy.

It turns out that every RBM can be expressed as the exponential of some twist $\xi \in \mathbb{R}^6$. So let’s consider the following:

Given that $g_{ab} = e^{\hat{\xi} \theta}$ (note $\hat{\xi}$, not $\xi$) and that we know the exact form of $g_{ab}$, can we compute $\xi$?

To do this, we need to compute the components $v$ and $\omega$.

• Let’s do $\omega$ first. From our above equations for $e^{\hat{\xi}\theta}$ and $g_{ab}$, we equate components and see that we need

We know $R_{ab}$ corresponds to a rotation matrix about the $z$-axis only. This means $\omega = (0,0,1)$. We also simply set $\alpha = \theta$.

• Now consider $v$. Once again, we equate components from both sides to get

This is the standard “solve for $x$ in $Ax=b$” problem that you saw in linear algebra classes. Solving for $v$, we get $v = (\frac{l_1-l_2}{2}, \frac{(l_1+l_2)\sin\alpha}{2(1-\cos\alpha)}, 0)$.

We conclude that our exponential coordinates which generated $g_{ab}$ are

assuming that $\alpha \ne 0$

# The Special Exponential Group

Over the last few weeks, I have been devouring Murray, Li, and Sastry’s A Mathematical Introduction to Robotic Manipulation. It’s freely available online and, despite the 1994 publication date, is still relevant for robotic manipulation as it’s used in EE206 at Berkeley.

(The main novelty today, from my perspective, is the use of Deep Learning to automate out analytic models under certain conditions, but I still think it’s valuable for me to know classical robotics concepts.)

In this post, I discuss the Special Exponential group, denoted as $SE(3)$. We can define it as follows:

where $p$ is a position vector, and $R$ is a matrix in the special orthogonal group $SO(3)$:

This is the same as saying that $R$ is a rotation matrix.

Side comment: the reason for “$(3)$” as the input is that $SE$ and $SO$ can be generalized to an arbitrary amount of dimensions. However, I’m only concerned about robotic manipulation in $\mathbb{R}^3$.

The above suggests the obvious question:

For what purpose do we utilize $SE(3)$?

We use $SE(3)$ to encode rigid body motions (RBMs) in robotic manipulation, which preserve distance between points and angles between vectors. RBMs consist of a rotation and a translation. To visualize RBMs, look at the left of the figure below. There are two coordinate frames: frame $A$, which is “inertial” (I think of this as the “default” frame), and frame $B$, attached to the base of the curvy object drawn there.

Vector $p_{ab} \in \mathbb{R}^3$ shows the 3-D position of the origin of $B$ with respect to $A$. This ordering from $B$ to $A$ and not vice versa is important, and it’s important to know these well for robotic manipulation, which in advanced contexts relies on multiple, consecutive coordinate frames attached to links in a robot arm. For rotation matrices, we’ll keep the ordering of subscripts the same and write $R_{ab}$ to indicate that it transforms 3-D points from frame $B$ to frame $A$.

Left: visualization of two coordinate frames, one inertial and one attached to the base of an object. Right: again, two coordinate frames visualized, this time in the context of rotating about an axis.

Now consider the point $q$ on the object. We can express its coordinates as $q_a$ or $q_b$, depending on which frame of $A$ or $B$ is our reference. Suppose we have $q_b$. A rigid body motion can be conducted as follows:

and we’ll collect as $g_{ab} = (p_{ab},R_{ab})$ all the information needed to specify an RBM, transforming coordinates from $B$ to $A$. This is an element of $SE(3)$. Indeed, any such RBM must be an element of $SE(3)$, which defines what’s known as a configuration space of RBMs. Configuration spaces are defined in page 25 of Murray et al:

More generally, we shall call a set $Q$ a configuration space for a system if every element $x\in Q$ corresponds to a valid configuration of the system and each configuration of the system can be identified with a unique element of $Q$.

I should also provide some intuition to make it clear what happens when we “transform coordinates.” One way is as follows. If we view $p_{ab}$ as jetting out in the positive $x$, $y$, and $z$ directions of frame $A$, and the same for $q_b$ w.r.t. frame $B$, then the components of $q_a$ are element-wise larger than in $q_b$. This is why we add when doing RBMs with translations, since the vector $p_{ab}$ will increase the values. (Drawing a picture really helps.)

Keeping $p_{ab}$ and $R_{ab}$ separate can result in some cumbersome math when a bunch of rotations and translations are combined. Thus, it’s common to use homogeneous coordinates. A full discussion is beyond the scope of this blog post, but for us, the important point is that if $(p,R) \in SE(3)$, then the equivalent homogeneous representation is

where the “0” above is a row vector of three zeros. This enables us to perform one matrix-vector multiply for a 3-D point which is expanded with a fourth coordinate with a “1” in it. Thus, the origin point is $o = (0,0,0,1)^T$, and for vectors — defined as the difference between two points — the fourth component is zero.

Let’s do an example. Consider the second image in the figure above, showing rotation about an axis $\theta$. Given a fixed, “real-life”, physical point, let’s show how to encode a rigid body transformation which can transform a coordinate representation of that point from frames $B$ to $A$.

• Translation. Inertial frame $A$ and frame $B$ differ only by $l_1$ in the y-direction, with $p_{ab} = (0,l_1,0)^T$ representing the origin of frame $B$ with respect to frame $A$.

• Rotation. The rotation about $\theta$ coincides with the $z$ axis. Hence, we use the well-known formula for the $z$-axis rotation matrix:

While the precise positioning of the sines and cosines might not be immediately apparent, it should be clear why the last row looks like that, since a rotation about the $z$ axis leaves the $z$ component of the 3-D vector unchanged. You can also easily check that $R^TR=I_3$.

The matrix above is $R_{ab}$, the orientation of the origin of frame $B$ w.r.t. frame $A$.

These two form our specification in $SE(3)$. Combining these by using the homogeneous representation for compactness, our RBM from $B$ to $A$ is:

where $\vec{a}$ and $\vec{b}$ represent, in homogeneous coordinates, the same physical point but with respect to frames $A$ and $B$, respectively.

# All the Books I Read in 2017, Plus My Thoughts [Long]

A year ago, I listed all the books that I read in the year 2016. I listed 35 books with summaries for each, and grouped them into categories by subject. I’m pleased to announce that I am continuing my tradition with this current blog post, which summarizes all the books I read in 2017.

As before, I am only listing non-fiction books, and am excluding (among other things) textbooks, magazines, and certainly all the academic papers1 that I read. Books with starred titles (like ** this **) are those that I especially enjoyed reading, for one reason or another.

In 2017, I read 43 books, which is eight more than the 35 I reported on last year. Yay! The book categories are:

1. Artificial Intelligence and Robotics (11 Books)
2. Technology, Excluding AI/Robotics (5 Books)
3. Business and Economics (5 Books)
4. Biographies and Memoirs (6 Books)
5. Conservative Politics (3 Books)
6. Self-Help and Personal Development (3 Books)
7. Psychology and Human Relationships (4 Books)
8. Miscellaneous (6 Books)

Within each section, books are listed according to publication date.

I hope you enjoy this blog post! For 2018, I hope to continue reading lots and lots of non-fiction books with a heavy focus on technology, businesses, and economics.

## Group 1: Artificial Intelligence and Robotics

Yowza! From the 11 books here, you can tell that I’m becoming a huge fan of this genre. ;-)

• ** Incognito: The Secret Lives of the Brain ** is a thrilling 2011 book by neuroscientist David Eagleman of the Baylor College School of Medicine. (I consider this book as “AI” in this blog post, though you could argue that “Psychology” might be better.) It is clearly designed for the lay reader with interest in neuroscience, like me, due to a number of engaging examples that thrill the reader without going overboard with the technical content. Eagleman describes how we don’t actually have that much control over our brain, that there are so many unexpected contradictions with how we think, and mentions a few interesting neuroscience factoids. Did you know, for instance, that half of a child’s brain can be removed and the child can still survive? On a technical note, I was impressed with how Eagleman referenced a few machine learning papers from Michael I. Jordan and Geoff Hinton in his footnotes about hierarchical learning. From the perspective of a computer scientist, the most interesting part was when he talked about the brain being a team of competing rivals. This is awfully similar to the idea behind Generative Adversarial Networks an enormously successful and well-cited paper that came out in NIPS 2014 … three years after this book was published! I have no idea how a non-computer scientist was able to almost predict this, but it shows that cross-collaboration between neuroscientists may be good for AI. He doesn’t get everything right, though. He mentions more than once that artificial neural networks have been a failure. Well… this book was published in 2011, and Alex-Net came out in 2012, so that kind of flopped quickly. Despite this, I hope to see Eaglemen write another book about the brain so that I can see a revised perspective. Incognito also contains interesting perspectives on neuroscience and the law. Eagleman doesn’t take sides but doesn’t go in too much depth either. He says the “bar” for blameworthiness will change depending on available neuroscience, which he says is a mistake (and I agree).

• ** The Most Human Human: What Artificial Intelligence Teaches Us About Being Alive ** is an engaging 2011 book written by Brian Christian, who also (co-)authored another book I read this year, Algorithms to Live By. This book’s main focus is on the Turing Test, and it plays a greater theme in this one more than any of the other AI books I’ve read. Simply put, while we’re so obsessed about getting a computer to fool human judges (the “most human computer”), Christian argues that an equally important criteria is the “most human human.” In other words, in the Turing Test, who is the human that can most convince the judges that he/she is human? The book’s chapters are explorations of the different factors that make us human, among other things our ability to barge and interrupt, our use of “um” and “uh”, our constant sidetracking, and so forth. Intuitively, these are hard for a computer to model. Christian has a computer science background, so some of the book covers technical concepts such as entropy, which he argues we need to be making as high as possible; low entropy means we’re not saying anything worth knowing more about. And yet, these seem to be the most negatively encouraged aspects of our society, which is quite odd in Christian’s opinion. I enjoyed reading most of the book because it stated observations that seem obvious to me in retrospect, but which I never gave much thought. That’s the best kind of observation. (It’s like Incognito to some extent, and indeed the Incognito author praises this book!) The biggest drawback is that it never goes through a blow by blow of the actual Turing test! I mean, c’mon, I was looking forward to that, and Christian essentially ruined it by fast forwarding to the end, when he mentions he won the title of the “most human human.” Well, congratulations dude, but why wasn’t there at least a complete transcript in the book???

• ** Our Final Invention: Artificial Intelligence and the End of the Human Era ** is a 2013 book by documentary filmmaker James Barrat exploring the rise of AI and its potential existentialist risk. This is not far off the mainstream of AI researchers and programmers as it sounds. In fact, Peter Norvig and Stuart Russell include a brief discussion about it at the end of their famous AI textbook. Russell has also explicitly said that he’s clearly worried about superintelligence. So what is superintelligence? I view it as AI that is so far advanced that it becomes better and better, and surpasses humans in just about every quality imaginable.2 I think I enjoyed reading this book, even if it is a tad too sensational. There isn’t that much technical detail, which is OK since it’s a popular science book. Barrat makes an excellent point that scientists need to make their work accessible to the public. I agree — that’s partly why I discuss technical stuff on this blog — but I also think that people have got to start learning more math on their own. It needs to be a two-way partnership. Moving on, an unexpected benefit of reading Our Final Invention was that I learned about the work done by I.J. Good, Eliezer Yudowsky, and others from the Machine Intelligent Research Institute. I’m embarrassed to admit that I didn’t know about those two people beforehand, but now I will remember their names. Unlike MIRI in most Berkeley AI research groups, such as the ones I’m in, we don’t give a modicum of thought to existential risks of AI, but the topic is garnering more attention. One final comment about the book: Barrat mentions that no computer is better than a child at object recognition. Well, whoops. They are now! He talks a lot about neural networks and how we don’t understand them, which by now feels old and I wish authors would take note of all the people workong on this area. I’d like to see an updated version of Our Final Invention in 2018, with the last five years of AI advancement taken into consideration.

• ** Superintelligence: Paths, Dangers, Strategies ** is a 2014 book by Oxford philosopher Nick Bostrom.3 His philosophy background is apparent in the way Superintelligence is written, though it is obviously much easier to read than a real, academic philosophy paper. In this book, Bostrom considers when AI grows to the point where machines are “superintelligent.” That term can be broadly understood as when machines are so powerful and intelligent that they effectively have complete control over the future of the universe. Why, Bostrom says, can we assume that superintelligence is friendly? We cannot, he concludes. The book is about different ways we can get to superintelligence (i.e., “paths”) including AI, emulating the brain, collective superintelligence (think a super-charged Internet) and so forth. There are also “dangers” and “strategies”. Bostrom convincingly explains why superintelligence poses an existential risk to humanity, and also explains what strategies we may take to counter it, such as by uploading appropriate values to the agent. Unfortunately, none of his solutions are clear-cut. There are two things you will notice when reading this book. First, almost every assumption has a counterexample or unexpected consequence. Bostrom often comes up contrived scenarios for this purpose. Second, Bostrom frequently cites Eliezer Yudkowsky’s work. I first learned about Yudkowski by reading James Barrat’s Our Final Invention (see above). If you like Yudkowsky, you’ll like Bostrom. Taking a more general view, Superintelligence is meant to be a serious academic-style discussion, but not a recipe that can be easily followed, because it assumes so many things and continues to make the reader feel like every case is hopelessly complicated with advantages and disadvantages abound, both obvious and non-obvious. Overall, I’m happy I read this book even if it is wildly premature. It made me think hard about any assumptions I make in my work.

• What to Think About Machines That Think: Today’s Leading Thinkers on the Age of Machine Intelligence collects responses based on a 2015 survey of the EDGE question: “what do you think about machines that think?” This was sent to about a hundred or so experts in a variety of fields, mostly different academics and well-known authors. Obvious inclusions were Nick Bostrom (of Superintelligence fame), Eliezer Yudkowsky, Stuart Russell, Peter Norvig, and others, but there were also some interesting new additions. I of course did not know the vast majority of the people here. This book is a bit of an unusual format; each author’s answer to this question took up 1-4 pages. There were no constraints otherwise, so the answers varied considerably. For instance, I like Steven Pinker’s comment of why people don’t think AI will “naturally develop along female lines […] without the desire to take over the world.” Gee, isn’t that stereotyping, Pinker?? There were also some amusing responses, such as when someone said that he himself was an AI and people didn’t know it (!!) as well as alarmingly short answers saying “machines can’t think”. Most responses were along the same theme of: “AIs taking over the world aren’t going to happen anytime soon, but they are affecting us now, sometimes in subtle ways, and [insert ‘novel’ insight here].” Overall, while I like the idea of the book format, I think the utility that I derive from reading is more about understanding a long, engrossing story. This is not necessarily a bad book, and I can see it being useful for someone who can only read books in short 3-5 minute spurts at a time, but it’s not my style.

• ** Machines of Loving Grace: The Quest for Common Ground Between Humans and Robots **, is a 2015 book by NY Times journalist John Markoff. Though a journalist, Markoff has frequently written about AI-related topics and has good connections in the field. Machines of Loving Grace gives an impressively balanced view of the history of Artificial Intelligence (AI) and Intelligence Augmentation (IA), which can be roughly thought of as HCI (I might view it as a subset of HCI). Obviously, I am more interested in the AI aspect. Markoff covers topics such as self-driving cars (unsurprisingly), and Rodney Brooks’ Baxter robot, which has been used in many research papers that I’ve read. But more surprising from a Berkeley perspective, Markoff also mentioned Pieter Abbeel’s work, though this was his clothes folding experiment, and not his later, more exciting DeepRL stuff from 2014 and onwards. I was also — unsurprisingly — interested in Markoff’s description of the history of how the neural network pioneers met each other (e.g., Terence Sejnowski, Geoffrey Hinton, and Yann LeCun). For the IA side, the most prominent example is Apple’s Siri, which interacts with humans, though I don’t have much to say about it because I have never used Siri. Yes, I’m embarrassed. On the AI vs IA dilemma, Markoff notes people such as John McCarthy and Doug Engelbart on opposite sides. And of course who could forget Marvin Minsky who decimated the field of neural networks with his legendary 1969 book? I decided to read this based on Professor Ken Goldberg’s brief Nature article, and was pleased that I did so, mostly (again) to learn about history (since I’m trying to become like one of those AI experts…) and the importance of ensuring that, at least for AI applied to the real world, we keep the human interaction aspect in mind.

• Rise of the Robots: Technology and the Threat of a Jobless Future is a 2015 book by technologist Martin Ford, who warns that our society is not prepared to handle all the future technological advances with robots automating out jobs. He begins by arguing that IT advances have not been as useful as electricity and other breakthroughs, and indeed that is a key theme from Robert Gordon’s The Rise and Fall of American Growth. To make the point clear, in response to Ray Kurzwiel saying that smartphones have provided incalculably large benefits to their owners, Ford counters with: “in practice, they may offer little more than the ability to play Angry Birds while standing in an unemployment line.” Ford continues by citing sources and reasons for the decline of the middle class in America. This part of the book is not controversial. Ford then raises the point that the IT revolution, along with not just robotics but also machine learning, means that even “high-skilled” jobs are at risk of being automated out. We now have machines that can write as well as most humans, that play Jeopardy! (as expected, IBM’s Watson was mentioned) and which perform better at image recognition and language translation using Deep Learning. Ford worries that, in the worst case, an elite few with all the wealth will hoard it and be guarded in a fortress by robots. Yes, he admits this is science fiction (and he discusses the Singularity, probably not the best idea…) but the point seems clear. Ford concludes the book with what he probably wanted to discuss all along: Universal Basic Income to the rescue! I heard about this book from Professor Ken Goldberg’s brief Nature article, who is critical of Ford “falling for the singularity hype” and his “extremely sketchy” evidence. I probably don’t find it as bad, and lately I’ve been thinking more seriously about supporting a Universal Basic Income. We might as well try on smaller scales, given that the best we can hope for in the future is more debt and safety net cuts with Republicans (now) or more debt inefficient bureaucracy with Democrats (in 2018/2020).

• Our Robots, Ourselves: Robotics and the Myths of Autonomy, a 2015 book written by MIT professor David A. Mindell, is the third robotics-related book that I read based off of Professor Ken Goldberg’s brief Nature article. Mindell has an unusual background, being a Professor of Aeronautics and Professor of the History of Engineering and Manufacturing (I didn’t even know that was a department). He’s also a pilot. So this book brings together his expertise when he discusses what it means for robots to be automated. Our Robots, Ourselves discusses five realms: sea, land, air, war, and space, and shows that in all of those, it is not straightforward to claim that robots are being more and more autonomous at the expense of the human aspect. In addition, Mindell tells stories of the natural conflict between increasing automation and human employees. For instance, with sea, what does it mean for geologists and scuba-diving analysts if robots do it for them? Does it detract from their job? A similar thing rings true for pilots. We need some way of humans taking over in emergencies, and pilots are worried that increasing automation will lower the prerequisite skills for the job and/or reduce the job’s purpose. Next, consider war. People who once fought on the front lines or as air force pilots are feeling resentful that those who manage drones remotely are getting respect and various honors. Mindell argues that increasing automation must also go along with better human-robot interaction, a topic which is rightfully becoming increasingly important for academia and the world. After reading this book, I now believe I do not want systems to be fully autonomous (a huge issue with self-driving cars) but instead, I want the automation to work well with humans. That’s the key insight I got from this book.

• ** Algorithms to Live By: The Computer Science of Human Decisions ** is a 2016 book co-authored by writer Brian Christian and Berkeley psychology professor Tom Griffiths. It consists of 11 chapters, each of which correspond to one broad theme in computer science, such as Bayes’ Rule, Overfitting, and Caching. Most of these topics are related to algorithms and machine learning, which wasn’t particularly surprised to me given the authors’ backgrounds. I also know Professor Griffiths publishes machine learning papers on occasion, such as his groundbreaking 2004 paper Finding Scientific Topics. Algorithms to Live By lists how the major technical issues and questions related to these topics can have implications for actions in our own lives, such as dating, parking cars, and designing our rooms/desks (this example with caching always comes up). The authors point out how, in practice, the algorithms people engage in for these activities can be surprisingly correct or well off the mark of optimality, where here the metric is based on mathematical proofs. Of course, whenever we talk about mathematical proofs, we have to be clear on what assumptions we make, which will drastically affect our options, and which in fact can often validate some of the seemingly irrational activities that humans perform. I tremendously enjoyed reading this, though admittedly it was easier for me to digest the material given that I knew the main idea of the computer science concepts covered. It was nice to get a high-level overview, though, and I still learned a lot from the book since I have not studied every computer science subfield in detail. My final thought is that, just like when I read The Checklist Manifesto last year and tried to think about utilizing checklists myself, I will try and see if I can incorporate some of the authors’ suggestions in my own life.

• ** Thinking Machines: The Quest for Artificial Intelligence and Where It’s Taking Us Next ** is a recent 2017 book by journalist Luke Dormehl. I found out about it by reading Ray Kurzweil’s favorable book review in The New York Times. Kurzweil remarks that Luke is a journalist who “actually knows the technical details.” I think that’s true, though there is virtually no math in this book, or at least very little of it compared to Pedro Domingos’ book. In Thinking Machines, Dormehl mentions the backpropagation algorithm which has powered Deep Learning, but only at a very high level (obviously). He also talks about Deep Learning’s history, which I know already (and it could have been derived right from Stanford’s CS 231n slides) but it’s good to have here. Dormehl writes about the by-now famous story that “neural networks were ignored for a while then they became popular and are now known as Deep Learning,” which Professor Jitendra Malik would remark is “more marketable.” As far as technical material goes, it’s correct, so no worries. Dormehl includes a substantial amount of material about sensors, the Internet of Things (similar to Thomas L. Friedman’s Thank You For Being Late) and of course about AI ethics, laws, and the singularity. These are not new themes, but the difference between this book and others is that it’s very recent and current, which is useful due to the fast-growing pace of AI, so it was able to cover AlphaGo from DeepMind. I consider it a broad “story” about AI, and less opinionated compared to James Barrat’s Our Final Invention. It’s of reasonable length (not too long, not too short) and great for a wide audience of readers. Overall, I enjoyed reading the book, and it kept me up late longer than I should have.

• ** Heart of the Machine: Our Future in a World of Artificial Emotional Intelligence ** is a 2017 book also recommended by Ray Kurzweil. The author is Richard Yonck, founder and president of Intelligent Future Computing, a company which provides advice on the impact of technology on business and society (is this called “consulting”?). Heart of the Machine, like many AI-related books, discusses recent research and commercial advances, but it emphasizes an emotional perspective. It discusses how we got to affective computing and the rise of emotional machines. The first part contains a little history and discusses some of the labs that are working on this (e.g., the MIT Media Lab). The second, like the first, shows how many companies are measuring emotions, in part using advances in AI and Big Data analysis, and cautions us about the uncanny valley. The third part of the book is about the future, and obviously sexbots play a role. What I remember most form the book are its anecdotes, one of the most touching of which was when someone wanted to marry a robot, and a parent opposed this. Will this be the future of marriage? The first step is interracial marriage, then the next is same-sex marriage, and the last (?) step is human-machine marriage. Yonck shows his academic side by citing some ACM/IEEE International Conference on Human-Robot Interaction papers. That’s a niche-style conference but will likely grow into something much larger in the coming years (see the 2018 website here), similar to how NIPS grew from a niche into an enormous conference with thousands of attendees each year. Finally, I appreciated that Yonck said we are already merging with technology in some ways. For instance, many deaf people opt for cochlear implants to better interact in a hearing world. (I likely would have one had I been born a few years later and if hearing aids were not already highly successful for me.) We already merge with technology so much, and this is likely to increase in the future.

## Group 2: Technology, Excluding AI/Robotics

These are books loosely related to technology, though excluding AI and robotics, as I discussed those in the previous section.

• ** The Second Machine Age: Work, Progress, and Prosperity in a Time of Brilliant Technologies **, co-authored by Erik Brynjolfsson and Andrew McAfee. Both authors are from MIT: Brynjolfsson is a professor of management and McAfee a research scientist. The authors appear to take the opposite perspective of economist Robert Gordon (author of The Rise and Fall of American Growth, which I discuss later), a point that they emphasize repeatedly: they argue that we are now at an inflection point and that we are on our way towards better times, and not stagnation. Their key rebuttal to Gordon is that innovation is due to recombination. Sure, we may not invent brand new things like electricity, but the IT revolution was all about combining stuff that had previously existed, and that will continue onwards as more people are able to try new things. As expected, they provide the usual disclaimers (at least from the technologically elite) that technological growth isn’t always great, that people fall behind, etc. To their credit, both men propose solutions, which I think are reasonable and — crucially in today’s politics — are widely agreed upon by economists across the entire political spectrum. For instance, they mention the universal basic income but seem to prefer the more mainstream “earned income tax credit” idea, and I think I can agree. One major quibble I have is that the book has one chapter to AI, but the actual AI portion of it is only two and a half pages long. And this for what might be the biggest technology advance of the 21st century! Fortunately, they seem to have given it greater attention since the book was first published. I bought The Second Machine Age in December 2016 as a Christmas gift, and they included a new introduction saying that they had underestimated progress in AI, particularly with deep neural networks, a topic which I frequently blog about here! (Incidentally, I saw Brynjofsson’s praise for a MOOC on Deep Learning … even MIT professors are going to MOOCs4 to learn about the subject!) The book is relatively straightforward to read and oozes more excitement compared to Gordon’s book. There is a book website for more details if you are interested. Brynjofsson and McAfee have since written more about Deep Learning, as you can see from their NY Times article after AlphaGo famously beat Go super-duper star Lee Sodol. I feel extremely fortunate to be in a position where, though I’m not the one creating this stuff, I can understand it.

• ** Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy ** is a 2016 book by mathematician-turned-data scientist and author (and MathBabe founder) Cathy O’Neil who argues that the use of large datasets in industry and government contexts has, well, increased inequality in our society. She describes stories about how the use of big data to predict whether someone will commit a crime or default on a loan has a harmful feedback loop on the poor and minorities. (Blacks are the minority group emphasized the most in the book.) Why is there a feedback loop? Minorities are more likely to be around people who are committing crimes, and the “birds of a feather” mentality among big data algorithms is that they tend to relate people to those who bear similar qualities. Whereas in the past, for instance, a banker might not have relied on big data but on his instincts to grant or deny loans, which would hurt women and minorities, nowadays we mostly have data algorithms to determine that, but even so, algorithms have their own biases and values (indeed, this is an academic research topic, see the BAIR Blog post on this which also uses Google’s “labeling blacks as gorillas” example of algorithms trained on wrong data). O’Neil calls for increased transparency in these algorithms, which she calls Weapons of Math Destruction (WMDs), and for the people working on these algorithms to better understand the values that are inherent in the models. I enjoyed most of the short, fast-paced book and highly recommend it. It’s also worth noting that O’Neil regularly writes columns about this subject area, which interested readers should check out.

• ** Thank You for Being Late: An Optimist’s Guide to Thriving in the Age of Accelerations ** is Thomas L. Friedman’s most recent book, published in late 2016 (though the manuscript was done before the outcome of the presidential election). For a long time, I’ve been following Friedman’s Sunday weekly columns at The New York Times, which has served as a preview for what’s to come in the book: Moore’s Law, the refugee/migration crisis, unstable governments, droughts and climate change, and the polarization at the highest level of American politics. Friedman goes through these and discusses topics much like he did in The World is Flat, though I think he tempers his idiosyncratic writing style. He mentions at one point a handful of policy changes he’d like to do, and claims he’s neither left nor right politically and that those labels are now outdated. For instance, he’s very free trade (right) but also for single-payer health care (left). I was duly impressed from the book because it taught me much about how the world works today. It also made me appreciate that I’m in a position where I can take advantage of what the world has to offer. Thank You for Being Late also mentioned several technical topics that I’m passionate about. It was really nice to see a mainstream, “non-technophobe” talk about Moore’s Law, GitHub7, and even TensorFlow/Deep Learning (!!); he explained these topics as well as he could given the non-technical nature of the targeted audience. I also appreciated the surgeon general’s comment in the end that America’s biggest killer “was not heart disease, but isolation” which is ironic given how we are more connected than ever before. Ultimately, I want to be part of that acceleration and, of course, to ensure that the vast majority of Americans aren’t left behind (including myself!). The book, however, made me concerned about the future. I finished this just a few days before Trump was inaugurated so … hopefully things will be OK.

• What to Do When Machines Do Everything: How to Get Ahead in a World of AI, Algorithms, Bots, and Big Data is a recent 2017 book by three leaders from Cognizant, a firm which I didn’t know about beforehand. This book takes the now-standard view (at least among many technology thinkers) that automation will be overall better for us, destroying some jobs but also creating new ones and clearing out old drudgery. One thing the authors note which I haven’t heard before is that they subscribe to the “S-Curve”: we’re in a “stall” zone, but for the next two decades, we will experience dramatic economic growth with more equalizing effects as it relates to income distribution. I find this hard to believe, unfortunately. Another perspective the authors bring is that once old entrenched companies make more of a digital transition, that’s when we’ll really see GDP take off. Regarding the book style, it’s short and written in a mini-textbook style. The abbreviations in it were a bit corny but I enjoyed the examples, at least the ones they had. I surprisingly didn’t seem to enjoy it as much as some other similar books, probably because some of their advice is really high level and generic, over-simplifying things. All in all, I think the book is mostly correct on a technical level but may not be my style.

## Group 3: Business and Economics

• How the West Grew Rich: The Economic Transformation of the Industrial World. This is an old 1986 book by the late economist and historian Nathan Rosenberg and co-author L.E. Birdzell Jr, an attorney and legal scholar, and I have several quick thoughts. The first was that this book was a real slog for me to read. It’s not even close to being the longest book I’ve read8 but I had to struggle through it; I think the writing style of 1986 is different from the one I’m used to today, but I’m also partly to blame since I spaced out my reading over many evenings when I was tired. In any case, this book is about capitalism in some sense, though the authors complain that the term is misleading. Their main argument is that the freedom of business and enterprise from religious and political control was the key factor in explaining the rise of the West, and not other factors generally attributed, such as science or mass production. Judging from the book, the prevailing wisdom at that time may have been mass production, but apparently not to them. It’s a bit interesting to think about what conclusions they make which are still relevant today, like how it’s so hard for Third World countries to catch up. I was also amused at seeing the Soviet Union mentioned so much, and I had to remind myself: 1986, 1986, 1986. (In a shout-out to the AI people reading this, that was the year when Rumelhardt, Willimans, and Hinton published their famous backpropgation paper with “readable math”.) Ultimately, while this book has some good spots in it, I lost focus too much to really benefit from it, and I think The Rise and Fall of American Growth is a vastly superior alternative, unless you want to get a better understanding of European stuff (not just American) and also some discussion about the Middle Ages.

• Shop Class As Soulcraft: An Inquiry Into the Value of Work, is a 2009 book written by Matthew Crawford, who has one of the most unusual profiles among authors I read. Crawford is a mechanic and works at a bike shop, but he also holds an undergraduate degree in physics and a PhD in political philosophy from the University of Chicago. After his PhD, he worked at a “think tank” (where he had to basically repeat what the oil companies wanted to say about global warming) and at a firm where his job was to basically rewrite abstracts of research papers (what?!?). His true heart lies in building things, where he gets value. Crawford is concerned that today’s white-collar emphasis of the world focuses too much on removing value from humans (whereas mechanics can just point and say “here’s my result!”), and the white-collar blue-collar divide is making the mechanics earn less respect across society. I am indeed concerned that this is true, especially with today’s political divide among the college-educated and non-college educated, and I wish that more people with solid academics, those who have “never failed” had a little more humility. (I certainly feel like I’ve failed a lot and I’m pretty academically credentialed compared to a lot of others.) In a sense, I get the feeling that this book is like Jaron Lanier’s “You Are Not a Gadget” — those two men might see a lot of common ground over their critiques of modern life, though for different reasons. One takeaway from the book is that I’m happy to be where I am since I can try and perform deep work to produce results (code, papers, etc.) that people can look at, as I am doing more frequently on my GitHub account. Unsurprisingly, this was the key point Cal Newport made from this book. Lastly, I can’t resist mentioning one of the most interesting parts of this book. In a footnote in the seventh chapter, he talks about AI in the context of lamenting how humans were often being reduced to simple straightforward rules. His footnote talks about computer science and says the one hope for AI in the future is with neural networks since they’re not reduced to simple rules. Wow … that was in 2009 (before Alex-Net, etc.). Even your bike shop repairman knows about Deep Learning!

• ** The Lean Startup: How Today’s Entrepreneurs Use Continuous Innovation to Create Radically Successful Businesses ** is a 2011 book by software entrepreneur Eric Ries, known for co-founding IMVU and then later for consulting various start-ups through his Lean Startup methodology. In this book, Reis provides a guide for start-ups, which he defines as: “[…] a human institution designed to create a new product or service under conditions of extreme uncertainty”. Note the lack of any comment about company size, and also note the inclusion of extreme uncertainty. This means start-ups can include non-profits, large companies, and even governmental organizations, so long as there is initial uncertainty in their roadmap. The Lean Startup argues that, in order for startups to thrive, they must follow a Build-Measure-Learn feedback loop. Furthermore, that loop must be their competitive advantage compared to slower, bulkier competition. By building Minimum Viable Products, Reis argues that appropriate metrics (not “vanity metrics” as he calls them) and customer feedback can be measured rapidly. Understanding these early results then guides the startup towards the next step, which may or may not involve the painful act of pivoting to change strategies. The book’s advice appears sound and reasonable. While I certainly don’t have much experience in this area to fully critique the book, Reis has famous tech titans such as Sheryl Sandberg and Andrew Ng to vouch for the book, so I think I can trust the advice. (I found out about this book from Andrew Ng’s reading recommendations.) While reading the book, I imagined what I would do if I tried to create (or more realistically, join) a start-up. My PhD program isn’t going to last forever … but I suppose while I’m here, I should emphasize the Minimum Viable Product aspect with respect to research.

• ** The Hard Thing About Hard Things: Building a Business When There are No Easy Answers ** is a 2014 book by billionaire venture capitalist Ben Horowitz where he recounts his experience running Loudcloud and Opsware as CEO. The book starts out by first describing the CEO experience. Then, Horowitz remarks about the lessons he’s learned and outlines recommendations and guides on what he thinks CEOs should be like. The book concludes with him explaining how he founded the venture capital firm Andreessen Horowitz, which he’s still running today9 to help technical founders become groomed CEOs. The book is fast-paced and feels like a high-octance novel, because Horowitz’s tenure at Loudcloud and Opsware was anything but smooth. Horowitz argues that there are peacetime CEOs and wartime CEOs, of which he was definitely the latter as he estimates he only had “three days of peace” when running the company. Loudcloud and Opsware initially raised a lot of money, but then after the dot-com crash, they struggled a lot and I’m amazed that Horowitz turned it around and eventually sold the company for \$1.6 billion to Hewlett-Packard. Reading his story, and Elon Musk’s story (which I’ll get to later), makes me wonder how these two CEOs managed to pull their companies out from the financial brink. I am kind of surprised that something “comes out of nothing,” and one of my disappointments is that it arguably spends a lot of time on Horowitz’s lessons for the CEO whereas I would have preferred more details on his CEO experience (at least, more than what’s in the book) because, again, I don’t understand how companies can go to a billion dollars’ worth of value out of nothing. I really need to step in the shoes of a CEO one of these days. But perhaps I would have better understood this if I understood more about business, and I certainly learned a lot about the business word from reading this book. For instance, I’m embarrassed to say that I only had a vague notion of what it meant for “a company to go public” but reading this book (and checking Wikipedia, Investopedia, and other online resources in parallel) made me better understand the process.

• ** The Rise and Fall of American Growth: The U.S. Standard of Living Since the Civil War ** is a long book by economist Robert Gordon, published in January 2016. For an academic-style book that’s 762 pages long, it is quite well-known, particularly due to the current debates on economic growth in politics. On Google, you can find pages and pages of reviews for Gordon’s book. Most of them mirror Bill Gates’ review in that they praise Gordon for providing a surprisingly complete historical picture of what American life was like in 1870, and how it was completely transformed in the “great century” to 1970. Gone were the days of darkness, backbreaking labor, endless drudgery in chores, and a stale diet, among other things, and in place of those came the electric light bulb, work conditions in heated and air-conditioned offices, the internal combustion engine (leading to automobiles and the airplane), and shopping centers to buy a variety of food and clothes. Gordon’s thesis is that since 1970, America has been in a long reign of slow growth despite the recent progress in AI, IT, and other tech-related fields. There are two reasons: these advances do not match up with those from previous generations (to echo Peter Thiel, “we wanted flying cars but ended up with 140 characters [Twitter]”), and there are headwinds preventing rapid growth such as income inequality, college debt, and demographic trends. He ends with a brief postscript on policy actions that might be useful to counter these trends; I wish more politicians would take note of them as some of his suggestions have broad appeal nowadays. This book is amazing, and despite my close connection with the technology sector, I agree with his thesis. Bill Gates counters by suggesting we’re on the cusp of medical advances, but I’m heavily skeptical about researchers finding cures for cancer and Alzheimer’s disease. It might be challenging for the average reader to go through a book this long, especially one packed with figures and footnotes. My advice? Read it. It’s worth it. I have probably learned more from this book than I have from any other.

## Group 4: Biographies and Memoirs

I am reading biographies of famous people because I want to be famous someday. My aim is to be famous for a good reason, e.g., developing technology that benefits large swaths of humanity. (It is obviously easier to become famous for a bad reason than a good reason.)

• ** Alan Turing: The Enigma ** is the definitive biography of Alan Turing, quite possibly the best computer scientist of all time. The book was written in 1983 by Andrew Hodges, a British mathematics tutor at the University of Oxford (now retired). I discussed this in a separate blog post so I will not repeat the details here.

• ** My Beloved World ** is Supreme Court Justice Sonia Sotomayor’s memoir, published in 2013. It’s written from the first person perspective and outlines her life from starting in South Bronx and moving up to her appointment as a judge to US District Court, Southern District of New York. It — unfortunately — doesn’t talk much about her experiences after that, getting appointed to the United States Court of Appeals for the Second Circuit in 1998, and of course, her time on the nation’s highest court starting in August 2009. She had a father who struggled with alcoholism and died when she was nine years old, and didn’t appear to be a good student until she was in fifth grade, when she started to obsess over getting “gold stars.” (I can attest to a similar experience over obsessing to get “gold star-like” objects when I was younger.) She then, as we all know, did well in high school and entered Princeton as one of the first incoming batch of women students and Hispanic students, graduating with stellar academic credentials in 1976 and then going on to Yale Law School where she graduated in 1979. The book describes her experiences in vivid terms, and I liked following through her footsteps. I feel and share her pain at not knowing “secrets” that the rich and privileged students had when I was an undergrad (I was clueless about how finance and investment banking jobs worked, and I’m still clueless today.) Overall, I enjoyed the book. It’s brilliantly written, with an engrossing, powerful story. I will be reminded of her key attribute of persistence and determination and focus which she says were key. I’m trying to pursue the same skills myself. While I understand the low likelihood of landing such tiny jobs (e.g., the tech equivalent of a Supreme Court Justice) I do try and think big and that’s what motivates me a lot. I read this book on a day trip where I was sitting in a car passenger seat, and I sometimes dozed off and imagined myself naming various hypothetical Supreme Court Justices.

• An Appetite for Wonder: The Making of a Scientist is Richard Dawkins’ first of two (!!) autobiographies, published in 2013 and which accounts for the first half of his life. Dawkins is one of the most famous and accomplished scientists today, not only in terms of raw science but also with respect to public outreach and fame (whether famous, in my opinion, or infamous, if otherwise), so perhaps two books is justified. Dawkins discusses his childhood, which he first spent in Africa before moving to England to attend boarding schools; he remarked that the students seemed to be relatively stronger in Africa. I sometimes wish I had attended boarding schools instead of my standard public schools, since perhaps I would have developed independence faster, so it was interesting to read his perspective. After this, Dawkins talks about his undergraduate years at Oxford10 (where his relatives had gone) and this is where I really want to know what he did, because I’m hoping to use my own “appetite for wonder” in science, since I think Artificial Intelligence is the new electricity. But anyway, Dawkins became a professor at Berkeley (!!) but he quickly left to return to England for another position. This book ends with his publication of The Selfish Gene, a book that I want to read one of these days. I’m impressed: it’s a challenge to write an autobiography, but fortunately, Dawkins’ parents saved a lot of letters and information, so that’s good. The book, however, is likely aimed at a niche audience of readers. It was also interesting to Dawkins used to be religious, before becoming an atheist by his late teens (like me, though I was never religious at all). I also liked his stories about computer programming and research, which were a lot simpler back then but presumably harder due to lack of documentation and the Internet.

• Brief Candle in the Dark: My Life in Science is Richard Dawkins’ second autobiography, written in 2015 and covering the second half of his life, at least, up to that point (he could theoretically still have 30 more years if he lives long enough). He writes more about his life as a professor at the University of Oxford, including his time as the inaugural Simonyi “Professor of the Public Understanding of Science,” which I have to admit was an odd title when I first read it in the back flap of my copy of The God Delusion. He certainly has helped me understand things in this world, and it’s true that I consider Richard Dawkins to be one my heroes. On the other hand, I’m not sure most lay readers would be willing to slog through both of his autobiographies, so keep this in mind in case you’re on the fence about reading this book. It is a non-chronological history of his academic life about debates (uh oh), being on television, writing books, giving talks, and so forth. Dawkins describes various stories about him with other famous people. I also learned a little more about basic evolution. His previous autobiography highlighted how genes — and not the individual — are the unit of evolution, but in his book The Extended Phenotype, he talks about an extension of natural selection onto the physical world (interesting, though one must not misinterpret this). He also emphasizes, and this is something I agree with, that natural selection can still explain complicated structures today that creationists use as evidence against evolution, such as the eye. Natural selection is the only theory we have of what can work gradually and cumulatively. This is key for developing complicated structures; in the absence of evidence, God should not be the default option. I also liked other tidbits of the book, such as how Dawkins did a lot of “evolutionary programming” — I bet he would be interested in reading the research paper Evolution Strategies as a Scalable Alternative to Reinforcement Learning.

• ** Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future ** is a 2015 biography of Elon Musk, written by technology journalist Ashlee Vance. He documents Musk’s chaotic life, both nowadays as the CEO of SpaceX and Tesla and before, when he was struggling to get his companies off the ground and earlier still, when he started his entrepreneurial spirit by starting Zip2. Musk grew up in South Africa and moved to Canada so that he could get to the United States as quickly as possible. Musk had some initial businesses successes, but was forced out of X.com (which particularly hurt as Musk is the personality who wants full control over his companies), then earned some more success with Tesla and SpaceX before teetering on the brink of collapse at the end of 2008 (you know, like the financial “oopsie” we had). Then later, those companies recovered, Musk married a young actress, divorced, then re-married (then divorced again …). The book concludes with some thoughts on Musk’s wild personality and ambitions, and basically says that there is no one like Musk, who still is holding out hope for humans to go to Mars by 2025. This book is an absolute thrill to read. Vance brilliantly writes it so that the reader often feels like he or she is swept up into “Elon Musk”-mode: hard-working, super-charging, and borderline out of control. After reading it, I kept thinking that my work ethic is too soft and weak, and that I better get back to working sixteen hours a day (or less, if I’ve focused really hard in fewer hours). I have two main criticisms of this book. The first is that I was hoping to see more information from his two wives, and for this I’ll probably have to relegate myself to reading Justine Musk’s blog. The second and most important critique is that when Vance wrote an updated epilogue in January 2017, which was five months before I bought the book at Chicago O’Hare International Airport, he never mentioned Musk’s investment in OpenAI, a nonprofit AI research company which aims to produce, or pave the path to, artificial general intelligence. In their introductory blog post from December 2015, they claim to have 1 billion dollars in investment. I’m not sure how much Musk contributed to that, but it must have been a lot!

• Keeper of the Olympic Flame: Lake Placid’s Jack Shea vs. Avery Brundage and the Nazi Olympics, a recent 2016 book by Michael Burgess, is one that hits home to me because Jack Shea was my great-uncle. Jack Shea was born and raised in Lake Placid, New York, and as a 21-year-old competitor in the 1932 Winter Olympics, he won two gold medals in speed-skating, becoming a hometown hero and putting Lake Placid “on the map.” A few years later, when it became apparent that the next Winter Olympics were going to be in Nazi Germany, Shea urged Avery Brundage (then in charge of American involvement in the Olympics) to boycott out of concerns over Adolf Hitler’s treatment of Jews and other minorities. Shea did not participate in those controversial Olympics, but the Americans did send a team, with relatively disappointing speed-skating results. The book then discusses more about the intersection between politics and sports, and also talks about the odd déjà vu when Lake Placid again hosted the Winter Olympics in 1980, again with politics causing tension (this time, from the Soviet Union). For obvious reasons, I enjoyed reading this book despite its flaws: it’s short and has obvious typos. I like knowing more about my ancestors and what they did, and the photos were really striking. My favorites include 19-year-old Shea shaking hands with then-governor Franklin Roosevelt, another one with Shea and his extended family (including my grandfather), and a third which shows a pre-teen Shea and his brother, Eugene, already in skates. Eugene, incidentally, lived to be 105 years old (!!) before passing away in October this year (obituary here) and was able to contribute photos and assistance to the author. I got to meet Jack Shea once, and he might very well have lived to be 100 years of age had he not been killed by a drunk driver at the age of 91. This was 17 days before his grandson would end up winning a gold medal in the 2002 Salt Lake City Winter Olympics. In my parent’s home, there is a photo of me with my cousin holding his gold medal (it was heavy!). I also have a separate blog post about this book soon after I read it.

## Group 5: Conservative Politics and Thoughts

Well, this will be interesting. I’m not a registered Republican, though I possess a surprisingly large amount conservative beliefs, some of which I’m not brave enough to blog about (for obvious reasons). In addition, I believe it is important to understand people’s beliefs across the political spectrum, though for this purpose I exclude the extreme far left (e.g., hardcore Communists) and right (e.g., the fascists and the Ku Klux Klan).

• ** Please Stop Helping Us: How Liberals Make it Harder for Blacks to Succeed ** a 2014 book written by Wall Street Journal columnist Jason Riley. It’s no secret that (a) most blacks tend to be liberal, I would guess due to the liberals getting the civil rights movement correct in the 1960s, and (b) blacks tend to have more credibility when criticizing blacks compared to whites. Riley, as a black conservative, can get away with roundly criticizing blacks in a way that I wouldn’t do since I do not want to be perceived as a racist. In Please Stop Helping Us, Riley “eviscerates nonsense” as described by his hero, Thomas Sowell, criticizing concepts such as the minimum wage, unions, young black culture, and affirmative action policies, among other things, for the decline in black prosperity. His chief claim is that liberals, while having good intentions, have not managed to achieve their desired results with respect to the black population. He also laments that young blacks tend to watch too much TV, engage in hip-hop culture, and the like. One of his stories that stuck with me was when a young (black) relative asked him: “why are you so white”, when all Riley did was speak proper English and tuck in his shirt. Indeed, variants of this story are common complaints that I’ve seen and heard about from black students and professionals across the political spectrum. I don’t agree with Riley on everything. For instance, Riley tends to ignore or explain away issues regarding racism as it relates to the lack of opportunities for job promotions or advancement, or when blacks are penalized more relative to others for a given crime. On the other hand, we agree on affirmative action, which he roundly criticizes, pointing out that no one wants to be the token “diversity hire”. To his credit, he additionally mentions that Asians are hurt the most from affirmative action, as I pointed out in an earlier blog post, making it a dubious policy when it come to advancing racial equality. In the end, this book is a thought-provoking piece about race. My impression is that Riley genuinely wants to see more blacks succeed in America (as I do), but he is disappointed that the major civil rights battles were all won decades ago, and nowadays current policies do not have the same positive impact factor.

• ** The Conservative Heart: How to Build a Fairer, Happier, and More Prosperous America **, is a 2015 book by Arthur Brooks, the president of the American Enterprise Institute, officially a nonpartisan think tank but widely regarded (both inside and outside the organization) as a place for conservative public policy development and analysis. Brooks argues that today’s conservatives, while they have most of the technical arguments right (e.g., on the benefits of free enterprise), lack the “moral high ground” that liberals have. Brooks cites statistics showing that conservatives are seen as less compassionate and less caring than liberals. He argues that conservatives can’t be about being anti-everything: government, minimum wage increases, food stamps, etc. Instead, they have to show that they care about people. They need to emphasize an equal starting line for which people can flourish, which contrasts with the common liberal perspective of making the end product equal (by income redistribution or proportional racial representation). One key point Brooks emphasizes is the need for work fulfillment and purpose instead of lying around while collecting checks from the American welfare state. I liked this book and found it engaging and accessible. It is, Brooks says, a book for a wide range of people, including “open-minded liberals” who wish to understand the conservative perspective. I have two major issues with his book, though. The first is that while he correctly points out the uneven recovery and the lack of progress on fixing poverty, he fails to mention the technological forces that have created these uneven conditions (see my technology, economics, and business related books above), much of which is outside the control of any presidential administration or Congress. The second is that I think he’s been proved wrong on a lot of things. President Donald Trump is virtually none of the stuff that a conservative “heart” would suggest and, well, he was elected President (after this book was published, to boot). I wish President Trump would start following Brooks’ suggestions.

• Conscience of a Conservative: A Rejection of Destructive Politics and a Return to Principle is a brief 2017 book/manifesto by U.S. Senator Jeff Flake of Arizona. Flake is well known for being one of those “Never Trump” style of Republicans since he remains true to certain Republican principles that have arguably fallen out of favor with the recent populist surge of Trump-ian Republicanism in 2016, such as free trade and limited government spending. And yes, I don’t think Republicans can claim to be the party of fiscal prudence nowadays, since Trump is decidedly not a limited spending conservative. In this book, Senator Flake argues that Republicans have to get back to true, Conservative principles and can’t allow populism and immaturity to define their party. He laments at the lack of bipartisanship in Congress, and while he makes it clear that both parties are to blame, in this book he mostly aims at Republicans. This explains why so many Republicans, including Barry Goldwater’s relatives, dislike this book. (Barry Goldwater wrote a book of the same title, “Conscience of a Conservative”, from which Jeff Flake borrowed the title.) I sort of liked this book but didn’t really like it. It still fails to address the notion of how the parties have fallen apart, and he (like everyone else) preaches bipartisanship without proposing clear solutions. Honestly, I think the main reason I read it was not that I think Flake has all the solutions, but that I sometimes think of myself in Congress in my fantasies. Thus, I jumped at the chance to read a book directly from a Congressman, and particularly a book like this where Flake bravely didn’t have his staff revise it to make it more “politically palatable.” It’s a bit raw and lacks the polish of super-skilled writers, but we shouldn’t hold Senators to such a high writing standard so it’s fine with me. It’s unfortunate that Flake isn’t going to seek re-election next year.

## Group 6: Self-Help and Personal Development

I’m reading these “personal development” books because, well, I want to be a far more effective person than I am right now. “Effectiveness” can mean a lot of things. I define it as being vastly more productive in (a) Artificial Intelligence research and (b) my social life.

• ** How to Win Friends and Influence People: The Only Book You Need to Lead You to Success ** is Dale Carnegie’s famous book based on his human interaction courses. It was originally published in 1936, during the depths of the Great Depression, making this book by far the oldest one I’ve read this year. I will not go into too much depth about it since I wrote a summary in an earlier blog post. The good news is that 2017 has been a much better year for me socially, and the book might have helped. I look forward to continuing the upward trend in 2018, and to read other Dale Carnegie books.

• ** The 7 Habits of Highly Effective People: Powerful Lessons in Personal Change **, written by Stephen R. Covey in 1989, is widely considered to be the “successor” to Dale Carnegie’s classic book (see above summary). In The 7 Habits, Covey argues that the habits are based on well-timed principles and thus do not noticeably vary across different religious groups, ethnic groups, and so forth. They are: “Be Proactive”, “Begin With the End in Mind”, “Put First Things First”, “Think Win-Win”, “Seek First to Understand, Then to be Understood”, “Synergize”, and “Sharpen the Saw”. You can find their details on the Wikipedia page so I won’t repeat the points here, but I will say that the one which really touches upon me is “Think Win-Win”. In general, I am always trying to make more friends, and I’d like these to be win-win. My strategy, which aligns with Covey’s (great minds think alike!), is to start a relationship by doing more work than the other person or letting the other person benefit more. Specifically, this means that I will be happy to (a) take the initiative in setting meeting times and any necessary reservations, (b) drive or travel farther distances, (c) let the other person choose the activity, and so forth. At some point, however, the relationship needs to be reciprocal. Indeed, I often tell people, subtly or not so subtly, that the true test of friendship is if friends are willing to do things for you just as much as you do to them. With respect to the six other principles, there isn’t much to disagree. There is striking similarity to Cal Newport’s Deep Work when Covey discusses high-impact, Quadrant II activities. Possibly my main disagreement with the book is that Covey argues how these principles come (to some extent) from religion and God. As an atheist, I do not buy this rationale, but I still agree with the principles themselves and I am trying to follow them as much as I can. This book has earned a place on my desk along with Dale Carnegie’s classic, and I will always remember it because I want to be a highly effective person.

• You are a Badass: How to Stop Doubting Your Greatness and Start Living an Awesome Life is a 2013 book by self-help guru Jen Sincero. It’s deliberately written in a very “teenage”-like way, where the author acts like she’s talking directly to the reader as the self-help coach. The target audience seems to be people who have “screwed up” and feel like their life is not as awesome as it could be. She goes through 27 relatively short chapters, each with different generic advice, though she does repeat this each chapter: love yourself. I definitely need reminders about that, since I don’t feel like I am achieving enough in life. However, I was somewhat skeptical of her advice and in general I am a self-help skeptic since I think it’s better for me to build my technical skills than to try and optimize advice from self-help books. Overall, I did not enjoy this book (largely due to the writing style), and I’m surprised it’s gotten so much critical acclaim and that it’s a best-seller. Yes, I will “love myself” but I can’t see myself remembering many other tidbits about this book that I didn’t already know before (e.g., think positive!!). Perhaps this book would be better suited with some concrete success stories of Sincero’s clients.

## Group 7: Psychology and Human Relationships

These books are about psychology, broadly speaking, which I suppose can include “human relationships”. I thoroughly enjoyed reading all four of these books.

• ** Thinking, Fast and Slow ** is a famous 2011 book by Daniel Kahneman, winner of the 2002 Nobel Prize in Economics for his work on decision making. This is a book about psychology and how humans think, and much of it is based on Kahneman’s research with Amos Tversky many decades ago. To make the concepts clearer to the reader, Kahneman describes a story consisting of System 1 and System 2. These are the fast and slow parts of our thinking, respectively, so the former represents our immediate intuition and the latter reflects what happens after we expend nontrivial amounts of effort on some task. Thinking, Fast and Slow is filled with informative anecdotes, thrilling insights, and unexpected contradictions about the way humans think, and supplements those with exercises to the reader. (I normally find these annoying, but here they were reasonable.) Possibly the biggest insight I gained is that human thinking is flawed and is easily manipulated, so I better be extra cautious if I have to make important judgments in my life. (For minor life decisions, I don’t have a hope of remembering all the advice in this 400+ page book.) To be clear, I already knew that humans behaved irrationally, but Kahneman does an excellent job in putting my haphazard thoughts about human irrationality on more solid footing. Kahneman augments that with related topics such as overconfidence (a major issue with CEOs and start-ups) and how anchoring, priming, and baselines influence human preferences. After reading the book, all I can say is, I think (pun intended!) Thinking, Fast and Slow lives up to its billing as a true classic.

• ** To Sell Is Human: The Surprising Truth About Moving Others ** is a 2012 book by best-selling author Daniel Pink. He argues that we should stop focusing on outdated views of salespeople. That is, that they are slimy, conniving, attempting to rip of us off, etc. Today, one in nine work in “sales” but Pink’s chief message to the reader is that the other eight of nine are also in sales. We try to influence people all the time. So in some sense this is obvious. I mean, come on, if I’m aiming to get a girlfriend, then I’m trying to influence her based on my positive qualities.11 For academics, we sell our work (i.e., research) all the time. That’s what Pink means when he says “everyone is working in sales.” He argues that nowadays, the barriers have fallen (he almost says “The World is Flat” a la Thomas L. Friedman) and that salespeople are no longer people who walk door by door to ask people to buy things. That’s outdated. One possible negative aspect of the book is that I don’t think we need this much “proof” that we’re all salespeople. Yes, some people think only in terms of that job, but all you have to do is say: “hey everyone is a salesperson, if you try to become friends with someone, that counts…” and if you tell that to people, all of a sudden they’ll get it and I don’t think belaboring the point is necessary. On the positive side, the book contains several case studies and lists of things to do, so that I can think of these and reread the book in case I want to apply them in my life. Indeed, as I was reading this book, I was also thinking of ways I could convince someone to become friends with me.

• ** Lean In: Women, Work, and the Will to Lead ** is a well-known 2013 book by Facebook COO Sheryl Sandberg. It’s a semi-memoir which also acts as a manifesto for women (and men) to be more aware of the gender gap in “prestigious positions” and how to counteract it. By such “prestigious positions” I mean CEOs (particularly of top companies), politicians, and other leadership positions. Women occupy fewer of these positions than men in virtually every country in the world, and Sandberg wants this to change. She outlines numerous factors that hold women back, not all of which are obvious. Her first example deals with parking spots reserved for pregnant women, in which she admits she (despite being a woman!) didn’t think about until she became pregnant herself. Pregnancy is a major focus in this book, along with work-life balance, a typical inclusion in books about women and careers. Sandberg also recounts stories about women being quiet in meetings or not taking seats in the center of a meeting table even when prompted to do so, and lowering their hands when people say there are no more questions (despite how men keep their hands up and thus get to ask more questions). This forms the overall basis for her advice that women must “lean in” and be more involved in discussions. I liked reading this fast-paced book but I also almost felt disappointed, since I anticipated much of the material in advance. Perhaps it’s because I read about gender-related issues frequently. Another possible explanation is that it is hard for me to participate in group meetings, so I often spend more time observing people and noticing things rather than focusing on the subject at hand. On a final note, I’d like to mention that I do, in some sense, believe that “other men are the problem, not me” though I would never say this in public to someone, because (a) it’s politically charged, and (b) I could, of course, make a mistake in the future and thus I would be hypocritical and have to eat my own words. In my adult life, I do not believe I have ever done anything blatantly sexist, though I certainly worry a lot about committing “microaggressions” when I interact with women, and do my best to avoid them to make my female (as well as male) conversationalists feel respected and comfortable.

• ** Originals: How Non-Conformists Move the World ** is a recent 2016 book by famous Wharton professor Adam Grant, also known as the author of Give and Take. I’ve been aware of Grant for some time, in part because he’s been featured in Cal Newport’s writing as someone who engages in the virtues of Deep Work (see an excerpt here). Yeah, he’s really productive, finishing a PhD in less than three years12 and then becoming the youngest tenured professor at his university. But what is this book about, anyway? In Originals, Grant argues that people who “buck the trend” are often ones who can make a difference for the better. As I anticipated ahead of time, Martin Luther King Jr is in the book, but not for all the reasons I thought. One of them — why procrastination might actually have been helpful (i.e., first mover disadvantage) for him when he was crafting his “I Have a Dream” speech, though one was more realistic: focusing on the victims of crimes (blacks facing discrimination) rather than criticizing the perpetrators. Another nice tidbit from Grant was making sure to emphasize the downsides of your work rather than the positives to venture capitalists, as that will help you look more sincere. Other stuff in this book include how to foster a correct sense of dissent in a company (e.g., Bridgewater Associates is unique in this regard because people freely criticize the billionaire founder Ray Dalio). I certainly felt like some of this was cherry-picking, which admittedly is unavoidable, but this book seems to pursue that more than others. Nonetheless, a lot of the advice seems reasonable and I hope to apply it in my life.

## Group 8: Miscellaneous

These books, in my opinion, don’t neatly align in one of the earlier groups.

• ** Knocking on Heaven’s Door: How Physics and Scientific Thinking Illuminate the Universe and the Modern World ** is Harvard physics professor Lisa Randall’s second of three major books. Last year, I read her most recent book Dark Matter and the Dinosaurs, so this is “going back” in time back to 2011. Sorry, I know should have read them in order. But anyway, this book is a fascinating exploration of what I argue are two major topics. First, the Large Hadron Collider — the well-known experimental setup that revealed the Higgs Boson particle in 2012 and earned Peter Higgs an Nobel Prize. Randall describes how the experiment was set up in great detail, but with juuuuuust enough clarity for non-physicists like me to barely follow. I don’t have much knowledge about the LHC, and indeed I didn’t even know it was a fantastic engineering feat; it is an enormous system built deep underground in Europe, as the pictures in the book helped to illuminate. The second major part of the book is about scientific thinking itself: why do scientists revise theories, why is the notion of scale important, why is quantum mechanics important at smaller distances, but why can we “average out” its effects with Newtonian physics? I learned a little about how the Standard Model in physics works, and it was great to see how she describes the scientific approach to thinking. Randall also discusses cosmology in this book, but it’s much shorter relative to particle physics and feels slightly out of place, but fortunately any reader who wants an overview in cosmology should just read Dark Matter and the Dinosaurs. Overall, this is a book that somehow remains fascinating and mostly accessible despite all the physics facts and jargon. It’s tricky to write science books for the general public. Randall does a good job in that when I was reading the book and felt somewhat confused at the jargon, I felt like it was my fault for my incompetence, and not hers. I am now thinking about reading her first book, Warped Passages, or her e-book on the Higgs Discovery. I’ll definitely be on the lookout for any future books she publishes!

• ** The Signal and the Noise: Why So Many Predictions Fail – But Some Don’t ** is Nate Silver’s 2012 book where he urges us to consider various issues that might be adversely affecting the quality of predictions. They range from the obvious, such as political biases which affect our assessment of political pundits (known as “hedgehogs” in his book), and perhaps less obvious things such as a bug in the Deep Blue chess program which nonetheless grandmaster Gary Kasparov took to meaning that Deep Blue could “predict twenty moves into advance.” I really enjoyed this book. The examples are far ranging: how to detect terrorist attacks (a major difficulty but one with enormous political importance) to playing poker (Silver’s previous main source of income), to uncertainties involving global warming models (always important to consider), and to the stock market (this one is hardest for me to understand given my lack of background knowledge on the stock market, but I am learning and working to rectify this!). The one issue I have is that Silver seems to just assume: hey let’s apply Bayes’ rule to fix everything, so that we have a prior of $X$, and we assume the probability of $Y$ … and therein lies the problem. In real settings we rarely get those $X$ and $Y$ values to a high degree of accuracy. But I have no issue with the general idea of revising predictions and using Bayes’ rule. I encourage you to see a related critique in The New Yorker. The reality, by the way, is that most current professional statisticians likely employ a mix of Frequentist and Bayesian statistics. For a more technical overview, check out Professor Michael I. Jordan’s talk on Are You A Bayesian or a Frequentist?.

• ** The Soul of an Octopus: A Surprising Exploration into the Wonder of Consciousness ** is a splendid 2015 book by author Sy Montgomery, who has written numerous biology-related books about animals. I would call this entirely a popular science book; it’s more like a combination of the author discovering octopuses and describing her own experience visiting the New England aquarium, learning how to scuba dive, watching octopuses having sex in Seattle, and of course, connecting with octopuses. To be frank, I had no idea octopuses could do any of the things she mentions in the book (such as walking on dry land and squeezing through a tiny hole to get out of a tank). Clearly, aquariums have their hands full trying to deal with octopuses. Much of the book is about trying to connect with the three octopuses the New England aquarium has; the author regularly touches and feeds the octopuses, observing and attempting to understand them. I was impressed by the way Montgomery manages to make the book educational, riveting, and emotional all at once, which was surprising to me when I found out about the book’s title. It’s surely a nice story, and that’s what I care about.

• Nothing Ever Dies: Vietnam and the Memory of War is a book by USC English Professor Viet Thanh Nguyen, published in 2016 and a finalist for the National Book Award in Non-Fiction that same year. It’s not a recap or history of the Vietnam War (since that subject has been beaten to death) but instead it focuses specifically on how people from different sides (obviously, American and Vietnamese, but also the rest of the world) view the war, because that will shape questions such as who is at fault and should make reparations and also how we can avoid similar wars in the future. It’s an academic-style book, so the writing is a bit dry and it’s important not to read this when tired. I think it provides a useful perspective on the Vietnam War and memories in general. Nguyen travels to many areas in Vietnam and Asia and explores how they view America — for instance, he argues that South Korea attempts to both ally with the US and look down on Vietnam with contempt. I found the most thought-provoking discussion to be about identity politics and how minorities often have to be the ones describing their own experiences. I’ve observed this in the books I read, in which if they’re written by a minority author (and here I’ll include Asians despite how critics of the tech industry bizarrely decide otherwise) are often about said minority. Other interesting (though obvious) insights include how the entire war machine and capitalism of the US means it can spread its memories of the war more effectively than Vietnam can. Thus, the favorable American perspective of the US as attempting to “save” minorities is more widespread, which puts America in a better light than (in my opinion, channeling my inner Noam Chomsky) it deserves.

• The Once and Future Liberal: After Identity Politics is a short book (describing it as an essay is probably more accurate) written by humanities professor Mark Lilla of Columbia University. This book grew out of his fantastic (perhaps my all-time favorite) Op-Ed in the NYTimes about the need to end identity politics, or specifically identity liberalism. I agree wholeheartedly; we need to stop treating different groups of people as monolithic. Now, it is certainly the case that racism or mistreating of any group must be called out, and white identity politics is often played on the right, versus the variety of identities on the left. Anyway, this short book is split into three parts: anti-politics, pseudo-politics, and politics, but this doesn’t seem to have registered much to me, and the book is arranged in a different style as I might have hoped. I was mostly intrigued by how he said Roosevelt-esque liberalism dominated from roughly 1930 to 1970. Then the Reagan-esque conservatism (i.e., the era of the individual) has dominated from 1980 to 2016 or so, and now we’re supposed to be starting a new era as Trump has demolished the old conservatism. But Lilla is frustrated that modern liberalism is so obsessed about identity, and quite frankly, so am I. He is correct, and many liberals would agree, that change must be aimed locally now, as Republicans have dominated state and local governments, particularly throughout the Obama years. I do wish, however, that he had focused more directly on (a) how groups are not monolithic, (b) why identity politics is bad politics. I know there was some focus, but there didn’t seem to be enough for me. But I suppose, this being a short essay, he wanted to prioritize the Roosevelt-Reagan parallels, which in all fairness is indeed interesting to ponder.

• ** Climate of Hope: How Cities, Businesses, and Citizens can Save the Planet **, a 2017 book jointly written by Michael Bloomberg and Carl Pope. Surprisingly, considering that I was born and raised in New York state all my life (albeit, upstate and not in the city) the first time I really learned about Bloomberg was when he gave the commencement speech at my college graduation. You can view the video here, and in fact, to the right you can barely see the hands of a sign language interpreter who I really should re-connect with sometime soon. Climate of Hope consists of a series of chapters, which are split into half from Bloomberg’s perspective, half from Pope’s perspective. The dynamics between the two men are interesting. Pope is a “typical” Sierra Nevada member, while Bloomberg is known for being a ridiculously-rich billionaire and a three-term (!!) mayor of New York City.13 The book is about cities, businesses, and citizens, and the omission of national governments is no accident: both men have been critical of Washington’s failure to get things done. Bloomberg and Pope aim their ire at the “climate change deniers” in Washington, though they do levy slight criticism on Democrats for failing to support nuclear power. They offer a brief scientific background on climate change, and then argue that new market forces and the rise of cities (thus greener due to more public transportation and more cramped living quarters) means we should be able to emphasize more renewable energy. One key thing I especially agree with is that to market policies that promote renewable energy — particularly to skeptical conservatives — people cannot talk about how “worldwide temperatures in 2100 will be two degrees higher.” Rather, we need to talk about things we can do now, such as saving money, protecting our cities, creating construction jobs, protecting our health from smog, all these thing we can do right now and which will have the effect of fighting long-term climate change anyway. I enjoyed this easy-to-read and optimistic book, though it’s also fair to say that I tend to view Bloomberg quite favorably and honor his commitment to getting things done rather than having dysfunction in Washington. Or maybe I just want to obtain a fraction of his professional success in my life.

1. Most of the academic papers that I read can be found in this GitHub repository

2. You’ll also notice in that link that Stuart Russell says he thinks superintelligence will happen in “more than 25 years” but he thinks it will happen. Russell’s been one of the leading academics voicing concern about AI. I’m not sure what has been created out of it, except raising a discussion of AI’s risks, kind of like how Barrat’s book doesn’t really propose solutions. (Disclaimer: I have not read all of Russell’s work on this, and I might need to see this page for information.)

3. In this interview, Oren Etzioni said that AI leaders were not concerned about superintelligence, and even quoted an anonymous AAAI Fellow who said that Nick Bostrom was “the Donald Trump of AI”. Stuart Russell, who has praised Superintelligence, wrote a rebuttal to Etzioni, who then apologized to Bostrom.

4. Of course, this raises the other problem with MOOCs. Only people who have sufficient motivation to learn are actually taking advantage of MOOCs, and these tend to be skewed towards those who are already well-educated. Under no circumstances is Brynjolfsson someone who needs a MOOC for his career. But there are many people who cannot afford college and the like, but who don’t have the motivation (or time!) to learn on their own. Is it fair for them to suffer under this new economy?

5. Eric Schmidt got his computer science PhD from Berkeley in 1982. So at least I know someone famous essentially started off on a similar career trajectory as I am.

6. I didn’t realize this until the authors put it in a footnote, but Jonathan Rosenberg’s father is Nathan Rosenberg, who wrote the 1986 book How the West Grew Rich which I also read this year. Heh, the more I read the more I realize that it’s a small world among the academic and technically elite among our society.

7. This blog is hosted on GitHub and built using software called Jekyll. Click here to see the source code

8. To compare, How the West Grew Rich is less than half the length of The Rise and Fall of American Growth. In addition, I skipped most footnotes for the former, but read all the footnotes for the later.

9. A quick thanks to Ben and Marc for helping to fund Berkeley’s Computer Science Graduate Student Association!

10. Dawkins mentions that, if anything was “the making” of him, Oxford was. For me, I consider Berkeley to be “the making of me” as I’ve learned much more, both academically and otherwise, here than at Williams College.

11. For the sake of keeping this blog mostly professional, I won’t list all my positive qualities here. ;)

12. Usually, someone completing a PhD in 2-3 years raises red flags since they likely didn’t get much research done and may have wanted to graduate ASAP. Grant is an exception, and it’s worth noting that there are also exceptions in computer science

13. Given the fact that Bloomberg was able to buy his way into being a politician, I really think the easiest way for me to enter national politics is to have enormous success in the business and technology sector. Then I can just buy my way in, or use my connections. It’s unfortunate that American politics is like this, but at least it’s better than having a king and royal family.

# At Long Last: A Simple Email Subscription for this Blog

It took me six and a half years to do this, but I finally managed to install an email subscription form for readers of this blog. The link is here. No more nasty RSS feeds that no one knows how to use!

The email subscription form for this blog uses MailChimp. Each time I publish a post, I will send an email to everyone on the list using MailChimp’s “Campaign” feature.

Incidentally, this is the same kind of email form we use over at the Berkeley AI Research (BAIR) Blog. If you haven’t already, please subscribe to the BAIR Blog! As a member of the editorial board, I know the posts that are coming up next. I obviously cannot reveal the exact content, though I can say that we have lots of interesting stuff lined up for 2018 and beyond.

For assistance on getting this set up, I thank Jane Liang, a UC Berkeley EECS student who set up MailChimp for the BAIR Blog. I also thank Dominic Spadacene, who wrote dead-simple HTML installation instructions on his Ctrl-F’d blog.

# On the Momentum Sign Flipping for Hamiltonian Monte Carlo

For a long time, I wanted to write a nice, long, friendly blog post on Hamiltonian Monte Carlo that I could come back to for more intuition and understanding as needed.

Fortunately, there’s no need for me to invest a ginormous amount of time I don’t have for that, because physicists/statistician Michael Betancourt has written a fantastic introduction to Hamiltonian Monte Carlo, called A Conceptual Introduction to Hamiltonian Monte Carlo. You can find the preprint here on arXiv. Don’t be deterred by the length; it’s a fast read compared to other academic papers, and certainly a much more intuitive read than Radford Neal’s 2011 review chapter, which I already thought couldn’t be surpassed in terms of a quality introduction to HMC. Indeed, even prominent statisticians such as COPSS Presidents’ Award winner Andrew Gelman have praised the writeup, and someone like him obviously doesn’t need it.

I have extensively read Radford Neal’s writeup, to the point where I was able to reproduce almost all his figures in my MCMC and Dynamics code repository on GitHub. There was, however, one question I had about HMC that I didn’t feel was elaborated upon enough:

Why is it necessary to flip the sign of the momentum to induce a symmetric proposal?

Fortunately, Betancourt’s writeup comes to the rescue! Thus, in this post, I’d like to go through the details on why it is necessary to flip the sign of the momentum term in HMC.

Let $\mathbb{Q}(q' \mid q)$ be the density function defining the current proposal method, whatever that may be. With a Gaussian proposal, we have symmetry in that $\mathbb{Q}(q'\mid q) = \mathbb{Q}(q\mid q')$. The same is true with Hamiltonian Monte Carlo … if we handle the momentum term correctly.

Borrowing Betancourt’s notation (from Section 5.2), we’ll assume that, starting from state $(q_0,p_0)$, we integrate the dynamics for $L$ steps to land at $(q_L,p_L)$, upon which we use that as our proposal:

where $\delta: \mathbb{R} \to \{0,1\}$ is the Dirac delta function, and the difference $q'-q_L$ is assumed to be real-valued; if $q$ and $p$ are vectors, these would need to be done component-wise and then summed up, but the same basic idea holds. Furthermore, $q'$ and $p'$ are “placeholder” random variables, kind of like how we often use $X$ when writing $\mathbb{P}[X=x]$ in introductory probability courses; $X$ is the placeholder and $x$ is the actual quantity.

Reading the definition straight from the Dirac delta functions, we see that our proposal density is one exactly at state $(q',p')=(q_L,p_L)$, and zero everywhere else. This makes sense because Hamiltonian dynamics are deterministic after re-sampling the momentum variables (but it’s understood that $p_0$ represents those states after the re-sampling, not before).

The problem with this is that the proposal becomes “ill-posed”. Betancourt remarks that:

however, I believe that’s a typo and that the numerator and denominator should be flipped, so that the numerator contains the density of the starting state given the proposal.

Regardless, to me it doesn’t make sense to have proposal probabilities or densities with these Dirac delta functions that result in zero everywhere (that means we’d always reject samples). The following figure (from Betancourt) visually depicts the problem:

Because these position and momentum variables are continuous-valued, the probability of actually landing back in the starting state has measure zero.

Suppose, however, that after we integrate for $L$ steps, we flip the sign of the momentum term. Then we have

so that only $(q',p')=(q_L,-p_L)$ results in a probability mass of one. See the following figure for this revision:

The key observation now, of course, is that

Why is this true? The dynamics are time-reversible, and if we set our potential energy to be the usual $K(p) = \frac{p^Tp}{2}$, then flipping the momentum term and going through the leapfrog means the sampler encounters the same exact steps, only in reverse.

To make this concrete, I like to explicitly go through the math of one leapfrog step. It requires some care with notation, but I find it’s worth it. I’ll write $(q_k^{(1)},p_k^{(1)})$ as the $k$-th element encountered during the forward trajectory. For the reverse, I use $(q_k^{(2)},p_k^{(2)})$ so that the superscript is now two instead of one. Furthermore, due to the leapfrog step taking half-steps for momentums, I use $k=0.5$ for this purpose.

Here’s the forward trajectory, starting from $(q_0^{(1)},p_0^{(1)})$:

and the last step negates the momentum, so that the final state is $(q_1^{(1)}, -p_1^{(1)})$.

Here’s the reverse trajectory, starting from $(q_0^{(2)},p_0^{(2)})$:

with our final state as $(q_1^{(2)}, -p_1^{(2)})$. Above, the only difference between the reverse and the forward trajectories is the change in superscripts. But when we do the math for the reverse trajectory while plugging in the values from the forward trajectory, we get:

Gee, this is exactly the negative of the half-step we were at in the first iteration! Similarly, for the position update, we have:

The leapfrog has brought the position back to the starting point. For the final half-step momentum update, we have:

and we see that our reverse trajectory landed us back in $(q_0^{(1)},-p_0^{(1)})$, and flipping the momentum gets us to the same exact starting state.

Thus, using this type of proposal means the proposal densities cancel out, result in a Metropolis test, not a Metropolis-Hastings test.

I should also add one other point: we do not consider the momentum resampling as being part of the proposal, as resampling the momentum can be considered as maintaining the canonical distribution, so it’s something that we can “always” decide to invoke if we want (hopefully that’s not too bad of a hand-wavy explanation).

Hopefully the above discussion clarifies why flipping the sign of the momentum is necessary, at least in theory. In practice, we don’t do it since for the usual Gaussian kinetic energies, $K(p) = K(-p)$ so their energy levels are the same, and because the momentum variables are traditionally entirely resampled after the acceptance test.

# Review of Deep Learning (CS 294-131) at Berkeley

This semester, I took CS 294-131, a Deep Learning “special topics” course which has been offered each semester since Fall 2016 for a variable amount of class units and will be taught again next semester (the course website is already up). As usual, it was co-taught by the Trevor Darrell and Dawn Song team. The course is low-commitment for them because it’s a seminar and they don’t have to give lectures or prepare assignments and exams. CS 294-131 meets only once a week; for us, it was Mondays from 1:00PM to 2:30PM. Each meeting featured a guest speaker from academia or industry who gave a talk on his or her cutting-edge Deep Learning research results.

Here were some of the highlights for me:

• Vladlen Koltun’s talk about his ICLR 2017 paper Learning to Act by Predicting the Future. I enjoyed his presentation, though admittedly most of it was because he was funny and actively engaging with the audience. I previously blogged about the more technical aspects here.

• Barret Zoph and Quoc Le’s joint talk on neural architecture search, also from ICLR 2017 (here’s the OpenReview link) and also (like Koltun’s paper) an oral presentation at that conference. I’ve been hoping to find some time to read their paper and perhaps the winter break will afford me that opportunity. Zoph and Le’s presentation featured a lot of aggressive questioning from students, to the point where Professor Song asked the students to quiet down and let the speakers proceed. Fortunately, at least to me, the technical content of the presentation was interesting enough to keep my attention.

• Ross Girshik’s talk on computer vision and object recognition. Actually, we had a fire alarm for this one, which delayed the start class for about 15 minutes … so then we had to find a new room. Unfortunately, it took about 10 more minutes to get the projector working, and then we were told we had to leave the room at around 2:10PM. At least when Girshik was actually able to talk about computer vision, I found the historical overview to be educational.

• Percy Liang’s presentation on fighting black boxes and adversaries in Deep Learning. This was somewhat more theoretical work but he didn’t go too much into the details. I am less familiar with his work but would like to get accustomed with it as adversarial learning is a pretty hot topic in robotics these days.

In case you’re wondering, yes I had the usual sign language interpreters for the class. Yes they were unhappy, but they tried, and we were able to agree on a few terminology-related issues beforehand. And yes, I didn’t get all the technical details from the talks. I tried to allocate two hours before class to do the background reading, but inevitably that turned into 1.5 hours … and then 1 hour … and then 30 minutes. How do people manage to do class readings ahead of time when they’re juggling four major research projects? Or do people with normal hearing just find it easy to absorb almost all the technical stuff in these talks without prior preparation?1

As I mentioned earlier, CS 294-131 can be taken for a different amount of credits. This year, we had the option of taking it for 1, 2, or 3 credits. We got one credit for doing “arXiv summaries” and “discussion leads” and two for doing a class project. I decided that those summaries and discussions would be too much of a hassle, so I took CS 294-131 for two credits. It helps that I’ve long since finished my course requirements.

While I enjoyed some of the Deep Learning talks, I do have some criticisms about CS 294-131:

• There are too many websites/links related to the class. We have Piazza, the course website, the Google group (seriously?) and Slack channels, with one for each new week. I think this is too much information and things should be centralized in two spots at most — the course website and Piazza.

• I’m also not a fan of the arXiv leads, which was new this semester. Students had to give 1-minute presentations on Deep Learning papers that appeared on arXiv the past week. The problem with this is that the majority of students tried to cram as may technical details in their talk as possible, rather than give the clear key insight from the paper. In addition, students often went over their allotted speaking time (gee, who would have guessed??).

• Finally, I have no idea why class participation is worth 20% of the grading here. On Piazza, we were literally told that we would get class participation credit by simply attending the lectures. Not only does it not make sense to award students who attend lectures but don’t pay attention, it also hurts those who watch the video livestreams to reduce pressure on the lecture room, since the first lecture was “standing-room only.”

To be honest, I didn’t quite enjoy the class as much as I should have, and my project didn’t turn out as well as I would have liked. I worked on a Deep Reinforcement Learning project with three other students, and hopefully that will turn into a research paper later, but in retrospect, it’s difficult to coordinate a four-person project when everyone else has other priorities.

I don’t plan to take the Spring 2018 version of the course, but I’ll certainly keep track of the papers in the background reading. I’m excited to see who the guest speakers will be this time around …

1. For readers of this blog who are EECS PhD students (and yes, I know you read this blog) that means I’m not-so-subtly asking you to tell me how much you can absorb from technical talks and lectures. Posting as a comment here or emailing me personally works.

# Basics of Bayesian Neural Networks

In this post, I try and learn as much about Bayesian Neural Networks (BNNs) as I can. I borrow the perspective of Radford Neal: BNNs are updated in two steps. The first step samples the hyperparameters, which are typically the regularizer terms set on a per-layer basis. The second step performs Hamiltonian Monte Carlo over the data, or through a series of minibatches plus sophisticated “friction” techniques, if using Stochastic Gradient Hamiltonian Monte Carlo. These update the actual weights we use for the neural networks; the hyperparameters are sampled mainly to invoke a “fully Bayesian” hierarchical model.

The above is different from the paradigm of using Bayesian Neural Networks with a technique known as variational inference. I will not be discussing that.

To make things concrete, in this blog post I will assume we have the following neural network which:

• is fully connected.
• takes MNIST data as input (784 dimensions for each data point), has a hidden layer of 100 units, and outputs a 10-dimensional vector from a softmax.
• uses the sigmoid activation for the hidden layer.
• uses a regularizer hyperparameter for each of the two sets of weight matrices, along with two for the biases.

We can write the network’s mathematical meaning using $A \in \mathbb{R}^{10 \times 100}$ and $B \in \mathbb{R}^{784 \times 100}$ as the weight matrices, along with $a \in \mathbb{R}^{10}$ and $b \in \mathbb{R}^{100}$ as the bias vectors. Using this, the network output can be expressed as:

where $y \in \{0,1,\ldots,9\}$ indicates the class label and $A_i$ is the $i$th column of $A$. (The entire output would just be the vector of $\langle P(y=0| x), \ldots, P(y=9| x)\rangle$ values.) I write $(x,y)$ without subscripts, but in general we should write $\mathcal{D} = \{(x^{(k)},y^{(k)})\}_{k=1}^N$ to represent the entire dataset.

We need to incorporate Bayesian assumptions somehow, so first we assume that all these weights have Gaussian priors with zero mean and covariance matrix $\Sigma = \sigma_A^2 I$ set to some multiple of the identity. Intuitively, this seems to be a reasonable prior, as we’d generally like our weights to be small and roughly symmetrical about zero. For example, with the weights for $A$, we

where we set $\lambda_A$ to be the inverse variance, also known as the precision term. Similarly, we have $P(B) \propto \exp\left(-\frac{\lambda_B}{2}\|B\|_2^2\right)$, $P(a) \propto \exp\left(-\frac{\lambda_a}{2}\|a\|_2^2\right)$, and $P(b) \propto \exp\left(-\frac{\lambda_b}{2}\|b\|_2^2\right)$. Notice that I am now flattening the matrices $A$ and $B$ so that they are vectors. This makes the notation a bit easier, and it means that when I write $\|A\|_2$, I mean the $L_2$ norm, not the spectral norm on matrices (a.k.a., the largest singular value). I will use the flattened notation for the remainder of this blog post.

The precision terms are hyperparameters which we also endow with their own (IID) priors:1

Letting $\Theta$ denote our eight major parameters (including the hyperparameters), the posterior distribution which we want to sample from based on dataset $\mathcal{D}$ is

where we follow the usual assumption of conditional independence among the data; think of drawing data points from the true “data distribution” to form our training data.

BNNs use Bayesian methods to figure out a good set of parameters $\Theta$ for some task, which here is based on digit classification accuracy. I will now go over how the hyperparameter updates work, followed by the parameter updates.

This step samples the following:

There is no dependence on the dataset $\mathcal{D}$ as the hyperparameters are sampled based on “data” which consists of the parameter values at the lower level. Also, since we assumed an IID prior for the precision terms, and because the values of the parameters are viewed as independent as well (I admit this probably isn’t the best way of describing it but it feels intuitive) we have:

We can literally sample the four precision terms sequentially due to their independence assumptions. For simplicity, let us assume we are sampling only $\lambda_A$ so that the rest of the computation is straightforward. The math turns out to be:

and indeed, we have conjugacy: sampling the hyperparameters can be done simply by sampling from a Gamma distribution with these updated parameters based on the previous values of $\alpha$ and $\beta$. For intuition on what these parameters mean, if we have $X \sim \Gamma(\alpha, \beta)$, then $\mathbb{E}[X] = \alpha / \beta$.

This step samples the following:

where $\Lambda = \{\lambda_A,\lambda_B, \lambda_a, \lambda_b\}$ to simplify the notation if we depend on all the hyperparameters. There is dependence on $\Lambda$ here as those values determine the spread of the Gaussian distributions. Also, notice the dependence on the data here, unlike the previous case.

Using Bayes’ Rule as we did earlier (with that condensed $\Theta$ notation) we get:

How do we sample from this distribution? We use Hamiltonian Monte Carlo (HMC).

### The Hamiltonian, Potential Energy, and Kinetic Energy

Briefly: HMC uses what is known as a Hamiltonian Function $H(\Theta,p)$ where $\Theta$ are the parameters and $p$ refers to auxiliary momentum variables.2 In Bayesian statistics, current practice is to split the Hamiltonian into two functions: $H(\Theta,p) = U(\Theta) + K(p)$ known as the potential energy and kinetic energy, respectively. HMC is designed to sample from the distribution defined as follows:

where $Z>0$ is a normalizing constant and $T>0$ is some temperature, typically used to “flatten” or “diffuse” the target distribution (which here is $\propto \exp\left(-\frac{H(\Theta,p)}{T}\right)$) to make optimization easier. For the rest of this post, I include $T$ for clarity but I keep it separate from $H, U,$ and $K$.

In Bayesian statistics, the Potential Energy is $U(\Theta) = -\log(P(\Theta) \cdot P(\mathcal{D} | \Theta))$ because if we plug that in, we get

which is exactly what we want for the position variables, assuming that the momentum is independent so that $P(\Theta,p)=P(\Theta)P(p)$, which is standard practice. To be clear: (a) we’re only interested in sampling from the distribution for $\Theta$, not the momentum’s distribution, so (b) to get our desired samples of $\Theta$ from the posterior, we generate samples $(\Theta^{(i)},p^{(i)})$ that include the momentum variables, and then we drop the latter after we’re done.

Regarding the Kinetic Energy, current practice is to set it to be a quadratic potential with mass matrix $M = \sigma^2 I$ a multiple of the identity:

To be clear, we need to sample from the target distribution as specified by $\exp\left(-\frac{H(\Theta,p)}{T}\right)$, which means we must technically sample from the distributions defined as:

Here, $U$ and $K$ are energy functions, but they are not the same as the actual distribution we are sampling from. For example, with the kinetic energy, the actual distribution we sample from is proportional to

i.e., a zero-mean Gaussian with covariance $T\sigma^2I$. We sequentially sample for each because they are independent by assumption.

### Running Hamiltonian Monte Carlo

Running HMC in computer simulation requires a Metropolis test3 each iteration (i.e., each sample in our MCMC chain) to correct for discretization error. This requires computing the follow ratio $r$:

where $\Theta'$ and $p'$ refer to the proposed position and momentum variables, respectively. Computing the kinetic energy is typically a matter of adding squared $L_2$ norms, so that part is easy. But what about $U$? To compute this difference for our proposed Bayesian Neural Network, we see that

where $C$ represents a constant independent of the parameter $\Theta$. The reason why I ignore this is twofold.

• First, when computing the Metropolis ratio, that $C$ will be the same for both $U(\Theta)$ and $U(\Theta')$ so we can ignore it.

• Second, we also need to use $U$ when we sample with HMC, and this means taking the gradient $\nabla_\Theta U(\Theta)$ which will kill $C$. (The Metropolis test is only to determine whether we accept or reject a proposal, but we need some way of actually getting that proposal).

To elaborate on the second point, sampling using “Hamiltonian Dynamics” requires the momentum update:

where $\epsilon$ is a (leapfrog) step size parameter which we divide by two as required from the leapfrog method.

You can immediately see from this that $p$ must have the same dimensions as $\Theta$. I think of $p$ as concatenating flattened $A,B,a,b$ weights, so it’s one giant vector. The gradient updates can be specified weight-by-weight, which will change the corresponding “slices” of the vector $p$. For instance, with $A$ and abusing notation by re-using $p$, we have:

The $\lambda_A A$ term serves as a weight decay regularizer, and the sum over the gradient of probabilities can be computed via backpropagation through the neural network.

• Remark 1: hopefully my above explanation clarifies why imposing a Gaussian prior on the weights is equivalent to $L_2$ regularization.

• Remark 2: consider using TensorFlow to get the gradients corresponding to the log probabilities. In particular, TensorFlow can return gradients using tf.gradients.

One thing I should point out: here, we can view $- \log P(\mathcal{D} | \Theta; \Lambda)$ as a “cost function” that we’re trying to minimize. This is equivalent to minimizing the cross entropy loss between what our the neural network predicts, and the distribution that consists of one-hot vectors of the training labels.4 That’s precisely the loss function I’d use if I were formulating the classification problem and solving it with stochastic gradient descent instead of HMC. The implication is that it’s OK to try and maximize the $\log P(y^{(k)} | x^{(k)}, \Theta ; \Lambda)$ value that we see above, which is what happens when we perform gradient ascent on it; intuitively, the resulting weights will assign higher probability to the correct class.

After the momentum updates, the position variables are updated using a similar gradient-based step:

so that, intuitively, $\Theta$ is also updated in roughly the same direction of the gradient. That’s how HMC works and uses gradient information.

## Practical Considerations

### Averaging Predictions

Using Bayesian Neural Networks in practice often requires sampling a set of neural network weights many times and then computing the mean and standard deviation of the predictions.

A figure copied from the VIME paper (NIPS 2016) showing Bayesian Neural Network predictions and uncertainty levels.

For instance, in the figure above (taken from the VIME paper) the authors construct a regression task, where the network takes in a scalar-valued5 input $x \in \mathbb{R}$ and outputs a prediction $y \in \mathbb{R}$. The red dots are the targets, while the green dots are the predictions. It’s clear that the red dots are clustered near the center of the figure, so logically, our Bayesian Neural Networks should be very confident in their predictions in those areas, and less confident outside the training data’s dominant regime. Indeed, the shaded areas confirm this, as they represent the output mean plus/minus one and two standard deviations (I think the second standard deviation is too far to see for the extremes in this figure) based on different neural network weight parameters.

These types of figures are typically shown when people talk about Bayesian Neural Networks, such as in Yarin Gal’s excellent tutorial.

### Code Implementation

I am working on implementing Bayesian Neural Networks in my MCMC repository, which is a TensorFlow implementation based on Tianqi Chen’s earlier pure numpy code. The code is a bit disorganized and not quite ready for consumption by the public, but I think I’m getting something going with this code.

1. We’re using the characterization using shape and rate, not shape and scale.

2. I follow Radford Neal’s notation in setting $p$ as the auxiliary momentum variables for HMC. Neal uses $q$ as the position variables, but I set them here as $\Theta$ for obvious reasons.

3. The proposals have the same density, so it is not necessary to perform a Metropolis-Hastings test. Why this is true is based on reversibility, but it is still somewhat unclear to me.

4. This is a reasonably well-known fact in machine learning, but I would like to write up some details on this because I sometimes find myself looking up the derivations again.

5. Well, technically they preprocessed the input to be $[x,x^2,x^3,x^4]$ but thinking of it as 1-D makes it much easier to plot.

# Read-Through of Multi-Level Discovery of Deep Options

In this post, I attempt to learn as much as possible about the paper Multi-Level Discovery of Deep Options, which introduced the DDO algorithm. This post is split into my understanding of the math, followed by implementation/experiments.

Before we begin, here is some useful notation:

• Options are denoted as $h = \langle \pi_h, \psi_h \rangle \in \mathcal{H}$, with $\mathcal{H}$ representing the set of possible options. To make the notation clearer when we’re dealing with options at each time step $t$, we could write $\mathcal{H} = \{h_{\rm opt}^{(1)}, \ldots, h_{\rm opt}^{|\mathcal{H}|}\}$.

• Higher-level policies are denoted as $\eta(h \mid s)$, in which we pick options. You could also repeat this recursively; think of using something like $\eta'(\eta \mid s)$, though the paper never officially uses that notation.

• A set of $K$ demonstrations are denoted as $\{\xi_1, \ldots, \xi_K\}$. They do not assume that these demonstrations are from an expert supervisor, only that (a) they have some hierarchical structure to be discovered, and (b) that they are informative of the relevant actions to take in each state. Assumption (a) is easier to understand and justify, for if we didn’t have a hierarchical structure to discover, there is no point doing DDO or any hierarchical learning algorithm. Fortunately, in many real-life tasks (for humans), we have hierarchical structures.

The paper also goes through some usual reinforcement learning and imitation learning notation, which I won’t repeat here as it will be clear to those with the relevant background.

The goal of DDO is to discover a set of parameterized options from $\{\xi_1, \ldots, \xi_K\}$.

## The Math

To derive the math, the authors use the perspective of fitting a generative model to the trajectories. That generative model, however, has latent variables, so they need to perform a flavor of Expectation-Maximization, which I’ve previously blogged about here.

But what are the latent (a.k.a., hidden) variables?

• The choice of $h_t \in \mathcal{H}$, which represents the option at time $t$.

• In addition, their generative model also assumes that at each time step, we have a binary random variable $b_t \in \{ 0, 1 \}$, where if $b_t=1$, a new option is drawn from the high-level policy. The random variable follows a Bernoulli distribution with parameter based on the option termination conditions $\psi_h$; the higher the value at a given state, the more likely it is that the option should finish and return control to the meta-policy.

The generative model tells us how to assign probabilities to the trajectories. Now, let’s figure out how to find $\theta \in \Theta$ that maximizes the likelihood of the trajectories! We’re using $\theta$ to denote all the parameters of everything together: the meta-control policy and for each option. The derivation here assumes a two-level hierarchy, but the extension to multiple levels should be straightforward, besides some nightmarish notation.

The discussion above makes it clear that the $i$th trajectory in our data follows this structure:

with $\xi_i$ and $\zeta_i$ representing the visible and latent variables, respectively.

To find parameters, we need to encode this probabilistically, as in $p(\xi_i, \zeta_i)$. Fortunately, as is standard in RL/IL, this long joint probability decomposes based on time steps. Specifically, we express it as:

where for notational clarity, we denote the various probability distributions as follows:

• $p_0$ represents the probability distribution over the initial state. There is no dependence on $\theta$.
• $p(s' | s,a)$ represents the dynamics of the environment. Once again, there is no dependence on $\theta$.
• $q_\theta(b_t,h_t | h_{t-1},s_t)$ represents the distribution over the hidden variables. (My notation is slightly different than what the DDO paper uses, but I find it easier to think of the entire likelihood $\mathbb{P}_\theta$ as distinct from this.)
• $\pi$ and $\eta$, of course, represent the policy and option selection.

Since $b$ is Bernoulli, we can easily split $q_\theta$ into cases to define it more precisely:

and

We should now step back and ask ourselves if this definition of $\mathbb{P}_\theta(\zeta_i, \xi_i)$ makes sense. Does it?

Yes. The first few terms draw the initial state, and then the $\delta_{b_0=1}$ ensures that we actually have an option to sample from to start (though it is really unnecessary notation). Then we assume we draw $h_0$.

Next, we iterate through the remaining time steps. We draw the action based on the policy, and then the dynamics provide the state. But then we also need to sample the two latent variables. We’re packing these together in $q_\theta$, but you can also think of it sequentially: first sampling a Bernoulli for the option termination, and then sampling from the meta-policy. The split of $q_\theta$ in two cases makes the sequential aspect of this clearer: $\psi_{h_{t-1}}$ represents the probability that $b_t=1$, and if it is zero (with probability $1-\psi_{h_{t-1}}$), we don’t need to draw at all, hence the delta term $\delta_{h_t=h_{t-1}}$. Otherwise, we need to sample, hence the $\eta(h_t|s_t)$. Yes, this all makes sense, and can be reasoned by iterating through the generative model “pseudocode.”

Great. Now we know the likelihood of one trajectory. We want to maximize the (log) likelihood $L(\theta | \xi)$ assigned to a trajectory, which means we need to take gradient steps to get our neural network weights to go to the correct spot. For one trajectory $\xi$ (omitting the subscript which normally indicates which trajectory in our set of them), we have:

Where in (i) we substituted the definition with the thing we want to optimize since that will lead to “likely” trajectories learned from gradient steps, in (ii) we applied the $\nabla \log p(x) = \frac{\nabla p(x)}{p(x)}$ “trick” and then applied the probability 101 rule that $p(x) = \sum_z p(x,z)$ while explicitly writing out what it means to sum over the discrete-valued $\zeta$ latent variable over all $T$ times (this is an exponentially large sum!), in (iii) we again applied the same log-derivative trick, except this time from the perspective of $\nabla p(x) = p(x) \nabla \log p(x)$, in (iv) we used the definition of conditional probability and then the definition of expectation in (v), and finally in (vi) we applied our previous derivation of the probability of both the visible and latent variables, and omitted terms which cancel out from the gradient. Recall again that $\theta$ includes the parameters of the meta policy (i.e., $\eta$) and the lower-level options $h_t \in \mathcal{H}$.

In general, we’re dealing with a batch of trajectories, so just apply it to each of the trajectories and take an average over the minibatch to make the gradient independent of batch size.

This is good, but we can actually write the gradient more explicitly, without the expectation by converting the expectation to the sums. Effectively, instead of doing the step (vi) to (v) thing we did earlier, we’ll explicitly simplify the probability based on the summation. However, we’ll start from step (vi) above since it’s easier to manage the calculations with all the stuff canceled out after applying $\nabla_\theta$. We’ll go through this computation in several steps.

For the third term, we have:

Where (i) is by linearity of expectation, (ii) simplifies the expectation by realizing that the term under expectation only (directly) depends on $h_t$, and (iii) applies the definition of expectation. If step (ii) isn’t clear, it should work out if you expand the full probability into sums over all $h_t$ and all $b_t$. Then the $b_t$ sums should get “pushed to the right” and sum to one (and thus go away) and the rest should follow from there. I would write it out, except that it makes sense to me and I don’t want to spend my entire blogging career getting bogged down with notation.

Additional comment: it might also be the case that conditioning on $\xi$ instead of just the relevant part of $\xi$ was done for notational convenience.

Next, here’s how to rewrite the first two terms in the expectation. This requires a considerable amount of care:

For the most part, it uses similar techniques as the other part, such as converting the expectation to sums over probabilities (i.e., applying the definition) and then moving the sums to the right as far as possible so that they can sum to one and then eliminate. The main challenge here is tracking all the $b_t$s here in addition to the $h_t$s.

Putting all this together, we get the equation shown in the DDO paper where they’ve substituted in $u_t(h), v_t(h)$, and $w_t(h)$ terms. It should be equivalent to what I have above, though the odds that there is a typo somewhere (probably above) is 100 percent. I think I know how this works in theory but there is a lot of notation.

Incidentally, there is some good explanation about how — since we’re living in discrete-land — there are three cross entropy loss terms embedded into the gradient update.

The last piece of math to note before proceeding to the implementation details is … how to implement Expectation-Gradient efficiently. Fortunately, the Expectation step — where latent variables are “sampled” and weighted probabilistically — can be done with the Baum-Welch update, which I have previously blogged about. I won’t go through the details here, though I went through them on pencil and paper and the math makes sense. The key to my intuition is to think of forward and backward probabilities as “counting up” the number of ways to reach a given spot from the start (for trajectory prefixes) or the end (for trajectory suffixes). The Baum-Welch algorithm gives us efficient ways to compute various probabilities, and then, since we can formulate a loss function which is the negative log likelihood of a trajectory, we can call TensorFlow to compute gradients for us for the Gradient step. Thus, we have the Expectation-Gradient algorithm.

Quick note: recall Expectation-Maximization. They’re not doing that because the maximization part can’t be done in closed form with neural network models:

Our work is most related to [43], who use a similar generative model, originally introduced by [8] as an Abstract Hidden Markov Model, and learn its parameters via the Expectation-Maximization (EM) algorithm. EM applies the same forward-backward E-step as our Expectation-Gradient algorithm (Section 4.2) to compute marginal posteriors, but uses them for a complete optimization M-step over the options and the meta-control policies. This optimization is infeasible for expressive representations, as well as for multi-level hierarchies.

OK, now let’s move on to some of the implementation details and experimental results.

## Implementation Details and Experiments

I am mostly going to investigate their GridWorld-related experiments, as the Atari stuff uses RAM, not images, so it’s hard to interpret, and the surgical robotics part is impossible to comprehend without intimate knowledge of the training data.

They consider DDO under two different scenarios:

(Supervised) given a supervisor who demonstrates a few times how to perform a task, show that the discovered options are useful for accelerating reinforcement learning on the same task; (Exploration) apply reinforcement learning for $T$ episodes, sample trajectories from the current best policy, and augment the action space with the discovered options for the remaining $T'$ episodes

This is a bit confusing to me for a few reasons. I will state why later when I review some questions I have about the paper. Let’s move on to GridWorld. Details:

• a $15 \times 11$ grid, with four rooms.
• the agent actions is randomized with probability 0.3.
• the agent can move in four directions (the “atomic actions”), and moving into wall has no effect.
• an apple is spawned in random location, and upon reaching it the agent gets +1 reward (and then it is respawned).
• the agent knows where it is, as it’s hard-coded in the state space.

All neural network policies (for control, meta-control, termination) are neural nets with one hidden layer of … two nodes. Yeah, you don’t need much for GridWorld.

The GridWorld used in the DDO paper. It is repeated four times in this image, each with a different discovered option: down, up, right, and left, respectively.

For the two setups:

• Supervised. They generate 50 trajectories of length 1000 using Value Iteration. (1000 seems like a large number, but the agent repeatedly respawns after hitting the apple so trajectories can go on indefinitely.) Then, after running DDO on this data, they can discover options. This is the supervised case, so there is (I assume) no environment interaction during the DDO stage. They executed the learned options and found four of them at the lower level (see figure above). There’s another figure for the higher-level options, which shows the options that are invoked to move the agents to rooms, starting from any given state.

• Exploration. They trained a DQN agent for 2000 steps with atomic actions. Then after those steps, they can roll out trajectories from the $\epsilon$-greedy policy, which effectively turn this into a “supervised” case. DDO learns options, and then the Q-function is then reset and DQN is run again with the augmented action space. I think what they want to show is that, compared to baseline DQN, the DQN augmented with options learns faster. It is a little hard to interpret Figure 2, though. They are starting from 2000 steps when they compute the number of steps for the option-augmented agents, right? (Because that’s 2000 steps of computation needed.) I think so, as their Figure 2 plots for the “exploration setting” are like the ones from the “supervised” setting except shifted to the right past 2000 steps, but then I’m surprised at why rewards shoot up at (essentially) exactly 2000 steps.

I also read through their Supplementary Material, which provides mostly more information on various GridWorld settings. Figure 7 appears to be missing the benchmark of having options-only DQN, but otherwise it has primitives-only DQN and options-and-primitives DQN, which is the “augmented DQN” from the paper.

Now here are some questions I have:

• For the supervised setting of DDO, is it fair to say it “accelerates RL”? In the default RL setting, we do not start with expert trajectories. Thus, any time we can get expert trajectories to bootstrap our starting policy, isn’t that an unfair advantage? This is what I wondered about when digesting the first half of Figure 2. Perhaps the comparison should be with reinforcement learning initialized with behavior cloning?

• For the exploration setting of DDO, what does it mean to augment the action space with discovered options? I think it means this: suppose our default action space is $\mathcal{A} = \{a_1,a_2,a_3,a_4\}$. Then we discover three options, so the action space becomes $\mathcal{A}' = \mathcal{A} \cup \{h_{\rm opt}^{(1)}, h_{\rm opt}^{(2)}, h_{\rm opt}^{(3)}\}$. Is that right? Then how is this logic implemented? The original RL policy must have started with a neural network that outputs four components, one per action (so that we can do a softmax to get the full probability distribution). Do we then copy weights over to a new neural network with seven outputs, so that all weights except for the last layer are pre-initialized?

• Regarding the DQN results, DQN is notorious for requiring lots of hyperparameter tuning and there are many ways that the implementation can go wrong. I wonder if this was hand-implemented or if the authors based it on an existing library with known benchmarks.

Whew! That was a fairly exhausting read. I needed to read this three times (which is the amount Professor Michael I. Jordan keeps telling us to read textbooks) but at least I think I get the gist of how DDO works.

# The 2017 Bay Area Robotics Symposium

Last Friday, I participated in the Bay Area Robotics Symposium for the first time. Since 2013, this has been an annual November event with alternating hosts of the University of California, Berkeley, and Stanford University. This year, it was held in Berkeley, and since by now I am closer to calling myself a robotics researcher, and additionally have (finally!) a robotics-related preprint online, I really had no excuse not to join this time around.

The International House auditorium, where BARS took place.

BARS took place at the International House in the southeast corner of the UC Berkeley campus. The talks were in the room shown in the picture above. As usual:

• I arrived early.
• I sat near the front of the room. Thus, this picture shows basically the entirety of the auditorium.

I normally follow the two rules above because of the need to (a) meet my sign language interpreters early, and (b) grab a seat by the front to get a good view of them. Sadly, it means that (as usual) I don’t get other students to sit next to me. In fact, the seats nearby me were empty. One of these days, I will figure out how to sit next to other students.

After about a 15 minute delay (typical Berkeley), BARS finally got started. The Berkeley host, Professor Anca Dragan, gave a few opening remarks, which included that BARS 2017 had (if I recall correctly) 392 people signed up. The capacity was 500, I think, and I’m actually surprised that we didn’t hit the limit. Perhaps CoRL 2017, held recently, meant that BARS 2017 was redundant?

Anyway, soon we started off with the agenda. We started with the first of four sets of faculty talks, each consisting of six faculty giving ten minute talks. Berkeley Professor Pieter Abbeel started things off.

Pieter Abbeel started off the faculty talks. He presented "Deep Learning for Robotics". Apologies for the glare on the camera --- I am a rookie with using my iPhone for cameras. You can also see my sign language interpreter there, who had her work cut out for her due to Abbeel's fast and energetic speaking rate.

My goodness, how does he get all these papers?!? His talk started off by listing his long list of publications … in 2017 alone. Then he talked about Meta-Learning, which is what he’s most interested in nowadays within AI. (Just to be clear, I said “most interested,” not “the only thing he’s interested in.”) Incidentally, Abbeel was featured in the New York Times for co-founding Embodied Intelligence. I’m excited to see what they will produce.

We then had more faculty talks. Once these concluded, we moved on to the first of two student spotlight talk sessions, when students gave 1-minute lightning talks on their research. As expected, about 90% of the students went over their alloted time, and most wasted time by saying “My name is X and I’m a student at Y advised by Z and blah blah blah.” The good news is that many of the presentations were interesting enough to pique the curiosity of a nontrivial fraction of the audience. Then those folks can read the students’ papers in their own time.

I didn’t give a lightning talk. I’m not sure why, but I suppose professors could only pick two or three of their students due to time constraints.

We then had our first coffee break. Yay! I stood up, stretched, and briefly chatted with a few other people from Berkeley who I knew. I wish I had talked to some students from Stanford, though. How do people network to brand-new folks at events like these? I really wish I knew the answer.

After another set of faculty talks (including Ken Goldberg’s work on the Dexterity-Network), and then a lunch break, we had … our keynote talk.

Professor Robert Full at the end of his keynote talk, asking the audience about our thoughts on a suitable "Grand Challenge" for robotics.

Professor Full isn’t technically a core robotics faculty — I think — because “biomechanics” is a better way of describing his work. But he quickly captivated the audience with his funny videos on insects and squirrels going through wild motions … and then videos of robots attempting to replicate that movement. I particularly liked the videos of insects and robots that were able to resist insane amounts of forces and squeeze their way through impossibly tiny paths. Judging from the frequent audience laughter, I wasn’t the only one who enjoyed his talk.

Towards the end, Professor Full asked us for “Grand Challenges” of Robotics. One of the professors who I work with, Ken Goldberg, offered “the ability of a robot to pick up anything a human can pick up, including stuff that doesn’t want to be picked up.” You can verify this by looking at his Twitter. I agree with him.

Ken, meanwhile, helped me out a bit later that day. During the second coffee break, Ken gathered a few of his students (including me) and introduced us to Stanford Professor Allison Okamura, who’s also done some work on surgical robotics. At least I got to network a bit, which is better than the usual nothing!

We then had our second set of lightning talks, with the same old issues (people going over their time, etc.), and then our fourth set of faculty talks. This featured stars such as Jitendra Malik and Sergey Levine. I enjoyed Malik’s talk, which was entertaining for two reasons. First, he joked that vision and robotics people (and Malik is a vision person) were historically separate communities in AI, but after drastic improvements in image recognition due to Deep Learning, the “robotics people better pay attention to what the vision people are doing.”

The second joke Malik made was with regards to Stanford vs Berkley. The joke was that Stanford people could develop algorithms for robots (e.g., navigation) that perform well when applied to Stanford-based environments, but which fail on Berkeley-based environments. Why? Stanford is nice and clean, Berkeley is dirty and messy.

Well, UC Berkeley may be a bit run down, not to mention crowded (see image below), but it’s the people that matter, right?

Yeowch. BARS sure was popular! Well, that, and the venue is a bit too small for what we're offering. No wonder I regularly hear complaints that Berkeley is too crowded.

Some random, concluding thoughts:

• Two areas of research in robotics (and AI more generally) that are hugely popular are safety and robustness. These are related, though there are subtle differences.

• There were several faculty talking about aerospace dynamics and mechanical engineering. This is not my area of research so I had a hard time processing the concepts. There was also more on surgical robotics than I expected, due to several Stanford faculty (as our surgical robotics person, Ken, is mostly working on grasping). Berkeley, naturally, has more faculty who do Deep Reinforcement Learning for Robotics, and I find that to be the most riveting field of robotics and AI.

• Many faculty kept saying some variant of: “we’ll do Deep Learning because it’s so popular and works.” Yes, I know it’s popular, and it’s funny to talk about it, but by now this has grown stale on me and I wish people would cut back on their hackneyed Deep Learning comments.

• Unfortunately, since I spent most of the coffee breaks talking to a few people or relaxing, I didn’t get to attend either poster session. I got a glimpse of one of them and it looked crowded (and noisy) so maybe I didn’t miss much.

Well, that’s a wrap. I look forward to attending BARS 2018 at Stanford. Thank you to everyone who helped make BARS 2017 happen!

# Understanding and Categorizing Scalable MCMC and MH Papers at a High Level

When reading academic papers about a certain subfield, I often find it difficult to clearly understand how they connect with each other. For example, what algorithms are based on other algorithms? Can the contributions of two papers be combined? Would combining them result in notable improvements or just on-the-margin, negligible changes? (The answer to that last question is usually “the latter” but it’s something we should at least consider.)

This post is an attempt to unify my understanding of papers related to scalable Markov Chain Monte Carlo and scalable Metropolis-Hastings. By “scalable,” I refer to the usual meaning of using these algorithms in the large data regime.

These are the papers I’m trying to understand:

• MCMC Using Hamiltonian Dynamics, Handbook of MCMC 2010
• Bayesian Learning via Stochastic Gradient Langevin Dynamics, ICML 2011
• Stochastic Gradient Hamiltonian Monte Carlo, ICML 2014
• Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget, ICML 2014
• Towards Scaling up Markov Chain Monte Carlo: An Adaptive Subsampling Approach, ICML 2014
• Firefly Monte Carlo: Exact MCMC with Subsets of Data, UAI 2014
• On Markov Chain Monte Carlo Methods For Tall Data, JMLR 2017.

(All of them are freely available online.)

First, I’ll briefly discuss why we care about the problem of scalability with MCMC and MH. Then, I’ll group these papers into categories and explain how they are connected to each other. This will then motivate our UAI 2017 paper, An Efficient Minibatch Acceptance Test for Metropolis-Hastings.

# Why Markov Chain Monte Carlo?

I’m not going to review MCMC here as you can find many other references, both online and in textbooks. It may help to look at my blog post from June 2016 where I describe the general problem setting. My more recent BAIR Blog post also contains some potentially useful background material.

But why use MCMC at all? Here’s one reason: if we use it to sample some model’s parameter $\theta$, then the chain of samples $\{\theta_1, \ldots, \theta_T\}$ should let us quantify useful statistics about properties of interest. Two of these are the expectation and variance of something, which we might apply on the parameter itself. We can estimate (for example) the expectation by taking a sequence of the $K$ most recent samples (or a subsampled sequence) from our chain $\{\theta_{T-K+1}, \ldots, \theta_T\}$ and then taking the sample vector-valued expectation. More generally, letting $f$ be a function of the parameters, we can compute the expectation $\mathbb{E}[f(\theta)]$ using the expectation of the sampled values $\{f(\theta_{T-K+1}),\ldots, f(\theta_T)\}$.

We can’t do this if we take stochastic gradient steps, because the samples from SGD are not from the posterior distribution of the parameter. SGD is designed to converge around a single point in the space of possible $\theta \in \Theta$ values, unlike MCMC methods which are supposed to approximate a distribution, which can then be used for sample estimates of expectations and variances.

My perspective is supported in papers such as the SGLD paper from 2011 (one of the papers I listed above); the authors (Welling & Teh) claim that:

Bayesian methods are appealing in their ability to capture uncertainty in learned parameters and avoid overfitting. Arguably with large datasets there will be little overfitting. Alternatively, as we have access to larger datasets and more computational resources, we become interested in building more complex models, so that there will always be a need to quantify the amount of parameter uncertainty.

So … that’s why we like the Bayesian perspective. These authors are rock-stars, by the way, so I generally trust their conclusions.

I’ll be honest, though: I can’t think of something nontrivial I’ve done in which the Bayesian perspective was that useful to me. In Deep Learning, Deep Imitation Learning, and Deep Reinforcement Learning, I’ve never used priors and posteriors; RMSProp or Adam is good enough, and it seems like this goes for the rest of the community. Maybe it’s just not that necessary in these domains? I have two papers on my reading list, Bootstrapped DQNs and Robust Bayesian Neural Networks, which might clarify some of my questions regarding how much of a Bayesian perspective is needed in Deep Learning. I should also definitely check out the Bayesian Deep Learning NIPS workshop.

# Langevin Dynamics and Hamiltonian Dynamics

This section concerns the following three papers:

• MCMC Using Hamiltonian Dynamics, Handbook of MCMC 2010
• Bayesian Learning via Stochastic Gradient Langevin Dynamics, ICML 2011
• Stochastic Gradient Hamiltonian Monte Carlo, ICML 2014

I gave a brief introduction to Langevin Dynamics in my earlier blog post, so just to summarize for this one, Langevin Dynamics injects an appropriate amount of noise so that (in our context) a gradient-based algorithm will converge to a distribution over the posterior of $\theta$. The Stochastic Gradient Langevin Dynamics algorithm combines the computational efficiencies of SGD by using a minibatch gradient, but uses the Langevin noise to appropriately cover the posterior:

[…] Langevin dynamics which injects noise into the parameter updates in such a way that the trajectory of the parameters will converge to the full posterior distribution rather than just the maximum a posteriori mode.

As a follow-up, the Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) algorithm is similar to SGLD in that it uses a minibatch gradient along with “exploration noise.” This time, the noise is from Hamiltonian Monte Carlo, which is more sophisticated than Langevin Dynamics since HMC introduces extra momentum variables which allows for larger jumps.

Radford Neal’s excellent 2010 book chapter goes over HMC in great detail, so I won’t go through the details here (though I’d like to write a blog post solely about HMC — so stay tuned!). Just to give a quick overview, though, our problem context is similar, where we have a target posterior:

with potential energy function

(Don’t worry too much about the “potential energy” terminology; HMC was originally developed from a physics background. We’re still in the same problem setting.)

HMC generates samples from a joint distribution that involves extra momentum variables:

where $r$ are the momentum variables and $M$ is a mass matrix. The update rules are:

• $\theta = \theta + \tau \cdot M^{-1}r$.
• $r = r - \tau \cdot \nabla U(\theta)$.

where $\tau$ is some step size. If this doesn’t make sense, read Neal’s 2010 book chapter.

The result from HMC is a set of samples $\{(\theta_i,r_i)\}_{i=1}^T$. But we’re only interested in the $\theta_i$s, so … we simply drop the $r_i$ terms to get our samples for $\theta$. Amazingly, $\theta$ is sampled from the correct target distribution, which one can show via some “reversibility” analysis.

SGHMC needs a little massaging to actually get it to sample the target distribution, since simply taking a subset of the data to compute an approximation to $\nabla U(\theta)$ will lose the “Hamiltonian Dynamics” property; the authors resolve this by using second-order Langevin Dynamics to counteract the effect of too much gradient noise in estimating $\nabla U(\theta)$, and the result is a similar algorithm to SGLD except with a different noise term.

Just to be clear, both SGLD and SGHMC are minibatch, gradient-based algorithms that are also considered “Bayesian methods.” Neither are pure random walks, i.e., neither use Gaussian proposals because the proposals are based on the stochastic gradient value, plus some additive noise term. For SGLD, that extra noise is actually a random walk, but not for SGHMC.

For both SGLD and SGHMC, we have to apply the Metropolis-Hastings test for computer implementations due to discretization error, even though in theory we shouldn’t have to since energy is preserved. In both papers, the authors decrease step sizes to zero so that the MH rejection rate goes to zero. Intuitively, smaller step sizes mean samples are concentrated into higher regions of the posterior, and the gradient ensures going in the direction of greatest increase of the posterior probability. In addition, decreasing step sizes also means discretization error decreases, which yet again further reduces the need for MH tests. While this is great, because the MH test requires full-batch computation, perhaps we are missing out somehow by keeping our step sizes small.1

# Metropolis-Hastings

In this section, I discuss the remaining papers listed at the introduction of this post. They are related in some form to the Metropolis-Hastings algorithm, which is commonly used in MCMC techniques to act as a correction to ensure that samples do not deviate too frequently away from the target posterior distribution.

As I mentioned in both of my earlier blog posts, conventional MH tests require a full pass over the entire dataset. This makes them extremely costly, and is one of the reasons why both SGLD and SGHMC emphasized how decreasing step sizes results in lower discretization error, so that they could omit the MH tests.

Their computational cost has raised the question over whether using subsamples of the data for the MH test computation is feasible. It’s not as straightforward as taking a fixed-sized subset (i.e., minibatch) of the dataset because that results in a non-trivial target distribution which is not the desired posterior.

The following two papers propose subsampling-based algorithms that attempt to tackle the high cost of full-batch MH tests:

• Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget, ICML 2014
• Towards Scaling up Markov Chain Monte Carlo: An Adaptive Subsampling Approach, ICML 2014

I discussed the first one in an earlier blog post. The second one follows a similar procedure as the first one, except that it uses a slightly different way of interpreting when to stop data collection. The downside, as I painfully realized when I tried to implement this, is that due to its concentration bounds, it requires a real-valued parameter which depends on the entire collection of $\{\log p(x_i\mid \theta), \log p(x_i\mid \theta')\}_{i=1}^N$ values each iteration, which defeats the point of using a subset of the data. (Here, I use $\theta'$ to denote the proposed distribution.)

The authors of the Adaptive Subsampling paper have a follow-up JMLR 2017 paper (it was under review for a long time) which expands upon this discussion. I found it quite useful, particularly because of their proof (in Section 6.1) about how naive subsampling for the MH test results in a nontrivial and hard-to-interpret target distribution. In Section 6.3, they introduce a novel contribution where they rely on subsampling noise for exploration; that is, use the minibatch-induced noise (which is Gaussian by the Central Limit Theorem) to explore the posterior. However, they showed that this approach still seems to require $O(n)$ data points each iteration. On the other hand, they didn’t investigate this method in too much detail, so it’s hard to comment on its usefulness.

The last related work was the Firefly paper, which won the Best Paper Award at UAI 2014. It can perform exact MCMC, but the main drawback is (emphasis mine):

FlyMC is compatible with a wide variety of modern MCMC algorithms, and only requires a lower bound on the per-datum likelihood factors.

To be clear on what this means, they require the existence of functions $B_i(\theta)$ satisfying $0 \le B_i(\theta) \le p(x_i \mid \theta)$ for all $i$. How realistic is that? I have no idea, honestly, but it seems like something that is difficult to achieve in practice, especially because it’s conditioning on $\theta$ and $\theta'$, which will vary considerably throughout sampling. There is some interesting discussion about this at Christian Roberts’ excellent blog, with Ryan Adams (the professor co-author) commenting.

The prior work then motivated our work, where we avoided needing these assumptions and showed that we could cut the M-H test cost time down to one equivalent with SGD, without loss of performance. There’s no free lunch, though; our algorithm has applicability constraints but those are hopefully not that restrictive. Check out our BAIR blog post for more information.

# Conclusion

I’ve discussed these set of papers and tried grouping them together to see a coherent theme in all of this. Hopefully this makes it clearer what these papers are trying to do.

1. I’m actually not sure if we can even use the Metropolis-Hastings test (and not just the “Metropolis Algorithm”) with SGHMC. The authors of the SGHMC paper claim that MH tests are impossible for both SGLD and SGHMC since the reverse proposal probability $q(\theta \mid \theta')$ cannot be computed. It seems to me, however, that one can compute the SGLD reverse probability because that’s a Gaussian centered at the gradient term with some known variance. What am I missing here? At the very least, applying the MH test to regular HMC should be OK, since we can omit the proposal probabilities. And that’s what both the SGHMC authors (judging from Tianqi Chen’s source code) and Radford Neal do in their experiments.

# Don't Focus on Writing Ability; Focus on Technical Skills

In the process of applying to graduate school, and then visiting schools that admitted me, I was told that PhD students needed to possess solid writing ability in addition to technical skills. One UT Austin professor told me he believed liberal arts students (like me) were better prepared than those from large research universities, presumably because of our increased exposure to writing courses. One Cornell professor emphasized the importance of writing by telling me that he spent at least 50 percent of his professional life writing. A Berkeley professor who I frequently collaborate with has a private Google Doc that he gives to students with instructions on writing papers, particularly about how to structure an introduction, what phrases to use, and so on.

The ability to write well is an important skill for academics, and I don’t mean to dismiss this outright. However, I think that we need to be very clear that technical skills matter far, far more for the typical graduate student, at least for computer science students focusing in artificial intelligence like me. I would additionally argue that factors such as research advisors and graduate student collaborators matter more than writing ability.

Perhaps the emphasis on writing skills is aimed at two groups of people: international students, and the very best graduate students for whom technical skills are relatively less of a research bottleneck. I won’t comment too much on the former group, besides saying that I absolutely respect their commitment to learning the English language and that I know I’m incredibly lucky to be a native English user.

I bring up the second group because much of the advice I get are from faculty at top institutions who were stellar graduate students. Perhaps most of their academic life is dominated by the time it takes to convert research contributions to a paper, instead of the time it takes to actually come up with the contribution itself. For instance, this is what UT Austin professor Scott Aaronson had to say in an old 2005 (!!) blog post, back when he was a postdoc (emphasis mine):

I’ll estimate that I spend at least two months on writing for every week on research. I write, and rewrite, and rewrite. Then I compress to 10 pages for the STOC/FOCS/CCC abstract. Then I revise again for the camera-ready version. Then I decompress the paper for the journal version. Then I improve the results, and end up rewriting the entire paper to incorporate the improvements (which takes much more time than it would to just write up the improved results from scratch). Then, after several years, I get back the referee reports, which (for sound and justifiable reasons, of course) tell me to change all my notation, and redo the proofs of Theorems 6 through 12, and identify exactly which result I’m invoking from [GGLZ94], and make everything more detailed and rigorous. But by this point I’ve forgotten the results and have to re-learn them. And all this for a paper that maybe five people will ever read.

Two months of writing for every week of research? I have no idea how that is humanly possible.

For me, the reverse holds: I probably spend two months of research for every week of actual writing. What dominates my academic life is the time it takes (a) to process the details from academic papers so that I understand how their ideas work, and (b) to build upon those results with my own novel contribution. Getting intuition on novel artificial intelligence concepts takes a considerable amount of mathematical thinking, and getting them to work in practice requires programming skills. Both math and programming fall under the realm of “technical skills.”

Obviously, once I HAVE a research contribution, then I have to “worry” about writing it, but I enjoy writing so it is no big deal.

But again, the research contribution itself must first exist. That’s what frustrates me about much of the academic advice that I see. Yes, it’s easier to tell someone how to write (use this phrase, don’t use this phrase, active instead of passive, blah blah blah), but it would be better to explain the thought process on how to come up with an original research contribution.

I conclude:

I would happily trade away some of my writing ability for a commensurate increase in technical skill.

Again, I am not disregard writing ability, since it is incredibly valuable for many reasons (such as for blogging!!) and more applicable than technical skills in life. However, I believe that the biggest priority for computer science doctoral students should be to focus on technical skills.

# Random Thoughts on Recent News

## Nobel Peace Prize

The winner of the 2017 Nobel Peace Prize is the International Campaign to Abolish Nuclear Weapons (ICAN). The award statement is:

The organization is receiving the award for its work to draw attention to the catastrophic humanitarian consequences of any use of nuclear weapons and for its ground-breaking efforts to achieve a treaty-based prohibition of such weapons.

The first part of this statement — that nuclear weapons have “catastrophic humanitarian consequences” — is obvious and should be universally agreed upon.

Unfortunately, while the second part about creating a treaty-based prohibition sounds great in theory, it would not work in practice. So long as rogue states such as North Korea continue to develop their own nuclear programs, the United States needs to maintain its own stockpile of such weapons. The Wall Street Journal editorial board got it correct when they described the award as the Nobel Alternate Reality prize, and I honestly think they weren’t harsh enough on the Nobel Committee.

This award reminded me about when President Obama was once considering adopting a “No Nuclear First” policy. While I generally supported President Obama, this would have been a mistake. Fortunately, his own administration shot down the idea. Now the next step is to convince politicians such as Senator Dianne Feinstein to steer away from these ideas.

## Harvey Weinstein

As most of us know, Harvey Weinstein systematically assaulted women throughout much of his famed career, and is rightfully being scorned and disgraced. Good riddance.

My first reaction was, wow. How did Weinstein get away with his behavior over all these years? I certainly hope that other men who have done these things face similar consequences and avoid getting away scot-free by paying their way out.

By the way, the fact that Weinstein was a prominent supporter of Democratic causes is, in my opinion, irrelevant. There are plenty of bad men in all political parties. I don’t need to name them — we know who they are. Let’s condemn them and, if they’ve donated money to politicians, to encourage those politicians to redirect that money to organizations that combat sexual assault.

Now the uncomfortable question I face is: are there Harvey Weinsteins in (academic) computer science? I hope not. From my experience, I have never noticed any man (or woman, for that matter) exhibit the kind of behavior as Weinstein. At least that counts for something.

Lastly, I certainly hope I have never been like Weinstein. One of the things I’ve learned from reading Weinstein-like stories is to avoid touching people. I am not much of a “touching” person. One handshake when I meet new people is enough for me! Sure, if other men or women want to hug me, fine, go ahead. I just won’t initiate the hugging, sorry. I am constantly worried that I’ll commit a micro-aggression.

## The NBA and NFL

NBA rookie Lonzo Ball has a huge target on him due to his outspoken Dad, LaVar ball. LaVar has made a few comments that remind me of Weinstein, particular with his criticism of female basketball referees. To be clear, LaVar Ball doesn’t seem remotely as bad as Weinstein, but at least I see the resemblence, if you get what I mean. And I am, quite frankly, annoyed at all the attention he gets.

The good news is that at least he’s there as a Dad and seems extremely supportive of his children. I got a poignant reminder about this from reading this article from The Undefeated about how Scott Brooks wished he had a father. I can certainly see how someone like him would view the situation differently.

In the NFL, Houston Texans owner Robert McNair made an ill-advised comment about calling NFL players “inmates”. Ouch. That was a mistake, and I’m happy he showed remorse and seems to regret his actions. If NFL players protest over these comments, well I can’t blame them. If they want to kneel for the flag, for instance, that’s their First Amendment right to do so and I will support them.

On the same token, the NFL players have America’s attention. Great. Now the next and exponentially harder step is to figure out what to do with that attention. Kneeling for the flag won’t work forever, but I don’t know how else the players should proceed so that their messages and goals have a high probability of seeing reality.

## Quantum Computing

There was a recent article in the Wall Street Journal about the race to create a quantum computer. Even though I have very little intuition on how quantum computers work, the author makes a reasonably compelling case for this to be our next “Manhattan Project”. I would, however, like to raise two points.

First, I reserve the right to dismiss this race as ill-informed if Scott Aaronson says so on his blog. (Professor Aaronson, I am waiting …)

Second, why not consider another project that is also crucial for the United States (and the planet, for that matter)? That would be Energy Independence, or the The Green Revolution, as proposed by authors such as Thomas L. Friedman. Rather than rely on Middle Eastern countries (and indirectly, radical Islam) for our oil, we can instead develop our own. Or even better, we can focus on renewable energy.

I am under no illusions that this will be hard to achieve, both for political reasons and because people don’t like to be pressured to do things that lower their standard of living. For example, I still use my car regularly even though I know that public transportation would be better for the environment.

The good news is that with more people living in cities, it will be easier for the country to use less energy, and that’s why I still have hope. For additional details, I recommend reading Climate of Hope, which was published just a few months ago.

## Dynamic Routing Through Capsules

OK, this isn’t really news that will capture the minds of the general public, but it’s certainly taking the AI-world by storm. Researchers Sara Sabour, Nicholas Frosst, and Geoffrey Hinton finally posted a long-awaited preprint, Dynamic Routing Through Capsules. There is substantial interest in this paper because Hinton has long been brainstorming alternatives to backpropagation for training neural networks. He has also been thinking about dramatically different neural network designs rather than making incremental changes to fully connected or convolutional nets. This is despite how he, perhaps more than anyone else, has been responsible for bringing their good-ness out in the open, revolutionizing the field of AI and, indeed, the world.

I wish I had the time to read the paper and write a detailed read-through blog post, but alas, it came on arXiv the night before I had an interview.

At this point, I think any top-tier conference paper with Hinton’s name attached to it is worth reading. I’m already on-record as saying that I predict Geoffrey Hinton to win the next Turing Award, so I expect that his papers will have extremely high research contribution.

## Surgical and Home Robotics

Lastly but certainly not least, consider checking out some recent research from the Berkeley AUTOLAB, of which I’m a member. The Berkeley AI Research blog recently featured back-to-back posts about imitation learning algorithms with applications to surgical robotics and home robotics, respectively. Recall that I serve on the BAIR Blog editorial board, and in particular, I was in charge for formatting and then pushing these posts live. I hope these two posts are informative to the lay reader interested in AI.

# Learning to Act by Predicting the Future

I first heard about the paper Learning to Act by Predicting the Future after one of the authors, Vladlen Koltun, came to give a highly entertaining talk as part of Berkeley’s Deep Learning seminar course (CS 294-131).

In retrospect, I’m embarrassed it took me this long to find out about the work. It’s research that feels highly insightful and should have been clear to us all along — yet somehow we never saw it until those authors presented it to us. To me, that’s an indicator of high-caliber research.

Others have agreed. Learning to Act by Predicting the Future was accepted as an oral presentation at ICLR 2017, meaning that it was one of the top 15 or so papers. You can check out the favorable reviews on OpenReview. It was also featured on Adrian Colyer’s blog. And of course, it was featured in my Deep Learning class.

So what is the research contribution of the paper? Here’s a key passage in the introduction which explains their framework:

Our approach departs from the reward-based formalization commonly used in RL. Instead of a monolithic state and a scalar reward, we consider a stream of sensory input $\{s_t\}$ and a stream of measurements $\{m_t\}$. The sensory stream is typically high-dimensional and may include the raw visual, auditory, and tactile input. The measurement stream has lower dimensionality and constitutes a set of data that pertain to the agent’s current state.

To be clear, at each time $t$, we get one sensory input and one set of (scalar-valued) measurements, so our observation is $o_t = \langle s_t, m_t \rangle$. Their running test platform in the paper is the first-person shooter Doom environment, so $s_t$ represents images and $m_t$ represents attributes in the game such as health and supply levels.

This is an intuitive difference between $s_t$ and $m_t$. There are, however, two important algorithmic differences:

• Given actions taken by the agent, they attempt to predict $m_t$, hence “predicting the future”. It’s very hard to predict full-blown images, but predicting (much-smaller) measurements shouldn’t be nearly as challenging.

• The measurement vector $m_t$ is used to shape the agent’s goals. They assume the agent wants to maximize

where

Thus, the goal is to maximize this inner product of the future measurements and a parameter vector $g$ weighing the relative importance of each terms. Note that this instantly generalizes the case with a scalar reward signal in MDPs: we’d set the elements of $g$ such that they are $\gamma^0, \gamma^1, \gamma^2, \ldots$, i.e. corresponding to discounted rewards. (I’m assuming that $m_t$ is a scalar here, but this generalizes to the vector case with $f$ a matrix, as we could flatten $f$ and $g$.)

In order to predict $m_t$, they have to train a function to do so, which they parameterize with (you guessed it) deep neural networks. They define the function $F$ as the predictor, with

Thus, given the observation, action, and goal vector parameter, we can predict the resulting measurements, so that during test-time applications, $p_t^a$ is “plugged in” for $f$ and the action which maximizes $u$ is chosen. To make this work mathematically, of course, $f,g,$ and $p_t^a$ must all have the same dimension. And to be clear, even though the reward (as they define it) is a function of $g$, we are not “training” the $g$ parameters but the parameters for $F$.

The parameters of $F$, incidentally, are trained in an unsupervised manner, or using “self-supervision” since the labels can be generated automatically by having the agent wander around in the world and then repeatedly computing the value of the function output at each of those time steps. Then, after some time has passed, we simply minimize the $L_2$ loss. Nice, no humans needed for labeling! When I was reading this, I was reminded of the Q-learning update, since the update rule automatically assumes that the “target” is the usual “reward plus discounted max Q-value” thingy, without human intervention. To further the connection with Q-learning, they use an experience memory in the same way as the DQN algorithm used experience replay (see my earlier blog post about DQN). Another concept that came to mind was Sergey Levine’s excellent paper on learning hand-eye coordination, where he and his collaborators were able to automatically generate labels. I need to figure out how to do stuff like this more often.

Anyway, given that $F$ takes in three inputs, one would intuitively expect that it has three separate input networks and concatenates them at some point. Indeed, that’s what they do in their network, shown below.

After concatenation, the network follows the paradigm of the Dueling DQN architecture by having separate expectation and value (“action”) streams. It might not be clear why this is useful, so if you’re puzzled, I recommend reading the Dueling DQN paper for justification (I need to re-read that as well).

They benchmark their paradigm (called DFP for “Direct Future Prediction”) on Doom with four scenarios of increasing difficulty. The baselines are the well-known DQN and A3C algorithms, along with a relatively obscure “DSR” algorithm (but which used the Doom platform, facilitating comparisons). I’m not sure why they used DQN instead of, say, Double DQN or prioritized variants since those are assumed to be strictly better, but at least they test using A3C which as far as I can tell is on par with the best DQN variants. It’s too bad that OpenAI baselines wasn’t around in time for the authors to use it for this paper.

They say that

We set the temporal offsets $\tau_1, \ldots, \tau_n$ of predicted future measurements to 1, 2, 4, 8, 16, and 32 steps in all experiments. Only the latest three time steps contribute to the objective function, with coefficients $(0.5, 0.5, 1)$.

I think they say this to mean that their goal vector $g$ contains only three nonzero components, corresponding to $\tau_{n-2}, \tau_{n-1}$ and $\tau_n$. But then I’m confused: why do they need to have all the other $\tau_i$ for $1 \le i \le n-3$? What’s also confusing is that for the two complicated environments with ammo, health, and frags, their training is set to maximize a linear combination of those three, with coefficients $(0.5, 0.5, 1.0)$. The same vector is repeated here!

I wish they had expanded upon their discussion since this is new stuff from their paper. Why did they choose this and that value? What is the intuition? I know it’s easy to ask this from a reading/reviewing perspective, but that’s only because the concept is new; for example, they do not need to justify why they chose the dueling-style architecture because they can refer to the Dueling DQN paper.

Regarding experiments, I don’t have much intuition on the vizDoom environments, as I have never used those, but their results look impressive on the two harder scenarios, which also provide more measurements (three instead of one). Their method out-performs sophisticated baselines in various settings, including those from the Visual Doom AI competition in September 2016.

At the end of the experimental section, after a few ablation studies (heh, my favorite!) they convincingly claim that

This supports the intuition that a dense flow of multivariate measurements is a better training signal than a scalar reward.

In my words: dense signals are better than sparse signals. In some cases, sparsity is desirable (e.g. in attention models, we want sparsity to focus on a few important components) but for rewards in reinforcement learning, we definitely need dense signals. Note that getting such signals wouldn’t be possible if the authors kept clinging to the usual MDP formulation. Indeed, Koltun made it a point in his talk to emphasize how he disagreed with the constraints imposed on us by the MDP formulation, with the usual “set of states, actions, rewards […]”. This is one of the things I wish I was better at: identifying certain gaps in assumptions that everyone makes, and trying to figure out where we can improve them.

That’s all I have to say for this paper now. For more details, I would check the paper website. Have fun!

# Thoughts on Dale Carnegie's "How to Win Friends and Influence People"

Last night, I finished reading Dale Carnegie’s book How to Win Friends and Influence People: The Only Book You Need to Lead You to Success. This is the 31st book I’ve read in 2017, and hopefully I will exceed the 38 books I read in 2016.

Carnegie’s book is well-known. It was originally published in 1936 (!!) during the Great Depression, but as the back cover argues, it is “equally valuable during booming economies or hard times.” I read the 1981 edition, which updated some of the original material to make it more applicable to the modern era. Even though it means the book loses its 1936 perspective, it’s probably a good idea to keep it updated to avoid confusing the reader, and Carnegie — who passed away in 1955 — would have wanted it. You can read more about the book’s history on its Wikipedia page.

So, is the book over-hyped, or is it actually insightful and useful? I think the answer is yes to both, but we’ll see what happens in the coming years when I especially try to focus on applying his advice. The benefit of self-help books clearly depends on how well the reader can apply it!

I don’t like books that bombard the reader with hackneyed, too-good-to-be-true advertisements. Carnegie’s book certainly suffers from this, starting from the terrible subtitle (seriously, “The Only Book”??). Now, to be fair, I don’t know if he wrote that subtitle or if it was added by someone later, and if it was 1936, it would have definitely been more original. Certainly in the year 2017, there is no shortage of lousy self-help books.

The good news is that once you get beyond the hyped-up advertising, the actual advice in the book is sound. My summary of it: advice that is obvious, but that we sometimes (often??) forget to follow.

I wrote the book, and yet frequently I find it difficult to apply everything I advocated.

This text appears in the beginning of a book titled “Nine Suggestions to Get the Most Out of This Book”. I am certainly going to be following those suggestions.

The advice he has is split into four rough groups:

• Fundamental Techniques in Handling People
• Six Ways to Make People Like You
• How to Win People to Your Way of Thinking
• Be a Leader: How to Change People Without Giving Offense or Arousing Resentment

Each group is split into several short chapters, ending in a quick one-phrase summary of the advice. Examples range from “Give honest and sincere appreciation” (first group), “smile” (second group), “If you are wrong, admit it quickly and emphatically” (third group), and “Talk about your own mistakes before criticizing the other person” (fourth group). Chapters contain anecdotes of people with various backgrounds. Former U.S. Presidents Washington, Lincoln, and both Roosevelts are featured, but there are also many examples from people leading less glamorous lives. The examples in the book seem reasonable, and I enjoyed reading about them, but I do want to point out the caveat that some of these stories seem way too good to be true.

One class of anecdotes that fits this criteria: when people are able to get others to do what they want without actually bringing it up! For example, suppose you run a business and want to get a stubborn customer to buy your products. You can ask directly and he or she will probably refuse, or you can praise the person, show appreciation, etc., and somehow magically that person will want to buy your stuff?!? Several anecdotes in the book are variants of this concept. I took notes (with a pencil) to highlight and comment in the book as I was reading it, and I frequently wrote “I’m skeptical”. Fortunately, many of the anecdotes are more realistic, and the advice itself is, as I mentioned before, accurate and helpful.

I have always wondered what it must be like to have a “normal” social life. I look at groups of friends going out to meals, parties, and so forth, and I repeatedly wonder:

• How did they first get together?
• What is their secret to liking each other??
• Do I have any ounce of hope of breaking into their social circle???

Consequently, what I most want to get out of the book is based on the second group, how to make people like me.

Unfortunately, I suffer from the social handicap of being deaf. While talking with one person usually isn’t a problem, I can’t follow conversations with noisy backgrounds and/or with many people. Heck, handing a conversation with two other people is often a challenge, and whenever this happens, I constantly fear that my two other “conversationalists” will talk to themselves and leave me out. And how on earth do I possibly network in noise-heavy academic conferences or workshops??? Gaaah.

Fortunately, what I find inspiring about Carnegie’s advice is that it is generic and highly applicable to the vast majority of people, regardless of socioeconomic status, disability condition, racial or ethnic background, and so forth. Obviously, the benefit of applying this advice will vary depending on people’s backgrounds, but for the vast majority of people, there should be some positive, non-zero benefit. That is what really counts.

I will keep How to Win Friends and Influence People on my desk as a constant reminder for me to keep applying these principles. Hopefully a year from now, I can look back and see if I have developed into a better, more fulfilled man.