# All the Books I Read in 2016, Plus My Thoughts

Last year, I wrote a normal-length blog post about my three favorite books that I read that year. For 2016, I thought I would do something a little different: why not list all the books I read this year? That way, I won’t forget what I read, and if anyone complains to me saying “why do young adults nowadays spending time on social media instead of reading books?”, I can simply refer them to my blog. Genius!

I have done precisely that in this lengthy post, which I’ve been updating throughout the year by writing summaries of books as soon as I’ve finished them. I read 38 books this year. (Textbooks, magazines, newspapers, and research papers don’t count.) The summaries here are only for the non-fiction books that I read this year. I read three easy fiction books in January to get me warmed up,1 so the summaries here are for the other 35 books. Here’s a rough set of categories:

1. Medicine and Surgery (4 books)
2. Work Habits and Technology (4 books)
3. History and Foreign Affairs (5 books)
4. Religion (5 books)
5. Science (4 books)
6. NSA and CIA (3 books)
7. Biographies and Memoirs (3 books)
8. Miscellaneous (7 books)

Books I especially liked are indicated with double asterisks by the title, ** like this **. For the full set of these “reading list” posts, see this page.

## Group 1: Medicine and Surgery

I read all of Atul Gawande’s four books. Gawande is a famous surgeon who also gave the 2012 commencement speech for my alma matter. I think he’s an excellent surgeon-writer. His books contain a healthy amount of technical detail on surgery and medicine without going overboard to discourage the lay reader like myself. They not only give me a glimpse into what it must be like to work in medicine, but also make me appreciate my relatively good physical health. Some of the stories Gawande describes about his patients are absolutely shuddering.

Here are my thoughts on his books:

• Complications: A Surgeon’s Notes on an Imperfect Science, Gawande’s first book and a finalist for the National Book Award, discusses the various complications and uncertainty that arise in surgery and medicine. It is divided into three parts: fallibility, mystery, and uncertainty. What happens when doctors make mistakes? And how do people feel about an inexperienced surgeon working on them? (After all, everyone has to start somewhere!) What about when there are cases that simply elude everyone? How about pure judgment calls? Each of the fourteen chapters covers a certain theme, and consists of several related stories. It’s a nice portrayal of how medicine faces some of the inherent difficulties in its field.

• Better: A Surgeon’s Notes on Performance. Gawande explores different ways to get “better” at surgery and medicine. Like his last book, this one is a collection of loosely-related stories. In general, Gawande doesn’t directly propose solutions to various issues, but rather, he provides examples of how certain people at the top of their field (e.g., those treating cystic fibrosis) carry out their work, in comparison to “ordinary” people. There are also some interesting but awkward topics: how should physicians treat the handling of patient genitals, and how should doctors discuss their “bell curve” of performance and the possibility that they may be at the lower tail? While the book is a nice collection of stories, the selection and ordering of topics might seem somewhat haphazard. Still, I felt that – like other books – it gave me a glimpse into what it is like to be a surgeon.

• ** The Checklist Manifesto: How to Get Things Right ** is an enlightening book about the power of checklists. Drawing from a variety of fields, including airfare, construction, and (obviously) surgery, Gawande shows how checklists have increased the safety and efficiency of those tasks that most of us take for granted. Yes, he acknowledges that checklists are stupidly simple. But here’s the remarkable thing: why didn’t anyone else write about this earlier? And why, as Gawande points out, do people resist utilizing something enormously beneficial with virtually no extra cost? Consequently, I’m now thinking about where checklists might be applied to my own life. One qualm I had with this book, however, is that it seems to put a ton of statistics in terms of percentage points only (e.g., 22 percent said this …) whereas we need the raw number to understand the significance (e.g., 22 percent of 10000 people surveyed …). This book is on the short end (fewer than 200 pages) so it wouldn’t hurt to have a little more data and facts.

• ** Being Mortal: Medicine and What Matters in the End **. What is the role of doctors when treating patients who are near the end of their lives? Surgery and medicine often restrict their focus to “fixing stuff:” if a patient has a problem, then doctors should undergo the procedure that will prolong patient’s lives the most. But what if that leads to the patient going directly into a nursing home? What if the patient and his/her family are banking on being the one-percent of people who end up significantly better off after a specific procedure? Most examples of patient stories in Being Mortal concern cancer, a notoriously difficult disease to treat, and people are not always willing to undergo brutal surgery and chemotherapy. I should point out that this is not a book on assisted suicide; the book only dedicates about four pages to that topic, towards the very end. Rather, this book is about assisted living; how can we balance patient’s desires while considering their safety and their relatives’ (sometimes conflicting) desires? I liked this book a little more than I liked The Checklist Manifesto, and it made me think about what I would want to happen to my older relatives. And, of course, myself, but hopefully I won’t have to worry about that for a long time.

## Group 2: Work Habits and Technology

In this group, I have two books by Georgetown computer science professor Cal Newport. I first became aware of Professor Newport’s blog back in high school when I Google-d “how to get good grades in college.” That search led me to a College Confidential thread which subsequently linked to his blog.

Professor Newport is a voracious reader (which I hope to be as well), so he recommends and discusses a number of books on his blog. I chose to read two of them.

• So Good They Can’t Ignore You: Why Skills Trump Passion in the Quest for Work You Love. Here, Newport argues that the often-cited “follow your passion!” suggestion is actually terrible, and people are better suited at developing skills that will allow them to eventually figure out their dream jobs. In short: don’t quit your job now with the fantasy that you’ll enjoy this new thing you’re starting out next week – you’ll end up worse than you were when you started. It is true that I sometimes feel like I may not be in an ideal job for me, but then I think things over, and realize that the current PhD program I’m in is clearly training me for a lot of jobs. Even if being a professor isn’t in the cards for me, I’m still learning a lot about programming, about advanced computer science, about teaching, about making connections, etc. Thus, this book made me feel happy that I am spending my twenties in an intellectually stimulating environment where my main responsibility is to learn new things and blog.

• ** Deep Work: Rules for Focused Success in a Distracted World **. This is Professor Newport’s most recent book and arguably his best, which became a Wall Street Journal 2016 business bestseller. His hypothesis here – which was obvious to me from the start – is that deep work, which is a state in which one is able to focus very heavily on a topic, churning out lots of work in a short amount of time, is becoming rarer at a time when it’s becoming more valuable. The former is due to distractions from smart phones and social media, and the latter is because the knowledge economy means people can pay the best workers remotely rather than hire locally for less-skilled people. When I was reading this book to myself, I kept thinking: come on, why on earth am I not engaging in deep work? There’s a lot of advice here, and there’s no way I can integrate them all, but I’ve been able to take a few steps (e.g., as my Facebook “friends” can attest, I don’t check Facebook frequently, and I don’t have Twitter, Instragram, SnapChat, or whatever is popular nowadays, and I check my smart phone a lot less than most people). I agree with a lot of what Newport says here; it seems like so much common sense. I have to add, it’s also funny that this book got published at a time when a related Deep phenomenon called Deep Learning is hot in AI, but that’s a story for another day.

• You Are Not a Gadget: A Manifesto, recommended by Professor Newport, is a 2010 book written by Jaron Lanier, one of the pioneers of virtual reality technology. Lanier worries that our early technological achievements have locked us in into suboptimal paths. In addition, he argues that many areas of technology are dehumanizing and favor the crowd or algorithm over individual human voices. To me, while it’s true that Wikipedia may be dehumanizing in some sense, things were arguably much worse pre-Wikipedia from a “dehumanizing perspective” (NOT a technological perspective). To take one example, as Robert Gordon describes in his fantastic book The Rise and Fall of American Growth2, human life before the “great inventions” from 1870-1970 was extremely isolating, particularly for rural families and farmers (and this continued to be the case in this era for many areas of the U.S., particularly in the Civil War-ravaged south). If a family is isolated from the rest of the world, that seems pretty dehumanizing to me. Overall, as you might tell, I have some mixed feelings about this book and do not revere it as much as Professor Newport, but it certainly brings up thought-provoking points.

## Group 3: History and Foreign Affairs

Another way for me to describe this category would be “Thomas Friedman and Michael Mandelbaum books” since all of the books in this category are written by them (in fact, one is co-authored by them!). For better or worse, most readers probably know Thomas Friedman from his columns at the New York Times. Michael Mandelbaum is professor emeritus at Johns Hopkins University, specializing in international affairs. I enjoyed these books.

• ** The World is Flat: A Brief History of the Twenty-First Century ** is a long book by Thomas Friedman. I read the third edition, published in 2007; the first two were published in 2005 and 2006. The metaphor “the world is flat” refers to the forces of globalization that have connected various people and countries together, and which empower the individual, particularly the Internet and how it allows people to have their own voices with blogs. (Contrast this, by the way, with Jaron Lanier’s book above, who argues that the Internet has been a chief cause of de-empowering the individual.) To his credit, Friedman emphasizes that globalization has tradeoffs. A frequent theme is the offshoring of work; companies can improve profits by utilizing cheaper labor in China and India, which benefits American consumers but makes life harder on those former American workers. Globalization has impacts on politics, of course, because it heightens income inequality. Friedman is pro-free trade, but understands that we have to provide the education and opportunity for people to flourish in this world. Right now, he argues America is not doing anywhere near as well as it should be doing — a quiet education crisis. This book is definitely a good history lesson for me, and it completely changed my perspective on how the world works.

• The Frugal Superpower: America’s Global Leadership in a Cash-Strapped Era is a short book written in 2010 by Michael Mandelbaum. Its central argument is that in the twentieth century, America effectively served as the world’s government and was able to intervene in other countries because it had the money to do so. But after the 2008 financial crisis, a shift emerged in U.S. policy: no longer can we lavishly spend money and intervene almost at will. Mandelbaum, while recognizing that it is not optimal to have financial crises, hopes that we can be more circumspect (hence, a “frugal” superpower). We will, however, have problems because of entitlement spending, which naturally implies spending less on military action. Demography is also working against us, but the saving grace is that the three areas of contention (Russia, China, and the Middle East) all have their own problems. For example, China has even more of a demographic problem due to their former one-child policy. Mandelbaum concludes that one way we can restore the world order is by taxing gasoline, so (a) we reduce profits of Middle Eastern countries (and indirectly, I might add, radical Islam), (b) provide more money to the federal government for entitlements, and (c) we encourage renewable and cleaner energy, thus helping to combat climate change. This book, with the exception of the suggestion that we tax gasoline, is largely subsumed by Mandelbaum’s later book, Mission Failure.

• ** That Used to Be Us: How America Fell Behind in the World It Invented and How We Can Come Back **, co-authored by columnist Thomas Friedman and foreign policy professor Michael Mandelbaum, is a book arguing that America is failing to meet the four major challenges of today’s world: globalization, the IT revolution, chronic deficits, and increasing energy consumption. The authors look at America’s history and see many examples of how we have been able to overcome similar challenges. Unfortunately, they worry, “that used to be us.” Today, with our political gridlock, lack of education funding, and competition worldwide from Asia, especially from China (who they almost view as being the most powerful country in the 21st century), it is uncertain whether we will be able to resolve these problems before the market or Mother Nature brutally forces changes down our throats. One of their main solutions? A third-party presidential candidate on the “radical center” who, while not necessarily winning the election, would sufficiently influence the two parties to focus on the “radical center” issues, as H. Ross Perot did in the 1992 election about balancing the deficit. I really liked this book, and I agree with Bill Gates’ assessment that the one word review of this book is: “fascinating.” Not everyone would like this, however: the Wall Street Journal published a negative review, taking a traditionally Conservative stance by criticizing the authors for proposing more government as the solution to fixing America. But then I would like to ask them: what is their proposal to combat climate change?

• ** Mission Failure: America and the World in the Post-Cold War Era ** is Michael Mandelbaum’s fifteenth and most recent book, published in 2016 and dubbed by Thomas Friedman as “a book that Barack Obama and Donald Trump would both enjoy”. The focus is on the period starting from 1991 — the Cold War’s end — to 2015. In this period, spanning four American presidents (George H.W. Bush through Barack Obama), Mandelbaum argues that we spent this time, intentionally and unintentionally, engaging in “regime change” in other countries. The goal? To convert them to Western-style countries supporting peace, democracies, and free markets. Those three features are the foundations of Mandelbaum’s 2002 book The Ideas that Conquered the World, one of my favorite books and one which I blogged about last year. As examples, the Clinton administration engaged in substantial humanitarian intervention in countries that had little significance to the U.S., such as Haiti, Somalia, Bosnia, and Kosovo. The Bush 43 administration also engaged in nation building in Iraq and Afghanistan, though this was not the original goal as in both cases, it started from war. In addition, for many years we have tried to convert Russia, China, and other middle eastern countries (especially Iran and the Palestines) to be more like ourselves. But these missions all failed. Why? Probably the biggest reason is culture; in many countries, there is already so much sectarian violence that unity based on abstract concepts like the “rule of law” is not possible. In other cases, there is American resentment, as in Russia due to NATO expansion. I found the history behind all this to be fascinating. Of course, I am also concerned. After all, this “era” of American foreign policy abruptly ended by 2015, because Russia and China have now become powerful enough to upset the stability of global order.

## Group 4: Religion

Here are some religion-related books. Specifically, these are from the non-religious or atheist perspective, which one might be able to tell given Sam Harris’ frequent appearance here. Don’t worry, I’m going to explore with more pro-religion books next year, and my parents have several book candidates which I can read.

• Moral Minority: Our Skeptical Founding Fathers, by Brooke Allen, argues that the six men who she considers to be the founding fathers of America: Benjamin Franklin, George Washington, Thomas Jefferson, John Adams, James Madison, and Alexander Hamilton, were not devout Christians as many claim, but instead highly skeptical men influenced not by religion, but by John Locke and Enlightenment thoughts. By doing so, Allen further claims that due to the efforts of these men, the United States was not founded on Christian principles. (To me, I find it odd that we’re often considered a “Christian nation” when the First Amendment explicitly allows for Freedom of Religion — and don’t get me started on “Judeo-Christian”.) The book has one chapter for each of these men, and uses their actual writing to claim that they were skeptical of religion and preferred to keep it out of politics. I agree with the main ideas of Moral Minority, but rather surprisingly (to me), I didn’t enjoy the book. In part this is due to the writing style; it’s a lot harder to understand 18th-century writing than modern writing. I also think the book discusses a lot of facts but doesn’t clearly unify or connect them together.

• ** The End of Faith: Religion, Terror, and the Future of Reason ** is Sam Harris’s first book, written in 2004 while he was still a Ph.D. student4 in neuroscience. Its impact was remarkable: it propelled him to fame and played a critical role in forming the New Atheist movement. Harris began writing on September 12, 2001, when the ashes of 9/11 were well present, and the essay he produced in those weeks of collective grief became the basis for this book. This book is similar to Richard Dawkins’ The God Delusion (blogged about here) in many ways, in that it is highly critical of religion, though to be clear: this is about criticism of sticking to any particular dogma, not necessarily religion. In fact, Harris argues near the end of the book (and this is something I did not know from reading Dawkins’ book) that Buddhism as a religion is far more an empirical science than Christianity, Islam, and Judaism. While Dawkins focuses more on the Bible and Koran and argues for evolution instead of creationism and intelligent design, Harris here takes more of a “mind and consciousness” route, which suits his philosophy and neuroscience academic background. One particular aspect about the book is that he is especially critical of Islam, which should not strike readers as surprising, since he argues that religion must be the driving force behind suicide bombing, etc. While I recognize that there are many controversial aspects of Sam Harris and his books, I personally liked this book a lot and view Harris highly. I agree with him that any discussion of religion must begin with the thousands of people who lost their lives on September 11, 2001.

• The Moral Landscape: How Science can Determine Human Values, a 2010 book by Sam Harris, argues that people are mistaken about moral values. He urges us to think about improving “well-being” as being equivalent to a peak in a hypothetical, non-convex (to use a term from math) “moral landscape”. This book, while not as anti-religion as his others, dismisses religion as something that causes people to be too rigid to change, and creates friction over “distractions” (social issues such as gay marriage) when we really should be focusing on the threat of terrorism, climate change, and – of course – how to best advance the collective well-being of humans. In my opinion, I was reminded of the Moral Minority book by Brooke Allen – I can find myself agreeing with the general idea, but I felt like the argument is somewhat haphazard and it took me some re-reads of certain paragraphs to get his ideas. Honestly, to me his biggest claims come from arguing that there are other sciences that are vague and have subjective answers (e.g., health and economics) so morality should not be subject to a double standard. Note that he has an interesting discussion of this book on his blog where he challenged readers to refute his thesis.

• Waking Up: A Guide to Spirituality Without Religion is a 2014 book, also by Sam Harris, that delves into rather unusual terrain for atheists: spirituality. I read this book for two reasons: one is that I know Sam Harris’ work and admire it very much, and the other is that I have been looking for spirituality or “meditation”-related stuff for a while to help decrease stress. This book, however, is less a “guide on spirituality” that its title suggests, and more about a discussion of the mind, the self, and consciousness. That I find really illuminating, albeit considerably difficult to understand. I felt lost many times in the book and had to reread certain sections, trying to gather every ounce of understanding I could. As an atheist, I acknowledge that there are very real benefits of spirituality and meditation, and that these can be utilized without religion. I want to pursue some of these tactics, but preferably those that don’t involve drugs, as Harris talks about near the end of the book. Ultimately, I was disappointed not because it is a bad book, but because I was expecting to get something different out of it.

• Islam and the Future of Tolerance: A Dialogue, yet another book by Sam Harris (co-authored with Maajid Nawaz in 2015) is a conversation between the two men, who come from two vastly different backgrounds. Sam Harris is a well-known atheist, and I frequently read his blog and books (as highly suggested by this blog post!). Nawaz has a remarkable personal story; he was born in England, but became radicalized, but then escaped to become a liberal Muslim activist. Harris and Nawaz have a dialogue about Islam and the prospects of reforming the faith. Nawaz urges us to have plurality, which will lead towards secularism, while Harris argues that we need to start viewing Islam directly as the problem because the accurate reading of scripture indicates that it is among the more evil of religions. I was impressed that two men who no doubt have disagreements can have an honest dialogue, even if the difference isn’t that great (Nawaz doesn’t strike me as a super-religious person). But we need a starting point if we are going to counter radical Islam. I wrote a blog post about it earlier this year in the wake of the Orlando massacre.

## Group 5: Science

Needless to say, given my “job” as a computer science graduate student, I have an interest in excellent (popular) science books. Here are four of them. The first three are physics/earth/biology-related with nontrivial overlap in their topics. The last one one is a machine learning book.

• What Evolution Is, by famed biologist Ernst Mayr, is a 2001 book about evolution aimed at the educated non-biologist. It is almost like a textbook, but simpler (the book itself is barely 300 pages). Mayr argues that evidence is not merely a theory, but a fact, based on the evidence of evolution from the fossil record. Mayr takes pains to show that this did not come out of a creationist-like reasoning of saying “this is a fact”; rather, the current theory of evolution based on natural selection had several competing theories when biology was in the early stages as a subject. Gradually, as the evidence accumulated, biologists more or less came to a consensus on the current “theory”. This is an extraordinary example of science in action. Ultimately, I don’t quite go as far as Jared Diamond (author of Guns, Germs, and Steel), who says that this book is the “second-greatest thing” that could ever happen, with the first being if Darwin himself (the most famous biologist of all time) came and wrote a book about this. But the book is educational since it’s written like a pseudo-textbook. It’s somewhat dry to read but with enough effort, you can understand the main ideas.

• The Sixth Extinction: An Unnatural History, by Elizabeth Kolbert5, is an engaging fast-paced book arguing that we are in the middle of the next mass extinction — the sixth of the earth’s history. (The fifth, of course, was the one that wiped out the dinosaurs, a key topic of Lisa Randall’s book, which I’ll describe next!) The Sixth Extinction consists of a series of short stories documenting Kolbert’s adventures across the world, where she meets lots of people and explores the outdoors. Some conditions are dangerous, and others are more amusing, as when she recounts seeing frogs having sex. But where there is humor is also sadness, because the story of extinction is not pleasant. Kolbert describes how biologists have failed to find once-abundant frogs, how zoologists have had to deal with the reality of trying to force the two remaining fertile members of a species to create offspring, and how the effects of ocean acidification and human-accelerated climate change will accelerate extinction. This book, however, is not about global warming, as that subject plays only a minor role here. Even discounting global warming, there is irrefutable evidence that humans have directly caused the extinction of other species, whether it is through construction projects, deforestation, transporting pests across the world, and other actions. After reading this book, I continue to remain concerned about the future of the world and the fate of our species.

• ** Dark Matter and the Dinosaurs: The Astounding Interconnectedness of the Universe **, by famed cosmologist and physicist Lisa Randall, is an exciting book weaving together astronomy, particle physics, earth science, and biology, to answer a profoundly interesting question: what caused the dinosaurs to die? No, I don’t mean about the meteor that hit earth. We know all that, and Randall briefly describes the interesting history of how scientists arrived at that conclusion. Randall’s book is instead about what caused the meteor to hit the earth. That’s a much more challenging question, and for someone like me without formal astronomy education, I have no idea how people can answer that kind of question considering the immense complexity and size of the universe. Fortunately, Randall explains the astronomy and particle physics basics for understanding some of this material. She proposes the new, very preliminary hypothesis that a disc of dark matter caused the meteor to hit the earth. The dark disc in the solar system causes the sun and planets to move up and down in unison periodically, and due to certain gravitational influences, when passing the center of the disc, causes one of the meteors (perhaps in the Oort cloud, something cool I learned) to dislodge from its orbit and sends it to earth. After reading Dark Matter and the Dinosaurs, I can no longer view the universe the same way. It is an exciting world we live in, and though I will not be part of the physics/cosmology research that explores Randall’s hypothesis, I eagerly await what they find and hope that Randall will write another book in five years describing her research progress.

• ** The Master Algorithm: How the Quest for the Ultimate Learning Machine will Remake Our World ** by Pedro Domingos is one of the few popular science books on machine learning, so naturally, I had to read it. It describes the history of machine learning as five tribes: symbolists, connectionists, analogists, Bayesians, and evolutionaries. This is what I wanted to know, so I’m happy he included this in his book. Near the end of the book, I felt like Domingos went a bit overboard with storytelling, but it’s probably better than the alternative of having more math. Certainly I would prefer more math, but most readers probably don’t want that, and we need ways to teach machine learning to a broad audience without much math. I have covered my thoughts on this book more extensively in this blog post. It’s an interesting book, to say the least, and I wish there were more books about this general area.

## Group 6: NSA and CIA

I maintain an interest in the NSA and the CIA. I chose three related books to read and I liked all of them. This is despite how two books ostensibly appear to be on the opposite side of a contentious political issue (NSA collection of bulk metadata). The third is an espionage thriller … which happens to be true.

• ** No Place to Hide: Edward Snowden, the NSA, and the U.S. Surveillance State **, is a 2014 book by journalist Glenn Greenwald. He was one of the people who Edward Snowden first contacted about the release of classified NSA phone metadata collection documents. The first part of the book is quite riveting, since it describes how Snowden first attempted to contact Greenwald (via PGP email, but that never happened) and how Greenwald decided to go to Hong Kong to meet Snowden. Greenwald next takes Snowden’s side regarding the NSA’s actions and argues that Snowden is right about what the NSA does (as in, Snowden’s famous “I, sitting at my desk, could wiretap you […] if I had a personal email” phrase). Greenwald displays a number of leaked documents to support his arguments. Next, Greenwald argues that government spying is bad. Privacy is a fundamental human right, and government spying encourages conformity and suppresses valid dissent. One passage that particularly struck me was when Greenwald said that the alternative to mass surveillance is not no surveillance, but targeted surveillance. This is a lesson that might be useful to many policymakers today.

• ** Playing to the Edge: American Intelligence in the Age of Terror **, a 2016 book by Michael V. Hayden, the only person to be the director of both the National Security Agency and CIA. It is meant to be “an unapologetic insider’s perspective” on the organizations in a time of terror, but also a time of criticism from the general public, who do not understand the intricacies of what these agencies do and what they are legally bound to do. Mr. Hayden frequently mentions that the NSA and CIA are unfairly asked to do more when the public feels unsafe, but then they’re asked to do less when the public feels safe. Mr. Hayden discussed his time talking to Congress, the POTUS (usually Bush 43, but he was also Obama’s CIA chief for three weeks), and other politicians, in an attempt to keep them as informed as possible, but that nonetheless did not seem to satisfy many Libertarians and others who oppose aspects of the work of the NSA and CIA. I found this book fascinating mostly for the insider’s perspective on what it’s like to manage such critical government organizations, and the careful considerations of these organizations when pursuing sensitive strategies (e.g., the metadata collection). I wish that there were other secrets he could have discussed, but I understand if this is not possible. One possible weakness is that it’s more of a biography/memoir with loosely related chapters, and doesn’t do enough (in my opinion, if it was the goal) to directly counter people like Glenn Greenwald6 and others who oppose NSA bulk collection of phone records.

• ** The Billion-Dollar Spy: A True Story of Cold War Espionage and Betrayal ** is a 2015 book written by David Hoffman. The book’s summary in its cover flaps describes it as “a brilliant feat of reporting which unfolds like an espionage thriller”, and I think this is a valid description. I thoroughly enjoyed the book’s details on Cold War espionage activities performed by the US and the Soviet Union (and this is something getting more attention nowadays due to Obama’s much-needed sanctions on Russia). It emphasizes how U.S. spies had to carefully communicate with Soviet engineer-turned-spy Adolf Tolkachev deep into the heart of Moscow … and several blocks away from the feared KGB. Fortunately, they were able to do that for several years, continually communicating with Tolkachev who handed over tens of thousands of secret documents. The air force said the material was worth “roughly in the billions of dollars range”. Wow. But then, in 1985, the operation ended due to a shocking betrayal. There are several downsides to this book. The first is that the “betrayal” is obvious once the appropriate character is introduced, and the second is that it includes some nice pictures of the characters and setting, but the pictures should be at the end of the book (not the middle!) because it gives away the ending. The first is a limitation of the story, though, and the alternative would have been to expand the book to be longer and introduce more characters early. I still greatly enjoyed this book, a riveting and poignant tale of Tolkachev.

## Group 7: Biographies and Memoirs

Here are some biographies and/or memoirs, with quite different themes. Actually, Michael Hayden’s book (from the previous group) would also qualify as a memoir.

• What I Talk About When I Talk About Running is the first book by Haruki Murakami that I’ve read. Murakami is mostly known as a prominent fiction writer, both in Japan and abroad, but I decided to read this book since I was in the middle of a personal “Running Renaissance” (e.g., see this blog post about running in the Berkeley Marina) and wanted to see his take. This book is a short memoir about the notes (most from 2005/2006) Murakami compiled while training for the New York marathon. Along the way, we learn that Murakami ran a bar (in Japan) until his thirties, but then started writing and struck gold there. He then started running to help him physically manage the strain of writing novels. Regarding his running, Murakami describes his experience training and running in marathons once a year. He’s even run an ultramarathon in Lake Saroma, which is 62 miles long. This book is probably an accurate survey of what Murakami thinks, though I think it suffers a lot from cliches: hard work can beat talent!, most people need to work hard!, and so forth. It’s a short book so even if one doesn’t like the concepts, reading it from start to finish does not take much time.

• Between the World and Me by Ta-Nehisi Coates, an author (and MacArthur Fellow) who I was previously aware of due to his 2014 The Case for Reparations article. This book won the 2015 National Book Award, so I was looking forward to reading it. The best way to describe this book is as a cross between a long letter and a poem, and Coates addresses it to his son. He argues that, as blacks, they will have to be extra careful about their bodies. He says that there is a lot of concern in a world where we’ve recently seen a lot of police officers shooting black men and boys. (And, tragically, the manuscript for this book must have finished before the Charleston shootings, which certainly would have made it in otherwise.) One term he uses a lot is “The Dream”, but paradoxically, it doesn’t actually seem to refer to a dream for blacks (as in, Martin Luther King Jr.’s kind of dream) but that of whites who are attempting to define themselves. In terms of my opinion, it’s true that this book helps me to understand the perspective of being black, and obviously made me appreciate not being black7. I think Coates could have focused more on the shooting cases where the officers were clearly guilty or had video evidence suggesting that, because in some cases it is justified for police officers to retaliate. Whether they did or not in those well-known cases, is not my place to answer, and could be a source of fair criticism for this book. As for the title, I think it was designed to represent how there is a “real” world, one for whites, and “his world” of blacks.

• Sisters in Law: How Sandra Day O’Connor and Ruth Bader Ginsburg Went to the Supreme Court and Changed the World is a 2015 dual biography by lawyer Linda Hirshman. She describes an overview of the Justices’ lives, starting from childhood, then to their difficulty establishing legal careers, and of course, to their time as Supreme Court justices. The book is educational for those who want to know about the politics and intricacies of the Supreme Court. I certainly count as one of those, so this book was a nice match for me: it was just technical and sophisticated enough that I could follow it (along with some Wikipedia checks to, for instance, understand what it really means to clerk for a judge or to be on the United States Court of Appeals). I also learned a lot about how the court works, about how justices have to strategically frame their arguments to get a majority, and why it matters who writes the opinions. The book talks about Ginsburg’s rise, then O’Connor’s rise, then about them together later. Towards the end of the book, Hirshman starts intervening with her personal opinions, as well as her attacks on Justice Anthony Kennedy. Hirshman doesn’t shy about inserting her liberal beliefs so this may discomfort diehard conservatives (but not moderate conservatives, ironically those like O’Connor), and I wonder if she would have done something differently had she waited until Kennedy wrote the majority opinion which legalized same sex marriage. I enjoyed knowing more of the relationships between the Supreme Court justices and how they work, and would like to see more of that if the book were to be extended. Also see the book’s review on the “nonpartisan” SCOTUS blog, which I mostly agree with.

## Group 8: Miscellaneous

These are some miscellaneous books, which don’t seem to fit in the above groups. I’ve arranged them by publication date.

• The Swerve: How the World Became Modern is a 2011 book by Harvard professor Stephen Greenblatt that won the 2011 National Book Award and the 2012 Pulitzer Prize (geez …). It’s the story of how Poggio Bracciolini, an Italian scholar and book-hunter in the 14th and 15th century, was able to track down Lucretius’ On the Nature of Things, a dramatic, profound essay that had been lost for well over a thousand years, and (incredibly) describes things that we take for granted today, especially the atomic nature of the world. The book chronicles Poggio’s journey to get the book and navigates through his life as a papal secretary and, clearly, the trip that led him to reach the long-lost manuscript. Note that this book is technically non-fiction, but given that this all happened in the 1400s, I think some of its content is bound to be the result of artistic license. Greenblatt argues that this essay “revolutionized our understanding of the world we live in today.” I thought this book was interesting, though I probably wouldn’t be on record as saying it was as great as other books I’ve read, such as Guns, Germs, & Steel. One thing that might not suit certain readers is the anti-religious stance of On the Nature of Things, which argues that there is no afterlife, so it subscribes to the Epicurean philosophy that life should be fulfilled for enjoyment first. For instance, see Jim Hinch’s criticism of the book.

• David and Goliath: Underdogs, Misfits, and the Art of Battling Giants is another book (from 2013) by the famous author Malcolm Gladwell, who I regard as one of the finest authors for explaining how the world works. I thoroughly enjoyed reading The Tipping Point, Blink, and Outliers in high school. This book is about taking a new perspective of underdogs versus the favorites. Why are we surprised when the underdogs win? What if we should expect underdogs to win? Gladwell discusses how people with disabilities or traumatic experiences, while generally worse off than others, have a chance of overcoming and being stronger than ever. For instance, he cites David Boies, a dyslexic lawyer who countered his disability with the ability to listen far better, and he is one of the top tax lawyers in the country. He also raises the interesting point as to whether it is better to be a big fish in a small pond or a little fish in a large pond. The first two parts are my favorite; the third part is a bit drier and mostly discusses how people who are in power have to consider carefully what happens to those under them (e.g., police officers versus civilians, the United States versus (North) Vietnam, etc.). Overall, the book, especially the first two parts to me, is thought-provoking and I agree with a lot of what Gladwell says. I do, however, have to emphasize a criticism of the book that I’ve seen declared elsewhere: when students who are at schools of vastly different prestige study the “same” subject, they are not studying the same concepts, looking at the same notes, etc. I can say this with confidence after studying the material from the artificial intelligence and machine learning courses at Williams and at Berkeley8. The Williams courses are noticeably easier and cover less material than the Berkeley counterparts. And Williams is a very strong school in its own right! The gap is therefore wider for other pairs of schools.

• Kill Anything That Moves: The Real American War in Vietnam by Nicholas Turse in 2013, is a history book that attempts to account for much of the atrocities committed by US soldiers in Vietnam. Most Americans, myself included (in 9th grade), were taught that the well-known My Lai massacre was just an anomaly, one mistake in an otherwise “appropriate” war. Turse refutes this by discussing how American troops were trained to hate Vietnamese (including South Vietnamese, ostensibly their allies) and how they committed crimes such as torture, rape, and murders (to gain body counts) etc. This was not the act of several low-level military folks, but people higher up in the commands. Turse goes over Operation Speedy Express, which he calls “a My Lai a month” but which has not gotten as much attention because of efforts by the US military to cover up the details. Yes, this book is frustrating to read, and it is not designed to be enjoyable. I don’t know how one could be happy when reading about what our military did. I knew already that the Vietnam War was a mistake just like the Iraq war, and though there are reviews of Turse’s book that argue that it doesn’t explain the whole truth, they can’t refute that American soldiers killed many innocent Vietnamese civilians. I have no doubt that if Vietnam had more military might, population, and power, it never would have restored diplomatic relations with the United States. Let’s not kid ourselves: the reason why we’re now “allied” with them today is because of our power imbalance (we’re “number one”) and because we “need them” to counter China.

• Zero to One: Notes on Startups, or How to Build the Future is a 2014 book by Peter Thiel that consists of notes from a class on startups that Thiel taught at Stanford. How do startups become successful, and what kind of people, attitude, and circumstances are necessary for them to flourish and ultimately change the world? He laments the fact that America is in a phase of “indefinite optimism” instead of the “definite optimism” of the 60s, when we landed a man on the moon9. Nowadays, people who want money but who don’t know how to create it become lawyers and investment bankers, a problem which Thiel says is exacerbated by today’s world which encourages us to focus on incrementally improving a variety of skills and focusing on getting top grades and so forth (though Thiel was like this in his early life!). The entrepreneurs are the ones who can generate wealth, and in order for them to succeed, they have to go “from zero to one”. That means being like Google, which had no comparable peer in terms of searching. While other search engines are going from 1 to n and recreating what we know, it’s very difficult to go from 0 to 1. This book is not a guide on doing that; such a guide cannot exist. It’s just a set of notes, but it’s an educational set of notes. I mostly enjoyed the notes, but almost all of it relates to technology startups. As Thiel mentions, you can’t really apply startup analysis to the restaurant industry, so this book caters to a specific group of people, and may possibly frustrate those outside the field.

• Six Amendments: How and Why We Should Change the Constitution by retired Supreme Court Justice John Paul Stevens, is a 2014 book where Stevens concisely describes six different ways the US Constitution needs to be amended. The reason for amendment is that, as he says: “rules crafted by a slim majority of the members of the Supreme Court have had such a profound and unfortunate impact on our basic law that resort to the process of amendment is warranted.” (The use of amendment has declined in recent years, as the last one was the 27th Amendment in 1992.) Prominent examples Stevens cites are political gerrymandering, the death penalty, and gun control. Some of the book is written in near-legalese language, but fortunately I was able to follow most of the arguments. I can definitely see why he now wants to abolish the death penalty (by simply adding “such as the death penalty” to the 8th Amendment). Granted, diehard Conservatives obviously won’t like this book, but Stevens simply aims to state what he thinks must be revised for the Constitution, and I agree in many respects.

• Who Rules the World? is the latest book by Noam Chomsky, possibly the most cited academic in the history of the world. (His Google Scholar account shows a ridiculous 306,000 citations as of May 2016!). Chomsky is provocative in his writing here, though I suppose I should not have been surprised after reading his Wikipedia page. His main argument is that the West (read: the United States) thinks they rule the world and can run it the way they want. America is therefore the arrogant bully who wants things run his way and manipulates the media in doing so. As a side effect, America actually hinders progress towards some of its goals. Take the “Iranian nuclear threat” for instance. America and the West (myself included) view Iran as one of the gravest threats to the world, but this is not the case in the Arab world or elsewhere. The Iranian threat could have been resolved many years ago, Chomsky argued, if we had not invaded Iraq or invited Iranians to study nuclear engineering at his institution (MIT). Also consider the ongoing Israel-Palestine conflict. The US argues that the PLO does not concede anything in negotiations (in fact, Mandelbaum argues just this in Mission Failure, described earlier) but Chomsky argues that it is Israel and the US who are blocking progress. Of course, there is also the threat of climate change in addition to nuclear warfare, the two main things that worry Chomsky. And, of course, climate change progress is stalling due to the US (more accurately, Republicans, who Chomsky accuses of no longer being a political party). Throughout the book, Chomsky criticizes Republicans and also criticizes Democrats for veering to the right. I can definitely see why some people will take issue with Chomsky’s ideas, but this book helped me to take a more critical view of how I understand US domestic and foreign policy.

Whew! That’s the long list of summaries of the books I read in 2016. I will now get started on next year’s books, once I can find the time …

1. They were “thriller” type books, but the themes and plots were shallow. If you want to read an excellent thriller book, I think Battle Royale reigns supreme. Don’t get me started about The Hunger Games — it’s almost a complete rip-off of the decade-older Battle Royale.

2. I’m currently reading his book, but alas, as much as I would have liked to finish it before 2017, the book is dense and over 700 pages, and I will not finish in time.

3. Like me, he’s probably been reading too much of The Wall Street Journal

4. It looks like I better get around to writing a book. “Fortunately” it appears that I will be a PhD student for a long time.

5. Interestingly enough, she was a visiting professor at Williams College for 2015-2016 (she might still be at Williams, I’m not sure).

6. Well, OK he did mention Glenn Greenwald a few times. I remember Hayden describing him as “pretty much anti-everything”.

7. I have a pet peeve about race, however: why can’t we call mixed race people by the proper terminology: mixed race? Or, possibly better, we can clearly list the most accurate approximation of race we have. For me, when describing my race, people should state: 50 percent White, 50 percent Asian. For Barack Obama, people should state: 50 percent White, 50 percent Black. And so on. These aren’t true percentages, of course, since all humans originate from the same ancestor, but they’re as close as we can get. Sometimes I wonder if mixed race people exist when I read countless statistics saying “X-Y percent white-black” with X+Y=100 and seemingly no room for others.

8. Currently, at Williams, Artificial Intelligence and Machine Learning are numbered CS 373 and CS 374, respectively, while the Berkeley counterparts are CS 188 and CS 189.

9. Now that I think about it, this may have been a reason why Thiel supported Donald Trump’s campaign, going so far as to talk at the Republican National Convention. This book was written before Donald Trump’s campaign, though. Thiel, as a Libertarian, was supporting Rand Paul beforehand.

10. I do not think David Brooks is really that conservative. Perhaps he enjoys being called “liberals’ favorite conservative.”

# Review of Algorithmic Human-Robot Interaction (CS 294-115) at Berkeley

Update June 1, 2019: the class number has changed to CS 287H.

In addition to taking STAT 210A this semester, I took CS 294-115, Algorithmic Human-Robot Interaction, taught by Professor Anca Dragan. I will first provide a summary of what we did in class, roughly in chronological order. Then I would like to discuss what could be improved for future iterations.

The first day presented, as usual, the enrollment surge, which is either a good thing or a bad thing. But for being able to find a chair in the classroom, it’s a bad thing. I arrived a few minutes before class started (which is actually late for the first day) and found a packed room, but fortunately my two sign language interpreters had booked a seat for me.

I felt bad for that. If I’m the last to arrive and all the chairs are taken, I have no problem sitting on the floor or standing up. It’s OK. Fortunately, the number of students eventually dropped to 34 by the end of the course, but for the first few weeks, we had a lot of students sitting on the floor or standing by the door. The two sign language interpreters took up two seats, so I worried that I was inconveniencing the other students. Fortunately, no one seemed to be angry, and Professor Dragan seemed to be OK with it. In fact, at the start of the semester, she sent me an email asking how the course was going, if she was talking too fast, etc. I remember two other professors contacting me that way: Sarah Jacobson for Microeconomics and Wendy Wang for Bayesian Statistics, and I like it when professors do that. I responded saying that she was doing just fine and that any limitation in my ability to understand material was mostly out of her control.

In CS 294-115, much of the first half of the semester consisted of lectures on standard robotics topics, particularly motion planning and trajectory optimization. This is also an HCI course, and we had lectures on experimental design and learning from demonstrations (which are often human-led). The most concise way to describe this class is probably a cross between CS 287, Advanced Robotics, and CS 260, the Human-Computer Interaction course at Berkeley. I think a lot of computer science PhD students took CS 294-115 to satisfy their PhD breadth requirements. I would, of course, take it even if it didn’t satisfy my requirements.

As I quickly learned, one of the most controversial aspects of CS 294-115 are the quizzes. Every few classes, Professor Dragan would set the first 5-10 minutes aside for a quiz, which typically asked basic questions about topics covered in the previous lectures. Each quiz was graded on a three-point scale: minus, check, and plus. I earned one plus and three checks on the four “normal” quizzes we had.

For one of the quizzes, the class as a whole did poorly (a third minus, a third check, and a third plus). Probably the reason for this is that the question was about the reading for the class, specifically, the Maximum-Entropy Inverse Reinforcement Learning paper. Before the quiz, Professor Dragan smiled and said something similar to: “if you didn’t do the reading, try to say anything relevant.” Gee, it looks like the do-your-readings rule doesn’t even apply to Berkeley graduate courses!

We had a make-up quiz, but I still only earned a check. Ugh.

Regardless of whether this class turns into a seminar or a lecture (which I’ll discuss later in this article) I personally think the quizzes should be abolished. Professor Dragan also commented in class that some students thought quizzes did not belong in a class like this. We’re “not in high school,” as some succinctly put it.

As one can tell from reading the syllabus online, aside from the lectures, we also had a lot of classes devoted to student presentations about research papers. For each paper, we had a PRO and a CON presenter. Their names are self-explanatory: the PRO presenter talks about the paper as if he/she had written it, while the CON presenter provides constructive criticism. I liked this part of the class, though there are some inherent weaknesses, such as scalability.

I had to wait a while; it wasn’t until November 15 that I got to present. I was assigned to be the CON presenter for a very interesting (and award-winning!) paper Asking for Help Using Inverse Semantics.

In every talk, I always need to think of several jokes to say. The logic is as follows: I am at the bottom of the social hierarchy when it comes to anything related to popularity. Thus, when I’m giving a talk to a class or some audience, this is one of the few times where people are actually going to bother to listen to what I have to say. In other words, I really can’t afford to waste my chance to give a positive impression. One way to make a good impression is to give a talk about something really positive, such as if I came up with a really new algorithm that people will love. Unfortunately, aside from the really top-tier stuff, I think it’s hard to distinguish oneself with the content of the talk (especially if it’s not about work I did!) so the next best thing that really helps is being reasonably funny.

Thus, I came up with the following tentative plan: since it’s election season, let’s have “Anca for President”! I would include an image of a bunch of faces going left to right, in order of how much respect I have for the person. Clearly, Professor Dragan would be at the rightmost spot (most respect), and someone like Putin would be at the leftmost spot. I think if I showed something like that in class, it would elicit considerable laughter and raise talks of “Anca for President!” even though she’s not Constitutionally eligible. Unfortunately, the paper I presented didn’t really make this joke fit in the talk well. Combined with the fact that the candidate who everyone opposed won the election … never mind.

Unfortunately I was scheduled as the fourth presenter that day, which is easily the worst slot to be in for this class. As luck would have it, the other three presenters went over their time (I sarcastically thought: “big surprise”), meaning that I got to be the first presenter during the following class.

The class after my actual presentation was the one before Thanksgiving, and it was a special one where Professor Dragan ran an experiment. Rather than have the usual PRO/CON presenters, we would instead split into small groups for discussion. You can imagine how much I “liked” that. To the surprise of no one, my sign language interpreters could not understand or follow the discussion well, and neither could I. But maybe I’m just a lonely Luddite. It seemed like the other students enjoyed that class.

After the Thanksgiving “break”, we had class presentations during the final week. I’m not sure if there’s something about the aura in our class, but there’s so much more humor compared to the Deep Neural Networks presentations from CS 294-129. Seemingly everyone had a joke or did something that caused laughter. Here are a few that I remember:

• Explicitly saying a joke. These are the best kind and the ones I try to do every time, for reasons described earlier.
• Showing a funny video.
• Performing a funny gesture.
• Complaining about the 4-minute time limit and watching as Professor Dragan sneaked closer and closer to the presenter to cut off the talk.
• Having multiple phones time the talks (Professor Dragan’s and the presenter’s) and having them both ring after 4 minutes.
• Grabbing Professor Dragan’s laptop after the talk was over, thinking it was theirs.

The laughter was contagious and this is definitely something positive about the class. My project presentation, incidentally, was about human gameplay to boost Atari game play agents. I have some code that I’ll hopefully release soon.

After the project presentations, we had to work on our final reports (5 page maximum). Here’s a tip for those who need more space, which should be everybody: use IEEE Conference-Style papers. The font size is small, and when you expand the LaTeX coding to reach the maximum margins, you get a lot of information packed into five pages.

All right, that was the class, but now what about some feedback? Professor Dragan said she plans to teach this class again under the new “CS 287H” label. What could be better for future semesters? In my opinion, the most important question to ask is whether the class should be structured as a seminar or a lecture. Doing it half-seminar, half-lecture as we had is really awkward.

I am inclined towards turning CS 294-115 into a standard lecture course, given my problematic history with seminars. Yes, I know I’ve complained a lot about not being able to follow technical lectures, but seminars are still, on average, worse.

The decision is obviously up to Professor Dragan. If she wants to keep it as a seminar, I would highly encourage:

• establishing a strict enrollment limit with no more than 30 students and requiring prospective students to fill out a detailed form describing how they can contribute to the class. There are several classes which do this already so it’s not unusual. The enrollment limit is necessary because far more students will sign up next year. Why? It’s simple: robotics is very popular and Professor Dragan is very popular.
• enforcing time limits better. There were many times when students, for both PRO/CON presentations and final project presentations, exceeded their allotted time.

If CS 294-115 turns into a normal lecture, then:

• introduce problem sets and a midterm to test knowledge instead of relying on shallow quizzes. This will require GSIs, but by now there are plenty of students who can do that job. This will mean more work for the students, but CS 294-115 probably needs more work anyway. It’s listed as 3 credits, but I think it’s more accurate to call it a 2.5-credit class.

Regardless of the class style, I have an idea if quizzes have to remain in CS 294-115. To fix the problem of students not reading papers beforehand, let’s have quizzes for all the paper readings. In fact, the student presenters can create them! They can do the grading and then forward the results to Professor Dragan to minimize her workload.

Don’t misunderstand what I’m saying. Throughout the class, I looked at all the papers, but I mostly read only the abstract plus the first few introductory sections. I wanted to read these papers in detail, but there’s a tradeoff. If I can spend four hours the night before class working towards important research, I’m going to prefer doing that over reading a paper without hesitation.

My final suggestion for the class is about the project presentation sign-up process. We had a class spreadsheet and had to sign up for one of the two days. Instead, let’s randomly assign students to presentation times and make slides due on the same night for everyone.

Well, that’s all I want to say about CS 294-115 now. This is just one person’s opinion and others will no doubt have their own thoughts about the class. I’ll check the future course websites now and then to see if there are any interesting changes from the version I just took.

Professor Dragan, thanks for a fun class!

# Review of Theoretical Statistics (STAT 210A) at Berkeley

In addition to taking CS 294-115 this semester, I also took STAT 210A, Theoretical Statistics. Here’s a description of the class from the Berkeley Statistics Graduate Student Association (SGSA):

All of the core courses require a large commitment of time, particularly with the homework. Taking more than two of them per semester (eight units) is a Herculean task.

210A contains essential material for the understanding of statistics. 210B contains more specialized material. The 210 courses are less mathematically rigorous than the 205 courses. 210A is easier than 205A; 210B is very difficult conceptually, though in practice it’s easy to perform adequately. Homework is time-consuming.

210A: The frequentist approach to statistics with comparison to Bayesian and decision theory alternatives, estimation, model assessment, testing, confidence regions, some asymptotic theory.

And on our course website we get the following description:

Stat 210A is Berkeley’s introductory Ph.D.-level course on theoretical statistics. It is a fast-paced and demanding course intended to prepare students for research careers in statistics.

Both of these descriptions are highly accurate. I saw these before the semester, so I was already prepared to spend many hours on this course. Now that it’s over, I think I spent about 20 hours a week on it. I thought this was a lot of time, but judging from the SGSA, maybe even the statistics PhD students have to dedicate lots of time to this course?

Our instructor was Professor William Fithian, who is currently in his second year as a statistics faculty member. This course is the first of two theoretical statistics courses offered each year aimed at statistics PhD students. However, I am not even sure if statistics PhD students were in the majority for our class. At the start of the semester, Professor Fithian asked us which fields we were studying, and a lot of people raised their hands when he asked “EECS”. Other departments represented in 210A included mathematics and engineering fields. Professor Fithian also asked if there was anyone studying the humanities or social sciences, but it looks like there were no such students. I wouldn’t expect any; I don’t know why 210A would be remotely useful for them. Maybe I can see a case for a statistically-minded political science student, but it seems like 215A and 215B would be far superior choices.

The number of students wasn’t too large, at least compared to the crowd-fest that’s drowning the EECS department. Right now I see 41 students listed on the bCourses website, and this is generally accurate since the students who drop classes usually drop bCourses as well.

It’s important to realize that even though some descriptions of this course say “introduction” (e.g., see Professor Fithian’s comments above), any student who majored in statistics for undergrad, or who studied various related concepts (computer science and mathematics, for instance) will have encountered the material in 210A before at some level. In my opinion, a succinct description of this course’s purpose is probably: “provide a broad review of statistical concepts but with lots of mathematical rigor, while filling in any minor gaps.”

I am not a statistics PhD student nor did I major in statistics during undergrad, which at the time I was there didn’t even offer a statistics major. I took only three statistics courses in college: an introductory to stats course (the easy kind, where we plug in numbers and compute standard deviations), a statistical modeling course (a much easier version than STAT 215A at Berkeley), and an upper-level Bayesian Statistics course taught by Wendy Wang. That third course was enormously helpful, and I’m really starting to appreciate taking that class. The fourth undergrad course that was extremely beneficial was Steven Miller’s Math 341 class (probability), which is also now cross-listed into statistics. I should thank my old professors …

Despite not majoring in statistics, I still knew a lot of concepts that were introduced in this class, such as:

• Sufficient Statistics
• Exponential Families
• Maximum Likelihood
• Neyman’s Factorization Theorem
• Jensen’s Inequality and KL Divergences
• Bootstrapping
• The Central Limit Theorem

But I didn’t know a lot:

• Rao-Blackwell Theorem
• GLRT, Wald, Score, Max tests
• Convergence in probability vs distribution
• The Delta Method
• A lot about hypothesis testing, UMP tests, UMPU tests, etc.

Even for the concepts I already knew, we went into far more detail in this class (e.g. with sufficient statistics) and described the math foundations, so I definitely got a taste as to what concepts appear in statistics research. A good example is the preprint about reranking in exponential families written by the professor and his graduate student Kenneth Hung, who was also the lone TA for the course. Having taken this course, it’s actually possible for me to read the paper and remotely understand what they’re doing.

The syllabus for 210A roughly proceeded in the order of Robert Keener’s textbook, a book which by itself requires significant mathematical maturity to understand. We covered probably more than half of the book. Towards the end of the semester, we discussed more hypothesis testing, which judging from Professor Fithian’s publications (e.g., the one I just described here) seems to be his research area so that’s naturally what he’d gravitate to.

We did not discuss measure theory in this class. I think the only measure theory courses we have at Berkeley are STAT 205A and 205B? I surprisingly couldn’t find any others by searching the math course catalog (is there really no other course on measure theory here???). But even Professor Aldous’ 205A website says:

Sketch of pure measure theory (not responsible for proofs)

Well, then, how do statistics PhD students learn measure theory if they’re “not responsible for proofs”? I assume they self-study? Or is measure theory needed at all? Keener says it is needed to properly state the results, but it has the unfortunate side effect of hindering understanding. And Andrew Gelman, a prominent statistician, says he never (properly) learned it.

As should be obvious, I do not have measure theory background. I know that if $X \sim {\rm Unif}[0,1]$ then the event that $X=c$ for any constant $c\in [0,1]$ has measure 0, but that’s about it. In some sense this is really frustrating. I have done a lot of differentiating and integrating, but before taking this class, I had no idea there were differences between Lebesuge and Riemannian integration. (I took real analysis and complex analysis in undergrad, but I can’t remember talking about Lebesuge integration.)

Even in a graduate-level class like this one, we made lots of simplifying assumptions. This was the header text that appeared at the top of our problem sets:

For this assignment (and generally in this class) you may disregard measure-theoretic niceties about conditioning on measure-zero sets. Of course, you can feel free to write up the problem rigorously if you want, and I am happy to field questions or chat about how we could make the treatment fully rigorous (Piazza is a great forum for that kind of question as long as you can pose it without spoilers!)

Well, I shouldn’t say I was disappointed, since I was more relieved if anything. If I had to formalize everything with measure theory, I think I would get bogged down with trying to understand measure this, measure that, the integral is with respect to X measure, etc.

Even with the “measure-theoretically naive” simplification, I had to spend lots of time working on the problem sets. In general, I could usually get about 50-80 percent of a problem set done by myself, but for the remainder of it, I needed to consult with other students, the GSI, or the professor. Since problem sets were usually released Thursday afternoons and due the following Thursday, I made sure to set aside either Saturday or Sunday as a day where I worked solely on the problem set, which allowed me to get huge chunks of it done and LaTeX-ed up right away.

The bad news is that it wasn’t easy to find student collaborators. I had collaborators for some (but not all of) the problem sets, and we stopped meeting later in the semester. Man, I wish I were more popular. I also attended a few office hours, and it could sometimes be hard to understand the professor or the GSI if there were a lot of other students there. The GSI office hours were a major problem, since another class had office hours in the same room. Think of overlapping voices from two different subjects. Yeah, it’s not the best setting for the acoustics and I could not understand or follow what people were discussing apart from scraps of information I could get from the chalkboard. The good news is that the GSI, apart from obviously knowing the material very well, was generous with partial credit and created solutions to all the problem sets, which were really helpful to me when I prepared for the three-hour final exam.

The day of the final was pretty hectic for me. I had to gave a talk at SIMPAR 2016 in San Francisco that morning, then almost immediately board BART to get back to Berkeley by 1:00PM to take the exam. (The exam was originally scheduled at 8:00AM1 (!!) but Professor Fithian kindly allowed me to take it at 1:00PM.) I did not have any time to change clothes, so for the first time in my life, I took an exam wearing business attire.

Business attire or not, the exam was tough. Professor Fithian made it four broad questions, each with multiple parts in it. I tried doing all four, and probably did half of all four correct, roughly. We were allowed a one-sided, one-page cheat sheet, and I crammed mine in with A LOT of information, but I almost never used it during the exam. This is typical, by the way. Most of the time, simply creating the cheat sheet actually serves as my studying. As I was also preoccupied with research, I only spent a day and a half studying for the final.

After I had turned in the exam, Professor Fithian mentioned that he thought it was a “hard exam”. This made me feel better, and it also made me wonder, how well do professors generally do on their own exams? Well, assuming the exam was the same difficulty and that they didn’t directly know the questions in advance.

It was definitely a relief to finish the final exam, and I rewarded myself with some Cheeseboard Pizza (first pizza in ages!) along with a beautiful 5.8-mile run the following morning.

Now that I can sit back and reflect on the course, STAT 210A was enormously time-consuming but I learned a lot from it. I’m not sure whether I would have learned more had I had more of a statistics background; it might have made it easier to go in-depth into some of the technically challenging parts. I probably could have spent additional focused time to understand the more confusing material, but that applies to any subject. There’s always a tradeoff: should I spend more time understanding every aspect of the t-test derivation, or should I learn how to not be an incompetently bad C programmer, or should I simply waste time and blog?

One of the challenges with a technical course like this one is, as usual, following the lectures in real time. Like most of my Berkeley classes, I used sign language interpreting accommodations. No, they were not happy with interpreting this material, and it was hard for them to convey information to me. On the positive side, I had a consistent group so they were able to get used to some of the terminology.

In general, if I did not already deeply understand the material, it was very difficult for me to follow a lecture from start to finish. I can understand math-based lectures if I can follow the math derivation on the board, but with many of the proofs we wrote, we had to skip a lot of details, making it hard to get intuition. There are also a lot of technical details to take care of for many derivations (e.g., regarding measure theory). I wonder if the ability to fill in gaps between math and avoid getting bogged down with technicalities is what separates people like Professor Fithian versus the average statistics PhD student who will never have the ability to become a Berkeley statistics faculty member.

Regarding the quality of instruction, some might be concerned that since Professor Fithian is a new faculty member, the teaching would be sub-par. For me, this is not usually much of an issue. Due to limitations in sign language and my ability to hear, I always get relatively less benefit out of lectures as compared to other students, so direct lecture-teaching skill matters less to me as to whether there are written sources or sufficient office hours to utilize. (To be fair, it’s also worth noting that Professor Fithian won several teaching awards at Stanford, where he obtained his PhD.)

In case anyone is curious, I thought Professor Fithian did a fine job teaching, and he’s skilled enough such that the teaching quality shouldn’t be a negative factor for future students thinking of taking 210A (assuming he’s teaching it again). He moves at a relatively moderate pace in lectures with standard chalkboard to present details. He doesn’t use slides, but we occasionally got to see some of his R code on his computer, as well as his newborn child on Facebook!

I would be remiss if I didn’t make any suggestions. So here they are, ordered roughly from most important to least important:

• If there’s any one suggestion I can make, it would be to increase the size of the chalkboard text/notes. For optimal positioning of sign language interpreting accommodations, I always sat in the front left corner of the room, and even then I often could not read what was on the far right chalkboard of the room. I can’t imagine how much harder it would be for the people in the back rows. In part, this may have been due to the limitations in the room, which doesn’t have elevated platforms for the back rows. Hopefully next semester they figure out better rooms for statistics courses.

• I think the pace of the class was actually too slow during the first part of the course, about exponential families, sufficiency, Bayesian estimation, and so on, but then it became slightly too fast during the details on hypothesis testing. My guess is that most students will be familiar with the details on the first set of topics, so it’s probably fair to speed through them to get to the hypothesis testing portion, which is probably the more challenging aspect of the course. Towards the end of the semester I think the pace was acceptable, which means it’s slightly too fast for me but appropriate for other students.

• Finally, I hope that in future iterations of the course, the problem sets are provided on a consistent basis and largely free of typos. We went off our once-a-week schedule several times, but still found a way to squeeze out 11 problem sets. Unfortunately, some of the questions either had typos or worse, major conceptual errors. This often happens when Professor Fithian has to create his own questions, which is partly out of necessity to avoid repeating old questions and risk students looking up solutions online. I did not look up solutions to the problem set questions, though I used the web for the purposes of background material and to learn supporting concepts (Wikipedia is great for this, so are other class websites which use slides) for every problem set. I’m not sure if this is standard among students, but I think most would need to do so to do very well on the problem set.

Well, that’s all I want to say now. This is definitely the longest I’ve written about any class at Berkeley. Hopefully this article provides the best and most comprehensive description of STAT 210A in the blogosphere. There’s a lot more I can talk about but there’s also a limit to how long I’m willing to write.

Anyway, great class!

1. Professor Fithian: “sorry, not my idea.”

# Reflections on Being a GSI for Deep Neural Networks (CS 294-129) at Berkeley

This semester, as I mentioned in an earlier blog post, I was one of two GSIs1 for the Deep Neural Networks course at Berkeley, taught by Professor John Canny. This was a surprise, because I was not planning on being a GSI for any class until Fall 2017. I wanted to have time to carefully think about and plan my GSI experience beforehand, perhaps by talking to previous GSIs or learning more about becoming an effective teacher. In addition, I was hoping to write a few technical articles here, and maybe some students would find them helpful. They would have to search for my blog themselves, though, since I usually don’t actively tell people about this website.

Well, so much for that. I got appointed several weeks into the course, after I had done the first assignment as a normal student. How did it happen? After Yang You (the other GSI) got appointed, he talked to me after class to ask about the possibility of being a GSI, since he knew that I was one of John Canny’s students. I thought about it and decided to join. John was a little hesitant at first and worried that I would spend too much time trying to help out the students, but after I promised him “not to spend more than 15 hours a week on this course” John agreed to have me appointed as a 10-hour GSI.2 This process was certainly rushed and I had to scramble through Berkeley’s annoying administrative requirements.

Fortunately, things soon settled down as normal. My first set of duties consisted of grading homework and providing comments on Piazza. I contacted the Stanford CS 231n team, but did not get a response which contained a rubric, so Yang and I had to create our own for the assignments. After I figured out a good workflow to downloading and managing Jupyter Notebooks, I was usually able to grade relatively quickly by running a select subset of the Notebook cells and spot-checking the plots.

There were four programming assignments in this course. The first three were adapted from CS 231n at Stanford, and the fourth one was a reinforcement learning homework assignment which John and I had to create. It asked questions regarding Deep-Q Networks and Policy Gradients. Some of the homework was unclear, and we only had two reinforcement learning lectures in the course — neither of which mentioned DQN — so I think in future iterations of the course, either the lecture arrangement or the fourth assignment (or both) should be substantially revised.

CS 294-129 also contained an in-class midterm (80 minutes). Professor Canny created the midterm and I helped to revise and clarify it. One regret I have is that I didn’t create any questions. While the midterm was OK and fair, I would have designed it in a slightly different way and would have made it more challenging. Grading was easy due to Gradescope.

I also made a mistake just before the midterm. Professor Canny had said that students were allowed to use “one page of notes”. A student asked a question on Piazza inquiring about whether this was one-sided or two-sided. I had always thought one page meant two-sided, and the day before the midterm, I provided an answer: “two-sided”. Oops. Professor Canny almost immediately replied saying one-sided. To the students who had to rewrite their cheat sheet for the exam due to this, I am genuinely sorry. After that, I made sure not to jump the gun on administration questions.

While the class has programming assignments and a midterm as mentioned earlier, the most important part of CS 294-129 is the student project assignment, done in groups of 1-4 students. We set aside four lectures for student project presentations, two in the week before the midterm and two in the last week of classes. (I think this is a bit too much, and would probably have set the first presentation week to be more about DQN and some other core Deep Learning concept.)

The 83 students who took the midterm collectively formed 36 project teams. Thus, 18 teams presented each day. We set aside 3 minutes for the first presentation week and 5 minutes for the second presentation week, since we started class 30 minutes earlier. The course staff graded both sets of presentations, and I was responsible for collecting all our feedback together to give a final grade along with written feedback. Here is a list containing some common feedback I gave:

• Please respect the time limits.

• Please use large text everywhere on figures: title text, axes text, legend text, etc. Come on everyone, this is really easy due to matplotlib.

• Please don’t clog your slides with a long list of text. If I want to read the text, I can read the final report. (And students in the audience who may want more detail are free to contact the presenters.)

• Please look at the audience when talking and project your voice while speaking with confidence. One might wonder if it is appropriate for a deaf person to provide comments on this material. In my case, it is OK for the following reasons: the other GSI and the professor also wrote comments about this, and my sign language interpreters can tell me if there’s a student who speaks too softly.

• Please respect the time limits.

I also noticed an interesting contrast between the presentations in this class versus those of CS 294-115, the Algorithmic Human-Robot Interaction class. Our presentations did not contain that much humor, and I don’t think we laughed a lot. In contrast, the CS 294-115 presentations were substantially more humorous. Why is that??

There was something far worse than grading, however. The worst part of being a GSI was, without question, that not a single student attended my office hours! I know I started late as a GSI, but I still had about seven or eight weeks of office hours. I guess I really am unpopular.

Sadly, this course will not be offered next semester, but there is an alternative in CS 294-112, which I expect to be popular. Professor Canny might be willing to teach CS 294-129 again next fall, and I might also be willing to be the GSI again, so we’ll see if that happens. If it does, I hope to be more popular with the students while also demanding strong work effort. If not, well, I think I enjoyed the GSI experience overall and I’m happy I did it. I hope the students enjoyed the class!

1. GSI stands for “Graduate Student Instructor” and is the same thing as a Teaching Assistant.

2. As far as I can tell, during the semester students can sign up to be a 10-hour or a 20-hour GSI, with 40-hour GSIs reserved for summer courses. Of course, by now I’ve learned to treat any “X-hour job” with skepticism that it’s actually an “X-hour” job.

# Mitt Romney Should be Secretary of State

Update December 20, 2016: This is interesting. The Wall Street Journal just published a fantastic opinion piece about Mitt Romney and Russia, and even says people should give Mr. Romney an apology, just as I said! No, I wasn’t reading their minds, this was just pure coincidence.

Another day, more news about Russian shenanigans. The CIA and other intelligence agencies have now concluded with high confidence that Russia was behind the hacks that leaked documents from Democratic political campaigns, presumably with the goal of tilting the election towards Mr. Trump.

At the same time, Trump is also nominating people who, while not necessarily pro-Russian, aren’t explicitly anti-Russian instead. I am most worried about his nominee for Secretary of State, ExxonMobil CEO Rex Tillerson, who has ties with Russia extensive enough that he even has an Order of Friendship from Vladimir Putin.

I strongly agree with Senator Marco Rubio’s tweet: “Being a ‘friend of Vladimir’ is not an attribute I am hoping for from a #SecretaryOfState”. Regarding the Secretary of State selection for the upcoming Trump administration, my preferred candidate was Mitt Romney by a mile. For a variety of reasons, everyone else (Giuliani, Tillerson, Bolton, Petraeus, etc.) collectively formed a “distant second” group.

I was not following politics too closely at the time when Mr. Romney’s comment about Russia being the “number one geopolitical foe” garnered substantial attention. But the more that I see Russia’s meddling into our affairs, and the more I also learn about what Vladimir Putin has done as President of Russia — among them, crippling freedom of the press and supporting the Syrian government — the more I find myself gravitating towards Mr. Romney’s stance on Russia.

This is my plea to our Congresspeople: use the issue of Russia to get back to bipartisanship by carrying out a much-needed Congressional investigation of Russia’s actions. The Republicans may find it harder to agree with this since it carries the implicit assumption that they’re going against their president-elect. Tradition is hard to break, however, and the Republican party has been anti-Russia for decades. This is their best shot at getting bipartisan cooperation towards a hard-line Russian stance. With Democrats presumably ready to join them now that they have seen the impact of Russia’s actions firsthand, all the momentum is there. Congress should explore and recommend actions to deter Russia, such as more economic sanctions, an increasingly aggressive cyberdefense, or enforcing a no-fly zone in Syria (admittedly, this might be a reach).

Perhaps there could be other things that Democrats could do to make such cooperation more likely. One could be a promise that, regardless of the investigation outcome, they would not challenge the legitimacy of a Trump administration.

A second could be this: anyone who mocked or otherwise strongly disagreed with Mr. Romney’s characterization of Russia as the “number one geopolitical foe” must be willing to apologize.

# Going Deeper Into Reinforcement Learning: Understanding Deep-Q-Networks

The Deep Q-Network (DQN) algorithm, as introduced by DeepMind in a NIPS 2013 workshop paper, and later published in Nature 2015 can be credited with revolutionizing reinforcement learning. In this post, therefore, I would like to give a guide to a subset of the DQN algorithm. This is a continuation of an earlier reinforcement learning article about linear function approximators. My contribution here will be orthogonal to my previous post about the preprocessing steps for game frames.

Before I proceed, I should mention upfront that there are already a lot of great blog posts and guides that people have written about DQN. Here are the prominent ones I was able to find and read:

I will not try to repeat what these great folks have already done. Here, I focus specifically on the aspect of DQN that was the most challenging for me to understand,1 which was about how the loss function works. I draw upon the posts above to aid me in this process.

In general, to optimize any function $f(\theta)$ which is complicated, has high-dimensional parameters and high-dimensional data points, one needs to use an algorithm such as stochastic gradient descent which relies on sampling and then approximating the gradient step $\theta_{i+1} = \theta_i - \alpha \nabla f(\theta_i)$. The key to understanding DQN is to understand how we characterize the loss function.

Recall from basic reinforcement learning that the optimal Q-value is defined from the Bellman optimality conditions:

\begin{align} Q^*(s,a) &= \sum_{s'}P(s'|s,a)\left[R(s,a,s')+\gamma\max_{a'}Q^*(s',a')\right] \\ &= \mathbb{E}_{s'}\left[R(s,a,s')+\gamma \max_{a'}Q^*(s',a') \mid s,a\right] \end{align}

But is this the definition of a Q-value (i.e. a $Q(s,a)$ value)?

NOPE!

This is something I had to think carefully before it finally hit home to me, and I think it’s crucial to understanding Q-values. When we refer to $Q(s,a)$, all we are referring to is some arbitrary value. We might pull that value out of a table (i.e. tabular Q-Learning), or it might be determined based on a linear or neural network function approximator. In the latter case, it’s better to write it as $Q(s,a;\theta)$. (I actually prefer writing it as $Q_\theta(s,a)$ but the DeepMind papers use the other notation, so for the rest of this post, I will use their notation.)

Our GOAL is to get it to satisfy the Bellman optimality criteria, which I’ve written above. If we have “adjusted” our $Q(s,a)$ function value so that for all $(s,a)$ input pairings, it satisfies the Bellman requirements, then we rewrite the function as $Q^*(s,a)$ with the asterisk.

Let’s now define a loss function. We need to do this so that we can perform stochastic gradient descent, which will perform the desired “adjustment” to $Q(s,a)$. What do we want? Given a state-action pair $(s,a)$, the target will be the Bellman optimality condition. We will use the standard squared error loss. Thus, we write the loss $L$ as:

$L = \frac{1}{2}({\rm Bellman} - Q(s,a))^2$

But we’re not quite done, right? This is for a specific $(s,a)$ pair. We want the $Q$ to be close to the “Bellman” value across all such pairs. Therefore, we take expectations. Here’s the tricky part: we need to take expectations with respect to samples $(s,a,r,s')$, so we have to consider a “four-tuple”, instead of a tuple, because the target (Bellman) additionally depends on $r$ and $s'$. The loss function becomes:

\begin{align} L(\theta) &= \mathbb{E}_{(s,a,r,s')}\left[\frac{1}{2}({\rm Bellman} - Q(s,a;\theta))^2\right] \\ &= \mathbb{E}_{(s,a,r,s')}\left[\frac{1}{2}\left(R(s,a,s')+\gamma\max_{a'}Q(s',a';\theta) - Q(s,a;\theta)\right)^2\right] \end{align}

Note that I have now added the parameter $\theta$. However, this actually includes the tabular case. Why? Suppose we have two states and three actions. Thus, the total table size is six, with elements indexed by: $\{(s_1,a_1),(s_1,a_2),(s_1,a_3),(s_2,a_1),(s_2,a_2),(s_2,a_3)\}$. Now let’s define a six-dimensional vector $\theta \in \mathbb{R}^6$. We will decide to encode each of the six $Q(s,a)$ values into one component of the vector. Thus, $Q(s_i,a_j;\theta)=\theta_{i,j}$. In other words, we parameterize the arbitrary function by $\theta$, and we directly look at $\theta$ to get the Q-values! Think about how this differs from the linear approximation case I discussed in my last post. Instead of $Q(s_i,a_i;\theta)$ corresponding to one element in the parameter vector $\theta$, it turns out to be a linear combination of the full $\theta$ along with a state-action dependent vector $\phi(s,a)$.

With the above intuition in mind, we can perform stochastic gradient descent to minimize the loss function. It is over expectations of all $(s,a,r,s')$ samples. Intuitively, we can think of it as a function of an infinite amount of such samples. In practice, though, to approximate the expectation, we can take a finite amount of samples and use that to form the gradient updater.

We can continue with this logic until we get to the smallest possible update case, involving just one sample. What does the update look like? With one sample, we have approximated the loss as:

\begin{align} L(\theta) &= \frac{1}{2}\left(R(s,a,s')+\gamma\max_{a'}Q(s',a';\theta) - Q(s,a;\theta)\right)^2 \\ &= \frac{1}{2}\left(R(s,a,s')+\gamma\max_{a'}Q(s',a';\theta) - \theta_{s,a}\right)^2 \end{align}

I have substituted $\theta_{s,a}$, a single element in $\theta$, since we assume the tabular case for now. I didn’t do that for the target, since for the purpose of the Bellman optimality condition, we assume it is fixed for now. Since there’s only one component of $\theta$ “visible” in the loss function, the gradient for all components2 other than $\theta_{s,a}$ is zero. Hence, the gradient update is:

\begin{align} \theta_{s,a} &= \theta_{s,a} - \alpha \frac{\partial}{\partial \theta_{s,a}}L(\theta)\\ &= \theta_{s,a} - \alpha \left(R(s,a,s')+\gamma\max_{a'}Q(s',a';\theta) - \theta_{s,a}\right)(-1) \\ &= (1-\alpha)\theta_{s,a} + \alpha \left(R(s,a,s')+\gamma\max_{a'}Q(s',a';\theta)\right) \end{align}

Guess what? This is exactly the standard tabular Q-Learning update taught in reinforcement learning courses! I wrote the same exact thing in my MDP/RL review last year. Here it is, reproduced below:

$Q(s,a) \leftarrow (1-\alpha) Q(s,a) + \alpha \underbrace{[R(s,a,s') + \gamma \max_{a'}Q(s',a')]}_{\rm sample}$

Let’s switch gears to the DQN case, so we express $Q(s,a;\theta)$ with the implicit assumption that $\theta$ represents neural network weights. In the Nature paper, they express the loss function as:

$L_i(\theta_i) = \mathbb{E}_{(s,a,r,s')\sim U(D)}\left[ \frac{1}{2}\left(r + \gamma \max_aQ(s',a';\theta_i^-) - Q(s,a; \theta_i)\right)^2 \right]$

(I added in an extra $1/2$ that they were missing, and note that their $r=R(s,a,s')$.)

This looks very similar to the loss function I had before. Let’s deconstruct the differences:

• The samples come from $U(D)$, which is assumed to be an experience replay history. This helps to break correlation among the samples. Remark: it’s important to know the purpose of experience replay, but nowadays it’s probably out of fashion since the A3C algorithm runs learners in parallel and thus avoids sample correlation across threads.

• Aside from the fact that we’re using neural networks instead of a tabular parameterization, the weights used for the target versus the current (presumably non-optimal) Q-value are different. At iteration $i$, we assume we have some giant weight vector $\theta_i$ representing all of the neural network weights. We do not, however, use that same vector to parameterize the network. We fix the targets with the previous weight vector $\theta_i^-$.

These two differences are mentioned in the Nature paper in the following text on the first page:

We address these instabilities with a novel variant of Q-learning, which uses two key ideas. […]

I want to elaborate on the second point in more detail, as I was confused by it for a while. Think back to my tabular Q-Learning example in this post. The target was parameterized using $Q(s',a'; \theta)$. When I perform an update using SGD, I updated $\theta_{s,a}$. If this turns out to be the same component as $\theta_{s',a'}$, then this will automatically update the target. Think of the successor state as being equal to the current state. And, again, during the gradient update, the target was assumed fixed, which is why I did not re-write $Q(s',a';\theta)$ into a component of $\theta$; it’s as if we are “drawing” the value from the table once for the purposes of computing the target, so the value is ignored when we do the gradient computation. After the gradient update, the $Q(s',a';\theta)$ value may be different.

DeepMind decided to do it a different way. Instead, we have to fix the weights $\theta$ for some number of iterations, and then allow the samples to accumulate. The argument for why this works is a bit hand-wavy and I’m not sure if there exists any rigorous mathematical justification. The DeepMind paper says that this reduces correlations with the target. This is definitely different from the tabular case I just described, where one update immediately modifies targets. If I wanted to make my tabular case the same as DeepMind’s scenario, I would update the weights the normal way, but I would also fix the vector $\theta^-$, so that when computing targets only, I would draw from $\theta^-$ instead of the current $\theta$.

You can see explanations about DeepMind’s special use of the target network that others have written. Denny Britz argues that it leads to more stable training:

Trick 2 - Target Network: Use a separate network to estimate the TD target. This target network has the same architecture as the function approximator but with frozen parameters. Every T steps (a hyperparameter) the parameters from the Q network are copied to the target network. This leads to more stable training because it keeps the target function fixed (for a while).

Arthur Juliani uses a similar line of reasoning:

Why not use just use one network for both estimations? The issue is that at every step of training, the Q-network’s values shift, and if we are using a constantly shifting set of values to adjust our network values, then the value estimations can easily spiral out of control. The network can become destabilized by falling into feedback loops between the target and estimated Q-values. In order to mitigate that risk, the target network’s weights are fixed, and only periodically or slowly updated to the primary Q-networks values. In this way training can proceed in a more stable manner.

I want to wrap up by doing some slight investigation of spragnur’s DQN code which relate to the points above about the target network. (The neural network portion in his code is written using the Theano and Lasagne libraries, which by themselves have some learning curve; I will not elaborate on these as that is beyond the scope of this blog post.) Here are three critical parts of the code to understand:

• The q_network.py script contains the code for managing the networks. Professor Sprague uses a shared Theano variable to manage the input to the network, called self.imgs_shared. What’s key is that he has to make the dimension $(N,|\phi|+1,H,W)$, where $|\phi|=4$ (just like in my last post). So why the +1? This is needed so that he can produce the Q-values for $Q(s',a'; \theta_i^-)$ in the loss function, which uses the successor state $s'$ rather than the current state $s$ (not to mention the previous weights $\theta_i^-$ as well!). The Nervana blog post I listed made this distinction, saying that a separate forward pass is needed to compute $Q$-values for the successor states. By wrapping everything together into one shared variable, spragnur’s code efficiently utilizes memory.

Here’s the corresponding code segment:

• On a related note, the “variable” q_vals contains the Q-values for the current minibatch, while next_q_vals contains the Q-values for the successor states. Since the minibatches used in the default setting are $(32,4+1,84,84)$-dimensional numpy arrays, both q_vals and next_q_vals can be thought of as arrays with 32 values each. In both cases, they are computed by calling lasagne.layers.get_output, which has an intuitive name.

The code separates the next_q_vals cases based on a self.freeze_interval parameter, which is -1 and 10000 for the NIPS and Nature versions, respectively. For the NIPS version, spragnur uses the disconnected_grad function, meaning that the expression is not computed in the gradient. I believe this is similar to what I did before with the tabular $Q$-Learning case, when I didn’t “convert” $Q(s',a';\theta)$ to $\theta_{s',a'}$.

The Nature version is different. It will create a second network called self.next_l_out which is independent of the first network self.l_out. During the training process, the code periodically calls self.reset_q_hat() to update the weights of self.next_l_out by copying from the current weights of self.l_out.

Both of these should accomplish the same goal of having a separate network whose weights are updated periodically. The Nature version is certainly easier to understand.

• I briefly want to mention a bit about the evaluation metrics used. The code, following the NIPS and Nature papers, reports the average reward per episode, which is intuitive and what we ultimately want to optimize. The NIPS paper argued that the average reward metric is extremely noisy, so they also reported on the average action value encountered during testing. They collected a random amount of states $\mathcal{S}$ and for each $s \in \mathcal{S}$, tracked the $Q(s,a)$-value obtained from that state, and found that the average discounted reward kept increasing. Professor Sprague’s code computes this value in the ale_agent.py file in the finish_testing method. He seems to do it different from the way the NIPS paper reports it, though, by subsampling a random set of states after each epoch instead of having that random set fixed for all 100 epochs (I don’t see self.holdout_data assigned anywhere). But, meh, it does basically the same thing.

I know I said this before, but I really like spragnur’s code.

That’s all I have to say for now regarding DQNs. I hope this post serves as a useful niche in the DQN blog-o-sphere by focusing specifically on how the loss function gets derived. In subsequent posts, I hope to touch upon the major variants of DQN in more detail, particularly A3C. Stay tuned!

1. The most annoying part of DQN, though, was figuring out how the preprocessing works. As mentioned earlier, I resolved that through a previous post here

2. Taking the gradient of a function from a vector to a scalar, such as $L(\theta)$, involves taking partial derivatives for each component. All components other than $\theta_{s,a}$ in this example here will have derivatives of 0.

# Frame Skipping and Pre-Processing for Deep Q-Networks on Atari 2600 Games

For at least a year, I’ve been a huge fan of the Deep Q-Network algorithm. It’s from Google DeepMind, and they used it to train AI agents to play classic Atari 2600 games at the level of a human while only looking at the game pixels and the reward. In other words, the AI was learning just as we would do!

Last year, I started a personal project related to DQNs and Atari 2600 games, which got me delving into the details of DeepMind’s two papers (from NIPS workshop 2013 and NATURE 2015). One thing that kept confusing me was how to interpret their frame skipping and frame processing steps, because each time I read their explanation in their papers, I realized that their descriptions were rather ambiguous. Therefore, in this post, I hope to clear up that confusion once and for all.

I will use Breakout as the example Atari 2600 game, and the reference for the frame processing will be from the NATURE paper. Note that the NATURE paper is actually rather “old” by deep learning research standards (and the NIPS paper is ancient!!), since it’s missing a lot of improvements such as Prioritized Experience Replay and Double Q-Learning, but in my opinion, it’s still a great reference for learning DQN, particularly because there’s a great open-source library which implements this algorithm (more on that later).

To play the Atari 2600 games, we generally make use of the Arcade Learning Environment library which simulates the games and provides interfaces for selecting actions to execute. Fortunately, the library allows us to extract the game screen at each time step. I modified some existing code from the Python ALE interface so that I could play the Atari games myself by using the arrow keys and spacebar on my laptop. I took 30 consecutive screenshots from the middle of one of my Breakout games and stitched them together to form the image below. It’s a bit large; right click on it and select “Open Image in New Tab” to see it bigger. The screenshots should be read left-to-right, up-to-down, just like if you were reading a book.

Note that there’s no reason why it had to be me playing. I could have easily extracted 30 consecutive frames from the AI playing the game. The point is that I just want to show what a sequence of frames would look like.

I saved the screenshots each time I executed an action using the ALE Python interface. Specifically, I started with ale = ALEInterface() and then after each call to ale.act(action), I used rgb_image = ale.getScreenRGB(). Thus, the above image of 30 frames corresponds to all of the screenshots that I “saw” while playing in the span of a half-second (though obviously no human would be able to distinguish among all 30 frames at once). In other words, if you save each screenshot after every action gets executed from the ALE python library, there’s no frame skipping.

Pro tip: if you haven’t modified the default game speed, play the game yourself, save the current game screenshot after every action, and check if the total number of frames is roughly equivalent to the number of seconds multiplied by 60.

Now that we have 30 consecutive in-game images, we need to process them so that they are not too complicated or high dimensional for DQN. There are two basic steps to this process: shrinking the image, and converting it into grayscale. Both of these are not as straightforward as they might seem! For one, how do we shrink the image? What size is a good tradeoff for computation time versus richness? And in the particular case of Atari 2600 games, such as in Breakout, do we want to crop off the score at the top, or leave it in? Converting from RGB to grayscale is also – as far as I can tell – an undefined problem, and there are different formulas to do the conversion.

For this, I’m fortunate that there’s a fantastic DQN library open-source on GitHub called deep_q_rl, written by Professor Nathan Sprague. Seriously, it’s awesome! Professor Sprague must have literally gone through almost every detail of the Google DeepMind source code (also open-source, but harder to read) to replicate their results. In addition, the code is extremely flexible; it has separate NIPS and NATURE scripts with the appropriate hyperparameter settings, and it’s extremely easy to tweak the settings.

I browsed the deep_q_rl source code to learn about how Professor Sprague did the downsampling. It turns out that the NATURE paper did a linear scale, thus keeping the scores inside the screenshots. That strikes me as a bit odd; I would have thought that cropping the score entirely would be better, and indeed, that seems to have been what the NIPS 2013 paper did. But whatever. Using the notation of (height,width), the final dimensions of the downsampled images were (84,84), compared to (210,160) for the originals.

To convert RGB images to grayscale, the deep_q_rl code uses the built-in ALE grayscale conversion method, which I’m guessing DeepMind also used.

The result of applying these two pre-processing steps to the 30 images above results in the new set of 30 images:

Applying this to all screenshots encountered in gameplay during DQN training means that the data dimensionality is substantially reduced, and also means the algorithm doesn’t have to run as long. (Believe me, running DQN takes a long time!)

We have all these frames, but what gets fed as input to the “Q-Network”? The NATURE and NIPS paper used a sequence of four game frames stacked together, making the data dimension (4,84,84). (This parameter is called the “phi length”) The idea is that the action agents choose depends on the prior sequence of game frames. Imagine playing Breakout, for instance. Is the ball moving up or down? If the ball is moving down, you better get the paddle in position to bounce it back up. If the ball is moving up, you can wait a little longer or try to move in the opposite direction as needed if you think the ball will eventually reach there.

This raises some ambiguity, though. Is it that we take every four consecutive frames? NOPE. It turns out that there’s a frame skipping parameter, which confusingly enough, the DeepMind folks also set at 4. What this means in practice is that only every fourth screenshot is considered, and then we form the “phi”s, which are the “consecutive” frames that together become the input to the neural network function approximator. Consider the updated sequence of screenshots, with the skipped frames denoted with an “X” over them:

What ends up happening is that the states are four consecutive frames, ignoring all of the “X”-ed out screenshots. In the above image, there are only seven non-skipped frames. Let’s denote these as $x_1,x_2,\ldots, x_7$. The DQN algorithm will use $s_1 = (x_1,x_2,x_3,x_4)$ as one state. Then for the next state, it uses $s_2 = (x_2,x_3,x_4,x_5)$. And so on.

Crucially, the states are overlapping. This was another thing that wasn’t apparent to me until I browsed Professor Sprague’s code. Did I mention I’m a huge fan of his code?

A remark before we move on: one might be worried that the agent is throwing away a lot of information using the above procedure. In fact, it actually makes a lot of sense to do this. Seeing four consecutive frames without subsampling doesn’t give enough information to discern motion that well, especially after the downsampling process.

There’s one more obscure step that Google DeepMind did: they took the component-wise maximum over two consecutive frames, which helps DQN deal with the problem of how certain Atari games only render their sprites every other game frame. Is this maximum done over all the game screenshots, or only the subsampled ones (every fourth)? It’s the former.

Ultimately, we end up with the revised image below:

I’ve wrapped two consecutive screenshots in yellow boxes, to indicate that we’re taking the pixel-by-pixel (i.e. component-wise) maximum of the two images, which then get fed into as components for our states $s_i$. Think of each of the seven yellow boxes above as forming one of the $x_i$ frames. Then everything proceeds as usual.

Whew! That’s all I have to say regarding the pre-processing. If you want to be consistent with the NATURE paper, try to perform the above steps. There’s still a lot of ambiguity, unfortunately, but hopefully this blog post clarifies the major steps.

Aside from what I’ve discussed here, I want to bring up a few additional points of interest:

• OpenAI also uses the Atari 2600 games as their benchmark. However, when we perform an action using OpenAI, the action we choose is performed either 2, 3, or 4 frames in a row, chosen at random. These are at the granularity of a single, true game frame. To clarify, using our method above from the NATURE paper would mean that each action that gets chosen is repeated four times (due to four skipped game frames). OpenAI instead uses 2, 3, or 4 game frames to introduce stochasticity, but doesn’t use the maximum operator across two consecutive images. This means if I were to make a fifth image representing the frames that we keep or skip with OpenAI, it would look like the third image in this blog post, except the consecutive X’s would number 1, 2, or 3 in length, and not just 3. You can find more details in this GitHub issue.
• On a related note regarding stochasticity, ALE has an action repeat probability which is hidden to the programmer’s interface. Each time you call an action, the ALE engine will ignore (!!) the action with probability 25% and simply repeat the previous action. Again, you can find more discussion on the related GitHub issue.
• Finally, as another way to reduce stochasticity, it has become standard to use a human starts metric. This method, introduced in the 2015 paper Massively Parallel Methods for Deep Reinforcement Learning, suggests that humans play out the initial trajectory of the game, and then the AI takes over from there. The NATURE paper did something similar to this metric, except that they chose a random number of no-op actions for the computer to execute at the start of each game. That corresponds to their “no-op max” hyperparameter, which is described deep in the paper’s appendix.

Why do I spend so much time obsessing over these details? It’s about understanding the problem better. Specifically, what engineering has to be done to make reinforcement learning work? In many cases, the theory of an algorithm breaks down in practice, and substantial tricks are required to get a favorable result. Many now-popular algorithms for reinforcement learning are based on similar algorithms invented decades ago, but it’s only now that we’ve developed the not just the computational power to run them at scale, but also the engineering tricks needed to get them working.

# Back to Blogging

Sorry for my silence in the last few weeks. I was completely preoccupied with preparation for a technical interview, which took place yesterday. I have now made public my interview preparation repository (on GitHub). It contains my code for some of the questions in the famous book Cracking the Coding Interview, 5th edition. Yes, the 5th edition, not the 6th … next time, I really should pay more attention to the version numbers of books before jumping the gun on buying them. But the 5th edition is still good, and I highly recommend using a recent version of the book to anyone who needs to prepare for a coding interview.

The repository only contains a fraction of what I wrote to prepare for the interview. I also had to review a lot of data structures concepts that I forgot. For instance, I never use linked lists in my research code (do any AI researchers use linked lists?!?). Consequently, I kept a Google Document with all of my typed notes from the Berkeley course, CS 61B. The best iteration of the class to review from is the Spring 2014 version, since that’s the most recent version taught by Professor Shewchuk, who has wonderful lecture notes publicly available. (Thank you so much!!)

There’s also been some other stuff going on besides my interview preparation. Most notable pertains to … politics. I was, of course, disappointed with the result at the federal level. I never thought that someone like Donald Trump could possibly become President of the United States, and it’s been painful for me to observe this happen given how strongly I opposed his candidacy from the start. I’m obviously not the only one here who opposed Trump; in the Berkeley computer science department, the faculty and students are nearly unanimously opposed to him, which says a lot given how computer science is supposed to be a nonpolitical field (as opposed to say, economics or political science). I am not aware of a single strong Trump supporter in the entire department, though I would not be surprised if we had a few of them (who are remaining quiet, for now).

While thinking about Trump’s rise to the highest elected office in the world, I began to fantasize myself being in his position. (Yes, I know it will never happen, so you don’t have to remind me.) What would a hypothetical Seita administration look like? I think it would be interesting to lay down some thoughts in a future blog post about the major positions I would take, and what issues would be important to me. Hopefully I can write this up in detail over the December/January holiday break.

Then again, maybe it’s better for my education and my sanity if I stick to technical/research posts. Who knows?

# Going Deeper Into Reinforcement Learning: Understanding Q-Learning and Linear Function Approximation

As I mentioned in my review on Berkeley’s Deep Reinforcement Learning class, I have been wanting to write more about reinforcement learning, so in this post, I will provide some comments on Q-Learning and Linear Function Approximation. I’m hoping these posts can serve as another set of brief RL introductions, similar to Andrej’s excellent post on deep RL.

Last year, I wrote an earlier post on the basics of MDPs and RL. This current one serves as the continuation of that by discussing how to scale RL into settings that are more complicated than simple tabular scenarios but nowhere near as challenging as, say, learning how to play Atari games from high-dimensional input. I won’t spend too much time trying to labor though the topics in this post, though. I want to save my effort for the deep variety, which is all the rage nowadays and a topic which I hope to eventually write a lot about for this blog.

To hep write this post, here are two references I used to quickly review Q-Learning with function approximation.

I read (1) entirely, and (2) only partly, since it is, after all, a full book (note that the authors are in the process of making a newer edition!). Now let’s move on to discuss an important concept: value iteration with function approximation.

## Value Iteration with Function Approximation

Value Iteration is probably the first RL-associated algorithm that students learn. It lets us assign values $V(s)$ to states $s$, which can then be used to determine optimal policies. I explained the algorithm in my earlier post, but just to be explicit, here’s a slide from my CS 287 class last fall which describes the procedure:

Looks pretty simple, right? For each iteration, we perform updates on our values $V_{i}^*$ until convergence. Each iteration, we can also update the policy $\pi_i^*$ for each state, if desired, but this is not the crucial part of the algorithm. In fact, since the policy updates don’t impact the $V_{i+1}^*(s)$ max-operations in the algorithm (as shown above), we can forget about updating $\pi$ and do that just once after the algorithm has converged. This is what Barto and Sutton do in their Value Iteration description.

Unfortunately, as suggested by the slide, (tabular) Value Iteration this is not something we can do in practice beyond the simplest of scenarios due to the curse of dimensionality in $s$. Note that $a$ is not usually a problem. Think of $9 \times 9$ Go, for instance: according to reference (1) above, there are $|\mathcal{S}| = 10^{38}$ states, but just $|\mathcal{A}| = 81$ actions to play. In Atari games, the disparity is even greater, with far more states but perhaps just two to five actions per game.

The solution is to obtain features and then represent states with a simple function of those features. We might use a linear function, meaning that

$V(s) = \theta_0 \cdot 1 + \theta_1 \phi_1(s) + \cdots + \theta_n\phi_n(s) = \theta^T\phi(s)$

where $\theta_0$ is a bias term with 1 “absorbed” into $\phi$ for simplicity. This function approximation dramatically reduces the size of our state space from $|\mathcal{S}|$ into “$|\Theta|$” where $\Theta$ is the domain for $\theta \in \mathbb{R}^n$. Rather than determining $V_{i+1}^*(s)$ for all $s$, each iteration our goal is to instead update the weights $\theta_i$.

How do we do that? The most straightforward way is to reformulate it as a supervised learning problem. From the tabular version above, we know that the weights should satisfy the Bellman optimality conditions. We can’t use all the states for this update, but perhaps we can use a small subset of them?

This motivates the following algorithm: each iteration $i$ in our algorithm with current weight vector $\theta^{(i)}$, we sample a very small subset of the states $S' \subset S$ and compute the official one-step Bellman backup updates:

$\bar{V}_{i+1}(s) = \max_a \sum_{s'}P(s'\mid s,a)[R(s,a,s') + \gamma \hat{V}_{\theta^{(i)}}(s')]$

Then we find the next set of weights:

$\theta^{(i+1)} = \arg_\theta\min \sum_{s \in S'}(V_{\theta^{(i+1)}}(s)-\bar{V}_{i+1}(s))^2$

which can be done using standard supervised learning techniques. This is, in fact, an example of stochastic gradient descent.

Before moving on, it’s worth mentioning the obvious: the performance of Value Iteration with function approximation is going to depend almost entirely on the quality of the features (along with the function representation, i.e. linear, neural network, etc.). If you’re programming an AI to play PacMan, the states $s$ will be the game board, which is far too high-dimensional for tabular representations. Features will ideally represent something relevant to PacMan’s performance in the game, such as the distance to the nearest ghost, the distance to the nearest pellet, whether PacMan is trapped, and so on. Don’t neglect the art and engineering of feature selection!

## Q-Learning with Function Approximation

Value Iteration with function approximation is nice, but as I mentioned in my last post, what we really want in practice are $Q(s,a)$ values, due to the key fact that

$\pi(s) = \arg_a \max Q^*(s,a)$

which avoids a costly sum over the states. This will cost us extra in the sense that we have to now use a function with respect to $a$ in addition to $s$, but again, this is not usually a problem since the set of actions is typically much smaller than the set of states.

To use Q-values with function approximation, we need to find features that are functions of states and actions. This means in the linear function regime, we have

$Q(s,a) = \theta_0 \cdot 1 + \theta_1 \phi_1(s,a) + \cdots + \theta_n\phi_n(s,a) = \theta^T\phi(s,a)$

What’s tricky about this, however, is that it’s usually a lot easier to reason about features that are only functions of the states. Think of the PacMan example from earlier: it’s relatively easy to think about features by just looking at what’s on the game grid, but it’s harder to wonder about what happens to the value of a state assuming that action $a$ is taking place.

For this reason, I tend to use the following “dimension scaling trick” as recommended by reference (1) above, which makes the distinction between different actions explicit. To make this clear, image an MDP with two features and four actions. The features for state-action pair $(s,a_i)$ can be encoded as:

$\phi(s,a_1) = \begin{bmatrix} \psi_1(s,a_1) \\ \psi_2(s,a_1) \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 1 \end{bmatrix} ,\quad \phi(s,a_2) = \begin{bmatrix} 0 \\ 0 \\ \psi_1(s,a_2) \\ \psi_2(s,a_2) \\ 0 \\ 0 \\ 0 \\ 0 \\ 1 \end{bmatrix} ,\quad \phi(s,a_3) = \begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \\ \psi_1(s,a_3) \\ \psi_2(s,a_3) \\ 0 \\ 0 \\ 1 \end{bmatrix} ,\quad \phi(s,a_4) = \begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ \psi_1(s,a_4) \\ \psi_2(s,a_4) \\ 1 \end{bmatrix}$

Why the need to use different feature for different actions? Intuitively, if we didn’t do this (and kept $\phi(s,a)$ with just two non-bias features) then the action would have no effect! But actions do have impacts on the game. To use the PacMan example again, imagine that the PacMan agent has a pellet to its left, so the agent is one move away from being invincible. At the same time, however, there could be a ghost to PacMan’s right! The action that PacMan takes in that state (LEFT or RIGHT) will have a dramatic impact on the resulting reward obtained! It is therefore important to take the actions into account when featurizing Q-values.

As a reminder before moving on, don’t forget the bias term! They are necessary to properly scale the function values independently of the features.

## Online Least Squares Justification for Q-Learning

I now want to sketch the derivation for Q-Learning updates to provide intuition for why it works. Knowing this is also important to understand how the training process works for the more advanced Deep-Q-Network algorithm/architecture.

Recall the standard Q-Learning update without function approximation:

$Q(s,a) \leftarrow (1-\alpha) Q(s,a) + \alpha \underbrace{[R(s,a,s') + \gamma \max_{a'}Q(s',a')]}_{\rm sample}$

Why does this work? Imagine that you have some weight vector $\theta^{(i)}$ at iteration $i$ of the algorithm and you want to figure out how to update it. The standard way to do so is with stochastic gradient descent, as mentioned earlier in the Value Iteration case. But what is the loss function here? We need that so we can compute the gradient.

In Q-Learning, when we approximate a state-action value $Q(s,a)$, we want it to be equal to the reward obtained, plus the (appropriately discounted) value of playing optimally from thereafter. Denote this “target” value as $Q^+(s,a)$. In the tabular version, we set the target as

$Q^+(s,a) = \sum_{s'} P(s'\mid s,a)[R(s,a,s') + \gamma \max_{a'}Q^+(s',a')]$

It is infeasible to sum over all states, but we can get a minibatch estimate of this by simply considering the samples we get: $Q^+(s,a) \approx R(s,a,s') + \gamma \max_{a'}Q^+(s',a')$. Averaged over the entire run, these will average out to the true value.

Assuming we’re in the function approximation case (though this is not actually needed), and that our estimate of the target is $Q(s,a)$, we can therefore define our loss as:

$L(\theta^{(i)}) = \frac{1}{2}(Q^+(s,a) - Q(s,a))^2 = \frac{1}{2}(Q^+(s,a) - \phi(s,a)^T\theta^{(i)})^2$

Thus, our gradient update procedure with step size $\alpha$ is

\begin{align} \theta^{(i+1)} &= \theta^{(i)} - \alpha \nabla L(\theta^{(i)}) \\ &= \theta^{(i)} - \alpha \nabla \frac{1}{2}[Q^+(s,a) - \phi(s,a)^T\theta^{(i)}]^2 \\ &= \theta^{(i)} - \alpha (Q^+(s,a) - \phi(s,a)^T\theta^{(i)}) \cdot \phi(s,a) \end{align}

This is how we update the weights for Q-Learning. Notice that $\theta^{(i)}$ and $\phi(s,a)$ are vectors, while $Q^+(s,a)$ is a scalar.

## Concluding Thoughts

Well, there you have it. We covered:

• Value Iteration with (linear) function approximation, a relatively easy-to-understand algorithm that should serve as your first choice if you need to scale up tabular value iteration for a simple reinforcement learning problem.
• Q-Learning with (linear) function approximation, which approximates $Q(s,a)$ values with a linear function, i.e. $Q(s,a) \approx \theta^T\phi(s,a)$. From my experience, I prefer to use the states-only formulation $\theta^T\phi(s)$, and then to apply the “dimension scaling trick” from above to make Q-Learning work in practice.
• Q-Learning as a consequence of online least squares, which provides a more rigorous rationale for why it makes sense, rather than relying on hand-waving arguments.

The scalability benefit of Q-Learning (or Value Iteration) with linear function approximation may sound great over the tabular versions, but here is the critical question: is a linear function approximator appropriate for the problem?

This is important because a lot of current research on reinforcement learning is applied towards complicated and high dimensional data. A great example, and probably the most popular one, is learning how to play Atari games from raw (210,160,3)-dimensional arrays. A linear function is simply not effective at learning $Q(s,a)$ values for these problems, because the problems are inherently non-linear! A further discussion of this is out of the scope of this post, but if you are interested, Andrew Ng addressed this in his talk at the Bay Area Deep Learning School a month ago. Kevin Zakka has a nice blog post which summarizes Ng’s talk.1 In the small data regime, many algorithms can achieve “good” performance, with differences arising in large part due to who tuned his/her algorithm the most, but in the high-dimensional big-data regime, the complexity of the model matters. Neural networks can learn and represent far more complicated functions.

Therefore, in my next post, I will introduce and discuss Q-Learning with neural networks as the function approximation. It will use the Atari games as a running example. Stay tuned!

1. It’s great that he wrote that, because all the YouTube videos with Ng’s talk (as of now) do not have the auto-captions option enabled. I’m not sure why, and despite the flaws of auto-captions, it would have helped a lot. I tried watching Ng’s talk and gave up after a few minutes since I was unable to understand the words he was saying.

# Hypothesis Testing for a Binomial Example

In STAT 210A class, we are now discussing hypothesis testing, which has brought back lots of memories from my very first statistics course (taken in my third semester of undergrad). Think null hypotheses, $p$-values, confidence intervals, and power levels, which are often covered in introductory statistics courses. In STAT 210A, though, we take a far more rigorous treatment of the material. Here’s the setting at a high level: we use data to infer which of two competing hypotheses is correct. To formalize it a bit: we assume some model $\mathcal{P} = \{P_\theta : \theta \in \Theta\}$ and test:

• $H_0 : \theta \in \Theta_0$, the null hypothesis
• $H_1 : \theta \in \Theta_1$, the alternative hypothesis

While it is not strictly necessary for $\Theta_0 \cup \Theta_1 = \Theta$ and $\Theta_0 \cap \Theta_1 = \emptyset$, we often assume these are true for simplicity.

We “solve” a hypothesis testing problem by specifying some critical function $\phi$, specified as

$\phi(x) = \begin{cases} 0 &\text{accept H_0}\\ \gamma &\text{reject H_0 with probability \gamma} \\ 1 &\text{reject H_0} \end{cases}$

In other words, $\phi$ tells us the probability that we should reject $H_0$.

The performance of the test $\phi$ is specified by the power function:

$\beta(\theta) = \mathbb{E}_\theta[\phi(x)] = {\rm Pr}_\theta[\text{reject H_0}]$

A closely related quantity is the significance level of a test:

$\alpha = \sup_{\theta \in \Theta_0} \beta(\theta) = \sup_{\theta \in \Theta_0} \mathbb{E}_\theta[\phi(X)]$

The level $\alpha$ here therefore represents the worst chance (among all the $\theta$ possibilities) of falsely rejecting $H_0$. Notice that $\theta$ is constrained to be in the null hypothesis region $\Theta_0$! The reason why we care about $\alpha$ is because $H_0$ often represents the status quo, and we only want to reject it if we are absolutely sure that our evidence warrants it. (In fact, we technically don’t even “accept” the null hypothesis; in many courses, this is referred to as “failing to reject”.)

We may often resort to a randomized test, which was suggested by my definition of $\phi$ above which uses a $\gamma$. This is useful for technical reasons to achieve exact significance levels. Let’s look at an example to make this concrete. Suppose that $X \sim Bin(\theta, n=2)$, and that we are testing

• $H_0 : \theta = \frac{1}{2}$.
• $H_1 : \theta = \frac{3}{4}$ (so here the hypotheses do not partition).

And, furthermore, that we want to develop a test $\phi$ with a significance level of $\alpha=0.05$. (Note: yes, this is related to the abundant usage of 0.05 as a $p$-value in many research articles.)

To test, it is convenient to use a likelihood ratio test, which have nice properties that I will not cover here. In our case, we have:

$L(X) = \frac{p(X \mid \theta = 3/4)}{p(X\mid \theta = 1/2)} = \frac{\left(\frac{3}{4}\right)^X\left(\frac{1}{4}\right)^{2-X}}{\left(\frac{1}{2}\right)^X\left(\frac{1}{2}\right)^{2-X}} = \frac{3^X}{4}$

where we have simply plugged in the densities for Binomial random variables and simplified. Intuitively, if this ratio is very large, then it is more likely that we should reject the null hypothesis because $\theta=3/4$ describes the data better.

We know that $L(X)$ can take on only three values (because $n=2$) and that under the null hypothesis (this is important!) the probabilities of $L(X)$ taking on $1/4,3/4$, or $9/4$ happen with probability $1/4,1/2,$ and $1/4$, respectively.

Knowing this, how do we design the test $\phi$ with the desired significance level? It must be the case that

\begin{align*} \mathbb{E}_\theta[\phi(X)] &= P(X=0)\phi(0) + P(X=1)\phi(1) + P(X=2)\phi(2) \\ &= \frac{1}{4}\phi(0) + \frac{2}{4}\phi(1) + \frac{1}{4}\phi(2) = 0.05 \end{align*}

(There is only one $\theta$ possibility here, so we do not need a “sup” term.)

By using the fact that a likelihood ratio must be defined by a cutoff point where if $L(X)>k$, we reject, else if $L(X) < k$ we accept (and with equality, we randomize), we see that our test must be $\phi(0)=\phi(1)=0$ and $\phi(2)=1/5$. If we were to equate this with our definitions above, this would be:

$\phi(x) = \begin{cases} 0 &\text{if L(X) = \frac{1}{4} or L(X) = \frac{3}{4}, i.e., if L(X) < \frac{9}{4}}\\ \frac{1}{5} &\text{if L(X) = \frac{9}{4}} \\ 1 &\text{if L(X) > \frac{9}{4}} \end{cases}$

with $k=9/4$. (The third case never happens here; I just added it to be consistent with our earlier $\phi$ definition.)

And that is why we use randomization. This also gives us a general recipe for designing tests that achieve arbitrary significance levels.

# The Expectation of the Minimum of IID Uniform Random Variables

In my STAT 210A class, we frequently have to deal with the minimum of a sequence of independent, identically distributed (IID) random variables. This happens because the minimum of IID variables tends to play a large role in sufficient statistics.

In this post, I’ll briefly go over how to find the minimum of IID uniform random variables. Specifically, suppose that $X_1,\ldots,X_n \sim {\rm Unif}(0,1)$ and let $T = \min_i X_i$. How do we find $\mathbb{E}[T]$? To compute this, observe that

\begin{align} P\left(\min_i X_i \le t\right) &= 1 - P\left(\min_i X_i \ge t\right) \\ &= 1 - P(X_1 \ge t, \ldots, X_n \ge t) \\ &= 1 - \prod_{i=1}^n P(X_i \ge t) \\ &= 1 - \prod_{i=1}^n [1 - P(X_i \le t)] \\ &= 1 - [1 - P(X_i \le t)]^n, \end{align}

so the next step is to determine $P(X_i \le t)$. Due to the IID assumption, it doesn’t matter which $X_i$ we use. Note also that to avoid adding cumbersome indicator functions, assume that $t \in [0,1]$.

The value $P(X_i \le t)$ is easier to compute because it directly relates to the cumulative distribution function (CDF) of a uniform random variable. For ${\rm Unif}(0,1)$, the CDF is simply $F_X(x) = x$ if $x \in (0,1)$. This means

$P\left(\min_i X_i \le t\right) = 1 - [1 - P(X_i \le t)]^n = 1 - [1 - t]^n,$

so that by differentiating with respect to $t$, the density function is $f_T(t) = n(1-t)^{n-1}$ (the notation here with the $T$ as the subscript is to denote the variable whose density is being defined).

The reason why we need the density is because of the definition of the expectation:

$\mathbb{E}[T] = \int_0^1 t f_T(t) dt = n \int_0^1 t(1-t)^{n-1}dt.$

To compute this, we integrate by parts. Set $u=t, du=dt$ and $dv = (1-t)^{n-1}dt, v = -(1-t)^{n}/n$. We get

\begin{align} \int_0^1 t(1-t)^{n-1}dt &= \left(-\frac{t(1-t)^n}{n} \Big|_{t=0}^{t=1}\right) - \int_0^1 -\frac{(1-t)^n}{n}dt \\ &= \left[-\frac{(1)(1-(1))^n}{n} - -\frac{(0)(1-(0))^n}{n} \right] + \frac{1}{n} \int_0^1 (1-t)^ndt \\ &= \frac{1}{n} \left[ -\frac{(1-t)^{n+1}}{n+1}\Big|_{t=0}^{t=1}\right] \\ &= \frac{1}{n} \left[ -\frac{(1-(1))^{n+1}}{n+1}- -\frac{(1-(0))^{n+1}}{n+1}\right] \\ &= \frac{1}{n} \frac{1}{n+1}. \end{align}

Combining this with the extra $n$ factor we left out (you didn’t forget that, did you?) we get $\mathbb{E}[T] = \frac{1}{n+1}$. Notice that as $n \to \infty$ the expected value of the minimum of these uniform random variables goes to zero. In addition, this expectation is always in $(0,1/2]$ for $n \ge 1$. Thus, the answer passes the smell test and seems reasonable.

# GSI-ing for Berkeley's Deep Neural Networks Class

In news that might be welcome to the students taking CS 294-129, “Designing, Visualizing, and Understanding Deep Neural Networks” this semester, I got recruited as the second GSI (Graduate Student Instructor, a.k.a. Teaching Assistant) for the class.

Why do I say this might be good news for the students? First, I hope they realize how much I enjoy teaching. I like to provide hints (but not complete solutions) to students when they need it, because it feels good when students are able to learn and benefit from my presence. I’m also tolerant of students who have to come to office hours frequently, because quite frankly, I was one of those students for a brief time in undergrad (thankfully, since outgrown).

Probably the biggest reason why I think students will like my new GSI position is that, in news that will surprise no one, the class was substantially over-enrolled at the start of the semester. My guess is that we had around a hundred students1 who wanted to take the class … but no GSI. Fortunately, the class got one GSI a few weeks into the semester, which meant everyone (I hope) on the waitlist got in. Then I became a GSI. The irony is that both of us started as normal students. I like to think of it as “getting promoted as a GSI.” With two of us, it means that students will get their homeworks graded faster. The first one was already due a week ago, and the second one is due next Friday. Gulp.

So what is this class about? It was a last-minute addition to the curriculum for the semester, but fortunately, the class is based on CS 231n from Stanford University, which has already been offered twice. This means we can borrow their three homework assignments, as well as the style of their midterm questions (though I hope we don’t use the exact same questions …). In addition, since we are on a semester system and Stanford is on a quarter system, we can cover slightly more material2, such as Deep Reinforcement Learning, Monte Carlo methods, and current research directions with Deep Learning, which do not appear to be on the CS 231n syllabus.

Incidentally, by looking at the Stanford CS 231n offering for Winter 2016, they had three instructors and eleven teaching assistants. There were apparently 200 final projects for the class, and since students could pair up for those, that corresponds to about 300-400 students total (wow!). We certainly don’t have nearly as many students, but there still is not enough course staff to make the student-to-staff ratio match.

Our homework assignments, as mentioned earlier, are borrowed from the Stanford version. I did the first one and am almost done with the second. The assignments are based on Jupyter Notebooks (awesome!) and done entirely in Python. And I’d like to give my congratulations to the CS 231n staff, because the assignments are awesome. They are challenging enough to stimulate the deepest parts of my brain (pun intended), but not so hard enough that I have to constantly ask someone for help, which would be a bit embarrassing for a GSI. One of the things I really like about the assignments is that they give me a taste for how real deep network systems are implemented. In my undergraduate machine learning class, I implemented backpropgation for a very simple, fully connected neural network, but that was a “naive” version, with me iterating through arrays and following a formulaic definition. That worked out fine, but in practice, neural network code must be modular enough (and vectorized!) to allow for effective training on deep networks. Oh, and did I mention that most networks for image recognition are not fully connected but convolutional nets? That makes the implementation much more challenging!

I’m excited to be joining the course staff for CS 294-129, and I hope that I and the rest of the students will enjoy the course. Probably the one downside to being a GSI is that I may have to face student complaints (moreso than when I was TA-ing in undergrad) since GSIs have more grading responsibility.

1. It’s a bit tough to say how many students we have exactly, since some join after the start date and others drop. The Piazza website says that there are 165 people enrolled, but that also includes students who have dropped the class.

2. Our textbook is based on the online book by Goodfellow et al. I was hoping to read through the full book at some point, so this class gives me a great opportunity to do that!

# On Baba Yetu and the Role Music Plays in Life

No, this isn’t going to turn into a “song of the week” blog post series. But I wanted to mention this particular song along with an accompanying discussion on the role of music in my life.

I should start with the rather obvious fact that I am not a musical person. I was not raised in a musical family. I remember playing the baritone back in elementary school for a year or two, and I quickly backed out as soon as I finished my “music requirements” for school.

As additional evidence of my lack of musical interest, knowledge, and expertise, in my six or seven years of driving, I have never had the music on, ever. I also do not think I have touched an iPod in my life, and I say “think” here because of the possibility that I may have had to accidentally brush one of them when cleaning up after another students’ mess on his/her desk (c’mon, please be tidy!). My use of headphones is restricted to when I show up to talk to my audiologist, who provides me with a headphone to wear as she checks my hearing levels. Does that count?

That’s not to say that people with hearing aids can’t listen to music at all. For instance, I hear music all the time when I watch movies or play games. There are also special earplugs that can work with hearing aids and have hooks to keep them in place.

One can also try to use earbuds “normally.” I know one deaf person who plugs in earbuds directly into his ears. This is only possible, I believe, for those who still have their natural hearing (i.e., without cochlear implants). I could do that because I have some natural hearing remaining. If someone is talking within two inches of my left ear (but not the right), I could probably understand that person. Unfortunately, I am afraid of putting earbuds directly into my ear because I worry that it will damage whatever natural hearing I have.

Recently, I have been trying to listen to more music. In part this is because I don’t want to embarrass myself in front of others when the topic of music comes up, but it’s also because I think I can use music for relaxation or focus. When I work at home, I frequently turn to relaxing nature sounds such as the ones in this YouTube video, but that’s technically not a “real” song.

An example of an actual song that I’ve come to really like over the past few years is Baba Yetu. Here’s a Wikipedia link, and here’s a YouTube link. For those who don’t know, it is the theme music for the popular and award-winning Civilization IV computer game. The composer, Christopher Tin, used to be Stanford roommates with the lead game developer, Soren Johnson, so that’s how the song ended up there. On the surface, this seems like blatant favoritism, but it was a pretty genius move for both sides because Baba Yetu became the first video game song to win a Grammy Award.

I was a huge Civilization IV fan in high school, and I remember a few times when I allowed the opening music to play completely instead of skipping through it and going straight to the gameplay, as is my custom. I am not sure how to quantify the quality of music, but I can agree with many other listeners that Baba Yetu intuitively has some appealing mystical or supernatural quality to it. I liked it enough in high school so that, during my 12th grade Advanced Placement English Class, when the students presented their favorite book, movie, and song to the class to introduce themselves to each other, Baba Yetu was the song that I presented. Incidentally, my book was Outliers, and my movie was The Dark Knight. And yes, that was AP English … as distressing as that sounds.

I would be remiss if I did not conclude this discussion with the news that Christopher Tin will also be composing the opening music for Civilization VI, slated for an October release! It looks like the designers of Civilization VI are learning from the mistakes of Civilization V after all! The gameplay for Civilization VI, and now the music, look to be far more promising than those of the lackluster Civilization V game. (For more on why I believe this in terms of gameplay, read this article by someone who I agree with a lot regarding computer games.) Unfortunately, I won’t become as good at it as I was with Civilization IV because I can no longer allow computer games to take over my life.

But I sure will try to spare a few minutes to listen to the music for Civilization VI.

# Concluding Thoughts on Reviewing for NIPS 2016

I mentioned a few months ago that I volunteered to be a reviewer for NIPS 2016. The NIPS decisions have been out for a few weeks now and a detailed account of the review process has been published.

Overall, I think it was a good experience for me to finally start reviewing academic papers, and I’m really happy that NIPS has so much transparency to help me better understand how it works behind the scenes. From looking at the review process, a number of things immediately caught my attention.

First, wow, NIPS is really large. I would not have expected about 6000 total authors to have submitted 2406 total papers for review. And NIPS used to be a niche conference a few decades ago, right? I wonder how much larger it can realistically become while still being an academic conference. In terms of reviewers, there were 3424 reviewers, again quite an eye-popping number. Also, I was correct on the senior vs volunteer reviewers: it looks like the system was designed to provide six reviewers per paper, with three seniors and three volunteers. I was naturally one of the volunteers. I also noticed that reviewers submitted an average of 4.05 reviews, so my set of papers to review (four) was standard.

In the past, NIPS papers used to be evaluated on a single scale from 1 to 10. This year, the process was more sophisticated with 1-5 point scales for (a) technical quality, (b) novelty, (c) impact, and (d) writing/clarity. After having some experience with the new scoring scales, I have some mixed feelings about them. This was ostensibly done to make NIPS more scalable, but I’m worried about the workload of the Area Chairs, who have to now spend more time analyzing the reviews instead of looking at a simple numeric score. If the Area Chairs can effectively work with this new scoring scale, then I suppose it is OK. I actually like how it helps me to think more clearly about “novelty” versus “impact.”

In general, (d) is the easiest to determine, especially for me and my “unique” typo-catching abilities. Papers with substantial typos and problems in their presentation should go straight in the rejection pile; for one of the papers I reviewed, nearly every reviewer heaped criticism on the writing/clarity, making it an easy rejection. I’m not sure why I spent so much time on that paper, carefully pointing out the flaws, when I should have quickly moved on. I was told by a faculty member that that “spending two hours for a review is a good benchmark, and you can spend a little more if the paper looks publishable.”

Gulp. I spend way more than two hours per paper, even on the one that should have been an obvious rejection. Way more.

Some reviewers were smarter (or lazier?). For each paper I reviewed, there was at least one reviewer who gave the dreaded two-line review. Interestingly enough, from my really small sample size, faculty gave more of those reviews than postdocs and graduate students. (Time for faculty to assign these reviews to their graduate students, I suppose.) While NIPS follows the double-blind reviewing system, paper reviewers could see the identity and reviews of the other reviewers. We could also give private comments in our original reviews, visible only to other reviewers and the Area Chairs. One of the most distressing aspects of the system is seeing almost every reviewer saying a variant of: “I did not completely check the proofs.” It’s clearly impossible even for a group of six to check all the technical aspects of papers. For this to even happen, at minimum, the six reviewers would have to somehow divide up the task, such as reviewer 1 checking Theorem 1, reviewer 2 checking Theorem 2, and so on.

Fortunately, I could not see the other reviews until after I had submitted mine – a good thing!! Once the reviewing period finished, the conference website provided a private forum where reviewers and Area Chairs could discuss aspects of the papers in depth, but I was kind of fried after this and busy with too much work. A few people labeled as “meta reviewers” (I think these were Area Chairs) tried to encourage some discussion, but most reviewers felt the same as I did and did not have the desire for extensive conversations. In a perfect world, there would especially be a lot of discussion about the author rebuttals, which is a key part of the NIPS reviewing process. Unfortunately, it was really hard for me to motivate myself to read the rebuttals carefully, and I ultimately did not change any of my scores.

From the list of accepted papers, one out of the four papers I reviewed got accepted, which aligns well with the overall acceptance rate for NIPS (this year, it was 568/2406 = 0.236). As an experiment, the NIPS chairs asked reviewers to provide an ordered ranking of only the papers they reviewed. The paper I rated the highest of my four was indeed the one that got in, which is extra assurance that my kind of reviews are not totally out of whack with what others think.

Overall, I think my favorite part of reviewing is when I get to compare my reviews with other people and to read author rebuttals. I generally like it when my reviews hit the same core topics as other reviews, and when authors (in their rebuttals) thank me for catching errors. I hope to continue giving feedback on research papers, whether or not it is part of an official reviewing system.

# On Risk and the Bias-Variance Tradeoff (From Keener 3.1)

I am currently reading the textbook Theoretical Statistics by Robert W. Keener, and I thought I would write up some notes on Chapter 3, Section 1 of the book. Chapter 3 is titled “Risk, Sufficiency, Completeness, and Ancillarity,” with 3.1 specifically being about the notion of risk.

So what is risk? To understand this, we need to describe the problem setting. The goal in statistical inference is about learning a parameter vector $\theta$ from data, where in general the data $X$ consists of $n$ points $X_1, \ldots, X_n$ with $X_i \in \mathbb{R}^k$ for some dimension $k$.

A statistic of the data is a function of the data which provides some information about it. Keener denotes data statistics as $\delta(X)$. Probably the canonical example of a statistic is the mean of the data, or $\delta(X) = (X_1 + \cdots + X_n)/n$. It is worth pointing out that this only works if the $X_i$ are random vectors, and in addition, that the mean function is many-to-one. That is, we can have different datasets that result in the same mean. Thus, the statistic $\delta$ here only provides partial information on the data.

Our problem setting also involves some function of the parameter, which Keener denotes as $g(\theta)$. The goal of estimation is to find a statistic $\delta$ such that $\delta(X)$ is “close” to $g(\theta)$. It may not be totally clear why we introduce $g$, but it gives us flexibility; for instance, we might only be interested in one component of $\theta$ so $g(\theta) = \theta_i$. For now, I find it easiest to just think of $g$ as the identity, so $g(\theta) = \theta$ and we want our statistic to be close to the real parameter, but throughout this post, I will be sloppy and use $g(\theta)$ and $\theta$ interchangeably (in my defense, Keener does the same thing).

Let’s clarify the above with the boring example of flipping a coin that can land heads with some probability $\theta \in [0,1]$. If we flip the coin 100 times, with each trial independent of the others, then the random variable $X$ representing the total number of heads follows a binomial distribution. (Another way to express $X$ is therefore $X = X_1 + \cdots + X_{100}$ where $X_i = 1$ if the trial resulted in heads, and 0 otherwise.) If we want to get a statistic $\delta$ that maps the data to an estimate of $g(\theta) = \theta$, what would be a “good” statistic? The statistic $\delta(X) = X/100$ certainly makes sense, since it is the natural proportion of samples! But is this the “best” statistic we have?

To quantify the notion of a “best” statistic, we turn to the concept of risk, which in turn relies on the concept of a loss function $L$. The loss function $L(g(\theta), \delta(X))$ represents the (non-negative) penalty in estimating $g(\theta)$ (which again, I treat as $\theta$ here) with $\delta(X)$. Since the data $X$ is random, $L(g(\theta), \delta(X))$ may sometimes be large with unlucky outcomes of $X$, so we use expectations, which is where the risk $R$ comes from:

$R(g(\theta), \delta(X)) = \mathbb{E}[L(g(\theta), \delta(X))]$

Note that Keener uses $\mathbb{E}_{\theta}$ to indicate expectations with $\theta$, but I find that a little misleading because what is random here is really $X$, not $\theta$. We assume $\theta$ is fixed … but for Bayesian risk the situation is different. Hence, the above is sometimes referred to as the Frequentist risk.

Let’s continue with the coin-flipping example from earlier. Probably the most common loss function is the squared error loss, or $L(g(\theta), \delta(X)) = (g(\theta)-\delta(X))^2$. What, then is the risk associated with the estimator $\delta(X) = X/100$? We apply the definition:

\begin{align} R(\theta, \delta) &= \mathbb{E}\left[\left(\theta - \frac{X}{100}\right)^2\right] \\ &= \mathbb{E}\left[\theta^2 - 2\theta \frac{X}{100} + \frac{X^2}{100}\right] \\ &= \theta^2 - \frac{2\theta}{100}\mathbb{E}[X] + \frac{1}{100^2}\mathbb{E}[X^2] \\ &= -\theta^2 + \frac{\theta (1-\theta)100\theta^2}{100}\\ &= \frac{\theta (1-\theta)}{100} \end{align}

Where we use $\mathbb{E}[X] = 100 \theta$ according to the definition of the expectation of a binomial random variable. To determine $\mathbb{E}[X^2]$ we follow the familiar technique of using the formula for the variance:

${\rm Var}(X) = \mathbb{E}[X^2] - \mathbb{E}[X]^2 = 100 \theta (1-\theta)$

Solving for $\mathbb{E}[X^2]$, we obtain

$\mathbb{E}[X^2] = 100 \theta (1-\theta) + 100^2\theta^2$

Now let’s consider something interesting: what if we use a different estimator, defined as $\delta(X) = (X+3)/106$? Intuitively, this looks to provide more “smoothing” of the $\theta$ estimate to get it closer to $1/2$. Let’s compute the risk, again assuming the squared loss:

\begin{align} R(\theta, \delta) &= \mathbb{E}\left[\left(\theta - \frac{X+3}{106}\right)^2\right] \\ &= \theta^2 - \frac{2\theta}{106}\mathbb{E}[X+3] + \frac{1}{106^2}\mathbb{E}[(X+3)^2] \\ &= \theta^2 - \frac{2\theta (100\theta + 3)}{106} + \frac{\mathbb{E}[X^2] + 6\mathbb{E}[X] + 9}{106^2} \\ &= \frac{-94\theta^2 - 6\theta}{106} + \frac{(100^2-100)\theta^2 + 700 \theta + 9}{106^2} \\ &= \frac{-64\theta^2 + 64\theta + 9}{106^2} \\ &= \frac{(-8\theta + 9)(8\theta + 1)}{106^2} \end{align}

How do we compare the risk of the previous two estimators? It’s not clear at first glance, so we need to plot the curves. Fortunately, Keener already did this with the following image:

In the above, he uses a third estimator, $\delta_1(X) = (X+3)/100$, and also, $\delta_0(X) = X/100$ is the original estimator and $\delta_2(X) = (X+3)/106$.

It is intuitive that $\delta_1$ is a poor estimator of $\theta$ because it adds 3 (i.e., like adding “three heads”) without doing anything else. In fact, we can rigorously say that $\delta_1$ is dominated by both $\delta_0$ and $\delta_2$ because, throughout the entire support for $\theta$, the risk associated with $\delta_1$ is higher than the other estimators. The comparison between $\delta_0$ and $\delta_2$ is less straightforward; when the true $\theta$ value is near $1/2$, we’d prefer $\delta_2$ for instance, but the reverse is true near 0 or 1.

To wrap up this discussion, I’d like to connect the concept of risk with the bias-variance tradeoff, which introduces two key concepts to know (i.e., bias and variance). The bias of an estimator $\delta(X)$ is defined as $b(\theta, \delta) = \mathbb{E}[\delta(X)] - \theta$, where here we write $\theta$ instead of $g(\theta)$ for simplicity. The variance is simply ${\rm Var}(\delta(X))$ and usually involves computations involving the variance of $X$ directly.

Problem 1 in Chapter 3 of Keener introduces the very interesting fact that the risk of an arbitrary estimator $\delta$ under squared error loss is ${\rm Var}(\delta(X)) + b^2(\theta, \delta)$, or more intuitively, $V + B^2$. In fact, I earlier saw this result stated in one of the two papers I discussed in my post on minibatch MCMC methods. In Section 3 of Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget, the paper states without proof:

The risk can be defined as the mean squared error in $\hat{I}$, i.e. $R = \mathbb{E}[(I - \hat{I})]$, where the expectation is taken over multiple simulations of the Markov chain. It is easy to show that the risk can be decomposed as $R = B^2 + V$, where $B$ is the bias and $V$ is the variance.

The first thing I do when I see rules like these is to try an example. Let’s test this out with one of our earlier estimators, $\delta_0(X) = X/100$. Note that this estimator is unbiased, which means the bias is zero, so we simply need to compute the variance:

${\rm Var}(\delta_0(X)) = {\rm Var}\left(\frac{X}{100}\right) = \frac{100 \theta(1-\theta)}{100^2} = R(\delta_0,\theta)$

and this precisely matches the risk from earlier!

To prove that the risk can be decomposed this way in general, we can do the following (using $\theta$ in place of $g(\theta)$ and $\delta$ in place of $\delta(X)$ to simplify notation):

\begin{align} R(\theta, \delta) &= \mathbb{E}[(\theta - \delta)^2] \\ &= \mathbb{E}[((\theta - \mathbb{E}[\delta]) - ( \delta - \mathbb{E}[\delta]))^2] \\ &= \mathbb{E}[(\theta - \mathbb{E}[\delta])^2 + (\delta - \mathbb{E}[\delta])^2 - 2(\theta-\mathbb{E}[\delta])(\delta-\mathbb{E}[\delta])] \\ &= \mathbb{E}[(\theta - \mathbb{E}[\delta])^2] + \mathbb{E}[(\delta - \mathbb{E}[\delta])^2] - 2 (\theta-\mathbb{E}[\delta])\mathbb{E}[(\delta-\mathbb{E}[\delta])] \\ &= \underbrace{(\theta - \mathbb{E}[\delta])^2}_{B^2} + \underbrace{\mathbb{E}[(\delta - \mathbb{E}[\delta])^2]}_{V} \end{align}

as desired! In the above, we expanded the terms (also by cleverly adding zero), applied linearity of expectation, applied the fact that $\mathbb{E}[\mathbb{E}[\delta(X)]]$ is just $\mathbb{E}[\delta(X)]$, and also used that $\mathbb{E}[\delta - \mathbb{E}[\delta]] = 0$.