My Blog Posts, in Reverse Chronological Order

subscribe via RSS or by signing up with your email here.

Five Years of Blogging

Aug 20, 2016

I survived. I have now been blogging for five years.

Technically, it’s been five years and 20 days. Sorry. As nice as it would have been to nail down five years exactly, I had some urgent work to do in the beginning of August. But despite all the constant work that’s being thrown at me (it’s my fault) I have somehow been able to keep cranking out at least one blog post each month for five years1. Given how tough it is to maintain a long commitment to nonessential writing projects, I wonder if any of my early readers (from 2011 to 2013) thought I would last this long.

This is a message towards all readers: I hope you enjoy some of what I write. I have been pleasantly surprised by the amount of comments and emails I get about the blog. These come almost every week, and it’s especially nice when the sender is someone I know or someone who I can easily meet in person most days. It gives us a chance to talk, regardless of whether it’s related to what I wrote or not. I also know that the vast majority of readers won’t be commenting or emailing, so the blog audience is much larger than I think. What’s interesting is that while a lot of people seem to like reading blogs, they don’t like writing one. My efforts in encouraging other people to start their own blogs has failed catastrophically. I think my current success rate is 0 for 20.

Incidentally, I don’t have to follow Rule #1 of the Internet: “don’t read comments.” I hope that I — and the readers — will never have to follow that rule2.

Since this is a special moment in the blog’s history, I decided it was high time to do a few major updates that I had been putting off for ages.

  1. I updated the “About” and “New? Start Here” pages, which have been begging for a do-over ever since this blog migrated from Wordpress to Jekyll. The former was pretty easy to fix; for the latter, I decided to try and provide a few examples of my favorite posts. With over 200 posts so far (the exact number, including this one, is 215) I had a wide range to choose from.

  2. I have been rewriting some of my older political posts (see here for one example). Every now and then, I do these changes because I get hives when I read some of my earlier writings. I like to have a reasonably high standard of writing, or at least to what is generally possible within the constraints of a blog post and lack of any editors helping me.

    A note regarding political posts: I’ll do my best to avoid these for the near future. Why? I’m getting weary of politics, first of all. Second, almost anyone can blog about poltics with varying degrees of quality, meaning that this blog wouldn’t be as unique as I want it to be. Third, politics causes unnecessary conflict between people with different beliefs, and I am not interested in burning any bridges with the people who I work with. Of course, given the near-uniformity liberalness of the Berkeley faculty and student body (and EECS is no exception in this regard), I can probably get away with writing political posts just by bashing conservatives. On the other hand, I think a lot of people who know me are aware of my self-proclaimed “centrist Democrat” leanings, so it doesn’t quite seem right to do that.3

    Regardless of what I believe, whether it is politics or otherwise, I am an even firmer believer in that we must keep an open mind. If I publish blog posts about particularly sensitive subjects, it does not mean I have exhaustively researched the area or that my opinion is fixed. I take pride in the amount that I read nowadays, so if there’s an issue I care about, chances are I’m trying to read up a lot to learn more.

  3. I also updated the DeepRL post related to the class I took last year. I really want to write more about DeepRL here, as that’s an exciting field. It just takes so darn long to write high quality posts like Andrej Karpathy’s recent reinforcement learning guide. I wrote a lot of blog posts related to my notes when I was studying for the AI prelims last summer, and those took a really long time to polish. My goal is to have several smaller posts that together cover a lot of DeepRL and which follow Andrej’s posting style. I have my first reinforcement learning post in draft stage (roughly 50% done) and hope to make it public within two weeks. Or maybe two months, who knows?

  4. I changed this blog to use https instead of http, since I want to use the more secure protocol, and it turns out that the MathJax (which renders LaTeX math for my posts) does not appear with https. I needed to do several things to get this set up properly:

    • I followed the directions to enforce https on GitHub Pages. It’s literally one checkbox to click, very easy!

    • In the _config.yml file, I changed the site URL to This is needed because my HTML code to display figures uses the site URL variable.

    • In two places, I had to change the MathJax script so that the link they use has https instead of http. The two locations to add this are in index.html and _layouts/post.html. The former is so that MathJax is visible on the homepage, where one can scroll to see many posts, and the latter is for individual posts when clicking on their titles (or from the archives). Here’s what the script segment looks like:

      <script type="text/javascript" 
    • Finally, I migrated all comments from http URLs to ones with the same name, but with https instead. This is necessary because Disqus links comments to specific URLs, so if I switched back and forth between http and https, I would see two different comment threads. Fortunately, someone already wrote a great guide on how to use the built-in Disqus URL mapping, and it was straightforward to follow.

    Whew! I think that’s it.

  5. I also modified several posts to fix them based on some of the lingering side effects of the Wordpress-to-Jekyll migration. For instance, I find centered text paired with left-centered images to look egregiously sloppy. The importer I used for the migration process automatically formed image tags, but it did not center them. For that, I have to use the following HTML code:

    <p style="text-align:center;">
        <img src=""> 

    where image_name.png should be replaced with whatever image name I am using.

    I also observed that the importer didn’t always correctly bold or italicize the text. Sometimes the asterisks I need for those would be one space off, but that spacing would throw off the formatting entirely. Ugh. This is always going to be a work in progress, I suppose, especially because I can’t easily write a script to automate it.

So, what’s the plan for the future? At some point I will threaten to quit blogging but not right now as there are many things I still want to write about. With some heavy work ahead of me, I really want to aim for more technical posts to help me learn about challenging topics. I want to write some occasional “guides on X” where X is a concept in computer science, to help out the community.

Last thing: for years, I’ve been trying to get a new name change for the blog, as “Seita’s Place” has to be the lamest blog name of all time. Unfortunately, I can’t come up with anything better.

That’s all for now. Thanks for reading!

  1. As a general rule, if there hasn’t been a blog post in a month and it’s approaching the 25th day of it, I’m scrambling behind the scenes to get something done. 

  2. Well, all right, if I can become a high-ranking politician then the tradeoff might be worth it, but what are the odds of me entering such a toxic field? 

  3. (Sarcasm) I expect to be confronted by a bunch of angry Bernie Sanders supporters when I get back to Berkeley. 

A Useful Matrix Inverse Equality for Ridge Regression

Aug 5, 2016

I was recently reading Stephen Tu’s blog and stumbled upon his matrix inverse equality post which brough back some memories. Here’s the equality: let \(X\) be an \(n \times d\) matrix, let \(\lambda > 0\), and let \(I_k\) represent the \(k \times k\) identity matrix. Then

\[X(X^TX + \lambda I_d)^{-1}X^T = XX^T(XX^T + \lambda I_n)^{-1}\]

Note that both sides are left-multipled by \(X\), so for the purposes of checking equality, there’s no need to have it there, so instead consider

\[(X^TX + \lambda I_d)^{-1}X^T = X^T(XX^T + \lambda I_n)^{-1}\]

I was asked to prove this as part of a homework assignment in my CS 281A class with Professor Ben Recht a while back. Stephen Tu already has an excellent proof based on Singular Value Decomposition, but I’d like to present an alternative proof that seems even simpler to me. I also want to explain why this equality is important.

First, start with \(\lambda X^T = \lambda X^T\). Then left-multiply the left hand side (LHS) by \(I_d\) and right-multiply the right hand side (RHS) by \(I_n\) to obtain

\[\lambda I_d X^T = \lambda X^TI_n\]

Note that this doesn’t change the equality, because multiplying a matrix with the identity will not change it, regardless of the multiplication ordering. As a sanity check, the LHS is \((d \times d)\times (d \times n)\) and the RHS is \((d\times n)\times (n\times n)\) and both result in a \((d \times n)\)-dimensional matrix so the dimensions match up. Next, add \(X^TXX^T\) to both sides and get

\[X^TXX^T + \lambda I_d X^T = X^TXX^T + \lambda X^TI_n\]

We now “distribute out” an \(X^T\), though the particular \(X^T\) to use is different on both sides. We obtain

\[(X^TX + \lambda I_d)X^T = X^T(XX^T + \lambda I_n)\]

Next, left-multiply both sides by \((X^TX + \lambda I_d)^{-1}\) and right-multiply both sides by \((XX^T + \lambda I_n)^{-1}\):

\[(X^TX + \lambda I_d)^{-1}(X^TX + \lambda I_d)X^T(XX^T + \lambda I_n)^{-1} = (X^TX + \lambda I_d)^{-1}X^T(XX^T + \lambda I_n)(XX^T + \lambda I_n)^{-1}\] \[X^T(XX^T + \lambda I_n)^{-1} = (X^TX + \lambda I_d)^{-1}X^T\]

Thus proving the identity.

Now, you can leave it like that and in some cases you’ll be OK. But strictly speaking, you should show that the two inverses above actually exist before using them. When I was writing up this for my homework, I said “the problem implies the inverses exist” because the inverse was already written in the problem statement. In retrospect, however, that was a mistake. Fortunately, the TA didn’t take off any points, though he marked “No, but w/e …”

To make things extra clear, we now show that the inverses exist. Consider \((X^TX + \lambda I_d)\). We claim its inverse exist because it is positive definite (and all positive definite matrices are invertible). This is straightforward because for any nonzero \(z \in \mathbb{R}^d\), we have

\[z^T(X^TX + \lambda I_d)z = z^TX^TXz + \lambda zI_dz = (Xz)^T(Xz) + \lambda z^Tz = \|Xz\|_2^2 + \lambda \|z\|_2^2 > 0\]

where we use the standard \(w^Tw = \|w\|_2^2\) trick, which is everywhere in machine learning and optimization. If you don’t believe me, take EE 227B as I did and you’ll never be able to forget it. By the way, the above shows why we needed \(\lambda > 0\), because \(\|Xz\|_2^2\) is not necessarily greater than zero; it could be zero! This is because \(X^TX\) could be positive semi-definite, but not positive definite.

A similar proof applies for the other inverse, which uses \(\|X^Ty\|_2^2\) for \(y \in \mathbb{R}^n\).

Having proved the equality, why do we care about it? The reason has to do with ridge regression and duality. The problem formulation for ridge regression is

\[{\rm minimize}_w \quad \|Xw - y\|_2^2 + \lambda \|w\|_2^2\]

where \(y\) is the \(n\)-dimensional vector of targets and \(\lambda > 0\). Taking the gradient we get

\[\begin{align*} \nabla_w ((Xw - y)^T(Xw - y) + \lambda w^Tw) &= \nabla_w (w^TX^TXw - 2y^TXw + \lambda w^Tw) \\ &= 2X^TXw - 2X^Ty + 2 \lambda w \\ \end{align*}\]

Setting to zero, we find \(w^*\) as

\[\begin{align*} 2X^TXw^* - 2X^Ty + 2\lambda w^* = 0 &\iff (X^TX + \lambda I_d)w^* - X^Ty = 0 \\ &\iff (X^TX + \lambda I_d)w^* = X^Ty \\ &\iff w^* = (X^TX + \lambda I_d)^{-1}X^Ty. \end{align*}\]

Gee, does this formulation of \(w^*\) look familiar? Yep! The above is one way of finding \(w^*\), but another way is by invoking the equality to get what is known as the dual form of \(w^*\):

\[w^* = (X^TX + \lambda I_d)^{-1}X^Ty = X^T\underbrace{(XX^T + \lambda I_n)^{-1}y}_{\alpha^*}\]

Why do we like this new perspective? First, like any dual solution, it provides us with an alternative method if the “other” way (technically called the primal solution) is harder to compute. In this case, however, I don’t think that’s the main reason why (because this is fairly straightforward either way). I think we like this representation because \(w^*\) is a linear combination of the rows of \(X\), with coefficients specified by the vector \(\alpha^* \in \mathbb{R}^n\)! The rows of \(X\) are the elements in our data \(\{x_i\}_{i=1}^n\), so this gives the intuitive interpretation of the optimal solution being some combination of the data elements.

Don't Worry about Walking in Between a Sign Language Conversation

Jul 30, 2016

Suppose there are two people talking in sign language. What should you do when you have to walk in between them in order to reach your destination? If the people were talking verbally (say, in English), you would walk through the two without hesitation.

But does sign language cause the etiquette to be different?

I thought about this recently when I was attending a recent tour. As part of it, the attendees listened to a few talks in an auditorium. Since I had a sign language interpreter, I naturally sat in the front corner seat of the room. Fortunately, I didn’t play my usual role of the “human repellent.” Despite there being lots of empty seats, some of the other people actually sat next to me instead of being afraid of the sign language and consequently sitting as far away from me as possible. (I hope that’s the reason why people generally don’t sit next to me …)

I didn’t know anyone else in the tour, so during breaks within the talks, I would strike up some conversations with the sign language interpreter. At the same time, a lot of the other attendees needed to get up and walk to the back of the auditorium, presumably to fetch a new batch of Starbucks coffee. But this meant that many had to cut through me and the sign language interpreter. Throughout the tour, I noticed two classes of people:

  1. Those who walked through our conversation quickly and without hesitation.

  2. Those who paused and nervously glanced at me and the interpreter, wondering about what to do.

I’m here to reassure you that (1) is much better than (2). If you need to walk between us to go somewhere, please do it (quickly).

I think most people who engage in (2) are those who do not have much experience with sign language and “deafness” more generally, so their reaction is certainly understandable (and a sign of politeness, I might add). When this situation happens, usually the sign language interpreter will quickly motion for the person to walk through, and then he or she will do a little sprint and then continue walking.

Perhaps people are worried about some unknown rules of sign language etiquette? If so, don’t worry: I don’t find walking between us to be rude at all, and I have yet to meet anyone who gets bothered by this. It doesn’t interfere with our conversation that much, because (just like in English audio) if a second of the conversation (or audio) gets cut, we can usually fill in the details by means of continuity, which comes with experience using the language.

If there’s a group of people who have to walk between us, then it’s a different story. When that happens, we have to pause the sign language until everyone gets through, but that’s understandable; we don’t want to be jerks and force everyone to take a far more roundabout route.

To state the obvious, though: if there’s a way to get to your destination without cutting through our conversation, you should consider that alternative route, but if that would require squeezing between a twelve-inch gap between the wall of the auditorium and the back of my interpreter’s chair and making yourself look silly in the process, please don’t do it. If you would walk between us without hesitation if we were talking in English, then you can safely walk in between a sign language conversation. That doesn’t mean walk, then stop, examine your clothes, fiddle around, and so forth. If you take more than a second, you’ll annoy us. Just act normally, please. That goes for a lot of other things related to deafness that I’ve experienced in my life.

Minecraft for AI Research?

Jul 17, 2016

More games are jumping on the AI bandwagon.

We saw IBM’s Deep Blue, which beat chess Grandmaster Gary Kasparov in 1997. Then in 2013, we saw DeepMind’s AI agents1 achieve human-level performance on many classic Atari 2600 games. Then earlier this year, we saw DeepMind’s next product, AlphaGo, beat Lee Sodol in a widely publicized Go match. You can view the details on DeepMind’s AlphaGo website, which also has the corresponding Nature paper. I have not read it yet (it’s on my TODO-list), but I read and considerably enjoyed their other Nature paper (from 2015) about Atari 2600 games.

Yet it looks like we are not done with AI and games. The news is out that Microsoft Research has made Project Malamo open source! It allows people to use the popular sandbox game Minecraft for AI research. Microsoft has been using it for a few months, as evident by this earlier article, but I only found out about Project Malamo after it became open source. Ironically, I didn’t find this by browsing Microsoft’s website — I found it on the Minecraft forum website.

Their overall goal is stated as follows:

[…] five computer scientists are spending their days trying to get a Minecraft character to climb a hill.

That may seem like a pretty simple job for some of the brightest minds in the field, until you consider this: The team is trying to train an artificial intelligence agent to learn how to do things like climb to the highest point in the virtual world, using the same types of resources a human has when she learns a new task.

That means that the agent starts out knowing nothing at all about its environment or even what it is supposed to accomplish. It needs to understand its surroundings and figure out what’s important – going uphill – and what isn’t, such as whether it’s light or dark. It needs to endure a lot of trial and error, including regularly falling into rivers and lava pits. And it needs to understand – via incremental rewards – when it has achieved all or part of its goal.

Stated succinctly, they want to train the player to get to higher ground in Minecraft, without having to explicitly code that concept. One can formulate this as a reinforcement learning problem using some form of a Markov Decision Process (you can view my earlier article on MDPs here). As in the case for DeepMind’s work, the researchers of Project Malamo want to accomplish this while only relying on information that a human can see. That is, they want the agent to learn like a human would. This is definitely going to be a challenging problem. Minecraft is one of the most open-ended games ever, and even humans are often unsure of what their objectives are when playing. I’ve had people tell me, “what’s the point of Minecraft?”

The researchers are constraining their focus to “moving uphill”, which helps to narrow down the problem. That still results in a very broad game, and is unlike previous work using Chess, Go, and Atari 2600 games, which have clearer goals for the player. I will be particularly interested in seeing how researchers will compare their results. How do we evaluate the quality of agents? I am also curious about what role deep learning will play in this research.

On a final note, if they want an AI enthusiast who is also a Minecraft enthusiast to serve as a contributor … my contact information is on the “About” page of this website. Unfortunately, I have not played Minecraft in months (maybe a year?!?) and my guess is that a disproportionate share of computer scientists are also Minecraft players.

  1. After they demonstrated their results, Google bought the company, so we commonly call them Google DeepMind

Donald Trump's Name Sign

Jul 9, 2016

When I need to use people’s names in sign language, I have three tactics I can draw upon. The best one is to invoke the person’s “official” sign name. The second tactic is to take the first letter of the person’s first name (or last, depending on context) and gently shake it. I view this as a generic way of signing someone’s name when a sign has not been established. The third tactic is to avoid a sign name all together and fingerspell.

It is not always possible to use the first tactic. When babies are born, we assign English names right away, often having pre-determined them beforehand. But finding an appropriate name sign for someone — regardless of whether he or she is deaf or hearing — requires many more years. This is due to an informal rule that a person can only be given a name sign by someone who is “culturally Deaf.” In addition, as a practical manner, most name signs combine the first letter of the person’s first name and a personal characteristic. This is the case for my name sign, which is to form a “D” in one’s dominant hand and then to peck one’s forehead twice. It’s to indicate “intelligence.” Yes, I know it’s embarrassing, and please don’t use it in my presence. I don’t know who gave it to me.

For these reasons, the use of name signs in official ASL conversations is often restricted to people who know each other well. The exception, of course, is with famous people. I got particularly curious about one man’s name sign when one of my high school sign language interpreters suggested it in a recent conversation with me at a local Starbucks. That man is Donald Trump.

A Google search for “Donald Trump name signs” leads to an interesting article from the Washington Post, which describes not only a proposed name sign for Trump, but also signs for other notable politicians: Bernie Sanders, Bill Clinton, Richard Nixon, Ronald Reagan, and Barack Obama. Of all the politicians listed, I have only used Obama’s name sign, which I quite enjoy. It is quick enough to use and represents his name with the beginning “O” along with the hand movement to indicate America’s flag. Quite conveniently, the (sideways) “B” that’s used to represent the flag is also the second letter in “Obama.”

Name signs are hard to get. Other than politicians and religious figures, I can only remember using a name sign for William Shakespeare. It involves moving one’s dominant hand as if one is shaking a small bag (since, you know, “shake” and Shakespeare).

While the Washington Post certainly has a suitable name sign for Trump, I got curious and investigated to see if there were alternatives. You can find plenty of proposals on YouTube videos. Almost all are variations of making fun of Trump’s hair, which should have been obvious in retrospect.

Here is a video I found that, in my opinion, represents the best version of a Trump name sign. It is a variation of the Washington Post’s sign; it repeats the hand-waving motion of Trump’s hair. This is fine since it can be done quickly. That video has some other context to it, so for your convenience, here’s my English translation of it:

Hi, Tad! You asked me, what is the sign name for Trump? Interesting. My husband and I discussed this and after a long time, we finally settled on a sign for him. Tonight, at the ASL SLAM [demonstrates sign], I was curious and asked everyone about a name sign for Trump. We had already settled on this one: [demonstrates sign] What do you think? [demonstrates sign again]

Regardless of whether you agree or disagree with his unorthodox policies, you have to admit that this sign is appropriate.

Captioning and Other Accommodations During my Trip to Washington DC

Jul 3, 2016

I took a break from paper reviewing to be in Washington DC for a few days. This marked my first time being in the city, and it’s been a pleasure to tour arguably the most important one of our country. My visit coincided with the July 4 weekend, so there was a lot of patriotism on display. Among my itinerary in Washington DC was Capitol Hill and the Smithsonian National Air and Space Museum. Both are government-controlled entities, which means that admission is free (because we pay for it with our taxes1) and that they are expected to provide reasonable accommodations for people with disabilities.

Just to be clear, I don’t want to suggest that the private sector doesn’t do a good job providing accommodations. I do, however, believe that government agencies might be slightly, slightly more sensitive for the need to provide disability-related accommodations. After all, wouldn’t it be an embarrassment if they failed to follow the Americans with Disabilities Act?

Not having done much planning in advance (thank you, paper reviewing!) I didn’t know what to expect with regards to accommodations. Fortunately, my Capitol Hill visit turned out quite well. Despite not having a reservation, I was able to participate in a guided tour of the building. This was split in two parts. The first was a brief introductory video which described our nation’s remarkable history in 10-15 minutes. Fortunately, it had open captioning, so I was able to understand it.

The second part was the actual building tour. There were so many people that the guides split us up into five groups, each with our own guide. What was interesting, though, was that we were all provided headsets so that we could hear our tour guide speaking in his microphone (and only our tour guide, not the others). This worked out well when we were in crowded areas with several tour guides talking at once. For me, of course, I wouldn’t have benefited much from the headsets. Fortunately, after inquiring, I got a T-Coil necklace, and wow, that really amplified my guide’s voice! It was even louder than what I’m used to with FM systems, and on top of it all, there was no static. As a result, I understood most of what my guide said throughout the tour. Speaking of the actual tour, it was a nice experience looking inside Capitol Hill; we passed by Speaker Paul Ryan’s office, though we obviously didn’t see the Speaker himself. We didn’t get to explore much of the building, unfortunately, and there is still substantial construction going on, obscuring much of the inner Capitol Hill dome. Perhaps later I can try and get a better tour by submitting my name in the lottery for the tours arranged by the California Representatives or Senators? Nonetheless, it was still a good experience for me and I was happy with the tour and the accommodations.

I had a different experience at the Smithsonian Museum. After touring some of the ground floor exhibits, I paid a ticket for one of their “featured films”. I first asked for the standard (non-IMAX) film that was closest to starting, but after I asked, the cashier told me it did not have captions, and she recommended I instead watch A Beautiful Planet, offered in IMAX. Once the doors opened to enter the theater, I lined up with everyone to take our 3D glasses. I found my way into a good seat in the center of the auditorium, and … there were no captions. IMAX films are quite loud, but even so, I could not make out the conversations among the astronauts in the film. I actually fell asleep about ten minutes into it out of boredom, and woke up at the end of the film. I was a bit disappointed that an IMAX film causes me to nap rather than watch a movie (and no, I didn’t even turn off my hearing aids). I wonder: should I have asked for more details on the captioning? I don’t recall the exact words I uttered, but it’s true that I could always have been more specific with my wording. Or was I supposed to get different 3D glasses, kind of like how I got a different kind of “headset” for the tour guide? I wouldn’t be surprised if they had closed captioning glasses. Or now that I think about it, did I simply mishear the person who I asked about this?

Lesson learned: I need to ask more questions, and to plan ahead for these once-in-a-few-years visits.

  1. Ah, but doesn’t that mean foreigners who visit are free-riding on us? Throughout my entire visit, no one ever asked me to prove my American citizenship. 

Currently Reviewing Papers for NIPS 2016

Jun 25, 2016

This year, I volunteered to be a NIPS reviewer, since I want to get the experience of reviewing academic papers. It was pretty easy to volunteer: for every paper submitted to NIPS, they asked that at least one of the authors offer to be a reviewer, and preferably not someone who’s already on their list of known reviewers. I’m clearly not among that elite crowd, and I wanted to review anyway, so it was an easy choice for me.

This year, NIPS had around 2500 (!) paper submissions, according to the number I saw in the Conference Management Toolkit website, which is what NIPS uses to manage paper submissions and reviews. I think the exact number I saw was roughly 2450-ish, but it’s worth noting that papers with authors who match my conflict of interest domains (especially “”) wouldn’t appear on this list. When I was submitting my paper bids for reviewing purposes, I couldn’t find the paper I submitted (fortunately!).

For obvious reasons, I can’t divulge too much about the papers I’m reviewing. But I can probably safely say this: I got assigned four papers to review in the span of roughly a month (from June 17 to July 17). So I’m reviewing right now, and decided to write this blog post during one of my breaks. Now that I’ve read all my assigned papers and have more-or-less come up with a very high-level draft conclusion about them, here are four thoughts:

  1. Unfortunately, I’m probably deeply qualified to review … none of the four papers. For two of the papers, I understand much of the main ideas, but I still lack intricate knowledge of the math and algorithms they use, and about the most prevalent related work. For two other papers, I’m close to being lost on them. Therefore, I have made this a rule: for each paper, I will allow myself to read one extra reference for background information. One. So effectively, my reading load for this cycle is eight papers, four to be read in-depth and four to be read semi-in-depth. This raises my curiosity: how do people actually carefully review papers? The vast majority of NIPS papers are extraordinarily specialized and technical, and many still have proofs in appendices. I don’t see how it is humanly possible for me to verify the proofs and make sure the math lines up correctly, especially with an internship that already takes up about 50 hours a week. It can take hours just to verify one simple line of equations in a paper, which might require looking through a previous paper reference, checking Wikipedia for additional background information, and then marking down an annoying typo that got me confused. Can an experienced NIPS reviewer please tell me how he or she reviews papers?

  2. Fortunately, the previous thought might not matter because, despite not having gone through all the technical material so far (I’ve only been reviewing a week!) I feel like I can more or less tell how I feel about the papers. For instance, the amount of formalism and “cleanliness” of the paper gives a good clue about its quality. If the paper is rife with typos and sloppy LaTeX, or if the figures look spectacularly low-quality, that’s just asking for a rejection. (Come on, with matplotlib embedded in Jupyter notebooks, people should be able to produce acceptable figures.) Note that by this, I’m excluding language issues that might arise due to non-native English speakers writing the paper. It’s not fair to them.

  3. It’s nice reviewing papers without knowing the identity of the authors. NIPS is double blind, which I believe to be a good thing, and I say this as someone who would probably benefit from an open reviewing process. I don’t want to think about the authors’ identities. I have not actively searched so I’m still in the dark about this for all four of my papers. As a reviewer, I like my identity being protected, so I’m reconsidering my earlier thought about making some of the reviews public (sadly, if this were politics, I’d be accused of “waffling”). Unfortunately, for one of my papers, I have a good feeling about who at least one of the authors is based on previous research, despite how I haven’t been in this field very long (and some would argue, have never been in this field). In addition, two of my other papers have a previous reference which is obviously similar to the current work, and since I ended up using those papers as my “background reference” material and seeing practically word-for-word similarities … author similarities follow. I’ll keep my guesses in mind and then see how they turn out later in August.

  4. Despite my internship, I would bet I have more time on my hands than most faculty for reviewing, because I have fewer competing priorities and because I would be assigned fewer papers (hopefully … did I?). Thus, I will try to write some really detailed suggestions or at the very least, thought-provoking comments/questions, and I think the authors would appreciate that even if their papers don’t get accepted. At all costs, I am not going to be writing those dreaded three line reviews that contribute nothing. Sadly, I frequently see these online at the NIPS proceedings, which makes reviews public for accepted papers. As a result, this period from mid-June to mid-July is going to be super busy for me, since I am doing all my reviewing in nights and weekends. I am really trying to make sure my reviews are helpful.

That’s all for now — I need to get back to reviewing.

Some Recent Results on Minibatch Markov Chain Monte Carlo Methods

Jun 18, 2016

I have recently been working on minibatch Markov chain Monte Carlo (MCMC) methods for Bayesian posterior inference. In this post, I’d like to give a brief summary of what that means and mention two ICML papers (from 2011 and 2014) that have substantially influenced my thinking.

When we say we do “MCMC for Bayesian posterior inference,” what this typically means is that we have some dataset \(\{x_1,x_2,\ldots,x_N\}\) and a parameter of interest \(\theta \in \mathbb{R}^K\) for some \(K \ge 1\). The posterior distribution \(f(\theta)\) we wish to estimate (via sampling) is

\[f(\theta) = p(\theta \mid x_1, \ldots x_N) = \frac{p(\theta)\prod_{i=1}^Np(x_i \mid \theta)}{p(x_1, \ldots, x_N)} \propto p(\theta)\prod_{i=1}^Np(x_i \mid \theta)\]

You’ll notice that we assume the data is conditionally independent given the parameter. In many cases, this is unrealistic, but most of the literature assumes this for simplicity.

After initializing \(\theta_0\), the procedure for sampling the posterior on step \(t+1\) is:

  • Draw a candidate \(\theta'\)
  • Compute the acceptance probability (note that the denominators of \(f\) cancel out):
\[P_a = \min \left\{1,\frac{f(\theta')q(\theta_t\mid \theta')}{f(\theta_t)q(\theta' \mid \theta_t)}\right\} = \min \left\{1,\frac{p(\theta')\prod_{i=1}^Np(x_i \mid \theta')q(\theta_t\mid \theta')}{p(\theta_t)\prod_{i=1}^Np(x_i \mid \theta_t)q(\theta' \mid \theta_t)}\right\}\]
  • Draw \(u \sim {\rm Unif}[0,1]\), and accept if \(u < P_a\). This means setting \(\theta_{t+1} = \theta'\). Otherwise, we set \(\theta_{t+1} = \theta_t\) and repeat this loop. This tripped me up earlier. Just to be clear, we generate a sample \(\theta\) every iteration, even if it is a repeat of the previous one.

This satisfies “detailed balance,” which roughly speaking, means that if one samples long enough, one will arrive at a stationary distribution matching the posterior, though a burn-in period and/or only using samples at regular intervals is often done in practice. The resulting collection of (correlated!) samples \(\{\theta_1, \theta_2, \ldots, \theta_T\}\) for large \(T\) can be used to compute the value of \(p(\theta \mid x_1,\ldots,x_N)\). Let’s consider a very simple example. Say \(\theta\) can take on three values: \(A\), \(B\), and \(C\). If our sampled set is \(\{A,B,B,C\}\), then since \(B\) appears two times out of four, we have \(p(\theta = B \mid x_1,\ldots,x_N) = 2/4\). The other two probabilities would naturally have \(1/4\) each according to the samples.

One issue with the standard procedure above is that in today’s big data world with \(N\) on the order of billions, it is ridiculously expensive to compute \(P_a\) because that involves determining all \(N\) of the likelihood factors. Remember, this has to be done every iteration! Doing all this just to get one bit of data (whether to accept or not) is not a good tradeoff. Hence, there has been substantial research on how to perform minibatch MCMC on large datasets. In this case, rather than use all \(N\) data points, we just use a subset \(n \ll N\) of points each iteration. This approximates the target distribution. The downside? It no longer satisfies detailed balance. (I don’t know the details on why, and it probably involves some complicated convergence studies, but I am willing to believe it.)

Just to be clear, we are focusing on getting a distribution, not a point estimate. That’s the whole purpose of Bayesian estimation! A distribution means we need a full probability function that sums to one and is non-negative; if \(\theta\) is one or two dimensional, we can easily plot the posterior estimate (I provide an example of this later in this post). A point estimate means finding one value of \(\theta^*\), usually the “best”, which we commonly express as the maximum likelihood estimate \(\theta^* = \theta_{\rm MLE}\).

All right, now let’s briefly discuss two papers that tackle the problem of minibatch MCMC.

Bayesian Learning via Stochastic Gradient Langevin Dynamics

This paper appeared in ICML 2011 and proposes using minibatch update methods for Bayesian posterior inference, with a concept known as Langevin Dynamics to inject the correct amount of noise into parameter updates so that the set of sampled parameters converges to the posterior, and not just a mode. To make the distinction clear, let’s see how we can use minibatch updates to converge to a mode — more specifically, the maximum a posteriori estimate. The function \(f\) we are trying to optimize is listed above. So … we just use stochastic gradient ascent! (We are ascending, not descending, because \(f\) is a posterior probability, and we want to make its values higher.) This means the update rule is as follows: \(\theta_{t+1} = \theta_t + \alpha_t \nabla f(\theta_t)\). Plugging in \(f\) above, we get

\[\theta_{t+1} = \theta_t + \frac{\epsilon_t}{2}\underbrace{\left( \nabla \log p(\theta_t) + \frac{N}{n} \sum_{i=1}^n\nabla \log p(x_{ti} \mid \theta_t)\right)}_{\approx \nabla f(\theta_t)}\]

where \(\epsilon_t\) is a sequence of step sizes.

The above is a stochastic gradient ascent update because we use \(n \ll N\) terms to approximate the gradient value, which is why I inserted an approximation symbol (\(\approx\)) in the underbrace. Because we only use \(n\) terms, however, we must multiply the summation by \(N/n\) to rescale the value appropriately. Intuitively, all those terms in that summation are negative since they’re log probabilities. If we use \(n\) instead of \(N\) terms, that summation is strictly smaller in absolute value. So we must rescale to make the value on the same order of magnitude. Another way of putting it is that the subsampled sum will be an unbiased estimator of the full-batch quantity.

What’s the problem with the above for Bayesian posterior inference? It doesn’t actually do Bayesian posterior inference. The above will mean \(\theta\) converges to a single value. We instead want a distribution. So what can we do? We can use Langevin Dynamics, meaning that (for the full batch case) our updates are:

\[\theta_{t+1} = \theta_t + \frac{\epsilon}{2}\underbrace{\left(\nabla \log p(\theta_t) + \sum_{i=1}^N \nabla \log p(x_i \mid \theta_t)\right)}_{\nabla f(\theta_t)} + \eta_t\] \[\eta_t \sim \mathcal{N}(0,\epsilon)\]

A couple of things are worth noting:

  • We use all \(N\) terms in the summation, so the gradient is in fact exact.
  • The injected noise \(\eta_t\) means the \(\theta_{t}\) values will “bounce around” to approximate a distribution and not converge to a single point.
  • The \(\epsilon\) is now constant instead of decreasing, and is balanced so that the variance of the injected noise matches that of the posterior.
  • We use \(x_i\) instead of \(x_{ti}\) as we did earlier. The only difference is that the \(t\) indicates the randomness in the minibatch. It does not mean that \(x_{ti}\) is one scalar element of a vector. In other words, both \(x_i\) and \(x_{ti}\) are in \(\mathbb{R}^k\) for some \(k\).

For simplicity, the above assumes that \(\eta_t \in \mathbb{R}\). In the general case, these should be multivariate Gaussians, with covariance \(\epsilon I\).

The problem with this, of course, is the need to use all \(N\) points. So let’s use \(n \ll N\) points, and we have the following update:

\[\theta_{t+1} = \theta_t + \frac{\epsilon_t}{2}\left(\nabla \log p(\theta_t) + \frac{N}{n} \sum_{i=1}^n \nabla \log p(x_{ti} \mid \theta_t)\right) + \eta_t\] \[\eta_t \sim \mathcal{N}(0, \epsilon_t)\]

where now, we need \(\epsilon_t\) to vary and decrease towards zero. The reason for this is that as the step size goes to zero, the corresponding (expensive!) Metropolis-Hastings test has rejection rates that decrease to zero. Thus, we can omit the test to make the algorithm computationally cheap.

Fortunately, the authors show that despite how the noise goes to zero, the set of samples from SGLD will indeed converge to the correct posterior distribution, just as with “normal” (i.e., full-batch) Langevin Dynamics noise.

Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget

This paper appeared in ICML 2014, and is also about minibatch MCMC. Here, instead of relying on simulating the physics of the system (as was the case with Stochastic Gradient Langevin Dynamics), they propose reformulating the standard MCMC method with the standard MH test into a sequential hypothesis test. To frame this, they take the log of both sides of the acceptance inequality:

\[\begin{align*} u < P_a &\iff u < \min \left\{1,\frac{p(\theta')\prod_{i=1}^Np(x_i \mid \theta')q(\theta_t\mid \theta')}{p(\theta_t)\prod_{i=1}^Np(x_i \mid \theta_t)q(\theta' \mid \theta_t)}\right\} \\ &\iff u\frac{p(\theta_t)q(\theta' \mid \theta_t)}{p(\theta')q(\theta_t\mid \theta')} < \frac{\prod_{i=1}^Np(x_i \mid \theta')}{\prod_{i=1}^Np(x_i \mid \theta_t)} \\ &\iff \log\left[u \frac{p(\theta_t)q(\theta' \mid \theta_t)}{p(\theta')q(\theta_t\mid \theta')} \right] < \log \left(\prod_{i=1}^Np(x_i \mid \theta')\right) - \log\left(\prod_{i=1}^Np(x_i \mid \theta_t)\right) \\ &\iff \frac{1}{N} \log\left[u \frac{p(\theta_t)q(\theta' \mid \theta_t)}{p(\theta')q(\theta_t\mid \theta')} \right] < \frac{1}{N}\sum_{i=1}^N (\log p(x_i \mid \theta') - \log p(x_i \mid \theta_t)) \end{align*}\]

In the first step we also dropped the initial “min” because if the “1” case applies, we will always accept. In the last step we divide both sides by \(N\). What is the purpose of this? The above is equivalent to the original MH test. But the right hand side depends on all \(N\) data points, so what happens if we compute the right hand side using \(n \ll N\) points?

This is the heart of their test. They start out by using a small fraction of the points and compute the right hand side. If the proposed \(\theta'\) element is so out of whack, then even with just \(n\) points, we should already know to reject it. (And a similar case holds if \(\theta'\) is really good.) If we cannot tell whether to accept or reject \(\theta'\) with some specified confidence threshold, then we increase the minibatch size and test again. Their acceptance test relies on the Central Limit Theorem and the Student-t distribution. The details are in the paper, but the main idea is straightforward: increasing the number of samples increases our certainty as to whether we accept or reject, and we can generally make these decisions with far fewer than \(N\) samples.

What’s the downside of their algorithm? In the worst case, we might need the entire data in one iteration. This may or may not be a problem, depending on the particular circumstances.

Their philosophy runs deeper than what the above says. Here’s a key quote:

We advocate MCMC algorithms with a “bias-knob”, allowing one to dial down the bias at a rate the optimally balances error due to bias and variance.

One other algorithm that adheres to this strategy? Stochastic Gradient Langevin Dynamics! (Not coincidentally, Professor Max Welling is a co-author on both papers.) Side note: the reason why SGLD is biased is because it omits the Metropolis-Hastings test. The reason why this algorithm (Adaptive Sampling) is biased is because it makes decisions based on a fraction of the data. So both are biased, but for slightly different reasons.

Putting it All Together: A Code Example

In order to better understand the two papers described above, I wrote some Python code to run SGLD and adaptive sampling. I also implemented the standard (full-batch) MCMC method for a baseline.

I tested using the experiment in Section 5.1 of the SGLD paper. The parameter is 2-D, \(\theta = (\theta_1,\theta_2)\), and the parameter/data generation process is

\[(\theta_1,\theta_2) \sim \mathcal{N}((0,0), {\rm diag}(\sigma_1^2, \sigma_2^2))\]


\[x_i \sim 0.5 \cdot \mathcal{N}(\theta_1, \sigma_x^2) + 0.5 \cdot \mathcal{N}(\theta_1 + \theta_2, \sigma_x^2)\]

I’ve pasted all my Python code here. If you put this together in a Jupyter notebook, it should work correctly. If your time is limited, feel free to skip the code and go straight to the output. The code is here mainly for the benefit of current and future researchers who might want a direct implementation of the above algorithms.

The first part of my code imports the necessary libraries and generates the data according to my interpretation of their problem. I am generating 500 points here, whereas the SGLD paper only used 100 points.

# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from scipy.stats import t

# Generate the data according to Section 5.1 of "Bayesian Learning via SGLD."
N = 500 # The number of data points; we reference this frequently later.
sigma1_sq = 10
sigma2_sq = 1
sigmax_sq = 2
theta1 = 0
theta2 = 1

# Generate the data X. The np.random.normal(...) requires STD (not VAR).
X = np.zeros(N)
for i in xrange(N):
    u = np.random.random()
    if (u < 0.5):
        X[i] = np.random.normal(theta1, np.sqrt(sigmax_sq))
        X[i] = np.random.normal(theta1+theta2, np.sqrt(sigmax_sq))

Next, I define a bunch of functions that I will need for the future. The most important one is the log_f function, which returns the log of the posterior. (Implementing this correctly requires filling in some missing mathematical details that follow from the formulation of multivariate Gaussian densities.)

def log_f(theta, X):
    The function 'f' is the posterior:
        f(\theta) \propto p(\theta) * \prod_{i=1}^N p(x_i | \theta)
    This method returns the *log* of f(\theta), for obvious reasons.
    X is assumed to be of shape (n,) so we reshape it into (n,1) for vectorization.
    We use [0,0] since we get a float in two lists, [[x]], so calling [0,0] gets x.
    scale = N / float(len(X))
    inverse_covariance = np.array([[0.1,0],[0,1]])
    prior_constant = 1.0 / (2*np.pi*np.sqrt(10))
    prior = np.log(prior_constant) - 0.5*(theta.T).dot(inverse_covariance).dot(theta)
    X_all = X.reshape((len(X),1))
    ll_constant = (1.0 / (4*np.sqrt(np.pi)))
    L = ll_constant * (np.exp(-0.25*(X_all-theta[0])**2) + np.exp(-0.25*(X_all-(theta[0]+theta[1]))**2))
    log_likelihood = np.sum(np.log(L)) * scale
    assert not np.isnan(prior + log_likelihood)
    return (prior + log_likelihood)[0,0]
def grad_f(theta, X):
    """ Computes gradient of log_f by finite differences. X is (usually) a mini-batch. """
    eps = 0.00001
    base1 = np.array([[eps],[0]])
    base2 = np.array([[0],[eps]])
    term1 = log_f(theta+base1, X) - log_f(theta-base1, X)
    term2 = log_f(theta+base2, X) - log_f(theta-base2, X)
    return np.array([[term1],[term2]])/(2*eps)
def get_delta(theta_c, theta_p, X):
    Compute the delta term, so we use min{1, exp(\Delta')} for standard MCMC.
    theta_c is the *current* theta that we may "leave" (old theta)
    theta_p is the *proposed* theta that we might accept (new theta)
    loss_old = log_f(theta_c, X)
    loss_new = log_f(theta_p, X)
    assert not np.isnan(loss_new - loss_old)
    return loss_new - loss_old

def get_noise(eps):
    """ Returns a 2-D multivariate normal vector with covariance matrix = diag(eps,eps). """
    return (np.random.multivariate_normal(np.array([0,0]), eps*np.eye(2))).reshape((2,1))

def log_f_prior(theta):
    Returns log p(\theta), the prior. Remember, it's returning LOGS. Don't take logs again! 
    This is only used for the adaptive sampling method's hypothesis testing.
    cov_inv = np.array([[0.1,0],[0,1]])
    return np.log(1.0/(2*np.pi*np.sqrt(10))) - 0.5*(theta.T).dot(cov_inv).dot(theta)
def approximate_mh_test(theta_c, theta_p, m1, X, mu_0, eps_tolerance):
    Implements the adaptive sampling method. Each time we call this, we shuffle the data.
    theta_c is the current \theta
    theta_p is the proposed \theta
    m1 is the starting minibatch size
    X is the full data matrix
    mu_0 is equation 2 from "Cutting the MH" paper, pre-computed
    eps_tolerance is for deciding whether delta is small enough
    Returns ('accept', 'size'), where 'accept' is a boolean to indicate if we accept or not,
        while 'size' is the size of the mini-batch we sampled (generally a multiple of m).
    accept = False
    done = False
    first_iteration = True
    n = 0
    m2 = m1 # We'll increment by the same amount as 'm1', the starting size.
    # Shuffle at beginning so we can do X_mini = X[:n]
    index = np.random.permutation(N)
    X = X[index]

    while not done:
        # Draw mini-batch without replacement, but keep using same indices we've used.
        # Note that we can have a special case for the first minibatch.
        if first_iteration:
            size = np.min([m1, N])
            first_iteration = False
            size = np.min([m2, N-n])
        n = n + size
        X_mini = (X[:n]).reshape((n,1)) # To enable broadcasting
        # Now compute ell_bar and ell_bar_squared, which rely on the data.
        numerator   = np.exp( -0.25*(X_mini-theta_p[0])**2 ) + np.exp( -0.25*(X_mini-(theta_p[0]+theta_p[1]))**2 )
        denominator = np.exp( -0.25*(X_mini-theta_c[0])**2 ) + np.exp( -0.25*(X_mini-(theta_c[0]+theta_c[1]))**2 )
        log_terms = np.log( numerator / denominator )
        ell_bar = (1.0/n) * np.sum(log_terms)
        ell_bar_squared = (1.0/n) * np.sum(log_terms**2) # Squaring individual terms!
        # Let's just be safe and check for NaNs.
        assert not np.isnan(np.sum(numerator))
        assert not np.isnan(np.sum(denominator))
        assert not np.isnan(ell_bar)
        assert not np.isnan(ell_bar_squared)
        # Now we have the information needed to compute s_l, s, |t|, and delta.
        # Note that if n == N, we can exit since we know delta should be 0. We have all the data.
        if (n == N):
            delta = 0
            s_l = np.sqrt((ell_bar_squared - (ell_bar**2)) * (float(n)/(n-1)))
            s = (s_l / np.sqrt(n)) * np.sqrt(1 - (float(n-1)/(N-1)))
            test_statistic = np.abs((ell_bar - mu_0) / s)
            delta = 1 - t.cdf(test_statistic, n-1)
            assert not np.isnan(delta)
            assert not np.isnan(s)
        # Finally, we can test if our hypothesis is good enough.
        if (delta < eps_tolerance):
            if (ell_bar > mu_0):
                accept = True
            done = True
    return (accept,n)

Next, I run three methods to estimate the posterior: standard full-batch MCMC, Stochastic Gradient Langevin Dynamics, and the Adaptive Sampling method. Code comments clearly separate and indicate these methods in the following code section. Note the following:

  • The standard MCMC and Adaptive Sampling methods use a random walk proposal.
  • I used 10000 samples for the methods, except that for SGLD, I found that I needed to increase the number of samples (here it’s 30000) since the algorithm occasionally got stuck at a mode.
  • The minibatch size for SGLD is 30, and for the Adaptive Sampling method, it starts at 10 (and can increase by 10 up to 500).
num_passes = 10000
rw_eps = 0.02
theta = np.array([[0.5],[0]])
all_1 = theta # Our vector of theta values

for T in range(1,num_passes):
    # Standard MH test. Note that it's just min{1, exp(\Delta')}.
    theta_new = theta + get_noise(rw_eps)        
    delta_real = get_delta(theta, theta_new, X)
    test = np.min([1, np.exp(delta_real)])
    u = np.random.random()
    # Perform test of acceptance. Even if we reject, we add current theta to all_1.
    if (u < test):
        theta = theta_new
    all_1 = np.concatenate((all_1,theta), axis=1)
print("\nDone with the standard minibatch MCMC test.\n")

num_passes = 30000
theta = np.array([[0.5],[0]])
all_2 = theta
mb_size = 30

# a and b are based on parameter values similar to those from the SGLD paper.
a = 0.06
b = 30

for T in range(1,num_passes): 

    # Step size according to the SGLD paper.
    eps = a*((b + T) ** (-0.55))
    # Get a minibatch, then compute the gradient, then the corresponding \theta updates.
    X_mini = X[np.random.choice(N, mb_size, replace=False)]
    gradient = grad_f(theta, X_mini)
    theta = theta + (eps/2.0)*gradient + get_noise(eps)
    # Add theta to all_2 and repeat.
    assert not np.isnan(np.sum(theta))
    all_2 = np.concatenate((all_2,theta), axis=1)
print("\nDone with Stochastic Gradient Langevin Dymamics.\n")

num_passes = 10000
rw_eps = 0.02
mb_size = 10
theta = np.array([[0],[1]])
all_3 = theta
tolerance = 0.025

for T in range(1,num_passes):
    # Let's deal with the Approx_MH test. This returns a boolean to accept (True) or not.
    # With a random walk proposal, the mu_0 does not depend on the likelihood information.
    theta_new = theta + get_noise(rw_eps)
    prior_old = log_f_prior(theta)
    prior_new = log_f_prior(theta_new)
    log_u = np.log(np.random.random())
    mu_0 = (1.0/N) * (log_u + prior_old - prior_new)    
    (do_we_accept, this_mb_size) = approximate_mh_test(theta, theta_new, mb_size, X, mu_0, tolerance)
    # Several checks, just to be safe.
    assert not np.isnan(prior_old)
    assert not np.isnan(prior_new)
    assert not np.isnan(log_u)
    assert not np.isnan(mu_0)
    assert this_mb_size >= mb_size
    # The usual acceptance test, and updating all_3.
    if do_we_accept:
        theta = theta_new
    all_3 = np.concatenate((all_3,theta), axis=1)
print("\nDone with the adaptive sampling method!\n")

With the sampled \(\theta\) values in all_1, all_2, and all_3, I can now plot them using my favorite Python library, matplotlib. I also create a contour plot of what the log posterior should really look like.

# Let's form the plot. This will be one figure with four sub-figures.
mymap2 = plt.get_cmap("Greys")
m_c2 = mymap2(400)
fig, axarr = plt.subplots(1,4, figsize=(16, 5))
(xlim1,xlim2) = (-1.5,2.5)
(ylim1,ylim2) = (-3,3)

# A contour plot of what the posterior really looks like.
axarr[0].set_title("Log Posterior Contour", size="xx-large")
K = 200
xlist = np.linspace(-1.5,2.5,num=K)
ylist = np.linspace(-3,3,num=K)
X_a,Y_a = np.meshgrid(xlist, ylist)
Z_a = np.zeros((K,K))
for i in range(K):
    for j in range(K):
        theta = np.array( [[X_a[i,j]],[Y_a[i,j]]] )
        Z_a[i,j] = log_f(theta, X)        

# The standard MCMC method
axarr[1].set_title("Standard MCMC", size="xx-large")
axarr[1].scatter(all_1[0], all_1[1], color = m_c2, alpha=0.15)

# Stochastic Gradient Langevin Dynamics
axarr[2].set_title("S.G.L.D. $(m=30)$", size="xx-large")
axarr[2].scatter(all_2[0], all_2[1], color = m_c2, alpha=0.15)

# The adaptive sampling method
axarr[3].set_title("Adaptive Sampling $(m=10)$", size="xx-large")
axarr[3].scatter(all_3[0], all_3[1], color = m_c2, alpha=0.15)

plt.savefig('blog_post_daniel.png', dpi=200)

By the way, it was only recently that I found out I could put LaTeX directly in matplotlib text. That’s pretty cool!

Here are the results in a scatter plot where darker spots indicate more points:


It looks like all three methods roughly obtain the same posterior form.

Islam and the Future of Tolerance

Jun 17, 2016

I just read Islam and the Future of Tolerance: A Dialogue, by renowned atheist Sam Harris and former jihadist-turned-liberal-activist Maajid Nawaz. I had this book on my radar for a while, and decided to read it yesterday in the wake of the tragic news of the recent Orlando shootings. My hope is to better understand what drives people to extremism and how we can reduce the probability of similar attacks from occurring.

I first found out about the book from browsing the website of Sam Harris, one of the four “New Atheists.” Harris and I see eye-to-eye on a number of issues. The most obvious one is that we are atheists. In addition, we both support LGBT rights, scientific reasoning, taxing billionaires, and we disapprove of the Iraq war. I do not claim to agree with everything he says, of course. Indeed, some of Harris’s statements have been controversial (as one would expect when dealing with religion). He actually felt compelled to write a blog post to respond to controversy.

As of this writing, Islam and the Future of Tolerance is Harris’ most recent book, but the interesting thing is that it’s actually a dialogue with Maajid Nawaz, listed as the book’s co-author. Nawaz has a remarkable personal story, which he describes in his 2013 autobiography Radical: My Journey Out Of Islamist Extremism. Nawaz was born in Britain, but after experiencing personal grievances (including racism), he joined the radical Islamist group Hizb ut-Tahrir. After an arrest in Egypt, he turned his life around by rejecting radical Islam. Today, he is of the world’s foremost liberal Muslims and a key challenger to extremism.

Islam and the Future of Tolerance is a conversation between Harris and Nawaz about Islam. Harris is skeptical that Nawaz can achieve the goal of reforming Islam. Harris, however, concedes that people like Nawaz — and not “infidels” like himself — are the ones who have to do the job. How can the faith reformation happen? What will be the future of Islam? That is the subject of their dialogue. Here are some of the highlights of their conversation, outlined according to the section:

  • The Roots of Extremism. This is a short section where Nawaz provides an abbreviated version of his autobiography.

  • The Scope of the Problem. Here’s the key takeaway: a primary issue with getting Muslims and non-Muslims to ally together is that the largest group of Muslims, which Harris and Nawaz call “religiously conservative Muslims”, do not wholeheartedly support contemporary liberal human rights. Thus, while those Muslims would be our allies against the Islamic State, they would also resist our efforts in promoting women’s rights, LGBT rights, etc.

  • The Power of Belief. Harris asks Nawaz how he originally became radicalized. Was it due to “ordinary” grievances that might be corrected with more tolerance (e.g., turning to radical Islam due to racism from non-Muslims?), or due to deeper forces such as martyrdom? I have frequently wondered about this process because, as an atheist, I struggle to understand how something like religion (in its most extreme form) can systematically guide people towards terror, barbarianism, and destruction.

  • The Betrayal of Liberalism. Both Harris and Nawaz slam liberals for their adherence to “political correctness.” (Donald Trump would be proud of them.) In the previous chapter, Harris rips liberal scholars who say that the West is to blame for the problems in Middle Eastern societies. Harris and Nawaz point out one unfortunate consequence of political correctness: that only right-wing bigots are accurately assessing jihadists nowadays.

  • The Nature of Islam. Harris argues that most of our human rights progress in the last few centuries is despite the presence of religions such as Judaism, Christianity, and (especially) Islam. There are plenty of religious moderates, but Harris argues that these moderates somehow have to filter out the more dangerous parts of their scripture. Harris concludes that the purest interpretation of reading scripture — and therefore leaving out personal biases — would actually favor the Islamic State:

This is why the approach of a group like the Islamic State has a certain intellectual appeal (which, admittedly, sounds strange to say) because the most straightforward reading of scripture suggests that Allah advises jihadists to take sex slaves from among the conquered, decapitate their enemies, and so forth.

  • Finding the Way Forward. Most of this section concerns the fatwa against the Islamic State. Harris and Nawaz, while recognizing that a fatwa is better than no fatwa, have reservations because it belies the true issues with Islamic tolerance. Near the end, Harris and Nawaz talk about the heavily American phenomenon of “being at war with Islam.” Harris admits that he may have played a role in assisting that culture, and Nawaz argues that President Obama must start naming the enemy radical Islam (I believe Nawaz is correct on this, but I also believe too much has been made out of this).

Despite the book’s brevity, I feel like both men were able to make good points. The conversation as a whole, while certainly not being a substitute for a more rigorous study of the role of religion in history, provides some interesting highlights. I learned about several historical events and spent some time browsing the corresponding Wikipedia pages to learn more.

One of my biggest takeaways from the book is that I doubt if liberals are effectively handling the problem of radical Islam. Harris and Nawaz, while both being liberals (in the sense that they would vote Democratic in the United States), would argue that liberals are not doing a good job. But they would also have that opinion of the conservative response.

But there’s something more important I need to mention. I think this conversation is useful because it shows how how people with different religious beliefs can come together and have a friendly, reasonable discussion without ripping each other’s throats. Nawaz is not that religious, but we need to have a starting point for a honest conversations with Muslims, and this dialogue serves that role effectively. In the wake of the Orlando shootings, these conversations give me hope that humanity will grow more peaceful and more tolerant in the coming years. Having an open mind is incredibly important. As Nawaz concludes:

Allow me to to take this opportunity to also thank you, Sam. It isn’t easy for anyone to reach across divides — real or imagined — and try to hold a sensible dialogue amid so much background noise and confusion. You will no doubt be censured by some Islam critics for not insisting that I am in fact a closet jihadist, just as I will be criticized by many Muslims for having this conversation with you.

Thanks to Sam Harris and Maajid Nawaz for having this conversation.

RRE: A Game-Theoretic Intrusion Response and Recovery Engine

Jun 6, 2016

During the past two months, I have been reading more research papers after not doing so for much of the spring 2016 semester. One paper I recently read is RRE: A Game-Theoretic Intrusion Response and Recovery Engine, a February 2014 IEEE Transactions on Parallel and Distributed Systems paper written by four computer engineering researchers. As suggested by the publication venue, it’s a systems and networking paper, but there’s an interesting connection to Artificial Intelligence. It also looks like it has had a noticeable impact, with 106 Google Scholar citations today, which is good for a 2014 paper. In this post, I’ll provide a brief summary of the RRE paper.

The paper considers the scenario of defending a computer network system from outside intrusions, which for many reasons is very common and very important nowadays. The traditional way to do this is to have some administrators and IT staff inspect the system and respond to attacks. Unfortunately, this is slow and hard to scale, and attackers are becoming increasingly sophisticated, meaning that we need automated responses to preserve the availability and integrity of computer systems. Hence, we have the Response and Recovery Engine (RRE) as a possible solution.

Their proposed RRE uses a two-tiered design for handling local and global situations. At the local level, individual computers have their own RREs, which take as input Intrusion Detection System (IDS) alerts and Attack Response Trees (ARTs). IDS alerts are a well-known computer networking concept, but ARTs are part of their new research contribution. ARTs analyze possible combinations of attack consequences that lead to some intrusion of system assets (e.g., SQL servers) and are designed offline by experts. ARTs allow RREs to (a) understand the cause and effects of different system problems and (b) decide on appropriate security responses. Another important aspect of ARTs is that its root node, denoted \(\delta_g\), represents the probability that the overall security property it represents has been compromised. The value of \(\delta_g\) can be computed using straightforward probability recurrence rules, so long as nodes in the ARTs are assumed independent. The recurrence starts at the leaves of the ART, which each take as input a set of IDS alerts and use a Naive Bayes binary classifier to determine the probability that the leaf node property has happened or not.

Using these ARTs, the authors nicely transform the security problem into a Partially Observed Markov Decision Process (POMDP), which they technically call a Partially Observed Competitive Markov Decision Process (POCMDP). First, an ART with \(n\) leaf nodes is automatically converted to a standard (i.e., fully observed) Markov Decision Process. The states \(s\) are \(n\)-dimensional binary vectors, with one element for each leaf, and the value matching whatever the leaf node has in the ART. The set of actions \(A\) for this MDP is split into sets \(A_r\) and \(A_a\), representing the set of actions for the RRE and the attacker, respectively. These actions have the effect of changing the values in the ART leaves; responses generally result in more zeros, and attacks typically result in more ones. (ARTs are designed so that nodes tend to represent negative events, hence we want more zeros.) The transition probabilities are determined from prior data, and the reward function \(r(s,a,s')\) is customized so that it takes into account the probability difference in the security property (i.e., \((\delta_g(s) - \delta_g(s'))^{\tau_1}\) for \(0\le \tau_1 \le 1\)), as well as the cost of executing the action. Here’s a brief exercise left to the reader: why is \(\delta_g(s') \le \delta_g(s)\)?

After formulating the MDP, the paper suddenly reminds us that we don’t actually have full knowledge of our states \(s\) (which makes sense in reality). Thus, we are in the partially observed case. According to their formulation, the probability of being in a state \(s\), denoted as \(b(s)\), is

\[b(s) = \prod_{l \in \mathcal{L}}(1_{[s_l = 1]}\delta(l) + 1_{[s_l=0]}(1-\delta(l)))\]

This makes sense and is consistent with the paper’s notation; I will leave it to the reader to review the paper if he or she wishes to verify this formula. They then outline how to solve the POMDP (based on their MDP formulation) using standard value iteration techniques. Note that they call their formulation a POCMDP (as stated earlier) because they add the “Competitive” word to the name, but I don’t think that is necessary. As I discussed near the end of this blog post on reinforcement learning, one can formulate a standard MDP in terms of a two-player adversarial game. They are simply making the distinction between action sets clearer here, perhaps for the benefit of the reader who is not familiar with Artificial Intelligence.

Another interesting note about their POCMDP formulation is that we do not actually need ARTs — the POCMDPs are all that is necessary, but they are harder to design and they have an exponential state space compared to the ARTs, so it is best to let experts design ARTs and automatically build the POCMDPs from it. This certainly makes sense from a practical perspective.

The process above is similar for the global case. The global RRE takes as input the recursively computed \(\delta_g\) values for each local RRE, the probability that the security property of the local assets have been compromised. It then uses these to automatically formulate a CMDP to help guide overall network priorities. To add finer-grained control over the states, the authors augment the system with fuzzy logic, allowing partial memberships. (See the Fuzzy Sets Wikipedia page for background on fuzzy sets.) This means they can combine information from local engines to determine a probability for a global security property. Their canonical example of the latter: is the network secure? Overall, this section of the paper is not as clear as the one describing local RREs, and while I understood the basics on how they used fuzzy logic, I wish there was enough space in the paper for a clearer explanation of how they form the global CMDP.

After their global RRE description, the authors present some brief experimental results on how long it takes to form the MDP models and on the runtime for RREs to compute an optimal response action. In general, their experiments are short and not that informative since they rely on randomly generated networks. To be fair, it is probably harder to run benchmarks in networking papers as compared to machine learning papers. In the latter case, we have standard datasets on which to test our algorithms.

The disappointing experiments notwithstanding, I enjoyed this paper. I fortunately did not need much networking or systems background to understand it, and it is nice to see different ways of how MDPs get formulated. It certainly falls prey to lots of simplifying assumptions, but I don’t see how that can be avoided for a research paper.

The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World

May 31, 2016

I recently finished The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World, by acclaimed computer science professor Pedro Domingos of the University of Washington. The Master Algorithm is aimed at a broad but well-educated audience and attempts to serve as an intermediate between dense, technical textbooks and simple, overly-hyped 800-word “AI IS GOING TO DESTROY PEOPLE” newspaper articles. The main hypothesis is that there exists a universal algorithm for solving general machine learning problems: the master algorithm. Or, as Domingos humorously puts it, one algorithm to rule them all. For background information, Domingos provides a historical overview of five “tribes” of machine learning that we must unify and understand to have a chance at unlocking the master algorithm. In general, I think his master algorithm idea has merit, and his explanation of the five areas of machine learning are the most important and valuable parts of the book. Nonetheless, as there are fundamental limitations on learning a technical subject like machine learning in a book with just 300 (non-mathematical) pages, this book is bound to disappoint a few readers. The explanation of the hypothetical master algorithm is also limited since it relies almost entirely on Domingos’ decade-old research. In addition, some of the implications on society seem far fetched.

I was familiar with most of the material in this book beforehand, but as stated earlier, I still found Domingos’ organization of machine learning into five tribes to be immensely useful for understanding the field. His five tribes are: symbolists, connectionists, evolutionaries, Bayesians, and analogizers. I would never have thought about organizing machine learning this way, because I have limited experience with evolutionaries (i.e., people who use genetic programming) and because whenever I try to study logic and reasoning (i.e., what the symbolists do) I struggle to avoid falling asleep. I have been able to study logic, though, as shown in this long blog post.

I am most familiar with the other three areas Domingos lists. The connectionists are dominating machine learning nowadays because they use neural networks. (Ever heard of “deep learning,” anyone?) The Bayesians used to dominate; these are the people who use graphical models. The analogizers use the nearest neighbor and support vector machine algorithms. For all three of these tribes, I have covered corresponding topics in previous blog entries (e.g., see this post for a discussion on SVMs).

Given the constraints of 30 pages per tribe, Domingos explains each of them remarkably well. Having studied these in mathematical detail, I find it enjoyable to just relax, avoid the math, and understand the history: who invented which algorithm, what were competing trends, and similar stories. In addition to the five tribes, Domingos goes over other important concepts in machine learning and artificial intelligence, including Expectation-Maximization and reinforcement learning. Domingos then describes his thoughts on a universal algorithm, invoking his own research (Markov logic networks) in the progress. Finally, he discusses the implications of improved machine learning on our lives.

For the sake of a general audience, Domingos avoids mathematical details and relies on examples and analogies. This is fine, and both Domingos and I agree that it is possible to understand some machine learning without math. I probably do not agree with that statement as much as Domingos does, though; I really need to go over all the math details for me to understand an algorithm, and other readers with my mindset can get disappointed.

I also think his attempts at explaining the master algorithm at the end fall prey to excessive storytelling. After his long, adventure-related metaphor, I was (and still am) confused about how to effectively build upon his Markov logic network work to get a real master algorithm. Regarding the implications of the master algorithm on human life, that chapter felt more fantasty than reality to me, and I can’t see the kind of stuff Domingos says happening in the near future. For instance, Domingos says that since AI might take over human labor, we will need to use a “universal basic income,” but that would not necessarily be the best thing to do and I can’t see how this will be politically viable (it’s possible, though, but I’m probably thinking on a longer time horizon than Domingos).

I don’t mean for the last two paragraphs to sound overly critical. I think The Master Algorithm is interesting and I would recommend it to those who would like to learn more about computer science. I would not, however, say it is among my top-tier favorite books. It’s a decent, solid, but not once-in-a-generation, and I think that’s a fair characterization.

Thoughts on Isolation: Three Difficult Years

May 28, 2016

I have now had depression for almost three years. It started in August 2013, when I became more conscious of all the isolation I was experiencing in college. A difficult senior year that followed means that I have some contradictory feelings about my alma matter.

I assumed it would be easier for me to socialize in graduate school since the other students would be more like me. But after two years in Berkeley, my depression has worsened and it has been enormously challenging for me to focus on work. I never realized that, during their first summer, graduate students were supposed to spend days alone without talking to anyone. Does anyone remember that line from The Martian movie when one man, wondering about how the astronaut survives on Mars alone, ponders: “What does that do to a man?” Every day, I need to decide: if people are going to ask me how I feel, do I lie and say “fine”, do I say “no” and change the subject, or do I take an intermediate stance like “I’m feeling a 6 out of 10 today” and adding in some jokes. Since it’s funnier, I default to the third. Perhaps I should start using Donald Trump jokes — I can usually think of a way to incorporate him in conversations.

This isn’t completely new to me. Technically my depression began in seventh grade, but it was more sporadic back then. During seventh grade, I became conscious of my bottom-of-the-pack social standing by seeing seemingly all other students having their own group of friends. At that time, there were only two other deaf students in my cohort, and both were far more popular than me (but they also had better hearing). I tried to fight my way up the hierarchy by studying what the other students did: their clothing styles, their facial expressions, their style of jokes, or anything that I thought would help me advance socially. Unfortunately, I was not successful, and eleven years later, I am still trying to figure out the secrets.

I thought that new graduate students would be able to find a group and be supervised by more experienced students or postdocs to get a fast start on research. At the very least, they would have some cohort of students of the same year (or plus/minus one year) with matching research interests, where if not outright working together, they could talk and bounce off ideas. Given how hard it has been for me to be involved in that, though, I wonder how students get that in the first place. It’s true that it has gotten noticeably better this semester, so hopefully I have stuff to look up to. Even so, I do not know why I was admitted to the PhD program at all, and I also believe it was a mistake for me to attend Berkeley (just like it was a mistake to attend Williams).

There’s a lot of stuff that people say about the Berkeley graduate school experience, whether it’s in person during visit days or on a Berkeley website. I know some of it is for advertising (e.g., to exaggerate, “graduate students love Berkeley!”) which is fine with me. Any school would need to do that. From personal experience, however, some of the material online is misleading if not completely wrong.

I know depressed graduate students aren’t exactly a scarce resource. I’m trying to fight my way out of it, but it’s not that easy. Some people may disagree with this, but I think it’s harder to do good, research-quality work while depressed. The faculty may not have had this experience since they were the stars in graduate school and, presumably, non-depressed, but not everyone is like that. I know there are people who enjoy graduate school. Yet it feels weird for me to be physically nearby but feel so distant simultaneously. How are they enjoying their experience so much?

I get a lot of emails that relate to trying to improve the graduate school experience. Examples include graduate student surveys for enhancing the well-being of students and invitations to student events/parties. The main thing I want to do, however, is get good research done in a research group. That’s going to help me more than going to a random student event. In addition, attending these events usually means I need to ask a sign language interpreter to come with me, and I feel very awkward trying to socialize with the interpreter.

I have thought about temporarily leaving graduate school to recover, but this would not be a guaranteed solution. I obviously don’t have a group of friends to start a company, so it’s probably best to work at a large, well-established company. Then again, the same concern arises: what if I can’t figure out how to mesh with my group there (people do work in groups, right?). The same concern arises, and the people in industry might also be older and have more diverging interests. If I come back to graduate school, I’ll continue to feel like I am behind in research for taking time off.

I am well aware that I could come under criticism for this blog post. I read more foreign policy books and there are lots of people in this world suffering. Just look at the horrible, gut-wrenching crisis in Syria, where people are under assault by the government, young men feel compelled to join gangs, and young women have to resort to prostitution for money. Even in the United States, there are so many people struggling financially, and as Berkeley can attest, there are too many homeless people in this country. Why do I have the right to feel depressed, especially when so many people would like to be a Berkeley graduate student?

By that logic, though, anyone who has a job in the United States can’t complain, because there’s always someone in the world who would want that position. I also do not think my desires are that out of line. If other students worked alone, then I would not necessarily feel as bad as I do now. Just to be clear, I know I could have done some stuff better over the past few years, such as working even harder. Some of the blame is always going to lie with me. Though there is some business incentive to only prioritize a few students, however, I would like to think that the department wants to put its students in positions where they are most likely to succeed.

It is a good thing I won’t be in Berkeley this summer. Given how much I detested last summer, I don’t want to think about how much worse I’d feel if I stayed in Berkeley. Fortunately, things improved for me last semester, and so my goal for the summer is to continue that trend, and hopefully, to prevent the depression from dragging on for a fourth year.

Review of Applications of Parallel Computing (CS 267) at Berkeley

May 27, 2016

Last semester, I took Applications of Parallel Computing (CS 267), taught by Jim Demmel. This is one of those graduate courses that we can expect will be offered every year for the near future. In fact, CS 267 has been offered almost every year since 1994, and strangely enough, it seems to have always been in the spring semester. Amazingly, the website for the 1994 class is still active!

The reason why CS 267 continues to be offered is because parallel programming has become more important recently. I can provide two quick, broad reasons why: because virtually all computers nowadays are parallel (i.e., have parallel processors), and because Moore’s Law1 has started to slow down, so that future speed-ups will not “come for free” as it did with Moore’s Law, but will come with parallelism. But parallelizing existing code is hard. Hence why this class exists.

CS 267 is aimed at a broad audience, so the student body is very diverse. There are computer scientists of course, but also mechanical engineers, environmental engineers, bioengineers, political scientists, etc. The slides on the first day of class said that we had 116 (yes, 116, you can tell how popular things are getting) students enrolled, including 16 undergrads and 64 EECS students. Computer scientists, especially those in graphics/AI/theory like to take this course to fulfill prelim breadth requirements. See the following screenshot for why:


We have to take at least one course in three out of the six areas above, with one “above the line” and one “below the line.” CS 267 is in the “Programming” category and “below the line,” making it a good match for AI people, which is why you see a lot of those in the class. Incidentally, our “Homework 0” was to write a short summary about ourselves, and these were posted on the website. I wish more classes did this.

Due to its popularity, CS 267 has video lectures. I talked about this earlier when I explained why I didn’t show up to lectures. (Actually, I’m still catching up on watching the video lectures … sorry.) Fortunately, the captioners did a better job than I expected with getting captions on the videos.

Regarding the workload in the class, there were four main things: three homeworks and a final project.

Homework 1: Optimize Matrix Multiplication. We had to implement matrix multiplication. We were judged by the speed of \(C = AB\) for dense, random matrices \(A\) and \(B\), on a nearby supercomputer. The naive \(O(n^3)\) algorithm is easy to describe and implement, but is unacceptably slow. To speed it up, we had to use low-level techniques such as compiler intrinsics, loop unrolling, blocking, etc. The thing I probably learned the most out of this assignment is how to use these operations to make better use of the cache. I also credit this assignment for giving me a glimpse into how real, professional-grade matrix libraries work. We had assigned teams on this homework; the course staff paired us up based on a course survey. It’s probably best to say you’re a bad C programmer in the survey so you can get paired up with an “expert”.

Homework 2: Parallelize Particle Simulation. It’s helpful to look at the GIF of the particle simulation:


This is implemented by having an array of particles (which themselves are C structs). During each iteration, the particle positions are updated based on simple forces that cause particles to repel. We were provided an \(O(n^2)\) serial (i.e., not parallel) implementation, where \(n\) is the number of particles. This assignment involved four parts: implementing (1) an \(O(n)\) serial version, (2) an \(O(n)\) OpenMP version (i.e., shared memory), (3) an \(O(n)\) MPI version (i.e., distributed memory), and (4) an \(O(n)\) GPU-accelerated version. For this assignment, obviously do (1) first, since the others are based on that. But don’t use linked lists, which I tried in my first version. For (2), OpenMP isn’t as hard as I thought, and most of it involves adding various pragma omp statements before for loops. For (3), MPI can be challenging, so do yourself a favor and read these excellent blog posts. For (4) … unfortunately, one of my partners did that, not me. I did (1), (2), and (3), but I wish I had done some GPU coding.

Homework 3: Parallelize Graph Algorithms. We needed to parallelize de Novo Gene Assembly. No biology background is needed, because the assignment description is self-contained. Our job was to parallelize their provided serial algorithm, which constructs and traverses over something called a De Brujin graph. We had to use an extension of the C language called UPC, developed by researchers here in Berkeley, but fortunately, it’s not that much more complicated than MPI, and there are some guides and long presentation slides online that you can read and digest. UPC tries to provide both shared and distributed memory, which can be a little hard to think about, so you really want to make sure you know exactly what kind of data is being transferred when you write UPC statements. Don’t assume things work as you expect!

For all three assignments, a write-up is needed, so be sure to have time before the deadline to write it. In terms of debugging, my best advice is to keep running the code with a bunch of printf(...) statements. Sorry for lack of better ideas. But a fair warning — there’s a limit to how often we can use the supercomputer’s resources! This is something I wish the course staff had clarified early in the semester, rather than waiting until a few people had exceeded their quotas.

Final Projects. As usual, we had final projects. Our format was somewhat unusual, though: we gathered in the Wozniak Lounge from 8:10AM to 11:00AM (early for some people, but not for me). The first half of this time slot consisted of a series of 1.5-minute presentations. For my talk, I discussed an ongoing research project about parallel Gibbs sampling, but my slide probably had too much stuff in it for a 1.5-minute talk. (I also didn’t realize how low the screen resolution would be until the presentation date.) I was tempted to add some humor by starting out with one separate slide that said “MAKE GIBBS SAMPLING GREAT AGAIN”, but for some reason I decided against it. Perhaps that was mistaken, since I think the presentations could have been a little funnier. After this, we had the usual poster session.

Overall, I wouldn’t say that CS 267 is among the more challenging CS courses2, but it can be difficult for those with limited C programming experience. Some students, especially undergrads, might be somewhat annoyed at the lack of grading information, because we didn’t get grades on our final project, and I don’t even know what grade I got on the third homework. The grading is lenient as usual, though, so I wouldn’t worry much about it. Also, during the homeworks, it can be easy to not do much work if you rely on your partners; don’t do that! You get more out of the class if you take the lead on the homeworks and do as much as you can.

  1. Roughly speaking, Moore’s Law states that the number of transistors in an integrated circuit increases at a fixed rate, usually stated as doubling every 1.5 years. 

  2. I’m talking about the assignments here, not the lectures. The assignments are generally doable, but the lectures are impossible to follow. 

The Obligatory "Can You Read Lips?" Question

May 24, 2016

It’s often one of the first things people ask when they meet me, whether it’s at some orientation event, a new doctor appointment, or some other setting:

“Can You Read Lips?”

These people know that I’m deaf. Perhaps I told them immediately (rather than dragging it out), perhaps they found out from someone else, or perhaps they saw my hearing aids and drew the logical conclusion. But they know in some way. In addition, my guess is that the people asking me that question also don’t have much experience communicating with other deaf and hard of hearing people.

Taking that last assumption into account, I’m often surprised at how many have defaulted to this as their first nontrivial statement to me (or more generally, when meeting their first deaf person). I don’t know where this phenomenon comes from. Was there some notable event or movement that popularized this notion? I can’t figure out anything matching this after a brief round of Googling.

After thinking for a while, it probably comes down to this: hearing people, when they think about deaf people, might first and foremost think about sign language. After all, it’s highly visible, as anyone who’s had a class with me can attest. If I were hearing, I imagine I would be fascinated during my first encounter with sign language.

But then a hearing person might wonder: how do deaf people communicate with hearing people without interpreters present? This must be where the myth (as I’ll explain shortly) of lip reading appeared.

To get back to the initial question of this post, and in part to relieve any reader of this blog post who might need to meet me for the first time later of asking me this question: yes, I can lip read. Now let’s go over the caveats. Here’s a very incomplete list:

  • Is the person who I’m communicating with trying to move his or her lips carefully? That makes it easy to lip read. At Williams, I was once part of a little lip reading exercise to see if we could read lips. But that fails to generalize to real conversations, because we tried hard to move our lips. Most people do not have that in mind when talking, especially because it’s usually not needed. And many people slur or speak in a way that inhibits lip movement.
  • Am I face-to-face with that person, in clear lighting? For obvious reasons, it’s easiest to lip read in those situations, and much harder when, for instance, we’re in a haunted house (sorry, I couldn’t think of a better example). From personal experience, it is also harder to lip read with someone when that person is talking over a screen, e.g., in Skype, even assuming excellent Internet connections.
  • Is the person a foreigner? On top of the already challenging task of understanding accents, part of the difficulty in communicating with them is that they may lack intuitive knowledge of certain frequencies when uttering words (as one of my statistics professors put it when we discussed this once). For instance, my grandmother is a Japanese-American and I have no hope of reading her lips at all.
  • Am I trying to lip read in a group conversation? On the one hand, lip-reading theoretically should easier than trying to listen normally because it “only” involves looking at one person’s mouth. In reality, group conversations often have sudden shifts in who does the talking, or people can talk simultaneously. Finally, most people will not be face-to-face with me.

Having said all that, under the best of circumstances, I can do some lip reading. I do not, however, think my lip reading abilities are multiple standard deviations better than those of the average hearing person, especially because I haven’t had any formal lip reading training.

But more training might not matter. Even if my skills were as good as humanly possible, the above list of caveats should make a convincing case that lip reading as a panacea for deaf people’s communication difficulties is a myth.

Ava, the company I introduced in last week’s blog post, has their own blog, and mentions lip reading in a post titled “Let’s Talk About Tom, Your Colleague You Haven’t Heard About” (or as I like to think of it, “Daniel”). They mention:

Actually reading lips is really, really hard. Despite year-long trainings for that, you only can hope to get 20–30% better. What’s even worse? In English, only 30% can be distinguished with lipreading.

These numbers are almost certainly arbitrary and wildly inaccurate, but their general point is correct: lip reading just cannot convey all the information someone says. It is much less effective than just hearing someone say something. If, given the choice between no lip reading skills and wearing my current hearing aids, or having lots of lip reading skills but no hearing aids at all, I would prefer the former scenario. My hearing aids, while an imperfect remedy, are far more helpful than lip reading can and will be.

In the future, if anyone emails me to ask the obligatory “Can I Lip Read?” question, I will send them the link to this blog post. If it’s in person … I’ll provide a 20-second version of this post. This doesn’t mean I dislike it when people ask these questions. To the contrary, I actually take a slightly positive stance towards it. Yes, it gets tiring, but at least it seems like people are curious and genuinely want to be helpful, and in some cases we can get the conversation going to other, more interesting topics.

Review of Convex Optimization and Approximation (EE 227C) at Berkeley

May 21, 2016

This past semester, I took Convex Optimization and Approximation (EE 227C). The name of the course is slightly misleading, because it’s not clear why there should be the extra “and approximation” text in the course title. Furthermore, EE 227C is not really a continuation of EE 227B1 since 227B material is not a prerequisite. Those two classes are generally orthogonal, and I would almost recommend taking then in the reverse order (EE 227C, then EE 227B) if one of the midterm questions hadn’t depended on EE 227B material. More on that later.

Here’s the course website. The professor was Ben Recht, who amusingly enough, calls the course a different name: “Optimization for Modern Data Analysis”. That’s probably more accurate than “Convex Optimization and Approximation”, if only because “Convex Optimization” implies that researchers and practitioners are dealing with convex functions. With neural network optimization being the go-to method for machine learning today, however, the loss functions in reality are non-convex. EE 227C takes a broader view than just neural network optimization, of course, and this is reflected in the main focus of the course: descent algorithms.

Given a convex function \(f : \mathbb{R}^n \to \mathbb{R}\), how can we find the \(x \in \mathbb{R}^n\) that minimizes it? The first thing one should think of is the gradient descent method: \(x_{k+1} = x_{k} - \alpha \nabla f(x_k)\) where \(\alpha\) is the step size. This is the most basic of all the descent methods, and there are tons of variations of it, as well as similar algorithms and/or problem frameworks that use gradient methods. More generally, the idea behind descent methods is to iteratively update our “point of interest”, \(x\), with respect to some function, and stop once we feel close enough to the optimal point. Perhaps the “approximation” part of the course title is because we can’t usually get to the optimal point of our problem. On the other hand, in many practical cases, it’s not clear that we do want to get the absolute optimal point. In the real world, \(x\) is usually a parameter of a machine learning model (often written as \(\theta\)) and the function to minimize is a loss function, showing how “bad” our current model is on a given training data. Minimizing the loss function perfectly usually leads to overfitting on the test data.

Here are some of the most important concepts covered in class that reflect the enormous breadth of descent methods, listed roughly in order of when we discussed them:

  • Line search. Use these for tuning the step size of the gradient method. There are two main ones to know: exact (but impractical) and backtracking (“stupid,” according to Stephen Boyd, but practical).

  • Momentum and accelerated gradients. These add in extra terms in the gradient update to preserve “momentum”, the intuition being that if we go in a direction, we’ll want to “keep the momentum going” rather than throwing away information from previous iterations, as is the case with the standard gradient method. The most well-known of these is Nesterov’s method: \(x_{k+1} = x_k + \beta_k(x_k - x_{k-1}) - \alpha_k \nabla f(x_k + \beta_k(x_k - x_{k-1}))\).

  • Stochastic gradients. These are when we use approximations of the gradient that match in expectation. Usually, we deal with them when our loss function is of the form \(f(x) = \frac{1}{n}\sum_{i=1}^nf_i(x)\), where each \(f_i\) is a specific training data example. The gradient of \(f\) is the gradient of the individual terms, but we can use a random subset each iteration and our performance is just as good and much, much faster.

  • Projected gradient. Use these for constrained optimization problems, where we want to find a “good” point \(x\), but we have to make sure it satisfies the constraint \(x \in \Omega\) for some space \(\Omega\). The easiest case is when we have component-wise linear constraints of the form \(a \le x_i \le b\). Those are easy because the projection is as follows: if \(x_i\) exceeds the range, either decrease it to \(b\), or increase it to \(a\), depending on which case applies.

  • Subgradient method. This is like the gradient method, except this time we use a subgradient rather than a gradient. It is not a descent direction, so perhaps this shouldn’t be in the list. Nonetheless, the performance in practice can still be good and, theoretically, it’s not much worse than regular stochastic gradient.

  • Proximal point. To me, these are non-intuitive. These methods combine a gradient step with a proximal method. They also perform a projection.

Then later, we had special topics, such as Newton’s method and zero-order derivatives (a.k.a., finite differences). For the former, quadratic convergence is nice, but the method is almost useless in practice. For the latter, we can use it, but avoid if possible.

As mentioned earlier, Ben Recht was the professor for the class, and this is the second class he’s taught for me (the first being CS 281A) so by now I know his style well. I generally had an easier time with this course than CS 281A, and one reason was that we had typed-up lecture notes released beforehand, and I could read them in great detail. Each lecture’s material was contained in a 5-10 page handout with the main ideas and math details, though in class we didn’t have time to cover most proofs. The notes had a substantial amount of typos (which is understandable) so Ben offered extra credit for those who could catch typos. Since “catching typos” is one of my areas of specialty (along with “reading lecture notes before class”) I soon began highlighting and posting on Piazza all the typos I found, though perhaps I went overkill on that. Since I don’t post anonymously on Piazza, the other students in the class also probably thought it was overkill2.

The class had four homework assignments, all of which were sufficiently challenging but certainly doable. I reached out to a handful of other students in the class to work together, which helped. A fair warning: the homeworks also contain typos, so be sure to check Piazza. One of the students in class told me he didn’t know we had a Piazza until after the second homework assignment was due, and that assignment had a notable typo; the way it was originally written meant it was unsolvable.

Just to be clear: I’m not here to criticize Ben for the typos. I think it’s actually a good thing, because he has to start writing these lecture notes and assignments from scratch. This isn’t one of those courses that’s been taught every year for 20 years and where Ben can reuse the material. The homework problems are also brand new questions; one student who took EE 227C last spring showed me his assignments which were vastly different.

In addition to the homeworks, we had one midterm just before spring break. It was a 25.5-hour take home midterm, but Ben said students should be able to finish the midterm in two hours. To state my opinion: while I agree that there are students in the class who can finish the midterm in less than two hours, I don’t think that’s the case for the majority of students. At least, it wasn’t for me — I needed about six hours — and I got a good score. The day we got our midterms back, Ben said that if we got above an 80 on the midterm, we shouldn’t talk to him to “complain about our grades.”

Incidentally, the midterm had four questions. One question wasn’t even related to the material that much (it was about critical points) and another was about duality and Lagrange multipliers, so that probably gave people like me who took EE 227B an advantage (these concepts were not covered much in class). The other two questions were based more on stuff directly from lecture.

The other major work component of EE 227C was the usual final project for graduate-level EE and CS courses. I worked on “optimization for robot grasping”, which is one of my ongoing research projects, so that was nice. Ben expects students to have final projects that coincide with their research. We had a poster session rather than presentations, but I managed to survive it as well as I could.

My overall thought about the class difficulty is that EE 227C is slightly easier than EE 227B, slightly more challenging than CS 280 and CS 287, and around the same difficulty as CS 281A.

To collect some of my thoughts together, here are a few positive aspects of the course:

  • The material is interesting both theoretically and practically. It is heavily related to machine learning and AI research.
  • Homework assignments are solid and sufficiently challenging without going overboard.
  • Lecture notes make it easy to review material before (and after!) class.
  • The student body is a mix of EE, CS, STAT, and IEOR graduate students, so it’s possible to meet people from different departments.

Here are the possibly negative aspects of EE 227C:

  • We had little grading transparency and feedback on assignments/midterms/projects, in part because of the relatively large class (around 50 students?) and only one GSI. But it’s a graduate-level course and my GPA almost doesn’t matter anymore so it was not a big deal to me.
  • We started in Etcheverry Hall, but had to move to a bigger room in Donner Lab (uh … where is that?!?) when more students stayed in the class than expected. This move meant we had to sit in cramped, auditorium-style seats, and I had to constantly work to make sure my legs didn’t bump into whoever was sitting next to me. Am I the only one who runs into this issue?
  • For some reason, we also ended class early a lot. The class was listed as being from 3:30-5:00PM, which means in Berkeley, it goes from 3:40-5:00PM. But we actually ran from 3:40-4:50PM, especially near the end of the semester. Super Berkeley time, maybe?

To end this review on a more personal note, convex optimization was one of those topics that I struggled with in undergrad. At Williams, there’s no course like this (or EE 227B … or even EE 227A!!3) so when I was working on my undergraduate thesis, I never deeply understood all of the optimization material that I needed to know for my topic, which was about the properties of a specific probabilistic graphical model architecture. I spent much of my “learning” time on Wikipedia and reading other class websites. After two years in Berkeley, with courses such as CS 281A, CS 287, EE 227B, and of course, this one, I finally have formal optimization education, and my understanding of related material and research topics has vastly improved. On our last lecture, I asked Ben what to take after this. He mentioned that this was a terminal course, but the closest would be a Convex Analysis course, as taught in the math department. I checked, and Bernd Sturmfels’s Gemoetry of Convex Optimization class would probably be the closest, though it looks like that’s not going to be taught for a while, if at all. In the absence of a course like that, I’m probably going to shift gears and take classes in different topics, but optimization was great to learn. I honestly felt like I enjoyed this course more than any other in my time at Berkeley.

Thanks for a great class, Ben!

  1. For some reason, Convex Optimization is still called EE 227BT instead of EE 227B. Are Berkeley’s course naming rules really that bad that we can’t get rid of the “T” there? 

  2. I’m not even sure if I got extra credit for those. 

  3. One of the odd benefits of graduate school is that I can easily rebel against my liberal arts education.