# For Final Projects, Class Presentations are Better than Poster Sessions

In computer science graduate-level courses at Berkeley, it is typical to have final projects instead of final exams. There are two ways in which these projects are disseminated among the students:

• Class Presentations. These are when students prepare a five to ten minute talk to the class, using slides and other demos to state the project’s main accomplishments. Due to explosions in class enrollment (see my class reviews here for examples), time limits are strictly enforced, so presentations must be precisely timed and polished.

• Poster Sessions. These are when students bring a poster describing their work. Usually, students create posters by stuffing lots of images and text in a power point slide (or other software). Then they print using their lab’s poster printer.

I’ve experienced both scenarios at Berkeley, and based on those I would strongly state the following to instructors: class presentations are better than poster sessions, and should be the method of choice for dissemination of final projects.

First, a class presentation means students practice a useful skill, one that they will likely need for their future careers. This is especially true for academic careers, and students taking graduate-level courses are far more likely to want academic careers than the average undergrad. For me, presentations are also a way that I can channel my humor, which isn’t immediately apparent to other students. A second, less important reason, is that in an age of exploding enrollment in graduate courses, it’s nice to be able to finally learn people’s names when they give class presentations.

One can, of course, learn names and project accomplishments in poster sessions, but this requires more effort and is challenging for people like me. I have lots of difficulty navigating my way through loud, noisy poster sessions filled with accents. I either resort to reading people’s posters (and not understanding much of it anyway due to time constraints) or going through the awkwardness of having a sign language interpreter with me (and having that interpreter struggling through accents and technical terms).

Poster sessions have other downsides that apply broadly, and not just to deaf students. For instance, poster sessions allow students to hide. What happens if students don’t manage to do much for their final projects? As I’ve seen happen in my classes, these students go to the corner of the room to avoid the spotlight. Presentations avoid this issue, unless students are willing to go as far as to even skip their presentation time. Some students who are nervous about public speaking might also want to hide. To most of them, I would respond: good luck convincing your future bosses to have you not do any presenting.

If class presentations force students to produce something that is worth presenting and force them to encounter their fears, then that’s probably sufficient reason alone to use them!

There are other downsides to having poster sessions. They cost more, creating a chasm between students who have access to fancy poster printers and those who don’t; the latter may have to resort to printing out ten pages of work and pasting them together in a poster. Furthermore, the posters that get printed are unlike to be used again, in the exact form. True, many conferences have poster sessions due to scalability issues, but class projects are not generally up to par with research projects, so students would have to re-print posters anyway. And that’s assuming that students are using class projects as the basis for future research, which isn’t always the case.

Class presentations are also superior to poster sessions in that they require less physical room. The presentations can be delivered in the same lecture room, while poster sessions force the course staff to go through the trouble of finding and reserving a large room (or hallway, as is the case for Berkeley).

Furthermore, the one “benefit” of poster sessions, scalability, does not stand up to a rigorous analysis. (If there are other benefits, please let me know because I can’t think of any.)

First, if the class size is so large that it approaches the enrollment of a popular academic conference, then would the course staff really have time to read the final reports? Remember, neither presentations nor poster sessions enable people to fully understand a project; for this, one has to read papers.

Second, with five minutes per presentation, the process goes by quickly, and it is also easier for the course staff to track progress. Also, with a large class, it is likely that students would be encouraged to form groups, drastically reducing the quantity of presentations. If there’s too many presentations for one class, the course staff should divide the class into groups.

Finally, scheduling presentations is not generally a problem even with many groups. Here’s a simple procedure: have a random draw to see who goes next. If the class requires a fixed schedule, then busy instructors should have their TAs form the order of presentations.

Unfortunately, the classes I’m taking next semester have historically used poster sessions rather than verbal presentations, but perhaps I could convince them to change their minds?

# Review of Convex Optimization (EE 227BT) at Berkeley

The third class I took this semester was Convex Optimization (EE 227BT), which was also my first time wading into electrical engineering. There are three convex optimization courses at Berkeley: EE 227A, EE 227B, and EE 227C. (Note: I say 227BT in this title because the course had a “T” for “Temporary,” but that should go away soon.) I did not take the first course, EE 227A, and I think that may have been a reason for my struggles in this class.

To do well in EE 227B, I think one needs to be highly skilled in the following two areas: linear algebra and problem solving. If a student lacks one or both of these skills, he or she is in serious trouble. For a linear algebra concept, consider this problem: $\max_{\{x : \|x\|_2=1\}} x^TAx$ for symmetric $A$. We encountered this at the start of the semester and would see it over and over again. The professor, Laurent El Ghaoui, said: “If you didn’t immediately know that the answer to this was the maximum eigenvalue of $A$, or $\lambda_{\max}(A)$, then run away to EE 227A. This is all linear algebra.” I did know that, in fact, but the class material was nonetheless very difficult for me to understand.

We had five problem sets, and I think they were among the hardest ones I’ve ever had, and also more challenging than those from CS 281A. After spending 30 to 40 hours on the first few homeworks, I realized I needed to seriously start reaching out to other students to get more than two-thirds of the homework done correctly, and I did do that this semester.

Each problem set contained three to five questions, each of which had some number of sub-problems. Their difficulty varied considerably, with some parts following directly from the definition of Cauchy-Schwarz, $x^Ty \le \|x\|_2\|y\|_2$ (not Cauchy-Schwartz … I don’t know why people keep misspelling that), and others requiring some ridiculously complicated insights. The hardest one was to prove Theorem 4 and Corollary 3 from Laurent’s paper Sparse Learning via Boolean Relaxations. Yes, we had to do that, and no, we were not given this paper reference and had to start some of that from scratch. I found out about this paper from another student. Also, the paper was published in 2015, so it must have been difficult since no one else did this until now. Setting the boolean relaxation problem aside, the homework questions were challenging but doable with some problem solving insights (one might need help for these, though), and they were brutally educational.

In terms of homework logistics, we had a paid grader who graded the homeworks, which is different from the previous iteration of the course (Fall 2014) when students had to self-grade their submissions. Note that Laurent’s EE 227BT website is (currently) incorrect; I think he recycles the same links for his classes, so some of it is out of date for the Fall 2015 edition. Our grader was surprisingly generous with points but did not offer detailed feedback and also took three or four weeks before providing grades. In part, this was because of the large class size. We had perhaps eighty students at the start before setting to fifty or sixty.

One of the “less-awesome” aspects of this class, in my opinion, was that we barely followed the projected outline. We were supposed to get five homework assignments, released every other Thursday, which meant we would get two weeks to do each assignment. However, because the lectures quickly fell behind from the outline, Laurent delayed the second homework by a week, which caused a few more subsequent delays for other assignments. This meant that homeworks eventually spilled over into time that was originally designated for us to do final project work. I think it would be best to design homeworks conservatively so that even if the lectures get delayed, there’s no need to put off the homework due dates.

We had a midterm, but that was also delayed, by a week. It was in-class for 80 minutes, open note (but not open laptop or Internet). It had three questions, each with multiple parts, and was out of 40 points total. Judging from the distribution of scores, I think most students got somewhere between 15 and 30 points. It was definitely a challenging midterm, but in retrospect, I thought it was fair, and was of higher quality compared to the CS 280 midterm.

The third part of our grade was based on the final project. We started final project discussions really early, in September! Almost from the beginning, Laurent designed lectures so that we would cover standard concepts (e.g., Lagrange duality) for 75 minutes, and then the last 5 minutes would be an open discussion of final project ideas. Despite the early focus of final projects in the lectures, in reality we didn’t have that much time to work on them due to the homeworks and midterm getting delayed and cutting into project time. I think the course staff should address this in future iterations of the course.

I worked in a group of four in my final project, where we investigated various properties of neural networks. We read a lot of research papers (the “literature review” that Laurent kept saying in lecture) and ran experiments using CAFFE and CVX. We wrote this up in a forty-page final project report. Going through and editing that at the end was a lot of work! A quick warning to future students: the project report date was set before RRR week, which I think is unusual for most graduate courses, which allow students to work on reports through mid-December.

In addition to a report, we had project presentations, which I was happy about since it’s fun to give talks. Not all students would agree with me. During the presentations, my sign language interpreters would comment on some of the students who appeared to be really nervous. To make matters worse, Laurent brought a hand-held microphone to the class, and about half of the students actually held the microphone when they were talking. No, I’m serious! And it’s not like we were on stage at Broadway — we were in a normal-sized classroom! I don’t like holding a microphone because it would make it completely obvious to the rest of the class that I was nervous about public speaking! I think Laurent had good intentions about bringing the microphone, but to future students, please don’t use microphones when talking.

When it was my turn to present, I put the microphone away after someone handed it to me (sorry, not using it!) and immediately started off with a planned joke. I told the class to pretend that Laurent and I were “trapped in a world that represents the loss function of the neural network.” (Don’t ask why!) I continued the story: I led Laurent to a local minimum, but he got angry and wanted the global minimum. I calmly responded that local minima are just as good as the global minimum in neural networks. I added a little acting and tried to cleverly alter my tone of voice. The class roared in laughter, and I think that was probably the most successful joke I have ever pulled off in a class presentation.

To wrap up my thoughts on EE 227B, I think it is similar to most classes I’ve taken in the sense that it is challenging, but very educational. I now feel like I have a much better understanding of concepts in linear algebra, especially those about norms, eigenvectors, and matrix decomposition. Many students who take this course do research in Artificial Intelligence fields, and EE 227B enables students to read AI research papers without getting bogged down by the notation and definitions. This was a huge problem for me when I first started to read machine learning papers a few years ago. I couldn’t even consistently remember what $\|x\|$ meant! Thanks to EE 227B, and some of my own independent linear algebra studying, I’ve cleared a lot of that initial “notation hurdle”.

Finally, to future students who are considering this class, the best advice I have is to make sure that your linear algebra skills are sharp. In particular, be sure you know about matrix norms, eigenvectors, and other forms of matrix decomposition (e.g., Singular Value Decomposition).

If you’re weak in those areas, then in the words of Laurent, “run away to EE 227A.”

# Review of Advanced Robotics (CS 287) at Berkeley

I took Advanced Robotics (CS 287) last semester, which is the graduate level class that Pieter Abbeel teaches at Berkeley. You can view the course website here. Robotics is a vast, highly interdisciplinary field, so to restrict the focus, CS 287 is about the math and algorithms of robot systems. No, we didn’t see giant, science-fiction style robots battle each other, but we did observe a research robot tie knots (alas, through videos, not in real time).

Before the class even began, I could tell we would have some logistics issues. Like almost every course I have taken at Berkeley, CS 287 was substantially over-enrolled at the start; we had perhaps eighty students before settling down to about sixty at the end. According to the CS 287 websites from previous years, it looks like the Fall 2009 and Fall 2012 courses had nineteen and fifteen students, respectively. Yeah, welcome to the new normal.

Due to the class size, Pieter actually provided two different lecture times, one in the morning and one in the afternoon, and I suspect he also convinced John to do the same thing for CS 294-112. Pieter did this to get to know the students better. During some of the class breaks, he would ask a handful of students to introduce themselves to everyone. Since I sat in the front corner of the room for optimal use of sign language interpreting services, I was called on first. From these introductions, I learned a few things from the class composition:

• There were a lot of mechanical engineering graduate students. So much, to the point where I was complaining (er, joking) about this with my interpreters midway through a long sequence of mechanical engineers introducing themselves. It’s a good thing that no one else in the class (I think…) can understand sign language. (PS: to mechanical engineers reading this, I was joking so please don’t get angry.)

• A lot of the students do not speak clearly! Many are quiet, have heavy foreign accents, or exhibit both qualities. The most egregious case resulted in my interpreter not understanding a single word a student said, which I mentioned earlier here.

• A lot of the students did robotics research of some form, whether it was in computer science, mechanical engineering, electrical engineering, or a related field. Then I’m confused, is it just this year that robotics suddenly became popular? Or is it because CS 287 wasn’t offered last year and that this is the “overflow” year?

In terms of course material, CS 287 combined lectures on standard topics in artificial intelligence (e.g., optimization and probability) and on more obscure, robotics research subjects. The course lectures could be divided as follows: Markov Decision Processes, optimization, probability, and research. Overall, I felt that the lectures were polished and of high quality. Pieter seemed like he really knew the material and was able to offer many doses of intuition for some of the more technical material.

I discuss this in my other reviews, so I’ll continue the trend: how did the lectures mesh with sign language interpreting services? Pieter lectured at a fast pace, which was problematic for my two interpreters, who were often exhausted when their 20-minute shifts were up. On the positive side, Pieter spoke loud and clear, to the point where I actually think he’s one of the easiest people for me to understand. Consequently, relative to other classes, I did not have much difficulty in terms of identifying the exact words he uttered. It’s also somewhat ironic that he would be the one to mention to me about an ideal future where people had “virtual captions” projected out of their mouths, which displays the text they say in real-time. Yes, I would like for that to happen.

As an added benefit, the course slides contained a lot of information. In many cases I could understand a concept or a homework sub-problem just by reading the appropriate slides, which is really handy for a text-heavy person like me. Incidentally, while Pieter wrote a lot of math on a white board, in almost all cases it was math directly from the slides, and he was writing it out for intuition. Thus, taking hand-written notes is probably unnecessary for this class.

No course is without its hiccups, however, and I’d like to bring up a few points that may (or may not) matter to future students:

• The difficulty of lectures varied considerably, which one can probably tell by browsing some of the slides. I thought the easiest class was the one on introductory probability. Since the material is quite rudimentary, I think that lecture needs to be eliminated in future iterations of the course. Basic probability is an ironclad requirement for understanding the math of robot systems. Other lectures were more complicated. The convex optimization and Kalman Filtering lectures would have been hard for me to follow had I not already had substantial exposure to those concepts.

• Towards the end of the semester, we had a “project speed-dating” lecture, which is when we gathered in small groups and shared our progress on the final project. Ideally, students could get feedback and learn what others were doing. In reality, most students skipped this class, and I’m not sure how beneficial it was to those students who did attend (I didn’t benefit). Furthermore, we eventually had final project presentations. Thus, I think project speed-dating should be replaced with a “standard” robotics lecture.

• We had three class sessions where guests from industry lectured about their companies. I’m neutral towards these, and would suggest that these only happen when Pieter (or another future instructor, if applicable) is traveling and unable to lecture.

CS 287 had four problem sets which involved math and MATLAB programming. I thought they were, on average, less challenging compared to problem sets in other classes. The math did not require incredible problem-solving skills, and I think they were designed to accommodate people from other fields (mechanical engineers …). For instance, the fourth homework asked to prove that covariance matrices are positive semidefinite, which is something that a lot of machine learning students can answer in thirty seconds. For the coding, we had to fill in MATLAB code in the designated “YOUR CODE HERE” sections. We got a lot of starter code for these assignments, so it’s relatively easy to understand how the code works in the overall pipeline.

To turn in homeworks, we used Gradescope, a company Pieter co-founded with Berkeley students. We only had to turn in PDFs of our answers, and the course staff can grade code-based assignments by spot-checking our plots. (Part of the reason why we had lots of starter code is because some of that is used to generate plots, which means that they are standardized across all student submissions.) We had page limits for our solutions, so be sure you know how to cram lots of figures together in LaTeX, such as by using minipages or subpages. Oh, I should mention: there are no solution sets to these assignments. I agree with Pieter in that there would be too much temptation for students to search for old solutions. Well, I wouldn’t search, but I’m not sure about others.

In addition to regular homeworks, we had four (!) optional extra credits, plus the final project. I only did one of the extra credit assignments, so I don’t have much to comment on those.

For my final project, I worked on a deep learning project about Atari game play, but my project ended up relating more to human learning since I analyzed data from humans playing Atari games on Amazon Mechanical Turk, and I ran out of time to integrate my findings with a Q-Learning agent. Pieter was the one who suggested this project. In fact, back in October, he and the two GSIs actually met with every project group in the class for five minutes to discuss the final project. Then, a day later, I assume Pieter sent out personalized emails to every group with project suggestions. That must have been a lot of work!

Just like in CS 280, we had project presentations, not a project poster session. That is a good thing. Single-student groups presented for 5.5 minutes. I tried to be funny by sprinkling in four jokes in my talk, and went so far as to put in a picture of Bernie Sanders in one of my slides. Unfortunately, I think my Sanders-related joke backfired since a lot of the students were internationals or were not fluent in American politics, whereas I have very strong political beliefs.

We then had to write the usual report to wrap up the project. I will warn future students: the grading for the final project is somewhat stricter than the grading for homeworks, though admittedly I think it was hard to get a really low grade on the project. Thus, to get an A, try to get at least 90 percent of the homework points, and make up for lost points with the four extra credit assignments. Pieter really makes it clear how our grades are computed, which makes the process less stressful for students who care about grades. This is in contrast with some other professors, who might not even return grades for final projects.

In conclusion, I enjoyed CS 287 and would highly recommend it to future students. Again, if possible consider taking this class concurrently with Deep Reinforcement Learning or a similar two-credit class as they would reinforce each other.

# Review of Deep Reinforcement Learning (CS 294-112) at Berkeley

Update October 31, 2016: I received an announcement that CS 294-112 will be taught again next semester! That sounds exciting, and while I won’t be enrolling in the course, I will be following its progress and staying in touch on the concepts taught.

And by the way, today I finally published my reinforcement learning post that I said I would write in my July update. You can see it here.

Update July 18, 2016: This post seems to have gotten a considerable amount of attention, at least compared to my normal blog posts, so I would like to answer some of the common questions that I’ve received in either the comments or by private email.

1. If you’re looking for homework assignments, I first want to warn you that, as I emphasize in the my review, the assignments are probably not going to be as educational as you would want them to be. If you’re still interested, our TA for the class posted a github repository on the Berkeley RLL page with the homework for this class. The homeworks are iPython notebooks (now called Jupyter notebooks, I think). If there’s code in the “YOUR CODE HERE” sections, then you’re probably reading the solutions; I’m not sure if there’s a clean version of the assignments there.

2. Unfortunately, we did not have any video lectures, slides, or readings outside of what you can see on the class website. A note for those who are reading the comments after this update: the class website was originally pulled down due to “some tyrants” (according to the course staff), but it’s happily now up.

3. If you’re looking for other resources to learn about deep reinforcement learning, I have several recommendations. In terms of courses, check out David Silver’s reinforcement learning course and the recent Machine Learning Summer School; the latter had our class instructor as part of the course staff, so the material is probably going to be similar to what we covered. (Coming up in a few weeks is the Deep Learning Summer School, something you might also want to check out.) I have all these courses bookmarked and am trying to carve out some time to read the slides. In terms of code, I would strongly recommend starting with either the deep_q_rl library or OpenAI Gym. The former is a super easy-to-read Python library that allows you to replicate DeepMind’s results in their 2013 and 2015 papers on Atari games. The latter was recently launched, and I don’t have experience with it, but it sounds really cool as we can compare our reinforcement learning implementations.

4. This is more of a comment than an answer, but I thought I’d mention it anyway: my blog’s comments are handled by Disqus, and in the moderation panel I can see the emails of the commenters. Thus, there is no need to post your email publicly as I can see it regardless.

Thanks everyone, and that’s all! After this paragraph is the original post as I had written it. But one more thing: after rereading this post, I think I was a little too harsh on the class. Furthermore, even though people have said they liked this post, I don’t think I gave reinforcement learning its due. So to rectify my regrets, I’m planning on launching a new series of deep reinforcement learning posts on this blog, similar to the style of Andrej Karpathy’s excellent blog post. I’ve already written a post on basic reinforcement learning, so I’m hoping to progress towards more advanced topics. My goal is to have the first post up sometime in August. Hopefully those will be a good resource for some enthusiasts out there.

What is this course? At the time I enrolled, it was a new two-credit class called Deep Reinforcement Learning (CS 294-112) and taught by Pieter Abbeel’s graduate student, John Schulman. It seemed like a cross between a research seminar and a normal lecture course. The former tend to be one or two credits and are principally about relevant research results; the latter tend to be three or four credits and have lectures, homeworks, exams, and projects.

In AI and robotics, reinforcement learning is a standard way of framing a problem. For example, if a robot needs to learn how to play a game, it must engage in “reinforcement learning” to try out different actions, get rewards, and then modify its policy. The word “deep” refers to how deep neural networks have recently become the workhorse of state-of-the-art reinforcement learning. (This is why the class wasn’t taught until now.) The broader category of deep learning involves the use of deep neural networks in other applications, such as image classification and speech recognition. Deep learning has become so popular that Google even paid \$400 million to buy a deep learning company, DeepMind.

The class had about eighty students, so to avoid getting into trouble with the building managers about stuffing too many people in one room, John gave two identical lectures for each class day. I remained in the afternoon session to make it easy on the interpreters’ schedules, but unfortunately, most of the other students picked the afternoon session, but hey, they don’t have my excuse … perhaps they can’t wake up early? So once again, a graduate level class had some of its students sitting on the floor. Seems like that’s a common problem here, huh?

Anyway, back to the class discussion. The first few lectures were about Markov Decision Processes and neural networks, so if there were any classes to miss, it would be those because I already knew the material.

The remaining lectures were, to be frank, difficult, and I often felt mentally stressed in class. Most of the content was pure math, and the derivations were a long sequence of sums, expectations, and other terms, each of which were more sums and expectations. For instance, look at the formula for policy gradients:

$g = \mathbb{E}\left[\sum_{t=1}^T \Psi_t \nabla_\theta \log \pi_\theta (a_t \mid s_t)\right]$

To understand1 this, one has to process lots of material, such as what it means to take the gradient of the log of a policy, and that $\Psi_t$ isn’t just a simple scalar but can represent concepts like the advantage function, which involves another sequence of expectations and sums of rewards. Connecting this material is challenging in real time, and I felt that the lectures did not provide sufficient intuition. My sign language interpreters tried to repeat the exact words John uttered, but despite this, I could not translate this process into clear mathematical comprehension.

Given that the lectures were difficult for me to follow, I hoped that homeworks would be more useful. The homeworks in this class were provided as IPython/Jupyter notebooks. We had starter code and needed to fill in the “YOUR CODE HERE” sections.

The first homework was nearly trivial for people who knew about the basics of Markov Decision Process, Value Iteration, Policy Iteration, and Q-Learning. I wrote about thirty lines of Python code for the entire assignment.

The second homework, on policy gradients, was more interesting, but the release date kept getting postponed. It soon became a running joke in class whenever John said: “Oh, and about that second homework, we plan to release it in a few days…”. It was finally released on October 11. (John on Piazza: “You may have given up hope that this day would ever come, but behold, HW2 is finally here.”) To put this in perspective, the first homework assignment was due on September 7.

Fortunately, the second assignment was more challenging than the first, and I had to be careful in implementing formulas since math from research papers doesn’t always translate neatly into code. I was pleased to see that the homework was designed so naive implementations of formulas would take too long to test. (I believe AI assignments should require code to be reasonably optimized.)

We were going to have homeworks on approximate dynamic programming and supervised learning, but since the second homework got delayed so much and the third one would have taken too long to create, the staff canceled all future assignments.

To be honest, the main deep reinforcement learning material I learned this semester didn’t even come from this class. In Pieter’s Advanced Robotics (CS 287) class, which I also took this semester, my final project was about deep learning for Atari games. I had time to sufficiently read and absorb the Atari deep learning research papers, which helped me to better understand some of the material in this class (CS 294-112). Consequently, my recommendation for someone who wants to take this class in the future is to, if possible, take CS 287 concurrently and do a project that uses neural networks. That way, one gets to do deep reinforcement learning.

To recap, here are some of the positive aspects of the class:

• It covers a popular and interesting research area.
• It presents many relevant research papers, including those from Berkeley students.
• For a class that is almost like a research seminar, there are many online resources one can consult for additional background. Unfortunately, a lot of the written references are also hard to understand.
• It is easy to obtain homework help on Piazza.

Here are some of the negative ones:

• The lectures were not polished and involved lots of math without intuition. This issue is understandable because it was a first time course taught by a graduate student.
• There did not seem to be much advance preparation for the course in terms of lecture material. The course website had a brief outline of lectures, but we had to change some of that on short notice.
• It did not provide sufficiently many or sufficiently difficult homework assignments. Having more in-depth assignments would let me deeply reinforce my understanding (pun fully intended).

Ultimately, this course allowed me to scratch the surface of deep reinforcement learning, though it was immensely frustrating for me to try and understand the material directly from the lecture, and the haphazard nature of the course did not help. I suspect that future iterations of the course will proceed more smoothly, and yes, even though no one’s told me personally, this class will be offered again (in some form) so long as deep learning remains the king of machine learning.

1. Update November 3, 2016: After studying more about policy gradients, I now feel like I truly understand this formula.

# Why Don't Democrats do This Instead?

Update August 15, 2016: As part of a site-wide cleanup, I made some changes to this post that I have been wanting to do for a while but kept putting off due to laziness. I did this because, even considering the constraints of a short blog post, I don’t think my arguments were sufficiently well-thought or described. In addition, two out of my three ideas were, I believe, reinforced after the Orlando shooting on June 12, 2016.

I have followed politics with increasing interest over the last few years. This is both enlightening and depressing. It’s enlightening because I now have a better understanding of how our society and world works. But it’s also depressing because the nature of politics has grown increasingly toxic, filled with more and more people who refuse to be radical centrists. I’m also starting to get distracted during my daily work life when I think about the most extreme political news from the night before (and occasionally, the week before).

As I struggle to understand what makes Democrats liberal and Republicans conservative, I often think about how these parties may be pursuing their goals in a suboptimal manner. For instance, Arthur Brooks has written about about how conservatives do not display enough compassion in his 2015 book The Conservative Heart. In a similar manner, I have some suggestions for Democrats on how they can win a broader audience or more easily achieve their goals.

Part 1: On Taxation

In most of the Republican presidential debates, and somewhere on most of the Republican candidates’ websites, I see the following come up over and over again: simplify the tax code. Yet I don’t see this discussion as much on the Democratic side, in part because “simplify the tax code” might be a euphemism for “tax cuts for the rich.”

Yes, “simplify the tax code” is one appealing way of describing a tax code to voters. Another appealing way, if statistics on Americans’ views of taxation are correct, is to say: “let’s tax the rich.”

So let’s combine them together. If a Democrat could say that we’ll be simplifying the tax code and raising taxes on the rich, that eliminates one of the “advertising advantages” of the Republican plan.

Part 2: On Guns

On gun control, the Senate voted against a Democrat-designed bill to prevent suspected terrorists (e.g., those on the nation’s no-fly list) from buying guns, which was a response to the San Bernardino shooting. The decision was almost entirely partisan-based.

My question is: what were those Democrats thinking? In an age of divided government, the voting outcome on a bill like that should have been obvious.

The standard Republican response is to say that mental health or terrorism is the real problem, not with guns themselves. We can debate on the merits of those statements, but right now — at least in Congress — it may be better for Democrats to go along with bills on overhauling mental health (e.g., Nicholas Kristof mentions one possibility in his On Guns, We’re Not Even Trying op-ed).

After the Orlando shootings, a similar situation occurred. This time, the shooter was on the no-fly list, but I still predicted that no gun control bills would be passed. I was right. Incredibly, a vote might not even have happened until Senator Chris Murphy led a 15-hour filibuster.

Consequently, I think Democrats should cease focusing on gun-related bills in Congress and work at more local levels to advance their agenda. For Congress, let’s keep the focus on directly fighting terrorism.

Part 3: On Radical Islam (and Political Correctness More Generally)

Many (if not all?) of the Republican presidential candidates have criticized President Obama and Hillary Clinton for refusing to say “radical Islam.” Those two don’t say that term due to concerns over alienating Muslims.

My suggestion is to move on from this and start saying “radical Islam.”

This might be controversial, but when I look at those words together, the “radical” part implies something far removed from standard Islam. Something “far removed from standard Islam” is what took over the minds of the San Bernardino and Orlando shooters1. Even most of the Republican candidates (sans Trump?) understand that the vast majority of Muslims are not radicals (or “jihadists” if you prefer), and in fact, are among some of our biggest allies against terrorism.

Unfortunately, the skyrocketing obsession over whether President Obama and Hillary Clinton will say “radical Islam” is overshadowing real issues. I would suggest getting on the same page and stating the fact that the war on terror is largely a war against people who claim to be Muslims. If there are concerns over alienating real Muslims, then we need to be careful to add gigantic disclaimers, like this: “we are at war with radical Islam, which to repeat, does NOT mean we are at war with the vast majority of Muslims, or at war with the religion itself.”

Avoiding the term “radical Islam” is, I believe, an example of “political correctness,” a term which has seen frequent usage in recent political discourse. I am personally against political correctness, but due to the rise of Donald Trump, I think there’s been a huge misunderstanding of what that term means. Political correctness is not when Trump argues that American Muslims were cheering en masse after 9/11. That’s just plain wrong. Political correctness should be applied when we discuss concepts or facts that are true.

A better example happened after the Orlando shootings. As part of the investigation, Attorney General Loretta Lynch and the FBI released a transcript of Mateen’s 911 calls during the shooting, but censored the name of the group to which he pledged allegiance. I was extremely disappointed upon finding out about this, and if I was a prominent politician, it would be hard for me to resist joining Florida Governor Rick Scott, House Speaker Paul Ryan, and other politicians in their criticism over the report. It is not necessary to be protective about something like this, and we should avoid unnecessary distractions over our war on terror.

1. As I point out in a related blog post, Sam Harris would instead argue that Muslims who follow their scripture exactly are the real danger, because the scripture encourages the reckless murder of infidels and other behavior that would make civilization’s stomach churn. To his credit, Harris points out the many moderate Muslims who don’t follow scripture exactly. He does not view the religion that those moderates follow as real Islam. Thus, he would think “real Islam” is what I and most others call “radical Islam,” and what we think of as “Islam,” he would equate with a “moderate variant” of Islam.

# Thoughts on Isolation: How Often do Students Work Together on Homework?

Well, I finally did it. I submitted my last homework assignment for this semester, the fifth one for convex optimization. The only work I have left to do this semester are my final projects.

As I was handing in the assignment, I once again wondered about a related question: how often do students work together on problem sets? My focus is on graduate-level computer science problem sets (or those from related fields). I’ve thought about that a lot this semester, since it’s a subset of the theme that I now think about every day, every hour: isolation.

I spent much of my summer alone in my “shared” office in a deserted VCL lab on the fifth floor of Soda Hall. While the other students had summer internships at Microsoft and Google, I was sitting there from 8:00AM to the evening, staring at my laptop, trying to do research, but often giving up and instead doing some prelim preparation (and blogging about that, of course). The extent of my daily conversations1 would sometimes be when I talked to various cashiers working at the cafes on Euclid Street, because I had to say things like: “I would like to buy X and Y. Thank you.”

That was it for the day (including evenings and nights, by the way).

The isolation I was experiencing gradually consumed me throughout the summer and adversely affected my mood and ability to focus. During the week before, during, and after the prelims, I regularly lost whole days of productivity because I was thinking about isolation all the time and how I wished I had other students with whom I could talk. (Fortunately, the prelims themselves were pretty easy, in part because I did a lot of studying before those “lost days” occurred.)

I still haven’t been able to completely recover from that disastrous summer, but I’ve made some baby steps this semester, and one of those steps has been to reach out to other students for homework collaboration.

This is new for me. In college, I did most homework assignments by myself, and made heavy use of office hours for the professors and teaching assistants. I continued that trend during my first semester at Berkeley, but that did not turn out so well. Last spring was much better, because in computer vision, we were allowed to work in groups of two and submit as a group (i.e., not “work and then submit separately,” which is the usual case) and I actually had a homework partner. He was the one who initiated our collaboration.

With that positive experience in mind, I tried to actively contact other students this semester. I sent more emails and initiated more conversations about the homeworks, and I did benefit from the discussions I had.

But I couldn’t help but think: is this the normal way students work together?

I think about this because I see many groups of students that are consistently together: they attend classes together, they attend GSI discussion sections together, they walk together, they eat together, and they do all sorts of social events together. To me, this indicates that they do not have to rely on the “email, suggest several times to meet, etc.” tactic that I used to discuss homework with classmates.

In other words, I’m someone who discusses work with other students by setting up meeting times; I sometimes feel that other students are just together all the time and don’t have to do that.

Would I like to be able to have that experience? Well, yes, of course. Sadly, working with other students doesn’t always end up working well. I don’t mean “not working well” in the sense that we can’t figure out something; I refer to “working well” in the context of how I feel after a group meeting. For instance, I thought I had an incredible stroke of luck when I found that someone else in my convex optimization class wanted to discuss practice problems to prepare for the midterm. I hastily replied, saying yes. Unfortunately, that discussion had four or five other students there and I could not hear much of what they were saying, so I felt lousy. I left early, wishing that I had specifically requested a one-on-one meeting.

The lesson? I have to be careful about working with other students, and to mentally calculate an extensive cost-benefit analysis.

Anyway, those are just some random thoughts I have now. I hope next semester will be a lot better than this one.

1. These were the days when I wasn’t Skyping with my parents, of course.

# The Benefits of Having the Same Group of Interpreters

I just submitted a sign language interpreting services request for the spring 2016 semester, when I am likely to take EE 227C and CS 267. The former is the third convex optimization course offered at Berkeley, and the latter is a popular entry-level graduate course on parallel computing and systems.

For this request, though, I also said that I wanted to have a more consistent group of interpreters. This means I would prefer to have the same interpreters currently working now (or those from last semester) to be assigned to those two courses. Just like in the Spring 2015 semester, I have a standard group of three to four interpreters this semester, but strangely enough, none of them were also part of the Spring 2015 group. This is despite how all of the interpreters are assigned out of the same Bay Area company, Partners In Communication LLC.

In addition, there’s been another interpreting issue for this semester in particular. I’m not sure why, but I have had an unusual amount of substitutes. There are two primary interpreters, plus one primary substitute interpreter, but then there have been at least five cases (as far as I can remember, all involving different people) when I’ve had substitutes for substitutes.

This would be frustrating even if I was taking an undergraduate humanities course, but when the material is so technical in my courses, a normal person cannot convey the material clearly on day one. At least with consistent interpreters, they can pick up some of the common terminology. The people who interpret for my Convex Optimization class (EE 227BT) have gotten so used to hearing the words “positive semidefinite matrix” together that they can now understand that sequence when it’s used in other classes. (Positive semidefinite matrices are everywhere in machine learning – I can’t believe I went through undergrad without knowing about them, and now I’m one of their biggest fans.)

Consistent interpreting is something that I admittedly did not think about when requesting for services last semester, but I will remember this for the future. It is already challenging for interpreters to work in STEM courses, so there needs to be consistency so that they improve throughout a semester. Note that in general, I do benefit somewhat from interpreting services despite issues in STEM courses, and in some cases interpreters are essential (as was the case a few weeks ago when an ear infection meant I had to stop wearing my right hearing aid for a week), so this is pretty important to me.

Oh, speaking of interpreting requests, I also need to hope that no one else “strongly suggests” me to drop and/or add a course at the last minute, though admittedly, adding a course results in substantially fewer headaches as compared to dropping a course.

# Why Can Certain People Understand Foreign Accents Easily?

Lately, I’ve observed an interesting phenomenon when my sign language interpreters have a tough time understanding some of my classmates’ accents, yet my professors (and, presumably, my other classmates) don’t seem to have that problem. Here are a few non-exhaustive examples, restricted to my Berkeley experience:

• I once gave a talk in Peter Bartlett’s research group meeting back in April. I had a sign language interpreter there, and she was able to help me by explaining some of the comments Peter made about the paper during pauses in my lecture. Unfortunately, she had major trouble with one of the postdocs, who had an accent I couldn’t even distinguish – it was not Chinese or Indian. She was unable to explain what that person said, and I think we had to rely on a few other people to help us out, plus a couple of finger pointing to the relevant stuff I wrote on the whiteboard.

• In my EE 227BT class (Convex Optimization), many of the students are Chinese. My interpreters had such a hard time understanding some of them that, after the first few lectures, they talked to the professor, Laurent El Ghaoui, about it. He acknowledged that some of them were tough to comprehend (“they’re engineers” he lamented) but he learned to repeat their questions so that my interpreters could easily relay the information back to me. However, there lies the interesting factor: my interpreters sit pretty close to the professor, and he doesn’t move around too much in class, so the difference in their comprehension probably doesn’t come down to distance from the speaker.

• In my CS 287 class (Advanced Robotics), Pieter Abbeel is unusual in that he seems like he really wants to get to know the students. (Our class is much larger than the previous editions, so he actually offers two lecture times to reduce the number of students in a room!) During some of our class breaks, he will take out the list of students and ask some of them to stand up and introduce themselves to the rest of the class. When one of the Indian students spoke about himself, my interpreter could not understand a single word that student said – literally! But Pieter did not even ask that student to repeat himself, so I assume he must have understood part of what that student said. This situation happened to a less extreme event (as in, the interpreter understood a handful of words) with a few other students.

I think the only way that can explain the comprehension disparity between my interpreters and professors is that the latter group of people are more used to being around foreigners. I wrote a blog post almost three years ago that highlighted my concern over understanding foreign accents in graduate school. Unfortunately, it’s been a nontrivial problem for me here as most of the students I work or converse with are international students (typically from one of two major countries: China or India).

Incidentally, none of those three professors are American1, so it’s possible that they may be more skilled at picking up accents since they’ve traveled to quite a number of places and conversed with lots of people. That’s my only explanation. But I would also be interested in knowing if there was any initial struggle or hump they had to clear, or if they actually do have trouble understanding people but are clever at hiding it.

So, here are some questions I’d like to ask:

• To people who can understand foreigners easily: why do you feel like this is the case? Where are you from? Have you traveled around the world a lot, talking to people of different nationalities? Was it always easy to talk to foreigners?

• To people who have a hard time understanding foreigners: do you get the chance to talk to a diverse group of people? How long have you tried to understand foreign accents?

• To American STEM students: how good are you at communicating with foreigners? Do you find the high proportion of foreign students in STEM to be a deterrent to your education?

• To foreigners: do you get frustrated when people ask you to repeat what you say? What are your thoughts in particular about communicating with deaf people like me (who have enough hearing to talk one-on-one)?

The practical concern of this for someone like me is that, if I work in a group of foreigners – reducing the likelihood of one-on-one conversations – will there be any benefit of a sign language interpreter?

1. Actually, probably most faculty in EECS are not even American. I don’t know what the matter is with our (pre-doctoral) STEM education.

# Suggestions on Improving Access Services Requests

It’s nice that Berkeley has an access services page that I can use to request accommodations. But it’s not perfect, and there are several things that really should be fundamental components of any service request system. Here are some of my “fundamental components,” which are not currently part of Berkeley’s recently-overhauled system:

• Any time I submit a request form, I need to get a “receipt” email that confirms I sent the request, along with the details. This gives me proof of submission and protects me in case Berkeley’s DSP loses it somehow. It also lets me double check that I filled in the boxes with correct information – it’s easy to mess up on these things.

• Any time the requests are satisfied, I need to get an email telling me that information. In my case, this means a sign language interpreter (or two) got assigned to wherever I am going, and I should know their names and contact information. Last year, for instance, I made a request for interpreting services for the Berkeley EECS town hall, but I ended up getting captioning instead, and I didn’t know until the last minute (well, five minutes). The town hall ended up being a disaster, though to be fair, it would have been difficult for an interpreter to be effective there given the noisy atmosphere.

• In the request form, there needs to be a generic “describe any miscellaneous information” box that I can type in to describe such information. For instance, some events may last for hours, but are low-key and don’t involve much discussion. If DSP were to just look at a request for a two-hour event, they would automatically hire two interpreters, but sometimes I might have to make it clear that only one is needed (thus saving money and man/woman-power). I’ve had to resort to describing this information to the person managing the requests by email, and that’s cumbersome.

I might be stretching here, but it would also be nice if my requests could get processed over the weekends. Like many graduate students, who are young and consumed with work, I don’t make much of a distinction between weekdays and weekends. But DSP will not operate over the weekends since they are staffed by “real world workers.” (Just to be clear, that’s a jab at graduate students, not real world workers.) Consequently, I have to be careful to submit requests before Friday evening; otherwise, DSP doesn’t get to look at them until Monday morning. I’ve been burned on this at least once. The lesson? Plan way ahead. I don’t have the luxury of going to an event on short notice.

My weekend idea is probably a bit too much to ask. I would be satisfied if DSP could implement my three other suggestions.

I would have thought that these things I mention are obvious, but I guess DSP didn’t have anyone complain, or they’re having technical difficulties. At the very least, this situation serves as yet another reminder to me that I need to educate intelligent people who have never considered the various issues that arise for deaf people.

# An Unbelievably Pleasant Social Gathering

A few weeks ago, there was a Berkeley AI social event. My natural reaction upon finding out about this was to ignore it, due to deeply unpleasant experiences in social gatherings before. Yet, I ended up going, due to two reasons. The first was that I had a terrible summer and was constantly in a bad mood, so I figured that even if I hated attending the social, my resulting mood couldn’t possibly be worse than what I was experiencing on a daily basis. The second reason why I attended was because John Canny said he would be going there, and he invited me to go with him. This would at least reduce the likelihood of the most common situation for me in social events: when I stand around awkwardly, watch other people talk without a clue as to what they are talking about, and then leave when my level of frustration exceeds a certain threshold.

Critically, John and I would be going together, not separately, so my plan was to stick with him until he either introduced me to someone, or until someone were to start a conversation with me. In the latter case, I would immediately switch to communicating with that person since John would have no trouble finding others to talk to. My goal for the AI social was to have at least one non-trivial conversation (ideally at least ten minutes) with someone (other than John), and leave with my mood at least as good as it was beforehand.

Well, fortune struck that afternoon. I had three (yes, three) such conversations! And I have an embarrassingly detailed recollection of how these conversations began, and what we talked about – often down to the exact words and sentences we said.

For the sake of privacy, I won’t go through all the details, but here is the high level story. When John and I went to the social event, there was already a huge crowd of faculty and students that I immediately became worried. Fortunately, there was a quieter room to the side that a few of us were in, and I was able to make my way over there. I saw Trevor (hey, can you tell me my prelim score already?) and a few other familiar folks. I had just started enjoying my cup of red wine when another student saw me and started a conversation, my first real one for that event. Fortunately, he spoke clearly, and the “side room” was sufficiently quiet.

I believe we had a nice conversation (I hope he liked it), and again, for the sake of privacy I won’t go through the details (but I can probably produce an accurate transcript of our conversation if needed). We reached a good stopping point, and then split up. I got some additional food and wine, and returned to the more crowded area. I was now by myself in a painfully familiar situation, and planned to remain only briefly.

But I got lucky – someone was walking back to work, and saw me along the way1 so we ended up talking. Again, this was another person who I knew somewhat well, and his voice is easy to understand. Our conversation was also pretty nice. We discussed the prelims and his plans for the next few years.

Then he had to leave. I stood around by the now-empty plate of food, and was getting ready to head out when I got lucky again: a student who I had seen before was walking towards me and had made eye contact. Yay! I did not know him as much as the two earlier students, so our conversation veered towards standard “introductory” material.

By the time we finished, most of the other students and faculty had left to go back to work (or home, because it was Friday evening), so I figured I would do the same.

But wow, that event was something I will definitely remember for a while. I felt so deliriously happy after the social, that it actually ended up a net negative on my work progress for the rest of that evening, because I kept thinking about the social rather than my work! (Don’t get me wrong – this was the rare case when I was happy not to get work done.) If others knew how much I obsess over almost trivial conversations, how I have a near-transcript level of conversation recollection, and how often I replay social situations in my head, they would probably be shocked.

So should I try attending social events more often? Maybe. Actually, one of my concerns is if I set expectations that are too high for future social events. I guess that’s not the worst thing worry about, though.

1. It’s quite convenient that I stood in a location that people would have to walk by in order to leave the area. That’s a clever strategy!

# My Wish for the 2015-2016 Academic Year: A True Student Collaborator

After a year into my doctorate program at Berkeley, I know that my experience has been suboptimal in many ways. One reason for that – actually, the main reason – is because I have been experiencing severe isolation the past few months, primarily a product of being in a small research group where I feel like my research interests and career goals do not closely match with those of the other people here.

Yeah, it’s “the usual” for me. After middle school, I badly wanted a fresh start in high school so that I could cut down on all the isolation I had experienced. Then I badly wanted a fresh start in college, for similar reasons. Then I badly wanted a fresh start in graduate school, also for similar reasons.

So my wish for the 2015-2016 academic year is simple, and I hope, realistic.

I wish … that I will be able to find another true, student collaborator, someone who is also a Ph.D. student here, and is interested in Artificial Intelligence research. Someone who will be willing to sit down with me, work with me side by side, and help teach me what it is like to do true research and boost my confidence. Someone who will not mind if I mess up on minor errors. Someone who has good English language skills, who can speak clearly, and who will not mind that I am deaf.

Someone who I can consider a true friend.

# My Prelims (Transcript)

Note: I wrote the following “transcript” of my exam from memory, so it should not be taken as verbatim.

## Before the Exam

Of the twelve AI students taking the prelims this fall, I was up last to go; my times were 4:10 to 4:40 with Pieter Abbeel and Dan Klein, and 5:10 to 5:40 with Ruzena Bajcsy and Trevor Darrell. Obviously, I had spent the entire day of August 24 before 4:10 doing some last-minute preparation. I arrived in the seventh floor of Sutardja Dai Hall and met my sign language interpreter there, who was relieved that she hadn’t gotten lost or sent to the wrong location.

Actually, I should backtrack to rant a little. My prelims was yet another example of how having multiple channels of communication in setting up a sign language interpreting appointment can mess things up. A week before the prelims, I carefully prepared a three-page document of preparatory material for my interpreter to review. Most of the document was a list of advanced terminology that she might expect to hear on the prelims. My goal was for her to look at the words, remember the spelling, check the pronunciations online (if necessary) so that she could finger-spell them quickly1 during the prelims. I also listed where and when we should meet before the prelims. Finally, and this is important, I gave the document to Berkeley’s DSP, who then forwarded it over to the agency that officially appoints the interpreters.

Yet when I arrived in the location where I requested to meet (the third floor of Sutardja Dai Hall), I never saw her, and when it was 4:05pm, I had no choice but to go up to the seventh floor. It was only when I finally got near Pieter’s office that I saw her. She then said that she had never gotten that document I created. This is the problem: I specifically asked to send my prep documents to the interpreter directlynot through multiple parties – but got declined2. We lost a bit of early discussion time: she went right to the location of the exam at 3:50pm, while I was anxiously pacing around in the lobby waiting for no one.

In the end, this might be me going overboard, since the prelims are not a case where having interpreters is strictly necessary for me. I can understand most people when they talk clearly and directly to me, but there are many other situations when having informed interpreters would be vital.

Sorry for the rant. A few minutes after I arrived near his office, Pieter came out and immediately told me to come in. It was time.

## Part One

“So … hello,” Pieter and Dan said simultaneously.

“Hi,” I replied. I had never seen Dan and Pieter look like this before, though admittedly I do not know either of them that well. I think they were either super-tired after having eleven other students before me, or they were trying super-hard to make me feel less nervous.

“The prelims are going to be a little different this time,” Dan said, “but everyone is doing the same thing, so don’t worry. It’s a written exam. You have thirty minutes to answer these questions. You can say things out loud if you already know the answer, and you don’t have to show excessive work. While you do that, I will take notes on my laptop, so … try not to worry about me.”

Fat chance, I thought as he smiled awkwardly. Dan pressed the timer on his phone, starting the exam.

I opened the exam and flipped through all the nine questions to get a feel for them. Then I went back to the fifth question, about logic, since I knew the answer already.

“Well, the term $\alpha \models \beta$ just means alpha entails beta, which is the same thing as saying that $\alpha$ implies $\beta$,” I said, writing down briefly. “And an algorithm for doing that is resolution. Or if we’re assuming Horn or definite clauses, we can use forward chaining and backward chaining.”

Both of them nodded. Since that was the extent of the question3, I went back to the start of the exam, in an attempt to do the rest of the questions in order. The first question was about different search algorithms and heuristics. The graph was structured as a directed tree with two goal states, with (different) edge costs listed for each arc.

“For depth first search, we just go to the left,” I said, outlining the path. “For uniform cost search, we expand this goal state. For a heuristic, we just need to not overestimate to the goal. And for the last part, when we talk about ranges that would make the heuristic admissible, it seems like we just need to make sure we have that be at most four … wait, we are doing A-star search, right?”

“Yes,” Pieter said. “Think about the value 4.5.”

I nodded and wrote my answer formally on the sheet. Then I moved on to the next question, about constraint satisfaction problems. What happens if we enforce arc consistency, and (1) we have one element in each domain, (2) we have a variable with no values left, (3) we have multiple values in each domain? “The first one should have an immediate solution,” I said. “For the second case, we’re guaranteed not to have a solution. For the third, we are not guaranteed to have a solution.”

Dan and Pieter did not noticeably react, so I went on to the third question. It presented me with a tree and asked me to write down a set of leaf node values so that the expectimax versus minimax paths are different. That was easy for me, as I wrote down “1,” “1,” “10,” and “0” for the two sets of two leaf nodes (so four leaf nodes in all). Both of them nodded in approval.

The fourth one was about Markov Decision Processes, and involved substantially more text than the previous questions. I looked at it and decided to move on to the sixth question.

“Oh yeah, a lot of people skipped that one,” Pieter said. I did not ask if that meant they skipped and solved it later, or (gulp) if they skipped it entirely.

The sixth question (remember, the fifth was about logic) presented me with two Bayesian Networks. The first part, asked me to write down the joint (that’s easy – just multiply $P(X_i \mid {\rm Parents}_{X_i})$ values together). The second was slightly more challenging, which asked me to outline a good variable elimination ordering and a bad variable elimination ordering. The graph was shaped with a root node $X$ connected to $Y_1$ through $Y_n$ and there were also “$Z$” nodes downstream of the $Y$ nodes.

“Obviously, a bad elimination ordering starts with $X$ because if you eliminate it, all the $P(Y_n\mid X)$ terms get put into one factor that depends on all the $Y_n$s, so you get a giant table,” I said. “Now, for a good variable elimination ordering, eliminate $X$ last. Then I guess the only question is whether we should eliminate $Y_1$, $Y_2$, and so on, or the other way around. To do that …”

“You don’t need to do that,” Dan interrupted.

“Oh, I guess the ordering of the $Y$s doesn’t matter,” I said. I probably would have figured this out eventually, but I’m glad they were helping me to speed up.

The seventh question presented me with a familiar decision network diagram. “The oil can be in one of three locations,” I muttered out loud, reading the question. “One hundred utility for getting the oil, zero for not getting it. So the MEU with no information is just one hundred divided by three, since one third of the time, we get 100 utility and otherwise we get nothing.”

The second part asked about the VPI of knowing with certainty the state of oil in one of the three locations (i.e., it was either “there” or “not there” with certainty). “For that, we need to find the maximum expected utility with this information, minus the MEU I found from the first part,” I said. “So one third of the time, we know the oil for certainty, and two thirds of the time, we can narrow down our options to the other two. So subtracting that off … it’s fifty over three.”

The eighth part was about maximum likelihood. Given that we have heads, heads, tails, and heads, I had to show that the MLE of $P(H) = 3/4$.

“The standard way of doing this is to write the product of the Bernoullis and form the Lagrangian …” I began, but Dan correctly pointed out that it wasn’t going to be that complicated.

“Oh yeah, we can literally write out $h^3(1-h)^1$,” I said, writing it out. Unfortunately, I ended up getting $P(H) = 1/4$. I stared at my algebra for an embarrassingly long time before Dan pointed out my stupid error: failing to distribute a “3” appropriately.

“Yeah, that’s right,” I said, relieved. “Moving on…”

The last question (well, second last for me) showed me three different two-dimensional Gaussian distributions, with different $\sigma$ “curves” printed. I never saw these diagrams before but it was pretty intuitive what it meant; if the curve stretched “long” in the $x$ direction but “short” in the $y$ direction, then the covariance matrix would have $\Sigma_{1,1} > \Sigma_{2,2}$. Two of them had diagonal covariance matrices, and the third one, which stretched out long in the second and fourth quadrants, I put down as having $Cov(X,Y) = -1.5$.

“Why did you pick that?” Pieter asked.

“Well … “ I said, not entirely sure if that was correct. (I later realized after the exam that I was right, and knowing that with certainty required remembering the formula for correlation.) I wrote out the definition of the covariance: $\Sigma_{1,2} = Cov(X,Y) = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[X]$.

“And the expected values of both are zero,” Pieter pointed out.

I wasn’t sure if he said that in agreement with my answer, but time was running out so I came back to that MDP question I skipped earlier (and left my answer to the ninth question unchanged). There was a grid of squares arranged in a “circular” fashion where we could move to any of two adjacent squares, except for two “slide” locations that forced us to move right.

“This is asking me about the value of states after three, ten, and infinity iterations of Value Iteration, with discounts of $\gamma=1$ or $\gamma=0.1$” I said. “Well, the value of the state by the slide is going to be two plus two-tenths after three iterations with discounting. Without, it’s four.”

I proceeded, figuring out the different values of states they asked me to write. I tried to sprinkle in a few jokes here and there to amuse them, and was moderately successful. Unfortunately, time ran out before I could fill in the last two blanks, so I put in a “2” in one of the boxes at random.

“No, I only put a random value down to get something down in the time limit,” I said.

“Well, why don’t you try and solve it?”

I looked at it, and within a few seconds, saw that the answer was infinity. “Um, yeah, infinity I guess, for both” I said, knowing that I should have gotten that earlier.

“All right, I guess we’re done. You’re the last one so you can now talk to any of the other students about this part of the exam,” Pieter said.

“Do you have a rough idea of how well I did?” I asked.

Dan smiled awkwardly. “We … can’t really do that now.”

Fine, I acknowledged, and left the office feeling reasonably happy with my performance. I read my AI book for a few more minutes before showing up at Ruzena’s office.

## Part Two

“Well, hello. Why don’t you introduce yourself?” Trevor asked. Again, I’m assuming they were doing this to help the students feel less nervous, though in my case I did not expect these two to know anything about me so this really did qualify as an introduction. I mentioned my undergraduate institution and my advisor, but I also wish I had remembered to say that I was deaf since birth.

Then they proceeded to ask questions.

“Explain logistic regression,” Trevor said.

“This is a classification algorithm,” I replied. “Let’s suppose we’re in the binary classification case. Then logistic regression means that we are assuming – at least in the discriminative case – that our data should be assigned probabilistically to classes of 0 or 1 based on the logistic function, $P(y = 0\mid x, \theta) = (1+e^{-\theta^Tx})^{-1}$.” I wrote the formula on Ruzena’s white board.

“And what are the parameters?” Trevor asked. I pointed to $\theta$ and mentioned that we could update these with a stochastic gradient method based on the gradient of the data log likelihood.

“Next up: how do you do linear regression on a V-shaped data?” Trevor asked.

“This is a mixture model,” I said. “Just have different linear regressions for the two sets of data.” I drew some hypothetical data on the board.

“But how do you know that you will have two components?” Trevor asked, shrugging. “Why not just one regression line?”

“Well, this is basically ordinary least squares, right? So with an extra regression, we would be able to better minimize the sum of squared errors…”

“Perhaps I should rephrase this,” Trevor said. “We have to decide the number of components, and in the general case, we can’t just eyeball the data and say that we will have two components. We need to try out different models and evaluate them somehow.”

“Oh, just try out different numbers of distributions in the mixture model and perform cross validation–”

“Yes, good,” Trevor interrupted. “Next up, how would you classify data if you did not know the labels?”

“One way to do that is with an algorithm like K-means,” I replied. I drew out some unlabeled data and tried to signal how the algorithm would work by mentioning how the center of each cluster gets updated.

“Is this parametric or non-parametric?”

“Uh … it’s parametric in the sense that we can fix $K$ and we only have a finite number of parameters.”

“That’s correct,” Trevor said, “but what is an example of a non-parametric technique?”

“Nearest neighbors.”

It was Ruzena’s turn to ask. “Can you write down Bayes’ rule for classification?”

I wrote down Bayes’ rule on the white board, for $P(y=0\mid x,\theta)$. I wasn’t sure if I really understood the question, because it’s just Bayes’ rule!

“Often times we care about learning the parameters,” Ruzena said.

“Yes, in which case we would have something like Bayes’ rule applied to $P(\theta \mid x)$,” I said, still unsure if I was supposed to answer something in more detail.

“Can you write down or explain what the Bayes’ risk means?” Ruzena asked.

This wasn’t on the syllabus, I thought, but remembering the formulas from CS 281A, I wrote down $\mathbb{E}[\ell(y,\hat{y})]$. “Basically, the expected loss between what we predict $\hat{y}$ and the true $y$.”

Surprisingly, they seemed satisfied with that brief answer, so Trevor changed the subject to the next part – neural networks4.

“First, describe a neural network.”

“This is a classifier that can also be viewed as an architecture,” I said, drawing out some nodes on the board. “We have an input layer that takes in our data, such as pixels in an image. Then we have the option of having hidden layers, and I’ll write down one. Then an output layer, where we have outputs. The key with these architectures is that the hidden layers allow for extra complexity, so to speak, but we also need to make sure we add in activation functions for the nodes to introduce nonlinearity. So we can use the sigmoid.”

I explained a few more basic properties in detail, and also went over backpropagation, until Trevor seemed satisfied.

“Now we’re going to talk about convolutional neural networks,” Trevor said. I kind of expected this to happen since Trevor was on the committee, but since CNNs were not on the syllabus, I would have to rely on my memory of CS 280 material (thank goodness I took the class).

“I have a network architecture listed here,” Trevor said, showing me his iPad. “Please draw it.”

I drew it out. It was a standard LeNet-like architecture, with a $13\times 13$ input layer, a convolutional layer, a max-pooling layer, and a fully connected layer, resulting in four outputs.

“How many pixels are there in each resulting image in the convolutional layer, given the kernel size listed?”

“They are ten by ten,” I replied, since the filters were $4\times 4$ and the stride was one.

“Good. Now what about the size of the max-pooling layers?”

“Five by five.”

“And how many weights are there in the fully connected part?”

I wrote down $(3\times 25) \times 4$ on the board.

“So, three hundred,” Trevor approved. “Now can you tell me the difference in run time or space time between convolutional and fully connected layers?”

“Yes, the space time is higher for fully connected layers because convolutional ones have weight sharing,” I replied, with Trevor nodding. “As far as runtime, wouldn’t the same apply for the convolutional one …”

“Well, think about the total amount of weights,” Trevor corrected.

I mutter something about the total number of (not necessarily unique) weights, and tried to make a high-level argument that the runtime would be higher in the convolutional layer, assuming that training or testing time would be proportional to the number of weights. Trevor seemed happy, and continued with asking neural network questions.

This was another topic that I only remember from CS 280. Fortunately, I did remember, and I wrote out the curve of the sigmoind function. “With the sigmoid, the gradient will go to zero as $x$ goes to plus or minus infinity. After a certain point, the difference in gradients is irrelevant and will not contribute to the update. This is why we like ReLU layers, because those are $f(x) = \max\{0,x\}$ and the gradient will increase linearly.”

I fumbled around with my initially sloppy explanation until I felt Trevor seemed satisfied with a less sloppy version.

“The last question for today is about transfer learning,” Trevor said. “What is it?”

Yet another question from CS 280! Wow, I am really glad I took the class. “That’s when we use weights learned from one task to apply to another task, hence transferring the knowledge from training one thing to another. So, in a computer vision object recognition challenge, imagine that we have one large data and one small, specialized data. We train on the specialized data to get some weights. Then we…”

“Other way around,” Trevor smiled, motioning with his hands.

“Oh, right, sorry! Yeah, I remember that. So we would train on the large, broad data to get some weights. Then, for another neural network on a more specific task, we use those as the initial weights. Say, for instance, we are trying to classify objects as being Trevor or Darrell …”

I’m Trevor Darrell!” Trevor replied.

“Yeah, I meant that those are two similar output classes, so we would need specific weights for those,” I said. I worried a bit that my joke backfired, but Trevor seemed to be in a good mood, so maybe not. I provided a half-baked concluding overview of transfer learning and neural networks in general.

“And we’re all out of time!” Trevor said, feeling satisfied.

“When do you think you can provide the results?” I asked.

“Probably tomorrow!” Trevor grinned.

I left Ruzena’s office, feeling pleased that I had remembered how convolutional neural networks worked from CS 280, but wishing that it had been explicitly listed on the syllabus. I knew I would be constantly refreshing my email tomorrow.

As it turned out, it wasn’t until another week when I got an email from Pieter saying I completely and comfortably passed the AI prelim. Now I feel a little better.

1. I may not have the most fluid ASL due to lack of practice, but I can fingerspell blazingly fast and can understand blazingly fast fingerspelling. If you’re interested, we can talk about that.

2. DSP later told me that it was because the agency that provides interpreters only gives them prep materials at the moment of assignment, which was set about a week before I sent them the document.

3. Despite Stuart Russell not being on the prelim committee this time, we did get a logic question, but it was by far the easiest (assuming you studied it!) and shortest question on the exam.

4. Yeah, I should have known that two prominent computer vision researchers would have asked me about neural networks.

# Miscellaneous Prelim Review (Part 2)

This will probably be my last prelim review post. The topics I’ll cover in this post are convex optimization, statistical learning theory (broadly), and logic/planning. Actually, I wanted to make some detailed notes about Kalman filtering, but I think I’ve done more than enough here, and there are too many equations involved to write that quickly.

## Convex Optimization

This part is based on sections 9.1 through 9.5 of Boyd and Vandenberghe’s book, freely available online. Stephen Boyd also has a lecture video on YouTube that I watched, which I found to be very helpful. (I can also understand Professor Boyd’s speech well.) The book is fine, I suppose, but is really hard for me to read so I made embarrassingly slow progress as I learned this material.

The main purpose of sections 9.1 through 9.5 is to discuss iterative algorithms for minimization. Formally, we have the problem of minimizing a convex function $f(x)$ and need to find the optimal $x = x^*$. As in almost all cases, we have to remember that $x$ is not generally a scalar but a vector, but $f$ is real-valued, so $f : \mathbb{R}^n \to \mathbb{R}$. I had to keep reminding myself of this.

In most cases, we need to use iterative algorithms to find $x^*$. The class of algorithms we use are known as descent algorithms because they generate points $\{x^1,x^2,\ldots \}$ such that $f(x^k) > f(x^{k+1})$ unless we are at the optimum. Actually, a little side-note: there is exactly one optimal $x^*$ because we are actually assuming that $f$ is strongly convex, not just convex (and strong convexity is not the same as strict convexity!). By strong convexity, we assume that there is a constant $m > 0$ such that $\nabla^2 f(x) \ge mI$, which means $\nabla^2 f(x) - mI$ is positive semidefinite.

A lot of our future analysis will depend on a concept known as the condition number of a matrix or a set. For a matrix, the condition number is the ratio of the largest singular value to the smallest singular value. Alternatively, we can use eigenvalues if the matrix is positive semidefinite, which actually happens here since the second-order characterization of convexity states that the Hessian of $f$ is positive semidefinite. The condition number of a set is defined as the ratio of the largest width to smallest width. High condition numbers result in highly skewed/stretched data.

Here’s the descent algorithm. We repeat the following until some stopping criterion:

• Compute a (descent) direction to change $x$, denoted $\Delta x$
• Compute a length or step size $t$ to go in that direction using some form of line search
• Compute $x \leftarrow x + t (\Delta x)$

There are two main ways to choose $t$: exact search (find $\arg_t \min f(x + t (\Delta x))$) or backtracking search. For some reason, it took me a really long time to understand backtracking line search, but after looking at that figure in Boyd’s book for ages, I understand what it does now. We have to keep decrementing $t$ until our function $f(x + t (\Delta x))$ lies below a given upper bound. Backtracking line search is important because it’s more efficient and in practice, it often works just as well (or better!) than exact line search. To explain it, remember that figure with the three curves in it: one is $f$ and two are straight curves which follow from the FOC of $f$ at $y=x+\Delta x$.

But the real difference in the various gradient algorithms comes when we pick $\Delta x$. There are three options:

• Use $-\nabla f(x)$. I mean, the gradient $\nabla f(x)$ points in the direction of greatest increase of $f$ at $x$ (by definition) so why on earth would we not use the negative of that? This is gradient descent.
• Use the direction that maximizes the negative gradient in the direction determined by a pre-specified norm. Precisely, our first-order approximation of $x$ at $x-v$ is $f(x+v) \approx f(x) + \nabla f(x)^Tv$, and we want to find $\arg_v \min \{\nabla f(x)^Tv : \|v\| \le 1\}$; in other words, we want to make the directional derivative as negative as possible. We need to restrict $\|v\|$ because if not we could first pick a direction that makes it negative when multiplied by the gradient, and then make it arbitrarily large. Also notice that we are not specifying the exact norm. This is steepest descent, and equals gradient descent when $\|v\|$ is the $\ell_2$ norm.
• use $-(\nabla^2 f(x))^{-1}\nabla f(x)$, i.e., the negative of the inverse of the Hessian, multiplied by the gradient. Whew! This comes from the second order approximation of $f(x + \Delta x)$ – just take the gradient with respect to $x$, then solve. This is Newton’s Method.

Gradient descent is simple, and works perfectly (i.e., converges in one step) when the data are “isotropic,” that is to say, roughly “equal in all directions.” It’s bad when the condition number of the Hessian or the sublevel sets is high (e.g., in the 1000s). The classic example is the ellipsoid “bowl” where we have a 3-D bowl that is much wider in one direction than the other. Gradient descent with exact line search will always “overshoot” the optimal location and keeps going back and forth, zig-zagging to the center. The stopping criterion for gradient descent is if $\|\nabla f(x)\|_2 \le \eta$ for some pre-specified $\eta$.

Steepest descent is a generalization of gradient descent in that we get the option of picking the norm that we want to use as a metric of our “gradient” here. A quick warning: there are actually two versions of $\Delta x$. I tend to assume we are using the normalized version $\Delta x_{\rm nsd}$, where the $v$ we pick has norm bounded by one. There’s also the un-normalized version $\Delta x_{\rm sd} = \|\nabla f(x)\|_{*} \Delta x_{\rm nsd}$ but I don’t understand how this actually works.

Steepest descent can work with the $\ell_1$, $\ell_2$, and quadratic norms. In the $\ell_1$, it is equivalent to coordinate descent (modifying one coordinate of $x$ at a time), and the way to think about this is that we are taking the maximum component (in absolute value) of $\nabla f(x)$ and setting our $v$ to be zero everywhere except for $\pm 1$ at that “largest component.” The derivation for $\Delta x_{\rm nsd}$ in the quadratic norm is more complicated (for the un-normalized, it’s just $-P^{-1}\nabla f(x)$), but visualizing it is easier: we have a point $x$, draw an ellipse around it (determined by the norm), and then pick the direction that results in the greatest decrease. More intuition: extend as far as possible in the direction of $-\nabla f(x)$, while staying inside that unit ball. It’s also worth noting that we can transform coordinates from the quadratic norm’s matrix $P$ to get gradient descent. In fact, this gives a useful test for a norm: how well steepest descent performs will depend on how well the transformed points $P^{1/2}x$ have “equal” isocontours suited for gradient descent.

Newton’s method is a step up from gradient descent in that we use a second-order approximation of $f$. The way I think of it is that gradient descent will produce a plane in 3-D (e.g., for a 3-D “bowl” that we’re trying to reach the minimum of) but Newton’s method will produce another bowl, though this bowl will usually be entirely above of the original one, save for the tangent point.

The book mentions three “perspectives” on Newton’s method:

• Minimization of the second-order approximation of $f$, which is how I see it.
• Steepest descent in the Hessian norm: it’s like the quadratic norm described earlier, but the Hessian is a really good “$P$” matrix to use since its condition number approximates the condition number of the sublevel sets!
• Solution of linearized optimality condition. I did not understand this at first, but actually, think of Newton’s method for approximating roots of a function $f$, where we need to subtract $f/f'$. In our case, we want to find the minimizer of $f$, which means we want the roots of the derivative $f'$, which involves $f'/f''$. That’s exactly what we have here!

• If the original function is already quadratic, Newton’s method converges in one step.
• It is independent of affine coordinate transformations. When we do iterates with $x^{(k)}$ versus $Tx^{(k)}$, the relationship between the points will remain the same.
• It uses something called the Newton decrement $\lambda(x) \approx f(x) - f(x^\star)$ to determine when to stop.
• There is a damped phase versus a pure phase. In the former, the difference in $f$ when we change $x$ decreases by a fixed quantity (this is good!). In the latter, the backtracking line search always picks $t=1$ and the number of accurate digits doubles. Thus, there is no need to run that second phase more than, say, four times.
• Newton’s method still works with badly-conditioned sublevel sets of $f$.
• The downside of Newton’s method compared to gradient or steepest descent is that (1) we have to compute the Hessian, and (2) we have to store it – remember that the Hessian will be $n \times n$, whereas the gradient will only be $n \times 1$.

The usual disclaimers apply in that we don’t really know various constants that get involved in the proofs, unfortunately.

## Statistical Concepts and Logistic Regression

This part is closely related to what I wrote about linear regression and the least mean squares algorithm. I will be discussing logistic regression as well (for classification, not regression), but first we take a brief detour to discuss the third major class of problem known as density estimation.

The problem is, given data, to find the appropriate model for it. The relatively easy case is if we assume we already have an idea of the distribution (e.g., Gaussian) and we just need to find the parameters (here, the mean and variance). We find the parameters via maximum likelihood. So in the IID Gaussian case, of which the graphical model is represented as $N$ independent shaded circles in a graphical model, we take the sample mean and sample covariance as our MLEs. With the Bayesian approach, where we have a new $\mu$ node pointing to all samples, we put a Gaussian prior on mean $\mu$ so that the result is a weighted estimate (and the same for the variance, actually), because of conjugate priors. In the case of discrete data $x$, we model these with multinomials. The resulting MLEs, which require Lagrange multipliers to solve (which gave me a huge headache at first), are just the sample proportions. For the Bayesian version, we use a Dirichlet prior. To extend the class of distributions we want to model, we can assume a mixture model, where $p(x\mid \theta) = \sum_{i=1}^{k}\alpha_i f_i(x\mid \theta_i)$, where the $\alpha_i$s are mixing proportions that sum to one. This time, we have a hidden node that points to its own observed data point $x$.

There is an alternative strategy of estimation known as nonparametric density estimation. Here, we do not assume we have a fixed parameter $\theta$ and as our data grows, the nonparametric model will grow to represent a wider class of distributions. We have kernels, where each data point takes some probability mass, and we add them up and normalize. In the case of Gaussian kernels, the nonparametric case for a fixed number of samples really reduces to the mixture model case, but they differ as the number of instances grow.

Tip: use the nonparametric case if we do not have a good idea of the model and lots of data, but use the parametric version when we have little data and a good idea of its underlying distribution (it will converge faster). The line between the two methods does blur somewhat, for instance, when we have a mixture modeling problem where we have to dynamically estimate the number of components $K$.

Finally, we can turn our attention towards the regression and classification problems. In both cases, we model $p(y_n\mid x_n,\theta)$, where the $n$ here indicates that we assume IID data. For linear regression, we assume $y_n = \beta^Tx_n + \epsilon_n$, and have to find $\beta$. The choice of $\epsilon_n$ is what really determines the distribution – here we assume Gaussians, so this is linear regression, and that means the MLE of $\beta$ is the OLS estimate. Another way of extending linear regression to be more flexible is to use (conditional) mixtures. Here, the graphical model looks like that of the density estimation mixture model, except we also need the $X_n$ node (which may or may not be connected to the mixture node $Z_n$). And, of course, we could always treat these from a Bayesian perspective, perhaps by endowing that $\epsilon_n$ error term for Gaussians (in linear regression) with Gaussian priors for its mean and variance (well, probably variance only if we want the mean to be zero).

We can also use nonparametric regression, if we do not want to restrict our conditional mean functions. Actually, Russell and Norvig cover this a bit in their nonparametric methods section in the textbook; each predicted new $y$ is based on the weighted prediction of the other, “nearest” $y_n$s.

In the classification case, the distinction between generative and discriminative cases is more apparent. I remember the way the arrows point in the model just by remembering the discriminative case, and then realizing that the generative is the opposite one. Use the generative case if we want a full probabilistic model, and use discriminative classification if we only care about the boundary point. The full model in the generative case also may help combat overfitting, so it is better with limited and partially observed data. Discriminative models have less bias because they make fewer assumptions, so they work better with lots of data (in fact, it’s a lot like how nearest neighbor will work best with lots of data).

These approaches are important to understand the logistic regression algorithm, where we assume that the posterior probability $p(y=0\mid x, \theta)$ for a binary classification problem is logistic or arrives at that form. That we have the inner product there means the posterior “boundaries” of equal probability are hyperplanes. In the generative case, we estimate means and covariances, which define $\theta$ (and these are density estimation problems!) and the boundary implicitly, while in the discriminative case, we estimate $\theta$ “directly,” possibly choosing an arbitrarily complex boundary. In fact, “discriminative = logistic regression”, “generative = Naive Bayes”, and both are for classification. In fact, that’s why they are in the same chapter of Mike Jordan’s notes!

Again, logistic regression assumes we have the sigmoid function as the form for our posterior probability. We can assume this from the outset (discriminative) but we can also “inspire” this generatively. Here’s how: assume that we have two classes, and the class conditionals1 are Gaussian with, and this is important, the same covariance matrices. Then the posterior $P(Y = 0 \mid X, \theta)$ can be expressed as $(1+e^{-\beta^Tx - \gamma})^{-1}$, i.e., the exponent has an affine function of $x$, which means that the boundaries of equal probability are hyperplanes. In the special case of equal mixing proportions, we have equidistant boundaries. A skewed mixing proportion will shift the boundaries towards or away one of the classes.

In fact, the assumption of a Gaussian class conditional is not even necessary. We can get away with multinomials (this is another way of viewing the Naive Bayes classifier), or in fact, anything in the exponential family2! When I was learning about these in my undergraduate Bayesian statistics course, I never really got why the exponential families were that important. But here is one reason, I suppose. Note that these are still assumptions that add bias to the generative case.

We can extend the previous analysis to the general classification case with $K$ outputs. In that case, we use the softmax: $e^{\beta_i^Tx}/\sum_j e^{\beta_j^Tx}$, which also results in linear boundaries, though that’s kind of stretching the definition; imagine a “pie-chart” where the “slices” represent boundaries. Also, if we wanted to find maximum likelihood estimation, we could do that, because we have $P(x\mid y,\theta)$ and $P(y\mid \theta)$. Just combine those to get the joint and differentiate the log of it. For instance, in the two Gaussian case, the MLE for the means $\mu_1$ and $\mu_2$ are just the sample means of the elements in their respective classes (remember, we assume we know the training data labels), and the covariance is weighted among the two. In the general case, we again write the formula and then separate the terms appropriately. Note: we will use $\theta$ to represent a generic vector of weights. To be safe, whenever we write probabilities, add a conditioned $\theta$.

Whew! Now we can talk about logistic regression, where the class dependency is fixed to be a sigmoid function. How do we find the best $\theta$? As usual, take logs, and maximize. This actually leads us to an LMS-like algorithm, and the only difference is the class expectation. For the batch version, we use iteratively reweighted least squares, which is basically Newton’s method for optimizing the (nearly) quadratic log likelihood function. In fact, there is a close connection between this method and the “normal” weighted least squares method, which started by assuming that each training input/output had an attached “weight” to it: this method can be written as

$\theta \leftarrow (X^TWX)^{-1}X^TWz$

for what I thought was a pretty convoluted $z$, but actually turns out to be a first order approximation of $y$. Interesting … I don’t really understand the full details of this, but having the knowledge of convex optimization at the top of this post really helped me.

For extending discriminative learning to multiple classes, again assume that $P(Y = ? \mid X,\theta)$ is represented by the softmax function, and a lot of our math follows for what is known as softmax regression.

Finally, thanks to Andrew Ng I have a bit of a better idea on the connection between the logistic regression update (in the LMS-like form) versus the perceptron: just change the sigmoid part in the update to be the “sign” function, and then the update turns into the perceptron.

## More Statistical Learning Theory

Here’s a random assortment of notes from Mike Jordan’s book (which I think he has abandoned now).

First, let’s consider the multivariate Gaussian is one of the most important distributions to understand, and I did not have an easy time learning about it. Fortunately, by now I can write out the formula and reason about it quite easily. Unfortunately, I don’t know how to derive it from first principles. I can explain “roughly” what it does, e.g., that $|\Sigma|^{1/2}$ in the normalizing constant comes from how each component of the random vector contributes some amount of variance equal to its eigenvalue, and the determinant of a matrix is the product of its eigenvalues.

But anyway, there are a few important facts worth discussing about the multivariate Gaussian.

• There is a moment parameterization and the canonical parameterization. The former is what I always use, but we can transform it into the latter with the rules $\Lambda = \Sigma^{-1}$ and $\eta = \Sigma^{-1}\mu$ to get $p(X\mid \eta, \Lambda)$.

• Given a matrix $M$ where we partition it into components $E,F,G,$ and $H$, the goal of block diagonalization is to find matrices $A$ and $B$ such that $A \times M \times B$ is diagonal in the corresponding locations of $F$ and $G$. After a lot of algebra, we can arrive at the derivative of the partitioned matrix $M$, and also derive a bunch of useful identities (the “matrix inversion lemma”) that I refuse to memorize.

• The reason why we go through this tedious algebra is that it gives us identities we can use when partitioning the multivariate Gaussian to get formulas for marginal and conditional probabilities involving multivariate Gaussians. Specifically, we have $x\in \mathbb{R}^n$ split into $x_1$ and $x_2$, and we want $p(x_2)$ and $p(x_1\mid x_2)$, where I’m eliding the parameters for simplicity. We obviously have the joint $p(x_1,x_2)$, so we need to figure out how to split them cleverly. Once we’ve gone through the derivation, we will find that the moment parameterization lends to easy computations of marginals but hard ones for conditionals, and the reverse is true for the canonical parameterization. Importantly, these formulas preserve the fact that our variables are Gaussian.

In addition to knowing that the marginals and the conditionals are Gaussian, the sum of independent Gaussians is Gaussians.

We can extend the mixture model discussion from last section into the multivariate Gaussian setting, where the hidden variables indicate the particular multivariate Gaussian distribution of interest. Here, we have $p(x\mid \theta) = \sum_i \pi_i \mathcal{N}(x\mid \mu_i, \Sigma_i)$, and assuming IID points, we want to find the $\pi$, $\mu$, and $\Sigma$ parameters to maximize the log likelihood. This requires Expectation-Maximization, which involves computing the probability that a particular distribution generated point $x$, which is of obvious interest for classification. (Admittedly this case works best in the binary setting where the conditional expectation is the same as the conditional probability of being one.) One can also think of K-Means as a simplified version of EM. We use EM rather than maximum likelihood because our “log” term has a sum inside it, which is due to the probabilities of the point being in multiple possible classes. In the previous section (on classification), we had the class so we effectively take only one term in that summation, in which MLE follows easily.

One thing I didn’t quite realize earlier was that in the EM for Gaussians, we can take the log likelihood, differentiate it with respect to $\pi_i$ (or $\mu_i$ or $\Sigma_i$) and we end up finding solutions that match the EM algorithm, which is interesting and implies that our “heuristic” update formulas may not be so bad because they indicate maxima of the log likelihood. Of course, one can also derive the update formulas “systematically” by appealing to the expected complete log likelihood, where we take expectations with respect to the hidden variables. (See my previous post for more information about this quantity.)

The E-step in general involves computing the expected complete log likelihood, and the M-step in general involves maximizing the expected complete log likelihood with respect to $\theta$. The full power of this terminology is not needed in the simple Gaussian example, but it is a useful exercise to ensure that we derive the same update formulas we developed “heuristically.” In general, the expected complete log likelihood does not suffer from the “coupling” of variables as the original log likelihood.

Finally, we consider the “mixture of experts” case, which is when we have a mixture model for the purposes of regression or classification. Mike Jordan’s notes appear to be missing some figures, so it’s a little hard to see what he’s trying to do, but I think the first figure represents a “V”-shaped set of data, and we need to fit two different regressions on that. The key is figuring out where to split, which is our “EM-like” task. In the mixture of experts, the M-step involves two different maximization steps.

## Logic and Planning

I discussed this earlier and had a chance to re-read all of that stuff. My main purpose in this section is to highlight how everything in this section connects with each other. I don’t want to just learn propositional logic, then first order logic, etc., I want to describe then in terms of each other, and to discuss all the similarities and differences among them (and the algorithms they inspire). But this won’t be long because Stuart Russell isn’t on the prelim committee this time (hint hint…).

But first, a laundry list of facts that really confused me the first time:

• Propositions consist of literals, which are just like the atomic elements of propositions, but they can have a “negation” symbol. That’s it: think of literals as either $A$ or $\neg A$.
• Predicates are really functions that output a True or a False. Predicates are – in my opinion – the backbone of first order logic.
• Be sure to realize that $\alpha \Rightarrow \beta$ is the same thing as $\neg \alpha \vee \beta$. This is probably the most important thing to remember to understand Horn and definite clauses, and why we can apply Modus Ponens to them.

Now we can talk about the connections. Here they are:

• One can convert from first order logic to propositional logic by extending universal and existential quantifiers.

• Forward and backward chaining play a role in both propositional and first order logic. They are algorithms for determining entailment when we assume that our knowledge base consists of Horn clauses (prop.) or first order definite clauses (FOL). This is a simplifying assumption, but it is often easy to convert databases to this format. The reason why Horn or definite clauses are needed is that their truth values are equivalent to $\alpha \Rightarrow \beta$ (and we need “or”s not “and”s), and that exactly fits the description of the Modus Ponens and Generalized Modus Ponens rule format. Note: we use these when we do not want to use the full power of resolution.

• As an alternative, say we do not have definite clauses and are just looking for a satisfying assignment to a disjunction of clauses. Then in both types of logics, we have the option of backtracking and local search. Both of these have their similarities in the Constraint Satisfaction Problem domain. In backtracking search, we have similar versions of “minimum remaining values” and “least constraining value” heuristics. In local search, that is when we are starting with a full, though not typically satisfactory, assignment to a problem in CNF form, and we pick clauses to shift, and this is the same as in CSPs when we start with a full assignment and use the minimum conflicts heuristics to adjust values.

• The PDDL language (Chapter 10) is about a simplified language that uses first-order logic “materials” (e.g., predicates, quantifiers, etc.) to encode a search problem (remember Chapter 3!)3. Since we’re encoding a search problem, we need to define the actions we can take, and those must have preconditions and effects, which involve adding or removing some fluents. The fluent, by the way, is the atomic set whose values represent a state. Again, the really important thing to know about Chapter 10 is that it is really another case of the general search problems. One can also make plans using a logical agent.

• Knowledge representation (Chapter 12) is all about encoding “real-world” stuff in first order logic. Our strategy to represent these is formally called ontological engineering. They discuss categorizing objects, categories (make them into objects!), and events.

• Let’s go over the different kinds of algorithms:

• Backtracking search: when we incrementally look for assignments to stuff, and then “backtrack” when we have seen some “problems”, e.g., impossible situations (and this can be used for entailment as well!). There are heuristics for this. We do this in CSPs and searching for satisfying assignments in propositional logic. We can also transform a classical planning case to a propositional case and turn it over to the backtracking solver, but this is not practical.
• Local search: we start with a complete assignment, and move variables around until we get to a solution. We do this with CSPs, propositional logic.
• Forward chaining and backward chaining are algorithms for deciding entailment in the two logics. We do not use these in CSPs or classical planning. The FOL case is more complicated due to the need to perform unification (among other factors), but we have general heuristics for improving them.
• In PDDLs, we do forward searching and backward searching to search for a satisfying sequence of actions. The forward searching part is similar to the backtracking search in that we can search for actions with heuristics and backtrack if needed. Backward search can avoid irrelevant states, though.

1. These are $p(x\mid y)$ because we are conditioning on the class $y$.

2. A distribution that can be expressed as $p(x\mid \eta) = h(x) \exp\{\eta^Tx - a(\eta)\}$ is in the exponential family.

3. The book never really makes this clear, but PDDL is not actually First Order Logic, but it reminds me of it because the syntax was designed apparently to be similar.

# Miscellaneous Prelim Review (Part 1)

Here is a random assortment of notes I created to wrap up some of the remaining material I need to know. It’s “part 1” because I have another part coming up later.

## Information Theory

This part, Chapter 2 from Cover and Thomas, is a bunch of definitions and straightforward theorems (i.e., those that follow directly from definitions):

• Entropy: $E(X) = -\sum_{x} p(x)\log_2 p(x) = -\mathbb{E}[\log_2 p(X)]$, where $x$ is a realization of variable $X$. It’s the amount of uncertainty inherent in a random variable. For a fixed variable $X$ with some probability distribution that we can create, the entropy is highest if we make the distribution relatively uniform, and lowest if we make it “peaked.” In the extreme case, if we set $Pr(X = 0) = 1$, then $E(X) = 0$.

• Mutual Information: $I(X;Y) = E(X) - E(X|Y)$. It’s the decrease in entropy (upon obtaining the value of $Y$), for variable $X$. Note that $I(X;X) = E(X)$ since the second term will be zero. Alternatively, we represent it as

$I(X;Y) = \sum_x \sum_y p(x,y) \log_2 \left(\frac{p(x,y)}{p(x)p(y)}\right)$

Note that $I(X;Y) = I(Y;X)$, so it does commute.

• Relative Entropy (KL-divergence): $D(p || q) = \sum_{x} p(x)\ln \left( \frac{p(x)}{q(x)} \right)$. This is a non-symmetric measure of the difference between distributions $p$ and $q$. We can also interpret it as the number of additional bits we will need to represent $p$ if we are using the (inferior) approximation of $q$. It is infinity if there exists an $x$ such that $p(x) > 0$ but $q(x) = 0$.

All three of the above quantities are non-negative.

Another concept that plays a huge role in information theory is the following:

• Jensen’s inequality: for a convex function $f$, $f(\mathbb{E}[X]) \le \mathbb{E}[f(X)]$. For a concave function, like the logarithm, we flip the sign (actually for logs, we can drop the equality case). I find it easiest to remember this rule by expanding out the equations for binary random variables. Let’s say they taken on values 0 and 1 with probability a half each. Then we have $f(\mathbb{E}[X]) = f(.5 \times 0 + .5 \times 1)$ and $\mathbb{E}[f(X)] = .5 \times f(0) + .5 \times f(1)$ and can directly relate this to the definition of convexity.

Based on the previous discussion, we can define and infer things like:

• Joint entropy $H(X,Y) = - \sum_x \sum_y p(x,y) \log_2 p(x,y) = H(X) + H(Y\mid X)$.

• Conditional entropy, conditional KL divergence, conditional mutual information. For the sake of simplicity, I will not write all the rules here, but here is the one for conditional entropy: $H(Y\mid X) = -\sum_x p(x) H(Y \mid X=x) = -\mathbb{E}_p [\log_2 p(Y\mid X)]$. Note that $H(Y\mid X) \ne H(X\mid Y)$.

• The chain rule for entropy, relative entropy, and mutual information. Unlike normal probability, these sum the components rather than multiply, which makes sense because all three cases involve logarithms. Again, I won’t write all the rules here, but will note that entropy is the easiest to relate to probability because we literally copy formulas from probability, but use sums instead of products. For the chain rule with mutual information, just pretend we don’t have $Y$ and follow the probability convention (but sum up). Then stick the $Y$’s after the semicolon, but before the conditioning bar. For KL-divergence, it’s the same (split up the joint into a marginal and product, but do this for both distributions, then use two “D” terms.

• A theorem: $H(X) \le \log_2|\mathcal{X}|$, where $\mathcal{X}$ represents the range of variable $X$, and equality here holds if and only if $X$ is a uniform random variable.

• Also one thing that tricked me up the first time I saw it was this consequence of Jensen’s inequality:

$\sum_{x \in A} p(x) \log\left( \frac{q(x)}{p(x)}\right) \le \log \left( \sum_{x \in A} p(x) \frac{q(x)}{p(x)} \right)$

where $A$ is the domain for $p$. I am assuming this really means

$\mathbb{E}_p\left[\log \left(\frac{q(x)}{p(x)}\right)\right] \le \log \left( \mathbb{E}_p\left[ \frac{q(x)}{p(x)} \right] \right)$

A final thought on this section: an alternative interpretation of entropy is that it is a lower bound on the average number of bits required to represent the random variable. It’s not “the minimum number of bits” because random variables take on different values with different probabilities, so we may wish to allocate more bits for the low probability events. And we also need to make it clear how we encode, so that we can compare different encodings. Example 1.1.2 from Cover and Thomas will clarify: here, we have eight horses, and they each win with some specified probability. If we wanted to encode the random variable $H$ indicating the horse that won, we could use three bits in the standard way. But this is suboptimal if the distribution is $(1/2,1/4,1/8,1/16,1/64,1/64,1/64,1/64)$, like it is in the example, because we should allocate fewer digits to the higher probability horses, and more towards the ones that are less likely to win. It’s possible to encode $H$ so that the average number of bits to represent it is two, which exactly matches the entropy.

## Decision Trees

Decision trees are one of the simplest nontrivial classifiers1 that have strong performance in practical tasks. The hypothesis space is the set of trees. For each $n$-dimensional sample $x$, we classify it by propagating $x$ down the tree. At each node, we test an attribute $x_i$, and depending on that value, send the sample left or right. Once we send it down the tree far enough, it will land in some “classifier” node that labels the class of that element.

Great, so how do we train such trees from labeled data $\{(x^{(i)} = (x_1,x_2,\ldots,x_n)^{(i)},y^{(i)})\}_{i=1}^n$? For that, we invoke some information theory criteria: we want to select the attribute to test that will result in the most amount of purity in the resulting trees, where purity is defined based on entropy. Formally, at each point of the tree, we have a set of data and a candidate set of attributes. We pick the attribute that maximizes the information gain of the data.

Let’s precisely define this for boolean decision trees. At a decision tree’s node, we have $p$ positive and $n$ negative samples. The entropy of the random variable describing the output is the entropy of a binary random variable with probability $p/(p+n)$; to simplify the subsequent notation, denote this as $B(p/(p+n))$. We define the gain of attribute $A_k$ that splits the data into $d$ subsets as follows:

$Gain(A_k) = B\left(\frac{p}{p+n}\right) - \sum_{i=1}^d \left( \frac{p_k+n_k}{p+n}\right) B\left(\frac{p_k+n_k}{p+n}\right)$

For each subset, we weigh its entropy probabilistically. Otherwise, you could think of a useless attribute that keeps the same proportion of positive and negative examples in each subset. Without the probability weighting, splitting on $A_k$ would increase the entropy of the goal test on the data.

There are other impurity measures, such as the Gini impurity measure $\sum_{k=1}^K p(x_k)(1-p(x_k))$ if the output takes on $K$ realizations. This is not the same as the measure of income inequality! I think the “CART” category of decision trees uses the Gini measure, whereas the “C4.5” and “C5” trees use entropy to measure impurity, though I think the boundaries between those categories is a bit blurry. Misclassification error is not an appropriate measure because it – unlike Gini impurity and entropy – does not give higher weight to branches with purer solutions.

Here are some things to think about:

• Trees can overfit, so what happens in most realistic algorithms is that we build the large tree first, then prune away nodes with only leaf descendants that do not contribute much to the information gain (e.g., using tests of statistical significance to see if the gain is significant enough). This is not the same as building a tree and deciding to prune away early. The classic example is the XOR data. If we have a lot of XOR data that we want to split, we will find that the information gain of both attributes is zero. We do not want to prune away early because the next step will involve splitting on the second part of the XOR, which splits the data perfectly.

• In practice, information gain might not be a good value of the amount of information in an attribute, because there might be an attribute that maps each element to a unique value.

• We may have missing data. A simple but bad strategy is to ignore all training data points that have missing data. An alternative is to “fill in” those values probabilistically based on the distribution of values of those variables in the other samples considered for a particular decision tree.

• We may want to use decision trees for regression if the output is continuous-valued. One option is to use a decision tree normally up to a certain depth, and then after that, we fit (linear) regression on only those data points that manage to reach that particular leaf node, and only the subset of variable attributes yet to be tested.

How would we train such a regression tree? HTF suggest a greedy algorithm (which also assumes continuous attributes, by the way): at each node, find the attribute and a split point that minimizes the sum of squared errors of the two resulting regions. HTF also assume that once we get to a region, we will approximate the samples with just one value, rather than doing a full-blown regression on it, which makes the problem a lot easier since the sum of squared errors criterion means we pick the mean of the elements considered at that node. To avoid overfitting, they suggest weakest link pruning. We iteratively pick the internal (non-leaf) node that, upon its removal (and subsequent collapsing of the tree) results in the smallest increase in the sum of squared errors criterion. This is pretty cool, and it’ similar to what Russell and Norvig describe.

Finally, here’s a rather interesting connection between boolean decision trees and propositional logic that I failed to realize at first: we can label various paths throughout the tree as $Path_i$, and so the goal is expressed as:

$Goal \iff (Path_1 \vee Path_2 \vee \cdots )$

Thus, any function in propositional logic can be expressed as a decision tree!

## Nearest Neighbor

This classifier is easy to describe: for each test point, we look at the $k$ nearest points according to Euclidean (or other) distance matrices and classify the test point as the majority class among those $k$. This is a problem in high dimensions, since the notion of “distance” as a measure of similarity becomes less reliable due to a combination of (1) noisy and irrelevant features, and (2) the rather intriguing fact that the higher we go in dimensions, the more likely it is our points are farther away from each other. As we increase the dimension of the unit hypercube with our fixed $k$-nearest neighbor classifier, we will need to traverse an extra amount for each dimension to reach the $k$ nearest neighbors.

Let’s now restrict our focus to the 1-nearest-neighbor case. On the surface, this might seem to be unreliable, since we’re only using one closest point and it might overfit the data (see examples of plots showing 1-nearest neighbor versus 5-nearest neighbor). But a famous result called the Cover-Hart Theorem provides a different story, saying that the asymptotic error rate of the 1-nearest neighbor classifier is never more than twice the error of the Bayes’ classifier (according to HTE), where the Bayes’ classifier assigns $\arg_y \max P(y\mid X=x)$. While it sounds nice, it assumes that new points have to exactly coincide with a point in the training data, which is true in the limit, but not true in general.

Here’s another interesting fact about nearest neighbors that I found surprising. Researchers used nearest neighbors to achieve the best performance (at that time) on the MNIST handwritten digit recognition problem. The digits themselves are points in $\mathbb{R}^{256}$-space. A classifier would have to work in high dimensional space and be invariant to rotations, scaling, etc. They way they did this was by defining manifolds in $\mathbb{R}^{256}$-space. For instance, there is a one-dimensional curve where points on that curve represent different rotated versions of the “3” digit2. Then there can be another curve representing a different three. One idea is to take the Euclidean distance of the two closest points $p_1$ and $p_2$ which lie on separate curves. Unfortunately, this may result in heavily rotated images being equivalent (the classic disaster: confusing a “6” with a “9”), so the ingenious solution is to use tangent lines. That’s the intuition: in reality, the “one-dimensional curves” would be manifolds taking into account additional invariance factors.

Having motivated nearest neighbors, let’s discuss some of its drawbacks. One problem is that it needs to store all the training instances, and for each new test point to classify, it needs to iterate through all of those to find the $k$ closest neighbors. If $O(n^2)$ time is unacceptable, then we can speed up the process of finding nearest neighbors with the following two strategies:

• We can use $k$-d trees, or more accurately, $n$-d trees if our data is $n$-dimensional, so that we don’t confuse this with the $k$ in the $k$-nearest neighbors. At each node, this tree will pick a dimension $i$ and split the examples according to their median point so that all $x_i$ such that $x_i < m$ will go the left sub-tree, and the rest go to the right subtree. The dimensions are typically chosen based on the widest spread of values.

Thus, if we are doing 1-nearest neighbor, for a given new point $x$, we find its nearest neighbor by querying the $k$-d tree to see where it would be located (i.e., it’s like we are inserting it in the tree). We proceed until we hit a leaf, and declare that as the best node found given the current information. But we have to be careful. The nearest neighbor of $x$ might not be in the same hyperplane after a split! We need to “backtrack” and then measure the distance between $x$ and the hyperplanes at each step, to see if there are nearest neighbor candidates on the other side. Check the Wikipedia page for more details. They have a nice description and an animation.

The downside with $k$-d trees is that with many dimensions, we will need to keep track of numerous subtrees that could potentially have “that nearest neighbor,” and we would iterate through the entire tree. To extend this algorithm for multiple neighbors, we use a list of nearest neighbor points.

• We can use locality sensitive hashing, which hashes “similar” values in high dimensional spaces to the same hash buckets. Then, using only the elements remaining in that bucket, we can perform exact nearest neighbor via brute force comparisons. Since hash functions are hard to create, we can try $M$ hash functions independently to get $M$ buckets, then take the union of all those elements to arrive at the set which we will use for exact comparisons. Russell and Norvig seem to suggest that each of those $M$ hash functions be a projection down to a line, and the buckets would be a line segment. I guess that makes sense.

The downside with locality sensitive hashing is that it, unlike the use of $k$-d trees, is an approximate nearest neighbors search.

Nearest neighbors has an interesting tradeoff with perceptrons. Kernelized perceptrons learn similar to the way nearest neighbors learn, especially with Gaussian kernels that weigh a probability distribution about each point. In other words, distance-weighted nearest neighbors are kernelized perceptrons. Nearest neighbors, unlike plain (non-kernelized) perceptrons, can use fancy similarity functions, as exemplified by the handwritten digit recognition example.

HTE also emphasize the connection between nearest neighbors and least squares.

## (Artificial) Neural Networks

Neural networks are a natural extension of the perceptron that I’ve written about in detail before, since perceptrons form the basic building blocks for each node. Like the perceptron (and regression, for that matter), we develop a classifier by updating weights. For neural networks, we use backpropagation to update the weights. At a high level, this means for each training instance, we “feed” it to the network so that it classifies it. Then, we propagate the “error” backwards through the network to update weights. The weight update for those connecting to the output layer is the same for that of logistic regression3, assuming we’re using the sigmoid nonlinear function. The real challenge comes when we compute weight updates for those connecting input or hidden layers to other hidden layers. But in fact, the gradient for the loss at inner nodes is the same as the computation for the gradient at the output layer, except we apply the chain rule multiple times. I’m not going to write the derivation here since it would take too much time; I just did it by hand.

Neural networks have an intriguing fact: provided that there are sufficiently many nodes and layers, they can represent any continuous function (of the input) with arbitrarily high accuracy. It needs multiple layers with non-linear activation functions at each node. Otherwise, if a NN just has an input layer directly connected to an output layer, it fails to learn even a simple XOR function.

There are many extensions to NNs. We could use recurrent NNs, convolutional NNs (popular for computer vision now), etc. We can use thresholding functions other than sigmoids, such as ReLUs, which avoid the “vanishing gradient” problem of sigmoids. Note that we focus on the problem of learning from a fixed structure, i.e., like parameter estimation for a graphical model with the nodes and edges fixed. Learning the structure is much more complicated.

## Principal Components Analysis

Principal Components Analysis (PCA) is a way of mapping high dimensional data into a reduced dimensional space, where the reduction is a “best approximation” of the original data. Formally, if $x_i \in \mathbb{R}^n$ but we really think they lie in $\mathbb{R}^k$ where $k \ll n$, then there is probably a process such that $x_i = \Lambda z_i + \mu_i$ for $z_i \in \mathbb{R}^k$ but $\mu_i \in \mathbb{R}^n$ is some noise added.

Obviously, there are many advantages to dimensionality reduction, so the question is how we do this in a sound way. PCA will do this by iteratively “mapping” points to a line characterized by the vector which preserves as much variance in the data, not including vectors already chosen4. In other words, we’d like to project the data onto a subspace so that the variance is maximized. To do this, PCA uses an eigendecomposition, which can make it expensive, but it does not need Expectation-Maximization. (The dimensionality reduction technique that uses E-M is called “factor analysis.”)

It’s easiest to derive PCA in the two dimensional case with $N$ data points $\{(x_1,x_2)^{i}\}_{i=1}^N$ where the data have zero mean and each coordinate has unit variance. In the first step, we solve for the (unit) direction vector $u$:

$\max_u \sum_{i=1}^N (u^Tx^{(i)})^2 = \max_u \|Xu\|_2^2 = \max_u u^T(X^TX)u$

where $X$ is the matrix where each row is a training instance $x^{(i)}$. Note that $X^TX = \sum_{i=1}^N x^{(i)}(x^{(i)})^T$, and also, if the data are centered, then it is the sample covariance matrix.

In fact, this is a standard optimization problem, where we have a quadratic form $z^TAz$ that we are maximizing w.r.t. $z$ subject to the fact that $\|z\|_2 = 1$. It is a well known fact that this problem is solved by finding the $u$ that corresponds to the eigenvector of $A = X^TX$ that has the largest eigenvalue. After all, $X^TX$ is a symmetric matrix of reals, so its eigenvectors can be chosen to be of unit norm and orthogonal to each other.

Given $u_1$, the best vector so far, we know that $x_i = u_1 z_i + \mu_i$ is our “process”, where $z_i$ is a scalar. Since $u_1^Tu_1 = 1$ it follows that to project all the $x_i$ points down to the one dimensional space characterized by $u$, we do $u_1^Tx_i$.

But normally we need more dimensions than that. How do we find the “best” set of vectors $u_1,\ldots,u_k$ for that? We take those eigenvectors that had the largest $k$ eigenvalues. These form the principal components of the data, and are mutually uncorrelated. (I’m not actually sure why this works – intuitively it does, but I don’t have a proof.) And when we need to project our data, we remember our “process” and add the new eigenvectors as columns of a matrix $U$ so that $x_i = Uz_i + \mu_i$, where $z_i \in \mathbb{R}^k$, and $k$ is the number of columns of $U$. Again, $U$ is orthogonal so ignoring the noise (which is deliberate, since it’s noise!) our projection is $U^Tx_i$ for all $x_i$ points.

We can find those eigenvectors by diagonalization or SVD of $X^TX$. SVD would work since that’s a real, symmetric matrix, so the eigenvalues will be the same as the singular values, and we can thus rank them easily.

There is an alternative way we can derive PCA, using the “process” I explained earlier. We can define $f(z_i) = Vz_i + \mu_i$ and use that as our approximation of $x_i$. Thus, our objective would be to find

$\min_{V} \sum_{i=1}^N \|x_i - f(z_i) \|_2^2 = \min_{V} \sum_{i=1}^N \|x_i - V(V^Tx_i) \|_2^2$

where I just put the $(V^Tx_i)$ to represent the lower dimensional approximation data. To find $U$, we can again resort to SVD: $X = UDV^T$, where $X$ is again the matrix with rows as training instances. Then the columns of $V$ form the vectors of the principal components. (Sorry for the $U$ and $V$ confusing; Ng and HTE use different formulations.) Technically, we only take the first $K$ columns from $V$ if we want a set of $K$ vectors for the projections, which I find is neat (if we want more, just add more columns!). Since $XV = UD$, then $UD$ consists of the projected points of $x_i$, one for each row (and $UD$ will usually have fewer columns than the full number of components of the $x_i$s). There’s a lot of matrix stuff going on here; draw this on a piece of paper to understand better.

HTE present an example of PCA using the Procrustes Transformation, but I don’t really understand how PCA relates to it. I guess because both involve rotations and scaling of the data?

1. In fact, the ability to describe the classifier to lawyers means that companies can use these classifiers to “discriminate” without concern. What companies would have to do is explain the classifier and their rationale (e.g., if a person is in X category, we have to do Y due to previous data, etc.).

2. Admittedly, I am skeptical of how they can claim that a one-dimensional curve represents various rotated aspects of a digit, but if you buy that argument, then everything else follows from that.

3. Recall how we do a stochastic gradient update of a single weight $w_j$ in logistic regression. For a given training instance, $(x,y)$, where $x$ is $N$-dimensional and $y$ is a scalar, we do

$\frac{\partial}{\partial w_i}(y-h_w(x))^2 = \frac{\partial}{\partial w_i}\left(y - \frac{1}{1 + e^{w^Tx}} \right)^2$

assuming we’re using the $L_2$ loss function. Then we eventually get

$w_j \leftarrow w_j + \alpha (y-h_w(x))h_w(x)(1-h_w(x))x_i$

which uses the fact that the derivative of the logistic function is itself multiplied by the quantity “one minus itself.”

4. The easiest way to understand this is to look for figures that plot data along with vectors that indicate the PCA dimensions. Typically there will be two vectors chosen, incidating two “best directions” that capture the data.

# Perceptrons, SVMs, and Kernel Methods

In this post, we’ll discuss the perceptron and the support vector machine (SVM) classifiers, which are both error-driven methods that make direct use of training data to adjust the classification boundary. They do not “build a model,” which is what a BayesNet-based algorithm such as Naive Bayes would do, which means we can make fewer assumptions about the data.

We’ll also talk about kernels, which allow us to efficiently compute dot products of high-dimensional feature vectors without actually computing those feature vectors.

## The Perceptron

The perceptron learning algorithm relies on classification via the sign of the dot product. Given a binary classification problem of vectors in $\mathbb{R}^n$, the perceptron algorithm computes one parameter vector $w \in \mathbb{R}^n$. Given an arbitrary sample $x_i$ with features1 $f(x_i) \in \mathbb{R}^n$, we classify this as +1 if $w \cdot f(x_i) \ge 0$ and -1 if otherwise. Assuming we’re doing supervised data, we will know the true label $y^{(i)} \in \{-1,1\}$. If ${\rm sign}(w \cdot f(x_i)) = y^{(i)}$, then we don’t do anything. Otherwise, we must adjust the weight vector $w \leftarrow w + y^{i}\cdot f(x_i)$. This will change the direction of the vector, thus shifting the classification boundary. It’s easiest to understand how this works by realizing that $w \cdot x = 0$ represents the decision boundary, which is orthogonal to $w$ by definition of the dot product and divides up the feature vector space into “halves,” where one has dot products with $w$ positive, and the other negative.

In the general case, there will be multiple classes, so we will have multiple weight vectors $w_1, \ldots, w_k$ for a $k$-way classification problem. In that case, whenever we have a training instance $x_i$, we assign the class based on $\arg_j \max w_j \cdot f(x_i)$. If $x_i$ was actually in class $j$, we are done; otherwise it should have been in class $j'$ so we need to adjust two weight vectors with $w_j \leftarrow w_j - f(x_i)$ and $w_{j'} \leftarrow w_{j'} + f(x_i)$. We add to the appropriate class, and subtract from the wrong class.

What are the problems with the perceptron as we just described? Well, if the data isn’t linearly separable, the algorithm will “thrash” around and never converge2. Two other (related) problems: it can overfit the data, or not find a suitable boundary. For the latter case, think of a linearly separable data, but with one outlier that causes the linear boundary to drastically shift. It may be wise to allow one “error” in order to get a $w$ that generalizes better.

There is a modification of the perceptron known as the Margin-Infused Relaxed Algorithm (MIRA), which updates in the same direction as the perceptron, but at the minimum magnitude necessary (technically, we add one to leave some slack, but whatever) to force the classifier to classify the current sample correctly (if it was not already correct). This means that the update could be smaller or greater than the perceptron update, but unlike the perceptron, MIRA will always classify an example correctly after seeing it. In practice, we cap the amount that a single training example can change the weight vector, so the scale factor $\tau$ is at most a pre-specified $C$.

As an alternative to the multiway classification perceptron, one can use the perceptron for ranking (e.g., website ranking), which has only one weight vector. It’s useful if we want to consider data points $x$ and classes $y$ together in a single vector $f(x,y)$. The decision rule is

$\arg_y \max f(x,y) \cdot w$

and the update rule is

$w \leftarrow w + f(x,y^*) - f(x,y)$

where for a data point $x$, $y$ was the predicted class but $y^*$ was the actual class. Now the weights are interpreted as the importance of each feature component to each class.

## The Kernelized Perceptron

We can create more complicated classification boundaries with perceptrons by using kernelization3. Suppose $w$ starts off as the zero vector. Then we notice in the general $k$-way classification problem that we only add or subtract $f(x_i)$ vectors to $w$. In other words, with $N$ samples in the training data, $w_j = \sum_{i=1}^N \alpha_{i,j}f(x_i)$ where all the $\alpha$ variables are integers. This means learning all the alphas would be enough to reconstruct the weight vectors.

How do we make a classification decision? For a given training instance (or even an entirely new sample) $x$, we would assign it the class based on whatever $j$ (for weight vector $w_j$) that maximizes the following: $\left( \sum_{i=1}^N \alpha_{i,j}f(x_i) \right) \cdot f(x) = \sum_{i=1}^N \alpha_{i,j} (f(x_i) \cdot f(x))$. We can re-express the dot product: $f(x_i) \cdot f(x) \rightarrow K(x_i,x)$, where we have introduced a kernel function $K$. Kernels allow us to “map” vectors $x_i$ and $x$ into a higher dimensional space, where we would then “take the dot product,” without actually transforming the features into the higher dimensional space.

Here’s an example: if we let $K(x_i,x) = (x_i \cdot x)^2$, then we have mapped $x_i$ and $x$ into a higher dimensional space that includes squared components of $x_i$ and $x$, resulting in linearly separable boundaries in that space even if the original feature space was not, e.g., the positive examples formed a circle and were surrounded by the negative examples. As a general rule, the more features we have, the more likely we have linearly separable data, unless two of the exact same $x$’s have different classes, for whatever reason. Of course, we will need more examples to learn correctly (growth is roughly quadratic in the number of features), and when doing classification, we will need to compute all the $K(\cdot,\cdot)$ values. It will be further slower if most of the alpha counts are nonzero.

There are two popular classes of kernels:

• The polynomial kernel has the form $K(x,y) = (x^Ty + c)^d$ for degree $d$. For vectors of dimension $n$, this kernel will map them to an $O(n^d)$-dimensional space! Expanding the kernel out for the simple case of $d=2$, we get

$(x^Ty + c)^2 = \sum_{i=1}^n\sum_{j=1}^n (x_ix_j)(z_iz_j) + \sum_{i=1}^n (\sqrt{2c}x_i)(\sqrt{2c}z_i) + c^2$

This is the equivalent of a dot product of features that contain elements $x_ix_j$, $\sqrt{2c}x_i$, and $c$ (not $c^2$ – watch out!).

• The Gaussian kernel, also known as the radial-basis function (RBF) kernel maps elements into an infinite-dimensional feature space. It is $K(x,y) = \exp(-\frac{1}{2\sigma^2}\|x-y\|_2^2)$. Probably more than any other kernel, classifying with this one is a lot like nearest neighbor because it clearly measures a similarity function, weighing “closer” examples more in our classification decisions. As $\sigma \to 0$, the kernel becomes a lookup table, and our training accuracy for a perceptron trained with this is 100 percent (except in the weird case of two exact same points getting different labels) but our validation and test set accuracy will be horrible.

To test my understanding of kernels in more detail, I looked at (as usual) an old CS 188 handout. It had the following image:

(In the first plot, the dotted line is $f(x_1) = x_1^3 - x_1$.)

Let’s consider a linear, a shifted linear, a quadratic, and cubic kernels (see the handout for details on these), and see if any of them can linearly separate the data in the two plots.

Plot (a) requires a third-order polynomial to separate the data, so only the cubic kernel will work, because that will map feature vectors $x \to \phi(x)$ to have $x_1^3$ in it. Then we’d just adjust the weights to set that to have nonzero weight.

In plot (b), a linear kernel is enough, but there has to be a bias term in there! (That actually tricked me.) Without a bias, in the 2-D case here, the decision rule is ${\rm sign}(w^Tx)$, and a 2-D vector $w$ must “emanate” out of the origin, which means the perpendicular line to it crosses the origin.

## Kernels, Formalized

The preceding discussion motivates the following question: how do we know if a function $K$ is a valid kernel? First, the official definition of a kernel is that it is a function $K(x,y) = \phi(x)^T\phi(y)$ that performs an inner product in a Hilbert Space. Normally, I prefer thinking of the inner product $\langle \phi(x),\phi(y) \rangle$ as the normal dot product (as I wrote earlier) but more generally, we should use the terminology of the inner product. Those satisfy properties of symmetry, bilinearity, and positive definiteness. A Hilbert Space is an inner product space that is complete and separable with respect to the norm defined by the inner product.

Since we are only dealing with real-valued vectors, our Hilbert Space will be $\mathbb{R}^n$ and the inner product here is the normal vector dot product. To test whether a function is a kernel, we invoke a simplified form of Mercer’s Theorem: let function $K : \mathbb{R}^n \times \mathbb{R}^n \to \mathbb{R}$ be given. Then for $K$ to be a valid Mercer kernel, it is necessary and sufficient that for any set of points $\{x^{(1)}, \ldots, x^{(m)}\}$ the corresponding kernel matrix is (symmetric) positive semi-definite. The element $K_{ij}$ of the kernel matrix is the value $K(x^{(i)},x^{(j)})$. (Sometimes the kernel matrix is called the Gram matrix.)

To prove one direction, that if $K$ is a kernel matrix corresponding to a feature mapping $\phi$, it must be symmetric positive semi-definite, we proceed as follows. First, it’s going to be symmetric due to the dot product (or more accurately, inner product) being commutative. Next, for any $z \in \mathbb{R}^n$, we have

\begin{align} z^TKz &= \sum_{i=1}^n\sum_{j=1}^n K_{ij}z_iz_j \\ &= \sum_{i=1}^n\sum_{j=1}^n z_i \phi(x^{(i)})^T\phi(x^{j})z_j \\ &= \sum_{i=1}^n\sum_{j=1}^n z_i \sum_{k=1}^n \phi_k(x^{(i)})\phi_k(x^{j})z_j \\ &= \sum_{k=1}^n\sum_{i=1}^n\sum_{j=1}^n z_i \phi_k(x^{(i)}) \phi_k(x^{(j)}) z_j\\ &= \sum_{k=1}^n \left( \sum_{i=1}^n z_i\phi_k(x^{(i)})\right)^2 \ge 0 \end{align}

where we used the fact we indirectly showed earlier that $\sum_i\sum_j x_iz_ix_jz_j = (x^Tz)^2 \ge 0$. It’s a little tricky because we are keeping $k$, a component of $\phi$, fixed, and ranging across different $\phi$ vectors.

This fact about positive semi-definiteness makes it easy to see that the following are also valid kernels:

• Addition: $K_1 + K_2$
• Multiplication: $K^T K$
• Scalar: $cK$ for a constant $c \ge 0$

We can use kernels for perceptrons (as previously discussed), support vector machines (as we will discuss), principal components analysis, and other classifiers such as linear regression.

Let’s discuss the linear regression case. In the general (regularized) case, the objective is:

$\arg_w \min \|y - Xw\|_2^2 + \lambda \|w\|_2^2$

where $X$ is the $n \times m$ matrix of training instances, where each training instance is a row (this is different from what I usually think of it, but it makes more sense in regression). The $y \in \mathbb{R}^n$ vector has the true labels. By taking derivatives, we see that the optimal $w$ is

$w = (X^TX + \lambda I)^{-1}X^Ty = X^T(XX^T + \lambda I)^{-1}y$

In the last step above, I used the clever trick I learned from CS 281A that for $\lambda > 0$ and a matrix $A$ that is $d\times n$, we have $(AA^T + \lambda I)^{-1}A = A(A^TA + \lambda I)^{-1}$. But wait, what does this mean? We can express $w$ as

$w = X^T(XX^T+\lambda I)^{-1}y = \sum_{i=1}^n \alpha_i x_i$

with an appropriately defined $\alpha_i$ since the columns of $X^T$ (not $X$, be careful!) are the actual training elements, so $w$ is in the space spanned by them and thus we can write it as a linear combination.

When we are faced with a new training instance to do regression, $x_{\rm new}$, we will perform the following:

$f(x_{\rm new}) = (x_{\rm new})^Tw = (x_{\rm new})^T \left(\sum_{i=1}^n \alpha_i x_i \right) = \sum_{i=1}^n \alpha_i \langle x_i, x_{\rm new}\rangle$

We have kernels again! This is more accurately known as kernelized linear regression. In fact, we even use kernels before we test on new examples (i.e., we use it during training). Why? The matrix $XX^T$ is itself a kernel matrix! It consists of dot products between the training instances, and since we optimize over that during training, we will use kernels during training.

I may end up reading more of Tom Mitchell’s slides on this, because this was quite illuminating to me.

## Support Vector Machines

We now switch gears to Support Vector Machines (SVMs), which are possibly the best “off-the-shelf” classifier because they combine the kernel trick along with the concept of a maximum margin separator. Thus, we know immediately that – like in the perceptron – we must find some way to express the optimization problem in terms of dot products.

To begin the derivation, we define the functional margin of a weight vector $(w,b)$ (note: we keep the intercept term separate now) with respect to training instance $x^{(i)}$ to be $\gamma^{(i)} = y^{(i)}(w^Tx^{(i)}+b)$, where the class label $y^{(i)} \in \{-1,1\}$, and across the entire dataset, $\gamma$ is just the minimum of all the functional margins. Ideally, we would like the functional margin to be relatively large, as that would indicate a strong, “robust” boundary between the two classes.

We can formulate SVMs with the following “optimization” problem:

$\max_{\gamma,w,b} \gamma$

such that

$y^{(i)}(w^Tx^{(i)}+b) \ge \gamma,\quad \forall i$

along with the restriction that $\|w\|_2 = 1$, which prevents the functional margin from changing due to invariance of the size of $w$ (though $b$ might vary, but I don’t think it’s a problem).

Unfortunately, this is not really possible with “optimization” easily, so we transform the problem into an equivalent one as follows:

$\min_{w,b} \frac{1}{2}\|w\|_2^2$

such that for all training instances $i \in \{1,2,\ldots,m\}$, we have the constraint $y^{(i)}(w^Tx^{(i)} + b) \ge 1$. This scales $w,b$ so that the functional margin must be one.

We are done, but it is better to face the problem from the dual perspective so that we can take advantage of kernels. Since the dual solution $d^*$ is less than or equal to the primal solution $p^*$, it follows that we can equivalently solve the problem by maximizing the dual4. We re-write the constraints as $g_i(w) = 1-y^{(i)}(w^Tx^{(i)}+b) \le 0$ and construct the Lagrangian as:

$\mathcal{L}(w,b,\alpha) = \frac{1}{2}\|w\|_2^2 - \sum_{i=1}^{m} \alpha_i (y^{(i)}(w^Tx^{(i)}+b) - 1)$

Setting the derivative of $\mathcal{L}$ with respect to $w$ and $b$, then after some algebra (which took me a while due to lots of indices messing me up, but I eventually got it), and then knowing that we need to maximize this, we pose the dual optimization problem:

$\max_\alpha \sum_{j} \alpha_j - \frac{1}{2}\sum_{j}\sum_{k} \alpha_j \alpha_k y^{(j)} y^{(k)}(x_j^T x_k)$

such that $\alpha_i \ge 0$ for all $i$, and $\sum_{i=1}^m \alpha_iy^{(i)} = 0$. Fortunately, this is convex, so there is a single global minimum.

Notice that we have $\alpha$ variables again, though these have a different interpretation than the ones in the kernelized perceptron, though. Watch out! And yes, we do get kernels to appear once again. Nice!

Now let’s see what happens when we have trained and are going to assign a class to a new instance $x_{\rm new}$. We perform the ${\rm sign}(w^Tx_{\rm new}+b)$ computation, which can equivalently be expressed as

$w^Tx_{\rm new}+b = \left(\sum_{i=1}^m \alpha_i y^{(i)}x^{(i)}\right)^Tx_{\rm new} + b = \sum_{i=1}^m \alpha_i y^{(i)} \langle x^{(i)},x_{\rm new}\rangle + b$

Once again, we have kernels! Incidentally, it looks like we might have to do a lot of computation for classifying a single point, but in fact, most of the $\alpha_i$s will be zero. The few that are nonzero correspond to the training instances called support vectors, and they are the ones closest to the margin. This is formally called the Karush-Kuhn-Tucker dual complementarity condition. The fact that we may not need to do much computation means SVMs gain some of the advantages of parametric models.

In the above problem, we – just like in the kernelized perceptron and kernelized regression – have formulated the problem so that, both during training and classification of new examples, the data enter via inner products, allowing us to use kernels.

What happens when we do not have linearly separable data? Rather than come up with a more complicated or longer feature vector (which might risk overfitting), we can reformulate the problem using slack variables (for $\ell_1$-regularization) and an additional, controllable parameter $C$:

$\min_{w,b} \frac{1}{2}\|w\|_2^2 + C \sum_{i=1}^m \xi_i$

such that for all $i \in \{1,2,\ldots,m\}$, we have $y^{(i)}(w^Tx^{(i)}+b) \ge 1 - \xi_i$ and $\xi_i \ge 0$. Thus, samples are permitted to have a (functional) margin less than one.

Rather surprisingly, the only change to the dual is that $\alpha_i \ge 0$ constraints turns into $C \ge \alpha_i \ge 0$ constraints, so we can apply the same principles (roughly speaking) as we did in the linearly separable case. In addition, the way we find $b$ changes, but generally we don’t really worry about the intercept too much when going through the derivation. It’s really $w$ that matters most to me.

1. This is important. When we call things $x$, we usually refer to the raw data, but what the classifier needs are a set of features for each sample. But some people elide this notation by treating $x$ directly as features, so be careful.

2. Even when the data is linearly separable, the perceptron is only guaranteed to converge in the binary classification case. Here’s a key theorem: suppose the (binary) data are separable with margin $\gamma$ and the maximum norm of a training sample is $R$. Then the perceptron converges with at most $O(R^2/\gamma^2)$ updates.

3. Here’s some intuition: we’re trying to combine the best of nearest neighbor approaches with perceptron approaches by using the former’s ability to use fancy “similarity” functions along with the latter’s ability to explicitly learn from data.

4. Technically, we need the Karush-Kuhn-Tucker conditions to hold for there to be possibly equality.