My Blog Posts, in Reverse Chronological Order
subscribe via RSS
The Benefits of Having the Same Group of Interpreters
I just submitted a sign language interpreting services request for the spring 2016 semester, when I am likely to take EE 227C and CS 267. The former is the third convex optimization course offered at Berkeley, and the latter is a popular entrylevel graduate course on parallel computing and systems.
For this request, though, I also said that I wanted to have a more consistent group of interpreters. This means I would prefer to have the same interpreters currently working now (or those from last semester) to be assigned to those two courses. Just like in the Spring 2015 semester, I have a standard group of three to four interpreters this semester, but strangely enough, none of them were also part of the Spring 2015 group. This is despite how all of the interpreters are assigned out of the same Bay Area company, Partners In Communication LLC.
In addition, there’s been another interpreting issue for this semester in particular. I’m not sure why, but I have had an unusual amount of substitutes. There are two primary interpreters, plus one primary substitute interpreter, but then there have been at least five cases (as far as I can remember, all involving different people) when I’ve had substitutes for substitutes.
This would be frustrating even if I was taking an undergraduate humanities course, but when the material is so technical in my courses, a normal person cannot convey the material clearly on day one. At least with consistent interpreters, they can pick up some of the common terminology. The people who interpret for my Convex Optimization class (EE 227BT) have gotten so used to hearing the words “positive semidefinite matrix” together that they can now understand that sequence when it’s used in other classes. (Positive semidefinite matrices are everywhere in machine learning – I can’t believe I went through undergrad without knowing about them, and now I’m one of their biggest fans.)
Consistent interpreting is something that I admittedly did not think about when requesting for services last semester, but I will remember this for the future. It is already challenging for interpreters to work in STEM courses, so there needs to be consistency so that they improve throughout a semester. Note that in general, I do benefit somewhat from interpreting services despite issues in STEM courses, and in some cases interpreters are essential (as was the case a few weeks ago when an ear infection meant I had to stop wearing my right hearing aid for a week), so this is pretty important to me.
Oh, speaking of interpreting requests, I also need to hope that no one else “strongly suggests” me to drop and/or add a course at the last minute, though admittedly, adding a course results in substantially fewer headaches as compared to dropping a course.
My Prediction on the 2016 Presidential Election
When I briefly wrote about the 2016 US presidential election a few months ago, my prediction on the Republican nominee fell completely flat. Scott Walker, who I honestly thought had enough political credentials to satisfy the party leaders while avoiding the “establishment”like aura to appease the surprisingly large portion of Republican voters who want an outsider, was actually the second person to drop out of the group of seventeen, despite how “candiadates” such as Jim Gilmore and George Pataki are still “running.” The current political classification of the Republican candidates looks like this, courtesy to FiveThirtyEight^{1}:
That brings us to where we are today. At the risk of embarrassing myself in the future, I think there are only three real candidates for the Republican nominee: Jeb Bush, John Kasich, and Marco Rubio. I think Rubio will be the nominee, and furthermore, that he will win the general election over Hillary Clinton. Note that I am not saying if I would support such an outcome; it is simply what I would place my money on if I were betting.
Why do I think Rubio will be the nominee among the fifteen current Republican candidates? To help answer that question, I have grouped up the candidates into four tiers, from least likely to most likely to win the nomination. I also have a short summary of each tier. My explanations are super brief, so apologies in advance to those who were expecting pages and pages of text. I don’t know enough about all of these people to write much about them, and you don’t want to read a long blog post.
First, let’s look at the bottom tier.
 Jim Gilmore
 Lindsey Graham
 George Pataki
 Rick Santorum
Rather surprisingly, none of these guys stand a chance at winning, despite how they are currently or have held prominent political positions. For one, they are all polling horribly. None of them have made it to the first two “varsity” Republican debates, and it does not look like that will be changing any time soon.
To discuss them individually: Gilmore is too unknown, Santorum’s time was in 2012 and he’s competing with more popular religious politicians, Pataki is too liberal (he supports more forms of abortion than is acceptable), and Graham … I’m honestly not sure. I would have thought that Graham would be an attractive candidate due to his detailed plans to fight ISIS, but I guess he may be a victim of the wave of Republicans who despise current politicians?
Next, we have the Tier 3 list of candidates. These people still don’t realistically have a chance of becoming the nominee, but at least they are all better off than those in the bottom tier:
 Benjamin Carson
 Bobby Jindal
 Carly Fiorina
 Rand Paul
 Donald Trump
I debated putting the three outsiders (Carson, Fiorina, and Trump) in the bottom tier, but since they’re collectively polling so strong, they might as well be in sort of a “midtier” position. Some of the recent polls have shown that more than half (!) of Republicans have one of these three as their top option.
I should start by saying the rather obvious fact that Trump and Carson make too many incendiary comments to realistically win the nominee. It’s not the average Republican voter that ultimately matters – it’s the party leaders who have power, and there is no way they will let one of those two guys be the nominee. Nonetheless, Trump and Carson are polling so well that they can’t be ignored. I think it would be interesting, however, to see what would happen if it was only one of those guys (say, Trump) against one of the Tier 4 guys (say, Graham). In a headtohead race among those two for the Republican nominee, I honestly think Graham would be the winner, or at the very least, it wouldn’t be a 25to1 loss like the current polls (in the inflated field) would suggest.
Fiorina is interesting. She is less popular than Trump and Carson, but actually has an endorsement from a Representative, plus business experience (even if it may not be the greatest) to back up her credentials with the establishment.
My addition of Bobby Jindal in this tier has to do with how I think Jindal is able to better appeal to religions conservatives than Santorum. Finally, Rand Paul is here because he at least made it made it to the first two varsity debates, but he seems to be languishing in the polls, so I don’t know how long that will last.
Now let’s bring up Tier 2, where we have candidates with more than a shred of a chance of winning:
 Chris Christie
 Ted Cruz
 Mike Huckabee
Chris Christie used to be pretty popular with conservatives, but I think he actually has a net negative rating among them now, which may be too much to overcome. His time to run was perhaps in 2012, and if he did become the nominee, I think Democrats would play “Bridgegate” ads over and over again.
Mike Huckabee and Ted Cruz are similar in that they heavily appeal to religious conservatives, who are upset over the same sex marriage ruling. They also polled strong enough to comfortably make the first two varsity debates. Mike Huckabee is a longtime governor, but he may struggle to expand his appeal beyond the religious conservatives. (To explain that last sentence a bit better, I think that being a state governor is better preparation for the presidency than being a U.S. Senator.)
At the risk of sounding repetitive, Ted Cruz might be similar to Scott Walker in that he can bridge the gap between those who want outsiders, and those who favor the establishment, though he’s clearly closer towards the “outsiders” as he’s heavily conservative and has infamously picked fights with people in his own party. At the very least, compared to Trump and Carson, Cruz actually looks reasonable, and I’m sure he will try to do whatever he can to get the lion’s share of supporters of Trump and Carson when those two inevitably drop out of the race. Cruz also has lots of money from his Super PAC.
Finally, we have the top tier of candidates:
 Jeb Bush
 John Kasich
 Marco Rubio
Jeb Bush and John Kasich are (or were) governors and are respected by the Republican establishment. The latter fact is something probably necessary to ultimately win the Republican nominee. Rubio is only a first term Florida senator, but he’s moderate enough to appeal to a wide range of Republican voters and lacks any obvious downsides.
Why do I think Rubio will win among those three? Again, at the risk of sounding repetitive, it’s simple: Bush and Kasich are too moderate to appeal to the TrumpCarson supporters (and what the heck, I’ll throw in the Cruz and Fiorina supporters as well). Remember when Trump tried to make fun of Graham and Rubio? I recall that his cell phone jab at Graham was a hit, but his insults to Rubio (e.g., his hair is bad, he supports immigration, he misses too many votes) do not have the same effect. The Republican party leaders may not want to risk having another establishment candidate and alienate a large segment of its supporters, who might otherwise abstain from voting.
Now for the huge question: why do I think Rubio will beat Clinton in the general election? I could go on and on about this, but (a) I don’t know enough about politics, and (b) I don’t think the reasons are all that complicated. Many of the things we debate about (businesses, taxes, foreign policy, etc.) are enormously complicated concepts that require years of experience and education to make informed decisions. Yet I would argue that the majority of people rely on quick sound bites and 750word opinion articles in their newspaper of choice to form opinions on these. So people aren’t going to dissect candidates to an alarming degree; by and large, the reasons why people will vote for a certain person are simple.
With that in mind, here are four reasons why I think Rubio will prevail over Clinton:
 Reason 1: it’s hard for a single political party to control the White House for a long time. Before Obama, we had eight years of a Republican president, then before that, eight years of a Democratic president.
 Reason 2: the crises going on in other parts of the world. I think the current Middle East mess (think Syria, and to a lesser extent, Iraq) will ultimately have a negative impact on Obama’s legacy, and people may want a change in leadership styles.
 Reason 3: money and background. Rubio was born in a more disadvantaged environment, and I think his campaign will repeatedly try to play the contrast between his life and Clinton’s. They’ll probably focus more on how Clinton earns hundreds of thousands of dollars for delivering speeches, while Rubio’s parents lived paycheck to paycheck.
 Reason 4: demographics. I really hate to label people by their race or gender^{2}, but the fact that Rubio is Hispanic is going to neutralize some of the advantages Clinton would have due to being a woman.
For these reasons, if I were betting money on this, I think Rubio will be the next president. We will see how things turn out.

It would be interesting to see the comparable venn diagram for the Democratic race, though since there are only three candidates (Clinton, O’Malley, and Sanders) it might be wisest to include other prominent Democrats not currently running for president. ↩

I do sympathize with Jindal when he says he wants to avoid the use of “hyphenated Americans.” ↩
Why Can Certain People Understand Foreign Accents Easily?
Lately, I’ve observed an interesting phenomenon when my sign language interpreters have a tough time understanding some of my classmates’ accents, yet my professors (and, presumably, my other classmates) don’t seem to have that problem. Here are a few nonexhaustive examples, restricted to my Berkeley experience:

I once gave a talk in Peter Bartlett’s research group meeting back in April. I had a sign language interpreter there, and she was able to help me by explaining some of the comments Peter made about the paper during pauses in my lecture. Unfortunately, she had major trouble with one of the postdocs, who had an accent I couldn’t even distinguish – it was not Chinese or Indian. She was unable to explain what that person said, and I think we had to rely on a few other people to help us out, plus a couple of finger pointing to the relevant stuff I wrote on the whiteboard.

In my EE 227BT class (Convex Optimization), many of the students are Chinese. My interpreters had such a hard time understanding some of them that, after the first few lectures, they talked to the professor, Laurent El Ghaoui, about it. He acknowledged that some of them were tough to comprehend (“they’re engineers” he lamented) but he learned to repeat their questions so that my interpreters could easily relay the information back to me. However, there lies the interesting factor: my interpreters sit pretty close to the professor, and he doesn’t move around too much in class, so the difference in their comprehension probably doesn’t come down to distance from the speaker.

In my CS 287 class (Advanced Robotics), Pieter Abbeel is unusual in that he seems like he really wants to get to know the students. (Our class is much larger than the previous editions, so he actually offers two lecture times to reduce the number of students in a room!) During some of our class breaks, he will take out the list of students and ask some of them to stand up and introduce themselves to the rest of the class. When one of the Indian students spoke about himself, my interpreter could not understand a single word that student said – literally! But Pieter did not even ask that student to repeat himself, so I assume he must have understood part of what that student said. This situation happened to a less extreme event (as in, the interpreter understood a handful of words) with a few other students.
I think the only way that can explain the comprehension disparity between my interpreters and professors is that the latter group of people are more used to being around foreigners. I wrote a blog post almost three years ago that highlighted my concern over understanding foreign accents in graduate school. Unfortunately, it’s been a nontrivial problem for me here as most of the students I work or converse with are international students (typically from one of two major countries: China or India).
Incidentally, none of those three professors are American^{1}, so it’s possible that they may be more skilled at picking up accents since they’ve traveled to quite a number of places and conversed with lots of people. That’s my only explanation. But I would also be interested in knowing if there was any initial struggle or hump they had to clear, or if they actually do have trouble understanding people but are clever at hiding it.
So, here are some questions I’d like to ask:

To people who can understand foreigners easily: why do you feel like this is the case? Where are you from? Have you traveled around the world a lot, talking to people of different nationalities? Was it always easy to talk to foreigners?

To people who have a hard time understanding foreigners: do you get the chance to talk to a diverse group of people? How long have you tried to understand foreign accents?

To American STEM students: how good are you at communicating with foreigners? Do you find the high proportion of foreign students in STEM to be a deterrent to your education?

To foreigners: do you get frustrated when people ask you to repeat what you say? What are your thoughts in particular about communicating with deaf people like me (who have enough hearing to talk oneonone)?
The practical concern of this for someone like me is that, if I work in a group of foreigners – reducing the likelihood of oneonone conversations – will there be any benefit of a sign language interpreter?

Actually, probably most faculty in EECS are not even American. I don’t know what the matter is with our (predoctoral) STEM education. ↩
Suggestions on Improving Access Services Requests
It’s nice that Berkeley has an access services page that I can use to request accommodations. But it’s not perfect, and there are several things that really should be fundamental components of any service request system. Here are some of my “fundamental components,” which are not currently part of Berkeley’s recentlyoverhauled system:

Any time I submit a request form, I need to get a “receipt” email that confirms I sent the request, along with the details. This gives me proof of submission and protects me in case Berkeley’s DSP loses it somehow. It also lets me double check that I filled in the boxes with correct information – it’s easy to mess up on these things.

Any time the requests are satisfied, I need to get an email telling me that information. In my case, this means a sign language interpreter (or two) got assigned to wherever I am going, and I should know their names and contact information. Last year, for instance, I made a request for interpreting services for the Berkeley EECS town hall, but I ended up getting captioning instead, and I didn’t know until the last minute (well, five minutes). The town hall ended up being a disaster, though to be fair, it would have been difficult for an interpreter to be effective there given the noisy atmosphere.

In the request form, there needs to be a generic “describe any miscellaneous information” box that I can type in to describe such information. For instance, some events may last for hours, but are lowkey and don’t involve much discussion. If DSP were to just look at a request for a twohour event, they would automatically hire two interpreters, but sometimes I might have to make it clear that only one is needed (thus saving money and man/womanpower). I’ve had to resort to describing this information to the person managing the requests by email, and that’s cumbersome.
I might be stretching here, but it would also be nice if my requests could get processed over the weekends. Like many graduate students, who are young and consumed with work, I don’t make much of a distinction between weekdays and weekends. But DSP will not operate over the weekends since they are staffed by “real world workers.” (Just to be clear, that’s a jab at graduate students, not real world workers.) Consequently, I have to be careful to submit requests before Friday evening; otherwise, DSP doesn’t get to look at them until Monday morning. I’ve been burned on this at least once. The lesson? Plan way ahead. I don’t have the luxury of going to an event on short notice.
My weekend idea is probably a bit too much to ask. I would be satisfied if DSP could implement my three other suggestions.
I would have thought that these things I mention are obvious, but I guess DSP didn’t have anyone complain, or they’re having technical difficulties. At the very least, this situation serves as yet another reminder to me that I need to educate intelligent people who have never considered the various issues that arise for deaf people.
An Unbelievably Pleasant Social Gathering
A few weeks ago, there was a Berkeley AI social event. My natural reaction upon finding out about this was to ignore it, due to deeply unpleasant experiences in social gatherings before. Yet, I ended up going, due to two reasons. The first was that I had a terrible summer and was constantly in a bad mood, so I figured that even if I hated attending the social, my resulting mood couldn’t possibly be worse than what I was experiencing on a daily basis. The second reason why I attended was because John Canny said he would be going there, and he invited me to go with him. This would at least reduce the likelihood of the most common situation for me in social events: when I stand around awkwardly, watch other people talk without a clue as to what they are talking about, and then leave when my level of frustration exceeds a certain threshold.
Critically, John and I would be going together, not separately, so my plan was to stick with him until he either introduced me to someone, or until someone were to start a conversation with me. In the latter case, I would immediately switch to communicating with that person since John would have no trouble finding others to talk to. My goal for the AI social was to have at least one nontrivial conversation (ideally at least ten minutes) with someone (other than John), and leave with my mood at least as good as it was beforehand.
Well, fortune struck that afternoon. I had three (yes, three) such conversations! And I have an embarrassingly detailed recollection of how these conversations began, and what we talked about – often down to the exact words and sentences we said.
For the sake of privacy, I won’t go through all the details, but here is the high level story. When John and I went to the social event, there was already a huge crowd of faculty and students that I immediately became worried. Fortunately, there was a quieter room to the side that a few of us were in, and I was able to make my way over there. I saw Trevor (hey, can you tell me my prelim score already?) and a few other familiar folks. I had just started enjoying my cup of red wine when another student saw me and started a conversation, my first real one for that event. Fortunately, he spoke clearly, and the “side room” was sufficiently quiet.
I believe we had a nice conversation (I hope he liked it), and again, for the sake of privacy I won’t go through the details (but I can probably produce an accurate transcript of our conversation if needed). We reached a good stopping point, and then split up. I got some additional food and wine, and returned to the more crowded area. I was now by myself in a painfully familiar situation, and planned to remain only briefly.
But I got lucky – someone was walking back to work, and saw me along the way^{1} so we ended up talking. Again, this was another person who I knew somewhat well, and his voice is easy to understand. Our conversation was also pretty nice. We discussed the prelims and his plans for the next few years.
Then he had to leave. I stood around by the nowempty plate of food, and was getting ready to head out when I got lucky again: a student who I had seen before was walking towards me and had made eye contact. Yay! I did not know him as much as the two earlier students, so our conversation veered towards standard “introductory” material.
By the time we finished, most of the other students and faculty had left to go back to work (or home, because it was Friday evening), so I figured I would do the same.
But wow, that event was something I will definitely remember for a while. I felt so deliriously happy after the social, that it actually ended up a net negative on my work progress for the rest of that evening, because I kept thinking about the social rather than my work! (Don’t get me wrong – this was the rare case when I was happy not to get work done.) If others knew how much I obsess over almost trivial conversations, how I have a neartranscript level of conversation recollection, and how often I replay social situations in my head, they would probably be shocked.
So should I try attending social events more often? Maybe. Actually, one of my concerns is if I set expectations that are too high for future social events. I guess that’s not the worst thing worry about, though.

It’s quite convenient that I stood in a location that people would have to walk by in order to leave the area. That’s a clever strategy! ↩
My Wish for the 20152016 Academic Year: A True Student Collaborator
After a year into my doctorate program at Berkeley, I know that my experience has been suboptimal in many ways. One reason for that – actually, the main reason – is because I have been experiencing severe isolation the past few months, primarily a product of being in a small research group where I feel like my research interests and career goals do not closely match with those of the other people here.
Yeah, it’s “the usual” for me. After middle school, I badly wanted a fresh start in high school so that I could cut down on all the isolation I had experienced. Then I badly wanted a fresh start in college, for similar reasons. Then I badly wanted a fresh start in graduate school, also for similar reasons.
So my wish for the 20152016 academic year is simple, and I hope, realistic.
I wish … that I will be able to find another true, student collaborator, someone who is also a Ph.D. student here, and is interested in Artificial Intelligence research. Someone who will be willing to sit down with me, work with me side by side, and help teach me what it is like to do true research and boost my confidence. Someone who will not mind if I mess up on minor errors. Someone who has good English language skills, who can speak clearly, and who will not mind that I am deaf.
Someone who I can consider a true friend.
My Prelims (Transcript)
Note: I wrote the following “transcript” of my exam from memory, so it should not be taken as verbatim.
Before the Exam
Of the twelve AI students taking the prelims this fall, I was up last to go; my times were 4:10 to 4:40 with Pieter Abbeel and Dan Klein, and 5:10 to 5:40 with Ruzena Bajcsy and Trevor Darrell. Obviously, I had spent the entire day of August 24 before 4:10 doing some lastminute preparation. I arrived in the seventh floor of Sutardja Dai Hall and met my sign language interpreter there, who was relieved that she hadn’t gotten lost or sent to the wrong location.
Actually, I should backtrack to rant a little. My prelims was yet another example of how having multiple channels of communication in setting up a sign language interpreting appointment can mess things up. A week before the prelims, I carefully prepared a threepage document of preparatory material for my interpreter to review. Most of the document was a list of advanced terminology that she might expect to hear on the prelims. My goal was for her to look at the words, remember the spelling, check the pronunciations online (if necessary) so that she could fingerspell them quickly^{1} during the prelims. I also listed where and when we should meet before the prelims. Finally, and this is important, I gave the document to Berkeley’s DSP, who then forwarded it over to the agency that officially appoints the interpreters.
Yet when I arrived in the location where I requested to meet (the third floor of Sutardja Dai Hall), I never saw her, and when it was 4:05pm, I had no choice but to go up to the seventh floor. It was only when I finally got near Pieter’s office that I saw her. She then said that she had never gotten that document I created. This is the problem: I specifically asked to send my prep documents to the interpreter directly – not through multiple parties – but got declined^{2}. We lost a bit of early discussion time: she went right to the location of the exam at 3:50pm, while I was anxiously pacing around in the lobby waiting for no one.
In the end, this might be me going overboard, since the prelims are not a case where having interpreters is strictly necessary for me. I can understand most people when they talk clearly and directly to me, but there are many other situations when having informed interpreters would be vital.
Sorry for the rant. A few minutes after I arrived near his office, Pieter came out and immediately told me to come in. It was time.
Part One
“So … hello,” Pieter and Dan said simultaneously.
“Hi,” I replied. I had never seen Dan and Pieter look like this before, though admittedly I do not know either of them that well. I think they were either supertired after having eleven other students before me, or they were trying superhard to make me feel less nervous.
“The prelims are going to be a little different this time,” Dan said, “but everyone is doing the same thing, so don’t worry. It’s a written exam. You have thirty minutes to answer these questions. You can say things out loud if you already know the answer, and you don’t have to show excessive work. While you do that, I will take notes on my laptop, so … try not to worry about me.”
Fat chance, I thought as he smiled awkwardly. Dan pressed the timer on his phone, starting the exam.
I opened the exam and flipped through all the nine questions to get a feel for them. Then I went back to the fifth question, about logic, since I knew the answer already.
“Well, the term just means alpha entails beta, which is the same thing as saying that implies ,” I said, writing down briefly. “And an algorithm for doing that is resolution. Or if we’re assuming Horn or definite clauses, we can use forward chaining and backward chaining.”
Both of them nodded. Since that was the extent of the question^{3}, I went back to the start of the exam, in an attempt to do the rest of the questions in order. The first question was about different search algorithms and heuristics. The graph was structured as a directed tree with two goal states, with (different) edge costs listed for each arc.
“For depth first search, we just go to the left,” I said, outlining the path. “For uniform cost search, we expand this goal state. For a heuristic, we just need to not overestimate to the goal. And for the last part, when we talk about ranges that would make the heuristic admissible, it seems like we just need to make sure we have that be at most four … wait, we are doing Astar search, right?”
“Yes,” Pieter said. “Think about the value 4.5.”
I nodded and wrote my answer formally on the sheet. Then I moved on to the next question, about constraint satisfaction problems. What happens if we enforce arc consistency, and (1) we have one element in each domain, (2) we have a variable with no values left, (3) we have multiple values in each domain? “The first one should have an immediate solution,” I said. “For the second case, we’re guaranteed not to have a solution. For the third, we are not guaranteed to have a solution.”
Dan and Pieter did not noticeably react, so I went on to the third question. It presented me with a tree and asked me to write down a set of leaf node values so that the expectimax versus minimax paths are different. That was easy for me, as I wrote down “1,” “1,” “10,” and “0” for the two sets of two leaf nodes (so four leaf nodes in all). Both of them nodded in approval.
The fourth one was about Markov Decision Processes, and involved substantially more text than the previous questions. I looked at it and decided to move on to the sixth question.
“Oh yeah, a lot of people skipped that one,” Pieter said. I did not ask if that meant they skipped and solved it later, or (gulp) if they skipped it entirely.
The sixth question (remember, the fifth was about logic) presented me with two Bayesian Networks. The first part, asked me to write down the joint (that’s easy – just multiply values together). The second was slightly more challenging, which asked me to outline a good variable elimination ordering and a bad variable elimination ordering. The graph was shaped with a root node connected to through and there were also “” nodes downstream of the nodes.
“Obviously, a bad elimination ordering starts with because if you eliminate it, all the terms get put into one factor that depends on all the s, so you get a giant table,” I said. “Now, for a good variable elimination ordering, eliminate last. Then I guess the only question is whether we should eliminate , , and so on, or the other way around. To do that …”
“You don’t need to do that,” Dan interrupted.
“Oh, I guess the ordering of the s doesn’t matter,” I said. I probably would have figured this out eventually, but I’m glad they were helping me to speed up.
The seventh question presented me with a familiar decision network diagram. “The oil can be in one of three locations,” I muttered out loud, reading the question. “One hundred utility for getting the oil, zero for not getting it. So the MEU with no information is just one hundred divided by three, since one third of the time, we get 100 utility and otherwise we get nothing.”
The second part asked about the VPI of knowing with certainty the state of oil in one of the three locations (i.e., it was either “there” or “not there” with certainty). “For that, we need to find the maximum expected utility with this information, minus the MEU I found from the first part,” I said. “So one third of the time, we know the oil for certainty, and two thirds of the time, we can narrow down our options to the other two. So subtracting that off … it’s fifty over three.”
The eighth part was about maximum likelihood. Given that we have heads, heads, tails, and heads, I had to show that the MLE of .
“The standard way of doing this is to write the product of the Bernoullis and form the Lagrangian …” I began, but Dan correctly pointed out that it wasn’t going to be that complicated.
“Oh yeah, we can literally write out ,” I said, writing it out. Unfortunately, I ended up getting . I stared at my algebra for an embarrassingly long time before Dan pointed out my stupid error: failing to distribute a “3” appropriately.
“Yeah, that’s right,” I said, relieved. “Moving on…”
The last question (well, second last for me) showed me three different twodimensional Gaussian distributions, with different “curves” printed. I never saw these diagrams before but it was pretty intuitive what it meant; if the curve stretched “long” in the direction but “short” in the direction, then the covariance matrix would have . Two of them had diagonal covariance matrices, and the third one, which stretched out long in the second and fourth quadrants, I put down as having .
“Why did you pick that?” Pieter asked.
“Well … “ I said, not entirely sure if that was correct. (I later realized after the exam that I was right, and knowing that with certainty required remembering the formula for correlation.) I wrote out the definition of the covariance: .
“And the expected values of both are zero,” Pieter pointed out.
I wasn’t sure if he said that in agreement with my answer, but time was running out so I came back to that MDP question I skipped earlier (and left my answer to the ninth question unchanged). There was a grid of squares arranged in a “circular” fashion where we could move to any of two adjacent squares, except for two “slide” locations that forced us to move right.
“This is asking me about the value of states after three, ten, and infinity iterations of Value Iteration, with discounts of or ” I said. “Well, the value of the state by the slide is going to be two plus twotenths after three iterations with discounting. Without, it’s four.”
I proceeded, figuring out the different values of states they asked me to write. I tried to sprinkle in a few jokes here and there to amuse them, and was moderately successful. Unfortunately, time ran out before I could fill in the last two blanks, so I put in a “2” in one of the boxes at random.
“Are you sure about your answer to that one?” Pieter asked.
“No, I only put a random value down to get something down in the time limit,” I said.
“Well, why don’t you try and solve it?”
I looked at it, and within a few seconds, saw that the answer was infinity. “Um, yeah, infinity I guess, for both” I said, knowing that I should have gotten that earlier.
“All right, I guess we’re done. You’re the last one so you can now talk to any of the other students about this part of the exam,” Pieter said.
“Do you have a rough idea of how well I did?” I asked.
Dan smiled awkwardly. “We … can’t really do that now.”
Fine, I acknowledged, and left the office feeling reasonably happy with my performance. I read my AI book for a few more minutes before showing up at Ruzena’s office.
Part Two
“Well, hello. Why don’t you introduce yourself?” Trevor asked. Again, I’m assuming they were doing this to help the students feel less nervous, though in my case I did not expect these two to know anything about me so this really did qualify as an introduction. I mentioned my undergraduate institution and my advisor, but I also wish I had remembered to say that I was deaf since birth.
Then they proceeded to ask questions.
“Explain logistic regression,” Trevor said.
“This is a classification algorithm,” I replied. “Let’s suppose we’re in the binary classification case. Then logistic regression means that we are assuming – at least in the discriminative case – that our data should be assigned probabilistically to classes of 0 or 1 based on the logistic function, .” I wrote the formula on Ruzena’s white board.
“And what are the parameters?” Trevor asked. I pointed to and mentioned that we could update these with a stochastic gradient method based on the gradient of the data log likelihood.
“Next up: how do you do linear regression on a Vshaped data?” Trevor asked.
“This is a mixture model,” I said. “Just have different linear regressions for the two sets of data.” I drew some hypothetical data on the board.
“But how do you know that you will have two components?” Trevor asked, shrugging. “Why not just one regression line?”
“Well, this is basically ordinary least squares, right? So with an extra regression, we would be able to better minimize the sum of squared errors…”
“Perhaps I should rephrase this,” Trevor said. “We have to decide the number of components, and in the general case, we can’t just eyeball the data and say that we will have two components. We need to try out different models and evaluate them somehow.”
“Oh, just try out different numbers of distributions in the mixture model and perform cross validation–”
“Yes, good,” Trevor interrupted. “Next up, how would you classify data if you did not know the labels?”
“One way to do that is with an algorithm like Kmeans,” I replied. I drew out some unlabeled data and tried to signal how the algorithm would work by mentioning how the center of each cluster gets updated.
“Is this parametric or nonparametric?”
“Uh … it’s parametric in the sense that we can fix and we only have a finite number of parameters.”
“That’s correct,” Trevor said, “but what is an example of a nonparametric technique?”
“Nearest neighbors.”
It was Ruzena’s turn to ask. “Can you write down Bayes’ rule for classification?”
I wrote down Bayes’ rule on the white board, for . I wasn’t sure if I really understood the question, because it’s just Bayes’ rule!
“Often times we care about learning the parameters,” Ruzena said.
“Yes, in which case we would have something like Bayes’ rule applied to ,” I said, still unsure if I was supposed to answer something in more detail.
“Can you write down or explain what the Bayes’ risk means?” Ruzena asked.
This wasn’t on the syllabus, I thought, but remembering the formulas from CS 281A, I wrote down . “Basically, the expected loss between what we predict and the true .”
Surprisingly, they seemed satisfied with that brief answer, so Trevor changed the subject to the next part – neural networks^{4}.
“First, describe a neural network.”
“This is a classifier that can also be viewed as an architecture,” I said, drawing out some nodes on the board. “We have an input layer that takes in our data, such as pixels in an image. Then we have the option of having hidden layers, and I’ll write down one. Then an output layer, where we have outputs. The key with these architectures is that the hidden layers allow for extra complexity, so to speak, but we also need to make sure we add in activation functions for the nodes to introduce nonlinearity. So we can use the sigmoid.”
I explained a few more basic properties in detail, and also went over backpropagation, until Trevor seemed satisfied.
“Now we’re going to talk about convolutional neural networks,” Trevor said. I kind of expected this to happen since Trevor was on the committee, but since CNNs were not on the syllabus, I would have to rely on my memory of CS 280 material (thank goodness I took the class).
“I have a network architecture listed here,” Trevor said, showing me his iPad. “Please draw it.”
I drew it out. It was a standard LeNetlike architecture, with a input layer, a convolutional layer, a maxpooling layer, and a fully connected layer, resulting in four outputs.
“How many pixels are there in each resulting image in the convolutional layer, given the kernel size listed?”
“They are ten by ten,” I replied, since the filters were and the stride was one.
“Good. Now what about the size of the maxpooling layers?”
“Five by five.”
“And how many weights are there in the fully connected part?”
I wrote down on the board.
“So, three hundred,” Trevor approved. “Now can you tell me the difference in run time or space time between convolutional and fully connected layers?”
“Yes, the space time is higher for fully connected layers because convolutional ones have weight sharing,” I replied, with Trevor nodding. “As far as runtime, wouldn’t the same apply for the convolutional one …”
“Well, think about the total amount of weights,” Trevor corrected.
I mutter something about the total number of (not necessarily unique) weights, and tried to make a highlevel argument that the runtime would be higher in the convolutional layer, assuming that training or testing time would be proportional to the number of weights. Trevor seemed happy, and continued with asking neural network questions.
“Can you talk about the vanishing gradient problem?”
This was another topic that I only remember from CS 280. Fortunately, I did remember, and I wrote out the curve of the sigmoind function. “With the sigmoid, the gradient will go to zero as goes to plus or minus infinity. After a certain point, the difference in gradients is irrelevant and will not contribute to the update. This is why we like ReLU layers, because those are and the gradient will increase linearly.”
I fumbled around with my initially sloppy explanation until I felt Trevor seemed satisfied with a less sloppy version.
“The last question for today is about transfer learning,” Trevor said. “What is it?”
Yet another question from CS 280! Wow, I am really glad I took the class. “That’s when we use weights learned from one task to apply to another task, hence transferring the knowledge from training one thing to another. So, in a computer vision object recognition challenge, imagine that we have one large data and one small, specialized data. We train on the specialized data to get some weights. Then we…”
“Other way around,” Trevor smiled, motioning with his hands.
“Oh, right, sorry! Yeah, I remember that. So we would train on the large, broad data to get some weights. Then, for another neural network on a more specific task, we use those as the initial weights. Say, for instance, we are trying to classify objects as being Trevor or Darrell …”
“I’m Trevor Darrell!” Trevor replied.
“Yeah, I meant that those are two similar output classes, so we would need specific weights for those,” I said. I worried a bit that my joke backfired, but Trevor seemed to be in a good mood, so maybe not. I provided a halfbaked concluding overview of transfer learning and neural networks in general.
“And we’re all out of time!” Trevor said, feeling satisfied.
“When do you think you can provide the results?” I asked.
“Probably tomorrow!” Trevor grinned.
I left Ruzena’s office, feeling pleased that I had remembered how convolutional neural networks worked from CS 280, but wishing that it had been explicitly listed on the syllabus. I knew I would be constantly refreshing my email tomorrow.
As it turned out, it wasn’t until another week when I got an email from Pieter saying I completely and comfortably passed the AI prelim. Now I feel a little better.

I may not have the most fluid ASL due to lack of practice, but I can fingerspell blazingly fast and can understand blazingly fast fingerspelling. If you’re interested, we can talk about that. ↩

DSP later told me that it was because the agency that provides interpreters only gives them prep materials at the moment of assignment, which was set about a week before I sent them the document. ↩

Despite Stuart Russell not being on the prelim committee this time, we did get a logic question, but it was by far the easiest (assuming you studied it!) and shortest question on the exam. ↩

Yeah, I should have known that two prominent computer vision researchers would have asked me about neural networks. ↩
Miscellaneous Prelim Review (Part 2)
This will probably be my last prelim review post. The topics I’ll cover in this post are convex optimization, statistical learning theory (broadly), and logic/planning. Actually, I wanted to make some detailed notes about Kalman filtering, but I think I’ve done more than enough here, and there are too many equations involved to write that quickly.
Convex Optimization
This part is based on sections 9.1 through 9.5 of Boyd and Vandenberghe’s book, freely available online. Stephen Boyd also has a lecture video on YouTube that I watched, which I found to be very helpful. (I can also understand Professor Boyd’s speech well.) The book is fine, I suppose, but is really hard for me to read so I made embarrassingly slow progress as I learned this material.
The main purpose of sections 9.1 through 9.5 is to discuss iterative algorithms for minimization. Formally, we have the problem of minimizing a convex function and need to find the optimal . As in almost all cases, we have to remember that is not generally a scalar but a vector, but is realvalued, so . I had to keep reminding myself of this.
In most cases, we need to use iterative algorithms to find . The class of algorithms we use are known as descent algorithms because they generate points such that unless we are at the optimum. Actually, a little sidenote: there is exactly one optimal because we are actually assuming that is strongly convex, not just convex (and strong convexity is not the same as strict convexity!). By strong convexity, we assume that there is a constant such that , which means is positive semidefinite.
A lot of our future analysis will depend on a concept known as the condition number of a matrix or a set. For a matrix, the condition number is the ratio of the largest singular value to the smallest singular value. Alternatively, we can use eigenvalues if the matrix is positive semidefinite, which actually happens here since the secondorder characterization of convexity states that the Hessian of is positive semidefinite. The condition number of a set is defined as the ratio of the largest width to smallest width. High condition numbers result in highly skewed/stretched data.
Here’s the descent algorithm. We repeat the following until some stopping criterion:
 Compute a (descent) direction to change , denoted
 Compute a length or step size to go in that direction using some form of line search
 Compute
There are two main ways to choose : exact search (find ) or backtracking search. For some reason, it took me a really long time to understand backtracking line search, but after looking at that figure in Boyd’s book for ages, I understand what it does now. We have to keep decrementing until our function lies below a given upper bound. Backtracking line search is important because it’s more efficient and in practice, it often works just as well (or better!) than exact line search. To explain it, remember that figure with the three curves in it: one is and two are straight curves which follow from the FOC of at .
But the real difference in the various gradient algorithms comes when we pick . There are three options:
 Use . I mean, the gradient points in the direction of greatest increase of at (by definition) so why on earth would we not use the negative of that? This is gradient descent.
 Use the direction that maximizes the negative gradient in the direction determined by a prespecified norm. Precisely, our firstorder approximation of at is , and we want to find ; in other words, we want to make the directional derivative as negative as possible. We need to restrict because if not we could first pick a direction that makes it negative when multiplied by the gradient, and then make it arbitrarily large. Also notice that we are not specifying the exact norm. This is steepest descent, and equals gradient descent when is the norm.
 use , i.e., the negative of the inverse of the Hessian, multiplied by the gradient. Whew! This comes from the second order approximation of – just take the gradient with respect to , then solve. This is Newton’s Method.
Gradient descent is simple, and works perfectly (i.e., converges in one step) when the data are “isotropic,” that is to say, roughly “equal in all directions.” It’s bad when the condition number of the Hessian or the sublevel sets is high (e.g., in the 1000s). The classic example is the ellipsoid “bowl” where we have a 3D bowl that is much wider in one direction than the other. Gradient descent with exact line search will always “overshoot” the optimal location and keeps going back and forth, zigzagging to the center. The stopping criterion for gradient descent is if for some prespecified .
Steepest descent is a generalization of gradient descent in that we get the option of picking the norm that we want to use as a metric of our “gradient” here. A quick warning: there are actually two versions of . I tend to assume we are using the normalized version , where the we pick has norm bounded by one. There’s also the unnormalized version but I don’t understand how this actually works.
Steepest descent can work with the , , and quadratic norms. In the , it is equivalent to coordinate descent (modifying one coordinate of at a time), and the way to think about this is that we are taking the maximum component (in absolute value) of and setting our to be zero everywhere except for at that “largest component.” The derivation for in the quadratic norm is more complicated (for the unnormalized, it’s just ), but visualizing it is easier: we have a point , draw an ellipse around it (determined by the norm), and then pick the direction that results in the greatest decrease. More intuition: extend as far as possible in the direction of , while staying inside that unit ball. It’s also worth noting that we can transform coordinates from the quadratic norm’s matrix to get gradient descent. In fact, this gives a useful test for a norm: how well steepest descent performs will depend on how well the transformed points have “equal” isocontours suited for gradient descent.
Newton’s method is a step up from gradient descent in that we use a secondorder approximation of . The way I think of it is that gradient descent will produce a plane in 3D (e.g., for a 3D “bowl” that we’re trying to reach the minimum of) but Newton’s method will produce another bowl, though this bowl will usually be entirely above of the original one, save for the tangent point.
The book mentions three “perspectives” on Newton’s method:
 Minimization of the secondorder approximation of , which is how I see it.
 Steepest descent in the Hessian norm: it’s like the quadratic norm described earlier, but the Hessian is a really good “” matrix to use since its condition number approximates the condition number of the sublevel sets!
 Solution of linearized optimality condition. I did not understand this at first, but actually, think of Newton’s method for approximating roots of a function , where we need to subtract . In our case, we want to find the minimizer of , which means we want the roots of the derivative , which involves . That’s exactly what we have here!
More facts about Newton’s method:
 If the original function is already quadratic, Newton’s method converges in one step.
 It is independent of affine coordinate transformations. When we do iterates with versus , the relationship between the points will remain the same.
 It uses something called the Newton decrement to determine when to stop.
 There is a damped phase versus a pure phase. In the former, the difference in when we change decreases by a fixed quantity (this is good!). In the latter, the backtracking line search always picks and the number of accurate digits doubles. Thus, there is no need to run that second phase more than, say, four times.
 Newton’s method still works with badlyconditioned sublevel sets of .
 The downside of Newton’s method compared to gradient or steepest descent is that (1) we have to compute the Hessian, and (2) we have to store it – remember that the Hessian will be , whereas the gradient will only be .
The usual disclaimers apply in that we don’t really know various constants that get involved in the proofs, unfortunately.
Statistical Concepts and Logistic Regression
This part is closely related to what I wrote about linear regression and the least mean squares algorithm. I will be discussing logistic regression as well (for classification, not regression), but first we take a brief detour to discuss the third major class of problem known as density estimation.
The problem is, given data, to find the appropriate model for it. The relatively easy case is if we assume we already have an idea of the distribution (e.g., Gaussian) and we just need to find the parameters (here, the mean and variance). We find the parameters via maximum likelihood. So in the IID Gaussian case, of which the graphical model is represented as independent shaded circles in a graphical model, we take the sample mean and sample covariance as our MLEs. With the Bayesian approach, where we have a new node pointing to all samples, we put a Gaussian prior on mean so that the result is a weighted estimate (and the same for the variance, actually), because of conjugate priors. In the case of discrete data , we model these with multinomials. The resulting MLEs, which require Lagrange multipliers to solve (which gave me a huge headache at first), are just the sample proportions. For the Bayesian version, we use a Dirichlet prior. To extend the class of distributions we want to model, we can assume a mixture model, where , where the s are mixing proportions that sum to one. This time, we have a hidden node that points to its own observed data point .
There is an alternative strategy of estimation known as nonparametric density estimation. Here, we do not assume we have a fixed parameter and as our data grows, the nonparametric model will grow to represent a wider class of distributions. We have kernels, where each data point takes some probability mass, and we add them up and normalize. In the case of Gaussian kernels, the nonparametric case for a fixed number of samples really reduces to the mixture model case, but they differ as the number of instances grow.
Tip: use the nonparametric case if we do not have a good idea of the model and lots of data, but use the parametric version when we have little data and a good idea of its underlying distribution (it will converge faster). The line between the two methods does blur somewhat, for instance, when we have a mixture modeling problem where we have to dynamically estimate the number of components .
Finally, we can turn our attention towards the regression and classification problems. In both cases, we model , where the here indicates that we assume IID data. For linear regression, we assume , and have to find . The choice of is what really determines the distribution – here we assume Gaussians, so this is linear regression, and that means the MLE of is the OLS estimate. Another way of extending linear regression to be more flexible is to use (conditional) mixtures. Here, the graphical model looks like that of the density estimation mixture model, except we also need the node (which may or may not be connected to the mixture node ). And, of course, we could always treat these from a Bayesian perspective, perhaps by endowing that error term for Gaussians (in linear regression) with Gaussian priors for its mean and variance (well, probably variance only if we want the mean to be zero).
We can also use nonparametric regression, if we do not want to restrict our conditional mean functions. Actually, Russell and Norvig cover this a bit in their nonparametric methods section in the textbook; each predicted new is based on the weighted prediction of the other, “nearest” s.
In the classification case, the distinction between generative and discriminative cases is more apparent. I remember the way the arrows point in the model just by remembering the discriminative case, and then realizing that the generative is the opposite one. Use the generative case if we want a full probabilistic model, and use discriminative classification if we only care about the boundary point. The full model in the generative case also may help combat overfitting, so it is better with limited and partially observed data. Discriminative models have less bias because they make fewer assumptions, so they work better with lots of data (in fact, it’s a lot like how nearest neighbor will work best with lots of data).
These approaches are important to understand the logistic regression algorithm, where we assume that the posterior probability for a binary classification problem is logistic or arrives at that form. That we have the inner product there means the posterior “boundaries” of equal probability are hyperplanes. In the generative case, we estimate means and covariances, which define (and these are density estimation problems!) and the boundary implicitly, while in the discriminative case, we estimate “directly,” possibly choosing an arbitrarily complex boundary. In fact, “discriminative = logistic regression”, “generative = Naive Bayes”, and both are for classification. In fact, that’s why they are in the same chapter of Mike Jordan’s notes!
Again, logistic regression assumes we have the sigmoid function as the form for our posterior probability. We can assume this from the outset (discriminative) but we can also “inspire” this generatively. Here’s how: assume that we have two classes, and the class conditionals^{1} are Gaussian with, and this is important, the same covariance matrices. Then the posterior can be expressed as , i.e., the exponent has an affine function of , which means that the boundaries of equal probability are hyperplanes. In the special case of equal mixing proportions, we have equidistant boundaries. A skewed mixing proportion will shift the boundaries towards or away one of the classes.
In fact, the assumption of a Gaussian class conditional is not even necessary. We can get away with multinomials (this is another way of viewing the Naive Bayes classifier), or in fact, anything in the exponential family^{2}! When I was learning about these in my undergraduate Bayesian statistics course, I never really got why the exponential families were that important. But here is one reason, I suppose. Note that these are still assumptions that add bias to the generative case.
We can extend the previous analysis to the general classification case with outputs. In that case, we use the softmax: , which also results in linear boundaries, though that’s kind of stretching the definition; imagine a “piechart” where the “slices” represent boundaries. Also, if we wanted to find maximum likelihood estimation, we could do that, because we have and . Just combine those to get the joint and differentiate the log of it. For instance, in the two Gaussian case, the MLE for the means and are just the sample means of the elements in their respective classes (remember, we assume we know the training data labels), and the covariance is weighted among the two. In the general case, we again write the formula and then separate the terms appropriately. Note: we will use to represent a generic vector of weights. To be safe, whenever we write probabilities, add a conditioned .
Whew! Now we can talk about logistic regression, where the class dependency is fixed to be a sigmoid function. How do we find the best ? As usual, take logs, and maximize. This actually leads us to an LMSlike algorithm, and the only difference is the class expectation. For the batch version, we use iteratively reweighted least squares, which is basically Newton’s method for optimizing the (nearly) quadratic log likelihood function. In fact, there is a close connection between this method and the “normal” weighted least squares method, which started by assuming that each training input/output had an attached “weight” to it: this method can be written as
for what I thought was a pretty convoluted , but actually turns out to be a first order approximation of . Interesting … I don’t really understand the full details of this, but having the knowledge of convex optimization at the top of this post really helped me.
For extending discriminative learning to multiple classes, again assume that is represented by the softmax function, and a lot of our math follows for what is known as softmax regression.
Finally, thanks to Andrew Ng I have a bit of a better idea on the connection between the logistic regression update (in the LMSlike form) versus the perceptron: just change the sigmoid part in the update to be the “sign” function, and then the update turns into the perceptron.
More Statistical Learning Theory
Here’s a random assortment of notes from Mike Jordan’s book (which I think he has abandoned now).
First, let’s consider the multivariate Gaussian is one of the most important distributions to understand, and I did not have an easy time learning about it. Fortunately, by now I can write out the formula and reason about it quite easily. Unfortunately, I don’t know how to derive it from first principles. I can explain “roughly” what it does, e.g., that in the normalizing constant comes from how each component of the random vector contributes some amount of variance equal to its eigenvalue, and the determinant of a matrix is the product of its eigenvalues.
But anyway, there are a few important facts worth discussing about the multivariate Gaussian.

There is a moment parameterization and the canonical parameterization. The former is what I always use, but we can transform it into the latter with the rules and to get .

Given a matrix where we partition it into components and , the goal of block diagonalization is to find matrices and such that is diagonal in the corresponding locations of and . After a lot of algebra, we can arrive at the derivative of the partitioned matrix , and also derive a bunch of useful identities (the “matrix inversion lemma”) that I refuse to memorize.

The reason why we go through this tedious algebra is that it gives us identities we can use when partitioning the multivariate Gaussian to get formulas for marginal and conditional probabilities involving multivariate Gaussians. Specifically, we have split into and , and we want and , where I’m eliding the parameters for simplicity. We obviously have the joint , so we need to figure out how to split them cleverly. Once we’ve gone through the derivation, we will find that the moment parameterization lends to easy computations of marginals but hard ones for conditionals, and the reverse is true for the canonical parameterization. Importantly, these formulas preserve the fact that our variables are Gaussian.
In addition to knowing that the marginals and the conditionals are Gaussian, the sum of independent Gaussians is Gaussians.
We can extend the mixture model discussion from last section into the multivariate Gaussian setting, where the hidden variables indicate the particular multivariate Gaussian distribution of interest. Here, we have , and assuming IID points, we want to find the , , and parameters to maximize the log likelihood. This requires ExpectationMaximization, which involves computing the probability that a particular distribution generated point , which is of obvious interest for classification. (Admittedly this case works best in the binary setting where the conditional expectation is the same as the conditional probability of being one.) One can also think of KMeans as a simplified version of EM. We use EM rather than maximum likelihood because our “log” term has a sum inside it, which is due to the probabilities of the point being in multiple possible classes. In the previous section (on classification), we had the class so we effectively take only one term in that summation, in which MLE follows easily.
One thing I didn’t quite realize earlier was that in the EM for Gaussians, we can take the log likelihood, differentiate it with respect to (or or ) and we end up finding solutions that match the EM algorithm, which is interesting and implies that our “heuristic” update formulas may not be so bad because they indicate maxima of the log likelihood. Of course, one can also derive the update formulas “systematically” by appealing to the expected complete log likelihood, where we take expectations with respect to the hidden variables. (See my previous post for more information about this quantity.)
The Estep in general involves computing the expected complete log likelihood, and the Mstep in general involves maximizing the expected complete log likelihood with respect to . The full power of this terminology is not needed in the simple Gaussian example, but it is a useful exercise to ensure that we derive the same update formulas we developed “heuristically.” In general, the expected complete log likelihood does not suffer from the “coupling” of variables as the original log likelihood.
Finally, we consider the “mixture of experts” case, which is when we have a mixture model for the purposes of regression or classification. Mike Jordan’s notes appear to be missing some figures, so it’s a little hard to see what he’s trying to do, but I think the first figure represents a “V”shaped set of data, and we need to fit two different regressions on that. The key is figuring out where to split, which is our “EMlike” task. In the mixture of experts, the Mstep involves two different maximization steps.
Logic and Planning
I discussed this earlier and had a chance to reread all of that stuff. My main purpose in this section is to highlight how everything in this section connects with each other. I don’t want to just learn propositional logic, then first order logic, etc., I want to describe then in terms of each other, and to discuss all the similarities and differences among them (and the algorithms they inspire). But this won’t be long because Stuart Russell isn’t on the prelim committee this time (hint hint…).
But first, a laundry list of facts that really confused me the first time:
 Propositions consist of literals, which are just like the atomic elements of propositions, but they can have a “negation” symbol. That’s it: think of literals as either or .
 Predicates are really functions that output a True or a False. Predicates are – in my opinion – the backbone of first order logic.
 Be sure to realize that is the same thing as . This is probably the most important thing to remember to understand Horn and definite clauses, and why we can apply Modus Ponens to them.
Now we can talk about the connections. Here they are:

One can convert from first order logic to propositional logic by extending universal and existential quantifiers.

Forward and backward chaining play a role in both propositional and first order logic. They are algorithms for determining entailment when we assume that our knowledge base consists of Horn clauses (prop.) or first order definite clauses (FOL). This is a simplifying assumption, but it is often easy to convert databases to this format. The reason why Horn or definite clauses are needed is that their truth values are equivalent to (and we need “or”s not “and”s), and that exactly fits the description of the Modus Ponens and Generalized Modus Ponens rule format. Note: we use these when we do not want to use the full power of resolution.

As an alternative, say we do not have definite clauses and are just looking for a satisfying assignment to a disjunction of clauses. Then in both types of logics, we have the option of backtracking and local search. Both of these have their similarities in the Constraint Satisfaction Problem domain. In backtracking search, we have similar versions of “minimum remaining values” and “least constraining value” heuristics. In local search, that is when we are starting with a full, though not typically satisfactory, assignment to a problem in CNF form, and we pick clauses to shift, and this is the same as in CSPs when we start with a full assignment and use the minimum conflicts heuristics to adjust values.

The PDDL language (Chapter 10) is about a simplified language that uses firstorder logic “materials” (e.g., predicates, quantifiers, etc.) to encode a search problem (remember Chapter 3!)^{3}. Since we’re encoding a search problem, we need to define the actions we can take, and those must have preconditions and effects, which involve adding or removing some fluents. The fluent, by the way, is the atomic set whose values represent a state. Again, the really important thing to know about Chapter 10 is that it is really another case of the general search problems. One can also make plans using a logical agent.

Knowledge representation (Chapter 12) is all about encoding “realworld” stuff in first order logic. Our strategy to represent these is formally called ontological engineering. They discuss categorizing objects, categories (make them into objects!), and events.

Let’s go over the different kinds of algorithms:
 Backtracking search: when we incrementally look for assignments to stuff, and then “backtrack” when we have seen some “problems”, e.g., impossible situations (and this can be used for entailment as well!). There are heuristics for this. We do this in CSPs and searching for satisfying assignments in propositional logic. We can also transform a classical planning case to a propositional case and turn it over to the backtracking solver, but this is not practical.
 Local search: we start with a complete assignment, and move variables around until we get to a solution. We do this with CSPs, propositional logic.
 Forward chaining and backward chaining are algorithms for deciding entailment in the two logics. We do not use these in CSPs or classical planning. The FOL case is more complicated due to the need to perform unification (among other factors), but we have general heuristics for improving them.
 In PDDLs, we do forward searching and backward searching to search for a satisfying sequence of actions. The forward searching part is similar to the backtracking search in that we can search for actions with heuristics and backtrack if needed. Backward search can avoid irrelevant states, though.
Miscellaneous Prelim Review (Part 1)
Here is a random assortment of notes I created to wrap up some of the remaining material I need to know. It’s “part 1” because I have another part coming up later.
Information Theory
This part, Chapter 2 from Cover and Thomas, is a bunch of definitions and straightforward theorems (i.e., those that follow directly from definitions):

Entropy: , where is a realization of variable . It’s the amount of uncertainty inherent in a random variable. For a fixed variable with some probability distribution that we can create, the entropy is highest if we make the distribution relatively uniform, and lowest if we make it “peaked.” In the extreme case, if we set , then .

Mutual Information: . It’s the decrease in entropy (upon obtaining the value of ), for variable . Note that since the second term will be zero. Alternatively, we represent it as
Note that , so it does commute.

Relative Entropy (KLdivergence): . This is a nonsymmetric measure of the difference between distributions and . We can also interpret it as the number of additional bits we will need to represent if we are using the (inferior) approximation of . It is infinity if there exists an such that but .
All three of the above quantities are nonnegative.
Another concept that plays a huge role in information theory is the following:
 Jensen’s inequality: for a convex function , . For a concave function, like the logarithm, we flip the sign (actually for logs, we can drop the equality case). I find it easiest to remember this rule by expanding out the equations for binary random variables. Let’s say they taken on values 0 and 1 with probability a half each. Then we have and and can directly relate this to the definition of convexity.
Based on the previous discussion, we can define and infer things like:

Joint entropy .

Conditional entropy, conditional KL divergence, conditional mutual information. For the sake of simplicity, I will not write all the rules here, but here is the one for conditional entropy: . Note that .

The chain rule for entropy, relative entropy, and mutual information. Unlike normal probability, these sum the components rather than multiply, which makes sense because all three cases involve logarithms. Again, I won’t write all the rules here, but will note that entropy is the easiest to relate to probability because we literally copy formulas from probability, but use sums instead of products. For the chain rule with mutual information, just pretend we don’t have and follow the probability convention (but sum up). Then stick the ’s after the semicolon, but before the conditioning bar. For KLdivergence, it’s the same (split up the joint into a marginal and product, but do this for both distributions, then use two “D” terms.

A theorem: , where represents the range of variable , and equality here holds if and only if is a uniform random variable.

Also one thing that tricked me up the first time I saw it was this consequence of Jensen’s inequality:
where is the domain for . I am assuming this really means
A final thought on this section: an alternative interpretation of entropy is that it is a lower bound on the average number of bits required to represent the random variable. It’s not “the minimum number of bits” because random variables take on different values with different probabilities, so we may wish to allocate more bits for the low probability events. And we also need to make it clear how we encode, so that we can compare different encodings. Example 1.1.2 from Cover and Thomas will clarify: here, we have eight horses, and they each win with some specified probability. If we wanted to encode the random variable indicating the horse that won, we could use three bits in the standard way. But this is suboptimal if the distribution is , like it is in the example, because we should allocate fewer digits to the higher probability horses, and more towards the ones that are less likely to win. It’s possible to encode so that the average number of bits to represent it is two, which exactly matches the entropy.
Decision Trees
Decision trees are one of the simplest nontrivial classifiers^{1} that have strong performance in practical tasks. The hypothesis space is the set of trees. For each dimensional sample , we classify it by propagating down the tree. At each node, we test an attribute , and depending on that value, send the sample left or right. Once we send it down the tree far enough, it will land in some “classifier” node that labels the class of that element.
Great, so how do we train such trees from labeled data ? For that, we invoke some information theory criteria: we want to select the attribute to test that will result in the most amount of purity in the resulting trees, where purity is defined based on entropy. Formally, at each point of the tree, we have a set of data and a candidate set of attributes. We pick the attribute that maximizes the information gain of the data.
Let’s precisely define this for boolean decision trees. At a decision tree’s node, we have positive and negative samples. The entropy of the random variable describing the output is the entropy of a binary random variable with probability ; to simplify the subsequent notation, denote this as . We define the gain of attribute that splits the data into subsets as follows:
For each subset, we weigh its entropy probabilistically. Otherwise, you could think of a useless attribute that keeps the same proportion of positive and negative examples in each subset. Without the probability weighting, splitting on would increase the entropy of the goal test on the data.
There are other impurity measures, such as the Gini impurity measure if the output takes on realizations. This is not the same as the measure of income inequality! I think the “CART” category of decision trees uses the Gini measure, whereas the “C4.5” and “C5” trees use entropy to measure impurity, though I think the boundaries between those categories is a bit blurry. Misclassification error is not an appropriate measure because it – unlike Gini impurity and entropy – does not give higher weight to branches with purer solutions.
Here are some things to think about:

Trees can overfit, so what happens in most realistic algorithms is that we build the large tree first, then prune away nodes with only leaf descendants that do not contribute much to the information gain (e.g., using tests of statistical significance to see if the gain is significant enough). This is not the same as building a tree and deciding to prune away early. The classic example is the XOR data. If we have a lot of XOR data that we want to split, we will find that the information gain of both attributes is zero. We do not want to prune away early because the next step will involve splitting on the second part of the XOR, which splits the data perfectly.

In practice, information gain might not be a good value of the amount of information in an attribute, because there might be an attribute that maps each element to a unique value.

We may have missing data. A simple but bad strategy is to ignore all training data points that have missing data. An alternative is to “fill in” those values probabilistically based on the distribution of values of those variables in the other samples considered for a particular decision tree.

We may want to use decision trees for regression if the output is continuousvalued. One option is to use a decision tree normally up to a certain depth, and then after that, we fit (linear) regression on only those data points that manage to reach that particular leaf node, and only the subset of variable attributes yet to be tested.
How would we train such a regression tree? HTF suggest a greedy algorithm (which also assumes continuous attributes, by the way): at each node, find the attribute and a split point that minimizes the sum of squared errors of the two resulting regions. HTF also assume that once we get to a region, we will approximate the samples with just one value, rather than doing a fullblown regression on it, which makes the problem a lot easier since the sum of squared errors criterion means we pick the mean of the elements considered at that node. To avoid overfitting, they suggest weakest link pruning. We iteratively pick the internal (nonleaf) node that, upon its removal (and subsequent collapsing of the tree) results in the smallest increase in the sum of squared errors criterion. This is pretty cool, and it’ similar to what Russell and Norvig describe.
Finally, here’s a rather interesting connection between boolean decision trees and propositional logic that I failed to realize at first: we can label various paths throughout the tree as , and so the goal is expressed as:
Thus, any function in propositional logic can be expressed as a decision tree!
Nearest Neighbor
This classifier is easy to describe: for each test point, we look at the nearest points according to Euclidean (or other) distance matrices and classify the test point as the majority class among those . This is a problem in high dimensions, since the notion of “distance” as a measure of similarity becomes less reliable due to a combination of (1) noisy and irrelevant features, and (2) the rather intriguing fact that the higher we go in dimensions, the more likely it is our points are farther away from each other. As we increase the dimension of the unit hypercube with our fixed nearest neighbor classifier, we will need to traverse an extra amount for each dimension to reach the nearest neighbors.
Let’s now restrict our focus to the 1nearestneighbor case. On the surface, this might seem to be unreliable, since we’re only using one closest point and it might overfit the data (see examples of plots showing 1nearest neighbor versus 5nearest neighbor). But a famous result called the CoverHart Theorem provides a different story, saying that the asymptotic error rate of the 1nearest neighbor classifier is never more than twice the error of the Bayes’ classifier (according to HTE), where the Bayes’ classifier assigns . While it sounds nice, it assumes that new points have to exactly coincide with a point in the training data, which is true in the limit, but not true in general.
Here’s another interesting fact about nearest neighbors that I found surprising. Researchers used nearest neighbors to achieve the best performance (at that time) on the MNIST handwritten digit recognition problem. The digits themselves are points in space. A classifier would have to work in high dimensional space and be invariant to rotations, scaling, etc. They way they did this was by defining manifolds in space. For instance, there is a onedimensional curve where points on that curve represent different rotated versions of the “3” digit^{2}. Then there can be another curve representing a different three. One idea is to take the Euclidean distance of the two closest points and which lie on separate curves. Unfortunately, this may result in heavily rotated images being equivalent (the classic disaster: confusing a “6” with a “9”), so the ingenious solution is to use tangent lines. That’s the intuition: in reality, the “onedimensional curves” would be manifolds taking into account additional invariance factors.
Having motivated nearest neighbors, let’s discuss some of its drawbacks. One problem is that it needs to store all the training instances, and for each new test point to classify, it needs to iterate through all of those to find the closest neighbors. If time is unacceptable, then we can speed up the process of finding nearest neighbors with the following two strategies:

We can use d trees, or more accurately, d trees if our data is dimensional, so that we don’t confuse this with the in the nearest neighbors. At each node, this tree will pick a dimension and split the examples according to their median point so that all such that will go the left subtree, and the rest go to the right subtree. The dimensions are typically chosen based on the widest spread of values.
Thus, if we are doing 1nearest neighbor, for a given new point , we find its nearest neighbor by querying the d tree to see where it would be located (i.e., it’s like we are inserting it in the tree). We proceed until we hit a leaf, and declare that as the best node found given the current information. But we have to be careful. The nearest neighbor of might not be in the same hyperplane after a split! We need to “backtrack” and then measure the distance between and the hyperplanes at each step, to see if there are nearest neighbor candidates on the other side. Check the Wikipedia page for more details. They have a nice description and an animation.
The downside with d trees is that with many dimensions, we will need to keep track of numerous subtrees that could potentially have “that nearest neighbor,” and we would iterate through the entire tree. To extend this algorithm for multiple neighbors, we use a list of nearest neighbor points.

We can use locality sensitive hashing, which hashes “similar” values in high dimensional spaces to the same hash buckets. Then, using only the elements remaining in that bucket, we can perform exact nearest neighbor via brute force comparisons. Since hash functions are hard to create, we can try hash functions independently to get buckets, then take the union of all those elements to arrive at the set which we will use for exact comparisons. Russell and Norvig seem to suggest that each of those hash functions be a projection down to a line, and the buckets would be a line segment. I guess that makes sense.
The downside with locality sensitive hashing is that it, unlike the use of d trees, is an approximate nearest neighbors search.
Nearest neighbors has an interesting tradeoff with perceptrons. Kernelized perceptrons learn similar to the way nearest neighbors learn, especially with Gaussian kernels that weigh a probability distribution about each point. In other words, distanceweighted nearest neighbors are kernelized perceptrons. Nearest neighbors, unlike plain (nonkernelized) perceptrons, can use fancy similarity functions, as exemplified by the handwritten digit recognition example.
HTE also emphasize the connection between nearest neighbors and least squares.
(Artificial) Neural Networks
Neural networks are a natural extension of the perceptron that I’ve written about in detail before, since perceptrons form the basic building blocks for each node. Like the perceptron (and regression, for that matter), we develop a classifier by updating weights. For neural networks, we use backpropagation to update the weights. At a high level, this means for each training instance, we “feed” it to the network so that it classifies it. Then, we propagate the “error” backwards through the network to update weights. The weight update for those connecting to the output layer is the same for that of logistic regression^{3}, assuming we’re using the sigmoid nonlinear function. The real challenge comes when we compute weight updates for those connecting input or hidden layers to other hidden layers. But in fact, the gradient for the loss at inner nodes is the same as the computation for the gradient at the output layer, except we apply the chain rule multiple times. I’m not going to write the derivation here since it would take too much time; I just did it by hand.
Neural networks have an intriguing fact: provided that there are sufficiently many nodes and layers, they can represent any continuous function (of the input) with arbitrarily high accuracy. It needs multiple layers with nonlinear activation functions at each node. Otherwise, if a NN just has an input layer directly connected to an output layer, it fails to learn even a simple XOR function.
There are many extensions to NNs. We could use recurrent NNs, convolutional NNs (popular for computer vision now), etc. We can use thresholding functions other than sigmoids, such as ReLUs, which avoid the “vanishing gradient” problem of sigmoids. Note that we focus on the problem of learning from a fixed structure, i.e., like parameter estimation for a graphical model with the nodes and edges fixed. Learning the structure is much more complicated.
Principal Components Analysis
Principal Components Analysis (PCA) is a way of mapping high dimensional data into a reduced dimensional space, where the reduction is a “best approximation” of the original data. Formally, if but we really think they lie in where , then there is probably a process such that for but is some noise added.
Obviously, there are many advantages to dimensionality reduction, so the question is how we do this in a sound way. PCA will do this by iteratively “mapping” points to a line characterized by the vector which preserves as much variance in the data, not including vectors already chosen^{4}. In other words, we’d like to project the data onto a subspace so that the variance is maximized. To do this, PCA uses an eigendecomposition, which can make it expensive, but it does not need ExpectationMaximization. (The dimensionality reduction technique that uses EM is called “factor analysis.”)
It’s easiest to derive PCA in the two dimensional case with data points where the data have zero mean and each coordinate has unit variance. In the first step, we solve for the (unit) direction vector :
where is the matrix where each row is a training instance . Note that , and also, if the data are centered, then it is the sample covariance matrix.
In fact, this is a standard optimization problem, where we have a quadratic form that we are maximizing w.r.t. subject to the fact that . It is a well known fact that this problem is solved by finding the that corresponds to the eigenvector of that has the largest eigenvalue. After all, is a symmetric matrix of reals, so its eigenvectors can be chosen to be of unit norm and orthogonal to each other.
Given , the best vector so far, we know that is our “process”, where is a scalar. Since it follows that to project all the points down to the one dimensional space characterized by , we do .
But normally we need more dimensions than that. How do we find the “best” set of vectors for that? We take those eigenvectors that had the largest eigenvalues. These form the principal components of the data, and are mutually uncorrelated. (I’m not actually sure why this works – intuitively it does, but I don’t have a proof.) And when we need to project our data, we remember our “process” and add the new eigenvectors as columns of a matrix so that , where , and is the number of columns of . Again, is orthogonal so ignoring the noise (which is deliberate, since it’s noise!) our projection is for all points.
We can find those eigenvectors by diagonalization or SVD of . SVD would work since that’s a real, symmetric matrix, so the eigenvalues will be the same as the singular values, and we can thus rank them easily.
There is an alternative way we can derive PCA, using the “process” I explained earlier. We can define and use that as our approximation of . Thus, our objective would be to find
where I just put the to represent the lower dimensional approximation data. To find , we can again resort to SVD: , where is again the matrix with rows as training instances. Then the columns of form the vectors of the principal components. (Sorry for the and confusing; Ng and HTE use different formulations.) Technically, we only take the first columns from if we want a set of vectors for the projections, which I find is neat (if we want more, just add more columns!). Since , then consists of the projected points of , one for each row (and will usually have fewer columns than the full number of components of the s). There’s a lot of matrix stuff going on here; draw this on a piece of paper to understand better.
HTE present an example of PCA using the Procrustes Transformation, but I don’t really understand how PCA relates to it. I guess because both involve rotations and scaling of the data?

In fact, the ability to describe the classifier to lawyers means that companies can use these classifiers to “discriminate” without concern. What companies would have to do is explain the classifier and their rationale (e.g., if a person is in X category, we have to do Y due to previous data, etc.). ↩

Admittedly, I am skeptical of how they can claim that a onedimensional curve represents various rotated aspects of a digit, but if you buy that argument, then everything else follows from that. ↩

Recall how we do a stochastic gradient update of a single weight in logistic regression. For a given training instance, , where is dimensional and is a scalar, we do
assuming we’re using the loss function. Then we eventually get
which uses the fact that the derivative of the logistic function is itself multiplied by the quantity “one minus itself.” ↩

The easiest way to understand this is to look for figures that plot data along with vectors that indicate the PCA dimensions. Typically there will be two vectors chosen, incidating two “best directions” that capture the data. ↩
Perceptrons, SVMs, and Kernel Methods
In this post, we’ll discuss the perceptron and the support vector machine (SVM) classifiers, which are both errordriven methods that make direct use of training data to adjust the classification boundary. They do not “build a model,” which is what a BayesNetbased algorithm such as Naive Bayes would do, which means we can make fewer assumptions about the data.
We’ll also talk about kernels, which allow us to efficiently compute dot products of highdimensional feature vectors without actually computing those feature vectors.
The Perceptron
The perceptron learning algorithm relies on classification via the sign of the dot product. Given a binary classification problem of vectors in , the perceptron algorithm computes one parameter vector . Given an arbitrary sample with features^{1} , we classify this as +1 if and 1 if otherwise. Assuming we’re doing supervised data, we will know the true label . If , then we don’t do anything. Otherwise, we must adjust the weight vector . This will change the direction of the vector, thus shifting the classification boundary. It’s easiest to understand how this works by realizing that represents the decision boundary, which is orthogonal to by definition of the dot product and divides up the feature vector space into “halves,” where one has dot products with positive, and the other negative.
In the general case, there will be multiple classes, so we will have multiple weight vectors for a way classification problem. In that case, whenever we have a training instance , we assign the class based on . If was actually in class , we are done; otherwise it should have been in class so we need to adjust two weight vectors with and . We add to the appropriate class, and subtract from the wrong class.
What are the problems with the perceptron as we just described? Well, if the data isn’t linearly separable, the algorithm will “thrash” around and never converge^{2}. Two other (related) problems: it can overfit the data, or not find a suitable boundary. For the latter case, think of a linearly separable data, but with one outlier that causes the linear boundary to drastically shift. It may be wise to allow one “error” in order to get a that generalizes better.
There is a modification of the perceptron known as the MarginInfused Relaxed Algorithm (MIRA), which updates in the same direction as the perceptron, but at the minimum magnitude necessary (technically, we add one to leave some slack, but whatever) to force the classifier to classify the current sample correctly (if it was not already correct). This means that the update could be smaller or greater than the perceptron update, but unlike the perceptron, MIRA will always classify an example correctly after seeing it. In practice, we cap the amount that a single training example can change the weight vector, so the scale factor is at most a prespecified .
As an alternative to the multiway classification perceptron, one can use the perceptron for ranking (e.g., website ranking), which has only one weight vector. It’s useful if we want to consider data points and classes together in a single vector . The decision rule is
and the update rule is
where for a data point , was the predicted class but was the actual class. Now the weights are interpreted as the importance of each feature component to each class.
The Kernelized Perceptron
We can create more complicated classification boundaries with perceptrons by using kernelization^{3}. Suppose starts off as the zero vector. Then we notice in the general way classification problem that we only add or subtract vectors to . In other words, with samples in the training data, where all the variables are integers. This means learning all the alphas would be enough to reconstruct the weight vectors.
How do we make a classification decision? For a given training instance (or even an entirely new sample) , we would assign it the class based on whatever (for weight vector ) that maximizes the following: . We can reexpress the dot product: , where we have introduced a kernel function . Kernels allow us to “map” vectors and into a higher dimensional space, where we would then “take the dot product,” without actually transforming the features into the higher dimensional space.
Here’s an example: if we let , then we have mapped and into a higher dimensional space that includes squared components of and , resulting in linearly separable boundaries in that space even if the original feature space was not, e.g., the positive examples formed a circle and were surrounded by the negative examples. As a general rule, the more features we have, the more likely we have linearly separable data, unless two of the exact same ’s have different classes, for whatever reason. Of course, we will need more examples to learn correctly (growth is roughly quadratic in the number of features), and when doing classification, we will need to compute all the values. It will be further slower if most of the alpha counts are nonzero.
There are two popular classes of kernels:

The polynomial kernel has the form for degree . For vectors of dimension , this kernel will map them to an dimensional space! Expanding the kernel out for the simple case of , we get
This is the equivalent of a dot product of features that contain elements , , and (not – watch out!).

The Gaussian kernel, also known as the radialbasis function (RBF) kernel maps elements into an infinitedimensional feature space. It is . Probably more than any other kernel, classifying with this one is a lot like nearest neighbor because it clearly measures a similarity function, weighing “closer” examples more in our classification decisions. As , the kernel becomes a lookup table, and our training accuracy for a perceptron trained with this is 100 percent (except in the weird case of two exact same points getting different labels) but our validation and test set accuracy will be horrible.
To test my understanding of kernels in more detail, I looked at (as usual) an old CS 188 handout. It had the following image:
(In the first plot, the dotted line is .)
Let’s consider a linear, a shifted linear, a quadratic, and cubic kernels (see the handout for details on these), and see if any of them can linearly separate the data in the two plots.
Plot (a) requires a thirdorder polynomial to separate the data, so only the cubic kernel will work, because that will map feature vectors to have in it. Then we’d just adjust the weights to set that to have nonzero weight.
In plot (b), a linear kernel is enough, but there has to be a bias term in there! (That actually tricked me.) Without a bias, in the 2D case here, the decision rule is , and a 2D vector must “emanate” out of the origin, which means the perpendicular line to it crosses the origin.
Kernels, Formalized
The preceding discussion motivates the following question: how do we know if a function is a valid kernel? First, the official definition of a kernel is that it is a function that performs an inner product in a Hilbert Space. Normally, I prefer thinking of the inner product as the normal dot product (as I wrote earlier) but more generally, we should use the terminology of the inner product. Those satisfy properties of symmetry, bilinearity, and positive definiteness. A Hilbert Space is an inner product space that is complete and separable with respect to the norm defined by the inner product.
Since we are only dealing with realvalued vectors, our Hilbert Space will be and the inner product here is the normal vector dot product. To test whether a function is a kernel, we invoke a simplified form of Mercer’s Theorem: let function be given. Then for to be a valid Mercer kernel, it is necessary and sufficient that for any set of points the corresponding kernel matrix is (symmetric) positive semidefinite. The element of the kernel matrix is the value . (Sometimes the kernel matrix is called the Gram matrix.)
To prove one direction, that if is a kernel matrix corresponding to a feature mapping , it must be symmetric positive semidefinite, we proceed as follows. First, it’s going to be symmetric due to the dot product (or more accurately, inner product) being commutative. Next, for any , we have
where we used the fact we indirectly showed earlier that . It’s a little tricky because we are keeping , a component of , fixed, and ranging across different vectors.
This fact about positive semidefiniteness makes it easy to see that the following are also valid kernels:
 Addition:
 Multiplication:
 Scalar: for a constant
We can use kernels for perceptrons (as previously discussed), support vector machines (as we will discuss), principal components analysis, and other classifiers such as linear regression.
Let’s discuss the linear regression case. In the general (regularized) case, the objective is:
where is the matrix of training instances, where each training instance is a row (this is different from what I usually think of it, but it makes more sense in regression). The vector has the true labels. By taking derivatives, we see that the optimal is
In the last step above, I used the clever trick I learned from CS 281A that for and a matrix that is , we have . But wait, what does this mean? We can express as
with an appropriately defined since the columns of (not , be careful!) are the actual training elements, so is in the space spanned by them and thus we can write it as a linear combination.
When we are faced with a new training instance to do regression, , we will perform the following:
We have kernels again! This is more accurately known as kernelized linear regression. In fact, we even use kernels before we test on new examples (i.e., we use it during training). Why? The matrix is itself a kernel matrix! It consists of dot products between the training instances, and since we optimize over that during training, we will use kernels during training.
I may end up reading more of Tom Mitchell’s slides on this, because this was quite illuminating to me.
Support Vector Machines
We now switch gears to Support Vector Machines (SVMs), which are possibly the best “offtheshelf” classifier because they combine the kernel trick along with the concept of a maximum margin separator. Thus, we know immediately that – like in the perceptron – we must find some way to express the optimization problem in terms of dot products.
To begin the derivation, we define the functional margin of a weight vector (note: we keep the intercept term separate now) with respect to training instance to be , where the class label , and across the entire dataset, is just the minimum of all the functional margins. Ideally, we would like the functional margin to be relatively large, as that would indicate a strong, “robust” boundary between the two classes.
We can formulate SVMs with the following “optimization” problem:
such that
along with the restriction that , which prevents the functional margin from changing due to invariance of the size of (though might vary, but I don’t think it’s a problem).
Unfortunately, this is not really possible with “optimization” easily, so we transform the problem into an equivalent one as follows:
such that for all training instances , we have the constraint . This scales so that the functional margin must be one.
We are done, but it is better to face the problem from the dual perspective so that we can take advantage of kernels. Since the dual solution is less than or equal to the primal solution , it follows that we can equivalently solve the problem by maximizing the dual^{4}. We rewrite the constraints as $$g_i(w) = 1
 y^{(i)}(w^Tx^{(i)}+b) \le 0$$ and construct the Lagrangian as:
Setting the derivative of with respect to and , then after some algebra (which took me a while due to lots of indices messing me up, but I eventually got it), and then knowing that we need to maximize this, we pose the dual optimization problem:
such that for all , and . Fortunately, this is convex, so there is a single global minimum.
Notice that we have variables again, though these have a different interpretation than the ones in the kernelized perceptron, though. Watch out! And yes, we do get kernels to appear once again. Nice!
Now let’s see what happens when we have trained and are going to assign a class to a new instance . We perform the computation, which can equivalently be expressed as
Once again, we have kernels! Incidentally, it looks like we might have to do a lot of computation for classifying a single point, but in fact, most of the s will be zero. The few that are nonzero correspond to the training instances called support vectors, and they are the ones closest to the margin. This is formally called the KarushKuhnTucker dual complementarity condition. The fact that we may not need to do much computation means SVMs gain some of the advantages of parametric models.
In the above problem, we – just like in the kernelized perceptron and kernelized regression – have formulated the problem so that, both during training and classification of new examples, the data enter via inner products, allowing us to use kernels.
What happens when we do not have linearly separable data? Rather than come up with a more complicated or longer feature vector (which might risk overfitting), we can reformulate the problem using slack variables (for regularization) and an additional, controllable parameter :
such that for all , we have and . Thus, samples are permitted to have a (functional) margin less than one.
Rather surprisingly, the only change to the dual is that constraints turns into constraints, so we can apply the same principles (roughly speaking) as we did in the linearly separable case. In addition, the way we find changes, but generally we don’t really worry about the intercept too much when going through the derivation. It’s really that matters most to me.

This is important. When we call things , we usually refer to the raw data, but what the classifier needs are a set of features for each sample. But some people elide this notation by treating directly as features, so be careful. ↩

Even when the data is linearly separable, the perceptron is only guaranteed to converge in the binary classification case. Here’s a key theorem: suppose the (binary) data are separable with margin and the maximum norm of a training sample is . Then the perceptron converges with at most updates. ↩

Here’s some intuition: we’re trying to combine the best of nearest neighbor approaches with perceptron approaches by using the former’s ability to use fancy “similarity” functions along with the latter’s ability to explicitly learn from data. ↩

Technically, we need the KarushKuhnTucker conditions to hold for there to be possibly equality. ↩
Markov Decision Processes and Reinforcement Learning
In this post, we’ll review Markov Decision Processes and Reinforcement Learning. This material is from Chapters 17 and 21 in Russell and Norvig (2010).
Markov Decision Processes
The general idea with this situation is that we are an agent in some world, and we have a set of states, actions (for each state), a transition probability (which we may not actually know), and a reward function. The goal for a rational agent is to maximize the expected sum of (discounted) rewards for a state , where is our discount factor, which we usually assume is equal to one for toy cases only.
This is the setup of a Markov Decision Process (MDP), with the added constraint that the probability of going to another state only depends on the current state, which is why we call this a Markov Decision Process. Since the ultimate goal is to maximize the expected utility, we will need to learn a policy function that maps states to actions. Finally, while not strictly necessary, it is common to “spice up” the MDP problem by assuming that actions are nondeterministic. In the canonical gridworld example described in the book (and in a lot of undergraduate AI classes, for that matter), we assume if we move North, then that has an 80 percent chance of succeeding. Hence, the transition probability is nontrivial, where and are states, and is the action we took at state .
In order to understand how to get an optimal policy, we first need to realize how exactly to define the value of a state, which in other words, is the expected utility we will get if we are at state and play optimally. (The asterisk here is to indicate optimality.) There are, sadly, two common formulations for . The first, from R & N, assumes that the reward aspect of the formula is defined in terms of a state only:
The second formulation, which Berkeley’s CS 188 class uses^{1}, defines reward in terms of a stateactionsuccessor triple:
Both of these equation sets can be called the Bellman Equations, which characterize the optimal values (but we will generally need some other way of computing them, as we show shortly). In general, I will utilize the second formulation, but the formulations are not fundamentally different.
It is also common to define a new quantity called a Qvalue with respect to stateaction pairs:
In words, is the expected utility starting at state , taking action , and hereafter, playing optimally. We can therefore relate the s and s with the following equation:
I find it easiest to think of these in terms of expectimax trees with chance nodes. The “normal” nodes correspond to s, and the “chance” nodes correspond to s. Both nodes have multiple successors: the s because we have choices of actions, and s because, even if we commit to an action, the actual outcome is generally nondeterministic, so we will not know what state we end up in. (I guess we can also view states as the “normal” nodes here.)
Given these definitions, how do we figure out the optimal policy? There are two common tactics:

Value Iteration is an iterative algorithm that computes values of states indexed by a , as in , which can also be thought of the best value of state assuming the game ends in timesteps. This is not the actual policy itself, but these values are used to determine the optimal policy^{2}.
Starting with for all , value iteration performs the following update:
and it repeats until we tell it to stop.
To make this intuitive, during each step, we have a vector of values, and then we do a oneply expectimax computation to get the next vector. Note that expectimax could compute all of the values we need, but it would take too long due to repeated and infinite depth state trees.
To prove value iteration converges to (and uniquely!), appeal to contraction and the fact that , so long as , will go down to zero. Said another way, the “ tree” and the “ tree” are different only in their last layer, but that last layer’s contribution is reduced exponentially.

Policy Iteration is generally an improvement over Value Iteration because policies often converge long before the values do, so we alternate between policy evaluation and policy improvement steps. In the first step, we assume we are given a policy and have to figure out the expected utilities of each state when executing . These values are characterized by the following equations:
Note that the MAX operator is gone because gives us our action. Thus, we can solve these equations in time with standard matrix multiplication methods^{3}.
It is actually rather easy to do policy improvement from Qvalues:
Alternatively, we can use the more complicated expectimax form:
The two steps would iterate until convergence, and like value iteration, policy iteration is optimal. Also, these equations are really policy extraction, in the sense that given values, we know how to get a policy.
In general, policy iteration seems to be a slightly better bet than value iteration, though I guess the latter could be simpler to implement in some cases since it only involves the value updating part?
If we do not know exactly what state we are in, we do not have a notion of here, so we need to change our representation of the problem. This is the Partially Observable Markov Decision Process (POMDP) case. We augment the MDP with a sensor model and treat states as belief states. In a discrete MDP with states, the belief state vector would be an dimensional vector with components representing the probabilities of being in a particular state. The belief state update for a particular component (state) looks a lot like an HMM update:
The cycle is: we compute an action from , see evidence , compute a new belief state (using the above formula), then repeat. Fortunately, the optimal action for a POMDP only depends on the current belief state, allowing us to have our policy be a mapping from a belief state to an action.
We can also map POMDPs to an observable MDP formula, by appropriately defining transition and reward functions. For instance, . The optimal policy for this new MDP over the belief states is the same optimal policy over the original POMDP. This is good, since we’ve reduced this problem to an MDP, but we have to modify our earlier algorithms to handle continuousvalued state spaces. A key observation that will help: the value function over belief states is piecewise linear and convex, because it is the maximum of a collection of hyperplanes. What a modified value iteration algorithm needs to do is to accumulate a set of possible plans, and for a given belief state, compute which of those has the optimal hyperplane. But this is too slow so in practice, people use online algorithms based on onestep lookahead, inspired by Dynamic Bayesian Networks.
To understand POMDPs well, consider simple MDPs with two states, so we can represent as a scalar.
Reinforcement Learning
Now we switch to the reinforcement learning case. Before diving into the details, it is easiest to think of us in the same MDP framework, except that we have to learn our policy online, not offline. This is harder because we typically assume we do not know the transition probabilities ahead of time, or even the rewards. Unfortunately, this means we will try out different paths, and will sometimes fail (e.g., falling into the 1 pit in the gridworld example).
First, consider the passive reinforcement case, where we are given a fixed (possibly garbage) policy and the only goal is to learn the values at each state, according to the Bellman equations.
Here’s one way to get the values: learn the transitions and the rewards through trials, then immediately apply the policy evaluation step in policy iteration. As a reminder, this computes values (as in value iteration) but since the policy is fixed, we can solve the set of equations quickly. This technique, which Russell and Norvig seem to call adaptive dynamic programming, is part of the class of modelbased learning techniques, because that assumes we have to compute a model of the world (i.e., the transitions and rewards). Here, our agent could play a series of actions from a variety of states (“restarting” when necessary). At the end, it will have data about the rewards it’s gotten, along with the number of times it transitioned from state to state . To compute the transition function, we would convert the counts to the averages in a “Maximum Likelihood Estimate”manner. The rewards are easier because we can see immediately when we transition from to .
In modelfree learning, we do not bother to compute the transition model, but instead only compute the actual values of each state. There is a basic technique called direct utility estimation that falls in this category of methods. The idea here is that for each trial, we need to record the sum of discounted rewards for each state visited. After a series of trials, these values will average out and converge, albeit slowly since direct utility estimation does not take into account the Bellman Equations.
We need a better solution. Since we already have a policy, we would like to do some form of policy evaluation (explained earlier), but we do not have the transitions and rewards. We can approximate those with our samples instead, which leads to Temporal Difference learning. This is a modelfree tactic where, after each transition , we update the values:
In short, TD learning works by adjusting the utility estimates towards the ideal equilibrium as stated by the Bellman equations. But what if we want to get a new policy? TD learning does not learn the transition probabilities, so we switch to learning Qvalues, since it is easier to extract actions from Qvalues.
This now brings us to active reinforcement learning, where we have to learn an optimal policy by choosing actions. Clearly, there will be some tradeoffs between exploration and exploitation.
The solution here is an algorithm called QLearning, which iteratively computes Qvalues:
Notice how the sample here is slightly different than in TD learning. Each sample is computed from a tuple.
The amazing thing is, so long as certain obvious assumptions hold, such as that we visit states sufficiently often, QLearning will converge to an optimal solution even if the actions we take are repeatedly suboptimal^{4}! In the fire pit example in the CS 188 class, we could keep jumping in the pit, but so long as our actions are somewhat stochastic, we will eventually have some paths that end up in the lone beneficial spot, which will result in higher Qvalues for those states and actions that lead towards that spot. (Think of each grid as having four Qvalues.)
Even so, it would probably be wise to have good heuristics for deciding on what actions to take, which might make QLearning converge faster. The simplest (and worst) strategy: act according to a current policy, but with some small probability, we choose a random action. To improve this, use exploration functions that weigh the importance of actions and states based on whether their values are “established” or not. This information propagates back so that we end up favoring actions that lead to unexplored areas. This means for some exploration function , the action we pick at a step will be where represents counts, and the new QLearning update is:
There is a close relative to QLearning called SARSA, StateActionRewardStateAction. The update is, following Russell and Norvig’s notation to have rewards be fixed to a state:
What this means is that we already have an action chosen for the successor state . QLearning will choose the best action , but in SARSA, it is fixed, and then we can update . SARSA and QLearning (the first version here, not the second version with ) are the same for a greedy agent always picking the best action. When exploration happens, QLearning will attempt to pick the best action, but SARSA picks whatever is actually going to happen. This means QLearning does not pay attention to the policy, i.e., it is offpolicy, but SARSA does, so it is onpolicy. Here’s an example I thought of to clarify: the fire pit in CS 188 means that a policy that tells us to go in the pit will be bad, but if transitions are stochastic, then at some point, we will get to the good exit, hence QLearning will eventually learn that best value. But with SARSA, we do not get that opportunity because we will always have the next action fixed, so it is “more realistic” in some sense.
It is worth pointing out that all this previous discussion about learning all the values is really applicable only for the smallest of problems, because we cannot tabulate all those in reallife situations. Thus, we resort to Approximate QLearning. Here, we use featurebased representations by representing a state as a vector of features:
And the same for the Qvalues:
We usually make things linear in terms of the features for simplicity. This has the additional benefit in that we can generalize to states we have not visited yet (in addition to the obvious compression benefits). Note that when we do the QLearning update here, we change the weights. Each weight update looks like the same kind of update we see in stochastic gradient descent.
Finally, the last major reinforcement learning topic is policy search. It actually seems like the policy iteration analogue: we only care about the policy, not the actual values themselves. Starting with an initial featurebased representation of the Qstates, we might consider tweaking the individual weights up and down and then evaluating the result of that representation to see if our rewards are higher. This may involve computing an expression for the expected reward, or running an agent in the world multiple times.
Old CS 188 Exam Questions
I found four interesting questions related to MDPs and reinforcement learning.

Spring 2011, Question 4 (“WorstCase Markov Decision Processes”). Now we measure the quality of a policy by its worstcase utility, or in other words, what we are guaranteed to achieve. This question is really intellectually simulating, as we have to consider the algorithm from a minimax perspective, rather than an expectimax perspective.
For a state, we have:
So we pick the action that maximizes the following quantity: it is the discounted reward based on the worst possible state to which we could transition.
There is a corresponding Qvalue version:
We are guaranteed , but have already committed to action , so the adversary is free to send us to the worst possible subsequent state for us (as long as that transition has nonzero probability, of course), but then we are free to maximize over the next action.
If we want to do Learning, then it is actually easier than standard Learning because the environment has deterministic rewards and we only care about the minimax one, so if we ever observe a reward, we know that it’s the correct one.

Fall 2011, Question 4 (An unnamed question). This one required us to actually perform some computations of value iteration and Qlearning. The only nontrivial thing about this was the jump from C, which could land in either D or E. I personally found it easiest to just run Value Iteration and try to reason about the Qvalues later. The Qlearning iterative update is:
For the third part, the tricky thing was to remember that the Qlearning update will involve MAXing over previously computed Qvalues.

Fall 2012, Question 6 (“Reinforcement Learning”). This question makes clear the distinction between QLearning and SARSA. For part (a), they give us the formula for the QLearning agent, and ask if this will converge to the optimal set of Qvalues even if the policy we’re following is suboptimal. The answer is yes, since they also mention in the question several necessary requirements (e.g., that we visit all stateaction pairs infinitely often). For part (b), the SARSA update, answer is no, we do not converge to the optimal Qvalues, because SARSA will have Qvalues for whatever policy is actually being executed. This policy that we converge to is the policy that follows with probability 0.5, and uniformly random otherwise.

Fall 2012, Question 7 (“Bandits”). This is a rather long question. Part (a) gives a nice review of MEU and VPI, and the remaining parts now deal with POMDPs over the two states of having bandits versus not having bandits. Hence, we represent the belief state as a single scalar representing the probability we believe there are bandits. When we have to compute in the tree, we use the following formula:
Notice that this is precisely the formula for updating the belief state that we discussed earlier in the notes, except that we have the transition function as the identity. By the way, it took me a while to figure this out, but the normalizing constant here needs to iterate through the set of belief states. We have a scalar , but we really have to consider the case of there possibly being no bandits at all. A bit tricky here. The transition probability is also tricky, so it’s easiest to think of it in terms of the problem: what is the probability of even reaching this node?
Finally, they discuss some utility diagrams. As we discussed earlier, to find the overall value function, we take the maximum over all the possible lines. For the discounting questions, part (i) the point was that agents with larger values should be more concerned about the future, so they are more willing to try out shortterm suboptimal paths. Thus, Agent A has a higher discount factor because it played the suboptimal Land action one more time. In part (ii), there are three plots you can easily remove. Then it’s just assigning three plots to three discount factors. A higher discount factor means we need to be more certain that we have bandits before we commit to making air shipments.

Here is where I get confused. Why are we having Berkeley’s AI class use a different formulation than what R & N are using? I mean, Stuart Russell is a faculty member here, so why are we choosing the other way? For pedagogical purposes? ↩

In R & N’s notation, we would use . ↩

Alternatively, if that step is still too expensive, one can perform iterative updates as we would in value iteration for some number of steps, then switch to policy improvement. In this case, we would add s as subscripts. ↩

But we do need to visit states often enough. In section 21.3 of R & N, they make it clear that at each time step (i.e., after each sample of stateactionrewardssuccessors), we could compute a model with whatever information we have, and then follow the optimal policy for that. But such a greedy agent is not good because it will never try out different states whose value may not be established. ↩
The Least Mean Squares Algorithm
After reviewing some linear algebra, the Least Mean Squares (LMS) algorithm is a logical choice of subject to examine, because it combines the topics of linear algebra (obviously) and graphical models, the latter case because we can view it as the case of a single, continuousvalued node whose mean is a linear function of the value of its parents. (Admittedly, I do not find this intuitively helpful.)
The HighLevel Problem
We are going to be analyzing LMS in the context of linear regression, i.e., we will have some input features along with their (scalarvalued) output as our data, and the goal is to estimate a parameter vector such that , where the is admitting that we do not expect to exactly match .
When we have ordinary linear regression, we often express the data all together in terms of matrices. Thus, we could have be our matrix of features, where there are samples and variables per sample, be the column vector of parameters, and the goal would be to find a solution to the problem , where is the column vector of actual outputs. The problem is linear because the equations are linear in ( itself can actually be input to a more complicated function). In the rare case that is square and that all columns are linearly independent, this immediately reduces to . Most of the time, though, is a “tall” matrix and the data vector lies outside the column space of .
The most natural solution, it seems, is to find the projection of onto the subspace of , as that intuitively should minimize all our terms. Thus, we have the normal equations, the most important one in statistics: . In most cases^{1}, we can invert . If not, we can use the pseudoinverse^{2}.
There are several ways of getting to the normal equations. One can appeal geometrically, by using geometry and linear algebra. Alternatively, we can derive it based on the following cost function:
If someone asks you about what LMS (which we will discuss shortly) is supposed to optimize, it is this cost function. We want to minimize the sum of squared errors.
So, under the assumption that we can get by inverting , we can solve this via direct methods (e.g., Gaussian elimination) or iterative methods. The latter is our focus, as the LMS algorithm is an example of an iterative method. Put another way, it is a steepest descent algorithm for solving the normal equations.
The LMS Algorithm
The gradient of the cost function^{3} over all the data (i.e., the “batch” case, not the “online” case) leads to the update rule:
We just initialize some and run this until convergence. But this is bad because we are summing up over the entire set of samples. Thus, we use the following singlecase update, which is indeed what we mean by the LMS algorithm:
In other words, the idea is that we take each data point stochastically^{4}, which provides an approximation of the gradient, in the hopes that will converge to whatever is implied by the normal equations. The above equation is the LMS update. While LMS is technically an approximation, it is in fact often superior to the batch version equation update written earlier.
This update rule has the following neat, intuitive interpretation courtesy of Michael I. Jordan. Suppose we are given ; what update should we perform on ? If this is our only data, then we have vectors and in some space. To map perfectly onto , we must figure out a way to project^{5} onto . In general, there will be a line of possible solutions for , but the two “natural” ways to update are by extending in either (1) the “ direction” or (2) the “ direction.” The second case is easiest to do because the error associated with the projection of is precisely , and by projecting by this amount in the direction of , we will get with no error term. In general, the rule will be:
and this nicely maps to the LMS algorithm (in fact, it is the LMS algorithm update!) as we can turn the reciprocal of the norm term into . If we have multiple data points, we should expect the “jumping ” to eventually land at the solution, if there is one, or converge under certain conditions. The above equation means that we update by moving it some distance in “the direction.”
What I found especially interesting is that we can literally treat the parameter estimation problem as a geometric problem, where each data point presents a constraint, and the overall goal is to solve a constraint satisfaction problem, a popular subfield in classic Artificial Intelligence.
It is possible to prove that LMS converges to a vector that satisfies the normal equations. I will not put the proofs here^{6}, but it is important to understand these to see how much flexibility we have with the step size .
Side Notes
If we want to do weighted least squares, where each point is assigned a weight that indicates its importance, then the cost function is , meaning that the normal equations turn into . Our new matrix is a diagonal matrix of weights.
We can connect our geometric treatment of LMS with probability. Suppose we endow the errors as IID zeromean Gaussians with variance . Then, if we try to compute for the entire data, what happens is that the log likelihood of this expression is equivalent to the least squares cost function ! Said another way, maximizing the log likelihood with is the same as minimizing the leastsquares cost function, and all we assumed are IID sampling and Gaussian errors. Thus, we can avoid most of our geometric, constraint satisfaction approaches in favor of just computing the data log likelihood, since they evidently get the same results. This is assuming we stay in the maximum likelihood framework, by the way; the leastsquares cost function can lead to a frequentist “estimator”, regardless of whether or not the data errors are Gaussian, and then that estimator and the MLE would not coincide. But here, they do.
It’s interesting to consider the question as to whether we can use LMS as discussed here for classification. As it is, it’s not a good fit for the classification case, but if we wanted to, we could adapt it into logistic regression. Then what this means is our hypothesis will still involve some form of , but we will instead use the logistic function:
And the LMS would look like:
How do we derive this? Take the gradient of the log likelihood of the data under our logistic regression assumption, and pick the th element.
It’s important to realize what we’ve done here. We have a similar algorithm as before, except here the conditional expectation of given is not measured by but rather by the logistic regression. Also, we’ve assumed a logistic function from the outset, and are trying to fit this to data. This is known as the discriminative form of learning. Alternatively, we could have a generative classifier, which would involve computing class conditionals and categorical . But in those cases, we don’t need iterative algorithms because we can get maximum likelihood estimates easily.
For the multiclass case, use the softmax function. Also, this still converges to something “linear” in the sense that this kind of classifier partitions the space into pieces, and this “partitioning” process is done with hyperplane surfaces, which are linear.
I hope to discuss logistic regression in more detail in a future blog post. While I (and you?) wait, here is another source about the LMS algorithm, courtesy of Andrew Ng. A lot of his notes really make things clear!

At least in introductory statistics courses… ↩

After reading more about Singular Value Decomposition (SVD), in addition to knowing about SVD and the four fundamental subspaces, I finally feel like I have a solid grasp of what the heck pseudoinverses do. The four fundamental subspaces are , and , where for an matrix , the first two I listed combined will take up the entire space, and the latter two will combine for space. In other words, any dimensional vector can be broken up into where and . What SVD does is that for an arbitrary matrix , it splits it into . Suppose has rank . Then the first columns of are the basis for , the last are the basis for , and similar conditions apply for the matrix. When we take pseudoinverses, we get them via the SVD, so (where is the transpose of with all the nonzero diagonals inverted) and one can verify that . Thus, finally, by looking at the normal equations , the obvious thing to do to solve for is to multiply both sides by . At long last, I finally understand what to do if is not invertible! (This happens when has linearly dependent columns, say if we repeated measurements somehow, not a far fetched possibility. If that happens, then is an matrix with rank less than .) Also, sorry for the abnormally long footnote here! We are not actually doing this here. This is the “batch” form of solving the problem, and we want the iterative form. ↩

A word of caution on the gradient: get by using the formulation of , not the matrix formulation, because in the latter it’s a little harder to see how we actually get the form of the update. ↩

The LMS algorithm is also known sometimes as the stochastic gradient method, I think. ↩

Recall that . It took me a surprisingly long time to verify that the projection actually worked, so it might be worth it to go through the math and trigonometry involved, if necessary. ↩

Ben Recht, on analyzing the convergence of LMS: “There are whole books written on this!” ↩
Stanford's Linear Algebra Review
I’ll take a break from the recent discussion on graphical models to go over some old linear algebra concepts. While there are many references on linear algebra designed to help a student relearn what he or she may have forgotten, I found the handout from Stanford’s CS 229 (Machine Learning) class to be one of the best when considering the content, size, and clarity together. And despite being advertised as a review, it actually did teach me a lot. In college, I only took one linear algebra class, a basic introductory and proofsbased course^{1}. While I found it difficult when I was taking it, in retrospect, we did not go over that much material. Two things that really tricked me up when I first arrived in Berkeley were
 Positive Definite Matrices (and related to that, Singular Value Decomposition)
 Matrix Calculus (with gradients, Hessians, etc.)
These two concepts are used all the time in AI, and I really wish I could know them better. Thus, in this post, I’ll quickly go over some of the basic review concepts of linear algebra as presented in the Stanford handout, and then spend a little more time on those previously mentioned topics.
Basic Concepts and Matrix Multiplication
A few years ago, I kept getting confused between column and row vectors when I was reading machine learning literature, but now I know that all vectors are assumed to be columns by default. Also, when denoting row vectors, be careful that the transpose sign does not necessarily mean we are taking the corresponding column and then transposing it (as the notes say, the definitions of and are ambiguous).
Matrix multiplication is at the heart of linear algebra and what we do in AI. I find it easiest to reason about matrix multiplication by always checking the number of rows and columns of the matrices involved, and making sure they align correctly. But this handout made it easier for me to think of other ways to view how multiplication works. They describe four interesting ways of looking at matrix multiplication, of which the most interesting to me is when the product is expressed as the sum of outer products (between vectors).
Operations and Properties
Some interesting operations and properties of matrices:

Any square matrix can be represented as a sum of a symmetric matrix and an antisymmetric matrix, since .

The trace is the sum of the diagonal elements of a matrix. Despite its seeming simplicity, the trace operator will actually be very useful in matrix calculus (the “trace trick”) and in the study of eigenvalues and eigenvectors, since the sum of the eigenvalues of a matrix equals the trace, even though the individual components being summed up generally have no relation.

The norm of a vector is like an informal measure of the distance. For some reason, these kept confusing me when I was trying to work on my undergraduate thesis, so it’s nice to see it formally treated here. There are several properties of norms, but in general, I tend to view norms of vectors as just the norm. Also, when working with norms, make sure I know the CauchySchwarz inequality: .

The rank of a matrix is equal to the number of its linearly independent columns (or rows). There are many implications of the rank; for me, the one I remember the best is that in order for an inverse to exist, a (square) matrix has to be full rank. If we generated a square matrix by filling in its values from a random number generator, then it will almost always have full rank.

A square matrix is orthogonal if . This requires the columns of to be orthonormal^{2}. A nice property: , which we can prove by squaring them, thus considering and .

The range, i.e., column space, of a matrix is . The null space is ^{3}. Vectors in the former set have length , those in the latter have length , and we can get the complement of each of those sets by considering and . For instance, considering the sets and , it turns out that the intersection of those two sets is empty, and that every vector in can be expressed as the sum of two vectors, one in each of those two sets. Thus, they are orthogonal complements. The two column spaces, combined with the two column spaces (using and ) form the four fundamental subspaces, a term popularized by Gilbert Strang. I’m not sure I really get it, though.

The determinant of a square matrix , denoted as or , is a mysterious function that maps to the real numbers. The three main properties it satisfies are:
 that the identity has determinant one
 that if we multiply a single row in by a scalar , the determinant of the new matrix is
 that if we exchange any two (different) rows, then the determinant of the new matrix is
All the other properties of determinants follow from this, including the cofactor expansion, which I will not list here. Determinants have an intuitive interpretation in terms of volume: if we take the span of the rows of , and restrict all the coefficients to be in , then the determinant is the volume of this set. Also, determinants are useful for checking if exists. Finally, the adjoint of a matrix, , is .

Given a square, symmetric matrix , a quadratic form is a function mapping and to the reals. Our variables are , so even if the dimensions of are large, the highest power of any (i.e., a component of ) that can exist in is two, hence the intuitive name.
Something that is related is the concept of a positive semidefinite matrix, which is a symmetric matrix such that for all . We can also define the obvious analogues for definite, negative definite, etc., matrices. Given any, not necessarily square matrix , the matrix is always positive semidefinite; prove this by using .
Note: quadratic forms like these are the start of an introduction to what really happens in machine learning and AI, where we have to understand matrix representations.

For determining eigenvalues and their corresponding eigenvectors (note: eigenvectors are not unique), refer to if we are going to be solving them by hand.
We tend to use (capital lambda) to represent the matrix of eigenvalues, so the matrix equation will result in , implying that , hence is diagonalizable.
Some interesting properties: the trace and determinant are the sum and product, respectively, of the eigenvalues. The eigenvalues of a triangular matrix are just the elements on the diagonal. If is nonsingular, then is an eigenvalue of with eigenvector .
If we have a symmetric matrix , then (1) all eigenvalues are real, and (2) its eigenvectors can be scaled to be orthogonal to each other, so that above can turn orthogonal, and hence we have , a good thing because transposes are easier computationally than inverses. Unfortunately, I don’t currently know how to prove these off the top of my head.
Singular Value Decomposition
With a symmetric matrix , we can set instead of using . It turns out we can generalize this to arbitrary (not even square!) matrices by expressing them in singular value decomposition form, . The and matrices are square and orthogonal, and represent eigenvectors of and , respectively.
The matrix is diagonal, possibly nonsquare, and contains something called the singular values (not the eigenvalues!) of matrix . They fill the first diagonal elements for a rank matrix , and the rest of the elements are zero. The singular values are the square roots of the nonzero eigenvalues of both and (which are symmetric, so they have real eigenvalues).
For a positive definite matrix, .
There are many applications of SVD, and the ones I’m familiar with happen to be in computer vision. Some of these applications have to do with the fact that and give orthonormal bases for all four fundamental subspaces. For instance, to use SVD to compute the null space of , then look at the last columns of .
Matrix Calculus
I really wish I had completely digested this section before taking CS 281A here because I had no idea how matrix calculus worked until I reviewed this document. Given a function , the gradient of with respect to is a matrix (i.e., it’s the same size as the original matrix input to ) of partial derivatives of .
The Hessian is the second derivative analogue to the gradient (well, almost). Here, we assume we are taking the Hessian with respect to a function , where we take a vector as input (not a general matrix, even though that’s possible).
The Trace Trick
Here’s a neat trick I learned from reading Mike Jordan’s notes. Since the trace is such that is the same no matter what the ordering of the three matrices, and since (verify this by explicit computation) then we can claim:
Intuitively, this makes sense since we just eliminated . In the past, when I tried computing derivatives like these, I would convert the entire expression to an enormous set of summations, and try to reason element by element. That is way too much work.
There is a related trick involving determinants: the derivative of with respect to is .
One can use these tactics to take the MLE of the covariance matrix of a multivariate Gaussian distribution, since the log likelihood of that term has and in it.

When I took the course (with Professor Elizabeth Beazley), it was called MATH 211. Now it’s called MATH 250 since the department put the core courses in the “50”s with their muchimproved numbering system. ↩

Watch out with the terminology! Some people might call such matrices as orthonormal matrices, but I call them orthogonal. ↩

This is technically called the right null space, or the kernel, since we may wish to refer to the left null space, which uses . ↩
Hidden Markov Models and Particle Filtering
In this post, we will continue our discussion of graphical models by going over a special kind known as a Hidden Markov Model (HMM). These are appropriate for modeling forms of sequential data, implying that we finally relax various forms of “independent identically distributed” data or variables. They are a piece of the broader class of models known as Dynamic Bayesian Networks, of which Kalman Filters – which we will discuss in a future blog post here – are their continuous analogue.
Hidden Markov Models
HMMs are one of the most popular graphical models in real use (indeed, probably the most popular). They are characterized by variables representing hidden states and variables representing observations (i.e., evidence). The subscript in and represents a discrete slice of time. There are three probability distributions in HMMs: a prior probability , a transition probability and an emission (sometimes called observation) probability . That the transition probability only depends on the previous state means we are effectively invoking the Markov assumption. Whatever information about the state in the last time slice is all we need to know to determine the likelihood of the current state. Our distribution is stationary, which means that we only need to specify those three probabilities and we are set, because the CPTs are “carried over” to all the time slices. This is an instance of parameter tying.
There are several welldefined inference tasks and algorithms that we can conduct on HMMs, to which we now turn. True, we could simply take an HMM and run variable elimination on it, but the structure of HMMs provides us with some recursion we can exploit.
Filtering, Smoothing, and Predicting
These three are all similar. Given a series of observations, we want to determine the distribution over states at some time stamp. Concretely, we want to determine . The task is called filtering if , smoothing if , and predicting if . Clearly, smoothing will give better estimates, and prediction the weakest (or most uncertain) estimates.
To compute filtering estimates, we require a recursive computation. The actual algorithm we use is the forward algorithm. We will talk more about it later; here is the update equation for the forward algorithm:
where capital variables denote unknown variables, and the lowercase ones denote known variables. One can see that this equation makes intuitive sense. We sum up over all possible state realizations in the previous time step by their transition probability to the current state, and for each of those, we weigh the probability by the likelihood of emission from the state (which doesn’t depend on so we can move it out of the sum). Furthermore, note that this is a recursive computation, a theme that will be common when we talk about inference in HMMs. Finally, the normalizing constant here is computed by summing up for all possible realizations of . The normalizing is needed only because of the addition of the emission model probability downscaling our values.
Given that we know filtering, prediction is not that much more work, because we have the value up to the latest evidence variable, then it’s like we are ignoring the evidence variable and only using the transition probability:
(Here, we have no proportion sign.)
As we predict further into the distribution, we will eventually end up at the stationary distribution of the underlying Markov chain governing the sequence of state realizations. We can compute this as , where is the matrix of transitions, and is the column vector of priors.
For smoothing, we actually need to compute something called a backward probability:
The first component is what we were able to compute earlier, and we can do this recursively from the starting state. The second component is new, but in fact, it is not too hard to compute recursively – we just have to start backwards from . Here is the update:
Notice now that we cannot move anything out of the summation, like we did with the forward update.
We can run something called the forwardbackward algorithm, which will compute the forward probabilities, then the backward probabilities. We will return to these “forward” and “backward” subroutines in the following subsections, with two helpful figures representing these two algorithms.
The Likelihood of Observations
In this section, we’ll talk about two other common tasks in HMMs. The first is computing the likelihood of an observation sequence, . At first, it might not seem obvious how to do this, because we don’t even know the state sequence that caused the observations! Following the laws of probability, we have to include the states in our joint sequence, and then sum up over them. To do this efficiently, we use a dynamic programming algorithm called the forward algorithm, which folds together the paths that could have generated the observations.
I find it easiest to think of this computation just by looking at a trellis, like the one in the following image from Jurafsky’s book^{1}:
This example was about determining the observation likelihood of seeing three, one, and then three ice creams eaten in a day, and the state variables represented the temperature of the day. In the trellis, the rows correspond to the possible realizations of each state, along with one other row to symbolically represent the emission/observation/evidence. The notation of represents the probability of the observation sequence and being at a particular state at time , so . The computation of is done recursively in an intuitive manner, as shown in the image:
Intuitively, the likelihood of an observation sequence should be the sum of all the possible state sequences that could have led to that observation. The fact that we are summing up and storing probabilities in the previous terms is an indication of the HMM’s independence assumptions, and is what lets us do inference efficiently. Oh, and to actually compute the observations, we do need to sum out from .
By the way, did you notice that these “updates” that we are doing are exactly what we did for the filtering step earlier? It’s just taking an earlier factor (representing whatever probability we want) and multiplying it by the transition and then the emission probability! In Russell and Norvig, the update equation is called , and can represent or . While and are clearly different, their updates are the same. And this FORWARD subroutine is what we generally mean when we talk about “the forward algorithm.”
The Viterbi Algorithm (Decoding)
This is another common inference task. Given a series of observations, the Viterbi algorithm helps us to determine the most likely sequence of states the system went to produce those observations. This is similar to the forward algorithm that we used to compute the likelihood of an observation sequence, but here we will now take maxes instead of summations, and we also have to keep a series of “backtrace pointers” so that we can reproduce our path.
Here’s the intuition: suppose we want to find the most likely path to some state . To do that, we need to find the most likely state path through , and then compare all the possible realizations of along with the transition (and emission, but those are the same) probabilities going to the current state. Then we pick the highest likelihood one, and add that particular to the path (note the lowercase here!). Mathematically, we express it as follows:
The “messages” here are not forward messages but instead
The time and space complexity is linear in terms of the length of the sequence, , but for filtering, we don’t have the space taking up linear time since there are no backpointers.
Training the Parameters with EM
The purpose of the BaumWelch algorithm (which is an example of the ExpectationMaximization algorithm), is to determine the parameters of an HMM given observed data. Somewhat confusingly, Jurafsky calls it the forwardbackward algorithm. Russell calls the forwardbackward algorithm as the algorithm that computes all the forward and backward probability messages defined earlier^{2}. But I guess they might be doing the same thing, because training HMMs will require us to compute the backward probabilities anyway. Jurafsky defines the backward probability to be exactly the same as what Russell defines it, i.e., as the probability of seeing the observations from time to the end, given that we are at some state in time , or . The recursion of the backwards probability is shown in the following image, where we now see that we had to keep separate emission probabilities for each state, unlike in the forward probability case.
To estimate the parameters of the transition probability matrix, for a given element , we must count the expected proportion of times that the system undergoes a transition from state realization to state realization . The expected counts are what EM computes, since those are the latent variables.
The following diagram shows the various things we have to compute to get the expected counts. Notice that the diagram is computing the probability of being in state then state , jointly with the observation sequence. Thus, after this step, we would divide by (or in their notation). We then sum over all times, and that gives us the expected number of transitions from state to state .
We follow a similar procedure for estimating the parameters of the emission probability.
The Mstep is simple here; it turns expected counts into MLE for the corresponding parameters.
Particle Filtering
I got confused about particle filtering earlier. I assumed it broadly meant any algorithm that sampled something, but in fact, it has a specific meaning in the context of HMMs and Dynamic Bayesian Networks more generally. It is indeed a sampling algorithm, but has some interesting properties that are worth discussing in its own right.
First, when do we want to do particle filtering? We want to do this if exact inference in HMMs is intractable, or if the speed of sampling (and computing its corresponding probabilities) is more important than exact answers.
A naive way of performing exact inference in an HMM is to “unroll” it into a normal Bayesian Network. Technically, HMMs are infinitely long, so what we would do is chop off the HMM after the last time step that has a query or evidence variable, since by the structure of an HMM, anything after is not relevant to the query^{3}. But if we did filtering after each observation, we would have to redo a lot of computations from the start to the set of variables at the current time step. To save on memory costs, we realize that if we do filtering for a given time , we only need the relevant probability distribution from the previous time , since the computation is done recursively. Unfortunately, when we do this “summing out” process ala variable elimination, that will iterate through the set of states, and this computation will grow exponentially due to the need to construct an exponentially large factor table^{4}.
Thus, we need to use approximate inference methods. In a previous post, I discussed likelihood weighting, which is a sampling algorithm that we can adapt for the HMM setting. We do, however, have to ensure that we do not unroll the entire HMM, and we have to make sure our weights are not being driven to zero.
Likelihood weighting (just like rejection sampling) suffers as the number of evidence variables increases, because the weight of each sample is based on the product of all the terms, which drives our values down to zero. But there is something that might be even worse than that. When we sample our nonevidence variables, ideally we would like those samples to depend on the evidence variables, so they can “guide” the sampling correctly. Unfortunately, if all the evidence variables tend to be downstream, i.e., they occur late in the variable ordering, then the sampling for the nonevidence variables will not depend on the evidence, even though the actual variables should depend on each other^{5}. Sadly, in an HMM, all state variables have the property such that all the evidence relevant to them is downstream! (The ones before it are not downstream, but they are not ancestors either!)
What does particle filtering do, then, keeping in mind the weaknesses of likelihood weighting? Here is how the algorithm works. We initialize a population of samples. The quantity determines the time complexity of the algorithm, not the number of states. Then for each time step, we perform the following steps in sequence:
 We transition the samples according to the transition probabilities . This means we do not have to unroll the HMM.
 We then observe the evidence and downscale (i.e., multiply by a value in ) each sample according to the evidence . This is where the likelihood weighting “adaption” is obvious because now each sample is a fraction of a sample with some weight, and the overall distribution over the state variable’s realizations is the fraction of weights there divided by the total fraction of weights across all realizations.
 Then (and this is important), we resample to renormalize the distribution, where the resampling process is sampled according to the weights of the samples.
Here is a nice diagram from the CS 188 slides that suddenly clarified particle filtering for me.
Here, we assume we have one state variable, with nine possible realizations. (Or we could have two state variables, each with three realizations, etc.) We have ten particles, representing our estimate of the distribution . Now the goal will be to determine the table of values. First, we transition, which means we move each of the ten particles over to their next realization and multiply accordingly. Then, we observe , and downscale the values, resulting in points of different sizes, which are a diagram notation for weights. Above, we assume that whatever was, the probability is high for that spot at coordinate , and lower as we gradually go away from that spot. The last step is to resample, so we are resampling based on the probability distribution of the weighted samples. So we get five samples in that block because divided by the sum of the weights (note: there’s a typo on the slides) gives a “high” probability. Note now that we have “normalsized” samples again.
At this point, we should ask ourselves why we even bother resampling. It’s not completely clear to me, but it has to do with how we constantly rescale the weights downwards. In general, it’s not a good idea to have such small weights, because then the more evidence we have, the more likely we will just have a bunch of zeroprobability particles. In other words, most particles end up in lowlikelihood places in the state space. Still, I don’t quite buy some of the textbook explanations on this because couldn’t we just perform the computations in log space?

I know it says the words “DRAFT” there, because I copied these images from an online version of the book. However, I did buy a hard copy of the book, so I think it is fine for me to do this (i.e., not illegal). If I did not have that online version, I would have taken a picture of the book with a cell phone, but then the figure would not have been perfectly flat. ↩

In the EM chapter, Russell says that his forwardbackward algorithm can be slightly modified for the EM case. ↩

In other words, the variable elimination algorithm would be able to push the summation sign for those variables and isolate them to the right. They would sum up to one and thus are irrelevant. ↩

I apologize if this is not clear. I think the easiest way to understand this is that this is still normal variable elimination, so it suffers from the sample computational complexity problems. What makes this confusing is that we tend to have one hidden variable , but that hidden variable can actually taken on different realizations. But what is happening here (and which Russel and Norvig discuss in their book) is that our variable represents a set of state variables. As the set increases in size, the process of summing out will require as many sums as there are state variables. At first, I thought the intractability came from how our one hidden variable could take on many values, but the intractability is actually when we have multiple hidden variables (and we would just use as shorthand to represent all of them). But both views can be made compatible if we have one megavariable that takes on values from each state variable into one realization. ↩

I hope I’m thinking about this correctly. I think the issue is that when we sample, we assume we only know the CPTs, which only express nodeparent relationships, so there is no notion of sampling in the “opposite direction” where we could sample a node based on the value of its children, even though those values do depend on each other according to the Bayes Ball algorithm. ↩
ExpectationMaximization
All right, I lied in the title of my last post. That wasn’t the final word on graphical models after all^{1}. In this post, we’ll discuss ExpectationMaximization, which is an incredibly useful and widespread algorithm in machine learning, though many in the field view it as “hacking” due to its lack of statistical guarantees^{2}. At first, I wasn’t sure what this had to do with graphical models. True, it’s not what I’d call core graphical models, but what happens is that a lot of problems that “factor” into nice Bayesian Network representations can use Expectation Maximization to determine the optimal value of their parameters (which are CPTs in the Bayesian Network case).
The Formalities of ExpectationMaximization
ExpectationMaximization (EM) is an extremely useful algorithm that applies in cases when we have a model with unobserved (i.e., latent) variables, and we need to train the model parameters despite the lack of that information. Why do we want latent variables? It helps to simplify our models. The canonical example is seeing a bunch of health measurements, then adding in a hidden node that represents the presence of heart disease.
We apply EM when we have some probabilistic model in which we are trying to find the “best” model parameters , which usually means taking the Maximum Likelihood Estimate (MLE). Crucially, is our observed variable (or a set of them), while represents the latent variables. Usually, we use logs when maximizing likelihood, so this is equivalent to finding , which Mike calls the incomplete log likelihood^{3}. This is hard; marginalization here (because it involves summing over ) tends to “make problems harder.” My intuition: having sums means we don’t usually have a closedform expression of the likelihood function, so how can we find the MLE for ?
We can define the expected complete log likelihood with some other distribution as , in other words, averaging out . A good choice of means that the expected log likelihood can be treated as the actual log likelihood.
EM is divided into two steps that repeat. Informally, the Expectation step assigns values to the hidden variables based on . (In most examples, these are indicator variables.) In the Maximization step, we find the MLE of , which is easy this time because we actually “know” what the variables are — we assigned them in the previous step!
There are two “formal perspectives” of EM, one relating to log likelihood, and the other relating to a concept known as the KullbackLeibler Divergence.
The Log Likelihood View
Let us define the following quantity:
We further know from some algebra that , i.e., that defines a lower bound of the incomplete log likelihood, for an arbitrary function.
We can now define our E and M steps more precisely:
 E step:
 M step:
What does this really mean? Note that in both cases we are trying to maximize a lower bound on the incomplete log likelihood, which makes intuitive sense: that is the point of MLE!

In the E step, we are finding the best , and this can actually be solved analytically. It’s right under our noses: . I mean, that’s just the true distribution of the hidden variable, given the known data and the current parameters! It sounds obvious.
There are two ways to show this. One is to plug in that choice of in the function , and we get equality with the incomplete log likelihood, . Hence, that’s the best we can get! The other is to take the KLDivergence view (spoiler for the next section!) and see that
which shows that to minimize the difference between the two distributions, we need to set the two distributions to be the same. Remember again that is a lower bound. I know I say this a lot, but it’s important.

In the M step, we maximize the expected complete log likelihood, which we recall we defined earlier as . This happens because breaks into two terms, one which is the expected complete log likelihood, and the other which does not depend on .
Taken together, what do these steps really do? Let’s look back at the incomplete log likelihood . This is hard to optimize, as we said earlier, but what the E and M steps do is that, during each iteration, they collectively keep maximizing this function! It’s a “hill climbing” algorithm that will keep increasing (or technically, never decrease) the incomplete log likelihood. Why is that the case? Here’s the intuition. The M step means that we are maximizing , a lower bound to the incomplete log likelihood. Hypothetically, suppose was actually equal to that incomplete log likelihood. Then the M step would necessarily result in a (highly desirable!) increase in . But this only works if there is equality … if there is a gap, the plan is thrown out the window.
Fortunately, the E step comes to rescue us, because we already know that for an appropriate choice of , we can get to be equal to . Note that the thetas here should really have a time superscript attached to them (as well as the s) but I left them out for simplicity.
In conclusion, EM will keep increasing each iteration. This, if you recall, was the whole point of this process anyway! And it does this by cleverly optimizing the function, which is much easier to handle than the incomplete log likelihood. Absolutely brilliant!
The KullbackLeibler Divergence View
We can alternatively bound the KLDivergence. Derive the bound
In words: the KLdivergence between the empirical distribution and the model is upper bounded by a “complete” KLdivergence. The intuition is that we are adding in the to “average over” the values of , just as we did earlier with the expected complete log likelihood term.
Our results carry over from the previous section because minimizing the right hand side of that expression above (with respect to and ) is equivalent to maximizing with respect to the same variables. Why? As part of the derivation for that bound above, we invoked , but that part was embedded inside an expression such that there was a negation attached, so maximizing means minimizing the overall problem.
Letting , the steps are:

E step:

M step:
Example on Gaussians
This example is from Russell and Norvig (2010). It is about clustering on a dataset presumed to be generated by three unknown multivariate Gaussian distributions. Here’s an example of a plot that could represent such data, with each data point highlighted by the color of its distribution (which we would not know — they are only there for ease of readability). Note: the figure here isn’t the exact one in the textbook, but it’s close.
I like this example because it makes it clear how to view the s as indicator variables.
To formalize the problem, we assume that each data point is generated by picking one of the three twodimensional, multivariate Gaussian distributions, then sampling from that distribution. Thus, the goal in using EM is to find the parameters of the model:
 the weight of each component (i.e., each multivariate Gaussian)
 the 2dimensional mean vector of each component
 the covariance matrix of each component
To start off with EM, we first initialize the three sets of parameters above to some random values^{4}. Then we iterate:

E step: this will determine the expected value of the hidden indicator variables in this model, which we define as , which is one if point was generated by Gaussian component and zero otherwise. From the earlier discussion, we know that we need to use the function. Assuming that the points are IID, this means for each point , we compute where .
We can transform this into an easier problem using Bayes’ rule, and claim that . The first quantity, , is a straightforward plug and chug into the formula for the density of a multivariate Gaussian, because we assume it came from the th one, and we have the parameters and from . The second quantity, is like computing the probability that a component generated a given point, but without any dependency on any point , we just probabilistically choose based on the component’s overall weight , which we get from .

M step: given the fact that we “know” which component generated which data point in a probabilistic sense, then we can find better values of the parameters using the following update equations:
 The weights:
 The means:
 The covariances:
where is the total number of data points and , the “number” of data points assigned to component (in a probabilistic sense).
Why does the M step work? The weight update is like that because that is the fraction of points we assigned to that cluster, and . The means and covariances is from the MLE for the parameters of multivariate Gaussians, except that we have to scale by an expectation because clusters are only assigned fractions of points.
For a more complicated example of EM, look at how EM is used to train the parameters of Hidden Markov Models.

I should have realized this a while back when I saw how many chapters Michael I. Jordan was able to write about graphical models. The last two posts I wrote about graphical models only covered chapters two through four. ↩

To future Ben Recht students, this is why he does not like ExpectationMaximization. Well, that, and because he (correctly) argues that it’s so good, it renders other attempts at approaching problems futile. ↩

When I was drafting this post, I often viewed the incomplete log likelihood as the “true” data log likelihood, which should be equivalent. I mean, we are summing out an arbitrary amount of s to result in a “true” distribution, I think? ↩

It would be better to initialize them sensibly in some way, which often requires assuming some extra knowledge about the problem. ↩