My Blog Posts, in Reverse Chronological Order

subscribe via RSS

Another Hearing Aid Fails to Live Up to Its Water Resistant Label

Mar 7, 2015


Today, I played basketball for the first time since I arrived in Berkeley. It was a lot of fun, and I was at Berkeley’s Rec Sports Facility for 1.5 hours. Unfortunately, I also received a sobering reminder that my water resistant hearing aids are not actually water resistant.

My Oticon Sensei hearing aids worked great for about half an hour … then I heard that all-too-familiar beeping sequence in both ears, and then a few minutes later, the hearing aids stopped working. So I didn’t have any hearing and had to rely on various body language cues and last-resort tactics (honed over the years) to understand what others were saying. Fortunately, in basketball, communication among players in game situations tends to be blunt and simple and from experience, I’ve learned what players typically say to each other.

It is not uncommon for my hearing aids to stop working while I’m engaging in some physical activity. In fact, I get surprised if my hearing aids last through a session of pickup basketball. Thus, I already knew that I would have to reduce the amount of sweat near my hearing aids. I tried using my shirt and the gym’s towel cloth to absorb some of it, but they can only help out so much.

I understand that water resistant does not mean water proof, but I just cannot fathom how a water resistant hearing aid stops functioning after a half hour of physical activity. Out of curiosity, I re-checked my manual and it states that the Oticon Sensei has an IP57 classification. This means that it was still able to function properly after being immersed in water for 30 minutes at a depth of 1 meter.

I am somewhat surprised, because 30 minutes is about the time it took for the hearing aids to stop working after playing basketball. Oh well. At least I have a functional hearing aid dryer. Within a few hours after arriving home, I had them working. But it’s still incredibly annoying. Honestly, the biggest problem with hearing aid breakdowns is not the lack of communication on the court, but what happens off the court. Between pickup games, players are constantly talking to each other about who should be playing the next game or what they want to do after basketball’s over. A more important issue is that I drive to the gym, and driving without working hearing aids is something I would rather avoid.

Make the Best Peer Reviews Public

Feb 28, 2015

The annual Neural Information Processing Systems (NIPS) conference is arguably the premier machine learning conference along with the International Conference on Machine Learning (ICML). I read a lot of NIPS papers, and one thing I’ve only recently found out was that NIPS actually makes the paper reviews (somewhat) public.

As I understand it, the way NIPS works is:

  1. Authors submit papers, which are eight pages of text, and a ninth one for references. Unlimited supplementary material is allowed with the caveat that reviewers do not need to read it.
  2. The NIPS committee assigns reviewers to peer-review the submissions. These people are machine learning faculty, graduate students, and researchers. (It has to be like that because there’s no other qualified group of people to review papers.) One key point is that NIPS is double-blind, so reviewers do not know the identity of the papers they read while reviewing, and authors who submit papers do not know the identity of the people reviewing their papers.
  3. After a few months, reviewers make their preliminary comments and assign relative scores to papers. Then the original authors can see the reviews and respond to them during the “author rebuttal” phase. Naturally, during all this time, the identity of the authors and reviewers is a secret, though I’ve seen cases when people post submitted NIPS papers to Arxiv before acceptance/rejection, and Arxiv requires full author identity, so I guess it is the reviewer’s responsibility to avoid searching for the identity of the authors.
  4. After a few more months, the reviewers make their final decision on which papers get accepted. Then the authors are notified and have to modify their submitted papers to include their actual names (papers in submissions don’t list the authors, of course!), any acknowledgments, and possibly some minor fixes suggested by the reviewers.
  5. A few months after that (yeah, we’re getting a lot of months here), authors of accepted papers travel to the conference where they discuss their research.

This is a fairly typical model of a computer science conference, though possibly an aytpical model when compared to other academic disciplines. But I won’t get into that discussion; what I wanted to point out here is that NIPS, as I said earlier, makes their reviews public, though the identity of the reviewers is not shown. Judging by the list of NIPS proceedings, this policy of making reviews public began in 2013, and happened again in 2014. I assume NIPS will continue with this policy. (You can click on that link, then click on the 2013/2014 papers lists, click on any paper, and then there’s a “Reviews” tab.) Note that the author rebuttals are also visible.

I was pleasantly surprised when I learned about this policy. This seems like a logical step towards transparency of reviews. Why don’t all computer science conferences do this?

On the other hand, I also see some room for improvement. To me, the obvious next step is to include the name of the reviewers who made those reviews (only for accepted papers). NIPS already gives awards for people who make the best reviews. Why not make it clear who wrote the reviews? It seems like this would incentivize a reviewer to do a good job since their reviews might be made public. Incidentally, those awards should be made more prestigious, perhaps by announcing them in the “grand banquet” or wherever the entire crowd gathers?

You might ask, why not make the identity of reviewers known for all reviews (of accepted papers)? I think there are several problems with this, but none seem to be too imposing, so this might not be a bad idea. One is that the current model for computer science seems to assign people too many papers to review, which necessarily lowers the quality of each individual review. I am not sure if it is necessary or fair to penalize an overworked researcher for making his/her token reviews public. Another is that it is a potential source of conflict between future researchers. I could image someone obsessively remembering a poor public review and using that against the reviewer in the future.

These are just my ideas, but I am not the only one thinking about the academic publishing model. There’s been a lot of discussion on how to change the computer science conference model (see, for instance, “Time For Computer Science to Grow Up“), but at least for the current model, NIPS got it mostly right by making reviews somewhat public. I argue that one additional step towards greater clarity would be helpful to the machine learning field.

Review of Natural Language Processing (CS 288) at Berkeley

Feb 14, 2015


This is the much-delayed review of the other class I took last semester. I wrote a little bit about Statistical Learning Theory a few weeks months ago, and now, I’ll discuss Natural Language Processing (NLP). Part of my delay is due to the fact that the semester’s well underway now, and I have real work to do. But another reason could be because this class was so incredibly stressful, more so than any other class I have ever taken, and I needed some amount of time to pass before writing this.

Before I get to that, let’s discuss what the class is about. Natural Language Processing (CS 288) is about the study of natural languages as it pertains to computers. It applies knowledge from linguistics and machine learning to develop algorithms that computers can run to perform a variety of language-related applications, such as automatic speech recognition, parsing, and machine translation. My class, being in the computer science department, was focused on the statistical portion of NLP, where we focus on the efficiencies of algorithms and justify them probabilistically.

At Berkeley, NLP seems to be offered every other year to train future NLP researchers. Currently we only have one major NLP researcher, Dan Klein, who teaches it (Berkeley’s hiring this year so maybe that number will turn into two). There are a few other faculty that have done work in NLP, most notably Michael Jordan and his groundbreaking Latent Dirichlet Allocation algorithm (over 10,000 Google Scholar citations!), but none are “pure” NLP like Dan.

CS 288 was a typical lecture class, and the grading was based exclusively on five programming projects. They were not exactly easy. Look at the following slide that Dan put up on the first day of class:


I come into every upper-level computer science expecting to be worked to oblivion, so this slide didn’t intimidate me, but seeing that text there gave me an initial extra “edge” to make sure I was focused, doing work early, and engaging in other good habits.

Let’s talk about the fun part: the projects! There were five of them:

  1. Language Modeling. This was heavy on data structures and efficiency. We had to implement Kneser-Ney Smoothing, a fairly challenging algorithm that introduced me to the world of “where the theory breaks down.” Part of the difficulty in the project comes from how we had to meet strict performance criteria, so naive implementations would not suffice.
  2. Automatic Speech Recognition. This was my favorite project of the class. We implemented automatic speech recognition based on Hidden Markov Models (HMMs), which provided the first major breakthrough in performance. The second major breakthrough came from convolutional neural networks, but HMMs are surprisingly a good architecture on their own.
  3. Parsing. This was probably the most difficult project, where we had to implement the CYK parsing algorithm. I remember doing a lot of debugging and checking indices of matrices to make sure they were aligned. There’s also the problem of dealing with unary expressions, since that’s a special case that’s not commonly described in most textbook descriptions of the CKY parsing algorithm (actually, the concept of “special cases not described by textbook descriptions” could be applied to most projects we did…).
  4. Discriminative Re-ranking. This was a fairly relaxing project because a lot of the code structure was built for us and the objective is intuitively obvious. Given a candidate set of parses, the goal was to find the highest ranking one. The CYK parsing algorithm can do this, but it’s better if that algorithm gives us a set of (say) 100 parses, and we run more extensive algorithms on those top parses to pick the best of those, hence the name “re-ranking.”
  5. Word Alignment. This was one that I had some high-level experience with before the class. Given two sentences of different languages, but which mean the same thing, the goal is to train a computer to determine the word alignment. So for an English-French sentence pair, the first English word might be aligned to the third French word, the second English word might be aligned to *no *French word, etc.

I enjoyed most of my time thinking about and programming these projects. They trained me to stretch my mind and to understand when the theory would break down for an algorithm in practice. They also forced me to brush up my non-existent debugging skills.

Now, that having been said, while the programming projects were somewhat stressful (though nothing unexpected given the standards of a graduate level class), and the grading was surprisingly lax (we got As just for completing project requirements) there was another part of the class that really stressed me out, far beyond what I thought was even possible. Yes, it was attending the lectures themselves.

A few months ago, in the middle of the semester, I wrote a little bit about the frustration I was having with remote CART, a new academic accommodation for me. Unfortunately, things didn’t get any better after I had written that post, and I think they actually worsened. My CART continued to be plagued by technical issues, slow typing, and the rapid pace of lecture. There was also construction going on near the lecture room. I remember at least one lecture that was filled with drilling sound while the professor was lecturing. (Background noise is a killer for me.)

I talked to Dan a few weeks into the course about the communication issues I was having in the class. He understood and thanked me for informing him, though we both agreed that slowing down the lecture rate might reduce the amount of material we could cover (for the rest of the students, of course, not for me).

Nonetheless, the remaining classes were still insanely difficult for me to learn from, and during most lectures, I found myself completely lost within ten minutes! What was also distressing was knowing that I would never be able to follow the question/answer discussions that students had with the professor in class. When a student asks a question, remote CART typically puts in an “inaudible” text due to lack of reception and the relatively quiet voice of the students. By my own estimate, this happened 75 percent of the time, and that doesn’t mean the remaining 25 percent produced perfect captions! CS 288 had about 40-50 students, but we were in a small room so everyone except me could understand what students were asking. By the way, I should add that while I do have hearing from hearing aids and can sometimes understand the professor unaided, that hearing ability virtually vanishes when other students are asking questions or engaging in a discussion.

This meant that I didn’t have much confidence in asking questions, since I probably would have embarrassed myself by repeating an earlier question. I like to participate in class, but I probably spoke up in lecture perhaps twice the entire semester. It also didn’t help that I was usually in a state of confusion, and asking questions isn’t always the ticket towards enlightenment. In retrospect, I was definitely suffering from a severe form of imposter syndrome. I would often wonder why I was showing up to lecture when I understood almost nothing while other students were able to extract great benefits from them.

Overall verdict: I was fascinated with the material itself, and reasonably liked the programming projects, and the course staff was great. But the fact that the class made it so hard for me to sit comfortably in lecture caused way more stress than I needed. (I considered it a victory if I learned anything non-trivial from a lecture.) At the start of the semester, I was hoping to leave a solid impression on Dan and the other students, but I think I failed massively at that goal, and I probably asked way too many questions on the class Piazza forum than I should have. It also adversely affected my CS 281a performance, since that lecture was right after CS 288, which meant I entered CS 281a lectures in a bad mood as a result of CS 288.

Wow, I’m happy the class is done. Oh, and I am also officially done with all forms of CART.

Harvard and MIT’s Lack of Closed Captions

Feb 14, 2015

Update February 25, 2017: Check out a related blog post about Stanford’s CS 231n class.

In the future, I will try not to discuss random news articles here, because often the subject might be a fad and fade in obscurity. Today, I’ll make an exception with this recent New York Times article about how Harvard and MIT are being sued over lack of closed captions. The actual suing/lawsuit action itself will probably be forgotten by most soon, but the overall theme of lack of captions and accessibility is a recurring news topic. Online education is real, and accommodations for those materials will also be necessary to ensure a maximal range of potential beneficiaries.

I don’t take part in online courses or video resources that much since there’s already plenty that I can learn from standard in-person lectures, and the material that I need to know (advanced math, for instance) is not something that I can learn from MOOCs, which by their very definition are for popular and broadly accessible subjects. For better or worse, the concepts I do need to know inside-out are embedded in dense, technical research papers.

Fortunately, the few online education resources I have experience with provide closed captions. The two that I’m most familiar with are MIT OpenCourseWare and Cousera, and both are terrific with captions. Coursera is slightly better, being more “modern” and also allows the video to be paused and sped up, while for MIT OCW one needs to use external tools, but both are great.

Apparently, using MIT OCW and Coursera (and sparingly at that) has probably led me to forget about how most online sources do not contain closed captions. It’s especially frustrating to me since in the few cases when I want to look at videos, I have to rely on extensive rewinding and judicious pauses to make sense of the material. I think in the next few years, I may need to employ those cumbersome tactics when I watch research talks.

It’s nice to see that captions are getting more attention, and I believe this issue will continue to reappear in news in the near future. Perhaps the brand names of “Harvard” and “MIT” are playing a role here, but I don’t view that as a bad sign: if they can take the initiative and be leaders in accessibility, then other universities should try and emulate them. After all, those universities want Harvard and MIT’s ranking…

Day in the Life of a Graduate Student

Feb 14, 2015

I was recently thinking about my daily routine at Berkeley, because I always feel like I am never getting enough work done. I wonder how much of my schedule is common among other graduate students (or among people in other, vastly unrelated careers). Let’s compare! Here’s my typical weekday:

5:45am: Wake up, shower, make and eat breakfast, which is usually three scrambled pastured eggs, two cups of berries, and a head of raw broccoli. Pack up a big-ass salad to bring with me to work.

6:45am: Leave for work. I usually drive — it takes ten minutes at this time — though at least one day of the week I’ll take the bus.

7:00am: Arrive at Soda Hall. Angrily turn off the lights in the open areas outside of my office after finding out that the people there last night left them on after leaving. Put my salad in the refrigerator. Unlock the door to my shared office, turn on laptop, pull out research and classwork notes. Check calendar and review my plan for the day.

7:15am to 9:15am: Try to make some headway on research. Check latest commits on github for John Canny‘s BID Data Project. Pull out my math notes and double-check related code segment from last night’s work to make sure it’s working the way it should be. Make some modifications and run some tests. Find out that only one of my approaches gets even a reasonable result, but it still pales in comparison to the benchmark I’ve set. Pound my fist on the table in frustration, but fortunately no one else notices because I’m still the only one on this floor.

9:30am: Realize that a lecture for my Computer Vision class is about to start. Fortunately, this is Berkeley, where lectures strangely start ten minutes after their listed time, but I need to get there early to secure a front row seat so I can see the sign language interpreters easily. (I can always ask people to move if I have to, and they probably will, but it’s best if I avoid the hassle.)

9:40am to 11:00am: Jitendra Malik lectures about computer vision and edge detectors. I concentrate as hard as I can while rapidly switching my attention between Jitendra, his slides, and my interpreters. Make mental notes of which concepts will be useful for my homework due the following week.

11:00am: Class is finished. Attempt to walk around in the huge crowd of entering/leaving students. Decide that since I don’t have anyone to eat lunch with, I’ll grab something from nearby Euclid street to take to my office.

11:15am to 11:45am: Eat lunch by myself in my office, wishing that there was someone else there. Browse Wikipedia-related pages for Computer Vision concepts from lecture today. Get tripped up by some of the math and vow that I will allocate time this weekend to re-review the concepts.

noon to 2:00pm: Try to get back to research regarding the BID Data Project. Write some more code and run some tests. Get some good but not great results, and wish that I could be better, knowing that John Canny would have been able to do the same work I do in a third of the time. Skim and re-read various research papers that might be useful for my work.

2:00pm to 3:00pm: Take a break from research to have a meeting with another Berkeley professor who I hope to work with. Discuss some research topics and what would be good but not impossible problems to focus on. Tell him that I will do this and that before our next meeting, and conclude on a good note.

3:15pm to 4:30pm: Arrive back in my office. Get my big-ass salad from the refrigerator and drizzle it with some Extra Virgin Olive Oil (I keep a bottle of it on my desk). My office-mate is here, so I strike up a quick chat. We talk for a while and then get back to work. My mood has improved, but I suddenly feel tired so end up napping by mistake for about fifteen minutes. Snap out of it later and try to get a research result done. End up falling short by only concluding that a certain approach will simply not work out.

4:30pm to 5:00pm: Decide to take a break from research frustration to make some progress on my Computer Vision homework. Get stuck on one of the easier physics-related questions and panic. Check the class Piazza website, and breathe a sigh of relief upon realizing that another classmate already asked the question (and got a detailed response from the professor). Click the “thanks” button on Piazza, update my LaTeX file for the homework, and read some more of the class notes.

5:00pm to 5:30pm: Take a break to check the news. Check Google Calendar just in case I didn’t forget to go somewhere today. Check email for the first time today. Most are from random mailing lists. In particular, there are 17 emails regarding current or forthcoming academic talks by visiting or current researchers, but they would have been a waste of time for me to attend anyway due to lack of related background information, and the short notice means it can be hard to get interpreting services. Some of those talks also provide lunches, but I hate going to lunches without having someone already with me, since it’s too hard to break into the social situation. Delete most of the email, respond to a few messages, and soon my inbox is quite clean. (The advantage of being at the bottom of the academic and social totem poles is that I don’t get much email, so I don’t suffer from the Email Event Horizon.)

5:45pm to 6:30pm: Try to break out of “email mood” to get some more progress done on homework. Rack my brain for a while and think about what these questions are really asking me to do. Check Piazza and Wikipedia again. Make some brief solution sketches for the remaining problems.

6:40pm to 7:00pm: Hit a good stopping point, so drive back home. (Still not in the greatest mood, but it’s better than it was before my 2:00pm meeting.) At this point most cars have disappeared from Hearst parking lot, which makes it easier for me to exit. Cringe as my car exits the poorly-paved roadway to the garage, but enjoy the rest of the ride back home as the roads aren’t as congested as I anticipated.

7:15pm: Think about whether I want to go to Berkeley’s Recreational Sports Facility to do some barbell lifting. It’s either going to be a “day A” session (5 sets of 5 for the squat, 5 sets of 5 for the bench) or a “day B” session (3 sets of 5 for the squat, 5 sets of 5 for the overhead press, and 1 set of 5 for the deadlift). I didn’t go yesterday, which means I have to go either now or tomorrow night. After a brief mental war, conclude that I’m too exhausted to do some lifting and mark down “RSF Session” on my calendar for tomorrow night.

7:30pm to 8:00pm: Cook and eat dinner, usually some salad (spring mix, spinach, arugula, carrots, peppers, etc.), more berries (strawberries or blueberries) a half-pound of meat (usually wild Alaskan salmon), and a protein shake. Browse random Internet sites while I eat in my room or out on my apartment’s table.

8:30pm to bedtime: Attempt to get some more work done, but end up getting making no progress, so pretend to be productive by refreshing email every five minutes and furiously responding to messages. Vow that I will be more productive tomorrow, and set my alarm clock an hour before I really should be waking up.

Deaf-Friendly Tactic: Provide an Email Address

Jan 31, 2015

Update 1/31/2015: I realized just after writing this that video relay is possible with the same phone number … whoops, that shows how long it’s been since I’ve made a single phone call! But in any case, I think the ideas in this article are still valid, and not every deaf person knows sign language.

Original article: In my search for deaf-friendly tactics that are straightforward to implement, I initially observed that it’s so much easier for me to understand someone when he or she speaks clearly (not necessarily loudly). I also pointed out that in a group situation, two people (me and one other person) is optimal (not three, not four…). Two recent events led me to think of another super simple deaf-friendly tactic. In retrospect, I’m surprised it took me a few years to write about it.

I recently had to schedule an appointment with Toyota of Berkeley to get my car serviced. I also received a jury duty summons for late February, and I figured that it would be best if I requested a sign language interpreter to be with me for my summons. Unfortunately, for both of these cases, calling Toyota and the California courts, respectively, seemed to be the only way that I could achieve my goals.

In fact, my jury summons form said the following:

Persons with disabilities and those requiring hearing assistance may request accommodations by contacting the court at [phone number redacted].

There was nothing else. I checked the summons form multiple times. There was no email address, no TTY number, no video relay service number, nothing. Yes, I am not joking. Someone who is hearing impaired — and logically will have difficulty communicating over the phone — will have to obtain jury duty accommodations by … calling the court! I actually tried to call with my iPhone 6. After multiple attempts, I realized that there was a pre-recorded message which said something like: “for doing X, press 1, for doing X, press 2…”, so I had to press a number to talk to a human. Actually, I think it’s probably best that there was no human on the other end, because otherwise I probably would have frustrated him or her by my constant requests for clarification.

I will fully admit that the iPhone 6 is not perfect for hearing aid users because its Hearing Aid Compatible rating is M3, T4 rather than the optimal M4, T4 rating, but still, even after about five or six attempts at calling, I did not understand what numbers corresponded to what activities. Sure, I’m rusty since I make around two phone calls a year to people outside of my immediate family, but I don’t see experience being much of a factor here.

This motives the following simple deaf-friendly tactic:

Provide an email address (perhaps in addition to a telephone number) that people can use to contact for support, scheduling services, and other activities.

I am aware that deaf people can easily use alternative services, such as TTY or video relay. Such services, however, are far inferior to email in many ways. Email nowadays is so prevalent in our lives and is incredibly easy to use. It’s rare when I don’t have some form of Internet access, so I can effectively check email whenever I want. The fact that I’m also writing instead of talking means that I can do things like revise my ideas more clearly and paste relevant web links. The process of forming an email can sometimes result in me resolving my own situation! I’ve often been in the process of writing an email, but then I realized I needed to add more information to show the person on the other end that I had done my research, but then that extra research I do can lead to an answer.

Furthermore, the set of people who regularly use email form effectively a proper superset over those people who use TTY and video relay services. In other words, the vast majority of TTY and video relay users also use email, but the converse is not true. In my case, I have not used TTY and video relay in years; email forms the foundation of almost all my communication nowadays. As long as it doesn’t become an obsession (as in checking it 50 times a day), I don’t see how it interferes that much in my daily life, and I would argue that a telephone call can drag on and on.

Conclusion: if you’re going to provide a phone number for contact, I would strongly urge you to also provide an email address.

Gallaudet University is Searching for a President

Jan 11, 2015

The news is out: Gallaudet University is searching for its eleventh president. Here’s the Presidential Search Advisory Committee web portal and here’s the specific job description, including desired candidate qualifications. I’ll be anxiously following the news. While I have never been on the campus before, I am obviously aware of its history as a college for the deaf (even though it was never on my college radar) and I know several current and former students.

Choosing a president of a college that caters at a specific group of people is a sensitive issue, because often the president is expected to share the same characteristic. For instance, students, faculty, and staff at an all-women’s college or a historically black college might be more favorable towards a female and a black president, respectively. Wellesley College has only had female presidents in its history, and Mount Holyoke College has had mostly female presidents.

Gallaudet is unique in that, as the world’s only university that caters to deaf and hard of hearing students across the board, the president is now expected to be deaf. The first seven presidents of Gallaudet were hearing, and it was not until the now famous 1988 Deaf President Now (DPN) saga that they had a deaf president.

It’s also not enough to just be deaf; the Gallaudet culture prides itself on American Sign Language (ASL), so the president is now expected to be fluent in that language (and immersed in deaf culture). I’m reminded of the 2006 fiasco when Gallaudet appointed Dr. Jane Fernandes as president. Students protested for a variety of reasons, but their argument can be succinctly stated as: “she wasn’t deaf enough.” The board of trustees eventually revoked her appointment. Strangely enough, I don’t remember personally knowing anything about it back in 2006. When I first learned about the incident a few years later, I thought the students mostly embarrassed themselves, but now I’ve become more understanding of their perspective. Incidentally, Dr. Fernandes still ended up with a strong career, as she’s now the president of Guilford College.

Thus, if the next president does not meet the de facto profile requirements, expect the students (and maybe faculty) to protest. The current job description asks that the candidate “has a deep understanding of bilingualism and biculturalism in the deaf community,” though it does not explicitly state that he or she be deaf or be fluent in ASL.

So, as I said, I’ll be anxiously following the news.

New Year’s Resolutions: 2015 Edition

Jan 8, 2015

It’s that time of the year when many people are creating New Year’s resolutions.

Wait, scratch that. We’re a week into 2015, so I think it’s more accurate for me to say: it’s that time of the year when many people have forgotten or given up on their New Year’s resolutions. After all, this guy from Forbes claims that only eight percent of people achieve their resolutions.

Why am I discussing this subject? Last semester, I was in a continuous “graduate student” state where I would read, read, take a few notes, attend classes, do homework, read more research papers, do odd hobbies on weekends, and repeat the cycle. I rarely got the chance to step back and look at the big picture, so perhaps some New Year’s resolutions would be good for me. And before you claim that few people stick with them, I also had New Year’s resolutions for 2014, and I kept my text document about it on my desktop. Thus, I was able to keep them in mind throughout the full year, even if I ended up falling short on many goals (I set the bar quite high).

For a variety of reasons, I had a disappointing first semester, so most of my resolutions are about making myself a better researcher. I think one obstacle for me is the pace in which I read research papers. I’ve always thought of myself as someone who relies less on lectures and more on outside reading in classes than most (Berkeley computer science graduate) students, so I was hoping that my comparative advantage would be in reading research papers. Unfortunately, to really understand even an 8-page conference paper that I need for research, I may end up spending days just to completely get the concepts and to fill in the technical details omitted from the paper due to page limits.

When reading research papers, it’s not uncommon for me to lose my focus, which means I spend considerable time backtracking. Perhaps this could be rectified with better reading habits? I’m going to try and follow the advice in this blog post about reading real books, rather than getting all my news from condensed newspaper or blog articles. (Ironically, I just broke my own rule, but I will cut back on reading blogs and arbitrary websites … and also, I came up with this idea about two weeks ago, so it’s nice to see that there’s someone who agrees with me.) Last week, I read two high-octane thrillers — Battle Royale and The Maze Runner — to get me back into “reading mode” and am moving on to reading non-fiction, scholar-like books. Maybe books will help me quit Minecraft for good (so far, it’s working: I’ve played zero seconds of Minecraft in 2015).

I’ve also recorded some concrete goals for weight lifting (specifically, barbell training), which is one of my primary non-academic hobbies. For the past four years, my motivation to attend the gym has been through the roof. I’ve never missed substantial gym time unless I was traveling. In retrospect, I think programs like Stronglifts and Starting Strength (which I loosely follow) are so popular because they generate motivation. Both use the same set of basic, compound lifts, but as you proceed throughout the programs, you add more weight if it is safe to do so. Obviously, the more weight you can lift, the stronger you are! I often juxtapose weight lifting and addictive role-playing games (RPGs), where my personal statistics in real life barbell lifts correspond to a hypothetical “strength” attribute in an RPG game that I continually want to improve.

Here’s a video of me a few days ago doing the bench press, which is one of the four major lifts I do, the others being the squat, deadlift, and overhead press. I know there’s at least one reader of this blog who also benches, and we’re neck-to-neck on it so maybe this will provide some motivation (yeah, there’s that word again…).

This is one set of five reps for 180 pounds; I did five sets that day. (The bar is 45 pounds, the two large plates on both sides are 45 pounds, and each side has two 10-pound plates and one 2.5-pound plate.) I remember when I was a senior in high school and couldn’t do a single rep at 135 pounds, so seeing these new results shows how far I’ve come from my earlier days. I’m definitely hoping the same feeling will transition to my research and motivation in general.

Motivation. It’s an incredibly powerful concept, and a must for graduate students to possess with respect to research.

Independent Component Analysis — A Gentle Introduction

Jan 3, 2015

In this post, I give a brief introduction to independent component analysis (ICA), a machine learning algorithm useful for a certain niche of problems. It is not as general as, say, regression, which means many introductory machine learning courses won’t have time to teach ICA. I first describe the rationale and problem formulation. Then I discuss a common algorithm to solve ICA, courtesy of Bell and Sejnowski.

Motivation and Problem Formulation

Here’s a quick technical overview: the purpose of ICA is to explain some desired non-Gaussian data by figuring out a linear combination of statistically independent components. So what does that really mean?

This means we have some random variable — or more commonly, a random vector — that we observe, and which comes from a combination of different data sources. Consider the canonical cocktail party problem. Here, we have a group of people conversing at some party. There are two microphones stationed in different locations of the party room, and at time indices , the microphones provide us with voice measurements and , such as amplitudes.

For simplicity, suppose that throughout the entire party, only two people are actually speaking, and that their speech signals are independent of each other (this is crucial). At time index , they speak with signals and , respectively. But since the two people are in different locations of the room, the microphones each record signals from a different combination of the two people’s voices. The goal of ICA is, given the time series data from the microphones, to figure out the original speakers’ speech signals. The combination is assumed to be linear in that

for unknown coefficients .

Here’s a graphical version, from a well-known ICA paper. The following image shows two (unrealistic) wavelength diagrams of two people’s voices:


The data that is observed from the two microphones is in the following image:


The goal is to recover the original people’s wavelengths (i.e., the two graphs in the first of the two images I posted) when we are only given the observed data (i.e., the two graphs from the second image). Intuitively, it seems like the first observed wavelength must have come from a microphone closer to the first person, because its shape more closely matches person 1’s wavelength. The opposite is true for the second microphone.

More generally, consider having microphones and independent speakers; the numerical equality of microphones and speakers is for simplicity. In matrix form, we can express the ICA problem as where is an unknown, square, invertible mixing matrix that does not depend on the time interval. Like the assumptions regarding , the invertibility of is to make our problem simple to start. We also know that all and are -dimensional random vectors. The goal is to recover the unseen sources . To simplify the subsequent notation, I omit the notation, but keep in mind that it’s there.

How does the linear combination part I mentioned earlier relate to this problem formulation? When we express problems in form, that can be viewed as taking linear combinations of components of along with the appropriate row of . For instance, the first component (remember, there are of them) of the vector is the dot product of the first row of and the full vector . This is a linear combination of independent source signals with coefficients based on the first row of .

Before moving on to an algorithm that can recover the sources, consider the following insights:

  1. What happens if we know ? Then multiply both sides of by and we are done. Of course, the point is that we don’t know . It is what computer scientists call a set of latent variables. In fact, one perspective of our problem is that we need to get the optimal based on our data.
  2. The following ambiguities regarding will always hold: we cannot determine the variance of the components of (due to scalars canceling out in ) and we also cannot determine ordering of (due to permutation matrices). Fortunately, these two ambiguities are not problematic in practice1.
  3. One additional assumption that ICA needs is that the independent source components are not Gaussian random variables. If they are, then the rotational symmetry of Gaussians means we cannot distinguish among the distributions when analyzing their combinations. This requirement is the same as ensuring that the vector is not multivariate Gaussian.

Surprisingly, as long as the source components are non-Gaussian, ICA will typically work well for a range of practical problems! Next, I’d like to discuss how we can “solve” ICA.

The Bell and Sejnowski ICA Algorithm

We describe a simple stochastic gradient descent algorithm to learn the parameter of the model. To simplify notation, let so that its rows can be denoted by . Broadly, the goal is to figure out some way of determining the log-likelihood of the training data that depends on the parameter , and then perform updates to iteratively improve our estimated . This is how stochastic gradient descent typically works, and the normal case is to take logarithms to make numerical calculations easier to perform. Also, we will assume the data are zero-mean, which is fine because we can normally “shift” a distribution to center it at 0.

For ICA, suppose we have time stamps . The log-likelihood of the data is

where we note the following:

  1. The determinant of is denoted with single vertical bars surrounding it
  2. is the derivative of the sigmoid function (not the sigmoid function itself!)

Let’s explain why this formula makes sense. It comes from taking logarithms of the density of at each time stamp. Note that so if we let denote the density function of , then , where is the density of the individual source . We can split the product this way due to the independence among the sources, and the terms are just (vector) constants so they can be separated as well. For a more detailed overview, see Andrew Ng’s lecture notes; in particular, we need the term due to the effect of linear transformations.

Unfortunately, we don’t know the density of the individual sources, so we approximate them with some “good” density and make them equal to each other. We can do this by taking the derivative of the sigmoid function:


The reason why this works is that the sigmoid function satisfies the properties of a cumulative distribution function, and by differentiating such a function, we get a probability density function. And since it works well in practice (according to Andrew Ng), we might as well use it.

Great, so now that we have the log-likelihood equation, what is the stochastic gradient descent update rule? It is (remember that is the sigmoid function):

where is the standard learning rate parameter, and the that we pick for each iteration update varies (ideally sampling from the training data pieces uniformly). Notice that the term in the parentheses is a matrix: we’re taking an outer product and then adding another matrix. To get that update rule from the log-likelihood equation, we take the gradient , though I think we omit the first summation over terms. Matrix calculus can be tricky and one of the best sources I found for learning about this is (surprise) another one of Andrew Ng’s lecture notes (look near the end). It took a while for me to verify but it should work as long as the summation is omitted, i.e., we do this for a fixed . To find the correct outer product vectors to use, it may help to use the sigmoid’s nice property that . Lastly, don’t forget to take the logarithm into account when taking the derivatives.

There are whole books written on how to decide when to stop iterating, so I won’t get into that. Once it converges, perform and we are done, assuming we just wanted the vectors at all times (not always the case!).

Well, that’s independent component analysis. Remember that this is just one way to solve related problems, and it’s probably on the easier side.

References and Further Reading

  1. Independent Component Analysis: Algorithms and Applications
  2. Andrew Ng’s lecture notes

  1. According to Professor Ng. 

Why I am Against Affirmative Action

Dec 31, 2014

Update 1/1/2015: Happy 2015 everyone! I read Ta-Nehisi Coates’ article The Case for Reparations, which talks about affirmative action a little bit but mostly is about the unfortunate news that, even after slavery ended and after the civil rights era, African Americans are really not equal to whites, a claim that I agree with wholeheartedly. I was surprised and devastated when he described how Congress failed to even consider the possibility of providing any form of reparations. Sadly, as the years go by, the likelihood of making substantial or meaningful reparations declines. Consider the payments to Japanese-Americans who were deported to camps in World War II. (Had I lived in California as a child at that time, I might have been among those people.) But at least some of those victims were able to get reparations during their lifetime. The question of how to do the same for African Americans is much trickier.

Coates’ article brings up affirmative action in several contexts. One is when President Barack Obama said that his children should not benefit from that policy, so I have another supporter there that I didn’t list in my original article. The second is that affirmative action is a tricky policy because there is no clear definition of it (I agree), and Coates appears to be doubtful about affirmative action’s effectiveness in reaching equality among blacks and whites (and Asians?).

When I was reading Coates’ article, I kept thinking about ways to boost minority enrollment in STEM. One effective way, I think, would be to implement affirmative action-like policies not for career positions, but for temporary programs that are sort of “breeding grounds” for such careers or college positions. For instance, many affluent families send their high school children to math and science summer programs. I believe if those programs made conscious efforts to reach out to minorities, or if there were even programs that only accepted minorities, that could serve to be a better place to practice such affirmative action-like policies. I was able to benefit from a program like that, and I think I know a few others like this, but it’s going to take years to see a difference in the racial and gender diversity among academics because of the scarcity of positions and the tenure aspect (which means the previous generation sticks around for decades).

That’s all I have to say for this (first) update. Here is my original post:

Affirmative action is the process of giving favoritism in some way (usually for employment) to groups of disadvantaged people who have historically been victims of discrimination. Affirmative action policies vary according to country. In the United States, having a pure quota on making sure that X percent of a workforce belongs to a certain group is illegal, but affirmative action does exist. I would like to present an argument against this policy based on one simple idea, though I should first add the disclaimer that I do not think I have benefited from it or will benefit from in the future. The only way I would is if some employer wants to hire a deaf person, but I rarely see this discussed in the two cases that I’m familiar with: college admissions, and faculty recruitment in STEM fields. Discussions about the lack of diversity in STEM are dominated around women, African Americans and Hispanics. It annoys me that people with disabilities are often ignored, but maybe I should talk about that later.

I am against affirmative action mostly because it often makes the people who benefit from affirmative action feel like the reason why they were hired is because of affirmative action, and not due to merit. (There’s also the real problem of resentment over those who think they aren’t getting their jobs, but I think that is less important.) Over the past few years, I have become far more sensitive to issues regarding race and gender, and I have learned about countless stories from people who have lamented how others view them as an “affirmative action hire.” Of these stories, the one that stuck to me the most was of current Supreme Court Justice Clarence Thomas, who would constantly remark that others stigmatized his Yale law degree as the product of affirmative action rather than his own merit:

Affirmative action (though it wasn’t yet called that) had become a fact of like at American college and universities, and before long I realized that those blacks who benefited from it were being judged by a double standard. As much as it stung to be told that I’d done well in the seminary DESPITE my race, it was far worse to feel that I was now at Yale BECAUSE of it. I sought to vanquish the perception that I was somehow inferior to my white classmates by obtaining special permission to carry more than the maximum number of credit hours and by taking a rigorous curriculum of courses in such traditional areas as corporate law, bankruptcy, and commercial transactions. How could anyone dare to doubt my abilities if I excelled in such demanding classes?

One more recent story was from Professor Carlotta Berry’s recent New York Times editorial, where she said that she wants to be viewed professionally, but understands that some may view her faculty hiring as a product of affirmative action, even as she points out, she does not believe she benefited from it.

Having worked with thousands of students, I know for a fact that for many — though by no means all, or even most — there is already a presumption that I, as a female and African-American, am less qualified than my white male colleagues, or at the very least that I was hired in order to meet a double minority quota. And I get it — anti-affirmative-action ideologues have managed to not only demolish the legitimacy of that policy, but tar the reputation of anyone who might have benefited from it (even if, like me, they did not).

Here’s another perspective from a non-beneficiary (I think) by Professor David Aldous of Berkeley, who defended the merit of the statistics faculty members when asked about affirmative action at Berkeley:

But putting aside the cynical view, here’s the bottom line. There are the various cultural pressures we all recognize that traditionally have reduced the number of women and minorities in math and science. Almost nobody objects to the principle of trying to counteract these pressures, but it’s the bureaucratic hassles involved in conforming to rules that create the cynicism. In my Stat department, we have maybe 5 out of 20-odd faculty being women, and they’re all perceived as having been hired on merit, not because of affirmative action. As for minorities, at the faculty level there are so few that it’s not on the radar.

I absolutely agree with him on all counts here. The statistics faculty here at Berkeley are amazing (all of them), and I would love to help reduce the cultural pressures and barriers to STEM for women and minorities.

At Berkeley, undergraduate admissions is race-blind, and one African American student is happy about this because she got in on merit rather than affirmative action:

“I got into Berkeley on my own,” said Nile Taylor, who entered Cal three years after the ban on affirmative action began. “I didn’t get in because they had to meet a quota. I got in because my application was good enough to get into Berkeley. Part of the stigma of people who get in under affirmative action is one, they only got in because of affirmative action — that they’re not considered to be good enough otherwise — and two, I think affirmative action is a Band-Aid. I don’t think it’s a solution.”

I believe these stories reinforce my argument. If there was no affirmative action, I don’t think there would be a need to constantly defend a woman or minority admission or hire on merit. (Note: “minorities” here do not include Asians; Berkeley has plenty of Asian professors.)

Whenever I think about a contentious issue, I often try to understand the opposite perspective: how would I feel if I were someone who clearly is a possible beneficiary of affirmative action? And I believe that I would still agree with my thoughts here. If I were to eventually become a professor or work at some of the many industry alternatives (Google, Microsoft, etc.) I would never want to feel like I was hired on the basis of race or gender. It would make me feel inadequate, and also make me feel guilty about taking away spots from potentially qualified people. Academic jobs are especially scarce nowadays, and there’s no need to increase tension among applicants by forcing affirmative action as a policy.

An argument in favor of affirmative action might be that it helps minorities by increasing the pool of “similar” people (never mind the danger in lumping people together in a group), thus resulting in increased productivity and expertise in the classroom or workforce. For instance, someone who is the only minority in a class might have to work by himself/herself all the time due to social exclusion, but with more minorities, then this increases the pool of people who are easier to work with, which therefore results in better grades, better job performance, etc. I’m a little skeptical of this perspective, because it still carries some stigma. Realize that I say this as someone who has felt excluded from other students in all levels of my education.

Justice Sonia Sotomayor provided other arguments in support of affirmative action when the Supreme Court backed Michigan’s ban on the policy in its public university admissions. One of them was about other aspects of college admissions:

Athletes, children of alumni and students from underrepresented parts of the state, she said, remained free to try to persuade university officials to give their applications special weight.

I agree that this is a problem. I am also against athletes, alumni, and those from underrepresented geographical locations getting special weight from the holistic college admissions process, but I do not think these aspects carry as much stigma as affirmative action does, and the point of my argument is that I’m trying to reduce the stigma associated with the policy as much as possible. Again, here’s a disclaimer: I was neither an athlete, nor an alumni, nor from an underrepresented geographical location when I applied to college. I lived in New York, which is probably the most over-represented state in many northeastern schools, but I somehow still got in Williams, and I am not sure if there was anyone who thought I got in for reasons other than merit. I don’t want the opposite perspective — that admission was a product of affirmative action — to hold true for me or anyone else.

Review of Statistical Learning Theory (CS 281A) at Berkeley

Dec 30, 2014


Now that I’ve finished my first semester at Berkeley, I think it’s time for me to review how I felt about the two classes I took: Statistical Learning Theory (CS 281A) and Natural Language Processing (CS 288). In this post, I’ll discuss CS 281a, a class that I’m extremely happy I took even if it was a bit stressful to be in lecture (more on that later).

First of all, what is statistical learning theory? I view the subject as one that principally deals with the problem of finding a predictive function of data that minimizes a loss function (e.g., squared loss) on training data, and analyzes this problem in a framework that conflates machine learning and probability methods. Whereas a standard machine learning course might primarily describe various learning algorithms, statistical learning theory focuses on the subset of these that are most well-suited to statistical analysis. For instance, regression is a common learning algorithm, and regularization is a common (statistical?) technique we use to improve our predictors.

At Berkeley, statistical learning theory is a popular course that attracts an unusually diverse audience of students (by graduate-course standards), not just machine learning theorists. It attracts students from all computer science and statistics research areas, as well as students from mathematics, psychology, and various engineering disciplines. For some reason, this year it was even more popular than usual — we had over 100 at the start (overflowing the largest room in the electrical engineering building). I would have thought that since the popular Professor Michael I. Jordan taught it last spring, that would have pulled away some of the students in this year’s cycle, but I guess not.

In past years, I think CS 281A focused almost exclusively on graphical models. My class seemed different: I had Professor Ben Recht, who was teaching it for the first time, and he changed the material so that we only discussed graphical models for about four lectures, giving us time to go over detection theory, hypothesis testing, and other fields. He told me personally that he dislikes graphical models (and also the EM-algorithm!) so I’m assuming that’s the reason why. We didn’t even get to talk about the junction tree algorithm.

We had five problem sets, which were each challenging but not impossible, and the workload got easier as the class went on. I probably spent an average of 15-20 hours on each problem set, including the “LaTeX-ing” process, but not including general mathematical review, of which I had to do a lot because of some shocking gaps in my linear algebra and probability intuition.

Digression: this semester gave me my first experience with Piazza, a private online forum where students can ask and answer questions related to the class material. (Students can be anonymous to other classmates if desired.) Even though it has some obvious shortcomings, I enjoyed it because it gave me a chance to discuss some of the homework problems “virtually.” Combined with a few in-person collaborations, CS 281a gave me a collaboration experience that I never had at Williams in my math courses. Having Piazza would have made some of those classes much easier!

Back to CS 281A: somewhat unexpectedly, we had a midterm! It was a 25.5-hour take-home midterm, open note, and open internet (!). At first, I was disappointed about having to take a midterm because I think I have proven my ability to understand concepts and describe them under timed exam constraints, but I ended up enjoying the test and benefited from it. I didn’t check, but I’m pretty sure none of these questions could be found online. 24-hour take home exams are the norm at Williams so I had tons of experience with this exam format. In lieu of a final exam, we had a final project, which I described in a previous post.

In terms of the lectures themselves, Professor Recht would alternate between discussing a concept at a high level and then writing some math on the blackboard. Unfortunately, the technical terms in this class made the captioning difficult, as I discussed earlier this semester. (Here’s a sample: Gaussians, Kullback-Liebler Divergence, Baum-Welch, Neyman-Pearson, and Lagrangians. Pretend you don’t know any math beyond calculus and try to spell these correctly.) And also, I didn’t mention this earlier, but for a large lecture course, we had a surprisingly high number of question-answer exchanges, which made it tougher on the captioner, I think, because of the need to listen to multiple people talking. The result was that the screen I looked at, which was supposed to contain the captions, had a lot of gibberish instead, and I bet the students sitting behind me were wondering what was going on. (I sat in the front row.)

I was still able to learn a lot of the material in part because I did a lot of reading — both from the assigned list and from random online sources — to really understand some of this material. I probably need to rely on out-of-class reading more than most (Berkeley computer science graduate) students, so I don’t mind that, and it’s something that graduate students are supposed to do: if you don’t understand something, then learn about it yourself (at first).

Overall verdict: I’m happy I took it. It’s a good introduction to what graduate courses are like, and I will probably take the sequel, CS 281B, the next time it’s offered.

The Advantages of Recitations in Large Lecture Courses

Dec 27, 2014


At Williams, the largest mathematics or computer science course I took had probably 55 students. The normal size was 20 to 35 students. This meant I had many opportunities to talk to the professors without too much competition among the students.

The upper limit of 55 students pales in comparison to one of my courses at Berkeley. Last semester, the Statistical Learning Theory course (CS 281a or STAT 241a) initially had over 100 students after the first few lectures. Eventually, the number of students dropped to roughly 80 by the time the course concluded, but that’s still a considerable amount, especially considering that this was a graduate-level course, so shouldn’t there only be a handful of people who are interested in the subject? Of course, even that enrollment pales in comparison to the enrollment in Berkeley’s introductory computer science courses. Here, computer science majors (technically “EECS” majors) are “supposed” to start with the CS 61A, 61B, and 61C series, which are basically “intro to CS,” data structures, and computer organization/architecture. I was told that the enrollment in those three courses last semester after add/drop were (brace yourselves) 1243, 752, and 411 students, respectively! I’m not sure if there’s a single room on the Williams campus that can hold 411 people!

It’s no surprise, therefore, that at CS-intensive universities like Berkeley, MIT, and Washington, large lecture courses like 61{A,B,C} split up students into smaller recitation sessions. These can be thought of an extra lecture session, but led by a graduate student or an advanced undergraduate. The reduced number of students in these sessions makes it easier to ask questions and to go over confusing concepts.

One understandable reaction to this kind of situation is … why would anyone prefer Berkeley-style lectures compared to Williams-style lectures, where no recitations with (presumably less-talented) instructors are needed? Certainly this would have been my thought as a high school student, because it seems like I would much rather prefer the advantage of having a more personal relationship with the brilliant professors and being able to ask real questions during lectures. But on the flip side, there are several advantages to the recitation style.

  1. The recitation instructors are also incredibly brilliant! By making the most out of recitations, students can obtain valuable advice, insights, and strategies, from the recitation instructor that might be tailored more towards the student’s needs. HUGE disclaimer: I go to Berkeley, where people in CS are expected to be geniuses or crazy hard-workers. I’m aware that many universities have bad TAs.
  2. Sometimes, when it directly concerns specific class assignments, the recitation instructors can be better than the professors at answering questions! This is not the same thing, of course, as saying that the recitation instructors know more about the subject than the professor (or the class lecturer). Professors have clear views on how their field is moving and a broad knowledge base, but a graduate student might have had more time to look specifically at the code for one of the projects and can answer very specific questions, such as whether a line of code is functioning correctly. I often find that when I get past the initial hurdle of understanding an assignment at a high-level, it’s the tiny details that remain before I can confidently turn in my work.
  3. I also believe that due to the reduction of question-answering in large lectures, and because of an “extra” lecture/class session due to recitation, computer science courses in a school like Berkeley are able to cover more material than the corresponding Williams courses. At Williams, I remember some lectures that ended up in a back-and-forth question-answering session among the students and the professor. While this does mean we get to answer students’ questions, it also means we don’t make that much ground on material as compared to a standard lecture style course. For instance, after reviewing lectures and syllabi, I think that Berkeley’s data structures course (CS 61B) covers more material than Williams’ data structures course (CS 136), even when ignoring the last two weeks of Berkeley’s class. (Berkeley semesters are two weeks longer than Williams semesters due to the lack of a Winter Study.)

I wish I had better understood the tradeoffs between lectures in large universities and small liberal arts colleges back in high school when I was considering which college to attend. Then again, since I got rejected from almost everywhere I applied despite high GPA and SAT scores, perhaps it wouldn’t have mattered in the end, but I might have considered expanding the pool of my college applications to include more universities.

Detection Theory Adventures (a.k.a. a Final Project)

Dec 18, 2014

Whew! I just spent the past week in a mad dash to finish up my Statistical Learning Theory final project (CS 281a at Berkeley). My write-up is online, in case you want to check it out. The overall goal of my project was to explore the area of detection theory, an important mathematical field that does have practical implications. I know every field likes to say that (and in a sense, maybe it’s true for all fields anyway) but seriously — detection theory is about trying to distinguish the signal from the noise. Suppose we see a bunch of data points that are generated from some underlying process. This goes on for a while, but then at some point, the chart we see spikes up. That could indicate that something’s wrong. There are tons of examples that match this kind of situation. The example I used in my report was monitoring a patient’s body temperature. If I’m taking my temperature every 12 hours, for instance, and I see the following numbers: 98.6, 98.6, … (a long time) …, 98.6, 99.1, 99.5, 99.7, 100.2, 100.0, 101.1, by the time I’m getting even past 99.5 I should be a little suspicious and think that the underlying process for my body temperature indicates that I have a fever.

I learned a lot from my final project, since I read about 15 academic papers (which are not easy to read) and skimmed over many others. Despite this, I am not that happy with how it ended up, because my experiments were not extensive or groundbreaking. On the other hand, perhaps this kind of work is what I should expect if I’ve got only four weeks for a class project. It wouldn’t be the first time that I’ve been excessively harsh on myself.

By the way, my report is written in the style and formatting of the Neural Information Processing Systems (NIPS) conference. NIPS is one of the top two academic machine learning research conferences, with the other one being the Internal Conference on Machine Learning (ICML). Their papers have a nine-page limit, with the ninth one reserved for references only, but I’ve noticed that in practice a lot of researchers end up putting a ton of proofs and other information in appendices or supplementary material after the first nine pages. I have seen supplementary material sections that were 30 pages long! This is allowed, because NIPS guidelines say that extra material after nine pages is fine with the understanding that reviewers are not obligated to read them. I found the eight page limit to be easy to reach with this simple project, which is funny because I’ve long viewed eight page papers/reports to be long for a high school or college class. Furthermore, many of my previous class papers had to be double-spaced in 12-point font, whereas in NIPS they cram everything down with single-spaced, 10-point font. I had to fiddle around with a lot of the text to get everything to squish into eight pages, and as my last step, I used the LaTeX command vskip -10pt to condense the “Acknowledgments” subsection heading with its text. I guess that’s what academic writing is like?

Brain Dump: Successfully Installing and Running the Moses Statistical Machine Translation System

Nov 19, 2014

I’m using Moses again. It’s an open-source statistical machine translation system. I first used it when I was at Bard in 2012, and I remember being clueless about the installation process and being overwhelmed by all the Linux/Unix commands I had to know. I think it took me more than a week before I had installed Moses and successfully completed the suggested baseline test. At that time, I was doing nothing but that … so I was pretty frustrated. Even at the end of my REU, the commands to run Moses felt like black magic.

But I’m back at it again. Armed with several more years of hacking and Linux/Unix experience, as well as a statistical natural language processing class for background material, I managed to install Moses and complete the baseline test on my laptop, which is a Macbook that I got last January. It still took me almost a week (Friday night, 9-5 Saturday, 9-5 Sunday, all day Monday and Tuesday…) to do that since I ran into some unexpected problems. To prevent me from getting this kind of headache again, I’ll be listing the steps I conducted to install and run the baseline, which hopefully will be applicable for anyone trying to use Moses right now. If you find this article useful, please let me know. (Keep in mind, however, that it will probably be obsolete in a few months.) You will need some Linux/Unix experience to follow this article.

At a high level, I had problems with installing boost, installing mgiza, and training the baseline system.

To start, I git cloned the moses code. The installation instructions say that I need g++ and boost installed. Actually, I don’t think g++ is necessary because in Mavericks (i.e., OS X 10.9), Apple changed the default compiler for C++ to be clang, so we really need that on our computers to run Moses, but clang should be built-in or be part of Xcode so it should definitely be there as long as I have Xcode. Speaking of boost, I did have boost 1.56 installed; I got it via the command brew install boost, which will install boost 1.56 in /usr/local/Cellar/boost/, because that’s the default place where homebrew installs things. For example, I also have the widely-used numpy package located in /usr/local/Cellar/numpy.

So, thinking that I had things taken care of, I went into my mosesdecoder directory, and followed the instructions by running ./bjam -j8. Unfortunately, I ran into the dreaded “clang: error: linker command failed with exit code 1.”

$ ./bjam -j8  
Tip: install tcmalloc for faster threading. See BUILD-INSTRUCTIONS.txt for more information.
mkdir: bin: File exists
...found 4469 targets...
...updating 155 targets... lm/bin/darwin-4.2.1/release/debug-symbols-on/link-static/threading-multi/query
ld: library not found for -lboost_thread
clang: error: linker command failed with exit code 1 (use -v to see invocation)

// Additional error messages...

...failed mert/bin/darwin-4.2.1/release/debug-symbols-on/link-static/threading-multi/util_test...
...skipped <pmert/bin/darwin-4.2.1/release/debug-symbols-on/link-static/threading-multi>util_test.passed for
lack of <pmert/bin/darwin-4.2.1/release/debug-symbols-on/link-static/threading-multi>util_test... mert/bin/darwin-4.2.1/release/debug-symbols-on/link-static/threading-multi/vocabulary_test
ld: library not found for -lboost_thread
clang: error: linker command failed with exit code 1 (use -v to see invocation)

"g++" -o "mert/bin/darwin-4.2.1/release/debug-symbols-on/link-static/threading-multi/vocabulary_test"
-lboost_unit_test_framework -llzma -lbz2 -ldl -lboost_system -lz -lboost_thread -lm -liconv -g
-Wl,-dead_strip -no_dead_strip_inits_and_terms
...skipped <pmert/bin/darwin-4.2.1/release/debug-symbols-on/link-static/threading-multi>vocabulary_test.passed
for lack of <pmert/bin/darwin-4.2.1/release/debug-symbols-on/link-static/threading-multi>vocabulary_test...
...failed updating 72 targets…
...skipped 83 targets…

The build failed. If you need support, run:
./jam-files/bjam -j8 –debug-configuration -d2 |gzip >build.log.gz
then attach build.log.gz to your e-mail.
You MUST do 3 things before sending to the mailing list:
1. Subscribe to the mailing list at
2. Attach build.log.gz to your e-mail
3. Say what is the EXACT command you executed when you got the error

Huh. I can’t even do a simple installation?!? Note: in the above terminal output, I included my original command (after the dollar sign $), and the error message was much longer than what’s displayed; I got rid of some if it with // Additional error messages for the sake of clarity.

What’s the problem? Apparently, moses couldn’t find the lboost_thread library. After some extensive research on Google and asking on the Moses mailing list, I think the issue comes down to the layout being layout=tagged versus layout=system. To give an example, the library file that I think lboost_thread refers to is libboost_thread-mt.a, which is a tagged version due to having mt; the untagged file version would be libboost_thread.a. I think this was the problem I was getting, but I couldn’t figure it out despite making moses look at the directory where boost was installed. On the instructions, they say to do ./bjam -with-boost=~/workspace/temp/boost_1_55_0 -j8 where the workspace/temp folder is just where they’ve put boost in. On my system, it’s obviously in a different location, so I ran ./bjam -with-boost=/usr/local/Cellar/boost -j8.

Unfortunately, that also didn’t work. Note: the tilda option indicates the path to the home directory, so on my computer, ~/workspace would be equivalent to /Users/danielseita/workspace. The /Users/danielseita equals the $HOME path variable.

I asked on the mailing list, and their advice was to do some clean installations because it’s obvious something was broken here, especially with boost. All right, then, the first step to do that is to uninstall boost: brew uninstall boost -force.

I went through several more trials of installing boost via homebrew before I decided to avoid using it at all; I went to the boost website directly, downloaded the .tar.gz file for version 1.57 (the latest version at the time of this writing), and untarred it: tar -zxvf boost_1_57_0.tar.gz.

That pastes the boost files in the current directory, but now we have to compile it for Moses. At the time of this writing, the Moses installation instructions say to execute the following two commands in the boost directory:

./b2 -j8 -prefix=$PWD -libdir=$PWD/lib64 -layout=system link=static install || echo FAILURE

Unfortunately, running those commands never helped, and I consistently got the same amount of errors/warnings each time. After a series of uninstallations/installations, I decided to just try following the instructions directly from the boost website, specifically from their “Getting Started” section. I attempted the following set of commands from a clean download folder boost_1_57_0:

cp -r boost_1_57_0/ /usr/local  
cd /usr/local/boost_1_57_0  
./b2 install  

And in addition to that … I went into Apple’s Finder window, double-clicked on the Xcode application (just to check if I had clang installed) … and upon doing so, I received a pop-up message saying that there was some external software I needed to install! I wish I had gotten a screenshot of that, but it showed up as soon as I double clicked on the application, and in a few seconds it had installed what it needed. It looked like I did have clang installed, but I have no idea about how much that initial pop-up must have helped me.

I then tried to compile moses again with a simple ./bjam -j8 command. Aaaaaand … it worked! See some of the output (again, the dollar sign indicates lines that I typed):

$ cd ~/mosesdecoder/  
$ ./bjam -j8  
Tip: install tcmalloc for faster threading. See BUILD-INSTRUCTIONS.txt for more information.  
mkdir: bin: File exists  
...found 4470 targets...
...updating 63 targets...
common.copy /Users/danielseita/mosesdecoder/bin/lmplz util/bin/darwin-4.2.1/release/debug-symbols-on/link-static/threading-multi/bit_packing_test util/bin/darwin-4.2.1/release/debug-symbols-on/link-static/threading-multi/multi_intersection_test util/bin/darwin-4.2.1/release/debug-symbols-on/link-static/threading-multi/joint_sort_test util/bin/file_piece_test.test/darwin-4.2.1/release/debug-symbols-on/link-static/threading-multi/file_piece_test lm/bin/partial_test.test/darwin-4.2.1/release/debug-symbols-on/link-static/threading-multi/partial_test lm/bin/left_test.test/darwin-4.2.1/release/debug-symbols-on/link-static/threading-multi/left_test lm/bin/model_test.test/darwin-4.2.1/release/debug-symbols-on/link-static/threading-multi/model_test  
testing.unit-test util/bin/darwin-4.2.1/release/debug-symbols-on/link-static/threading-multi/joint_sort_test.passed  
Running 4 test cases...

*** No errors detected  
testing.unit-test util/bin/darwin-4.2.1/release/debug-symbols-on/link-static/threading-multi/multi_intersection_test.passed  
Running 4 test cases...

// Additional cases not shown

***  No errors detected  
testing.unit-test mert/bin/darwin-4.2.1/release/debug-symbols-on/link-static/threading-multi/util_test.passed  
Running 3 test cases...

*** No errors detected  
testing.unit-test mert/bin/darwin-4.2.1/release/debug-symbols-on/link-static/threading-multi/vocabulary_test.passed  
Running 2 test cases...

*** No errors detected  
testing.capture-output moses/LM/bin/BackwardTest.test/darwin-4.2.1/release/debug-symbols-on/link-static/threading-multi/
**passed** moses/LM/bin/BackwardTest.test/darwin-4.2.1/release/debug-symbols-on/link-static/threading-multi/BackwardTest.test
...updated 63 targets...

What’s the lesson here? I guess the point is that deleting stuff and just starting an installation process all over again while trying out new ways of doing things is one of the most effective techniques to tackling complicated software problems. Yeah, it’s kind of lame, but I guess trial-and-error is the nature of Linux/Unix.

All right, now let’s discuss the second main problem I had with installing Moses, this time related to mgiza. I initially ran into the problems that I will describe below, but the fix above (uninstalling boost, taking it from the website, and compiling it according to the boost website’s instructions) seemed to resolve them. But I will describe the problem I had with mgiza just for completeness.

On the Moses website, they now recommend avoiding GIZA++ (which is a popular word-alignment software) and instead using mgiza, which is a multi-threaded version of GIZA++. As with moses, I git cloned it into a directory. The installation instructions are really simple, since mgiza comes with makefiles (the cmake command will generate a Makefile). There are three commands to execute:

cmake .  
make install

Notice that cmake has a period after it, which indicates that we’re assuming the installation happens in the current directory.

Of course, despite the simplicity of these instructions, I still got errors. The cmake step seemed to work fine, but not make:

1 warning generated.  
Linking CXX executable ../bin/d4norm  
Undefined symbols for architecture x86_64:  
"std::string::_Rep::_M_destroy(std::allocator<char> const&)", referenced from:  
boost::system::(anonymous namespace)::generic_error_category::message(int) const in libboost_system-mt.a(error_code.o)  
"std::string::_Rep::_S_empty_rep_storage", referenced from:  
boost::system::(anonymous namespace)::generic_error_category::message(int) const in libboost_system-mt.a(error_code.o)  
"std::string::assign(char const*, unsigned long)", referenced from:  
boost::system::(anonymous namespace)::generic_error_category::message(int) const in libboost_system-mt.a(error_code.o)  
"std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, std::allocator<char> const&)", referenced from:  
boost::system::(anonymous namespace)::generic_error_category::message(int) const in libboost_system-mt.a(error_code.o)  
"std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&)", referenced from:  
boost::system::(anonymous namespace)::generic_error_category::message(int) const in libboost_system-mt.a(error_code.o)  
"std::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string()", referenced from:  
boost::system::(anonymous namespace)::generic_error_category::message(int) const in libboost_system-mt.a(error_code.o)  
ld: symbol(s) not found for architecture x86_64  
clang: error: linker command failed with exit code 1 (use -v to see invocation)

Yikes … another problem with boost! And this time, the “ld: symbols(s) not found for architecture x86_64” error occurred. Again, as I mentioned earlier, the solution lies not with mgiza itself but with boost (where it’s installed, etc.), so consider a clean installation and compilation of boost from the official website (not homebrew). When I did that, I deleted the mgiza directory, cloned it again from git, and the three subsequent commands worked. I got a ton of warning messages, but no errors, which is the important part.

Whew! With moses and mgiza successfully compiled, I could finally start the baseline! Most of the time, copying and pasting the instructions from the Moses website and modifying them according to your directory structure should work, but there are some important things to be aware of:

(1) Installing irstlm is a little different because now it’s version 5.80.06 rather than the 5.80.03 that’s currently listed on the website. (In fact, irstlm 5.80.03 does not even compile on my laptop.) With this new version, irstlm moved the directory structure so now the installation won’t work if you copy the baseline. There’s a README in the trunk directory inside irstlm so I followed that and didn’t seem to have many issues. Make sure you modify the rest of the baseline’s commands accordingly, since the documentation assumes that we use ~/irstlm/bin as the place where the binaries are located.

(2) One thing I had trouble with was the “–text yes” option for the compile-lm. That created a “DEBUG: warning: too many parameters” output, as described here. The key is to use “–text=yes” so put in the equals sign. I can’t believe that fix took so long for me to figure out! I described this on the mailing list and the people maintaining the Moses website have (as of this writing) changed the instructions to conform to “–text=yes”.

(3) With training, the current instructions assume we’re using GIZA++, but since it’s mgiza, we’ve got to change that. Oh, that reminds me — I also tried out these training steps with normal GIZA++, but I could never get it work because after training went on for a few hours, something wrong happened with the lexical reordering score, because extract.o.sorted.gz never got generated. Here’s the error I got in the log file training.out:

libc++abi.dylib: terminating with uncaught exception of type  
util::ErrnoException: util/ in int util::OpenReadOrThrow(const char *)  
threw ErrnoException because '-1 == (ret = open(name, 0x0000))'.

No such file or directory while opening  

ERROR: Execution of:  
/Users/danielseita/working/train/model/extract.o.sorted.gz 0.5  
-model "wbe msd wbe-msd-bidirectional-fe"

died with signal 6, without coredump

The error above prevented a moses.ini file from even forming. To work around this issue, it’s possible to run the training while removing the -reordering argument, but when I did that, tuning fails. If you know of a fix, please let me know!

Now back to training with mgiza. The instructions are listed on the external software page. We need to specify the -mgiza flag, and we also need to make sure that -external-bin-dir is set correctly. This should be a directory where the mgiza binary files are located, as well as the script (all of these must be in one directory together!). In the current version of mgiza, they seem to be in /mgiza/mgizapp/bin. I decided to move everything over to a directory called word_align_tools; inside the mosesdecoder directory. I also specified the number of CPUs for mgiza to be four. The following code shows the list of commands I used to prepare my file system for training:

cd ~/mosesdecoder/  
mkdir word_align_tools  
cp ~/mgiza/mgizapp/bin/* word_align_tools/  
cp ~/mgiza/mgizapp/scripts/ word_align_tools/

And this does the actual training:

-root-dir /Users/danielseita/working/train  
-corpus /Users/danielseita/corpus/  
-f fr -e en -alignment grow-diag-final-and -reordering msd-bidirectional-fe  
-lm 0:3:/Users/danielseita/lm/  
-mgiza -mgiza-cpus 4  
-external-bin-dir /Users/danielseita/mosesdecoder/mgiza_tools/ >&training.out

I put in some backward slashes there just so the code wouldn’t require so much scrolling to understand, but I didn’t include them when I ran it in Terminal. Also notice that I’m using absolute paths, rather than relative paths. This means we want a full path like /Users/danielseita/mosesdecoder instead of ~/mosesdecoder. I saw some errors online that were resolved with using full paths, so I decided to use full paths for everything, though it’s probably worth it only for the -root-dir argument above. Finally, I’m not adding in a “:8″ to the language model argument, because I don’t know what that does (to do: find out!).

Update 11/29/14: Apparently, that extra :8 argument forces the use of KenLM instead of SRILM … so yes, include an extra :8 after the language model file name. The above command, while still correct I believe, should not be used.

That command above took five or six hours to complete on my laptop, which is surprising because the baseline currently claims it should be 1.5 hours, and I’ve got a pretty powerful laptop. Perhaps the data I downloaded got updated and is larger than the baseline claims? Or maybe it’s an mgiza issue?

Once training is done, here’s a very important piece of advice: change the resulting moses.ini file to say KENLM instead of SRILM. This means my moses.ini file had a line that looked like this:

KENLM name=LM0 factor=0 path=/Users/danielseita/lm/ order=3

I’m not sure why this isn’t emphasized that much on the baseline webpage, because it’s pretty darn important! The tuning will fail if we don’t get the language model set correctly.

(4) I ran into many problems with tuning, but most of them were “carried over” from the training step, such as if I made an error in the command for training. Once I finally got mgiza working, tuning was fine, and the following command (for tuning) ran in the normal time range, which is about four hours.

/Users/danielseita/working/train/model/moses.ini -decoder-flags="-threads 4"
-mertdir /Users/danielseita/mosesdecoder/bin/ >& mert.out

(5) Testing should proceed as normal. My BLEU score is 23.58, which is what they expect (they say they got 23.5).

$ ~/mosesdecoder/scripts/generic/multi-bleu.perl -lc  
~/corpus/newstest2011.true.en < newstest2011.translated.en  
BLEU = 23.58, 60.3/29.9/16.9/10.1 (BP=1.000, ratio=1.017, hyp_len=76035, ref_len=74753)

Looking back at this, while Moses is no doubt a great software system, it does take a lot of effort to get it working correctly. Sometime in the next few weeks, I’ll try to post a real, organized guide here. UPDATE May 16, 2015: I doubt I’ll ever get a real guide up here, since Moses isn’t really an integral part of my research now. Sorry about that!

Steve Ballmer’s (Subtle) Jab at UC Berkeley

Nov 15, 2014

Well, the news is out. Former Microsoft CEO and current Los Angeles Clippers owner Steve Ballmer just donated enough money to the Harvard computer science department to fund twelve professorships. Twelve! To put that in perspective, that’s 50% more than the total number of computer science professors at Williams College, and about half of the current size of Harvard’s CS faculty.

While it’s no doubt thrilling to see the attention that computer science is getting nowadays, I couldn’t help but notice this little segment from The Crimson:

“Right now I think everybody would agree that MIT, Stanford, and Carnegie Mellon are the top places [for computer science],” Ballmer said, adding that some would also include the University of California at Berkeley. “I want Harvard on that list.”

Wait a second, did Ballmer just exclude Berkeley from the Stanford, CMU, and MIT group? Last I checked, they were all clustered together at rank one … perhaps the exclusion is related to how Berkeley’s a public university? I can’t really think of any other reason. And while he did mention the school, don’t you think that if he viewed the top schools as a group of four, he would have said “I think everyone would agree that MIT, Stanford, Carnegie Mellon, and Berkeley are the top places […]” instead?

Anyway, I hope Berkeley can maintain its reputation for the next few years. This is mainly so that people will be willing to take me seriously at a first glance/conversation when discussing research; beyond that, of course, they’ll care more about your actual record than the school you go to. But it helps to go to a highly-ranked school. And I’m sure that some of Harvard’s new faculty members will have gotten their Ph.D.s from Berkeley. Incidentally, the fourth and fifth year students at Berkeley who have strong publication records must be feeling ecstatic.