My Blog Posts, in Reverse Chronological Order
subscribe via RSS
I first heard about the paper Learning to Act by Predicting the Future after one of the authors, Vladlen Koltun, came to give a highly entertaining talk as part of Berkeley’s Deep Learning seminar course (CS 294-131).
In retrospect, I’m embarrassed it took me this long to find out about the work. It’s research that feels highly insightful and should have been clear to us all along — yet somehow we never saw it until those authors presented it to us. To me, that’s an indicator of high-caliber research.
Others have agreed. Learning to Act by Predicting the Future was accepted as an oral presentation at ICLR 2017, meaning that it was one of the top 15 or so papers. You can check out the favorable reviews on OpenReview. It was also featured on Adrian Colyer’s blog. And of course, it was featured in my Deep Learning class.
So what is the research contribution of the paper? Here’s a key passage in the introduction which explains their framework:
Our approach departs from the reward-based formalization commonly used in RL. Instead of a monolithic state and a scalar reward, we consider a stream of sensory input and a stream of measurements . The sensory stream is typically high-dimensional and may include the raw visual, auditory, and tactile input. The measurement stream has lower dimensionality and constitutes a set of data that pertain to the agent’s current state.
To be clear, at each time , we get one sensory input and one set of (scalar-valued) measurements, so our observation is . Their running test platform in the paper is the first-person shooter Doom environment, so represents images and represents attributes in the game such as health and supply levels.
This is an intuitive difference between and . There are, however, two important algorithmic differences:
Given actions taken by the agent, they attempt to predict , hence “predicting the future”. It’s very hard to predict full-blown images, but predicting (much-smaller) measurements shouldn’t be nearly as challenging.
The measurement vector is used to shape the agent’s goals. They assume the agent wants to maximize
Thus, the goal is to maximize this inner product of the future measurements and a parameter vector weighing the relative importance of each terms. Note that this instantly generalizes the case with a scalar reward signal in MDPs: we’d set the elements of such that they are , i.e. corresponding to discounted rewards. (I’m assuming that is a scalar here, but this generalizes to the vector case with a matrix, as we could flatten and .)
In order to predict , they have to train a function to do so, which they parameterize with (you guessed it) deep neural networks. They define the function as the predictor, with
Thus, given the observation, action, and goal vector parameter, we can predict the resulting measurements, so that during test-time applications, is “plugged in” for and the action which maximizes is chosen. To make this work mathematically, of course, and must all have the same dimension. And to be clear, even though the reward (as they define it) is a function of , we are not “training” the parameters but the parameters for .
The parameters of , incidentally, are trained in an unsupervised manner, or using “self-supervision” since the labels can be generated automatically by having the agent wander around in the world and then repeatedly computing the value of the function output at each of those time steps. Then, after some time has passed, we simply minimize the loss. Nice, no humans needed for labeling! When I was reading this, I was reminded of the Q-learning update, since the update rule automatically assumes that the “target” is the usual “reward plus discounted max Q-value” thingy, without human intervention. To further the connection with Q-learning, they use an experience memory in the same way as the DQN algorithm used experience replay (see my earlier blog post about DQN). Another concept that came to mind was Sergey Levine’s excellent paper on learning hand-eye coordination, where he and his collaborators were able to automatically generate labels. I need to figure out how to do stuff like this more often.
Anyway, given that takes in three inputs, one would intuitively expect that it has three separate input networks and concatenates them at some point. Indeed, that’s what they do in their network, shown below.
After concatenation, the network follows the paradigm of the Dueling DQN architecture by having separate expectation and value (“action”) streams. It might not be clear why this is useful, so if you’re puzzled, I recommend reading the Dueling DQN paper for justification (I need to re-read that as well).
They benchmark their paradigm (called DFP for “Direct Future Prediction”) on Doom with four scenarios of increasing difficulty. The baselines are the well-known DQN and A3C algorithms, along with a relatively obscure “DSR” algorithm (but which used the Doom platform, facilitating comparisons). I’m not sure why they used DQN instead of, say, Double DQN or prioritized variants since those are assumed to be strictly better, but at least they test using A3C which as far as I can tell is on par with the best DQN variants. It’s too bad that OpenAI baselines wasn’t around in time for the authors to use it for this paper.
They say that
We set the temporal offsets of predicted future measurements to 1, 2, 4, 8, 16, and 32 steps in all experiments. Only the latest three time steps contribute to the objective function, with coefficients .
I think they say this to mean that their goal vector contains only three nonzero components, corresponding to and . But then I’m confused: why do they need to have all the other for ? What’s also confusing is that for the two complicated environments with ammo, health, and frags, their training is set to maximize a linear combination of those three, with coefficients . The same vector is repeated here!
I wish they had expanded upon their discussion since this is new stuff from their paper. Why did they choose this and that value? What is the intuition? I know it’s easy to ask this from a reading/reviewing perspective, but that’s only because the concept is new; for example, they do not need to justify why they chose the dueling-style architecture because they can refer to the Dueling DQN paper.
Regarding experiments, I don’t have much intuition on the vizDoom environments, as I have never used those, but their results look impressive on the two harder scenarios, which also provide more measurements (three instead of one). Their method out-performs sophisticated baselines in various settings, including those from the Visual Doom AI competition in September 2016.
At the end of the experimental section, after a few ablation studies (heh, my favorite!) they convincingly claim that
This supports the intuition that a dense flow of multivariate measurements is a better training signal than a scalar reward.
In my words: dense signals are better than sparse signals. In some cases, sparsity is desirable (e.g. in attention models, we want sparsity to focus on a few important components) but for rewards in reinforcement learning, we definitely need dense signals. Note that getting such signals wouldn’t be possible if the authors kept clinging to the usual MDP formulation. Indeed, Koltun made it a point in his talk to emphasize how he disagreed with the constraints imposed on us by the MDP formulation, with the usual “set of states, actions, rewards […]”. This is one of the things I wish I was better at: identifying certain gaps in assumptions that everyone makes, and trying to figure out where we can improve them.
That’s all I have to say for this paper now. For more details, I would check the paper website. Have fun!
Last night, I finished reading Dale Carnegie’s book How to Win Friends and Influence People: The Only Book You Need to Lead You to Success. This is the 31st book I’ve read in 2017, and hopefully I will exceed the 38 books I read in 2016.
Carnegie’s book is well-known. It was originally published in 1936 (!!) during the Great Depression, but as the back cover argues, it is “equally valuable during booming economies or hard times.” I read the 1981 edition, which updated some of the original material to make it more applicable to the modern era. Even though it means the book loses its 1936 perspective, it’s probably a good idea to keep it updated to avoid confusing the reader, and Carnegie — who passed away in 1955 — would have wanted it. You can read more about the book’s history on its Wikipedia page.
So, is the book over-hyped, or is it actually insightful and useful? I think the answer is yes to both, but we’ll see what happens in the coming years when I especially try to focus on applying his advice. The benefit of self-help books clearly depends on how well the reader can apply it!
I don’t like books that bombard the reader with hackneyed, too-good-to-be-true advertisements. Carnegie’s book certainly suffers from this, starting from the terrible subtitle (seriously, “The Only Book”??). Now, to be fair, I don’t know if he wrote that subtitle or if it was added by someone later, and if it was 1936, it would have definitely been more original. Certainly in the year 2017, there is no shortage of lousy self-help books.
The good news is that once you get beyond the hyped-up advertising, the actual advice in the book is sound. My summary of it: advice that is obvious, but that we sometimes (often??) forget to follow.
Indeed, Carnegie admits that
I wrote the book, and yet frequently I find it difficult to apply everything I advocated.
This text appears in the beginning of a book titled “Nine Suggestions to Get the Most Out of This Book”. I am certainly going to be following those suggestions.
The advice he has is split into four rough groups:
- Fundamental Techniques in Handling People
- Six Ways to Make People Like You
- How to Win People to Your Way of Thinking
- Be a Leader: How to Change People Without Giving Offense or Arousing Resentment
Each group is split into several short chapters, ending in a quick one-phrase summary of the advice. Examples range from “Give honest and sincere appreciation” (first group), “smile” (second group), “If you are wrong, admit it quickly and emphatically” (third group), and “Talk about your own mistakes before criticizing the other person” (fourth group). Chapters contain anecdotes of people with various backgrounds. Former U.S. Presidents Washington, Lincoln, and both Roosevelts are featured, but there are also many examples from people leading less glamorous lives. The examples in the book seem reasonable, and I enjoyed reading about them, but I do want to point out the caveat that some of these stories seem way too good to be true.
One class of anecdotes that fits this criteria: when people are able to get others to do what they want without actually bringing it up! For example, suppose you run a business and want to get a stubborn customer to buy your products. You can ask directly and he or she will probably refuse, or you can praise the person, show appreciation, etc., and somehow magically that person will want to buy your stuff?!? Several anecdotes in the book are variants of this concept. I took notes (with a pencil) to highlight and comment in the book as I was reading it, and I frequently wrote “I’m skeptical”. Fortunately, many of the anecdotes are more realistic, and the advice itself is, as I mentioned before, accurate and helpful.
I have always wondered what it must be like to have a “normal” social life. I look at groups of friends going out to meals, parties, and so forth, and I repeatedly wonder:
- How did they first get together?
- What is their secret to liking each other??
- Do I have any ounce of hope of breaking into their social circle???
Consequently, what I most want to get out of the book is based on the second group, how to make people like me.
Unfortunately, I suffer from the social handicap of being deaf. While talking with one person usually isn’t a problem, I can’t follow conversations with noisy backgrounds and/or with many people. Heck, handing a conversation with two other people is often a challenge, and whenever this happens, I constantly fear that my two other “conversationalists” will talk to themselves and leave me out. And how on earth do I possibly network in noise-heavy academic conferences or workshops??? Gaaah.
Fortunately, what I find inspiring about Carnegie’s advice is that it is generic and highly applicable to the vast majority of people, regardless of socioeconomic status, disability condition, racial or ethnic background, and so forth. Obviously, the benefit of applying this advice will vary depending on people’s backgrounds, but for the vast majority of people, there should be some positive, non-zero benefit. That is what really counts.
I will keep How to Win Friends and Influence People on my desk as a constant reminder for me to keep applying these principles. Hopefully a year from now, I can look back and see if I have developed into a better, more fulfilled man.
President Trump, you should have clearly and unequivocally denounced racism and white supremacy immediately, without trying to pin the blame on “both sides” or whatever other un-related group comes to mind. Your delayed statement does not redeem yourself.
The failure to call out and condemn white supremacy is perhaps the epitome of political correctness. We tragically saw one person, Heather Heyer, murdered from the events in Charlottesville. In this case, political correctness really is deadly.
The Ku Klux Klan, neo-Nazis, and other white nationalist groups do not belong in our society. We need to always condemn them and aim to eradicate their presence so that America can become a better place.
America has come a long way since the days of George Washington, Abraham Lincoln, and Martin Luther King Jr., but we still have lots of progress to go before we can truly claim that America provides an equal playing field for its citizens.
Today, August 15, was the last day of UAI 2017. We had workshops, which you can think of as one-day conferences with fewer people. UAI 2017 offered three workshops, and I attended the Bayesian Modeling Applications Workshop. It was a small workshop with only ten of us present at the 9:00am starting time, though a few more would trickle in during the first hour.
Here were some of the highlights:
David Poole from the University of British Columbia gave the opening talk on Probabilistic Reasoning with Complex Heterogeneous Observations and Applications in Geology and Medicine. This one was largely about ontologies. Unfortunately, in the interest of time, he had to skip a lot of the content.
The other talks were more directly related to Bayesian networks, which I studied a lot in undergrad and also for my AI prelim exams.
There was another talk about OpenMarkov. I got mostly distracted when the speaker emphasized the advantage that the software was open source. Maybe this is me coming from Deep Learning, but open source should be the expectation, not the norm. (MuJoCo is the one exception for Deep Learning, but hopefully that will soon no longer be the case.) I was reminded of Zack Lipton’s blog post on a sober perspective of Tensorflow when he wrote that “A number of other news outlets marveled that Google made the code open source.”.
I don’t have much else to say because I didn’t take detailed notes.
Upon the evening of August 15, the conference officially ended. Tomorrow, I’ll board a 15-hour direct flight from Sydney to San Francisco, and life will be back to normal.
What are some of my thoughts now that UAI 2017 has concluded? Here is a rough categorization of the pros:
I enjoyed giving a talk on my research. And the paper won an award!
I identified a few interesting papers and concepts from tutorials which I should investigate in more detail once I have time.
I met (a.k.a. “networked with”) a few students and faculty, and hopefully this will help spread my name. I should email them later.
The venue and location were awesome. This place is probably the best in Australia for tourism.
Here are the cons:
Captioning. Gah. As you know, it wasn’t set up on the first day, and even when the service was present, I still had a hard time following talks. The lack of mobility of captioners is also a drawback. Even so, it was better than nothing.
I don’t feel like I sufficiently networked. Yes, I networked a bit (as mentioned recently) but probably to a lesser extent compared to other students. How again do people normally network at conferences, particularly if they’re unpopular and unknown like me? (The rock stars, of course, don’t need to do anything, as people flock to them, not the other way around.)
Despite these not so insignificant drawbacks, I’m extremely happy that I attended UAI 2017. I thank the conference organizers for arranging UAI and hope that they enjoyed it at least as much as I did.
I should elaborate on the venue, location, and related stuff. The hotel had excellent service, and the breakfast buffet was awesome. I had to resist eating so quickly! A picture of an example breakfast of mine is shown below:
My breakfast on August 15, the last full day of UAI 2017.
The coffee was great, both at the hotel and in the conference. I’ve used coffee machines that produced utter junk lattes and cappuccinos, but the ones at ICC Sydney made great coffee.
Darling Harbor, of course, is great. Here are two final views of it:
A view of the harbor.
Another view of the harbor.
Yeah. Someday, I’ll be back.
For the fourth day of UAI 2017 (August 14), I skipped my 4:30am workout to get two more full practice runs of my talk. Thus, by the time I entered the conference venue, I was feeling confident.
Professor Terry Speed gives the keynote talk of the day, and the fifth one overall for the conference.
The day started off with our last keynote for the conference, presented by Professor Terry Speed who currently works at the Walter and Eliza Hall Institute of Medical Research in Australia. Interestingly enough, he used to be a Berkeley statistics professor until his retirement in 2009.
His talk reminded me of Professor Heller’s talk from the other day. There’s a lot of work being done at the intersection of medicine and artificial intelligence. My impression is that Professor Speed’s work is more about RNA and DNA sequencing, while Professor Heller’s is about modeling diseases and health conditions. There might be overlap with the techniques they use, since both briefly mentioned (recurrent) neural networks. Yeah, they’ve fallen for the hype.
While Professor Speed is funny and skilled at presenting, I had a hard time processing the technical content, because I kept mentally rehearsing for my talk. This also happened during the first oral session (on causality) which preceded the one that contained my talk. Does anyone else find it hard to focus on talks that precede theirs?
Finally, at around noon, after several Disney Research people taught us about stochastic gradient descent for imbalanced data, it was my turn.
Whew … I walked up and got my laptop and clicker set up, though before I could begin, the conference chair gave a few remarks about my paper and then presented me with the Honorable Mention for Best Student Paper Award. After some applause, I also pointed out my coauthor John Canny in the audience, and got everyone to applaud for him as well. Then I began my talk.
I’m pleased to report that my talk went as well as I could have hoped for, with one exception that I’ll bring up later.
Here’s the list of rather sloppy reminders that I made for myself and which I reviewed beforehand:
- Don’t be flat-footed, don’t stand like a robot.
- Don’t swing side to side!!
- Must stay vigilant and alert!
- Must not have a flat voice. Try to vary it. Lots of deliberate pauses. With smiles!
- Talk LOUD, since I will likely be moving away from the microphone.
- Don’t put my hand in my pockets!
- Thank them at the beginning for the award, and at the end thank the audience for their attention.
I also wrote talk-specific reminders to include phrases such as “I hope you remember this” when reaching this slide, and so forth.
One thing that’s perhaps unique about me is my stance on giving talks. I touched up on this briefly when discussing my class review for Algorithmic Human-Robot Interaction, but I’ll expand the discussion here with some bold text to catch your attention.
I view talking in front of a large audience as an absolute privilege that I CANNOT waste. Thus, my talk must be polished, but in addition, I must make it MEMORABLE and keep the viewers as ALERT as possible. This means I need to be loud, funny, and highly active. Even if this comes at the cost of a slight reduction in the amount of technical material that appears on my slides.
For the vast majority of conference talks, while some audience members pay rigid attention, many will also be checking their phones and laptops. Realistically, there’s no way to keep everyone’s attention for the entire talk, especially in conferences when there are many talks back-to-back. Even with coffee, people can’t absorb all this information. Thus, I think it’s best to simply get the audience interested so that they can look up the material later in their own time.
One absolute sure-fire way to lose the already-fragile attention span of humans is to stand frozen behind a microphone and read off text-filled slides (or a pre-made script) with a flat voice. Sorry, but when people do that, I want to yell at them: What are you doing?!? You’re wasting such a great opportunity to impress your audience!! The fact that the majority of conference speakers — mostly students, but sometimes faculty are guilty of this as well — still do this is simply mind-boggling to me. It’s completely baffling.
I understand that non-native English speakers might have difficulty with knowing what phrases to emphasize and so forth. But that doesn’t mean they can’t smile and be active when presenting, and the people who are guilty of robotic speaking are not always non-native English speakers.
Of course, there are certain times when it’s best not to follow my speaking techniques. I would obviously not apply this style at a funeral. Academia, however, is not entirely conservative in presentation style. Sure, you can be a boring robot reading off a script, but you can also be active and constantly be in engagement with the audience and no one’s going to stop you.
Whew. Anyway, sorry for that mini-rant but this felt like something important I should bring up. You can expect that whenever I give a polished academic talk at a conference, I am not going to be a boring or typical speaker.
For my talk, I did not stand behind the lectern with the microphone; I stood in front of it like Terry Speed did (see the picture above).
I also deliberately did not walk too fast when talking. The key is to walk a little bit, stand still, point the laser pointer at the slides, make a joke or two, make eye contact with the audience, and then slowly walk to the other side of the room.
I think the talk was great. I followed my advice and made some comments to get the crowd to laugh. One of them, for instance, was “please remember this figure for the rest of your life.”
OK, now what was that one “exception” I referred to earlier? It happened during the question-answer session. Professor John Duchi asked if I could prove that the method “converges to the correct posterior distribution” or something like that. I must have laid an egg because I don’t think my answer satisfied him (though to be fair, I thought his question was too vague).
Then John Duchi and coauthor John Canny (who were sitting next to each other) started discussing amongst themselves, as humorously pointed out by conference chair Kristian Kersting. Incidentally, Kristian was standing next to me during this Q&A to repeat questions from the audience, since I can’t hear/understand them. He had to relay John Duchi’s question to me even though John was literally five meters away from me.
After my talk concluded, the other conference chair, Gal Elidan, came to me and shook my hand (I didn’t see him do that to anyone else). Throughout the rest of the day, no less than six people came to me and said they liked my talk.
I certainly felt relieved after presenting. It was also our lunch break. I wasn’t sure what to do, but fortunately, John Canny came to my rescue and said that I should join him plus a few others for lunch. It turns out those “others” were: Gal Elidan, Kristian Kersting, Terry Speed, and John Duchi. Gulp. I would of course never have the courage to ask to join them for lunch myself, given that just about the only thing I’m better at than those guys is blogging.
John Duchi made the choice to eat at a small lunch/bar place called Social. I ate a pork burger and mostly watched the conversation, since I was unable to get involved.
We finish lunch at Social. From left to right, we have Professors Terry Speed, Kristian Kersting (behind Terry Speed), Gal Elidan (behind John Duchi), John Duchi, and John Canny.
After that, we had another oral session and then a poster session.
The third day of UAI 2017 (August 13) started off with Stanford Professor Christopher Ré giving the first keynote talk of the day about his group’s project called Snorkel. Chris is epitome of a “rock-star academic,” and he has a ridiculous amount of publications in the last few years. His lengthy list of awards includes the well-known MacArthur “Genius” Fellowship.
Stanford Professor Christopher Ré gives the first keynote talk of the day.
I really enjoyed Professor Ré’s talk, both for the content and for the “style” (i.e. at the right technical level, good visuals, etc.). He inserted some humor now and then, as you can see in the slide above. Anyone want to win the parasite award? Heh. I also try to include humor in my talks.
Anyway, as I mentioned earlier, the main technical part of the talk was about Snorkel. It’s an important project, because it helps us deal with “dark” or unlabeled data, which is what the vast majority of data will look like in real life. How can we make sense of dark data, and “clean it up” to make it more useful? This is critical because, as Professor Ré said in the talk:
Training data is the new, new oil.
(Yes, he said “new” twice.)
I was amusingly reminded of Andrew Ng’s famous (or infamous, depending on your opinion) phrase:
AI is the new electricity.
In case you are curious, I agree with both of the above quotes.
You can find more information about Snorkel on the project website. What’s great is that there are also lots of blog posts. His group really likes to write blog posts! At least I have something in common with them.
The first oral session was about “Representations”, a research sub-field which I am unfortunately not familiar with, and so I had an extremely hard time following the material. I tried to gather pieces of what I could and recorded anything interesting in my ongoing Google Doc containing my notes from UAI 2017. As I stated in my last blog post, I do not try to follow talks in their entirety — I couldn’t do that even if I wanted to — but I record bits and pieces of intriguing stuff which are candidates for future investigation.
During breaks, I worked on outlining these blog posts; I drafted them in Google Docs.
The second oral session was about … reinforcement learning! Awesome. At least I should have more background information for this material. Of the four papers presented, the fourth one seemed to have the most interesting material in it. The UAI organizers must have agreed, because the authors (Shayan Doroudi along with Philip Thomas and Emma Brunskill) won the UAI 2017 Best Paper Award for the paper “Importance Sampling for Fair Policy Selection.”
Fairness is becoming a recurring theme in research and academia nowadays along with safety and (as you’ll see later) health care. The talk was excellent since Shayan has good speaking skills. He motivated the problem with a quick example of choosing between two policies, one of which was obviously better than the other. Despite the apparent simplicity of choosing the policies, importance sampling approaches can actually choose the worse policy more often than not.
Shayan Doroudi (to the left) being presented with the UAI 2017 Best Paper Award by the conference chairs. Congratulations!
We now have the following award-winning reinforcement learning papers:
- Importance Sampling for Fair Policy Selection (UAI 2017 best paper)
- Modular Multitask Reinforcement Learning with Policy Sketches (ICML 2017 runner-up best paper)
- Value Iteration Networks (NIPS 2016 best paper)
- Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016 best paper)
At some point, I’d like to grind through the details in these papers. I know the high level idea of each of these but aside from perhaps the dueling networks paper, the details elude me.
After a brief lunch break with another student from India — whom I found standing alone and thus it made sense for us to go to lunch together — we had our second keynote talk. Duke Professor Katherine Heller gave a talk about machine learning in health care. Gee, is anyone seeing a trend with machine learning applications?
Katherine Heller gives the second keynote talk of the day.
You can see the outline of her talk in my picture above. I remember that she discussed the following topics:
- Health conditions: chronic kidney disease, sepsis, and multiple sclerosis.
- Health issues: delayed diagnosis results in problems, surgery can introduce complications; her exact figure was 15% of the time but there are obvious simplifications with numbers like that.
- Modeling health issues: using graphical models with latent variables. Basically, given health conditions (or a sequence of conditions measured at times) what can we say about the patient’s health? I also saw a few mentions of RNNs and LSTMs there (wow, really?) and would be interested in learning more.
Given that much of the talk was about, I believe, modeling health care, I sometimes wonder how accurate our models are. The United States has one of the most inefficient health care systems in the developed world, and I wish we could use some machine learning to cut away at the inefficiency.
After Professor Heller’s talk, we had the usual poster session. I managed to engage in a few interesting one-on-one conversations, which is good enough for me!
We then had a special event provided by the conference: a dinner cruise along Darling Harbor. Awesome! The buffet options included a wide range of food options: prawns, Indian chicken curry, Thai fish curry, pastas, potatoes, and of course, lots of salad options.
UAI 2017 conference attendees lining up to get dinner.
I don’t know about how others feel, but every time there’s an event like this where I have to pick my own seat amidst a large dinner gathering, I worry and overthink it way too much. Fortunately, there was a student who I met earlier at the conference who told me to sit near the center (gulp) of a table filled with other graduate students, thus saving me the stress of coming up with the decision myself. I was happy with this because it meant I wasn’t sitting by myself, and because it’s better for me to know other graduate students (and potential future collaborators/colleagues) rather than, for instance, industry sponsors.
Yes, it was extremely noisy in the ship, and I couldn’t participate in substantive conversations, but hey, at least I was sitting with other graduate students. And it seemed like there was some ongoing discussion regarding my blog, judging by how several of the other students nearby kept looking at my name tag in order to correctly spell my name in Google.
Throughout the cruise, we would frequently walk to the top of the ship and view Darling Harbor.
A view of Luna Park.
A view of the Sydney Opera House.
It’s times like these when I wish I had a girlfriend, so that we could go on a vacation together and explore Darling Harbor.
For the second day of UAI 2017 (August 12)1, I followed the same initial routine from the previous day. I woke up early, had a 4:30am gym session, ate a hearty breakfast at the hotel’s buffet, and then walked over to the conference venue. The talks were held in the room shown in the following image:
Room C4.8 in the ICC Sydney building.
Yeah, it’s a fairly small room. Compare that to the room used for the ICML keynote talks, which is in the same building. Wow!
Fortunately, the second day of UAI started off better than the first one, since the captioner (a.k.a. “CART provider” or “stenographer”) arrived. Whew.
At around 8:30am, the day began with some opening remarks from one of the chairs. After that, it was time for MIT robotics professor Leslie Kaelbling’s one-hour keynote talk on Intelligent Robots in an Uncertain World. It was a nice, relatively high-level talk which centered on Partially Observable Markov Decision Processes (POMDPs) and belief states, with applications that focused on robotics.
Professor Kaelbling gives the initial keynote talk for UAI 2017.
The experiments that she showed in the slides involved the PR2 robot, which I assume is one of the primary robots in her lab. I wish I could use the PR2 one of these days, or at least a robot similar to it.
The final part of her talk contained a request for the UAI community to figure out how to perform action selection in belief spaces. In other words, if we don’t know everything about the environment (which is always the case in real applications) we have to pick actions on the basis of what we believe about the world.
Overall, it was an excellent talk. There were a few sections that were relatively difficult for me to follow, but I’m not sure if it was because there was too much information to process in the slides (some of them had a lot!) or if it was because I had a hard time getting used to the captioning.
After the keynote talk, we had oral sessions. In these, authors of papers accepted to the conference give 20 minute talks. Not all the papers have oral talks, though; they’re reserved for those with the highest reviews. Also, typically the first author is the one who gives presentations.
Today, there were four oral sessions, each of which consisted of one broad research topic and three research papers in each (so each session was an hour long). The first oral session was about deep models. Yay! Ming Jin started off the oral sessions with his excellent talk on inverse reinforcement learning.
UC Berkeley electrical engineering PhD student Ming Jin gives a talk about his paper Inverse Reinforcement Learning via Deep Gaussian Process.
The two other talks for this oral session were also interesting and perhaps more related to core Deep Learning research.
The second oral session was on the subject of machine learning, which is probably not the best name for it, but whatever. Unfortunately, the papers were quite mathematical and the speakers were a little difficult to understand (the captioner had major difficulty) so it was hard for me to get much out of the talks beyond trying to extract every ounce of information from the slides that I could.
After a break for lunch — though I was simply too full from eating earlier and had to pass — we had our second keynote of the day, Expectations in Learning and Inference by Professor Amir Globerson of Tel Aviv University.
The second keynote talk of the day.
This talk was more about math, in particular about expectations and probabilities. What happens when we aren’t given data but are given the expected value? What can we determine from it? (See the slide above for the related context.) The technical contribution was probably the development of bounds for not probabilities, but the minimum of probabilities over a certain class (if that makes sense?). I unfortunately had a harder time understanding this talk compared to the first keynote. I thought I could follow slide by slide and sentence by sentence in the captions when I saw the transcripts, but I couldn’t piece together a good story. Maybe this has happened to other people?
In any case, for me I long ago decided that for research talks, I don’t try to understand everything but instead, I find any interesting points and then follow-up on these later, either by emailing the author or (more likely) simply searching online. Google has been great for people like me.
We had two more oral presentations after this second keynote. In between the two, I had an entertaining conversation with another student from the University of Helsinski who told me that hearing aids should have some way of blocking out background noise. I told him that, sadly, they’re already supposed to do that, though he didn’t give up and said that we should use machine learning to make them block out noise. Yeah, that would be great.
The poster session for today.
We wrapped up the day with a poster session which featured roughly one-third of the UAI 2017 papers. (There are also poster sessions in day three and day four of the conference.)
After this, I went to the nearby mall and found a quick, cheap Middle Eastern restaurant for dinner. I ate by myself as I didn’t know anyone else, and I couldn’t find any lonely person who I could pounce on with an invitation, but that was OK with me. I just wanted to see what the city had to offer, and I’m pleased to say that I was not disappointed. Darling Harbor has a ridiculous ton of awesome restaurants. It’s food paradise for someone like me.
The view of the lovely harbor at night. The conference is in the large building located at the center-left with the lights on. To its right is a giant mall (which the photo doesn't entirely reveal) with a LOT of stores and restaurants. Wow.
The days when the posts are published on this blog do not necessarily coincide with the day that the conference took place, e.g., this post was published the day after. ↩
Sadly, the day did not start out well. When I opened my inbox this morning (August 11), I saw this from the transcription agency which was supposed to (spoiler alert!) provide captioning services:
Thank you for the confirmation. As at 9:00pm yesterday we were still awaiting confirmation of the details and were not sure that the stenographer was still required for today. As such, it is unlikely that [name redacted] will make it to the venue today. I have spoken with her this morning and she will be there ready to go on Tuesday morning.
I have passed on the information below and [name redacted] has your phone no.
I am very sorry that we are not able to cover the captions for today. It has been difficult, with the lack of communication this week, to know what was happening.
Good luck with today.
Ouch. Where to begin?
I had carefully arranged for captioning services to be arranged for this conference. I looked at my email records. On June 15 I sent an email to Berkeley’s DSP, and received acknowledgement from them that same day, that we should arrange for accommodation services (either captioning or sign language interpreting). The conference, as you know, starts August 11. That’s almost two months. And the paper got accepted on June 12 so I couldn’t have known I would be attending UAI before June 12.
A month later, on July 15, I had sent (and again, received acknowledgment from them) the day and times of the talks, which included August 11. I’m not sure if the exact rooms had been set up, but at least the times were known. And of course, we all knew that UAI was in the ICC Sydney building. Once again, I couldn’t send this email earlier since the schedule was not available on the conference website until then.
Thus, on July 15, I thought that the captioning services were essentially set up. Indeed, after I had repeatedly questioned Berkeley’s DSP, they sent me this email on August 3 which implied that things were OK:
Moving forward. The CART provider(s?) are booked, and I’ve passed the schedule updates along.
I’ll put you in touch with them before the end of the week to discuss details and logistics, as well as prep materials if possible.
Yet on August 11, the day the conference began … I got that email above. Gee, doesn’t that directly contradict the August 3 email from Berkeley’s DSP? I double checked, and the schedule I sent to them on or before that day definitely included August 11, not to mention that I explicitly included August 11 in the written portion of the email (and not in an attachment).
I was also intrigued by the fact that the transcription agency emailed me directly about this, but didn’t do so until late at night on August 10 (and I had gone to sleep early due to jetlag). Why didn’t they email me so sooner? It’s also not like my contact information is completely invisible; a simple Google search of my name (even if it’s from Australia) should have this blog as the #1 hit. Moreover, I saw an email thread between Berkeley’s DSP and the transcription agency that went back a few days which was forwarded to me on the morning of August 11. It appears that the transcription agency was aware of my email on or before August 9 (to be fair, I’m not sure the time zone). Somewhere, there must have been some massive mis-communication.
Yeah. I got completely blindsided.
I think back to my previous grievances with accommodations: see here for my prelims and here for the BAIR retreat. One thing that would especially help in these cases is if I could directly contact the agencies providing those accommodations.
After I return to Berkeley, I am going to talk to DSP and demand that for future conferences, I am able to contact whatever captioning (or sign language interpreting) agency they use. I will always copy Berkeley’s DSP to these emails, and DSP is certainly free to email and/or talk to them so long as they always copy me in emails and give me the details of any phone conversation. Always.
If they refuse, then I will figure out how to sue DSP.
In addition, I will switch to figuring out how to get an automated captioning system set up on my laptop for future conferences as I have lost interest in the hassle of setting up accommodations.
In a foul mood, I trudged over to the room of the talk (see picture below) and hoped for the best.
The room where the tutorials were located. This was actually taken after the first tutorial, but I decided to put this picture first in the blog post because ... artistic license.
Tutorial 1 of 4
Fortunately, my mood soon brightened once the first talk began at around 8:45am, featuring Assistant Professor John Duchi of Stanord University. He’s known as a machine learning rockstar. I’ve always wondered how he is able to produce so much mathematical research and have time for pursuing other physical activities such as running and participating in triathlons. (When I was applying to graduate school, he told me he was considering accepting the University of Washington’s faculty position, and I sheepishly thought that I could be one of his students there.) I knew I was in for some great content, and on the plus side, his voice is relatively easy to understand. I took a lot of notes (in Google Docs) of his talk. It was divided into two parts: convex optimization and non-convex optimization. Yeah, not the most creative division, but it’s the content that matters.
He started by providing some basic definitions of convex optimization and so forth. Given my background in EE 227BT and EE 227C in Berkeley, I was fortunately aware of this material. Nonetheless, I was happy he went over the concepts slowly. He introduced us to the style of convergence proofs that people use in his sub-field, which often involves relating expressions of the form and .
John then moved on to discuss interesting topics ranging from AdaGrad — which is probably his most well-known research contribution — and about how to transform non-convex problems into convex problems, which is a technical way of saying “let’s make the hard problems easy.” Along the way, he inserted some humor here and there. For instance, when talking about the slide in the following image, John referred to the function as “Batman”:
The "Batman function", according to John Duchi.
Because, hey, doesn’t that look like Batman, he asked rhetorically. John then said he was watching too much Batman. I must have misheard because I don’t believe that one iota.
There were a lot of references to go through in the last few slides. I hope to skim a few of them to get a feel for the work, but unfortunately John hasn’t posted the slides online anywhere. But I remember that a lot of them were his own papers anyway.
Tutorial 2 of 4
After a quick coffee break (thanks but no thanks, that lovely cappuccino from the hotel’s buffet was more than enough to keep me awake), the second speaker, Arthur Gretton, gave his talk on comparing probability distributions. The main idea: given samples from two distributions and , do the two distributions differ? In addition, what can we learn about these distributions, and can we figure out their dependency relationship? Many interesting questions arise out of this, and anyone who has been following machine learning seriously should know that problem of trying to distinguish between samples from two probability distributions is precisely what the Discriminator tries to do in Generative Adversarial Networks.
Anything GAN-related is going to catch my attention!
Indeed, Gretton discussed GANs for a few slides, and in fact I think he may have talked more about GANs than Shakir and Danilo did in their Deep Generative Models tutorial later in the afternoon. However, GANs weren’t the entirety of the talk. Much of it was dedicated to some of the technical details regarding “Maximum Mean Discrepancy” and other metrics for comparing distributions.
Unfortunately, I couldn’t really follow the talk. Aside from the stuff on GANs, and even not entirely that — I don’t know how Wasserstein GANs work, for instance — I barely understood what was going on at the technical level. Yes, I know the talk is about distinguishing between two distributions, and yes that’s obviously useful. I just didn’t get the technical gist of the talk.
On top of it all, I could not understand the speaker’s voice despite sitting in literally the seat which was located closest to where he was talking.
Tutorial 3 of 4
The third talk was the one I was most excited about: Deep Generative Models, presented by Shakir Mohamed (blog) and Danilo Rezende (blog), both of whom are research scientists at Google DeepMind, and both of whom have excellent blogs. I hope that mine will soon be as useful to machine learning researchers as theirs are!
The speakers do some last-minute discussion before giving their joint. Gal Elidan (to the right) is one of the conference organizers.
Danilo took over the first half of the talk and gave a long, borderline overkill motivation for why we like generative models, but I liked the examples and it’s important to try and sell one’s research. For the more technical part of the talk, he discussed fully observed and latent variable generative models, and frequently compared different sub-approaches with pros and cons of each. Unfortunately I wasn’t too familiar with many of these models, and I couldn’t understand his voice, so I didn’t get much out of this.
I benefited more from Shakir’s half of the talk since he was relatively easier for me to understand. (By the way, I have to reiterate that when I talk about not understanding speakers, it’s emphatically not a criticism of them but simply a limitation of what I can understand given my hearing limitations.) Shakir was extremely engaged and talked about the score function estimator and the reparameterization trick, both of which are featured on his blog and which I had previously read beforehand. It was also nice for Shakir to discuss all the ways in which people name these; I learned that Radon-Nikodym derivatives are another way of referring to the policy gradient used in REINFORCE (i.e. Vanilla Policy Gradients).
I certainly think I learned more about the big picture of deep generative models after Shakir’s half, so the talk was on the whole positive for me.
Tutorial 4 of 4
The fourth and last tutorial for the day was about Machine Learning in Healthcare. In general, health care-related talks don’t have much to do with the majority of UAI papers, but the plus side is that they’re much more readily applicable to the real world. And, y’know, that’s important!!
Unfortunately, due to visa issues, Suchi Saria (the speaker) couldn’t attend the tutorial. Thus, her talk was pre-recorded and provided as a video. I stuck around for the first few slides, but unfortunatey, I couldn’t understand what she was saying. It would have been hard enough to understand her in person, but having it via video made it even worse. The slides were informative, but the material isn’t related to my research. Thus, I quietly left the talk after the first few slides, took a seat with a nice view of Darling Harbour, and got to work on my blog.
The fourth speaker couldn't attend due to visa issues, so her talk was pre-recorded.
But in all fairness, put yourself in my shoes. I think if you were someone who could not understand what a speaker was saying and had the choice between sitting for an hour to listen to a pre-recorded talk that wasn’t in your area of research, I think you would do the same.
On the morning of August 10 (a.k.a. my birthday), I arrived in Sydney, Australia. It’s my first time in Australia. I am attending the Thirty Third Conference on Uncertainty in Artificial Intelligence (UAI). UAI is probably known in AI circles as the little brother of the ICML and NIPS behemoths, but it’s still a top-tier academic conference. I would say ICML and NIPS are the clear-cut top two conferences in AI, with UAI being “1b” to them.
The reason why I’m attending is that I am presenting a research paper, An Efficient Minibatch Acceptance Test for Metropolis Hastings, which was accepted there. I feel very fortunate to also receive the Honorable Mention for Best Student Paper award. (There are three awards given by this conference: one for the best paper, one for the best student paper, and one honorable mention for the best student paper.) I’ll be giving a talk on the fourth day, so until then I’ll just listen as much as I can.
The lovely view of Darling Harbour from my hotel room. UAI (also ICML before that) is held in the building to the top center with the white cover on it. Sorry for the picture quality; you can tell it was taken through a window.
I got to the hotel sufficiently early on August 10 to relax for a bit, and you can see the excellent view above.
Though to be honest, I was a bit anxious and didn’t fully relax. August 10 was also the next-to-last day of ICML, and I was looking at Twitter to see all the tweets with the #icml2017 hashtag.
And … wow, ICML is crowded! Now, UAI isn’t going to be as crowded, but there will still be lots of people. This raises some serious concerns for me, in particular (a) how to find someone to talk to, given that in large groups, others are able to quickly find conversationalists while I resort to watching people talk, and (b) if I did find someone to talk to, how would I understand that person?
I didn’t come up with an immediate solution to this. On the other hand, I resolved to do something for the conference that I don’t ordinarily do, but which I should be doing more often:
I will never let someone — who is talking to me — get away with saying something that I don’t understand. I will ask that person to repeat what he or she says as many times as needed.
Is that clear? For instance, if I’m talking with another person about research and he says something which I can somewhat tell isn’t that important to understanding the gist of our conversation, I don’t care. I am asking him to repeat.
I’ve gotten reasonably skilled at telling what parts are or are not important to understanding the main idea in verbal conversations. But I don’t care. For this conference, I just want to understanding everything. Got it?
Let’s move on. I had a really good night’s sleep, sleeping from about 8:00pm to 4:00am in Sydney time. It pays to be judicious with drinking coffee. I drank a little bit of coffee to stay up as late as was reasonable. In the coming days, if I need to really adjust to Sydney time, I can simply stay up later and later. It’s much harder in my experience to force myself to go to sleep earlier and earlier.
On the morning of August 11, I had a good workout at 4:30am:
The gym at 4:30am. The news on the television was about ... Donald Trump and North Korea. Yes, even in Australia. To be fair, though, this is more relevant to global news than most Trump-related news. And anything nuclear warfare-related keeps me on edge.
I then had a great breakfast at the hotel’s buffet. The sunrise was pretty as I walked on the harbor:
Viewing the hotels from the walkway on Darling Harbor.
All right … let’s begin the conference!!
It is no secret that I’m not the biggest fan of the current President of the United States and his administration. However, I finally received at least one good piece of news regarding a politically contentious issue: affirmative action. The Justice Department is planning to investigate colleges for anti-Asian bias.
I do not say this lightly, but this is good news.
I support having a diverse student body on college campuses — whether racial, socioeconomic, or in other terms — so long as the students are academically qualified and admitted under a fair system. On the other hand, I get disappointed when I read studies that show that Asians have to earn higher test scores compared to members of other races. As additional evidence of anti-Asian bias, the proportion of Asian students at the University of California system has skyrocketed after racial preferences were banned (so has the California Institute of Technology, though I don’t think they explicitly ban racial preferences). The rate of admitted students has exceeded the rate of increase in the Asian-American population as a whole.
This seems at odds with a policy that is designed to ensure fairness to racial minorities in this country. As most of us know, Asians were often the victims of Whites in this country — anyone remember the Internment of Japanese Americans? The treatment of Asians also reminds me of how Jews once had to be held to a higher standard than non-Jews.
I think it would be nice to definitively clarify how we should treat Asian applicants. In addition, what about the thorny issues that arise from multiracials such as myself? For instance, as a half White, half Asian person, do I help or hinder diversity? I am still not sure, and clarification would be nice. (The tech industry, for instance, does not generally celebrate Asian males as contributing to a diverse workforce.)
My preference would be to abolish affirmative action (as well as legacy preferences, consideration of geographical region, etc.) in favor of a simple metric: is this applicant (a) clearly well-qualified academically and (b) did he or she make the most of his/her opportunities given the context of his/her life?
The first factor, (a), establishes the fact that an accepted applicant is academically qualified. The second, (b), would help to ensure a degree of fairness in the process which indexes the applicant’s performance to opportunity, which is precisely what affirmative action was ostensibly designed to do, except that it would not take race into account unless there is good reason to do so, e.g. an inferior neighborhood due to racial segregation.
Yes, I have read the criticism that Whites are using Asians as a “wedge” to advance their “racial agenda.” Yes, it is true that many Asians do support affirmative action to provide more opportunity to members of other minority groups, but that’s precisely why I have that second item (b) as part of my proposed admissions metric. As someone who grew up without financial concerns, even though I wish every day that I had gone to an elite science school in New York City or Silicon Valley, I understand that I had more academic opportunity than the average American child. I do not have any basis for a college admissions committee to give me extra points for overcoming poverty. I understand and accept this, which probably explains why I got rejected from (for instance) Harvard, Princeton, and MIT. I just don’t want the reason for rejection to be partially (mostly?) because of my racial background.
Ultimately, in order for me to consider supporting affirmative action, I would encourage supporters to explain why Asians should be held to a higher standard compared to other races in order to address past sins of Whites.
I wrote a BAIR Blog post about some of the work that I’ve been doing this past year with John Canny and other students on minibatch Metropolis-Hastings. Please check it out and let me know what you think. As a side note, we just got commenting enabled for the blog! But please don’t abuse it, of course. (I have had enough of dealing with Internet trolls in my past life, so I would rather not deal with them any more.)
Some context on the post:
It corresponds to a paper (arXiv link) that I’m presenting in Sydney, Australia for the 33rd Conference on Uncertainty in Artificial Intelligence. I’ve never been to Australia anywhere. In fact, I’ve only been to the United States (obviously), Canada (since my mother’s family lived on the border between New York and Canada), Japan (in part to visit a great uncle who lives there and because my grandmother was born and raised in Japan) and … yeah that’s it. My travel resume is embarrassing compared to other Berkeley EECS graduate students, so hopefully this trip will partially rectify that.
I’ve been working on this project for a while. In fact that was one of the reasons why I wrote this blog post back in June 2016 … that was a long time ago, before I really knew deep reinforcement learning and TensorFlow. Wow.
I was worried that it contains too much math and technical detail for a BAIR Blog post, but I figured that (a) you need to know some math to do AI, and (b) we needed a little more diversity in the research topics that we’re presenting, and this gives more of a statistical perspective on AI.
Finally, we didn’t get a blog post last week since many BAIR students were swamped with things to do such as attending conferences (e.g. CVPR) and we’re approaching ICML/UAI/IJCAI season now. As part of the editorial board, I felt “professionally” obligated to keep the blog moving so I instantly took the chance to write something.
Anyway, I hope you enjoy the post!
Clockwise, starting from top left: the Da Vinci, the ABB YuMi, the PR2, and the Toyota HSR.
I now have several months of research experience in robotics. I am familiar with the following four robots, roughly in descending order of my knowledge of them:
- Da Vinci. Price: $2,000,000 (!!!). I’m not sure how much of the full set I use, though — I only use two of the arms, and the components might be cheaper versions. After all, even for well-funded Berkeley robotics labs, paying $2 million for a robot is impractical. Smaller hospitals also cannot afford the full Da Vinci.
- Toyota Human Support Robot (HSR). Price: ???. Oddly, I can’t find a price! In fact, I’m not even sure where to buy it.
- ABB YuMi. Price: $40,000. At least this one is on the “cheap” end … I think?
- Personal Robot 2 (PR2), from Willow Garage1. Price: $280,000. Yikes! And that’s the open source version – the raw sticker cost started as $400,000 when it was released in 2010. Given that Willow Garage no longer exists, I’m not sure if it’s possible to buy a PR2.
I have sadly never touched or worked with the YuMi and the PR2, though I’ve manipulated the Da Vinci on a regular basis. The one-sentence summary: it’s a surgical robotics system which is currently the only such system approved by the U.S. Food and Drug Administration.
This is interesting. Now let’s switch to another topic: suppose you talk to a layperson about working in robotics. One typical, half-serious conversation involves this question: when will your robots take over the world?
I would respond by pointing out the obvious restrictions placed on the Da Vinci. It’s fixed to a base, with arms that perform a strictly limited set of surgery-related functions. So … it can’t really “join forces” with other Da Vincis and somehow navigate the real world.
But perhaps, your conversationalist argues, that we can take the arms of the Da Vinci and integrate them to a mobile robot (e.g. the Toyota HSR). If the Da Vinci works in surgical applications, then it must logically be very nimble-fingered2. Think of the things it can do! It can pick locks, use car keys, redirect electric wires, and so forth.
Alas, from my experience, it’s difficult to even get the robot arms to go where I want to them to go. To make this more concrete, suppose we’re looking at an image of a flat surgical platform through the Da Vinci camera mounted above. When we look at the image, we can visually identify the area where we want the arm (or more precisely, the “end effectors”) to go to, and we can figure out the pixel values. Mathematically, given in pixel space, with positive integers typically bounded by 1080 and 1920 (i.e. the resolution) we want to find the corresponding six-dimensional robot coordinates where I’ve added yaw, pitch, and roll along with the subscript representing “robot.”
The problem is that we can’t directly convert from pixel to robot points. The best strategy I’ve used for dealing with this is to do some supervised learning. Given known points, I can manually move the robot end effectors to where they should be. Then my code can record the robot coordinates. I repeat this process many times to get a dataset, then perform supervised learning (e.g. with a random forest) to find the mapping. Finally, I use that map in real experiments.
This is the process of calibration. And unfortunately, it doesn’t work that well. I’ve found that I consistently get errors of at least 4 millimeters, and for automated robot surgery that’s pretty dangerous. To be clear, I’m focused on automated surgery, not teleoperation, which is when a human expert surgeon controls some switches which then translate to movement of the Da Vinci arms.
Indeed, calibration is a significant enough problem that it can be part of a research paper on its own. For instance, here’s a 2014 paper from the IEEE International Conference on Automation Science and Engineering (CASE) which deals with the problem of kinematic control (which involves calibration).
Calibration — or more broadly, kinematic control — is one of those necessary evils for research. I will tolerate it because I enjoy working with robotics and with enough manual time, usually calibration becomes workable for running experiments.
I hope to continue working with robotics to make them be more autonomous. Sadly, they won’t be taking over the world.
Willow Garage also developed the ROS system, which is used in many robotics systems, including the Da Vinci and Toyota HSR. While it’s no longer around, it has a long history and is considered a iconic robotics company. Many companies have spawned from Willow Garage. I’m embarrassed to admit that I didn’t know about Willow Garage until a few months ago. I really need to read more about the tech industry; it might be more informative for me than skimming over the latest political insults hurled on The New York Times and The Wall Street Journal. ↩
I’ve been putting more of my work-related stuff in GitHub repositories and by now I have more or less settled on a reasonable workflow for utilizing GitHub. For those of you who are new to this, GitHub helps us easily visualize and share code repositories online, whether in public (visible to everyone) or private (visible only to those with permissions), though technically repositories don’t have to be strictly code-based. GitHub uses version control in combination with git, which is what actually handles the technical machinery. It’s grown into the de facto place where computer scientists — particularly those in Artificial Intelligence — present their work. What follows is a brief description of what I use GitHub for; in particular, I have many public repositories along with a few private repositories.
For public repositories, I have the following:
- A Paper Notes repository, where I write notes for research papers. A few months ago, I wrote a brief blog post describing why I decided to do this. Fortunately, I have come back to this repository several times to see what I wrote for certain research papers. The more I’m doing this, the more useful it is! The same holds for running a blog; the more I find myself rereading it, the better!
- A repository for coding various algorithms. I actually have two repositories which carry out this goal: one for reinforcement learning and another for MCMC-related stuff. The goal of these is to help me understand existing algorithms; many of the state-of-the-art algorithms are tricky to implement precisely because they are state-of-the-art.
- A repository for miscellaneous personal projects, such as one for Project Euler problems (yes, I’m still doing that … um, barely!) and another for self-studying various courses and textbooks.
- A repository for preparing for coding interviews. I thought it might be useful to post some of my solutions to practice problems.
- A repository for my vimrc file. Right now my vimrc file is only a few lines, but it might get more complex. I’m using a number of computers nowadays (mostly via ssh), so one of the first steps to get started with a machine is to clone the repository and establish my vimrc.
- Lastly, but certainly not least, don’t forget that there’s a repository for my blog. That’s obviously the most important one!
On the other hand, there are many cases when it makes sense for individuals to use private repositories. (I’m using “individuals” here since it should be clear that all companies have their “critical” code in private version control.) Here are some of the private repositories I have:
- All ongoing research projects have their own private repository. This should be a no-brainer. You don’t want to get scooped, particularly with a fast-paced field such as Artificial Intelligence. Once such papers are ready to be posted to arXiv, that’s when the repository can be released to the public, or copied to a new public one to start fresh.
- I also have one repository that I’ll call a research sandbox. It contains multiple random ideas I have, and I run smaller-scale experiments here to test ideas. If any ideas look like they’ll work, I start a new repository to develop them further. On a side note, running quick experiments to test an idea before scaling it up is a skill that I need to work on!
- Finally, I have a repository for homework, which also includes class final projects. It’s particularly useful for when one has laptops that are relatively old (like mine) since the computer might die and thus all my work LaTeX-ing statistics homework might be lost. At this point, though, I think I’m done taking any real classes so I don’t know if I’ll be using this one anymore.
Well, this is a picture of how I manage my repositories. I am pleased with this configuration, and perhaps others who are starting out with GitHub might adapt some of these repositories for themselves.
Over the last two weeks, I have been using more Theano-based code for Deep Learning instead of TensorFlow, in part due to diving into OpenAI’s Generative Adversarial Imitation Learning code.
That code base has also taught me something that I have wondered about on occasion: what is the “proper” way to save and load neural network model weights? At the very least, how should we as programmers save weights in a way that’s robust, scalable, and easy to understand? In my view, there are two major steps to this procedure:
- Extracting or setting the model weights from a single vector of parameters.
- Actually storing that vector of weights in a file.
One way to do the first step is to save model weights in a vector, and use that
vector to load the weights back to the model as needed. I do this in my
personal reinforcement learning repository, for instance. It’s implemented
in TensorFlow, but the main ideas still hold across Deep Learning software.
Here’s a conceptually self-contained code snippet for setting model weights
from a vector
In later code, I run TensorFlow sessions on
self.set_params_flat_op and supply
self.theta with the weight vector in the
feed_dict. Then it iteratively
makes an update to extract a segment of the
self.theta vector and assigns it
to the correct weight. The main thing to watch out about here is that
self.theta actually contains the weights in the correct ordering.
I’m more curious about the second stage of this process, that of saving and
loading weights into files. I used to use pickle files to save the weight
vectors, but one problem is the incompatibility between Python 2 and Python 3
pickle files. Given that I sometimes switch back and forth between
versions, and that I’d like to keep the files consistent across versions, this
is a huge bummer for me. Another downside is the lack of organization. Again,
I still have to be careful to ensure that the weights are stored in the correct
ordering so that I can use
After looking at how the GAIL code stores and loads model weights, I realized
it’s different from saving single pickle or numpy arrays. I started by running
their Trust Region Policy Optimization code (
observed that the code specifies neural network weights with a list of
dictionaries. Nice! I was wondering about how I could better generalize my
existing neural network code.
Moving on, what happens after saving the snapshots? (In Deep Learning it’s
common to refer to weights after specific iterations as “snapshots” to be
saved.) The GAIL code uses a
TrainingLog class which utilizes PyTables
and — by extension — the HDF5 file format. If I run the TRPO code I might
trpo_logs/CartPole-v0.h5 as the output file. It doesn’t have to end with
the HDF5 extension
.h5 but that’s the convention. Policies in the code are
subclasses of a generic
Policy class to handle the case of discrete versus
continuous control. The
Policy class is a subclass of an abstract
class which provides an interface for saving and loading weights.
I decided to explore a bit more, this time using the pre-trained CartPole-v0 policy provided by GAIL:
It took me a while to figure this out, but here’s how to walk through the nodes in the entire file:
PyTables works well for hierarchical data, which is nice for Deep Reinforcement Learning because there are many ways to form a hierarchy: snapshots, iterations, layers, weights, and so on. All in all, PyTables looks like a tremendously useful library. I should definitely consider using it to store weights. Furthermore, even if it would be easier to store with a single weight vector as I now do (see my TensorFlow code snippet from earlier) the generality of PyTables means it might have cross-over effects to other code I want to run in the future. Who knows?
After taking STAT 210A last semester (and writing way too much about it), it made sense for me to take STAT 210B, the continuation of Berkeley’s theoretical statistics course aimed at PhD students in statistics and related fields.
Our professor was Michael I. Jordan, who is colloquially called the “Michael Jordan of machine learning.” Indeed, how does one begin to describe his research? Yann LeCun, himself an extraordinarily prominent Deep Learning researcher and considered as one of the three leaders in the field1, said this2 in a public Facebook post:
Mike’s research direction tends to take radical turns every 5 years or so, from cognitive psychology, to neural nets, to motor control, to probabilistic approaches, graphical models, variational methods, Bayesian non-parametrics, etc. Mike is the “Miles Davis of Machine Learning”, who reinvents himself periodically and sometimes leaves fans scratching their heads after he changes direction.
And Professor Jordan responded with:
I am particularly fond of your “the Miles Davis of machine learning” phrase. (While “he’s the Michael Jordan of machine learning” is amusing—or so I’m told—your version actually gets at something real).
As one would expect, he’s extremely busy, and I think he had to miss four lectures for 210B. Part of the reason might be because, as he mentioned to us: “I wasn’t planning on teaching this course … but as chair of the statistics department, I assigned it to myself. I though it would be fun to teach.” The TAs were able to substitute, though it seemed like some of the students in the class decided to skip those lectures.
Just because him teaching 210B was somewhat “unplanned” doesn’t mean that it was easy — far from it! In the first minute of the first lecture, he said that 210B is the hardest course that the statistics department offers. Fortunately, he followed up with saying that the grading would be lenient, that he didn’t want to scare us, and so forth. Whew. We also had two TAs (or “GSIs” in Berkeley language) who we could ask for homework assistance.
Then we dived into the material. One of the first things we talked about was U-Statisics, a concept that can often trick me up because of my lack of intuition in internalizing expectations of expectations and how to rearrange related terms in clever ways. Fortunately, we had a homework assignment question about U-Statistics in 210A so I was able to follow some of the material. We also talked about the related Hájek projection.
Diving into High-Dimensional Statistics
We soon delved into to the meat of the course. I consider this to be the material in our textbook for the course, Professor Martin Wainwright’s recent book High-Dimensional Statistics: A Non-Asymptotic Viewpoint.
For those of you who don’t know, Professor Wainwright is a faculty member in the Berkeley statistics and EECS departments who won the 2014 COPSS “Nobel Prize in Statistics” award due to his work on high dimensional statistics. Here’s the transcript of his interview, where he says that serious machine learning students must know statistics. As a caveat, the students he’s referring to are the kind that populate the PhD programs in schools like Berkeley, so he’s talking about the best of the best. It’s true that basic undergraduate statistics courses are useful for a broad range of students — and I wish I had taken more when I was in college — but courses like 210B are not needed for all but a handful of students in specialized domains.
First, what is “high-dimensional” statistics? Suppose we have parameter and labeled data points which we can use to estimate via linear regression or some other procedure. In the classical setting, we can safely assume that , or that is allowed to increase while the data dimension is typically held fixed. This is not the case in high-dimensional (or “modern”) statistics where the relationship is reversed, with . Classical algorithms end up running into brick walls into these cases, so new theory is needed, which is precisely the main contribution of Wainwright’s research. It’s also the main focus of STAT 210B.
The most important material to know from Wainwright’s book is the stuff from the second chapter: sub-Gaussian random variables, sub-Exponential random variables, bounds from Lipschitz functions, and so on. We referenced back to this material all the time.
We then moved away from Wainwright’s book to talk about entropy, the Efron-Stein Inequality, and related topics. Professor Jordan criticized Professor Wainwright for not including the material in this book. I somewhat agree with him, but for a different reason: I found this material harder to follow compared to other class concepts, so it would have been nice to see Professor Wainwright’s interpretation of it.
Note to future students: get the book by Boucheron, Lugosi, and Massart, titled Concentration Inequalities: a Nonasymptotic Theory of Independence. I think that’s the book Professor Jordan was reviewing when he gave these non-Wainwright-related lectures, because he was using the same exact notation as in the book.
How did I know about the book, which amazingly, wasn’t even listed on the course website? Another student brought it to the class and I peeked over the student’s shoulder to see the title. Heh. I memorized the title and promptly ordered it online. Unfortunately, or perhaps fortunately, Professor Jordan then moved on to exclusively material from Professor Wainwright’s book.
If any future students want to buy off the Boucheron et al book from me, send me an email.
After a few lectures, it was a relief to me when we returned to material from Wainwright’s book, which included:
- Rademacher and Gaussian Complexity (these concepts were briefly discussed in a Deep Learning paper I recently blogged about)
- Metric entropy, coverings, and packings
- Random matrices and high dimensional covariance matrix estimation
- High dimensional, sparse linear models
- Non-parametric least squares
- Minimax lower bounds, a “Berkeley specialty” according to Professor Jordan
I obtained a decent understanding of how these concepts relate to each other. The concepts appear in many chapters outside the ones when they’re formally defined, because they can be useful as “sub-routines” or as part of technical lemmas for other problems.
Despite my occasional complaint about not understanding details in Wainwright’s book — which I’ll bring up later in this blog post — I think the book is above-average in terms of clarity, relative to other textbooks aimed at graduate students. There were often enough high-level discussions so that I could see the big picture. One thing that needs to be fixed, though, are the typos. Professor Jordan frequently pointed these out during lecture, and would also sometimes ask us to confirm his suspicions that something was a typo.
Regarding homework assignments, we had seven of them, each of which was about five or so problems with multiple parts per problem. I was usually able to correctly complete about half of each homework by myself. For the other half, I needed to consult the GSIs, other students, or perform extensive online research to assist me with the last parts. Some of the homework problems were clearly inspired by Professor Wainwright’s research papers, but I didn’t have much success translating from research paper to homework solution.
For me, some of the most challenging homework problems pertained to material that wasn’t in Wainwright’s textbook. In part this is because some of the problems in Wainwright’s book have a similar flavor to exercises in the main text of the book, which were often accompanied with solutions.
The Final Exam
In one of the final lectures of the class, Professor Jordan talked about the final exam — that it would cover a range of questions, that it would be difficult, and so forth — but then he also mentioned that he could complete it in an hour. (Final exams in Berkeley are in three-hour slots.) While he quickly added “I don’t mean to disparage you…”, unfortunately I found the original comment about completing the exam in an hour quite disparaging. I’m baffled by why professors say that; it seems to be a no-win solution for the students. Furthermore, no student is going to question a Berkeley professor’s intelligence; I certainly wouldn’t.
That comment aside, the final exam was scheduled to be Thursday at 8:00AM (!!) in the morning. I was hoping we could keep this time slot, since I am a morning person and if other students aren’t, then I have a competitive advantage. Unfortunately, Professor Jordan agreed with the majority in the class that he hated the time, so we had a poll and switched to Tuesday at 3:00PM. Darn. At least we know now that professors are often more lenient towards graduate students than undergrads.
On the day of the final exam, I felt something really wrenching. And it wasn’t something that had to do with the actual exam, though that of course was also “wrenching.” It was this:
It looked like my streak of having all professors know me on a first-name basis was about to be snapped.
For the last seven years at Williams and Berkeley, I’m pretty sure I managed to be known on a first-name basis to the professors from all of my courses. Yes, all of them. It’s easier to get to know professors at Williams, since the school is small and professors often make it a point to know the names of every student. At Berkeley it’s obviously different, but graduate-level courses tend to be better about one-on-one interaction with students/professors. In addition, I’m the kind of student who frequently attends office hours. On top of it all, due to my deafness, I get some form of visible accommodation, either captioning (CART providers) or sign language interpreting services.
Yes, I have a little bit of an unfair advantage in getting noticed by professors3, but I was worried that my streak was about to be snapped. It wasn’t for lack of trying; I had indeed attended office hours once with Professor Jordan (who promptly criticized me for my lack of measure theory knowledge) and yes, he was obviously aware of the sign language interpreters I had, but as far as I can tell he didn’t really know me.
So here’s what happened just before we took the final. Since the exam was at a different time slot than the “official” one, Professor Jordan decided to take attendance.
My brain orchestrated an impressive mental groan. It’s a pain for me to figure out when I should raise my hand. I did not have a sign language interpreter present, because why? It’s a three hour exam and there wouldn’t be (well, there better not be!) any real discussion. I also have bad memories because one time during a high school track practice, I gambled and raised my hand when the team captains were taking attendance … only to figure out that the person being called at that time had “Rizzuto” as his last name. Oops.
Then I thought of something. Wait … why should I even raise my hand? If Professor Jordan knew me, then surely he would indicate to me in some way (e.g. by staring at me). Furthermore, if my presence was that important to the extent that my absence would cause a police search for me, then another student or TA should certainly point me out.
So … Professor Jordan took attendance. I kept turning around to see the students who raised their hand (I sat in the front of the class. Big surprise!). I grew anxious when I saw the raised hand of a student whose last name started with “R”. It was the moment of truth …
A few seconds later … Professor Jordan looked at me and checked something off on his paper — without consulting anyone else for assistance. I held my breath mentally, and when another student whose last name was after mine was called, I grinned.
My streak of having professors know me continues! Whew!
That personal scenario aside, let’s get back to the final exam. Or, maybe not. I probably can’t divulge too much about it, given that some of the material might be repeated in future iterations of the course. Let me just say two things regarding the exam:
- Ooof. Ouch. Professor Jordan wasn’t kidding when he said that the final exam was going to be difficult. Not a single student finished early, though some were no doubt quadruple-checking their answers, right?
- Professor Jordan wasn’t kidding when he said that the class would be graded leniently.
I don’t know what else there is to say.
I am Dying to Know
Well, STAT 210B is now over, and in retrospect I am really happy I took the course. Even though I know I won’t be doing research in this field, I’m glad that I got a taste of the research frontier in high-dimensional statistics and theoretical machine learning. I hope that understanding some of the math here can transfer to increased comprehension of technical material more directly relevant to my research.
Possibly more than anything else, STAT 210B made me really appreciate the enormous talent and ability that Professor Michael I. Jordan and Professor Martin Wainwright exhibit in math and statistics. I’m blown away at how fast they can process, learn, connect, and explain technically demanding material. And the fact that Professor Wainwright wrote the textbook solo, and that much of the material there comes straight from his own research papers (often co-authored with Professor Jordan!) surely attests to why those two men are award-winning statistics and machine learning professors.
It makes me wonder: what do I lack compared to them? I know that throughout my life, being deaf has put me at a handicap, which my white male privilege (even though I’m not white) can’t completely overcome. But if Professor Jordan or Professor Wainwright and I were to sit side-by-side and each read the latest machine learning research paper, they would be able to process and understand the material far faster than I could. Reading a research paper theoretically means my disability shouldn’t be a strike on me.
So what is it that prevents me from being like those two?
I tried doing as much of the lecture reading as I could, and I truly understood a lot of the material. Unfortunately, many times I would get bogged down by some technical item which I couldn’t wrap my head around, or I would fail to fill in missing steps to argue why some “obvious” conclusion is true. Or I would miss some (obvious?) mathematical trick that I needed to apply, which was one of the motivating factors for me writing a lengthy blog post about these mathematical tricks.
Then again, after one of the GSIs grinned awkwardly at me when I complained to him during office hours about not understanding one of Professor Wainwright’s incessant “putting together the pieces” comment without any justification whatsoever … maybe even advanced students struggle from time to time? And Wainwright does have this to say in the first chapter of his book:
Probably the most subtle requirement is a certain degree of mathematical maturity on the part of the reader. This book is meant for the person who is interested in gaining a deep understanding of the core issues in high-dimensional statistics. As with anything worthwhile in life, doing so requires effort. This basic fact should be kept in mind while working through the proofs, examples and exercises in the book.
(I’m not sure if a “certain degree” is a good description, more like “very high degree” wouldn’t you say?)
Again, I am dying to know:
What is the difference between me and Professor Jordan? For instance, when we each read Professor Wainwright’s textbook, why is he able to process and understand the information at a much faster rate? Does his brain simply work on a higher plane? Do I lack his intensity, drive, and/or focus? Am I inherently less talented?
I just don’t know.
Here are a few other random thoughts and comments I have about the course:
The course had recitations, which are once-a-week events when one of the TAs leads a class section to discuss certain class concepts in more detail. Attendance was optional, but since the recitations conflicted with one of my research lab meetings, I didn’t attend a single recitation. Thus, I don’t know what they were like. However, future students taking 210B should at least attend one section to see if such sessions would be beneficial.
Yes, I had sign language interpreting services, which are my usual class accommodations. Fortunately, I had a consistent group of two interpreters who attended almost every class. They were quite kind enough to bear through such technically demanding material, and I know that one of the interpreters was sick once, but came to work anyway since she knew that whoever would be substituting would be scarred to life from the class material. Thanks to both of you4, and I hope to continue working with you in the future!
To make things easier for my sign language interpreters, I showed up early to every class to arrange two seats for them. (In fact, beyond the first few weeks, I think I was the first student to show up to every class, since in addition to rearranging the chairs, I used the time to review the lecture material from Wainwright’s book.) Once the other students in the class got used to seeing the interpreters, they didn’t touch the two magical chairs.
We had a class Piazza. As usual, I posted way too many times there, but it was interesting to see that we had a lot more discussion compared to 210A.
The class consisted of mostly PhD students in statistics, mathematics, EECS, and mechanical engineering, but there were a few talented undergrads who joined the party.
I’d like to get back to that Facebook discussion between Yann LeCun and Michael I. Jordan in the beginning of his post. Professor Jordan’s final paragraph was a pleasure to read:
Anyway, I keep writing these overly-long posts, and I’ve got to learn to do better. Let me just make one additional remark, which is that I’m really proud to be a member of a research community, one that includes Yann Le Cun, Geoff Hinton and many others, where there isn’t just lip-service given to respecting others’ opinions, but where there is real respect and real friendship.
I found this pleasing to read because I often find myself thinking similar things. I too feel proud to be part of this field, even though I know I don’t have a fraction of the contributions of those guys. I feel privileged5 to be able to learn statistics and machine learning from Professor Jordan and all the other professors I’ve encountered in my education. My goal is to become a far better researcher than I am now so that I feel like I am giving back to the community. That’s indeed one of the reasons why I started this blog way back in August 2011 when I was hunched over my desk in the eighth floor of a dorm at the University of Washington. I wanted a blog in part so that I could discuss the work I’m doing and new concepts that I’ve learned, all while making it hopefully accessible to many readers.
The other amusing thing that Professor Jordan and I have in common is that we both write overly long posts, him on his Facebook, and me on my blog. It’s time to get back to research.
The other two are Geoffrey Hinton and Yoshua Bengio. Don’t get me started with Jürgen Schmidhuber, though he’s admittedly a clear fourth. ↩
This came out of an interview that Professor Jordan had with IEEE back in 2014. However, it didn’t quite go as well as Professor Jordan wanted, and he criticized the title and hype (see the featured comments below at the article). ↩
Sadly, this “unfair advantage” has not translated in “getting noticed” in other respects, such as friendship, dating, and so forth. ↩
While I don’t advertise this blog to sign language interpreters, a few years ago one of them said that there had been “some discussion” of my blog among her social circle of interpreters. Interesting … ↩
Even though that word has gotten a bad rap from the Social Justice Warriors, it’s the right word here. ↩