My Second Graduate Student Instructor Experience for CS 182/282A (Previously 194/294-129)

In Spring 2019, I was the Graduate Student Instructor (i.e., Teaching Assistant) for CS 182/282A, Designing, Visualizing, and Understanding Deep Neural Networks, taught by Professor John Canny. The class was formerly numbered CS 194/294-129, and recently got “upgraded” to have its own three-digit numbers of 182 and 282A for the undergraduate and graduate versions, respectively. The convention for Berkeley EECS courses is that new ones are numbered 194/294-xyz where xyz is a unique set of three digits, and once the course has proven that it is worthy of being treated as a regular course, it gets upgraded to a unique number without the “194/294” prefix.

Judging from my conversations with other Berkeley students who are aware of my blog, my course reviews seem to be a fairly popular category of posts. You can find the full set in the archives and in this other page I have. While most of these course reviews are for classes that I have taken, one of the “reviews” is actually my GSI experience from Fall 2016, when I was the GSI for the first edition of Berkeley’s Deep Learning course. Given that the class will now be taught on a regular basis, and that I just wrapped up my second GSI experience for it, I thought it would be nice to once again dust off my blogging skills and discuss my experience as a course staff member.

Unlike last time, when I was an “emergency” 10-hour GSI, I was a 20-hour GSI for CS 182/28A from the start. At Berkeley, EECS PhD students must GSI for a total of “30 hours.” The “hours” designation means that students are expected to work for that many hours per week in a semester as a GSI, and the sum of all the hours across all semesters must be at least 30. Furthermore, at least one of the courses must be an undergraduate course.¹ As a 20-hour GSI for CS 182/282A and a 10-hour GSI for the Fall 2016 edition, I have now achieved my teaching requirements for the UC Berkeley EECS PhD program.

That being said, let us turn to the course itself.

Course Overview and Logistics

CS 182/282A can be thought of as a mix between 231n and 224n from Stanford. Indeed, I frequently watched lectures or reviewed the notes from those courses to brush up on the material here. Berkeley’s on a semester system, whereas Stanford has quarters, so we are able to cover slightly more material than 231n or 224n alone. We cover some deep reinforcement learning in 182/282A, but that is also a small part of 231n.

In terms of course logistics, the bad news was obvious when the schedule came out. CS 182/282A had lectures on Mondays and Wednesdays, at 8:00am. Ouch. That’s hard on the students; I wish we had a later time, but I think the course was added late to the catalog so we were assigned to the least desirable time slot. My perspective on lecture times is that as a student, I would enjoy an early time because I am a morning person, and thus earlier times fit right into my schedule. In contrast, as a course staff member who hopes to see as many students attend lectures as possible, I prefer early afternoon slots when it’s more likely that we get closer to full attendance.

Throughout the semester, there were only three mornings when the lecture room was crowded: on day one, and on the two in-class midterms. That’s it! Lecture attendance for 182/282A was abysmal. I attended nearly all the lectures, and by the end of the semester, I observed we were only getting about 20 students per lecture, out of a class size of (judging by the amount listed on the course evaluations) perhaps 235 students!

Incidentally, the reason for everyone showing up on day one is that I think students on the waiting list have to show up on the first day if they want a chance of getting in the class. The course staff got a lot of requests from students asking if they could get off the waiting list. Unfortunately I don’t think I or anyone on the course staff had control over this, so I was unable to help. I really wish the EECS department had a better way to state unequivocally whether a student can get in a class or not, and I am somewhat confused as to why students constantly ask this question. Do other classes have course staff members deal with the waiting list?

One logistical detail that is unique to me are sign language interpreting services. Normally, Berkeley’s Disabled Students’ Program (DSP) pays for sign language services for courses and lab meetings, since this is part of my academic experience. Since I was getting paid by the EECS department, however, DSP told me that the EECS department had to do the payment. Fortunately, this detail was quickly resolved by the excellent administrators at DSP and EECS, and the funding details abstracted away from me.

Discussion Sections

Part of our GSI duties for 182/282A is that we need to host discussions or sections; I use the terms interchangeably, and sometimes together, and another term is “recitations” as in this blog post 4.5 years ago. Once a week, each GSI was in charge of two discussions, which are each a 50-minute lecture we give to a smaller audience of students. This allows for a more intimate learning environment, where students may feel more comfortable asking questions as compared to the normal lectures (with a terrible time slot).

The discussions did not start well. They were scheduled only on Mondays, with some overlapping time slots, which seemed like a waste of resources. We first polled the students to see the best times for them, and then requested the changes to the scheduling administrators in the department. After several rounds of ambiguous email exchanges, we got a stern and final response from one of them, who said the students “were not six year olds” and were responsible for knowing the discussion schedule since it was posted ahead of time.

To the students who had scheduling conflicts with the sections, I apologize, but we tried.

We also got off to a rocky start with the material we chose to present. The first discussion was based on Michael I. Jordan’s probabilistic grapical models notes² to describe the connection between Naive Bayes and Logistic Regression. Upon seeing our discussion material, John chastised us for relying on graduate-level notes, and told us to present his simpler ideas instead. Sadly, I did not receive his email until after I had already given my two discussion sections, since he sent it while I was presenting.

I wish we had started off a bit easier to help some students gradually get acclimated to the mathematics. Hopefully after the first week, the discussion material was at a more appropriate difficulty level. I hope the students enjoyed the sections. I certainly did! It was fun to lecture and to throw the occasional (OK, frequent) joke.

Preparing for the sections meant that I needed to know the material and anticipate the questions students might ask. I dedicated my entire Sundays (from morning to 11:00pm) preparing for the sections by reviewing the relevant concepts. Each week, one GSI took the lead in forming the notes, which dramatically helped to simplify the workload.

At the end of the course, John praised us (the GSIs) for our hard work on the notes, and said he would reuse them them in future iterations of the course.

Piazza and Office Hours

I had ZERO ZERO ZERO people show up to office hours throughout the ENTIRE semester in Fall 2016. I don’t even know how that is humanly possible.

I did not want to repeat that “accomplishment” this year. I had high hopes that in Spring 2019, with the class growing larger, more students would come to office hours. Right? RIGHT?!?

It did not start off well. Not a single student showed up to my office hours after the first discussion. I was fed up, so at the beginning of my section the next week, I wrote down on the board: “number of people who showed up to my office hours”. Anyone want to guess? I asked the students. When no one answered, I wrote a big fat zero on the board, eliciting a few chuckles.

Fortunately, once the homework assignments started, a nonzero number of students showed up to my office hours, so I no longer had to complain.

Students were reasonably active on Piazza, which is expected for a course this large with many undergraduate students. One thing that was also expected — this one not so good — was that many students ran into technical difficulties when doing the homework assignments, and posted incomplete reports on Piazza. Their reports were written in a way that made it hard for the course staff to adequately address them.

This has happened in previous iterations of the course, so John wrote this page on the course website which has a brief and beautiful high-level explanation of how to properly file an issue report. I’m copying some of his words here because they are just so devastatingly effective:

If you have a technical issue with Python, EC2 etc., please follow these guidelines when you report an issue in Piazza. Most issues are relatively easy to resolve when a good report is given. And the process of creating a good Issue Report will often help you fix the problem without getting help - i.e. when you write down or copy/paste the exact actions you took, you will usually discover if you made a slip somewhere.

Unfortunately many of the issue reports we get are incomplete. The effect of this is that a simple problem becomes a major issue to resolve, and staff and students go back-and-force trying to extract more information.

[…]

Well said! Whenever I felt stressed throughout the semester due to teaching or other reasons, I would often go back and read those words on that course webpage, which brought me a dose of sanity and relief. Ahhhhh.

The above is precisely why I have very few questions on StackOverflow and other similar “discussion forums.” The following has happened to me so frequently: I draft a StackOverflow post and structure it by saying that I tried this and that and … oh, I just realized I solved what I wanted to ask!

For an example of how I file in (borderline excessive) issue reports, please see this one that I wrote for OpenAI baselines about how their DDPG algorithm does not work. (But what does “does not work” mean?? Read the issue report to find out!)

I think I was probably spending too much time on Piazza this semester. The problem is that I get this uncontrollable urge to respond to student questions.³ I had the same problem when I was a student, since I was constantly trying to answer Piazza questions that other students had. I am proud to have accumulated a long list of “An instructor […] endorsed this answer” marks.

The advantage of my heavy Piazza scrutiny was that I was able to somewhat gauge which students should get a slight participation bonus for helping others on Piazza. Officially, participation was 10% of the grade, but in practice, none of us actually knew what that meant. Students were constantly asking the course staff what their “participation grade” actually meant, and I was never able to get a firm answer from the other course staff members. I hope this is clarified better in future iterations of 182/282A.

Near the end of the grading period, we finally decided that part of participation would consist of slight bonuses to the top few students who were most helpful on Piazza. It took me four hours to scan through Piazza and to send John a list of the students who got the bonus. This was a binary bonus: students could get either nothing or the bonus. Obviously, we didn’t announce this to the students, because we would get endless complaints from those who felt that they were near the cutoff for getting credit.

Homework Assignments

We had four challenging homework assignments for 182/282A, all of which were bundled with Python and Jupyter notebooks:

The first two came straight from the 231n class at Stanford — but we actually took their second and third assignments, and skipped their first one. Last I checked, the first assignment for 231n is mostly an introduction to machine learning and taking gradients, the second is largely about convolutional neural networks, and third is about recurrent neural networks with a pinch of Generative Adversarial Networks (GANs). Since we skipped the first homework assignment from 231n, this might have made our course relatively harder, but fortunately for the students, we did not ask them to do the GANs part for Stanford’s assignment.
The third homework was on NLP and the Transformer architecture (see my blog post here). One of the other GSIs designed this from the ground up, so it was unique for the class. We provided a lot of starter code for the students, and asked them to implement several modules for the Transformer. Given that this was the first iteration of the assignment, we got a lot of Piazza questions about code usage and correctness. I hope this was educational to the students! Doing the homework myself (to stress test it beforehand) was certainly helpful for me.
The fourth homework was on deep reinforcement learning, and I designed this one. It took a surprisingly long time, even though I borrowed lots of the code from elsewhere. My original plan was actually to get the students to implement Deep Q-Learning from Demonstrations (blog post here) because that’s an algorithm that nicely combines imitation and reinforcement learning, and I have an implementation (actually, two) in a private repository which I could adapt for the assignment. But John encouraged me to keep it simple, so we stuck with the usual “Intro to DeepRL” combination of Vanilla Policy Gradients and Deep Q-learning.

The fourth homework assignment may have been tough on the students since this was due just a few days after the second midterm (sorry!). Hopefully the lectures were helpful for the assignment. Incidentally, I gave one of the lectures for the course, on Deep Q-learning methods. That was fun! I enjoyed giving the lecture. It was exciting to see students raise their hands with questions.

Midterms

We had two midterms for 182/282A.⁴ The midterms consisted of short answer questions. We had to print the midterms and walk a fair distance to some of the exam rooms. I was proctoring one of them with John, and since it was awkward not talking to someone, particularly when that someone is your PhD advisor, I decided to strike up a conversation while we were lugging around the exams: how did you decide to come to Berkeley? Ha ha! I learned some interesting factoids about why John accepted the Berkeley faculty offer.⁵

Anyway, as is usual in Berkeley, we graded exams using Gradescope. We split the midterms so that each of the GSIs graded 25% of the points allocated to the exam.⁶ I followed these steps for grading my questions:

I only grade one question at a time.
I check to make sure that I understand the question and its possible solutions. Some of them are based on concepts from research papers, so this process sometimes took a long time.
I get a group of the student answers on one screen, and scroll through them to get a general feel for what the answers are like. Then I develop rough categories. I use Gradescope’s “grouping” feature to create groups, such as “Correct - said X,Y,Z”, “Half Credit - Missed X”, etc.
Then I read through the answers and assign them to the pre-created groups.
At the end, I go through the groups and check for borderline cases. I look at the best and worst answers in each group, and re-assign answers to different categories if necessary.
Finally, I assign point values for the groups, and grade in batch mode. Fortunately, the entire process is done (mostly) anonymously, and I try not to see the identity of the students for fairness. Unfortunately, some students have distinctive handwriting, so it was not entirely anonymous, but it’s close enough. Grading using Gradescope is FAR better than the alternative of going through physical copies of exams. Bleh!
Actually, there’s one more step: regrade requests. Gradescope includes a convenient way to manage regrade requests, and we allowed a week for students to submit regrade requests. There were, in fact, a handful of instances when we had to give students more points, due to slight errors with our grading. (This is unavoidable in a class with more than 200 students, and with short-answer questions that have many possible answers.)

We hosted review sessions before each midterm, which were hopefully helpful to the students.

In retrospect, I think we could have done a better job with the clarity of some midterm questions. Some students gave us constructive feedback after the first midterm by identifying which short-answer questions were ambiguous, and I hope we did a better job designing the second midterm.

We received another comment about potentially making the exam multiple choice. I am a strong opponent of this, because short answer questions far more accurately reflect the real world, where people must explain concepts and are not normally given a clean list of choices. Furthermore, multiple choice questions can also be highly ambiguous, and they are sometimes easy to “game” if they are poorly designed (e.g., avoid any answer that says a concept is “always true,” etc.).

Overall, I hope the exams did a good job measuring student’s retention of the material. Yes, there are limits to how well timed exams correlate with actual knowledge, but it is one of the best resources we have based on time, efficiency, and fairness.

Final Projects

Thankfully, we did not have a final exam for the class. Instead, we had final projects, which were to be done in groups of 2-4 students, though some sneaky students managed to work individually. (“Surely I don’t have to explain why a team of one isn’t a team?” John replied on Piazza to a student who asked if he/she could work on a project alone.) The process of working on the final projects involved two pass/fail “check-ins” with GSIs. At the end of the semester, we had the poster session, and then each team submitted final project reports.

The four GSIs split up the final project grading so that we were the primary grader for 25% of the teams, and then we graded a subset of the other GSI teams to recalibrate grades if needed. I enforced a partial ordering for my teams: projects with grades $x$ were higher quality than those with grades less than $x$ and worse than those with grades higher than $x$, and just about equivalent in the case of equal grades. After a final scan of the grades, I was pretty confident with my ordering of them, and I (like the other GSIs) prepared a set of written comments to send to each team.

We took about a week to grade the final project reports and to provide written feedback. As with the midterms, we allowed for a brief regrade request period. Unfortunately, we could not simply give more points to teams if they gave little to no justification for why they should get a regrade, or if they just said “we spent so much time on the project!!”. We also have to be firm about keeping original grades set without changing it by a handful of points for no reason. If another GSI gave a grade of (for example) 80, and I thought I would have given it (for example) an 81, we cannot and did not adjust grades because otherwise we would be effectively rewarding students for filing pointless regrade requests by letting them get the maximum among the set of GSI grades.

One other aspect of the project deserves extra comment: the credit assignment problem. We required the teams to list the contribution percentages of each team member in their report. This is a sensitive topic, and I encourage you to read a thoughtful blog post by Chris Olah on this topic.

We simply cannot assign equal credit if people contribute unequally to projects. It is not ethical to do so, and we have to avoid people free-riding on top of others who do the actual work. Thus, we re-weighted grades based on the project contribution. That is, each student got a team grade and an individual grade. This is the right thing to do. I get so passionate about ensuring that credit is allocated at least as fairly as is reasonable.

The Course Evaluation

After the course was over, the course staff received reviews from the students. My reviews were split up into those from the 182 and the 282A students. I’m not sure why this is needed, as it only makes it harder for me to accumulate the results together. Anyway, here are the number of responses we received:

182 students: 145 responses out of 170 total.
282 students: 53 responses out of 65 total.

I don’t know if the “total” here reflects students who dropped the course.

Here are the detailed results. I present my overall numerical grades followed by the open-ended responses. The totals don’t match the numbers I recently stated, I think because some students only filled in a subset of the course evaluation.

Left: undergraduate student responses. Right: graduate student responses.

Top: graduate open-ended responses. Bottom: undergraduate open-ended responses.

My first thought was: ouch! The reviews said that the EECS department average was 4.34. Combining the undergrad and graduate ratings meant that I was below average.

Well, I was hoping to at least be remotely in contention for the Outstanding GSI award from the department. Unfortunately, I guess that will not happen. Nonetheless, I will still strive to be as effective a teacher as I can possibly be in the future. I follow the “growth mindset” from Carol Dweck,⁷ so I must use this opportunity to take some constructive criticism.

From looking at the comments, one student said I was “kinda rude” and another said I was “often condescending and sometimes threatening” (?!?!?). I have a few quick reactions to this.

First, if I displayed any signs of rudeness, condescension, or threatening behavior (!!) in the course, it was entirely by mistake and entirely unintentional! I would be terrified if I was a student and had a course staff member threaten me, and I would never want to impose this feeling on a student.
Regarding the criticism of “condescension,” I have tried long and hard to remain humble in that I do not know everything about every concept, and that I should not (unreasonably) criticize others for this. When I was in elementary school, one of my most painful nightmares was when a speech teacher⁸ called me out for an arrogant comment; I had told her that “skiing on black diamond trails is easy.” That shut me up, and taught me to watch my words in the future. With respect to Deep Learning, I try to make it clear if I do not know enough about a concept to help a student. For example, I did not know much about the Transformer architecture before taking this class, and I had to learn it along with the students. Maybe some of the critical comments above could have been due to vague answers about the Transformer architecture? I don’t use it in my research, unlike the three other GSIs who do, which is why I recommended that students with specific Transformer-related questions ask them.
One possibility might be that those negative comments came from students who posted incomplete issue reports on Piazza that got my prompt response of linking to John Canny’s “filing an issue report” page (discussed earlier). Admittedly, I was probably excessively jumping the gun on posting those messages. Maybe I should not have done that, but the reality is that we simply cannot provide any reasonable help to students if they do not post enough information for us to understand and reproduce their errors, and I figured that students would want a response sooner than later.

I want to be a better teacher. If there are students who have specific comments about how I could improve my teaching, then I would like to know about it. I would particularly be interested in getting responses from students who gave me low ratings. To be clear, I have absolutely no idea who wrote what comments above; students have the right to provide negative feedback anonymously to avoid potential retaliation, though I would never do such a thing.

To any students who are genuinely worried that I will be angry at them if I get non-anonymous negative feedback from them, then in your email or message to me, be sure to paste a screenshot of this blog post which shows that I am asking for this feedback and will not get angry. Unlike Mitch McConnell, I have no interest in being a hypocrite, so I will have to take the feedback to heart. At the very least, a possible message could be structured like the following:

Dear Daniel

I was a student in CS 182/282A this past semester. I think you are a terrible GSI and I rated you 1 out of 5 on the course evaluation (and I would have given a 0 had that been an option). I am emailing to explain why you are an atrocious teacher, and before you get into a hissy fit, here’s a screenshot of your blog showing that we have permission to give this kind of feedback without retaliation:

[insert screenshot and other supporting documentation here]

Anyway, here are the reasons:

Reason 1: [insert constructive feedback here]

Reason 2: [insert constructive feedback here]

I hope you find this helpful!

Sincerely, […]

And there you have it! Without knowing what I can do to be a better teacher, I won’t be able to improve.

On the positive side, at least I got lots of praise for the Piazza comments! That counts for something. I hope students appreciated it, as I enjoy responding and chatting with students.

Obviously, if you graded me 5 stars, then thanks! I am happy to meet with you and chat over tea. I will pay for it.

Finally, without a doubt, the most badass comment above was “I’m inspired by his blog” (I’m removing the swear word here, see the footnote for why).⁹ Ha ha! To whoever said that, if you have not subscribed, here is the link.

Closing Thoughts

Whew, now that the summer is well underway, I am catching up on research and other activities now that I am no longer teaching. I hope this reflection gives an interesting perspective of a GSI for a course on a cutting-edge, rapidly changing subject. It is certainly a privilege to have this opportunity!

Throughout the semester, I recorded the approximate hours that I worked each week on this class. I’m pleased to report that my average was roughly 25 hours a week. I do not count breakfast, lunch, or dinner if I consumed them in between my work schedule. I count meetings, and usually time I spend on emails. It’s hard to say if I am working more or less than other GSIs.

Since I no longer have to check Piazza, my addiction to Piazza is now treated. Thus, my main remaining addictions to confront are reading books, blogging, eating salads, and long-distance running. Unfortunately, despite my best efforts, I think I am failing to adequately reduce the incidence of all four of these.

That is a wrap. I am still super impressed by how much John Canny is able to quickly pick up different fields. Despite being department chair in two days, he continues to test and implement his own version of the Transformer model. He will be teaching CS 182/282A next Spring, and I am told that John will try to get a better time than 8:00am, and given that he’s the department chair, he must somehow get priority on his course times. Right?

Stay tuned for the next iteration of the course, and happy Deep Learning!

I thank David Chan and Forrest Huang for feedback on earlier drafts of this post.

It is confusing, but courses like CS 182/282A, which have both undergraduate and graduate students, should count for the “undergraduate” course. If it doesn’t, then I will have to petition the department. ↩
These are the notes that form the basis of the AI prelim at Berkeley. You can read about my experience with that here. ↩
This was the reason that almost stopped me from being a GSI for this course in Fall 2016. John was concerned that I would spend too much time on Piazza and not enough on research. ↩
I made a cruel joke in one of my office hours by commenting on the possibility of there being a third midterm. It took some substantial effort on my part to convince the students there that I was joking. ↩
I’m sure John had his choice of faculty offers, since he won the 1987 ACM Doctoral Dissertation award, for having the best computer science PhD in the world. From reading the award letter in his disseration, it says John’s dissertation contained about “two awards’ worth” (!!) of material. And amazingly, I don’t think his PhD thesis includes much about his paper on edge detection, the one for which he is most well known for with over 31,000 citations. As in, he could omit his groundbreaking edge detector, and his thesis still won the disseration award. You can find the winners of the ACM Doctoral Dissertation Award here. Incidentally, it seems like the last five Berkeley winners or honorable mentions (Chelsea Finn, Aviad Rubinstein, Peter Bailis, Matei Zaharia, and John Duchi) all are currently at Stanford, with Grey Ballard breaking the trend by going back to his alma matter of Wake Forest. ↩
One of my regrets is that I did not know that some other GSIs were not 20-hour GSIs like me, and worked less. Since that was the case, I should have taken more of the duty in grading the exam questions. ↩
You can probably guess at least one of the books that’s going to appear on my eventual blog post about the “Books I Read in 2019.” You can find past blog posts about my reading list here (2016), here (2017), and here (2018). ↩
Most students in our Deaf and Hard of Hearing program at my school district took speech lessons throughout elementary and middle school, since it is harder for us to know if we are pronouncing words in the way that most hearing people do. Even today, I don’t think I can pronounce “s” in a fully satisfactory manner. ↩
A weird and crazy factoid about me is that — as a conservative estimate — it has been ten years since I last uttered a swear word or used a swear word in writing. This includes “censoring” swear words with asterisks. ↩