Concluding Thoughts on Reviewing for NIPS 2016

I mentioned a few months ago that I volunteered to be a reviewer for NIPS 2016. The NIPS decisions have been out for a few weeks now and a detailed account of the review process has been published.

Overall, I think it was a good experience for me to finally start reviewing academic papers, and I’m really happy that NIPS has so much transparency to help me better understand how it works behind the scenes. From looking at the review process, a number of things immediately caught my attention.

First, wow, NIPS is really large. I would not have expected about 6000 total authors to have submitted 2406 total papers for review. And NIPS used to be a niche conference a few decades ago, right? I wonder how much larger it can realistically become while still being an academic conference. In terms of reviewers, there were 3424 reviewers, again quite an eye-popping number. Also, I was correct on the senior vs volunteer reviewers: it looks like the system was designed to provide six reviewers per paper, with three seniors and three volunteers. I was naturally one of the volunteers. I also noticed that reviewers submitted an average of 4.05 reviews, so my set of papers to review (four) was standard.

In the past, NIPS papers used to be evaluated on a single scale from 1 to 10. This year, the process was more sophisticated with 1-5 point scales for (a) technical quality, (b) novelty, (c) impact, and (d) writing/clarity. After having some experience with the new scoring scales, I have some mixed feelings about them. This was ostensibly done to make NIPS more scalable, but I’m worried about the workload of the Area Chairs, who have to now spend more time analyzing the reviews instead of looking at a simple numeric score. If the Area Chairs can effectively work with this new scoring scale, then I suppose it is OK. I actually like how it helps me to think more clearly about “novelty” versus “impact.”

In general, (d) is the easiest to determine, especially for me and my “unique” typo-catching abilities. Papers with substantial typos and problems in their presentation should go straight in the rejection pile; for one of the papers I reviewed, nearly every reviewer heaped criticism on the writing/clarity, making it an easy rejection. I’m not sure why I spent so much time on that paper, carefully pointing out the flaws, when I should have quickly moved on. I was told by a faculty member that that “spending two hours for a review is a good benchmark, and you can spend a little more if the paper looks publishable.”

Gulp. I spend way more than two hours per paper, even on the one that should have been an obvious rejection. Way more.

Some reviewers were smarter (or lazier?). For each paper I reviewed, there was at least one reviewer who gave the dreaded two-line review. Interestingly enough, from my really small sample size, faculty gave more of those reviews than postdocs and graduate students. (Time for faculty to assign these reviews to their graduate students, I suppose.) While NIPS follows the double-blind reviewing system, paper reviewers could see the identity and reviews of the other reviewers. We could also give private comments in our original reviews, visible only to other reviewers and the Area Chairs. One of the most distressing aspects of the system is seeing almost every reviewer saying a variant of: “I did not completely check the proofs.” It’s clearly impossible even for a group of six to check all the technical aspects of papers. For this to even happen, at minimum, the six reviewers would have to somehow divide up the task, such as reviewer 1 checking Theorem 1, reviewer 2 checking Theorem 2, and so on.

Fortunately, I could not see the other reviews until after I had submitted mine – a good thing!! Once the reviewing period finished, the conference website provided a private forum where reviewers and Area Chairs could discuss aspects of the papers in depth, but I was kind of fried after this and busy with too much work. A few people labeled as “meta reviewers” (I think these were Area Chairs) tried to encourage some discussion, but most reviewers felt the same as I did and did not have the desire for extensive conversations. In a perfect world, there would especially be a lot of discussion about the author rebuttals, which is a key part of the NIPS reviewing process. Unfortunately, it was really hard for me to motivate myself to read the rebuttals carefully, and I ultimately did not change any of my scores.

From the list of accepted papers, one out of the four papers I reviewed got accepted, which aligns well with the overall acceptance rate for NIPS (this year, it was 568/2406 = 0.236). As an experiment, the NIPS chairs asked reviewers to provide an ordered ranking of only the papers they reviewed. The paper I rated the highest of my four was indeed the one that got in, which is extra assurance that my kind of reviews are not totally out of whack with what others think.

Overall, I think my favorite part of reviewing is when I get to compare my reviews with other people and to read author rebuttals. I generally like it when my reviews hit the same core topics as other reviews, and when authors (in their rebuttals) thank me for catching errors. I hope to continue giving feedback on research papers, whether or not it is part of an official reviewing system.