weka

My new research project is underway. It involves the use of a web crawler that can reach the computer science home pages for colleges and universities across the country. That web crawler can then try and derive data from web pages belonging to individual computer scientists. Given that data, is it possible for us to determine a set of features that will indicate whether the web crawler has found the homepage of a female computer scientist? If so, then it might be possible to reach out to those people by adding them to mailing lists, such as the one belonging to the CRA-W. I hope to learn a lot during this project, and I’ll probably discuss it later on Seita’s Place.

Since this is a machine learning project, I’m using the Weka software to help me run experiments and determine what features in the data are most useful in indicating whether a homepage belongs to a female computer scientist. An example of a feature I think would be very indicative is the number of times the pronoun “she” appears. Another one might be how often a name appears in an index of common American female first names.

Because I’ve never used Weka before, I looked at this tutorial on YouTube. I wasn’t planning to watch the whole thing, since I just wanted to see what the user’s settings were, where he clicked, and other small things. But I was surprised to see that there were complete and correct subtitles for the entire 23-minute video! This wasn’t the Google captioning that you can just click on for most videos. This guy, Brandon Weinberg, actually inserted a full transcript of his spiel. Major kudos goes to him.

Don’t you wish that every YouTube video could be like this? That would be a nice Christmas present for me in 2030.