Just another site

Thought this was cool: alexbw/Netflix-Prize · GitHub

leave a comment »

Comments: “alexbw/Netflix-Prize · GitHub”


My Code for the Netflix Prize

I’m not aware of folks having published their code for the Netflix Prize. Here’s mine.
Under the team name “Hi!”, I competed alone in college. I did it mostly for fun, and to learn modern machine learning techniques. It was an incredibly valuable, but strenuous, time. Well worth it on all fronts, though.
I peaked out at #45 or so, and then dropped out to work on my senior thesis, and came in #145 or so.
What I learned in the process was that smarter wasn’t always better — make an algorithm, and then scale it up, and then make a dozen tweaks to it, and then average all of the results together. That’s how you climbed the leaderboard.

Anyhoo, I haven’t touched this code in awhile, but perhaps it’ll be useful to folks interested in competitive data mining.
Specifically, the lessons I learned:

  • Get the raw data into a saved and manageable format fast. The easier it is to load your data in and start mutating it, the better.
  • If doing simple pivots on your data is hard, and slows you down from visualizing whats in your data, spend time making data structures which make that easy.
  • Generalize. Iterate. If you have a method you think will work, but it has a lot of knobs, and you don’t know the best way to set those knobs, make it easy for you to try every possible iteration. There is often not a good way to figure out what the best approach is. You will have to try many of them in order to build up an intuition. Specifically, that means (for me) a pluggable architecture. If there’s ten ways to try a particular step, make sure you write your overarching algorithm so that it takes a function that you can pass to it, as opposed to having a method hardwired in the code. That way, you can hotswap all your ideas.
  • Speed is a feature. Of course you make sure it works first. But your goal is to see if something works. If an algorithm takes a day to run, but you can spend five hours making it run in 1/3 of a day, do it. You’ll be running it over and over again, and you’ll learn more if you can iterate.

As for the technical nitty-gritty, everything that’s speed sensitive is written in Cython, which was the best balance of speed and convenience in 2009. If I were to do it al again, I would use (Numba)[].

The original data is gone, I believe, but I might have it stored somewhere. I’ll look for that.

from Hacker News 50:


Written by cwyalpha

九月 5, 2012 在 9:24 上午

发表在 Uncategorized


Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  更改 )

Google+ photo

You are commenting using your Google+ account. Log Out /  更改 )

Twitter picture

You are commenting using your Twitter account. Log Out /  更改 )

Facebook photo

You are commenting using your Facebook account. Log Out /  更改 )


Connecting to %s

%d 博主赞过: