Just another site

Thought this was cool: Case study: million songs dataset

leave a comment »

A couple of days ago I wrote about the million songs dataset. Our man in London, Clive Cox from Rummble Labs, suggested we should implement rankings based on item similarity.

Thanks to Clive suggestion, we have now an implementation of Fabio Aiolli’s cost function as explained in the paper: A Preliminary Study for a Recommender System for the Million Songs Dataset, which is the winning method in this contest.

Following are detailed instructions on how to utilize GraphChi CF toolkit on the million songs dataset data, for computing user ratings out of item similarities. 

Instructions for computing item to item similarities:

1) For obtaining the dataset, download and scripts.

2) Run to download the million songs dataset and prepare GraphChi compatible format.
$ sh
Note: this operation may take an hour or so to prepare the data.

3) Run GraphChi item based collaborative filtering, to find out the top 500 similar items for each item:

./toolkits/collaborative_filtering/itemcf –training=train –K=500 –asym_cosine_alpha=0.15 –distance=3 –min_allowed_intersection=5
Explanation: –training points to the training file. –K=500 means we compute the top 500 similar items.
–distance=3 is Aillio’s metric. –min_allowed_intersection=5 – means we take into account only items that were rated together by at least 5 users.

Note: this operation requires around 20GB of memory and may take a few ours…

4) Post process results to create a single item to item ratings file
$ sh ./toolkits/collaborative_filtering/ train
Sorting output file train.out0

Merging sorted files:
File written: train-topk

5) Create a matrix market header – by saving the below bash script into a file and running it:
(or simply copy everything and paste into a bash/sh shell window)

#!/bin/sh -x
echo “%%MatrixMarket matrix coordinate real general” > train-topk\:info
echo $ITEMS $ITEMS `wc -l train-topk | awk ‘{print $1}’` >> train-topk\:info
echo “*********************”
cat train-topk\:info
echo “*********************”

Create user recommendations based on item similarities:

1) Run itemsim2rating to compute recommendations based on item similarities
$ ./toolkits/parsers/itemsim2rating –training=train –similarity=train-topk –K=500 membudget_mb 50000 –nshards=1 –max_iter=2 –Q=3
Note: this operation may require 20GB of RAM and may take a couple of hours based on your computer configuration.

Output file is: train-rec

Evaluating the result

1) Prepare test data:
./toolkits/parsers/topk –training=test –K=500

Output file is: test.ids

2) Prepare training recommendations: 

./toolkits/parsers/topk –training=train-rec –K=500

Output file is: train-rec.ids

3) Compute mean average precision @ 500:
./toolkits/collaborative_filtering/metric_eval –training=train-rec.ids –test=test.ids –K=500

About the performance: Q is the power applied to the item similarity weight.
When Q = 1 we get:

INFO:     metric_eval.cpp(eval_metrics:114): 7.48179 Finished evaluating 100000 instances. 
ESC[0mINFO:     metric_eval.cpp(eval_metrics:117): Computed AP@500 metric: 0.151431


  • Clive Cox, for proposing to implement item based recommendations in GraphChi, and support in the process of implementing this method.
  • Fabio Aiolli, University of Padova, winner of Million songs dataset contest, for great support regarding implementation of his metric.

from Large Scale Machine Learning and Other Animals:

Written by cwyalpha

二月 2, 2013 在 12:09 下午

发表在 Uncategorized


Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / 更改 )

Twitter picture

You are commenting using your Twitter account. Log Out / 更改 )

Facebook photo

You are commenting using your Facebook account. Log Out / 更改 )

Google+ photo

You are commenting using your Google+ account. Log Out / 更改 )

Connecting to %s

%d 博主赞过: