CWYAlpha

Just another WordPress.com site

Thought this was cool: Collaborative Filtering – 3rd Generation [or] winning the kdd cup in 5 minutes!

leave a comment »


After spending a few years writing collaborative filtering software with thousands of installations, and after talking to tens of companies and participating in KDD CUP twice,  I have started to develop some next generation collaborative filtering software. The software is very experimental at this point and I am looking for the help of my readers – universities and companies who would like to try it out.

The problem:

Most collaborative filtering methods (like ALS, SGD, bias-SGD, NMF etc.) use the rating values for computing matrix factorization. A few “fancier” methods (like tensor-ALS, time-SVD++ etc. ) utilize also the temporal information to improve the quality of predictions. So basically we are limited to 2 or 3 dimensional factorization. Typically the utilized data is of the type:
[ user ] [ item ] [ rating ] 
or
[ user ] [ item ] [ time ] [ rating ] 

I am often asked, how to approach problems when you have data of the type:

[ user ] [ item ] [ item category] [ purchase amount ] [ quantity ] [ user age ] [ zip code ] [ time ] [ date ] … [ user rating ]

In other words, how do we utilize additional information we have about user features, item features, or even more fancier feature like user friends etc. This problem is often encountered in practice and in many cases, papers are written about it by doing specific constructions. See for example Koenigstein’s paper. However, in practice, most users do not like to break their heads and invent novel algorithms but want to have a readily accessible method that can take more features into account and without much fine tuning.

The solution:

Following the great success of libFM, I thought about implementing a more general SGD method in GraphChi that case take a list of features into account.

A new SGD based algorithm is developed with the following
1) Support for string features (John Smith bought the Matrix)
2) Support for dynamic selection of features on runtime.
3) Support of multiple file formats with column permutation.
4) Support for an unlimited number of features
5) Support for multiple ratings of the same item.

Working example – KDD CUP 2012 – track1

To give some concrete example, I will use KDD CUP 2012 track1 data which will demonstrate how easy to setup and try the new method.

Preliminaries:
0) Download track 1 data from here. Extract the zip file.
1) Download and install GraphChi using steps 1-3.

2a) In the root graphchi folder, Create a file named rec_log_train.txt:info with the following lines:

%%MatrixMarket matrix coordinate real general
2500000 2500000 73209277

2b) link the file track1/rec_log_train.txt into the root graphchi folder:
cd graphchi
ln -s ../track1/rec_log_train.txt .

Let’s look at the input file format:

<49|0>bickson@bigbro6:~/graphchi$ head rec_log_train.txt
2088948 1760350 -1 1318348785
2088948 1774722 -1 1318348785
2088948 786313 -1 1318348785
601635 1775029 -1 1318348785
601635 1902321 -1 1318348785

The input is of the format 
[user] [ add ] [ click ] [ timestamp ]
Where click is either -1 (not clicked) or 1 (clicked).

First step: regular matrix factorization

Now let’s run a quick matrix factorization using user, item and rating:
 ./toolkits/collaborative_filtering/gensgd –training=rec_log_train.txt –val_pos=2 –rehash=1 –limit_rating=1000000 –max_iter=100 –gensgd_mult_dec=0.999999 –minval=-1 –maxval=1 –quiet=1  –calc_error=1

Explanation: –training is the input file. –val_pos=2 means that the rating is in column 2, –rehash=1 means we treat all fields as strings (and thus support string values), –limit_rating means we handle only the first million ratings (to speed up the demo), –max_iter is the number of SGD iterations, –minval and –maxval are the allowed rating range, and –quiet less verbose output. –calc_error displays the classification error (how many predictions were wrong).
And here is the output we get:

WARNING:  common.hpp(print_copyright:95): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
INFO:     gensgd.cpp(main:1140): Total selected features: 0 : 
  0.700402) Iteration:   0 Training RMSE:   0.525669 Train err:   0.016858
   1.13682) Iteration:   1 Training RMSE:   0.509467 Train err:   0.022426
   1.54744) Iteration:   2 Training RMSE:   0.500282 Train err:   0.016037
    1.9904) Iteration:   3 Training RMSE:   0.494313 Train err:   0.020332
     2.409) Iteration:   4 Training RMSE:   0.489487 Train err:   0.019487

   40.8789) Iteration:  96 Training RMSE:   0.181228 Train err:   0.002992
   41.2817) Iteration:  97 Training RMSE:   0.180879 Train err:   0.003891
   41.7102) Iteration:  98 Training RMSE:   0.180597 Train err:   0.005794
   42.1098) Iteration:  99 Training RMSE:   0.180272 Train err:   0.004452

We got a training RMSE error of 0.18, and 4452 wrong predictions.

Second step: temporal matrix factorization

Now let’s add the time bins (3rd column) into the computation as feature and run again. This is done using the –-features=3 command line flag:

./toolkits/collaborative_filtering/gensgd –training=rec_log_train.txt –val_pos=2 –rehash=1 –limit_rating=1000000 –max_iter=100 –gensgd_mult_dec=0.999999 –minval=-1 –maxval=1 –quiet=1  –calc_error=1 –features=3
WARNING:  common.hpp(print_copyright:95): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
INFO:     gensgd.cpp(main:1140): Total selected features: 1 : 
   3.17175) Iteration:   0 Training RMSE:   0.522901 Train err:   0.033788
   3.59872) Iteration:   1 Training RMSE:   0.502048 Train err:   0.035543
   4.02827) Iteration:   2 Training RMSE:   0.490269 Train err:   0.034003
   4.47459) Iteration:   3 Training RMSE:   0.481781 Train err:   0.031853
   4.91073) Iteration:   4 Training RMSE:   0.473387 Train err:    0.03153
   5.33881) Iteration:   5 Training RMSE:   0.460354 Train err:   0.034153

   46.6616) Iteration:  96 Training RMSE:  0.0195902 Train err:          0
   47.0878) Iteration:  97 Training RMSE:   0.019386 Train err:          0
   47.5255) Iteration:  98 Training RMSE:  0.0191875 Train err:          0
   47.9595) Iteration:  99 Training RMSE:  0.0189961 Train err:          0

By adding the time bins into consideration, we get an improvement from RMSE 0.18 to 0.018 !!
Furthermore, the classification error is down to zero. 

Third step: let’s throw in some user features!

Besides of add rating data, we have some additional information about the users. The file user_profile.txt holds some properties of each user. The file has the following format:
100044 1899 1 5 831;55;198;8;450;7;39;5;111
100054 1987 2 6 0
100065 1989 1 57 0
100080 1986 1 31 113;41;44;48;91;96;42;79;92;35
100086 1986 1 129 0
100097 1981 1 75 0
100100 1984 1 47 71;51
The file has the following format:
[user ] [ year of birth ] [ gender ] [ number of tweets ] [ tag ids (area of interest) ]
Adding user features is simply done by the flag –user_file=user_profile.txt

./toolkits/collaborative_filtering/gensgd –training=rec_log_train.txt –val_pos=2 –rehash=1 –limit_rating=1000000 –max_iter=100 –gensgd_mult_dec=0.999999 –minval=-1 –maxval=1 –quiet=1  –calc_error=1 –features=3 –last_item=1 –user_file=user_profile.txt
WARNING:  common.hpp(print_copyright:95): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
INFO:     gensgd.cpp(main:1140): Total selected features: 1 : 
   2.02809) Iteration:   0 Training RMSE:          0 Train err:          0
   2.90718) Iteration:   1 Training RMSE:   0.511614 Train err:   0.022662
   3.74655) Iteration:   2 Training RMSE:    0.49371 Train err:   0.017136
   4.55983) Iteration:   3 Training RMSE:   0.479225 Train err:   0.015074
   5.40781) Iteration:   4 Training RMSE:   0.465404 Train err:   0.016538
   6.27764) Iteration:   5 Training RMSE:   0.451063 Train err:   0.015657

   77.5867) Iteration:  96 Training RMSE:  0.0177382 Train err:          0
   78.3384) Iteration:  97 Training RMSE:  0.0176325 Train err:          0
   79.0683) Iteration:  98 Training RMSE:  0.0174947 Train err:          0
   79.7872) Iteration:  99 Training RMSE:  0.0174152 Train err:          0
Overall we got another improvement from 0.018 to 0.0174

Step four: throw in some item features

In the KDD cup data, we are also given some item features, in the file item.txt

2335869 8.1.4.2 412042;974;85658;174033;974;9525;72246;39928;8895;30066;2245;1670;85658;174033;6977;6183;974;85658;174033;974;9525;72246;39928;8895;30066;2245;1670;85658;174033;6977;6183;974
1774844 1.8.3.6 31449;517124;45008;2796;79868;45008;202761;2796;101376;144894;31449;327552;133996;17409;2796;4986;2887;31449;6183;2796;79868;45008;13157;16541;2796;17027;2796;2896;4109;501517;2487;2184;9089;17979;9268;2796;79868;45008;202761;2796;101376;144894;31449;327552;133996;17409;2796;4986;2887;31449;6183;2796;79868;45008;13157;16541;2796;17027;2796;2896;4109;501517;2487;2184;9089;17979;9268

The format is:
[add id] [catergory] [ list of keywords ]

Let’s throw in some item information into the algorithm. This is done using the –item_file parameter.

bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/gensgd –training=rec_log_train.txt –val_pos=2 –rehash=1 –limit_rating=1000000 –max_iter=100 –gensgd_mult_dec=0.999999 –minval=-1 –maxval=1 –quiet=0 –features=3 –last_item=1   –quiet=1 –user_file=user_profile.txt –item_file=item.txt –gensgd_rate5=1e-5 –calc_error=1


WARNING:  common.hpp(print_copyright:95): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
INFO:     gensgd.cpp(main:1140): Total selected features: 1 : 
   2.23951) Iteration:   0 Training RMSE:          0 Train err:          0
   4.95858) Iteration:   1 Training RMSE:   0.527203 Train err:   0.022205
   7.54827) Iteration:   2 Training RMSE:   0.499881 Train err:   0.022271
    10.026) Iteration:   3 Training RMSE:   0.476596 Train err:   0.024138
   12.4976) Iteration:   4 Training RMSE:   0.454496 Train err:   0.016523
   14.9459) Iteration:   5 Training RMSE:   0.431336 Train err:   0.016406

    217.96) Iteration:  96 Training RMSE:  0.0127242 Train err:          0
   220.116) Iteration:  97 Training RMSE:  0.0126185 Train err:          0
   222.317) Iteration:  98 Training RMSE:  0.0125111 Train err:          0
   224.559) Iteration:  99 Training RMSE:  0.0123526 Train err:          0




We got some significant RMSE improvement – from 0.017 to 0.012.


from Large Scale Machine Learning and Other Animals: http://bickson.blogspot.com/2012/12/collaborative-filtering-3rd-generation.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+blogspot%2FsYXZE+%28Large+Scale+Machine+Learning+and+Other+Animals%29

Written by cwyalpha

十二月 13, 2012 在 5:51 上午

发表在 Uncategorized

发表评论

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / 更改 )

Twitter picture

You are commenting using your Twitter account. Log Out / 更改 )

Facebook photo

You are commenting using your Facebook account. Log Out / 更改 )

Google+ photo

You are commenting using your Google+ account. Log Out / 更改 )

Connecting to %s

%d 博主赞过: