CWYAlpha

Just another WordPress.com site

Thought this was cool: Collaborative filtering with GraphChi

leave a comment »


A couple of weeks ago I covered GraphChi by Aapo Kyrola in my blog.
Here is a quick tutorial for trying out GraphChi collaborative filtering toolbox that I wrote. Currently it supports ALS (alternating least squares), SGD (stochastic gradient descent), bias-SGD (biased stochastic gradient descent) , SVD++ , NMF (non-negative matrix factorization), SVD (restarted lanczos, and one sided lanczos) but I am soon going to implement several more algorithms.

Here are papers which explain the algorithms in more detail:

  • Alternating Least Squares (ALS)
    Yunhong Zhou, Dennis Wilkinson, Robert Schreiber and Rong Pan. Large-Scale Parallel Collaborative Filtering for the Netflix Prize. Proceedings of the 4th international conference on Algorithmic Aspects in Information and Management. Shanghai, China pp. 337-348, 2008.
    
  • Stochastic gradient descent (SGD)
     Matrix Factorization Techniques for Recommender Systems Yehuda Koren, Robert Bell, Chris Volinsky In IEEE Computer, Vol. 42, No. 8. (07 August 2009), pp. 30-37. 
    Takács, G, Pilászy, I., Németh, B. and Tikk, D. (2009). Scalable Collaborative Filtering Approaches for Large Recommender Systems. Journal of Machine Learning Research, 10, 623-656.
    
  • Bias stochastic gradient descent (Bias-SGD)
    Y. Koren. Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model. ACM SIGKDD 2008. Equation (5).
  • SVD++
    Y. Koren. Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model. ACM SIGKDD 2008. 
  • Weighted-ALS
    Collaborative Filtering for Implicit Feedback Datasets Hu, Y.; Koren, Y.; Volinsky, C. IEEE International Conference on Data Mining (ICDM 2008), IEEE (2008). 
  • Sparse-ALS
    Xi Chen, Yanjun Qi, Bing Bai, Qihang Lin and Jaime Carbonell. Sparse Latent Semantic Analysis. In SIAM International Conference on Data Mining (SDM), 2011. 
    D. Needell, J. A. Tropp CoSaMP: Iterative signal recovery from incomplete and inaccurate samples Applied and Computational Harmonic Analysis, Vol. 26, No. 3. (17 Apr 2008), pp. 301-321. 
  • NMF
    Lee, D..D., and Seung, H.S., (2001), 'Algorithms for Non-negative Matrix
    Factorization', Adv. Neural Info. Proc. Syst. 13, 556-562.
  • SVD (Restarted Lanczos & One sided Lanczos)
    V. Hern´andez, J. E. Rom´an and A. Tom´as. STR-8: Restarted Lanczos Bidiagonalization for the SVD in SLEPc. 
  • tensor-ALS
    Tensor Decompositions, Alternating Least Squares and other Tales. P. Comon, X. Luciani and A. L. F. de Almeida. Special issue, Journal of Chemometrics. In memory of R. Harshman.
    August 16, 2009

The benefit of using GraphChi is that it requires a single multicore machine and can scale up to very large models, since at no point the data is fully read into memory. In other words, GraphChi is very useful for machine with limited RAM since it streams over the dataset. It is also possible to configure how much RAM to use during the run.

Here are some performance numbers:

The above graph shows 6 iterations of SGD (stochastic gradient descent) on the full Netflix data.
Netflix has around 100M ratings so the matrix has 100M non-zeros. The size of the decomposed
matrix is about 480K users x 10K movies. I used a single multicore machine with 8 threads, where GraphChi memory consumption was limited to 800Mb, using 8 cores. The factorized matrix has a width of D=20. In total it takes around 80 seconds per 6 iterations, which is around 14 seconds per iteration.

Preprocessing the matrix is done once, and take around 35 seconds.

The input to GraphChi ALS/SGD/bias-SGD is the sparse matrix A in sparse matrix market format. The output are two matrices U and V s.t. A ~= U*V’  and both U and V have
a lower dimension D.

Let’s start with an example:

1) Download graphchi from mercurial using the instructions here.

2) Change directory to graphchi
    cd graphchi

3) Download Eigen and put the header files inside the ./src directory
    wget http://bitbucket.org/eigen/eigen/get/3.1.1.tar.bz2
  tar -xjf 3.1.1.tar.bz2
  mv eigen-eigen-43d9075b23ef/Eigen ./src

4) Compile using
    make cf

5a) For ALS/SGD/bias-SGD/SVD++/SVD Download Netflix synthetic sample file. The input is in sparse matrix market format.
    wget http://www.select.cs.cmu.edu/code/graphlab/datasets/smallnetflix_mm 
  wget http://www.select.cs.cmu.edu/code/graphlab/datasets/smallnetflix_mme 


5b) For WALS Download netflix sample file including time:

  wget http://www.select.cs.cmu.edu/code/graphlab/datasets/time_smallnetflix
  wget http://www.select.cs.cmu.edu/code/graphlab/datasets/time_smallnetflixe

6a) Run ALS on the Netflix example:
     ./toolkits/collaborative_filtering/als –training=smallnetflix_mm –validation=smallnetflix_mme –lambda=0.065 –minval=1 –maxval=5 –max_iter=6 –quiet=1


   At the first time, the input file will be preprocessed into an efficient binary representation on disk and then 6 ALS iterations will be run. 

6b) Run SGD on the Netflix example:
  ./toolkits/collaborative_filtering/sgd –training=smallnetflix_mm –validation=smallnetflix_mme –sgd_lambda=1e-4 —–sgd_gamma=1e-4 –minval=1 –maxval=5 –max_iter=6 –quiet=1

6c) Run bias-SGD on the Netflix example:
  ./toolkits/collaborative_filtering/biassgd –training=smallnetflix_mm –validation=smallnetflix_mme –biassgd_lambda=1e-4 –biassgd_gamma=1e-4 –minval=1 –maxval=5 –max_iter=6 –quiet=1

6d) Run SVD++ on Netflix example:

  ./toolkits/collaborative_filtering/svdpp –training=smallnetflix_mm –validation=smallnetflix_mme –biassgd_lambda=1e-4 –biassgd_gamma=1e-4 –minval=1 –maxval=5 –max_iter=6 –quiet=1


6e) Run weighted-ALS on the Netflix time example:
  ./toolkits/collaborative_filtering/als –training=time_smallnetflix –validation=time_smallnetflixe –lambda=0.065 –minval=1 –maxval=5 –max_iter=6 –quiet=1

6f) Run NMF on the reverse Netflix example:

./toolkits/collaborative_filtering/nmf –training=reverse_netflix.mm –minval=1 –maxval=5 –max_iter=20 –quiet=1

6g) Run one sided SVD on the Netflix example:
./toolkits/collaborative_filtering/svd_onesided –training=smallnetflix_mm –nsv=3 –nv=10 –max_iter=1 –quiet=1 –tol=1e-1

6h) Run tensor-ALS on Netflix time example
  ./toolkits/collaborative_filtering/als_tensor –training=time_smallnetflix –validation=time_smallnetflixe –lambda=0.065 –minval=1 –maxval=5 –max_iter=6 –quiet=1


7) View the output.

For ALS, SGD, bias-SGD, WALS, SVD++ and NMF

Two files are created: filename_U.mm and filename_V.mm
    The files store the matrices U and V in sparse matrix market format.
    head smallnetflix_mm_U.mm

%%MatrixMarket matrix array real general
95526 5
0.693370
1.740420
0.947675
1.328987
1.150084
1.399164
1.292951
0.300416


For tensor-ALS

Additional output file named filename_T.mm is created. Prediction is computed as the tensor product of U_i * V_j * T_k (namely r_ijk = sum_l( u_il * v_jl * t_kl )).

For bias-SGD, SVD++

Additional three files are created: filename_U_bias.mm, filename_V_bias.mm and filename_global_mean.mm. Bias files include the bias for each user (U) and item (V).
The global mean file includes the global mean of the rating.

For SVD

For each singular vector a file named filename.U.XX is created where XX is the number of the singular vector. The same with filename.V.XX. Additionally a singular value files is also saved.

Load output in Matlab (optional)

a) download the script mmread.m.
b) # matlab
    >> A=mmread(‘smallnetflix_mm’);
    >> U=mmread(‘smallnetflix_mm_U.mm’);
    >> V=mmread(‘smallnetflix_mm_V.mm’);
    >> whos
  Name          Size                  Bytes  Class     Attributes

  A         95526×3561             52799104  double    sparse    
  U         95526X5                3821040   double              
  V         3561X5                 142480    double          


c) compute prediction for user 8 movie 12:
   >> U(8,:)*V(12,:)’


d) compute approximation error    
     >> norm(A-U*V’) % may be slow… depending on problem size
Command line options:

Common options:

–training – the training input file
–validation – the validation input file (optional). Validation is data with known ratings which not used for training.
–test – the test input file (optional).

–minval – min allowed rating (optional). It is highly recommended to set this value since it improves prediction accuracy.
–maxval – max allowed rating (optional). It is highly recommended to set this value since it improves prediction accuracy.
–max_iter – number of iterations to run
–quiet – 1 run with less traces.
D – width of the factorized matrix. Default is 20. To change it you need to edit the file toolkits/collaborative_filtering/XXX.hpp (where XXX is the algorithm name), and change the macro #define D to some other value and recompile.

ALS:

–lambda – regularization parameter (optional, default value 0.065)

SGD:

–lambda – gradient step size (optional). Default value 1e-3.
–gamma – regularization (optional). Default value 1e-3.
–sgd_step_dec – multiplicative step decrement (optional). Default is 0.9.

bias-SGD:

–biassgd_lambda – gradient step size (optional). Default value 1e-3.
–biassgd_gamma – regularization (optional). Default value 1e-3.
–biassgd_step_dec – multiplicative gradient step decrement (optional). Default is 0.9.

SVD++

-svdpp_item_bias_step, –svdpp_user_bias_step, –svdpp_user_factor_step, –svdpp_user_factor2_step – gradient step size (optional). Default value 1e-4.
–svdpp_item_bias_reg, –svdpp_user_bias_reg, –svdpp_user_factor_reg, –svdpp_user_factor2_reg – regularization (optional). Default value 1e-4.
–svdpp_step_dec – multiplicative gradient step decrement (optional). Default is 0.9.

Weighted-ALS

Note: for weighted-ALS, the input file has 4 columns:
[user] [item] [weight] [rating]. See example file in section 5b).

–lambda – regularization

Sparse-ALS

–lambda – regularization
–user_sparsity – defines sparsity of each of the user factors. Value should be in the range [0.5,1)
–movie_sparsity – defines sparsity of each of the movie factors. Value should be in the range [0.5,1)
–algorithm – sparsity pattern. There are three run modes:
                          SPARSE_USR_FACTOR = 1
                          SPARSE_ITM_FACTOR = 2
                          SPARSE_BOTH_FACTORS = 3





Example running sparse-ALS:

bickson@thrust:~/graphchi$ ./bin/sparse_als.cpp –training=smallnetflix_mm –user_sparsity=0.8 –movie_sparsity=0.8 –algorithm=3 –quiet=1 –max_iter=15
WARNING:  sparse_als.cpp(main:202): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
[training] => [smallnetflix_mm]
[user_sparsity] => [0.8]
[movie_sparsity] => [0.8]
[algorithm] => [3]
[quiet] => [1]
[max_iter] => [15]
  0) Training RMSE:    1.11754  Validation RMSE:    3.82345
  1) Training RMSE:    3.75712  Validation RMSE:      3.241
  2) Training RMSE:    3.22943  Validation RMSE:    2.03961
  3) Training RMSE:    2.10314  Validation RMSE:    2.88369
  4) Training RMSE:    2.70826  Validation RMSE:    3.00748
  5) Training RMSE:    2.70374  Validation RMSE:    3.16669
  6) Training RMSE:    3.03717  Validation RMSE:     3.3131
  7) Training RMSE:    3.18988  Validation RMSE:    2.83234
  8) Training RMSE:    2.82192  Validation RMSE:    2.68066
  9) Training RMSE:    2.29236  Validation RMSE:    1.94994
 10) Training RMSE:    1.58655  Validation RMSE:    1.08408
 11) Training RMSE:     1.0062  Validation RMSE:    1.22961
 12) Training RMSE:    1.05143  Validation RMSE:     1.0448
 13) Training RMSE:   0.929382  Validation RMSE:    1.00319
 14) Training RMSE:   0.920154  Validation RMSE:   0.996426

tensor-ALS

Note: for tensor-ALS, the input file has 4 columns:
[user] [item] [time] [rating]. See example file in section 5b).

–lambda – regularization


NMF

NMF Has no special command line arguments.


SVD

SVD is implemented using the restarted lanczos algorithm.
The input is a sparse matrix market format input file.
The output are 3 files: one file containing the singular values, and two dense matrix market format files containing the matrices U and V.

Note: for larger models, it is advised to use svd_onesided since it significantly saved memory.

Here is an example Matrix Market input file for the matrix A2:

<235|0>bickson@bigbro6:~/ygraphlab/graphlabapi/debug/toolkits/parsers$ cat A2
%%MatrixMarket matrix coordinate real general
3 4 12
1 1  0.8147236863931789
1 2  0.9133758561390194
1 3  0.2784982188670484
1 4  0.9648885351992765
2 1  0.9057919370756192
2 2  0.6323592462254095
2 3  0.5468815192049838
2 4  0.1576130816775483
3 1  0.1269868162935061
3 2  0.09754040499940952
3 3  0.9575068354342976
3 4  0.9705927817606157


Here is an for running SVD :

bickson@thrust:~/graphchi$ ./bin/svd –training=A2 –nsv=4 –nv=4 –max_iter=4 –quiet=1
WARNING: svd.cpp(main:329): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any comments or bug reports to danny.bickson@gmail.com
[training] => [A2]
[nsv] => [4]
[nv] => [4]
[max_iter] => [4]
[quiet] => [1]
Load matrix
set status to tol
Number of computed signular values 4
Singular value 0 2.16097 Error estimate: 2.35435e-16
Singular value 1 0.97902 Error estimate: 1.06832e-15
Singular value 2 0.554159 Error estimate: 1.56173e-15
Singular value 3 9.2673e-65 Error estimate: 6.51074e-16
Lanczos finished 7.68793

Command line arguments

Basic configuration –training Input file name (in sparse matrix market format)
–nv Number of inner steps of each iterations. Typically the number should be greater than the number of singular values you look for.
–nsv Number of singular values requested. Should be typically less than –nv
–ortho_repeats Number of repeats on the orthogonalization step. Default is 1 (no repeats). Increase this number for higher accuracy but slower execution. Maximal allowed values is 3.
–max_iter Number of allowed restarts. The minimum is 2= no restart.
–save_vectors=true
Save the factorized matrices U and V to file.
–tol Convergence threshold. For large matrices set this number set this number higher (for example 1e-1, while for small matrices you can set it to 1e-16). As smaller the convergence threshold execution is slower.






Understanding the error measure



Following Slepc, the error measure is computed by a combination of:
sqrt( ||Av_i – sigma(i) u_i ||_2^2 + ||A^Tu_i – sigma(i) V_i ||_2^2 ) / sigma(i)


Namely, the deviation of the approximation sigma(i) u_i  from Av_i , and vice versa.


Scalability


Currently the code was tested with up to 3.5 billion non-zeros on a 24 core machine. Each Lanczos iteration takes about 30 seconds.


Difference to Mahout


Mahout SVD solver is implemented using the same Lanczos algorithm. However, there are several differences
1) In Mahout there are no restarts, so quality of the solution deteriorates very rapidly, after 5-10 iterations the solution is no longer accurate. Running without restarts can be done using our solution with the –max_iter=2 flag.
2) In Mahout there is a single orthonornalization step in each iteration while in our implementation there are two (after computation of u_i and after v_i ).
3) In Mahout there is no error estimation while we provide for each singular value the approximated error.
4) Our solution is typically x100 times faster than Mahout.

Computing recommendations

It is possible to compute recommendations using the rating command. Currently it supports: ALS, weighted-ALS, sparse-ALS, NMF, SGD. 
First you need to run one of the above methods (ALS, SGD, NMF etc.) . Next, the the command rating, for example:
bickson@thrust:~/graphchi/toolkits/collaborative_filtering$ ./rating –training=smallnetflix_mm –K=5 –quiet=1
WARNING:  rating.cpp(main:376): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
[training] => [smallnetflix_mm]
[K] => [5]
[quiet] => [1]
Computing recommendaitons for user 1 at time: 0.717385
Computing recommendaitons for user 40001 at time: 0.730287
Computing recommendaitons for user 56001 at time: 0.745666
Computing recommendaitons for user 70001 at time: 0.781473
Computing recommendaitons for user 72001 at time: 0.789523
Computing recommendaitons for user 76001 at time: 0.803709
Computing recommendaitons for user 87001 at time: 0.827568
Computing recommendaitons for user 91001 at time: 0.838071
Computing recommendaitons for user 95001 at time: 0.861837
Computing recommendaitons for user 1001 at time: 0.924312
Computing recommendaitons for user 2001 at time: 0.98236
Computing recommendaitons for user 3001 at time: 1.0347
Computing recommendaitons for user 4001 at time: 1.08348
Computing recommendaitons for user 5001 at time: 1.12427
Computing recommendaitons for user 6001 at time: 1.16263
Computing recommendaitons for user 7001 at time: 1.2002
Computing recommendaitons for user 8001 at time: 1.23634
Computing recommendaitons for user 9001 at time: 1.2701
Computing recommendaitons for user 10001 at time: 1.30333
Computing recommendaitons for user 11001 at time: 1.33622
The output of the rating algorithm are two files. The first one is more useful.
1) filename.ids – includes recommended item ids for each user.
2) filename.ratings – includes scalar ratings of the top K items

bickson@thrust:~/graphchi/toolkits/collaborative_filtering$ head smallnetflix_mm.ids 
%%MatrixMarket matrix array real general
%This file contains item ids matching the ratings. In each row i the top K item ids for user i.
95526 5
3424 1141 1477 2151 2012 
2784 1900 516 1835 1098 
1428 3450 2284 2328 58 
209 1073 3285 60 1271 
132 1702 2575 1816 2284 
2787 1816 3024 2514 985 
3078 375 168 2514 2460 



bickson@thrust:~/graphchi/toolkits/collaborative_filtering$ head smallnetflix_mm.ratings%%MatrixMarket matrix array real general%This file contains user scalar rating. In each row i, K top scalar ratings of different items recommended for user i.95526 57.726248219530e+00 7.321665743778e+00 7.023083603761e+00 7.008616274552e+00 6.670937980807e+001.222724647853e+01 1.162004403228e+01 1.144299819709e+01 1.133374751034e+01 1.061483854315e+017.497070438026e+00 7.187132667285e+00 6.686989429238e+00 6.550680427186e+00 6.542147872641e+001.158861203665e+01 9.885307642785e+00 9.045366124418e+00 8.801333430322e+00 8.713271980918e+00


Command line arguments


Basic configuration –training Input file name (in sparse matrix market format) for training data
–K Number of top items to recommend
–knn_sample_percent A value between (0,1]. When the dataset is big and there are a lot of user/item pairs it may not be feasible to compute all possible pairs. knn_sample_percent tells the program how many pairs to sample
–minval Truncate allowed ratings in range
–maxval Truncate allowed ratings in range
–quiet
Less verbose

Limitations:

The rating command does not support yet all algorithms. Contact me if you like to add additional algorithms like SVD++ etc.


Handling implicit ratings

Implicit rating handles the case where we have only positive examples (for example when a user bought a certain product) but we never have indication when a user DID NOT buy another product. The paper [Pan, Yunhong Zhou, Bin Cao, Nathan N. Liu, Rajan Lukose, Martin Scholz, and Qiang Yang. 2008. One-Class Collaborative Filtering. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining (ICDM ’08). IEEE Computer Society, Washington, DC, USA, 502-511. ] proposes to add negative examples at random for unobserved user/item pairs. Implicit rating is implemented in the collaborative filtering library and can be used with any of the algorithms explained above.

Basic configuration –implicitratingtype=1 Adds implicit ratings at random
–implicitratingpercentage A number between 1e-8 to 0.8 which determines what is the percentage of edges to add to the sparse model. 0 means none while 1 means fully dense model.
–implicitratingvalue The value of the rating added. On default it is zero, but you can change it.
–implicitratingweight Weight of the implicit rating (for WALS) OR
Time of the explicit rating (for tensor algorithms)



Example for implicit rating edition:

./toolkits/collaborative_filtering/sgd –training=smallnetflix_mm –implicitratingtype=1 –implicitratingvalue=-1 –implicitratingpercentage=0.00001
WARNING:  sgd.cpp(main:182): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
[training] => [smallnetflix_mm]
[implicitratingtype] => [1]
[implicitratingvalue] => [-1]
[implicitratingpercentage] => [0.00001]
INFO:     sharder.hpp(start_preprocessing:164): Started preprocessing: smallnetflix_mm –> smallnetflix_mm.4B.bin.tmp
INFO:     io.hpp(convert_matrixmarket:190): Starting to read matrix-market input. Matrix dimensions: 95526 x 3561, non-zeros: 3298163
INFO:     implicit.hpp(add_implicit_edges:71): Going to add: 3401 implicit edges.
INFO:     implicit.hpp(add_implicit_edges:79): Finished adding 3401 implicit edges. 

Common errors and their meaning

File not found error:

bickson@thrust:~/graphchi$ ./bin/example_apps/matrix_factorization/als_vertices_inmem file smallnetflix_mm 
INFO:     sharder.hpp(start_preprocessing:164): Started preprocessing: smallnetflix_mm –> smallnetflix_mm.4B.bin.tmp
ERROR:    als.hpp(convert_matrixmarket_for_ALS:153): Could not open file: smallnetflix_mm, error: No such file or directory

Solution:
Input file is not found, repeat step 5 and verify the file is in the right folder
Environment variable error:
bickson@thrust:~/graphchi/bin/example_apps/matrix_factorization$ ./als_vertices_inmem 
ERROR: Could not read configuration file: conf/graphchi.local.cnf
Please define environment variable GRAPHCHI_ROOT or run the program from that directory.
Solution:
export GRAPHCHI_ROOT=/path/to/graphchi/folder/
Error:

FATAL:    io.hpp(convert_matrixmarket:169): Failed to read global mean from filesmallnetflix_mm.gm

Solution: remove all temporary files created by the preprocessor, verify you have write permissions to your working folder and try again.

Item based similarity methods

Item based similarity methods documentation is found here.


from Large Scale Machine Learning and Other Animals: http://bickson.blogspot.com/2012/08/collaborative-filtering-with-graphchi.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+blogspot%2FsYXZE+%28Large+Scale+Machine+Learning+and+Other+Animals%29

Written by cwyalpha

九月 13, 2012 在 1:29 下午

发表在 Uncategorized

发表评论

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / 更改 )

Twitter picture

You are commenting using your Twitter account. Log Out / 更改 )

Facebook photo

You are commenting using your Facebook account. Log Out / 更改 )

Google+ photo

You are commenting using your Google+ account. Log Out / 更改 )

Connecting to %s

%d 博主赞过: