CWYAlpha

Just another WordPress.com site

Thought this was cool: Collaborative filtering – 3rd generation – part 2

leave a comment »


A couple of days ago I wrote about a new experimental software I am writing – which is what I call a 3rd generation collaborative filtering software. I got a lot of interesting feedback from my readers which helps improve the software. Previously I tried it to examine its performance on KDD CUP 2012 dataset. Now I tried it on a completely different dataset and I am quite pleased with the results.

Below I will explain how to deploy it on a different problem domain: Airline on time performance. It is a completely different dataset from a different domain, but still the gensgd software can deal without without any modification. I hope that those results that show how
flexible is the software will encourage additional data scientist to try it out!

The airline on time dataset, has information about 10 years of flights in the US. The data of each year is a csv file with the following format:
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay

The fields are rather self explanatory  Each line represents a single flight, and information about the date, carrier, airport etc. is given, and the interesting fields is the varying information about flight duration.

And here are the first few lines:

2008,1,3,4,2003,1955,2211,2225,WN,335,N712SW,128,150,116,-14,8,IAD,TPA,810,4,8,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,754,735,1002,1000,WN,3231,N772SW,128,145,113,2,19,IAD,TPA,810,5,10,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,628,620,804,750,WN,448,N428WN,96,90,76,14,8,IND,BWI,515,3,17,0,,0,NA,NA,NA,NA,NA

First task. Can we predict the total time the flight was on the air? 

Well, for a matrix factorization method, it is not clear what is the actual matrix here. That is why it is useful to have a flexible software. In my experiments I have chosen “UniqueCarrier” and “FlightNum” as the two fields which form the matrix. This is because the characterize each flight rather uniquely. Next we need to decide which field we want to predict. I have chosen the ActualElapsedTime as the prediction target. Note that those fields are chosen on the fly, so you are more than welcome to chose others and see how well is the prediction in that case.
(Additional information about each field meaning is found here).

First let’s use traditional matrix factorization.

bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/gensgd –training=2008.csv –from_pos=8 –to_pos=9 –val_pos=11 –rehash=1  –gensgd_rate3=1e-5  –gensgd_mult_dec=0.9999 –max_iter=20 –file_columns=28 –gensgd_rate1=1e-5 –gensgd_rate2=1e-5 –quiet=1
WARNING:  common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
INFO:     gensgd.cpp(main:1155): Total selected features: 0 : 
   7.58561) Iteration:   0 Training RMSE:    67.1094
   11.7177) Iteration:   1 Training RMSE:    64.6665
   15.8441) Iteration:   2 Training RMSE:    63.2155
   19.9971) Iteration:   3 Training RMSE:    59.0044
   24.0989) Iteration:   4 Training RMSE:    53.9083
   28.1962) Iteration:   5 Training RMSE:    50.2416

   77.6041) Iteration:  17 Training RMSE:    35.6409
   81.7165) Iteration:  18 Training RMSE:     35.505
   85.8197) Iteration:  19 Training RMSE:    35.4046
   89.9266) Iteration:  20 Training RMSE:    35.3288

We got RMSE error of 35.3 minutes error on predicted flight time taking into account the carrier and flight number. That is rather bad.. we are half an hour off track.

Next let’s throw in some temporal features into the computation: Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime. How do we do that? It is very easy! Just add the command line: –features=0,1,2,3,4,5,6,7 namely the positions of the features in the input file. This is what we call temporal matrix factorization or tensor factorization. But for utilizing it in one of the traditional methods, you need to merge al the 8 fields into one integer which encodes the time. Which is of course a tedious task.

bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/gensgd –training=2008.csv –from_pos=8 –to_pos=9 –val_pos=11 –rehash=1 –file_columns=28 –gensgd_rate3=1e-5  –gensgd_mult_dec=0.9999 –max_iter=100  –gensgd_rate1=1e-5 –gensgd_rate2=1e-5 –features=1,2,3,4,5,6,7 –quiet=1
WARNING:  common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
INFO:     gensgd.cpp(main:1155): Total selected features: 7 : 
INFO:     gensgd.cpp(main:1158): Selected feature: 1
INFO:     gensgd.cpp(main:1158): Selected feature: 2
INFO:     gensgd.cpp(main:1158): Selected feature: 3
INFO:     gensgd.cpp(main:1158): Selected feature: 4
INFO:     gensgd.cpp(main:1158): Selected feature: 5
INFO:     gensgd.cpp(main:1158): Selected feature: 6
INFO:     gensgd.cpp(main:1158): Selected feature: 7
   21.8356) Iteration:   0 Training RMSE:    50.3144
   36.6782) Iteration:   1 Training RMSE:    40.4813
    51.425) Iteration:   2 Training RMSE:    36.0579
   66.4348) Iteration:   3 Training RMSE:    33.4226

   272.188) Iteration:  17 Training RMSE:    20.0103
   286.887) Iteration:  18 Training RMSE:    19.7198
   301.602) Iteration:  19 Training RMSE:    19.4597
   316.305) Iteration:  20 Training RMSE:    19.2147

 With temporal information we now got to RMSE of 19.2 minutes. Which is again not that
good.

Now let’s utilize the full power of gensgd: when the going gets tough – throw in some more features! Without even understanding what the feature means I have thrown in almost everything…

./toolkits/collaborative_filtering/gensgd –training=2008.csv –from_pos=8 –to_pos=9 –val_pos=11 –rehash=1 –features=1,2,3,4,5,6,7,12,13,14,15,16,17,18 –gensgd_rate3=1e-5  –gensgd_mult_dec=0.9999 –file_columns=4 –max_iter=20 –gensgd_rate1=1e-5 –gensgd_rate2=1e-5 –quiet=1
WARNING:  common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
INFO:     gensgd.cpp(main:1155): Total selected features: 14 : 
INFO:     gensgd.cpp(main:1158): Selected feature: 1
INFO:     gensgd.cpp(main:1158): Selected feature: 2
INFO:     gensgd.cpp(main:1158): Selected feature: 3
INFO:     gensgd.cpp(main:1158): Selected feature: 4
INFO:     gensgd.cpp(main:1158): Selected feature: 5
INFO:     gensgd.cpp(main:1158): Selected feature: 6
INFO:     gensgd.cpp(main:1158): Selected feature: 7
INFO:     gensgd.cpp(main:1158): Selected feature: 12
INFO:     gensgd.cpp(main:1158): Selected feature: 13
INFO:     gensgd.cpp(main:1158): Selected feature: 14
INFO:     gensgd.cpp(main:1158): Selected feature: 15
INFO:     gensgd.cpp(main:1158): Selected feature: 16
INFO:     gensgd.cpp(main:1158): Selected feature: 17
INFO:     gensgd.cpp(main:1158): Selected feature: 18
   36.2089) Iteration:   0 Training RMSE:    21.1476
   61.2802) Iteration:   1 Training RMSE:    10.1963
   86.3032) Iteration:   2 Training RMSE:    8.64215
   111.236) Iteration:   3 Training RMSE:    7.76054
   136.246) Iteration:   4 Training RMSE:    7.14308
   161.221) Iteration:   5 Training RMSE:     6.6629

   461.528) Iteration:  17 Training RMSE:    4.26991
    486.61) Iteration:  18 Training RMSE:    4.17239
   511.737) Iteration:  19 Training RMSE:    4.08084
   536.775) Iteration:  20 Training RMSE:    3.99414

Now we got down to 4 minutes avg error. But, we can continue the computation (run more iterations) and we get down even below 2 minutes error. Isn’t that neat? The average flight time is 127 minutes in 2008, so 2 minutes error prediction is not that bad.

Conclusion: traditional matrix / tensor factorization have some severe limitation when dealing with real world complex data. Additional techniques are needed to improve accuracy!

Second task: let’s predict TaxiIn (time that the plane is on the ground when coming in)

This task is slightly more difficult, since as you may imagine, there is much larger variation in texiin time relative to flight time. But is predeicing it more difficult? No.. we simply change –val_pos=19 namely to point the taget into the taxiintime field.
bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/gensgd –training=2008.csv –from_pos=8 –to_pos=9 –val_pos=19 –rehash=1  –file_columns=28 –gensgd_rate3=1e-3  –gensgd_mult_dec=0.9999 –max_iter=20  –file_columns=28 –gensgd_rate1=1e-3 –gensgd_rate2=1e-3 –features=1,2,3,4,5,6,7,10,11,12,13,14,15,16,17,18 –quiet=1
WARNING:  common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
[quiet] => [1]
INFO:     gensgd.cpp(main:1155): Total selected features: 16 : 
INFO:     gensgd.cpp(main:1158): Selected feature: 1
INFO:     gensgd.cpp(main:1158): Selected feature: 2
INFO:     gensgd.cpp(main:1158): Selected feature: 3
INFO:     gensgd.cpp(main:1158): Selected feature: 4
INFO:     gensgd.cpp(main:1158): Selected feature: 5
INFO:     gensgd.cpp(main:1158): Selected feature: 6
INFO:     gensgd.cpp(main:1158): Selected feature: 7
INFO:     gensgd.cpp(main:1158): Selected feature: 10
INFO:     gensgd.cpp(main:1158): Selected feature: 11
INFO:     gensgd.cpp(main:1158): Selected feature: 12
INFO:     gensgd.cpp(main:1158): Selected feature: 13
INFO:     gensgd.cpp(main:1158): Selected feature: 14
INFO:     gensgd.cpp(main:1158): Selected feature: 15
INFO:     gensgd.cpp(main:1158): Selected feature: 16
INFO:     gensgd.cpp(main:1158): Selected feature: 17
INFO:     gensgd.cpp(main:1158): Selected feature: 18
   1.56777) Iteration:   0 Training RMSE:    3.89207
   3.01777) Iteration:   1 Training RMSE:    3.64978
    4.5159) Iteration:   2 Training RMSE:    3.46472
    5.8659) Iteration:   3 Training RMSE:    3.30712
   7.26778) Iteration:   4 Training RMSE:    3.17225
    8.7159) Iteration:   5 Training RMSE:    3.06696
   23.6072) Iteration:  16 Training RMSE:    2.60147
   24.9789) Iteration:  17 Training RMSE:    2.57697
   26.3267) Iteration:  18 Training RMSE:    2.55768
   27.6967) Iteration:  19 Training RMSE:    2.54186
   29.0773) Iteration:  20 Training RMSE:    2.53113

We again get to average RMSE of 2.5 minutes – which means that this task is actually more difficult than predicting air time.

Instructions:
0) Install GraphChi from mercurial using the instructions here.
1) Download the year 2008 from here.
2) Open the zip file using:
    bunzip2 2008.csv.bz2
3) Create a matrix market format file, named csv.2008:info with the following two lines:

%%MatrixMarket matrix coordinate real general
20 7130 1000000
4) Before each run, remove temporary files using the command
rm -f 2008.csv.*
5) Run the commands aas instructed above.


from Large Scale Machine Learning and Other Animals: http://bickson.blogspot.com/2012/12/collaborative-filtering-3rd-generation_14.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+blogspot%2FsYXZE+%28Large+Scale+Machine+Learning+and+Other+Animals%29

Written by cwyalpha

十二月 14, 2012 在 12:58 下午

发表在 Uncategorized

发表评论

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / 更改 )

Twitter picture

You are commenting using your Twitter account. Log Out / 更改 )

Facebook photo

You are commenting using your Facebook account. Log Out / 更改 )

Google+ photo

You are commenting using your Google+ account. Log Out / 更改 )

Connecting to %s

%d 博主赞过: