Just another site

Thought this was cool: GraphChi parsers toolkit

leave a comment »

Consecutive IDS parser

If you like to use GraphChi collaborative filtering library, you first need to parse your data into consecutive user integer ids and consecutive item integer ids.
For example, assume you have a user movie rating database with the following format:
Johny Walker,The Matrix (Director’s Cut),5
Abba Babba,James Bond VII,3
Bondy Jr.,Titanic,1

Namely, each row contains one ratings with the format
<user name>,<item name>,<rating>

And you like to convert it to sparse matrix market format:
1 1  5
2 2  3
3 3  1
Namely user 1 rated item 1 with rating of 5, etc.

Additionally, you can also convert files of the format:
12032 12304-0323-X 3
Where for example 12032 is the first user ID, 12304-0323-X is the ISBN of the rated book and 3 is the rating value.

You can use the consecutive_matrix_market parser to create the appropriate format from your data files. 
The input to the parser is either CSV file, TSV file or a matrix market input file with non consecutive integer ids. The user and items can be either strings or integers.
There are several outputs to the consecutive IDS parser:
1) filename.out – a rating file with the format [user] [ movie] [rating] where both the user ids and item ids are consecutive integers. In our example

1 1  5
2 2  3
3 3  1

2) Mapping between user names/ids to consecutive integers. In our example:

Abba Babba 2
Bondy Jr. 3
Johny Walker 1
3) Mapping between movie names/ids to consecutive integers. In our example:

James Bond VII 2
The Matrix (Director’s Cut) 1
Titanic 3

4+5) The same maps but in reverse, namely mapping back between unsigned integers to string ids.
6) Matrix market header file – in our examples there are 3 users, 3 movies and 3 ratings:

%%MatrixMarket matrix coordinate integer general
3 3 3

To run the consecutive IDS parser you should prepare a file with the input file name within it.
For example, the file name “movies” contains:

Johny Walker,The Matrix (Director’s Cut),5
Abba Babba,James Bond VII,3
Bondy Jr.,Titanic,1

Next, you prepare a file named list, with movies in the first line:
#> cat list

Finally, you run the parser:
./toolkits/collaborative_fitering/consecutive_matrix_market –file_list=list –csv=1

The supported command line flags are:
–csv=1 – csv format
–tsv=1 – tsv format
-> otherwise – sparse separated format
–file_list=list – list of files to parse

LDA Parser

To the request of Baoqiang Cao I have started a parsers toolkits in GraphChi to be used for
preparing data to be used in GraphLab/ Graphchi. The parsers should be used as template which can be easy customized to user specific needs.
LDA parser code is found here.

Example input file:

I was not a long long ago here
because i am not so damn smart.
as you may have thought

The assumption is that each line holds words from a certain document.  We would like to assign the strings unique ids, and count how many words appear in each document. 
The input to GraphLab’s LDA is in the format
[docid] [wordid] [word count]
Example run:
1) Create a file “stamfile” with the example input above.
2) Create a file “a” which has the file name “stamfile” in its first line.
>>head d
3) Run the parser:
bickson@thrust:~/graphchi$ ./toolkits/parsers/texttokens –file_list=a
WARNING:  texttokens.cpp(main:146): GraphChi parsers library is written by Danny Bickson (c). Send any  comments or bug reports to 
[file_list] => [a]
INFO:     texttokens.cpp(parse:138): Finished parsing total of 4 lines in file stamfile
total map size: 16
Finished in 0.024819
INFO:     texttokens.cpp(save_map_to_text_file:56): Wrote a total of 16 map entries to text file: amap.text
INFO:     texttokens.cpp(save_map_to_text_file:68): Wrote a total of 16 map entries to text file:
The output of the parser is a file named stamfile.out
bickson@thrust:~/graphchi$ cat stamfile.out 
2 1 1
2 2 1
2 3 1
2 4 2
2 5 1
2 6 1
3 11 1
3 3 1
3 7 1
3 8 1
3 9 1
3 10 1
4 12 1
4 13 1
4 14 1
4 15 1
4 16 1
And the mapping files:
bickson@thrust:~/graphchi$ cat amap.text
may 14
long 4
am 8
because 7
you 13
as 12
so 9
here 6
have 15
not 3
i 1
smart 11
thought 16
damn 10
was 2
ago 5

bickson@thrust:~/graphchi$ cat 
1 i
2 was
3 not
4 long
5 ago
6 here
7 because
8 am
9 so
10 damn
11 smart
12 as
13 you
14 may
15 have
16 thought

Twitter parser

A second parser, which is slightly more fancy, go over user twitter tweets in the following format:

 * Twitter input format is:
 * T  2009-06-11 16:56:42
 * U
 * W  Bus is pulling out now. We gotta be in LA by 8 to check into the Paragon.
 * T  2009-06-11 16:56:42
 * U
 * W  灰を灰皿に落とそうとすると高確率でヘッドセットの線を根性焼きする形になるんだが
 * T  2009-06-11 16:56:43
 * U
 * W  @carolinesweatt There are no orphans…of God! 🙂 Miss your face!

And extracts graph links between the retweets.

Operating the parsers

1) Install graphchi as explained in steps 1+2 here.
2) Compile with “make parsers
3) Run using ./toolkits/parsers/texttoken –file_list=files


from Large Scale Machine Learning and Other Animals:


Written by cwyalpha

十月 8, 2012 在 1:36 下午

发表在 Uncategorized


Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  更改 )

Google+ photo

You are commenting using your Google+ account. Log Out /  更改 )

Twitter picture

You are commenting using your Twitter account. Log Out /  更改 )

Facebook photo

You are commenting using your Facebook account. Log Out /  更改 )


Connecting to %s

%d 博主赞过: