Just another site

Thought this was cool: Big Data Linkshare–最近大数据方面的技术文献

leave a comment »

Big Data Linkshare是订阅的一个邮件列表,忘了怎么订阅的了。


版权:Copyright © 2012 Israeli Big Data Linkshare, All rights reserved.

The past few weeks had been too hectic to find time for writing the newsletter. I hope going forward I’ll be able to keep to a regular schedule; Nevertheless, next week I’m travelling so the next issue will only come out the last week of September. But if any of you are in London and want to have a beer and discuss Data Science, drop me a line!

Time Series and Multivariate Analysis in R R Statistics

Book (75 pp.):

Book (53 pp.):
A series of three short, free books about data analysis in R: The first one discusses Time Series analysis (decomposition to trend and seasonal components, exponential smoothing, ARIMA and the likes).
The second is about multivariate analysis and covers PCA, LDA and other topics
The last one is on Biomedical statistics so might be a little specific in nature. It is linked to from the other pages.

Mining Twitter with MongoDB and MapReduce Big Data

Demonstration of how to use a Ruby Twitter gem to scrape tweets into MongoDB, and then run a Mongo map-reduce task (non-Hadoop) to get a histogram of tweets per hour of day.

Optimizing Pig Jobs Big Data

This guy did his PhD project on running DB benchmarks against Hive and Pig and discovering how to optimize them. He has a few tips stashed away in a powerpoint attached to a JIRA issue linked from the post. They are:
  1. Reorder JOINs properly
  3. Use FLATTEN for self-join
  4. Project before (CO)GROUP
  5. Remove types in LOAD
  6. Use hash-based aggregation
The post itself goes into detail about the benchmarks they ran and why different improvements like cutting down on Pig’s generated MR jobs (e.g Pig generates 3 jobs for an ORDER BY clause), or enabling map aggregation, are significant.

Unique Python Data Structures Python

The short blog post covers several libraries implementing Bloom filters, Trie trees, efficient lists or arrays, and general purpose graph libraries for Python. Some of these are especially fit for big data or NLP so I opted to include this post.

Scaling Deep Learning by Google ML Big Data

Video (40 min):
Jeff Dean of Google fame (his works were reviewed here before) gives a technical talk about how google runs fast, parallel optimization algorithms (such as SGD or L-BGFS) from an engineering standpoint.

Cardinality Estimation Algorithms Big Data Python

Continuing the ideas covered in former blog posts featured here on sketches and other probabilistic counting methods, this post covers and demonstrates (with Python) SuperLogLog and HyperLogLog.

Re-Expression optimizes your ML models ML

This site includes a very abstract description and reference implementation, in C++, of the re-expression method, whereby Naive Bayes model variables are re-encoded into progressively more complex combinations to better fit the data, while avoiding the creation of dependent variables.

Recommendation engine in Hadoop with Python ML Big Data Python

A very detailed walkthrough of how to write python MapReduce jobs with mrjob to do collaborative filtering. The algorithm implemented is item-item similarity generated from user-item preference data. Various measures are demonstrated (cosine similarity, jaccard distance and correlation coefficient).

Fast Bayesian Inference with STAN Statistics

A lot of excitement had followed the release of Stan, a BUGS-alternative for running monte carlo simulations. Underneath, Stan generates and compiles C++ code from the BUGS-like model description, which runs super fast Hamiltonian MCMC to converge to the model distribution. It comes with R bindings so that it’s easy to use from your research code.

Survey of NLP papers NLP

To conclude, I’ll be lazy and link to someone else’s summaries: Alexandre summarizes the papers he found interesting from the EMNLP 12′ conference. As mentioned, I haven’t had the time to review almost anything longer than a page in the past month.



记录两篇Revolutions上的关于R分析Big Data的文章





from 丕子:


Written by cwyalpha

九月 13, 2012 在 1:55 上午

发表在 Uncategorized


Fill in your details below or click an icon to log in: 徽标

You are commenting using your account. Log Out /  更改 )

Google+ photo

You are commenting using your Google+ account. Log Out /  更改 )

Twitter picture

You are commenting using your Twitter account. Log Out /  更改 )

Facebook photo

You are commenting using your Facebook account. Log Out /  更改 )


Connecting to %s

%d 博主赞过: