Thought this was cool: Big Data Linkshare–最近大数据方面的技术文献

Time Series and Multivariate Analysis in R R Statistics

Book (75 pp.): http://a-little-book-of-r-for-time-series.readthedocs.org/en/latest/index.html

Book (53 pp.): http://little-book-of-r-for-multivariate-analysis.readthedocs.org/en/latest/index.html

A series of three short, free books about data analysis in R: The first one discusses Time Series analysis (decomposition to trend and seasonal components, exponential smoothing, ARIMA and the likes).

The second is about multivariate analysis and covers PCA, LDA and other topics

The last one is on Biomedical statistics so might be a little specific in nature. It is linked to from the other pages.

Mining Twitter with MongoDB and MapReduce Big Data

Blog: http://gregmoreno.wordpress.com/2012/09/05/mining-twitter-data-with-ruby-mongodb-and-map-reduce/

Demonstration of how to use a Ruby Twitter gem to scrape tweets into MongoDB, and then run a Mongo map-reduce task (non-Hadoop) to get a histogram of tweets per hour of day.

Optimizing Pig Jobs Big Data

Blog: http://hortonworks.com/blog/pig-performance-and-optimization-analysis/

This guy did his PhD project on running DB benchmarks against Hive and Pig and discovering how to optimize them. He has a few tips stashed away in a powerpoint attached to a JIRA issue linked from the post. They are:

Reorder JOINs properly
Use COGROUP for JOIN + GROUP
Use FLATTEN for self-join
Project before (CO)GROUP
Remove types in LOAD
Use hash-based aggregation

The post itself goes into detail about the benchmarks they ran and why different improvements like cutting down on Pig’s generated MR jobs (e.g Pig generates 3 jobs for an ORDER BY clause), or enabling map aggregation, are significant.

Unique Python Data Structures Python

Blog: http://kmike.ru/python-data-structures/

The short blog post covers several libraries implementing Bloom filters, Trie trees, efficient lists or arrays, and general purpose graph libraries for Python. Some of these are especially fit for big data or NLP so I opted to include this post.

Scaling Deep Learning by Google ML Big Data

Video (40 min): http://techtalks.tv/talks/57639/

Jeff Dean of Google fame (his works were reviewed here before) gives a technical talk about how google runs fast, parallel optimization algorithms (such as SGD or L-BGFS) from an engineering standpoint.

Cardinality Estimation Algorithms Big Data Python

Blog: http://blog.notdot.net/2012/09/Dam-Cool-Algorithms-Cardinality-Estimation

Continuing the ideas covered in former blog posts featured here on sketches and other probabilistic counting methods, this post covers and demonstrates (with Python) SuperLogLog and HyperLogLog.

Re-Expression optimizes your ML models ML

Site: http://arauhala.github.com/libreexpweb/

This site includes a very abstract description and reference implementation, in C++, of the re-expression method, whereby Naive Bayes model variables are re-encoded into progressively more complex combinations to better fit the data, while avoiding the creation of dependent variables.

Recommendation engine in Hadoop with Python ML Big Data Python

Blog: http://aimotion.blogspot.com.br/2012/08/introduction-to-recommendations-with.html

A very detailed walkthrough of how to write python MapReduce jobs with mrjob to do collaborative filtering. The algorithm implemented is item-item similarity generated from user-item preference data. Various measures are demonstrated (cosine similarity, jaccard distance and correlation coefficient).

Fast Bayesian Inference with STAN Statistics

Site: http://mc-stan.org/

Blog: http://martynplummer.wordpress.com/2012/09/02/stan/

A lot of excitement had followed the release of Stan, a BUGS-alternative for running monte carlo simulations. Underneath, Stan generates and compiles C++ code from the BUGS-like model description, which runs super fast Hamiltonian MCMC to converge to the model distribution. It comes with R bindings so that it’s easy to use from your research code.

Survey of NLP papers NLP

Blog: http://atpassos.posterous.com/emnlp-2012-reading-list

To conclude, I’ll be lazy and link to someone else’s summaries: Alexandre summarizes the papers he found interesting from the EMNLP 12′ conference. As mentioned, I haven’t had the time to review almost anything longer than a page in the past month.

您可能也喜欢：
UCI机器学习库和一些相关算法	纽约时报：大数据时代	记录两篇Revolutions上的关于R分析Big Data的文章	2010信息检索数据挖掘机器学习相关国际会议	R语言的图形用户界面
无觅

from 丕子: http://www.zhizhihu.com/html/y2012/3922.html

Written by cwyalpha

9月 13, 2012 在 1:55 上午

发表在 Uncategorized

CWYAlpha