CWYAlpha

Just another WordPress.com site

Thought this was cool: Big Data Linkshare–最近大数据方面的技术文献

leave a comment »


Big Data Linkshare是订阅的一个邮件列表,忘了怎么订阅的了。

但是里面的东西挺好的。这一期的内容很精彩啊很精彩。

版权:Copyright © 2012 Israeli Big Data Linkshare, All rights reserved.

The past few weeks had been too hectic to find time for writing the newsletter. I hope going forward I’ll be able to keep to a regular schedule; Nevertheless, next week I’m travelling so the next issue will only come out the last week of September. But if any of you are in London and want to have a beer and discuss Data Science, drop me a line!

Time Series and Multivariate Analysis in R R Statistics

Book (75 pp.): http://a-little-book-of-r-for-time-series.readthedocs.org/en/latest/index.html

Book (53 pp.): http://little-book-of-r-for-multivariate-analysis.readthedocs.org/en/latest/index.html
A series of three short, free books about data analysis in R: The first one discusses Time Series analysis (decomposition to trend and seasonal components, exponential smoothing, ARIMA and the likes).
The second is about multivariate analysis and covers PCA, LDA and other topics
The last one is on Biomedical statistics so might be a little specific in nature. It is linked to from the other pages.

Mining Twitter with MongoDB and MapReduce Big Data

Blog: http://gregmoreno.wordpress.com/2012/09/05/mining-twitter-data-with-ruby-mongodb-and-map-reduce/
Demonstration of how to use a Ruby Twitter gem to scrape tweets into MongoDB, and then run a Mongo map-reduce task (non-Hadoop) to get a histogram of tweets per hour of day.

Optimizing Pig Jobs Big Data

Blog: http://hortonworks.com/blog/pig-performance-and-optimization-analysis/
This guy did his PhD project on running DB benchmarks against Hive and Pig and discovering how to optimize them. He has a few tips stashed away in a powerpoint attached to a JIRA issue linked from the post. They are:
  1. Reorder JOINs properly
  2. Use COGROUP for JOIN + GROUP
  3. Use FLATTEN for self-join
  4. Project before (CO)GROUP
  5. Remove types in LOAD
  6. Use hash-based aggregation
The post itself goes into detail about the benchmarks they ran and why different improvements like cutting down on Pig’s generated MR jobs (e.g Pig generates 3 jobs for an ORDER BY clause), or enabling map aggregation, are significant.

Unique Python Data Structures Python

Blog: http://kmike.ru/python-data-structures/
The short blog post covers several libraries implementing Bloom filters, Trie trees, efficient lists or arrays, and general purpose graph libraries for Python. Some of these are especially fit for big data or NLP so I opted to include this post.

Scaling Deep Learning by Google ML Big Data

Video (40 min): http://techtalks.tv/talks/57639/
Jeff Dean of Google fame (his works were reviewed here before) gives a technical talk about how google runs fast, parallel optimization algorithms (such as SGD or L-BGFS) from an engineering standpoint.

Cardinality Estimation Algorithms Big Data Python

Blog: http://blog.notdot.net/2012/09/Dam-Cool-Algorithms-Cardinality-Estimation
Continuing the ideas covered in former blog posts featured here on sketches and other probabilistic counting methods, this post covers and demonstrates (with Python) SuperLogLog and HyperLogLog.

Re-Expression optimizes your ML models ML

Site: http://arauhala.github.com/libreexpweb/
This site includes a very abstract description and reference implementation, in C++, of the re-expression method, whereby Naive Bayes model variables are re-encoded into progressively more complex combinations to better fit the data, while avoiding the creation of dependent variables.

Recommendation engine in Hadoop with Python ML Big Data Python

Blog: http://aimotion.blogspot.com.br/2012/08/introduction-to-recommendations-with.html
A very detailed walkthrough of how to write python MapReduce jobs with mrjob to do collaborative filtering. The algorithm implemented is item-item similarity generated from user-item preference data. Various measures are demonstrated (cosine similarity, jaccard distance and correlation coefficient).

Fast Bayesian Inference with STAN Statistics

Site: http://mc-stan.org/
Blog: http://martynplummer.wordpress.com/2012/09/02/stan/
A lot of excitement had followed the release of Stan, a BUGS-alternative for running monte carlo simulations. Underneath, Stan generates and compiles C++ code from the BUGS-like model description, which runs super fast Hamiltonian MCMC to converge to the model distribution. It comes with R bindings so that it’s easy to use from your research code.

Survey of NLP papers NLP

Blog: http://atpassos.posterous.com/emnlp-2012-reading-list 
To conclude, I’ll be lazy and link to someone else’s summaries: Alexandre summarizes the papers he found interesting from the EMNLP 12′ conference. As mentioned, I haven’t had the time to review almost anything longer than a page in the past month.
您可能也喜欢:


UCI机器学习库和一些相关算法


纽约时报:大数据时代


记录两篇Revolutions上的关于R分析Big Data的文章


2010信息检索数据挖掘机器学习相关国际会议


R语言的图形用户界面

无觅

相关文章

from 丕子: http://www.zhizhihu.com/html/y2012/3922.html

Written by cwyalpha

9月 13, 2012 在 1:55 上午

发表在 Uncategorized

留下评论