# CWYAlpha

Just another WordPress.com site

## Thought this was cool: 小明不小心把iPhone4掉到河里，坐在河边伤心的哭起来。 河神听…

from cnBeta.com精彩优秀评论: http://www.cnbeta.com/articles/225416.htm

Written by cwyalpha

## Thought this was cool: That Netflix RMSE is way too low or is it ? ( Clustering-Based Matrix Factorization – implementation -)

We’ve seen this type of occurrence on Nuit Blanche before. This one is either a bombshell or a dud. Early on in a discussion in the Advanced Matrix Factorization group, Nima Mirbakhsh shared his thought and a interesting and potentially mind blowing implementation, here is what he said:

Helping to evaluate my proposed extension on matrix factorization.

Hello eveyone,
I have a new extension of matrix factorization named “Clustering-Based Matrix Factorization”. I apply it on many datasets including “Netflix”, “Movielens”, “Epinions”, “Flixter”, and it acheives very good results. For the last three data sets the RMSE result is good and realizable, but for Netflix dataset it acheives very interestng result. As we all know the RMSE result of the Netflix prize winner was 0.8567, now my method achieves the RMSE of 0.8122.
I know that the Netflix prize winner’s method includes fusion of lots of different algorithm’s result, and it is hard to believe that one algorithm can reach such a good result. It has been my concern in the last couple of months too. Thus, I check my source code and my setup several time but cannot find any bug there. I also submit the paper in ICML but except a weak acceptation all other reviewers said that my method actually make sense but they all reject my work just because of the extraordinary result!
That is why I decide to put the paper and my source code online that everyone can evaluate it. Now, I am going to ask you to kindly joining me to evaluate the paper and the source code more accurately. Lets say if my method works fine, it is going to be a new experience on recommendation systems and may show us that they are still opportunities to improve the RMSE results.
source code: http://goo.gl/Az0lS
We recently saw some improvement of the Netflix RMSE (Linear Bandits in High Dimension and Recommendation Systems) but this time, the code is shared for everybody to kick the tires on it. As a reminder, we featured that paper earlier:

Recommender systems are emerging technologies that nowadays can be found in many applications such as Amazon, Netflix, and so on. These systems help users find relevant information, recommendations, and their preferred items. Matrix Factorization is a popular method in Recommendation Systems showing promising results in accuracy and complexity. In this paper we propose an extension of matrix factorization that uses the clustering paradigm to cluster similar users and items in several communities. We then establish their effects on the prediction model then. To the best of our knowledge, our proposed model outperforms all other published recommender methods in accuracy and complexity. For instance, our proposed method’s accuracy is 0.8122 on the Netflix dataset which is better than the Netflix prize winner’s accuracy of 0.8567.

Written by cwyalpha

## Thought this was cool: 火控雷达意味着什么

> Military ships and planes do not “lock onto other vessels all the time”. Not with targeting radar, and not in international waters – doing so is proscribed under international law as an act of aggression. It is the high-tech equivalent of sticking a gun in someone’s face and cocking the hammer

via

from est's blog: http://blog.est.im/post/42493400735

Written by cwyalpha

## Thought this was cool: LDA-math-文本建模

4. 文本建模

• 上帝都有什么样的骰子；
• 上帝是如何抛掷这些骰子的；

4.1 Unigram Model

$$p(\overrightarrow{w}) = p(w_1, w_2, \cdots, w_n) = p(w_1)p(w_2) \cdots p(w_n)$$

$$p(\mathcal{W})= p(\overrightarrow{w_1})p(\overrightarrow{w_2}) \cdots p(\overrightarrow{w_m})$$

$$p(\overrightarrow{n}) = Mult(\overrightarrow{n}|\overrightarrow{p}, N) = \binom{N}{\overrightarrow{n}} \prod_{k=1}^V p_k^{n_k}$$

\begin{align*}
p(\mathcal{W})= p(\overrightarrow{w_1})p(\overrightarrow{w_2}) \cdots p(\overrightarrow{w_m})
= \prod_{k=1}^V p_k^{n_k}
\end{align*}

$$\hat{p_i} = \frac{n_i}{N} .$$

$$p(\mathcal{W}) = \int p(\mathcal{W}|\overrightarrow{p}) p(\overrightarrow{p})d\overrightarrow{p}$$

$$p(\overrightarrow{n}) = Mult(\overrightarrow{n}|\overrightarrow{p}, N)$$

$$Dir(\overrightarrow{p}|\overrightarrow{\alpha})= \frac{1}{\Delta(\overrightarrow{\alpha})} \prod_{k=1}^V p_k^{\alpha_k -1}， \quad \overrightarrow{\alpha}=(\alpha_1, \cdots, \alpha_V)$$

$$\Delta(\overrightarrow{\alpha}) = \int \prod_{k=1}^V p_k^{\alpha_k -1} d\overrightarrow{p} .$$

Dirichlet 先验下的 Unigram Model

Unigram Model的概率图模型

Dirichlet 先验 + 多项分布的数据 $\rightarrow$ 后验分布为 Dirichlet 分布

$$Dir(\overrightarrow{p}|\overrightarrow{\alpha}) + MultCount(\overrightarrow{n})= Dir(\overrightarrow{p}|\overrightarrow{\alpha}+\overrightarrow{n})$$

p(\overrightarrow{p}|\mathcal{W},\overrightarrow{\alpha})
= Dir(\overrightarrow{p}|\overrightarrow{n}+ \overrightarrow{\alpha})
= \frac{1}{\Delta(\overrightarrow{n}+\overrightarrow{\alpha})}
\prod_{k=1}^V p_k^{n_k + \alpha_k -1} d\overrightarrow{p}

$$E(\overrightarrow{p}) = \Bigl(\frac{n_1 + \alpha_1}{\sum_{i=1}^V(n_i + \alpha_i)}, \frac{n_2 + \alpha_2}{\sum_{i=1}^V(n_i + \alpha_i)}, \cdots, \frac{n_V + \alpha_V}{\sum_{i=1}^V(n_i + \alpha_i)} \Bigr)$$

\label{dirichlet-parameter-estimation}
\hat{p_i} = \frac{n_i + \alpha_i}{\sum_{i=1}^V(n_i + \alpha_i)}

\begin{align}
p(\mathcal{W}|\overrightarrow{\alpha}) & = \int p(\mathcal{W}|\overrightarrow{p}) p(\overrightarrow{p}|\overrightarrow{\alpha})d\overrightarrow{p} \notag \\
& = \int \prod_{k=1}^V p_k^{n_k} Dir(\overrightarrow{p}|\overrightarrow{\alpha}) d\overrightarrow{p} \notag \\
& = \int \prod_{k=1}^V p_k^{n_k} \frac{1}{\Delta(\overrightarrow{\alpha})}
\prod_{k=1}^V p_k^{\alpha_k -1} d\overrightarrow{p} \notag \\
& = \frac{1}{\Delta(\overrightarrow{\alpha})}
\int \prod_{k=1}^V p_k^{n_k + \alpha_k -1} d\overrightarrow{p} \notag \\
& = \frac{\Delta(\overrightarrow{n}+\overrightarrow{\alpha})}{\Delta(\overrightarrow{\alpha})}
\label{likelihood-dir-mult}
\end{align}

4.2 Topic Model 和 PLSA

• 说到语言学，我们容易想到的词包括：语法、句子、乔姆斯基、句法分析、主语…；
• 谈论概率统计，我们容易想到以下一些词: 概率、模型、均值、方差、证明、独立、马尔科夫链、…；
• 谈论计算机，我们容易想到的词是： 内存、硬盘、编程、二进制、对象、算法、复杂度…；

Topic 就是Vocab 上的概率分布

PLSA 模型的文档生成过程

$$p(w|d_m) = \sum_{z=1}^K p(w|z)p(z|d_m) = \sum_{z=1}^K \varphi_{zw} \theta_{mz}$$

$$p(\overrightarrow{w}|d_m) = \prod_{i=1}^n \sum_{z=1}^K p(w_i|z)p(z|d_m) = \prod_{i=1}^n \sum_{z=1}^K \varphi_{zw_i} \theta_{dz}$$

Written by cwyalpha

## Thought this was cool: HTML5 for TV (Big screen)

@刘兴亮
HTML5的突出优点是多设备、跨平台。“HTML5 for TV”是由美国Cable Labs牵头组成的国际性开源协作项目，为推动HTML5技术在电视终端上的应用。预计2013年会有更多的开发者投入到基于智能电视的应用开发。针对大屏的应用开发是个“富矿”。2013年，智能电视终端应用的重要特点是——跨屏，跨终端互操作。

vod.xunlei.com 我已经大部分视频在这里看了。但是

• 多媒体键盘的media key没法响应。这个是browser vendor的问题
• 音视频内容无法DLNA到其他设备的屏幕

<video onControlPlay="function(){}" />


from est's blog: http://blog.est.im/post/41774190853

Written by cwyalpha

## Thought this was cool: 你熟悉的计算机语言是第几代的？

1GL (First-generation programming language)

2GL:

• 汇编

3GL:

• Fortran
• COBOL
• Pascal
• C
• C++
• C#
• Java
• BASIC
• Delphi

4GL:

• LabVIEW
• SQL 和 PL/SQL
• MATLAB
• R
• Scilab
• XQuery
• XUL
• ColdFusion

5GL

• Prolog
• ML
• Erlang

from est's blog: http://blog.est.im/post/41061230598

Written by cwyalpha

## Thought this was cool: Case study: million songs dataset

A couple of days ago I wrote about the million songs dataset. Our man in London, Clive Cox from Rummble Labs, suggested we should implement rankings based on item similarity.

Thanks to Clive suggestion, we have now an implementation of Fabio Aiolli’s cost function as explained in the paper: A Preliminary Study for a Recommender System for the Million Songs Dataset, which is the winning method in this contest.

Following are detailed instructions on how to utilize GraphChi CF toolkit on the million songs dataset data, for computing user ratings out of item similarities.

### Instructions for computing item to item similarities:

2) Run createTrain.sh to download the million songs dataset and prepare GraphChi compatible format.
$sh createTrain.sh Note: this operation may take an hour or so to prepare the data. 3) Run GraphChi item based collaborative filtering, to find out the top 500 similar items for each item: ./toolkits/collaborative_filtering/itemcf –training=train –K=500 –asym_cosine_alpha=0.15 –distance=3 –min_allowed_intersection=5 Explanation: –training points to the training file. –K=500 means we compute the top 500 similar items. –distance=3 is Aillio’s metric. –min_allowed_intersection=5 – means we take into account only items that were rated together by at least 5 users. Note: this operation requires around 20GB of memory and may take a few ours… 4) Post process results to create a single item to item ratings file$ sh ./toolkits/collaborative_filtering/topk.sh train
Sorting output file train.out0

Merging sorted files:
File written: train-topk

5) Create a matrix market header – by saving the below bash script into a file and running it:
(or simply copy everything and paste into a bash/sh shell window)

#!/bin/sh -x
USERS=100000
ITEMS=385371
echo “%%MatrixMarket matrix coordinate real general” > train-topk\:info
echo $ITEMS$ITEMS wc -l train-topk | awk ‘{print $1}’ >> train-topk\:info echo “*********************” cat train-topk\:info echo “*********************” ### Create user recommendations based on item similarities: 1) Run itemsim2rating to compute recommendations based on item similarities$ ./toolkits/parsers/itemsim2rating –training=train –similarity=train-topk –K=500 membudget_mb 50000 –nshards=1 –max_iter=2 –Q=3
Note: this operation may require 20GB of RAM and may take a couple of hours based on your computer configuration.

Output file is: train-rec

### Evaluating the result

1) Prepare test data:
./toolkits/parsers/topk –training=test –K=500

Output file is: test.ids

2) Prepare training recommendations:

./toolkits/parsers/topk –training=train-rec –K=500

Output file is: train-rec.ids

3) Compute mean average precision @ 500:
./toolkits/collaborative_filtering/metric_eval –training=train-rec.ids –test=test.ids –K=500

About the performance: Q is the power applied to the item similarity weight.
When Q = 1 we get:

INFO:     metric_eval.cpp(eval_metrics:114): 7.48179 Finished evaluating 100000 instances.
ESC[0mINFO:     metric_eval.cpp(eval_metrics:117): Computed AP@500 metric: 0.151431

Acknowledgements:

• Clive Cox, RummbleLabs.com for proposing to implement item based recommendations in GraphChi, and support in the process of implementing this method.
• Fabio Aiolli, University of Padova, winner of Million songs dataset contest, for great support regarding implementation of his metric.

from Large Scale Machine Learning and Other Animals: http://bickson.blogspot.com/2013/02/case-study-million-songs-dataset.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+blogspot%2FsYXZE+%28Large+Scale+Machine+Learning+and+Other+Animals%29

Written by cwyalpha