CWYAlpha

Just another WordPress.com site

Archive for 十二月 2012

Thought this was cool: 有哪些比较基础的计算机书籍?

leave a comment »

谢邀,正好今年夏天的时候,在毕业生卖书的地摊上找到一本书,还挺好的,叫做《Computer Science Illuminated》(计算机科学概论,点亮你的计算机世界),作者是Nell Dale和John Lewis。

http://www.amazon.cn/Computer-Science-Illuminated-Student-Study-Guide-Dale-Nell-B/dp/0763726265/ref=sr_1_7?ie=UTF8&qid=1356683364&sr=8-7

这是第二版的,英文版的比较贵,卓越上有一本第三版的中文译本。http://www.amazon.cn/%E8%AE%A1%E7%AE%97%E6%9C%BA%E7%A7%91%E5%AD%A6%E6%A6%82%E8%AE%BA-%E6%88%B4%E5%B0%94/dp/product-description/B001TDM10Y/ref=dp_proddesc_0?ie=UTF8&s=books

这本书的内容比较丰富,但是对于任何一个话题都没有具体的展开详述,毕竟只有600多页的篇幅,而其中任何一个章节的内容拿出来进行系统叙述都可以变成一本600页以上的书。这本书只能让你有一个概要性的、宏观上的理解,如果要求深入,请另外选择书籍。

以下是一些简要的章节:
Laying the Groundwork(基础知识)

  • Chapter 1 The Big Picture(全景图),本书的层次,计算机和软件的历史。

The Information Layer(信息层面)

  • Chapter 2 Binary Values and Number Systems(二进制值和计数系统),二进制、八进制、十进制、十六进制的计算和转换。
  • Chapter 3 Data Representation(数据表示),模拟量和数字量,如何表示数据,数怎么表示,文字、声音、图像、视频怎么表示。

The Hardware Layer(硬件层面)

  • Chapter 4 Gates and Circuits(逻辑门和电路),门电路、晶体管、加法器、存储器的简单原理。
  • Chapter 5 Computing Components(计算部件),冯式结构、CPU指令周期、外存结构和非冯式结构。

The Programming Layer(编程层面)

  • Chapter 6 Problem Solving and Algorithm Design(问题解决和算法设计),设计简单算法、自顶向下、测试和面向对象。
  • Chapter 7 Low-Level Programming Languages(低级编程语言),机器语言和汇编。
  • Chapter 8 High-Level Programming Languages(高级编程语言),编译器和解释器、编程范式、函数式编程、常用的程序结构(IO、选择、循环、子程序、递归等)、类型系统。
  • Chapter 9 Abstract Data Types and Algorithms(抽象数据类型和算法),数组和链表、排序、二分查找、栈和队列、树。

The Operating System Layer(操作系统层面)

  • Chapter 10 Operating Systems (操作系统),操作系统的功能、内存管理、进程管理、CPU调度。
  • Chapter 11 File Systems and Directories(文件系统和目录),文件操作、目录树、磁盘结构。

The Application Layer(应用程序层面)

  • Chapter 12 Information Systems(信息系统),电子表格和数据库系统。
  • Chapter 13 Artificial Intelligence(人工智能),思考机器、知识表示、专家系统、神经网络、自然语言处理和机器人。
  • Chapter 14 Simulation and Other Application(模拟器和其他应用),模拟系统、CAD和嵌入式系统。

The Communication Layer(通信层面)

  • Chapter 15 Networks(网络),网络的结构和模式、网络协议和地址。
  • Chapter 16 The World Wide Web(万维网),使用网络(搜索引擎、即时通信等),HTML、交互式页面和XML。

In Conclusion(结论)

  • Chapter 17 Limitations of Computing(计算的局限)

— 完 —

下载知乎 iPhone 客户端:http://zhi.hu/ios
from 知乎每日精选: http://www.zhihu.com/question/20679695/answer/15830307

Advertisements

Written by cwyalpha

十二月 31, 2012 at 2:48 下午

发表在 Uncategorized

Thought this was cool: What is the difference between probability and statistics?

leave a comment »

In Statistics (mathematical science): Ben Golub added a question.

3 Answers

See question on Quora

from Ben Golub on Quora: http://www.quora.com/Statistics-mathematical-science/What-is-the-difference-between-probability-and-statistics

Written by cwyalpha

十二月 31, 2012 at 2:48 下午

发表在 Uncategorized

Thought this was cool: Simons Institute Big Data Program

leave a comment »

Michael Jordan sends the below:

The new Simons Institute for the Theory of Computing
will begin organizing semester-long programs starting in 2013.

One of our first programs, set for Fall 2013, will be on the “Theoretical Foundations
of Big Data Analysis”. The organizers of this program are Michael Jordan (chair),
Stephen Boyd, Peter Buehlmann, Ravi Kannan, Michael Mahoney, and Muthu Muthukrishnan.

See http://simons.berkeley.edu/program_bigdata2013.html for more information on
the program.

The Simons Institute has created a number of “Research Fellowships” for young
researchers (within at most six years of the award of their PhD) who wish to
participate in Institute programs, including the Big Data program. Individuals
who already hold postdoctoral positions or who are junior faculty are welcome
to apply, as are finishing PhDs.

Please note that the application deadline is January 15, 2013. Further details
are available at http://simons.berkeley.edu/fellows.html .

Mike Jordan

from Machine Learning (Theory): http://hunch.net/?p=2606

Written by cwyalpha

十二月 31, 2012 at 2:17 下午

发表在 Uncategorized

Thought this was cool: What is the funniest research paper you have ever read?

leave a comment »

“A Mathematical Model for the Determination of Total Area
Under Glucose Tolerance and Other Metabolic Curves”
http://care.diabetesjournals.org…

In 1993, a nutrition scientist at NYU claimed to have invented a novel and highly accurate method for determining the area under metabolic curves. She named the method after herself and dedicated it to her parents.

Tai’s model was developed to correct the deficiency of under- or overestimation of the total area under a metabolic curve. This formula also allows calculating the area under a curve with unequal units on the X-axis. The strategy of this mathematical model is to divide the total area under a curve into individual small segments such as squares, rectangles, and triangles, whose areas can be precisely determined according to existing geometric formulas. The area of the individual segments are then added to obtain the total area under the curve.

The paper was published in the peer-reviewed journal Diabetes Care, which is run by the American Diabetes Association and has a very respectable impact factor of 8.087. The paper currently has 173 citations, mostly from the diabetes literature.

Most people with high school math will realize that Tai is describing the trapezoidal rule, a classic method attributed to Newton, i.e. circa 1600s.

A year later, in response to what must have been significant amounts of backlash, Tai wrote an emotional letter to the journal defending herself:

While a doctoral candidate working on my dissertation at Columbia University in 1981, I needed to calculate total area under a curve. During a session with my statistical advisor, and after examining several alternative methods, I worked out the model in front of him. The concept behind it is obviously common sense, and one does not have to consult the trapezoid rule to figure it out. The trapezoid rule is really not Nobel Prize material, such as the double helix or jumping genes. I also used the formulas to calculate the areas of a square or a triangle without knowing whose rules were being followed. Fortunately, I do not have to answer that for you.

I never thought of publishing the model as a great discovery or accomplishment; it was not published until 14 years later, in 1994. Because of its accuracy and easy application, many colleagues at the Obesity Research Center of St Luke’s-Roosevelt Hospital Center and Columbia University began using it and addressed it as “Tai’s formula” to distinguish it from others. Later, because the investigators were unable to cite an unpublished work, I submitted it for publication at their requests. Therefore, my name was rubber-stamped on the model before its publication.

My intention in publishing the model is therefore to share, rather than to gain honor or glory with its publication, because there is none. Many other investigators probably thought about the same thing, but maybe they did not bother to follow up or produce a model (or the same model). You indicated that I probably did work this out on my own and I am grateful for your “probability,” because I did indeed do so with a witness present. Maybe I can address the model as my creation based on fact rather than your doubtful “probability.” Besides, if I do not address the model as “Tai’s,” other investigators who wish to cite it will.

http://care.diabetesjournals.org…

Read other related questions on Quora:

Read more answers on Quora.
from Quora: http://www.quora.com/Scientific-Research/What-is-the-funniest-research-paper-you-have-ever-read/answer/Kaicheng-Liang

Written by cwyalpha

十二月 31, 2012 at 1:56 下午

发表在 Uncategorized

Thought this was cool: 犹太人在教育上有什么值得我们学习和借鉴的地方?

leave a comment »

作为一个从小学到大学都在以色列的人我说说自己的看法:
兴趣最重要
这其中包括很多教育上的制度,细节很多,我尽量把最重要的列出来。
1)选课自由:以色列高中选课是自由的,国内必修的生物、物理、化学在这里都是选修课,甚至有学生除了必修外只修文科的,比如说戏剧和心理学,不强制学生学习自己不喜欢或者学的不好的学科很打程度让学生有更多的精力来钻研自己喜欢的东西,从而达到更好的科研或人文方面的成果。
2)课程分级:高中数学分三四五级,因为物理等课程的要求三级数学学生不能学物理,对于4级和5级学生来说,大学录取只看分数,如果一个人4级考了100,另一个人5级考了80,学校会优先录取100分那个,虽然4级学的东西比5级要简单。3级特别轻松,一般是决定要向文科发展或者数学不是强项的学生修的,只要成绩好以后的前途也不会太受影响。
3)考试时间:这三个都是上面的马梓耕同学提过的,高中时规定一个星期不得超过两门大考,同一天不得考两门,全部都是学生的权利,大学也这样,如果考试很集中,学生可以集体提出更改日期。
4)人文:从小学一直到大学都很重视学生的展示能力,一直有很多project,学生可以从给出的主题里选择一个,然后根据自己的想法来做演讲或者书面的研究。这种活动也可以很正式,比如说高中和以色列有名的学府合作推出的论文项目,学生可以选一个科研方向,有一个导师带领,做实验,然后写一篇论文,包括proposal和最后的thesis等正式的过程,实验也绝对不含糊,生物,化学,物理,天文等等全部都有涉及。还有就是做志愿者,当时有同学去红十字会做了一年以后迷上了医疗,高中时就已经有了很多专业急救执照,还亲手救过熟人的家人。还有人去照顾自闭症孩子,现在已经连着做了3年多的志愿者了。初中和高中时学校总是带学生去野外踏生和看话剧等等,高中时很多孩子成了童子军的指导员,陪练了组织能力和野外生存能力,后来还经常自己去野外徒步旅行。学校也经常组织学生去各大学府参观,还让学生去参加各式各样的讲座,心理学、天文学、前沿物理和生化等都包括在内。还有就是各种各样的电脑编程课啊,体育课啊,舞蹈课啊等等,这么丰富的活动能让学生了解很多之前不知道的方向和学科。对了,关于大屠杀这个让以色列人特别团结的话题,有很多年轻人会去陪伴这些大多很孤独的老人,照顾他们的生活,如果可能的话记录下他们的故事,以保留一份真相,也是一个很好的成长的经历,志愿者很多都是高中和大学的学生。
5)鼓励:从小老师的教育都是鼓励大家多说说自己的想法,政治课不是单纯地背课本而是讨论实事的辩论课,经常会出现意见不合拍桌子大吼等情况,但很好地体现了以色列内部的矛盾,让学生更加认识到民主国家的好坏还有以色列这个国家内部的不合和需要修改的地方。以色列课堂上很鼓励提问,气氛一般也很轻松,尤其是这边学生比老师地位高,老师没什么权威,所以学生高中毕业后厌学的比较少(虽然课堂有时也很乱)。还有就是老师经常会提到很多有趣的主题,稍加解释后就让我们自己去查询,感兴趣人自然会去深入了解,这样学生会渐渐形成自己钻研的精神,主动学习总要比被动学习好很多。
6)重视阅读:高中的图书馆专业书比较少,反倒是各种小说、名著、杂志、历史书和心理学书籍等等东西,大学的话主要是专业书,但是特别丰富,学生有足够的时间来研究自己喜欢的课题,在以色列逃课的大学生是少数,大多数人对自己选择的专业内容还是很感兴趣的。这边报考大学不是估分质,想考上好大学还是相当简单的,而且高中有好多避免高考不成功就落榜的设置。只要成绩够了,就可以报自己喜欢的专业。这边很多书籍很有趣,比如说有好多数学书都编写的很轻松,有很多很有趣的问题和对数学的各个方面的深入却不枯燥的介绍,物理也如是。以色列文学课一般不会要求背课文等等,学的文学作品和国内相比也要少很多,但是剖析的比较深入,一部作品可以花1个多月的时间来讲解。
7)重视课外知识:其实人文里说了很多,但是还是不够全面,我们每年都能听到很多的演讲,囊括了生活和学术等广泛的方面,所以学生对于很多处于社会或学术前沿的话题都不陌生。

应该还有很多东西,只是打字打累了…不太会总结,不好意思…>_< 说白了就是兴趣最重要,一切的教育都应该更重质一点,最重要的还是让听者觉得很有趣,然后自发地去探讨这个方面的问题。

— 完 —

下载知乎 iPhone 客户端:http://zhi.hu/ios
from 知乎每日精选: http://www.zhihu.com/question/19707668/answer/15841554

Written by cwyalpha

十二月 31, 2012 at 12:24 下午

发表在 Uncategorized

Thought this was cool: 解读Cardinality Estimation算法(第二部分:Linear Counting)

leave a comment »

上一篇文章中,我们知道传统的精确基数计数算法在数据量大时会存在一定瓶颈,瓶颈主要来自于数据结构合并和内存使用两个方面。因此出现了很多基数估计的概率算法,这些算法虽然计算出的结果不是精确的,但误差可控,重要的是这些算法所使用的数据结构易于合并,同时比传统方法大大节省内存。

在这一篇文章中,我们讨论Linear Counting算法。

简介

Linear Counting(以下简称LC)在1990年的一篇论文“A linear-time probabilistic counting algorithm for database applications”中被提出。作为一个早期的基数估计算法,LC在空间复杂度方面并不算优秀,实际上LC的空间复杂度与上文中简单bitmap方法是一样的(但是有个常数项级别的降低),都是O(N_{max}),因此目前很少单独使用LC。不过作为Adaptive Counting等算法的基础,研究一下LC还是比较有价值的。

基本算法

思路

LC的基本思路是:设有一哈希函数H,其哈希结果空间有m个值(最小值0,最大值m-1),并且哈希结果服从均匀分布。使用一个长度为m的bitmap,每个bit为一个桶,均初始化为0,设一个集合的基数为n,此集合所有元素通过H哈希到bitmap中,如果某一个元素被哈希到第k个比特并且第k个比特为0,则将其置为1。当集合所有元素哈希完成后,设bitmap中还有u个bit为0。则:

\hat{n}=-mln\frac{u}{m}

为n的一个估计,且为最大似然估计(MLE)。

示意图如下:

image

推导及证明

(对数学推导不感兴趣的读者可以跳过本节)

由上文对H的定义已知n个不同元素的哈希值服从独立均匀分布。设A_j为事件“经过n个不同元素哈希后,第j个桶值为0”,则:

P(A_j)=(1-\frac{1}{m})^n

又每个桶是独立的,则u的期望为:

E(u)=\sum_{j=1}^mP(A_j)=m(1-\frac{1}{m})^n=m((1+\frac{1}{-m})^{-m})^{-n/m}

当n和m趋于无穷大时,其值约为me^{-n/m}

令:

E(u)=me^{-n/m}

得:

n=-mln\frac{E(u)}{m}

显然每个桶的值服从参数相同0-1分布,因此u服从二项分布。由概率论知识可知,当n很大时,可以用正态分布逼近二项分布,因此可以认为当n和m趋于无穷大时u渐进服从正态分布。

因此u的概率密度函数为:

f(x) = {1 \over \sigma\sqrt{2\pi} }\,e^{- {{(x-\mu )^2 \over 2\sigma^2}}}

由于我们观察到的空桶数u是从正态分布中随机抽取的一个样本,因此它就是\mu的最大似然估计(正态分布的期望的最大似然估计是样本均值)。

又由如下定理:

f(x)的最大似然估计。

-mln\frac{x}{m}的最大似然估计。

偏差分析

下面不加证明给出如下结论:

Bias(\frac{\hat{n}}{n})=E(\frac{\hat{n}}{n})-1=\frac{e^t-t-1}{2n}

StdError(\frac{\hat{n}}{n})=\frac{\sqrt{m}(e^t-t-1)^{1/2}}{n}

其中t=n/m

以上结论的推导在“A linear-time probabilistic counting algorithm for database applications”可以找到。

算法应用

在应用LC算法时,主要需要考虑的是bitmap长度m的选择。这个选择主要受两个因素的影响:基数n的量级以及容许的误差。这里假设估计基数n的量级大约为N,允许的误差为\epsilon,则m的选择需要遵循如下约数。

误差控制

这里以标准差作为误差。由上面标准差公式可以推出,当基数的量级为N,容许误差为\epsilon是,有如下限制:

m > \frac{e^t-t-1}{(\epsilon t)^2}

将量级和容许误差带入上式,就可以得出m的最小值。

满桶控制

由LC的描述可以看到,如果m比n小太多,则很有可能所有桶都被哈希到了,此时u的值为0,LC的估计公式就不起作用了(变成无穷大)。因此m的选择除了要满足上面误差控制的需求外,还要保证满桶的概率非常小。

上面已经说过,u满足二项分布,而当n非常大,p非常小时,可以用泊松分布近似逼近二项分布。因此这里我们可以认为u服从泊松分布(注意,上面我们说u也可以近似服从正态分布,这并不矛盾,实际上泊松分布和正态分布分别是二项分布的离散型和连续型概率逼近,且泊松分布以正态分布为极限):

当n、m趋于无穷大时:

Pr(u=k)=(\frac{\lambda^k}{k!})e^{-\lambda}

因此:

Pr(u=0)<e^{-5}=0.007

由于泊松分布的方差为\lambda的标准差就可以保证满桶的概率不大约0.7%。因此可得:

m > 5(e^t-t-1)

综上所述,当基数量级为N,可接受误差为\epsilon,则m的选取应该遵从

m > \beta (e^t-t-1)

其中\beta = max(5, 1/(\epsilon t)^2)

下图是论文作者预先计算出的关于不同基数量级和误差情况下,m的选择表:

image

可以看出精度要求越高,则bitmap的长度越大。随着m和n的增大,m大约为n的十分之一。因此LC所需要的空间只有传统的bitmap直接映射方法的1/10,但是从渐进复杂性的角度看,空间复杂度仍为O(N_{max})

合并

LC非常方便于合并,合并方案与传统bitmap映射方法无异,都是通过按位或的方式。

小结

这篇文章主要介绍了Linear Counting。LC算法虽然由于空间复杂度不够理想已经很少被单独使用,但是由于其在元素数量较少时表现非常优秀,因此常被用于弥补LogLog Counting在元素较少时误差较大的缺陷,实际上LC及其思想是组成HyperLogLog Counting和Adaptive Counting的一部分。

在下一篇文章中,我会介绍空间复杂度仅有O(log(log(N_{max})))的基数估计算法LogLog Counting。


0

   

0

IT 牛人博客聚合网站(udpwork.com) 聚合
|
评论: 0
|
10000+ 本编程/Linux PDF/CHM 电子书下载

from IT牛人博客聚合网站: http://www.udpwork.com/item/8950.html

Written by cwyalpha

十二月 31, 2012 at 12:24 下午

发表在 Uncategorized

Thought this was cool: Simon: Open-Source Speech Recognition: Simon 0.4.0

leave a comment »

Comments: “Simon: Open-Source Speech Recognition: Simon 0.4.0”

URL: http://simon-listens.blogspot.com/2012/12/simon-040.html

After years of hard work, the Simon team is proud to announce the new major release: Simon 0.4.0.

New in Simon 0.4

This new version of the open source speech recognition system Simon features a whole new recognition layer, context-awareness for improved accuracy and performance, a dialog system able to hold whole conversations with the user and more.

Revisiting Usability

A lot of work has gone into making Simon easier to use – both for existing and new users.

Perhaps most visibly, the main window of Simon has been reorganized to bring the most important options together in one screen.

Moreover, the newly introduced Simon base model format (.sbm) and the integration of a GHNS online repository of base models have removed the last big hurdle of the initial configuration.
One can now easily go from a fresh installation to a working setup in less than 5 minutes without any preparation. Don’t believe me? Check out the quick start below!


Simon 0.4.0: Quick Start

Many other, smaller changes sum up to one simple but important difference: Simon will overall require less user interaction while achieving more.

SPHINX

One of the major internal changes of Simon 0.4 is of course the included support for the BSD licensed CMU SPHINX. While we still also maintain full support for HTK and Julius, new models compiled with Simon will default to the SPHINX backend and the (proprietary) HTK is no longer required to build user-generated models.
Best of all: Simon will select the correct backend for your configuration transparently and automatically.

Voxforge

A major problem of open source speech recognition has always been the lack of freely available high quality speech models.

The Voxforge project has been working for years towards GPL acoustic models for a variety of languages. While their models are certainly not yet perfect, they offer a promising starting point.
The English Voxforge model is of course available as a Simon base model and can be downloaded and imported with Simon.

Additionally, starting with Simon 0.4, users will also have the option to contribute their gathered Simon training samples directly to the Voxforge server.
These recordings will then be used to train and improve the general acoustic models.

Context

There is a simple rule of thumb in speech recognition: The smaller the application domain, the better the recognition accuracy. This was always one of the core principles of Simon.
In Simon 0.4, however, we went one step further: Simon can now re-configure itself on-the-fly as the current situation changes. Through so called “context conditions” Simon 0.4 can automatically activate and deactivate selected scenarios, microphones and even parts of your training corpus.

For example: Why listen for “Close tab” when your browser isn’t even open? Or why listen for anything at all when you’re actually in the next room listening to music? Yes, Simon is watching you.

Dialog System

Simon 0.4.0 also ships with the new dialog system featuring scripted variables (Javascript), integration with Plasma data engines, a templating system and – of course – text-to-speech output.

Simonoid

For users of KDE’s plasma workspace, we now provide the “Simonoid” plasmoid to start and monitor Simon – including the current recording volume.

The screenshot above shows two instances of the plasmoid: One added to the panel and another one to the desktop.

… and everything else

Please don’t be foold to think that the above is a complete list of all improvements. For example, we also have a new sample review tool called Afaras, integration with the Sequitur grapheme to phoneme framework, an Akonadi command plugin and many, many other noteworthy changes.
You’ll have to try out Simon to see for yourself!

Download

To install Simon 0.4.0, you can either compile the official source tarball, install a binary package provider by our Linux distribution or use the installer for Windows.

If you are a packager and would like to package Simon 0.4, please do get in touch with us. Thank you.


from Hacker News 50: http://simon-listens.blogspot.com/2012/12/simon-040.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+hacker-news-feed-50+%28Hacker+News+50%29

Written by cwyalpha

十二月 31, 2012 at 11:23 上午

发表在 Uncategorized