CWYAlpha

Just another WordPress.com site

Thought this was cool: Beautiful Data-统计语言模型的应用三:分词7

leave a comment »


  走到这一步,我们利用Google的一元语言模型进行分词的程序基本上已经完成了,先看一下已完成的segment.py程序吧:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import operator

def segment( text ):
        “””Return a list of words that is the best segmentation of text.”””
        if not text : return []
        candidates = ( [first] + segment( rem ) for first, rem in splits( text ) )
        return max( candidates, key = Pwords )

def splits( text, L = 20 ):
        “””Return a list of all possible ( first, rem ) pairs, len( first ) <=L”””
        return [ ( text[:i+1], text[i+1:] ) for i in range( min(len(text), L ) ) ]

def Pwords( words ):
        “””The Naive Bayes probability of a sequence of words.”””
        return product( Pw(w) for w in words )

def product( nums ):
        “””Return the product of a sequence of numbers.”””
        return reduce( operator.mul, nums, 1 )

class Pdist( dict ):
        “””A probability distribution estimated from counts in datafile.”””
        def __init__( self, data, N = None, missingfn = None ):
                for key, count in data:
                        self[key] = self.get( key, 0 ) + int( count )
                self.N = float( N or sum( self.itervalues() ) )
                self.missingfn = missingfn or ( lambda k, N: 1./N )

        def __call__( self, key ):
                if key in self: return self[key] / self.N
                else: return self.missingfn( key, self.N )

def datafile( name, sep = \t ):
        “””Read key, value pairs from file.”””
        for line in file( name ):
                yield line.split( sep )

def avoid_long_words( word, N ):
  “””Estimate the probability of an unknown word.”””
        return 10./( N * 10**len( word ) )

N = 1024908267229       ## Number of tokens in corpus

Pw = Pdist( datafile( ‘count_1w.txt’ ), N, avoid_long_words )

  其中Pwords和segment两个函数还没有测试,先利用Python解释器测一下。先测count_1w.txt里前三个单词the,of,and,其中:
  Count(the) = 23135851162
  Count(of) = 13151942776
  Count(and) = 12997637966

  >>> segment.N
  1024908267229L
  >>> segment.Pw( “the” )
  0.022573582340740989
  >>> 23135851162. / segment.N
  0.022573582340740989
  >>> segment.Pw( “of” )
  0.012832312116632971
  >>> 13151942776. / segment.N
  0.012832312116632971
  >>> segment.Pw( “and” )
  0.012681757364628494
  >>> 12997637966. / segment.N
  0.012681757364628494
  测一个造出来的未登录词“theofand”:
  >>> segment.Pw( “theofand” )
  9.7569707648437317e-20
  >>> 10. / ( segment.N * 10 ** len( “theofand” ) )
  9.7569707648437317e-20

  再来看Pwords函数,它返回“product( Pw(w) for w in words )”:
  >>> segment.Pwords( [“the”, “of”, “and”] )
  3.6735405611059254e-06
  也就是等于:
  >>> segment.Pw( “the” ) * segment.Pw( “of” ) * segment.Pw( “and” )
  3.6735405611059254e-06

  最后我们来看segment函数了,先测一下分词第一节所举的两个例子:
  >>> segment.segment( “choosespain” )
  [‘choose’, ‘spain’]
  >>> segment.segment( “insufficientnumbers” )
  [‘insufficient’, ‘numbers’]
  不错,全分正确了,再试一下“自然语言处理”的英文单词:
  >>> segment.segment( “naturallanguageprocessing” )
  …
  陷入了长时间的停顿,看看你机器的CPU,总有一个在100%的全力运行, 但不管怎么说,还是等到了正确的分词结果:
  [‘natural’, ‘language’, ‘processing’]
  如果读者愿意,可以继续试一下,譬如“我爱自然语言处理”:
  >>> segment.segment( “Ilovenaturallanguageprocessing” )
  …
  这一次,我实在等不下去了。问题出在了什么地方?再回顾一下segment函数的代码吧:

1
2
3
4
5
def segment( text ):
        “””Return a list of words that is the best segmentation of text.”””
        if not text : return []
        candidates = ( [first] + segment( rem ) for first, rem in splits( text ) )
        return max( candidates, key = Pwords )

  关于“max(candidates,key=Pwords)”这里引用前一节评论中热心读者navygong的解释:

candidates实际上是个generator(生成器),你提到的这两行代码就是计算每种候选分词方式的概率,并从中取概率最大的那种。如”wheninthecourse”可能的分词方式有
[‘w’, ‘henin’, ‘the’, ‘course’]
[‘wh’, ‘en’, ‘in’, ‘the’, ‘course’]
[‘whe’, ‘n’, ‘in’, ‘the’, ‘course’]

[‘wheninthecour’, ‘se’]
[‘wheninthecours’, ‘e’]
[‘wheninthecourse’]。
以[‘wh’, ‘en’, ‘in’, ‘the’, ‘course’]为例,Pwords函数作用到这个列表上后得到的是各个词出现的概率的乘积。然后用max函数取出最大乘积的那种候选分词方式。

未完待续:分词8

注:原创文章,转载请注明出处“我爱自然语言处理”:www.52nlp.cn

本文链接地址:http://www.52nlp.cn/beautiful-data-统计语言模型的应用三分词7

相关文章:

  1. Beautiful Data-统计语言模型的应用三:分词3
  2. Beautiful Data-统计语言模型的应用三:分词5
  3. Beautiful Data-统计语言模型的应用三:分词6
  4. Beautiful Data-统计语言模型的应用三:分词4
  5. Beautiful Data-统计语言模型的应用三:分词8
  6. Beautiful Data-统计语言模型的应用三:分词2
  7. Beautiful Data-统计语言模型的应用三:分词1
  8. Beautiful Data-统计语言模型的应用二:背景
  9. Beautiful Data-统计语言模型的应用一:缘起
  10. MIT自然语言处理第三讲:概率语言模型(第五部分)


from 我爱自然语言处理: http://www.52nlp.cn/beautiful-data-%e7%bb%9f%e8%ae%a1%e8%af%ad%e8%a8%80%e6%a8%a1%e5%9e%8b%e7%9a%84%e5%ba%94%e7%94%a8%e4%b8%89%e5%88%86%e8%af%8d7?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+52nlp+%28%E6%88%91%E7%88%B1%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%29

Written by cwyalpha

3月 17, 2012 在 3:38 下午

发表在 Uncategorized

留下评论