CWYAlpha

Just another WordPress.com site

Thought this was cool: BeautifulSoup中文乱码解决问题

leave a comment »


import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen('http://www.leeon.me');
soup = BeautifulSoup(page,fromEncoding="gb18030")

print soup.originalEncoding
print soup.prettify()

如果中文页面编码是gb2312,gbk,在BeautifulSoup构造器中传入fromEncoding=”gb18030″参数即可解决乱码问题,即使分析的页面是utf8的页面使用gb18030也不会出现乱码问题!

转载自:《beautifulsoup解析中文网页乱码》
from 四号程序员四号程序员: http://www.coder4.com/archives/3621

Written by cwyalpha

九月 13, 2012 在 3:26 下午

发表在 Uncategorized

发表评论

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / 更改 )

Twitter picture

You are commenting using your Twitter account. Log Out / 更改 )

Facebook photo

You are commenting using your Facebook account. Log Out / 更改 )

Google+ photo

You are commenting using your Google+ account. Log Out / 更改 )

Connecting to %s

%d 博主赞过: