CWYAlpha

Just another WordPress.com site

Thought this was cool: Python拥抱lxml

leave a comment »


之前一直在用BeautifulSoup,怎么说呢,上手容易,但经常容易出现各种诡异的问题,却很难找到方法修改。

lxml.html是一个偏向底层的HTML解析器,速度秒杀BeautifulSoup。

网站:http://lxml.de/lxmlhtml.html

解析页面:

# 下载页面,转化编码
import urllib2
str = urllib2.urlopen("http://www.coder4.com").read().decode("utf-8")
# 解析lxml.html
ll = lxml.html.document_fromstring(str)
# 获取全部文本
print ll.text_content()
# 通过ID获取元素
elem = ll.get_element_by_id("top")
# 获取标签名 如 div
elem.tag
# 通过xpath返回所有的div
for tag in ll.findall('*div'):
    print tag.text_content()

一些高级点的功能,清理,具体参数见文档http://lxml.de/api/lxml.html.clean.Cleaner-class.html:

# 清理HTML
from lxml.html.clean import Cleaner
cleaner = Cleaner(page_structure=False, links=False)
print cleaner.clean_html(html)

 

 
from 四号程序员四号程序员: http://www.coder4.com/archives/3639

Written by cwyalpha

十月 16, 2012 在 4:16 下午

发表在 Uncategorized

发表评论

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / 更改 )

Twitter picture

You are commenting using your Twitter account. Log Out / 更改 )

Facebook photo

You are commenting using your Facebook account. Log Out / 更改 )

Google+ photo

You are commenting using your Google+ account. Log Out / 更改 )

Connecting to %s

%d 博主赞过: