CWYAlpha

Just another WordPress.com site

Thought this was cool: 怎样用python模仿浏览器行为

leave a comment »


在信息抓取过程中,经常遇到下面的问题 :

0、你访问网站过于频繁,被拒绝访问;
1、你被网站认为是使用程序访问,被拒绝访问;
2、你同时建立的连接过多,被拒绝访问。

以下的方法可以解决上述问题:

1、你建立的网络连接一定要模拟出浏览器的特征属性;

2、 在访问同一个网站的时候,应该用连接池;
连接池的意思是在本地列表中维护你已经创建出来的连接。当有连接需要的时候,从这个池中任意挑选一个。这样可以减少打开连接的次数和打开的数量

3、 遇到timeout的时候可以尝试多次。

建立连接池和retry:

def retrieve_url(url,retries= 3):
    url_parse = urlparse.urlsplit(url)
    hostname = url_parse[ 1]
    if hostname in http_connection_pool:
        connections = http_connection_pool[hostname]
        if len(connections) < 5:
            opener = urllib2.build_opener(support, urllib2.HTTPHandler )
            connections.append(opener)
            http_connection_pool[hostname]=connections
        else:
            i = random.randint( 0, 4)
            opener = connections[i]
    else:
        opener = urllib2.build_opener( support, urllib2.HTTPHandler )
        http_connection_pool[hostname]=[opener]
    try:
        req = urllib2.Request(
            url = url,
            headers = headers
        )
        return opener.open(req).read()
    except Exception:
        if retries> 0:
                return retrieve_url(url,retries- 1)
        else:
            logger.exception( “Can’t retrieve content from url:”+url)
            return None



from duyamin’s blog: http://www.duyamin.com/2012/10/python-connection-pool.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+DuyaminsBlog+%28duyamin%E2%80%99s+blog%29

Written by cwyalpha

十一月 24, 2012 在 3:26 下午

发表在 Uncategorized

发表评论

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / 更改 )

Twitter picture

You are commenting using your Twitter account. Log Out / 更改 )

Facebook photo

You are commenting using your Facebook account. Log Out / 更改 )

Google+ photo

You are commenting using your Google+ account. Log Out / 更改 )

Connecting to %s

%d 博主赞过: