use proxy in spider
http://love-python.blogspot.com/2008/03/use-proxy-in-your-spider.html
life is short - you need Python!
Using proxy, you can minimize the chance of getting blocked for your crawlers/spiders. Now let me tell you how to use proxy ip address in your spider in python. First load the list from a file:
Now for each proxy in list, call the following function:
Hope your spidering experience will be better with proxies :-)
Let me know about any alternate idea.
fileproxylist = open('proxylist.txt', 'r')
proxyList = fileproxylist.readlines()
indexproxy = 0
totalproxy = len(proxyList)
Now for each proxy in list, call the following function:
def get_source_html_proxy(url, pip):
proxy_handler = urllib2.ProxyHandler({'http': pip})
opener = urllib2.build_opener(proxy_handler)
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urllib2.install_opener(opener)
req=urllib2.Request(url)
sock=urllib2.urlopen(req)
data = sock.read()
return data
Hope your spidering experience will be better with proxies :-)
Let me know about any alternate idea.