zoukankan html css js c++ java

用python写网络爬虫 -从零开始 1 编写第一个网络爬虫

本文从最简单的爬虫开始，通过添加检测下载错误，设置用户代理，设置网络代理，逐渐完善爬虫功能 。
首先 说明一下代码的使用方法 ：在python2.7 环境下，用命令行也可以，用Pycharm编辑也可以。通过定义函数，然后引用函数完成网页抓取
例如 ：  download （”HTTP：//www.baidu.com“）

        download1 （”HTTP：//www.baidu.com“）

        download2（”HTTP：//www.baidu.com“）




1.用三行代码  完成第一个最简单的网络爬虫 

import urllib2
import urlparse


def download1(url):
    """Simple downloader"""
    return urllib2.urlopen(url).read()

2.升级一下，编写出现下载错误的网络爬虫

def download2(url):
    """Download function that catches errors"""
    print 'Downloading:', url
    try:
        html = urllib2.urlopen(url).read()
    except urllib2.URLError as e:
        print 'Download error:', e.reason
        html = None
    return html
3.网页5xx错误一般发生在服务器端，给爬虫加上一个判断，当错误代码大于500小于600的时候继续下载2次，

def download3(url, num_retries=2):
    """Download function that also retries 5XX errors"""
    print 'Downloading:', url
    try:
        html = urllib2.urlopen(url).read()
    except urllib2.URLError as e:
        print 'Download error:', e.reason
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # retry 5XX HTTP errors
                html = download3(url, num_retries-1)
    return html

4.设置用户代理
一般情况下，默认的网络爬虫会被一些网站封杀，这里设置了一个"wswp"为名称的网络代理

def download4(url, user_agent='wswp', num_retries=2):
    """Download function that includes user agent support"""
    print 'Downloading:', url
    headers = {'User-agent': user_agent}
    request = urllib2.Request(url, headers=headers)
    try:
        html = urllib2.urlopen(request).read()
    except urllib2.URLError as e:
        print 'Download error:', e.reason
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # retry 5XX HTTP errors
                html = download4(url, user_agent, num_retries-1)
    return html

5.支持代理
有时候我们需要用代理访问某个网站。比如，NTEflix屏蔽了美国以外的大多数国家。我们使用 requests 模块来实现网络代理的功能。

import urllib2
import urlparse

def download5(url, user_agent='wswp', proxy=None, num_retries=2):
    """Download function with support for proxies"""
    print 'Downloading:', url
    headers = {'User-agent': user_agent}
    request = urllib2.Request(url, headers=headers)
    opener = urllib2.build_opener()
    if proxy:
        proxy_params = {urlparse.urlparse(url).scheme: proxy}
        opener.add_handler(urllib2.ProxyHandler(proxy_params))
    try:
        html = opener.open(request).read()
    except urllib2.URLError as e:
        print 'Download error:', e.reason
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # retry 5XX HTTP errors
                html = download5(url, user_agent, proxy, num_retries-1)
    return html

查看全文

相关阅读:
BestCoder Round #29 1003 （hdu 5172） GTY's gay friends [线段树判不同预处理好题]
POJ 1182 食物链 [并查集带权并查集开拓思路]
Codeforces Round #288 (Div. 2) E. Arthur and Brackets [dp 贪心]
Codeforces Round #287 (Div. 2) E. Breaking Good [Dijkstra 最短路优先队列]
Codeforces Round #287 (Div. 2) D. The Maths Lecture [数位dp]
NOJ1203 最多约数问题 [搜索数论]
poj1426
POJ 1502 MPI Maelstrom [最短路 Dijkstra]
POJ 2785 4 Values whose Sum is 0 [二分]
浅析group by，having count()

原文地址：https://www.cnblogs.com/mrruning/p/7638377.html