zoukankan      html  css  js  c++  java
  • 0.爬虫 urlib库讲解 urlopen()与Request()

    # 注意一下 是import urllib.request 还是 form urllib import request

    0. urlopen()

    语法:urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

    • 实例0:(这个函数 一般就使用三个参数 url data timeout)

    *添加的data参数需要使用bytes()方法将参数转换为字节流(区别于str的一种类型 是一种比特流 010010010)编码的格式的内容,即bytes类型。

    *response.read()是bytes类型的数据,需要decode(解码)一下。

    import urllib.parse
    import urllib.request
    import urllib.error
    
    url = 'http://httpbin.org/post'
    data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8')
    try:
        response = urllib.request.urlopen(url, data=data,timeout=1)
    except urllib.error.URLError as e:
        if isinstance(e.reason, socket.timeout):
            print('TIME OUT')
    else:
        print(response.read().decode("utf-8"))

    输出结果:

    {
      "args": {}, 
      "data": "", 
      "files": {}, 
      "form": {
        "word": "hello"
      }, 
      "headers": {
        "Accept-Encoding": "identity", 
        "Content-Length": "10", 
        "Content-Type": "application/x-www-form-urlencoded", 
        "Host": "httpbin.org", 
        "User-Agent": "Python-urllib/3.6"
      }, 
      "json": null, 
      "origin": "101.206.170.234, 101.206.170.234", 
      "url": "https://httpbin.org/post"
    }
    
    • 实例1:查看i状态码、响应头、响应头里server字段的信息
    import urllib.request
    
    response = urllib.request.urlopen('https://www.python.org')
    print(response.status)
    print(response.getheaders())
    print(response.getheader('Server'))
    

    输出结果:

    200
    [('Server', 'nginx'), ('Content-Type', 'text/html; charset=utf-8'), ('X-Frame-Options', 'DENY'), ('Via', '1.1 vegur'), ('Via', '1.1 varnish'), ('Content-Length', '48410'), ('Accept-Ranges', 'bytes'), ('Date', 'Tue, 09 Apr 2019 02:32:34 GMT'), ('Via', '1.1 varnish'), ('Age', '722'), ('Connection', 'close'), ('X-Served-By', 'cache-iad2126-IAD, cache-hnd18751-HND'), ('X-Cache', 'MISS, HIT'), ('X-Cache-Hits', '0, 1223'), ('X-Timer', 'S1554777154.210361,VS0,VE0'), ('Vary', 'Cookie'), ('Strict-Transport-Security', 'max-age=63072000; includeSubDomains')]
    nginx
    

    使用urllib库的urlopen()方法有很大的局限性,比如不能设置响应头的信息等。所以需要引入request()方法。

    1. Request()

    • 实例0:(这两种方法的实现效果是一样的)
    import urllib.request
    
    response = urllib.request.urlopen('https://www.python.org')
    print(response.read().decode('utf-8'))
    
    ######################################
    
    import urllib.request
    
    req = urllib.request.Request('https://python.org')
    response = urllib.request.urlopen(req)
    print(response.read().decode('utf-8'))

    下面主要讲解下使用Request()方法来实现get请求和post请求,并设置参数。

    • 实例1:(post请求)
    from urllib import request, parse
    
    url = 'http://httpbin.org/post'
    headers = {
        'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
        'Host': 'httpbin.org'
    }
    dict = {
        'name': 'Germey'
    }
    data = bytes(parse.urlencode(dict), encoding='utf8')
    req = request.Request(url=url, data=data, headers=headers, method='POST')
    response = request.urlopen(req)
    print(response.read().decode('utf-8'))

    亦可使用add_header()方法来添加报头,实现浏览器的模拟,添加data属性亦可如下书写:

    补充:还可以使用bulid_opener()修改报头,不过多阐述,够用了就好。

    from urllib import request, parse
    
    url = 'http://httpbin.org/post'
    dict = {
        'name': 'Germey'
    }
    data = parse.urlencode(dict).encode('utf-8')
    req = request.Request(url=url, data=data, method='POST')
    req.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)')
    response = request.urlopen(req)
    print(response.read().decode('utf-8'))
    • 实例2:(get请求) 百度关键字的查询
    from urllib import request,parse
    
    url = 'http://www.baidu.com/s?wd='
    key = '路飞'
    key_code = parse.quote(key)
    url_all = url + key_code
    """
    #第二种写法
    url = 'http://www.baidu.com/s'
    key = '路飞'
    wd = parse.urlencode({'wd':key})
    url_all = url + '?' + wd
    """
    req = request.Request(url_all)
    response = request.urlopen(req)
    print(response.read().decode('utf-8'))

    在这里,对编码decode、reqest模块里的quote()方法、urlencode()方法 等就有疑问了,,对此,做一些说明:

    1. parse.quote:将str数据转换为对应的编码
    2. parse.urlencode:将字典中的k:v转换为K:编码后的v
    3. parse.unquote:将编码后的数据转化为编码前的数据
    4. decode 字符串解码 decode("utf-8")跟read()搭配很配!
    5. encode 字符串编码
    >>> str0 = '我爱你'
    >>> str1 = str0.encode('gb2312')    
    >>> str1 
    b'xcexd2xb0xaexc4xe3'
    >>> str2 = str0.encode('gbk')
    >>> str2
    b'xcexd2xb0xaexc4xe3'
    >>> str3 = str0.encode('utf-8')
    >>> str3
    b'xe6x88x91xe7x88xb1xe4xbdxa0'
    >>> str00 = str1.decode('gb2312')
    >>> str00
    '我爱你'
    >>> str11 = str1.decode('utf-8') #报错,因为str1是gb2312编码的
    Traceback (most recent call last):
      File "<pyshell#9>", line 1, in <module>
        str11 = str1.decode('utf-8')
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 0: invalid continuation byte

    * encoding指定编码格式

    在这里,又有疑问了?read()、readline()、readlines()的区别:

    1. read():全部,字符串str
    2. reasline():一行
    3. readlines():全部,列表list
  • 相关阅读:
    sparkSQL
    Spark分区实例(teacher)
    SparkCore的性能优化
    Linux 输出当前路径下某个文件的绝对路径
    bulid runnable jar file with dependencies
    bulid runnable jar file with dependencies and main class
    spring mvc 整合jsp和thymeleaf两个模板引擎
    解决Volley中的JsonObjectRequest jsonRequest参数无法被服务端读取的问题
    为volley的http请求添加自定义request header
    使用spring-boot-starter-data-jpa 怎么配置使运行时输出SQL语句
  • 原文地址:https://www.cnblogs.com/DC0307/p/10675878.html
Copyright © 2011-2022 走看看