zoukankan      html  css  js  c++  java
  • urllib库使用方法

    这周打算把学过的内容重新总结一下,便于以后翻阅查找资料。

    urllib库是python的内置库,不需要单独下载。其主要分为四个模块:

    1.urllib.request——请求模块

    2.urllib.error——异常处理模块

    3.urllib.parse——url解析模块

    4.urllib.robotparser——用来识别网站的robot.txt文件(看看哪些内容是可以爬的,不常用)

    1.urlopen

    import urllib.request
    
    response = urllib.request.urlopen('http://www.baidu.com')
    print(response.read().decode('utf-8'))
    超时读取
    import socket
    import urllib.request
    import urllib.error
    
    try:
        response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
    except urllib.error.URLError as e:
        if isinstance(e.reason, socket.timeout):
            print('TIME OUT')
    响应内容分析
    import urllib.request
    
    response = urllib.request.urlopen('https://www.python.org')
    print(type(response))
    print(response.status)
    print(response.getheaders())
    print(response.getheader('Server'))
    <class 'http.client.HTTPResponse'>
    200
    [('Server', 'nginx'), ('Content-Type', 'text/html; charset=utf-8'), ('X-Frame-Options', 'SAMEORIGIN'), ('x-xss-protection', '1; mode=block'), ('X-Clacks-Overhead', 'GNU Terry Pratchett'), ('Via', '1.1 varnish'), ('Content-Length', '50069'), ('Accept-Ranges', 'bytes'), ('Date', 'Mon, 26 Nov 2018 02:44:49 GMT'), ('Via', '1.1 varnish'), ('Age', '3121'), ('Connection', 'close'), ('X-Served-By', 'cache-iad2150-IAD, cache-sjc3143-SJC'), ('X-Cache', 'MISS, HIT'), ('X-Cache-Hits', '0, 254'), ('X-Timer', 'S1543200290.644687,VS0,VE0'), ('Vary', 'Cookie'), ('Strict-Transport-Security', 'max-age=63072000; includeSubDomains')]
    nginx

    2. request

    用来传递更多的请求参数,url,headers,data, method
    from urllib import request, parse
    
    url = 'http://httpbin.org/post'
    headers = {
        'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
        'Host': 'httpbin.org'
    }
    dict = {
        'name': 'Germey'
    }
    data = bytes(parse.urlencode(dict), encoding='utf8')
    req = request.Request(url=url, data=data, headers=headers, method='POST')
    response = request.urlopen(req)
    print(response.read().decode('utf-8'))
    {
      "args": {}, 
      "data": "", 
      "files": {}, 
      "form": {
        "name": "Germey"
      }, 
      "headers": {
        "Accept-Encoding": "identity", 
        "Connection": "close", 
        "Content-Length": "11", 
        "Content-Type": "application/x-www-form-urlencoded", 
        "Host": "httpbin.org", 
        "User-Agent": "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)"
      }, 
      "json": null, 
      "origin": "117.136.66.101", 
      "url": "http://httpbin.org/post"
    }
    另一种方法添加headers
    
    from urllib import request, parse
    
    url = 'http://httpbin.org/post'
    dict = {
        'name': 'asd'
    }
    data = bytes(parse.urlencode(dict), encoding='utf8')
    req = request.Request(url=url, data=data, method='POST')
    req.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)')  #如果有好多headers,可以循环添加
    response = request.urlopen(req)
    print(response.read().decode('utf-8'))

    3. Handler

    代理
    import urllib.request
    
    proxy_handler = urllib.request.ProxyHandler({
        'http': 'http://127.0.0.1:9743',
        'https': 'https://127.0.0.1:9743'
    })
    opener = urllib.request.build_opener(proxy_handler)
    response = opener.open('https://www.httpbin,org/get')
    print(response.read())

    4. Cookie

    获取cookie
    import
    http.cookiejar, urllib.request cookie = http.cookiejar.CookieJar() handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com') for item in cookie: print(item.name+"="+item.value)
    存储Cookie
    import http.cookiejar, urllib.request
    filename = "cookie.txt"
    cookie = http.cookiejar.MozillaCookieJar(filename)
    handler = urllib.request.HTTPCookieProcessor(cookie)
    opener = urllib.request.build_opener(handler)
    response = opener.open('http://www.baidu.com')
    cookie.save(ignore_discard=True, ignore_expires=True)
    获取Cookie
    import http.cookiejar, urllib.request
    cookie = http.cookiejar.MozillaCookieJar()
    cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
    handler = urllib.request.HTTPCookieProcessor(cookie)
    opener = urllib.request.build_opener(handler)
    response = opener.open('http://www.baidu.com')
    print(response.read().decode('utf-8'))

    5. UrlError 和 HttpError

    from urllib import request, error
    
    try:
        response = request.urlopen('http://cuiqingcai.com/index.htm')
    except error.HTTPError as e:
        print(e.reason, e.code, e.headers, sep='
    ')
    except error.URLError as e:
        print(e.reason)
    else:
        print('Request Successfully')

    对于urllib.parse,因为用的不多,就先不写了。上传一个文档链接吧,万一哪天用了还可以看看。

    https://files.cnblogs.com/files/zhangguoxv/urllib%E8%AE%B2%E8%A7%A3.zip

  • 相关阅读:
    IIS部署Asp.Net Core 项目运行时报错,处理程序“aspNetCore”在其模块列表中有一个错误模块“AspNetCoreModuleV2"
    Linux Mysql5.7.22安装
    Nginx初体验
    asp.net core Csc任务不支持SharedCompilationId参数,请确认改参数存在于此任务中,并且是可设置的公共实例属性
    【Node.js 】Express框架
    【Node.js】 初体验
    Mongodb 配置
    【C#】Windows服务守护并发送邮件通知
    新建【Git】仓库后给使用者授权
    Git提交修改的代码出现提交不上去
  • 原文地址:https://www.cnblogs.com/zhangguoxv/p/10019546.html
Copyright © 2011-2022 走看看