zoukankan      html  css  js  c++  java
  • 爬虫基本库的使用之requests库

    使用requests

    由于处理网页验证和Cookies时,需要写Opener和Handler来处理,为了更方便地实现这些操作,就有了更强大的库requests。requests库功能很强大。能实现Cookies、登录验证、代理设置等操作。

    简单使用requests库

    import requests
    r = requests.get('http://wwww.baidu.com/')
    print(type(r), r.status_code, r.text, r.cookies, sep='\n\n')
    

    GET请求

    返回相应的请求信息

    requests.get(url, params)
    # url表示要捕获的页面链接,params表示url的额外参数(字典或字节流格式)
    

    举例1:

    import requests
    r = requests.get('http://httpbin.org/get')
    print(r.text)
    
    # 输出
    {
       "args": {}, 
       "headers": {
         "Accept": "*/*", 
         "Accept-Encoding": "gzip, deflate", 
         "Host": "httpbin.org", 
         "User-Agent": "python-requests/2.21.0"
       }, 
       "origin": "120.85.108.192, 120.85.108.192", 
       "url": "https://httpbin.org/get"
     }
    

    举例2

    import requests
    data = {
         'name': 'LiYihua',
         'age': '21'
     }
    r = requests.get('http://httpbin.org/get', params=data)
    print(r.text)
    
    # 输出:
    {
      "args": {
        "age": "21", 
        "name": "LiYihua"
      }, 
      "headers": {
        "Accept": "*/*", 
        "Accept-Encoding": "gzip, deflate", 
        "Host": "httpbin.org", 
        "User-Agent": "python-requests/2.21.0"
      }, 
      "origin": "120.85.108.92, 120.85.108.92", 
      "url": "https://httpbin.org/get?name=LiYihua&age=21"
    }
    

    举例3

    import requests
    r = requests.get('http://httpbin.org/get')
    print(type(r.text), r.json(), type(r.json()), sep='\n\n')
    
    # 输出:
    <class 'str'>
    
    {'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.21.0'}, 'origin': '120.85.108.92, 120.85.108.92', 'url': 'https://httpbin.org/get'}
    
    <class 'dict'>
    

    举例4

    抓取照片

    import requests
    r = requests.get('https://github.com/favicon.ico')
    with open('favicon.ico', 'wb') as f:
        f.write(r.content)
    
    # 运行结束后生成一个名为favicon.ico的图标
    

    POST请求

    这是一种比较常见的URL请求方式,举例:

    import requests
    
    data = {
        'name': 'LiYihua',
        'age': 21
    }
    r = requests.post('http://httpbin.org/post', data=data)
    print(r.text)
    
    
    # 输出:
    {
      "args": {}, 
      "data": "", 
      "files": {}, 
      "form": {
        "age": "21", 
        "name": "LiYihua"
      }, 
      "headers": {
        "Accept": "*/*", 
        "Accept-Encoding": "gzip, deflate", 
        "Content-Length": "19", 
        "Content-Type": "application/x-www-form-urlencoded", 
        "Host": "httpbin.org", 
        "User-Agent": "python-requests/2.21.0"
      }, 
      "json": null, 
      "origin": "120.85.108.90, 120.85.108.90", 
      "url": "https://httpbin.org/post"
    }
    
    # POST请求成功,获得返回结果,form部分为提交的数据
    

    Response

    1. text 和 content 获取响应的内容
    2. status code 属性得到状态码
    3. headers 属性得到响应头
    4. cookies属性得到 Cookies
    5. url属性得到 URL
    6. history属性得到请求历史

    举例:

    import requests
    
    r = requests.get('https://www.cnblogs.com/liyihua/')
    
    print(type(r.status_code), r.status_code,
          type(r.headers), r.headers,
          type(r.cookies), r.cookies,
          type(r.url), r.url,
          type(r.history), r.history,
          sep='\n\n')
    
    
    # 输出:
    <class 'int'>
    
    200
    
    <class 'requests.structures.CaseInsensitiveDict'>
    
    {'Date': 'Thu, 20 Jun 2019 08:18:00 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'Cache-Control': 'private, max-age=10', 'Expires': 'Thu, 20 Jun 2019 08:18:10 GMT', 'Last-Modified': 'Thu, 20 Jun 2019 08:18:00 GMT', 'X-UA-Compatible': 'IE=10', 'X-Frame-Options': 'SAMEORIGIN', 'Content-Encoding': 'gzip'}
    
    <class 'requests.cookies.RequestsCookieJar'>
    
    <RequestsCookieJar[]>
    
    <class 'str'>
    
    https://www.cnblogs.com/liyihua/
    
    <class 'list'>
    
    []
    
    

    requests 的高级用法

    1. 文件上传

      import requests
      
      files = {
          'file': open('favicon.ico', 'rb')
      }
      r = requests.post('http://httpbin.org/post', files=files)
      print(r.text)
      
      
      # 输出:
      {
        "args": {}, 
        "data": "", 
        "files": {
          "file": "data:application/octetstream;base64,AAABAAIAEBAAAAEAIAAoBQAAJgAAACAgAAABACAAKBQAAE4FAAAoAAAAEAAAACAAAAABACAAAAAAAAAFAAA...
        }, 
        "form": {}, 
        "headers": {
          "Accept": "*/*", 
          "Accept-Encoding": "gzip, deflate", 
          "Content-Length": "6665", 
          "Content-Type": "multipart/form-data; boundary=c1b665273fc73e67e57ac97e78f49110", 
          "Host": "httpbin.org", 
          "User-Agent": "python-requests/2.21.0"
        }, 
        "json": null, 
        "origin": "120.85.108.71, 120.85.108.71", 
        "url": "https://httpbin.org/post"
      }
      
    2. 会话维持

      1. Session对象,可以方便的维护一个会话

        import requests
        
        requests.get('http://httpbin.org/cookies/set/number/123456789')
        r = requests.get('http://httpbin.org/cookies')
        print(r.text)
        
        
        # 输出:
        {
          "cookies": {}
        }
        
        
        import requests
        
        s = requests.Session()
        s.get('http://httpbin.org/cookies/set/number/123456789')
        r = s.get('http://httpbin.org/cookies')
        print(r.text)
        
        
        # 输出:
        {
          "cookies": {
            "number": "123456789"
          }
        }
        
      2. SSL证书验证

        import requests
        
        r = requests.get('https://www.12306.cn')
        print(r.status_code)
        
        # 没有出错会输出:200
        # 如果请求一个HTTPS站点,但是证书验证错误的页面时,就会错误。
        
        
        # 为了避免错误,可以将改例子稍作修改
        import requests
        from requests.packages import urllib3
        
        urllib3.disable_warnings()
        r = requests.get('https://www.12306.cn', verify=False)
        print(r.status_code)
        
      3. 代理设置

        import requests
        
        proxies = {
            'http': 'socks5://user:password@10.10.1.10:3128',
            'https': 'socks5://user:password@10.10.1.10:1080'
        }
        
        requests.get('https://www.taobao.com', proxies=proxies)
        
        
        # 使用SOCKS协议代理
        
      4. 超时设置

        import requests
        
        r = requests.get('https://taobao.com', timeout=(0.1, 1))
        print(r.status_code)
        
        # 输出:200
        
      5. 身份验证

        import requests
        from requests.auth import HTTPBasicAuth
        
        r = requests.get('http://localhost', auth=HTTPBasicAuth('liyihua', 'woshiyihua134'))
        print(r.status_code)
        
        
        # 输出:200
        
        
        # 也可以使用OAuth1方法
        import requests
        from requests_oauthlib import OAuth1
        
        url = 'https://api.twitter.com/1.1/account/verify_credentials.json'
        auth = OAuth1('YOUR_APP_KEY', 'YOUR_APP_SECRET'
                      'USER_OAUTH_TOKEN', 'USER_OAUTH_TOKEN_SECRET')
        requests.get(url, auth=auth)
        
      6. Prepared Request(准备请求

        要获取一个带有状态的 Prepared Request, 需要用Session.prepare_request()
        
        from requests import Request, Session
        
        url = 'http://httpbin.org/post'
        data = {
            'name': 'LiYihua'
        }           # 参数
        header = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537 (KHTML, like Gecko Chrome/53.0.2785.116 Safari/537.36'
        }           # 伪装浏览器
        s = Session()                       # 会话维持
        req = Request('POST', url, data=data, headers=header)
        
        prepped = s.prepare_request(req)            # Session的prepare_request()方法将req转化为一个 Prepared Request对象 
        r = s.send(prepped)                 # send() 发送请求
        print(r.text)
        
        
        # 输出:
        {
          "args": {}, 
          "data": "", 
          "files": {}, 
          "form": {
            "name": "LiYihua"
          }, 
          "headers": {
            "Accept": "*/*", 
            "Accept-Encoding": "gzip, deflate", 
            "Content-Length": "12", 
            "Content-Type": "application/x-www-form-urlencoded", 
            "Host": "httpbin.org", 
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537 (KHTML, like Gecko Chrome/53.0.2785.116 Safari/537.36"
          }, 
          "json": null, 
          "origin": "120.85.108.184, 120.85.108.184", 
          "url": "https://httpbin.org/post"
        }
        

    本文来自博客园,作者:LeeHua,转载请注明原文链接:https://www.cnblogs.com/liyihua/p/11050374.html

  • 相关阅读:
    Flink实例(四十七):状态管理(十一)自定义操作符状态(五)广播状态(Broadcast state)(三)
    Flink实例(四十六): Operators(七)多流转换算子(二)CONNECT, COMAP和COFLATMAP
    python题库
    python---replace函数
    算法图解--读书笔记
    python里的StringIO
    python通过sha1和base64生成签名
    python调用接口方式
    智能停车场车牌识别系统【python】
    leetcode 查找算法(三)
  • 原文地址:https://www.cnblogs.com/liyihua/p/11050374.html
Copyright © 2011-2022 走看看