zoukankan      html  css  js  c++  java
  • Python爬虫学习==>第八章:Requests库详解

    学习目的:

       request库比urllib库使用更加简洁,且更方便。

    正式步骤

    Step1:什么是requests


     

      requests是用Python语言编写,基于urllib,采用Apache2 Licensed开源协议的HTTP库。它比urllib更加方便,可以节约大量工作时间,还完全满足HTTP测试需求,是一个简单易用的HTTP库。

    Step2:实例 引入


     

       

    # -*-  coding:utf-8 -*-
    
    import requests
    
    response = requests.get('http://www.baidu.com')
    print(type(response))
    print(response.content)
    print(response.status_code)
    print(response.text)
    print(type(response.text))
    print(response.cookies)

    重要:

    • response.content():这是从网络上直接抓取的数据,没有经过任何的解码,是一个bytes类型,
    • response.text():这是str类型数据,是requests库将response.content进行解码的字符串,解码需要指定一个编码方式,requests会根据自己的猜测来判断编码方式,有时候会判断错误,所以最稳妥的办法是response.content.decode("utf-8"),指定一个编码方式手动解码

    Step3:各种请求方式


     

      

    # -*-  coding:utf-8 -*-
    import requests
    
    requests.post('http://httpbin.org/post')
    requests.put('http://httpbin.org/put')
    requests.delete('http://httpbin.org/delete')
    requests.head('http://httpbin.org/get')
    requests.options('http://httpbin.org/get')
    1. get请求
      ① 基本用法
      # -*-  coding:utf-8 -*-
      
      import requests
      
      response = requests.get('http://httpbin.org/get')
      print(response.text)

      运行结果:

      {
        "args": {}, 
        "headers": {
          "Accept": "*/*", 
          "Accept-Encoding": "gzip, deflate", 
          "Connection": "close", 
          "Host": "httpbin.org", 
          "User-Agent": "python-requests/2.18.4"
        }, 
        "origin": "222.94.50.178", 
        "url": "http://httpbin.org/get"
      }


      ②带参数的get请求

      import requests
      
      
      data = {
          'name':'python','age':17
      }
      
      response = requests.get('http://httpbin.org/get',params=data)
      print(response.text)

      运行结果:

      {
        "args": {
          "age": "17", 
          "name": "python"
        }, 
        "headers": {
          "Accept": "*/*", 
          "Accept-Encoding": "gzip, deflate", 
          "Connection": "close", 
          "Host": "httpbin.org", 
          "User-Agent": "python-requests/2.18.4"
        }, 
        "origin": "222.94.50.178", 
        "url": "http://httpbin.org/get?name=python&age=17"
      }

       
      get和post请求的区别:

      • GET是从服务器上获取数据,POST是向服务器传送数据

      • GET请求参数显示,都显示在浏览器网址上,HTTP服务器根据该请求所包含URL中的参数来产生响应内容,即“Get”请求的参数是URL的一部分。 例如: http://www.baidu.com/s?wd=Chinese

      • POST请求参数在请求体当中,消息长度没有限制而且以隐式的方式进行发送,通常用来向HTTP服务器提交量比较大的数据(比如请求中包含许多参数或者文件上传操作等),请求的参数包含在“Content-Type”消息头里,指明该消息体的媒体类型和编码,
        注意:避免使用Get方式提交表单,因为有可能会导致安全问题。 比如说在登陆表单中用Get方式,用户输入的用户名和密码将在地址栏中暴露无遗。





      ③解析Json

      import requests
      import json
      
      response = requests.get('http://httpbin.org/get')
      print(response.json())
      print(type(response.json()))




      ④获取二进制数据

      # -*-  coding:utf-8 -*-
      '''
      保存百度图标
      '''
      import requests
      
      response = requests.get('https://www.baidu.com/img/bd_logo1.png')
      with open('baidu.png','wb') as f:
          f.write(response.content)
          f.close()



      ⑤添加headers
      如果直接爬取知乎的网站,是会报错的,如:

      import requests
      
      response = requests.get('https://www.zhihu.com/explore')
      print(response.text)

      运行结果:

      <html><body><h1>500 Server Error</h1>
      An internal server error occured.
      </body></html>

      解决办法:

      import requests
      headers = {
          'user-agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
      }
      response = requests.get('https://www.zhihu.com/explore',headers = headers)
      print(response.text)

      就是添加一个headers,就可以正常抓取,而headers中的数据,我是通过chrome浏览器自带的开发者工具去找了然后copy过来的


    2. 基本POST请求
      import requests
      
      data = {
          'name':'python','age' : 18
      }
      headers = {
          'user-agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
      }
      
      response = requests.post('http://httpbin.org/post',data=data,headers=headers)
      print(response.json())

       实例:爬取拉勾网python职位,并把数据保存为字典

      # -*-  coding:utf-8 -*-
      
      import requests
      
      headers = {
          'User-Agent' :'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                               'Chrome/69.0.3497.100 Safari/537.36',
          'Referer':"https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput="
      }
      
      data = {
          'first':"True",
          'pn':"1",
          'kd' :"python"
      }
      url = 'https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false'
      
      response = requests.get(url,headers=headers,params=data)
      print(response.json())






    3. 响应
      import requests
      '''
      response属性
      '''
      response = requests.get('http://www.baidu.com')
      print(response.status_code,type(response.status_code))
      print(response.history,type(response.history))
      print(response.cookies,type(response.cookies))
      print(response.url,type(response.url))
      print(response.headers,type(response.headers))

       运行结果:

      200 <class 'int'>
      [] <class 'list'>
      <RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]> <class 'requests.cookies.RequestsCookieJar'>
      http://www.baidu.com/ <class 'str'>
      {'Server': 'bfe/1.0.8.18', 'Date': 'Thu, 05 Apr 2018 06:27:33 GMT', 'Content-Type': 'text/html', 'Last-Modified': 'Mon, 23 Jan 2017 13:28:24 GMT', 'Transfer-Encoding': 'chunked', 'Connection': 'Keep-Alive', 'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Pragma': 'no-cache', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Content-Encoding': 'gzip'} <class 'requests.structures.CaseInsensitiveDict'>


    4. 状态码判断
      状态码参考表 http://www.cnblogs.com/wuzhiming/p/8722422.html
      # -*-  coding:utf-8 -*-
      
      import requests
      
      response = requests.get('http://www.cnblogs.com/hello.html')
      exit() if not response.status_code == requests.codes.not_found else print('404 not found')
      
      response1 = requests.get('http://www.baidu.com')
      exit() if not response1.status_code == requests.codes.ok else print('Request Successly')
    5. 高级操作
      ①文件上传
      import requests
      
      file = {'file':open('baidu.png','rb')}
      response = requests.post('http://httpbin.org/post',files = file)
      print(response.text)

       运行结果不演示


      ②获取cookie
      import requests
      
      response = requests.get('http://www.baidu.com')
      cookies = response.cookies
      print(cookies)
      for key,value in cookies.items():
          print(key + '=' + value)
      ③会话维持
      import requests
      
      s = requests.Session()
      s.get('http://httpbin.org/cookies/get/number/123456789')
      response = s.get('http://httpbin.org/cookies')
      print(response.text)
      ④证书验证
      import requests
      
      #verify=False表示不进行证书验证
      response = requests.get('https://www.12306.cn',verify=False)
      print(response.status_code)

       手动指定证书

      response1 = requests.get('https://www.12306.cn',cert=('/path/server.crt','/path/key'))

      ⑤代理设置
      import requests
      #用法示例,代理可以自己百度免费的代理
      proxies = {
          'http':'http://127.0.0.1:端口号',
          'https':'https://ip:端口号',
          'http':'http://username:password@ip:端口号'
      }
      
      response = requests.get('http://www.baidu.com',proxies=proxies)
      print(response.status_code)
      ⑥超时设置
      import requests
      
      response = requests.get('http://httpbin.org/get',timeout = 1)
      print(response.status_code)
      ⑦认证设置
      import requests
      from requests.auth import HTTPBasicAuth
      
      response = requests.get('http://127.0.0.1:8888',auth=('user','password'))
      response1 = requests.get('http://127.0.0.1:8888',auth=HTTPBasicAuth('user','passwrd'))
      print(response.status_code)

       PS:127.0.0.1:8888只是举例


      ⑧异常处理
      import requests
      from requests.exceptions import ReadTimeout,HTTPError,RequestException
      
      try:
          response = requests.get('http://httpbin.org/get',timeout = 0.01)
          print(response.status_code)
      except ReadTimeout:
          print("TIME OUT")
      except HTTPError:
          print('HTTP ERROR')
      except RequestException:
          print("ERROR")

    学习总结:


     

       通过爬虫的学习可以进一步的掌握python的基础应用,我的目的就是这个,后面继续学习

  • 相关阅读:
    201271050130-滕江南 实验二 个人项目—《西北师范大学学生疫情上报系统》项目报告
    201271050130-滕江南 实验一 软件工程准备—<读《构建之法——现代软件工程》心得体会>
    201271050130-滕江南《面向对象程序设计(java)》课程学习总结
    201271050130-滕江南-《面向对象程序设计(java)》第十七周学习总结
    201271050130-滕江南-《面向对象程序设计(java)》第十六周学习总结
    201271050130-滕江南-《面向对象程序设计(java)》第十五周学习总结
    201271050130-滕江南-《面向对象程序设计(java)》第十四周学习总结
    《男403团队》:线上点餐系统选题报告
    计算机与软件工程作业5
    计算机软件工程作业4
  • 原文地址:https://www.cnblogs.com/wuzhiming/p/8711850.html
Copyright © 2011-2022 走看看