zoukankan      html  css  js  c++  java
  • Python爬虫基础之requests

    一、随时随地爬取一个网页下来

      怎么爬取网页?对网站开发了解的都知道,浏览器访问Url向服务器发送请求,服务器响应浏览器请求并返回一堆HTML信息,其中包括html标签,css样式,js脚本等。我们之前用的是Python标准基础库Urllib实现的,

    现在我们使用Python的Requests HTTP库写个脚本开始爬取网页。Requests的口号很响亮“让HTTP服务人类“,够霸气。

    二、Python Requests库的基本使用

    1.GET和POST请求方式

    GET请求

    1 import requests
    2 
    3 payload = {"t": "b", "w": "Python urllib"}
    4 response = requests.get('http://zzk.cnblogs.com/s', params=payload)
    5 # print(response.url)  # 打印 http://zzk.cnblogs.com/s?w=Python+urllib&t=b&AspxAutoDetectCookieSupport=1
    6 print(response.text)  

    Python requests的GET请求,不需要在作为请求参数前,对dict参数进行urlencode()和手动拼接到请求url后面,get()方法会直接对params参数这样做。

    POST请求

    1 import requests
    2 
    3 payload = {"t": "b", "w": "Python urllib"}
    4 response = requests.post('http://zzk.cnblogs.com/s', data=payload)
    5 print(response.text)  # u'......'

    Python requests的POST请求,不需要在作为请求参数前,对dict参数进行urlencode()和encode()将字符串转换成字节码。raw属性返回的是字节码,text属性直接返回unicode格式的字符串,而不需要再进行decode()将返回的bytes字节码转化为unicode。

    相对于Python urllib而言,Python requests更加简单易用。

     2.设置请求头headers

    1 import requests
    2 
    3 payload = {"t": "b", "w": "Python urllib"}
    4 headers = {'user_agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
    5 response = requests.get('http://zzk.cnblogs.com/s', params=payload, headers=headers)
    6 print(response.request.headers)  

    get方法的请求头,可以通过传递字典格式的参数给headers来实现。response.headers返回服务器响应的请求头信息,response.request.headers返回客户端的请求头信息。

    3.设置会话cookie

    1 import requests
    2 
    3 cookies = {'cookies_are': 'working'}
    4 response = requests.get('http://zzk.cnblogs.com/', cookies=cookies)
    5 print(response.text)

    requests.get()方法cookies参数除了支持dict()字典格式,还支持传递一个复杂的RequestsCookieJar对象,可以指定域名和路径属性。

    1 import requests
    2 import requests.cookies
    3 
    4 cookieJar = requests.cookies.RequestsCookieJar()
    5 cookieJar.set('cookies_are', 'working', domain='cnblogs', path='/cookies')
    6 response = requests.get('http://zzk.cnblogs.com/', cookies=cookieJar)
    7 print(response.text)

    4.设置超时时间timeout

    1 import requests
    2 
    3 response = requests.get('http://zzk.cnblogs.com/', timeout=0.001)
    4 print(response.text)

    三、Python Requests库的高级使用

    1.Session Object

    1 from requests import Request,Session
    2 
    3 s = Session()
    4 
    5 s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
    6 r = s.get('http://httpbin.org/cookies')
    7 
    8 print(r.text)
    9 # '{"cookies": {"sessioncookie": "123456789"}}'

    通过Session,我们可以在多个请求之间传递cookies信息,不过仅限于同一域名下,否则不会附带上cookie。如果碰到需要登录态的页面,我们可以在登陆的时候保存登录态,再访问其他页面时附带上就好。

    2.Prepared Requested

     1 from requests import Request,Session
     2 
     3 url = 'http://zzk.cnblogs.com/s'
     4 payload = {"t": "b", "w": "Python urllib"}
     5 headers = {
     6         'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
     7         'Content-Type':'application/x-www-form-urlencoded'
     8 }
     9 s = Session()
    10 request = Request('GET', url, headers=headers, data=payload)
    11 prepped = request.prepare()
    12 
    13 # do something with prepped.headers
    14 del prepped.headers['Content-Type']
    15 response = s.send(prepped, timeout=3)
    16 print(response.request.headers)

    Request对象的prepare()方法返回的对象允许在发送请求前做些额外的工作,例如更新请求体body或者请求头headers.

    3.Set Proxy

     1 import requests
     2 
     3 
     4 # set headers
     5 user_agent = 'Mozilla/5.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/600.1.3 (KHTML, like Gecko) Version/8.0 Mobile/12A4345d Safari/600.1.4'
     6 headers = {'User-Agent': user_agent}
     7 url = 'http://passport.xxx.com/auth/valid.json'
     8 proxies = {'http': 'http://10.1.1.1:80'}  # set http proxy
     9 params = {'uin': 'xxxxxx', 'passwd': 'xxxxxx', 'imgcode': 'ijyk', '_1': '12',
    10           '_2': '10279', '_3': '23935743', 'url': ''}
    11 
    12 response = requests.post(url, headers=headers, data=params, proxies=proxies)
    13 response.raise_for_status()
    14 if response.status_code == requests.codes.ok:
    15     print(response.text)

    requests.get()和post()方法均支持http proxy代理,只要传递proxies = {'http': 'http://10.1.1.1:80'}字典对象,可以实现把请求传递给代理服务器,代理服务器从10.1.1.1:80取回响应数据返回来。

    四、Python Requests库的实际应用

     1.GET请求封装

     1 def do_get_request(self, url, headers=None, timeout=3, is_return_text=True, num_retries=2):
     2         if url is None:
     3             return None
     4         print('Downloading:', url)
     5         if headers is None:  # 默认请求头
     6             headers = {
     7                 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
     8         response = None
     9         try:
    10             response = requests.get(url,headers=headers,timeout=timeout)
    11 
    12             response.raise_for_status()  # a 4XX client error or 5XX server error response,raise requests.exceptions.HTTPError
    13             if response.status_code == requests.codes.ok:
    14                 if is_return_text:
    15                     html = response.text
    16                 else:
    17                     html = response.json()
    18             else:
    19                 html = None
    20         except requests.Timeout as err:
    21             print('Downloading Timeout:', err.args)
    22             html = None
    23         except requests.HTTPError as err:
    24             print('Downloading HTTP Error,msg:{0}'.format(err.args))
    25             html = None
    26             if num_retries > 0:
    27                 if 500 <= response.status_code < 600:
    28                     return self.do_get_request(url, headers=headers, num_retries=num_retries - 1)  # 服务器错误,导致请求失败,默认重试2次
    29         except requests.ConnectionError as err:
    30             print('Downloading Connection Error:', err.args)
    31             html = None
    32 
    33         return html

    2.POST请求封装

     1  def do_post_request(self, url, data=None, headers=None, timeout=3, is_return_text=True, num_retries=2):
     2         if url is None:
     3             return None
     4         print('Downloading:', url)
     5         # 如果请求数据未空,直接返回
     6         if data is None:
     7             return
     8         if headers is None:
     9             headers = {
    10                 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
    11         response = None
    12         try:
    13             response = requests.post(url,data=data, headers=headers, timeout=timeout)   # 设置headers timeout无效
    14 
    15             response.raise_for_status()  # a 4XX client error or 5XX server error response,raise requests.exceptions.HTTPError
    16             if response.status_code == requests.codes.ok:
    17                 if is_return_text:
    18                     html = response.text
    19                 else:
    20                     html = response.json()
    21             else:
    22                 print('else')
    23                 html = None
    24         except requests.Timeout as err:
    25             print('Downloading Timeout:', err.args)
    26             html = None
    27         except requests.HTTPError as err:
    28             print('Downloading HTTP Error,msg:{0}'.format(err.args))
    29             html = None,
    30             if num_retries > 0:
    31                 if 500 <= response.status_code < 600:
    32                     return self.do_post_request(url, data=data, headers=headers,
    33                                                 num_retries=num_retries - 1)  # 服务器错误,导致请求失败,默认重试2次
    34         except requests.ConnectionError as err:
    35             print('Downloading Connection Error:', err.args)
    36             html = None
    37 
    38         return html

    3.登录态cookie

     1 def save_cookies(self, requeste_cookiejar, filename):
     2     with open(filename, 'wb')as f:
     3         pickle.dump(requeste_cookiejar, f)
     4 
     5 def load_cookies(self, filename):
     6     with open(filename, 'rb') as f:
     7         return pickle.load(f)
     8 
     9 # save request cookies
    10 r = requests.get(url)
    11 save_cookies(r.cookies,filename)
    12 
    13 # load cookies and do a request
    14 requests.get(url,cookies=load_cookies(filename))
  • 相关阅读:
    ubantu系统之jdk切换使用
    Asp.net core 学习笔记 2.1 升级到 2.2
    box-sizing 和 dom width
    Angular 学习笔记 (组件沟通的思考)
    Angular 学习笔记 (久久没有写 angular 常会忘记的小细节)
    Asp.net core 学习笔记 (AutoMapper)
    Angular 学习笔记 (Material Select and AutoComplete)
    Asp.net core (学习笔记 路由和语言 route & language)
    Asp.net core 学习笔记 (library)
    Angular 学习笔记 (Material Datepicker)
  • 原文地址:https://www.cnblogs.com/taotaoblogs/p/7241273.html
Copyright © 2011-2022 走看看