zoukankan      html  css  js  c++  java
  • Python爬虫学习==>第七章:urllib库的基本使用方法

    学习目的:

      urllib提供了url解析函数,所以需要学习
    正式步骤

    Step1:什么是urllib

      urllib库是Python自带模块,是Python内置的HTTP请求库

      包含4个模块:
      

    >>> import urllib
    >>> # urllib.request  请求模块
    >>> # urllib.error  异常处理模块
    >>> # urllib.parse  url解析模块
    >>> # urllib.robotparser  robot.txt解析模块

    Step2:用法讲解

    1. urlopen
      # -*-  coding:utf-8 -*
      
      import urllib.request
      
      '''
      urlopen语法格式如下
      urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,*, cafile=None, capath=None, cadefault=False, context=None)
      '''
      #示例1
      response = urllib.request.urlopen('http://www.baidu.com')
      print(response.read().decode('utf-8'))    #read()方法是获取了response内容,然后指定编码打印出来,如果不加decode,那么打印则显示在一行
      
      
      print('
      ')
      print('urllib.parse实例')
      print('
      ')
      import urllib.request
      import urllib.parse
      
      data =  bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf-8')
      response = urllib.request.urlopen('http://httpbin.org/post',data=data)
      print(response.read())
      
      print('
      ')
      print('urllib中的timeout用法和urllib.error异常处理模块')
      print('
      ')
      
      import urllib.request
      import socket
      import urllib.error
      try:
          response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)
      except urllib.error.URLError as e:
          if isinstance(e.reason,socket.error):
              print('TIMEOUT')
    2. 响应
      # -*-  coding:utf-8 -*-
      
      print("响应类型实例")
      
      import urllib.request
      
      response = urllib.request.urlopen('http://www.baidu.com')
      print(type(response))
    3. 状态码和响应头
      # -*-  coding:utf-8 -*-
      
      print('状态码和响应头的实例')
      
      import urllib.request
      
      response = urllib.request.urlopen('http://www.baidu.com')
      print(response.status)
      print(response.getheaders())
      print(response.getheader('Content-Type'))
      print(response.getheader('Date'))
      print(response.getheader('Server'))

      运行结果
      状态码和响应头的实例
      200
      [('Date', 'Tue, 03 Apr 2018 14:29:52 GMT'), ('Content-Type', 'text/html; charset=utf-8'), ('Transfer-Encoding', 'chunked'), ('Connection', 'Close'), ('Vary', 'Accept-Encoding'), ('Set-Cookie', 'BAIDUID=6150350FD6AF7F0B4629DA49AEF7DEAE:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'BIDUPSID=6150350FD6AF7F0B4629DA49AEF7DEAE; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'PSTM=1522765792; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'BDSVRTM=0; path=/'), ('Set-Cookie', 'BD_HOME=0; path=/'), ('Set-Cookie', 'H_PS_PSSID=1430_25809_13290_21093_20927; path=/; domain=.baidu.com'), ('P3P', 'CP=" OTI DSP COR IVA OUR IND COM "'), ('Cache-Control', 'private'), ('Cxy_all', 'baidu+66a85a47dcb1b7de8cd2d7ba25b3a1dc'), ('Expires', 'Tue, 03 Apr 2018 14:29:42 GMT'), ('X-Powered-By', 'HPHP'), ('Server', 'BWS/1.1'), ('X-UA-Compatible', 'IE=Edge,chrome=1'), ('BDPAGETYPE', '1'), ('BDQID', '0xa1de1b2000003abd'), ('BDUSERID', '0')]
      text/html; charset=utf-8
      Tue, 03 Apr 2018 14:29:52 GMT
      BWS/1.1
    4. handler 代理
      # -*-  coding:utf-8 -*-
      import urllib.request
      
      proxy_hander = urllib.request.ProxyHandler(
          {'http':'http://127.0.0.1:9743','https':'https://127.0.0.1:9743'}
      )#代理以实际代理为准
      opener = urllib.request.build_opener(proxy_hander)
      response = opener.open('http://www.baidu.com')
      print(response.read())
    5. cookie (记录用户身份的文本文件)
      # -*-  coding:utf-8 -*-
      import urllib.request,http.cookiejar
      #将cookie保存
      filename = 'cookie.txt'
      cookie = http.cookiejar.LWPCookieJar(filename)
      hander = urllib.request.HTTPCookieProcessor(cookie)
      opener = urllib.request.build_opener(hander)
      responer = opener.open('http://www.baidu.com')
      cookie.save(ignore_discard=True,ignore_expires=True)

      打印cookie

      # -*-  coding:utf-8 -*-
      
      import urllib.request,http.cookiejar
      
      #声明cookie为cookiejar对象
      cookie = http.cookiejar.CookieJar()
      #hander是处理浏览器中的cookie作用
      hander = urllib.request.HTTPCookieProcessor(cookie)
      #利用build_opener将cookie传给opener
      opener = urllib.request.build_opener(hander)
      responser = opener.open('http://www.baidu.com')
      for i in cookie:
          print(i.name + '=' + i.value)

      将本地的cookie值赋到浏览器

      # -*-  coding:utf-8 -*-
      
      import urllib.request,http.cookiejar
      
      cookie = http.cookiejar.LWPCookieJar()
      cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True)
      hander = urllib.request.HTTPCookieProcessor(cookie)
      opener = urllib.request.build_opener(hander)
      response = opener.open('http://www.baidu.com')
      print(response.read().decode('utf-8'))

    学习总结:

      其余的内置方法未操作,直接学习下一节requests库

  • 相关阅读:
    C++ String详解
    乏力的编码很累,这里有私货..
    给自己~~微语&&歌单
    2019CSP-J第二轮 B题C题
    HDU 3966 树链剖分+树状数组 模板
    HDU 2255 KM算法 二分图最大权值匹配
    HDU 4280 ISAP+BFS 最大流 模板
    HDU 6181 第k短路
    假装会python--爬取贴吧正文
    HDU 6170 dp
  • 原文地址:https://www.cnblogs.com/wuzhiming/p/8700244.html
Copyright © 2011-2022 走看看