zoukankan      html  css  js  c++  java
  • python之路 -- 爬虫 -- 常用模块

    1.requests


     

    Requests 是用Python语言编写,基于 urllib,采用 Apache2 Licensed 开源协议的 HTTP 库。它比 urllib 更加方便,可以节约我们大量的工作,完全满足 HTTP 测试需求。

    requests模块的参数

    1.1  get  #发送get请求

    requests.get( )的参数有:url、params、headers、cookies  

    1 requests.get(
    2     url=”http: // www.oldboyedu.com”,
    3     params = {“nid”:1,”name”:”xx”}  # 实际上传入的url为http://www.oldboyedu.com?Nid=1&name=xx    #url中传入参数
    4     headers = {...}, 
    5     cookies = {...}
    6 )    

    1.2  post  #发送post请求

    requests.post( )的参数有:url、params , haders , data , cookies

    post中的参数用法和get中的一样,就不一一赘述了。

    1.3  proxies

    proxies  --代理

    1     # 发送文件,定制文件名(上传文件)
    2     # file_dict = {
    3     # 'f1': ('test.txt', open('readme', 'rb'))
    4     # }
    5     # requests.request(method='POST',
    6     # url='http://127.0.0.1:8000/test/',
    7     # files=file_dict)

    1.4  json

    当请求中提交的不是From Data作为数据,而是payload.时使用,导入json模块json.dumps(data)

    post发送json数据

    1 import requests
    2 import json
    3  
    4 r = requests.post('https://api.github.com/some/endpoint', data=json.dumps({'some': 'data'}))
    5 print(r.json())

    1.5  auth

    做基本的认证

    1.6  timeout

    #超时时间
    timeout=(m,n)
    #表示 请求时间最多n秒;响应时间最多等待接收m秒

    1.7  allow_redirects

    是否支持重定向,默认为True

    1.8  stream

    下载大文件是使用,一点一点的下载

    1 ret = requests.get('http://127.0.0.1:8000/test/', stream=True)
    2     for i in r.iter_content():
    3          print(i)
    4     from contextlib import closing
    5     with closing(requests.get('http://httpbin.org/get', stream=True)) as r:
    6         # 在此处理响应。
    7         for i in r.iter_content():
    8             print(i)
    View Code

    1.9  cert: 证书

    1.10  verify: 确认


    2.BeautifulSoup


    Beautiful Soup 是一个可以从 HTML 或 XML 文件中提取数据的 Python 库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup 会帮你节省数小时甚至数天的工作时间.

    2.1  bs4的安装

    pip install BeautifulSoup4

    2.2  解析

    1 import requests
    2 from bs4 import BeautifulSoup
    3 
    4 ret = requests.get("http://www.baidu.com")
    5 soup = BeautifulSoup(ret.text,'html.parser')
    6 print(soup)    #打印解析出来的html

    2.3  find和find_all 方法

    1 div = soup.find(name="div",attrs={"id":"content-list"})
    2 # 找到标签名为div,id属性为content-list的标签,返回此div标签中的所有内容
    3 items = div.find_all(name="div",attrs={"class":"item"})
    4 #找到标签名为div,class属性为item的所有标签,返回所有此class属性的div标签

    一大波练习这两个爬虫最常用模块的实例:

    1.自动登录抽屉并批量点赞

     1 import requests
     2 from bs4 import BeautifulSoup
     3 
     4 #获取每一页页面的URL
     5 for page in range(5,6):
     6     pageurl = "https://dig.chouti.com/all/hot/recent/%s"%page
     7 
     8 header = {
     9              "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"
    10         }
    11     #循环访问每一页,并获取cookies
    12     response = requests.get(
    13         url=pageurl,
    14         headers = header
    15     )
    16     cookie1_dict = response.cookies.get_dict()
    17 # response.encoding = response.apparent_encoding
    18 # print(response.text)
    19 
    20 # 发送post请求,进行登录
    21 data = {
    22     "phone":"********",
    23     "password":"*******",
    24     "oneMonth":1
    25 }
    26 response1 = requests.post(url="https://dig.chouti.com/login",
    27                          data=data,
    28                          headers=header,
    29                           cookies = cookie1_dict
    30                          )
    31 
    32 #找到每页的各个新闻的ID
    33 soup = BeautifulSoup(response.text,"html.parser")
    34 div = soup.find(name="div",attrs={"id":"content-list"})
    35 # print(div)
    36 items = div.find_all(name="div",attrs={"class":"item"})
    37 for item in items:
    38     id=item.find(name="div",attrs = {"class":"part2"}).get("share-linkid")
    39 
    40     #进行点赞操作
    41     response2 = requests.post(url="https://dig.chouti.com/link/vote?linksId=%s"%id,
    42                           headers={
    43                               "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"
    44                           },
    45                           cookies = cookie1_dict
    46                          
    47                           )
    48     print(response2.text)
    View Code

    2.自动登录GitHub并获取个人信息

     1 import requests
     2 from bs4 import BeautifulSoup
     3 
     4 res = requests.get(url="https://github.com/login")
     5 soup1 = BeautifulSoup(res.text,"html.parser")
     6 tag = soup1.find(name='input', attrs={'name': 'authenticity_token'})
     7 authenticity_token = tag.get('value')
     8 cookie1 = res.cookies.get_dict()
     9 
    10 header={
    11     "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36"
    12 }
    13 
    14 res_login = requests.post(url="https://github.com/session",
    15                          headers = header,
    16                          data = {
    17                             "commit":"Sign in",
    18                             "utf8":"",
    19                             "authenticity_token":authenticity_token,
    20                             "login":"******",
    21                             "password":"**********"
    22                          },
    23                          cookies = cookie1
    24                          )
    25 cookie2 = res_login.cookies.get_dict()
    26 # print(res_login.text)
    27 
    28 res_message = requests.get(url="https://github.com/Aberwang",
    29                            headers=header,
    30                            cookies = cookie2,
    31                            )
    32 # print(res_message.text)
    33 soup2 = BeautifulSoup(res_message.text,"html.parser")
    34 div = soup2.find(name="div",attrs={"id":"js-pjax-container"})
    35 h1 = div.find(name="h1",attrs={"class":"vcard-names"})
    36 span = h1.find(name="span",attrs={"class":"p-nickname vcard-username d-block"})
    37 username = span.get_text()
    38 print("获取到的用户名为:",username)
    39 a = div.find(name="a",attrs={"class":"u-photo d-block tooltipped tooltipped-s"})
    40 img = a.find(name="img",attrs={"class":"avatar width-full rounded-2"})
    41 src = img.get("src")
    42 print("获取到的用户头像地址为:",src)
    View Code

    3.汽车之家新闻抓取

     1 import requests
     2 from bs4 import BeautifulSoup
     3 
     4 res = requests.get("https://www.autohome.com.cn/news/") #获取网页HTML内容
     5 res.encoding = "gbk"
     6 
     7 soup = BeautifulSoup(res.text,"html.parser")
     8 #解析所获得的html页面
     9 li_list = soup.find(id = "auto-channel-lazyload-article").find_all(name = "li")
    10 for li in li_list:
    11     title = li.find("h3")
    12     if not title:
    13         continue
    14     summary = li.find("p")
    15     url = li.find("a").get("href")
    16     img = li.find('img').get('src')
    17     print(title.text,url,summary.text,img)
    View Code

    4.自动登录码云并获取个人信息

     1 import requests
     2 from bs4 import BeautifulSoup
     3 
     4 #获取token
     5 r1 = requests.get("https://gitee.com/login")
     6 r1.encoding = "utf-8"
     7 soup = BeautifulSoup(r1.text,"html.parser")
     8 token = soup.find(name = "input",attrs = {"name":"authenticity_token"}).get("value")
     9 
    10 #将用户名,密码,token发送到服务端,以POST请求方式
    11 
    12 
    13 date = {
    14     "utf8":"",
    15     "authenticity_token":token,
    16     "redirect_to_url":"",
    17     "user[login]":"***账号****",
    18     'user[password]':"***密码***.",
    19     "captcha":"",
    20     "user[remember_me]":"0",
    21     "commit":"登录"
    22     }
    23 r2 = requests.post("https://gitee.com/login",date)
    24 
    25 cookie_dict = r2.cookies.get_dict()
    26 r3 = requests.get("https://gitee.com/aberwang/projects",cookie_dict)
    27 print(r3.text)
    View Code
  • 相关阅读:
    六大设计原则(一)
    .Net MVC 实现WebSocket
    Socket基础三
    Linux date命令的用法(转)
    SpringBoot自定义校验注解校验日期时间格式字符串
    前缀、中缀、后缀表达式
    什么是重放攻击,列举几种常见防御手段?
    09月13日总结
    09月12日总结
    09月11日总结
  • 原文地址:https://www.cnblogs.com/aberwang/p/9273703.html
Copyright © 2011-2022 走看看