zoukankan      html  css  js  c++  java
  • request、bs4爬虫

    一 先从爬虫案例开始

    爬虫和反爬虫之间的斗争,看似反爬虫占据着主动权,但最后都爬虫者获胜,只是付出代价大小的问题;本文只在技术层面探讨爬虫入门知识,爬虫本质上就是猜和博弈,经验很重要,主要就是分析请求和web接口,因此,想学爬虫,必须先了解web;只要浏览器能够获得的数据,都能够通过爬虫获取的到,就看伪装的彻底不彻底了,这里先从几个简单爬虫说起:

    1 汽车之家

    这应该是跟爬取百度一样简单的网站,该网站完全没有设防,无需伪装成浏览器,不需要cookie,也无需登录:

    import requests
    from bs4 import BeautifulSoup
    
    response = requests.get("https://www.autohome.com.cn/news/")
    response.encoding = 'gbk'
    
    soup = BeautifulSoup(response.text,'html.parser')
    
    div = soup.find(name='div',attrs={'id':'auto-channel-lazyload-article'})
    
    li_list = div.find_all(name='li')
    
    for li in li_list:
    
        title = li.find(name='h3')
        if not title:
            continue
        p = li.find(name='p')
        a = li.find(name='a')
    
        print(title.text)
        print(a.attrs.get('href'))
        print(p.text)
    
        img = li.find(name='img')
        src = img.get('src')
        src = "https:" + src
        print(src)
    
        # 再次发起请求,下载图片
        file_name = src.rsplit('/',maxsplit=1)[1]
        ret = requests.get(src)
        with open(file_name,'wb') as f:
            f.write(ret.content)
    View Code

    2 抽屉新热榜

    第一步爬取网页内容,需要伪装成浏览器:

    import requests
    from bs4 import BeautifulSoup
    
    r1 = requests.get(
        url='https://dig.chouti.com/',
        headers={
            'user-agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
        }
    )
    
    soup = BeautifulSoup(r1.text,'html.parser')
    
    # 标签对象
    content_list = soup.find(name='div',id='content-list')
    # print(content_list)
    # [标签对象,标签对象]
    item_list = content_list.find_all(name='div',attrs={'class':'item'})
    for item in item_list:
        a = item.find(name='a',attrs={'class':'show-content color-chag'})
        print(a.text.strip())
        # print(a.text)
    View Code

    更进一步,给通过文章点赞,注意陷阱,点赞需要cookie,但是不是登录后响应中的cookie,而是第一次加载网页时响应中的cookie,可以看出如果爬虫直接从登录那一步开始发请求是不行的,因为正常的浏览器都是先访问一下页面才登录的,从点赞使用的cookie可以看出,这是反爬虫的一种方案,这就要求爬虫者具有一个的分析能和经验了,不然即使获取到了cookie也是一个烟雾弹,具体如下:

    import requests
    # 1. 查看首页
    r1 = requests.get(
        url='https://dig.chouti.com/',
        headers={
            'user-agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
        }
    )
    
    # 2. 提交用户名和密码
    r2 = requests.post(
        url='https://dig.chouti.com/login',
        headers={
            'user-agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
        },
        data={
            'phone':'自己先注册',
            'password':'自己先注册',
            'oneMonth':1
        },
        cookies=r1.cookies.get_dict()
    )
    
    
    # 3. 点赞
    r3 = requests.post(
        url='https://dig.chouti.com/link/vote?linksId=20435396',
        headers={
            'user-agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
        },
        cookies=r1.cookies.get_dict()
    )
    print(r3.text)
    View Code

    3 自动登录GitHub

    这里登录的时候是form表单提交,需要获取csrf token值防止跨站伪造请求,登录这里的 token值跟登录名和密码一块放在了请求体中;这也是分析得到的,如果一个网站登录或提交数据需要token值的话,放在请求头中还是请求体中,或者cookie中,或者其他参数中,需要自己具体真实登录一下就能分析到了;具体如下:

    # 1. GET,访问登录页面
    """
    - 去HTML中找隐藏的Input标签获取csrf token
    - 获取cookie
    """
    
    # 2. POST,用户名和密码
    """
    - 发送数据:
        - csrf
        - 用户名
        - 密码
    - 携带cookie
    """
    
    # 3. GET,访问https://github.com/settings/emails
    """
    - 携带 cookie
    """
    
    import requests
    from bs4 import BeautifulSoup
    
    # # 1. 访问登陆页面,获取 authenticity_token
    i1 = requests.get('https://github.com/login')
    soup1 = BeautifulSoup(i1.text, features='lxml')
    tag = soup1.find(name='input', attrs={'name': 'authenticity_token'})
    authenticity_token = tag.get('value')
    c1 = i1.cookies.get_dict()
    i1.close()
    
    # 1. 携带authenticity_token和用户名密码等信息,发送用户验证
    form_data = {
    "authenticity_token": authenticity_token,
        "utf8": "",
        "commit": "Sign in",
        "login": "自己注册",
        'password': '自己注册'
    }
    
    i2 = requests.post('https://github.com/session', data=form_data, cookies=c1)
    c2 = i2.cookies.get_dict()
    c1.update(c2)
    
    
    i3 = requests.get('https://github.com/settings/repositories', cookies=c1)
    soup3 = BeautifulSoup(i3.text, features='lxml')
    list_group = soup3.find(name='div', class_='listgroup')
    
    from bs4.element import Tag
    
    for child in list_group.children:
        if isinstance(child, Tag):
            project_tag = child.find(name='a', class_='mr-1')
            size_tag = child.find(name='small')
            temp = "项目:%s(%s); 项目路径:%s" % (project_tag.get('href'), size_tag.string, project_tag.string, )
            print(temp)
    View Code

    4 自动登录拉钩网

    这里需要在请求页面的响应数据中获取两个隐藏在网页中的参数,然后把参数放在请求头中才能登录成功,这里注意,无论解析网页用的是bs4还是xpath,都不能忘记正则表达式,它能解决其他方式无法解决的一些问题,具体如下:

    import re
    import requests
    
    r1 = requests.get(
        url='https://passport.lagou.com/login/login.html',
        headers={
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
        }
    )
    X_Anti_Forge_Token = re.findall("X_Anti_Forge_Token = '(.*?)'", r1.text, re.S)[0]
    X_Anti_Forge_Code = re.findall("X_Anti_Forge_Code = '(.*?)'", r1.text, re.S)[0]
    # print(X_Anti_Forge_Token, X_Anti_Forge_Code)
    # print(r1.text)
    #
    r2 = requests.post(
        url='https://passport.lagou.com/login/login.json',
        headers={
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
            'X-Anit-Forge-Code':X_Anti_Forge_Code,
            'X-Anit-Forge-Token':X_Anti_Forge_Token,
            'Referer': 'https://passport.lagou.com/login/login.html', # 上一次请求地址是什么?
        },
        data={
            "isValidate": True,
            'username': '自己注册',
            'password': 'ab18d270d7126ea65915c50288c22c0d',
            'request_form_verifyCode': '',
            'submit': ''
        },
        cookies=r1.cookies.get_dict()
    )
    print(r2.text)
    View Code

    二  requests模块

    1 请求方法

    requests.get

    requests.post

    requests.put

    requests.delete

    等等等等,网络中常见的请求方法都有,具体还想了解更多的,可以参考requests原码查看

    另一种写法:requests.request(method='POST')

    2 请求参数

    2.1 url

    2.2 headers

    2.3 cookies

    2.4 params

    2.5 data

    传请求体 requests.post(
    ...,
    data={'user':'liuneng','pwd':'123'}
    )

    GET /index http1.1 host:c1.com user=liuneng&pwd=123

    2.6 json,传请求体

    requests.post(
    ...,
    json={'user':'liuneng','pwd':'123'}
    )

    GET /index http1.1 host:c1.com Content-Type:application/json {"user":"liuneng","pwd":123}

    2.7 代理 proxies

    # 无验证
    proxie_dict = {
    "http": "61.172.249.96:80",
    "https": "http://61.185.219.126:3128",
    }
    ret = requests.get("https://www.proxy360.cn/Proxy", proxies=proxie_dict)


    # 验证代理
    from requests.auth import HTTPProxyAuth

    proxyDict = {
    'http': '77.75.105.165',
    'https': '77.75.106.165'
    }
    auth = HTTPProxyAuth('用户名', '密码')

    r = requests.get("http://www.google.com",data={'xxx':'ffff'} proxies=proxyDict, auth=auth)
    print(r.text)
    -----------------------------------------------------------------------------------------上面的几项必须掌握

    2.8 文件上传 files

    # 发送文件
    file_dict = {
    'f1': open('xxxx.log', 'rb')
    }
    requests.request(
    method='POST',
    url='http://127.0.0.1:8000/test/',
    files=file_dict
    )

    2.9 认证 auth

    内部:
    用户名和密码,用户和密码加密,放在请求头中传给后台。

    - "用户:密码"
    - base64("用户:密码")
    - "Basic base64("用户|密码")"
    - 请求头:
    Authorization: "basic base64("用户|密码")"

    from requests.auth import HTTPBasicAuth, HTTPDigestAuth

    ret = requests.get('https://api.github.com/user', auth=HTTPBasicAuth('wupeiqi', 'sdfasdfasdf'))
    print(ret.text)

    2.10 超时 timeout

    # ret = requests.get('http://google.com/', timeout=1)
    # print(ret)

    # ret = requests.get('http://google.com/', timeout=(5, 1))
    # print(ret)

    2.11 允许重定向 allow_redirects

    ret = requests.get('http://127.0.0.1:8000/test/', allow_redirects=False)
    print(ret.text)

    2.12 大文件下载 stream

    from contextlib import closing
    with closing(requests.get('http://httpbin.org/get', stream=True)) as r1:
    # 在此处理响应。
    for i in r1.iter_content():
    print(i)

    2.13 证书 cert

    - 百度、腾讯 => 不用携带证书(系统帮你做了)
    - 自定义证书
    requests.get('http://127.0.0.1:8000/test/', cert="xxxx/xxx/xxx.pem")
    requests.get('http://127.0.0.1:8000/test/', cert=("xxxx/xxx/xxx.pem","xxx.xxx.xx.key"))

    2.14 确认 verify =False 

    requests.get('http://127.0.0.1:8000/test/', cert="xxxx/xxx/xxx.pem")

    作者:E-QUAL
    出处:https://www.cnblogs.com/liujiajia_me/
    本文版权归作者和博客园共有,不得转载,未经作者同意参考时必须保留此段声明,且在文章页面明显位置给出原文连接。
                                                本文内容参考如下网络文献得来,用于个人学习,如有侵权,请您告知删除修改。
                                                参考链接:https://www.cnblogs.com/linhaifeng/
                                                                 https://www.cnblogs.com/yuanchenqi/
                                                                 https://www.cnblogs.com/Eva-J/
                                                                 https://www.cnblogs.com/jin-xin/
                                                                 https://www.cnblogs.com/liwenzhou/
                                                                 https://www.cnblogs.com/wupeiqi/
  • 相关阅读:
    MyBatis入门基础
    复制复杂链表
    二叉树中和为某一值的所有路径
    树的层次遍历
    Statement, PreparedStatement和CallableStatement的区别
    JSP有哪些动作?
    latex 输入矩阵
    Struts简单入门实例
    在Eclipse里面配置Struts2
    Windows使用Github
  • 原文地址:https://www.cnblogs.com/liujiajia_me/p/12483937.html
Copyright © 2011-2022 走看看