zoukankan      html  css  js  c++  java
  • scrapy 基础组件专题(十二):scrapy 模拟登录

    1. scrapy有三种方法模拟登陆

    1.1直接携带cookies

    1.2找url地址,发送post请求存储cookie

    1.3找到对应的form表单,自动解析input标签,自动解析post请求的url地址,自动带上数据,自动发送请求

    2. scrapy携带cookies直接获取需要登陆后的页面

    2.1 应用场景

    2.1.1cookie过期时间很长,常见于一些不规范的网站

    2.1.2能在cookie过期之前把搜有的数据拿到

    2.1.3配合其他程序使用,比如其使用selenium把登陆之后的cookie获取到保存到本地,scrapy发送请求之前先读取本地cookie

    2.2 实现:重构scrapy的start_rquests方法

    scrapy中start_urls是通过start_requests来进行处理的,其实现代码如下

    def start_requests(self):
        cls = self.__class__
        if method_is_overridden(cls, Spider, 'make_requests_from_url'):
            warnings.warn(
                "Spider.make_requests_from_url method is deprecated; it "
                "won't be called in future Scrapy releases. Please "
                "override Spider.start_requests method instead (see %s.%s)." % (
                    cls.__module__, cls.__name__
                ),
            )
            for url in self.start_urls:
                yield self.make_requests_from_url(url)
        else:
            for url in self.start_urls:
                yield Request(url, dont_filter=True)

    所以对应的,如果start_urls地址中的url是需要登录后才能访问的url地址,则需要重写start_request方法并在其中手动添加上cookie

    2.3 携带cookies登陆

    import scrapy
    
    
    class ItSpider(scrapy.Spider):
        name = 'it'
        allowed_domains = ['enren.com']
        start_urls = ['http://www.renren.com/260246846/newsfeed/photo']
    
        def parse(self, response):
            print("----parse----")
            with open("renren2.html", "w") as f:
                f.write(response.body.decode())
    
        def start_requests(self):  # 重构start_requests方法
            # 这个cookies_str是抓包获取的
            cookies_str = """anonymid=jxkbmqz1k8rnj7; _r01_=1; depovince=GW; ick_login=bf826d1f-53dc-4829-81e1-da6554509e97; first_login_flag=1; ln_uact=dong4716138@163.com; ln_hurl=http://hdn.xnimg.cn/photos/hdn521/20190703/0820/main_Rdy3_c9750000c97b1986.jpg; JSESSIONID=abcHmX81Tn80iaLs-yHWw; jebecookies=210f9dee-e58e-4cb5-a3f8-777b74969dd9|||||; _de=55D1995656E8B7574112FD057B0CD36E34DF20B0B3AA6FF7; p=51eebfa2d9baf41144b0bc8858e9061b6; t=1d75b874aa18d7b78cf616e523078e0f6; societyguester=1d75b874aa18d7b78cf616e523078e0f6; id=260246846; xnsid=de931535; ver=7.0; loginfrom=null; wp_fold=0"""  # 抓包获取
            # 将cookies_str转换为cookies_dict
            cookies_dict = {i[:i.find('=')]: i[i.find('=')+1:] for i in cookies_str.split('; ')}
    
            print(">>>cookie>>>", cookies_dict)
            for url in self.start_urls:
                yield scrapy.Request(
                    url=url,
                    callback=self.parse,
                    cookies=cookies_dict
                )

    注意:

    1. scrapy中cookie不能够放在headers中,在构造请求的时候有专门的cookies参数,能够接受字典形式的cookie
    2. 在setting中设置ROBOTS协议、USER_AGENT

    注意2:

    既然start_urls列表中的url会经过start_requests方法进行处理,所以可以考虑如果在start_urls中的url默认需要用POST提交的话,可以考虑在start_requests方法中进行处理

    注意3:

    可以尝试使用scrapy的中间件携带cookie

    middlewares.py:

    import requests
    import logging
    import json
    # 自定义微博请求的中间件
    class WeiBoMiddleWare(object):
        def __init__(self, cookies_pool_url):
            self.logging = logging.getLogger("WeiBoMiddleWare")
            self.cookies_pool_url = cookies_pool_url
     
        def get_random_cookies(self):
            try:
                response = requests.get(self.cookies_pool_url)
            except Exception as e:
                self.logging.info('Get Cookies failed: {}'.format(e))
            else:
                # 在中间件中,设置请求头携带的Cookies值,必须是一个字典,不能直接设置字符串。
                cookies = json.loads(response.text)
                self.logging.info('Get Cookies success: {}'.format(response.text))
                return cookies
     
        @classmethod
        def from_settings(cls, settings):
            obj = cls(
                cookies_pool_url=settings['WEIBO_COOKIES_URL']
            )
            return obj
     
        def process_request(self, request, spider):
            request.cookies = self.get_random_cookies()
            return None

    settings.py:

    # Override the default request headers:
    DEFAULT_REQUEST_HEADERS = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
        'Cache-Control': 'max-age=0',
        'Connection': 'keep-alive',
        'Host': 'weibo.cn',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0',
    }
    DOWNLOADER_MIDDLEWARES = {
       'weibospider.middlewares.WeiBoMiddleWare': 543,
    }
     
    # 配置微博Cookie池的地址
    WEIBO_COOKIES_URL = 'http://localhost:5000/weibo/random'
    ————————————————

    3. scrapy.FormRequest发送post请求

    通过scrapy.FormRequest能够发送post请求,同时需要添加fromdata参数作为请求体,以及callback

    yield scrapy.FormRequest(
        "https://github.com/session",
        formdata={
            "authenticity_token":authenticity_token,
            "utf8":utf8,
            "commit":commit,
            "login":"dong4716138@163.com",
            "password":"xxxx"
        },
        callback=self.parse_login
    )

    3.2 使用scrapy.FormRequest()登陆github

    3.2.1 思路分析

    1. 找到post的url地址:点击登录按钮进行抓包,然后定位url地址为https://github.com/session
    2. 找到请求体的规律:分析post请求的请求体,其中包含的参数均在前一次的响应中
    3. 否登录成功:通过请求个人主页,观察是否包含用户名

    3.2.2 代码实现如下:

    # -*- coding: utf-8 -*-
    import scrapy
    import re
    
    
    class GitSpider(scrapy.Spider):
        name = 'git'
        allowed_domains = ['github.com']
        start_urls = ['https://github.com/login']
    
        def parse(self, response):
            authenticity_token = response.xpath("//input[@name='authenticity_token']/@value").extract_first()
            utf8 = response.xpath("//input[@name='utf8']/@value").extract_first()
            commit = response.xpath("//input[@name='commit']/@value").extract_first()
    
            # 构造POST请求,传递给引擎
            yield scrapy.FormRequest(
                "https://github.com/session",
                formdata={
                    "authenticity_token": authenticity_token,
                    "utf8": utf8,
                    "commit": commit,
                    "login": "993484988@qq.com",  # 填写自己的GitHub账号
                    "password": "xxxxxxx",  # 填写自己的GitHub密码
                    "webauthn - support": "supported"
                },
                callback=self.parse_login
            )
    
        def parse_login(self, response):
            ret = re.findall(r"dong138", response.text)
            print(ret)

    4. 小技巧

    在settings.py中通过设置COOKIES_DEBUG=True 能够在终端看到cookie的传递传递过程

    总结

    1. start_urls中的url地址是交给start_request处理的,如有必要,可以重写start_request函数
    2. 直接携带cookie登陆:cookie只能传递给cookies参数接收
    3. scrapy.FormRequest()发送post请求
     
  • 相关阅读:
    Android集成科大讯飞SDK语音听写及语音合成功能实现
    Android开发中一些常见的问题解决方案
    Android混淆配置文件规范
    android第三方分享之友盟社会化组件
    android一些常用的代码2(收藏)
    svn
    ubuntu 解决中文zip乱码问题
    android优秀Github源码整理
    linux清理内存
    cocos2d-x图层相关 锚点
  • 原文地址:https://www.cnblogs.com/qiu-hua/p/12641510.html
Copyright © 2011-2022 走看看