zoukankan      html  css  js  c++  java
  • 搭建Cookie池

    很多时候我们在对网站进行数据抓取的时候,可以抓取一部分页面或者接口,这部分可能没有设置登录限制。但是如果要抓取大规模数据的时候,没有登录进行爬取会出现一些弊端。对于一些设置登录限制的页面,无法爬取对于一些没有设置登录的页面或者接口,一旦IP访问频繁,会触发网站的反爬虫,相比较代理池通过改变IP地址来避免被网站封禁,但是现在的有的网站已经不封IP地址,开始封账号的反爬措施,如果做大规模爬虫怎么办呢,一个账号有可能被封,如果像代理池一样提供不同IP,我有多个账号轮流爬取是不是可以避免被封。所有就需要维护多个账号,这个时候就要用到cookies池了,通过获取每个账号模拟登录后的cookies信息,保存到数据库,定时检测cookies有效性。

    1.相关工具

     安装Redis数据库,redis-py库,reuqests,selenium,Flask 库,还有Google Chrome浏览器 安装好ChromeDriver,购买要爬取的网站账号,比如我购买的微博小号(卖号网站随机百度,有些网站不固定容易404,最好买免验证码登录的) 这里我们搭建微博的Cookies池

    2.cookies实现

     需要下图几大模块

     存储模块负责存储每个账号的用户名,密码已经每个账号对应的Cookies信息,同时提供方法对数据的存储操作

     生成模块 负责获取登录之后的Cookies,这个模块要从数据库中取账号密码,再模拟登录目标页面,如果登陆成功,就获取Cookies保存到数据库

     检测模块负责定时检测数据库中的Cookies是否有效,使用Cookies请求链接,如果登录状态成功,则是有效的,否则失效并删除,接下来等待生产模块重新登录生成Cookies

     接口模块是通过Api提供对外服务的接口,Cookies越多越好,被检测到的概率越小,越不容易被封

    接下来实现存储模块:存储部分有两部分:1.账号和密码 2.账号和Cookies 这两部分是一对一对应的,所以可以使用Redis的hash ,hash存储结构是key-value 也就是键值对的形式,和我们所需的是符合的,所以就有两组映射,账号密码 ,账号Cookies, key是账号

    import random
    import redis
    

    # Redis数据库地址
    REDIS_HOST = 'localhost'

    # Redis端口
    REDIS_PORT = 6379

    # Redis密码,如无填None
    REDIS_PASSWORD = None

    class RedisClient(object):
        def __init__(self, type, website, host=REDIS_HOST, port=REDIS_PORT, password=REDIS_PASSWORD):
            """
            初始化Redis连接
            :param host: 地址
            :param port: 端口
            :param password: 密码
            """
            self.db = redis.StrictRedis(host=host, port=port, password=password, decode_responses=True)
            self.type = type
            self.website = website
    
        def name(self):
            """
            获取Hash的名称
            :return: Hash名称
            """
            return "{type}:{website}".format(type=self.type, website=self.website)
    
        def set(self, username, value):
            """
            设置键值对
            :param username: 用户名
            :param value: 密码或Cookies
            :return:
            """
            return self.db.hset(self.name(), username, value)
    
        def get(self, username):
            """
            根据键名获取键值
            :param username: 用户名
            :return:
            """
            return self.db.hget(self.name(), username)
    
        def delete(self, username):
            """
            根据键名删除键值对
            :param username: 用户名
            :return: 删除结果
            """
            return self.db.hdel(self.name(), username)
    
        def count(self):
            """
            获取数目
            :return: 数目
            """
            return self.db.hlen(self.name())
    
        def random(self):
            """
            随机得到键值,用于随机Cookies获取
            :return: 随机Cookies
            """
            return random.choice(self.db.hvals(self.name()))
    
        def usernames(self):
            """
            获取所有账户信息
            :return: 所有用户名
            """
            return self.db.hkeys(self.name())
    
        def all(self):
            """
            获取所有键值对
            :return: 用户名和密码或Cookies的映射表
            """
            return self.db.hgetall(self.name())
    
    
    if __name__ == '__main__':
        conn = RedisClient('accounts', 'weibo')
        result = conn.set('wert', 'ssdsdsf')
        print(result)

    我们可以看见name()方法,返回值就是存储key-value 的hash名称 比如accounts:weibo 存储的就是账号和密码,通过这种方式可以把账号密码添加进数据库

    生成模块的实现

    要获取微博登录信息的Cookies,肯定要登录微博,但是微博这网站的登录接口需要填写验证码 或者手机号验证比较复杂,比较好的是微博登录站点有三个 1.https://weibo.cn  2.https://m.weibo.com

     3.https://weibo.com  我们选择第二个站点比较合适,类似于手机客户端的界面,而且登录的时候不需要验证码(前提购买免验证码账号才可以)登录界面是这样

    import json
    from selenium import webdriver
    from selenium.webdriver import DesiredCapabilities
    
    from cookiespool.db import RedisClient
    from login.weibo.cookies import WeiboCookies

    # 产生器使用的浏览器
    BROWSER_TYPE = 'PhantomJS'

    class CookiesGenerator(object):
        def __init__(self, website='default'):
            """
            父类, 初始化一些对象
            :param website: 名称
            :param browser: 浏览器, 若不使用浏览器则可设置为 None
            """
            self.website = website
            self.cookies_db = RedisClient('cookies', self.website)
            self.accounts_db = RedisClient('accounts', self.website)
            self.init_browser()
    
        def __del__(self):
            self.close()
        
        def init_browser(self):
            """
            通过browser参数初始化全局浏览器供模拟登录使用
            :return:
            """
            if BROWSER_TYPE == 'PhantomJS':
                caps = DesiredCapabilities.PHANTOMJS
                caps[
                    "phantomjs.page.settings.userAgent"] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'
                self.browser = webdriver.PhantomJS(desired_capabilities=caps)
                self.browser.set_window_size(1400, 500)
            elif BROWSER_TYPE == 'Chrome':
                self.browser = webdriver.Chrome()
        
        def new_cookies(self, username, password):
            """
            新生成Cookies,子类需要重写
            :param username: 用户名
            :param password: 密码
            :return:
            """
            raise NotImplementedError
        
        def process_cookies(self, cookies):
            """
            处理Cookies
            :param cookies:
            :return:
            """
            dict = {}
            for cookie in cookies:
                dict[cookie['name']] = cookie['value']
            return dict
        
        def run(self):
            """
            运行, 得到所有账户, 然后顺次模拟登录
            :return:
            """
            accounts_usernames = self.accounts_db.usernames()
            cookies_usernames = self.cookies_db.usernames()
            
            for username in accounts_usernames:
                if not username in cookies_usernames:
                    password = self.accounts_db.get(username)
                    print('正在生成Cookies', '账号', username, '密码', password)
                    result = self.new_cookies(username, password)
                    # 成功获取
                    if result.get('status') == 1:
                        cookies = self.process_cookies(result.get('content'))
                        print('成功获取到Cookies', cookies)
                        if self.cookies_db.set(username, json.dumps(cookies)):
                            print('成功保存Cookies')
                    # 密码错误,移除账号
                    elif result.get('status') == 2:
                        print(result.get('content'))
                        if self.accounts_db.delete(username):
                            print('成功删除账号')
                    else:
                        print(result.get('content'))
            else:
                print('所有账号都已经成功获取Cookies')
        
        def close(self):
            """
            关闭
            :return:
            """
            try:
                print('Closing Browser')
                self.browser.close()
                del self.browser
            except TypeError:
                print('Browser not opened')
    
    
    class WeiboCookiesGenerator(CookiesGenerator):
        def __init__(self, website='weibo'):
            """
            初始化操作
            :param website: 站点名称
            :param browser: 使用的浏览器
            """
            CookiesGenerator.__init__(self, website)
            self.website = website
        
        def new_cookies(self, username, password):
            """
            生成Cookies
            :param username: 用户名
            :param password: 密码
            :return: 用户名和Cookies
            """
            return WeiboCookies(username, password, self.browser).main()

     这部分是来判断是否登录成功

    import time
    from io import BytesIO
    from PIL import Image
    from selenium.common.exceptions import TimeoutException
    #from selenium.webdriver import ActionChains
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from os import listdir
    from os.path import abspath, dirname
    
    TEMPLATES_FOLDER = dirname(abspath(__file__)) + '/templates/'
    
    
    class WeiboCookies():
        def __init__(self, username, password, browser):
            self.url = 'https://passport.weibo.cn/signin/login?entry=mweibo&r=https://m.weibo.cn/'
            self.browser = browser
            self.wait = WebDriverWait(self.browser, 20)
            self.username = username
            self.password = password
        
        def open(self):
            """
            打开网页输入用户名密码并点击
            :return: None
            """
            self.browser.delete_all_cookies()
            self.browser.get(self.url)
            username = self.wait.until(EC.presence_of_element_located((By.ID, 'loginName')))
            password = self.wait.until(EC.presence_of_element_located((By.ID, 'loginPassword')))
            submit = self.wait.until(EC.element_to_be_clickable((By.ID, 'loginAction')))
            username.send_keys(self.username)
            password.send_keys(self.password)
            time.sleep(1)
            submit.click()
        
        def password_error(self):
            """
            判断是否密码错误
            :return:
            """
            try:
                return WebDriverWait(self.browser, 5).until(
                    EC.text_to_be_present_in_element((By.ID, 'errorMsg'), '用户名或密码错误'))
            except TimeoutException:
                return False
        
        def login_successfully(self):
            """
            判断是否登录成功
            :return:
            """
            try:
                return bool(
                    WebDriverWait(self.browser, 5).until(EC.presence_of_element_located((By.CLASS_NAME, 'lite-iconf-profile'))))
            except TimeoutException:
                return False
    
    def get_cookies(self):
            """
            获取Cookies
            :return:
            """
            return self.browser.get_cookies()
        
        def main(self):
            """
            破解入口
            :return:
            """
            self.open()
            if self.password_error():
                return {
                    'status': 2,
                    'content': '用户名或密码错误'
                }
            # 如果不需要验证码直接登录成功
            if self.login_successfully():
                cookies = self.get_cookies()
                return {
                    'status': 1,
                    'content': cookies
                }

     检测模块:获取到Cookies信息后还需要对Cookies的有效性进行检测,也就是通过登录返回的Response的状态码判断是否有效

    import json
    import requests
    from requests.exceptions import ConnectionError
    from cookiespool.db import *
    
    
    class ValidTester(object):
        def __init__(self, website='default'):
            self.website = website
            self.cookies_db = RedisClient('cookies', self.website)
            self.accounts_db = RedisClient('accounts', self.website)
        
        def test(self, username, cookies):
            raise NotImplementedError
        
        def run(self):
            cookies_groups = self.cookies_db.all()
            for username, cookies in cookies_groups.items():
                self.test(username, cookies)
    
    
    class WeiboValidTester(ValidTester):
        def __init__(self, website='weibo'):
            ValidTester.__init__(self, website)
        
        def test(self, username, cookies):
            print('正在测试Cookies', '用户名', username)
            try:
                cookies = json.loads(cookies)
            except TypeError:
                print('Cookies不合法', username)
                self.cookies_db.delete(username)
                print('删除Cookies', username)
                return
            try:
                test_url = TEST_URL_MAP[self.website]
                response = requests.get(test_url, cookies=cookies, timeout=5, allow_redirects=False)
                if response.status_code == 200:
                    print('Cookies有效', username)
                else:
                    print(response.status_code, response.headers)
                    print('Cookies失效', username)
                    self.cookies_db.delete(username)
                    print('删除Cookies', username)
            except ConnectionError as e:
                print('发生异常', e.args)

     接口模块驱动其他几个模块的运行

    import time
    from multiprocessing import Process
    
    from cookiespool.api import app
    from cookiespool.generator import *
    from cookiespool.tester import *

    # 产生器类,如扩展其他站点,请在此配置
    GENERATOR_MAP = {
    'weibo': 'WeiboCookiesGenerator'
    }

    # 测试类,如扩展其他站点,请在此配置
    TESTER_MAP = {
    'weibo': 'WeiboValidTester'
    }

    TEST_URL_MAP = {
    'weibo': 'https://m.weibo.cn/'
    }

    # 产生器和验证器循环周期
    CYCLE = 120

    # 产生器开关,模拟登录添加Cookies
    GENERATOR_PROCESS = True
    # 验证器开关,循环检测数据库中Cookies是否可用,不可用删除
    VALID_PROCESS = True
    # API接口服务
    API_PROCESS = True

    class Scheduler(object):
        @staticmethod
        def valid_cookie(cycle=CYCLE):
            while True:
                print('Cookies检测进程开始运行')
                try:
                    for website, cls in TESTER_MAP.items():
                        tester = eval(cls + '(website="' + website + '")')
                        tester.run()
                        print('Cookies检测完成')
                        del tester
                        time.sleep(cycle)
                except Exception as e:
                    print(e.args)
        
        @staticmethod
        def generate_cookie(cycle=CYCLE):
            while True:
                print('Cookies生成进程开始运行')
                try:
                    for website, cls in GENERATOR_MAP.items():
                        generator = eval(cls + '(website="' + website + '")')
                        generator.run()
                        print('Cookies生成完成')
                        generator.close()
                        time.sleep(cycle)
                except Exception as e:
                    print(e.args)
        
        @staticmethod
        def api():
            print('API接口开始运行')
            app.run(host=API_HOST, port=API_PORT)
        
        def run(self):
            if API_PROCESS:
                api_process = Process(target=Scheduler.api)
                api_process.start()
            
            if GENERATOR_PROCESS:
                generate_process = Process(target=Scheduler.generate_cookie)
                generate_process.start()
            
            if VALID_PROCESS:
                valid_process = Process(target=Scheduler.valid_cookie)
                valid_process.start()

     api接口

    import json
    from flask import Flask, g
    
    from cookiespool.db import *

    # API地址和端口
    API_HOST = '127.0.0.1'
    API_PORT = 5000

    __all__ = ['app']
    
    app = Flask(__name__)
    
    @app.route('/')
    def index():
        return '<h2>Welcome to Cookie Pool System</h2>'
    
    
    def get_conn():
        """
        获取
        :return:
        """
        for website in GENERATOR_MAP:
            print(website)
            if not hasattr(g, website):
                setattr(g, website + '_cookies', eval('RedisClient' + '("cookies", "' + website + '")'))
                setattr(g, website + '_accounts', eval('RedisClient' + '("accounts", "' + website + '")'))
        return g
    
    
    @app.route('/<website>/random')
    def random(website):
        """
        获取随机的Cookie, 访问地址如 /weibo/random
        :return: 随机Cookie
        """
        g = get_conn()
        cookies = getattr(g, website + '_cookies').random()
        return cookies
    
    
    @app.route('/<website>/add/<username>/<password>')
    def add(website, username, password):
        """
        添加用户, 访问地址如 /weibo/add/user/password
        :param website: 站点
        :param username: 用户名
        :param password: 密码
        :return: 
        """
        g = get_conn()
        print(username, password)
        getattr(g, website + '_accounts').set(username, password)
        return json.dumps({'status': '1'})
    
    
    @app.route('/<website>/count')
    def count(website):
        """
        获取Cookies总数
        """
        g = get_conn()
        count = getattr(g, website + '_cookies').count()
        return json.dumps({'status': '1', 'count': count})
    
    
    if __name__ == '__main__':
        app.run(host='0.0.0.0')

     运行效果如下

    github:https://github.com/jzxsWZY/CookiesPool

  • 相关阅读:
    es5预览本地文件、es6练习代码演示案例
    Java实现 LeetCode 838 推多米诺(暴力模拟)
    Java实现 LeetCode 838 推多米诺(暴力模拟)
    Java实现 LeetCode 838 推多米诺(暴力模拟)
    Java实现 LeetCode 837 新21点(DP)
    Java实现 LeetCode 837 新21点(DP)
    Java实现 LeetCode 837 新21点(DP)
    Java实现 LeetCode 836 矩形重叠(暴力)
    Subversion under Linux [Reprint]
    Subversion how[Reprint]
  • 原文地址:https://www.cnblogs.com/jzxs/p/11084804.html
Copyright © 2011-2022 走看看