zoukankan      html  css  js  c++  java
  • scrapy 框架

    scrapy爬虫框架

    介绍

    #通用的网络爬虫框架,相对于爬虫界的django
    
    #scrapy执行流程
    	5大组件
        	-引擎(EGINE):大总管,负责控制数据的流向
            -调度器(SCHEDULER):由它来决定下一个要抓取的网址是什么,去重
            -下载器(DOWLOADER):用于下载网页内容, 并将网页内
            容返回给EGINE,下载器是建立在twisted这个高效的异步模型上的
            -爬虫(SPIDERS):开发人员自定义的类,用来解析responses,并且提取items,或者发送新的请求request
            -项目管道(ITEM PIPLINES):在items被提取后负责处理它们,主要包括清理、验证、持久化(比如存到数据库)等操作
    	2大中间件
        	-爬虫中间件:位于EGINE和SPIDERS之间,主要工作是处理SPIDERS的输入和输出(用的很少)
            -下载中间件:引擎和下载器之间,加代理,加头,集成selenium
            
    #开发者只需要在固定位置写固定代码即可,写的最多的是spider        
    

    安装

    #1 pip3 install scrapy(mac,linux)
    
    #2 windows上(80%能成功,少部分人成功不了)
    	1、pip3 install wheel #安装后,便支持通过wheel文件安装软件,wheel文件官网:https://www.lfd.uci.edu/~gohlke/pythonlibs
        
        3、pip3 install lxml
        
        4、pip3 install pyopenssl
        
        5、下载并安装pywin32:#pip3 install pywin32
        https://sourceforge.net/projects/pywin32/files/pywin32/
            
        6、下载twisted的wheel文件:		http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
            
        7、执行pip3 install 下载目录Twisted-17.9.0-cp36-cp36m-win_amd64.whl
        
        8、pip3 install scrapy
        
    # 3 就有scrapy命令
    	-D:Python36Scriptsscrapy.exe  用于创建项目
    

    scrapy创建项目、创建爬虫、运行爬虫

    1 scrapy startproject 项目名
    	-scrapy startproject firstscrapy
        
    2 创建爬虫
    	-cd 项目名
    	-scrapy genspider 爬虫名 爬虫地址
        -scrapy genspider chouti dig.chouti.com
        -一执行就会在spider文件夹下创建出一个py文件,名字叫chouti
        
    3 运行爬虫,settings.py的ROBOTSTXT_OBEY改为False,不遵从爬虫协议,看下图
    	-scrapy crawl chouti   # 带运行日志
        -scrapy crawl chouti --nolog  # 不带日志
        
    4 支持右键执行爬虫
    	-在项目路径下新建一个main.py
        from scrapy.cmdline import execute
    	execute(['scrapy','crawl','chouti','--nolog'])
    

    目录介绍

    tiktok# 项目名字
    	scrapy.cfg# 上线相关
        tiktok# 包
        	__init__.py
        	items.py # 一个一个的类,
            main.py# 自己加的,执行爬虫,启动文件
            middlewares.py# 中间件(爬虫,下载中间件都写在这)
            pipelines.py # 持久化相关写在这(items.py中类的对象)
            settings.py # 配置文件
            spiders# 所有的爬虫文件放在里面
                __init__.py
                baidu.py# 一个个的爬虫(以后基本上都在这写东西)
    

    settings介绍

    1 默认情况,scrapy会去遵循爬虫协议
    
    2 修改配置文件参数,强行爬取,不遵循协议
    	ROBOTSTXT_OBEY = False
        
    3 USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'	#修改自己的客户端信息
    
    4 LOG_LEVEL='ERROR'	#设置日志级别
    

    爬取抽屉新闻

    #firstscrapyspiderschouti.py
    import scrapy
    from firstscrapy.items import ChoutiItem
    class ChoutiSpider(scrapy.Spider):
        name = 'chouti'
        allowed_domains = ['dig.chouti.com']
        start_urls = ['http://dig.chouti.com/']
    
        def parse(self, response):
    
            div_list = response.xpath('//div[contains(@class,"link-item")]')
            for div in div_list:
                item = ChoutiItem()	#获得item对象
                title = div.css('.link-title::text').extract_first()
                url = div.css('.link-title::attr(href)').extract_first()
                photo_url = div.css('.image-scale::attr(src)').extract_first()
                if not photo_url:
                    photo_url = ''
                # item.title=title      #必须要['key'] = val的形式
                # item.url=url
                # item.photo_url=photo_url
                item['title'] = title
                item['news_url'] = url
                item['img'] = photo_url
                yield item  #一定要放在for里,不然只执行一次
    
    #firstscrapypipelines.py            
    class FirstscrapyPipeline(object):
    
        # 开始会执行,创建mysql连接
        def open_spider(self, spider):
            import pymysql
            self.conn_mysql = pymysql.connect(host='127.0.0.1',
                                              port=3306,
                                              user='root',
                                              password='123',
                                              db='chouti',
                                              charset='utf8')
            import redis
            self.i =0
            self.conn_redis = redis.Redis(host='127.0.0.1', port=6379)
    
        # 结束会执行
        def close_spider(self, spider):
            print('写入完成')
            self.conn_mysql.close()
    
        # 持久化,写入mysql、redis库
        def process_item(self, item, spider):
            # mysql
            cursor = self.conn_mysql.cursor()
            sql = 'insert into article(title,img,news_url) values(%s,%s,%s)'
            # 'insert into article (title,url,photo_url)values(%s,%s,%s) '
            cursor.execute(sql, [item['title'], item['img'], item['news_url']])
            print(item['title'])
            self.conn_mysql.commit()
    
            # redis
            self.conn_redis.hmset(name=f'article{self.i}',
                                  mapping={'title': item['title'], 'img': item['img'],
                                           'news_url': item['news_url']})
            self.i += 1
            return item    
        
    #firstscrapyitems.py	#原来item实例化的类
    
    import scrapy
    
    
    class ChoutiItem(scrapy.Item):
        title = scrapy.Field()
        img = scrapy.Field()
        news_url = scrapy.Field()
        
    #firstscrapymain.py    #启动文件
    
    from scrapy.cmdline import execute
    
    execute(['scrapy', 'crawl', 'chouti'])
    
    #firstscrapysettings.py配置
    
    USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
    ROBOTSTXT_OBEY = False
    LOG_LEVEL='ERROR'	#打印日志的级别
    

    scrapy数据解析(重点)

    #xpath:
        -response.xpath('//a[contains(@class,"link-title")]/text()').extract()  # 取文本
        -response.xpath('//a[contains(@class,"link-title")]/@href').extract()  #取属性
    #css
        -response.css('.link-title::text').extract()  # 取文本
        -response.css('.link-title::attr(href)').extract_first()  # 取属性
    

    scrapy数据持久化储存(重点)

    #1 方案一:parser函数必须返回列表套字典的形式(了解),用的少
        
    #2 方案二:高级,pipline item存储(mysql,redis,文件)
    	-在Items.py中写一个类
        -在spinder中导入,实例化,把数据放进去
        	    item['title']=title
                item['url']=url
                item['photo_url']=photo_url
                yield item
                
        -在setting中配置(数字越小,级别越高)b'b'v
        	ITEM_PIPELINES = {
       		'firstscrapy.pipelines.ChoutiFilePipeline': 300,
    		}
        -在pipelines.py中写ChoutiFilePipeline
        	-open_spider(开始的时候)
            -close_spider(结束的时候)
            -process_item(在这持久化)
    

    自动给抽屉新闻点赞

    from selenium import webdriver
    import time
    import requests
    
    bro = webdriver.Chrome(executable_path='./chromedriver.exe')#指定浏览器驱动
    bro.implicitly_wait(10)#隐士等待
    bro.get('https://dig.chouti.com/')
    bro.maximize_window()  # 最大化窗口
    login_b = bro.find_element_by_id('login_btn')
    # print(login_b)
    login_b.click()
    
    username = bro.find_element_by_name('phone')
    username.send_keys('18666550526')
    password = bro.find_element_by_name('password')
    password.send_keys('******')
    
    button = bro.find_element_by_css_selector('button.login-btn')
    button.click()
    # 可能有验证码,手动操作一下
    time.sleep(10)
    
    my_cookie = bro.get_cookies()  # 列表
    print(my_cookie)
    bro.close()
    
    # 这个cookie不是一个字典,不能直接给requests使用,需要转一下
    cookie = {}
    for item in my_cookie:
        cookie[item['name']] = item['value']
    
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36',
        'Referer': 'https://dig.chouti.com/'}
    # ret = requests.get('https://dig.chouti.com/',headers=headers)
    # print(ret.text)
    
    
    ret = requests.get('https://dig.chouti.com/top/24hr?_=1596677637670', headers=headers)
    # print(ret.json())
    ll = []
    for item in ret.json()['data']:
        ll.append(item['id'])
    
    print(ll)
    #点赞
    for id in ll:
        ret = requests.post(' https://dig.chouti.com/link/vote', headers=headers, cookies=cookie, data={'linkId': id})
        print(ret.text)
    
    'https://dig.chouti.com/comments/create'
    '''
    content: 说的号
    linkId: 29829529
    parentId: 0
    
    '''
    
    

    全站爬取cnblogs

    #secondscarpyspiderscnblogs.py
    import scrapy
    
    from secondscarpy.items import CnblogsItem
    from scrapy import Request
    
    class CnblogSpider(scrapy.Spider):
        name = 'cnblogs'
        allowed_domains = ['cnblogs.com']
        start_urls = ['https://www.cnblogs.com']
    
        def parse(self, response):
            div_list = response.css('article.post-item')
            for div in div_list:
                item = CnblogsItem()
                title = div.xpath('.//div[1]/a/text()').extract_first()
                item['title'] = title
                url = div.xpath('.//div[1]/a/@href').extract_first()
                item['url'] = url
                desc = div.xpath('.//div[1]/p/text()').extract_first().strip()
                item['desc'] = desc
                # 要继续爬取详情
                # callback如果不写,默认回调到parse方法
                # 如果写了,响应回来的对象就会调到自己写的解析方法中
                yield Request(url, callback=self.parser_detail, meta={'item': item})
    
            # 解析出下一页的地址
            next = 'https://www.cnblogs.com' + response.css('#paging_block>div a:last-child::attr(href)').extract_first()
            # print(next)
            yield Request(next)
    
        def parser_detail(self, response):
            content = response.css('#cnblogs_post_body').extract_first()
            # print(str(content))
            # item哪里来
            item = response.meta.get('item')
            item['content'] = content
            yield item
    
    #secondscarpyitems.py
    import scrapy
    
    
    class CnblogsItem(scrapy.Item):
        title = scrapy.Field()
        url = scrapy.Field()
        desc = scrapy.Field()
        content = scrapy.Field()
        
    #secondscarpypipelines.py
    import pymysql
    
    
    class SecondscarpyPipeline:
        def open_spider(self, spider):
            # 爬虫对象
            print('-------', spider.name)
            #统计爬取条数
            self.i  = 0
            self.conn = pymysql.connect(host='127.0.0.1', user='root', password="123", database='cnblogs', port=3306,
                                        autocommit=True)
    
        def process_item(self, item, spider):
            cursor = self.conn.cursor()
            sql = 'insert into article (title,url,content,`desc`) values (%s,%s,%s,%s)'
            cursor.execute(sql, [item['title'], item['url'], item['content'], item['desc']])
            # self.conn.commit()
            self.i += 1
            print(self.i)#打印爬取条数
            return item
    
        def close_spider(self, spider):
            self.conn.close()
    
    #secondscarpysettings.py配置下参数
    USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'  # 修改自己的客户端信息
    LOG_LEVEL='ERROR'
    ROBOTSTXT_OBEY = False
    

    提升scrapy爬取数据的效率

    - 在配置文件中settings.py进行相关的配置即可:(默认还有一套setting)
    #1 增加并发:
    默认scrapy开启的并发线程为32个,可以适当进行增加。在settings配置文件中修改CONCURRENT_REQUESTS = 100值为100,并发设置成了为100。
    #2 降低日志级别:
    在运行scrapy时,会有大量日志信息的输出,为了减少CPU的使用率。可以设置log输出信息为INFO或者ERROR即可。在配置文件中编写:LOG_LEVEL = ‘INFO’
    # 3 禁止cookie:
    如果不是真的需要cookie,则在scrapy爬取数据时可以禁止cookie从而减少CPU的使用率,提升爬取效率。在配置文件中编写:COOKIES_ENABLED = False
    # 4禁止重试:
    对失败的HTTP进行重新请求(重试)会减慢爬取速度,因此可以禁止重试。在配置文件中编写:RETRY_ENABLED = False
    # 5 减少下载超时:
    如果对一个非常慢的链接进行爬取,减少下载超时可以能让卡住的链接快速被放弃,从而提升效率。在配置文件中进行编写:DOWNLOAD_TIMEOUT = 10 超时时间为10s
    

    scrapy下载中间件

    # 1 都写在middlewares.py
    
    # 2 爬虫中间件
    
    # 3 下载中间件
    
    # 4 要生效,一定要配置,配置文件
    
    # 下载中间件内(SecondscarpyDownloaderMiddleware)
    -process_request:返回不同的对象,后续处理不同(加代理...)
      		# 1 更换请求头
            # print(type(request.headers))
            # print(request.headers)
            #
            # from scrapy.http.headers import Headers
            # request.headers['User-Agent']=''
    
            # 2 加cookie ---cookie池
            # 假设你你已经搭建好cookie 池了,
            # print('00000--',request.cookies)
            # request.cookies={'username':'asdfasdf'}
    
            # 3 加代理
            # print(request.meta)
            # request.meta['download_timeout'] = 20
            # request.meta["proxy"] = 'http://27.188.62.3:8060'
            # return None
    -process_response:返回不同的对象,后续处理不同
    -process_exception:
    def process_exception(self, request, exception, spider):
            print('xxxx')
            # 不允许直接改url
            # request.url='https://www.baidu.com'
            from scrapy import Request
            request=Request(url='https://www.baidu.com',callback=spider.parser)
            return request
    

    selenium在scrapy中的使用流程

    # 当前爬虫用的selenium是同一个
    
    # 1 在爬虫中初始化webdriver对象
    
    import scrapy
    from selenium import webdriver
    class CnblogSpider(scrapy.Spider):
        name = 'cnblogs'
        allowed_domains = ['cnblogs.com']
        start_urls = ['https://www.cnblogs.com']
        bro = webdriver.Chrome(executable_path='../chromedriver.exe')
        def parse(self, response):
            '''
            这里面写selenium操作即可
            :param response:
            :return:
            '''
            print(response.status)
    
    # 2 在中间件中使用(process_request内)
    
    spider.bro.get('https://dig.chouti.com/')   response=HtmlResponse(url='https://dig.chouti.com/',body=spider.bro.page_source.encode('utf-8'),request=request)
        return response
    	
    # 3 在爬虫中关闭
        def close(self, reason):
            print("我结束了")
            self.bro.close()
    
    

    去重规则源码分析

    # 去重源码分析
    # from scrapy.core.scheduler import Scheduler
    # Scheduler下:def enqueue_request(self, request)方法判断是否去重
        if not request.dont_filter and self.df.request_seen(request):
           Requests对象,RFPDupeFilter对象
    # 如果要自己写一个去重类
      -写一个类,继承BaseDupeFilter类
      -重写def request_seen(self, request):
      -在setting中配置:DUPEFILTER_CLASS = '项目名.dup.UrlFilter'
                
              
    -增量爬取(100链接,150个链接)
      -已经爬过的,放到某个位置(mysql,redis中:集合)
      -如果用默认的,爬过的地址,放在内存中,只要项目一重启,就没了,它也不知道我爬过那个了,所以要自己重写去重方案
    -你写的去重方案,占得内存空间更小
        -bitmap方案
        -BloomFilter布隆过滤器
      
      
    from scrapy.http import Request
    from scrapy.utils.request import request_fingerprint
    
    # 这种网址是一个
    requests1=Request(url='https://www.baidu.com?name=lqz&age=19')
    requests2=Request(url='https://www.baidu.com?age=18&name=lqz')
    
    ret1=request_fingerprint(requests1)
    ret2=request_fingerprint(requests2)
    print(ret1)
    print(ret2)
    
    # bitmap去重  一个小格表示一个连接地址 32个连接,一个比特位来存一个地址
    # https://www.baidu.com?age=18&name=lqz ---》44
    # https://www.baidu.com?age=19&name=lqz ---》89
    # c2c73dfccf73bf175b903c82b06a31bc7831b545假设它占4个bytes,4*8=32个比特位
    # 存一个地址,占32个比特位
    # 10个地址,占320个比特位
    #计算机计量单位
    # 比特位:只能存0和1
    
    
        def request_seen(self, request):
            # 把request对象传入request_fingerprint得到一个值:aefasdfeasd
            # 把request对象,唯一生成一个字符串
            fp = self.request_fingerprint(request)
            #判断fp,是否在集合中,在集合中,表示已经爬过,return True,他就不会再爬了
            if fp in self.fingerprints:
                return True
            # 如果不在集合中,放到集合中
            self.fingerprints.add(fp)
            if self.file:
                self.file.write(fp + os.linesep)
    

    分布式爬虫、scrapr-redis

    1 pip3 install scrapy-redis
    
    2 原来继承Spider,现在继承RedisSpider
    
    3 不用写statr_urls = ['https:/www.cnblogs.com/']
    
    4 需要写redis_key = 'myspider:start_urls'
    
    5 settings配置
    # redis的连接
    REDIS_HOST = 'localhost'                            # 主机名
    REDIS_PORT = 6379                                   # 端口
    #REDIS_PASS = 'redis@Pssw0rd'						#有密码加上
    # 使用scrapy-redis的去重
    DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
    # 使用scrapy-redis的Scheduler
    # 分布式爬虫的配置
    SCHEDULER = "scrapy_redis.scheduler.Scheduler"
    # 持久化的可以配置,也可以不配置
    ITEM_PIPELINES = {
       'scrapy_redis.pipelines.RedisPipeline': 299	#可以写redis和mysql加上即可
    }
    
    #代码
    #cn_redisspidersc_redis.py
    
    from scrapy_redis.spiders import RedisSpider
    from cn_redis.items import CnRedisItem
    from scrapy import Request
    
    
    class CnblogSpider(RedisSpider):
        name = 'cn_redis'
        allowed_domains = ['www.cnblogs.com']
        redis_key = 'myspider:start_urls'
    
        def parse(self, response):
            div_list = response.css('article.post-item')
            for div in div_list:
                item = CnRedisItem()
                title = div.xpath('.//div[1]/a/text()').extract_first()
                item['title'] = title
                url = div.xpath('.//div[1]/a/@href').extract_first()
                item['url'] = url
                desc = div.xpath('.//div[1]/p/text()').extract_first().strip()
                item['desc'] = desc
                # 要继续爬取详情
                # callback如果不写,默认回调到parse方法
                # 如果写了,响应回来的对象就会调到自己写的解析方法中
                yield Request(url, callback=self.parser_detail, meta={'item': item})
    
            # 解析出下一页的地址
            next = 'https://www.cnblogs.com' + response.css('#paging_block>div a:last-child::attr(href)').extract_first()
            print(next)
            yield Request(next)
    
        def parser_detail(self, response):
            content = response.css('#cnblogs_post_body').extract_first()
            # item哪里来
            item = response.meta.get('item')
            item['content'] = content
            yield item
            
    #cn_redisitems.py
    
    import scrapy
    
    
    class CnRedisItem(scrapy.Item):
        title = scrapy.Field()
        url = scrapy.Field()
        desc = scrapy.Field()
        content = scrapy.Field()
        
    #cn_redispipelines.py
    import pymysql
    
    
    class CnRedisPipeline(object):
        def open_spider(self, spider):
            # 爬虫对象
            print('-------', spider.name)
            # 统计爬取条数
            self.i = 0
            self.conn = pymysql.connect(host='127.0.0.1', user='root', password="123", database='cnblogs', port=3306,
                                        autocommit=True)
    
        def process_item(self, item, spider):
            cursor = self.conn.cursor()
            sql = 'insert into article (title,url,content,`desc`) values (%s,%s,%s,%s)'
            cursor.execute(sql, [item['title'], item['url'], item['content'], item['desc']])
            # self.conn.commit()
            self.i += 1
            print('已爬取条数', self.i)  # 打印爬取条数
            return item
    
        def close_spider(self, spider):
            self.conn.close()
            
    #
    # redis的连接
    REDIS_HOST = 'localhost'  # 主机名
    REDIS_PORT = 6379  # 端口
    # REDIS_PASS = 'redis@Pssw0rd'						#有密码加上
    # 使用scrapy-redis的去重
    DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
    # 使用scrapy-redis的Scheduler
    # 分布式爬虫的配置
    SCHEDULER = "scrapy_redis.scheduler.Scheduler"
    LOG_LEVEL='ERROR'	#设置日志级别
    ITEM_PIPELINES = {
        'cn_redis.pipelines.CnRedisPipeline': 300,
        'scrapy_redis.pipelines.RedisPipeline': 299  #
    }
    SPIDER_MIDDLEWARES = {
       'cn_redis.middlewares.CnRedisSpiderMiddleware': 543,
    }
    ROBOTSTXT_OBEY = False
    USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
    BOT_NAME = 'cn_redis'
    
    SPIDER_MODULES = ['cn_redis.spiders']
    NEWSPIDER_MODULE = 'cn_redis.spiders'
    
    #启动方式
    local:scrapy crawl cn_redis
    local2:scrapy crawl cn_redis
    local2:scrapy crawl cn_redis
    
    cmd连接redis 添加爬取地址:lpush myspider:start_urls https://www.cnblogs.com/
    

    破解知乎登录(js逆向和解密)

    client_id=c3cef7c66a1843f8b3a9e6a1e3160e20&
    grant_type=password&
    timestamp=1596702006088&
    source=com.zhihu.web&
    signature=eac4a6c461f9edf86ef33ef950c7b6aa426dbb39&
    username=%2B86liuqingzheng&
    password=1111111&
    captcha=&
    lang=en&
    utm_source=&
    ref_source=other_https%3A%2F%2Fwww.zhihu.com%2Fsignin%3Fnext%3D%252F"
    
    
    # 破解知乎登陆
    
    import requests    #请求解析库
    
    import base64							  #base64解密加密库
    from PIL import Image	  			      #图片处理库
    import hmac								  #加密库
    from hashlib import sha1				  #加密库
    import time
    from urllib.parse import urlencode		  #url编码库
    import execjs							  #python调用node.js
    from http import cookiejar as cookielib
    class Spider():
        def __init__(self):
            self.session = requests.session()
            self.session.cookies = cookielib.LWPCookieJar()    #使cookie可以调用save和load方法
            self.login_page_url = 'https://www.zhihu.com/signin?next=%2F'
            self.login_api = 'https://www.zhihu.com/api/v3/oauth/sign_in'
            self.captcha_api = 'https://www.zhihu.com/api/v3/oauth/captcha?lang=en'
            self.headers = {
                'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 LBBROWSER',
            }
    
            self.captcha =''         #存验证码
            self.signature = ''	   #存签名
    
        # 首次请求获取cookie
        def get_base_cookie(self):
            self.session.get(url=self.login_page_url, headers=self.headers)
    
        def deal_captcha(self):
            r = self.session.get(url=self.captcha_api, headers=self.headers)
            r = r.json()
            if r.get('show_captcha'):
                while True:
                    r = self.session.put(url=self.captcha_api, headers=self.headers)
                    img_base64 = r.json().get('img_base64')
                    with open('captcha.png', 'wb') as f:
                        f.write(base64.b64decode(img_base64))
                    captcha_img = Image.open('captcha.png')
                    captcha_img.show()
                    self.captcha = input('输入验证码:')
                    r = self.session.post(url=self.captcha_api, data={'input_text': self.captcha},
                                          headers=self.headers)
                    if r.json().get('success'):
                        break
    
        def get_signature(self):
            # 生成加密签名
            a = hmac.new(b'd1b964811afb40118a12068ff74a12f4', digestmod=sha1)
            a.update(b'password')
            a.update(b'c3cef7c66a1843f8b3a9e6a1e3160e20')
            a.update(b'com.zhihu.web')
            a.update(str(int(time.time() * 1000)).encode('utf-8'))
            self.signature = a.hexdigest()
    
        def post_login_data(self):
            data = {
                'client_id': 'c3cef7c66a1843f8b3a9e6a1e3160e20',
                'grant_type': 'password',
                'timestamp': str(int(time.time() * 1000)),
                'source': 'com.zhihu.web',
                'signature': self.signature,
                'username': '+8618953675221',
                'password': '',
                'captcha': self.captcha,
                'lang': 'en',
                'utm_source': '',
                'ref_source': 'other_https://www.zhihu.com/signin?next=%2F',
            }
    
            headers = {
                'x-zse-83': '3_2.0',
                'content-type': 'application/x-www-form-urlencoded',
                'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 LBBROWSER',
            }
    
            data = urlencode(data)
            with open('zhih.js', 'rt', encoding='utf-8') as f:
                js = execjs.compile(f.read(), cwd='node_modules')
            data = js.call('b', data)
    
            r = self.session.post(url=self.login_api, headers=headers, data=data)
            print(r.text)
            if r.status_code == 201:
                self.session.cookies.save('mycookie')
                print('登录成功')
            else:
                print('登录失败')
    
        def login(self):
            self.get_base_cookie()
            self.deal_captcha()
            self.get_signature()
            self.post_login_data()
    if __name__ == '__main__':
        zhihu_spider = Spider()
        zhihu_spider.login()
    
    
    
    
    
    

    爬虫的反扒措施

    1 user-agent
    2 referer
    3 cookie(cookie池,先访问一次)
    4 频率限制(代理池,延迟)
    5 js加密(扣出来,exjs模块指向)
    6 css加密
    7 验证码(打码平台),半手动
    8 图片懒加载
    

    布隆过滤器

    from scrapy.dupefilters import BaseDupeFilter
    
    class UrlFilter(BaseDupeFilter):
        def __init__(self):
            self.bloom = ScalableBloomFilter(initial_capacity=100, error_rate=0.001, mode=ScalableBloomFilter.LARGE_SET_GROWTH)
    
    def request_seen(self, request):
        if request.url in self.bloom:
            return True
        self.bloom.add(request.url)
    
  • 相关阅读:
    (教程)怎么避免拼多多比价订单行为
    Ubuntu下搭建apache2+GGI环境
    搭建k8s
    我的2021年总结
    工作三年的一些感悟
    xshell6+xftp6免安装破解版
    centos7启动docker容器时提示Error response from daemon: Unknown runtime specified dockerrunc
    解决一个C#中定时任务被阻塞问题
    工程中实际问题解决两例——基于C#
    解决一次gitlab因异常关机导致启动失败
  • 原文地址:https://www.cnblogs.com/linqiaobao/p/13508744.html
Copyright © 2011-2022 走看看