zoukankan      html  css  js  c++  java
  • Scrapy框架

    Scrapy

    Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 其可以应用在数据挖掘,信息处理或存储历史数据等一系列的程序中。
    其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的, 也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。Scrapy用途广泛,可以用于数据挖掘、监测和自动化测试。

    Scrapy 使用了 Twisted异步网络库来处理网络通讯。整体架构大致如下

    Scrapy主要包括了以下组件:

    • 引擎(Scrapy)
      用来处理整个系统的数据流处理, 触发事务(框架核心)
    • 调度器(Scheduler)
      用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL(抓取网页的网址或者说是链接)的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址
    • 下载器(Downloader)
      用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)
    • 爬虫(Spiders)
      爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面
    • 项目管道(Pipeline)
      负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后,将被发送到项目管道,并经过几个特定的次序处理数据。
    • 下载器中间件(Downloader Middlewares)
      位于Scrapy引擎和下载器之间的框架,主要是处理Scrapy引擎与下载器之间的请求及响应。
    • 爬虫中间件(Spider Middlewares)
      介于Scrapy引擎和爬虫之间的框架,主要工作是处理蜘蛛的响应输入和请求输出。
    • 调度中间件(Scheduler Middewares)
      介于Scrapy引擎和调度之间的中间件,从Scrapy引擎发送到调度的请求和响应。

    Scrapy运行流程大概如下:

    1. 引擎从调度器中取出一个链接(URL)用于接下来的抓取
    2. 引擎把URL封装成一个请求(Request)传给下载器
    3. 下载器把资源下载下来,并封装成应答包(Response)
    4. 爬虫解析Response
    5. 解析出实体(Item),则交给实体管道进行进一步的处理
    6. 解析出的是链接(URL),则把URL交给调度器等待抓取

    一、安装

    Linux:
        pip3 install scrapy 
    
    Windows:
        pip3 install wheel
        D:twisted.wheel
        pip3 install D:twisted.wheel
        
        pip3 install scrapy 报错:twisted安装错误
        
        pywin32
    
    
    PS: 
        - python3对twisted未完全支持
        - python2    对Scrapy支持更好些
    
    import scrapy
    View Code

    二、基本使用

    1. 基本命令

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    Django:
        django-admin startproject mysite
        cd mysite
        python manage.py startapp app01
         
     
    Scrapy:
        # 创建项目,在当前目录中创建中创建一个项目文件(类似于Django)
        scrapy startproject sp1
            生成目录如下:
                sp1
                    - sp1
                        - spiders          目录,放置创建的爬虫应用
                        - middlewares.py    中间件
                        - items.py          格式化,与pipelines.py一同做持久化
                        - pipelines.py      持久化
                        - settings.py       配置文件
                    - scrapy.cfg            配置
             
        # 创建爬虫应用
        cd sp1
        scrapy genspider xiaohuar xiaohuar.com      # 创建了xiaohuar.py
        scrapy genspider baidu baidu.com        # 创建了baidu.py
         
        # 展示爬虫应用列表
        scrapy list
     
        # 执行爬虫,进入project
        scrapy crawl baidu
        scrapy crawl baidu --nolog

    文件说明:

    • scrapy.cfg  项目的主配置信息。(真正爬虫相关的配置信息在settings.py文件中)
    • items.py    设置数据存储模板,用于结构化数据,如:Django的Model
    • pipelines    数据处理行为,如:一般结构化的数据持久化
    • settings.py 配置文件,如:递归的层数、并发数,延迟下载等
    • spiders      爬虫目录,如:创建文件,编写爬虫规则

    注意:一般创建爬虫文件时,以网站域名命名

    2. 基本操作

    2.1  selector作筛选

    hxs = Selector(response=response)
    # print(hxs)
    user_list = hxs.xpath('//div[@class="item masonry_brick"]')
    for item in user_list:
        price = item.xpath('./span[@class="price"]/text()').extract_first()
        url = item.xpath('div[@class="item_t"]/div[@class="class"]//a/@href').extract_first()
        print(price,url)
    
    result = hxs.xpath('/a[re:test(@href,"http://www.xiaohuar.com/list-1-d+.html")]/@href')
    print(result)
    result = ['http://www.xiaohuar.com/list-1-1.html','http://www.xiaohuar.com/list-1-2.html']
    View Code

    2.2 yield Request(url=url,callback=self.parse)   # 迭代去执行

    2.3 代码的实现

    # -*- coding: utf-8 -*-
    import scrapy
    
    class BaiduSpider(scrapy.Spider):
        name = 'baidu'                          # 爬虫应用的名称,通过此名称启动爬虫命令
        allowed_domains = ['baidu.com']         # 允许的域名
        start_urls = ['http://baidu.com/']     # 起始URL
    
        def parse(self, response):
            print(response.text)
            print(response.body)
    baidu.py
    import scrapy
    from scrapy.selector import HtmlXPathSelector,Selector
    from scrapy.http import Request
    
    class XiaohuarSpider(scrapy.Spider):
        name = 'xiaohuar'
        allowed_domains = ['xiaohuar.com']
        start_urls = ['http://www.xiaohuar.com/hua/']            # 起始url
    
        def parse(self, response):
            # 要废弃
            # hxs = HtmlXPathSelector(response)     # 拿到的内容response转换成对象
            # print(hxs)
            # result = hxs.select('//a[@class="item_list"]')        # select:表示查找;//a :是找页面所有的a标签
            ## result = hxs.select('//a[@class="item_list"]').extract()        # .extract()使返回的值result不是对象,而是列表[<a></a>,<a></a>...]
            ## result = hxs.select('//a[@class="item_list"]').extract_one()        # 拿第一个
            ## result = hxs.select('//a[@class="item_list"]/@href').extract_one()        # 表示拿href属性
            ## result = hxs.select('//a[@class="item_list"]/text()').extract_one()        # 表示拿文本内容
    
            ############################# 以上写法不推荐 #############################
    
    
            ############################### 推荐以下方式 ##############################
    
            hxs = Selector(response=response)
            # print(hxs)
            user_list = hxs.xpath('//div[@class="item masonry_brick"]')     # 拿到的是对象,但可以对这个对象进行循环。找到class="item masonry_brick"的所有div标签
            for item in user_list:                                              # 每个item也是对象
                price = item.xpath('.//span[@class="price"]/text()').extract_first()     # 相对于当前标签的找子子孙孙使用.//span...
                # price = item.xpath('//span[@class="price"]/text()').extract_first()是错误的,因为//span...是向整个html里找
                url = item.xpath('div[@class="item_t"]/div[@class="class"]//a/@href').extract_first()
                # / 表示去儿子里找,//表示到子子孙孙里找。但必须是在内部才有意义。最外层//和/ 都是有特殊意义
                print(price,url)
                
            # 上面找的只是第一页索引的内容,下面找的是分页的内容
            result = hxs.xpath('/a[re:test(@href,"http://www.xiaohuar.com/list-1-d+.html")]/@href')    # re:test() 正则查找
            print(result)
            result = ['http://www.xiaohuar.com/list-1-1.html','http://www.xiaohuar.com/list-1-2.html']
    
            # 规则
            for url in result:
                yield Request(url=url,callback=self.parse)      # yield Request(url=url) 只是把url封装起来放到调度器里了,callback=self.parse源源不断的发请求,迭代去执行
    xiaohuar.py

    补充:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    选择器:
        //          # 子子孙孙
            /           # 儿子
            /@属性名   # 取属性
            /text()     # 取文本
     
         
    特殊:
        item.xpath('./')    # 相对当前子孙中找
        item.xpath('a')     # 相对当前儿子中找           

    三、深入了解

    (一)以下内容 以登录抽屉并点赞来举例进行深入了解

    1. 起始URL

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    不指明callback=self.parse1情况下,默认下载完后执行 parse函数
     
    import scrapy
    from scrapy.http import Request
     
    class ChoutiSpider(scrapy.Spider):
        name = 'chouti'
        allowed_domains = ['chouti.com']
        start_urls = ['http://chouti.com/']
     
        def start_requests(self):       # 看源码,如果我们没有start_requests函数,默认会执行继承的类scrapy.Spider里的start_requests方法
            for url in self.start_urls:
                yield Request(url, dont_filter=True,callback=self.parse1)       # dont_filter=True对爬取的url不去重
     
        def parse1(self, response):
            pass

    2. 如何发POST请求,携带请求头,cookie,数据

    1
    2
    requests.get(params={},headers={},cookies={})
    requests.post(params={},headers={},cookies={},data={},json={})

    2.1 requests请求相关的参数

    url, 
    method='GET', 
    headers=None, 
    body=None,
    cookies=None,
    ...
    View Code

    2.2 GET请求

    url, 
    method='GET', 
    headers={}, 
    cookies={}, cookiejar            # cookies可以是字典也可以是cookiejar对象
    View Code

    2.3 POST请求

    url, 
    method='GET', 
    headers={}, 
    cookies={}, cookiejar            # cookies可以是字典也可以是cookiejar对象
    body=None,                        # 请求体
        请求头application/x-www-form-urlencoded; charset=UTF-8格式下,数据"phone=86155fa&password=asdf&oneMonth=1" 
        请求头json格式application/json; charset=UTF-8,数据时字典格式"{k1:'v1','k2':'v2'}"
        
        当请求头application/x-www-form-urlencoded; charset=UTF-8格式下,form_data = {'user':'xyp','pwd': 123}需要for循环拼接成"user=xyp$pwd=123"
        但scrapy框架提供了模块可以自动完成拼接
            import urllib.parse
            data = urllib.parse.urlencode({'k1':'v1','k2':'v2'})
            print(data)
            # ---> "k1=v1&k2=v2"  
             
            
        请求头json格式application/json; charset=UTF-8格式下
            json.dumsp({k1:'v1','k2':'v2'})
            
            "{k1:'v1','k2':'v2'}"   
    View Code

    2.4 POST请求示例

     Request(
        url='http://dig.chouti.com/login',
        method='POST',
        headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'},
        body='phone=8615131255089&password=pppppppp&oneMonth=1',
        callback=self.check_login
    )
    View Code

    2.5 cookie

    Request(
        url='http://dig.chouti.com/login',
        method='POST',
        headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'},
        body='phone=8615131255089&password=pppppppp&oneMonth=1',
        cookies=self.cookie_dict,
        callback=self.check_login
    )
    View Code

    具体代码实现:

    #以下代码是循环不停的,加上去重操作
    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy.http import Request
    from scrapy.selector import Selector
    
    class ChoutiSpider(scrapy.Spider):
        name = 'chouti'
        allowed_domains = ['chouti.com']
        start_urls = ['http://chouti.com/']
        cookie_dict = {}
        """
        1. 发送一个GET请求,抽屉
           获取cookie
           
        2. 用户密码POST登录:携带上一次cookie
           返回值:9999表示登录成功
           
        3. 为所欲为,携带cookie,点赞
        """
        def start_requests(self):       # 看源码,如果我们没有start_requests函数,默认会执行继承的类scrapy.Spider里的start_requests方法
            for url in self.start_urls:
                yield Request(url, dont_filter=True,callback=self.parse1)       # dont_filter=True对爬取的url不去重
    
        def parse1(self,response):
            # response.text 首页所有内容
            from scrapy.http.cookies import CookieJar
            cookie_jar = CookieJar() # 对象,对象中封装了 cookies
            cookie_jar.extract_cookies(response, response.request) # 去响应中获取cookies
    
            for k, v in cookie_jar._cookies.items():
                for i, j in v.items():
                    for m, n in j.items():
                        self.cookie_dict[m] = n.value
            post_dict = {
                'phone': '8615131255089',
                'password': 'woshiniba',
                'oneMonth': 1,
            }
            import urllib.parse
    
            # 目的:发送POST进行登录
            yield Request(
                url="http://dig.chouti.com/login",
                method='POST',
                cookies=self.cookie_dict,       # 或者cookies=self.cookie_jar 也行
                body=urllib.parse.urlencode(post_dict),     # 要发送的body数据
                headers={'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8'},
                callback=self.parse2                        # 回调函数
            )
    
        def parse2(self,response):
            print(response.text)        # 这里需根据response判断是否登录成功,此处省略判断
            # 获取新闻列表
            yield Request(url='http://dig.chouti.com/',cookies=self.cookie_dict,callback=self.parse3)
    
        def parse3(self,response):
    
            # 找div,class=part2, 获取share-linkid属性,得到文章id
            hxs = Selector(response)
            link_id_list = hxs.xpath('//div[@class="part2"]/@share-linkid').extract()       # 取到当前页面所有的文章id
            print(link_id_list)
            for link_id in link_id_list:
                # 获取每一个ID去点赞
                base_url = "http://dig.chouti.com/link/vote?linksId=%s" %(link_id,)
                yield Request(url=base_url,method="POST",cookies=self.cookie_dict,callback=self.parse4)
    
    
            #################### 以上只是把首页文章全部点赞 ####################
            
            
            ####################### 分页每个文章都点赞 ####################### 
            
            page_list = hxs.xpath('//div[@id="dig_lcpage"]//a/@href').extract()     # 拿到所有的页码
            for page in page_list:
                #page : /all/hot/recent/2
                page_url = "http://dig.chouti.com%s" %(page,)
                yield Request(url=page_url,method='GET',callback=self.parse3)       # 循环不同页码点赞
    
        def parse4(self, response):
            print(response.text)
    自动登录抽屉并点赞

    (二)以下内容 以获取煎蛋文章标题和url来举例进行持久化的了解

    3. 持久化

    3.1 获取煎蛋文章标题和url:具体代码及持久化详细注释

    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy.http import Request
    from scrapy.selector import Selector
    
    class JianDanSpider(scrapy.Spider):
        name = 'jiandan'
        allowed_domains = ['jandan.net']
        start_urls = ['http://jandan.net/']
    
        def start_requests(self):
            for url in self.start_urls:
                yield Request(url, dont_filter=True,callback=self.parse1)
        def parse1(self,response):
            # response.text 首页所有内容
            hxs = Selector(response)
            a_list = hxs.xpath('//div[@class="indexs"]/h2')
            for tag in a_list:
                url = tag.xpath('./a/@href').extract_first()
                text = tag.xpath('./a/text()').extract_first()
                from ..items import Sp2Item
                yield Sp2Item(url=url,text=text)        # 创建特殊的对象直接交给pipeline,没有做持久化操作,只是把工作转交给了pipeline
            #以上获取的是首页文章的文本和url
            # 获取页码 [url,url]
            """
            for url in url_list:
                yield Request(url=url,callback=self.parse1)
            """
    jiandan.py
    import scrapy
    
    class Sp2Item(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        url = scrapy.Field()
        text = scrapy.Field()
    items.py
    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
    
    
    class Sp2Pipeline(object):
        def __init__(self):
            self.f = None
    
        def process_item(self, item, spider):
            """
    
            :param item:  爬虫中yield回来的对象
            :param spider: 爬虫对象 obj = JianDanSpider()
            :return:
            """
            if spider.name == 'jiadnan':
                pass
            print(item)
            self.f.write('....')
            # 将item传递给下一个pipeline的process_item方法
            # return item
            # from scrapy.exceptions import DropItem
            # raise DropItem()  下一个pipeline的process_item方法不在执行
    
        @classmethod
        def from_crawler(cls, crawler):
            """
            初始化时候,用于创建pipeline对象
            :param crawler:
            :return:
            """
            # val = crawler.settings.get('MMMM')
            print('执行pipeline的from_crawler,进行实例化对象')
            return cls()
    
        def open_spider(self,spider):
            """
            爬虫开始执行时,调用
            :param spider:
            :return:
            """
            print('打开爬虫')
            self.f = open('a.log','a+')
    
        def close_spider(self,spider):
            """
            爬虫关闭时,被调用
            :param spider:
            :return:
            """
            self.f.close()
    pipelines.py
    ITEM_PIPELINES = {
               'sp2.pipelines.Sp2Pipeline': 300,        # 300是优先级
            }
    settings.py

    3.2 总结

    ① pipeline执行的前提

    1
    2
    3
    4
    5
    6
    - spider中yield Item对象
    - settings中注册
        ITEM_PIPELINES = {
           'sp2.pipelines.Sp2Pipeline': 300,        # 300为优先级,越小越先执行
           'sp2.pipelines.Sp3Pipeline': 100,
        }

    ② 编写pipeline

    class Sp2Pipeline(object):
        def __init__(self):
            self.f = None
    
        def process_item(self, item, spider):
            """
    
            :param item:  爬虫中yield回来的对象
            :param spider: 爬虫对象 obj = JianDanSpider()
            :return:
            """
            print(item)
            self.f.write('....')
            return item
            # from scrapy.exceptions import DropItem
            # raise DropItem()  下一个pipeline的process_item方法不在执行
    
        @classmethod
        def from_crawler(cls, crawler):
            """
            初始化时候,用于创建pipeline对象
            :param crawler:
            :return:
            """
            # val = crawler.settings.get('MMMM')
            print('执行pipeline的from_crawler,进行实例化对象')
            return cls()
    
        def open_spider(self,spider):
            """
            爬虫开始执行时,调用
            :param spider:
            :return:
            """
            print('打开爬虫')
            self.f = open('a.log','a+')
    
        def close_spider(self,spider):
            """
            爬虫关闭时,被调用
            :param spider:
            :return:
            """
            self.f.close()
    View Code
    1
    2
    3
    4
    当注册Sp2Pipeline和Sp3Pipeline时,先执行优先级高的__init__函数初始化方法,from_crawler方法,open_spider方法。但是不继续执行优先级高的爬虫方法。
    而是等优先级低的执行完__init__函数初始化方法,from_crawler方法,open_spider方法后才会执行爬虫方法。
     
    PipeLine是全局生效,所有爬虫都会执行,个别做特殊操作: 通过spider.name判断

    ③ pipelines.py可以自定义的方法,及程序运行顺序 

    # class CustomPipeline(object):
    #     def __init__(self,val):
    #         self.val = val
    #
    #     def process_item(self, item, spider):
    #         # 操作并进行持久化
    #
    #         # return表示会被后续的pipeline继续处理
    #         return item
    #
    #         # 表示将item丢弃,不会被后续pipeline处理
    #         # raise DropItem()
    #
    #     @classmethod
    #     def from_crawler(cls, crawler):
    #         """
    #         初始化时候,用于创建pipeline对象
    #         :param crawler:
    #         :return:
    #         """
    #         val = crawler.settings.get('MMMM')
    #         return cls(val)
    #
    #     def open_spider(self,spider):
    #         """
    #         爬虫开始执行时,调用
    #         :param spider:
    #         :return:
    #         """
    #         print('000000')
    #
    #     def close_spider(self,spider):
    #         """
    #         爬虫关闭时,被调用
    #         :param spider:
    #         :return:
    #         """
    #         print('111111')
    
    """
    检测 CustomPipeline类中是否有 from_crawler方法
    如果有:
           obj = 类.from_crawler()
    如果没有:
           obj = 类()
    obj.open_spider()
    
    while True:
        爬虫运行,并且执行parse各种各样的爬虫方法,yield item
        obj.process_item()
    
    obj.close_spider()    
    
    """
    View Code

    以上以例子为了解的内容结束。

    4. 自定义去重规则

    4.1 配置文件中指定

    1
    2
    3
    4
    scrapy默认使用 scrapy.dupefilter.RFPDupeFilter 进行去重,默认在settings相关配置有:
        DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'
        DUPEFILTER_DEBUG = False
        JOBDIR = "保存范文记录的日志路径,如:/root/"  # 最终路径为 /root/requests.seen

    4.2 自定义URL去重操作

    class RepeatUrl:
        def __init__(self):
            self.visited_url = set() # 放在当前服务的内存
    
        @classmethod
        def from_settings(cls, settings):
            """
            初始化时,调用
            :param settings:
            :return:
            """
            return cls()
    
        def request_seen(self, request):
            """
            检测当前请求是否已经被访问过
            :param request:
            :return: True表示已经访问过;False表示未访问过
            """
            if request.url in self.visited_url:
                return True
            self.visited_url.add(request.url)
            return False
    
        def open(self):
            """
            开始爬去请求时,调用
            :return:
            """
            print('open replication')
    
        def close(self, reason):
            """
            结束爬虫爬取时,调用
            :param reason:
            :return:
            """
            print('close replication')
    
        def log(self, request, spider):
            """
            记录日志
            :param request:
            :param spider:
            :return:
            """
            print('repeat', request.url)
    rep.py
    DUPEFILTER_CLASS = 'sp2.rep.RepeatUrl'
    settings.py

    5. 自定义扩展【基于信号】

    from scrapy import signals
    
    class MyExtension(object):
        def __init__(self, value):
            self.value = value
    
        @classmethod
        def from_crawler(cls, crawler):
            val = crawler.settings.getint('MMMM')
            ext = cls(val)
    
            # 在scrapy中注册信号: spider_opened
            crawler.signals.connect(ext.opened, signal=signals.spider_opened)        # ext.opened触发信号时执行的函数 
                    
            # 在scrapy中注册信号: spider_closed
            crawler.signals.connect(ext.closed, signal=signals.spider_closed)
            
            return ext
    
        def opened(self, spider):
            print('open')
    
        def closed(self, spider):
            print('close')
    extends.py
    EXTENSIONS = {
       # 'scrapy.extensions.telnet.TelnetConsole': None,
    }
    settings.py注册

    6. 中间件

    6.1 爬虫中间件

    SPIDER_MIDDLEWARES = {
       'sp3.middlewares.Sp3SpiderMiddleware': 543,
    }
    settings.py注册
    class Sp3SpiderMiddleware(object):
    
        def process_spider_input(self,response, spider):
            """
            下载完成,执行,然后交给parse处理
            :param response: 
            :param spider: 
            :return: 
            """
            pass
    
        def process_spider_output(self,response, result, spider):
            """
            spider处理完成,返回时调用
            :param response:
            :param result:
            :param spider:
            :return: 必须返回包含 Request 或 Item 对象的可迭代对象(iterable)
            """
            return result
    
        def process_spider_exception(self,response, exception, spider):
            """
            异常调用
            :param response:
            :param exception:
            :param spider:
            :return: None,继续交给后续中间件处理异常;含 Response 或 Item 的可迭代对象(iterable),交给调度器或pipeline
            """
            return None
    
    
        def process_start_requests(self,start_requests, spider):
            """
            爬虫启动时调用
            :param start_requests:
            :param spider:
            :return: 包含 Request 对象的可迭代对象
            """
            return start_requests
    middlewares.py

    6.2 下载中间件

    DOWNLOADER_MIDDLEWARES = {
       'sp3.middlewares.DownMiddleware1': 543,
    }
    settings.py注册
    class DownMiddleware1(object):
        def process_request(self, request, spider):
            """
            请求需要被下载时,经过所有下载器中间件的process_request调用
            :param request: 
            :param spider: 
            :return:  
                None,继续后续中间件去下载;
                Response对象,停止process_request的执行,开始执行process_response
                Request对象,停止中间件的执行,将Request重新调度器
                raise IgnoreRequest异常,停止process_request的执行,开始执行process_exception
            """
            
            
            """
            from scrapy.http import Request
            # print(request)
            # request.method = "POST"
            request.headers['proxy'] = "{'ip_port': '111.11.228.75:80', 'user_pass': ''},"
            return None
            """
            
            
            """
            from scrapy.http import Response
            import requests
            v = request.get('http://www.baidu.com')
            data = Response(url='xxxxxxxx',body=v.content,request=request)
            return data
             """
            
            
            pass
    
    
    
        def process_response(self, request, response, spider):
            """
            spider处理完成,返回时调用
            :param response:
            :param result:
            :param spider:
            :return: 
                Response 对象:转交给其他中间件process_response
                Request 对象:停止中间件,request会被重新调度下载
                raise IgnoreRequest 异常:调用Request.errback
            """
            print('response1')
            return response
    
        def process_exception(self, request, exception, spider):
            """
            当下载处理器(download handler)或 process_request() (下载中间件)抛出异常
            :param response:
            :param exception:
            :param spider:
            :return: 
                None:继续交给后续中间件处理异常;
                Response对象:停止后续process_exception方法
                Request对象:停止中间件,request将会被重新调用下载
            """
            return None
    middlewares.py

    7. 自定义命令【scrapy crawl baidu看源码的入口】

    1
    2
    在spiders同级创建任意目录,如:commands
    在其中创建 crawlall.py 文件 (此处文件名就是自定义的命令)
    class Command(ScrapyCommand):
    
        requires_project = True
    
        def syntax(self):
            return '[options]'
    
        def short_desc(self):
            return 'Runs all of the spiders'
    
        def run(self, args, opts):
            # 爬虫列表
            spider_list = self.crawler_process.spiders.list()
            for name in spider_list:
                print(name)                                                # #
                # 初始化爬虫
                self.crawler_process.crawl(name, **opts.__dict__)
            # 开始执行所有的爬虫
            self.crawler_process.start()
    crawlall.py
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    在settings.py 中添加配置 COMMANDS_MODULE = '项目名称.目录名称'
    在项目目录执行命令:scrapy crawlall
             
    就多了命令:scrapy crawlall      
    scrapy crawlall --nolog     #---> xxx
    scrapy genspider ooo ooo.com
    scrapy crawlall --nolog    
    '''
        ---> xxx
             ooo
    '''

    8. 其他(scrapy配置文件)

    # -*- coding: utf-8 -*-
    
    # Scrapy settings for step8_king project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     http://doc.scrapy.org/en/latest/topics/settings.html
    #     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
    #     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
    
    # 1. 爬虫名称
    BOT_NAME = 'step8_king'    
    
    
    # 2. 爬虫应用路径
    SPIDER_MODULES = ['step8_king.spiders']
    NEWSPIDER_MODULE = 'step8_king.spiders'
    
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    # 3. 客户端 user-agent请求头                
    # USER_AGENT = 'step8_king (+http://www.yourdomain.com)'                # user-agent客户端设备
    
    
    # Obey robots.txt rules
    # 4. 禁止爬虫配置
    # ROBOTSTXT_OBEY = False            # 是否遵循爬虫协议                    
    
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    # 5. 并发请求数
    # CONCURRENT_REQUESTS = 4
    
    
    # Configure a delay for requests for the same website (default: 0)
    # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    # 6. 延迟下载秒数
    # DOWNLOAD_DELAY = 2
    
    
    # The download delay setting will honor only one of:        # 如果设置单域名访问并发数和单IP访问并发数会把第五条并发请求数覆盖
    # 7. 单域名访问并发数,并且延迟下次秒数也应用在每个域名
    # CONCURRENT_REQUESTS_PER_DOMAIN = 2
    # 单IP访问并发数,如果有值则忽略:CONCURRENT_REQUESTS_PER_DOMAIN,并且延迟下次秒数也应用在每个IP
    # CONCURRENT_REQUESTS_PER_IP = 3
    
    
    # Disable cookies (enabled by default)
    # 8. 是否支持cookie,cookiejar进行操作cookie
    # COOKIES_ENABLED = True
    # COOKIES_DEBUG = True
    
    
    # Disable Telnet Console (enabled by default)
    # 9. Telnet用于查看当前爬虫的信息,操作爬虫等...            # 对于你的爬虫进行监控
    #    使用telnet ip port ,然后通过命令操作
    # TELNETCONSOLE_ENABLED = True
    # TELNETCONSOLE_HOST = '127.0.0.1'
    # TELNETCONSOLE_PORT = [6023,]
    
    
    # 10. 默认请求头,设置所有的请求头,但是优先级比较低,在爬虫名.py文件中设置请求头优先级高一些
    # Override the default request headers:    
    # DEFAULT_REQUEST_HEADERS = {
    #     'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    #     'Accept-Language': 'en',
    # }
    
    
    # Configure item pipelines
    # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
    # 11. 定义pipeline处理请求
    # ITEM_PIPELINES = {
    #    'step8_king.pipelines.JsonPipeline': 700,
    #    'step8_king.pipelines.FilePipeline': 500,
    # }
    
    
    
    # 12. 自定义扩展,基于信号进行调用
    # Enable or disable extensions
    # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
    # EXTENSIONS = {
    #     # 'step8_king.extensions.MyExtension': 500,
    # }
    
    
    # 13. 爬虫允许的最大深度,可以通过meta查看当前深度;0表示无深度
    # DEPTH_LIMIT = 3
    
    
    # 14. 爬取时,0表示深度优先Lifo(默认);1表示广度优先FiFo
    
    # 后进先出,深度优先
    # DEPTH_PRIORITY = 0
    # SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleLifoDiskQueue'
    # SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.LifoMemoryQueue'
    # 先进先出,广度优先
    
    # DEPTH_PRIORITY = 1
    # SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
    # SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'
    
    
    # 15. 调度器队列
    # SCHEDULER = 'scrapy.core.scheduler.Scheduler'        # scrapy框架默认的调度器,与14条队列结合
    # from scrapy.core.scheduler import Scheduler
    
    
    # 16. 访问URL去重
    # DUPEFILTER_CLASS = 'step8_king.duplication.RepeatUrl'
    
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See http://doc.scrapy.org/en/latest/topics/autothrottle.html
    
    """
    17. 自动限速算法
        from scrapy.contrib.throttle import AutoThrottle
        自动限速设置
        1. 获取最小延迟 DOWNLOAD_DELAY
        2. 获取最大延迟 AUTOTHROTTLE_MAX_DELAY
        3. 设置初始下载延迟 AUTOTHROTTLE_START_DELAY
        4. 当请求下载完成后,获取其"连接"时间 latency,即:请求连接到接受到响应头之间的时间
        5. 用于计算的... AUTOTHROTTLE_TARGET_CONCURRENCY
        target_delay = latency / self.target_concurrency
        new_delay = (slot.delay + target_delay) / 2.0 # 表示上一次的延迟时间
        new_delay = max(target_delay, new_delay)
        new_delay = min(max(self.mindelay, new_delay), self.maxdelay)
        slot.delay = new_delay
    """
    
    # 开始自动限速
    # AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    # 初始下载延迟
    # AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    # 最大下载延迟
    # AUTOTHROTTLE_MAX_DELAY = 10
    # The average number of requests Scrapy should be sending in parallel to each remote server
    # 平均每秒并发数
    # AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    
    # Enable showing throttling stats for every response received:
    # 是否显示
    # AUTOTHROTTLE_DEBUG = True
    
    # Enable and configure HTTP caching (disabled by default)
    # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    
    
    """
    18. 启用缓存
        目的用于将已经发送的请求或相应缓存下来,以便以后使用
        
        from scrapy.downloadermiddlewares.httpcache import HttpCacheMiddleware
        from scrapy.extensions.httpcache import DummyPolicy
        from scrapy.extensions.httpcache import FilesystemCacheStorage
    """
    # 是否启用缓存策略
    # HTTPCACHE_ENABLED = True
    
    # 缓存策略:所有请求均缓存,下次在请求直接访问原来的缓存即可
    # HTTPCACHE_POLICY = "scrapy.extensions.httpcache.DummyPolicy"
    # 缓存策略:根据Http响应头:Cache-Control、Last-Modified 等进行缓存的策略
    # HTTPCACHE_POLICY = "scrapy.extensions.httpcache.RFC2616Policy"
    
    # 缓存超时时间
    # HTTPCACHE_EXPIRATION_SECS = 0
    
    # 缓存保存路径
    # HTTPCACHE_DIR = 'httpcache'
    
    # 缓存忽略的Http状态码
    # HTTPCACHE_IGNORE_HTTP_CODES = []
    
    # 缓存存储的插件
    # HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    
    
    """
    19. 代理,需要在环境变量中设置
        from scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddleware
        
        方式一:使用默认
            os.environ
            {
                http_proxy:http://root:woshiniba@192.168.11.11:9999/
                https_proxy:http://192.168.11.11:9999/
            }
        方式二:使用自定义下载中间件
        
        def to_bytes(text, encoding=None, errors='strict'):
            if isinstance(text, bytes):
                return text
            if not isinstance(text, six.string_types):
                raise TypeError('to_bytes must receive a unicode, str or bytes '
                                'object, got %s' % type(text).__name__)
            if encoding is None:
                encoding = 'utf-8'
            return text.encode(encoding, errors)
            
        class ProxyMiddleware(object):
            def process_request(self, request, spider):
                PROXIES = [
                    {'ip_port': '111.11.228.75:80', 'user_pass': ''},
                    {'ip_port': '120.198.243.22:80', 'user_pass': ''},
                    {'ip_port': '111.8.60.9:8123', 'user_pass': ''},
                    {'ip_port': '101.71.27.120:80', 'user_pass': ''},
                    {'ip_port': '122.96.59.104:80', 'user_pass': ''},
                    {'ip_port': '122.224.249.122:8088', 'user_pass': ''},
                ]
                proxy = random.choice(PROXIES)
                if proxy['user_pass'] is not None:
                    request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
                    encoded_user_pass = base64.encodestring(to_bytes(proxy['user_pass']))
                    request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass)
                    print "**************ProxyMiddleware have pass************" + proxy['ip_port']
                else:
                    print "**************ProxyMiddleware no pass************" + proxy['ip_port']
                    request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
        
        DOWNLOADER_MIDDLEWARES = {
           'step8_king.middlewares.ProxyMiddleware': 500,
        }
        
    """
    
    
    
    """
    20. Https访问
        Https访问时有两种情况:
        1. 要爬取网站使用的可信任证书(默认支持)
            DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"
            DOWNLOADER_CLIENTCONTEXTFACTORY = "scrapy.core.downloader.contextfactory.ScrapyClientContextFactory"
            
        2. 要爬取网站使用的自定义证书
            DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"
            DOWNLOADER_CLIENTCONTEXTFACTORY = "step8_king.https.MySSLFactory"
            
            # https.py
            from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory
            from twisted.internet.ssl import (optionsForClientTLS, CertificateOptions, PrivateCertificate)
            
            class MySSLFactory(ScrapyClientContextFactory):
                def getCertificateOptions(self):
                    from OpenSSL import crypto
                    v1 = crypto.load_privatekey(crypto.FILETYPE_PEM, open('/Users/xyp/client.key.unsecure', mode='r').read())
                    v2 = crypto.load_certificate(crypto.FILETYPE_PEM, open('/Users/xyp/client.pem', mode='r').read())
                    return CertificateOptions(
                        privateKey=v1,  # pKey对象
                        certificate=v2,  # X509对象
                        verify=False,
                        method=getattr(self, 'method', getattr(self, '_ssl_method', None))
                    )
        其他:
            相关类
                scrapy.core.downloader.handlers.http.HttpDownloadHandler
                scrapy.core.downloader.webclient.ScrapyHTTPClientFactory
                scrapy.core.downloader.contextfactory.ScrapyClientContextFactory
            相关配置
                DOWNLOADER_HTTPCLIENTFACTORY
                DOWNLOADER_CLIENTCONTEXTFACTORY
    
    """
    
    
    
    """
    21. 爬虫中间件
        class SpiderMiddleware(object):
    
            def process_spider_input(self,response, spider):
                '''
                下载完成,执行,然后交给parse处理
                :param response: 
                :param spider: 
                :return: 
                '''
                pass
        
            def process_spider_output(self,response, result, spider):
                '''
                spider处理完成,返回时调用
                :param response:
                :param result:
                :param spider:
                :return: 必须返回包含 Request 或 Item 对象的可迭代对象(iterable)
                '''
                return result
        
            def process_spider_exception(self,response, exception, spider):
                '''
                异常调用
                :param response:
                :param exception:
                :param spider:
                :return: None,继续交给后续中间件处理异常;含 Response 或 Item 的可迭代对象(iterable),交给调度器或pipeline
                '''
                return None
        
        
            def process_start_requests(self,start_requests, spider):
                '''
                爬虫启动时调用
                :param start_requests:
                :param spider:
                :return: 包含 Request 对象的可迭代对象
                '''
                return start_requests
        
        内置爬虫中间件:
            'scrapy.contrib.spidermiddleware.httperror.HttpErrorMiddleware': 50,
            'scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware': 500,
            'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': 700,
            'scrapy.contrib.spidermiddleware.urllength.UrlLengthMiddleware': 800,
            'scrapy.contrib.spidermiddleware.depth.DepthMiddleware': 900,
    
    """
    # from scrapy.contrib.spidermiddleware.referer import RefererMiddleware
    # Enable or disable spider middlewares
    # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
    SPIDER_MIDDLEWARES = {
       # 'step8_king.middlewares.SpiderMiddleware': 543,
    }
    
    
    """
    22. 下载中间件
        class DownMiddleware1(object):
            def process_request(self, request, spider):
                '''
                请求需要被下载时,经过所有下载器中间件的process_request调用
                :param request:
                :param spider:
                :return:
                    None,继续后续中间件去下载;
                    Response对象,停止process_request的执行,开始执行process_response
                    Request对象,停止中间件的执行,将Request重新调度器
                    raise IgnoreRequest异常,停止process_request的执行,开始执行process_exception
                '''
                pass
        
        
        
            def process_response(self, request, response, spider):
                '''
                spider处理完成,返回时调用
                :param response:
                :param result:
                :param spider:
                :return:
                    Response 对象:转交给其他中间件process_response
                    Request 对象:停止中间件,request会被重新调度下载
                    raise IgnoreRequest 异常:调用Request.errback
                '''
                print('response1')
                return response
        
            def process_exception(self, request, exception, spider):
                '''
                当下载处理器(download handler)或 process_request() (下载中间件)抛出异常
                :param response:
                :param exception:
                :param spider:
                :return:
                    None:继续交给后续中间件处理异常;
                    Response对象:停止后续process_exception方法
                    Request对象:停止中间件,request将会被重新调用下载
                '''
                return None
    
        
        默认下载中间件
        {
            'scrapy.contrib.downloadermiddleware.robotstxt.RobotsTxtMiddleware': 100,
            'scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware': 300,
            'scrapy.contrib.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware': 350,
            'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400,
            'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 500,
            'scrapy.contrib.downloadermiddleware.defaultheaders.DefaultHeadersMiddleware': 550,
            'scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware': 580,
            'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware': 590,
            'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': 600,
            'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware': 700,
            'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 750,
            'scrapy.contrib.downloadermiddleware.chunked.ChunkedTransferMiddleware': 830,
            'scrapy.contrib.downloadermiddleware.stats.DownloaderStats': 850,
            'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 900,
        }
    
    """
    # from scrapy.contrib.downloadermiddleware.httpauth import HttpAuthMiddleware
    # Enable or disable downloader middlewares
    # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
    # DOWNLOADER_MIDDLEWARES = {
    #    'step8_king.middlewares.DownMiddleware1': 100,
    #    'step8_king.middlewares.DownMiddleware2': 500,
    # }
    settings.py
  • 相关阅读:
    第三天 moyax
    mkfs.ext3 option
    write file to stroage trigger kernel warning
    download fomat install rootfs script
    custom usb-seriel udev relus for compatible usb-seriel devices using kermit
    Wifi Troughput Test using iperf
    learning uboot switch to standby system using button
    learning uboot support web http function in qca4531 cpu
    learngin uboot design parameter recovery mechanism
    learning uboot auto switch to stanbdy system in qca4531 cpu
  • 原文地址:https://www.cnblogs.com/lianxuebin/p/8341986.html
Copyright © 2011-2022 走看看