zoukankan      html  css  js  c++  java
  • crawlspider

     - CrawlSpider继承自Spider,Spider类的设计原则是只爬取start_url列表中的网页,而CrawlSpider类定义了一些规则(rule)来提供跟进link的方便的机制,从爬取的网页中获取link并继续爬取。

     - 创建项目与之前不同

    scrapy startproject ct
    cd ct
    scrapy genspider -t crawl chouti www.xxx.com

     - 简单爬取抽屉网全部url

    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    
    
    class CtSpider(CrawlSpider):
        name = 'ct'
        # allowed_domains = ['www.xxx.com']
        start_urls = ['https://dig.chouti.com/all/hot/recent/1']
    
        # 连接提取器:
        # allow:表示的就是链接提取器提取连接的规则(正则)
        link = LinkExtractor(allow=r'/all/hot/recent/d+')
    
        rules = (
            #规则解析器:将链接提取器提取到的连接所对应的页面数据进行指定形式的解析
            Rule(link, callback='parse_item', follow=True),
        )
    
        def parse_item(self, response):
            print(response)
    

     - 糗事百科

    class CtSpider(CrawlSpider):
        name = 'qiubai'
    
        start_urls = ['https://www.qiushibaike.com/pic/']
    
        link = LinkExtractor(allow=r'/pic/page/d+?s=d+')
        link1 = LinkExtractor(allow=r'/pic/$')
        rules = (
            Rule(link, callback='parse_item', follow=True),
            Rule(link1, callback='parse_item', follow=True),
        )
    
        def parse_item(self, response):
            print(response)
    

      

    链接

  • 相关阅读:
    师弟大喜之日,送上一幅对联 求横批
    漫画:Google 走了
    产品研发流程改进
    Outlook2010 Bug 一则
    Android 手机用户版本比例
    CDMA 短信中心号码
    UIM卡 PIN 码特点
    [Accessibility] Missing contentDescription attribute on image
    java打印函数的调用堆栈
    android中解析Json
  • 原文地址:https://www.cnblogs.com/lzmdbk/p/10477503.html
Copyright © 2011-2022 走看看