zoukankan      html  css  js  c++  java
  • scrapy instantiation

    start

    from scrapy.cmdline import execute
    execute(['scrapy', 'crawl', 'jokespider'])
    

      

    items.py

    import scrapy
    
    class JokejiItem(scrapy.Item):
        title=scrapy.Field()
        url=scrapy.Field()
    
    class ListItem(scrapy.Item):
        title=scrapy.Field()
        url=scrapy.Field()
    

      

    spider.py

    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    from jokeji.items import JokejiItem,ListItem
    
    class JokespiderSpider(CrawlSpider):
        name = 'jokespider'
        allowed_domains = ['zizi.cn']
        start_urls = ['http://www.zizi.cn']
    
        rules = [
            Rule(LinkExtractor(allow=r'/listw+.htm'), callback='parse_list', follow=True),
            Rule(LinkExtractor(allow=r'/jokehtml/w+/d+.htm',deny=(r'/list')), callback='parse_item', follow=True),
        ]
    
        def parse_item(self, response):
            item=JokejiItem()
            item['title']='from content'
            return item
    
        def parse_list(self,response):
            item=ListItem()
            item['url']="from list........"+response.url
            return item
    

      

    pipelines.py

    class JokejiPipeline(object):
        def process_item(self, item, spider):
            print(item,item__class__,spider)
    

     

    通过 item__class__ 是什么类来决定如何处理数据

    当然 ItemClass() 类里可以加

    def __str__(self):

      return 'ItemClass"

    更直观.

  • 相关阅读:
    链表-(1)
    爬虫(10-3)验证码图片识别
    爬虫10-2(多线程爬虫)
    分布式爬虫系统的架构(19)
    pipenv管理Python虚拟环境
    peewee-async集成到tornado
    Python3笔记051
    Python3笔记050
    Python3笔记049
    Python3笔记048
  • 原文地址:https://www.cnblogs.com/pythonClub/p/9841509.html
Copyright © 2011-2022 走看看