zoukankan      html  css  js  c++  java
  • scrapy snippet

    1. spider文件

    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from scrapy.selector import HtmlXPathSelector
    
    item = DomzItem()
    image_urls = hxs.select('//img/@src').extract()
    item['image_urls'] = ["http:" + x for x in image_urls]
    return item
    from scrapy.selector import HtmlXPathSelector
    hxs = HtmlXPathSelector(response)
    
    class MySpider(CrawlSpider): #控制下载速度
        name = 'myspider'
        download_delay = 2
    
    $ scrapy crawl somespider -s JOBDIR=crawls/somespider-1
    #这样开始下载之后可以Ctrl + C停止,恢复下载还是同样的命令
    $ scrapy crawl somespider -s JOBDIR=crawls/somespider-1
    name = "wikipedia"
    allowed_domains = ["wikipedia.org"]
    start_urls = [
      "http://en.wikipedia.org/wiki/Pune"
    ]
    

     2. setting文件

    ITEM_PIPELINES = ['scrapy.contrib.pipeline.images.ImagesPipeline']
    IMAGES_STORE= '...'
    

    3. item 文件

     image_urls = Field()
     images = Field() 
    
  • 相关阅读:
    架构之道(5)
    项目的命名规范
    semantic框架
    jquery.timepicker.js
    jquery.marquee.js
    CkEditor
    快速测试,其实没什麽大不了
    架构之道(4)
    架构之道(3)
    子网划分与子网掩码
  • 原文地址:https://www.cnblogs.com/bushe/p/4003392.html
Copyright © 2011-2022 走看看