zoukankan      html  css  js  c++  java
  • scrapy

    scrapy extract提取出的是list,且selectors返回list.

    创建项目:scrapy startproject myproject

    下载内容并送到标准输出:

    scrapy fetch --nolog http://www.example.com/some/page.html

    scrapy fetch --nolog --headers http://www.example.com/

    用浏览器打开指定的URL

    scrapy view <url>

    启动shell

    scrapy shell [url]

    使用spider分给定url

    scrapy parse http://www.example.com/ -c parse_item

    在运行crawl时添加-a可以传递spider参数

    scrapy crawl myspider -a category=electronics

    import scrapy
    
    class MySpider(Spider):
        name = 'myspider'
    
        def __init__(self, category=None, *args, **kwargs):
            super(MySpider, self).__init__(*args, **kwargs)
            self.start_urls = ['http://www.example.com/categories/%s' % category]

      

    rules = (
            # 提取匹配 'category.php' (但不匹配 'subsection.php') 的链接并跟进链接(没有callback意味着follow默认为True)
            Rule(LinkExtractor(allow=('category.php', ), deny=('subsection.php', ))),
    
            # 提取匹配 'item.php' 的链接并使用spider的parse_item方法进行分析
            Rule(LinkExtractor(allow=('item.php', )), callback='parse_item'),
        )
    

      

    >>> response.xpath('//title/text()')
    [<Selector (text) xpath=//title/text()>]
    >>> response.css('title::text')
    [<Selector (text) xpath=//title/text()>]
    

      

    >>> links = response.xpath('//a[contains(@href, "image")]')
    >>> links.extract()
    [u'<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
     u'<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
     u'<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
     u'<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
     u'<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']
    
    >>> for index, link in enumerate(links):
            args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract())
            print 'Link number %d points to url %s and image %s' % args
    
    Link number 0 points to url [u'image1.html'] and image [u'image1_thumb.jpg']
    Link number 1 points to url [u'image2.html'] and image [u'image2_thumb.jpg']
    Link number 2 points to url [u'image3.html'] and image [u'image3_thumb.jpg']
    Link number 3 points to url [u'image4.html'] and image [u'image4_thumb.jpg']
    Link number 4 points to url [u'image5.html'] and image [u'image5_thumb.jpg']
    

      

    >>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:s*(.*)')
    [u'My image 1',
     u'My image 2',
     u'My image 3',
     u'My image 4',
     u'My image 5']
    

    divs = response.xpath('//div')

    提取出divs后应该用for p in divs.xpath('.//p')而不是for p in divs.xpath('//p'),起始为 / 的XPath,那么该XPath将对文档使用绝对路径.

    如果p是直系亲属的话用divs.xpath('p').

    >>> from scrapy import Selector
    >>> doc = """
    ... <div>
    ...     <ul>
    ...         <li class="item-0"><a href="link1.html">first item</a></li>
    ...         <li class="item-1"><a href="link2.html">second item</a></li>
    ...         <li class="item-inactive"><a href="link3.html">third item</a></li>
    ...         <li class="item-1"><a href="link4.html">fourth item</a></li>
    ...         <li class="item-0"><a href="link5.html">fifth item</a></li>
    ...     </ul>
    ... </div>
    ... """
    >>> sel = Selector(text=doc, type="html")
    >>> sel.xpath('//li//@href').extract()
    [u'link1.html', u'link2.html', u'link3.html', u'link4.html', u'link5.html']
    >>> sel.xpath('//li[re:test(@class, "item-d$")]//@href').extract()
    [u'link1.html', u'link2.html', u'link4.html', u'link5.html']
    >>>
    for scope in sel.xpath('//div[@itemscope]'):
    ...     print "current scope:", scope.xpath('@itemtype').extract()
    ...     props = scope.xpath('''
    ...                 set:difference(./descendant::*/@itemprop,
    ...                                .//*[@itemscope]/*/@itemprop)''')
    ...     print "    properties:", props.extract()
    ...     print
    

      

  • 相关阅读:
    dubbo服务配置
    架构基本概念和架构本质
    最大子数组和问题
    struts2简单登陆页面
    四则运算随机出题
    省赛训练赛赛题(简单题)
    Ubuntu虚拟机安装,vritualbox虚拟机软件的使用
    Rational Rose 2007破解版
    netbeans出现的错误
    快速幂
  • 原文地址:https://www.cnblogs.com/tuifeideyouran/p/4200668.html
Copyright © 2011-2022 走看看