zoukankan      html  css  js  c++  java
  • scrapy的selectors

    from scrapy import Selector

    >>> doc = """

    ... <div>

    ...     <ul>

    ...         <li class="item-0"><a href="link1.html">first item</a></li>

    ...         <li class="item-1"><a href="link2.html">second item</a></li>

    ...         <li class="item-inactive"><a href="link3.html">third item</a></li>

    ...         <li class="item-1"><a href="link4.html">fourth item</a></li>

    ...         <li class="item-0"><a href="link5.html">fifth item</a></li>

    ...     </ul>

    ... </div>

    ... """

    >>> sel = Selector(text=doc, type="html")

    >>> sel.xpath('//li//@href').extract()

    [u'link1.html', u'link2.html', u'link3.html', u'link4.html', u'link5.html']

    在xpath中使用正则表达式

    >>> sel.xpath('//li[re:test(@class, "item-d$")]//@href').extract()

    [u'link1.html', u'link2.html', u'link4.html', u'link5.html']

    在xpath中使用变量,用$标识,下面路径表示提取包含5个<a>标签的div标签的属性id的值

    response.xpath('//div[count(a)=$cnt]/@id',cnt=5).extract_first()

    response.xpath('//div[@id=$val]/a/text()', val='images').extract_first()

    u'Name: My image 1 '

    response.xpath('//base/@href').extract()

    [u'http://example.com/']

    response.css('base::attr(href)').extract()

    [u'http://example.com/']

    response.xpath('//a[contains(@href,"img")]/@href').extract()

    response.css(

  • 相关阅读:
    【leetcode】1. Two Sum
    【leetcode】32. Longest Valid Parentheses
    【leetcode】20. Valid Parentheses
    scala细节
    【转】CentOS下firefox安装flash说明
    Spring 容器(一)
    源码学习快捷键
    Gradle编译spring3.x报错找不到itextpdf4.2.2解决方案
    Insertion Sort
    Bubble Sort
  • 原文地址:https://www.cnblogs.com/Ting-light/p/9543425.html
Copyright © 2011-2022 走看看