zoukankan      html  css  js  c++  java
  • xpath tips

    In the context of web scraping, [XPath](http://en.wikipedia.org/wiki/XPath) is a nice tool to have in your belt, as it allows you to write specifications of document locations more flexibly than CSS selectors. In case you're looking for a tutorial, [here is a XPath tutorial with nice examples](http://www.zvon.org/comp/r/tut-XPath_1.html).
    
    
    
    In this post, we'll show you some tips we found valuable when using XPath in the trenches, using [Scrapy Selector API](http://doc.scrapy.org/en/latest/topics/selectors.html) for our examples.
    
    ## Avoid using contains(.//text(), 'search text') in your XPath conditions.
    
    Use contains(., 'search text') instead.
    
    Here is why: the expression `.//text()` yields a collection of text elements -- a *node-set*. And when a node-set is converted to a string, which happens when it is passed as argument to a string function like `contains()` or `starts-with()`, results in the text for the **first** element only.
    
    **>>>** from scrapy import Selector
    
    **>>>** sel = Selector**(**text='<a href="#">Click here to go to the <strong>Next Page</strong></a>'**)**
    
    **>>>** xp = lambda x: sel.xpath**(**x**)**.extract**()** # let's type this only once
    
    **>>>** xp**(**'//a//text()'**)** # take a peek at the node-set
    
       **[**u'Click here to go to the ', u'Next Page'**]**
    
    **>>>** xp**(**'string(//a//text())'**)**  # convert it to a string
    
       **[**u'Click here to go to the '**]**
    
    A *node* converted to a string, however, puts together the text of itself plus of all its descendants:
    
     **>>>** xp**(**'//a[1]'**)** # selects the first a node
    
    **[**u'<a href="#">Click here to go to the <strong>Next Page</strong></a>'**]**
    
    **>>>** xp**(**'string(//a[1])'**)** # converts it to string
    
    **[**u'Click here to go to the Next Page'**]**
    
    So, in general:
    
    **GOOD:**
    
    **>>>** xp**(**"//a[contains(., 'Next Page')]"**)**
    
    **[**u'<a href="#">Click here to go to the <strong>Next Page</strong></a>'**]**
    
    **BAD:**``
    
    **>>>** xp**(**"//a[contains(.//text(), 'Next Page')]"**)**
    
    **[]**
    
    **GOOD:**
    
    **>>>** xp**(**"substring-after(//a, 'Next ')"**)**
    
    **[**u'Page'**]**
    
    **BAD:**
    
    **>>>** xp**(**"substring-after(//a//text(), 'Next ')"**)**
    
    **[**u''**]**
    
    You can read [more detailed explanations about string values of nodes and node-sets in the XPath spec](http://www.w3.org/TR/xpath/#dt-string-value).
    
    ## Beware of the difference between //node[1] and (//node)[1]
    
    `//node[1]` selects all the nodes occurring first under their respective parents.
    
    `(//node)[1]` selects all the nodes in the document, and then gets only the first of them.
    
    **>>>** from scrapy import Selector
    
    **>>>** sel=Selector**(**text="""
    
    ....:     <ul class="list">
    
    ....:         <li>1</li>
    
    ....:         <li>2</li>
    
    ....:         <li>3</li>
    
    ....:     </ul>
    
    ....:     <ul class="list">
    
    ....:         <li>4</li>
    
    ....:         <li>5</li>
    
    ....:         <li>6</li>
    
    ....:     </ul>"""**)**
    
    **>>>** xp = lambda x: sel.xpath**(**x**)**.extract**()**
    
    **>>>** xp**(**"//li[1]"**)** # get all first LI elements under whatever it is its parent
    
    **[**u'<li>1</li>', u'<li>4</li>'**]**
    
    **>>>** xp**(**"(//li)[1]"**)** # get the first LI element in the whole document
    
    **[**u'<li>1</li>'**]**
    
    **>>>** xp**(**"//ul/li[1]"**)**  # get all first LI elements under an UL parent
    
    **[**u'<li>1</li>', u'<li>4</li>'**]**
    
    **>>>** xp**(**"(//ul/li)[1]"**)** # get the first LI element under an UL parent in the document
    
    **[**u'<li>1</li>'**]**
    
    Also,
    
    `//a[starts-with(@href, '#')][1]` gets a collection of the local anchors that occur first under their respective parents.
    
    `(//a[starts-with(@href, '#')])[1]` gets the first local anchor in the document.
    
    ## When selecting by class, be as specific as necessary
    
    If you want to select elements by a CSS class, the XPath way to do that is the rather verbose:
    
    

    *[contains(concat(' ', normalize-space(@class), ' '), ' someclass ')]

    
    Let's cook up some examples:
    
    **>>>** sel = Selector**(**text='<p class="content-author">Someone</p><p class="content text-wrap">Some content</p>'**)**
    
    **>>>** xp = lambda x: sel.xpath**(**x**)**.extract**()**
    
    **BAD:** doesn't work because there are multiple classes in the attribute
    
    **>>>** xp**(**"//*[@class='content']"**)**
    
    **[]**
    
    **BAD:** gets more than we want
    
    **>>>** xp**(**"//*[contains(@class,'content')]"**)**
    
    **[**u'<p class="content-author">Someone</p>'**]**
    
    **GOOD:**
    
    **>>>** xp**(**"//*[contains(concat(' ', normalize-space(@class), ' '), ' content ')]"**)** 
    
    **[**u'<p class="content text-wrap">Some content</p>'**]**
    
    And many times, you can just use a CSS selector instead, and even combine the two of them if needed:
    
    **ALSO GOOD:**
    
    **>>>** sel.css**(**".content"**)**.extract**()**
    
    **[**u'<p class="content text-wrap">Some content</p>'**]** 
    
    **>>>** sel.css**(**'.content'**)**.xpath**(**'@class'**)**.extract**()**
    
    **[**u'content text-wrap'**]**
    
    Read [more about what you can do with Scrapy's Selectors here](http://scrapy.readthedocs.org/en/latest/topics/selectors.html#nesting-selectors).
    
    ## Learn to use all the different axes
    
    It is handy to know how to use the axes, you can [follow through the examples given in the tutorial](http://www.zvon.org/comp/r/tut-XPath_1.html#Pages~List_of_XPaths) to quickly review this.
    
    In particular, you should note that [following](http://www.zvon.org/comp/r/tut-XPath_1.html#Pages~Following_axis) and [following-sibling](http://www.zvon.org/comp/r/tut-XPath_1.html#Pages~Following-sibling_axis) are not the same thing, this is a common source of confusion. The same goes for [preceding](http://www.zvon.org/comp/r/tut-XPath_1.html#Pages~Preceding_axis) and [preceding-sibling](http://www.zvon.org/comp/r/tut-XPath_1.html#Pages~Preceding-sibling_axis), and also [ancestor](http://www.zvon.org/comp/r/tut-XPath_1.html#Pages~Ancestor_axis) and [parent](http://www.zvon.org/comp/r/tut-XPath_1.html#Pages~Parent_axis).
    
    ## Useful trick to get text content
    
    Here is another XPath trick that you may use to get the interesting text contents:
    
    //*[not(self::script or self::style)]/text()[normalize-space(.)]
    
    This excludes the content from `script` and `style` tags and also skip whitespace-only text nodes.
    
    Source: http://stackoverflow.com/a/19350897/2572383
    

    from:https://www.zyte.com/blog/xpath-tips-from-the-web-scraping-trenches/

  • 相关阅读:
    1265 四点共面
    1003 阶乘后面0的数量
    1080 两个数的平方和
    1090 3个数和为0
    1087 1 10 100 1000
    1082 与7无关的数
    OpenLayers工作原理
    CI(持续集成)CD(持续交付)
    打包命令
    文件与目录管理重点
  • 原文地址:https://www.cnblogs.com/c-x-a/p/14388313.html
Copyright © 2011-2022 走看看