zoukankan      html  css  js  c++  java
  • XPath

    将字符串转换成对象:
    - 方式一:
    response.xpath('//div[@id='content-list']/div[@class='item']')
    - 方式二:

    from scrapy.selector import HtmlXPathSelector
    hxs = HtmlXPathSelector(response=response)
    items = hxs.xpath("//div[@id='content-list']/div[@class='item']")

    from lxml import etree(待补充mark)

    html=etree.HTML(r.text)

    img_urls=html.xpath('.//img/@src')


    查找规则:
    //a
    //div/a
    //a[re:test(@id, "id+")]

    items = hxs.xpath("//div[@id='content-list']/div[@class='item']")
    for item in items:
    item.xpath('.//div')

    解析:
    标签对象:xpath('/html/body/ul/li/a/@href')
    列表: xpath('/html/body/ul/li/a/@href').extract()
    值: xpath('//body/ul/li/a/@href').extract_first()

    #// 代表从整个文档中搜索

    In [1]: response.xpath('//a')
    Out[1]:
    [<Selector xpath='//a' data='<a href="image1.html">Name: My image 1 <'>,
    <Selector xpath='//a' data='<a href="image2.html">Name: My image 2 <'>,
    <Selector xpath='//a' data='<a href="image3.html">Name: My image 3 <'>,
    <Selector xpath='//a' data='<a href="image4.html">Name: My image 4 <'>,
    <Selector xpath='//a' data='<a href="image5.html">Name: My image 5 <'>]

    In [2]: response.xpath('//a').extract()
    Out[2]:
    ['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
    '<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
    '<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
    '<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
    '<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']

    In [3]: response.xpath('//a').extract_first()
    Out[3]: '<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>'


    #找儿子
    In [9]: response.xpath('//div/a').extract()
    Out[9]:
    ['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
    '<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
    '<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
    '<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
    '<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']


    #找子孙
    In [13]: response.xpath('//div//img').extract()
    Out[13]:
    ['<img src="image1_thumb.jpg">',
    '<img src="image2_thumb.jpg">',
    '<img src="image3_thumb.jpg">',
    '<img src="image4_thumb.jpg">',
    '<img src="image5_thumb.jpg">']


    #找内容
    response.css('a::text').extract()
    response.xpath('//a/text()').extract()

    #找属性
    response.css('img::attr("src")').extract()
    response.xpath('//img/@src').extract()

    #设置找不到情况下的默认值
    In [27]: response.xpath('//img/@srcsssss').extract_first('not found')
    Out[27]: 'not found'

    #按照属性查找
    response.css('#images').extract()
    response.xpath('//*[@id="images"]').extract()
    response.xpath('//*[@href="image2.html"]').extract()

    #模糊匹配
    response.css('*[src*="im"]').extract()
    response.xpath('//*[contains(@id,"result")]').extract_first()

    #嵌套查询
    response.xpath('//div').css('a')
    response.xpath('//div').xpath('a') #一样response.xpath('//div').xpath('./a')
    response.xpath('//div').xpath('img')

    response.xpath('//div').xpath('//img')

    #正则

    # hxs = Selector(response=response).xpath('//a[re:test(@id, "id+")]')

    # print(hxs)

    # hxs = Selector(response=response).xpath('//a[re:test(@id, "id+")]/text()').extract()

    # print(hxs)

    # hxs = Selector(response=response).xpath('//a[re:test(@id, "id+")]/@href').extract()

    #带变量的xpath规则
    response.xpath('//*[@id="images"]').extract_first()
    response.xpath('//*[@id=$xxx]',xxx='images').extract_first()

    response.xpath('//div[count(a)=$xxx]',xxx=5).extract()

  • 相关阅读:
    P3396 哈希冲突 TJ
    U135884 膜法问题 TJ
    U135075 简单数列 TJ
    U135649 皇室战争 TJ
    SF&SJJG-ST表
    牛客NOIP集训三S 牛半仙的妹子数 TJ
    UVA297 四分树 Quadtrees TJ
    UVA679 小球下落 Dropping Balls TJ
    [ACM] CF水题记
    Hoppz的收藏夹
  • 原文地址:https://www.cnblogs.com/nick477931661/p/8666257.html
Copyright © 2011-2022 走看看