zoukankan      html  css  js  c++  java
  • XPath

    将字符串转换成对象:
    - 方式一:
    response.xpath('//div[@id='content-list']/div[@class='item']')
    - 方式二:

    from scrapy.selector import HtmlXPathSelector
    hxs = HtmlXPathSelector(response=response)
    items = hxs.xpath("//div[@id='content-list']/div[@class='item']")

    from lxml import etree(待补充mark)

    html=etree.HTML(r.text)

    img_urls=html.xpath('.//img/@src')


    查找规则:
    //a
    //div/a
    //a[re:test(@id, "id+")]

    items = hxs.xpath("//div[@id='content-list']/div[@class='item']")
    for item in items:
    item.xpath('.//div')

    解析:
    标签对象:xpath('/html/body/ul/li/a/@href')
    列表: xpath('/html/body/ul/li/a/@href').extract()
    值: xpath('//body/ul/li/a/@href').extract_first()

    #// 代表从整个文档中搜索

    In [1]: response.xpath('//a')
    Out[1]:
    [<Selector xpath='//a' data='<a href="image1.html">Name: My image 1 <'>,
    <Selector xpath='//a' data='<a href="image2.html">Name: My image 2 <'>,
    <Selector xpath='//a' data='<a href="image3.html">Name: My image 3 <'>,
    <Selector xpath='//a' data='<a href="image4.html">Name: My image 4 <'>,
    <Selector xpath='//a' data='<a href="image5.html">Name: My image 5 <'>]

    In [2]: response.xpath('//a').extract()
    Out[2]:
    ['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
    '<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
    '<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
    '<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
    '<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']

    In [3]: response.xpath('//a').extract_first()
    Out[3]: '<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>'


    #找儿子
    In [9]: response.xpath('//div/a').extract()
    Out[9]:
    ['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
    '<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
    '<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
    '<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
    '<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']


    #找子孙
    In [13]: response.xpath('//div//img').extract()
    Out[13]:
    ['<img src="image1_thumb.jpg">',
    '<img src="image2_thumb.jpg">',
    '<img src="image3_thumb.jpg">',
    '<img src="image4_thumb.jpg">',
    '<img src="image5_thumb.jpg">']


    #找内容
    response.css('a::text').extract()
    response.xpath('//a/text()').extract()

    #找属性
    response.css('img::attr("src")').extract()
    response.xpath('//img/@src').extract()

    #设置找不到情况下的默认值
    In [27]: response.xpath('//img/@srcsssss').extract_first('not found')
    Out[27]: 'not found'

    #按照属性查找
    response.css('#images').extract()
    response.xpath('//*[@id="images"]').extract()
    response.xpath('//*[@href="image2.html"]').extract()

    #模糊匹配
    response.css('*[src*="im"]').extract()
    response.xpath('//*[contains(@id,"result")]').extract_first()

    #嵌套查询
    response.xpath('//div').css('a')
    response.xpath('//div').xpath('a') #一样response.xpath('//div').xpath('./a')
    response.xpath('//div').xpath('img')

    response.xpath('//div').xpath('//img')

    #正则

    # hxs = Selector(response=response).xpath('//a[re:test(@id, "id+")]')

    # print(hxs)

    # hxs = Selector(response=response).xpath('//a[re:test(@id, "id+")]/text()').extract()

    # print(hxs)

    # hxs = Selector(response=response).xpath('//a[re:test(@id, "id+")]/@href').extract()

    #带变量的xpath规则
    response.xpath('//*[@id="images"]').extract_first()
    response.xpath('//*[@id=$xxx]',xxx='images').extract_first()

    response.xpath('//div[count(a)=$xxx]',xxx=5).extract()

  • 相关阅读:
    redis限流器的设计
    使用HystrixCommand封装http请求
    自定义的最简单的可回调的线程任务CallbackableFeatureTask(模仿google的ListenableFutureTask)
    通用的规则匹配算法(原创)(java+.net)
    redis缓存切面实现(支持缓存key的spel表达式)
    使用join和CountDownLatch来等待线程结束
    shell日常实战练习——通过监视用户登陆找到入侵者
    Nginx web服务器
    nginx——防盗链功能
    nginx 隐藏nginx版本号
  • 原文地址:https://www.cnblogs.com/nick477931661/p/8666257.html
Copyright © 2011-2022 走看看