zoukankan      html  css  js  c++  java
  • XPath

    将字符串转换成对象:
    - 方式一:
    response.xpath('//div[@id='content-list']/div[@class='item']')
    - 方式二:

    from scrapy.selector import HtmlXPathSelector
    hxs = HtmlXPathSelector(response=response)
    items = hxs.xpath("//div[@id='content-list']/div[@class='item']")

    from lxml import etree(待补充mark)

    html=etree.HTML(r.text)

    img_urls=html.xpath('.//img/@src')


    查找规则:
    //a
    //div/a
    //a[re:test(@id, "id+")]

    items = hxs.xpath("//div[@id='content-list']/div[@class='item']")
    for item in items:
    item.xpath('.//div')

    解析:
    标签对象:xpath('/html/body/ul/li/a/@href')
    列表: xpath('/html/body/ul/li/a/@href').extract()
    值: xpath('//body/ul/li/a/@href').extract_first()

    #// 代表从整个文档中搜索

    In [1]: response.xpath('//a')
    Out[1]:
    [<Selector xpath='//a' data='<a href="image1.html">Name: My image 1 <'>,
    <Selector xpath='//a' data='<a href="image2.html">Name: My image 2 <'>,
    <Selector xpath='//a' data='<a href="image3.html">Name: My image 3 <'>,
    <Selector xpath='//a' data='<a href="image4.html">Name: My image 4 <'>,
    <Selector xpath='//a' data='<a href="image5.html">Name: My image 5 <'>]

    In [2]: response.xpath('//a').extract()
    Out[2]:
    ['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
    '<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
    '<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
    '<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
    '<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']

    In [3]: response.xpath('//a').extract_first()
    Out[3]: '<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>'


    #找儿子
    In [9]: response.xpath('//div/a').extract()
    Out[9]:
    ['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
    '<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
    '<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
    '<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
    '<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']


    #找子孙
    In [13]: response.xpath('//div//img').extract()
    Out[13]:
    ['<img src="image1_thumb.jpg">',
    '<img src="image2_thumb.jpg">',
    '<img src="image3_thumb.jpg">',
    '<img src="image4_thumb.jpg">',
    '<img src="image5_thumb.jpg">']


    #找内容
    response.css('a::text').extract()
    response.xpath('//a/text()').extract()

    #找属性
    response.css('img::attr("src")').extract()
    response.xpath('//img/@src').extract()

    #设置找不到情况下的默认值
    In [27]: response.xpath('//img/@srcsssss').extract_first('not found')
    Out[27]: 'not found'

    #按照属性查找
    response.css('#images').extract()
    response.xpath('//*[@id="images"]').extract()
    response.xpath('//*[@href="image2.html"]').extract()

    #模糊匹配
    response.css('*[src*="im"]').extract()
    response.xpath('//*[contains(@id,"result")]').extract_first()

    #嵌套查询
    response.xpath('//div').css('a')
    response.xpath('//div').xpath('a') #一样response.xpath('//div').xpath('./a')
    response.xpath('//div').xpath('img')

    response.xpath('//div').xpath('//img')

    #正则

    # hxs = Selector(response=response).xpath('//a[re:test(@id, "id+")]')

    # print(hxs)

    # hxs = Selector(response=response).xpath('//a[re:test(@id, "id+")]/text()').extract()

    # print(hxs)

    # hxs = Selector(response=response).xpath('//a[re:test(@id, "id+")]/@href').extract()

    #带变量的xpath规则
    response.xpath('//*[@id="images"]').extract_first()
    response.xpath('//*[@id=$xxx]',xxx='images').extract_first()

    response.xpath('//div[count(a)=$xxx]',xxx=5).extract()

  • 相关阅读:
    15.[JavaScript]第8章对象和数组[上,下, 中]
    centos 7 firewall(防火墙)开放端口/删除端口/查看端口
    Docker使用docker-compose.yml构建Asp.Net Core和Mysql镜像并与Mysql数据库通信
    Docker使用Dockerfile构建Asp.Net Core镜像
    Docker使用Mysql镜像命令
    指定的 CGI 应用程序遇到错误,服务器终止了该进程。
    常见SMTP发送失败原因列表
    MVC View中获取action、controller、area名称、参数
    status 返回当前请求的http状态码
    asp:GridView控件使用FindControl方法获取控件的问题
  • 原文地址:https://www.cnblogs.com/nick477931661/p/8666257.html
Copyright © 2011-2022 走看看