zoukankan      html  css  js  c++  java
  • scrapy中xpath、css用法

    一、实验环境

    1.Windows7x64_SP1

    2.anaconda3 + python3.7.3(anaconda集成,不需单独安装)

    3.scrapy1.6.0

    二、用法举例

    1.开启scrapy shell,在命令行输入如下命令:

    scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
    

    结果如下:

     

    2.提取a节点

    • xpath中用法

    result = response.xpath('//a')
    

    结果如下:

    [<Selector xpath='//a' data='<a href="image1.html">Name: My image 1 <'>,
     <Selector xpath='//a' data='<a href="image2.html">Name: My image 2 <'>,
     <Selector xpath='//a' data='<a href="image3.html">Name: My image 3 <'>,
     <Selector xpath='//a' data='<a href="image4.html">Name: My image 4 <'>,
     <Selector xpath='//a' data='<a href="image5.html">Name: My image 5 <'>]
    
    •  css中用法

    result = response.css('a')

    结果如下:

    [<Selector xpath='descendant-or-self::a' data='<a href="image1.html">Name: My image 1 <'>,
     <Selector xpath='descendant-or-self::a' data='<a href="image2.html">Name: My image 2 <'>,
     <Selector xpath='descendant-or-self::a' data='<a href="image3.html">Name: My image 3 <'>,
     <Selector xpath='descendant-or-self::a' data='<a href="image4.html">Name: My image 4 <'>,
     <Selector xpath='descendant-or-self::a' data='<a href="image5.html">Name: My image 5 <'>]
    

      

    3.查看result的类型

     type(result)
    

    结果如下:

    scrapy.selector.unified.SelectorList
    

    说明:result为Selector组成的列表,也是SelectList类型,他们都可以继续调用xpath()和css()等方法,进一步提取数据。

    4.查看result提取数据全部内容,使用extract()函数

    result.extract()  

    结果如下:

    ['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
     '<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
     '<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
     '<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
     '<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']
    

    5.提取节点内容

    • xpath中用法,使用text()函数

    response.xpath('//a/text()')
    

    结果如下:

    [<Selector xpath='//a/text()' data='Name: My image 1 '>,
     <Selector xpath='//a/text()' data='Name: My image 2 '>,
     <Selector xpath='//a/text()' data='Name: My image 3 '>,
     <Selector xpath='//a/text()' data='Name: My image 4 '>,
     <Selector xpath='//a/text()' data='Name: My image 5 '>]
    

    查看HTML内容

    response.xpath('//a/text()').extract()
    

    结果如下:

    ['Name: My image 1 ',
     'Name: My image 2 ',
     'Name: My image 3 ',
     'Name: My image 4 ',
     'Name: My image 5 ']
    
    •  css中用法

    response.css('a::text').extract()
    

    结果如下:

    ['Name: My image 1 ',
     'Name: My image 2 ',
     'Name: My image 3 ',
     'Name: My image 4 ',
     'Name: My image 5 ']
    

        

    6.提取属性值

    • xpath中用法,使用/@属性名(如/@href)

    response.xpath('//a/@href').extract()
    

    结果如下:

    ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
    
    •  css中用法
    response.css('a::attr("href")').extract()
    

    结果如下:

    ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html'] 

      

    7.提取节点内部子节点

    • xpath中用法,/子节点名  

    response.xpath('//a/img').extract()
    

    结果如下:

    ['<img src="image1_thumb.jpg">',
     '<img src="image2_thumb.jpg">',
     '<img src="image3_thumb.jpg">',
     '<img src="image4_thumb.jpg">',
     '<img src="image5_thumb.jpg">']
    
    •  css中用法

    response.css('a img').extract()

    结果如下:

    ['<img src="image1_thumb.jpg">',
     '<img src="image2_thumb.jpg">',
     '<img src="image3_thumb.jpg">',
     '<img src="image4_thumb.jpg">',
     '<img src="image5_thumb.jpg">']
    

      

      

    再提取其中的src属性值,与步骤6相同

    • xpath用法

    response.xpath('//a/img/@src').extract()
    
    • css用法

    response.css('a img::attr("src")').extract()
    

      

    8.公用方法

    • extract_first()  #用于提取第一个元素
    • extract_first('default value')   #同上,添加默认参数
  • 相关阅读:
    为什么在SqlServer流水模式下,事务无法启动?
    默认web站点被删除,如何设置新的默认站点?
    用C#实现基于TCP协议的网络通讯
    如何通过DataRelation关联两个DataGrid,实现主从表。
    如何设置网站的会话时间?
    性能测试基本概念释疑
    C#中如何获取服务器IP,名称,操作系统,客户端IP,名称!
    DataGridComboBoxColumn控件
    端口基础知识
    P2P之UDP穿透NAT的原理与实现
  • 原文地址:https://www.cnblogs.com/hester/p/11371384.html
Copyright © 2011-2022 走看看