scrapy中Selector的使用

zoukankan html css js c++ java

scrapy中Selector的使用
scrapy的Selector选择器其实也可以用来解析，今天主要总结下css和xpath的用法，其实我个人最喜欢用css

以慕课网嵩天老师教程中的一个网页为例，python123.io/ws/demo.html

解析是提取信息的一种手段，主要提取的信息包括：标签节点、属性、文本，下面从这三个方面来分别说明

一、提取标签节点

response = ”<html><head><title>This is a python demo page</title></head>
<body>
The demo python introduces several python courses.
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.
</body></html>”

上面这个就是网页的html信息了，比如我要提取标签

使用css选择器
selector = Selector(text=response) p = selector.css('p').extract() print(p)
#['The demo python introduces several python courses.', 'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.']
这样就得到了所有p节点的信息，得到的是一个列表信息，如果只想得到第一个，实际上可以使用extract_first()方法，而不是使用extract()方法

对于简单的节点查找，这样就够了，但是如果同样的节点很多，而且我要查找的节点不在第一个，这样处理就不行。解决的方法是添加限制条件，添加class、id等等限制信息

比如我想提取class=course的p节点信息，使用p[class='course']，当然，如果有其他的属性，也可以用其他属性作为限定
selecor = Selector(text=result) response = selecor.css('p[class="course"]').extract_first() print(response)

#Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.
使用xpath

使用xpath大体思路也是一样的，只不过语法有点不同

使用xpath实现上述第一个例子
selecor = Selector(text=result) response = selecor.xpath('//p').extract_first() print(response)
使用xpath实现上述第二个例子
selecor = Selector(text=result) response = selecor.xpath('//p[@class="course"]').extract_first() print(response)
细心点的可能会发现xpath选取标签节点，就比css多了个//和@，//代表从当前节点进行选择，@后面接的是属性

二、提取属性

有时候我们需要提取属性值，比如src、href

response = ”<html><head><title>This is a python demo page</title></head>
<body>
The demo python introduces several python courses.
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.
</body></html>”

还是这段例子，为了方便观看，我拷过来

比如我现在要提取第一个a标签的href

使用css

直接在标签后面加上::attr(href)，attr代表提取的是属性，括号内的href代表我要提取的是哪种属性
selecor = Selector(text=result) response = selecor.css('a::attr(href)').extract_first() print(response)
#http://www.icourse163.org/course/BIT-268001
如果要提取特性的a标签的href属性，比如第二个a标签的href，同样可以使用限制条件
selecor = Selector(text=result) response = selecor.css('a[class="py2"]::attr(href)').extract_first() print(response)
#http://www.icourse163.org/course/BIT-1001870001
使用xpath

实现上面第一个例子
selecor = Selector(text=result) response = selecor.xpath('//a/@href').extract_first() print(response)
实现上面第二个例子
selecor = Selector(text=result) response = selecor.xpath('//a[@class="py2"]/@href').extract_first() print(response)
三、提取文本信息

response = ”<html><head><title>This is a python demo page</title></head>
<body>
The demo python introduces several python courses.
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.
</body></html>”

提取第一个a标签的文本

使用css选择器

只需要在标签后面加上::text，至于怎么选择标签参照上面
selecor = Selector(text=result) response = selecor.css('a::text').extract_first() print(response)
#Basic Python
选择特定标签的文本，比如第二个a标签文本，同样是加一个限制条件就好
selecor = Selector(text=result) response = selecor.css('a[class="py2"]::text').extract_first() print(response)
#Advanced Python
使用xpath来实现

首先是第一个例子，使用//a选择到a节点，再/text()选择到文本信息
selecor = Selector(text=result) response = selecor.xpath('//a/text()').extract_first() print(response)
实现第二个例子，添加xpath限制条件的时候前面一定不要忘记加@，而且text后面要加()
selecor = Selector(text=result) response = selecor.xpath('//a[@class="py2"]/text()').extract_first() print(response)
最后总结下：对于提取而言，xpath多了/和@符号，即使在添加限制条件时，xpath也需要在限制的属性前加@，所以这也是我喜欢css的原因，因为我懒。
查看全文

相关阅读:
flask ajax
python 符合条件跳过下一次循环
 python使用openpyxl excel 合并拆分单元格
 等价类划分法
 python 同级目录包导入问题，使用"."错误
 django：查询，反向查询
 Python实现程序执行次数的计数
 python 2x SSH通道连接服务器读取数据库和中文编码问题
 Python for 循环中使用append()添加可变元素，前面的值被覆盖，循环中内存应用地址不变
 以概率列表选择对应元素，轮盘概率选择Python实现

原文地址：https://www.cnblogs.com/sjfeng1987/p/9930285.html