scrapy Selector
https://docs.scrapy.org/en/latest/topics/selectors.html
基本使用
selector 常规写法:
>>> response.selector.xpath('//span/text()').get()
'good'
selector 缩写:
>>> response.xpath('//span/text()').get()
'good'
>>> response.css('span::text').get()
'good'
从文本中解析:
>>> from scrapy.selector import Selector
>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').get()
'good'
解析响应
>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse
>>> response = HtmlResponse(url='http://example.com', body=body)
>>> Selector(response=response).xpath('//span/text()').get()
'good'
获取文本
>>> response.xpath('//title/text()').getall()
['Example website']
>>> response.xpath('//title/text()').get()
'Example website'
.get()
总是返回一个结果,如果有多个匹配项,则返回第一个匹配内容;如果没有匹配项,则返回None.
get(default="可以设置默认值")
原先版本中,使用extract_first()
取得第一个结果
.getall()
返回包含所有结果的列表
获取属性
1. 使用xpath的 @src => response.xpath("//a/@href").getall()
2. 使用 .attrib => response.css('img').attrib['src']
正则表达式
>>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:s*(.*)')
['My image 1',
'My image 2',
'My image 3',
'My image 4',
'My image 5']
选择器嵌套
>>> from scrapy import Selector
>>> sel = Selector(text='<div class="hero shout"><time datetime="2014-07-23 19:00">Special date</time></div>')
>>> sel.css('.shout').xpath('./time/@datetime').getall()
['2014-07-23 19:00']
xpath中使用变量
>>> # `$val` used in the expression, a `val` argument needs to be passed
>>> response.xpath('//div[@id=$val]/a/text()', val='images').get()
'Name: My image 1 '
删除名称空间
网站:
$ scrapy shell https://feeds.feedburner.com/PythonInside
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet ...
<feed xmlns="http://www.w3.org/2005/Atom"
xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/"
xmlns:blogger="http://schemas.google.com/blogger/2008"
xmlns:georss="http://www.georss.org/georss"
xmlns:gd="http://schemas.google.com/g/2005"
xmlns:thr="http://purl.org/syndication/thread/1.0"
xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">
...
一般的:
>>> response.xpath("//link")
[]
删除名称空间后:
>>> response.selector.remove_namespaces()
>>> response.xpath("//link")
[<Selector xpath='//link' data='<link rel="alternate" type="text/html" h'>,
<Selector xpath='//link' data='<link rel="next" type="application/atom+'>,
...