zoukankan html css js c++ java

scrapy Selector

https://docs.scrapy.org/en/latest/topics/selectors.html

基本使用

selector 常规写法：
>>> response.selector.xpath('//span/text()').get()
'good'

selector 缩写：
>>> response.xpath('//span/text()').get()
'good'
>>> response.css('span::text').get()
'good'

从文本中解析：

>>> from scrapy.selector import Selector
>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').get()
'good'

解析响应

>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse
>>> response = HtmlResponse(url='http://example.com', body=body)
>>> Selector(response=response).xpath('//span/text()').get()
'good'

获取文本

>>> response.xpath('//title/text()').getall()
['Example website']
>>> response.xpath('//title/text()').get()
'Example website'

.get() 总是返回一个结果，如果有多个匹配项，则返回第一个匹配内容；如果没有匹配项，则返回None.

get(default="可以设置默认值")

原先版本中，使用extract_first()取得第一个结果

.getall() 返回包含所有结果的列表

获取属性

1. 使用xpath的 @src  =>  response.xpath("//a/@href").getall()
2. 使用 .attrib  =>  response.css('img').attrib['src']

正则表达式

>>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:s*(.*)')
['My image 1',
 'My image 2',
 'My image 3',
 'My image 4',
 'My image 5']

选择器嵌套

>>> from scrapy import Selector
>>> sel = Selector(text='<div class="hero shout"><time datetime="2014-07-23 19:00">Special date</time></div>')
>>> sel.css('.shout').xpath('./time/@datetime').getall()
['2014-07-23 19:00']

xpath中使用变量

>>> # `$val` used in the expression, a `val` argument needs to be passed
>>> response.xpath('//div[@id=$val]/a/text()', val='images').get()
'Name: My image 1 '

删除名称空间

网站：
$ scrapy shell https://feeds.feedburner.com/PythonInside


<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet ...
<feed xmlns="http://www.w3.org/2005/Atom"
      xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/"
      xmlns:blogger="http://schemas.google.com/blogger/2008"
      xmlns:georss="http://www.georss.org/georss"
      xmlns:gd="http://schemas.google.com/g/2005"
      xmlns:thr="http://purl.org/syndication/thread/1.0"
      xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">
  ...


一般的：
>>> response.xpath("//link")
[]

删除名称空间后：
>>> response.selector.remove_namespaces()
>>> response.xpath("//link")
[<Selector xpath='//link' data='<link rel="alternate" type="text/html" h'>,
    <Selector xpath='//link' data='<link rel="next" type="application/atom+'>,
    ...

查看全文

相关阅读:
Javascript语言精粹之String常用方法分析
 Javascript语言精粹之Array常用方法分析
 Javascript语言精粹之正则表达式知识整理
 深入浅出KnockoutJS
用KnockoutJS实现ToDoMVC代码分析
 用JavaScript实现网页动态水印
 LINQ to JavaScript 源码分析
 《Javascript高级程序设计》读书笔记之bind函数详解
 《Javascript高级程序设计》读书笔记之闭包
 转载-MySQL 加锁处理分析

原文地址：https://www.cnblogs.com/xt12321/p/13879574.html