XPath - 走看看

zoukankan html css js c++ java

XPath

将字符串转换成对象：
- 方式一：
response.xpath('//div[@id='content-list']/div[@class='item']')
- 方式二：

from scrapy.selector import HtmlXPathSelector
hxs = HtmlXPathSelector(response=response)
items = hxs.xpath("//div[@id='content-list']/div[@class='item']")

from lxml import etree(待补充mark)

html=etree.HTML(r.text)

img_urls=html.xpath('.//img/@src')

查找规则：
//a
//div/a
//a[re:test(@id, "id+")]

items = hxs.xpath("//div[@id='content-list']/div[@class='item']")
for item in items:
item.xpath('.//div')

解析：
标签对象：xpath('/html/body/ul/li/a/@href')
列表： xpath('/html/body/ul/li/a/@href').extract()
值： xpath('//body/ul/li/a/@href').extract_first()

#// 代表从整个文档中搜索

In [1]: response.xpath('//a')
Out[1]:
[<Selector xpath='//a' data='<a href="image1.html">Name: My image 1 <'>,
<Selector xpath='//a' data='<a href="image2.html">Name: My image 2 <'>,
<Selector xpath='//a' data='<a href="image3.html">Name: My image 3 <'>,
<Selector xpath='//a' data='<a href="image4.html">Name: My image 4 <'>,
<Selector xpath='//a' data='<a href="image5.html">Name: My image 5 <'>]

In [2]: response.xpath('//a').extract()
Out[2]:
['<a href="image1.html">Name: My image 1 <img src="image1_thumb.jpg"></a>',
'<a href="image2.html">Name: My image 2 <img src="image2_thumb.jpg"></a>',
'<a href="image3.html">Name: My image 3 <img src="image3_thumb.jpg"></a>',
'<a href="image4.html">Name: My image 4 <img src="image4_thumb.jpg"></a>',
'<a href="image5.html">Name: My image 5 <img src="image5_thumb.jpg"></a>']

In [3]: response.xpath('//a').extract_first()
Out[3]: '<a href="image1.html">Name: My image 1 <img src="image1_thumb.jpg"></a>'

#找儿子
In [9]: response.xpath('//div/a').extract()
Out[9]:
['<a href="image1.html">Name: My image 1 <img src="image1_thumb.jpg"></a>',
'<a href="image2.html">Name: My image 2 <img src="image2_thumb.jpg"></a>',
'<a href="image3.html">Name: My image 3 <img src="image3_thumb.jpg"></a>',
'<a href="image4.html">Name: My image 4 <img src="image4_thumb.jpg"></a>',
'<a href="image5.html">Name: My image 5 <img src="image5_thumb.jpg"></a>']

#找子孙
In [13]: response.xpath('//div//img').extract()
Out[13]:
['<img src="image1_thumb.jpg">',
'<img src="image2_thumb.jpg">',
'<img src="image3_thumb.jpg">',
'<img src="image4_thumb.jpg">',
'<img src="image5_thumb.jpg">']

#找内容
response.css('a::text').extract()
response.xpath('//a/text()').extract()

#找属性
response.css('img::attr("src")').extract()
response.xpath('//img/@src').extract()

#设置找不到情况下的默认值
In [27]: response.xpath('//img/@srcsssss').extract_first('not found')
Out[27]: 'not found'

#按照属性查找
response.css('#images').extract()
response.xpath('//*[@id="images"]').extract()
response.xpath('//*[@href="image2.html"]').extract()

#模糊匹配
response.css('*[src*="im"]').extract()
response.xpath('//*[contains(@id,"result")]').extract_first()

#嵌套查询
response.xpath('//div').css('a')
response.xpath('//div').xpath('a') #一样response.xpath('//div').xpath('./a')
response.xpath('//div').xpath('img')

response.xpath('//div').xpath('//img')

#正则

# hxs = Selector(response=response).xpath('//a[re:test(@id, "id+")]')

# print(hxs)

# hxs = Selector(response=response).xpath('//a[re:test(@id, "id+")]/text()').extract()

# print(hxs)

# hxs = Selector(response=response).xpath('//a[re:test(@id, "id+")]/@href').extract()

#带变量的xpath规则
response.xpath('//*[@id="images"]').extract_first()
response.xpath('//*[@id=$xxx]',xxx='images').extract_first()

response.xpath('//div[count(a)=$xxx]',xxx=5).extract()

查看全文

相关阅读:
(原创)系统架构设计-通用权限模型设计①
(原创)项目部署-Tomcat设置默认访问项目及项目重复加载问题处理
 安装在CloudStack时CentOS6.4中安装MySQL通过mysql_secure_installation方式修改密码
 (原创)VM中的CentOS6.4中安装CloudStack6.3②
(原创)VM中的CentOS6.4中安装CloudStack6.3①
S2SH+mysql-软件开发实际部署问题-8个小时后提示MYSQL数据库无法连接
 转---B/S结构JavaEE WebApp的全自动安装包制作心得
 javaEE-----org.springframework.dao.InvalidDataAccessApiUsageException: Write operation
监控服务器Java异常脚本
 StringUtils.isNumeric("")竟然返回true

原文地址：https://www.cnblogs.com/nick477931661/p/8666257.html