zoukankan html css js c++ java

scrapy 中用selector来提取数据的用法

一. 基本概念

1. Selector是一个可独立使用的模块，我们可以用Selector类来构建一个选择器对象，然后调用它的相关方法如xpaht(), css()等来提取数据，如下

from  scrapy import Selector
body= '<html><head><title>Hello World</title></head><body></body> </ html> ’
selector  = Selector(text=body)
title  = selector.xpath('//title/text()').extract_first()
print(title)



输出为
Hello World

2. scrapy shell 主要用于测试scrapy项目中命令是否生效，可在bash下直接执行，

这里我们通过使用scrapy shell来验证学习选择器提取网页数据，在linux中bash下执行命令

scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html即可进入scrapy shell命令模式

上面测试网站源码

<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
  </div>
 </body>
</html>

二. scrapy shell中有内置选择器response.selector，可用于提取网页信息，几个例子如下

1. xpath和css的基本用法

#获取<title>的文本值，其中第一个selector字符可以不写
response.selector.xpath('//title/text()').extract_first()response.selector.css('title::text').extract_first()


#获取a标签的href属性值
response.xpath('//a/@href').extract()
response.css('a::attr(href)').extract() 


#查找属性名称包含image字样的所有a标签
 response.xpath('//a[contains(@href, "image")]/@href').extract()
 response.css('a[href*=image]::attr(href)').extract()


#查找属性名称包含image字样的所有a标签，并且在下级img目录下的src属性值
 response.xpath('//a[contains(@href, "image")]/img/@src').extract()
 response.css('a[href*=image] img::attr(src)').extract()


#结合正则表达式提取所需内容
 response.css('a::text').re('Name:(.*)')   #提取(.*)代表的内容
 response.css('a::text').re_first('Name:(.*)').strip()  #提取第一个(.*）代表的内容，strip()去除首尾空格

2. xpath和css也可以一起用

#先选上src属性标签
response.xpath('//div[@id="images"]').css('img::attr(src)'))
#提取相应信息
response.xpath('//div[@id="images"]').css('img::attr(src)')).extract() #得到多个字符值
response.xpath('//div[@id="images"]').css('img::attr(src)')).extract_first() #得到一个字符值
response.xpath('//div[@id="images"]').css('img::attr(src)')).extract_first(default='') #如果没提取到返回默认值

注意：

1. extract()方法把selector类型变为数据类型

2. [@id="images"]表示用属性来限制匹配的范围，只查找id属性值等于images的div标签，经测试[]中的id属性值image必须用双引号

查看全文

相关阅读:
[基础]RHEL6下LINUX服务器批量部署
 delphi 连接 c++ builder 生成obj文件
 Delphi基本图像处理代码
 Delphi 版本号（D1到XE6），发现一个delphi.wikia.com网站
 Delphi常用排序
 Delphi中用Webbrowser加载百度地图滚轮失效（ApplicationEvents里使用IsChild提前判断是哪个控件的消息）
判断连个单链表是否交叉，并找到交叉点
 窗体自适应屏幕分辨率
 Zlib压缩算法在Java与Delphi间交互实现（压缩XML交互）
开机自动启动程序的几种方法

原文地址：https://www.cnblogs.com/regit/p/9402626.html