zoukankan html css js c++ java

scrapy的命令行

全局命令

startproject：创建一个爬虫项目：scrapy startproject demo（demo 创建的爬虫项目的名字）
runspider 运用单独一个爬虫文件：scrapy runspider abc.py
veiw 下载一个网页的源代码，并在默认的文本编辑器中打开这个源代码：scrapy view http://www.aobossir.com/
shell 进入交互终端，用于爬虫的调试（如果你不调试，那么就不常用）：scrapy shell http://www.baidu.com --nolog（--nolog 不显示日志信息）
version 查看版本：（scrapy version）

项目命令

check 测试爬虫文件、或者说：检测一个爬虫，如果结果是：OK，那么说明结果没有问题。：scrapy check f1
crawl 运行一个爬虫文件。：scrapy crawl f1 或者 scrapy crawl f1 --nolog
list 列出当前爬虫项目下所有的爬虫文件： scrapy list
edit 使用编辑器打开爬虫文件：scrapy edit f1

<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
  </div>
 </body>
</html>

拿出ID为images的所有图片/链接/文本信息

response.xpath("//div[@id='images']").css('img::attr(src)').extract()

response.xpath("//div[@id='images']").css('img::attr(src)').extract_first(default='')

response.css('a::attr(href)').extract()

response.css('a::text').extract()

查找属性名称包括image的href

response.xpath('//a[contains(@href,"image")]/href').extract()

response.css('a[href*=image] img::attr(src)').extract()

response.css('a::text').re_first('Name:(.*)')

response.css('a::text').re('Name:(.*)')

查看全文

相关阅读:
freopen stdout 真的更快？
【评分】第二次作业——个人项目实战
 【评分】第二次作业-数独-第一次测试成绩
 姑娘你大胆地往前走——答大二学生XCL之八问
 第二次作业-数独-初步测试日志
 第二次作业——个人项目实战
 关于C#的随机数
 必须展示窗口才能截图怎么办，伪后台截图思路
 Winform 奇怪的英文字体错乱显示问题
 wpf 解决 WPF SelectionChanged事件向上传递造成重复执行不想执行的函数的问题

原文地址：https://www.cnblogs.com/zsc329/p/9356339.html