Scrapy中的crawlspider

zoukankan html css js c++ java

Scrapy中的crawlspider
crawlspider
- 能自动的获取url并提交请求
  命令:scrapy genspider -t crawl spidername 'example.cn'
  
  所导入的模块
  
  # -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule
  
  继承CrawlSpider
  
  LInkEctractor(allow=r'Items/') : 通过正则表达式提取url链接
  
  url不完整时crawlspider会自动补充
  
  callback='parse_item':回调函数(可不写)
  
  follow=True: 是否继续从响应内容里提取url链接
  
  可添加多个Rule
  
  class PspiderSpider(CrawlSpider): name = 'spidername' allowed_domains = [''] start_urls = [''] rules = ( Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True), )
  
  还可以自定义函数对数据进行处理
  
  不能定义parse函数
  
  也可以yiled传递数据
  
  可以通过正则表达式提取内容
  
  可以xpath提取内容
  
  def parse_item(self, response): item = {} #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get() #item['name'] = response.xpath('//div[@id="name"]').get() # import re #item['description'] = re.findall('', response.body.decode())[0] return item
- 补充内容:
  
  LinkExtractor更多常见参数:
  
  allow:满足括号中“正则表达式”的URL会被提取，如果为空，则全部匹配。
  
  deny:满足括号中“正则表达式”的URL-定不提取(优先级高于allow)。
  
  allow_ domains:会被提取的链接的domains.
  
  deny_ domains:-定不会被提取链接的domains.
  
  restrict_ xpaths: 使用xpath表达式，和allow共同作用过滤链接，xpath满足范围内的url地址会被提取
  
  spiders . Rule常见参数:
  
  link_ extractor: 是一个Link Extractor对象，用于定义需要提取的链接。
  
  callback:从link extractor中每获取到链接时，参数所指定的值作为回调函数
  
  follow:是一个布尔(boolean)值,指定了根据该规则从response提取的链接是否需要跟进。如果callback为None, fllw 默认设置为True，否则默认为False。
  
  process_ links:指定该spider中哪个的函数将会被调用, link_ extractor中获取到链接列表时将会调用该函数，该方法主要用来过滤url。
  
  process_ request: 指定该spider中哪个的函数将会被调用，该规则提取到每个request时都会调用该函数，用来过滤request.
查看全文

相关阅读:
（寒假练习 AcWing 870）约数个数（数论）
（补题 CF 1271B） Blocks
（补题 POJ-2228）Naptime（环状DP）
实验 ·SQL语言综合练习
 （补题水题汇总）四川大学第二届SCUACM新生赛
 （补题 CF 455A）Boredom（DP）
（补题 HDU 1176）免费馅饼（DP）
deppin下使用vscode编写C++
（补题 CF 355B） Vasya and Public Transport
【数据结构】ACwing-41. 包含min函数的栈【单调栈】

原文地址：https://www.cnblogs.com/l0nmar/p/12553850.html