scrapy之spiders - 走看看

zoukankan html css js c++ java

scrapy之spiders
官方文档：https://docs.scrapy.org/en/latest/topics/spiders.html#

一句话总结：spider是定义爬取的动作（是否跟进新的链接）及分析网页结构（提取数据，返回item）的地方。

一 scrapy.Spider

　　1 name

　　2 allowed_domins <-----------------------> offsitemiddleware

　　3 start_urls <-----------------------> start_requests()

　　4 custom_settings <------------------------->Built-in settings reference

　　It must be defined as a class attribute since the settings are updated before instantiation.
class BaiduSpider(scrapy.Spider): name = 'baidu' allowed_domains = ['https://www.baidu.com'] start_urls = ['http://https://www.baidu.com/'] custom_settings = { 'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36', } def parse(self, response): pass
　　5 crawler <----------> from_crawler()

　　6 settings

　　7 logger

　　8 from_crawler(crawler,*args,**kwargs)

　　This is the class method used by Scrapy to create your spiders.

　　9 start_request()

　　It is called by Scrapy when the spider is opened for scraping.

　　核心代码：
for url in self.start_urls: yield Request(url, dont_filter=True)
　　　关于Request的说明。以下是Requet的源码。
class Request(object_ref): def __init__(self, url, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding='utf-8', priority=0, dont_filter=False, errback=None, flags=None):
　　源码中可以看到，Request默认是get请求，如果是发post请求，需要在重写此方法。这里涉及到了 Request类
class MySpider(scrapy.Spider): name = 'myspider' def start_requests(self): return [scrapy.FormRequest("http://www.example.com/login", formdata={'user': 'john', 'pass': 'secret'}, callback=self.logged_in)] def logged_in(self, response): # here you would extract links to follow and return Requests for # each of them, with another callback pass
　　10 parse(response)

　　This method, as well as any other Request callback, must return an iterable of Requestand/or dicts or Item objects.

　　11 log(message[ , level,component])

　　12 closed(reason)

　　

二 Spider arguments

　　-a

三 Generic Spiders

　　1 CrawlSpider

　　　　推荐

　　　　加了 rules，简化了相关操作。

　　2 XMLFeedSpider

　　3 CSVFeedSpider

　　4 SitemapSpider
查看全文

相关阅读:
python模块整理2-sys模块分类： python Module 2013-09-13 16:49 563人阅读评论(0) 收藏
 sys常用模块小探分类： python Module 2013-09-13 16:42 339人阅读评论(0) 收藏
 先执行linux的clear清屏命令，再执行其他操作分类： python 小练习 2013-09-13 11:23 441人阅读评论(0) 收藏
 MySQL 解决ERROR 1045 (28000): Access deniedfor user datam@localhost (using password: YES)的问题分类： database 2013-09-12 15:52 402人阅读评论(0) 收藏
 函数名function是一个数据类型，可以赋值分类： python基础学习 2013-09-12 11:01 366人阅读评论(0) 收藏
 解决 mysql error: Failed dependencies: 错误分类： database 2013-09-11 11:23 772人阅读评论(0) 收藏
 Java之wait()/sleep()和notify()/notifyAll()
Java之数据流DataInput(Output)Stream 和字节数组流 ByteArrayInput(Output) Stream的嵌套
 Eclipse的简易教程
 JAVA中的反射机制

原文地址：https://www.cnblogs.com/654321cc/p/8875285.html