爬虫
scrapy开启请求
简便方式
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
重写start_request
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
start_requests():必须提供一个Spider开始抓取的迭代请求(你可以返回一个请求列表或者编写一个生成器函数)。 随后的请求将从这些初始请求中接连生成。
爬虫类scrapy.Spider
https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy-spider
crawerspider
https://docs.scrapy.org/en/latest/topics/spiders.html#crawlspider
csvspider
https://docs.scrapy.org/en/latest/topics/spiders.html#csvfeedspider
Selector类
常见内置选择器:https://docs.scrapy.org/en/latest/topics/selectors.html#module-scrapy.selector
常见选择器一般有xpath,css,还有re
请求类Request:
https://yiyibooks.cn/__trs__/zomin/Scrapy15/index.html#request-objects
errbacks(错误处理参数的使用)
响应类Response
https://yiyibooks.cn/__trs__/zomin/Scrapy15/index.html#response-objects
调度器
待补充
下载器
待补充
引擎
待补充
管道
item
https://yiyibooks.cn/__trs__/zomin/Scrapy15/index.html#document-topics/items
item操作方法就和字典一样,可以通过Item.fields来获取item所有属性
自定义itemloader
https://yiyibooks.cn/__trs__/zomin/Scrapy15/index.html#declaring-item-loaders
声明输入和输出处理器
https://yiyibooks.cn/__trs__/zomin/Scrapy15/index.html#declaring-input-and-output-processors
mongodb管道示例
https://yiyibooks.cn/__trs__/zomin/Scrapy15/index.html#write-items-to-mongodb
splash管道示例
https://yiyibooks.cn/__trs__/zomin/Scrapy15/index.html#take-screenshot-of-item
导出为excel等
https://yiyibooks.cn/__trs__/zomin/Scrapy15/index.html#feed-exports
下载器中间件
文档:
https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#downloader-middleware
实现代理IP:
https://www.jianshu.com/p/8449b9c397bb
自定义爬虫中间件:
常用内置下载器中间件:
爬虫中间件
文档
https://docs.scrapy.org/en/latest/topics/spider-middleware.html#spider-middleware
爬虫设置