Scrapy 爬虫使用指南完全教程

zoukankan html css js c++ java

Scrapy 爬虫使用指南完全教程
scrapy note

command

全局命令:
- startproject ：在 project_name 文件夹下创建一个名为 project_name 的Scrapy项目。
```
scrapy startproject myproject
```
- settings：在项目中运行时，该命令将会输出项目的设定值，否则输出Scrapy默认设定。
- runspider：在未创建项目的情况下，运行一个编写在Python文件中的spider。
- shell：以给定的URL(如果给出)或者空(没有给出URL)启动Scrapy shell。
- fetch：使用Scrapy下载器(downloader)下载给定的URL，并将获取到的内容送到标准输出。
```
scrapy fetch --nolog --headers http://www.example.com/
```
- view：在浏览器中打开给定的URL，并以Scrapy spider获取到的形式展现。
```
scrapy view http://www.example.com/some/page.html
```
- version：输出Scrapy版本。
项目(Project-only)命令:
- crawl：使用spider进行爬取。
- scrapy crawl myspider
- check：运行contract检查。
- scrapy check -l
- list：列出当前项目中所有可用的spider。每行输出一个spider。
 edit
- parse：获取给定的URL并使用相应的spider分析处理。如果您提供 --callback 选项，则使用spider的该方法处理，否则使用 parse 。
```
--spider=SPIDER: 跳过自动检测spider并强制使用特定的spider
--a NAME=VALUE: 设置spider的参数(可能被重复)
--callback or -c: spider中用于解析返回(response)的回调函数
--pipelines: 在pipeline中处理item
--rules or -r: 使用 CrawlSpider 规则来发现用来解析返回(response)的回调函数
--noitems: 不显示爬取到的item
--nolinks: 不显示提取到的链接
--nocolour: 避免使用pygments对输出着色
--depth or -d: 指定跟进链接请求的层次数(默认: 1)
--verbose or -v: 显示每个请求的详细信息
scrapy parse http://www.example.com/ -c parse_item
```
- genspider：在当前项目中创建spider。
```
scrapy genspider [-t template] <name> <domain>
scrapy genspider -t basic example example.com
```
- deploy：将项目部署到Scrapyd服务。
- bench：运行benchmark测试。
使用选择器(selectors)
```
body = '<html><body>good</body></html>'
Selector(text=body).xpath('//span/text()').extract()

response = HtmlResponse(url='http://example.com', body=body)
Selector(response=response).xpath('//span/text()').extract()
```
Scrapy提供了两个实用的快捷方式: response.xpath() 及 response.css()
```
>>> response.xpath('//base/@href').extract()
>>> response.css('base::attr(href)').extract()
>>> response.xpath('//a[contains(@href, "image")]/@href').extract()
>>> response.css('a[href*=image]::attr(href)').extract()
>>> response.xpath('//a[contains(@href, "image")]/img/@src').extract()
>>> response.css('a[href*=image] img::attr(src)').extract()
```
嵌套选择器(selectors)

选择器方法( .xpath() or .css() )返回相同类型的选择器列表，因此你也可以对这些选择器调用选择器方法。下面是一个例子:
```
links = response.xpath('//a[contains(@href, "image")]')
for index, link in enumerate(links):
 args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract())
 print 'Link number %d points to url %s and image %s' % args
```
结合正则表达式使用选择器(selectors)

Selector 也有一个 .re() 方法，用来通过正则表达式来提取数据。然而，不同于使用 .xpath() 或者 .css() 方法, .re() 方法返回unicode字符串的列表。所以你无法构造嵌套式的 .re() 调用。
```
>>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:s*(.*)')
```
使用相对XPaths
```
>>> for p in divs.xpath('//p'): # this is wrong - gets all from the whole document
... print p.extract()
>>> for p in divs.xpath('.//p'): # extracts all inside
... print p.extract()
>>> for p in divs.xpath('p'): #gets all from the whole document
... print p.extract()
```
例如在XPath的 starts-with() 或 contains() 无法满足需求时， test() 函数可以非常有用。
```
>>> sel.xpath('//li//@href').extract()
>>> sel.xpath('//li[re:test(@class, "item-d$")]//@href').extract()
```
XPATH TIPS
- Avoid using contains(.//text(), ‘search text’) in your XPath conditions. Use contains(., ‘search text’) instead.
- Beware of the difference between //node[1] and (//node)[1]
- When selecting by class, be as specific as necessary，When querying by class, consider using CSS
- Learn to use all the different axes
- Useful trick to get text content
Item Loaders

populate items
```
def parse(self, response):
 l = ItemLoader(item=Product(), response=response)
 l.add_xpath('name', '//div[@class="product_name"]')
 l.add_xpath('name', '//div[@class="product_title"]')
 l.add_xpath('price', '//p[@id="price"]')
 l.add_css('stock', 'p#stock]')
 l.add_value('last_updated', 'today') # you can also use literal values
 return l.load_item()
```
Item Pipeline
- 清理HTML数据
- 验证爬取的数据(检查item包含某些字段)
- 查重(并丢弃)
- 将爬取结果保存到数据库中
编写你自己的item pipeline

每个item pipeline组件都需要调用该方法，这个方法必须返回一个 Item (或任何继承类)对象，或是抛出 DropItem 异常，被丢弃的item将不会被之后的pipeline组件所处理。
参数:
- item (Item 对象) – 被爬取的item
- spider (Spider 对象) – 爬取该item的spider
Write items to MongoDB
```
import pymongo

class MongoPipeline(object):

 def __init__(self, mongo_uri, mongo_db):
 self.mongo_uri = mongo_uri
 self.mongo_db = mongo_db

 @classmethod
 def from_crawler(cls, crawler):
 return cls(
 mongo_uri=crawler.settings.get('MONGO_URI'),
 mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
 )

 def open_spider(self, spider):
 self.client = pymongo.MongoClient(self.mongo_uri)
 self.db = self.client[self.mongo_db]

 def close_spider(self, spider):
 self.client.close()

 def process_item(self, item, spider):
 collection_name = item.__class__.__name__
 self.db[collection_name].insert(dict(item))
 return item
```
为了启用一个Item Pipeline组件，你必须将它的类添加到 ITEM_PIPELINES 配置，就像下面这个例子:
```
ITEM_PIPELINES = {
 'myproject.pipelines.PricePipeline': 300,
 'myproject.pipelines.JsonWriterPipeline': 800,
}
```
分配给每个类的整型值，确定了他们运行的顺序，item按数字从低到高的顺序，通过pipeline，通常将这些数字定义在0-1000范围内。

实践经验

同一进程运行多个spider
```
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings

runner = CrawlerRunner(get_project_settings())
dfs = set()
for domain in ['scrapinghub.com', 'insophia.com']:
 d = runner.crawl('followall', domain=domain)
 dfs.add(d)

defer.DeferredList(dfs).addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished
```
避免被禁止(ban)
- 使用user agent池，轮流选择之一来作为user agent。池中包含常见的浏览器的user agent(google一下一大堆)
- 禁止cookies(参考 COOKIES_ENABLED)，有些站点会使用cookies来发现爬虫的轨迹。
- 设置下载延迟(2或更高)。参考 DOWNLOAD_DELAY 设置。
- 如果可行，使用 Google cache 来爬取数据，而不是直接访问站点。
- 使用IP池。例如免费的 Tor项目或付费服务(ProxyMesh)。
- 使用高度分布式的下载器(downloader)来绕过禁止(ban)，您就只需要专注分析处理页面。这样的例子有: Crawlera
- 增加并发 CONCURRENT_REQUESTS = 100
- 禁止cookies:COOKIES_ENABLED = False
- 禁止重试:RETRY_ENABLED = False
- 减小下载超时:DOWNLOAD_TIMEOUT = 15
- 禁止重定向:REDIRECT_ENABLED = False
- 启用 “Ajax Crawlable Pages” 爬取:AJAXCRAWL_ENABLED = True
对爬取有帮助的实用Firefox插件
- Firebug
- XPather
- XPath Checker
- Tamper Data
- Firecookie
- 自动限速：AUTOTHROTTLE_ENABLED=True
other

Scrapyd
Spider中间件
 下载器中间件(Downloader Middleware)
内置设定参考手册

 Requests and Responses

Scrapy入门教程
查看全文

相关阅读:
项目太多工作环境互相干扰？virtualenv 一招教你轻松解决。
安装的 Python 版本太多互相干扰？pyenv 建议了解一下。
Python 拓展之详解深拷贝和浅拷贝
 Python 操作 SQLite 数据库
 IQueryable接口与IEnumberable接口的区别
 Resharper的配置（习惯使用了VS的F6编译和F12(快速非resharper查询编译代码)转到定义的默认设置）【设置了好多次resharper的使用了，特此记下简单的思路】
程序人生，人生程序。(面向对象的奇葩理解)
SQL表连接查询(inner join、full join、left join、right join)
MYSQL中存储过程的创建，调用及语法
 mysql存储过程详解

原文地址：https://www.cnblogs.com/cutd/p/6208861.html

Scrapy 爬虫 使用指南 完全教程

scrapy note

command

全局命令:

项目(Project-only)命令:

使用选择器(selectors)

嵌套选择器(selectors)

结合正则表达式使用选择器(selectors)

使用相对XPaths

XPATH TIPS

Item Loaders

populate items

Item Pipeline

编写你自己的item pipeline

Write items to MongoDB

实践经验

同一进程运行多个spider

避免被禁止(ban)

对爬取有帮助的实用Firefox插件

other

Scrapy 爬虫使用指南完全教程