zoukankan html css js c++ java

scrapy基础命令

默认的Scrapy项目结构

所有的Scrapy项目默认有类似于下边的文件结构:

scrapy.cfg
myproject/
    __init__.py
    items.py
    pipelines.py
    settings.py
    spiders/
        __init__.py
        spider1.py
        spider2.py
        ...

scrapy.cfg 存放的目录被认为是 项目的根目录 。该文件中包含python模块名的字段定义了项目的设置。例如:

[settings]
default = myproject.settings

创建项目

一般来说，使用 scrapy 工具的第一件事就是创建您的Scrapy项目:

scrapy startproject myproject

该命令将会在 myproject 目录中创建一个Scrapy项目。

接下来，进入到项目目录中:

cd myproject

这时候您就可以使用 scrapy 命令来管理和控制您的项目了。

控制项目

您可以在您的项目中使用 scrapy 工具来对其进行控制和管理。

比如，创建一个新的spider:

scrapy genspider mydomain mydomain.com

可用的工具命令(tool commands)

该章节提供了可用的内置命令的列表。每个命令都提供了描述以及一些使用例子。您总是可以通过运行命令来获取关于每个命令的详细内容:

scrapy <command> -h

您也可以查看所有可用的命令:

scrapy -h

全局命令:

项目(Project-only)命令:

genspider

语法: scrapy genspider [-t template] <name> <domain>
是否需要项目: yes

在当前项目中创建spider。这仅仅是创建spider的一种快捷方法。该方法可以使用提前定义好的模板来生成spider。您也可以自己创建spider的源码文件。

$ scrapy genspider -l
Available templates:
  basic
  crawl
  csvfeed
  xmlfeed

$ scrapy genspider -d basic
import scrapy

class $classname(scrapy.Spider):
    name = "$name"
    allowed_domains = ["$domain"]
    start_urls = (
        'http://www.$domain/',
        )

    def parse(self, response):
        pass

$ scrapy genspider -t basic example example.com
Created spider 'example' using template 'basic' in module:
  mybot.spiders.example

crawl

语法: scrapy crawl <spider>
是否需要项目: yes

使用spider进行爬取。

例子:

$ scrapy crawl myspider
[ ... myspider starts crawling ... ]

check

语法: scrapy check [-l] <spider>
是否需要项目: yes

运行contract检查。

例子:

$ scrapy check -l
first_spider
  * parse
  * parse_item
second_spider
  * parse
  * parse_item

$ scrapy check
[FAILED] first_spider:parse_item
>>> 'RetailPricex' field is missing

[FAILED] first_spider:parse
>>> Returned 92 requests, expected 0..4

list

语法: scrapy list
是否需要项目: yes

列出当前项目中所有可用的spider。每行输出一个spider。

使用例子:

$ scrapy list
spider1
spider2

edit

语法: scrapy edit <spider>
是否需要项目: yes

使用 EDITOR 中设定的编辑器编辑给定的spider

该命令仅仅是提供一个快捷方式。开发者可以自由选择其他工具或者IDE来编写调试spider。

例子:

$ scrapy edit spider1

fetch

语法: scrapy fetch <url>
是否需要项目: no

使用Scrapy下载器(downloader)下载给定的URL，并将获取到的内容送到标准输出。

该命令以spider下载页面的方式获取页面。例如，如果spider有 USER_AGENT 属性修改了 User Agent，该命令将会使用该属性。

因此，您可以使用该命令来查看spider如何获取某个特定页面。

该命令如果非项目中运行则会使用默认Scrapy downloader设定。

例子:

$ scrapy fetch --nolog http://www.example.com/some/page.html
[ ... html content here ... ]

$ scrapy fetch --nolog --headers http://www.example.com/
{'Accept-Ranges': ['bytes'],
 'Age': ['1263   '],
 'Connection': ['close     '],
 'Content-Length': ['596'],
 'Content-Type': ['text/html; charset=UTF-8'],
 'Date': ['Wed, 18 Aug 2010 23:59:46 GMT'],
 'Etag': ['"573c1-254-48c9c87349680"'],
 'Last-Modified': ['Fri, 30 Jul 2010 15:30:18 GMT'],
 'Server': ['Apache/2.2.3 (CentOS)']}

view

语法: scrapy view <url>
是否需要项目: no

在浏览器中打开给定的URL，并以Scrapy spider获取到的形式展现。有些时候spider获取到的页面和普通用户看到的并不相同。因此该命令可以用来检查spider所获取到的页面，并确认这是您所期望的。

例子:

$ scrapy view http://www.example.com/some/page.html
[ ... browser starts ... ]

shell

语法: scrapy shell [url]
是否需要项目: no

以给定的URL(如果给出)或者空(没有给出URL)启动Scrapy shell。查看 Scrapy终端(Scrapy shell) 获取更多信息。

例子:

$ scrapy shell http://www.example.com/some/page.html
[ ... scrapy shell starts ... ]

runspider

语法: scrapy runspider <spider_file.py>
是否需要项目: no

在未创建项目的情况下，运行一个编写在Python文件中的spider。

例子:

$ scrapy runspider myspider.py
[ ... spider starts crawling ... ]

version

语法: scrapy version [-v]
是否需要项目: no

输出Scrapy版本。配合 -v 运行时，该命令同时输出Python, Twisted以及平台的信息，方便bug提交。

bench

0.17 新版功能.测试爬取速度

语法: scrapy bench
是否需要项目: no

运行benchmark测试。 Benchmarking 。

保存文件

scrapy提供的Feed Exports 可以轻松输出结果

保存为JSON文件，输出后项目多了一个quotes.json文件
scrapy crawl quote -o quotes.json

每个Item输出一行JSON，后缀为jl，为jsonline的缩写
scrapy crawl quotes -o quotes.jl
或者
scrapy crawl quotes -o quotes.jsonline

下面命令分别输出csv xml pickle marshal 格式及ftp远程输出
scrapy crawl quotes -o quotes.csv
scrapy crawl quotes -o quotes.xml
scrapy crawl quotes -o quotes.pickle
scrapy crawl quotes -o quotes.marshal
scrapy crawl quotes -o ftp://user:pass@ftp.example.com/path/to/quotes.csv
ftp需要正确配置包括用户名、密码、地址、输出路径

scrapy选择器

xpath选择器最前方加.点，代表提取元素内部的数据，没有加.点，代表从根节点提取，用//img表示从html节点提取

提取快捷方法response.xpath() 和 response.css()，具体内容提取用extract()

xpath选取内部文本和属性 ('//a/text()').extract() ('//a/@href').extract()

/text()获取节点的内部文本， /@href获得节点的href属性，@后面内容是要获取的属性名称

给extract_first()方法设置一个默认值参数，xpath提取不到就会使用默认值extract_first(default='')

css选择器获取文本和属性的写法 ::text 和::attr()

正则re_first方法可以选取列表的第一个元素

response对象不能直接调用re()和re_first()

<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
  </div>
 </body>
</html>



>>> response.selector
<Selector xpath=None data='<html>
 <head>
  <base href="http://exam'>
>>> response.selector.xpath('//title/text()')
[<Selector xpath='//title/text()' data='Example website'>]
>>> response.selector.xpath('//title/text()').extract_first()
'Example website'
>>> response.selector.css('title::text').extract_first()
'Example website'
>>> response.css('title::text').extract_first()
'Example website'
>>> response.xpath('//div[@id="image"]').css('img')
[]
>>> response.xpath('//div[@id="images"]').css('img')
[<Selector xpath='descendant-or-self::img' data='<img src="image1_thumb.jpg">'>, <Selector xpath='descendant-or-self::img' data='<img src="image2_thumb.jpg">'>, <Selector xpath='
descendant-or-self::img' data='<img src="image3_thumb.jpg">'>, <
Selector xpath='descendant-or-self::img' data='<img src="image4_
thumb.jpg">'>, <Selector xpath='descendant-or-self::img' data='<
img src="image5_thumb.jpg">'>]
>>> response.xpath('//div[@id="images"]').css('img::attr(src)')
[<Selector xpath='descendant-or-self::img/@src' data='image1_thumb.jpg'>, <Selector xpath='de
scendant-or-self::img/@src' data='image2_thumb.jpg'>, <Selector xpath='descendant-or-self::[<Selector xpath='descendant-or-self::img/@src' data='image1_thumb.jpg'>, <S
elector xpath='descendant-or-self::img/@src' data='image2_thumb.jpg'>, <Sele
ctor xpath='descendant-or-self::
img/@src' data='image3_thumb.jpg
'>, <Selector xpath='descendant-
or-self::img/@src' data='image4_
thumb.jpg'>, <Selector xpath='de
scendant-or-self::img/@src' data
='image5_thumb.jpg'>]
>>> response.xpath('//div[@id="i
mages"]').css('img::
attr(src)').extract(
)
['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg',
 'image4_thumb.jpg', 'image5_thumb.jpg']
>>> response.xpath('//div[@id="images"]').css('img::attr(src
)').extract_first()
'image1_thumb.jpg'
>>> response.xpath('//div[@id="images"]').css('img::attr(src
)').extract_first(default='')
'image1_thumb.jpg'
>>> response.xpath('//a/@href')
[<Selector xpath='//a/@href' data='image1.html'>, <Selector xpath='//a/@href' data
='image2.html'>, <Selector xpath='//a/@href' data='image3.html'>, <Selector xpath=
'//a/@href' data='image4.html'>, <Selector xpath='//a/@href' data='image5.html'>]
>>> response.xpath('//a/@href').extract()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
>>> response.css('a').extract()
['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>', '<
a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>', '<a h
ref="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>', '<a href
="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>', '<a href="i
mage5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']
>>> response.css('a::attr(href)').extract()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
>>> response.xpath('//a/text()').extract()
['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4
', 'Name: My image 5 ']
>>> response.css('a::text()').extract()
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "c:python3.7libsite-packagesscrapyhttp
esponse	ext.py", line 122, in
 css
    return self.selector.css(query)
  File "c:python3.7libsite-packagesparselselector.py", line 262, in css
    return self.xpath(self._css2xpath(query))
  File "c:python3.7libsite-packagesparselselector.py", line 265, in _css2xpat
h
    return self._csstranslator.css_to_xpath(query)
  File "c:python3.7libsite-packagesparselcsstranslator.py", line 109, in css_
to_xpath
    return super(HTMLTranslator, self).css_to_xpath(css, prefix)
  File "c:python3.7libsite-packagescssselectxpath.py", line 192, in css_to_xp
ath
    for selector in parse(css))
  File "c:python3.7libsite-packagescssselectxpath.py", line 192, in <genexpr>

    for selector in parse(css))
  File "c:python3.7libsite-packagescssselectxpath.py", line 222, in selector_
to_xpath
    xpath = self.xpath_pseudo_element(xpath, selector.pseudo_element)
  File "c:python3.7libsite-packagesparselcsstranslator.py", line 72, in xpath
_pseudo_element
    % pseudo_element.name)
cssselect.xpath.ExpressionError: The functional pseudo-element ::text() is unknown

>>> response.css('a::text').extract()
['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4
', 'Name: My image 5 ']
>>> response.xpath('//a[contains(@href, "image")]/@href')
[<Selector xpath='//a[contains(@href, "image")]/@href' data='image1.html'>, <Selec
tor xpath='//a[contains(@href, "image")]/@href' data='image2.html'>, <Selector xpa
th='//a[contains(@href, "image")]/@href' data='image3.html'>, <Selector xpath='//a
[contains(@href, "image")]/@href' data='image4.html'>, <Selector xpath='//a[contai
ns(@href, "image")]/@href' data='image5.html'>]
>>> response.xpath('//a[contains(@href, "image")]/@href').extract()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
>>> response.css('a[href*=image]::attr(hr)')
[]
>>> response.css('a[href*=image]::attr(href)')
[<Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href"
 data='image1.html'>, <Selector xpath="descendant-or-self::a[@href and contains(@h
ref, 'image')]/@href" data='image2.html'>, <Selector xpath="descendant-or-self::a[
@href and contains(@href, 'image')]/@href" data='image3.html'>, <Selector xpath="d
escendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image4.html'
>, <Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@hre
f" data='image5.html'>]
>>> response.css('a[href*=image]::attr(href)').extract()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
>>> response.xpath('//a/img/@src').extract()
['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg', 'image4_thumb.jpg', '
image5_thumb.jpg']
>>> response.xpath('//a[contains(@href, "image")]/img/@src').extract()
['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg', 'image4_thumb.jpg', '
image5_thumb.jpg']
>>> response.css('a[href*=image] img::attr(src)').extract()
['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg', 'image4_thumb.jpg', '
image5_thumb.jpg']
>>> response.css('a::text').re('Name:(.*)')
[' My image 1 ', ' My image 2 ', ' My image 3 ', ' My image 4 ', ' My image 5 ']
>>> response.css('a::text').re_first('Name:(.*)')
' My image 1 '

查看全文

相关阅读:
黄聪：Microsoft Enterprise Library 5.0 系列教程(三) Validation Application Block (高级)
黄聪：Microsoft Enterprise Library 5.0 系列教程(八) Unity Dependency Injection and Interception
黄聪：Microsoft Enterprise Library 5.0 系列教程(六) Security Application Block
黄聪：Microsoft Enterprise Library 5.0 系列教程(五) Data Access Application Block
黄聪：【转】C# 对称加密解密算法
 黄聪：Enterprise Library 5.0 系列教程
 黄聪：Microsoft Enterprise Library 5.0 系列教程(七) Exception Handling Application Block
黄聪：Microsoft Enterprise Library 5.0 系列教程(二) Cryptography Application Block (初级)
黄聪：Microsoft Enterprise Library 5.0 系列教程(一) : Caching Application Block (初级)
黄聪：Microsoft Enterprise Library 5.0 系列教程(四) Logging Application Block

原文地址：https://www.cnblogs.com/wangjinliang1991/p/9898876.html