scrapy shell命令的【选项】简介

zoukankan html css js c++ java

scrapy shell命令的【选项】简介
在使用scrapy shell测试某网站时，其返回400 Bad Request，那么，更改User-Agent请求头信息再试。

DEBUG: Crawled (400) <GET https://www.某网站.com> (referer: None)

可是，怎么更改呢？

使用scrapy shell --help命令查看其用法：

Options中没有找到相应的选项；

Global Options呢？里面的--set/-s命令可以设置/重写配置。

使用-s选项更改了User-Agent配置，再测试某网站，成功返回页面（状态200）：

...>scrapy shell -s USER_AGENT="Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36" https://www.某网站.com

2018-07-15 12:11:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.某网站.com> (referer: None)

--------翻篇--------

说明，其实，这个-s的用法并非自己通过上面步骤知道的（之前一直关注Options下面的选项，忽略了Global Options，觉得没用吗？），而是通过网页搜索，然后见到下面的文章：

scrapy shell 用法（慢慢更新...）原文作者：木木&侃侃（一位园友，原文链接）

更进一步：在Scrapy的源码中会对相关配置项有更详细的信息。

打开C:Python36Libsite-packagesscrapycommands目录，可以在里面看到各种内置的Scrapy命令的Python文件，其中，shell.py正是scrapy shell命令的源文件。

从源码可以看到，里面定义了Command类——继承了scrapy.commands.ScrapyCommand，在Command的add_options函数中，添加了三个选项：

-c：evaluate the code in the shell, print the result and exit（执行一段解析代码？）

--spider：use this spider

--no-redirect：do not handle HTTP 3xx status codes and print response as-is

没有发现-s选项，那么，-s选项来自哪儿呢？看看scrapy.commands.ScrapyCommand的源码（__init__.py文件中）。可以发现，其下的add_options函数中添加了-s选项：
1 def add_options(self, parser): 2 """ 3 Populate option parse with options available for this command 4 """ 5 group = OptionGroup(parser, "Global Options") 6 group.add_option("--logfile", metavar="FILE", 7 help="log file. if omitted stderr will be used") 8 group.add_option("-L", "--loglevel", metavar="LEVEL", default=None, 9 help="log level (default: %s)" % self.settings['LOG_LEVEL']) 10 group.add_option("--nolog", action="store_true", 11 help="disable logging completely") 12 group.add_option("--profile", metavar="FILE", default=None, 13 help="write python cProfile stats to FILE") 14 group.add_option("--pidfile", metavar="FILE", 15 help="write process ID to FILE") 16 group.add_option("-s", "--set", action="append", default=[], metavar="NAME=VALUE", 17 help="set/override setting (may be repeated)") 18 group.add_option("--pdb", action="store_true", help="enable pdb on failure") 19 20 parser.add_option_group(group)
好了，源头找到了。

可是，之前在寻找方法时发现，scrapy crawl、runspider提供了-a选项来设置/重写配置，可是，已经有了-s选项了，为何还要增加-a选项呢？两者有什么区别？

从其解释来看，-a选项仅仅修改spider的参数，而-s可以设置的范围更广泛，包括官文Settings中所有配置吧！（未测试）

parser.add_option("-a", dest="spargs", action="append", default=[], metavar="NAME=VALUE",
　　　　　　　　help="set spider argument (may be repeated)")

--------翻篇--------

实践1：scrapy shell的-c选项

(env0626) D:wsenv0626ws>scrapy shell -c "response.xpath('//title/text()')" https://www.baidu.com

输出：

2018-07-15 13:07:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.baidu.com> (referer: None)
[<Selector xpath='//title/text()' data='百度一下，你就知道'>]

实践2：scrapy runspider -a选项和-s选项修改User-Agent请求头
1 # -*- coding: utf-8 -*- 2 import scrapy 3 4 5 class MousiteSpider(scrapy.Spider): 6 name = 'mousite' 7 allowed_domains = ['www.zhihu.com'] 8 start_urls = ['https://www.zhihu.com/'] 9 10 def parse(self, response): 11 yield response.xpath('//title/text()')
测试结果：-a选项无法获取数据，返回400；-s选项可以，返回200；

-a选项：

(env0626) D:wsenv0626ws>scrapy runspider -a USER_AGENT="Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36" mousite.py

DEBUG: Crawled (400) <GET https://www.zhihu.com/> (referer: None)

INFO: Ignoring response <400 https://www.zhihu.com/>: HTTP status code is not handled or not allowed

-s选项：

(env0626) D:wsenv0626ws>scrapy runspider -s USER_AGENT="Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36" mousite.py

DEBUG: Crawled (200) <GET https://www.zhihu.com/> (referer: None)

{'title': [<Selector xpath='//title/text()' data='知乎 - 发现更大的世界'>]}

看来，两者还是有差别的。

注意，上面的试验都是在Scrapy项目外执行（）。
查看全文

相关阅读:
Apache Airavata 0.6 发布
 Erebus 0.5 发布，2D 实时角色扮演游戏
 Pcompress 1.3.0 发布，性能大幅提升
 JasperStarter 1.0.1 发布
 Newscoop 4.1 发布，适合记者的 CMS 系统
 Wireshark 1.8.5 发布，网络协议检测程序
 Open Search Server 1.4 Beta2 发布
 Erlang/OTP R16A 发布
 Apache Derby 10.8.3.0 发布
 reading notes for solr source code

原文地址：https://www.cnblogs.com/luo630/p/9313322.html