zoukankan html css js c++ java

Scrapy命令详解

1 创建项目命令

# 可以在创建的时候指定项目所在的目录,myproject是项目名
scrapy startproject myproject [project_dir]

这将在project_dir目录下创建一个Scrapy项目。如果project_dir没有指定，project_dir将与myproject相同。

创建爬虫命令

# 先切换目录
cd project_dir
# 执行创建，myspider是爬虫的名字， mydomain.com是指定此爬虫需要爬取的网站的domain
scrapy genspider myspider mydomain.com

帮助命令

我们可以在任何目录下通过运行以下命令获取有关每个命令的更多信息：

scrapy <command> -h

示例1：

D:>scrapy startproject -h
Usage
=====
  scrapy startproject <project_name> [project_dir]

Create new project

Options
=======
--help, -h              show this help message and exit

Global Options
--------------
--logfile=FILE          log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
                        log level (default: DEBUG)
--nolog                 disable logging completely
--profile=FILE          write python cProfile stats to FILE
--pidfile=FILE          write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
                        set/override setting (may be repeated)
--pdb                   enable pdb on failure

示例2：

 1 D:>scrapy genspider -h
 2 Usage
 3 =====
 4   scrapy genspider [options] <name> <domain>
 5 
 6 Generate new spider using pre-defined templates
 7 
 8 Options
 9 =======
10 --help, -h              show this help message and exit
11 --list, -l              List available templates
12 --edit, -e              Edit spider after creating it
13 --dump=TEMPLATE, -d TEMPLATE
14                         Dump template to standard output
15 --template=TEMPLATE, -t TEMPLATE
16                         Uses a custom template.
17 --force                 If the spider already exists, overwrite it with the
18                         template
19 
20 Global Options
21 --------------
22 --logfile=FILE          log file. if omitted stderr will be used
23 --loglevel=LEVEL, -L LEVEL
24                         log level (default: DEBUG)
25 --nolog                 disable logging completely
26 --profile=FILE          write python cProfile stats to FILE
27 --pidfile=FILE          write process ID to FILE
28 --set=NAME=VALUE, -s NAME=VALUE
29                         set/override setting (may be repeated)
30 --pdb                   enable pdb on failure

查看所有可用的命令：

scrapy -h

示例：

 1 D:>scrapy -h
 2 Scrapy 1.8.0 - no active project
 3 
 4 Usage:
 5   scrapy <command> [options] [args]
 6 
 7 Available commands:
 8   bench         Run quick benchmark test
 9   fetch         Fetch a URL using the Scrapy downloader
10   genspider     Generate new spider using pre-defined templates
11   runspider     Run a self-contained spider (without creating a project)
12   settings      Get settings values
13   shell         Interactive scraping console
14   startproject  Create new project
15   version       Print Scrapy version
16   view          Open URL in browser, as seen by Scrapy
17 
18   [ more ]      More commands available when run from project directory
19 
20 Use "scrapy <command> -h" to see more info about a command

全局命令

有两种命令，一种只能在Scrapy项目内部工作（特定于项目的命令）和那些在没有活动的Scrapy项目（全局命令）的情况下工作的命令，从项目内部运行时它们可能表现略有不同（因为他们会使用项目覆盖设置）。

startproject

genspider

# 创建爬虫，可指定模板
scrapy genspider [-t template] <name> <domain>
# 列出创建spider所有可用模板 ：
scrapy genspider -l
# 指定模板生成spider ，不指定默认为basic模板，这里指定crawl模板
scrapy genspider -t crawl zhihu www.zhihu.com 
scrapy genspider -d 模板名：可以查看到模板的内容

settings

获取配置文件信息

runspider

在Python文件中运行自包含的蜘蛛，而无需创建项目。

shell

获取数据后进入交互模式，可以使用程序中的方法进行操作。交互模式主要用于调试

# 进入shell
scrapy shell
# 向一个网页发送请求,得到一个response对象
fetch('http://www.baidu.com')
# 在浏览器中查看响应结果
view(response)
# 得到响应的文本信息
response.text
# response是一个Htmlresponse对象，可以使用xpath,css的方法解析网页

fetch

类似于requests的url请求。可以添加参数[–nolog]不显示响应头；[–headers]；显示响应头；[–no重定向]禁止重定向

 scrapy fetch <url>

view

会把请求的数据保存成一个文件并在浏览器中打开。

version

输出版本

仅限项目的命令

crawl
运行爬虫，后面的参数是spider的名称

scrapy crawl myspider

check
检查语法是否有错误

scrapy check

list
可以列出当前项目中可使用的爬虫文件

scrapy list

edit
可以对指定的某个文件进行编辑

parse
对指定的url网址进行分析和处理，

bench
可以测试本地硬件的性能，当运行scrapy bench时，会创建一个本地服务器并会以最大的速度进行爬行，就为了测试本地硬件的性能。可以检测到每分钟能爬多少网页，当我们实际运行项目时，可以参照这个数据进行比较，从而对爬虫项目继续改进和修改。

查看全文

相关阅读:
搜索引擎的排序技术
 搜索引擎的检索模型-查询与文档的相关度计算
 搜索引擎网页排序算法
 IntelliJ IDEA全局内容搜索和替换
 Java8 利用Lambda处理List集合循环给另外一个List赋值过滤处理
 Java时间串获取(格式:yyyyMMddHHmmss)
Java int转string 长度不足左补0
float:浮点型double:双精度实型decimal:数字型单精度浮点数(Single)双精度浮点数(double)
java 集合框架 Java集合&List的实现
 CSRF 攻击的应对之道

原文地址：https://www.cnblogs.com/yoyowin/p/12200018.html