PySpider简易教程

zoukankan html css js c++ java

PySpider简易教程
一、安装

1、上帝说要有Python

python首先要保证有pip，注意 pip is already installed if you're using Python 2 >=2.7.9 or Python 3 >=3.4

2、安装Pyspider

然后打开命令行，pip install pyspider
报错如下：

error: Microsoft Visual C++ 9.0 is required (Unable to find vcvarsall.bat). Get it from http://aka.ms/vcpython27

到对应地址下载Microsoft Visual C++ Compiler for Python 2.7
继续报错：

c:usersxxxappdatalocal empxmlXPathInitzbzwjw.c(1) : fatal error C1 083: Cannot open include file: 'libxml/xpath.h': No such file or directory **************************************************************************** ***** Could not find function xmlCheckVersion in library libxml2. Is libxml2 insta lled? **************************************************************************** ***** error: command 'C:\Users\xxx\AppData\Local\Programs\Common\Micros oft\Visual C++ for Python\9.0\VC\Bin\cl.exe' failed with exit status 2

安装缺失的库，在cmd输入easy_install lxml，问题解决
在cmd输入pyspider, 访问http://localhost:5000/

3、安装phantomJs（可选）

去官网下载phantomJs：http://phantomjs.org/download.html
解压后把bin文件夹添加到系统的环境变量里，重新运行pyspider不会显示phantomJs不存在则安装成功。

PySpider官方教程

二、测试IMDB电影抓取爬虫

1、创建脚本

打开页面以后，按create，然后填写初始的爬取页面，会自动生成一个脚本。

创建脚本

2、编辑脚本

按照官网上的tutorial做了一下，实验一由于imdb页面更改的缘故更改了detail_page的应用

@config(priority=2) def detail_page(self, response): return { "url": response.url, "title": response.doc('h1').text(), "rating": response.doc('span[itemprop="ratingValue"]').text(), "director": [x.text() for x in response.doc('[itemprop="director"] span').items()], "stars": [x.text() for x in response.doc('[itemprop="actors"] span').items()], }

2、测试脚本，按左上角run，然后切换到下面的follow，可是看到每一步爬取的链接，点开那个...可以看详细的信息。

测试运行脚本

3、运行爬虫

正常来说回到PySpider的dashboard，把任务状态改成running或者debug（没啥区别）打钩，然后点Run就行。右边那个active tasks可以看任务运行状态，results里是爬虫的结果。
Tips：如果下载结果为csv格式，用文本编辑器打开中文显示正常，而用excel打开是乱码，记得把csv文件的编码改成utf16即可

运行爬虫

4、可能遇到的错误

PySpider坑爹的地方在于删除以及重新运行爬虫非常麻烦，我不多说了，总之鼓捣了几下之后遇到了奇怪的错误：

[W 160223 19:04:58 index:105] connect to scheduler rpc error: error(10061 onnection could be made because the target machine actively refused it')

原因不明，可能是我删除project不当造成的，解决方法不明，有一个治标不治本的方法，新开一个cmd输入pyspider scheduler --no-xmlrpc，然后重启pyspider。这个方法的缺点是当Scheduler重启以后需要重新输入这个命令，否则问题会反复出现。当出现以下提示说明Scheduler运行正常（注意第二行说明载入项目成功）：

[I 160225 14:29:57 scheduler:453] loading projects [I 160225 14:29:57 scheduler:722] select imdb_test:_on_get_info data:,_on_get_in fo [I 160225 14:29:57 scheduler:394] in 5m: new:0,success:0,retry:0,failed:0

有一个奇怪的点就是虽然把Python27Scripts加入了系统环境变量，但是以上命令有时候不管用，建议在执行pyspider和scheduler配置的时候都在Scripts目录下（这样任务相关的data也会保存在Scripts目录下）。

。
查看全文

相关阅读:
唯一的确定一棵二叉树
 Educational Codeforces Round 55 (Rated for Div. 2)
524 (Div. 2) Masha and two friends
单链表实现n(n≥20)的阶乘
 表达式的后缀表示
 UPCOJ2012 The King’s Walk（dp）
第七届山东省省赛D Swiss-system tournament（归并排序）
第七届山东省省赛C Proxy（最短路）
hihocoder1185 连通性·三
 hihocoder1184 连通性二·边的双连通分量

原文地址：https://www.cnblogs.com/zuichuyouren/p/11135565.html

PySpider简易教程

一、安装

1、上帝说要有Python

2、安装Pyspider

3、安装phantomJs（可选）

二、测试IMDB电影抓取爬虫

1、创建脚本

2、编辑脚本

2、测试脚本，按左上角run，然后切换到下面的follow，可是看到每一步爬取的链接，点开那个...可以看详细的信息。

3、运行爬虫

4、可能遇到的错误

2、测试脚本，按左上角run，然后切换到下面的follow，可是看到每一步爬取的链接，点开那个`...`可以看详细的信息。