爬虫 - 走看看

zoukankan html css js c++ java

爬虫
http://blog.csdn.net/almost_mr/article/details/53958940

如果在目录下使用

scrapy crawl comment -o comment.csv

则不用写piplines，这样就够用了。

settings里写入user_agent，取消item_piplines的注释

items里加入要的field

spiders里加入新的爬虫

需要名字，打分，时间，有用个数，点评人常居地分布

import pandas as pd

result = pd.read_csv('items.csv')

命令行

scrapy shell http://quotes.toscrape.com/tag/humor/

https://movie.douban.com/subject/25823277/comments?status=P 这个这样爬不了

可以检查xpath写的对不对

In [1]: response.xpath("/html/body/div/div[2]/div[1]/div[2]/span[1]/text()").extract_first()
Out[1]: '“A day without sunshine is like, you know, night.”'

In [4]: response.xpath("//span[@class='text']/text()").extract()[2]

Out[4]: '“Anyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.”'

把列表用拼接起来
```
"
".join(response.xpath('//div[@id="Cnt-Main-Article-QQ"]/p[@class="text"]/text()').extract())

http://blog.csdn.net/amaomao123/article/details/52511882
302 301解决，好像是忽略？
http://www.cnblogs.com/rwxwsblog/p/4575894.html
加延迟
http://cuiqingcai.com/968.html
教你cokies

2017/9/30
今天在L3里douban代码里加入爬取评论者地址的信息，出现问题是headers
```
```
评论self.headers['Host'] = "movie.douban.com"
```
```
个人信息self.headers['Host'] = "www.douban.com"
```
查看全文

相关阅读:
禅知Pro 1.6 前台任意文件读取 | 代码审计
 wpa破解学习实践
 Natural Merge Sort(自然归并排序)
[转]the service mysql57 failed the most recent status[/br]mysql57 was not found解决办法
 《Metasploit魔鬼训练营》第七章学习笔记
 Adobe阅读器漏洞(adobe_cooltype_sing)学习研究
 MS10_087漏洞学习研究
 第三方插件渗透攻击之KingView
《Metasploit魔鬼训练营》虚拟环境搭建中网络配置的一些问题
 KingView 6.53漏洞学习研究

原文地址：https://www.cnblogs.com/yunyouhua/p/7607805.html