zoukankan      html  css  js  c++  java
  • 爬虫

    http://blog.csdn.net/almost_mr/article/details/53958940

    如果在目录下使用

    scrapy crawl comment -o comment.csv

    则不用写piplines,这样就够用了。

    settings里写入user_agent,取消item_piplines的注释

    items里加入要的field

    spiders里加入新的爬虫

    需要 名字,打分,时间,有用个数,点评人常居地分布

    import pandas as pd

    result = pd.read_csv('items.csv')

    命令行

    scrapy shell http://quotes.toscrape.com/tag/humor/

    https://movie.douban.com/subject/25823277/comments?status=P 这个这样爬不了

    可以检查xpath写的对不对

    In [1]: response.xpath("/html/body/div/div[2]/div[1]/div[2]/span[1]/text()").extract_first()
    Out[1]: '“A day without sunshine is like, you know, night.”'

    In [4]: response.xpath("//span[@class='text']/text()").extract()[2]

    Out[4]: '“Anyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.”'

    把列表用 拼接起来

    "
    ".join(response.xpath('//div[@id="Cnt-Main-Article-QQ"]/p[@class="text"]/text()').extract())

    http://blog.csdn.net/amaomao123/article/details/52511882
    302 301解决,好像是忽略?
    http://www.cnblogs.com/rwxwsblog/p/4575894.html
    加延迟
    http://cuiqingcai.com/968.html
    教你cokies

    2017/9/30
    今天在L3里douban代码里加入爬取评论者地址的信息,出现问题是headers
    评论self.headers['Host'] = "movie.douban.com"
    个人信息self.headers['Host'] = "www.douban.com"
    
    
  • 相关阅读:
    区分JS的空值
    死锁
    高效的SQLSERVER分页方案
    IIS经典模式VS集成模式
    MVC过滤器
    Request接收参数乱码原理解析
    int三种转化区别
    Area使用
    Action和Partial等区别
    Log4Net
  • 原文地址:https://www.cnblogs.com/yunyouhua/p/7607805.html
Copyright © 2011-2022 走看看