zoukankan html css js c++ java

调试Scrapy过程中的心得体会

1.大量抓取网页时出现“Memory Error”解决办法：设置一个队列，每当爬虫空闲时才向队列中放入请求，例如：

from scrapy import signals, Spider
from scrapy.xlib.pydispatch import dispatcher


class ExampleSpider(Spider):
    name = "example"
    start_urls = ['http://www.example.com/']

    def __init__(self, *args, **kwargs):
        super(ExampleSpider, self).__init__(*args, **kwargs)
        # connect the function to the spider_idle signal
        dispatcher.connect(self.queue_more_requests, signals.spider_idle)

    def queue_more_requests(self, spider):
        # this function will run everytime the spider is done processing
        # all requests/items (i.e. idle)

        # get the next urls from your database/file
        urls = self.get_urls_from_somewhere()

        # if there are no longer urls to be processed, do nothing and the
        # the spider will now finally close
        if not urls:
            return

        # iterate through the urls, create a request, then send them back to
        # the crawler, this will get the spider out of its idle state
        for url in urls:
            req = self.make_requests_from_url(url)
            self.crawler.engine.crawl(req, spider)

    def parse(self, response):
        pass

More info on the spider_idle signal: http://doc.scrapy.org/en/latest/topics/signals.html#spider-idle

More info on debugging memory leaks: http://doc.scrapy.org/en/latest/topics/leaks.html

P.S.还有一种限定爬取深度的方法（貌似在settings.py中？）待研究

2.如果请求的url不存在（404），则不会有response对象返回，爬虫什么也没做

3.编码问题

pubmed_spider.py中

import sys
reload(sys)
#python默认环境编码时ascii
sys.setdefaultencoding("utf-8")

保证抓取到的数据是utf8格式的

pipeline.py中file = codecs.open('/%s.txt' % (item['name']), mode = 'w',encoding='utf-8')将数据以utf8格式存储

查看全文

相关阅读:
jperf windows
Eclipse+Maven命令创建webapp项目<三>
Eclipse+Maven创建webapp项目<二>
Eclipse+Maven创建webapp项目<一>
在java中定义有几种类加载器
 JAVA创建对象有哪几种方式 ?
js创建对象的几种常用方式小结(推荐)
maven安装以及eclipse配置maven
MyEclipse 10.0安装及激活步骤
 jdk下载网址

原文地址：https://www.cnblogs.com/zhouliyan/p/5970665.html