zoukankan      html  css  js  c++  java
  • scrapy框架集成http

    如果只是在Flask中调用Scrapy爬虫,可能会遇到如下错误:

    ValueError: signal only works in main thread
    # 或者
    twisted.internet.error.ReactorNotRestartable

    解决的办法有几个。

    1 使用python子进程(subproccess)

    首先,确保目录结构类似如下:

    > tree -L 1                                                                                                                                                              
    
    ├── dirbot
    ├── README.rst
    ├── scrapy.cfg
    ├── server.py
    └── setup.py

    然后在,新进程中启动爬虫:

    # server.py
    import subprocess
    
    from flask import Flask
    app = Flask(__name__)
    
    @app.route('/')
    def hello_world():
        """
        Run spider in another process and store items in file. Simply issue command:
    
        > scrapy crawl dmoz -o "output.json"
    
        wait for  this command to finish, and read output.json to client.
        """
        spider_name = "dmoz"
        subprocess.check_output(['scrapy', 'crawl', spider_name, "-o", "output.json"])
        with open("output.json") as items_file:
            return items_file.read()
    
    if __name__ == '__main__':
        app.run(debug=True)

    新进程中启动爬虫:

    2 使用Twisted-Klein + Scrapy

    代码如下:

    # server.py
    import json
    
    from klein import route, run
    from scrapy import signals
    from scrapy.crawler import CrawlerRunner
    
    from dirbot.spiders.dmoz import DmozSpider
    
    
    class MyCrawlerRunner(CrawlerRunner):
        """
        Crawler object that collects items and returns output after finishing crawl.
        """
        def crawl(self, crawler_or_spidercls, *args, **kwargs):
            # keep all items scraped
            self.items = []
    
            # create crawler (Same as in base CrawlerProcess)
            crawler = self.create_crawler(crawler_or_spidercls)
    
            # handle each item scraped
            crawler.signals.connect(self.item_scraped, signals.item_scraped)
    
            # create Twisted.Deferred launching crawl
            dfd = self._crawl(crawler, *args, **kwargs)
    
            # add callback - when crawl is done cal return_items
            dfd.addCallback(self.return_items)
            return dfd
    
        def item_scraped(self, item, response, spider):
            self.items.append(item)
    
        def return_items(self, result):
            return self.items
    
    
    def return_spider_output(output):
        """
        :param output: items scraped by CrawlerRunner
        :return: json with list of items
        """
        # this just turns items into dictionaries
        # you may want to use Scrapy JSON serializer here
        return json.dumps([dict(item) for item in output])
    
    
    @route("/")
    def schedule(request):
        runner = MyCrawlerRunner()
        spider = DmozSpider()
        deferred = runner.crawl(spider)
        deferred.addCallback(return_spider_output)
        return deferred
    
    
    run("localhost", 8080)

    3 使用ScrapyRT

    安装ScrapyRT,然后启动:

    > scrapyrt 

    文章来源:https://stackoverflow.com/questions/36384286/how-to-integrate-flask-scrapy

  • 相关阅读:
    在 ML2 中 enable local network
    实践 Neutron 前的两个准备工作
    配置 linux-bridge mechanism driver
    为 Neutron 准备物理基础设施(II)
    两张图总结 Neutron 架构
    Service Plugin / Agent
    详解 ML2 Core Plugin(I)
    #define用法集锦[修正版]
    socketpair理解
    判断单链表是否存在环,判断两个链表是否相交-的相关讨论
  • 原文地址:https://www.cnblogs.com/Im-Victor/p/15473986.html
Copyright © 2011-2022 走看看