zoukankan      html  css  js  c++  java
  • 开源的多线程爬虫框架

    融合了scrapy的架构理念和twisted的任务调度思想,目前只是个精简版的,开源地址

    https://github.com/aware-why/multithreaded_crawler/,已附带demo演示的例子。

    有兴趣的可以参与进来,有更好想法的人可以直接发起pull request,热烈欢迎大家的贡献。

    multithreaded_crawler

    A condensed crawler framework of “multithreaded model”

    dependency

    At present, the framework depends on nothing except for modules in the python standard libraries.

    Usage

    cd threaded_spider
    python run.py --help

    You will see a demo output by python run.py, it crawls the sina.com.cn using five threads and has the crawling depth limited to be 2 by default (It's tested in python2.7).
    In threaded_spider directory, there are extra log files whose name like “spider.*.log” respectively generated using python run.py --thread=* command.

    Community

    QQ Group: 4704309
    Your contribute will be welcome.

    自助者天助;自天佑之,吉无不利。
  • 相关阅读:
    VirtualBox 创建com对象失败
    大数据(十)
    HITCON 2014 已開始征求投稿计划书
    CSS
    工具
    工具
    Linux
    Python
    JavaScript
    JavaScript
  • 原文地址:https://www.cnblogs.com/6ruce/p/3531523.html
Copyright © 2011-2022 走看看