开源的多线程爬虫框架 - 走看看

zoukankan html css js c++ java

开源的多线程爬虫框架

融合了scrapy的架构理念和twisted的任务调度思想，目前只是个精简版的，开源地址

https://github.com/aware-why/multithreaded_crawler/，已附带demo演示的例子。

有兴趣的可以参与进来，有更好想法的人可以直接发起pull request，热烈欢迎大家的贡献。

multithreaded_crawler

A condensed crawler framework of “multithreaded model”

dependency

At present, the framework depends on nothing except for modules in the python standard libraries.

Usage

cd threaded_spider
python run.py --help

You will see a demo output by python run.py, it crawls the sina.com.cn using five threads and has the crawling depth limited to be 2 by default (It's tested in python2.7).
In threaded_spider directory, there are extra log files whose name like “spider.*.log” respectively generated using python run.py --thread=* command.

Community

QQ Group: 4704309
Your contribute will be welcome.

自助者天助；自天佑之，吉无不利。

查看全文

相关阅读:
用Visual Studio 2005/2008提取EXE文件中的资源[图片|htm|光标文件]
C# 操作Excel之旁门左道 [ C# | Excel ]
ExtJs 备忘录（1）—— Form表单（一） [ 控件使用 ]
Win7(64位)安装Microsoft SQL Server Management Studio Express[error 29506]
ExtJs 备忘录（8）—— 管理界面搭建和其他部分控件介绍
 Visual Studio 模板 —— 自定义WebForm模板
 让Visual Studio 也支持JS代码折叠 [ Visual Studio | #region | #endregion ]
ExtJs 备忘录（3）—— Form表单（三） [ 数据验证 ]
ExtJs 备忘录（7）—— GirdPanl表格（三） [ 统计|查看、修改单行记录 ]
让Visual Studio 也支持JS代码折叠 —— 续 [ Visual Studio | Js | ScriptOutline | SmallOutline ]

原文地址：https://www.cnblogs.com/6ruce/p/3531523.html

Copyright © 2011-2022 走看看