[爬虫资源]各大爬虫资源大汇总,做我们自己的awesome系列

zoukankan html css js c++ java

[爬虫资源]各大爬虫资源大汇总,做我们自己的awesome系列
大数据的流行一定程序导致的爬虫的流行，有些企业和公司本身不生产数据，那就只能从网上爬取数据，笔者关注相关的内容有一定的时间，也写过很多关于爬虫的系列，现在收集好的框架希望能为对爬虫有兴趣的人，或者想更进一步的研究的人提供索引，也随时欢迎大家star,fork ,或者提issue，让我们一起来完善这个awesome系列
github地址

Awesome-crawler

A collection of awesome web crawler,spider and resources in different language

Python
- Scrapy - A fast high-level screen scraping and web crawling framework.
- pyspider - A powerful spider system.
- cola - A distributed crawling framework.
- Demiurge - PyQuery-based scraping micro-framework.
- feedparser - Universal feed parser.
- Grab - Site scraping framework.
- MechanicalSoup - A Python library for automating interaction with websites.
- portia - Visual scraping for Scrapy.
- crawley - Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.
- RoboBrowser - A simple, Pythonic library for browsing the web without a standalone web browser.
- MSpider - A simple ,easy spider using gevent and js render.
Java
- Apache Nutch - Highly extensible, highly scalable web crawler for production environment.
- Crawler4j - Simple and lightweight web crawler.
- JSoup - Scrapes, parses, manipulates and cleans HTML.
- websphinx - Website-Specific Processors for HTML INformation eXtraction.
- Open Search Server - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
- Gecco - A easy to use lightweight web crawler
- WebCollector - Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
- Webmagic - A scalable crawler framework.
- Heritrix3 - Extensible, web-scale, archival-quality web crawler project.
- SeimiCrawler - An agile, distributed crawler framework.
C#
- ccrawler - Built in C# 3.5 version. it contains a simple extention of web content categorizer, which can saparate between the web page depending on their content.
- SimpleCrawler - Simple spider base on mutithreading, regluar expression.
- Abot - C# web crawler built for speed and flexibility.
- Hawk - Advanced Crawler and ETL tool written in C#/WPF.
JavaScript
- simplecrawler - Event driven web crawler.
- node-crawler - Node-crawler has clean,simple api.
- js-crawler - Web crawler for Node.JS, both HTTP and HTTPS are supported.
PHP
- Goutte - A screen scraping and web crawling library for PHP.
  
  laravel-goutte - Laravel 5 Facade for Goutte.
- dom-crawler - The DomCrawler component eases DOM navigation for HTML and XML documents.
- pspider - Parallel web crawler written in PHP.
- php-spider - A configurable and extensible PHP web spider.
C++
- open-source-search-engine - A distributed open source search engine and spider/crawler written in C/C++.
Ruby
- wombat - Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
- RubyRetriever - RubyRetriever is a Web Crawler, Scraper & File Harvester.
Go
- gocrawl - Polite, slim and concurrent web crawler.
- fetchbot - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.
Scala
- crawler - Scala DSL for web crawling.
- scrala - Scala crawler(spider) framework, inspired by scrapy.
- ferrit - Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra.
还在持续更新之中：最新的资源请查看git:https://github.com/BruceDone/awesome-crawler
查看全文

相关阅读:
keras系列︱迁移学习：利用InceptionV3进行fine-tuning及预测、完美案例（五）
keras系列︱人脸表情分类与识别：opencv人脸检测+Keras情绪分类（四）
keras系列︱图像多分类训练与利用bottleneck features进行微调（三）
keras系列︱Application中五款已训练模型、VGG16框架（Sequential式、Model式）解读（二）
将C++资源文件读取出来
 windows驱动程序中的预处理含义
 win10网上邻居看不到别的共享电脑怎么样办
 #pragma alloc_text 与 ALLOC_PRAGMA
IoAllocateMdl，MmProbeAndLockPages的用法
 Composer三步曲：安装、使用、发布

原文地址：https://www.cnblogs.com/codefish/p/5947165.html

[爬虫资源]各大爬虫资源大汇总,做我们自己的awesome系列

Awesome-crawler

Python

Java

C#

JavaScript

PHP

C++

Ruby

Go

Scala