zoukankan      html  css  js  c++  java
  • [爬虫资源]各大爬虫资源大汇总,做我们自己的awesome系列

      大数据的流行一定程序导致的爬虫的流行,有些企业和公司本身不生产数据,那就只能从网上爬取数据,笔者关注相关的内容有一定的时间,也写过很多关于爬虫的系列,现在收集好的框架希望能为对爬虫有兴趣的人,或者想更进一步的研究的人提供索引,也随时欢迎大家star,fork ,或者提issue,让我们一起来完善这个awesome系列
    github地址

    Awesome-crawler Awesome

    A collection of awesome web crawler,spider and resources in different language

    Python

    • Scrapy - A fast high-level screen scraping and web crawling framework.
    • pyspider - A powerful spider system.
    • cola - A distributed crawling framework.
    • Demiurge - PyQuery-based scraping micro-framework.
    • feedparser - Universal feed parser.
    • Grab - Site scraping framework.
    • MechanicalSoup - A Python library for automating interaction with websites.
    • portia - Visual scraping for Scrapy.
    • crawley - Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.
    • RoboBrowser - A simple, Pythonic library for browsing the web without a standalone web browser.
    • MSpider - A simple ,easy spider using gevent and js render.

    Java

    • Apache Nutch - Highly extensible, highly scalable web crawler for production environment.
    • Crawler4j - Simple and lightweight web crawler.
    • JSoup - Scrapes, parses, manipulates and cleans HTML.
    • websphinx - Website-Specific Processors for HTML INformation eXtraction.
    • Open Search Server - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
    • Gecco - A easy to use lightweight web crawler
    • WebCollector - Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
    • Webmagic - A scalable crawler framework.
    • Heritrix3 - Extensible, web-scale, archival-quality web crawler project.
    • SeimiCrawler - An agile, distributed crawler framework.

    C#

    • ccrawler - Built in C# 3.5 version. it contains a simple extention of web content categorizer, which can saparate between the web page depending on their content.
    • SimpleCrawler - Simple spider base on mutithreading, regluar expression.
    • Abot - C# web crawler built for speed and flexibility.
    • Hawk - Advanced Crawler and ETL tool written in C#/WPF.

    JavaScript

    PHP

    • Goutte - A screen scraping and web crawling library for PHP.
    • dom-crawler - The DomCrawler component eases DOM navigation for HTML and XML documents.
    • pspider - Parallel web crawler written in PHP.
    • php-spider - A configurable and extensible PHP web spider.

    C++

    Ruby

    • wombat - Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
    • RubyRetriever - RubyRetriever is a Web Crawler, Scraper & File Harvester.

    Go

    • gocrawl - Polite, slim and concurrent web crawler.
    • fetchbot - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.

    Scala

    • crawler - Scala DSL for web crawling.
    • scrala - Scala crawler(spider) framework, inspired by scrapy.
    • ferrit - Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra.

    还在持续更新之中:最新的资源请查看git:https://github.com/BruceDone/awesome-crawler

  • 相关阅读:
    常见数据结构使用场景
    Java与算法之(4)
    Java与算法之(3)
    Java与算法之(2)
    Java与算法之(1)
    Maven适配多种运行环境的打包方案
    从头开始基于Maven搭建SpringMVC+Mybatis项目(4)
    从头开始基于Maven搭建SpringMVC+Mybatis项目(3)
    从头开始基于Maven搭建SpringMVC+Mybatis项目(2)
    从头开始基于Maven搭建SpringMVC+Mybatis项目(1)
  • 原文地址:https://www.cnblogs.com/codefish/p/5947165.html
Copyright © 2011-2022 走看看