zoukankan      html  css  js  c++  java
  • Niocchi Java crawl library implementing synchronous I/O multiplexing

    Niocchi - Java crawl library implementing synchronous I/O multiplexing

    Niocchi is a java crawler library implementing synchronous I/O multiplexing.
    This specific type of implementation allows crawling tens of thousands of hosts in parallel on a single low end server. Niocchi has been designed for big search engines that need to crawl massive amount of data, but can also be used to write no frills crawlers. It is currently used in production by Enormo and Vitalprix.

    javadoc

    Index

    1. Introduction
    2. Requirements
    3. License
    4. Package organization
    5. Architecture
    6. Usage
    7. Caveats
    8. To Do
    9. Download
    10. Change history
    11. About the authors

    Introduction

    Most of the java crawling libraries use standard java IO package.
    That means crawling N documents in parallel requires at least N running
    threads. Even if each thread is not taking a lot of resources while
    fetching the content, that approach becomes costly when crawling at a
    large scale. On the contrary, doing synchronous I/O multiplexing by using the NIO
    package introduced in java 1.4 allows the crawling of many documents in
    parallel using one single thread.

    Requirements

    Niocchi requires java 1.5 or above.

    License

    This software is licensed under the Apache license version 2.0.

    Package organization

    • org.niocchi.core holds the library itself.
    • org.niocchi.gc holds an implementation example of a very simple crawler that reads the URL to crawl from a file and saves the crawled documents.
    • org.niocchi.monitor holds a utility thread that can be used by the crawler to provide real time information through a telnet connexion.
    • org.niocchi.rc holds an implementation example of a RedirectionController.
    • org.niocchi.resources holds a few implementation examples of the Resource and ResourceCreator classes.
    • org.niocchi.urlpools holds a few implementation examples of the URLPool class.

    Architecture

    • A Query encapsulates a URL and implements methods to check its
      status after being crawled.
    • A Resource holds the crawled content and implements methods to
      save it.
    • Each Query is associated to a Resource. To crawl one URL, one
      Resource needs to be taken from the pool of resources. Once the URL is
      crawled and its content processed, the Resource is returned to the
      pool. The number of available Resources is fixed and controls how many
      URL can be crawled in parallel at any time. This number is set through
      the ResourcePool constructor.
    • When a Query is crawled, its associated Resource will be
      processed by one of the workers.
    • The URLPool acts as a source of URLs to crawl into which the
      crawled taps. It's an interface that must be implemented to provide
      URLs to the crawler.
    • The crawler has been designed as "active", meaning it consumes
      URLs from the URLPool, as opposed to being "passive" and waiting to be
      given URL. When the crawler starts, it will get URLs to crawl from the
      URLPool until all resources are consumed or hasNextQuery() return false
      or getNextQuery() returns null. Each time a Query is crawled and
      processed and its Resource returned to the ResourcePool, the crawler
      requests other URLs to crawl from the URLPool until all resources are
      consumed or hasNextQuery() return false or getNextQuery() returns null.
      If all URLs have been crawled and no more are immediately available to
      crawl, the crawler will recheck every second for available URLs to
      crawl.
    • When a Query has been crawled, it is put into a FIFO of queries
      to be processed. One of the Workers will take it and process the
      content of its associated Resource. The work is done in the
      processResource() method. The Query is then returned to the URLPool
      which can examine the crawl status and the result of the processing.
      Lastly, the Query associated Resource is returned to the Resource pool.
    • In order to not block during host name resolution, the Crawler
      uses two additional threads. ResolverQueue resolves the URL coming from
      the URLPool and RedirectionResolverQueue resolves the URLs gotten from
      redirections.
  • 相关阅读:
    Clipper库中文文档详解
    JavaScript-Clipper.js
    安装Scrapy遇到的问题
    Python中if __name__ == '__main__'的使用
    写出一段Python代码实现删除一个list里面的重复元素
    Python 内置函数(反射类)
    Python 内置函数(集合操作,IO操作)
    Python 内置函数(数学运算类,逻辑判断类)
    Python 推导式(列表推导式,字典推导式,集合推导式)
    Python 闭包和装饰器
  • 原文地址:https://www.cnblogs.com/lexus/p/2513003.html
Copyright © 2011-2022 走看看