zoukankan      html  css  js  c++  java
  • Nutch介绍(译)

    Introduction

    Apache Nutch is an open source Web crawler written in Java. By using it, we can find Web page hyperlinks in an automated manner, reduce lots of maintenance work, for example checking broken links, and create a copy of all the visited pages for searching over. That’s where Apache Solr comes in. Solr is an open source full text search framework, with Solr we can search the visited pages from Nutch. Luckily, integration between Nutch and Solr is pretty straightforward as explained below.

    Apache Nutch supports Solr out-the-box, greatly simplifying Nutch-Solr integration. It also removes the legacy dependence upon both Apache Tomcat for running the old Nutch Web Application and upon Apache Lucene for indexing. Just download a binary release from here.

    ---------------------------------------------------------------------------------------------------------------译文(如有不当请指正):

         Apache Nutch 是一个用JAVA语言编写的开源web爬虫项目。通过使用它,我们能够以一种自动化的方式找到web页面上的超链接,减少了大量的维护工作,例如检查无用的链接或者创建一个所有访问过搜索页面的副本。讲到这里Apache Solr出现,Solr是一个开源的全文检索框架,通过solr我们能搜索Nutch访问过的页面。幸运的是,整合Nutch和Solr是十分简单的,例如下面的讲解。

         Apache Nutch 支持Solr拆箱即用,使得Nutch 和solr的整合非常简单。同时也去除了遗留的依赖问题:不必在Apchce tomcat上运行老版本的Nutch web应用程序,也不必基于Lucene进行搜索。请下载一个Nutch的二进制版本从http://www.apache.org/dyn/closer.cgi/nutch/

  • 相关阅读:
    Django框架-模板层
    Django框架-路由层
    Django流程-以登录功能为例
    常见的MySQL慢查询优化
    函数之二
    python 之 函数
    python 文件操作
    set dict tuple 内置方法
    leetcode
    python 之 数据类型初接触
  • 原文地址:https://www.cnblogs.com/hzhuxin/p/2509456.html
Copyright © 2011-2022 走看看