zoukankan      html  css  js  c++  java
  • 【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件


    nutch-site.xml

    在nutch2.2.1中,有两份配置文件:nutch-default.xml与nutch-site.xml。

    其中前者是nutch自带的默认属性,一般情况下不要修改。

    如果需要修改默认属性,可以在nutch-site.xml中增加一个同名的属性,并修改其值。nutch-site.xml中的属性值会覆盖nutch-default.xml中的值。


    1、db.ignore.external.links

    若为true,则只抓取本域名内的网页,忽略外部链接。

    可以在 regex-urlfilter.txt中增加过滤器达到同样效果,但如果过滤器过多,如几千个,则会大大影响nutch的性能。

    <property>
      <name>db.ignore.external.links</name>
      <value>true</value>
      <description>If true, outlinks leading from a page to external hosts
      will be ignored. This is an effective way to limit the crawl to include
      only initially injected hosts, without creating complex URLFilters.
      </description>
    </property>

    2、fetcher.parse

    能否在抓取的同时进行解释:可以,但不 建议这样做。

    <property>
      <name>fetcher.parse</name>
      <value>false</value>
      <description>If true, fetcher will parse content. NOTE: previous releases would
      default to true. Since 2.0 this is set to false as a safer default.</description>
    </property>

    官方解释

    N.B. In a parsing fetcher, outlinks are processed in the reduce phase (at least when outlinks are followed). If a fetcher's reducer stalls you may run out of memory or disk space, usually after a very long reduce job. Behaviour typical to this is usually observed in this situation.

    In summary, if it is possible, users are advised not to use a parsing fetcher as it is heavy on IO and often leads to the above outcome.


    3、db.max.outlinks.per.page

    默认情况下,Nutch只抓取某个网页的100个外部链接,导致部分链接无法抓取。若要改变此情况,可以修改此配置项。

    <property>
      <name>db.max.outlinks.per.page</name>
      <value>100</value>
      <description>The maximum number of outlinks that we'll process for a page.  If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks  will be processed for a page; otherwise, all outlinks will be processed.
      </description>
    </property>
    官方说明如下:http://wiki.apache.org/nutch/FAQ/

    Nutch doesn't crawl relative URLs? Some pages are not indexed but my regex file and everything else is okay - what is going on?

    The crawl tool has a default limitation of 100 outlinks of one page that are being fetched. To overcome this limitation change thedb.max.outlinks.per.page property to a higher value or simply -1 (unlimited).

    file: conf/nutch-default.xml

     <property>
       <name>db.max.outlinks.per.page</name>
       <value>-1</value>
       <description>The maximum number of outlinks that we'll process for a page.
       If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
       will be processed for a page; otherwise, all outlinks will be processed.
       </description>
     </property>

    see also: http://www.mail-archive.com/nutch-user@lucene.apache.org/msg08665.html


    4、file.content.limit   http.content.limit  ftp.content.limit

    默认情况下,nutch只抓取网页的前65536个字节,之后的内容将被丢弃。
    但对于某些大型网站,首页的内容远远不止65536个字节,甚至前面65536个字节里面均是一些布局信息,并没有任何的超链接。
    因此修改默认值如下:

    <property>
      <name>file.content.limit</name>
      <value>-1</value>
      <description>The length limit for downloaded content using the file
       protocol, in bytes. If this value is nonnegative (>=0), content longer
       than it will be truncated; otherwise, no truncation at all. Do not
       confuse this setting with the http.content.limit setting.
      </description>
    </property>
    
    
    <property>
      <name>http.content.limit</name>
      <value>-1</value>
      <description>The length limit for downloaded content using the http
      protocol, in bytes. If this value is nonnegative (>=0), content longer
      than it will be truncated; otherwise, no truncation at all. Do not
      confuse this setting with the file.content.limit setting.
      </description>
    </property>
    
    <property>
      <name>ftp.content.limit</name>
      <value>-1</value>   
      <description>The length limit for downloaded content, in bytes.
      If this value is nonnegative (>=0), content longer than it will be truncated;
      otherwise, no truncation at all.
      Caution: classical ftp RFCs never defines partial transfer and, in fact,
      some ftp servers out there do not handle client side forced close-down very
      well. Our implementation tries its best to handle such situations smoothly.
      </description>
    </property>


     








  • 相关阅读:
    Java常用类库(二):Iterator迭代器和子范围视图
    Java常用类库(一) : Object 和日期类的简单使用
    MyBatis的逆向工程
    AdminLTE介绍和zTree的简单使用
    MyBatis分页组件--PageHelper
    SpringAop--系统日志简例
    Shiro
    Mysql(三):多表查询和存储程序
    MySql(二):常见的那些个约束
    正则表达式一些用法
  • 原文地址:https://www.cnblogs.com/eaglegeek/p/4557857.html
Copyright © 2011-2022 走看看