zoukankan html css js c++ java

【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件分类： H3_NUTCH 2014-08-18 16:33 1376人阅读评论(0) 收藏

nutch-site.xml

在nutch2.2.1中，有两份配置文件：nutch-default.xml与nutch-site.xml。

其中前者是nutch自带的默认属性，一般情况下不要修改。

如果需要修改默认属性，可以在nutch-site.xml中增加一个同名的属性，并修改其值。nutch-site.xml中的属性值会覆盖nutch-default.xml中的值。

1、db.ignore.external.links

若为true，则只抓取本域名内的网页，忽略外部链接。

可以在 regex-urlfilter.txt中增加过滤器达到同样效果，但如果过滤器过多，如几千个，则会大大影响nutch的性能。

<property>
  <name>db.ignore.external.links</name>
  <value>true</value>
  <description>If true, outlinks leading from a page to external hosts
  will be ignored. This is an effective way to limit the crawl to include
  only initially injected hosts, without creating complex URLFilters.
  </description>
</property>

2、fetcher.parse

能否在抓取的同时进行解释：可以，但不建议这样做。

<property>
  <name>fetcher.parse</name>
  <value>false</value>
  <description>If true, fetcher will parse content. NOTE: previous releases would
  default to true. Since 2.0 this is set to false as a safer default.</description>
</property>

官方解释

N.B. In a parsing fetcher, outlinks are processed in the reduce phase (at least when outlinks are followed). If a fetcher's reducer stalls you may run out of memory or disk space, usually after a very long reduce job. Behaviour typical to this is usually observed in this situation.

In summary, if it is possible, users are advised not to use a parsing fetcher as it is heavy on IO and often leads to the above outcome.

3、db.max.outlinks.per.page

默认情况下，Nutch只抓取某个网页的100个外部链接，导致部分链接无法抓取。若要改变此情况，可以修改此配置项。

<property>
  <name>db.max.outlinks.per.page</name>
  <value>100</value>
  <description>The maximum number of outlinks that we'll process for a page.  If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
</property>

官方说明如下：http://wiki.apache.org/nutch/FAQ/

Nutch doesn't crawl relative URLs? Some pages are not indexed but my regex file and everything else is okay - what is going on?

The crawl tool has a default limitation of 100 outlinks of one page that are being fetched. To overcome this limitation change thedb.max.outlinks.per.page property to a higher value or simply -1 (unlimited).

file: conf/nutch-default.xml

 <property>
   <name>db.max.outlinks.per.page</name>
   <value>-1</value>
   <description>The maximum number of outlinks that we'll process for a page.
   If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
   will be processed for a page; otherwise, all outlinks will be processed.
   </description>
 </property>

4、file.content.limit http.content.limit ftp.content.limit

默认情况下，nutch只抓取网页的前65536个字节，之后的内容将被丢弃。
但对于某些大型网站，首页的内容远远不止65536个字节，甚至前面65536个字节里面均是一些布局信息，并没有任何的超链接。
因此修改默认值如下：

<property>
  <name>file.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content using the file
   protocol, in bytes. If this value is nonnegative (>=0), content longer
   than it will be truncated; otherwise, no truncation at all. Do not
   confuse this setting with the http.content.limit setting.
  </description>
</property>


<property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content using the http
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  </description>
</property>

<property>
  <name>ftp.content.limit</name>
  <value>-1</value>   
  <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be truncated;
  otherwise, no truncation at all.
  Caution: classical ftp RFCs never defines partial transfer and, in fact,
  some ftp servers out there do not handle client side forced close-down very
  well. Our implementation tries its best to handle such situations smoothly.
  </description>
</property>

查看全文

相关阅读:
新闻列表中标题和日期的左右分别对齐的几种处理方法
 BFC
css清除浮动float的几种方法
 git 恢复单个文件
 Git ignore
198. House Robber(动态规划)
121. Best Time to Buy and Sell Stock（股票最大收益）
120. Triangle(动态规划三角形最小路径难想)
91. Decode Ways(动态规划 26个字母解码个数)
53. Maximum Subarray(动态规划求最大子数组)

原文地址：https://www.cnblogs.com/lujinhong2/p/4637257.html

【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件 分类： H3_NUTCH 2014-08-18 16:33 1376人阅读 评论(0) 收藏

【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件分类： H3_NUTCH 2014-08-18 16:33 1376人阅读评论(0) 收藏