zoukankan      html  css  js  c++  java
  • nutch2.3中nutch-site.xml设置说明

    nutch-site.xml是运行nutch的非必须设置文件,也就是说你不设置,nutch照样可以运行。

    nutch-site.xml是nutch-default.xml的一个客制化文件。

    nutch-default.xml提供了Nutch可以设置的各种属性参数,但客制化的部分并不是在nutch-default.xml中实现的,而是需要通过修改nutch-site.xml来实现自己的客制化需求。

    nutch-default.xml可以分为25个大块:

     1 <!-- general properties  -->25
     2 <!-- file properties -->36
     3 <!-- HTTP properties -->83
     4 <!-- FTP properties -->290
     5 <!-- web db properties -->357
     6 <!-- generate properties -->557
     7 <!-- urlpartitioner properties -->597
     8 <!-- fetcher properties -->617
     9 <!-- indexingfilter plugin properties -->761
    10 <!-- BasicIndexingfilter plugin properties -->790
    11 <!-- moreindexingfilter plugin properties -->800
    12 <!-- AnchorIndexing filter plugin properties -->811
    13 <!-- URL normalizer properties -->822
    14 <!-- mime properties -->849
    15 <!-- plugin properties -->869
    16 <!-- parser properties -->908
    17 <!-- urlfilter plugin properties -->1011
    18 <!-- scoring filters properties -->1070
    19 <!-- language-identifier plugin properties -->1083
    20 <!-- index-metadata plugin properties -->1143
    21 <!-- parse-metatags plugin properties -->1155
    22 <!-- Temporary Hadoop 0.17.x workaround. -->1168
    23 <!-- solr index properties -->1181
    24 <!-- elasticsearch index properties -->1220
    25 <!-- storage properties -->1273

    http.max.delays

    <property>
      <name>http.max.delays</name>
      <value>100</value>
      <description>The number of times a thread will delay when trying to
      fetch a page.  Each time it finds that a host is busy, it will wait
      fetcher.server.delay.  After http.max.delays attepts, it will give
      up on the page for now.</description>
    </property>

    爬虫的网络延时线程等待时间,以秒计时,默认的配时间是3秒,视网络状况而定。如果在爬虫运行的时候发现服务器返回了主机忙消息,则等待时间由fetcher.server.delay 决定,所以在网络状况不太好的情况下fetcher.server.delay 也设置稍大一点的值较好,此外还有一个http.timeout 也和网络状况有关系。

    http.content.limit

    <property>
      <name>http.content.limit</name>
      <value>65536</value>
      <description>The length limit for downloaded content using the http
      protocol, in bytes. If this value is nonnegative (>=0), content longer
      than it will be truncated; otherwise, no truncation at all. Do not
      confuse this setting with the file.content.limit setting.
      </description>
    </property>

    描述爬虫抓取的文档内容长度的配置项。原来的值是 65536 , 也就是说抓取到的一个文档截取 65KB左右,超过部分将被忽略,对于抓取特定内容的搜索引擎需要修改此项,比如XML文档。

    db.fetch.interval.default和db.fetch.interval.max

    <property>
      <name>db.fetch.interval.default</name>
      <value>2592000</value>
      <description>The default number of seconds between re-fetches of a page (30 days).
      </description>
    </property>
    
    <property>
      <name>db.fetch.interval.max</name>
      <value>7776000</value>
      <description>The maximum number of seconds between re-fetches of a page
      (90 days). After this period every page in the db will be re-tried, no
      matter what is its status.
      </description>
    </property>

    这个功能对定期自动爬取需求的开发有用,设置多少天重新爬一个页面。

    fetcher.server.delay

    <property>
      <name>fetcher.server.delay</name>
      <value>5.0</value>
      <description>The number of seconds the fetcher will delay between 
       successive requests to the same server. Note that this might get
       overriden by a Crawl-Delay from a robots.txt and is used ONLY if 
       fetcher.threads.per.queue is set to 1.
       </description>
    </property>

    fetcher.threads.fetch

    <property>
      <name>fetcher.threads.fetch</name>
      <value>10</value>
      <description>The number of FetcherThreads the fetcher should use.
      This is also determines the maximum number of requests that are
      made at once (each FetcherThread handles one connection). The total
      number of threads running in distributed mode will be the number of
      fetcher threads * number of nodes as fetcher has one map task per node.
      </description>
    </property>

    最大抓取线程数量

    fetcher.threads.per.queue

    <property>
      <name>fetcher.threads.per.queue</name>
      <value>1</value>
      <description>This number is the maximum number of threads that
        should be allowed to access a queue at one time. Setting it to 
        a value > 1 will cause the Crawl-Delay value from robots.txt to
        be ignored and the value of fetcher.server.min.delay to be used
        as a delay between successive requests to the same server instead 
        of fetcher.server.delay.
       </description>
    </property>

    同一时刻抓取网站的最大线程数量

    fetcher.verbose

    <property>
      <name>fetcher.verbose</name>
      <value>false</value>
      <description>If true, fetcher will log more verbosely.</description>
    </property>

    如果是 true, 打印出更多详细信息

    plugin.folders

    <property>
      <name>plugin.folders</name>
      <value>plugins</value>
      <description>Directories where nutch plugins are located.  Each
      element may be a relative or absolute path.  If absolute, it is used
      as is.  If relative, it is searched for on the classpath.</description>
    </property>

    配置插件功能的配置项 ,plugin.folders制定插件加载路径

    plugin.includes

    <property>
      <name>plugin.includes</name>
     <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
     <description>Regular expression naming plugin directory names to
      include.  Any plugin not matching this expression is excluded.
      In any case you need at least include the nutch-extensionpoints plugin. By
      default Nutch includes crawling just HTML and plain text via HTTP,
      and basic indexing and search plugins. In order to use HTTPS please enable 
      protocol-httpclient, but be aware of possible intermittent problems with the 
      underlying commons-httpclient library.
      </description>
    </property>

    配置插件功能的配置项 , plugin.includes表示需要加载的插件列表

    parser.character.encoding.default

    <property>
      <name>parser.character.encoding.default</name>
      <value>windows-1252</value>
      <description>The character encoding to fall back to when no other information
      is available</description>
    </property>

    解析文档的时候使用的默认编码windows-1252 好像比较少用到的一种编码,我不太熟悉。

    parser.html.impl

    <property>
      <name>parser.html.impl</name>
      <value>neko</value>
      <description>HTML Parser implementation. Currently the following keywords
      are recognized: "neko" uses NekoHTML, "tagsoup" uses TagSoup.
      </description>
    </property>

    制定解析HTML文档的时候使用的解析器, NEKO功能比较强大,后面会有专门的文章介绍Neko 从HTML到 TEXT以及html片断的解析等功能做介绍。

    lang.analyze.max.length

    <property>
      <name>lang.analyze.max.length</name>
      <value>2048</value>
      <description> The maximum bytes of data to uses to indentify
      the language (0 means full content analysis).
      The larger is this value, the better is the analysis, but the
      slowest it is.
      </description>
    </property>

    和语言有关系,分词的时候会用到,不过我没用过这个配置项。还有几个重要的配置项在 nutch-site.xml里面配置。

    --

  • 相关阅读:
    正则表达式--hdu2206ip匹配
    win7查看隐藏分区
    我购买byd的几点逻辑
    html5笔记
    机器学习
    Popular Cows
    武大OJ 574. K-th smallest
    武大OJ 622. Symmetrical
    [HAOI2011]防线修建
    1185: [HNOI2007]最小矩形覆盖
  • 原文地址:https://www.cnblogs.com/zhjsll/p/4704371.html
Copyright © 2011-2022 走看看