zoukankan      html  css  js  c++  java
  • nutch2.3命令参数解析

    nutch中可执行的命令列表

    [root@ewanalysis ~]# nutch
    Usage: nutch COMMAND
    where COMMAND is one of:
     inject         inject new urls into the database
     hostinject     creates or updates an existing host table from a text file
     generate       generate new batches to fetch from crawl db
     fetch          fetch URLs marked during generate
     parse          parse URLs marked during fetch
     updatedb       update web table after parsing
     updatehostdb   update host table after parsing
     readdb         read/dump records from page database
     readhostdb     display entries from the hostDB
     index          run the plugin-based indexer on parsed batches
     elasticindex   run the elasticsearch indexer - DEPRECATED use the index command                        instead
     solrindex      run the solr indexer on parsed batches - DEPRECATED use the inde                       x command instead
     solrdedup      remove duplicates from solr
     solrclean      remove HTTP 301 and 404 documents from solr - DEPRECATED use the                        clean command instead
     clean          remove HTTP 301 and 404 documents and duplicates from indexing b                       ackends configured via plugins
     parsechecker   check the parser for a given url
     indexchecker   check the indexing filters for a given url
     plugin         load a plugin and run one of its classes main()
     nutchserver    run a (local) Nutch server on a user defined port
     webapp         run a local Nutch web application
     junit          runs the given JUnit test
     or
     CLASSNAME      run the class named CLASSNAME
    Most commands print help when invoked w/o parameters.

    crawl

    Usage: crawl <seedDir> <crawlID> [<solrUrl>] <numberOfRounds>

    参数说明:

    <seedDir>:包括URL列表的文本文件,它是一个已存在的文件夹。

    <crawlID>:爬取的ID编号

    [<solrUrl>]:solr解析的建立索引的地址

    <numberOfRounds>:爬取的轮次

    nutch inject

    Usage: InjectorJob <url_dir> [-crawlId <id>]

    参数说明:

    <url_dir>:包括URL列表的文本文件,它是一个已存在的文件夹。 

    nutch generate

    Usage: GeneratorJob [-topN N] [-crawlId id] [-noFilter] [-noNorm] [-adddays numDays]

    参数说明:

    [-topN N]:选取前多少个链接,默认值为Long.MAX_VALUE 

    [-noFilter]:不激活过滤器插件过滤url,默认是true

    [-noNorm] :不激活normalizer插件规范化的url,默认是true

    [-adddays numDays]: 添加 <numDays>到当前时间,配置crawling urls ,以将很快被爬取db.default.fetch.interval默认值为0。爬取结束时间在当前时间以前的。 

    nutch fetch

    Usage: FetcherJob (<batchId> | -all) [-crawlId <id>] [-threads N] [-resume] [-numTasks N]

    参数说明:

    [-crawlId <id>]:

    [-threads N]:运行的fetcher线程数默认值为 Configuration Key -> fetcher.threads.fetch -> 10 

    [-resume]:恢复中断的工作

    [-numTasks N]:如果N>0,则使用设定的N减少抓取任务(默认值: mapred.map.tasks)

    nutch parse

    Usage: ParserJob (<batchId> | -all) [-crawlId <id>] [-resume] [-force]

    参数说明:

    [-crawlId <id>]:

    [-resume]:恢复之前中断的任务

    [-force]:强制重新解析这个页面,即使这个页面已经被解析过了

    nutch updatedb

    Usage: DbUpdaterJob (<batchId> | -all) [-crawlId <id>] <batchId> - crawl identifier returned by Generator, or -all for all
    generated batchId-s
    -crawlId <id> - the id to prefix the schemas to operate on,
    (default: storage.crawl.id)

    参数说明:

    nutch index

    Usage: IndexingJob (<batchId> | -all | -reindex) [-crawlId <id>]

    参数说明:

  • 相关阅读:
    毕业设计同学们的福利(将word表格导入PowerDesigner中实现快速创建PDM)
    (转载)彻底的理解:WebService到底是什么?
    Aptana常用快捷键总结
    解决nuxt中路由变化后vanlist触底不加载的问题
    vuepropertydecorator的装饰器及其功能(可能不全)
    前端基础复习篇DOM
    Docker如何制作镜像Dockerfile的使用
    接口测试及常用接口测试工具
    SVN快速入门3——整合eclipse(1)
    SVN快速入门1——SVN的安装及常用命令
  • 原文地址:https://www.cnblogs.com/zhjsll/p/4704409.html
Copyright © 2011-2022 走看看