nutch2.3命令参数解析

zoukankan html css js c++ java

nutch2.3命令参数解析
nutch中可执行的命令列表
[root@ewanalysis ~]# nutch Usage: nutch COMMAND where COMMAND is one of: inject inject new urls into the database hostinject creates or updates an existing host table from a text file generate generate new batches to fetch from crawl db fetch fetch URLs marked during generate parse parse URLs marked during fetch updatedb update web table after parsing updatehostdb update host table after parsing readdb read/dump records from page database readhostdb display entries from the hostDB index run the plugin-based indexer on parsed batches elasticindex run the elasticsearch indexer - DEPRECATED use the index command instead solrindex run the solr indexer on parsed batches - DEPRECATED use the inde x command instead solrdedup remove duplicates from solr solrclean remove HTTP 301 and 404 documents from solr - DEPRECATED use the clean command instead clean remove HTTP 301 and 404 documents and duplicates from indexing b ackends configured via plugins parsechecker check the parser for a given url indexchecker check the indexing filters for a given url plugin load a plugin and run one of its classes main() nutchserver run a (local) Nutch server on a user defined port webapp run a local Nutch web application junit runs the given JUnit test or CLASSNAME run the class named CLASSNAME Most commands print help when invoked w/o parameters.
crawl

Usage: crawl <seedDir> <crawlID> [<solrUrl>] <numberOfRounds>

参数说明：

<seedDir>：包括URL列表的文本文件，它是一个已存在的文件夹。

<crawlID>：爬取的ID编号

[<solrUrl>]：solr解析的建立索引的地址

<numberOfRounds>：爬取的轮次

nutch inject

Usage: InjectorJob <url_dir> [-crawlId <id>]

参数说明：

<url_dir>：包括URL列表的文本文件，它是一个已存在的文件夹。

nutch generate

Usage: GeneratorJob [-topN N] [-crawlId id] [-noFilter] [-noNorm] [-adddays numDays]

参数说明：

[-topN N]：选取前多少个链接，默认值为Long.MAX_VALUE

[-noFilter]：不激活过滤器插件过滤url，默认是true

[-noNorm] ：不激活normalizer插件规范化的url，默认是true

[-adddays numDays]: 添加 <numDays>到当前时间，配置crawling urls ，以将很快被爬取db.default.fetch.interval默认值为0。爬取结束时间在当前时间以前的。

nutch fetch

Usage: FetcherJob (<batchId> | -all) [-crawlId <id>] [-threads N] [-resume] [-numTasks N]

参数说明：

[-crawlId <id>]：

[-threads N]：运行的fetcher线程数默认值为 Configuration Key -> fetcher.threads.fetch -> 10

[-resume]：恢复中断的工作

[-numTasks N]：如果N>0，则使用设定的N减少抓取任务（默认值: mapred.map.tasks）

nutch parse

Usage: ParserJob (<batchId> | -all) [-crawlId <id>] [-resume] [-force]

参数说明：

[-crawlId <id>]：

[-resume]：恢复之前中断的任务

[-force]：强制重新解析这个页面，即使这个页面已经被解析过了

nutch updatedb

Usage: DbUpdaterJob (<batchId> | -all) [-crawlId <id>] <batchId> - crawl identifier returned by Generator, or -all for all
generated batchId-s
-crawlId <id> - the id to prefix the schemas to operate on,
(default: storage.crawl.id)

参数说明：

nutch index

Usage: IndexingJob (<batchId> | -all | -reindex) [-crawlId <id>]

参数说明：
查看全文

相关阅读:
JavaScript传递参数方法
 IScroll5不能滑到最底端的解决办法
 VS Less Compiler插件使用
 Sql查询某个字段是否包含小写字母
 试用VS2019正式版
 Ext.net MessageBox提示
 VS打开项目提示Asp.net4.0未在web服务器上注册的解决方案
 罗技M185鼠标飘
 Ext.Net的一例Ext Undefined解决办法
 JGUI源码：DataTable固定列样式（20）

原文地址：https://www.cnblogs.com/zhjsll/p/4704409.html

nutch2.3命令参数解析

nutch中可执行的命令列表

crawl

nutch inject

nutch generate

nutch fetch

nutch parse

nutch updatedb

nutch index