配置nutch
(nutch文件夹已在/home目录下)
1. 修改系统环境变量
sudo gedit /etc/profile
//增加
#set nutch export PATH=/home/nutch/runtime/local/bin:$PATH
2. 测试(nutch/runtime/local/bin中./nutch & ./crawl)
nutch
//结果如下: Usage: nutch COMMAND where COMMAND is one of: inject inject new urls into the database hostinject creates or updates an existing host table from a text file generate generate new batches to fetch from crawl db fetch fetch URLs marked during generate parse parse URLs marked during fetch updatedb update web table after parsing updatehostdb update host table after parsing readdb read/dump records from page database readhostdb display entries from the hostDB elasticindex run the elasticsearch indexer solrindex run the solr indexer on parsed batches solrdedup remove duplicates from solr parsechecker check the parser for a given url indexchecker check the indexing filters for a given url plugin load a plugin and run one of its classes main() nutchserver run a (local) Nutch server on a user defined port junit runs the given JUnit test or CLASSNAME run the class named CLASSNAME Most commands print help when invoked w/o parameters.
crawl
//结果如下: Missing seedDir : crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>