zoukankan      html  css  js  c++  java
  • 配置nutch

    配置nutch

    (nutch文件夹已在/home目录下)

    1. 修改系统环境变量

    sudo gedit /etc/profile
    

     //增加

    #set nutch
    export PATH=/home/nutch/runtime/local/bin:$PATH
    

    2. 测试(nutch/runtime/local/bin中./nutch  &  ./crawl)

    nutch
    
    //结果如下:
    Usage: nutch COMMAND
    where COMMAND is one of:
     inject		inject new urls into the database
     hostinject     creates or updates an existing host table from a text file
     generate 	generate new batches to fetch from crawl db
     fetch 		fetch URLs marked during generate
     parse 		parse URLs marked during fetch
     updatedb 	update web table after parsing
     updatehostdb   update host table after parsing
     readdb 	read/dump records from page database
     readhostdb     display entries from the hostDB
     elasticindex   run the elasticsearch indexer
     solrindex 	run the solr indexer on parsed batches
     solrdedup 	remove duplicates from solr
     parsechecker   check the parser for a given url
     indexchecker   check the indexing filters for a given url
     plugin 	load a plugin and run one of its classes main()
     nutchserver    run a (local) Nutch server on a user defined port
     junit         	runs the given JUnit test
     or
     CLASSNAME 	run the class named CLASSNAME
    Most commands print help when invoked w/o parameters.
    
    crawl
    
    //结果如下:
    Missing seedDir : crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>
    
  • 相关阅读:
    第二次团队作业
    第一次团队作业
    软件工程结对编程第二次作业
    第四次软件工程作业
    Hadoop综合大作业
    hive基本操作与应用
    熟悉HBase基本操作
    爬虫大作业(爬取广州番禺职业技术学院新闻发布方)
    熟悉常用的HDFS操作
    数据结构化与保存
  • 原文地址:https://www.cnblogs.com/timssd/p/5103236.html
Copyright © 2011-2022 走看看