zoukankan      html  css  js  c++  java
  • nutch install

    on ubuntu
    http://wiki.apache.org/nutch/NutchTutorial#A1_Setup_Nutch_from_binary_distribution

    还是要看官方的文档,之前google了一个早期1.0版本的中文安装教程,直接误入歧途

    Steps

    1 Setup Nutch from binary distribution

    • Unzip your binary Nutch package to $HOME/nutch-1.3
    • cd $HOME/nutch-1.3/runtime/local

    From now on, we am going to use ${NUTCH_RUNTIME_HOME} to refer to the current directory.

    2. Verify your Nutch installation

    • run "bin/nutch" - You can confirm a correct installation if you seeing the following:

    Usage: nutch [-core] COMMAND

    Some troubleshooting tips:

    • Run the following command if you are seeing "Permission denied":

    chmod +x bin/nutch
    • Setup JAVA_HOME if you are seeing JAVA_HOME not set. On Mac, you can run the following command or add it to ~/.bashrc:

    export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home

    3. Crawl your first website

    • Add your agent name in the value field of the http.agent.name property in conf/nutch-site.xml, for example:

    <property><br /> <name>http.agent.name</name><br /> <value>My Nutch Spider</value><br /></property>
    • mkdir -p urls
    • create a text file nutch under /urls with the following content (1 url per line for each site you want Nutch to crawl).

    http://nutch.apache.org/

    * Edit the file conf/regex-urlfilter.txt and replace

    # accept anything else<br />+.

    with a regular expression matching the domain you wish to crawl. For example, if you wished to limit the crawl to the nutch.apache.org domain, the line should read:

     +^http://([a-z0-9]*\.)*nutch.apache.org/

    This will include any url in the domain nutch.apache.org.

    3.1 Using the Crawl Command

    Now we are ready to initiate a crawl, use the following parameters:

    • -dir dir names the directory to put the crawl in.

    • -threads threads determines the number of threads that will fetch in parallel.

    • -depth depth indicates the link depth from the root page that should be crawled.

    • -topN N determines the maximum number of pages that will be retrieved at each level up to the depth.

    • Run the following command:

    bin/nutch crawl urls -dir crawl -depth 3 -topN 5
    • Now you should be able to see the following directories created:

    crawl/crawldb<br />Crawl/linkdb<br />crawl/segments


  • 相关阅读:
    谈谈服务限流算法的几种实现
    使用 MongoDB 存储日志数据
    MongoDB存储引擎选择
    下载一线视频
    spring-boot-starter-redis配置详解
    SpringBoot学习笔记(6) SpringBoot数据缓存Cache [Guava和Redis实现]
    Guava 源码分析(Cache 原理)
    分布式链路跟踪 Sleuth 与 Zipkin【Finchley 版】
    Dubbo x Cloud Native 服务架构长文总结(很全)
    区块链使用Java,以太坊 Ethereum, web3j, Spring Boot
  • 原文地址:https://www.cnblogs.com/lexus/p/2206831.html
Copyright © 2011-2022 走看看