zoukankan      html  css  js  c++  java
  • nutch的定时增量爬取

    译文来着:

    http://wiki.apache.org/nutch/Crawl

    介绍(Introduction)

    注意:脚本中没有直接使用Nutch的爬去命令(bin/nutch crawl或者是“Crawl”类),所以url过滤的实现并不依赖“conf/crawl-urlfilter.txt”。而是应该在“regex-urlfilter.txt”中设定实现。

    爬取步骤(Steps)

    脚本大致分为8部:

    1. Inject URLs(注入urls)
    2. Generate, Fetch, Parse, Update Loop(循环运行:产生待抓取URL。抓取。转换得到的页面。更新各DB)
    3. Merge Segments(合并segments)
    4. Invert Links(得到抓取到的页面的外连接数据)
    5. Index(索引)
    6. Dedup(去重)
    7. Merge Indexes(合并索引)
    8. Load new indexes(tomcat又一次载入新索引文件夹)

    两种运行模式(Modes of Execution)

    脚本能够两种模式运行:-

    • Normal Mode(普通模式)
    • Safe Mode(安全模式)

    Normal Mode

    用 'bin/runbot'命令运行, 将删除运行后全部的文件夹。

    注意: 这意味着假设抓取过程因某些原因中断。并且crawl DB 是不完整的, 那么将没办法恢复。

    Safe Mode

    用'bin/runbot safe' 命令运行安全模式,将不会删除用到的文件夹文件. 全部暂时文件将被以"BACK_FILE"备份。假设出错。能够利用这些备份文件运行恢复操作。

    Normal Mode vs. Safe Mode

    除非你能够保证一切都不出问题,否则我们建议您运行安全模式。

    Tinkering

    依据你的须要设定 'depth', 'threads', 'adddays' and 'topN'。

    假设不想设定'topN'。就将其凝视掉或者删掉。

    NUTCH_HOME

    假设你不是在 nutch的'bin/runbot' 文件夹下运行该脚本, 你应该在脚本中设定 'NUTCH_HOME' 的值为你的nutch路径:-

    if [ -z "$NUTCH_HOME" ]
    then
      NUTCH_HOME=.

    ps:假设你在环境变量中已经设定了 'NUTCH_HOME'的值,则能够忽略此处。

    CATALINA_HOME

    'CATALINA_HOME' 指向tomcat的安装路径。

    须要在脚本或者环境变量中对其设置。类似 'NUTCH_HOME'的设定:-

    if [ -z "$CATALINA_HOME" ]
    then
      CATALINA_HOME=/opt/apache-tomcat-6.0.10


    Can it re-crawl?

    Can it re-crawl?

    尽管作者自己使用过多次,可是否可以适合你的工作,请先測试一下。假设不能非常好的运行重爬,请联系我们。

    脚本内容(Script)

    # runbot script to run the Nutch bot for crawling and re-crawling.
    # Usage: bin/runbot [safe]
    #        If executed in 'safe' mode, it doesn't delete the temporary
    #        directories generated during crawl. This might be helpful for
    #        analysis and recovery in case a crawl fails.
    #
    # Author: Susam Pal
    
    depth=2
    threads=5
    adddays=5
    topN=15 #Comment this statement if you don't want to set topN value
    
    # Arguments for rm and mv
    RMARGS="-rf"
    MVARGS="--verbose"
    
    # Parse arguments
    if [ "$1" == "safe" ]#推断是以哪种模式运行
    then
      safe=yes
    fi
    
    if [ -z "$NUTCH_HOME" ]#推断 'NUTCH_HOME'是否设定
    then
      NUTCH_HOME=.
      echo runbot: $0 could not find environment variable NUTCH_HOME
      echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the script 
    else
      echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME 
    fi
    
    if [ -z "$CATALINA_HOME" ]#推断tomcat路径是否设置
    then
      CATALINA_HOME=/opt/apache-tomcat-6.0.10
      echo runbot: $0 could not find environment variable NUTCH_HOME
      echo runbot: CATALINA_HOME=$CATALINA_HOME has been set by the script 
    else
      echo runbot: $0 found environment variable CATALINA_HOME=$CATALINA_HOME 
    fi
    
    if [ -n "$topN" ]#topN设定
    then
      topN="-topN $topN"
    else
      topN=""
    fi
    
    steps=8
    echo "----- Inject (Step 1 of $steps) -----"#注入种子urls
    $NUTCH_HOME/bin/nutch inject crawl/crawldb urls
    
    echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"#循环运行抓取
    for((i=0; i < $depth; i++))
    do
      echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
      $NUTCH_HOME/bin/nutch generate crawl/crawldb crawl/segments $topN 
          -adddays $adddays
      if [ $? -ne 0 ]
      then
        echo "runbot: Stopping at depth $depth. No more URLs to fetch."
        break
      fi
      segment=`ls -d crawl/segments/* | tail -1`
    
      $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
      if [ $?

    -ne 0 ] then echo "runbot: fetch $segment at depth `expr $i + 1` failed." echo "runbot: Deleting segment $segment." rm $RMARGS $segment continue fi $NUTCH_HOME/bin/nutch updatedb crawl/crawldb $segment done echo "----- Merge Segments (Step 3 of $steps) -----"#合并Segments $NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/* if [ "$safe" != "yes" ] then rm $RMARGS crawl/segments else rm $RMARGS crawl/BACKUPsegments mv $MVARGS crawl/segments crawl/BACKUPsegments fi mv $MVARGS crawl/MERGEDsegments crawl/segments echo "----- Invert Links (Step 4 of $steps) -----"#得到外连接数据 $NUTCH_HOME/bin/nutch invertlinks crawl/linkdb crawl/segments/* echo "----- Index (Step 5 of $steps) -----"#建索引 $NUTCH_HOME/bin/nutch index crawl/NEWindexes crawl/crawldb crawl/linkdb crawl/segments/* echo "----- Dedup (Step 6 of $steps) -----"#去重 $NUTCH_HOME/bin/nutch dedup crawl/NEWindexes echo "----- Merge Indexes (Step 7 of $steps) -----"#合并索引 $NUTCH_HOME/bin/nutch merge crawl/NEWindex crawl/NEWindexes echo "----- Loading New Index (Step 8 of $steps) -----"#tomcat又一次载入索引文件夹 ${CATALINA_HOME}/bin/shutdown.sh if [ "$safe" != "yes" ] then rm $RMARGS crawl/NEWindexes rm $RMARGS crawl/index else rm $RMARGS crawl/BACKUPindexes rm $RMARGS crawl/BACKUPindex mv $MVARGS crawl/NEWindexes crawl/BACKUPindexes mv $MVARGS crawl/index crawl/BACKUPindex fi mv $MVARGS crawl/NEWindex crawl/index ${CATALINA_HOME}/bin/startup.sh echo "runbot: FINISHED: Crawl completed!" echo ""

  • 相关阅读:
    在Java中如何优雅地判空
    软件可以流氓到什么程度?从卸载步骤就可以看出来!
    面试中常问的List去重问题,你都答对了吗?
    为什么程序员都不喜欢使用switch而使用if来做条件跳转
    那些年,我们一起卸载过的软件…
    趣图:当我捕获Bug的时候
    9个成功的微服务设计的基础知识
    5.1 包装类
    4.9 初始化块
    4.8 继承与组合
  • 原文地址:https://www.cnblogs.com/cynchanpin/p/6924023.html
Copyright © 2011-2022 走看看