zoukankan      html  css  js  c++  java
  • nutch 1.7 导入 eclipse


    开发环境建议:ubuntu+eclipse (windows + cygwin + eclipse不推荐)

    第一步:下载
    http://archive.apache.org/dist/nutch/
    从上述站点下载src和bin两个压缩文件
    wget 'http://archive.apache.org/dist/nutch/1.7/apache-nutch-1.7-bin.tar.gz'
    wget 'http://archive.apache.org/dist/nutch/1.7/apache-nutch-1.7-src.tar.gz'

    第二步:解压
    tar zxvf apache-nutch-1.7-bin.tar.gz
    解压出一个 apache-nutch-1.7 文件夹
    重命名: mv apache-nutch-1.7 apache-nutch-1.7-bin

    tar zxvf apache-nutch-1.7-src.tar.gz
    解压出一个 apache-nutch-1.7 文件夹
    重命名: mv apache-nutch-1.7 apache-nutch-1.7-src

    第三步:组合
    将apache-nutch-1.7-bin/lib中的所有jar包拷贝到apache-nutch-1.7-src/lib中
    cp apache-nutch-1.7-bin/lib/* apache-nutch-1.7-src/lib/
    将apache-nutch-1.7-bin/conf中的配置文件覆盖apache-nutch-1.7-src/conf中


    第四步:导入eclipse
    eclipse : File -- New -- Java Project

    这一步完成了将源码(而非工程)导入eclipse
    注解:笔者以前用的eclipse版本有import project from source ,但这个版本没有,只有import project from existing project.而我们只有src文件

    点击NEXT
    找到 conf 文件夹 ,然后点击 Add Folder 'conf' to build path
    defautl output 设置为 apache-nutch-1.7/bin

    点击Finish

    第四步:一些小BUG
    此时会发现工程有错误(红色的小叉叉),这是因为缺少引用导致的。
    以parse-html为例:
    import org.cyberneko.html.parsers.*;
    这里报错是因为缺少 nekohtml-0.9.5.jar

    如何获取nekohtml-0.9.5.jar:
    到apache-nutch-1.7-bin/plugin 下搜索 nekohtml 就能找到这个jar包
    然后复制到项目的lib文件夹里并add to build path

    其他bug以此类推(所有的jar都可以在apache-nutch-1.7-bin/plugin 下找到

    feed
    cp apache-nutch-1.7-bin/plugins/feed/rome-0.9.jar apache-nutch-1.7-src/lib/
    parse-html
    cp apache-nutch-1.7-bin/plugins/parse-html/tagsoup-1.2.1.jar apache-nutch-1.7-src/lib/
    cp apache-nutch-1.7-bin/plugins/lib-nekohtml/nekohtml-0.9.5.jar apache-nutch-1.7-src/lib/



    至此整个工程将不会有任何错误了。

    第五步:测试采集
    1.vim conf/nutch-defalut.xml -----vim
    /plugin.forlder ---vim查找命令
    修该为:
    <property>
      <name>plugin.folders</name>
      <value>./src/plugin</value>
      <description>Directories where nutch plugins are located.  Each
      element may be a relative or absolute path.  If absolute, it is used
      as is.  If relative, it is searched for on the classpath.</description>
    </property>

    原因:源代码文件中 plugin在src文件夹里,但在bin文件中plugin 在根目录下。

      2 vim conf/nutch-site.xml    加入:
        <property>
        <name>http.agent.name</name>
        <value>your sipder name</value>
      </property>

    3 在apache-nutch-1.7-src下建立一个urls文件夹,在urls下面建一个文本文档
    mkdir urls
    cd urls
    vim seed.txt
    写入:http://www.163.com/

    4 vim conf/regex-urlfilter.txt
    5 运行配置:

    运行结果:

    至此运行成功。
    检测采集结果:
    
    

     

    统计结果:(unfetched比较多是因为nutch给url打分,过滤掉了分数小于0的,这个可以在nutch-default.xml中修改)

    2013-09-22 13:17:46,710 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(351)) - Statistics for CrawlDb: crawl/crawldb
    2013-09-22 13:17:46,710 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(354)) - TOTAL urls:    794
    2013-09-22 13:17:46,710 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(369)) - retry 0:    794
    2013-09-22 13:17:46,711 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(359)) - min score:    0.0
    2013-09-22 13:17:46,711 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(363)) - avg score:    0.003186398
    2013-09-22 13:17:46,711 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(361)) - max score:    1.007
    2013-09-22 13:17:46,711 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(368)) - status 1 (db_unfetched):    750
    2013-09-22 13:17:46,711 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    3g.163.com :    7
    2013-09-22 13:17:46,711 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    auto.163.com :    12
    2013-09-22 13:17:46,711 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    baby.163.com :    1
    2013-09-22 13:17:46,711 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    baoxian.163.com :    2
    2013-09-22 13:17:46,712 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    bbs.163.com :    1
    2013-09-22 13:17:46,712 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    bbs.culture.163.com :    1
    2013-09-22 13:17:46,712 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    bbs.ent.163.com :    1
    2013-09-22 13:17:46,712 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    bbs.lady.163.com :    1
    2013-09-22 13:17:46,712 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    biz.163.com :    1
    2013-09-22 13:17:46,712 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    blog.163.com :    2
    2013-09-22 13:17:46,712 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    book.163.com :    10
    2013-09-22 13:17:46,713 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    caipiao.163.com :    50
    2013-09-22 13:17:46,713 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    cbachina.163.com :    1
    2013-09-22 13:17:46,713 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    club.auto.163.com :    1
    2013-09-22 13:17:46,713 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    corp.163.com :    3
    2013-09-22 13:17:46,713 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    data.ent.163.com :    1
    2013-09-22 13:17:46,713 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    digi.163.com :    40
    2013-09-22 13:17:46,713 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    discovery.163.com :    1
    2013-09-22 13:17:46,713 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    dl.163.com :    1
    2013-09-22 13:17:46,713 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    ecard.163.com :    1
    2013-09-22 13:17:46,714 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    edu.163.com :    1
    2013-09-22 13:17:46,714 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    email.163.com :    1
    2013-09-22 13:17:46,714 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    emarketing.biz.163.com :    31
    2013-09-22 13:17:46,714 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    ent.163.com :    10
    2013-09-22 13:17:46,714 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    expo.163.com :    1
    2013-09-22 13:17:46,714 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    fashion.163.com :    1
    2013-09-22 13:17:46,714 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    focus.news.163.com :    1
    2013-09-22 13:17:46,714 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    fushi.163.com :    1
    2013-09-22 13:17:46,714 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    game.163.com :    1
    2013-09-22 13:17:46,715 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    gb.corp.163.com :    8
    2013-09-22 13:17:46,715 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    hea.163.com :    3
    2013-09-22 13:17:46,715 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    help.163.com :    2
    2013-09-22 13:17:46,715 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    history.news.163.com :    1
    2013-09-22 13:17:46,715 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    home.163.com :    1
    2013-09-22 13:17:46,715 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    house.163.com :    1
    2013-09-22 13:17:46,715 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    hr.163.com :    1
    2013-09-22 13:17:46,715 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    jiu.163.com :    1
    2013-09-22 13:17:46,715 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    kf.yxp.163.com :    1
    2013-09-22 13:17:46,716 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    lady.163.com :    8
    2013-09-22 13:17:46,716 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    live.caipiao.163.com :    1
    2013-09-22 13:17:46,716 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    love.163.com :    2
    2013-09-22 13:17:46,716 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    lovegongyi.163.com :    1
    2013-09-22 13:17:46,716 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    m.163.com :    81
    2013-09-22 13:17:46,716 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    media.163.com :    1
    2013-09-22 13:17:46,716 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    mibao.gm.163.com :    1
    2013-09-22 13:17:46,716 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    mobile.163.com :    2
    2013-09-22 13:17:46,716 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    money.163.com :    21
    2013-09-22 13:17:46,717 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    news.163.com :    9
    2013-09-22 13:17:46,717 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    news.tag.163.com :    1
    2013-09-22 13:17:46,717 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    newsapp.blog.163.com :    19
    2013-09-22 13:17:46,717 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    pay.163.com :    1
    2013-09-22 13:17:46,717 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    pic.auto.163.com :    1
    2013-09-22 13:17:46,717 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    post.news.163.com :    1
    2013-09-22 13:17:46,717 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    product.auto.163.com :    1
    2013-09-22 13:17:46,717 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    qiye.163.com :    1
    2013-09-22 13:17:46,717 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    quotes.money.163.com :    1
    2013-09-22 13:17:46,718 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    reg.163.com :    3
    2013-09-22 13:17:46,718 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    sports.163.com :    14
    2013-09-22 13:17:46,718 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    survey2.163.com :    1
    2013-09-22 13:17:46,718 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    t.163.com :    45
    2013-09-22 13:17:46,718 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    tech.163.com :    42
    2013-09-22 13:17:46,718 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    travel.163.com :    1
    2013-09-22 13:17:46,718 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    tveasy.blog.163.com :    2
    2013-09-22 13:17:46,718 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    v.163.com :    1
    2013-09-22 13:17:46,718 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    v.money.163.com :    1
    2013-09-22 13:17:46,718 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    v.news.163.com :    1
    2013-09-22 13:17:46,719 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    v.sports.163.com :    1
    2013-09-22 13:17:46,719 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    vipmail.163.com :    1
    2013-09-22 13:17:46,719 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    vs.caipiao.163.com :    1
    2013-09-22 13:17:46,719 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    wangyiyuedu.blog.163.com :    1
    2013-09-22 13:17:46,719 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    war.news.163.com :    1
    2013-09-22 13:17:46,719 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    www.163.com :    1
    2013-09-22 13:17:46,719 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    yuedu.163.com :    265
    2013-09-22 13:17:46,719 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    zx.caipiao.163.com :    7
    2013-09-22 13:17:46,719 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    zz.yc.163.com :    3
    2013-09-22 13:17:46,720 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(368)) - status 2 (db_fetched):    40
    2013-09-22 13:17:46,720 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    caipiao.163.com :    1
    2013-09-22 13:17:46,720 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    corp.163.com :    2
    2013-09-22 13:17:46,720 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    digi.163.com :    1
    2013-09-22 13:17:46,720 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    emarketing.biz.163.com :    1
    2013-09-22 13:17:46,720 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    gb.corp.163.com :    3
    2013-09-22 13:17:46,720 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    help.163.com :    1
    2013-09-22 13:17:46,720 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    love.163.com :    1
    2013-09-22 13:17:46,720 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    m.163.com :    11
    2013-09-22 13:17:46,721 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    music.163.com :    1
    2013-09-22 13:17:46,721 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    newsapp.blog.163.com :    1
    2013-09-22 13:17:46,721 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    open.163.com :    1
    2013-09-22 13:17:46,721 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    open.yuedu.163.com :    1
    2013-09-22 13:17:46,721 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    sitemap.163.com :    1
    2013-09-22 13:17:46,721 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    t.163.com :    1
    2013-09-22 13:17:46,721 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    tech.163.com :    2
    2013-09-22 13:17:46,722 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    www.163.com :    1
    2013-09-22 13:17:46,722 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    yuedu.163.com :    9
    2013-09-22 13:17:46,722 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    zz.yc.163.com :    1
    2013-09-22 13:17:46,722 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(368)) - status 4 (db_redir_temp):    2
    2013-09-22 13:17:46,722 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    3g.163.com :    1
    2013-09-22 13:17:46,722 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    m.163.com :    1
    2013-09-22 13:17:46,723 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(368)) - status 5 (db_redir_perm):    2
    2013-09-22 13:17:46,723 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    caipiao.163.com :    1
    2013-09-22 13:17:46,723 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    corp.163.com :    1
    2013-09-22 13:17:46,723 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(374)) - CrawlDb statistics: done
  • 相关阅读:
    freertos 启动任务调度器后卡在svc 0,汇编停在了0x0800014A E7FE B 0x0800014A
    cadence报错:Class must be one of IC, IO, DISCRETE, MECHANICAL, PLATING_BAR or DRIVER_CELL.
    DDR内存256M16、512M8含义
    常用摄像头像素
    cadence报错because the library part is newer than the part in the design cache.Select the part in the cache and choose Design-Update Cache,and then place the part again.
    ESP-Example ble-ancs解析
    ping 请求找不到主机 www.baidu.com
    linux驱动ioctl报[-Werror=incompatible-pointer-types]错
    常用排序算法对比
    修改gitlab服务器网段后修改git配置的方法
  • 原文地址:https://www.cnblogs.com/i80386/p/3324068.html
Copyright © 2011-2022 走看看