zoukankan      html  css  js  c++  java
  • nutch 1.7 导入 eclipse


    开发环境建议:ubuntu+eclipse (windows + cygwin + eclipse不推荐)

    第一步:下载
    http://archive.apache.org/dist/nutch/
    从上述站点下载src和bin两个压缩文件
    wget 'http://archive.apache.org/dist/nutch/1.7/apache-nutch-1.7-bin.tar.gz'
    wget 'http://archive.apache.org/dist/nutch/1.7/apache-nutch-1.7-src.tar.gz'

    第二步:解压
    tar zxvf apache-nutch-1.7-bin.tar.gz
    解压出一个 apache-nutch-1.7 文件夹
    重命名: mv apache-nutch-1.7 apache-nutch-1.7-bin

    tar zxvf apache-nutch-1.7-src.tar.gz
    解压出一个 apache-nutch-1.7 文件夹
    重命名: mv apache-nutch-1.7 apache-nutch-1.7-src

    第三步:组合
    将apache-nutch-1.7-bin/lib中的所有jar包拷贝到apache-nutch-1.7-src/lib中
    cp apache-nutch-1.7-bin/lib/* apache-nutch-1.7-src/lib/
    将apache-nutch-1.7-bin/conf中的配置文件覆盖apache-nutch-1.7-src/conf中


    第四步:导入eclipse
    eclipse : File -- New -- Java Project

    这一步完成了将源码(而非工程)导入eclipse
    注解:笔者以前用的eclipse版本有import project from source ,但这个版本没有,只有import project from existing project.而我们只有src文件

    点击NEXT
    找到 conf 文件夹 ,然后点击 Add Folder 'conf' to build path
    defautl output 设置为 apache-nutch-1.7/bin

    点击Finish

    第四步:一些小BUG
    此时会发现工程有错误(红色的小叉叉),这是因为缺少引用导致的。
    以parse-html为例:
    import org.cyberneko.html.parsers.*;
    这里报错是因为缺少 nekohtml-0.9.5.jar

    如何获取nekohtml-0.9.5.jar:
    到apache-nutch-1.7-bin/plugin 下搜索 nekohtml 就能找到这个jar包
    然后复制到项目的lib文件夹里并add to build path

    其他bug以此类推(所有的jar都可以在apache-nutch-1.7-bin/plugin 下找到

    feed
    cp apache-nutch-1.7-bin/plugins/feed/rome-0.9.jar apache-nutch-1.7-src/lib/
    parse-html
    cp apache-nutch-1.7-bin/plugins/parse-html/tagsoup-1.2.1.jar apache-nutch-1.7-src/lib/
    cp apache-nutch-1.7-bin/plugins/lib-nekohtml/nekohtml-0.9.5.jar apache-nutch-1.7-src/lib/



    至此整个工程将不会有任何错误了。

    第五步:测试采集
    1.vim conf/nutch-defalut.xml -----vim
    /plugin.forlder ---vim查找命令
    修该为:
    <property>
      <name>plugin.folders</name>
      <value>./src/plugin</value>
      <description>Directories where nutch plugins are located.  Each
      element may be a relative or absolute path.  If absolute, it is used
      as is.  If relative, it is searched for on the classpath.</description>
    </property>

    原因:源代码文件中 plugin在src文件夹里,但在bin文件中plugin 在根目录下。

      2 vim conf/nutch-site.xml    加入:
        <property>
        <name>http.agent.name</name>
        <value>your sipder name</value>
      </property>

    3 在apache-nutch-1.7-src下建立一个urls文件夹,在urls下面建一个文本文档
    mkdir urls
    cd urls
    vim seed.txt
    写入:http://www.163.com/

    4 vim conf/regex-urlfilter.txt
    5 运行配置:

    运行结果:

    至此运行成功。
    检测采集结果:
    
    

     

    统计结果:(unfetched比较多是因为nutch给url打分,过滤掉了分数小于0的,这个可以在nutch-default.xml中修改)

    2013-09-22 13:17:46,710 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(351)) - Statistics for CrawlDb: crawl/crawldb
    2013-09-22 13:17:46,710 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(354)) - TOTAL urls:    794
    2013-09-22 13:17:46,710 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(369)) - retry 0:    794
    2013-09-22 13:17:46,711 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(359)) - min score:    0.0
    2013-09-22 13:17:46,711 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(363)) - avg score:    0.003186398
    2013-09-22 13:17:46,711 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(361)) - max score:    1.007
    2013-09-22 13:17:46,711 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(368)) - status 1 (db_unfetched):    750
    2013-09-22 13:17:46,711 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    3g.163.com :    7
    2013-09-22 13:17:46,711 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    auto.163.com :    12
    2013-09-22 13:17:46,711 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    baby.163.com :    1
    2013-09-22 13:17:46,711 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    baoxian.163.com :    2
    2013-09-22 13:17:46,712 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    bbs.163.com :    1
    2013-09-22 13:17:46,712 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    bbs.culture.163.com :    1
    2013-09-22 13:17:46,712 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    bbs.ent.163.com :    1
    2013-09-22 13:17:46,712 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    bbs.lady.163.com :    1
    2013-09-22 13:17:46,712 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    biz.163.com :    1
    2013-09-22 13:17:46,712 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    blog.163.com :    2
    2013-09-22 13:17:46,712 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    book.163.com :    10
    2013-09-22 13:17:46,713 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    caipiao.163.com :    50
    2013-09-22 13:17:46,713 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    cbachina.163.com :    1
    2013-09-22 13:17:46,713 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    club.auto.163.com :    1
    2013-09-22 13:17:46,713 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    corp.163.com :    3
    2013-09-22 13:17:46,713 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    data.ent.163.com :    1
    2013-09-22 13:17:46,713 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    digi.163.com :    40
    2013-09-22 13:17:46,713 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    discovery.163.com :    1
    2013-09-22 13:17:46,713 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    dl.163.com :    1
    2013-09-22 13:17:46,713 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    ecard.163.com :    1
    2013-09-22 13:17:46,714 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    edu.163.com :    1
    2013-09-22 13:17:46,714 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    email.163.com :    1
    2013-09-22 13:17:46,714 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    emarketing.biz.163.com :    31
    2013-09-22 13:17:46,714 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    ent.163.com :    10
    2013-09-22 13:17:46,714 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    expo.163.com :    1
    2013-09-22 13:17:46,714 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    fashion.163.com :    1
    2013-09-22 13:17:46,714 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    focus.news.163.com :    1
    2013-09-22 13:17:46,714 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    fushi.163.com :    1
    2013-09-22 13:17:46,714 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    game.163.com :    1
    2013-09-22 13:17:46,715 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    gb.corp.163.com :    8
    2013-09-22 13:17:46,715 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    hea.163.com :    3
    2013-09-22 13:17:46,715 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    help.163.com :    2
    2013-09-22 13:17:46,715 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    history.news.163.com :    1
    2013-09-22 13:17:46,715 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    home.163.com :    1
    2013-09-22 13:17:46,715 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    house.163.com :    1
    2013-09-22 13:17:46,715 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    hr.163.com :    1
    2013-09-22 13:17:46,715 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    jiu.163.com :    1
    2013-09-22 13:17:46,715 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    kf.yxp.163.com :    1
    2013-09-22 13:17:46,716 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    lady.163.com :    8
    2013-09-22 13:17:46,716 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    live.caipiao.163.com :    1
    2013-09-22 13:17:46,716 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    love.163.com :    2
    2013-09-22 13:17:46,716 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    lovegongyi.163.com :    1
    2013-09-22 13:17:46,716 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    m.163.com :    81
    2013-09-22 13:17:46,716 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    media.163.com :    1
    2013-09-22 13:17:46,716 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    mibao.gm.163.com :    1
    2013-09-22 13:17:46,716 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    mobile.163.com :    2
    2013-09-22 13:17:46,716 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    money.163.com :    21
    2013-09-22 13:17:46,717 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    news.163.com :    9
    2013-09-22 13:17:46,717 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    news.tag.163.com :    1
    2013-09-22 13:17:46,717 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    newsapp.blog.163.com :    19
    2013-09-22 13:17:46,717 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    pay.163.com :    1
    2013-09-22 13:17:46,717 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    pic.auto.163.com :    1
    2013-09-22 13:17:46,717 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    post.news.163.com :    1
    2013-09-22 13:17:46,717 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    product.auto.163.com :    1
    2013-09-22 13:17:46,717 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    qiye.163.com :    1
    2013-09-22 13:17:46,717 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    quotes.money.163.com :    1
    2013-09-22 13:17:46,718 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    reg.163.com :    3
    2013-09-22 13:17:46,718 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    sports.163.com :    14
    2013-09-22 13:17:46,718 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    survey2.163.com :    1
    2013-09-22 13:17:46,718 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    t.163.com :    45
    2013-09-22 13:17:46,718 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    tech.163.com :    42
    2013-09-22 13:17:46,718 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    travel.163.com :    1
    2013-09-22 13:17:46,718 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    tveasy.blog.163.com :    2
    2013-09-22 13:17:46,718 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    v.163.com :    1
    2013-09-22 13:17:46,718 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    v.money.163.com :    1
    2013-09-22 13:17:46,718 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    v.news.163.com :    1
    2013-09-22 13:17:46,719 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    v.sports.163.com :    1
    2013-09-22 13:17:46,719 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    vipmail.163.com :    1
    2013-09-22 13:17:46,719 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    vs.caipiao.163.com :    1
    2013-09-22 13:17:46,719 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    wangyiyuedu.blog.163.com :    1
    2013-09-22 13:17:46,719 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    war.news.163.com :    1
    2013-09-22 13:17:46,719 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    www.163.com :    1
    2013-09-22 13:17:46,719 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    yuedu.163.com :    265
    2013-09-22 13:17:46,719 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    zx.caipiao.163.com :    7
    2013-09-22 13:17:46,719 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    zz.yc.163.com :    3
    2013-09-22 13:17:46,720 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(368)) - status 2 (db_fetched):    40
    2013-09-22 13:17:46,720 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    caipiao.163.com :    1
    2013-09-22 13:17:46,720 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    corp.163.com :    2
    2013-09-22 13:17:46,720 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    digi.163.com :    1
    2013-09-22 13:17:46,720 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    emarketing.biz.163.com :    1
    2013-09-22 13:17:46,720 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    gb.corp.163.com :    3
    2013-09-22 13:17:46,720 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    help.163.com :    1
    2013-09-22 13:17:46,720 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    love.163.com :    1
    2013-09-22 13:17:46,720 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    m.163.com :    11
    2013-09-22 13:17:46,721 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    music.163.com :    1
    2013-09-22 13:17:46,721 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    newsapp.blog.163.com :    1
    2013-09-22 13:17:46,721 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    open.163.com :    1
    2013-09-22 13:17:46,721 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    open.yuedu.163.com :    1
    2013-09-22 13:17:46,721 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    sitemap.163.com :    1
    2013-09-22 13:17:46,721 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    t.163.com :    1
    2013-09-22 13:17:46,721 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    tech.163.com :    2
    2013-09-22 13:17:46,722 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    www.163.com :    1
    2013-09-22 13:17:46,722 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    yuedu.163.com :    9
    2013-09-22 13:17:46,722 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    zz.yc.163.com :    1
    2013-09-22 13:17:46,722 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(368)) - status 4 (db_redir_temp):    2
    2013-09-22 13:17:46,722 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    3g.163.com :    1
    2013-09-22 13:17:46,722 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    m.163.com :    1
    2013-09-22 13:17:46,723 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(368)) - status 5 (db_redir_perm):    2
    2013-09-22 13:17:46,723 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    caipiao.163.com :    1
    2013-09-22 13:17:46,723 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    corp.163.com :    1
    2013-09-22 13:17:46,723 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(374)) - CrawlDb statistics: done
  • 相关阅读:
    AIMS 2013中的性能报告工具不能运行的解决办法
    读懂AIMS 2013中的性能分析报告
    在线研讨会网络视频讲座 方案设计利器Autodesk Infrastructure Modeler 2013
    Using New Profiling API to Analyze Performance of AIMS 2013
    Map 3D 2013 新功能和新API WebCast视频下载
    为Autodesk Infrastructure Map Server(AIMS) Mobile Viewer创建自定义控件
    ADN新开了云计算Cloud和移动计算Mobile相关技术的博客
    JavaScript修改css样式style
    文本编辑神器awk
    jquery 开发总结1
  • 原文地址:https://www.cnblogs.com/i80386/p/3324068.html
Copyright © 2011-2022 走看看