开发环境建议:ubuntu+eclipse (windows + cygwin + eclipse不推荐)
第一步:下载
http://archive.apache.org/dist/nutch/
从上述站点下载src和bin两个压缩文件
wget 'http://archive.apache.org/dist/nutch/1.7/apache-nutch-1.7-bin.tar.gz'
wget 'http://archive.apache.org/dist/nutch/1.7/apache-nutch-1.7-src.tar.gz'
第二步:解压
tar zxvf apache-nutch-1.7-bin.tar.gz
解压出一个 apache-nutch-1.7 文件夹
重命名: mv apache-nutch-1.7 apache-nutch-1.7-bin
tar zxvf apache-nutch-1.7-src.tar.gz
解压出一个 apache-nutch-1.7 文件夹
重命名: mv apache-nutch-1.7 apache-nutch-1.7-src
第三步:组合
将apache-nutch-1.7-bin/lib中的所有jar包拷贝到apache-nutch-1.7-src/lib中
cp apache-nutch-1.7-bin/lib/* apache-nutch-1.7-src/lib/
将apache-nutch-1.7-bin/conf中的配置文件覆盖apache-nutch-1.7-src/conf中
第四步:导入eclipse
eclipse : File -- New -- Java Project
这一步完成了将源码(而非工程)导入eclipse
注解:笔者以前用的eclipse版本有import project from source ,但这个版本没有,只有import project from existing project.而我们只有src文件
点击NEXT
找到 conf 文件夹 ,然后点击 Add Folder 'conf' to build path
defautl output 设置为 apache-nutch-1.7/bin
点击Finish
第四步:一些小BUG
此时会发现工程有错误(红色的小叉叉),这是因为缺少引用导致的。
以parse-html为例:
import org.cyberneko.html.parsers.*;
这里报错是因为缺少 nekohtml-0.9.5.jar
如何获取nekohtml-0.9.5.jar:
到apache-nutch-1.7-bin/plugin 下搜索 nekohtml 就能找到这个jar包
然后复制到项目的lib文件夹里并add to build path
其他bug以此类推(所有的jar都可以在apache-nutch-1.7-bin/plugin 下找到
feed
cp apache-nutch-1.7-bin/plugins/feed/rome-0.9.jar apache-nutch-1.7-src/lib/
parse-html
cp apache-nutch-1.7-bin/plugins/parse-html/tagsoup-1.2.1.jar apache-nutch-1.7-src/lib/
cp apache-nutch-1.7-bin/plugins/lib-nekohtml/nekohtml-0.9.5.jar apache-nutch-1.7-src/lib/
至此整个工程将不会有任何错误了。
第五步:测试采集
1.vim conf/nutch-defalut.xml -----vim
/plugin.forlder ---vim查找命令
修该为:
<property> <name>plugin.folders</name> <value>./src/plugin</value> <description>Directories where nutch plugins are located. Each element may be a relative or absolute path. If absolute, it is used as is. If relative, it is searched for on the classpath.</description> </property>
原因:源代码文件中 plugin在src文件夹里,但在bin文件中plugin 在根目录下。
2 vim conf/nutch-site.xml 加入:
<property>
<name>http.agent.name</name>
<value>your sipder name</value>
</property>
3 在apache-nutch-1.7-src下建立一个urls文件夹,在urls下面建一个文本文档
mkdir urls
cd urls
vim seed.txt
写入:http://www.163.com/
4 vim conf/regex-urlfilter.txt
5 运行配置:
运行结果:
至此运行成功。
检测采集结果:
统计结果:(unfetched比较多是因为nutch给url打分,过滤掉了分数小于0的,这个可以在nutch-default.xml中修改)
2013-09-22 13:17:46,710 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(351)) - Statistics for CrawlDb: crawl/crawldb 2013-09-22 13:17:46,710 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(354)) - TOTAL urls: 794 2013-09-22 13:17:46,710 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(369)) - retry 0: 794 2013-09-22 13:17:46,711 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(359)) - min score: 0.0 2013-09-22 13:17:46,711 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(363)) - avg score: 0.003186398 2013-09-22 13:17:46,711 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(361)) - max score: 1.007 2013-09-22 13:17:46,711 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(368)) - status 1 (db_unfetched): 750 2013-09-22 13:17:46,711 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - 3g.163.com : 7 2013-09-22 13:17:46,711 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - auto.163.com : 12 2013-09-22 13:17:46,711 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - baby.163.com : 1 2013-09-22 13:17:46,711 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - baoxian.163.com : 2 2013-09-22 13:17:46,712 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - bbs.163.com : 1 2013-09-22 13:17:46,712 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - bbs.culture.163.com : 1 2013-09-22 13:17:46,712 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - bbs.ent.163.com : 1 2013-09-22 13:17:46,712 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - bbs.lady.163.com : 1 2013-09-22 13:17:46,712 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - biz.163.com : 1 2013-09-22 13:17:46,712 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - blog.163.com : 2 2013-09-22 13:17:46,712 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - book.163.com : 10 2013-09-22 13:17:46,713 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - caipiao.163.com : 50 2013-09-22 13:17:46,713 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - cbachina.163.com : 1 2013-09-22 13:17:46,713 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - club.auto.163.com : 1 2013-09-22 13:17:46,713 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - corp.163.com : 3 2013-09-22 13:17:46,713 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - data.ent.163.com : 1 2013-09-22 13:17:46,713 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - digi.163.com : 40 2013-09-22 13:17:46,713 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - discovery.163.com : 1 2013-09-22 13:17:46,713 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - dl.163.com : 1 2013-09-22 13:17:46,713 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - ecard.163.com : 1 2013-09-22 13:17:46,714 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - edu.163.com : 1 2013-09-22 13:17:46,714 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - email.163.com : 1 2013-09-22 13:17:46,714 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - emarketing.biz.163.com : 31 2013-09-22 13:17:46,714 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - ent.163.com : 10 2013-09-22 13:17:46,714 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - expo.163.com : 1 2013-09-22 13:17:46,714 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - fashion.163.com : 1 2013-09-22 13:17:46,714 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - focus.news.163.com : 1 2013-09-22 13:17:46,714 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - fushi.163.com : 1 2013-09-22 13:17:46,714 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - game.163.com : 1 2013-09-22 13:17:46,715 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - gb.corp.163.com : 8 2013-09-22 13:17:46,715 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - hea.163.com : 3 2013-09-22 13:17:46,715 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - help.163.com : 2 2013-09-22 13:17:46,715 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - history.news.163.com : 1 2013-09-22 13:17:46,715 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - home.163.com : 1 2013-09-22 13:17:46,715 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - house.163.com : 1 2013-09-22 13:17:46,715 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - hr.163.com : 1 2013-09-22 13:17:46,715 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - jiu.163.com : 1 2013-09-22 13:17:46,715 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - kf.yxp.163.com : 1 2013-09-22 13:17:46,716 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - lady.163.com : 8 2013-09-22 13:17:46,716 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - live.caipiao.163.com : 1 2013-09-22 13:17:46,716 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - love.163.com : 2 2013-09-22 13:17:46,716 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - lovegongyi.163.com : 1 2013-09-22 13:17:46,716 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - m.163.com : 81 2013-09-22 13:17:46,716 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - media.163.com : 1 2013-09-22 13:17:46,716 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - mibao.gm.163.com : 1 2013-09-22 13:17:46,716 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - mobile.163.com : 2 2013-09-22 13:17:46,716 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - money.163.com : 21 2013-09-22 13:17:46,717 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - news.163.com : 9 2013-09-22 13:17:46,717 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - news.tag.163.com : 1 2013-09-22 13:17:46,717 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - newsapp.blog.163.com : 19 2013-09-22 13:17:46,717 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - pay.163.com : 1 2013-09-22 13:17:46,717 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - pic.auto.163.com : 1 2013-09-22 13:17:46,717 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - post.news.163.com : 1 2013-09-22 13:17:46,717 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - product.auto.163.com : 1 2013-09-22 13:17:46,717 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - qiye.163.com : 1 2013-09-22 13:17:46,717 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - quotes.money.163.com : 1 2013-09-22 13:17:46,718 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - reg.163.com : 3 2013-09-22 13:17:46,718 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - sports.163.com : 14 2013-09-22 13:17:46,718 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - survey2.163.com : 1 2013-09-22 13:17:46,718 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - t.163.com : 45 2013-09-22 13:17:46,718 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - tech.163.com : 42 2013-09-22 13:17:46,718 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - travel.163.com : 1 2013-09-22 13:17:46,718 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - tveasy.blog.163.com : 2 2013-09-22 13:17:46,718 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - v.163.com : 1 2013-09-22 13:17:46,718 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - v.money.163.com : 1 2013-09-22 13:17:46,718 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - v.news.163.com : 1 2013-09-22 13:17:46,719 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - v.sports.163.com : 1 2013-09-22 13:17:46,719 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - vipmail.163.com : 1 2013-09-22 13:17:46,719 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - vs.caipiao.163.com : 1 2013-09-22 13:17:46,719 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - wangyiyuedu.blog.163.com : 1 2013-09-22 13:17:46,719 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - war.news.163.com : 1 2013-09-22 13:17:46,719 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - www.163.com : 1 2013-09-22 13:17:46,719 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - yuedu.163.com : 265 2013-09-22 13:17:46,719 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - zx.caipiao.163.com : 7 2013-09-22 13:17:46,719 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - zz.yc.163.com : 3 2013-09-22 13:17:46,720 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(368)) - status 2 (db_fetched): 40 2013-09-22 13:17:46,720 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - caipiao.163.com : 1 2013-09-22 13:17:46,720 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - corp.163.com : 2 2013-09-22 13:17:46,720 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - digi.163.com : 1 2013-09-22 13:17:46,720 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - emarketing.biz.163.com : 1 2013-09-22 13:17:46,720 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - gb.corp.163.com : 3 2013-09-22 13:17:46,720 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - help.163.com : 1 2013-09-22 13:17:46,720 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - love.163.com : 1 2013-09-22 13:17:46,720 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - m.163.com : 11 2013-09-22 13:17:46,721 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - music.163.com : 1 2013-09-22 13:17:46,721 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - newsapp.blog.163.com : 1 2013-09-22 13:17:46,721 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - open.163.com : 1 2013-09-22 13:17:46,721 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - open.yuedu.163.com : 1 2013-09-22 13:17:46,721 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - sitemap.163.com : 1 2013-09-22 13:17:46,721 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - t.163.com : 1 2013-09-22 13:17:46,721 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - tech.163.com : 2 2013-09-22 13:17:46,722 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - www.163.com : 1 2013-09-22 13:17:46,722 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - yuedu.163.com : 9 2013-09-22 13:17:46,722 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - zz.yc.163.com : 1 2013-09-22 13:17:46,722 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(368)) - status 4 (db_redir_temp): 2 2013-09-22 13:17:46,722 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - 3g.163.com : 1 2013-09-22 13:17:46,722 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - m.163.com : 1 2013-09-22 13:17:46,723 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(368)) - status 5 (db_redir_perm): 2 2013-09-22 13:17:46,723 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - caipiao.163.com : 1 2013-09-22 13:17:46,723 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) - corp.163.com : 1 2013-09-22 13:17:46,723 INFO crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(374)) - CrawlDb statistics: done