单站点的爬取与检索测试
1, 创建urls文件夹,在文件夹下面创建seed.txt
文件, 在seed.txt文件中输入要爬取的站点例如: www.osu.edu
mkdir -p urls
cd urls
touch seed.txt to create a text file seed.txt under urls/ with the following content (one URL per line for each site you want Nutchto crawl).
2,修改conf/crawl-urlfilter.txt
将MY.DOMAIN.NAME替换为osu.edu
原来为:
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
现在为:
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*osu.edu/
3, 开始爬取
bin/nutch crawl urls -dir crawldemo -depth 2
4, 配置tomcat,并重新启动,重启的过程不能忘记.
gsli@ubuntu:~/Downloads/apache-tomcat-7.0.10/webapps/nutch-1.2/WEB-INF/classes$
cat nutch-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>searcher.dir</name>
<value>/home/gsli/Downloads/nutch-1.2/crawldemo</value>
<description></description>
</property>
</configuration>
5, 在nutch的搜索页面进行检索
需要在完成第四步的配置,然后重启tomcat才可以进行检索