Nutch的配置以及动态网站的抓取

zoukankan html css js c++ java

Nutch的配置以及动态网站的抓取

http://blog.csdn.net/jimanyu/article/details/5619949

一：配置Nutch：

1、解压缩的nutch后，以抓取http://www.163.com/为例，新建一个文件urls,在文件中输入http://www.163.com/保存，这个文件可以放在任何地方（我这个文件放在D:/nutch/urls）,另外再建立一个爬虫日志目录logs(我放在D:/nutch/logs)

打开nutch-0.9/conf/crawl-urlfilter.txt文件，把MY.DOMAIN.NAME字符替换为myurl内的域名（比如我改成了“+^http://([a-z0-9]*/.)*163.com/”，其实更简单点，直接删除MY.DOMAIN.NAME这几个字就可以了，也就是说，只保存+^http://([a-z0-9]*/.)*这几个字就可以了，表示所有http的网站都同意爬行）。
2：修改conf下面的nutch-site.xml文件，在<configuration>之间添加以下内容
<property>
<name>http.agent.name</name>
<value>longtask</value>
<description>HTTP ‘User-Agent’ request header. </description>
</property>
<property>
<name>http.agent.description</name>
<value>longtask</value>
<description>Further description of our bot- this text is used in the User-Agent header.
</description>
</property>
<property>
<name>http.agent.url</name>
<value>http://www.longtask.com/blog/</value>
<description>A URL to advertise in the User-Agent header.
</description>
</property>
<property>
<name>http.agent.email</name>
<value>longtask@gmail.com</value>
<description>An email address to advertise in the HTTP ‘From’ reques header and User-Agent header.
</description>
</property>

修改<value></value>，输入<value>www.163.com</value>，这里的设置，是因为nutch遵守了robots协议，在获取response时，把自己的相关信息提交给被爬行的网站，以供识别。

二：解决搜索动态内容的问题：
需要注意在conf下面的2个文件：regex-urlfilter.txt，crawl-urlfilter.txt
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=] （-改+）
这段意思是跳过在连接中存在? * ! @ = 的页面，因为默认是跳过所以，在动态页中存在？一般按照默认的是不能抓取到的。可以在上面2个文件中都修改成：# skip URLs containing certain characters as probable queries, etc. # -[?*!@=]
另外增加允许的一行
# accept URLs containing certain characters as probable queries, etc. +[?=&]
意思是抓取时候允许抓取连接中带 ? = & 这三个符号的连接
注意：两个文件都需要修改，因为NUTCH加载规则的顺序是crawl-urlfilter.txt-> regex-urlfilter.txt

三：运行爬虫，抓取内容：
打开Cygwin,
在命令行窗口中输入： cd nutch的目录/bin
执行命令：
        bin/ nutch crawl urls -dir mydir -depth 3 -threads 4 -topN 50

说明：
-dir dirnames      设置保存所抓取网页的目录.
-depth depth   表明抓取网页的层次深度
-delay delay    表明访问不同主机的延时，单位为“秒”
-threads threads      表明需要启动的线程数
-topN number    在每次迭代中限制爬行的头几个链接数,默认是Integer.MAX_VALUE
运行结束后，查看log.txt日志，会有爬虫检索网页的详细信息。
     问题的解决：运行的过程中报错：

四：部署到tomcat下面：
1:配置nutch-0.9.war包中的文件

解压开nutch-0.9.war，然后修改nutch-0.9/webapps/ nutch-0.9/WEB-INF/classes/nutch-site.xml文件如下：
<configuration>
<property>
<name>searcher.dir</name>
<value>D://nutch//mydir</value>
</property>
</configuration>

将nutch-0.9重命名为ROOT，替换C:/Program Files/Apache-tomcat/webapps下的ROOT文件夹,为了支持中文的搜索，修改Tomcat/conf/server.xml。找到对应的地方修改成

    <Connector port="8080" protocol="HTTP/1.1" connectionTimeout="20000"
   redirectPort="8443" URIEncoding="UTF-8" useBodyEncodingForURI="true"/>

2:把应用部署到tomcat的webapps下面，启动tomcat，访问应用：http://localhost:8080/就可以了

查看全文

相关阅读:
随意给一组数，找出满足一下条件的a[i],a[i]左边的数小于等于a[i],a[i]右边的数大于等于a[i]
SVN
四种进程或线程同步互斥的控制方法
 二叉树转双向链表
 最大连续子序列和
 找出一个字符串中第一个只出现一次的字符
 清除浮动的那些事
 jQuery中ready与load事件的区别
 css 经典布局之圣杯布局（左右固定，中间自适应）
window.name + iframe跨域实例

原文地址：https://www.cnblogs.com/zkwarrior/p/5392208.html