zoukankan      html  css  js  c++  java
  • 使用NUTCH进行单站点的爬取与检索测试

    单站点的爬取与检索测试

    1, 创建urls文件夹,在文件夹下面创建seed.txt
    文件, seed.txt文件中输入要爬取的站点例如: www.osu.edu
    mkdir -p urls 
    cd urls

    touch seed.txt to create a text file seed.txt under urls/ with the following content (one URL per line for each site you want Nutchto crawl).

    2,修改conf/crawl-urlfilter.txt

    MY.DOMAIN.NAME替换为osu.edu

    原来为:

    # accept hosts in MY.DOMAIN.NAME

    +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

    现在为:

    # accept hosts in MY.DOMAIN.NAME

    +^http://([a-z0-9]*\.)*osu.edu/

    3, 开始爬取

    bin/nutch crawl urls -dir crawldemo -depth 2

    4, 配置tomcat,并重新启动,重启的过程不能忘记.

    gsli@ubuntu:~/Downloads/apache-tomcat-7.0.10/webapps/nutch-1.2/WEB-INF/classes$
    cat nutch-site.xml

    <?xml version="1.0"?>

    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <!-- Put site-specific property overrides in this file. -->

    <configuration>

    <property>

                                        <name>searcher.dir</name>

                                          <value>/home/gsli/Downloads/nutch-1.2/crawldemo</value>

                                          <description></description>

    </property>

    </configuration>

    5, nutch的搜索页面进行检索

    需要在完成第四步的配置,然后重启tomcat才可以进行检索



     



  • 相关阅读:
    CodeForces Gym 100935G Board Game DFS
    CodeForces 493D Vasya and Chess 简单博弈
    CodeForces Gym 100935D Enormous Carpet 快速幂取模
    CodeForces Gym 100935E Pairs
    CodeForces Gym 100935C OCR (水
    CodeForces Gym 100935B Weird Cryptography
    HDU-敌兵布阵
    HDU-Minimum Inversion Number(最小逆序数)
    七月馒头
    非常可乐
  • 原文地址:https://www.cnblogs.com/afreethinker/p/3159587.html
Copyright © 2011-2022 走看看