zoukankan      html  css  js  c++  java
  • Nutch分类搜索

    环境

    ubuntu11.10

    tomcat6.0.35

    nutch1.2

    笔者想到的分类搜索的方法是根据不同的url建立不同的抓取库,比如要搞电力行业的垂直的搜索,可以将他分为新闻,产品,人才。那麽就建立三个抓取库,每个抓取库都有自己的url入口地址列表。然后配置网站过滤规则达到想要的结果。

    下面笔者将一步一步的讲解他的实现过程。

    首先先要得到相关类别的url入口地址列表,这个可以分类百度一下然后根据结果自己整理出来3个列表。

    以下是笔者整理的三个列表。

    新闻的(文件名newsURL

    http://www.cpnn.com.cn/

    http://news.bjx.com.cn/

    http://www.chinapower.com.cn/news/

    http://news.bjx.com.cn/

    产品的(文件名productURL

    http://www.powerproduct.com/

    http://www.epapi.com/

    http://cnc.powerproduct.com/

    人才的(文件名talentURl

    http://www.cphr.com.cn/

    http://www.ephr.com.cn/

    http://www.myepjob.com/

    http://www.epjob88.com/

    http://hr.bjx.com.cn/

    http://www.epjob.com.cn/

    http://ep.baidajob.com/

    http://www.01hr.com/

    因为是做测试用,所以就不弄太多的地址了。

    做垂直搜索就不能在用nutchcrawl url -dir crawl -depth -topN -threads命令来抓取了,这个命令是做企业内部搜索的,而且不能增量抓取。在这里笔者采用别人已经写好的增量抓取脚本。

    地址http://wiki.apache.org/nutch/Crawl

    因为要建三个抓取库所以要将该脚本修给一下。笔者的抓取库放在/crawldb/news /crawldb/product

    /crawldb/talent,而且将三个url入口文件分别放到相应的分类下面/crawldb/news/newsURL

    /crawldb/product/productURl /crawldb/talent/talentURl。下面是笔者修改后的抓取脚本。使用该脚本要配置NUTCH_HOMECATALINA_HOME环境变量。



    #!/bin/bash

    #############################电力新闻抓增量取部分################################runbot script to run the Nutch bot for crawling and re-crawling.

    #Usage: bin/runbot [safe]

    # If executed in 'safe' mode, it doesn't delete the temporary

    # directories generated during crawl. This might be helpful for

    # analysis and recovery in case a crawl fails.

    #

    #Author: Susam Pal

    echo"-----开始电力新闻增量抓取-----"

    cd/crawldb/news

    depth=5

    threads=100

    adddays=5

    topN=5000#Comment this statement if you don't want to set topN value

    #Arguments for rm and mv

    RMARGS="-rf"

    MVARGS="--verbose"

    #Parse arguments

    if[ "$1" == "safe" ]

    then

    safe=yes

    fi

    if[ -z "$NUTCH_HOME" ]

    then

    NUTCH_HOME=.

    echorunbot: $0 could not find environment variable NUTCH_HOME

    echorunbot: NUTCH_HOME=$NUTCH_HOME has been set by the script

    else

    echorunbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME

    fi

    if[ -z "$CATALINA_HOME" ]

    then

    CATALINA_HOME=/opt/apache-tomcat-6.0.10

    echorunbot: $0 could not find environment variable NUTCH_HOME

    echorunbot: CATALINA_HOME=$CATALINA_HOME has been set by the script

    else

    echorunbot: $0 found environment variable CATALINA_HOME=$CATALINA_HOME

    fi

    if[ -n "$topN" ]

    then

    topN="-topN$topN"

    else

    topN=""

    fi

    steps=8

    echo"----- Inject (Step 1 of $steps) -----"

    $NUTCH_HOME/bin/nutchinject crawl/crawldb urls

    echo"----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"

    for((i=0;i < $depth; i++))

    do

    echo"--- Beginning crawl at depth `expr $i + 1` of $depth ---"

    $NUTCH_HOME/bin/nutchgenerate crawl/crawldb crawl/segments $topN \

    -adddays$adddays

    if[ $? -ne 0 ]

    then

    echo"runbot: Stopping at depth $depth. No more URLs to fetch."

    break

    fi

    segment=`ls-d crawl/segments/* | tail -1`

    $NUTCH_HOME/bin/nutchfetch $segment -threads $threads

    if[ $? -ne 0 ]

    then

    echo"runbot: fetch $segment at depth `expr $i + 1` failed."

    echo"runbot: Deleting segment $segment."

    rm$RMARGS $segment

    continue

    fi

    $NUTCH_HOME/bin/nutchupdatedb crawl/crawldb $segment

    done

    echo"----- Merge Segments (Step 3 of $steps) -----"

    $NUTCH_HOME/bin/nutchmergesegs crawl/MERGEDsegments crawl/segments/*

    if[ "$safe" != "yes" ]

    then

    rm$RMARGS crawl/segments

    else

    rm$RMARGS crawl/BACKUPsegments

    mv$MVARGS crawl/segments crawl/BACKUPsegments

    fi

    mv$MVARGS crawl/MERGEDsegments crawl/segments

    echo"----- Invert Links (Step 4 of $steps) -----"

    $NUTCH_HOME/bin/nutchinvertlinks crawl/linkdb crawl/segments/*

    echo"----- Index (Step 5 of $steps) -----"

    $NUTCH_HOME/bin/nutchindex crawl/NEWindexes crawl/crawldb crawl/linkdb \

    crawl/segments/*

    echo"----- Dedup (Step 6 of $steps) -----"

    $NUTCH_HOME/bin/nutchdedup crawl/NEWindexes

    echo"----- Merge Indexes (Step 7 of $steps) -----"

    $NUTCH_HOME/bin/nutchmerge crawl/NEWindex crawl/NEWindexes

    echo"----- Loading New Index (Step 8 of $steps) -----"

    if[ "$safe" != "yes" ]

    then

    rm$RMARGS crawl/NEWindexes

    rm$RMARGS crawl/index

    else

    rm$RMARGS crawl/BACKUPindexes

    rm$RMARGS crawl/BACKUPindex

    mv$MVARGS crawl/NEWindexes crawl/BACKUPindexes

    mv$MVARGS crawl/index crawl/BACKUPindex

    fi

    mv$MVARGS crawl/NEWindex crawl/index

    echo"runbot: FINISHED: -----电力新闻增量抓取完毕!-----"

    echo""

    #############################电力产品增量抓取部分################################

    echo"-----开始电力产品增量抓取-----"

    cd/crawldb/product

    steps=8

    echo"----- Inject (Step 1 of $steps) -----"

    $NUTCH_HOME/bin/nutchinject crawl/crawldb urls

    echo"----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"

    for((i=0;i < $depth; i++))

    do

    echo"--- Beginning crawl at depth `expr $i + 1` of $depth ---"

    $NUTCH_HOME/bin/nutchgenerate crawl/crawldb crawl/segments $topN \

    -adddays$adddays

    if[ $? -ne 0 ]

    then

    echo"runbot: Stopping at depth $depth. No more URLs to fetch."

    break

    fi

    segment=`ls-d crawl/segments/* | tail -1`

    $NUTCH_HOME/bin/nutchfetch $segment -threads $threads

    if[ $? -ne 0 ]

    then

    echo"runbot: fetch $segment at depth `expr $i + 1` failed."

    echo"runbot: Deleting segment $segment."

    rm$RMARGS $segment

    continue

    fi

    $NUTCH_HOME/bin/nutchupdatedb crawl/crawldb $segment

    done

    echo"----- Merge Segments (Step 3 of $steps) -----"

    $NUTCH_HOME/bin/nutchmergesegs crawl/MERGEDsegments crawl/segments/*

    if[ "$safe" != "yes" ]

    then

    rm$RMARGS crawl/segments

    else

    rm$RMARGS crawl/BACKUPsegments

    mv$MVARGS crawl/segments crawl/BACKUPsegments

    fi

    mv$MVARGS crawl/MERGEDsegments crawl/segments

    echo"----- Invert Links (Step 4 of $steps) -----"

    $NUTCH_HOME/bin/nutchinvertlinks crawl/linkdb crawl/segments/*

    echo"----- Index (Step 5 of $steps) -----"

    $NUTCH_HOME/bin/nutchindex crawl/NEWindexes crawl/crawldb crawl/linkdb \

    crawl/segments/*

    echo"----- Dedup (Step 6 of $steps) -----"

    $NUTCH_HOME/bin/nutchdedup crawl/NEWindexes

    echo"----- Merge Indexes (Step 7 of $steps) -----"

    $NUTCH_HOME/bin/nutchmerge crawl/NEWindex crawl/NEWindexes

    echo"----- Loading New Index (Step 8 of $steps) -----"

    if[ "$safe" != "yes" ]

    then

    rm$RMARGS crawl/NEWindexes

    rm$RMARGS crawl/index

    else

    rm$RMARGS crawl/BACKUPindexes

    rm$RMARGS crawl/BACKUPindex

    mv$MVARGS crawl/NEWindexes crawl/BACKUPindexes

    mv$MVARGS crawl/index crawl/BACKUPindex

    fi

    mv$MVARGS crawl/NEWindex crawl/index

    echo"runbot: FINISHED:-----电力产品增量抓取完毕!-----"

    echo""

    ###############################电力人才增量抓取部分############################

    echo"-----开始电力人才增量抓取!-----"

    cd/crawldb/talent

    steps=8

    echo"----- Inject (Step 1 of $steps) -----"

    $NUTCH_HOME/bin/nutchinject crawl/crawldb urls

    echo"----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"

    for((i=0;i < $depth; i++))

    do

    echo"--- Beginning crawl at depth `expr $i + 1` of $depth ---"

    $NUTCH_HOME/bin/nutchgenerate crawl/crawldb crawl/segments $topN \

    -adddays$adddays

    if[ $? -ne 0 ]

    then

    echo"runbot: Stopping at depth $depth. No more URLs to fetch."

    break

    fi

    segment=`ls-d crawl/segments/* | tail -1`

    $NUTCH_HOME/bin/nutchfetch $segment -threads $threads

    if[ $? -ne 0 ]

    then

    echo"runbot: fetch $segment at depth `expr $i + 1` failed."

    echo"runbot: Deleting segment $segment."

    rm$RMARGS $segment

    continue

    fi

    $NUTCH_HOME/bin/nutchupdatedb crawl/crawldb $segment

    done

    echo"----- Merge Segments (Step 3 of $steps) -----"

    $NUTCH_HOME/bin/nutchmergesegs crawl/MERGEDsegments crawl/segments/*

    if[ "$safe" != "yes" ]

    then

    rm$RMARGS crawl/segments

    else

    rm$RMARGS crawl/BACKUPsegments

    mv$MVARGS crawl/segments crawl/BACKUPsegments

    fi

    mv$MVARGS crawl/MERGEDsegments crawl/segments

    echo"----- Invert Links (Step 4 of $steps) -----"

    $NUTCH_HOME/bin/nutchinvertlinks crawl/linkdb crawl/segments/*

    echo"----- Index (Step 5 of $steps) -----"

    $NUTCH_HOME/bin/nutchindex crawl/NEWindexes crawl/crawldb crawl/linkdb \

    crawl/segments/*

    echo"----- Dedup (Step 6 of $steps) -----"

    $NUTCH_HOME/bin/nutchdedup crawl/NEWindexes

    echo"----- Merge Indexes (Step 7 of $steps) -----"

    $NUTCH_HOME/bin/nutchmerge crawl/NEWindex crawl/NEWindexes

    echo"----- Loading New Index (Step 8 of $steps) -----"

    ${CATALINA_HOME}/bin/shutdown.sh

    if[ "$safe" != "yes" ]

    then

    rm$RMARGS crawl/NEWindexes

    rm$RMARGS crawl/index

    else

    rm$RMARGS crawl/BACKUPindexes

    rm$RMARGS crawl/BACKUPindex

    mv$MVARGS crawl/NEWindexes crawl/BACKUPindexes

    mv$MVARGS crawl/index crawl/BACKUPindex

    fi

    mv$MVARGS crawl/NEWindex crawl/index

    ${CATALINA_HOME}/bin/startup.sh

    echo"runbot: FINISHED:-----电力人才增量抓取完毕!-----"

    echo""

    将上面的代码复制到你的linux上,然后给他可执行的权限chmod755

    下载还不能抓取页面,要在$NUTCH_HOME/conf/regex.urlfilter.txt中配置url过滤规则

    我的配置如下

    #Licensed to the Apache Software Foundation (ASF) under one or more

    #contributor license agreements. See the NOTICE file distributedwith

    #this work for additional information regarding copyright ownership.

    #The ASF licenses this file to You under the Apache License, Version2.0

    #(the "License"); you may not use this file except incompliance with

    #the License. You may obtain a copy of the License at

    #

    # http://www.apache.org/licenses/LICENSE-2.0

    #

    #Unless required by applicable law or agreed to in writing, software

    #distributed under the License is distributed on an "AS IS"BASIS,

    #WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express orimplied.

    #See the License for the specific language governing permissions and

    #limitations under the License.

    #The default url filter.

    #Better for whole-internet crawling.

    #Each non-comment, non-blank line contains a regular expression

    #prefixed by '+' or '-'. The first matching pattern in the file

    #determines whether a URL is included or ignored. If no pattern

    #matches, the URL is ignored.

    #skip file: ftp: and mailto: urls

    -^(file|ftp|mailto):

    #skip image and other suffixes we can't yet parse

    -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

    #skip URLs containing certain characters as probable queries, etc.

    +[?*!@=]

    #skip URLs with slash-delimited segment that repeats 3+ times, tobreak loops

    -.*(/[^/]+)/[^/]+\1/[^/]+\1/

    -.*\.js

    #accept anything else

    +^http://([a-z0-9]*\.)*cpnn.com.cn/

    +^http://([a-z0-9]*\.)*cphr.com.cn/

    +^http://([a-z0-9]*\.)*powerproduct.com/

    +^http://([a-z0-9]*\.)*bjx.com.cn/

    +^http://([a-z0-9]*\.)*renhe.cn/

    +^http://([a-z0-9]*\.)*chinapower.com.cn/

    +^http://([a-z0-9]*\.)*ephr.com.cn/

    +^http://([a-z0-9]*\.)*epapi.com/

    +^http://([a-z0-9]*\.)*myepjob.com/

    +^http://([a-z0-9]*\.)*epjob88.com/

    +^http://([a-z0-9]*\.)*xindianli.com/

    +^http://([a-z0-9]*\.)*epjob.com.cn/

    +^http://([a-z0-9]*\.)*baidajob.com/

    +^http://([a-z0-9]*\.)*01hr.com/

    接下来配置$NUTCH_HOME/conf/nutch-site.xml如下

    <?xmlversion="1.0"?>

    <?xml-stylesheettype="text/xsl" href="configuration.xsl"?>


    <!--Put site-specific property overrides in this file. -->


    <configuration>

    <property>

    <name>http.agent.name</name>

    <value>justa test</value>

    <description>Test</description>

    </property>

    </configuration>

    上述步骤都成功了的话,就可以用刚才的脚本抓取了。这里要注意你的抓取数据的存放目录,请在抓取脚本的相应位置做出更改,以适应你的目录结构。

    抓取完成后就是要将搭建搜索环境了。

    nutch目录下的war包放到tomcatwebapps目录下,待其自己解压。将ROOT该目录下已有的东西删掉,将刚才解压目录中的东西复制到其中,并修改WEB-INF/classes/nutch-site.xml文件如下

    <?xmlversion="1.0"?>

    <?xml-stylesheettype="text/xsl" href="configuration.xsl"?>


    <!--Put site-specific property overrides in this file. -->


    <configuration>

    <property>

    <name>searcher.dir</name>

    <value>/crawldb/news/crawl</value>

    </property>

    <property>

    <name>http.agent.name</name>

    <value>tangmiSpider</value>

    <description>MySearch Engine</description>

    </property>

    <property>

    <name>plugin.includes</name>

    <value>protocol-http|urlfilter-regex|parse-(text|html|js)|analysis-(zh)|index-basic|query-(basic|site|url)|summary-lucene|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

    </property>

    </configuration>

    其中的search.dir的值是你的抓取数据的存放目录,请做出相应的更改。在webapps目录下建立两个目录talentproduct,将刚才解压目录中的东西复制到其中,并修改WEB-INF/classes/nutch-site.xml,将searcher.dir的分别设置为/crawldb/talent/crawl/crawldb/product/crawl。至此就可以进行分类搜索了。进行搜索是请进输入相应的url

    我的结果页面

     

     

  • 相关阅读:
    【NOIP2007】守望者的逃离
    20200321(ABC)题解 by 马鸿儒 孙晨曦
    20200320(ABC)题解 by 王一帆
    20200319(ABC)题解 by 王一帆 梁延杰 丁智辰
    20200314(ABC)题解 by 董国梁 蒋丽君 章思航
    20200309(ABC)题解 by 梁延杰
    20200307(DEF)题解 by 孙晨曦
    20200306(ABC)题解 by 孙晨曦
    20200305(DEF)题解 by 孙晨曦
    20200303(ABC)题解 by 王锐,董国梁
  • 原文地址:https://www.cnblogs.com/fengfengqingqingyangyang/p/3111185.html
Copyright © 2011-2022 走看看