zoukankan      html  css  js  c++  java
  • Nutch URL过滤配置规则

    nutch网上有不少有它的源码解析,但是采集这块还是不太让人容易理解.今天终于知道怎么,弄的.现在把crawl-urlfilter.txt文件贴出来,让大家一块交流,也给自己备忘录一个。

    # Licensed to the Apache Software Foundation (ASF) under one or more
    # contributor license agreements.  See the NOTICE file distributed with
    # this work for additional information regarding copyright ownership.
    # The ASF licenses this file to You under the Apache License, Version 2.0
    # (the "License"); you may not use this file except in compliance with
    # the License.  You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.


    # The url filter file used by the crawl command.

    # Better for intranet crawling.
    # Be sure to change MY.DOMAIN.NAME to your domain name.

    # Each non-comment, non-blank line contains a regular expression
    # prefixed by '+' or '-'.  The first matching pattern in the file
    # determines whether a URL is included or ignored.  If no pattern
    # matches, the URL is ignored.

    # skip file:, ftp:, & mailto: urls
    -^(file|ftp|mailto):

    # skip image and other suffixes we can't yet parse
    -.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

    # skip URLs containing certain characters as probable queries, etc.

    //采集动态网站很重要。必须这样设置。不然像a.jsp?a=001 带有问号的网页就没办法采集。
    +[?*!@=]

    # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
    -.*(/[^/]+)/[^/]+1/[^/]+1/

    # accept hosts in MY.DOMAIN.NAME
    ###########################7shop24########################################
    #+^http://([a-z0-9]*.)*7shop24.com/
    #+^http://www.7shop24.com/indexdtl06.asp?classid=([0-9]*)&productid=([0-9]*)+$



    ###############################http://www.redbaby.com.cn/##############################

    //采集是有顺序的,不是随便写的。比如:你要采集产品页,你首先得把首页放进来,然后产品是放在分类页面的,你得把//分类也得包括进来,然后再把具体产品规则的正则写进来,这样才能完成你所需要的任务。如:
    +^http://www.redbaby.com.cn/$
    +^http://www.redbaby.com.cn/([a-zA-Z]*.)*index.html$
    +^http://www.redbaby.com.cn/([a-zA-Z]*)/$
    +^http://www.redbaby.com.cn/([a-zA-Z]*)/index.html+$
    +^http://www.redbaby.com.cn/Product/Product_List.aspx?Site=d&BranchID=d&DepartmentID=d+$ 
    +^http://www.redbaby.com.cn/Product/ProductInfowdw([0-9]*.)*html$
    +^http://www.redbaby.com.cn/Product/Product_List.aspx?Site=d&BranchID=d&DepartmentID=d&SortID=d+$
    +^http://www.redbaby.com.cn/Product/ProductInfowdwd.htm$
    # skip everything else
    -.

     #例如采集大麦的票务信息

    +^http://www.damai.cn/map.html
    +^http://www.damai.cn/allticketwd+.html$
    -^http://item.damai.cn/(.*)aspx(.*)$
    +^http://item.damai.cn/

    url匹配可能用到的java正则:

    ?    对应     ? 

    _ (下划线)  对应   w 

    .(点号)    对应  .

  • 相关阅读:
    vi里面全局替换
    guanbi selinux
    ntop
    Java:求一个数组中连续子元素最大和
    LeetCode.643. 子数组最大平均数 I
    分治法-最大子数组问题
    Java实现最大连续子数组和
    golang xorm cmd xorm工具使用 reverse 反转一个数据库结构,生成代码
    golang中xorm的基本使用
    xorm入门
  • 原文地址:https://www.cnblogs.com/lixiuran/p/3682095.html
Copyright © 2011-2022 走看看