nutch如何修改regex-urlfilter.txt爬取符合条件的链接

zoukankan html css js c++ java

nutch如何修改regex-urlfilter.txt爬取符合条件的链接
例如我在爬取学生在线的时候，发现爬取不到特定的通知，例如《中粮福临门助学基金申请公告》，通过分析发现原来通知的链接被过滤掉了，下面对过滤url的配置文件regex-urlfilter.txt进行分析，以后如果需要修改可以根据自己的情况对该配置文件进行修改：

说明：配置文件中以“#”开头的行为注释，以“-" 开头的表示符合正则表达式就过滤掉，以“+”开头的表示符合正则表达式则保留。正则表达式中"^"表示字符串的开头，"$"表示字符串的结尾，"[]"表示集合。中文部分是我添加的注释

[java] view plain copy
print?
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# The default url filter.
# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file: ftp: and mailto: urls
#过滤掉file：ftp等不是html协议的链接
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
#过滤掉图片等格式的链接
-.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=] 过滤掉汗特殊字符的链接，因为要爬取更多的链接，所以修改过滤条件，使包含？=的链接不被过滤掉
-[*!@]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
#过滤掉一些特殊格式的链接
-.*(/[^/]+)/[^/]+1/[^/]+1/

# accept anything else
#接受所有的链接，这里可以做自己的修改，是的只接受自己规定类型的链接
# Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License.
The default url filter.

Better for whole-internet crawling.

Each non-comment, non-blank line contains a regular expression

prefixed by '+' or '-'. The first matching pattern in the file

determines whether a URL is included or ignored. If no pattern

matches, the URL is ignored.

skip file: ftp: and mailto: urls

过滤掉file：ftp等不是html协议的链接

-^(file|ftp|mailto):

skip image and other suffixes we can't yet parse

过滤掉图片等格式的链接

-.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

skip URLs containing certain characters as probable queries, etc.

-[?*!@=] 过滤掉汗特殊字符的链接，因为要爬取更多的链接，所以修改过滤条件，使包含？=的链接不被过滤掉

-[*!@]

skip URLs with slash-delimited segment that repeats 3+ times, to break loops

过滤掉一些特殊格式的链接

-.*(/[^/]+)/[/]+1/[^/]+1/

accept anything else

接受所有的链接，这里可以做自己的修改，是的只接受自己规定类型的链接

原因解释：因为爬取的公告链接为（http://www.online.sdu.edu.cn/news/article.php?pid=636514943），链接中含有？和=字符，所以被过滤特殊字符的正则表达式过滤掉，通过修改regex-urlfilter.txt配置文件（如上），最终可以爬取这类公告的链接。
查看全文

相关阅读:
自定义弹框
 微信分享
 RichText
UIDatePicker
微服务概述
 超详细十大经典排序算法总结
 《Java程序员面试笔试宝典》学习笔记（持续更新……）
知识图谱让分析工作化繁就简
 构建以知识图谱为核心的下一代数据中台
 智慧安监系统为城市安全监管提供保障

原文地址：https://www.cnblogs.com/jpfss/p/7903783.html

nutch如何修改regex-urlfilter.txt爬取符合条件的链接

The default url filter.

Better for whole-internet crawling.

Each non-comment, non-blank line contains a regular expression

prefixed by '+' or '-'. The first matching pattern in the file

determines whether a URL is included or ignored. If no pattern

matches, the URL is ignored.

skip file: ftp: and mailto: urls

过滤掉file：ftp等不是html协议的链接

skip image and other suffixes we can't yet parse

过滤掉图片等格式的链接

skip URLs containing certain characters as probable queries, etc.

-[?*!@=] 过滤掉汗特殊字符的链接，因为要爬取更多的链接，所以修改过滤条件，使包含？=的链接不被过滤掉

skip URLs with slash-delimited segment that repeats 3+ times, to break loops

过滤掉一些特殊格式的链接

accept anything else

接受所有的链接，这里可以做自己的修改，是的只接受自己规定类型的链接