zoukankan      html  css  js  c++  java
  • Nutch关于robot.txt的处理 分类: H3_NUTCH 2015-01-28 11:20 472人阅读 评论(0) 收藏


    在nutch中,默认情况下尊重robot.txt的配置,同时不提供配置项以忽略robot.txt。
    以下是其中一个解释。即作为apache的一个开源项目,必须遵循某些规定,同时由于开放了源代码,可以简单的通过修改源代码来忽略robot.txt的限制。

    From the point of view of research and crawling certain pieces of the web, and i strongly agree with you that it should be configurable. But because Nutch being an Apache project, i dismiss it (arguments available upon request). We should adhere to some ethics, it is bad enough that we can just DoS a server by setting some options to a high level. We publish source code, it leaves the option open to everyone to change it, and i think the current situation is balanced enough.
    Patching it is simple, i think we should keep it like that :)

    以下为修改源代码的方法:【未验证】
    修改类org.apache.nutch.fetcher.FetcherReducer.java
    将以下内容注释掉:

           if (!rules.isAllowed(fit.u.toString())) {
                  // unblock
                  fetchQueues.finishFetchItem(fit, true);
                  if (LOG.isDebugEnabled()) {
                    LOG.debug("Denied by robots.txt: " + fit.url);
                  }
                  output(fit, null, ProtocolStatusUtils.STATUS_ROBOTS_DENIED,
                      CrawlStatus.STATUS_GONE);
                  continue;
                }




    版权声明:本文为博主原创文章,未经博主允许不得转载。

  • 相关阅读:
    用charles工具 mock数据(原创)
    css img图片和背景图片按容器大小自适应大小(居中裁切)
    js 实现图片上传
    Java基础之接口
    Java基础之字符串
    Java基础之常用API
    Java基础之面向对象
    Java基础之方法与流程控制
    Java基础之常量、变量、数据类型、运算符
    Java基础之JVM、JRE、JDK
  • 原文地址:https://www.cnblogs.com/lujinhong2/p/4637239.html
Copyright © 2011-2022 走看看