zoukankan      html  css  js  c++  java
  • nutch 异常集锦

    异常:
    Exception in thread "main" java.io.IOException: Failed to set permissions of path: mphadoop-dellmapredstagingdell1008071661.staging to 0700
        at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:691)
        at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:664)
    原因:
    hadoop在windows下文件权限问题,在linux不存在这个问题。
    解决方法:
    1 代码的修改:
    笔者使用的是nutch-1.7,对应的hadoop版本为1.2.0 下载地址:(hadoop-core-1.2.0
    在下载的release-1.2.0src 下搜索 ‘FileUtil’ ,然后修改:
    private static void checkReturnValue(boolean rv, File p, FsPermission permission)  {
        /**
        if (!rv) {
          throw new IOException("Failed to set permissions of path: " + p + 
                                " to " + 
                                String.format("%04o", permission.toShort()));
        }
        **/
      }

      

    2 hadoop的编译:(不需要导入eclipse)
    环境:Cygwin,Ant
    Ant后会生成: elease-1.2.0uildhadoop-core-1.2.1-SNAPSHOT.jar
    改名为 hadoop-core-1.2.0 覆盖 apache-nutch-1.7libhadoop-core-1.2.0.jar即可。


    异常
    java.io.IOException: Job failed! 解决方案: Src中的: <property> <name>plugin.folders</name> <value>./src/plugin</value> <description>./src/pluginDirectories where nutch plugins are located. Each element may be a relative or absolute path. If absolute, it is used as is. If relative, it is searched for on the classpath.</description> </property> 记住是单数哦 bin中的: plugin文件夹是单数,所以这里要做一下修改。 <property> <name>plugin.folders</name> <value>./src/plugins</value> <description>./src/pluginDirectories where nutch plugins are located. Each element may be a relative or absolute path. If absolute, it is used as is. If relative, it is searched for on the classpath.</description> </property>
    异常:nutch下载的html不完整的因素
    1 http://news.163.com/ skipped. Content of size 481597 was truncated to 65376 解决方案: 将conf/nutch-default.xml 将 parser.skip.truncated 为false


    2 http请求的字节限制
    <property>
      <name>http.content.limit</name>
      <value>-1</value>
      <description>The length limit for downloaded content using the http://
      protocol, in bytes. If this value is nonnegative (>=0), content longer
      than it will be truncated; otherwise, no truncation at all. Do not
      confuse this setting with the file.content.limit setting.
      </description>
    </property>


    异常:
    种子添加了,http://www.gov.cn/
    regex-urlfilter.txt 中添加了 +^http://www.gov.cn/
    配置完全没错,但是爬虫却没采集到任何东西
    原因:
    对方设置了机器人协议。
    解决方案:
    如果要修改:Fetcher 类 
     /**
                  if (!rules.isAllowed(fit.u.toString())) {
                    // unblock
                    fetchQueues.finishFetchItem(fit, true);
                    if (LOG.isDebugEnabled()) {
                      LOG.debug("Denied by robots.txt: " + fit.url);
                    }
                    output(fit.url, fit.datum, null, ProtocolStatus.STATUS_ROBOTS_DENIED, CrawlDatum.STATUS_FETCH_GONE);
                    reporter.incrCounter("FetcherStatus", "robots_denied", 1);
                    continue;
                  }**/
    异常 : unzipBestEffort returned null
    
    转载自:http://blog.chinaunix.net/uid-8345138-id-3358621.html

    Nutch爬虫爬取某网页是出现下列异常:

    ERROR http.Http (?:invoke0(?)) - java.io.IOException: unzipBestEffort returned null
    ERROR http.Http (?:invoke0(?)) - at org.apache.nutch.protocol.http.api.HttpBase.processGzipEncoded(HttpBase.java:472)
    ERROR http.Http (?:invoke0(?)) - at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:151)
    ERROR http.Http (?:invoke0(?)) - at org.apache.nutch.protocol.http.Http.getResponse(Http.java:63)
    ERROR http.Http (?:invoke0(?)) - at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:208)
    ERROR http.Http (?:invoke0(?)) - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:173)

    经过调试发现异常来源于:

    java.io.IOException: Not in GZIP format
    at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:137)
    at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:58)
    at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:68)

    该异常原因:

    此页面采用这个是一个分段传输,而nutch爬虫则默认采用了非分段式处理,导致构造GZIP时出错,从而影响了后面的GZIP解压失败。

    是否是分段传输可以在Http headers里面看到,如果是分段传输则有:transfer-encoding:chunked这样一个响应。

    处理方法:

    1. 修改接口org.apache.nutch.metadata.HttpHeaders, 添加:

    Java代码  收藏代码
    1. public final static String TRANSFER_ENCODING = "Transfer-Encoding";  

    2. 在nutch中的org.apache.nutch.protocol.http.HttpResponse类中已经提供了分段传输类型的处理方法:

    Java代码  收藏代码
    1. private void readChunkedContent(PushbackInputStream in,    
    2.                                   StringBuffer line)   

    我们只需要在HttpResponse的构造方法总调用该方法即可,添加如下代码:

    Java代码  收藏代码
    1. String transferEncoding = getHeader(Response.TRANSFER_ENCODING);  
    2.         
    3.       if(transferEncoding != null && transferEncoding.equalsIgnoreCase("chunked")){  
    4.          StringBuffer line = new StringBuffer();  
    5.        this.readChunkedContent(in, line);  
    6.         }else{  
    7.          readPlainContent(in);  
    8.         }  

    修改完成,运行测试。

    刚才不能爬取的站点终于可以爬取了

    =========================================================

    注:

    1.有两个HttpResponse类,一个在protocol.http里面,一个在protocol.httpclient里面,我们需要修改的是前者。

    2.Nutch2.0已将readChunkedContent方法删掉,故贴上Nutch1.5的方法,将这个方法放入HttpResponse:

    点击(此处)折叠或打开

    1. private void readChunkedContent(PushbackInputStream in, StringBuffer line)
    2.             throws HttpException, IOException {
    3.         boolean doneChunks = false;
    4.         int contentBytesRead = 0;
    5.         byte[] bytes = new byte[Http.BUFFER_SIZE];
    6.         ByteArrayOutputStream out = new ByteArrayOutputStream(Http.BUFFER_SIZE);
    7.         while (!doneChunks) {
    8.             if (Http.LOG.isTraceEnabled()) {
    9.                 Http.LOG.trace("Http: starting chunk");
    10.             }
    11.             readLine(in, line, false);
    12.             String chunkLenStr;
    13.             // if (LOG.isTraceEnabled()) { LOG.trace("chunk-header: '" + line +
    14.             // "'"); }
    15.             int pos = line.indexOf(";");
    16.             if (pos < 0) {
    17.                 chunkLenStr = line.toString();
    18.             } else {
    19.                 chunkLenStr = line.substring(0, pos);
    20.                 // if (LOG.isTraceEnabled()) { LOG.trace("got chunk-ext: " +
    21.                 // line.substring(pos+1)); }
    22.             }
    23.             chunkLenStr = chunkLenStr.trim();
    24.             int chunkLen;
    25.             try {
    26.                 chunkLen = Integer.parseInt(chunkLenStr, 16);
    27.             } catch (NumberFormatException e) {
    28.                 throw new HttpException("bad chunk length: " + line.toString());
    29.             }
    30.             if (chunkLen == 0) {
    31.                 doneChunks = true;
    32.                 break;
    33.             }
    34.             if ((contentBytesRead + chunkLen) > http.getMaxContent())
    35.                 chunkLen = http.getMaxContent() - contentBytesRead;
    36.             // read one chunk
    37.             int chunkBytesRead = 0;
    38.             while (chunkBytesRead < chunkLen) {
    39.                 int toRead = (chunkLen - chunkBytesRead) < Http.BUFFER_SIZE ? (chunkLen - chunkBytesRead)
    40.                         : Http.BUFFER_SIZE;
    41.                 int len = in.read(bytes, 0, toRead);
    42.                 if (len == -1)
    43.                     throw new HttpException("chunk eof after "
    44.                             + contentBytesRead + " bytes in successful chunks"
    45.                             + " and " + chunkBytesRead + " in current chunk");
    46.                 // DANGER!!! Will printed GZIPed stuff right to your
    47.                 // terminal!
    48.                 // if (LOG.isTraceEnabled()) { LOG.trace("read: " + new
    49.                 // String(bytes, 0, len)); }
    50.                 out.write(bytes, 0, len);
    51.                 chunkBytesRead += len;
    52.             }
    53.             readLine(in, line, false);
    54.         }
    55.         if (!doneChunks) {
    56.             if (contentBytesRead != http.getMaxContent())
    57.                 throw new HttpException(
    58.                         "chunk eof: !doneChunk && didn't max out");
    59.             return;
    60.         }
    61.         content = out.toByteArray();
    62.         parseHeaders(in, line);
    63.     }
    3.修改构造方法的地方在call readPlainContent的地方。
    could only be replicated to 0 nodes, instead of 1
    
    周末机房断电,然后hadoop爆出如题的错误,解决方案就是关闭所有节点的防火墙,相关命令如下:
    
    查看防火墙状态:
    /etc/init.d/iptables status
    暂时关闭防火墙:
    /etc/init.d/iptables stop
    禁止防火墙在系统启动时启动
    /sbin/chkconfig --level 2345 iptables off
    重启iptables:
    /etc/init.d/iptables restart
  • 相关阅读:
    语言基础
    进制转换
    添加
    查找
    继承
    封装
    面向基础 c#小复习
    主外键
    三个表的关系
    插入信息,模糊查询,聚合函数,时间函数,排序,字符串函数,数学函数,求个数,球最大
  • 原文地址:https://www.cnblogs.com/i80386/p/3223027.html
Copyright © 2011-2022 走看看