nutch 异常集锦

zoukankan html css js c++ java

nutch 异常集锦
异常：
Exception in thread "main" java.io.IOException: Failed to set permissions of path: mphadoop-dellmapredstagingdell1008071661.staging to 0700
at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:691)
at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:664)
原因：
hadoop在windows下文件权限问题，在linux不存在这个问题。
解决方法：
1 代码的修改：
笔者使用的是nutch-1.7，对应的hadoop版本为1.2.0 下载地址：（hadoop-core-1.2.0）
在下载的release-1.2.0src 下搜索 ‘FileUtil’ ，然后修改：

private static void checkReturnValue(boolean rv, File p, FsPermission permission) { /** if (!rv) { throw new IOException("Failed to set permissions of path: " + p + " to " + String.format("%04o", permission.toShort())); } **/ }

　　

2 hadoop的编译：（不需要导入eclipse）
环境：Cygwin，Ant
Ant后会生成： elease-1.2.0uildhadoop-core-1.2.1-SNAPSHOT.jar
改名为 hadoop-core-1.2.0 覆盖 apache-nutch-1.7libhadoop-core-1.2.0.jar即可。
异常
java.io.IOException: Job failed! 解决方案： Src中的： <property> <name>plugin.folders</name> <value>./src/plugin</value> <description>./src/pluginDirectories where nutch plugins are located. Each element may be a relative or absolute path. If absolute, it is used as is. If relative, it is searched for on the classpath.</description> </property> 记住是单数哦 bin中的: plugin文件夹是单数，所以这里要做一下修改。 <property> <name>plugin.folders</name> <value>./src/plugins</value> <description>./src/pluginDirectories where nutch plugins are located. Each element may be a relative or absolute path. If absolute, it is used as is. If relative, it is searched for on the classpath.</description> </property>
异常：nutch下载的html不完整的因素
1 http://news.163.com/ skipped. Content of size 481597 was truncated to 65376 解决方案： 将conf/nutch-default.xml 将 parser.skip.truncated 为false

2 http请求的字节限制
<property>
<name>http.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded content using the http://
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the file.content.limit setting.
</description>
</property>
异常： 种子添加了，http://www.gov.cn/ regex-urlfilter.txt 中添加了 +^http://www.gov.cn/ 配置完全没错，但是爬虫却没采集到任何东西 原因： 对方设置了机器人协议。 解决方案： 如果要修改：Fetcher 类 /** if (!rules.isAllowed(fit.u.toString())) { // unblock fetchQueues.finishFetchItem(fit, true); if (LOG.isDebugEnabled()) { LOG.debug("Denied by robots.txt: " + fit.url); } output(fit.url, fit.datum, null, ProtocolStatus.STATUS_ROBOTS_DENIED, CrawlDatum.STATUS_FETCH_GONE); reporter.incrCounter("FetcherStatus", "robots_denied", 1); continue; }**/
异常： unzipBestEffort returned null 转载自：http://blog.chinaunix.net/uid-8345138-id-3358621.html

Nutch爬虫爬取某网页是出现下列异常：

ERROR http.Http (?:invoke0(?)) - java.io.IOException: unzipBestEffort returned null
ERROR http.Http (?:invoke0(?)) - at org.apache.nutch.protocol.http.api.HttpBase.processGzipEncoded(HttpBase.java:472)
ERROR http.Http (?:invoke0(?)) - at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:151)
ERROR http.Http (?:invoke0(?)) - at org.apache.nutch.protocol.http.Http.getResponse(Http.java:63)
ERROR http.Http (?:invoke0(?)) - at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:208)
ERROR http.Http (?:invoke0(?)) - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:173)

经过调试发现异常来源于：

java.io.IOException: Not in GZIP format
at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:137)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:58)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:68)

该异常原因：

此页面采用这个是一个分段传输，而nutch爬虫则默认采用了非分段式处理，导致构造GZIP时出错，从而影响了后面的GZIP解压失败。

是否是分段传输可以在Http headers里面看到，如果是分段传输则有：transfer-encoding：chunked这样一个响应。

处理方法：

1. 修改接口org.apache.nutch.metadata.HttpHeaders，添加：

Java代码

public final static String TRANSFER_ENCODING = "Transfer-Encoding";

2. 在nutch中的org.apache.nutch.protocol.http.HttpResponse类中已经提供了分段传输类型的处理方法：

Java代码

private void readChunkedContent(PushbackInputStream in,

                                  StringBuffer line)

我们只需要在HttpResponse的构造方法总调用该方法即可，添加如下代码：

Java代码

String transferEncoding = getHeader(Response.TRANSFER_ENCODING);



      if(transferEncoding != null && transferEncoding.equalsIgnoreCase("chunked")){

         StringBuffer line = new StringBuffer();

       this.readChunkedContent(in, line);

        }else{

         readPlainContent(in);

        }

修改完成，运行测试。

刚才不能爬取的站点终于可以爬取了

=========================================================

注：

1.有两个HttpResponse类，一个在protocol.http里面，一个在protocol.httpclient里面，我们需要修改的是前者。

2.Nutch2.0已将readChunkedContent方法删掉，故贴上Nutch1.5的方法，将这个方法放入HttpResponse：

点击(此处)折叠或打开

private void readChunkedContent(PushbackInputStream in, StringBuffer line)

            throws HttpException, IOException {

        boolean doneChunks = false;

        int contentBytesRead = 0;

        byte[] bytes = new byte[Http.BUFFER_SIZE];

        ByteArrayOutputStream out = new ByteArrayOutputStream(Http.BUFFER_SIZE);

        while (!doneChunks) {

            if (Http.LOG.isTraceEnabled()) {

                Http.LOG.trace("Http: starting chunk");

            }

            readLine(in, line, false);

            String chunkLenStr;

            // if (LOG.isTraceEnabled()) { LOG.trace("chunk-header: '" + line +

            // "'"); }

            int pos = line.indexOf(";");

            if (pos < 0) {

                chunkLenStr = line.toString();

            } else {

                chunkLenStr = line.substring(0, pos);

                // if (LOG.isTraceEnabled()) { LOG.trace("got chunk-ext: " +

                // line.substring(pos+1)); }

            }

            chunkLenStr = chunkLenStr.trim();

            int chunkLen;

            try {

                chunkLen = Integer.parseInt(chunkLenStr, 16);

            } catch (NumberFormatException e) {

                throw new HttpException("bad chunk length: " + line.toString());

            }

            if (chunkLen == 0) {

                doneChunks = true;

                break;

            }

            if ((contentBytesRead + chunkLen) > http.getMaxContent())

                chunkLen = http.getMaxContent() - contentBytesRead;

            // read one chunk

            int chunkBytesRead = 0;

            while (chunkBytesRead < chunkLen) {

                int toRead = (chunkLen - chunkBytesRead) < Http.BUFFER_SIZE ? (chunkLen - chunkBytesRead)

                        : Http.BUFFER_SIZE;

                int len = in.read(bytes, 0, toRead);

                if (len == -1)

                    throw new HttpException("chunk eof after "

                            + contentBytesRead + " bytes in successful chunks"

                            + " and " + chunkBytesRead + " in current chunk");

                // DANGER!!! Will printed GZIPed stuff right to your

                // terminal!

                // if (LOG.isTraceEnabled()) { LOG.trace("read: " + new

                // String(bytes, 0, len)); }

                out.write(bytes, 0, len);

                chunkBytesRead += len;

            }

            readLine(in, line, false);

        }

        if (!doneChunks) {

            if (contentBytesRead != http.getMaxContent())

                throw new HttpException(

                        "chunk eof: !doneChunk && didn't max out");

            return;

        }

        content = out.toByteArray();

        parseHeaders(in, line);

    }

3.修改构造方法的地方在call readPlainContent的地方。
could only be replicated to 0 nodes, instead of 1 周末机房断电，然后hadoop爆出如题的错误，解决方案就是关闭所有节点的防火墙，相关命令如下：查看防火墙状态： /etc/init.d/iptables status 暂时关闭防火墙： /etc/init.d/iptables stop 禁止防火墙在系统启动时启动 /sbin/chkconfig --level 2345 iptables off 重启iptables: /etc/init.d/iptables restart
查看全文

相关阅读:
20080408 VS2003 中 Jscript 文件中文乱码问题
 20080330 single process memory on Windows and Windows virtual memory
20080331 Corillian's product is a Component Container Name at least 3 component containers that ship now with the Windows Server Family
20080330 the difference between an EXE and a DLL
20080408 Javascript中的字符串替换replaceG
20080501 修复Windows Update 自动更新
 20080331 How many processes can listen on a single TCPIP port
20080329 What is a Windows Service and how does its lifecycle differ from a standard EXE
20080407 Fire in the hole
20080328 Adobe Launches Webbased Photoshop Express

原文地址：https://www.cnblogs.com/i80386/p/3223027.html