zoukankan      html  css  js  c++  java
  • 使用代码查看Nutch爬取的网站后生成的SequenceFile信息

    必须针对data文件中的value类型来使用对应的类来查看(把这个data文件,放到了本地Windows的D盘根目录下).

    代码:

     1 package cn.summerchill.nutch;
     2 import java.io.IOException;
     3 
     4 import org.apache.hadoop.conf.Configuration;
     5 import org.apache.hadoop.fs.FileSystem;
     6 import org.apache.hadoop.fs.Path;
     7 import org.apache.hadoop.io.SequenceFile;
     8 import org.apache.hadoop.io.Text;
     9 import org.apache.nutch.crawl.CrawlDatum;
    10 import org.apache.nutch.crawl.Inlinks;
    11 import org.apache.nutch.parse.ParseData;
    12 import org.apache.nutch.parse.ParseText;
    13 import org.apache.nutch.protocol.Content;
    14 /**
    15  * 读取nutch生成的sequencefile文件
    16  * @author Administrator
    17  *
    18  */
    19 public class SeFileReader {
    20     public static void main(String[] args) throws IOException {  
    21         Configuration conf=new Configuration();  
    22         Path dataPath=new Path("D:\data");  
    23         FileSystem fs=dataPath.getFileSystem(conf);  
    24         SequenceFile.Reader reader=new SequenceFile.Reader(fs,dataPath,conf);  
    25         Text key=new Text();  
    26         CrawlDatum value=new CrawlDatum();  
    27         //Content value = new Content();
    28         //Inlinks value = new Inlinks();
    29         //ParseText value = new ParseText();
    30         //ParseData value = new ParseData();
    31         while(reader.next(key,value)){  
    32             System.out.println("key->
    "+key);  
    33             System.err.println("value->
    "+value); 
    34             try {
    35                 Thread.sleep(1000);
    36             } catch (InterruptedException e) {
    37                 e.printStackTrace();
    38             }
    39             System.out.println("=======================================");
    40         }
    41         reader.close();  
    42     } 
    43 }

    运行结果:

    key->
    http://bbs.superwu.cn/
    value->
    Version: 7
    Status: 2 (db_fetched)
    Fetch time: Tue Nov 08 08:31:30 CST 2016
    Modified time: Thu Jan 01 08:00:00 CST 1970
    Retries since fetch: 0
    Retry interval: 2592000 seconds (30 days)
    Score: 1.6153846
    Signature: 22defcd7cb4e7b1dc8a16a0a2f339ecb
    Metadata: 
         Content-Type=application/xhtml+xml
        _pst_=success(1), lastModified=0
        _rs_=610
    
    =======================================
    value->
    Version: 7
    Status: 1 (db_unfetched)
    Fetch time: Sun Oct 09 08:31:35 CST 2016
    Modified time: Thu Jan 01 08:00:00 CST 1970
    Retries since fetch: 0
    Retry interval: 2592000 seconds (30 days)
    Score: 0.23076925
    Signature: null
    Metadata: 
     
    key->
    http://bbs.superwu.cn/archiver/
    =======================================
    key->
    http://bbs.superwu.cn/forum.php
    value->
    Version: 7
    Status: 1 (db_unfetched)
    Fetch time: Sun Oct 09 08:31:35 CST 2016
    Modified time: Thu Jan 01 08:00:00 CST 1970
    Retries since fetch: 0
    Retry interval: 2592000 seconds (30 days)
    Score: 0.15384616
    Signature: null
    Metadata: 
     
    =======================================
  • 相关阅读:
    2016701010126 2016-2017-2《java程序设计》集合
    201671010126 2016-2017-2《Java程序设计》第六周
    201671010126 2016-2017-2《Java程序设计》总结
    201671010128 2017-12-17《Java程序设计》之并发
    201671010128 2017-11-10《Java程序设计》之应用程序部署(2)
    201671010128 2017-11-29《Java程序设计》之应用程序部署
    201671010128 2017-11-29《Java程序设计》之Swing用户界面组件
    201671010128 2017-11-19《Java程序设计》之事件处理技术
    201671010128 2017-11-12《Java程序设计》之图形程序设计
    201671010128 2017-11-05《Java程序设计》之集合
  • 原文地址:https://www.cnblogs.com/DreamDrive/p/5944073.html
Copyright © 2011-2022 走看看