zoukankan html css js c++ java

读取Webpage表中的内容

nutch将从网页中抓取到的信息放入hbase数据库中，默认情况下表名为$crawlId_webpage，但表中的内容以16进制进行表示，直接scan或者通过Java API进行读取均只能读取到16进制信息。
因此nutch提供了readdb选项进行数据获取，将表中的内容读取到一个文本中。

具体用法为：

$ bin/nutch readdb
Usage: WebTableReader (-stats | -url [url] | -dump <out_dir> [-regex regex])
                      [-crawlId <id>] [-content] [-headers] [-links] [-text]
    -crawlId <id>  - the id to prefix the schemas to operate on,
                     (default: storage.crawl.id)
    -stats [-sort] - print overall statistics to System.out
    [-sort]        - list status sorted by host
    -url <url>     - print information on <url> to System.out
    -dump <out_dir> [-regex regex] - dump the webtable to a text file in
                     <out_dir>
    -content       - dump also raw content
    -headers       - dump protocol headers
    -links         - dump links
    -text          - dump extracted text
    [-regex]       - filter on the URL of the webtable entry

示例：
（1）seed.txt的内容为：
http://www.163.com

（2）执行以下命令进行inject操作
bin/nutch inject seed.txt -crawlId test001

（3）scan表中内容，发现无意义

hbase(main):002:0> scan 'test001_webpage'
ROW                                         COLUMN+CELL                                                                                                                 
 com.163.money:http/                        column=f:fi, timestamp=1423550107073, value=x00'x8Dx00                                                                  
 com.163.money:http/                        column=f:ts, timestamp=1423550107073, value=x00x00x01Kr2xC7xD6                                                        
 com.163.money:http/                        column=mk:_injmrk_, timestamp=1423550107073, value=y                                                                       
 com.163.money:http/                        column=mk:dist, timestamp=1423550107073, value=0                                                                           
 com.163.money:http/                        column=mtdt:_csh_, timestamp=1423550107073, value=?x80x00x00                                                             
 com.163.money:http/                        column=s:s, timestamp=1423550107073, value=?x80x00x00                                                                   
1 row(s) in 0.4090 seconds

（4）将表中内容读取到/mnt/jediael/2

bin/nutch readdb  -dump /mnt/jediael/2  -crawlId test001 -content

（5）查看/mnt/jediael/2中的内容

$ ll
total 4
-rwxrwxrwx. 1 jediael jediael 344 Feb 10 14:41 part-r-00000
-rwxrwxrwx. 1 jediael jediael   0 Feb 10 14:41 _SUCCESS

$ cat part-r-00000
http://money.163.com/   key:    com.163.money:http/
baseUrl:        null
status: 0 (null)
fetchTime:      1423550105558
prevFetchTime:  0
fetchInterval:  2592000
retriesSinceFetch:      0
modifiedTime:   0
prevModifiedTime:       0
protocolStatus: (null)
parseStatus:    (null)
title:  null
score:  1.0
marker _injmrk_ :       y
marker dist :   0
reprUrl:        null
metadata _csh_ :        ?锟

查看全文

相关阅读:
计算机中如何表示数字-01机器数与真值
 计算机中如何表示数字-06浮点数
 Java基础类型与其二进制表示
 char类型与Unicode的编码
 数组的详细总结
 Java中的instanceof关键字
 java 启动多线程
 elasticsearch7.2 集群搭建插件安装和kibana安装
 java读取 properties配置文件的两种方式
 查询mysql 库和表占的大小

原文地址：https://www.cnblogs.com/jediael/p/4304040.html