zoukankan      html  css  js  c++  java
  • 读取Webpage表中的内容 分类: H3_NUTCH 2015-02-10 14:59 418人阅读 评论(0) 收藏



        nutch将从网页中抓取到的信息放入hbase数据库中,默认情况下表名为$crawlId_webpage,但表中的内容以16进制进行表示,直接scan或者通过Java API进行读取均只能读取到16进制信息。
        因此nutch提供了readdb选项进行数据获取,将表中的内容读取到一个文本中。

     具体用法为:

    $ bin/nutch readdb
    Usage: WebTableReader (-stats | -url [url] | -dump <out_dir> [-regex regex])
                          [-crawlId <id>] [-content] [-headers] [-links] [-text]
        -crawlId <id>  - the id to prefix the schemas to operate on,
                         (default: storage.crawl.id)
        -stats [-sort] - print overall statistics to System.out
        [-sort]        - list status sorted by host
        -url <url>     - print information on <url> to System.out
        -dump <out_dir> [-regex regex] - dump the webtable to a text file in
                         <out_dir>
        -content       - dump also raw content
        -headers       - dump protocol headers
        -links         - dump links
        -text          - dump extracted text
        [-regex]       - filter on the URL of the webtable entry

    示例:
    (1)seed.txt的内容为:
    http://www.163.com

    (2)执行以下命令进行inject操作
     bin/nutch inject seed.txt -crawlId test001

    (3)scan表中内容,发现无意义

    hbase(main):002:0> scan 'test001_webpage'
    ROW                                         COLUMN+CELL                                                                                                                 
     com.163.money:http/                        column=f:fi, timestamp=1423550107073, value=x00'x8Dx00                                                                  
     com.163.money:http/                        column=f:ts, timestamp=1423550107073, value=x00x00x01Kr2xC7xD6                                                        
     com.163.money:http/                        column=mk:_injmrk_, timestamp=1423550107073, value=y                                                                       
     com.163.money:http/                        column=mk:dist, timestamp=1423550107073, value=0                                                                           
     com.163.money:http/                        column=mtdt:_csh_, timestamp=1423550107073, value=?x80x00x00                                                             
     com.163.money:http/                        column=s:s, timestamp=1423550107073, value=?x80x00x00                                                                   
    1 row(s) in 0.4090 seconds
    


    (4)将表中内容读取到/mnt/jediael/2
    bin/nutch readdb  -dump /mnt/jediael/2  -crawlId test001 -content 

    (5)查看/mnt/jediael/2中的内容
    $ ll
    total 4
    -rwxrwxrwx. 1 jediael jediael 344 Feb 10 14:41 part-r-00000
    -rwxrwxrwx. 1 jediael jediael   0 Feb 10 14:41 _SUCCESS

    $ cat part-r-00000
    http://money.163.com/   key:    com.163.money:http/
    baseUrl:        null
    status: 0 (null)
    fetchTime:      1423550105558
    prevFetchTime:  0
    fetchInterval:  2592000
    retriesSinceFetch:      0
    modifiedTime:   0
    prevModifiedTime:       0
    protocolStatus: (null)
    parseStatus:    (null)
    title:  null
    score:  1.0
    marker _injmrk_ :       y
    marker dist :   0
    reprUrl:        null
    metadata _csh_ :        ?锟







    版权声明:本文为博主原创文章,未经博主允许不得转载。

  • 相关阅读:
    面试只要问到分布式,必问分布式锁
    Java编程中忽略这些细节,Bug肯定找上你
    不止承上启下,带你了解工业物联网关
    论文解读二十七:文本行识别模型的再思考
    并发高?可能是编译优化引发有序性问题
    论文解读丨LayoutLM: 面向文档理解的文本与版面预训练
    SQL优化老出错,那是你没弄明白MySQL解释计划
    SQL反模式学习笔记1 开篇
    SQL Server中自定义函数:用指定的分隔符号分割字符串
    .NET软件开发与常用工具清单
  • 原文地址:https://www.cnblogs.com/lujinhong2/p/4637230.html
Copyright © 2011-2022 走看看