zoukankan      html  css  js  c++  java
  • HBase 数据迁移实战案例

                  HBase 数据迁移实战案例

                                         作者:尹正杰

    版权声明:原创作品,谢绝转载!否则将追究法律责任。

     

      通过HBase的相关JavaAPI,我们可以实现伴随HBase操作的MapReduce过程,比如使用MapReduce将数据从本地文件系统导入到HBase的表中,比如我们从HBase中读取一些原始数据后使用MapReduce做数据分析。

     

    一.官方HBase-MapReduce案例 

      博主推荐阅读:
        http://hbase.apache.org/2.2/book.html#mapreduce

    1>.统计"yinzhengjie2020:teacher"表中有多少行数据(需要同时启动Hadoop,Hbase集群)

    [root@hadoop101.yinzhengjie.org.cn ~]# yarn jar /yinzhengjie/softwares/hbase/lib/hbase-server-2.2.4.jar org.apache.hadoop.hbase.mapreduce.RowCounter yinzhengjie2020:teacher

    2>.使用MapReduce将本地数据导入到HBase

    [root@hadoop101.yinzhengjie.org.cn ~]# vim fruit.tsv
    [root@hadoop101.yinzhengjie.org.cn ~]# 
    [root@hadoop101.yinzhengjie.org.cn ~]# cat fruit.tsv
    1001    Apple    Red
    1002    Pear    Yellow
    1003    Pineapple    Yellow
    [root@hadoop101.yinzhengjie.org.cn ~]# 
    [root@hadoop101.yinzhengjie.org.cn ~]# vim fruit.tsv                  #创建测试文件
    [root@hadoop101.yinzhengjie.org.cn ~]# hdfs dfs -mkdir /input_fruit/
    [root@hadoop101.yinzhengjie.org.cn ~]# 
    [root@hadoop101.yinzhengjie.org.cn ~]# ll
    total 4
    -rw-r--r-- 1 root root 54 May 29 18:50 fruit.tsv
    [root@hadoop101.yinzhengjie.org.cn ~]# 
    [root@hadoop101.yinzhengjie.org.cn ~]# 
    [root@hadoop101.yinzhengjie.org.cn ~]# hdfs dfs -put fruit.tsv /input_fruit/
    [root@hadoop101.yinzhengjie.org.cn ~]# 
    [root@hadoop101.yinzhengjie.org.cn ~]# hdfs dfs -ls /input_fruit/
    Found 1 items
    -rw-r--r--   3 root supergroup         54 2020-05-29 18:55 /input_fruit/fruit.tsv
    [root@hadoop101.yinzhengjie.org.cn ~]# 
    [root@hadoop101.yinzhengjie.org.cn ~]# hdfs dfs -put fruit.tsv /input_fruit/    #将测试文件上传到HDFS上
    [root@hadoop101.yinzhengjie.org.cn ~]# yarn jar /yinzhengjie/softwares/hbase/lib/hbase-server-2.2.4.jar org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=HBASE_ROW_KEY,info:name,info:color fruit hdfs://hadoop106.yinzhengjie.org.cn:9000/input_fruit

    二.自定义HBase-MapReduce案例

    1>.查看Hadoop的mapreduce执行任务所需要HBase的依赖包(也可以说是类路径)

    [root@hadoop101.yinzhengjie.org.cn ~]# hbase mapredcp
    /yinzhengjie/softwares/hbase/bin/../lib/shaded-clients/hbase-shaded-mapreduce-2.2.4.jar:/yinzhengjie/softwares/hbase/bin/../lib/client-facing-thirdparty/audience-ann
    otations-0.5.0.jar:/yinzhengjie/softwares/hbase/bin/../lib/client-facing-thirdparty/commons-logging-1.2.jar:/yinzhengjie/softwares/hbase/bin/../lib/client-facing-thirdparty/findbugs-annotations-1.3.9-1.jar:/yinzhengjie/softwares/hbase/bin/../lib/client-facing-thirdparty/htrace-core4-4.2.0-incubating.jar:/yinzhengjie/softwares/hbase/bin/../lib/client-facing-thirdparty/log4j-1.2.17.jar:/yinzhengjie/softwares/hbase/bin/../lib/client-facing-thirdparty/slf4j-api-1.7.25.jar[root@hadoop101.yinzhengjie.org.cn ~]# 
    [root@hadoop101.yinzhengjie.org.cn ~]# hbase mapredcp

     

    2>.配置环境变量

    [root@hadoop101.yinzhengjie.org.cn ~]# cat /etc/profile.d/hbase.sh 
    #Add ${HBASE_HOME} by yinzhengjie
    HBASE_HOME=/yinzhengjie/softwares/hbase
    PATH=$PATH:${HBASE_HOME}/bin
    [root@hadoop101.yinzhengjie.org.cn ~]# 
    [root@hadoop101.yinzhengjie.org.cn ~]# cat /etc/profile.d/hbase.sh
    [root@hadoop101.yinzhengjie.org.cn ~]# cat /yinzhengjie/softwares/ha/etc/hadoop/hadoop-env.sh  
    #!/bin/bash
    # Licensed to the Apache Software Foundation (ASF) under one
    # or more contributor license agreements.  See the NOTICE file
    # distributed with this work for additional information
    # regarding copyright ownership.  The ASF licenses this file
    # to you under the Apache License, Version 2.0 (the
    # "License"); you may not use this file except in compliance
    # with the License.  You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    
    # Set Hadoop-specific environment variables here.
    
    # The only required environment variable is JAVA_HOME.  All others are
    # optional.  When running a distributed configuration it is best to
    # set JAVA_HOME in this file, so that it is correctly defined on
    # remote nodes.
    
    # The java implementation to use.
    #export JAVA_HOME=${JAVA_HOME}
    export JAVA_HOME=/yinzhengjie/softwares/jdk1.8.0_201
    
    # The jsvc implementation to use. Jsvc is required to run secure datanodes
    # that bind to privileged ports to provide authentication of data transfer
    # protocol.  Jsvc is not required if SASL is configured for authentication of
    # data transfer protocol using non-privileged ports.
    #export JSVC_HOME=${JSVC_HOME}
    
    export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}
    
    # Extra Java CLASSPATH elements.  Automatically insert capacity-scheduler.
    for f in $HADOOP_HOME/contrib/capacity-scheduler/*.jar; do
      if [ "$HADOOP_CLASSPATH" ]; then
        export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$f
      else
        export HADOOP_CLASSPATH=$f
      fi
    done
    
    #Add by yinzhengjie
    export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/yinzhengjie/softwares/hbase/lib/*
    
    # The maximum amount of heap to use, in MB. Default is 1000.
    #export HADOOP_HEAPSIZE=
    #export HADOOP_NAMENODE_INIT_HEAPSIZE=""
    
    # Enable extra debugging of Hadoop's JAAS binding, used to set up
    # Kerberos security.
    # export HADOOP_JAAS_DEBUG=true
    
    # Extra Java runtime options.  Empty by default.
    # For Kerberos debugging, an extended option set logs more invormation
    # export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true -Dsun.security.krb5.debug=true -Dsun.security.spnego.debug"
    export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"
    
    # Command specific options appended to HADOOP_OPTS when specified
    export HADOOP_NAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_NAMENODE_OPTS"
    export HADOOP_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS $HADOOP_DATANODE_OPTS"
    
    export HADOOP_SECONDARYNAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_SECONDARYNAMENODE_OPTS"
    
    export HADOOP_NFS3_OPTS="$HADOOP_NFS3_OPTS"
    export HADOOP_PORTMAP_OPTS="-Xmx512m $HADOOP_PORTMAP_OPTS"
    
    # The following applies to multiple commands (fs, dfs, fsck, distcp etc)
    export HADOOP_CLIENT_OPTS="$HADOOP_CLIENT_OPTS"
    # set heap args when HADOOP_HEAPSIZE is empty
    if [ "$HADOOP_HEAPSIZE" = "" ]; then
      export HADOOP_CLIENT_OPTS="-Xmx512m $HADOOP_CLIENT_OPTS"
    fi
    #HADOOP_JAVA_PLATFORM_OPTS="-XX:-UsePerfData $HADOOP_JAVA_PLATFORM_OPTS"
    
    # On secure datanodes, user to run the datanode as after dropping privileges.
    # This **MUST** be uncommented to enable secure HDFS if using privileged ports
    # to provide authentication of data transfer protocol.  This **MUST NOT** be
    # defined if SASL is configured for authentication of data transfer protocol
    # using non-privileged ports.
    export HADOOP_SECURE_DN_USER=${HADOOP_SECURE_DN_USER}
    
    # Where log files are stored.  $HADOOP_HOME/logs by default.
    #export HADOOP_LOG_DIR=${HADOOP_LOG_DIR}/$USER
    
    # Where log files are stored in the secure data environment.
    #export HADOOP_SECURE_DN_LOG_DIR=${HADOOP_LOG_DIR}/${HADOOP_HDFS_USER}
    
    ###
    # HDFS Mover specific parameters
    ###
    # Specify the JVM options to be used when starting the HDFS Mover.
    # These options will be appended to the options specified as HADOOP_OPTS
    # and therefore may override any similar flags set in HADOOP_OPTS
    #
    # export HADOOP_MOVER_OPTS=""
    
    ###
    # Router-based HDFS Federation specific parameters
    # Specify the JVM options to be used when starting the RBF Routers.
    # These options will be appended to the options specified as HADOOP_OPTS
    # and therefore may override any similar flags set in HADOOP_OPTS
    #
    # export HADOOP_DFSROUTER_OPTS=""
    ###
    
    ###
    # Advanced Users Only!
    ###
    
    # The directory where pid files are stored. /tmp by default.
    # NOTE: this should be set to a directory that can only be written to by 
    #       the user that will run the hadoop daemons.  Otherwise there is the
    #       potential for a symlink attack.
    #export HADOOP_PID_DIR=${HADOOP_PID_DIR}
    export HADOOP_PID_DIR=/yinzhengjie/softwares/hadoop-2.10.0/pid
    export HADOOP_SECURE_DN_PID_DIR=${HADOOP_PID_DIR}
    
    # A string representing this instance of hadoop. $USER by default.
    export HADOOP_IDENT_STRING=$USER
    [root@hadoop101.yinzhengjie.org.cn ~]# 
    [root@hadoop101.yinzhengjie.org.cn ~]# cat /yinzhengjie/softwares/ha/etc/hadoop/hadoop-env.sh

    3>.查看原表数据

    hbase(main):001:0> list
    TABLE                                                                                                                                                                                                                                                                         
    engineer                                                                                                                                                                                                                                                                      
    teacher                                                                                                                                                                                                                                                                       
    yinzhengjie2020:teacher                                                                                                                                                                                                                                                       
    3 row(s)
    Took 0.3388 seconds                                                                                                                                                                                                                                                           
    => ["engineer", "teacher", "yinzhengjie2020:teacher"]
    hbase(main):002:0>
    hbase(main):001:0> list
    hbase(main):002:0> describe 'yinzhengjie2020:teacher'
    Table yinzhengjie2020:teacher is ENABLED                                                                                                                                                                                                                                      
    yinzhengjie2020:teacher                                                                                                                                                                                                                                                       
    COLUMN FAMILIES DESCRIPTION                                                                                                                                                                                                                                                   
    {NAME => 'synopsis', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMF
    ILTER => 'ROW', CACHE_INDEX_ON_WRITE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536'}                                                               
    
    1 row(s)
    
    QUOTAS                                                                                                                                                                                                                                                                        
    0 row(s)
    Took 0.2182 seconds                                                                                                                                                                                                                                                           
    hbase(main):003:0>
    hbase(main):002:0> describe 'yinzhengjie2020:teacher'            #查看"yinzhengjie2020:teacher"表的结构
    hbase(main):003:0> scan 'yinzhengjie2020:teacher'
    ROW                                                                  COLUMN+CELL                                                                                                                                                                                              
     100001                                                              column=synopsis:name, timestamp=1590715486602, value=JasonYin                                                                                                                                            
     10010                                                               column=synopsis:name, timestamp=1590726110051, value=xE5xB0xB9xE6xADxA3xE6x9DxB0                                                                                                                
    2 row(s)
    Took 0.0478 seconds                                                                                                                                                                                                                                                           
    hbase(main):004:0> 
    hbase(main):003:0> scan 'yinzhengjie2020:teacher'

    4>.新建目标表

    hbase(main):005:0> create 'yinzhengjie2020:student','synopsis'
    Created table yinzhengjie2020:student
    Took 1.3685 seconds                                                                                                                                                                                                                                                           
    => Hbase::Table - yinzhengjie2020:student
    hbase(main):006:0>
    hbase(main):005:0> create 'yinzhengjie2020:student','synopsis'
    hbase(main):006:0> list
    TABLE                                                                                                                                                                                                                                                                         
    engineer                                                                                                                                                                                                                                                                      
    teacher                                                                                                                                                                                                                                                                       
    yinzhengjie2020:student                                                                                                                                                                                                                                                       
    yinzhengjie2020:teacher                                                                                                                                                                                                                                                       
    4 row(s)
    Took 0.0091 seconds                                                                                                                                                                                                                                                           
    => ["engineer", "teacher", "yinzhengjie2020:student", "yinzhengjie2020:teacher"]
    hbase(main):007:0> 
    hbase(main):006:0> list
    hbase(main):007:0> describe 'yinzhengjie2020:student'
    Table yinzhengjie2020:student is ENABLED                                                                                                                                                                                                                                      
    yinzhengjie2020:student                                                                                                                                                                                                                                                       
    COLUMN FAMILIES DESCRIPTION                                                                                                                                                                                                                                                   
    {NAME => 'synopsis', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMF
    ILTER => 'ROW', CACHE_INDEX_ON_WRITE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536'}                                                               
    
    1 row(s)
    
    QUOTAS                                                                                                                                                                                                                                                                        
    0 row(s)
    Took 0.0651 seconds                                                                                                                                                                                                                                                           
    hbase(main):008:0> 
    hbase(main):007:0> describe 'yinzhengjie2020:student'            #创建另外一张表其结构与"yinzhengjie2020:teacher"一样
    hbase(main):008:0> scan 'yinzhengjie2020:student'
    ROW                                                                  COLUMN+CELL                                                                                                                                                                                              
    0 row(s)
    Took 0.0765 seconds                                                                                                                                                                                                                                                           
    hbase(main):009:0> 
    hbase(main):008:0> scan 'yinzhengjie2020:student'

    5>.编写MapReduce程序实现将"yinzhengjie2020:teacher"表中的数据迁移至"yinzhengjie2020:student"表中

  • 相关阅读:
    远程过程调用RPC
    CAP原理
    2021/03/08阿里在线笔试问题总结
    水容器问题
    rand5生成rand3和rand7
    二维数组查找K(Go语言)
    判别IP为IPV4或者IPV6 (Go语言)
    路径总和(Go)
    合并K个升序链表(Go)
    delphi idhttp提交网址包含中文
  • 原文地址:https://www.cnblogs.com/yinzhengjie2020/p/12271835.html
Copyright © 2011-2022 走看看