zoukankan      html  css  js  c++  java
  • Hadoop跨集群迁移数据(整理版)

    1. 什么是DistCp

      DistCp(分布式拷贝)是用于大规模集群内部和集群之间拷贝的工具。它使用Map/Reduce实现文件分发,错误处理和恢复,以及报告生成。它把文件和目录的列表作为map任务的输入,每个任务会完成源列表中部分文件的拷贝。由于使用了Map/Reduce方法,这个工具在语义和执行上都会有特殊的地方。

    1.1 DistCp使用的注意事项

      1. DistCp会尝试着均分需要拷贝的内容,这样每个map拷贝差不多相等大小的内容。但因为文件是最小的拷贝粒度,所以配置增加同时拷贝(如map)的数目不一定会增加实际同时拷贝的数目以及总吞吐量。

      2. 如果没使用-m选项,DistCp会尝试在调度工作时指定map的数据为 min (total_bytes / bytes.per.map, 20 * num_task_trackers),其中bytes.per.map默认是256MB。

      3. 建议对于长时间运行或定期运行的作业,根据源和目标集群大小、拷贝数量大小以及带宽调整map的数目。

      4. 对于不同Hadoop版本间的拷贝,用户应该使用HftpFileSystem。这是一个只读文件系统,所以DistCp必须运行在目标端集群上(更确切的的说是能够写入目标集群的TaskTracker上)。源的格式是 hftp://<dfs.http.address>/<path> (默认情况dfs.http.address是 <namenode>:50070)

    2. Hadoop DistCp的api使用

    [root@node105 ~]# hadoop distcp
    usage: distcp OPTIONS [source_path...] <target_path>
                  OPTIONS
     -append                       Reuse existing data in target files and
                                   append new data to them if possible
     -async                        Should distcp execution be blocking
     -atomic                       Commit all changes or none
     -bandwidth <arg>              Specify bandwidth per map in MB
     -blocksperchunk <arg>         If set to a positive value, fileswith more
                                   blocks than this value will be split into
                                   chunks of <blocksperchunk> blocks to be
                                   transferred in parallel, and reassembled on
                                   the destination. By default,
                                   <blocksperchunk> is 0 and the files will be
                                   transmitted in their entirety without
                                   splitting. This switch is only applicable
                                   when the source file system implements
                                   getBlockLocations method and the target
                                   file system implements concat method
     -copybuffersize <arg>         Size of the copy buffer to use. By default
                                   <copybuffersize> is 8192B.
     -delete                       Delete from target, files missing in source
     -diff <arg>                   Use snapshot diff report to identify the
                                   difference between source and target
     -f <arg>                      List of files that need to be copied
     -filelimit <arg>              (Deprecated!) Limit number of files copied
                                   to <= n
     -filters <arg>                The path to a file containing a list of
                                   strings for paths to be excluded from the
                                   copy.
     -i                            Ignore failures during copy
     -log <arg>                    Folder on DFS where distcp execution logs
                                   are saved
     -m <arg>                      Max number of concurrent maps to use for
                                   copy
     -mapredSslConf <arg>          Configuration for ssl config file, to use
                                   with hftps://. Must be in the classpath.
     -numListstatusThreads <arg>   Number of threads to use for building file
                                   listing (max 40).
     -overwrite                    Choose to overwrite target files
                                   unconditionally, even if they exist.
     -p <arg>                      preserve status (rbugpcaxt)(replication,
                                   block-size, user, group, permission,
                                   checksum-type, ACL, XATTR, timestamps). If
                                   -p is specified with no <arg>, then
                                   preserves replication, block size, user,
                                   group, permission, checksum type and
                                   timestamps. raw.* xattrs are preserved when
                                   both the source and destination paths are
                                   in the /.reserved/raw hierarchy (HDFS
                                   only). raw.* xattrpreservation is
                                   independent of the -p flag. Refer to the
                                   DistCp documentation for more details.
     -rdiff <arg>                  Use target snapshot diff report to identify
                                   changes made on target
     -sizelimit <arg>              (Deprecated!) Limit number of files copied
                                   to <= n bytes
     -skipcrccheck                 Whether to skip CRC checks between source
                                   and target paths.
     -strategy <arg>               Copy strategy to use. Default is dividing
                                   work based on file sizes
     -tmp <arg>                    Intermediate work path to be used for
                                   atomic commit
     -update                       Update target, copying only missingfiles or
                                   directories

    3. 测试用例

      1. 查看将要迁移的目标文件

    [root@calculation101 ~]# hdfs dfs -du -h /test/2018/10/

      2. 创建新集群的测试目录:

    [hdfs@node105 root]$ 
    [hdfs@node105 root]$ hdfs dfs -mkdir -p /yangjianqiu/data/
    [hdfs@node105 root]$ 
    [hdfs@node105 root]$ hdfs dfs -chown -R root:root  /yangjianqiu/data/  
    [hdfs@node105 root]$ 
    [hdfs@node105 root]$ exit 
    exit
    [root@node105 ~]# 
    [root@node105 ~]# hdfs dfs -ls /yangjianqiu
    Found 1 items
    drwxr-xr-x   - root root          0 2018-10-29 03:29 /yangjianqiu/data

      2. 开始迁移数据I并记录日志以及迁移数据所用时间:

    [root@node105 ~]# mkdir /yangjianqiu
    [root@node105 ~]# 
    [root@node105 ~]# 
    [root@node105 ~]# nohup time hadoop distcp hdfs://calculation101:8020/test/2018/10/23 hdfs://node105:8020/yangjianqiu/data >> /yangjianqiu/distcp.log 2>&1 & 
    [
    1] 11125
    [root@node105
    ~]#
    [root@node105
    ~]# jobs
    [
    1]+ Running nohup time hadoop distcp hdfs://calculation101:8020/test/2018/10/23 hdfs://node105:8020/yangjianqiu/data >> /yangjianqiu/distcp.log 2>&1 &

    4. 应用程序调用distcp接口

    总结

    【参考资料】

    https://blog.bcmeng.com/post/hbase-bulkload.html Hive 数据 bulkload 导入 HBase

    https://blog.csdn.net/levy_cui/article/details/70156682  hadoop跨集群之间迁移hive数据

    http://blog.itpub.net/30089851/viewspace-2062010 hadoop 集群跨版本数据迁移

    https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/administration/content/distcp_between_ha_clusters.html DistCp between HA clusters

    https://docs.cloudera.com/documentation/enterprise/5-12-x/topics/cdh_admin_distcp_data_cluster_migrate.html  Copying Cluster Data Using DistCp

    https://www.programcreek.com/java-api-examples/index.php?api=org.apache.hadoop.tools.DistCp Java Code Examples for org.apache.hadoop.tools.DistCp

    https://www.cnblogs.com/yinzhengjie/p/9872365.html HDFS集群PB级数据迁移方案-DistCp生产环境实操篇

  • 相关阅读:
    .NET ------ 多线程的简单使用
    .NET --- 页面刷新(html 和 js两种方式)
    .NET ---- B/S的特点,不接收js赋值
    二分查找与二分答案
    c++运行程序 鼠标点击按钮 (c++)(windows)
    c++运行程序 光标隐藏与移动 (c++)(windows)
    推荐:史蒂芬霍金论天道
    LaTeX公式学习
    Markdown语法学习
    文言语言!!!(附c/c++自译)
  • 原文地址:https://www.cnblogs.com/swordfall/p/11846683.html
Copyright © 2011-2022 走看看