zoukankan      html  css  js  c++  java
  • Apache Sqoop 结构化、非结构化数据转换工具


    简介:

    Apache Sqoop 是一种用于 Apache Hadoop 与关系型数据库之间结构化、非结构化数据转换的工具。

    一、安装 MySQL、导入测试数据

    1、文档链接:http://www.cnblogs.com/wangxiaoqiangs/p/5336048.html

    2、导入测试数据

    mysql > create database tmpdb;
    mysql > use tmpdb;
    mysql > system ls
    tmp_recommend_hot.sql
    mysql > source tmp_recommend_hot.sql

    # 创建测试库,导入测试表

    3、授权 hadoop

    mysql > grant all on *.* to hadoop@'%' identified by 'hadoop';
    mysql > flush privileges;

    二、安装 Sqoop

    1、下载、解压、设置环境变量

    shell > cd /usr/local/src
    shell > wget http://apache.fayea.com/sqoop/1.4.6/sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz
    shell > tar zxf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz -C ../
    shell > cd /usr/local/sqoop-1.4.6.bin__hadoop-2.0.4-alpha
    shell > vim /etc/profile
    export PATH=$PATH:/usr/local/mysql/bin:/usr/local/hadoop-2.8.0/bin:/usr/local/apache-hive-2.1.1-bin/bin:/usr/local/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/bin
    shell > source /etc/profile

    2、修改 sqoop-env.sh

    shell > cp conf/sqoop-env-template.sh conf/sqoop-env.sh
    shell > vim conf/sqoop-env.sh
    # 指定安装目录
    export HADOOP_COMMON_HOME=/usr/local/hadoop-2.8.0
    export HADOOP_MAPRED_HOME=/usr/local/hadoop-2.8.0

    3、复制 MySQL 连接器

    shell > cp /usr/local/src/mysql-connector-java-5.1.41/mysql-connector-java-5.1.41-bin.jar lib/

    4、测试

    shell > sqoop list-databases --connect jdbc:mysql://master.hadoop:3306 
    > --username hadoop --password hadoop
    
    information_schema
    hive_meta
    mysql
    performance_schema
    test
    tmpdb

    # 连接成功

    三、MySQL To HDFS To Hive

    1、创建数据存放目录

    hadoop shell > hdfs dfs -mkdir /user/root
    hadoop shell > hdfs dfs -chown root /user/root

    2、将数据导入 HDFS

    shell > sqoop import --connect jdbc:mysql://master.hadoop:3306/tmpdb 
    > --username hadoop --password hadoop 
    > --table tmp_recommend_hot --warehouse-dir=/user/root
    
    hadoop shell > hdfs dfs -ls /user/root
    Found 1 items
    drwxr-xr-x   - root supergroup          0 2017-05-26 18:47 /user/root/tmp_recommend_hot
    
    hadoop shell > hdfs dfs -ls /user/root/tmp_recommend_hot
    Found 5 items
    -rw-r--r--   3 root supergroup          0 2017-05-26 18:47 /user/root/tmp_recommend_hot/_SUCCESS
    -rw-r--r--   3 root supergroup      17426 2017-05-26 18:47 /user/root/tmp_recommend_hot/part-m-00000
    -rw-r--r--   3 root supergroup      18188 2017-05-26 18:47 /user/root/tmp_recommend_hot/part-m-00001
    -rw-r--r--   3 root supergroup      18719 2017-05-26 18:47 /user/root/tmp_recommend_hot/part-m-00002
    -rw-r--r--   3 root supergroup      18430 2017-05-26 18:47 /user/root/tmp_recommend_hot/part-m-00003

    # 默认情况下启用 4 个 MR 进程,所以有 4 个文件

    3、将数据从 HDFS 导入 Hive

    shell > beeline -u jdbc:hive2://master.hadoop:10000 -n hadoop
    
    0: jdbc:hive2://master.hadoop:10000> create database tmpdb;
    0: jdbc:hive2://master.hadoop:10000> use tmpdb;
    0: jdbc:hive2://master.hadoop:10000> dfs -cat /user/root/tmp_recommend_hot/*;
    +--------------------------------------------------------------------------------------------------------------------------------+--+
    | 401,1859110,资讯,2017,《人民的名义》热播原著小说杭州卖断货,http://pic2.qiyipic.com/image/20170410/0f/2a/v_112112674_m_601.jpg,934,null             |
    | 402,1859123,资讯,2017,临汾旅游景区体制机制改革再出招,http://pic6.qiyipic.com/image/20170410/a5/bb/v_112112690_m_601.jpg,420,null                |
    | 403,1291853,电影,2016,魔兽,http://imgbftv.b0.upaiyun.com/upload/origin/8/147598326883101.jpg,326,null                              |
    | 404,1838847,综艺,2017,奇葩说第4季,http://imgbftv.b0.upaiyun.com/upload/origin/2/149176084218704.jpg,579,null                          |
    | 405,14614,电视剧,2014,神雕侠侣,http://imgbftv.b0.upaiyun.com/upload/origin/6/143945924668370.jpg,260,null                             |
    | 406,387443,电影,2005,金刚2005,http://imgbftv.b0.upaiyun.com/upload/origin/3/148497964349088.jpg,2563,null                          |
    | 407,1861695,资讯,2017,追踪:夜半横躺马路中央男子遭碾压致死,http://pic6.qiyipic.com/image/20170411/3e/66/v_112119228_m_601.jpg,806,null             |
    | 408,1841442,综艺,2017,天生是优我,http://imgbftv.b0.upaiyun.com/upload/origin/1/149182951136923.jpg,1094,null

    # 查看了一下原始数据,是以 , 为分隔符的文本

    0: jdbc:hive2://master.hadoop:10000> create external table hot_film
    . . . . . . . . . . . . . . . . . .> (id int, vid int, type string, year int, name string, image string, views int, dtime int)
    . . . . . . . . . . . . . . . . . .> row format delimited fields terminated by ','
    . . . . . . . . . . . . . . . . . .> location 'hdfs:///user/root/tmp_recommend_hot';

    # 创建了一个外部表,/user/hive/warehouse/tmpdb.db 下并没有数据,数据还存在原始位置 /user/root/tmp_recommend_hot 下

    0: jdbc:hive2://master.hadoop:10000> select id, vid, name, views from hot_film limit 3;
    +-----+----------+-----------+--------+--+
    | id  |   vid    |   name    | views  |
    +-----+----------+-----------+--------+--+
    | 1   | 1544131  | 三生三世十里桃花  | 1003   |
    | 2   | 1774150  | 情圣        | 630    |
    | 3   | 1774815  | 因为遇见你     | 548    |
    +-----+----------+-----------+--------+--+

    # 数据没问题,经 count(id) 对比,数据也没少!

    四、MySQL To Hive

    1、创建 Hive 数据库

    shell > beeline -u jdbc:hive2://master.hadoop:10000 -n hadoop
    
    0: jdbc:hive2://master.hadoop:10000> create database tmpdb2;

    2、导入数据

    shell > sqoop import --connect jdbc:mysql://master.hadoop:3306/tmpdb 
    > --username hadoop --password hadoop 
    > --fields-terminated-by '	' --table tmp_recommend_hot 
    > --hive-import --hive-database tmpdb2 --hive-table hot_film

    # import、import-all-tables 导入表、导入所有表
    # --fields-terminated-by 指定分隔符
    # --table 指定导入的表
    # --hive-import 导入 hive 表
    # --hive-database 指定导入到 hive 哪个数据库中
    # --hive-table 指定导入后的表名,不指定时保持原表名
    # --hive-overwrite 覆盖写入
    # -m 指定启动几个 map/reduce 程序,表中没有主键时,需要指定 -m 1

    3、验证数据

    0: jdbc:hive2://master.hadoop:10000> use tmpdb2;
    0: jdbc:hive2://master.hadoop:10000> show tables;
    +-----------+--+
    | tab_name  |
    +-----------+--+
    | hot_film  |
    +-----------+--+
    0: jdbc:hive2://master.hadoop:10000> select id, vid, name, views from hot_film limit 3;
    +-----+----------+-----------+--------+--+
    | id  |   vid    |   name    | views  |
    +-----+----------+-----------+--------+--+
    | 1   | 1544131  | 三生三世十里桃花  | 1003   |
    | 2   | 1774150  | 情圣        | 630    |
    | 3   | 1774815  | 因为遇见你     | 548    |
    +-----+----------+-----------+--------+--+

    # 一切正常,这样导入的是内部表,数据会被移动到 hive 配置文件中指定的路径。默认 /user/hive/warehouse

    报错管理:

    1、数据文件已存在

    17/05/27 14:34:15 ERROR tool.ImportTool: Encountered IOException running import job: org.apache.hadoop.mapred.FileAlreadyExistsException:
    Output directory hdfs://master.hadoop:8020/user/root/tmp_recommend_hot already exists

    # 与上一个实验导入的数据冲突,从 hdfs 删除即可,当然上个实验的数据库表中就没有数据了

    2、权限不足

    17/05/27 14:48:34 INFO hive.HiveImport: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: org.apache.hadoop.security.AccessControlException Permission denied: user=root, access=WRITE, inode="/user/hive/warehouse/tmpdb2.db":hadoop:supergroup:drwxrwxr-x

    # 开放 /user/hive 目录权限,hdfs dfs -chmod -R 777 /user/hive

  • 相关阅读:
    【CF 359B】Permutation
    如何更新 DevC++ 的编译器
    【LG 2801】教主的魔法
    矩阵浅谈
    NOI 系列赛常见技术问题整理
    Treap 浅谈
    DP 优化浅谈
    友链
    【CF 708C】Centroids
    我跳过的坑
  • 原文地址:https://www.cnblogs.com/wangxiaoqiangs/p/6933882.html
Copyright © 2011-2022 走看看