zoukankan      html  css  js  c++  java
  • Sqoop-将MySQL数据导入到hive orc表

    sqoop创建并导入数据到hive orc表

    sqoop import 
    --connect jdbc:mysql://localhost:3306/spider 
    --username root --password 1234qwer 
    --table org_ic_track --driver com.mysql.jdbc.Driver 
    --create-hcatalog-table 
    --hcatalog-database spider_tmp 
    --hcatalog-table org_ic_track 
    --hcatalog-partition-keys batch 
    --hcatalog-partition-values 20190404 
    --hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="SNAPPY")' 
    -m 1

    查看表结构

    CREATE TABLE `org_ic_track`(
    `id` int, 
    `info_id` int, 
    `company` varchar(250), 
    `company_url` varchar(250), 
    `invest_date` varchar(150), 
    `invested_company` varchar(500), 
    `invested_ratio` varchar(100), 
    `update_time` string)
    PARTITIONED BY ( 
    `batch` string)
    ROW FORMAT SERDE 
    'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
    STORED AS INPUTFORMAT 
    'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
    OUTPUTFORMAT 
    'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
    LOCATION
    'hdfs://hadoop1:8020/home/hive/warehouse/spider_tmp.db/org_ic_track'
    TBLPROPERTIES (
    'orc.compress'='SNAPPY', 
    'transient_lastDdlTime'='1554342988')

    sqoop导入数据到已存在的hive orc表

    sqoop import 
    --connect jdbc:mysql://localhost:3306/spider 
    --username root --password 1234qwer 
    --table org_ic_track --driver com.mysql.jdbc.Driver 
    --hcatalog-database spider_tmp 
    --hcatalog-table org_ic_track 
    --hcatalog-partition-keys batch 
    --hcatalog-partition-values 20190405 
    -m 1

    sqoop导入数据(query)到已存在的hive orc表

    sqoop import 
    --connect jdbc:mysql://localhost:3306/spider 
    --username root --password 1234qwer 
    --query "select * from org_ic_track where update_time between '2019-04-01 21:16:04' and '2019-04-01 21:16:05' and $CONDITIONS" 
    --driver com.mysql.jdbc.Driver 
    --hcatalog-database spider_tmp 
    --hcatalog-table org_ic_track 
    --hcatalog-partition-keys batch 
    --hcatalog-partition-values 20190406 
    -m 1

    字段说明

    connect    JDBC连接信息
    username    JDBC验证用户名
    password    JDBC验证密码
    table    要导入的源表名
    driver    指定JDBC驱动
    create-hcatalog-table    指定需要创建表,若不指定则默认不创建,注意若指定创建的表已存在将会报错
    hcatalog-database    目标库
    hcatalog-table    目标表名
    hcatalog-storage-stanza    指定存储格式,该参数值会拼接到create table的命令中。默认:stored as rcfile
    hcatalog-partition-keys    指定分区字段,多个字段请用逗号隔开(hive-partition-key的加强版)
    hcatalog-partition-values    指定分区值,多分区值请用逗号隔开(hive-partition-value的加强)

    注:若不指定字段类型,MySQL中的varchar数据抽取至hive中也会是varchar类型,但是varchar类型在hive中操作会出现各种问题

      1.抽取时长文本、含有特殊字符的文本抽取不全

      2.hive操作orc表varchar类型的字段造成乱码

    解决:抽取数据时指定字段类型

    -map-column-hive company=String,company_url=String
  • 相关阅读:
    python_元素定位
    python_html_初识
    python_selenium_初识
    python_jenkins_集成
    python_正则表达式_re
    python_接口关联处理与pymysql中commit
    python_json与pymsql模块
    python_接口请求requests模块
    Codeforces Round #656 (Div. 3) D. a-Good String
    Codeforces Round #656 (Div. 3) C. Make It Good
  • 原文地址:https://www.cnblogs.com/EnzoDin/p/10653350.html
Copyright © 2011-2022 走看看