Hive笔记(一) - 走看看

zoukankan html css js c++ java

Hive笔记(一)

Hive将Sql查询转化为一系列在hadoop集群上运行的MapReduce作业。

create table records (year string, temperature int, quality int)

row format delimited　　　说明文本是按行分割的，也可使用serde字句来指定所使用的工具

fields terminated by '\t';　　行分隔符为制表符

选项　　　　　　　　　　　　　　默认值　　　　　　　　　　　　作用

hive.metastore.warehouse.dir　　　　 /user/hive/warehouse　　　　仓库目录

javax.jdo.option.ConnectionURL　　　　　　　jdbc:mysql://host/dbname?

javax.jdo.option.ConnectionDrivername　　　 com.mysql.jdbc.Driver

配置：

hive --config /user/hadoop/hive-0.9.0/conf/hive-site.xml

指定配置目录

hive -hiveconfig fs.default.name=localhost -hiveconf mapred.job.tracker=lcoalhost:8021

为单个会话配置属性，使用一个伪分布集群

hive>set hive.enforce.bucketing=true;

set命令更改设置，对于某个特定的查询修改和MapReduce作业很有用，不加参数会显示现在的值

set -v 列出所有属性的值，包括Hadoop的默认值

show functions　　　获取函数列表

show function length　　函数的详细说明

create external table external_table (dummy string)

location '';　

load data inpath '' into table external_table;　创建外部表, 删除时只会删除元数据，不会删除数据。

如果所有的数据都由Hive来处理，那么就应该建立托管表，相反，如果Hive和其他工具处理同一个数据集，那么应该建立外部表，

普遍的做法是把存放在HDFS（由其他进程创建）的初始数据集用作外部表，然后用HIVE的变换功能把数据移到托管的Hive表。这一做法反之也成立，外部表（未必在HDFS中）可以用于从Hive导出数据供其他应用程序使用。需要使用外部表的另一个原因是你想为同一数据集关联不同的模式。

分区：

create table logs (ts bigint, line string)

partitioned by (dt string, country string);

load data local inpath '' into table logs

partition (dt='',country='');

select ts,line,dt

from logs

where country='';

桶：

连接两个在相同列上划分了桶的表，可以使用map端连接高效的实现。

create table bucketed-users (id int, name string)

clustered by (id) into 4 buckets;

Hive 使用对值进行哈希然后在除以桶的个数取余数，这样id相同的行会被分到相同的桶内。

桶内数据可以进行排序：

create table bucketed_users (id int,name string)

clustered by (id) sorted by (id asc) into 4 buckets;

下面详细解析下分桶的过程：

首先

Hive>set hive.enforce.bucketing=true　　否则不会分桶处理当插入数据的时候

create table bucketed_users (id int,name string)

clustered by (id) sorted by (id asc) into 4 buckets;

创建一个users表：

create table users (id int, name string)

row format delimited

fields terminated by '\t';

将数据导入users表：

load data local inpath '/home/hadoop/users' into table users;

将users表中数据插入bucketed_users:（动态插入，后面会介绍）

insret overwrite table bucketed_users

select * from users;

将会运行一个mapreduce作业，然后查看分桶情况：

dfs -ls /user/hive/warehouse/bucketed_users;

出现4个文件：（因为桶的个数为4）

000000_0

000001_0

000002_0

000003_0

dfs -cat 可以查看每个文件的内容。

hive>select * from bucketed_users

　　　tablesample(bucket 1 out of 4 on id);

用tablesample字句对表进行取样，只查询一部分桶，而不是整个表，相反，在没有桶的表中取样会查询真个表：

hive>select * from users

　　　tablesample(bucket 1 out of 4 on rand());

存储格式：

Serde(序列化和反序列化工具)定义Hive中的“行格式”。

序列化：执行insect、ctas操作的时候执行序列化工作，Serde会把Hive的数据内部表示形式序列化成字节形式写入输出文件中去。

反序列化：当执行查询操作的时候，Serde会把文件中字节形式的数据行反序列化为Hive内部操作数据行时所使用的对象形式。

“文件格式”：最简单的是纯文本格式，也可以是面向行和面向列的二进制格式。也就是使用stored as的时候发生的事情，默认是文本文件。

行内默认分隔符ctrl+A，

集合默认分隔符ctrl+B（数组，map键值对，struct）

二进制存储格式：顺序文件和RCFile

create table的时候使用stored as sequencefile即可使用顺序文件。

使用顺序文件存储hive中产生的表时，一行将作为顺序文件的一条记录来存储。

面向行的存储格式对于那些只访问表中一部分行的查询比较有效，面向列的存储格式适用于行中有很多列的情况。

create table ..

row format serde 'org.......ColumnarSedde'

stored as rcfile;

以上创建的表使用面向列的存储，在创建过程中指定了面向列的Serde。

总结:

fields terminated by '\t' 前面必须跟着 row format delimited 否则会出错。

创建表的时候没有指定行内分隔符，使用load data的方式导入数据，在select的时候出行类型检查会出错，显示NULL　　Null

创建表的时候没有指定行内分隔符，但采用insert的动态插入方式导入数据的话，不会出错。

Hive内部使用一个叫做lazySimpleSerde的Serde来处理分割格式，是文本格式，采用这种格式有很多的好处（Mapreduce程序和Streaming处理起来很容易，）但也可以采用一些高效紧凑的二进制Serde。

查看全文

相关阅读:
BestCoder Round #29 1003 （hdu 5172） GTY's gay friends [线段树判不同预处理好题]
POJ 1182 食物链 [并查集带权并查集开拓思路]
Codeforces Round #288 (Div. 2) E. Arthur and Brackets [dp 贪心]
Codeforces Round #287 (Div. 2) E. Breaking Good [Dijkstra 最短路优先队列]
Codeforces Round #287 (Div. 2) D. The Maths Lecture [数位dp]
NOJ1203 最多约数问题 [搜索数论]
poj1426
POJ 1502 MPI Maelstrom [最短路 Dijkstra]
POJ 2785 4 Values whose Sum is 0 [二分]
浅析group by，having count()

原文地址：https://www.cnblogs.com/waxili/p/3021231.html