(一) Hive数据分析
(1) 用户行为分析需求:2014-12-11~12号有多少条购买商品的记录
select count(*) from bigdata_user where visit_date>=’2014-12-11’ and visit_date<=’2014-12-12’ and behavior_type=’4’;
(2) 用户行为分析需求:分析每月1-31号购买情况
select day(visit_date),count(*) from bigdata_user where behavior_type=’4’ group by visit_date;
(3) 按某一特殊日期(如双12)进行用户行为分析
a) 各省份购买商品数量
select province,count(*) from bigdata_user where visit_date=’2014-12-12’ and behavior_type=’4’ group by province;
b) 商品购买与浏览比例
select uid,count(*)ca, sum(case when behavior_type='4' then 1 else 0 end) from bigdata_user group by uid limit 10;
c) 用户活跃度分析
用户在2014-12-12当天浏览次数:
select uid,count(*) from bigdata_user where visit_date=’2014-12-12’ and behavior_type=’1’ group by uid;
d) 购买5件以上商品的用户
select uid,count(*) from bigdata_user behavior_type=’4’ and visit_date=’2014-12-12’ groun by uid having count(*)>5 limit 10;
(4) 用户购买与浏览比例。
a) 计算用户购买数与浏览数的比值。
select c.*,c.c4/c.c1 c41 from (select uid,count(*)countall, sum(case when behavior_type='4' then 1 else 0 end)c4, sum(case when behavior_type='1' then 1 else 0 end)c1 from bigdata_user group by uid)c order by c41 desc limit 10;
b) 创建表格保存结果。
create table if not exists buybrowse1 row format delimited fields terminated by ' ' as select c.*,c.c4/c.c1 c41 from (select uid,count(*)countall, sum(case when behavior_type='4' then 1 else 0 end)c4, sum(case when behavior_type='1' then 1 else 0 end)c1 from bigdata_user group by uid)c order by c41 desc;
(5) 按地理位置进行用户行为分析
查看2014-12-12当天各个省份的购买量:
select province,count(*) from bigdata_user where behavior_type=’4’ and visit_date=’2014-12-12’ groun by province limit 10;
自定义需求
a) 查看前十位用户对商品的操作:select behavior_type from bigdata_user limit 10;
b) 查询前20位用户购买商品时的时间和商品的种类:select visit_date,item_category from bigdata_user limit 20;
c) 查看一共有多少记录:select count(*) from bigdata_user;
d) 查看有多少用户:select count(distinct uid) from bigdata_user;
e) 查看不重复的数据有多少:select count(*) from (select uid,item_id,behavior_type,item_category,visit_date,province from bigdata_user group by uid,item_id,behavior_type,item_category,visit_date,province having count(*)=1)a;
f) 查询2014-12-10到2014-12-13的数据:select count(*) from bigdata_user where behavior_type='1' and visit_date<'2014-12-13' and visit_date>'2014-12-10';
g) 以月的第4天为统计单位,依次显示第4天网站卖出去的商品的个数:select count(distinct uid), day(visit_date) from bigdata_user where behavior_type='4' group by day(visit_date);
h) 查询2014-12-12当天用户在江西购买商品的订单量:select count(*) from bigdata_user where province='江西' and visit_date='2014-12-12' and behavior_type='4';
i) 查询有多少用户在2014-12-11购买了商品:select count(*) from bigdata_user where visit_date='2014-12-11'and behavior_type='4';
j) 查询有多少用户在2014-12-11点击了该店:select count(*) from bigdata_user where visit_date ='2014-12-11';
k) 查询用户10001082在2014-12-12点击网站的次数:select count(*) from bigdata_user where uid=10001082 and visit_date='2014-12-12';
l) 查询所有用户在2014-12-12这一天点击该网站的次数:select count(*) from bigdata_user where visit_date='2014-12-12';
m) 查询2014-12-12在该网站购买商品超过5次的用户id:select uid from bigdata_user where behavior_type='4' and visit_date='2014-12-12' group by uid having count(behavior_type='4')>5;
n) 数据分析结果查看与保存
i. 查询每个地区的点击量,并保存到数据库:
ii. 创建表:create table scan(province STRING,scan INT) COMMENT 'This is the search of bigdataday' ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE;
iii. 插入数据:insert overwrite table scan select province,count(behavior_type) from bigdata_user where behavior_type='1' group by province;
iv. 显示数据:select * from scan;
(二) Hive、MySQL、HBase数据互导
从Hive导入MySQL
a) 创建临时表user_action:create table dblable.user_action(id STRING,uid STRING, item_id STRING, behavior_type STRING, item_category STRING, visit_date DATE, province STRING) COMMENT 'Welcome to XMU dblab! ' ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE;
b) 查看创建结果:hdfs dfs -ls /user/hive/warehouse/dblable.db
c) 插入数据:INSERT OVERWRITE TABLE dblable.user_action select * from dblable.bigdata_user;
d) 查询插入结果:select * from user_action limit 10;
e) 登录mysql:mysql –u root –p
f) 创建数据库:create database dblable;
g) 创建数据表:CREATE TABLE `dblable`.`user_action` (`id` varchar(50),`uid` varchar(50),`item_id` varchar(50),`behavior_type` varchar(10),`item_category` varchar(50), `visit_date` DATE,`province` varchar(20)) ENGINE=InnoDB DEFAULT CHARSET=utf8;
h) 查看创建结果:show tables;
i) 导入数据:sqoop export --connect jdbc:mysql://localhost:3306/dblable --username root --password hao991206 --table user_action --export-dir '/user/hive/warehouse/dblable.db/user_action' --fields-terminated-by ' ';
j) 登录mysql,使用数据库dblable:
i. mysql -uroot -p
ii. use dblable;
k) 查看前十条数据:select * from user_action limit 10;
从MySQL导入hbase
l) 启动hbase:start-hbase.sh
o) 打开hbase命令窗口:hbase shell
p) 创建user_action数据表:create 'user_action', { NAME => 'f1', VERSIONS => 5}
q) 导入数据:sqoop import --connect jdbc:mysql://localhost:3306/dblable --username root --password hadoop --table user_action --hbase-table user_action --column-family f1 --hbase-row-key id --hbase-create-table -m 1
r) 查看导入的部分数据:scan 'user_action',{LIMIT=>10}
使用HBase Java API把数据从本地导入到HBase中
s) 将HDFS上的user_action数据复制到本地目录/usr/local/bigdatacase/dataset:hdfs dfs -get /user/hive/warehouse/dblable.db/user_action .
t) 查看前十条数据:cat ./user_action/* | head -10
u) 将00000*文件复制一份重命名为user_action.outpu:cat ./user_action/00000* > user_action.output
v) 查看前十条数据:head -10 user_action.output
w) 下载安装Eclipse
x) 创建项目
运程传输: