1数据预处理(Linux环境)
搜狗数据的数据格式:
访问时间 用户ID [查询词] 该URL在返回结果中的排名 用户点击的顺序号 用户点击的URL
其中,用户ID是根据用户使用浏览器访问搜索引擎时的Cookie信息自动赋值,即同一次使用浏览器输入的不同查询对应同一个用户ID。
1.1数据扩展 awk -F ‘ ’ ‘{print ()}’ sogo1w > sogo1w_e
awk -F' ''{print$0" "substr($1,1,4)" "substr($1,5,2)" "substr($1,7,2)" "substr($1,9,2)}' sougoulog.txt > sougoulog1.txt
将时间字段拆分并拼接,添加年、月、日、小时字段,结果如图所示:
2基于Hive构建日志数据的数据仓库
2.1创外部表
1 创建扩展4个字段(年、月、日、小时)数据的外部表
create external table sogou1(time string,uid string,name string,rank int,oder int,url string, year string,month string,day string,hour string) row format delimited fields terminated by ' ';
2 装载数据
load data local inpath '/usr/local/hivedata/sougoulog1.txt' into table sogou1;
2.2创建带分区的表(按照年、月、天、小时分区)
Create table(,) partitioned by (,,,,)
Create table sogou2 (time string,uid string,name string,rank int,oder int,url string) partitioned by (year string,month string,day string,hour string) row format delimited fields terminated by ‘ ’;
2.3灌入数据-->动态分区
Insert into table sogou2 patition (year,month,day,hour) select time,uid,name,rank,oder,url,year,month,day,hour from sogou1;
3条数统计
3.1数据总条数
Select count(*) from sogou2;
3.2关键字的非空查询条数
Select count(*) from sogou2 where name is not null;
3.3独立UID总数
Select count(*) from (select distinct uid from sogou2) s;
Select count(*) from (select count(*) from sogou2 group by uid)s;
单个uid出现的次数
Select count(*) from (select count(*) from sogou2 group by uid having count(*)=1)s;
4 UID分析
4.1 UID的查询次数分布(查询1次的UID个数,...查询N次的UID个数)
select s.c ,count(*) from (select count(*) c from sogou2 group by uid)s group by s.c;
4.2 每个UID平均查询次数
select avg(s.c) from (select count(*) c from sogou2 group by uid) s ;
4.3查询次数大于2次的用户总数
select sum(s.c) from (select count(*) c from sogou2 group by uid ) s where s.c>2;
4.3查询次数大于2次的用户占比
select sum(m.c)/sum(n.d) from (select count(*) c from sogou2 group by uid having c>2) m , (select count(*) d from sogou2 group by uid) n;
5 Rank和关键词的分析
5.1 Rank在10以内的点击次数占比
select m.c/n.d from (select count(*)c from sogou2 where runk < 10)m,(select count(*)d from sogou2 )n;
5.2 直接输入URL作为查询词的比例
select m.c/n.d from (select count(*)c from sogou2 where name = ‘url’)m,(select count(*) d from sogou2)n;