需求:统计一个网页的访问次数pv,及其用户访问量uv
环境:hadoop-cdh伪分布式,mysql,hive,sqoop,springboot
思路:
编码开始:
1.将爬取的元数据上传至hdfs文件系统
bin/hdfs dfs -mkdir /project bin/hdfs dfs -put /usr/local/2015082818 /project bin/hdfs dfs -put /usr/local/2015082819 /project
2.hive创建源文件表
create table yhd_source( id string , url string , referer string , keyword string , type string , guid string , pageId string , moduleId string , linkId string , attachedInfo string , sessionId string , trackerU string , trackerType string , ip string , trackerSrc string , cookie string , orderCode string , trackTime string , endUserId string , firstLink string , sessionViewNo string , productId string , curMerchantId string , provinceId string , cityId string , fee string , edmActivity string , edmEmail string , edmJobId string , ieVersion string , platform string , internalKeyword string , resultSum string , currentPage string , linkPosition string , buttonPosition string )row format delimited fields terminated by ' ' location '/project';
3.建一张清洗表,将时间字段清洗,提取部分的时间字段出来
create table yhd_qingxi( id string, url string, guid string, date string, hour string ) row format delimited fields terminated by ' ';
4.字段截取,天&小时
insert into table yhd_qingxi select id,url,guid,substring(trackTime,9,2) date,substring(trackTime,12,2) hour from yhd_source;
5.分区的方式:hive静态分区
create table yhd_part1( id string, url string, guid string ) partitioned by (date string,hour string) row format delimited fields terminated by ' ';
6.加载数据,来源于source源表
insert into table yhd_part1 partition (date='20150828',hour='18') select id,url,guid from yhd_qingxi where date='28' and hour='18'; insert into table yhd_part1 partition (date='20150828',hour='19') select id,url,guid from yhd_qingxi where date='28' and hour='19';
7.PV实现:
select date,hour,count(url) PV from yhd_part1 group by date,hour; -》按照天和小时进行分区 -》结果: +-----------+-------+--------+--+ | date | hour | pv | +-----------+-------+--------+--+ | 20150828 | 18 | 64972 | | 20150828 | 19 | 61162 | +-----------+-------+--------+--+
8.uv实现
select date,hour,count(distinct guid) UV from yhd_part1 group by date,hour; -》结果: +-----------+-------+--------+--+ | date | hour | uv | +-----------+-------+--------+--+ | 20150828 | 18 | 23938 | | 20150828 | 19 | 22330 | +-----------+-------+--------+--+
9.pv与uv整合
create table if not exists result as select date,hour,count(url) PV ,count(distinct guid) UV from yhd_part1 group by date,hour; -》结果: +--------------+--------------+------------+------------+--+ | result.date | result.hour | result.pv | result.uv | +--------------+--------------+------------+------------+--+ | 20150828 | 18 | 64972 | 23938 | | 20150828 | 19 | 61162 | 22330 | +--------------+--------------+------------+------------+--+
10.将结果导出到mysql表中
将结果导出到mysql表中 先在mysql建表:用于保存结果集 create table if not exists save( date varchar(30) not null, hour varchar(30) not null, pv varchar(30) not null, uv varchar(30) not null ); 使用sqoop实现导出到mysql bin/sqoop export --connect jdbc:mysql://hadoop5.lhcedu.com:3306/sqoop --username root --password 1234456 --table save --export-dir /user/hive/warehouse/lhc125.db/result --num-mappers 1 --input-fields-terminated-by ' 01' +----------+------+-------+-------+ | date | hour | pv | uv | +----------+------+-------+-------+ | 20150828 | 18 | 64972 | 23938 | | 20150828 | 19 | 61162 | 22330 | +----------+------+-------+-------+ hive默认的分隔符:001
11.
动态分区
分区的方式:动态分区
hive-site.xml <property> <name>hive.exec.dynamic.partition</name> <value>true</value> <description>Whether or not to allow dynamic partitions in DML/DDL.</description> </property> -》默认值是true,代表允许使用动态分区实现 <property> <name>hive.exec.dynamic.partition.mode</name> <value>strict</value> <description>In strict mode, the user must specify at least one static partition in case the user accidentally overwrites all partitions.</description> </property> -》set hive.exec.dynamic.partition.mode=nonstrict; 使用非严格模式 建表: create table yhd_part2( id string, url string, guid string ) partitioned by (date string,hour string) row format delimited fields terminated by ' '; 执行动态分区: insert into table yhd_part2 partition (date,hour) select * from yhd_qingxi;
spring-boot整合Echarts完成图形化报表,
echarts官网案例:https://echarts.baidu.com/examples/editor.html?c=area-stack
常规项目环境下:
1.在jsp页面引入echarts-min.js库
<%@ page contentType="text/html;charset=UTF-8" pageEncoding="UTF-8" %> <!DOCTYPE html> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"> <meta content="always" name="referrer"> <title>file</title> <script type="text/javascript" src="${pageContext.request.contextPath}/jquery.min.js"></script> <script src="${pageContext.request.contextPath}/echarts.min.js"></script> <body> <!-- 为 ECharts 准备一个具备大小(宽高)的 DOM --> <div id="ec" style=" 600px;height:400px;"></div> <script type="text/javascript"> var myChart = echarts.init(document.getElementById('ec')); option = { title: { text: '一号店pv/uv统计图' }, tooltip : { trigger: 'axis', axisPointer: { type: 'cross', label: { backgroundColor: '#6a7985' } } }, legend: { data:['PV','UV'] }, toolbox: { feature: { saveAsImage: {} } }, grid: { left: '3%', right: '4%', bottom: '3%', containLabel: true }, xAxis : [ { } ], yAxis : [ { /*type : 'value'*/ } ], series : [ ] }; myChart.setOption(option); $.ajax({ type: "get", url: "${pageContext.request.contextPath}/re/query", dataType: "JSON", success: function (data) { myChart.setOption({ xAxis : [ { type : 'category', boundaryGap : false, data : data.h } ], series: [ { name:'PV', type:'line', stack: '总量', areaStyle: {}, data:data.pv }, { name:'UV', type:'line', stack: '总量', areaStyle: {}, data:data.uv } ] }) } }) </script> </body> </html>
2.controller封装完成echarts需要的数据结构的封装
package com.lhc.controller; import com.lhc.entity.Result; import com.lhc.service.ResultService; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.stereotype.Controller; import org.springframework.web.bind.annotation.RequestMapping; import org.springframework.web.bind.annotation.ResponseBody; import java.util.ArrayList; import java.util.HashMap; import java.util.List; import java.util.Map; @Controller @RequestMapping("re") public class ResultController { @Autowired ResultService resultService; @RequestMapping("query") @ResponseBody public Map query(){ List<Result> results = resultService.queryResult(); List<String> pv = new ArrayList(); List<String> uv = new ArrayList(); List<String> date = new ArrayList<>(); for (Result result : results) { pv.add(result.getPv()); uv.add(result.getUv()); date.add(result.getHour()); } Map map = new HashMap(); map.put("pv",pv); map.put("uv",uv); map.put("h",date); return map; } }
结果,完成图形化报表