zoukankan html css js c++ java

有效线索主题看板阿善有用清洗转换具体怎么做

今日内容: ----------------------------------------------------------------------------------------
1) 分桶表的相关优化 -- 理解
2) 建模分层操作 -- 需要操作
3) 全量流程的统计分析: -- 需求操作 (尝试自己实现)
数据的采集, 数据的清洗转换, 数据维度退化, 数据的统计分析
4) 增量流程的: 如何对拉链表实现增量处理 -- 理解

1.意向客户主题看板_需求说明:
需求一: 计期内，新增意向客户（包含自己录入的意向客户）总数。
指标: 意向数量
维度:
时间维度:
年月天小时
新老维度:
线上线下:

涉及表:
customer_relationship(意向表)
涉及的字段:
create_date_time
基于这个字段统计意向用户数量: customer_id:先去重

需求二: 统计指定时间段内，新增的意向客户，所在城市区域人数热力图
指标: 意向数量
维度:
时间维度: 年月天小时
新老维度:
线上线下
区域维度:
涉及表:
customer_relationship(意向表)
customer (客户表(学员表))
涉及的字段:
意向表中: create_date_time

客户表: area

基于这个字段统计意向用户数量: customer_id:先去重
两个表关联条件:
意向表.customer_id=客户表.id

需求三: 统计指定时间段内，新增的意向客户中，意向学科人数排行榜。学科名称要关联查询出来
指标: 意向数量
维度:
时间维度: 年月天小时
新老维度:
线上线下
学科维度
涉及表:
customer_relationship(意向表),
itcast_subject(学科表)
customer_clue(线索表)

涉及字段:
线索表 :
clue_state : 可以帮助识别新老用户
deleted : 用于判断数据是否删除
create_date_time
意向表 :
origin_type: 此字段可以帮助判断是否为线上还是线下
如果值为: NETSERVICE OR PRESIDNUP 说明是线上其他就是为线下
基于这个字段统计意向用户数量: customer_id:先去重
学科表:
name
关联条件:
线索表.customer_relationship_id = 意向表.id
学科表.id = 意向表.itcast_subject_id

需求四: 统计指定时间段内，新增的意向客户中，意向校区人数排行榜
指标: 意向数量
维度:
时间维度: 年月天小时
新老维度:
线上线下
校区维度

注意：学校id，同步时，0和null转换为统一数据，都转换为-1

涉及表:
customer_relationship(意向表),
customer_clue(线索表),
itcast_school(校区表)
涉及字段:
线索表 :
clue_state : 可以帮助识别新老用户
deleted : 用于判断数据是否删除
create_date_time
意向表 :
origin_type: 此字段可以帮助判断是否为线上还是线下
如果值为: NETSERVICE OR PRESIDNUP 说明是线上其他就是为线下
基于这个字段统计意向用户数量: customer_id:先去重
校区表:
name
关联条件:
意向表.itcast_school_id = 校区表.id
线索表.customer_relationship_id = 意向表.id

需求五: 统计指定时间段内，新增的意向客户中，不同来源渠道的意向客户占比。
指标: 意向数量
维度:
时间维度: 年月天小时
新老维度:
线上线下
来源渠道

涉及表:
customer_relationship(意向表),
customer_clue(线索表)
涉及字段:
线索表 :
clue_state : 可以帮助识别新老用户
deleted : 用于判断数据是否删除
意向表:
create_date_time
origin_type: 此字段可以帮助判断是否为线上还是线下此字段也表示来源渠道
如果值为: NETSERVICE OR PRESIDNUP 说明是线上其他就是为线下
基于这个字段统计意向用户数量: customer_id:先去重
关联条件:
线索表.customer_relationship_id = 意向表.id

需求6: 统计指定时间段内，新增的意向客户中，各咨询中心产生的意向客户数占比情况
指标: 意向数量
维度:
时间维度: 年月天小时
新老维度:
线上线下
各咨询中心

涉及表:
customer_relationship(意向表),
employee: 员工表
scrm_department : 部门表
customer_clue(线索表)
涉及字段:
线索表 :
clue_state : 可以帮助识别新老用户
意向表:
create_date_time
origin_type: 此字段可以帮助判断是否为线上还是线下此字段也表示来源渠道
如果值为: NETSERVICE OR PRESIDNUP 说明是线上其他就是为线下
基于这个字段统计意向用户数量: customer_id:先去重
员工表:
tdepart_id : 部门id
部门表:
name
关联条件:
线索表.customer_relationship_id = 意向表.id
员工表.tdepart_id = 部门表.id
意向表.creator = 员工表.id

总结:
指标: 意向数量
维度:
时间维度: 年月天小时
新老维度:
线上线下
产品属性维度:
地区维度 , 来源渠道, 学科维度, 校区维度 , 各咨询中心

涉及表: 7张表
customer_relationship(意向表),
涉及到字段: create_date_time , origin_type , customer_id
employee: 员工表
涉及到字段 : tdepart_id 和 id
scrm_department : 部门表
涉及到字段 : name 和 id
customer_clue(线索表)
涉及到字段 : clue_state ,deleted ,create_date_time ,customer_relationship_id
itcast_school(校区表) :
涉及到字段 : name 和 id
itcast_subject(学科表)
涉及到字段 : name 和 id
customer(客户表)
涉及到字段: area 和 id
表关联:
线索表.customer_relationship_id = 意向表.id
员工表.tdepart_id = 部门表.id
意向表.creator = 员工表.id
意向表.itcast_school_id = 校区表.id
学科表.id = 意向表.itcast_subject_id
意向表.customer_id=客户表.id

意向主题看板案例_导入原始业务数据 --- 此层在实际工作中不存在
create database scrm default character set utf8mb4 collate utf8mb4_unicode_ci;

将原来发的知行教育分析平台资料中 --> 原始完整数据集 --> scrm --> 将7个表依次导入MySQL中

意向主题看板案例_建模分析:
ODS层:
事实表: 意向表
额外放置一张表: 线索表 (说明: 此表由于是后续主题看板事实表, 为了方便后续的处理, 将此表放置在ODS层)
表: 内部表 + 分桶表 + 分区表 + 拉链表实施
DIM层: 维度层
员工表, 校区表, 学科表, 客户表 ,部门表
表: 外部表 + 分区表
关于以上两层: 只需要一对对应原生数据表结构构建即可, 构建时注意添加一个 start_time(抽取时间)
数据格式和压缩方式: ORC + ZLIB(SNAPPY)

DW层:
DWD: 清洗转换以及如果表字段过多, 可以抽取相关的字段 , 对 ODS层表进行处理
清洗工作:
清理掉以及被标识为删除的数据
转换工作:
将 origin_type中数据转换为 0 和 1 形成一个新的字段, 用于标识线上上下
create_date_time将时间转换为年月日小时
学校id，同步时，0和null转换为统一数据，都转换为-1
涉及到字段:
普通字段:
id,create_date_time,delete ,customer_id ,origin_type ,origin_type_stat,
itcast_school_id ,itcast_subject_id,creator,hourinfo
分区:
年(yearinfo) , 月(monthinfo) 日(dayinfo)

DWM: 基于维度提前聚合操作 (不能做) 维度退化
将六个维度表, 和 DWD的事实表进行组合, 形成一张表, 从而实现维度退化操作
思想: 考虑要从各个维度表中获取那些字段数据, 将这些字段数据全部糅杂在一个表即可
相关字段:
普通字段:
customer_id, create_date_time,clue_state_stat ,origin_type_stat,area,origin_type,
itcast_school_id,school_name,itcast_subject_id,itcast_subject_name,department_id,
department_name ,hourinfo
分区字段:
年(yearinfo) , 月(monthinfo) 日(dayinfo)

要想生成这个表的数据, 此处需要进行从ODS+DIM 进行七表联查得出此表结果

DWS: 指标只有一个, 表也就只有一个
customerid_total,clue_state_stat,origin_type_stat,area,origin_type,
itcast_school_id,school_name,itcast_subject_id,itcast_subject_name,
department_id, department_name , time_type,group_type ,hourinfo ,time_str

分区:
年(yearinfo) , 月(monthinfo) 日(dayinfo)
time_type: 1(年) 2(月) 3(日) 4(小时)
group_type: 1地区维度 , 2来源渠道, 3学科维度, 4校区维度 , 5各咨询中心 ,6 总意向量

数据结果:
1000 0 0 年 -1 -1 -1 -1
1000 0 1 年 -1 -1 -1 -1
1000 1 0 年 -1 -1 -1 -1
1000 1 1 年 -1 -1 -1 -1
1000 0 0 年 11 -1 -1 -1
1000 0 1 年 11 -1 -1 -1
1000 1 0 年 11 -1 -1 -1
1000 1 1 年 11 -1 -1 -1
1000 0 0 年 11 01 -1 -1
1000 0 1 年 11 01 -1 -1
1000 1 0 年 11 01 -1 -1
1000 1 1 年 11 01 -1 -1
1000 0 0 年 11 -1 山西 -1
1000 0 1 年 11 -1 山西 -1
1000 1 0 年 11 -1 山西 -1
1000 1 1 年 11 -1 山西 -1
1000 0 0 年 11 01 -1 weixin
1000 0 1 年 11 01 -1 weixin
1000 1 0 年 11 01 -1 weixin
1000 1 1 年 11 01 -1 weixin

app层: 不要 DWS已经成功将各个维度分析完成....

2. 分桶表的相关优化:
分桶表: 分文件将一个文件拆分多个文件的操作, 具体拆分多少, 取决于设置的分桶的数量
底层是如何实现分文件呢? 核心采用 MR 分区, 采用 Hash取模计算法对分桶字段进行分区操作
会将数据进行打散操作, 同时保证相同数据会发往同一个reduce中

桶表的操作:
创建表:
create table test_buck(id int, name string)
clustered by(id) sorted by (id asc) into 6 buckets -- 主要此处代码
row format delimited fields terminated by ' ';

插入数据:
--启用桶表
set hive.enforce.bucketing=true;
insert into ...

注意: 桶表不能使用 load data 方式来插入桶表数据,
set hive.strict.checks.bucketing = true; 禁止桶表使用load data 默认true
如何将数据插入到桶表:
对桶表建立一张临时表(千万不能桶表) 通过 load data 方式将数据进行加载到临时表, 然后通过 insert into 从临时表
将数据加载到桶表中

作用:
数据的抽样处理 : 将一个文件的数据拆分为多个文件后, 从中获取其中某几个文件来进行处理, 这个过程数据采样
作用:
1. 测试的时候, 由于数据过于庞大, 可以对数据进行采样, 然后在采样的结果上进行统计分析即可,提升快速开发的效率
2. 对整体数据分析不是很方便, 可以进行采样分析, 得出的结果依然可以反映整个数据的结果信息
如何实现抽样:
格式:
select * from table tablesample(bucket x out of y on column) as a

放置位置: 紧跟在表的后面如果表有别名, 请将抽样函数放置在别名之前, 表之后
函数说明: tablesample(bucket x out of y on column)
X : 从第几个桶开始抽 x的值必须小于等于y的值
y : 抽桶数量比例 , 必须是桶的倍数或者因子
column : 按照那个字段进行分桶抽样

例子: 表有 10个桶分桶字段为id

tablesample(bucket 3 out of 5 on id):
思考 : 会抽出几个桶? 10/5 = 2
会抽出那两个桶呢?
第三个桶和第八个桶

提升多表join的查询性能 : 主要的手段就是 map join
1. mapjoin: 适合于小表和大表的join操作
必备条件:
set hive.auto.convert.join=true; -- 必须开启 mapjoin的优化默认值为true
set hive.auto.convert.join.noconditionaltask.size=512000000; 小表阈值默认值为 20971520 (20M)

2. 中等大小的表和大表进行join: 要求使用 map join 可以使用 Bucket-MapJoin
实现必备条件:
1) 两个表的关联条件的字段必须是分桶字段
2) 中型表的分桶数量小于等于大表的分桶数量并且必须是大表桶的倍数
3) 开启 bucket_mapjoin : set hive.optimize.bucketmapjoin = true
4) 两个表必须是分桶表 : 启用 set hive.enforce.bucketing=true;

一旦将以上的条件都满足, hive自动采用 Bucket-MapJoin 如果不满足, hive会检测是否满足 map join, 如果不满足, 那么就采用
原始 reduce join的方案

3. 大表和大表 join: 要求使用 map join 可以采用 SMB Join
基于 Bucket-MapJoin 实施的, 首先要先满足 Bucket-MapJoin
实现必备条件:
1) 两个表的关联条件的字段必须是分桶字段, 并且必须按照分桶字段进行排序
2) 两个表的分桶数量必须相等
3) 开启 bucket_mapjoin : set hive.optimize.bucketmapjoin = true
4) 两个表必须是分桶表 : 启用 set hive.enforce.bucketing=true;
5) 开启 SMB join的必备三项条件 :
set hive.auto.convert.sortmerge.join=true;
set hive.optimize.bucketmapjoin.sortedmerge = true; --开启 SMBjoin
set hive.auto.convert.sortmerge.join.noconditionaltask=true;
set hive.enforce.sorting=true;

建表操作:
create table test_smb_2(mid string,age_id string)
CLUSTERED BY(mid) SORTED BY(mid) INTO 500 BUCKETS;
--3. 意向用户主题看板: 建模分层操作
准备工作: 开启写入压缩
set hive.exec.orc.compression.strategy=COMPRESSION;
--3.1: 创建 ODS层表: 2张表 (意向表和线索表)
CREATE TABLE IF NOT EXISTS itcast_ods.`customer_relationship` (
`id` int COMMENT '客户关系id',
`create_date_time` STRING COMMENT '创建时间',
`update_date_time` STRING COMMENT '最后更新时间',
`deleted` int COMMENT '是否被删除（禁用）',
`customer_id` int COMMENT '所属客户id',
`first_id` int COMMENT '第一条客户关系id',
`belonger` int COMMENT '归属人',
`belonger_name` STRING COMMENT '归属人姓名',
`initial_belonger` int COMMENT '初始归属人',
`distribution_handler` int COMMENT '分配处理人',
`business_scrm_department_id` int COMMENT '归属部门',
`last_visit_time` STRING COMMENT '最后回访时间',
`next_visit_time` STRING COMMENT '下次回访时间',
`origin_type` STRING COMMENT '数据来源',
`itcast_school_id` int COMMENT '校区Id',
`itcast_subject_id` int COMMENT '学科Id',
`intention_study_type` STRING COMMENT '意向学习方式',
`anticipat_signup_date` STRING COMMENT '预计报名时间',
`level` STRING COMMENT '客户级别',
`creator` int COMMENT '创建人',
`current_creator` int COMMENT '当前创建人：初始==创建人，当在公海拉回时为拉回人',
`creator_name` STRING COMMENT '创建者姓名',
`origin_channel` STRING COMMENT '来源渠道',
`comment` STRING COMMENT '备注',
`first_customer_clue_id` int COMMENT '第一条线索id',
`last_customer_clue_id` int COMMENT '最后一条线索id',
`process_state` STRING COMMENT '处理状态',
`process_time` STRING COMMENT '处理状态变动时间',
`payment_state` STRING COMMENT '支付状态',
`payment_time` STRING COMMENT '支付状态变动时间',
`signup_state` STRING COMMENT '报名状态',
`signup_time` STRING COMMENT '报名时间',
`notice_state` STRING COMMENT '通知状态',
`notice_time` STRING COMMENT '通知状态变动时间',
`lock_state` STRING COMMENT '锁定状态',
`lock_time` STRING COMMENT '锁定状态修改时间',
`itcast_clazz_id` int COMMENT '所属ems班级id',
`itcast_clazz_time` STRING COMMENT '报班时间',
`payment_url` STRING COMMENT '付款链接',
`payment_url_time` STRING COMMENT '支付链接生成时间',
`ems_student_id` int COMMENT 'ems的学生id',
`delete_reason` STRING COMMENT '删除原因',
`deleter` int COMMENT '删除人',
`deleter_name` STRING COMMENT '删除人姓名',
`delete_time` STRING COMMENT '删除时间',
`course_id` int COMMENT '课程ID',
`course_name` STRING COMMENT '课程名称',
`delete_comment` STRING COMMENT '删除原因说明',
`close_state` STRING COMMENT '关闭装填',
`close_time` STRING COMMENT '关闭状态变动时间',
`appeal_id` int COMMENT '申诉id',
`tenant` int COMMENT '租户',
`total_fee` DECIMAL COMMENT '报名费总金额',
`belonged` int COMMENT '小周期归属人',
`belonged_time` STRING COMMENT '归属时间',
`belonger_time` STRING COMMENT '归属时间',
`transfer` int COMMENT '转移人',
`transfer_time` STRING COMMENT '转移时间',
`follow_type` int COMMENT '分配类型，0-自动分配，1-手动分配，2-自动转移，3-手动单个转移，4-手动批量转移，5-公海领取',
`transfer_bxg_oa_account` STRING COMMENT '转移到博学谷归属人OA账号',
`transfer_bxg_belonger_name` STRING COMMENT '转移到博学谷归属人OA姓名',
`end_time` STRING COMMENT '有效截止时间')
comment '客户关系表'
PARTITIONED BY(start_time STRING)
clustered by(id) sorted by(id) into 10 buckets
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
stored as orc
TBLPROPERTIES ('orc.compress'='ZLIB');

CREATE TABLE IF NOT EXISTS itcast_ods.customer_clue (
id int COMMENT 'customer_clue_id',
create_date_time STRING COMMENT '创建时间',
update_date_time STRING COMMENT '最后更新时间',
deleted STRING COMMENT '是否被删除（禁用）',
customer_id int COMMENT '客户id',
customer_relationship_id int COMMENT '客户关系id',
session_id STRING COMMENT '七陌会话id',
sid STRING COMMENT '访客id',
status STRING COMMENT '状态（undeal待领取 deal 已领取 finish 已关闭 changePeer 已流转）',
users STRING COMMENT '所属坐席',
create_time STRING COMMENT '七陌创建时间',
platform STRING COMMENT '平台来源（pc-网站咨询|wap-wap咨询|sdk-app咨询|weixin-微信咨询）',
s_name STRING COMMENT '用户名称',
seo_source STRING COMMENT '搜索来源',
seo_keywords STRING COMMENT '关键字',
ip STRING COMMENT 'IP地址',
referrer STRING COMMENT '上级来源页面',
from_url STRING COMMENT '会话来源页面',
landing_page_url STRING COMMENT '访客着陆页面',
url_title STRING COMMENT '咨询页面title',
to_peer STRING COMMENT '所属技能组',
manual_time STRING COMMENT '人工开始时间',
begin_time STRING COMMENT '坐席领取时间 ',
reply_msg_count int COMMENT '客服回复消息数',
total_msg_count int COMMENT '消息总数',
msg_count int COMMENT '客户发送消息数',
comment STRING COMMENT '备注',
finish_reason STRING COMMENT '结束类型',
finish_user STRING COMMENT '结束坐席',
end_time STRING COMMENT '会话结束时间',
platform_description STRING COMMENT '客户平台信息',
browser_name STRING COMMENT '浏览器名称',
os_info STRING COMMENT '系统名称',
area STRING COMMENT '区域',
country STRING COMMENT '所在国家',
province STRING COMMENT '省',
city STRING COMMENT '城市',
creator int COMMENT '创建人',
name STRING COMMENT '客户姓名',
idcard STRING COMMENT '身份证号',
phone STRING COMMENT '手机号',
itcast_school_id int COMMENT '校区Id',
itcast_school STRING COMMENT '校区',
itcast_subject_id int COMMENT '学科Id',
itcast_subject STRING COMMENT '学科',
wechat STRING COMMENT '微信',
qq STRING COMMENT 'qq号',
email STRING COMMENT '邮箱',
gender STRING COMMENT '性别',
level STRING COMMENT '客户级别',
origin_type STRING COMMENT '数据来源渠道',
information_way STRING COMMENT '资讯方式',
working_years STRING COMMENT '开始工作时间',
technical_directions STRING COMMENT '技术方向',
customer_state STRING COMMENT '当前客户状态',
valid STRING COMMENT '该线索是否是网资有效线索',
anticipat_signup_date STRING COMMENT '预计报名时间',
clue_state STRING COMMENT '线索状态',
scrm_department_id int COMMENT 'SCRM内部部门id',
superior_url STRING COMMENT '诸葛获取上级页面URL',
superior_source STRING COMMENT '诸葛获取上级页面URL标题',
landing_url STRING COMMENT '诸葛获取着陆页面URL',
landing_source STRING COMMENT '诸葛获取着陆页面URL来源',
info_url STRING COMMENT '诸葛获取留咨页URL',
info_source STRING COMMENT '诸葛获取留咨页URL标题',
origin_channel STRING COMMENT '投放渠道',
course_id int COMMENT '课程编号',
course_name STRING COMMENT '课程名称',
zhuge_session_id STRING COMMENT 'zhuge会话id',
is_repeat int COMMENT '是否重复线索(手机号维度) 0:正常 1：重复',
tenant int COMMENT '租户id',
activity_id STRING COMMENT '活动id',
activity_name STRING COMMENT '活动名称',
follow_type int COMMENT '分配类型，0-自动分配，1-手动分配，2-自动转移，3-手动单个转移，4-手动批量转移，5-公海领取',
shunt_mode_id int COMMENT '匹配到的技能组id',
shunt_employee_group_id int COMMENT '所属分流员工组',
ends_time STRING COMMENT '有效时间')
comment '客户关系表'
PARTITIONED BY(starts_time STRING)
clustered by(customer_relationship_id) sorted by(customer_relationship_id) into 10 buckets
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
stored as orc
TBLPROPERTIES ('orc.compress'='ZLIB');

--3.2: 创建 DIM层表: 5张表
CREATE DATABASE IF NOT EXISTS itcast_dimen;
CREATE TABLE IF NOT EXISTS itcast_dimen.`customer` (
`id` int COMMENT 'key id',
`customer_relationship_id` int COMMENT '当前意向id',
`create_date_time` STRING COMMENT '创建时间',
`update_date_time` STRING COMMENT '最后更新时间',
`deleted` int COMMENT '是否被删除（禁用）',
`name` STRING COMMENT '姓名',
`idcard` STRING COMMENT '身份证号',
`birth_year` int COMMENT '出生年份',
`gender` STRING COMMENT '性别',
`phone` STRING COMMENT '手机号',
`wechat` STRING COMMENT '微信',
`qq` STRING COMMENT 'qq号',
`email` STRING COMMENT '邮箱',
`area` STRING COMMENT '所在区域',
`leave_school_date` date COMMENT '离校时间',
`graduation_date` date COMMENT '毕业时间',
`bxg_student_id` STRING COMMENT '博学谷学员ID，可能未关联到，不存在',
`creator` int COMMENT '创建人ID',
`origin_type` STRING COMMENT '数据来源',
`origin_channel` STRING COMMENT '来源渠道',
`tenant` int,
`md_id` int COMMENT '中台id')
comment '客户表'
PARTITIONED BY(start_time STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
stored as orc
TBLPROPERTIES ('orc.compress'='SNAPPY');

CREATE TABLE IF NOT EXISTS itcast_dimen.employee (
id int COMMENT '员工id',
email STRING COMMENT '公司邮箱，OA登录账号',
real_name STRING COMMENT '员工的真实姓名',
phone STRING COMMENT '手机号，目前还没有使用；隐私问题OA接口没有提供这个属性，',
department_id STRING COMMENT 'OA中的部门编号，有负值',
department_name STRING COMMENT 'OA中的部门名',
remote_login STRING COMMENT '员工是否可以远程登录',
job_number STRING COMMENT '员工工号',
cross_school STRING COMMENT '是否有跨校区权限',
last_login_date STRING COMMENT '最后登录日期',
creator int COMMENT '创建人',
create_date_time STRING COMMENT '创建时间',
update_date_time STRING COMMENT '最后更新时间',
deleted STRING COMMENT '是否被删除（禁用）',
scrm_department_id int COMMENT 'SCRM内部部门id',
leave_office STRING COMMENT '离职状态',
leave_office_time STRING COMMENT '离职时间',
reinstated_time STRING COMMENT '复职时间',
superior_leaders_id int COMMENT '上级领导ID',
tdepart_id int COMMENT '直属部门',
tenant int COMMENT '租户',
ems_user_name STRING COMMENT 'ems用户名称'
)
comment '员工表'
PARTITIONED BY(start_time STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
stored as orc
TBLPROPERTIES ('orc.compress'='SNAPPY');

CREATE TABLE IF NOT EXISTS itcast_dimen.`scrm_department` (
`id` int COMMENT '部门id',
`name` STRING COMMENT '部门名称',
`parent_id` int COMMENT '父部门id',
`create_date_time` STRING COMMENT '创建时间',
`update_date_time` STRING COMMENT '更新时间',
`deleted` STRING COMMENT '删除标志',
`id_path` STRING COMMENT '编码全路径',
`tdepart_code` int COMMENT '直属部门',
`creator` STRING COMMENT '创建者',
`depart_level` int COMMENT '部门层级',
`depart_sign` int COMMENT '部门标志，暂时默认1',
`depart_line` int COMMENT '业务线，存储业务线编码',
`depart_sort` int COMMENT '排序字段',
`disable_flag` int COMMENT '禁用标志',
`tenant` int COMMENT '租户')
comment 'scrm部门表'
PARTITIONED BY(start_time STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
stored as orc
TBLPROPERTIES ('orc.compress'='SNAPPY');

CREATE TABLE IF NOT EXISTS itcast_dimen.`itcast_school` (
`id` int COMMENT '自增主键',
`create_date_time` timestamp COMMENT '创建时间',
`update_date_time` timestamp COMMENT '最后更新时间',
`deleted` STRING COMMENT '是否被删除（禁用）',
`name` STRING COMMENT '校区名称',
`code` STRING COMMENT '校区标识',
`tenant` int COMMENT '租户')
comment '校区字典表'
PARTITIONED BY(start_time STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
stored as orc
TBLPROPERTIES ('orc.compress'='SNAPPY');

CREATE TABLE IF NOT EXISTS itcast_dimen.`itcast_subject` (
`id` int COMMENT '自增主键',
`create_date_time` timestamp COMMENT '创建时间',
`update_date_time` timestamp COMMENT '最后更新时间',
`deleted` STRING COMMENT '是否被删除（禁用）',
`name` STRING COMMENT '学科名称',
`code` STRING COMMENT '学科编码',
`tenant` int COMMENT '租户')
comment '学科字典表'
PARTITIONED BY(start_time STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
stored as orc
TBLPROPERTIES ('orc.compress'='SNAPPY');

--3.3 构建 DWD层: -- 演示 join优化
CREATE TABLE IF NOT EXISTS itcast_dwd.`itcast_intention_dwd` (
`rid` int COMMENT 'id',
`customer_id` STRING COMMENT '客户id',
`create_date_time` STRING COMMENT '创建时间',
`itcast_school_id` STRING COMMENT '校区id',
`deleted` STRING COMMENT '是否被删除',
`origin_type` STRING COMMENT '来源渠道',
`itcast_subject_id` STRING COMMENT '学科id',
`creator` int COMMENT '创建人',
`hourinfo` STRING COMMENT '小时信息',
`origin_type_stat` STRING COMMENT '数据来源:0.线下；1.线上'
)
comment '客户意向dwd表'
PARTITIONED BY(yearinfo STRING,monthinfo STRING,dayinfo STRING)
clustered by(rid) sorted by(rid) into 10 buckets
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
stored as ORC
TBLPROPERTIES ('orc.compress'='SNAPPY');

-- 3.4: 构建 DWM层
create database itcast_dwm;
CREATE TABLE IF NOT EXISTS itcast_dwm.`itcast_intention_dwm` (
`customer_id` STRING COMMENT 'id信息',
`create_date_time` STRING COMMENT '创建时间',
`area` STRING COMMENT '区域信息',
`itcast_school_id` STRING COMMENT '校区id',
`itcast_school_name` STRING COMMENT '校区名称',
`deleted` STRING COMMENT '是否被删除',
`origin_type` STRING COMMENT '来源渠道',
`itcast_subject_id` STRING COMMENT '学科id',
`itcast_subject_name` STRING COMMENT '学科名称',
`hourinfo` STRING COMMENT '小时信息',
`origin_type_stat` STRING COMMENT '数据来源:0.线下；1.线上',
`clue_state_stat` STRING COMMENT '新老客户：0.老客户；1.新客户',
`tdepart_id` STRING COMMENT '创建者部门id',
`tdepart_name` STRING COMMENT '咨询中心名称'
)
comment '客户意向dwm表'
PARTITIONED BY(yearinfo STRING,monthinfo STRING,dayinfo STRING)
clustered by(customer_id) sorted by(customer_id) into 10 buckets
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
stored as ORC
TBLPROPERTIES ('orc.compress'='SNAPPY');

-- 3.5 构建 DWS 层
CREATE TABLE IF NOT EXISTS itcast_dws.itcast_intention_dws (
`customer_total` INT COMMENT '聚合意向客户数',
`area` STRING COMMENT '区域信息',
`itcast_school_id` STRING COMMENT '校区id',
`itcast_school_name` STRING COMMENT '校区名称',
`origin_type` STRING COMMENT '来源渠道',
`itcast_subject_id` STRING COMMENT '学科id',
`itcast_subject_name` STRING COMMENT '学科名称',
`hourinfo` STRING COMMENT '小时信息',
`origin_type_stat` STRING COMMENT '数据来源:0.线下；1.线上',
`clue_state_stat` STRING COMMENT '客户属性：0.老客户；1.新客户',
`tdepart_id` STRING COMMENT '创建者部门id',
`tdepart_name` STRING COMMENT '咨询中心名称',
`time_str` STRING COMMENT '时间明细',
`groupType` STRING COMMENT '产品属性类别：1.总意向量；2.区域信息；3.校区、学科组合分组；4.来源渠道；5.咨询中心;',
`time_type` STRING COMMENT '时间维度：1、按小时聚合；2、按天聚合；3、按周聚合；4、按月聚合；5、按年聚合；'
)
comment '客户意向dws表'
PARTITIONED BY(yearinfo STRING,monthinfo STRING,dayinfo STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
stored as orc
TBLPROPERTIES ('orc.compress'='SNAPPY');

4. 意向主题看板案例_数据的采集:
4.1: 完成 DIM层的数据采集:
sqoop import
--connect jdbc:mysql://192.168.52.150:3306/scrm
--username root
--password 123456
--query 'select id, customer_relationship_id, create_date_time, update_date_time, deleted, name, idcard, birth_year, gender, phone, wechat, qq, email, area, leave_school_date, graduation_date, bxg_student_id, creator, origin_type, origin_channel, tenant, md_id, FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d") as start_time from customer where $CONDITIONS'
--hcatalog-database itcast_dimen
--hcatalog-table customer
-m 1
--split-by id

sqoop import
--connect jdbc:mysql://192.168.52.150:3306/scrm
--username root
--password 123456
--query 'select id,email,real_name,-1 as phone,department_id,department_name,remote_login,job_number,cross_school,last_login_date,creator,create_date_time,update_date_time,deleted,scrm_department_id,leave_office,leave_office_time,reinstated_time,superior_leaders_id,tdepart_id,tenant,ems_user_name,FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as start_time from employee where $CONDITIONS'
--hcatalog-database itcast_dimen
--hcatalog-table employee
-m 1
--split-by id

sqoop import
--connect jdbc:mysql://192.168.52.150:3306/scrm
--username root
--password 123456
--query 'select *, FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as start_time from scrm_department where $CONDITIONS'
--hcatalog-database itcast_dimen
--hcatalog-table scrm_department
-m 1
--split-by id

sqoop import
--connect jdbc:mysql://192.168.52.150:3306/scrm
--username root
--password 123456
--query 'select *, FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as start_time from itcast_school where $CONDITIONS'
--hcatalog-database itcast_dimen
--hcatalog-table itcast_school
-m 1
--split-by id

sqoop import
--connect jdbc:mysql://192.168.52.150:3306/scrm
--username root
--password 123456
--query 'select *, FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as start_time from itcast_subject where $CONDITIONS'
--hcatalog-database itcast_dimen
--hcatalog-table itcast_subject
-m 1
--split-by id

4.2: 完成ODS层的数据采集
由于ODS层表时两张桶表数据, 而 sqoop 无法支持桶表数据的导入工作, 此时解决方案: 为对应的桶表构建临时表, 然后通过sqoop将数据导入到临时表
在通过临时表使用 insert into 的方式将数据导入分桶表中即可

4.2.1: 意向表的数据导入
第一步: 创建意向表的临时表结构
CREATE TABLE IF NOT EXISTS itcast_ods.`customer_relationship_tmp` (
`id` int COMMENT '客户关系id',
`create_date_time` STRING COMMENT '创建时间',
`update_date_time` STRING COMMENT '最后更新时间',
`deleted` int COMMENT '是否被删除（禁用）',
`customer_id` int COMMENT '所属客户id',
`first_id` int COMMENT '第一条客户关系id',
`belonger` int COMMENT '归属人',
`belonger_name` STRING COMMENT '归属人姓名',
`initial_belonger` int COMMENT '初始归属人',
`distribution_handler` int COMMENT '分配处理人',
`business_scrm_department_id` int COMMENT '归属部门',
`last_visit_time` STRING COMMENT '最后回访时间',
`next_visit_time` STRING COMMENT '下次回访时间',
`origin_type` STRING COMMENT '数据来源',
`itcast_school_id` int COMMENT '校区Id',
`itcast_subject_id` int COMMENT '学科Id',
`intention_study_type` STRING COMMENT '意向学习方式',
`anticipat_signup_date` STRING COMMENT '预计报名时间',
`level` STRING COMMENT '客户级别',
`creator` int COMMENT '创建人',
`current_creator` int COMMENT '当前创建人：初始==创建人，当在公海拉回时为拉回人',
`creator_name` STRING COMMENT '创建者姓名',
`origin_channel` STRING COMMENT '来源渠道',
`comment` STRING COMMENT '备注',
`first_customer_clue_id` int COMMENT '第一条线索id',
`last_customer_clue_id` int COMMENT '最后一条线索id',
`process_state` STRING COMMENT '处理状态',
`process_time` STRING COMMENT '处理状态变动时间',
`payment_state` STRING COMMENT '支付状态',
`payment_time` STRING COMMENT '支付状态变动时间',
`signup_state` STRING COMMENT '报名状态',
`signup_time` STRING COMMENT '报名时间',
`notice_state` STRING COMMENT '通知状态',
`notice_time` STRING COMMENT '通知状态变动时间',
`lock_state` STRING COMMENT '锁定状态',
`lock_time` STRING COMMENT '锁定状态修改时间',
`itcast_clazz_id` int COMMENT '所属ems班级id',
`itcast_clazz_time` STRING COMMENT '报班时间',
`payment_url` STRING COMMENT '付款链接',
`payment_url_time` STRING COMMENT '支付链接生成时间',
`ems_student_id` int COMMENT 'ems的学生id',
`delete_reason` STRING COMMENT '删除原因',
`deleter` int COMMENT '删除人',
`deleter_name` STRING COMMENT '删除人姓名',
`delete_time` STRING COMMENT '删除时间',
`course_id` int COMMENT '课程ID',
`course_name` STRING COMMENT '课程名称',
`delete_comment` STRING COMMENT '删除原因说明',
`close_state` STRING COMMENT '关闭装填',
`close_time` STRING COMMENT '关闭状态变动时间',
`appeal_id` int COMMENT '申诉id',
`tenant` int COMMENT '租户',
`total_fee` DECIMAL COMMENT '报名费总金额',
`belonged` int COMMENT '小周期归属人',
`belonged_time` STRING COMMENT '归属时间',
`belonger_time` STRING COMMENT '归属时间',
`transfer` int COMMENT '转移人',
`transfer_time` STRING COMMENT '转移时间',
`follow_type` int COMMENT '分配类型，0-自动分配，1-手动分配，2-自动转移，3-手动单个转移，4-手动批量转移，5-公海领取',
`transfer_bxg_oa_account` STRING COMMENT '转移到博学谷归属人OA账号',
`transfer_bxg_belonger_name` STRING COMMENT '转移到博学谷归属人OA姓名',
`end_time` STRING COMMENT '有效截止时间')
comment '客户关系表'
PARTITIONED BY(start_time STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
stored as orc
TBLPROPERTIES ('orc.compress'='ZLIB');

第二步: 使用sqoop 完成数据导入到临时表:
sqoop import
--connect jdbc:mysql://192.168.52.150:3306/scrm
--username root
--password 123456
--query 'select id, create_date_time, update_date_time, deleted, customer_id, first_id, belonger, belonger_name, initial_belonger, distribution_handler, business_scrm_department_id, last_visit_time, next_visit_time, origin_type, itcast_school_id, itcast_subject_id, intention_study_type, anticipat_signup_date, level, creator, current_creator, creator_name, origin_channel, comment, first_customer_clue_id, last_customer_clue_id, process_state, process_time, payment_state, payment_time, signup_state, signup_time, notice_state, notice_time, lock_state, lock_time, itcast_clazz_id, itcast_clazz_time, payment_url, payment_url_time, ems_student_id, delete_reason, deleter, deleter_name, delete_time, course_id, course_name, delete_comment, close_state, close_time, appeal_id, tenant, total_fee, belonged, belonged_time, belonger_time, transfer, transfer_time, follow_type, transfer_bxg_oa_account, transfer_bxg_belonger_name,date_format("9999-12-31","%Y-%m-%d") as end_time, FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as start_time from customer_relationship where $CONDITIONS'
--hcatalog-database itcast_ods
--hcatalog-table customer_relationship_tmp
-m 1
--split-by id

--第三步: 将临时表的数据, 在次灌入到 ODS的分桶的意向表中:
--分区
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.created.files=150000;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
--分桶 set hive.optimize.bucketmapjoin = true;
set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;

set hive.auto.convert.sortmerge.join=true;
set hive.auto.convert.sortmerge.join.noconditionaltask=true;

insert into table itcast_ods.customer_relationship partition(start_time)
select * from customer_relationship_tmp;

4.2.2: 将线索表数据导入到ods层的表中
第一步: 建立线索表的临时表:
CREATE TABLE IF NOT EXISTS itcast_ods.customer_clue_tmp (
id int COMMENT 'customer_clue_id',
create_date_time STRING COMMENT '创建时间',
update_date_time STRING COMMENT '最后更新时间',
deleted STRING COMMENT '是否被删除（禁用）',
customer_id int COMMENT '客户id',
customer_relationship_id int COMMENT '客户关系id',
session_id STRING COMMENT '七陌会话id',
sid STRING COMMENT '访客id',
status STRING COMMENT '状态（undeal待领取 deal 已领取 finish 已关闭 changePeer 已流转）',
users STRING COMMENT '所属坐席',
create_time STRING COMMENT '七陌创建时间',
platform STRING COMMENT '平台来源（pc-网站咨询|wap-wap咨询|sdk-app咨询|weixin-微信咨询）',
s_name STRING COMMENT '用户名称',
seo_source STRING COMMENT '搜索来源',
seo_keywords STRING COMMENT '关键字',
ip STRING COMMENT 'IP地址',
referrer STRING COMMENT '上级来源页面',
from_url STRING COMMENT '会话来源页面',
landing_page_url STRING COMMENT '访客着陆页面',
url_title STRING COMMENT '咨询页面title',
to_peer STRING COMMENT '所属技能组',
manual_time STRING COMMENT '人工开始时间',
begin_time STRING COMMENT '坐席领取时间 ',
reply_msg_count int COMMENT '客服回复消息数',
total_msg_count int COMMENT '消息总数',
msg_count int COMMENT '客户发送消息数',
comment STRING COMMENT '备注',
finish_reason STRING COMMENT '结束类型',
finish_user STRING COMMENT '结束坐席',
end_time STRING COMMENT '会话结束时间',
platform_description STRING COMMENT '客户平台信息',
browser_name STRING COMMENT '浏览器名称',
os_info STRING COMMENT '系统名称',
area STRING COMMENT '区域',
country STRING COMMENT '所在国家',
province STRING COMMENT '省',
city STRING COMMENT '城市',
creator int COMMENT '创建人',
name STRING COMMENT '客户姓名',
idcard STRING COMMENT '身份证号',
phone STRING COMMENT '手机号',
itcast_school_id int COMMENT '校区Id',
itcast_school STRING COMMENT '校区',
itcast_subject_id int COMMENT '学科Id',
itcast_subject STRING COMMENT '学科',
wechat STRING COMMENT '微信',
qq STRING COMMENT 'qq号',
email STRING COMMENT '邮箱',
gender STRING COMMENT '性别',
level STRING COMMENT '客户级别',
origin_type STRING COMMENT '数据来源渠道',
information_way STRING COMMENT '资讯方式',
working_years STRING COMMENT '开始工作时间',
technical_directions STRING COMMENT '技术方向',
customer_state STRING COMMENT '当前客户状态',
valid STRING COMMENT '该线索是否是网资有效线索',
anticipat_signup_date STRING COMMENT '预计报名时间',
clue_state STRING COMMENT '线索状态',
scrm_department_id int COMMENT 'SCRM内部部门id',
superior_url STRING COMMENT '诸葛获取上级页面URL',
superior_source STRING COMMENT '诸葛获取上级页面URL标题',
landing_url STRING COMMENT '诸葛获取着陆页面URL',
landing_source STRING COMMENT '诸葛获取着陆页面URL来源',
info_url STRING COMMENT '诸葛获取留咨页URL',
info_source STRING COMMENT '诸葛获取留咨页URL标题',
origin_channel STRING COMMENT '投放渠道',
course_id int COMMENT '课程编号',
course_name STRING COMMENT '课程名称',
zhuge_session_id STRING COMMENT 'zhuge会话id',
is_repeat int COMMENT '是否重复线索(手机号维度) 0:正常 1：重复',
tenant int COMMENT '租户id',
activity_id STRING COMMENT '活动id',
activity_name STRING COMMENT '活动名称',
follow_type int COMMENT '分配类型，0-自动分配，1-手动分配，2-自动转移，3-手动单个转移，4-手动批量转移，5-公海领取',
shunt_mode_id int COMMENT '匹配到的技能组id',
shunt_employee_group_id int COMMENT '所属分流员工组',
ends_time STRING COMMENT '有效时间')
comment '客户关系表'
PARTITIONED BY(starts_time STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
stored as orc
TBLPROPERTIES ('orc.compress'='ZLIB');

第二步: 使用sqoop 完成数据导入到线索表临时表

sqoop import
--connect jdbc:mysql://192.168.52.150:3306/scrm
--username root
--password 123456
--query 'select id,create_date_time,update_date_time,deleted,customer_id,customer_relationship_id,session_id,sid,status,user as users,create_time,platform,s_name,seo_source,seo_keywords,ip,referrer,from_url,landing_page_url,url_title,to_peer,manual_time,begin_time,reply_msg_count,total_msg_count,msg_count,comment,finish_reason,finish_user,end_time,platform_description,browser_name,os_info,area,country,province,city,creator,name,"-1" as idcard,"-1" as phone,itcast_school_id,itcast_school,itcast_subject_id,itcast_subject,"-1" as wechat,"-1" as qq,"-1" as email,gender,level,origin_type,information_way,working_years,technical_directions,customer_state,valid,anticipat_signup_date,clue_state,scrm_department_id,superior_url,superior_source,landing_url,landing_source,info_url,info_source,origin_channel,course_id,course_name,zhuge_session_id,is_repeat,tenant,activity_id,activity_name,follow_type,shunt_mode_id,shunt_employee_group_id,date_format("9999-12-31","%Y-%m-%d") as ends_time,FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as starts_time from customer_clue where $CONDITIONS'
--hcatalog-database itcast_ods
--hcatalog-table customer_clue_tmp
-m 1
--split-by id

第三步: 将临时表的数据, 导入到线索表:

insert into table itcast_ods.customer_clue partition(starts_time)
select * from itcast_ods.customer_clue_tmp;

4.3: 完成数据清洗转换处理工作: ODS的意向表 --> DWD层清洗后的意向表
需要清洗和转换的操作都有哪些?
清洗:
将标记为delete=1进行清除
转换工作:
create_date_time字段, 需要转换出有年月天小时
origin_type 中数据生成一个新的字段 origin_type_stat 用于区分线上和线下
学校id和学科ID，同步时，0和null转换为统一数据，都转换为-1

清洗转换的SQL :
INSERT INTO TABLE itcast_dwd.itcast_intention_dwd partition(yearinfo,monthinfo,dayinfo)
select
id as rid,
customer_id,
create_date_time,
if(itcast_school_id is null or itcast_school_id =0,'-1',itcast_school_id) as itcast_school_id ,
deleted,
origin_type,
if(itcast_subject_id is null or itcast_subject_id =0,'-1',itcast_subject_id) as itcast_subject_id,
creator,
substr(create_date_time,12,2) as hourinfo,
if(origin_type in('NETSERVICE','PRESIGNUP'),'1','0') as origin_type_stat,
substr(create_date_time,1,4) as yearinfo,
substr(create_date_time,6,2) as monthinfo,
substr(create_date_time,9,2) as dayinfo
from itcast_ods.customer_relationship TABLESAMPLE(BUCKET 1 OUT OF 10 on id) as cr where deleted = 0;

--4.4: 完成数据转换操作: DWD --> DWM
--分区
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.created.files=150000;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
--分桶
set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.auto.convert.sortmerge.join=true;
set hive.auto.convert.sortmerge.join.noconditionaltask=true;

insert into table itcast_dwm.itcast_intention_dwm partition(yearinfo,monthinfo,dayinfo)
select
iid.customer_id,
iid.create_date_time,
dcu.area,
iid.itcast_school_id,
dis.name,
iid.deleted,
iid.origin_type,
iid.itcast_subject_id,
disub.name,
iid.hourinfo,
iid.origin_type_stat,
if(cc.clue_state ='VALID_NEW_CLUES' , '1', if(cc.clue_state ='VALID_PUBLIC_NEW_CLUE','0','-1') ) as clue_state_stat, -- 找新老用户
demp.tdepart_id,
dsd.name,
iid.yearinfo,
iid.monthinfo,
iid.dayinfo
from itcast_dwd.itcast_intention_dwd as iid
left join itcast_ods.customer_clue as cc on iid.rid = cc.customer_relationship_id
left join itcast_dimen.itcast_school as dis on dis.id = iid.itcast_school_id
left join itcast_dimen.itcast_subject as disub on disub.id=iid.itcast_subject_id
left join itcast_dimen.customer as dcu on dcu.id = iid.customer_id
left join itcast_dimen.employee as demp on demp.id = iid.creator
left join itcast_dimen.scrm_department as dsd on dsd.id = demp.tdepart_id;

经过测试发现: itcast_intention_dwd 和 customer_clue 产生 SMB的mapjoin优化
其余表均为普通 map join

4.5) 统计分析:
指标: 意向数量
维度:
时间维度: 年月天小时
新老维度:
线上线下
产品属性维度:
地区维度 , 来源渠道, 学科维度, 校区维度 , 各咨询中心

--需求1: 按照月统计新老用户以及线上下产生意向用户数量
insert into table itcast_dws.itcast_intention_dws partition(yearinfo,monthinfo,dayinfo)
select
count(distinct customer_id ) as customer_total,
'-1' as area,
'-1' as itcast_school_id,
'-1' as itcast_school_name,
'-1' as origin_type,
'-1' as itcast_subject_id,
'-1' as itcast_subject_name,
'-1' as hourinfo,
origin_type_stat,
clue_state_stat,
'-1' as tdepart_id,
'-1' as tdepart_name,
concat(yearinfo,'-',monthinfo) as time_str,
'1' as grouptype ,
'4' as time_type,
yearinfo,
monthinfo,
'-1' as dayinfo
from itcast_dwm.itcast_intention_dwm group by yearinfo,monthinfo, clue_state_stat,
origin_type_stat;

-- 需求2: 按照天统计新老用户以及线上下以及各个地区产生意向用户数量
insert into table itcast_dws.itcast_intention_dws partition(yearinfo,monthinfo,dayinfo)
select
count(distinct customer_id ) as customer_total,
area,
'-1' as itcast_school_id,
'-1' as itcast_school_name,
'-1' as origin_type,
'-1' as itcast_subject_id,
'-1' as itcast_subject_name,
'-1' as hourinfo,
origin_type_stat,
clue_state_stat,
'-1' as tdepart_id,
'-1' as tdepart_name,
concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
'2' as grouptype ,
'2' as time_type,
yearinfo,
monthinfo,
dayinfo
from itcast_dwm.itcast_intention_dwm group by yearinfo,monthinfo,dayinfo, clue_state_stat,
origin_type_stat,area;

有效线索主题看板

1. 学习目标

了解有效线索主题看板需求

了解Hive索引的用法

掌握Row Group Index的用法

掌握Bloom Filter Index的用法

能够采集有效线索全量数据

能够使用Hive进行并行操作

掌握Hive常用的判断函数

能够编写有效线索指标的DWD清洗转换SQL

能够编写有效线索指标的DWM中间层SQL

能够编写有效线索指标的DWS业务层SQL

能够导出有效线索指标结果到Mysql

掌握增量数据分析的过程

2. 主题需求

2.1 有效线索转化率

说明：统计期内，访客咨询产生的有效线索的占比。有效线索量 / 咨询量，有效线索指的是拿到电话且电话有效。

展现：线状图。双轴：有效线索量、有效线索转化率。

条件：年、月、线上线下

维度：年、月、线上线下

指标：访客咨询率=有效线索量/咨询量

粒度：天

数据来源：客户管理系统的customer_clue线索表、customer_relationship意向表、customer_appeal申诉表；咨询系统的web_chat_ems访问咨询表

SQL：

--咨询量(暂时以2019年7月的数据为例)：
SELECT
count(1)
FROM
web_chat_ems_2019_07
WHERE
msg_count >= 1
AND create_time >= '2019-07-01'
AND create_time <= '2019-07-15 23:59:59';

--有效线索量：
SELECT
count(1)
FROM
customer_clue cc
LEFT JOIN customer_relationship cr ON cc.customer_relationship_id = cr.id
WHERE
cc.clue_state IN (
'VALID_NEW_CLUES', --新客户新线索
'VALID_PUBLIC_NEW_CLUE' --老客户新线索
)
AND cc.customer_relationship_id NOT IN (
SELECT
ca.customer_relationship_first_id
FROM --投诉表，投诉成功的数据为无效线索
customer_appeal ca
WHERE
ca.appeal_status = 1 AND ca.customer_relationship_first_id != 0
)
AND cr.origin_type IN ('NETSERVICE','PRESIGNUP') --线上（排除挖掘录入量）
AND ! cc.deleted
AND cc.create_date_time <= '2019-07-01'
AND cc.create_date_time <= '2019-07-15 23:59:59';

2.2 有效线索转化率时间段趋势

说明：统计期内，1-24h之间，每个时间段的有效线索转化率。横轴：1-24h，间隔为1h，纵轴：每个时间段的有效线索转化率。

展现：线状图

条件：天、线上线下

维度：天、线上线下

指标：某小时的总有效线索转化率

粒度：区间内小时段（区间内同一个时间点的总有效线索转化率）

数据来源：客户管理系统的customer_clue线索表、customer_relationship意向表、customer_appeal申诉表；咨询系统的web_chat_ems访问咨询表

SQL：同上

2.3 有效线索量

说明：统计期内，新增的咨询客户中，有效线索的数量。

展现：线状图。

条件：年、月、线上线下

维度：年、月、线上线下

指标：有效线索的数量

粒度：天

数据来源：客户管理系统的customer_clue线索表、customer_relationship意向表、customer_appeal申诉表

SQL：同上

2.4 原始数据结构

有效线索指标的原始数据为客户管理系统的customer_clue线索表和customer_relationship意向客户表。

customer_clue是线索事实表，customer_relationship表主要是用来判断数据来源为线上还是线下。

这两张表在意向客户指标的ODS层已经抽取过，此处可以直接复用。

customer_appeal表是线索申诉表，主要用来判断客户线索被投诉无效。

测试数据：已包含在意向客户主题测试sql中，无需重复导入。

CREATE TABLE `customer_appeal` (
  `id` int(11) NOT NULL AUTO_INCREMENT COMMENT '主键',
  `customer_relationship_first_id` int(11) NOT NULL COMMENT '第一条客户关系id',
  `employee_id` int(11) DEFAULT NULL COMMENT '申诉人',
  `employee_name` varchar(64) COLLATE utf8_bin DEFAULT NULL COMMENT '申诉人姓名',
  `employee_department_id` int(11) DEFAULT NULL COMMENT '申诉人部门',
  `employee_tdepart_id` int(11) DEFAULT NULL COMMENT '申诉人所属部门',
  `appeal_status` int(1) NOT NULL COMMENT '申诉状态，0:待稽核 1:无效 2：有效',
  `audit_id` int(11) DEFAULT NULL COMMENT '稽核人id',
  `audit_name` varchar(255) COLLATE utf8_bin DEFAULT NULL COMMENT '稽核人姓名',
  `audit_department_id` int(11) DEFAULT NULL COMMENT '稽核人所在部门',
  `audit_department_name` varchar(255) COLLATE utf8_bin DEFAULT NULL COMMENT '稽核人部门名称',
  `audit_date_time` datetime DEFAULT NULL COMMENT '稽核时间',
  `create_date_time` datetime DEFAULT CURRENT_TIMESTAMP COMMENT '创建时间（申诉时间）',
  `update_date_time` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '更新时间',
  `deleted` bit(1) NOT NULL DEFAULT b'0' COMMENT '删除标志位',
  `tenant` int(11) NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`),
  KEY `id` (`id`,`appeal_status`) USING BTREE,
  KEY `idx_customer_relationship_first_id` (`customer_relationship_first_id`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=2012358 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

3. 建模分析

3.1 指标和维度

根据主题需求，我们来进行指标和维度的提取：

有效线索转化率，公式：有效线索量/咨询量。其中分母咨询量，在我们之前的指标中已经统计过，在这里可以直接使用。因此只需要统计分子有效线索量即可。有效线索指的是拿到电话且电话有效。维度虽然只有年、月、线上线下，但是因为数据粒度具体到天，所以维度统计时也要增加天。维度：年、月、天、线上线下。

有效线索转化率时间段趋势，主要是针对小时段区间内的数据进行统计，跨天数据无需去重。因此维度统计时需要统计到小时维度。维度：天、小时、线上线下。

有效线索量，即有效线索转化率的分子，数据粒度为天。但此处特别指明是新增客户的有效线索，因此统计维度中需要包含此线索所属的客户是新客户还是老客户。维度：年、月、天、线上线下、新旧客户。

总结：

l 指标：有效线索量；

l 维度：年、月、天、小时、线上线下、新旧客户。

3.2 分层设计

最终需要统计的数据维度：年、月、天、小时、线上线下、新旧客户；
将维度分为三类：时间维度（年、月、天、小时）、数据来源（线上线下）和新旧客户类型；
ODS层原始数据包括：主表有效线索、意向客户表（用来判断是新客户还是老客户）；
有效线索数据同样的数据只能录入一次，因此不存在去重的问题，所以可以使用DWM中间层来进行维度关联，并做少量聚合，可被DWS层调用以提高计算速度；
DWS层在DWM层的基础上进行统计，得出数据集市；
因为使用的是帆软BI自定义可视化展现，所以不再提供细分的APP层，直接将DWS数据集市导出到OLAP应用的mysql中即可。

4. 实现

4.1 建模

4.1.1 指标和维度

l指标：有效线索量；

l维度：

l 时间维度：年、月、天、小时

l 数据来源：线上、线下

l 客户类型：新客户线索、老客户线索

4.1.2 事实表和维度表

事实表：customer_clue线索表

维表：

customer_relationship意向客户表，主要为了判断数据来源为线上还是线下，也是意向客户指标的事实表。
customer_appeal线索申诉表，主要为了判断线索数据是否有效。

4.1.3 Hive索引

Hive支持索引，但是Hive的索引与关系型数据库中的索引并不相同，比如，Hive不支持主键或者外键。

Hive索引可以建立在表中的某些列上，以提升一些操作的效率，例如减少MapReduce任务中需要读取的数据块的数量。

在可以预见到分区数据非常庞大的情况下，分桶和索引常常是优于分区的。而分桶由于SMB Join对关联键要求严格，所以并不是总能生效。

4.1.3.1 Hive原始索引

Hive的索引目的是提高Hive表指定列的查询速度。

没有索引时，类似'WHERE tab1.col1 = 10' 的查询，Hive会加载整张表或分区，然后处理所有的rows，但是如果在字段col1上面存在索引时，那么只会加载和处理文件的一部分。

在每次建立、更新数据后，Hive索引不会自动更新，需要手动进行更新（重建索引以构建索引表），会触发一个mr job。

Hive索引使用过程繁杂，而且性能一般，在Hive3.0中已被删除，在工作环境中不推荐优先使用，在分区数量过多或查询字段不是分区字段时，索引可以作为补充方案同时使用。推荐使用ORC文件格式的索引类型进行查询。

4.1.3.2 Row Group Index

一个ORC文件包含一个或多个stripes(groups of row data)，每个stripe中包含了每个column的min/max值的索引数据，当查询中有<,>,=的操作时，会根据min/max值，跳过扫描不包含的stripes。

而其中为每个stripe建立的包含min/max值的索引，就称为Row Group Index行组索引，也叫min-max Index大小对比索引，或者Storage Index。

在建立ORC格式表时，指定表参数’orc.create.index’=’true’之后，便会建立Row Group Index，需要注意的是，为了使Row Group Index有效利用，向表中加载数据时，必须对需要使用索引的字段进行排序，否则，min/max会失去意义。另外，这种索引主要用于数值型字段的查询过滤优化上。

设置hive.optimize.index.filter为true，并重启hive

创建表/插入数据：

CREATE TABLE lxw1234_orc2 stored AS ORC
TBLPROPERTIES
(
    'orc.compress'='SNAPPY',
--     开启行组索引
    'orc.create.index'='true'
)
AS
    SELECT CAST(siteid AS INT) AS id,
    pcid
    FROM lxw1234_text
--     插入的数据保持排序
    DISTRIBUTE BY id sort BY id;

查询：

set hive.optimize.index.filter=true;
SELECT COUNT(1) FROM lxw1234_orc1 WHERE id >= 1382 AND id <= 1399;

4.1.3.3 Bloom Filter Index

在建表时候，通过表参数”orc.bloom.filter.columns”=”pcid”来指定为那些字段建立BloomFilter索引，这样，在生成数据的时候，会在每个stripe中，为该字段建立BloomFilter的数据结构，当查询条件中包含对该字段的=号过滤时候，先从BloomFilter中获取以下是否包含该值，如果不包含，则跳过该stripe。

创建：

CREATE TABLE lxw1234_orc2 stored AS ORC
TBLPROPERTIES
(
    'orc.compress'='SNAPPY',
    'orc.create.index'='true',
--     pcid字段开启BloomFilter索引
    "orc.bloom.filter.columns"="pcid"
)
AS
SELECT CAST(siteid AS INT) AS id,
pcid
FROM lxw1234_text
DISTRIBUTE BY id sort BY id;

查询：

SET hive.optimize.index.filter=true;
SELECT COUNT(1) FROM lxw1234_orc1 WHERE id >= 0 AND id <= 1000
AND pcid IN ('0005E26F0DCCDB56F9041C','A');

只有在数据量较大时，使用索引才能带来性能优势。

4.1.4 分层

ODS层可以复用意向客户指标，无需重复创建。

因为线索数据不会重复（不用distinct），所以可以采用DWM中间层过度。

4.1.4.1 ODS

customer_clue线索表和customer_relationship意向客户表复用意向客户指标的ODS层。

写入时压缩生效

set hive.exec.orc.compression.strategy=COMPRESSION;

4.1.4.1.1 customer_appeal线索申诉表

分桶表，sqoop抽取数据是不支持的，但是索引表是支持的。

CREATE EXTERNAL TABLE IF NOT EXISTS itcast_ods.`customer_appeal` (
  `id` int COMMENT 'customer_appeal_id',
  `customer_relationship_first_id` int COMMENT '第一条客户关系id',
  `employee_id` int COMMENT '申诉人',
  `employee_name` STRING COMMENT '申诉人姓名',
  `employee_department_id` int COMMENT '申诉人部门',
  `employee_tdepart_id` int COMMENT '申诉人所属部门',
  `appeal_status` int COMMENT '申诉状态，0:待稽核 1:无效 2：有效',
  `audit_id` int COMMENT '稽核人id',
  `audit_name` STRING COMMENT '稽核人姓名',
  `audit_department_id` int COMMENT '稽核人所在部门',
  `audit_department_name` STRING COMMENT '稽核人部门名称',
  `audit_date_time` STRING COMMENT '稽核时间',
  `create_date_time` STRING COMMENT '创建时间（申诉时间）',
  `update_date_time` STRING COMMENT '更新时间',
  `deleted` STRING COMMENT '删除标志位',
  `tenant` int COMMENT '租户id')
comment '客户申诉表'
PARTITIONED BY(start_time STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
stored as orc
TBLPROPERTIES ('orc.compress'='SNAPPY','orc.create.index'='true','orc.bloom.filter.columns'='appeal_status,customer_relationship_first_id');

4.1.4.2 DWD

过滤已删除数据(deleted)、线索状态clue_state为空的数据；

此处回顾SMB Join的用法：关联意向表的customer_relationship_id如果为空则转换为-1。

customer_clue线索表、customer_relationship意向表可以获取并转换得到新老客户、线上线下。在意向客户主题已经建立过这两个ODS表，类似的关联查询可以通过分桶提高查询效率，我们在这两个字段上分别建立分桶，而且要保证他们的排序和分桶的数量是一致的。

CREATE TABLE IF NOT EXISTS itcast_dwd.itcast_clue_dwd (
   `id` STRING COMMENT '线索id',
   `customer_relationship_id` int COMMENT '客户关系id',
   `origin_type_stat` STRING COMMENT '数据来源:0.线下；1.线上',
   `for_new_user` STRING COMMENT '0:未知；1：新客户线索；2：旧客户线索',
   `deleted` STRING COMMENT '是否删除',
   `create_date_time` BIGINT COMMENT '创建时间',
   `hourinfo` STRING COMMENT '小时信息'
)
comment '客户申诉dwd表'
PARTITIONED BY(yearinfo STRING,monthinfo STRING,dayinfo STRING)
clustered by(customer_relationship_id) sorted by(customer_relationship_id) into 10 buckets
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '	'
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY','orc.create.index'='true','orc.bloom.filter.columns'='customer_relationship_id');

4.1.4.3 DWM

DWM层过滤投诉数据，并以小时进行统计。

分区并不是越多越好，统计后数据量变小，可以年作为分区。

CREATE TABLE IF NOT EXISTS itcast_dwm.itcast_clue_dwm (
   `clue_nums` STRING COMMENT '根据id聚合',
   `origin_type_stat` STRING COMMENT '数据来源:0.线下；1.线上',
   `for_new_user` STRING COMMENT '0:未知；1：新客户线索；2：旧客户线索',
   `hourinfo` STRING COMMENT '小时信息',
   `dayinfo`STRING COMMENT '天信息',
   `monthinfo` STRING COMMENT '月信息'
)
comment '客户申诉dwm表'
PARTITIONED BY(yearinfo STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY');

4.1.4.4 DWS

针对不同的时间维度进行统计，方便OLAP系统使用。

CREATE TABLE IF NOT EXISTS itcast_dws.itcast_clue_dws (
   `clue_nums` INT COMMENT '根据id聚合',
   `origin_type_stat` STRING COMMENT '数据来源:0.线下；1.线上',
   `for_new_user` STRING COMMENT '0:未知；1：新客户线索；2：旧客户线索',
   `hourinfo` STRING COMMENT '小时信息',
   `dayinfo`STRING COMMENT '天信息',
   `monthinfo` STRING COMMENT '月信息',
   `time_type` STRING COMMENT '聚合时间类型：1、按小时聚合；2、按天聚合；3、按周聚合；4、按月聚合；5、按年聚合；',
   `time_str` STRING COMMENT '时间明细'
)
comment '客户申诉app表'
PARTITIONED BY(yearinfo STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY');

4.1.4.5 APP

如果用户需要具体的报表展示，可以针对不同的报表页面设计APP层结构，然后导出至OLAP系统的mysql中。此系统使用FineReport，需要通过宽表来进行灵活的展现。因此APP层不再进行细化。直接将DWS层导出至mysql即可。

4.2 全量流程

4.2.1 数据采集

4.2.1.1 Customer_clue线索表、customer_relationship表

ODS复用意向客户指标，不需要重复采集数据。

4.2.1.2 customer_appeal表

SQL：

select `id`,

`customer_relationship_first_id`,

`employee_id`,

`employee_name`,

`employee_department_id`,

`employee_tdepart_id`,

`appeal_status`,

`audit_id`,

`audit_name`,

`audit_department_id`,

`audit_department_name`,

`audit_date_time`,

`create_date_time`,

`update_date_time`,

`deleted`,

`tenant`,

DATE_SUB(curdate(),INTERVAL 1 DAY) as start_time

from customer_appeal

Sqoop：

sqoop import

--connect jdbc:mysql://192.168.52.150:3306/scrm

--username root

--password 123456

--query 'select `id`,`customer_relationship_first_id`,`employee_id`,`employee_name`,`employee_department_id`,`employee_tdepart_id`,`appeal_status`,`audit_id`,`audit_name`,`audit_department_id`,`audit_department_name`,`audit_date_time`,`create_date_time`,`update_date_time`,`deleted`,`tenant`,DATE_SUB(curdate(),INTERVAL 1 DAY) as start_time from customer_appeal where $CONDITIONS'

--hcatalog-database itcast_ods

--hcatalog-table customer_appeal

-m 100

--split-by id

4.2.2 数据清洗转换

4.2.2.1 分析

一、清洗：

过滤已删除数据(deleted)、线索状态clue_state为空的数据；

二、转换：

关联意向表的customer_relationship_id字段如果为空则转换为-1。

customer_clue线索表、customer_relationship意向表可以获取并转换得到新老客户、线上线下。

customer_clue线索表获取clue_state信息，将clue_state状态转换为新老客户：如果clue_state状态为VALID_NEW_CLUES，则为新客户，为VALID_PUBLIC_NEW_CLUE，则为老客户，否则为无效数据。

此处回顾SMB Join的用法，customer_clue线索表的customer_relationship_id字段与customer_relationship表的id字段进行关联。customer_relationship意向客户主表，将origin_type来源渠道字段转换为线上/线下：如果origin_type是NETSERVICE和PRESIGNUP类型，即为1线上，否则为0线下。

在意向客户主题已经建立过这两个ODS表，类似的关联查询可以通过分桶提高查询效率，我们在这两个字段上分别建立分桶，而且要保证他们的排序和分桶的数量是一致的。

类似的关联查询可以通过分桶提高查询效率，我们在这两个字段上分别建立分桶，而且要保证他们的排序和分桶的数量是一致的。在意向客户主题已经建立过这两个ODS表。

4.2.2.2 代码

4.2.2.3 测试

customer_relationship可以采用分桶采样的方式进行测试，以提升执行效率。注意tablesample关键字所在的位置。

通过执行计划，可以看到分桶后的Join查询，使用了SMB Join进行优化。

INSERT INTO itcast_dwd.itcast_clue_dwd PARTITION (yearinfo, monthinfo, dayinfo)
SELECT
……
FROM itcast_ods.customer_clue TABLESAMPLE(BUCKET 1 OUT OF 10 ON customer_relationship_id) clue
LEFT JOIN itcast_ods.customer_relationship rs on clue.customer_relationship_id = rs.id
where clue.clue_state is not null AND clue.deleted = 'false';

4.2.3 统计分析

4.2.3.1 分析

在DWD层通过关联转换获取到客户线索是否有效，以及线索的来源是线上还是线下。

DWM层会在DWD层的基础之上，判断线索是否被客服投诉；

因为不涉及去重问题，此处简单的按照最细粒度维度打包进行统计，便于上层的数据集市按需取数。其中时间维度使用最小粒度的小时维度。

DWS根据需要的维度，分别从DWM获取数据后进行二次聚合，因为DWM已经打包统计过一次，数据较少，所以DWS的统计效率会比较高。

4.2.3.2 代码

4.2.3.2.1 DWM

4.2.3.2.1.1 实现

INSERT into itcast_dwm.itcast_clue_dwm partition (yearinfo)
SELECT
    count(dwd.id) as clue_nums,
    dwd.origin_type_stat,
    dwd.for_new_user,
    dwd.hourinfo,
    dwd.dayinfo,
    dwd.monthinfo,
    dwd.yearinfo
from itcast_dwd.itcast_clue_dwd dwd
WHERE
dwd.customer_relationship_id NOT IN
(
    SELECT a.customer_relationship_first_id from itcast_ods.customer_appeal a
    WHERE a.appeal_status = 1 and a.customer_relationship_first_id != 0
)
GROUP BY dwd.yearinfo, dwd.monthinfo, dwd.dayinfo, dwd.hourinfo, dwd.origin_type_stat, dwd.for_new_user;

4.2.3.2.1.2 验证索引性能

创建无索引表并导入数据：

CREATE TABLE IF NOT EXISTS itcast_dwd.itcast_clue_dwd_test (
   `id` STRING COMMENT '线索id',
   `customer_relationship_id` STRING COMMENT '客户关系id',
   `origin_type_stat` STRING COMMENT '数据来源:0.线下；1.线上',
   `for_new_user` STRING COMMENT '0:未知；1：新客户线索；2：旧客户线索',
   `deleted` STRING COMMENT '是否删除',
   `create_date_time` BIGINT COMMENT '创建时间',
   `hourinfo` STRING COMMENT '小时信息'
)
comment '客户申诉dwd表'
PARTITIONED BY(yearinfo STRING,monthinfo STRING,dayinfo STRING)
clustered by(customer_relationship_id) sorted by(customer_relationship_id) into 10 buckets
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY');

CREATE EXTERNAL TABLE IF NOT EXISTS itcast_ods.`customer_appeal_test` (
  `id` int COMMENT 'customer_appeal_id',
  `customer_relationship_first_id` int COMMENT '第一条客户关系id',
  `employee_id` int COMMENT '申诉人',
  `employee_name` STRING COMMENT '申诉人姓名',
  `employee_department_id` int COMMENT '申诉人部门',
  `employee_tdepart_id` int COMMENT '申诉人所属部门',
  `appeal_status` int COMMENT '申诉状态，0:待稽核 1:无效 2：有效',
  `audit_id` int COMMENT '稽核人id',
  `audit_name` STRING COMMENT '稽核人姓名',
  `audit_department_id` int COMMENT '稽核人所在部门',
  `audit_department_name` STRING COMMENT '稽核人部门名称',
  `audit_date_time` STRING COMMENT '稽核时间',
  `create_date_time` STRING COMMENT '创建时间（申诉时间）',
  `update_date_time` STRING COMMENT '更新时间',
  `deleted` STRING COMMENT '删除标志位',
  `tenant` int COMMENT '租户id')
comment '客户申诉表'
PARTITIONED BY(start_time STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
stored as orc
TBLPROPERTIES ('orc.compress'='SNAPPY');

INSERT INTO itcast_ods.customer_appeal_test PARTITION (start_time)
SELECT * from itcast_ods.customer_appeal;

INSERT INTO itcast_dwd.itcast_clue_dwd_test PARTITION (yearinfo, monthinfo, dayinfo)
SELECT
    clue.id,
    clue.customer_relationship_id,
    if(rs.origin_type='NETSERVICE' or rs.origin_type='PRESIGNUP', '1', '0') as origin_type_stat,
    if(clue.clue_state='VALID_NEW_CLUES', 1, if(clue.clue_state='VALID_PUBLIC_NEW_CLUE', 2, 0)) as for_new_user,
    clue.deleted,
    clue.create_date_time,
    substr(clue.create_date_time, 12, 2) as hourinfo,
    substr(clue.create_date_time, 1, 4) as yearinfo,
    substr(clue.create_date_time, 6, 2) as monthinfo,
    substr(clue.create_date_time, 9, 2) as dayinfo
FROM itcast_ods.customer_clue TABLESAMPLE(BUCKET 1 OUT OF 10 ON customer_relationship_id) clue
LEFT JOIN itcast_ods.customer_relationship rs on clue.customer_relationship_id = rs.id
where clue.clue_state is not null AND clue.deleted = 'false';

explain
SELECT
    count(dwd.id) as clue_nums,
    dwd.origin_type_stat,
    dwd.for_new_user,
    dwd.hourinfo,
    dwd.dayinfo,
    dwd.monthinfo,
    dwd.yearinfo
from itcast_dwd.itcast_clue_dwd_test dwd
WHERE
dwd.customer_relationship_id NOT IN
(
    SELECT a.customer_relationship_first_id from itcast_ods.customer_appeal_test a
    WHERE a.appeal_status = 1 and a.customer_relationship_first_id != 0
)
GROUP BY dwd.yearinfo, dwd.monthinfo, dwd.dayinfo, dwd.hourinfo, dwd.origin_type_stat, dwd.for_new_user;

通过执行计划，可以明显看到，加上索引以后，查询所读取的数据大小缩小了10倍以上，数据量越大提升越大：

4.2.3.2.2 DWS

DWS根据需要的维度，分别从DWM获取数据后进行二次聚合，因为DWM已经打包统计过一次，数据较少，所以DWS的统计效率会比较高。

小时数据和DWM层的数据是一致的，可以直接拿来使用。而年月日数据，则需要group by以后执行sum求和操作。

小时：

--小时
INSERT INTO itcast_clue_dws PARTITION(yearinfo)
SELECT
    clue_nums,
    origin_type_stat,
    for_new_user,
    hourinfo,
    dayinfo,
    monthinfo,
    '1' as time_type,
    concat(dwm.yearinfo,'-',dwm.monthinfo,'-',dwm.dayinfo,' ', dwm.hourinfo) as time_str,
    yearinfo
from itcast_dwm.itcast_clue_dwm dwm;

年月日数据：

--天
INSERT INTO itcast_clue_dws PARTITION(yearinfo)
SELECT
    sum(clue_nums) as clue_nums,
    origin_type_stat,
    for_new_user,
    '-1' as hourinfo,
    dayinfo,
    monthinfo,
    '2' as time_type,
    concat(dwm.yearinfo,'-',dwm.monthinfo,'-',dwm.dayinfo) as time_str,
    yearinfo
from itcast_dwm.itcast_clue_dwm dwm
GROUP BY dwm.yearinfo, dwm.monthinfo, dwm.dayinfo, dwm.origin_type_stat, dwm.for_new_user;

--月
INSERT INTO itcast_clue_dws PARTITION(yearinfo)
SELECT
    sum(clue_nums) as clue_nums,
    origin_type_stat,
    for_new_user,
    '-1' as hourinfo,
    '-1' asdayinfo,
    monthinfo,
    '4' as time_type,
    concat(dwm.yearinfo,'-',dwm.monthinfo) as time_str,
    yearinfo
from itcast_dwm.itcast_clue_dwm dwm
GROUP BY dwm.yearinfo, dwm.monthinfo, dwm.origin_type_stat, dwm.for_new_user;

--年
INSERT INTO itcast_clue_dws PARTITION(yearinfo)
SELECT
    sum(clue_nums) as clue_nums,
    origin_type_stat,
    for_new_user,
    '-1' as hourinfo,
    '-1' as dayinfo,
    '-1' as monthinfo,
    '5' as time_type,
    dwm.yearinfo as time_str,
    yearinfo
from itcast_dwm.itcast_clue_dwm dwm
GROUP BY dwm.yearinfo, dwm.origin_type_stat, dwm.for_new_user;

4.2.4 导出数据

4.2.4.1 创建Mysql表

CREATE TABLE `itcast_clue` (
   `clue_nums` INT(11) COMMENT '有效线索量',
   `origin_type_stat` varchar(32) COMMENT '数据来源:0.线下；1.线上',
   `for_new_user` varchar(32) COMMENT '0:未知；1：新客户线索；2：旧客户线索',
   `hourinfo` varchar(32) COMMENT '小时信息',
   `dayinfo` varchar(32) COMMENT '日信息',
   `monthinfo` varchar(32) COMMENT '月信息',
   `time_type` varchar(32) COMMENT '聚合时间类型：1、按小时聚合；2、按天聚合；3、按周聚合；4、按月聚合；5、按年聚合；',
   `time_str` varchar(32) COMMENT '时间明细',
   `yearinfo` varchar(32) COMMENT '年信息'
);

4.2.4.2 Sqoop导出脚本

sqoop export

--connect "jdbc:mysql://192.168.52.150:3306/scrm_bi?useUnicode=true&characterEncoding=utf-8"

--username root

--password 123456

--table itcast_clue

--hcatalog-database itcast_dws

--hcatalog-table itcast_clue_dws

-m 100

4.3 增量流程

4.3.1 数据采集

4.3.1.1 customer_relationship表

ODS复用意向客户指标，不需要重复采集数据。

4.3.1.2 customer_appeal表

此表数据较少，因此可以直接全部覆盖同步，同全量过程。

4.3.2 数据清洗转换

4.3.2.1 分析

customer_clue表是一个拉链表，会保存数据的历史状态。因为业务方将更新周期限制在30天内，所以只需查询更新30天内的数据即可。

因此在统计时，我们只需要将上个月1日至今的数据进行统计。

4.3.2.2 代码

4.3.2.2.1 SQL：

4.3.2.2.2 Shell脚本：

#! /bin/bash

#采集日期

if [[ $1 == "" ]];then

TD_DATE=`date -d '1 days ago' "+%Y-%m-%d"`

else

TD_DATE=$1

${HIVE_HOME} -S -e "

--分区

SET hive.exec.dynamic.partition=true;

SET hive.exec.dynamic.partition.mode=nonstrict;

set hive.exec.max.dynamic.partitions.pernode=10000;

set hive.exec.max.dynamic.partitions=100000;

set hive.exec.max.created.files=150000;

--hive压缩

set hive.exec.compress.intermediate=true;

set hive.exec.compress.output=true;

--写入时压缩生效

set hive.exec.orc.compression.strategy=COMPRESSION;

--分桶

set hive.enforce.bucketing=true;

set hive.enforce.sorting=true;

set hive.optimize.bucketmapjoin = true;

set hive.auto.convert.sortmerge.join=true;

set hive.auto.convert.sortmerge.join.noconditionaltask=true;

--并行执行

set hive.exec.parallel=true;

set hive.exec.parallel.thread.number=8;

INSERT INTO itcast_dwd.itcast_clue_dwd PARTITION (yearinfo, monthinfo, dayinfo)

SELECT

clue.id,

clue.customer_relationship_id,

if(rs.origin_type='NETSERVICE' or rs.origin_type='PRESIGNUP', '1', '0') as origin_type_stat,

if(clue.clue_state='VALID_NEW_CLUES', 1, if(clue.clue_state='VALID_PUBLIC_NEW_CLUE', 2, 0)) as for_new_user,

clue.deleted,

clue.create_date_time,

substr(clue.create_date_time, 12, 2) as hourinfo,

substr(clue.create_date_time, 1, 4) as yearinfo,

substr(clue.create_date_time, 6, 2) as monthinfo,

substr(clue.create_date_time, 9, 2) as dayinfo

FROM itcast_ods.customer_clue TABLESAMPLE(BUCKET 1 OUT OF 10 ON customer_relationship_id) clue

LEFT JOIN itcast_ods.customer_relationship rs on clue.customer_relationship_id = rs.id

where clue.clue_state is not null AND clue.deleted = 'false' substr(clue.create_date_time, 1, 10) >= '${TD_DATE}';--2019-11-01

4.3.3 统计分析

增量统计时，只需要统计上个月1日至今的数据。

4.3.3.1 DWM

4.3.3.1.1 SQL

--分区
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.created.files=150000;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入表时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
--分桶
set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;
set hive.optimize.bucketmapjoin = true;
set hive.auto.convert.sortmerge.join=true;
set hive.auto.convert.sortmerge.join.noconditionaltask=true;
--并行执行
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=16;
 
INSERT into itcast_dwm.itcast_clue_dwm partition (yearinfo)
SELECT
    count(dwd.id) as clue_nums,
    dwd.origin_type_stat,
    dwd.for_new_user,
    dwd.hourinfo,
    dwd.dayinfo,
    dwd.monthinfo,
    dwd.yearinfo
from itcast_dwd.itcast_clue_dwd dwd
WHERE 
dwd.customer_relationship_id NOT IN
(
    SELECT a.customer_relationship_first_id from itcast_ods.customer_appeal a
    WHERE a.appeal_status = 1 and a.customer_relationship_first_id != 0
)
AND concat_ws('-',dwd.yearinfo,dwd.monthinfo,dwd.dayinfo)>= '2019-11-01'
GROUP BY dwd.yearinfo, dwd.monthinfo, dwd.dayinfo, dwd.hourinfo, dwd.origin_type_stat, dwd.for_new_user;

4.3.3.1.2 Shell脚本

#! /bin/bash

#上个月1日

Last_Month_DATE=$(date -d "$(date +%Y%m)01 last month" +%Y-%m-01)

${HIVE_HOME} -S -e "

--分区

SET hive.exec.dynamic.partition=true;

SET hive.exec.dynamic.partition.mode=nonstrict;

set hive.exec.max.dynamic.partitions.pernode=10000;

set hive.exec.max.dynamic.partitions=100000;

set hive.exec.max.created.files=150000;

--hive压缩

set hive.exec.compress.intermediate=true;

set hive.exec.compress.output=true;

--写入表时压缩生效

set hive.exec.orc.compression.strategy=COMPRESSION;

--分桶

set hive.enforce.bucketing=true;

set hive.enforce.sorting=true;

set hive.optimize.bucketmapjoin = true;

set hive.auto.convert.sortmerge.join=true;

set hive.auto.convert.sortmerge.join.noconditionaltask=true;

--并行执行

set hive.exec.parallel=true;

set hive.exec.parallel.thread.number=16;

INSERT into itcast_dwm.itcast_clue_dwm partition (yearinfo)

SELECT

count(dwd.id) as clue_nums,

dwd.origin_type_stat,

dwd.for_new_user,

dwd.hourinfo,

dwd.dayinfo,

dwd.monthinfo,

dwd.yearinfo

from itcast_dwd.itcast_clue_dwd dwd

WHERE

dwd.customer_relationship_id NOT IN

(

SELECT a.customer_relationship_first_id from itcast_ods.customer_appeal a

WHERE a.appeal_status = 1 and a.customer_relationship_first_id != 0

)

AND concat_ws('-',dwd.yearinfo,dwd.monthinfo,dwd.dayinfo)>= '$Last_Month_DATE'

GROUP BY dwd.yearinfo, dwd.monthinfo, dwd.dayinfo, dwd.hourinfo, dwd.origin_type_stat, dwd.for_new_user;

4.3.3.2 DWS

统计上个月1日至今的数据。

4.3.3.2.1 SQL

--分区
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.created.files=150000;
--分桶
set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;
set hive.optimize.bucketmapjoin = true;
set hive.auto.convert.sortmerge.join=true;
set hive.auto.convert.sortmerge.join.noconditionaltask=true;
--并行执行
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=16;
--小时
INSERT INTO itcast_dws.itcast_clue_dws PARTITION(yearinfo)
SELECT
    clue_nums,
    origin_type_stat,
    for_new_user,
    hourinfo,
    dayinfo,
    monthinfo,
    '1' as time_type,
    concat(dwm.yearinfo,'-',dwm.monthinfo,'-',dwm.dayinfo,' ', dwm.hourinfo) as time_str,
    yearinfo
from itcast_dwm.itcast_clue_dwm dwm
where CONCAT_WS('-', dwm.yearinfo, dwm.monthinfo, dwm.dayinfo)>='${Last_Month_DATE}';

--天
INSERT INTO itcast_dws.itcast_clue_dws PARTITION(yearinfo)
SELECT
    sum(clue_nums) as clue_nums,
    origin_type_stat,
    for_new_user,
    '-1' as hourinfo,
    dayinfo,
    monthinfo,
    '2' as time_type,
    concat(dwm.yearinfo,'-',dwm.monthinfo,'-',dwm.dayinfo) as time_str,
    yearinfo
from itcast_dwm.itcast_clue_dwm dwm
where CONCAT_WS('-', dwm.yearinfo, dwm.monthinfo, dwm.dayinfo)>='${Last_Month_DATE}'
GROUP BY dwm.yearinfo, dwm.monthinfo, dwm.dayinfo, dwm.origin_type_stat, dwm.for_new_user;

--月
INSERT INTO itcast_dws.itcast_clue_dws PARTITION(yearinfo)
SELECT
    sum(clue_nums) as clue_nums,
    origin_type_stat,
    for_new_user,
    '-1' as hourinfo,
    '-1' asdayinfo,
    monthinfo,
    '4' as time_type,
    concat(dwm.yearinfo,'-',dwm.monthinfo) as time_str,
    yearinfo
from itcast_dwm.itcast_clue_dwm dwm
where CONCAT_WS('-', dwm.yearinfo, dwm.monthinfo)>='${V_Month}'
GROUP BY dwm.yearinfo, dwm.monthinfo, dwm.origin_type_stat, dwm.for_new_user;

--年
INSERT INTO itcast_dws.itcast_clue_dws PARTITION(yearinfo)
SELECT
    sum(clue_nums) as clue_nums,
    origin_type_stat,
    for_new_user,
    '-1' as hourinfo,
    '-1' as dayinfo,
    '-1' as monthinfo,
    '5' as time_type,
    dwm.yearinfo as time_str,
    yearinfo
from itcast_dwm.itcast_clue_dwm dwm
where dwm.yearinfo>='${V_Year}'
GROUP BY dwm.yearinfo, dwm.origin_type_stat, dwm.for_new_user;

4.3.3.2.2 Shell脚本

#! /bin/bash

#上个月1日

Last_Month_DATE=$(date -d "$(date +%Y%m)01 last month" +%Y-%m-01)

#根据TD_DATE计算年季度月日

V_PARYEAR=`date --date="$Last_Month_DATE" +%Y`

V_PARMONTH=`date --date="$Last_Month_DATE" +%m`

V_PARDAY=`date --date="$Last_Month_DATE" +%d`

V_month_for_quarter=`date --date="$Last_Month_DATE" +%-m`

V_PARQUARTER=$(((${V_month_for_quarter}-1)/3+1))

#计算所需要的日期字符串

V_Month="${V_PARYEAR}"_"${V_PARMONTH}"

V_QUARTER="${V_PARYEAR}"_Q"${V_PARQUARTER}"

V_Year="${V_PARYEAR}"

${HIVE_HOME} -S -e "

--分区

SET hive.exec.dynamic.partition=true;

SET hive.exec.dynamic.partition.mode=nonstrict;

set hive.exec.max.dynamic.partitions.pernode=10000;

set hive.exec.max.dynamic.partitions=100000;

set hive.exec.max.created.files=150000;

--分桶

set hive.enforce.bucketing=true;

set hive.enforce.sorting=true;

set hive.optimize.bucketmapjoin = true;

set hive.auto.convert.sortmerge.join=true;

set hive.auto.convert.sortmerge.join.noconditionaltask=true;

--并行执行

set hive.exec.parallel=true;

set hive.exec.parallel.thread.number=16;

--小时

INSERT INTO itcast_dws.itcast_clue_dws PARTITION(yearinfo)

SELECT

clue_nums,

origin_type_stat,

for_new_user,

hourinfo,

dayinfo,

monthinfo,

'1' as time_type,

concat(dwm.yearinfo,'-',dwm.monthinfo,'-',dwm.dayinfo,' ', dwm.hourinfo) as time_str,

yearinfo

from itcast_dwm.itcast_clue_dwm dwm

where CONCAT_WS('-', dwm.yearinfo, dwm.monthinfo, dwm.dayinfo)>='${Last_Month_DATE}';

--天

INSERT INTO itcast_dws.itcast_clue_dws PARTITION(yearinfo)

SELECT

sum(clue_nums) as clue_nums,

origin_type_stat,

for_new_user,

'-1' as hourinfo,

dayinfo,

monthinfo,

'2' as time_type,

concat(dwm.yearinfo,'-',dwm.monthinfo,'-',dwm.dayinfo) as time_str,

yearinfo

from itcast_dwm.itcast_clue_dwm dwm

where CONCAT_WS('-', dwm.yearinfo, dwm.monthinfo, dwm.dayinfo)>='${Last_Month_DATE}'

GROUP BY dwm.yearinfo, dwm.monthinfo, dwm.dayinfo, dwm.origin_type_stat, dwm.for_new_user;

--月

INSERT INTO itcast_dws.itcast_clue_dws PARTITION(yearinfo)

SELECT

sum(clue_nums) as clue_nums,

origin_type_stat,

for_new_user,

'-1' as hourinfo,

'-1' asdayinfo,

monthinfo,

'4' as time_type,

concat(dwm.yearinfo,'-',dwm.monthinfo) as time_str,

yearinfo

from itcast_dwm.itcast_clue_dwm dwm

where CONCAT_WS('-', dwm.yearinfo, dwm.monthinfo)>='${V_Month}'

GROUP BY dwm.yearinfo, dwm.monthinfo, dwm.origin_type_stat, dwm.for_new_user;

--年

INSERT INTO itcast_dws.itcast_clue_dws PARTITION(yearinfo)

SELECT

sum(clue_nums) as clue_nums,

origin_type_stat,

for_new_user,

'-1' as hourinfo,

'-1' as dayinfo,

'-1' as monthinfo,

'5' as time_type,

dwm.yearinfo as time_str,

yearinfo

from itcast_dwm.itcast_clue_dwm dwm

where dwm.yearinfo>='${V_Year}'

GROUP BY dwm.yearinfo, dwm.origin_type_stat, dwm.for_new_user;

4.3.4 导出数据

#! /bin/bash
SQOOP_HOME=/usr/bin/sqoop
HOST=192.168.52.150
USERNAME="root"

PASSWORD="123456"
PORT=3306
DBNAME="scrm_bi"
MYSQL=/usr/local/mysql_5723/bin/mysql

#上个月1日

if [[ $1 == "" ]];then
Last_Month_DATE=$(date -d "$(date +%Y%m)01 last month" +%Y-%m-01)
else
Last_Month_DATE=$1
fi
${MYSQL} -h${HOST} -P${PORT} -u${USERNAME} -p${PASSWORD} -D${DBNAME} -e "delete from itcast_clue where yearinfo = '${Last_Month_DATE:0:4}'"
${SQOOP_HOME} export
--connect "jdbc:mysql://${HOST}:${PORT}/${DBNAME}?useUnicode=true&characterEncoding=utf-8"
--username ${USERNAME}
--password ${PASSWORD}
--table itcast_clue
--hcatalog-database itcast_dws
--hcatalog-table itcast_clue_dws
--hcatalog-partition-keys yearinfo
--hcatalog-partition-values ${TD_DATE:0:4}
-m 100

查看全文

相关阅读:
另一种遍历Map的方式： Map.Entry 和 Map.entrySet()
mycat 插入语句导致的一个Dobbo问题
 Json数据处理
 List与字符串转换
 MySQL中四舍五入的实现
 java连接mysql ：No Suitable Driver Found For Jdbc 解决方法
 Linux中printf格式化输出
 bat隐藏文件夹
 Python 3.5.2建立与DB2的连接
 Python 爬虫实例

原文地址：https://www.cnblogs.com/shan13936/p/14060319.html

有效线索主题看板 阿善有用 清洗转换具体怎么做

有效线索主题看板

1. 学习目标

2. 主题需求

2.1 有效线索转化率

2.2 有效线索转化率时间段趋势

2.3 有效线索量

2.4 原始数据结构

3. 建模分析

3.1 指标和维度

3.2 分层设计

4. 实现

4.1 建模

4.1.1 指标和维度

4.1.2 事实表和维度表

4.1.3 Hive索引

4.1.3.1 Hive原始索引

4.1.3.2 Row Group Index

4.1.3.3 Bloom Filter Index

4.1.4 分层

4.1.4.1 ODS

4.1.4.1.1 customer_appeal线索申诉表

4.1.4.2 DWD

4.1.4.3 DWM

4.1.4.4 DWS

4.1.4.5 APP

4.2 全量流程

4.2.1 数据采集

4.2.1.1 Customer_clue线索表、customer_relationship表

4.2.1.2 customer_appeal表

4.2.2 数据清洗转换

4.2.2.1 分析

4.2.2.2 代码

4.2.2.3 测试

4.2.3 统计分析

4.2.3.1 分析

4.2.3.2 代码

4.2.3.2.1 DWM

4.2.3.2.2 DWS

4.2.4 导出数据

4.2.4.1 创建Mysql表

4.2.4.2 Sqoop导出脚本

4.3 增量流程

4.3.1 数据采集

4.3.1.1 customer_relationship表

4.3.1.2 customer_appeal表

4.3.2 数据清洗转换

4.3.2.1 分析

4.3.2.2 代码

4.3.2.2.1 SQL：

4.3.2.2.2 Shell脚本：

4.3.3 统计分析

4.3.3.1 DWM

4.3.3.1.1 SQL

4.3.3.1.2 Shell脚本

4.3.3.2 DWS

4.3.3.2.1 SQL

4.3.3.2.2 Shell脚本

4.3.4 导出数据

有效线索主题看板阿善有用清洗转换具体怎么做