zoukankan      html  css  js  c++  java
  • 交互式数据处理

      一、数据预处理

      1.查看数据

      将要用到的sogou.500w.utf8包含了500万条搜狗网络访问日志记录的数据的文件(547MB)复制到/home/jun/Resources下,通过less命令查看文件内容,通过PgUp/PgDn上下翻页,退出时可以按ESC-Enter-Q即可。

    20111230000013  b89952902d7821db37e8999776b32427        怎么骂一个人不带脏字    2       2       http://zhidao.baidu.com/question/224925866
    20111230000013  072fa3643c91b29bd586aff29b402161        暴力破解无线网络密码    2       1       http://download.csdn.net/detail/think1919/3935722
    20111230000014  f31f594bd1f3147298bd952ba35de84d        12306.cn        1       1       http://www.12306.cn/

      这个文件每一条记录的对应的意思是:访问时间、用户ID、查询词、返回结果排序、用户单击的顺序号、用户单击的URL。一共6个字段,字段与字段之间是通过一个“ ”(Tab)分割的。

      通过wc命令统计文件的行数(-l)、字数(-w)、字节数(-c),从下面的结果来看,确实有500W条记录,确实称得上是大数据了。

    [jun@master Resources]$ wc -l sogou.500w.utf8 
    5000000 sogou.500w.utf8
    [jun@master Resources]$ wc -w sogou.500w.utf8 
    30436251 sogou.500w.utf8
    [jun@master Resources]$ wc -c sogou.500w.utf8 
    573670020 sogou.500w.utf8

      还可以通过head命令截取文件的部分数据

    [jun@master Resources]$ head -200 sogou.500w.utf8 > sogou.200.utf8
    [jun@master Resources]$ wc -l sogou.200.utf8 
    200 sogou.200.utf8

      2.数据扩展

      由于上面的文件的第一个字段是20111230000013形式的,为了方便统计,将这个字符提取并分割成4个字段,分别对应着年(2011)、月(12)、日(30)、小时(00),再将这四个字段添加到原来每条记录的后面。

      使用shell脚本可以完成这一工作,来看一下shell脚本的内容。第一个参数是输入文件,第二个参数是输出文件,使用awk编程语句可以对每一行的字段进行重新编辑。

    #!/bin/bash
    infile=$1
    outfile=$2
    awk -F '	' '{print $0"	"substr($1,0,4)"	"substr($1,4,2)"	"substr($1,6,2)"	"substr($1,8,2)}' $infile > $outfile

      先给脚本赋予权限,然后运行这个脚本

    jun@master Resources]$ chmod +x sogou-log-extend.sh 
    [jun@master Resources]$ ./sogou-log-extend.sh sogou.500w.utf8 sogou.500w.utf8.ext

      查看生成的文件,可以看到确实在后面增加了4个字段

    20111230000013  e0d255845fc9e66b2a25c43a70de4a9a        无饶河,益慎职 意思     3       1       http://hanyu.iciba.com/wiki/1230433.shtml       2011    11      23      00
    20111230000013  b89952902d7821db37e8999776b32427        怎么骂一个人不带脏字    2       2       http://zhidao.baidu.com/question/224925866      2011    11      23      00
    20111230000013  072fa3643c91b29bd586aff29b402161        暴力破解无线网络密码    2       1       http://download.csdn.net/detail/think1919/3935722       2011    11      23      00

      3.数据过滤

      通过分析可以看出,这500万条记录中,有的记录不是很完整,缺少了某一个或者某几个字段,这样得到的数据就不是很完整,因此,为了保留相对完整的记录,将这些记录中第2个或第3个字段为空的记录过滤掉,同样编写一个shell脚本来实现。

    #!/bin/bash
    infile=$1
    outfile=$2
    awk -F"	" '{if($2 != "" && $3 != "" && $2 != " " && $3 != " ") print $0}' $infile > $outfile

      先给脚本赋予权限,然后运行这个脚本

    [jun@master Resources]$ chmod +x sogou-log-filter.sh 
    [jun@master Resources]$ ./sogou-log-filter.sh sogou.500w.utf8.ext sogou.500w.utf8.flt

      4.数据上传

      在得到处理过的数据之后,需要在Hadoop平台上进行分析,当然需要把文件提交到HDFS上了,首先确保启动了Hadoop,然后创建目录

    [jun@master Resources]$ hadoop fs -mkdir /sogou_ext
    [jun@master Resources]$ hadoop fs -mkdir /sogou_ext/20180724

      将文件上传到新建的目录下

    [jun@master Resources]$ hadoop fs -put ~/Resources/sogou.500w.utf8.flt /sogou_ext/20180724/

      二、创建数据仓库

      1.新建Hive数据仓库

      启动Hive

    [jun@master Resources]$ cd /home/jun/apache-hive-2.3.3-bin/bin         
    [jun@master bin]$ ./hive

      执行创建仓库命令,然后打开仓库,显示数据仓库中的表

    hive> create database sogou;
    OK
    Time taken: 6.412 seconds
    hive> show databases;
    OK
    default
    sogou
    test_db
    Time taken: 0.146 seconds, Fetched: 3 row(s)
    hive> use sogou;
    OK
    Time taken: 0.019 seconds
    hive> show tables;
    OK
    Time taken: 0.035 seconds

      2.新建一个外部表,该表包含了扩展字段,即

    hive> create external table sogou.sogou_ext_20180724(
        > time string,
        > uid string,
        > keywords string,
        > rank int,
        > ordering int,
        > url string,
        > year int,
        > month int,
        > day int,
        > hour int)
        > comment 'This is the sogou search data extend'
        > row format delimited
        > fields terminated by '	'
        > stored as textfile
        > location '/sogou_ext/20180724';
    OK
    Time taken: 0.758 seconds

      3.创建一个带分区的表,以最后四个时间字段分区,即

    hive> create external table sogou.sogou_partition(
        > time string,
        > uid string,
        > keywords string,
        > rank int,
        > ordering int,
        > url string)
        > partitioned by (
        > year int,
        > month int,
        > day int,
        > hour int)
        > row format delimited
        > fields terminated by '	'
        > stored as textfile
        > ;
    OK
    Time taken: 0.416 seconds

      4.向数据库中导入数据,这里由于之前在创建外部表的时候指定了location '/sogou_ext/20180724'所以就会自动去指定的路径去找。

    hive> set hive.exec.dynamic.partition.mode=nonstrict;      
    hive> insert overwrite table sogou.sogou_partition partition(year,month,day,hour) select * from sogou.sogou_ext_20180724;

      查询一下导入的数据:

    hive> select * from sogou_ext_20180724 limit 5
        > ;
    OK
    20111230000005    57375476989eea12893c0c3811607bcf    奇艺高清    1    1    http://www.qiyi.com/    2011    11    23    0
    20111230000005    66c5bb7774e31d0a22278249b26bc83a    凡人修仙传    3    1    http://www.booksky.org/BookDetail.aspx?BookID=1050804&Level=1    2011    11    23    0
    20111230000007    b97920521c78de70ac38e3713f524b50    本本联盟    1    1    http://www.bblianmeng.com/    2011    11    23    0
    20111230000008    6961d0c97fe93701fc9c0d861d096cd9    华南师范大学图书馆    1    1    http://lib.scnu.edu.cn/    2011    11    23    0
    20111230000008    f2f5a21c764aebde1e8afcc2871e086f    在线代理    2    1    http://proxyie.cn/    2011    11    23    0
    Time taken: 0.187 seconds, Fetched: 5 row(s)

      三、数据分析

      1.基本统计

      (1)统计总记录数

    hive> select count(*) from sogou_ext_20180724;
    Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 1
    2018-07-24 17:32:24,305 Stage-1 map = 0%,  reduce = 0%
    2018-07-24 17:32:38,146 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 9.42 sec
    2018-07-24 17:32:39,211 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU 17.38 sec
    2018-07-24 17:32:41,956 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 33.68 sec
    2018-07-24 17:32:51,246 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 38.55 sec
    MapReduce Total cumulative CPU time: 38 seconds 550 msec
    Ended Job = job_1532414392815_0003
    MapReduce Jobs Launched:
    Stage-Stage-1: Map: 3  Reduce: 1   Cumulative CPU: 38.55 sec   HDFS Read: 643703894 HDFS Write: 107 SUCCESS
    Total MapReduce CPU Time Spent: 38 seconds 550 msec
    OK
    5000000
    Time taken: 45.712 seconds, Fetched: 1 row(s)

      (2)统计keywords非空记录数

    hive> select count(*) from sogou_ext_20180724 where keywords is not null and keywords!='';
    Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 1
    2018-07-24 17:34:28,102 Stage-1 map = 0%,  reduce = 0%
    2018-07-24 17:34:39,467 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 6.55 sec
    2018-07-24 17:35:04,710 Stage-1 map = 78%,  reduce = 0%, Cumulative CPU 56.56 sec
    2018-07-24 17:35:07,843 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 57.75 sec
    2018-07-24 17:35:08,875 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 60.35 sec
    MapReduce Total cumulative CPU time: 1 minutes 0 seconds 350 msec
    Ended Job = job_1532414392815_0004
    MapReduce Jobs Launched:
    Stage-Stage-1: Map: 3  Reduce: 1   Cumulative CPU: 60.35 sec   HDFS Read: 643705555 HDFS Write: 107 SUCCESS
    Total MapReduce CPU Time Spent: 1 minutes 0 seconds 350 msec
    OK
    5000000
    Time taken: 49.97 seconds, Fetched: 1 row(s)

      (3)统计独立uid总数

    hive> select count(distinct(uid)) from sogou_ext_20180724;
    Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 1
    2018-07-24 17:36:46,624 Stage-1 map = 0%,  reduce = 0%
    2018-07-24 17:37:00,919 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 10.85 sec
    2018-07-24 17:37:02,995 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 46.25 sec
    2018-07-24 17:37:10,351 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 54.47 sec
    MapReduce Total cumulative CPU time: 54 seconds 470 msec
    Ended Job = job_1532414392815_0005
    MapReduce Jobs Launched:
    Stage-Stage-1: Map: 3  Reduce: 1   Cumulative CPU: 54.47 sec   HDFS Read: 643704766 HDFS Write: 107 SUCCESS
    Total MapReduce CPU Time Spent: 54 seconds 470 msec
    OK
    1352664
    Time taken: 33.131 seconds, Fetched: 1 row(s)

      (4)关键词长度统计

    hive> select avg(a.cnt) from (select size(split(keywords,'\s+')) as cnt from sogou.sogou_ext_20180724) a;
    Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 1
    2018-07-24 17:42:10,425 Stage-1 map = 0%,  reduce = 0%
    2018-07-24 17:42:24,710 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 11.31 sec
    2018-07-24 17:42:31,858 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU 49.17 sec
    2018-07-24 17:42:34,182 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 53.05 sec
    2018-07-24 17:42:36,285 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 55.35 sec
    MapReduce Total cumulative CPU time: 55 seconds 350 msec
    Ended Job = job_1532414392815_0006
    MapReduce Jobs Launched:
    Stage-Stage-1: Map: 3  Reduce: 1   Cumulative CPU: 55.35 sec   HDFS Read: 643705682 HDFS Write: 109 SUCCESS
    Total MapReduce CPU Time Spent: 55 seconds 350 msec
    OK
    1.0869984
    Time taken: 34.047 seconds, Fetched: 1 row(s)

      (5)频率最高的20个关键词

    hive> select keywords, count(*) as cnt from sogou.sogou_ext_20180724 group by keywords order by cnt desc limit 20;
    Hadoop job information for Stage-2: number of mappers: 2; number of reducers: 1
    2018-07-24 17:45:47,831 Stage-2 map = 0%,  reduce = 0%
    2018-07-24 17:45:56,536 Stage-2 map = 50%,  reduce = 0%, Cumulative CPU 6.49 sec
    2018-07-24 17:45:57,613 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 10.82 sec
    2018-07-24 17:46:01,833 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 12.62 sec
    MapReduce Total cumulative CPU time: 12 seconds 620 msec
    Ended Job = job_1532414392815_0008
    MapReduce Jobs Launched:
    Stage-Stage-1: Map: 3  Reduce: 3   Cumulative CPU: 75.89 sec   HDFS Read: 643711008 HDFS Write: 62953072 SUCCESS
    Stage-Stage-2: Map: 2  Reduce: 1   Cumulative CPU: 12.62 sec   HDFS Read: 62961345 HDFS Write: 949 SUCCESS
    Total MapReduce CPU Time Spent: 1 minutes 28 seconds 510 msec
    OK
    百度    38441
    baidu    18312
    人体艺术    14475
    4399小游戏    11438
    qq空间    10317
    优酷    10158
    新亮剑    9654
    馆陶县县长闫宁的父亲    9127
    公安卖萌    8192
    百度一下 你就知道    7505
    百度一下    7104
    4399    7041
    魏特琳    6665
    qq网名    6149
    7k7k小游戏    5985
    黑狐    5610
    儿子与母亲不正当关系    5496
    新浪微博    5369
    李宇春体    5310
    新疆暴徒被击毙图片    4997
    Time taken: 95.338 seconds, Fetched: 20 row(s)

      (6)查询次数的分布

    hive> select sum(if(uids.cnt=1,1,0)), sum(if(uids.cnt=2,1,0)), sum(if(uids.cnt=3,1,0)), sum(if(uids.cnt>3,1,0)) from (select uid, count(*) as cnt from sogou.sogou_ext_20180724 group by uid) uids;
    MapReduce Total cumulative CPU time: 5 seconds 690 msec
    Ended Job = job_1532414392815_0020
    MapReduce Jobs Launched:
    Stage-Stage-1: Map: 3  Reduce: 3   Cumulative CPU: 82.1 sec   HDFS Read: 643715334 HDFS Write: 384 SUCCESS
    Stage-Stage-2: Map: 2  Reduce: 1   Cumulative CPU: 5.69 sec   HDFS Read: 9325 HDFS Write: 127 SUCCESS
    Total MapReduce CPU Time Spent: 1 minutes 27 seconds 790 msec
    OK
    549148    257163    149562    396791
    Time taken: 62.601 seconds, Fetched: 1 row(s)

      (7)平均查询次数

    hive> select sum(a.cnt)/count(a.uid) from (select uid,count(*) as cnt from sogou.sogou_ext_20180724 group by uid) a;
    MapReduce Total cumulative CPU time: 6 seconds 610 msec
    Ended Job = job_1532414392815_0010
    MapReduce Jobs Launched:
    Stage-Stage-1: Map: 3  Reduce: 3   Cumulative CPU: 70.07 sec   HDFS Read: 643712322 HDFS Write: 363 SUCCESS
    Stage-Stage-2: Map: 2  Reduce: 1   Cumulative CPU: 6.61 sec   HDFS Read: 9207 HDFS Write: 118 SUCCESS
    Total MapReduce CPU Time Spent: 1 minutes 16 seconds 680 msec
    OK
    3.6964094557111005
    Time taken: 89.135 seconds, Fetched: 1 row(s

      (8)查询次数大于2的用户数

    hive> select count(a.cnt) from (select uid,count(*) as cnt from sogou.sogou_ext_20180724 group by uid having cnt > 2 ) a;
     seconds 790 msec
    Ended Job = job_1532414392815_0012
    MapReduce Jobs Launched:
    Stage-Stage-1: Map: 3  Reduce: 3   Cumulative CPU: 70.04 sec   HDFS Read: 643713027 HDFS Write: 351 SUCCESS
    Stage-Stage-2: Map: 2  Reduce: 1   Cumulative CPU: 4.79 sec   HDFS Read: 7712 HDFS Write: 106 SUCCESS
    Total MapReduce CPU Time Spent: 1 minutes 14 seconds 830 msec
    OK
    546353
    Time taken: 61.16 seconds, Fetched: 1 row(s)

      (9)查询次数大于2的数据展示

    hive> select b.* from 
        > (select uid,count(*) as cnt from sogou.sogou_ext_20180724 group by uid having cnt>2) a
        > join sogou.sogou_ext_20180724 b on a.uid=b.uid
        > limit 20;
    MapReduce Total cumulative CPU time: 3 minutes 40 seconds 190 msec
    Ended Job = job_1532414392815_0014
    MapReduce Jobs Launched:
    Stage-Stage-1: Map: 3  Reduce: 3   Cumulative CPU: 73.96 sec   HDFS Read: 643711740 HDFS Write: 27591098 SUCCESS
    Stage-Stage-2: Map: 5  Reduce: 3   Cumulative CPU: 220.19 sec   HDFS Read: 671324193 HDFS Write: 9785 SUCCESS
    Total MapReduce CPU Time Spent: 4 minutes 54 seconds 150 msec
    OK
    20111230222158    000080fd3eaf6b381e33868ec6459c49    福彩3d单选一注法    6    3    http://bbs.17500.cn/thread-2453170-1-1.html    2011    11    23    2
    20111230222603    000080fd3eaf6b381e33868ec6459c49    福彩3d单选一注法    10    5    http://www.18888.com/read-htm-tid-6069520.html    2011    11    23    2
    20111230222128    000080fd3eaf6b381e33868ec6459c49    福彩3d单选一注法    5    2    http://www.zibocn.com/Infor/i8513.html    2011    1123    2
    20111230222802    000080fd3eaf6b381e33868ec6459c49    福彩3d单选号码走势图    1    1    http://zst.cjcp.com.cn/cjw3d/view/3d_danxuan.php    2011    11    23    2
    20111230222417    000080fd3eaf6b381e33868ec6459c49    福彩3d单选一注法    7    4    http://bbs.18888.com/read-htm-tid-4017348.html    2011    11    23    2
    20111230220953    000080fd3eaf6b381e33868ec6459c49    福彩3d单选一注法    4    1    http://www.55125.cn/3djq/20111103_352210.htm    2011    11    23    2
    20111230211504    0000c2d1c4375c8a827bff5dab0cc0a6    穿越小说txt    3    2    http://www.booktxt.com/chuanyue/    2011    11    232
    20111230213029    0000c2d1c4375c8a827bff5dab0cc0a6    浮生若梦txt    1    1    http://ishare.iask.sina.com.cn/f/15694326.html?from=like    2011    11    23    2
    20111230211319    0000c2d1c4375c8a827bff5dab0cc0a6    穿越小说txt    2    1    http://www.zlsy.net.cn/    2011    11    23    2
    20111230213047    0000c2d1c4375c8a827bff5dab0cc0a6    浮生若梦txt    2    2    http://www.txtinfo.com/txtshow/txt6105.html    2011    1123    2
    20111230205803    0000c2d1c4375c8a827bff5dab0cc0a6    步步惊心歌曲    4    1    http://www.tingge123.com/zhuanji/1606.shtml    2011    1123    2
    20111230205643    0000c2d1c4375c8a827bff5dab0cc0a6    步步惊心主题曲    4    1    http://bubujingxin.net/music.shtml    2011    11    232
    20111230212531    0000c2d1c4375c8a827bff5dab0cc0a6    乱世公主txt    1    1    http://ishare.iask.sina.com.cn/f/20689380.html    2011    1123    2
    20111230210041    0000c2d1c4375c8a827bff5dab0cc0a6    步步惊心歌曲    5    2    http://www.yue365.com/mlist/10981.shtml    2011    11    232
    20111230213911    0000c2d1c4375c8a827bff5dab0cc0a6    浮生若梦小说在线阅读    2    1    http://www.readnovel.com/partlist/22004/    2011    11    23    2
    20111230213835    0000c2d1c4375c8a827bff5dab0cc0a6    浮生若梦小说txt下载    2    1    http://www.2yanqing.com/f_699993/244670/download.html    2011    11    23    2
    20111230195312    0000d08ab20f78881a2ada2528671c58    棉花价格    3    3    http://www.yz88.org.cn/jg/    2011    11    23    1
    20111230195114    0000d08ab20f78881a2ada2528671c58    棉花价格    2    2    http://www.cnjidan.com/mianhua.asp    2011    11    231
    20111230200339    0000d08ab20f78881a2ada2528671c58    棉花价格最新    2    2    http://www.yz88.org.cn/jg/    2011    11    23    2
    20111230195652    0000d08ab20f78881a2ada2528671c58    棉花价格行情走势图    1    1    http://www.yz88.org.cn/jg/    2011    11    231

      2.用户行为分析

      (1)用户在使用搜索引擎时,搜索引擎返回的结果的前10个项目正好位于网页的第一页,因此来查询一下有多少记录是在前十条

    hive> select count(*) from sogou.sogou_ext_20180724 where rank<11;
    MapReduce Total cumulative CPU time: 23 seconds 180 msec
    Ended Job = job_1532414392815_0015
    MapReduce Jobs Launched:
    Stage-Stage-1: Map: 3  Reduce: 1   Cumulative CPU: 23.18 sec   HDFS Read: 643705566 HDFS Write: 107 SUCCESS
    Total MapReduce CPU Time Spent: 23 seconds 180 msec
    OK
    4999869
    Time taken: 29.25 seconds, Fetched: 1 row(s)

      可以看到,一共有4999869次记录的rank值小于等于10,也就是说基本上所有用户都只浏览搜索引擎返回的第一页的内容。

      (2)用户在搜索引擎中有的输入关键字,有的则是记不全网站的域名,想通过搜索引擎来找到想要访问的网站,统计这部分记录的个数。可以使用下面的含有正则表达式的查询语句来查询keywords中包含“www”的记录数。可以看到,绝大部分用户不会采用URL进行查询。

    hive> select count(*) from sogou.sogou_ext_20180724 where keywords like '%www%';
    MapReduce Total cumulative CPU time: 27 seconds 960 msec
    Ended Job = job_1532414392815_0016
    MapReduce Jobs Launched:
    Stage-Stage-1: Map: 3  Reduce: 1   Cumulative CPU: 27.96 sec   HDFS Read: 643705515 HDFS Write: 105 SUCCESS
    Total MapReduce CPU Time Spent: 27 seconds 960 msec
    OK
    73979
    Time taken: 28.339 seconds, Fetched: 1 row(s

      还可以查询在用户输入了URL的情况下,用户点击了其输入的URL网址的记录数。可以看到,有27561/73979=37%的用户提交了URL进行查询,并且继续点击了查询的结果。这可能是由于用户没有记全URL等原因,而想借助搜索引擎来找到自己想要的网址。这个分析结果就提示我们:搜索引擎在处理这一部分查询请求的时候,一个可能比较理想的改进方式就是,首先把相关的完整URL返回给用户,这样就有较大可能改善用户的查询体验,满足用户需求。

    hive> select sum(if(instr(url,keywords)>0,1,0)) from (select * from sogou.sogou_ext_20180724 where keywords like '%www%' ) a;
    MapReduce Total cumulative CPU time: 32 seconds 220 msec
    Ended Job = job_1532414392815_0017
    MapReduce Jobs Launched:
    Stage-Stage-1: Map: 3  Reduce: 1   Cumulative CPU: 32.22 sec   HDFS Read: 643706391 HDFS Write: 105 SUCCESS
    Total MapReduce CPU Time Spent: 32 seconds 220 msec
    OK
    27561
    Time taken: 27.895 seconds, Fetched: 1 row(s)

      (3)要想知道有多少人喜欢“仙剑奇侠传”,查询出搜索过“仙剑奇侠传”且次数大于3的uid,可以看到有两人满足,分别查询了6次和5次。

    select uid,count(*) as cnt from sogou.sogou_ext_20180724 where keywords='仙剑奇侠传' group by uid having cnt > 3;
    Ended Job = job_1532414392815_0018
    MapReduce Jobs Launched:
    Stage-Stage-1: Map: 3  Reduce: 3   Cumulative CPU: 47.1 sec   HDFS Read: 643717341 HDFS Write: 355 SUCCESS
    Total MapReduce CPU Time Spent: 47 seconds 100 msec
    OK
    653d48aa356d5111ac0e59f9fe736429    6
    e11c6273e337c1d1032229f1b2321a75    5
    Time taken: 39.244 seconds, Fetched: 2 row(s)

      3.实时数据

      在实际应用中,为了实时地显示当前搜索引擎的搜索数据,首先需要创建一些临时表,然后在一天结束后对数据进行处理,并将数据插入到临时表中,供显示部分展示。

      (1)创建临时表

    hive> create table sogou.uid_cnt(uid string, cnt int)
        > comment 'This is the sogou search data of one day'
        > row format delimited
        > fields terminated by '	'
        > stored as textfile;
    OK
    Time taken: 0.488 seconds

      (2)插入数据

    hive> insert overwrite table sogou.uid_cnt select uid, count(*) as cnt
        > from sogou.sogou_ext_20180724 group by uid;

      (3)查看数据

    hive> select * from  uid_cnt limit 20;
    OK
    00005c113b97c0977c768c13a6ffbb95    2
    000080fd3eaf6b381e33868ec6459c49    6
    0000c2d1c4375c8a827bff5dab0cc0a6    10
    0000d08ab20f78881a2ada2528671c58    9
    0000e7482034da216ce878a9f16feb49    5
    0001520a31ed091fa857050a5df35554    1
    0001824d091de069b4e5611aad47463d    1
    0001894c9f9de37ef9c90b6e5a456767    2
    0001b04bf9473458af40acb4c13f1476    1
    0001f5bacf60b0ff8c1c9e66e4905c1f    2
    000202ae03f7acc86d5ae784b4bf56ba    1
    0002b0dfc0b974b05f246acc590694ea    2
    0002c93607740aa5919c0de3645639cb    1
    000312ca0eaa91c30e5bafbcf2981bfd    21
    00032480797f1578f8fc83f47e180a77    1
    00032937ee88388581c86aa910b2a85b    1
    0003dbdb7fca09669a9784c6aaaf3bb1    6
    00043047d46f5e49dfcf15979b1bd49d    11
    00043fcb1a34d32bb06c0dfa35fb199b    3
    00047c0822b036bc1b473d9373fda199    1
    Time taken: 0.16 seconds, Fetched: 20 row(s)

      这样,前端开发人员就可以访问该临时表,并将数据展示出来,其展示方式可以根据实际需要设计,如表格、统计图等。

     

  • 相关阅读:
    asp.net mvc本地程序集和GAC的程序集冲突解决方法
    SolrCloud-如何在.NET程序中使用
    Application Initialization Module for IIS 7.5
    CentOS 6.5/6.6 安装mysql 5.7 最完整版教程
    NHibernate one-to-one
    “Invalid maximum heap size” when running Maven
    初涉RxAndroid结合Glide实现多图片载入操作
    【案例分析】Linux下怎样查看port占用情况
    js学习之--Bootstrap Modals(模态框)
    sdut2852 小鑫去爬山9dp入门)
  • 原文地址:https://www.cnblogs.com/BigJunOba/p/9362604.html
Copyright © 2011-2022 走看看