zoukankan      html  css  js  c++  java
  • hive里的group by和distinct

    hive里的group by和distinct

    前言

    今天才明确知道group by实际上还是有去重读作用的,其实细想一下,按照xx分类,肯定相同的就算是一类了,也就相当于去重来,详细的看一下。

    group by
    • 看一下实例1:
    hive> select * from test;
    OK
    zhao	15	20170807
    zhao	14	20170809
    zhao	15	20170809
    zhao	16	20170809
    
    hive> select name from test;
    OK
    zhao
    zhao
    zhao
    zhao
    
    hive> select name from test group by name;
    
    ...
    
    OK
    zhao
    Time taken: 40.273 seconds, Fetched: 1 row(s)
    

    按照这个去分类,最后结果只有一个,达到了去重的效果;实际上,所谓去重,肯定是两个一样的才可以去重,下面试一下两列的效果:

    hive> select name,age from test group by name,age;
    ...
    
    OK
    zhao	14
    zhao	15
    zhao	16
    Time taken: 36.943 seconds, Fetched: 3 row(s)
    
    hive> select name,age from test group by name;
    FAILED: SemanticException [Error 10025]: Line 1:12 Expression not in GROUP BY key 'age'
    

    只group by name就会出错,想一下只用name去做那么age不同就没法处理了,也合情合理。

    distinct

    这个也比较简单,就是去重:

    hive> select distinct name from test;
    ...
    
    OK
    zhao
    Time taken: 37.047 seconds, Fetched: 1 row(s)
    
    hive> select distinct name,age from test;
    OK
    zhao	14
    zhao	15
    zhao	16
    Time taken: 39.131 seconds, Fetched: 3 row(s)
    
    hive> select distinct(name),age from test;
    OK
    zhao	14
    zhao	15
    zhao	16
    Time taken: 37.739 seconds, Fetched: 3 row(s)
    
    区别
    • 如果数据较多,distinct效率会更低一些,一般推荐使用group by。
    • 至于原因,推荐这篇文章
  • 相关阅读:
    UVa 11440 Help Tomisu (欧拉函数)
    理解最小路径覆盖(转)
    bzoj 3196: Tyvj 1730 二逼平衡树
    splay
    bzoj 3223: Tyvj 1729 文艺平衡树
    小奇的糖果(candy)
    线性函数
    bzoj 4408: [Fjoi 2016]神秘数
    bzoj 4446: [Scoi2015]小凸玩密室
    bzoj 4443: [Scoi2015]小凸玩矩阵
  • 原文地址:https://www.cnblogs.com/wswang/p/7718085.html
Copyright © 2011-2022 走看看