zoukankan      html  css  js  c++  java
  • distinct关键字对执行计划的影响

    一、前言

    最近看到一段话,"count(distinct 列名)若列上有索引,且有非空约束或在where子句中使用is not null,则会选择索引快速全扫描。其余情况则选择全表扫描",对其中的原理不理解,因此有了以下的实验。


    二、准备工作

    1. 准备t1表

    SQL> create table t1 as select * from dba_objects;
    SQL> insert into t1 select * from t1;
    SQL> insert into t1 select * from t1;
    SQL> commit;


    2. 将object_name列弄出少量的空值

    SQL> update t1 set object_name = null where owner = 'SCOTT';


    3. 在object_name列上创建普通索引

    SQL> create index idx_t1_name on t1(object_name);


    4. 收集t1表和t1表上索引的统计信息

    SQL> begin
       2  dbms_stats.gather_table_stats(ownname => 'SCOTT',
       3  tabname => 'T1',
       4  estimate_percent => 100,
       5  cascade => true, 
       6  no_invalidate => false,
       7  degree => 4);
       8  end;
       9  /

     

    5. 统计t1表的总行数,object_name的行数

    SQL> select count(*), count(object_name), count(distinct object_name) from t1;

      COUNT(*) COUNT(OBJECT_NAME) COUNT(DISTINCTOBJECT_NAME)
    ---------- ------------------ --------------------------
         54068              54060                      10472

    至此,准备工作已经完成。t1表有54068行,object_name列有54060行,之所以这个值比总行数少,是因为count(列)的时候不统计该列上的空值。
         
         

    三、查看执行计划

    分别执行下面四条sql,观察执行计划
    a. select count(object_name) from t1;    
    b. select count(object_name) from t1 where object_name is not null;        
    c. select count(distinct object_name) from t1 where object_name is not null;        
    d. select count(distinct object_name) from t1;

        

    1. 执行sql(a)

    SQL> set autot on
    SQL> select count(object_name) from t1;

    COUNT(OBJECT_NAME)
    ------------------
                 54060
    
    -------------------------------------------------------------------------------------
    | Id  | Operation             | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
    -------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT      |             |     1 |    19 |    63   (0)| 00:00:01 |
    |   1 |  SORT AGGREGATE       |             |     1 |    19 |            |          |
    |   2 |   INDEX FAST FULL SCAN| IDX_T1_NAME | 54068 |  1003K|    63   (0)| 00:00:01 |
    -------------------------------------------------------------------------------------


    2. 执行sql(b)

    SQL> select count(object_name) from t1 where object_name is not null;

    COUNT(OBJECT_NAME)
    ------------------
                 54060
    
    -------------------------------------------------------------------------------------
    | Id  | Operation             | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
    -------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT      |             |     1 |    19 |    63   (0)| 00:00:01 |
    |   1 |  SORT AGGREGATE       |             |     1 |    19 |            |          |
    |*  2 |   INDEX FAST FULL SCAN| IDX_T1_NAME | 54060 |  1003K|    63   (0)| 00:00:01 |
    -------------------------------------------------------------------------------------

    可以看到sql(a)和sql(b)的执行结果和执行计划都一样,执行结果一样很好理解,count(object_name)本来就不会统计object_name为空的行,所以后面有没有where object_name is not null对结果都没有影响。
    执行计划一样,也很好理解,都是走的索引快速全扫描,毕竟我只是想得到object_name有多少个值,空值我根本不管,而btree索引刚好也不存储空值,所以只需要统计object_name上的索引有多少行就行了。


    3. 执行sql(c)

    SQL> select count(distinct object_name) from t1 where object_name is not null;

    COUNT(DISTINCTOBJECT_NAME)
    --------------------------
                         10472
    
    -----------------------------------------------------------------------------------------------
    | Id  | Operation               | Name        | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
    -----------------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT        |             |     1 |    66 |       |   220   (2)| 00:00:03 |
    |   1 |  SORT AGGREGATE         |             |     1 |    66 |       |            |          |
    |   2 |   VIEW                  | VW_DAG_0    | 10472 |   674K|       |   220   (2)| 00:00:03 |
    |   3 |    HASH GROUP BY        |             | 10472 |   194K|  1496K|   220   (2)| 00:00:03 |
    |*  4 |     INDEX FAST FULL SCAN| IDX_T1_NAME | 54060 |  1003K|       |    63   (0)| 00:00:01 |
    -----------------------------------------------------------------------------------------------

    可以看到sql(c)比sql(b)多了一个distinct关键字,执行计划仍然采用的是索引快速全扫描。


    4. 执行sql(d)

    SQL> select count(distinct object_name) from t1;

    COUNT(DISTINCTOBJECT_NAME)
    --------------------------
                         10472
    
    -----------------------------------------------------------------------------------------
    | Id  | Operation            | Name     | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
    -----------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT     |          |     1 |    66 |       |   349   (1)| 00:00:05 |
    |   1 |  SORT AGGREGATE      |          |     1 |    66 |       |            |          |
    |   2 |   VIEW               | VW_DAG_0 | 10472 |   674K|       |   349   (1)| 00:00:05 |
    |   3 |    HASH GROUP BY     |          | 10472 |   194K|  1496K|   349   (1)| 00:00:05 |
    |   4 |     TABLE ACCESS FULL| T1       | 54068 |  1003K|       |   192   (0)| 00:00:03 |
    -----------------------------------------------------------------------------------------

    可以看到sql(d)在sql(c)的基础上,删掉了where object_name is not null,执行结果没有变,但是执行计划由索引快速全扫描变成了全表扫描。照道理来讲,sql(d)依然可以使用索引的快速全扫描就可以得出结果,但是却选择了cost更大的全表扫描,这个是为什么呢?


    四、问题

    a. select count(object_name) from t1;    
    b. select count(object_name) from t1 where object_name is not null;        
    c. select count(distinct object_name) from t1 where object_name is not null;        
    d. select count(distinct object_name) from t1;

    sql(a)与sql(b),都走索引INDEX FAST FULL SCAN,在它的上层是SORT AGGREGATE。也就是扫个索引,统计下索引行数就行了。
    sql(c),也走索引INDEX FAST FULL SCAN,它的上层是HASH GROUP BY,然后是VIEW,最后才是SORT AGGREGATE。
    sql(d),走的是全表扫描,它的上层是HASH GROUP BY,然后是VIEW,最后才是SORT AGGREGATE。

    count(object_name),oracle知道空值对结果没有什么影响,所以不管加不加where条件,都能走索引。
    count(distinct object_name),oracle估计就懵了,它会在sql中先看看有没有过滤条件。如果将空值踢掉了,开开心心走索引,没踢掉,老老实实全表扫描。
    这是为啥?

  • 相关阅读:
    zz[读书笔记]《Interpretable Machine Learning》
    Xgboost,LightGBM 和 CatBoost
    zz:Xgboost 导读和实战
    bzoj3252: 攻略 优先队列 并查集 贪心
    [BeiJing2009 WinterCamp]取石子游戏 Nim SG 函数
    Playing With Stones UVALive
    Division Game UVA
    [BJWC2011]禁忌 AC 自动机 概率与期望
    [Jsoi2010]连通数 bitset + Floyd
    洛谷P2197 nim游戏模板
  • 原文地址:https://www.cnblogs.com/ddzj01/p/11418848.html
Copyright © 2011-2022 走看看