    首先要知道一个概念selectivit--选择性。选择性是一个row source中可能返回的row的多少。比如一个100行的表,经过查询返回48行,那么selectivity就是0.48。 selectivity对CBO的判断非常重要,简单的说,如果selectivity很大,返回的row占row source的大部分,CBO就倾向于用全表扫描来访问表,反之则倾向于index扫描。


    • distinct 值的多少
    • 该列的low和high值
    • null值的多少
    • 数据分部信息,或者说histogram(这个是可选的)

    如果没有histogram信息,CBO就用前三种信息来判断选择性,这时候CBO会认为该列的值的分布是均匀的。也就是在low 和 high值之间,所有的distinct值都是相等的。我们来看一个例子:(这里的10000是字符  正常的测试应该是数字。但是字符和数字的表现不一样。这个值得研究) 

    SQL> create table GOOD as select rownum all_distinct, 10000 skew from dual connect by level <= 10000;
    SQL> update GOOD set skew=all_distinct+10 where rownum<=10;
    SQL> select * from GOOD where rownum<12;
    ------------ ----------
               1         11
               2         12
               3         13
               4         14
               5         15
               6         16
               7         17
               8         18
               9         19
              10         20
              11      10000


    exec dbms_stats.gather_table_stats('SYS','GOOD', method_opt=>'for all columns size 1');


    SQL> select column_name,num_distinct,LOW_VALUE,HIGH_VALUE,NUM_NULLS,density from user_tab_col_statistics where table_name='GOOD';
    ------------------ ------------ ------------------ ------------------ ---------- ----------
    ALL_DISTINCT              10000 C102               C302                        0      .0001
    SKEW                         11 C10C               C302                        0 .090909091



    SQL> select /*+ gather_plan_statistics */ * from GOOD where skew=12;
    ------------ ----------
               2         12
    SQL> select * from TABLE(dbms_xplan.display_cursor(null,null,'iostats last'));
    | Id  | Operation         | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
    |   0 | SELECT STATEMENT  |      |      1 |        |      1 |00:00:00.01 |      21 |
    |*  1 |  TABLE ACCESS FULL| GOOD |      1 |    909 |      1 |00:00:00.01 |      21 |

    E-Rows是CBO认为会返回的rows数量,因为没有收集直方图信息,oracle认为数据是均匀分布的。所以cardinality = density * 10000 = 909.09 rows。 再看一个执行计划。

    SQL> explain plan for select * from GOOD where skew=10000;
    SQL> select * from table(dbms_xplan.display());
    | Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
    |   0 | SELECT STATEMENT  |      |   909 |  8181 |     7  (15)| 00:00:01 |
    |*  1 |  TABLE ACCESS FULL| GOOD |   909 |  8181 |     7  (15)| 00:00:01 |


    SQL> create index GOOD_I on GOOD(skew);
    SQL> exec dbms_stats.gather_index_stats('SYS','GOOD_I');
    SQL> explain plan for select * from GOOD where skew=10000;
    SQL> select * from table(dbms_xplan.display());
    | Id  | Operation                   | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
    |   0 | SELECT STATEMENT            |        |   909 |  5454 |     4   (0)| 00:00:01 |
    |   1 |  TABLE ACCESS BY INDEX ROWID| GOOD   |   909 |  5454 |     4   (0)| 00:00:01 |
    |*  2 |   INDEX RANGE SCAN          | GOOD_I |   909 |       |     2   (0)| 00:00:01 |




    width-blanced or frequence histogram

    这种直方图的x轴是distinct value,y轴是对应distinc value在列中出现的次数。这种直方图的前提就是x轴能够涵盖所有的distinct value。oracle histogram的bucket最大值是254.也就是说一个column如果它的distinct值不超过254个,我们就可以使用这种直方图。下图是我们例子的频率直方图


    SQL> exec dbms_stats.gather_table_stats('SYS','GOOD',method_opt=>'for columns skew size 11');
    PL/SQL procedure successfully completed.
    SQL> select column_name,endpoint_number,endpoint_value from user_tab_histograms where table_name='GOOD' and column_name='SKEW';
    ------------------ --------------- --------------
    SKEW                             1             11
    SKEW                             2             12
    SKEW                             3             13
    SKEW                             4             14
    SKEW                             5             15
    SKEW                             6             16
    SKEW                             7             17
    SKEW                             8             18
    SKEW                             9             19
    SKEW                            10             20
    SKEW                         10000          10000

    这里的ENDPOINT_VALUE 是distinct value的值。而ENDPOINT_NUMBER是对应endpoint_value出现次数的累加。比如 11出现了一次,12 就出现了2-1=1次。而20出现了10-9-1次。  你可能会想问什么不直接写出现的次数呢?为什么用累加值呢? 因为这样存储在遇到范围扫描的时候非常有用。 比如skew>15的值的数量就是 10000-5.

    当然你可以通过下面这样的SQL获得一个更直观的 distinct value 与 出现次数对应的查询

    select     endpoint_value as column_value, 
            endpoint_number as cummulative_frequency,
            endpoint_number - lag(endpoint_number,1,0) over (order by endpoint_number) as frequency
    from user_tab_histograms
    where table_name = 'GOOD' and column_name = 'SKEW';
    ------------ --------------------- ----------
              11                     1          1
              12                     2          1
              13                     3          1
              14                     4          1
              15                     5          1
              16                     6          1
              17                     7          1
              18                     8          1
              19                     9          1
              20                    10          1
           10000                 10000       9990


    SQL> select column_name, density, histogram from user_tab_col_statistics where table_name='GOOD' ;
    ------------------ ---------- ---------------------------------------------
    ALL_DISTINCT            .0001 NONE
    SKEW                   .00005 FREQUENCY

    首先可以看到density的不同。  执行计划也会变的更好:

    SQL> explain plan for Select * from good where skew=10000;
    SQL> select * from table(dbms_xplan.display);
    | Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
    |   0 | SELECT STATEMENT  |      |  9990 | 59940 |     6  (17)| 00:00:01 |
    |*  1 |  TABLE ACCESS FULL| GOOD |  9990 | 59940 |     6  (17)| 00:00:01 |

    可以看到 rows是9990说明CBO正确的估算了返回值的大小。

    Height-balanced Histograms
    In the case of Frequency histograms Oracle allocates a bucket for each distinct value. However the
    maximum possible number of buckers is 254, so if you have tables with a huge number of distinct
    values (greater than 254); you would have to go for height-balanced histograms.
    In height-balanced histograms, since we have more distinct values than number of buckets, hence
    Oracle first sorts the column data and then the complete data set is divided into number of buckets
    and all buckets contain the same number of values (which is why they are called height-balanced
    histograms), except the last bucket that may have fewer values than the other buckets.

    There is no separate statement to create height-balanced histograms. When the number of buckets
    requested is less than the number of distinct values in a column, Oracle creates height-balanced
    histograms and the meaning of ENDPOINT_VALUE and ENDPOINT_NUMBER are quite
    different. To understand how to interpret histogram information, let’s take another example of a
    column data which has 23 values and there are 9 distinct values in the column. Let’s suppose we have
    requested for 5 buckets. Below is a pictorial representation of how data will be stored in histogram.

    We can make following points based on above picture:
    • Number of buckets is less than number of distinct values in the column.
    • Since we’ve requested for 5 buckets, so the total dataset will be divided into equally sized
    buckets, except the last bucket, which in this case has only 3 values.
    • End points of each bucket and first point of the first bucket are marked, as they are of special
    • Data value ‘3’ is marked in red color; it is special in the sense that it is end point in multiple

    With 5 buckets and 23 values means there are 5 values in each bucket except that last bucket which
    has 3 values. Actually this is the way Oracle stores height-balanced histogram information in data
    dictionary views, with a minor change. Since Bucket 1 and 2 both have 3 as an end point, Oracle
    doesn’t store bucket 1 so as to save space. So when both buckets will be merged, single entry will be

    Let’s create histogram on column skew, this time with number of buckets less than the actual number
    of distinct values that is 11.

    exec dbms_stats.gather_table_stats('SYS','GOOD',method_opt=>'for columns skew size 5');
    SQL> select table_name, column_name,endpoint_number,endpoint_value from DBA_TAB_HISTOGRAMS  where table_name='GOOD' and COLUMN_NAME='SKEW';
    ------------------ ------------------ --------------- --------------
    GOOD               SKEW                             0             11
    GOOD               SKEW                             5          10000

    Here buckets 1-5 all have 10000 as an end point so these buckets 1-4 are not stored so as to save

    So in nutshell, in height-balanced histograms, data is divided into different 'buckets' where each
    bucket contains the same number of values. The highest value in each bucket is recorded together
    (ENDPOINT_VALUE) with the lowest value in the first bucket (bucket 0). Also,
    ENDPOINT_NUMBER represents the bucket number. Once the data is recorded in buckets we
    recognize 2 types of data value - Non-popular values and popular values.

    Popular values are those that occur multiple times as end points. For instance, in our previous
    example 3 is a popular value and in the column skew 10000 is a popular value. Non-popular values
    are those that do not occur multiple times as end times. As you might be thinking, popular and nonpopular
    values are not fixed and depend on bucket size. Changing the bucket size will result in
    different popular values

    Let me summarize our discussion w.r.t two histogram types:
    • Distinct values less than or equal to the number of buckets: When you have less number of
    distinct values than the number of buckets, the ENDPOINT_VALUE column contains the
    distinct values themselves while the ENDPOINT_NUMBER column holds the
    CUMULATIVE number of rows with less than that column value (Frequency Histograms).
    • More number of distinct values than the number of buckets: When you have more number of
    distinct values than the number of buckets, then the ENDPOINT_NUMBER column
    contains the bucket id and the ENDPOINT_VALUE holds the highest value in each bucket.
    Bucket 0 is special in that it holds the low value for that column (Height-balanced

    Creating Histograms
    The GATHER_TABLE_STATS procedure of DBMS_STATS package is used to gather table and
    column statistics and optionally we can instruct to create histogram on certain column(s) using the
    method_opt parameter:
    The ‘method_opt’ parameter of the procedure accepts following values:
    • For all [Indexed/Hidden] Columns [Size option]
    • For Columns column_name [Size option] column_name [Size option] …
    The SIZE keyword specifies the maximum number of buckets for the histogram and takes following
    SIZE {integer | REPEAT | AUTO | SKEWONLY}
    • integer: Number of histogram buckets. Must be in the range 1-254.
    • Repeat: Collects histograms only on the columns that already have histograms.
    • Auto: Oracle determines the columns to collect histograms based on data distribution and the
    workload of the columns.
    • Skewonly: Oracle determines the columns to collect histograms based on the data distribution
    of the columns.

    Auto option also considers the workload of the columns. What it means is that it checks for the SQL
    queries having column name in where predicate.
    The default for method_opt is changed to ‘For all Columns Size Auto’ in 10g, which in 9i used to
    have ‘For all columns size 1’. In other words, Oracle now automatically decides for us which columns
    need histograms and number of buckets also. This seems ideal situation, but this has many caveats
    which are not in the scope of this paper. In next part, I’ll touch base on this topic in greater detail.
    The default value can be changed using the SET_PARAM Procedure.

    Viewing Histograms
    We are fetching histogram information for a while, now it’s time to see in detail the various options
    available for histogram information. We can find information about existing histograms in the
    database through DBA_TAB_HISTOGRAMS data dictionary view. This view lists histograms on
    columns of all tables. The actual value may be stored in ENDPOINT_ACTUAL_VALUE if the
    column is not a number (i.e. a varchar2) and the first six bytes of some values are the same.
    Number of buckets in each column’s histogram and density value can be found in
    DBA_TAB_COLUMNS and DBA_TAB_COL_STATISTICS data dictionary views. The latter
    extracts the data from DBA_TAB_COLUMNS only.
    There are corresponding views available for partition and sub-partitions columns, for example

