zoukankan      html  css  js  c++  java
  • [Hive

    Sampling Syntax  抽样语法

    Sampling Bucketized Table 

    table_sample: TABLESAMPLE (BUCKET x OUT OF y [ON colname])

    The TABLESAMPLE clause allows the users to write queries for samples of the data instead of the whole table. The TABLESAMPLE clause can be added to any table in the FROM clause. The buckets are numbered starting from 1. colname indicates the column on which to sample each row in the table. colname can be one of the non-partition columns in the table or rand() indicating sampling on the entire row instead of an individual column. The rows of the table are 'bucketed' on the colname randomly into y buckets numbered 1 through y. Rows which belong to bucket x are returned.

    In the following example the 3rd bucket out of the 32 buckets of the table source. 's' is the table alias.

    SELECT *
    FROM source TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s;

    Input pruning: Typically, TABLESAMPLE will scan the entire table and fetch the sample. But, that is not very efficient. Instead, the table can be created with a CLUSTERED BY clause which indicates the set of columns on which the table is hash-partitioned/clustered on. If the columns specified in the TABLESAMPLE clause match the columns in the CLUSTERED BY clause, TABLESAMPLE scans only the required hash-partitions of the table.

    Example:

    So in the above example, if table 'source' was created with 'CLUSTERED BY id INTO 32 BUCKETS'

    TABLESAMPLE(BUCKET 3 OUT OF 16 ON id)

    would pick out the 3rd and 19th clusters as each bucket would be composed of (32/16)=2 clusters.

    On the other hand the tablesample clause

    TABLESAMPLE(BUCKET 3 OUT OF 64 ON id)

    would pick out half of the 3rd cluster as each bucket would be composed of (32/64)=1/2 of a cluster.

    For information about creating bucketed tables with the CLUSTERED BY clause, see Create Table (especially Bucketed Sorted Tables) and Bucketed Tables.

    Block Sampling

    Block sampling is available starting with Hive 0.8. Addressed under JIRA - https://issues.apache.org/jira/browse/HIVE-2121

    block_sample: TABLESAMPLE (n PERCENT)

    This will allow Hive to pick up at least n% data size (notice it doesn't necessarily mean number of rows) as inputs. Only CombineHiveInputFormat is supported and some special compression formats are not handled. If we fail to sample it, the input of MapReduce job will be the whole table/partition. We do it in HDFS block level so that the sampling granularity is block size. For example, if block size is 256MB, even if n% of input size is only 100MB, you get 256MB of data.

    In the following example the input size 0.1% or more will be used for the query.

    SELECT *
    FROM source TABLESAMPLE(0.1 PERCENT) s;

    Sometimes you want to sample the same data with different blocks, you can change this seed number:

    set hive.sample.seednumber=<INTEGER>;

    Or user can specify total length to be read, but it has same limitation with PERCENT sampling. (As of Hive 0.10.0 - https://issues.apache.org/jira/browse/HIVE-3401)

    block_sample: TABLESAMPLE (ByteLengthLiteral)
     
    ByteLengthLiteral : (Digit)+ ('b' 'B' 'k' 'K' 'm' 'M' 'g' 'G')

    In the following example the input size 100M or more will be used for the query.

    SELECT *
    FROM source TABLESAMPLE(100M) s;

    Hive also supports limiting input by row count basis, but it acts differently with above two. First, it does not need CombineHiveInputFormat which means this can be used with non-native tables. Second, the row count given by user is applied to each split. So total row count can be vary by number of input splits. (As of Hive 0.10.0 - https://issues.apache.org/jira/browse/HIVE-3401)

    block_sample: TABLESAMPLE (n ROWS)

    For example, the following query will take the first 10 rows from each input split.

    SELECT * FROM source TABLESAMPLE(10 ROWS);
    谨言慎行,专注思考 , 工作与生活同乐
  • 相关阅读:
    Eureka 系列(04)客户端源码分析
    Eureka 系列(03)Spring Cloud 自动装配原理
    Eureka 系列(02)Eureka 一致性协议
    Eureka 系列(01)最简使用姿态
    Feign 系列(05)Spring Cloud OpenFeign 源码解析
    python 线程,进程与协程
    Python IO多路复用
    python 作用域
    python 网络编程:socket(二)
    python 网络编程:socket
  • 原文地址:https://www.cnblogs.com/tmeily/p/4249952.html
Copyright © 2011-2022 走看看