zoukankan      html  css  js  c++  java
  • Flink基础(132):FLINK-SQL语法 (26) DQL(18) OPERATIONS(15)Deduplication 去重

    Deduplication 

    Batch Streaming

    Deduplication removes rows that duplicate over a set of columns, keeping only the first one or the last one. In some cases, the upstream ETL jobs are not end-to-end exactly-once; this may result in duplicate records in the sink in case of failover. However, the duplicate records will affect the correctness of downstream analytical jobs - e.g. SUMCOUNT - so deduplication is needed before further analysis.

    Flink uses ROW_NUMBER() to remove duplicates, just like the way of Top-N query. In theory, deduplication is a special case of Top-N in which the N is one and order by the processing time or event time.

    The following shows the syntax of the Deduplication statement:

    SELECT [column_list]
    FROM (
       SELECT [column_list],
         ROW_NUMBER() OVER ([PARTITION BY col1[, col2...]]
           ORDER BY time_attr [asc|desc]) AS rownum
       FROM table_name)
    WHERE rownum = 1

    Parameter Specification:

    • ROW_NUMBER(): Assigns an unique, sequential number to each row, starting with one.
    • PARTITION BY col1[, col2...]: Specifies the partition columns, i.e. the deduplicate key.
    • ORDER BY time_attr [asc|desc]: Specifies the ordering column, it must be a time attribute. Currently Flink supports processing time attribute and event time atttribute. Ordering by ASC means keeping the first row, ordering by DESC means keeping the last row.
    • WHERE rownum = 1: The rownum = 1 is required for Flink to recognize this query is deduplication.
    Note: the above pattern must be followed exactly, otherwise the optimizer won’t be able to translate the query.

    The following examples show how to specify SQL queries with Deduplication on streaming tables.

    CREATE TABLE Orders (
      order_time  STRING,
      user        STRING,
      product     STRING,
      num         BIGINT,
      proctime AS PROCTIME()
    ) WITH (...);
    
    -- remove duplicate rows on order_id and keep the first occurrence row,
    -- because there shouldn't be two orders with the same order_id.
    SELECT order_id, user, product, num
    FROM (
      SELECT *,
        ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY proctime ASC) AS row_num
      FROM Orders)
    WHERE row_num = 1

    本文来自博客园,作者:秋华,转载请注明原文链接:https://www.cnblogs.com/qiu-hua/p/15195954.html

  • 相关阅读:
    Android 性能测试_Monkey 实践
    Linux下 运行Jmeter (含一个jmeter简单示例)
    IOS测试-Fastmonkey
    iOSApp -Monkey测试
    如何用monkey测试IOS
    MVC系列开篇:(我的第一片博文)
    ASP.NET MVC 使用Swagger需要注意的问题!!!
    Urlparse模块
    从0到1:微信后台系统的演进之路(转自INFOQ)
    什么是商业模式?产品模式、用户模式、推广模式,最后是收入模式(转)
  • 原文地址:https://www.cnblogs.com/qiu-hua/p/15195954.html
Copyright © 2011-2022 走看看