zoukankan      html  css  js  c++  java
  • Redis数据结构之HperLogLog

    一、HyperLogLog

    HyperLogLog是用来做基数统计的。

    其可以非常省内存的去统计各种计数,比如注册ip数、每日访问IP数、页面实时UV(PV肯定字符串就搞定了)、在线用户数等在对准确性不是很重要的应用场景。

    HyperLogLog的优点是:

    在输入元素的数量或者体积非常非常大时,计算基数所需的空间总是固定的、并且是很小的,

    HyperLogLog的缺点:

    它是估计基数的算法,所以会有一定误差0.81%。

    每个HyperLogLog键只需要花费12KB内存,就可以计算接近264个不同元素的基数。这和计算基数时,元素越多耗费内存就越多的集合形成鲜明对比。 

    但是,因为 HyperLogLog 只会根据输入元素来计算基数,而不会储存输入元素本身,所以 HyperLogLog 不能像集合那样,返回输入的各个元素即无法知道统计的详细内容。

    二、基数和估算值

    1、基数

    基数是集合中不同元素的数量。

    比如数据集 {1, 3, 5, 7, 5, 7, 8}, 那么这个数据集的基数集为 {1, 3, 5 ,7, 8}, 基数(不重复元素)为5。 

    基数估计就是在误差可接受的范围内,快速计算基数。

    2、估算值

    算法给出的基数并不是精确的,可能会比实际稍微多一些或者稍微少一些,但会控制在合理的范围之内。

    三、HperLogLog基本命令

    redis HyperLogLog 的基本命令:

    1 PFADD key element [element ...] 

    添加指定元素到 HyperLogLog 中。

    2 PFCOUNT key [key ...] 

    返回给定 HyperLogLog 的基数估算值。

    3 PFMERGE destkey sourcekey [sourcekey ...] 

    将多个 HyperLogLog 合并为一个 HyperLogLog

    PFADD

    将任意数量的元素添加到指定的 HyperLogLog 里面。在执行这个命令之后,HyperLogLog内部的结构会被更新,并有所反馈,

    如果执行完之后HyperLogLog内部的基数估算发生了变化,那么就会返回1,否则(认为已经存在)就返回0。

    这个命令还有一个比较神器的就是可以只有键,没有值,这样的意思就是只是创建空的键,不放值。

    如果这个键存在,不做任何事情,返回0;不存在的话就创建,并返回1。

    这个命令的时间复杂度为O(1),所以就放心用吧~

    PFCOUNT

    当命令作用于单个键的时候,返回这个键的基数估算值。如果键不存在,则返回0。

    当 PFCOUNT 命令作用于多个键时, 返回所有给定 HyperLogLog 的并集的近似基数, 这个近似基数是通过将所有给定 HyperLogLog 合并至一个临时 HyperLogLog 来计算得出的。

    这个命令在作用于单个值的时候,时间复杂度为O(1),并且具有非常低的平均常数时间;在作用于N个值的时候,时间复杂度为O(N),这个命令的常数复杂度会比较低些。

    命令返回的可见集合(observed set)基数并不是精确值, 而是一个带有 0.81% 标准错误(standard error)的近似值。

    举个例子, 为了记录一天会执行多少次各不相同的搜索查询, 一个程序可以在每次执行搜索查询时调用一次 PFADD , 并通过调用 PFCOUNT 命令来获取这个记录的近似结果。

    PFMERGE

    合并(merge)多个HyperLogLog为一个HyperLogLog。 合并后的 HyperLogLog 的基数接近于所有输入 HyperLogLog 的可见集合(observed set)的并集。

    合并得出的 HyperLogLog 会被储存在 destkey 键里面, 如果该键并不存在, 那么命令在执行之前, 会先为该键创建一个空的 HyperLogLog 。

    这个命令的第一个参数为目标键,剩下的参数为要合并的HyperLogLog。命令执行时,如果目标键不存在,则创建后再执行合并。

    这个命令的时间复杂度为O(N),其中N为要合并的HyperLogLog的个数。不过这个命令的常数时间复杂度比较高。

    redis> PFADD  ip:20170626  "192.168.0.10"  "192.168.0.20"  "192.168.0.30"

    (integer) 1

    redis> PFADD  ip:20170626 "192.168.0.20"  "192.168.0.40"  "192.168.0.50"  # 存在就只加新的

    (integer) 1

    redis> PFCOUNT ip:20170626  # 元素估计数量没有变化

    (integer) 5

    redis> PFADD  ip:20170626 "192.168.0.20"  # 存在就不会增加

    (integer) 0

    edis> PFMERGE ip:20170626   ip:20170627   ip:20170628

    OK

    redis> PFCOUNT  ip:201706

    (integer) 5

    四、hperloglog 描述

    由于hperloglog,这种数据结构在实际应用场景中并不多。因此,这里就不再详细讨论了。

    我们看下hperloglog.c文件,对HperLogLog的描述

    /* The Redis HyperLogLog implementation is based on the following ideas:

     *

     * * The use of a 64 bit hash function as proposed in [1], in order to don't

     *   limited to cardinalities up to 10^9, at the cost of just 1 additional

     *   bit per register.

     * * The use of 16384 6-bit registers for a great level of accuracy, using

     *   a total of 12k per key.

     * * The use of the Redis string data type. No new type is introduced.

     * * No attempt is made to compress the data structure as in [1]. Also the

     *   algorithm used is the original HyperLogLog Algorithm as in [2], with

     *   the only difference that a 64 bit hash function is used, so no correction

     *   is performed for values near 2^32 as in [1].

     *

     * [1] Heule, Nunkesser, Hall: HyperLogLog in Practice: Algorithmic

     *     Engineering of a State of The Art Cardinality Estimation Algorithm.

     *

     * [2] P. Flajolet, éric Fusy, O. Gandouet, and F. Meunier. Hyperloglog: The

     *     analysis of a near-optimal cardinality estimation algorithm.

     *

     * Redis uses two representations:

     *

     * 1) A "dense" representation where every entry is represented by

     *    a 6-bit integer.

     * 2) A "sparse" representation using run length compression suitable

     *    for representing HyperLogLogs with many registers set to 0 in

     *    a memory efficient way.

     *

     *

     * HLL header

     * ===

     *

     * Both the dense and sparse representation have a 16 byte header as follows:

     *

     * +------+---+-----+----------+

     * | HYLL | E | N/U | Cardin.  |

     * +------+---+-----+----------+

     *

     * The first 4 bytes are a magic string set to the bytes "HYLL".

     * "E" is one byte encoding, currently set to HLL_DENSE or

     * HLL_SPARSE. N/U are three not used bytes.

     *

     * The "Cardin." field is a 64 bit integer stored in little endian format

     * with the latest cardinality computed that can be reused if the data

     * structure was not modified since the last computation (this is useful

     * because there are high probabilities that HLLADD operations don't

     * modify the actual data structure and hence the approximated cardinality).

     *

     * When the most significant bit in the most significant byte of the cached

     * cardinality is set, it means that the data structure was modified and

     * we can't reuse the cached value that must be recomputed.

     *

     * Dense representation

     * ===

     *

     * The dense representation used by Redis is the following:

     *

     * +--------+--------+--------+------//      //--+

     * |11000000|22221111|33333322|55444444 ....     |

     * +--------+--------+--------+------//      //--+

     *

     * The 6 bits counters are encoded one after the other starting from the

     * LSB to the MSB, and using the next bytes as needed.

     *

     * Sparse representation

     * ===

     *

     * The sparse representation encodes registers using a run length

     * encoding composed of three opcodes, two using one byte, and one using

     * of two bytes. The opcodes are called ZERO, XZERO and VAL.

     *

     * ZERO opcode is represented as 00xxxxxx. The 6-bit integer represented

     * by the six bits 'xxxxxx', plus 1, means that there are N registers set

     * to 0. This opcode can represent from 1 to 64 contiguous registers set

     * to the value of 0.

     *

     * XZERO opcode is represented by two bytes 01xxxxxx yyyyyyyy. The 14-bit

     * integer represented by the bits 'xxxxxx' as most significant bits and

     * 'yyyyyyyy' as least significant bits, plus 1, means that there are N

     * registers set to 0. This opcode can represent from 0 to 16384 contiguous

     * registers set to the value of 0.

     *

     * VAL opcode is represented as 1vvvvvxx. It contains a 5-bit integer

     * representing the value of a register, and a 2-bit integer representing

     * the number of contiguous registers set to that value 'vvvvv'.

     * To obtain the value and run length, the integers vvvvv and xx must be

     * incremented by one. This opcode can represent values from 1 to 32,

     * repeated from 1 to 4 times.

     *

     * The sparse representation can't represent registers with a value greater

     * than 32, however it is very unlikely that we find such a register in an

     * HLL with a cardinality where the sparse representation is still more

     * memory efficient than the dense representation. When this happens the

     * HLL is converted to the dense representation.

     *

     * The sparse representation is purely positional. For example a sparse

     * representation of an empty HLL is just: XZERO:16384.

     *

     * An HLL having only 3 non-zero registers at position 1000, 1020, 1021

     * respectively set to 2, 3, 3, is represented by the following three

     * opcodes:

     *

     * XZERO:1000 (Registers 0-999 are set to 0)

     * VAL:2,1    (1 register set to value 2, that is register 1000)

     * ZERO:19    (Registers 1001-1019 set to 0)

     * VAL:3,2    (2 registers set to value 3, that is registers 1020,1021)

     * XZERO:15362 (Registers 1022-16383 set to 0)

     *

     * In the example the sparse representation used just 7 bytes instead

     * of 12k in order to represent the HLL registers. In general for low

     * cardinality there is a big win in terms of space efficiency, traded

     * with CPU time since the sparse representation is slower to access:

     *

     * The following table shows average cardinality vs bytes used, 100

     * samples per cardinality (when the set was not representable because

     * of registers with too big value, the dense representation size was used

     * as a sample).

     *

     * 100 267

     * 200 485

     * 300 678

     * 400 859

     * 500 1033

     * 600 1205

     * 700 1375

     * 800 1544

     * 900 1713

     * 1000 1882

     * 2000 3480

     * 3000 4879

     * 4000 6089

     * 5000 7138

     * 6000 8042

     * 7000 8823

     * 8000 9500

     * 9000 10088

     * 10000 10591

     *

     * The dense representation uses 12288 bytes, so there is a big win up to

     * a cardinality of ~2000-3000. For bigger cardinalities the constant times

     * involved in updating the sparse representation is not justified by the

     * memory savings. The exact maximum length of the sparse representation

     * when this implementation switches to the dense representation is

     * configured via the define server.hll_sparse_max_bytes.

     */

  • 相关阅读:
    微信小程序scroll-viwe遇到的问题
    微信小程序缓存
    微信刷新数据不刷新页面的另一个方法
    微信小程序中无刷新修改
    Bayesian
    深度学习(七)object detection
    深度学习(十二)wide&deep model
    深度学习(十)训练时的调参技巧
    深度学习(九)过拟合和欠拟合
    深度学习(二)常见概念
  • 原文地址:https://www.cnblogs.com/exceptioneye/p/7081896.html
Copyright © 2011-2022 走看看