zoukankan      html  css  js  c++  java
  • [转载] Calculating Entropy

    From:  johndcook.com/blog

    For a set of positive probabilities pi summing to 1, their entropy is defined as

    - sum p_i log p_i

    (For this post, log will mean log base 2, not natural log.)

    This post looks at a couple questions about computing entropy. First, are there any numerical problems computing entropy directly from the equation above?

    Second, imagine you don’t have the pi values directly but rather counts ni that sum to N. Then pi = ni/N. To apply the equation directly, you’d first need to compute N, then go back through the data again to compute entropy. If you have a large amount of data, could you compute the entropy in one pass?

    To address the second question, note that

    - sum frac{n_i}{N} log frac{n_i}{N} = log N -frac{1}{N} sum n_i log n_i

    so you can sum ni and ni log ni in the same pass.

    One of the things you learn in numerical analysis is to look carefully at subtractions. Subtracting two nearly equal numbers can result in a loss of precision. Could the numbers above be nearly equal? Maybe if the ni are ridiculously large. Not just astronomically large — astronomically large numbers like the number of particles in the universe are fine — but ridiculously large, numbers whose logarithms approach the limits of machine-representable numbers. (If we’re only talking about numbers as big as the number of particles in the universe, their logs will be at most three-digit numbers).

    Now to the problem of computing the sum of ni log ni. Could the order of the terms matter? This also applies to the first question of the post if we look at summing the pi log pi. In general, you’ll get better accuracy summing a lot positive numbers by sorting them and adding from smallest to largest and worse accuracy by summing largest to smallest. If adding a sorted list gives essentially the same result when summed in either direction, summing the list in any other order should too.

    To test the methods discussed here, I used two sets of count data, one on the order of a million counts and the other on the order of a billion counts. Both data sets had approximately a power law distribution with counts varying over seven or eight orders of magnitude. For each data set I computed the entropy four ways: two equations times two orders. I convert the counts to probabilities and use the counts directly, and I sum smallest to largest and largest to smallest.

    For the smaller data set, all four methods produced the same answer to nine significant figures. For the larger data set, all four methods produced the same answer to seven significant figures. So at least for the kind of data I’m looking at, it doesn’t matter how you calculate entropy, and you might as well use the one-pass algorithm to get the result faster.

  • 相关阅读:
    D. Longest Subsequence
    线段树入门HDU_1754
    poj_2503(map映射)
    HDU_4826
    poj_2251
    day 44 单表查询,多表查询
    day43 字段的修改、添加和删除,多表关系(外键),单表详细操作(增删改查)
    day 42 数据库的配置、数据库与表的一些剩余操作、用户操作、数据库表的引擎、数据库的模式、mysql支持的数据类型、约束
    day41 数据库介绍、数据库基本操作
    day 40 线程队列、线程定时器、进程池和线程池、同步与异步、用多线程来写socket服务端与客户端
  • 原文地址:https://www.cnblogs.com/johnpher/p/3271006.html
Copyright © 2011-2022 走看看