zoukankan      html  css  js  c++  java
  • 离散序列的标准互信息计算(转载)

                                                                                                        离散序列的标准互信息计算

                                                                    来源:http://www.cnblogs.com/ziqiao/archive/2011/12/13/2286273.html

    一、离散序列样本

      X = [1 1 1 1 1 1   2 2 2 2 2 2    3 3 3 3 3];

      Y= [1 2 1 1 1 1   1 2 2 2 2 3    1 1 3 3 3];

    二、计算离散序列X与Y的互信息(Mutual information

      MI可以按照下面的公式(1)计算:

                                                           

    X和Y的联合分布概率p(x,y)和边缘分布律p(x)、p(y)如下:

                                                     

    其中,分子p(x,y)为x和y的联合分布概率:

                                                               p(1,1)=5/17, p(1,2)=1/17, p(1,3)=0;

                                                               p(2,1)=1/17, p(2,2)=4/17, p(2,3)=1/17;

                                                               p(3,1)=2/17, p(3,2)=0, p(3,3)=3/17;                  

    分母p(x)为x的概率函数,p(y)为y的概率函数,

                                                    对p(x):

                                                               p(1)=6/17,p(2)=6/17,p(3)=5/17  

                                                    对p(y):

                                                               p(1)=8/17,p(2)=5/17,P(3)=4/17 

    把上述概率代入公式(1),就可以算出MI。

    三、计算标准化互信息NMI(Normalized Mutual information)

      标准化互信息,即用熵做分母将MI值调整到0与1之间。一个比较多见的实现是下面所示:

                                                                 

    H(X)和H(Y)分别为X和Y的熵,H(X)计算公式如下,公式中log的底b=2。

                                                          

    例如,H(X) =  -p(1)*log2(p(1)) - -p(2)*log2(p(2)) -p(3)*log2(p(3))。

    四、计算标准互信息的MATLAB程序

    function MIhat = nmi( A, B ) %NMI Normalized mutual information
    % http://en.wikipedia.org/wiki/Mutual_information
    % http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html
    % Author: http://www.cnblogs.com/ziqiao/   [2011/12/13] if length( A ) ~= length( B)
        error('length( A ) must == length( B)');
    end
    total = length(A);
    A_ids = unique(A);
    B_ids = unique(B);
    
    % Mutual information
    MI = 0;
    for idA = A_ids
        for idB = B_ids
             idAOccur = find( A == idA );
             idBOccur = find( B == idB );
             idABOccur = intersect(idAOccur,idBOccur); 
             
             px = length(idAOccur)/total;
             py = length(idBOccur)/total;
             pxy = length(idABOccur)/total;
             
             MI = MI + pxy*log2(pxy/(px*py)+eps); % eps : the smallest positive number
    
        end
    end
    
    % Normalized Mutual information
    Hx = 0; % Entropies
    for idA = A_ids
        idAOccurCount = length( find( A == idA ) );
        Hx = Hx - (idAOccurCount/total) * log2(idAOccurCount/total + eps);
    end
    Hy = 0; % Entropies
    for idB = B_ids
        idBOccurCount = length( find( B == idB ) );
        Hy = Hy - (idBOccurCount/total) * log2(idBOccurCount/total + eps);
    end
    
    MIhat = 2 * MI / (Hx+Hy);
    end
    
    % Example :  
    % (http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html)
    % A = [1 1 1 1 1 1   2 2 2 2 2 2    3 3 3 3 3];
    % B = [1 2 1 1 1 1   1 2 2 2 2 3    1 1 3 3 3];
    % nmi(A,B)% ans = 0.3646

      为了节省运行时间,将for循环用矩阵运算代替,1百万的数据量运行 1.795723second,上面的方法运行3.491053 second;  但是这种方法太占内存空间, 五百万时,利用matlab2011版本的内存设置就显示Out of memory了。

    版本一:

    function MIhat = nmi( A, B )
    %NMI Normalized mutual information
    % http://en.wikipedia.org/wiki/Mutual_information
    % http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html
    % Author: http://www.cnblogs.com/ziqiao/   [2011/12/15] if length( A ) ~= length( B)
        error('length( A ) must == length( B)');
    end
    total = length(A);
    A_ids = unique(A);
    A_class = length(A_ids);
    B_ids = unique(B);
    B_class = length(B_ids);
    % Mutual information
    idAOccur = double (repmat( A, A_class, 1) == repmat( A_ids', 1, total ));
    idBOccur = double (repmat( B, B_class, 1) == repmat( B_ids', 1, total ));
    idABOccur = idAOccur * idBOccur';
    Px = sum(idAOccur') / total;
    Py = sum(idBOccur') / total;
    Pxy = idABOccur / total;
    MImatrix = Pxy .* log2(Pxy ./(Px' * Py)+eps);
    MI = sum(MImatrix(:))
    % Entropies
    Hx = -sum(Px .* log2(Px + eps),2);
    Hy = -sum(Py .* log2(Py + eps),2);
    %Normalized Mutual information
    MIhat = 2 * MI / (Hx+Hy);
    % MIhat = MI / sqrt(Hx*Hy); another version of NMIend
    
    % Example :  
    % (http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html)
    % A = [1 1 1 1 1 1   2 2 2 2 2 2    3 3 3 3 3];
    % B = [1 2 1 1 1 1   1 2 2 2 2 3    1 1 3 3 3];
    % nmi(A,B) % ans =  0.3646

    版本二:

    A = [1 1 1 1 1 1   2 2 2 2 2 2    3 3 3 3 3];
    B = [1 2 1 1 1 1   1 2 2 2 2 3    1 1 3 3 3];
     
     if length( A ) ~= length( B)
        error('length( A ) must == length( B)');
     end
     
    total = length(A);                  %17
    A_ids = unique(A);                %[1,2,3]
    A_class = length(A_ids);       % 3
    B_ids = unique(B);                 %[1,2,3]
    B_class = length(B_ids);        %3
    
    % Mutual information
    idAOccur = double (repmat( A, A_class, 1) == repmat( A_ids', 1, total ));
    idBOccur = double (repmat( B, B_class, 1) == repmat( B_ids', 1, total ));
    idABOccur = idAOccur * idBOccur';
    Px = sum(idAOccur') / total;
    Py = sum(idBOccur') / total;
    Pxy = idABOccur / total;
    miMatrix = Pxy .* log2(Pxy ./(Px' * Py)+eps);  %加上一个很小的数,log2的真数部分不能<=0.
    mi = sum(miMatrix(:));
    
    % Entropies(熵)
    Hx = -sum(Px .* log2(Px + eps),2);
    Hy = -sum(Py .* log2(Py + eps),2);
    
    %Normalized Mutual information
    nmi= 2 * mi / (Hx+Hy)   % nmi = mi / sqrt(Hx*Hy); another version of nmi
  • 相关阅读:
    windows下查看端口进程占用情况
    分布式文件管理系统MooseFS在centOS 7中的安装
    开发中的技术选型调研总结
    windows和linux无法访问VMware中linux的tomcat主页问题
    如何使windows系统ping通VMware下面的linux系统
    VMware中linux安装jdk
    mysql数据库之存储过程入门
    Hibernate批处理操作优化 (批量插入、更新与删除)
    JDBC: 批量处理提高SQL处理速度
    小程序 长按复制文本
  • 原文地址:https://www.cnblogs.com/hezhiyao/p/7208615.html
Copyright © 2011-2022 走看看