zoukankan      html  css  js  c++  java
  • 机器学习&数据挖掘笔记_14(GMM-HMM语音识别简单理解)

       为了对GMM-HMM在语音识别上的应用有个宏观认识,花了些时间读了下HTK(用htk完成简单的孤立词识别)的部分源码,对该算法总算有了点大概认识,达到了预期我想要的。不得不说,网络上关于语音识别的通俗易懂教程太少,都是各种公式满天飞,很少有说具体细节的,当然了,那需要有实战经验才行。下面总结以下几点,对其有个宏观印象即可(以孤立词识别为例)。

      一、每个单词的读音都对应一个HMM模型,大家都知道HMM模型中有个状态集S,那么每个状态用什么来表示呢,数字?向量?矩阵?其实这个状态集中的状态没有具体的数学要求,只是一个名称而已,你可以用’1’, ’2’, ‘3’…表示,也可以用’a’, ‘b’, ’c ’表示。另外每个HMM模型中到底该用多少个状态,是通过先验知识人为设定的。

      二、HMM的每一个状态都对应有一个观察值,这个观察值可以是一个实数,也可以是个向量,且每个状态对应的观察值的维度应该相同。假设现在有一个单词的音频文件,首先需要将其进行采样得到数字信息(A/D转换),然后分帧进行MFCC特征提取,假设每一帧音频对应的MFCC特征长度为39,则每个音频文件就转换成了N个MFCC向量(不同音频文件对应的N可能不同),这就成了一个序列,而在训练HMM模型的参数时(比如用Baum-Welch算法),每次输入到HMM中的数据要求就是一个观测值序列。这时,每个状态对应的观测值为39维的向量,因为向量中元素的取值是连续的,需要用多维密度函数来模拟,通常情况下用的是多维高斯函数。在GMM-HMM体系中,这个拟合函数是用K个多维高斯混合得到的。假设知道了每个状态对应的K个多维高斯的所有参数,则该GMM生成该状态上某一个观察向量(一帧音频的MFCC系数)的概率就可以求出来了。

      三、对每个单词建立一个HMM模型,需要用到该单词的训练样本,这些训练样本是提前标注好的,即每个样本对应一段音频,该音频只包含这个单词的读音。当有了该单词的多个训练样本后,就用这些样本结合Baum-Welch算法和EM算法来训练出GMM-HMM的所有参数,这些参数包括初始状态的概率向量,状态之间的转移矩阵,每个状态对应的观察矩阵(这里对应的是GMM,即每个状态对应的K个高斯的权值,每个高斯的均值向量和方差矩阵)。

      四、在识别阶段,输入一段音频,如果该音频含有多个单词,则可以手动先将其分割开(考虑的是最简单的方法),然后提取每个单词的音频MFCC特征序列,将该序列输入到每个HMM模型(已提前训练好的)中,采用前向算法求出每个HMM模型生成该序列的概率,最后取最大概率对应的那个模型,而那个模型所表示的单词就是我们识别的结果。

      五、在建立声学模型时,可以用Deep Learning的方法来代替GMM-HMM中的GMM,因为GMM模拟任意函数的功能取决于混合高斯函数的个数,所以具有一定的局限性,属于浅层模型。而Deep Network可以模拟任意的函数,因而表达能力更强。注意,这里用来代替GMM的Deep Nets模型要求是产生式模型,比如DBN,DBM等,因为在训练HMM-DL网络时,需要用到HMM的某个状态产生一个样本的概率。

      六、GMM-HMM在具体实现起来还是相当复杂的。

      七、一般涉及到时间序列时才会使用HMM,比如这里音频中的语音识别,视频中的行为识别等。如果我们用GMM-HMM对静态的图片分类,因为这里没涉及到时间信息,所以HMM的状态数可设为1,那么此时的GMM-HMM算法就退化成GMM算法了。

      MFCC:

      MFCC的matlab实现教程可参考:张智星老师的网页教程mfcc. 最基本的12维特征。

    function mfcc=frame2mfcc(frame, fs, filterNum, mfccNum, plotOpt)
    % frame2mfcc: Frame to MFCC conversion.
    %    Usage: mfcc=frame2mfcc(frame, fs, filterNum, mfccNum, plotOpt)
    %
    %    For example:
    %        waveFile='what_movies_have_you_seen_recently.wav';
    %        [y, fs, nbits]=wavReadInt(waveFile);
    %        startIndex=12000;
    %        frameSize=512;
    %        frame=y(startIndex:startIndex+frameSize-1);
    %        frame2mfcc(frame, fs, 20, 12, 1);
    
    %    Roger Jang 20060417
    
    if nargin<1, selfdemo; return; end
    if nargin<2, fs=16000; end
    if nargin<3, filterNum=20; end
    if nargin<4, mfccNum=12; end
    if nargin<5, plotOpt=0; end
    
    frameSize=length(frame);
    % ====== Preemphasis should be done at wave level
    %a=0.95;
    %frame2 = filter([1, -a], 1, frame);
    frame2=frame;
    % ====== Hamming windowing
    frame3=frame2.*hamming(frameSize);
    % ====== FFT
    [fftMag, fftPhase, fftFreq, fftPowerDb]=fftOneSide(frame3, fs);
    % ====== Triangular band-pass filter bank
    triFilterBankPrm=getTriFilterBankPrm(fs, filterNum);    % Get parameters for triangular band-pass filter bank
    % Triangular bandpass filter.
    for i=1:filterNum
        tbfCoef(i)=dot(fftPowerDb, trimf(fftFreq, triFilterBankPrm(:,i)));%得到filterNum个滤波系数
    end
    % ====== DCT
    mfcc=zeros(mfccNum, 1); %DCT变换的前后个数也没有变
    for i=1:mfccNum
        coef = cos((pi/filterNum)*i*((1:filterNum)-0.5))'; %mfcc中的前mfccNum个系数
        mfcc(i) = sum(coef.*tbfCoef');%直接按照DCT公式
    end
    % ====== Log energy
    %logEnergy=10*log10(sum(frame.*frame));
    %mfcc=[logEnergy; mfcc];
    
    if plotOpt
        subplot(2,1,1);
        plot(frame, '.-');
        set(gca, 'xlim', [-inf inf]);
        title('Input frame');
        subplot(2,1,2);
        plot(mfcc, '.-');
        set(gca, 'xlim', [-inf inf]);
        title('MFCC vector');
    end
    
    % ====== trimf.m (from fuzzy toolbox)
    function y = trimf(x, prm) %由频率的横坐标算出三角形内的纵坐标,0~1
    a = prm(1); b = prm(2); c = prm(3);
    y = zeros(size(x));
    % Left and right shoulders (y = 0)
    index = find(x <= a | c <= x);
    y(index) = zeros(size(index)); %只考虑三角波内的量
    % Left slope
    if (a ~= b)
        index = find(a < x & x < b);
        y(index) = (x(index)-a)/(b-a);
    end
    % right slope
    if (b ~= c)
        index = find(b < x & x < c);
        y(index) = (c-x(index))/(c-b);
    end
    % Center (y = 1)
    index = find(x == b);
    y(index) = ones(size(index));
    
    % ====== Self demo
    function selfdemo
    waveFile='what_movies_have_you_seen_recently.wav';
    [y, fs, nbits]=wavReadInt(waveFile);
    startIndex=12000;
    frameSize=512;
    frame=y(startIndex:startIndex+frameSize-1);
    feval(mfilename, frame, fs, 20, 12, 1);

      ZCR:

      过0检测,用于判断每一帧中过零点的数量情况,最简单的版本可参考:zeros cross rate. 

    waveFile='csNthu.wav';
    frameSize=256;
    overlap=0;
    [y, fs, nbits]=wavread(waveFile);
    frameMat=enframe(y, frameSize, overlap);
    frameNum=size(frameMat, 2);
    for i=1:frameNum
        frameMat(:,i)=frameMat(:,i)-mean(frameMat(:,i));    % mean justification
    end
    zcr=sum(frameMat(1:end-1, :).*frameMat(2:end, :)<0);
    sampleTime=(1:length(y))/fs;
    frameTime=((0:frameNum-1)*(frameSize-overlap)+0.5*frameSize)/fs;
    subplot(2,1,1); plot(sampleTime, y); ylabel('Amplitude'); title(waveFile);
    subplot(2,1,2); plot(frameTime, zcr, '.-');
    xlabel('Time (sec)'); ylabel('Count'); title('ZCR');

      EPD:

      端点检测,检测声音的起始点和终止点,可参考:EPD in Time Domain,在时域中的最简单检测方法。

    waveFile='sunday.wav';
    [wave, fs, nbits] = wavread(waveFile);
    frameSize = 256;
    overlap = 128;
    
    wave=wave-mean(wave);                % zero-mean substraction
    frameMat=buffer2(wave, frameSize, overlap);    % frame blocking,每一列代表一帧
    frameNum=size(frameMat, 2);            % no. of frames
    volume=frame2volume(frameMat);        % volume,求每一帧的能量,绝对值或者平方和,volume为行向量
    volumeTh1=max(volume)*0.1;            % volume threshold 1
    volumeTh2=median(volume)*0.1;            % volume threshold 2
    volumeTh3=min(volume)*10;            % volume threshold 3
    volumeTh4=volume(1)*5;                % volume threshold 4
    index1 = find(volume>volumeTh1); %找出volume大于阈值的那些帧序号
    index2 = find(volume>volumeTh2);
    index3 = find(volume>volumeTh3);
    index4 = find(volume>volumeTh4);
    %frame2sampleIndex()为从帧序号找到样本点的序号(即每一个采样点的序号)
    %endPointX长度为2,包含了起点和终点的样本点序号
    endPoint1=frame2sampleIndex([index1(1), index1(end)], frameSize, overlap);
    endPoint2=frame2sampleIndex([index2(1), index2(end)], frameSize, overlap);
    endPoint3=frame2sampleIndex([index3(1), index3(end)], frameSize, overlap);
    endPoint4=frame2sampleIndex([index4(1), index4(end)], frameSize, overlap);
    
    subplot(2,1,1);
    time=(1:length(wave))/fs;
    plot(time, wave);
    ylabel('Amplitude'); title('Waveform');
    axis([-inf inf -1 1]);
    line(time(endPoint1(  1))*[1 1], [-1, 1], 'color', 'm');%标起点终点线
    line(time(endPoint2(  1))*[1 1], [-1, 1], 'color', 'g');
    line(time(endPoint3(  1))*[1 1], [-1, 1], 'color', 'k');
    line(time(endPoint4(  1))*[1 1], [-1, 1], 'color', 'r');
    line(time(endPoint1(end))*[1 1], [-1, 1], 'color', 'm');
    line(time(endPoint2(end))*[1 1], [-1, 1], 'color', 'g');
    line(time(endPoint3(end))*[1 1], [-1, 1], 'color', 'k');
    line(time(endPoint4(end))*[1 1], [-1, 1], 'color', 'r');
    legend('Waveform', 'Boundaries by threshold 1', 'Boundaries by threshold 2', 'Boundaries by threshold 3', 'Boundaries by threshold 4');
    
    subplot(2,1,2);
    frameTime=frame2sampleIndex(1:frameNum, frameSize, overlap);
    plot(frameTime, volume, '.-');
    ylabel('Sum of Abs.'); title('Volume');
    axis tight;
    line([min(frameTime), max(frameTime)], volumeTh1*[1 1], 'color', 'm');
    line([min(frameTime), max(frameTime)], volumeTh2*[1 1], 'color', 'g');
    line([min(frameTime), max(frameTime)], volumeTh3*[1 1], 'color', 'k');
    line([min(frameTime), max(frameTime)], volumeTh4*[1 1], 'color', 'r');
    legend('Volume', 'Threshold 1', 'Threshold 2', 'Threshold 3', 'Threshold 4');

       GMM: 

       GMM用在拟合数据分布上,本质上是先假设样本的概率分布为GMM,然后用多个样本去学习这些GMM的参数。GMM建模在语音中可用于某个单词的发音,某个人的音色等。其训练过程可参考:speaker recognition.

    function [M, V, W, logProb] = gmmTrain(data, gaussianNum, dispOpt)
    % gmmTrain: Parameter training for gaussian mixture model (GMM)
    %    Usage: function [M, V, W, logProb] = gmm(data, gaussianNum, dispOpt)
    %        data: dim x dataNum matrix where each column is a data point
    %        gaussianNum: No. of Gaussians or initial centers
    %        dispOpt: Option for displaying info during training
    %        M: dim x meanNum matrix where each column is a mean vector
    %        V: 1 x gaussianNum vector where each element is a variance for a Gaussian
    %        W: 1 x gaussianNum vector where each element is a weighting factor for a Gaussian
    
    % Roger Jang 20000610
    
    if nargin==0, selfdemo; return; end
    if nargin<3, dispOpt=0; end
    
    maxLoopCount = 50;    % Max. iteration
    minImprove = 1e-6;    % Min. improvement
    minVariance = 1e-6;    % Min. variance
    logProb = zeros(maxLoopCount, 1);   % Array for objective function
    [dim, dataNum] = size(data);
    
    % Set initial parameters
    % Set initial M
    %M = data(1+floor(rand(gaussianNum,1)*dataNum),:);    % Randomly select several data points as the centers
    if length(gaussianNum)==1,
        % Using vqKmeans to find initial centers
        fprintf('Start KMEANS to find the initial mu...
    ');
    %    M = vqKmeansMex(data, gaussianNum, 0);
        M = vqKmeans(data, gaussianNum, 0); %利用聚类的方法求均值,聚成gaussianNum类
    %    M = vqLBG(data, gaussianNum, 0);
        fprintf('Start GMM training...
    ');
        if any(any(~isfinite(M))); keyboard; end
    else
        % gaussianNum is in fact the initial centers
        M = gaussianNum;
        gaussianNum = size(M, 2);
    end
    % Set initial V as the distance to the nearest center
    if gaussianNum==1
        V=1;
    else
        distance=pairwiseSqrDist(M);%pairwiseSqrDist是dll
       %distance=pairwiseSqrDist2(M);
       
        distance(1:(gaussianNum+1):gaussianNum^2)=inf;    % Diagonal elements are inf
        [V, index]=min(distance);    % Initial variance for each Gaussian
    end
    % Set initial W
    W = ones(1, gaussianNum)/gaussianNum;    % Weight for each Gaussian,初始化时是均分权值
    
    if dispOpt & dim==2, displayGmm(M, V, data); end
    for i = 1:maxLoopCount  %开始迭代训练参数,EM算法
        % Expectation step:
        % P(i,j) is the probability of data(:,j) to the i-th Gaussian
        % Prob为每个样本在GMM下的概率
        [prob, P]=gmmEval(data, M, V, W);
        logProb(i)=sum(log(prob)); %所有样本的联合概率
        if dispOpt
            fprintf('i = %d, log prob. = %f
    ',i-1, logProb(i));
        end
        PW = diag(W)*P;
        BETA=PW./(ones(gaussianNum,1)*sum(PW));    % BETA(i,j) is beta_i(x_j)
        sumBETA=sum(BETA,2);
    
        % Maximization step:  eqns (2.96) to (2.98) from Bishop p.67:
        M = (data*BETA')./(ones(dim,1)*sumBETA');
    
       DISTSQ = pairwiseSqrDist(M, data);                    % Distance of M to data
       %DISTSQ = pairwiseSqrDist2(M, data);                    % Distance of M to data
       
        V = max((sum(BETA.*DISTSQ, 2)./sumBETA)/dim, minVariance);    % (2.97)
        W = (1/dataNum)*sumBETA;                    % (2.98)
    
        if dispOpt & dim==2, displayGmm(M, V, data); end
        if i>1, if logProb(i)-logProb(i-1)<minImprove, break; end; end
    end
    [prob, P]=gmmEval(data, M, V, W);
    logProb(i)=sum(log(prob));
    fprintf('Iteration count = %d, log prob. = %f
    ',i, logProb(i));
    logProb(i+1:maxLoopCount) = [];
    
    % ====== Self Demo ======
    function selfdemo
    %[data, gaussianNum] = dcdata(2);
    data = rand(1000,2);
    gaussianNum = 8;
    data=data';
    plotOpt=1;
    [M, V, W, lp] = feval(mfilename, data, gaussianNum, plotOpt);
    
    pointNum = 40;
    x = linspace(min(data(1,:)), max(data(1,:)), pointNum);
    y = linspace(min(data(2,:)), max(data(2,:)), pointNum);
    [xx, yy] = meshgrid(x, y);
    data = [xx(:) yy(:)]';
    z = gmmEval(data, M, V, W);
    zz = reshape(z, pointNum, pointNum);
    figure; mesh(xx, yy, zz); axis tight; box on; rotate3d on
    figure; contour(xx, yy, zz, 30); axis image
    
    % ====== Other subfunctions ======
    function displayGmm(M, V, data)
    % Display function for EM algorithm
    figureH=findobj(0, 'tag', mfilename);
    if isempty(figureH)
        figureH=figure;
        set(figureH, 'tag', mfilename);
        colordef black
        plot(data(1,:), data(2,:),'.r'); axis image
        theta=linspace(-pi, pi, 21);
        x=cos(theta); y=sin(theta);
        sigma=sqrt(V);
        for i=1:length(sigma)
            circleH(i)=line(x*sigma(i)+M(1,i), y*sigma(i)+M(2,i), 'color', 'y');
        end
        set(circleH, 'tag', 'circleH', 'erasemode', 'xor');
    else
        circleH=findobj(figureH, 'tag', 'circleH');
        theta=linspace(-pi, pi, 21);
        x=cos(theta); y=sin(theta);
        sigma=sqrt(V);
        for i=1:length(sigma)
            set(circleH(i), 'xdata', x*sigma(i)+M(1,i), 'ydata', y*sigma(i)+M(2,i));
        end
        drawnow
    end

       Speaker identification:

       给N个人的语音资料,用GMM可以训练这N个人的声音模型,然后给定一段语音,判断该语音与这N个人中哪个最相似。方法是求出该语音在N个GMM模型下的概率,选出概率最大的那个。可参考:speaker recognition.

    function [recogRate, confusionMatrix, speakerData]=speakerIdentify(speakerData, speakerGmm, useIntGmm)
    % speakerIdentify: speaker identification using GMM parameters
    %    Usage: [recogRate, confusionMatrix, speakerData]=speakerIdentify(speakerData, speakerGmm, useIntGmm)
    %        speakerData: structure array generated by speakerDataRead.m
    %        speakerGmm: speakerGmm(i).gmmPrm is the GMM parameters for speaker i.
    %        useIntGmm: use fixed-point GMM
    
    %    Roger Jang, 20070517, 20080726
    
    if nargin<3, useIntGmm=0; end
    
    % ====== Speaker identification using GMM parameters
    speakerNum=length(speakerData);
    for i=1:speakerNum
    %    fprintf('%d/%d: Recognizing wave files by %s
    ', i, speakerNum, speakerData(i).name);
        for j=1:length(speakerData(i).sentence)
    %        fprintf('	Sentece %d...
    ', j);
            frameNum=size(speakerData(i).sentence(j).fea, 2);
            logProb=zeros(speakerNum, frameNum); %logProb(i,m)表示第i个人第j个句子中第m帧在GMM模型下的log概率
            %找出一个句子,看它属于哪个speaker
            for k=1:speakerNum,
    %            fprintf('		Speaker %d...
    ', k);
            %    logProb(k, :)=gmmEval(speakerData(i).sentence(j).fea, speakerGmm(k).gmmPrm);
                if ~useIntGmm
                %    logProb(k, :)=gmmEvalMex(speakerData(i).sentence(j).fea, gmm(k).mean, gmm(k).covariance, gmm(k).weight);
                    logProb(k, :)=gmmEval(speakerData(i).sentence(j).fea, speakerGmm(k).gmmPrm);
                else
                %    logProb(k, :)=gmmEvalIntMex(speakerData(i).sentence(j).fea, gmm(k).mean, gmm(k).covariance, gmm(k).weight);
                    logProb(k, :)=gmmEvalIntMex(speakerData(i).sentence(j).fea, speakerGmm(i).gmmPrm);
                end
            end
            cumLogProb=sum(logProb, 2);
            [maxProb, index]=max(cumLogProb);
            speakerData(i).sentence(j).predictedSpeaker=index; %找出身份
            speakerData(i).sentence(j).logProb=logProb;
        end
    end
    
    % ====== Compute confusion matrix and recognition rate
    confusionMatrix=zeros(speakerNum);
    for i=1:speakerNum,
        predictedSpeaker=[speakerData(i).sentence.predictedSpeaker];
        [index, count]=elementCount(predictedSpeaker);
        confusionMatrix(i, index)=count;
    end
    recogRate=sum(diag(confusionMatrix))/sum(sum(confusionMatrix));

      GMM-HMM:

      训练阶段:给出HMM的k个状态,每个状态下的观察样本的生成可以用一个概率分布来拟合,这里是采用GMM拟合的。其实,可以把GMM-HMM整体看成是一个生成模型。给定该模型的5个初始参数(结合随机和训练样本获得),启动EM算法的E步:获得训练样本分布,即计算训练样本在各个状态下的概率。M步:用这些训练样本重新评估那5个参数。

      测试阶段:(以孤立词识别为例)给定每个词发音的frame矩阵,取出某一个GMM-HMM模型,算出该发音每一帧数据在取出的GMM-HMM模型各个state下的概率,结合模型的转移概率和初始概率,获得对应的clique tree,可用图模型的方法inference出生成该语音的概率。比较多个GMM-HMM模型,取最大概率的模型对应的词。

       参考资料:

         机器学习&数据挖掘笔记_13(用htk完成简单的孤立词识别)

         http://htk.eng.cam.ac.uk/extensions/

         张智星老师的网页教程mfcc.

  • 相关阅读:
    Linux 清理boot分区
    Linux 虚拟内存
    使用mongoskin操作MongoDB
    nodejs操作session和cookie
    nodejs接收get参数和post参数
    Tomcat--在IDEA创建Java Web项目,通过tomcat启动
    Tomcat--配置
    Mac安装Tomcat
    XML--解析
    XML--约束
  • 原文地址:https://www.cnblogs.com/tornadomeet/p/3276753.html
Copyright © 2011-2022 走看看