zoukankan      html  css  js  c++  java
  • 机器学习&数据挖掘笔记_14(GMM-HMM语音识别简单理解)

       为了对GMM-HMM在语音识别上的应用有个宏观认识,花了些时间读了下HTK(用htk完成简单的孤立词识别)的部分源码,对该算法总算有了点大概认识,达到了预期我想要的。不得不说,网络上关于语音识别的通俗易懂教程太少,都是各种公式满天飞,很少有说具体细节的,当然了,那需要有实战经验才行。下面总结以下几点,对其有个宏观印象即可(以孤立词识别为例)。

      一、每个单词的读音都对应一个HMM模型,大家都知道HMM模型中有个状态集S,那么每个状态用什么来表示呢,数字?向量?矩阵?其实这个状态集中的状态没有具体的数学要求,只是一个名称而已,你可以用’1’, ’2’, ‘3’…表示,也可以用’a’, ‘b’, ’c ’表示。另外每个HMM模型中到底该用多少个状态,是通过先验知识人为设定的。

      二、HMM的每一个状态都对应有一个观察值,这个观察值可以是一个实数,也可以是个向量,且每个状态对应的观察值的维度应该相同。假设现在有一个单词的音频文件,首先需要将其进行采样得到数字信息(A/D转换),然后分帧进行MFCC特征提取,假设每一帧音频对应的MFCC特征长度为39,则每个音频文件就转换成了N个MFCC向量(不同音频文件对应的N可能不同),这就成了一个序列,而在训练HMM模型的参数时(比如用Baum-Welch算法),每次输入到HMM中的数据要求就是一个观测值序列。这时,每个状态对应的观测值为39维的向量,因为向量中元素的取值是连续的,需要用多维密度函数来模拟,通常情况下用的是多维高斯函数。在GMM-HMM体系中,这个拟合函数是用K个多维高斯混合得到的。假设知道了每个状态对应的K个多维高斯的所有参数,则该GMM生成该状态上某一个观察向量(一帧音频的MFCC系数)的概率就可以求出来了。

      三、对每个单词建立一个HMM模型,需要用到该单词的训练样本,这些训练样本是提前标注好的,即每个样本对应一段音频,该音频只包含这个单词的读音。当有了该单词的多个训练样本后,就用这些样本结合Baum-Welch算法和EM算法来训练出GMM-HMM的所有参数,这些参数包括初始状态的概率向量,状态之间的转移矩阵,每个状态对应的观察矩阵(这里对应的是GMM,即每个状态对应的K个高斯的权值,每个高斯的均值向量和方差矩阵)。

      四、在识别阶段,输入一段音频,如果该音频含有多个单词,则可以手动先将其分割开(考虑的是最简单的方法),然后提取每个单词的音频MFCC特征序列,将该序列输入到每个HMM模型(已提前训练好的)中,采用前向算法求出每个HMM模型生成该序列的概率,最后取最大概率对应的那个模型,而那个模型所表示的单词就是我们识别的结果。

      五、在建立声学模型时,可以用Deep Learning的方法来代替GMM-HMM中的GMM,因为GMM模拟任意函数的功能取决于混合高斯函数的个数,所以具有一定的局限性,属于浅层模型。而Deep Network可以模拟任意的函数,因而表达能力更强。注意,这里用来代替GMM的Deep Nets模型要求是产生式模型,比如DBN,DBM等,因为在训练HMM-DL网络时,需要用到HMM的某个状态产生一个样本的概率。

      六、GMM-HMM在具体实现起来还是相当复杂的。

      七、一般涉及到时间序列时才会使用HMM,比如这里音频中的语音识别,视频中的行为识别等。如果我们用GMM-HMM对静态的图片分类,因为这里没涉及到时间信息,所以HMM的状态数可设为1,那么此时的GMM-HMM算法就退化成GMM算法了。

      MFCC:

      MFCC的matlab实现教程可参考:张智星老师的网页教程mfcc. 最基本的12维特征。

    function mfcc=frame2mfcc(frame, fs, filterNum, mfccNum, plotOpt)
    % frame2mfcc: Frame to MFCC conversion.
    %    Usage: mfcc=frame2mfcc(frame, fs, filterNum, mfccNum, plotOpt)
    %
    %    For example:
    %        waveFile='what_movies_have_you_seen_recently.wav';
    %        [y, fs, nbits]=wavReadInt(waveFile);
    %        startIndex=12000;
    %        frameSize=512;
    %        frame=y(startIndex:startIndex+frameSize-1);
    %        frame2mfcc(frame, fs, 20, 12, 1);
    
    %    Roger Jang 20060417
    
    if nargin<1, selfdemo; return; end
    if nargin<2, fs=16000; end
    if nargin<3, filterNum=20; end
    if nargin<4, mfccNum=12; end
    if nargin<5, plotOpt=0; end
    
    frameSize=length(frame);
    % ====== Preemphasis should be done at wave level
    %a=0.95;
    %frame2 = filter([1, -a], 1, frame);
    frame2=frame;
    % ====== Hamming windowing
    frame3=frame2.*hamming(frameSize);
    % ====== FFT
    [fftMag, fftPhase, fftFreq, fftPowerDb]=fftOneSide(frame3, fs);
    % ====== Triangular band-pass filter bank
    triFilterBankPrm=getTriFilterBankPrm(fs, filterNum);    % Get parameters for triangular band-pass filter bank
    % Triangular bandpass filter.
    for i=1:filterNum
        tbfCoef(i)=dot(fftPowerDb, trimf(fftFreq, triFilterBankPrm(:,i)));%得到filterNum个滤波系数
    end
    % ====== DCT
    mfcc=zeros(mfccNum, 1); %DCT变换的前后个数也没有变
    for i=1:mfccNum
        coef = cos((pi/filterNum)*i*((1:filterNum)-0.5))'; %mfcc中的前mfccNum个系数
        mfcc(i) = sum(coef.*tbfCoef');%直接按照DCT公式
    end
    % ====== Log energy
    %logEnergy=10*log10(sum(frame.*frame));
    %mfcc=[logEnergy; mfcc];
    
    if plotOpt
        subplot(2,1,1);
        plot(frame, '.-');
        set(gca, 'xlim', [-inf inf]);
        title('Input frame');
        subplot(2,1,2);
        plot(mfcc, '.-');
        set(gca, 'xlim', [-inf inf]);
        title('MFCC vector');
    end
    
    % ====== trimf.m (from fuzzy toolbox)
    function y = trimf(x, prm) %由频率的横坐标算出三角形内的纵坐标,0~1
    a = prm(1); b = prm(2); c = prm(3);
    y = zeros(size(x));
    % Left and right shoulders (y = 0)
    index = find(x <= a | c <= x);
    y(index) = zeros(size(index)); %只考虑三角波内的量
    % Left slope
    if (a ~= b)
        index = find(a < x & x < b);
        y(index) = (x(index)-a)/(b-a);
    end
    % right slope
    if (b ~= c)
        index = find(b < x & x < c);
        y(index) = (c-x(index))/(c-b);
    end
    % Center (y = 1)
    index = find(x == b);
    y(index) = ones(size(index));
    
    % ====== Self demo
    function selfdemo
    waveFile='what_movies_have_you_seen_recently.wav';
    [y, fs, nbits]=wavReadInt(waveFile);
    startIndex=12000;
    frameSize=512;
    frame=y(startIndex:startIndex+frameSize-1);
    feval(mfilename, frame, fs, 20, 12, 1);

      ZCR:

      过0检测,用于判断每一帧中过零点的数量情况,最简单的版本可参考:zeros cross rate. 

    waveFile='csNthu.wav';
    frameSize=256;
    overlap=0;
    [y, fs, nbits]=wavread(waveFile);
    frameMat=enframe(y, frameSize, overlap);
    frameNum=size(frameMat, 2);
    for i=1:frameNum
        frameMat(:,i)=frameMat(:,i)-mean(frameMat(:,i));    % mean justification
    end
    zcr=sum(frameMat(1:end-1, :).*frameMat(2:end, :)<0);
    sampleTime=(1:length(y))/fs;
    frameTime=((0:frameNum-1)*(frameSize-overlap)+0.5*frameSize)/fs;
    subplot(2,1,1); plot(sampleTime, y); ylabel('Amplitude'); title(waveFile);
    subplot(2,1,2); plot(frameTime, zcr, '.-');
    xlabel('Time (sec)'); ylabel('Count'); title('ZCR');

      EPD:

      端点检测,检测声音的起始点和终止点,可参考:EPD in Time Domain,在时域中的最简单检测方法。

    waveFile='sunday.wav';
    [wave, fs, nbits] = wavread(waveFile);
    frameSize = 256;
    overlap = 128;
    
    wave=wave-mean(wave);                % zero-mean substraction
    frameMat=buffer2(wave, frameSize, overlap);    % frame blocking,每一列代表一帧
    frameNum=size(frameMat, 2);            % no. of frames
    volume=frame2volume(frameMat);        % volume,求每一帧的能量,绝对值或者平方和,volume为行向量
    volumeTh1=max(volume)*0.1;            % volume threshold 1
    volumeTh2=median(volume)*0.1;            % volume threshold 2
    volumeTh3=min(volume)*10;            % volume threshold 3
    volumeTh4=volume(1)*5;                % volume threshold 4
    index1 = find(volume>volumeTh1); %找出volume大于阈值的那些帧序号
    index2 = find(volume>volumeTh2);
    index3 = find(volume>volumeTh3);
    index4 = find(volume>volumeTh4);
    %frame2sampleIndex()为从帧序号找到样本点的序号(即每一个采样点的序号)
    %endPointX长度为2,包含了起点和终点的样本点序号
    endPoint1=frame2sampleIndex([index1(1), index1(end)], frameSize, overlap);
    endPoint2=frame2sampleIndex([index2(1), index2(end)], frameSize, overlap);
    endPoint3=frame2sampleIndex([index3(1), index3(end)], frameSize, overlap);
    endPoint4=frame2sampleIndex([index4(1), index4(end)], frameSize, overlap);
    
    subplot(2,1,1);
    time=(1:length(wave))/fs;
    plot(time, wave);
    ylabel('Amplitude'); title('Waveform');
    axis([-inf inf -1 1]);
    line(time(endPoint1(  1))*[1 1], [-1, 1], 'color', 'm');%标起点终点线
    line(time(endPoint2(  1))*[1 1], [-1, 1], 'color', 'g');
    line(time(endPoint3(  1))*[1 1], [-1, 1], 'color', 'k');
    line(time(endPoint4(  1))*[1 1], [-1, 1], 'color', 'r');
    line(time(endPoint1(end))*[1 1], [-1, 1], 'color', 'm');
    line(time(endPoint2(end))*[1 1], [-1, 1], 'color', 'g');
    line(time(endPoint3(end))*[1 1], [-1, 1], 'color', 'k');
    line(time(endPoint4(end))*[1 1], [-1, 1], 'color', 'r');
    legend('Waveform', 'Boundaries by threshold 1', 'Boundaries by threshold 2', 'Boundaries by threshold 3', 'Boundaries by threshold 4');
    
    subplot(2,1,2);
    frameTime=frame2sampleIndex(1:frameNum, frameSize, overlap);
    plot(frameTime, volume, '.-');
    ylabel('Sum of Abs.'); title('Volume');
    axis tight;
    line([min(frameTime), max(frameTime)], volumeTh1*[1 1], 'color', 'm');
    line([min(frameTime), max(frameTime)], volumeTh2*[1 1], 'color', 'g');
    line([min(frameTime), max(frameTime)], volumeTh3*[1 1], 'color', 'k');
    line([min(frameTime), max(frameTime)], volumeTh4*[1 1], 'color', 'r');
    legend('Volume', 'Threshold 1', 'Threshold 2', 'Threshold 3', 'Threshold 4');

       GMM: 

       GMM用在拟合数据分布上,本质上是先假设样本的概率分布为GMM,然后用多个样本去学习这些GMM的参数。GMM建模在语音中可用于某个单词的发音,某个人的音色等。其训练过程可参考:speaker recognition.

    function [M, V, W, logProb] = gmmTrain(data, gaussianNum, dispOpt)
    % gmmTrain: Parameter training for gaussian mixture model (GMM)
    %    Usage: function [M, V, W, logProb] = gmm(data, gaussianNum, dispOpt)
    %        data: dim x dataNum matrix where each column is a data point
    %        gaussianNum: No. of Gaussians or initial centers
    %        dispOpt: Option for displaying info during training
    %        M: dim x meanNum matrix where each column is a mean vector
    %        V: 1 x gaussianNum vector where each element is a variance for a Gaussian
    %        W: 1 x gaussianNum vector where each element is a weighting factor for a Gaussian
    
    % Roger Jang 20000610
    
    if nargin==0, selfdemo; return; end
    if nargin<3, dispOpt=0; end
    
    maxLoopCount = 50;    % Max. iteration
    minImprove = 1e-6;    % Min. improvement
    minVariance = 1e-6;    % Min. variance
    logProb = zeros(maxLoopCount, 1);   % Array for objective function
    [dim, dataNum] = size(data);
    
    % Set initial parameters
    % Set initial M
    %M = data(1+floor(rand(gaussianNum,1)*dataNum),:);    % Randomly select several data points as the centers
    if length(gaussianNum)==1,
        % Using vqKmeans to find initial centers
        fprintf('Start KMEANS to find the initial mu...
    ');
    %    M = vqKmeansMex(data, gaussianNum, 0);
        M = vqKmeans(data, gaussianNum, 0); %利用聚类的方法求均值,聚成gaussianNum类
    %    M = vqLBG(data, gaussianNum, 0);
        fprintf('Start GMM training...
    ');
        if any(any(~isfinite(M))); keyboard; end
    else
        % gaussianNum is in fact the initial centers
        M = gaussianNum;
        gaussianNum = size(M, 2);
    end
    % Set initial V as the distance to the nearest center
    if gaussianNum==1
        V=1;
    else
        distance=pairwiseSqrDist(M);%pairwiseSqrDist是dll
       %distance=pairwiseSqrDist2(M);
       
        distance(1:(gaussianNum+1):gaussianNum^2)=inf;    % Diagonal elements are inf
        [V, index]=min(distance);    % Initial variance for each Gaussian
    end
    % Set initial W
    W = ones(1, gaussianNum)/gaussianNum;    % Weight for each Gaussian,初始化时是均分权值
    
    if dispOpt & dim==2, displayGmm(M, V, data); end
    for i = 1:maxLoopCount  %开始迭代训练参数,EM算法
        % Expectation step:
        % P(i,j) is the probability of data(:,j) to the i-th Gaussian
        % Prob为每个样本在GMM下的概率
        [prob, P]=gmmEval(data, M, V, W);
        logProb(i)=sum(log(prob)); %所有样本的联合概率
        if dispOpt
            fprintf('i = %d, log prob. = %f
    ',i-1, logProb(i));
        end
        PW = diag(W)*P;
        BETA=PW./(ones(gaussianNum,1)*sum(PW));    % BETA(i,j) is beta_i(x_j)
        sumBETA=sum(BETA,2);
    
        % Maximization step:  eqns (2.96) to (2.98) from Bishop p.67:
        M = (data*BETA')./(ones(dim,1)*sumBETA');
    
       DISTSQ = pairwiseSqrDist(M, data);                    % Distance of M to data
       %DISTSQ = pairwiseSqrDist2(M, data);                    % Distance of M to data
       
        V = max((sum(BETA.*DISTSQ, 2)./sumBETA)/dim, minVariance);    % (2.97)
        W = (1/dataNum)*sumBETA;                    % (2.98)
    
        if dispOpt & dim==2, displayGmm(M, V, data); end
        if i>1, if logProb(i)-logProb(i-1)<minImprove, break; end; end
    end
    [prob, P]=gmmEval(data, M, V, W);
    logProb(i)=sum(log(prob));
    fprintf('Iteration count = %d, log prob. = %f
    ',i, logProb(i));
    logProb(i+1:maxLoopCount) = [];
    
    % ====== Self Demo ======
    function selfdemo
    %[data, gaussianNum] = dcdata(2);
    data = rand(1000,2);
    gaussianNum = 8;
    data=data';
    plotOpt=1;
    [M, V, W, lp] = feval(mfilename, data, gaussianNum, plotOpt);
    
    pointNum = 40;
    x = linspace(min(data(1,:)), max(data(1,:)), pointNum);
    y = linspace(min(data(2,:)), max(data(2,:)), pointNum);
    [xx, yy] = meshgrid(x, y);
    data = [xx(:) yy(:)]';
    z = gmmEval(data, M, V, W);
    zz = reshape(z, pointNum, pointNum);
    figure; mesh(xx, yy, zz); axis tight; box on; rotate3d on
    figure; contour(xx, yy, zz, 30); axis image
    
    % ====== Other subfunctions ======
    function displayGmm(M, V, data)
    % Display function for EM algorithm
    figureH=findobj(0, 'tag', mfilename);
    if isempty(figureH)
        figureH=figure;
        set(figureH, 'tag', mfilename);
        colordef black
        plot(data(1,:), data(2,:),'.r'); axis image
        theta=linspace(-pi, pi, 21);
        x=cos(theta); y=sin(theta);
        sigma=sqrt(V);
        for i=1:length(sigma)
            circleH(i)=line(x*sigma(i)+M(1,i), y*sigma(i)+M(2,i), 'color', 'y');
        end
        set(circleH, 'tag', 'circleH', 'erasemode', 'xor');
    else
        circleH=findobj(figureH, 'tag', 'circleH');
        theta=linspace(-pi, pi, 21);
        x=cos(theta); y=sin(theta);
        sigma=sqrt(V);
        for i=1:length(sigma)
            set(circleH(i), 'xdata', x*sigma(i)+M(1,i), 'ydata', y*sigma(i)+M(2,i));
        end
        drawnow
    end

       Speaker identification:

       给N个人的语音资料,用GMM可以训练这N个人的声音模型,然后给定一段语音,判断该语音与这N个人中哪个最相似。方法是求出该语音在N个GMM模型下的概率,选出概率最大的那个。可参考:speaker recognition.

    function [recogRate, confusionMatrix, speakerData]=speakerIdentify(speakerData, speakerGmm, useIntGmm)
    % speakerIdentify: speaker identification using GMM parameters
    %    Usage: [recogRate, confusionMatrix, speakerData]=speakerIdentify(speakerData, speakerGmm, useIntGmm)
    %        speakerData: structure array generated by speakerDataRead.m
    %        speakerGmm: speakerGmm(i).gmmPrm is the GMM parameters for speaker i.
    %        useIntGmm: use fixed-point GMM
    
    %    Roger Jang, 20070517, 20080726
    
    if nargin<3, useIntGmm=0; end
    
    % ====== Speaker identification using GMM parameters
    speakerNum=length(speakerData);
    for i=1:speakerNum
    %    fprintf('%d/%d: Recognizing wave files by %s
    ', i, speakerNum, speakerData(i).name);
        for j=1:length(speakerData(i).sentence)
    %        fprintf('	Sentece %d...
    ', j);
            frameNum=size(speakerData(i).sentence(j).fea, 2);
            logProb=zeros(speakerNum, frameNum); %logProb(i,m)表示第i个人第j个句子中第m帧在GMM模型下的log概率
            %找出一个句子,看它属于哪个speaker
            for k=1:speakerNum,
    %            fprintf('		Speaker %d...
    ', k);
            %    logProb(k, :)=gmmEval(speakerData(i).sentence(j).fea, speakerGmm(k).gmmPrm);
                if ~useIntGmm
                %    logProb(k, :)=gmmEvalMex(speakerData(i).sentence(j).fea, gmm(k).mean, gmm(k).covariance, gmm(k).weight);
                    logProb(k, :)=gmmEval(speakerData(i).sentence(j).fea, speakerGmm(k).gmmPrm);
                else
                %    logProb(k, :)=gmmEvalIntMex(speakerData(i).sentence(j).fea, gmm(k).mean, gmm(k).covariance, gmm(k).weight);
                    logProb(k, :)=gmmEvalIntMex(speakerData(i).sentence(j).fea, speakerGmm(i).gmmPrm);
                end
            end
            cumLogProb=sum(logProb, 2);
            [maxProb, index]=max(cumLogProb);
            speakerData(i).sentence(j).predictedSpeaker=index; %找出身份
            speakerData(i).sentence(j).logProb=logProb;
        end
    end
    
    % ====== Compute confusion matrix and recognition rate
    confusionMatrix=zeros(speakerNum);
    for i=1:speakerNum,
        predictedSpeaker=[speakerData(i).sentence.predictedSpeaker];
        [index, count]=elementCount(predictedSpeaker);
        confusionMatrix(i, index)=count;
    end
    recogRate=sum(diag(confusionMatrix))/sum(sum(confusionMatrix));

      GMM-HMM:

      训练阶段:给出HMM的k个状态,每个状态下的观察样本的生成可以用一个概率分布来拟合,这里是采用GMM拟合的。其实,可以把GMM-HMM整体看成是一个生成模型。给定该模型的5个初始参数(结合随机和训练样本获得),启动EM算法的E步:获得训练样本分布,即计算训练样本在各个状态下的概率。M步:用这些训练样本重新评估那5个参数。

      测试阶段:(以孤立词识别为例)给定每个词发音的frame矩阵,取出某一个GMM-HMM模型,算出该发音每一帧数据在取出的GMM-HMM模型各个state下的概率,结合模型的转移概率和初始概率,获得对应的clique tree,可用图模型的方法inference出生成该语音的概率。比较多个GMM-HMM模型,取最大概率的模型对应的词。

       参考资料:

         机器学习&数据挖掘笔记_13(用htk完成简单的孤立词识别)

         http://htk.eng.cam.ac.uk/extensions/

         张智星老师的网页教程mfcc.

  • 相关阅读:
    【leetcode】Binary Search Tree Iterator
    【leetcode】Palindrome Partitioning II
    【leetcode】Best Time to Buy and Sell Stock III
    【leetcode】Best Time to Buy and Sell Stock II
    【leetcode】Longest Consecutive Sequence
    【leetcode】Factorial Trailing Zeroes
    【leetcode】Simplify Path
    【leetcode】Generate Parentheses
    【leetcode】Combination Sum II
    【leetcode】Combination Sum
  • 原文地址:https://www.cnblogs.com/tornadomeet/p/3276753.html
Copyright © 2011-2022 走看看