zoukankan      html  css  js  c++  java
  • 【sklearn】Gaussian Mixture Model

    概述

    参考

    A Gaussian Mixture Model (GMM) is a parametric probability density function represented as a weighted sum of Gaussian component densities. GMMs are commonly used as a parametric model of the probability distribution of continuous measurements or features in a biometric system, such as vocal-tract related spectral features in a speaker recognition system. GMM parameters are estimated from training data using the iterative Expectation-Maximization (EM) algorithm or Maximum A Posteriori(MAP) estimation from a well-trained prior model.

    高斯混合模型,经典的概率模型/生成模型,常用于声纹识别、语音识别等模式识别应用。常使用最大似然估计方法训练(估计参数),用期望最大化算法(Expectation Maximization,EM)具体实现。算法原理:
    avatar

    The sklearn.mixture module implements mixture modeling algorithms. 里面有Gaussian_mixture和Baysian_mixture,这两个类都继承于BaseMixture。

    GaussianMixture

    高斯混合模型的概率分布,参数估计。
    参考sklearn.mixture.GaussianMixture及其源码

    class sklearn.mixture.GaussianMixture(n_components=1, *, covariance_type='full', tol=0.001, reg_covar=1e-06, max_iter=100, n_init=1, init_params='kmeans', weights_init=None, means_init=None, precisions_init=None, random_state=None, warm_start=False, verbose=0, verbose_interval=10) 
    

    初始化

    初始化 GaussianMixture 类。使用方式如下:

    from sklearn.mixture import GaussianMixture 
    gmm = GaussianMixture(n_components = 20, max_iter = 200, covariance_type='diag', n_init = 3) 
    

    该GMM由20个高斯分布函数组成,训练过程中EM算法迭代次数为200,协方差类型为diag(每个高斯分量都有对角协方差矩阵),3次初始化,训练过程中保存最好的结果。
    参数(weights,means,precisions)的初始化默认采用kmeans方法,precisions_init默认为None,此时的尺寸由covariance_type决定,为diag时尺寸为(n_components, n_features)。
    热启动warm_start默认为False。
    注意:n_samples >= n_components

    .fit

    Estimate model parameters with the EM algorithm. The method fits the model n_init times and sets the parameters with which the model has the largest likelihood or lower bound. Within each trial, the method iterates between E-step and M-step for max_iter times until the change of likelihood or lower bound is less than tol, otherwise, a ConvergenceWarning is raised.
    If warm_start is True, then n_init is ignored and a single initialization is performed upon the first call. Upon consecutive calls, training starts where it left off.
    参考源码。使用方式如下:

    gmm.fit(datas)  
    

    The datas is array-like of shape (n_samples, n_features). List of n_features-dimensional data points. Each row corresponds to a single data point.

    .score

    Compute the per-sample average log-likelihood of the given data X.
    参考源码。使用方式如下:

    ll_score = gmm.score(test_datas)
    

    The test_datas is array-like of shape (n_samples, n_dimensions). List of n_features-dimensional data points. Each row corresponds to a single data point.
    The result ll_score is float. Log likelihood of the Gaussian mixture given test_datas.

    Parameters: X, array-like of shape (n_samples, n_dimensions), List of n_features-dimensional data points. Each row corresponds to a single data point.
    Returns: log_likelihood, float data

    preprocessing

    Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).
    ‎例如,学习算法的目标函数中使用的许多elements(如SVM的RBF核,线性模型的L1和L2正则),假定all features are centered around 0 and have variance in the same order。
    If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.
    The sklearn.preprocessing module includes scaling, centering, normalization, binarization methods.

    .scale

    高斯混合模型训练前,一般要对特征进行标准化处理。可以使用sklearn.preprocessing.scale处理。Standardize a dataset along any axis. Center to the mean and component wise scale to unit variance.
    参考sklearn.preprocessing.scale及其源码,使用方法如下:

    from sklearn import preprocessing  
    X_tr = preprocessing.scale(X, *, axis=0, with_mean=True, with_std=True, copy=True)
    

    需要注意的是,输入X是一个 array-like sparse matrix of shape (n_samples, n_features), which is the data to center and scale.
    If the axis used to compute the means and standard deviations along, is 0, independently standardize each feature, otherwise (if 1) standardize each sample.
    Return the transformed data X_tr is ndarray, sparse matrix of shape (n_samples, n_features).

    Warning Risk of data leak

    Do not use scale unless you know what you are doing.
    A common mistake is to apply it to the entire data before splitting into training and test sets. This will bias the model evaluation because information would have leaked from the test set to the training set. In general, we recommend using StandardScaler within a Pipeline in order to prevent most risks of data leaking: pipe = make_pipeline(StandardScaler(), LogisticRegression()).

    未用?

    .StandardScaler

    Standardize features by removing the mean and scaling to unit variance. 计算方式:$ { ext{z}} = left( {x - mu } ight)/sigma $.
    参考sklearn.preprocessing.StandardScaler

    class sklearn.preprocessing.StandardScaler(*, copy=True, with_mean=True, with_std=True)  
    

    默认with_meanwith_std均为True,需要对输入数据提前进行中心化和单位标准差归一化。
    Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using transform.

  • 相关阅读:
    Android五天乐(第三天)ListFragment与ViewPager
    Thinking in States
    红黑树上的连接操作
    [LeetCode][Java] Binary Tree Level Order Traversal
    使用IR2101半桥驱动电机的案例
    HDU 4782 Beautiful Soup(模拟)
    C语言之基本算法38—格式化输出10000以内的全部完数
    远在美国的凤姐为何选择回国理財?
    2014-7-20 谁还认得这几本书?
    360在线笔试---反思两道题
  • 原文地址:https://www.cnblogs.com/ytxwzqin/p/14363476.html
Copyright © 2011-2022 走看看