zoukankan      html  css  js  c++  java
  • sklearn 中的 Pipeline 机制 和FeatureUnion

    一、pipeline的用法

    pipeline可以用于把多个estimators级联成一个estimator,这么 做的原因是考虑了数据处理过程中一系列前后相继的固定流程,比如feature selection->normalization->classification

    pipeline提供了两种服务:

    • Convenience:只需要调用一次fit和predict就可以在数据集上训练一组estimators
    • Joint parameter selection可以把grid search 用在pipeline中所有的estimators参数的参数组合上面

    注意:Pipleline中最后一个之外的所有estimators都必须是变换器(transformers),最后一个estimator可以是任意类型(transformer,classifier,regresser)

    如果最后一个estimator是个分类器,则整个pipeline就可以作为分类器使用,如果最后一个estimator是个聚类器,则整个pipeline就可以作为聚类器使用

    from sklearn.preprocessing import StandardScaler
    from sklearn.decomposition import PCA
    from sklearn.linear_model import LogisticRegression
    
    from sklearn.pipeline import Pipeline
    
    estimator=[('pca', PCA()),
               ('clf', LogisticRegression())
               ]
    pipe=Pipeline(estimator)
    print(pipe)
    #Pipeline(steps=[('pca', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,svd_solver='auto', tol=0.0, whiten=False)), ('clf', LogisticRegression(C=1.0, class_weight=None, dual=False,fit_intercept=True,intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,penalty='l2', random_state=None, solver='liblinear', tol=0.0001,verbose=0, warm_start=False))])
    print(pipe.steps[0])
    #('pca', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,  svd_solver='auto', tol=0.0, whiten=False))
    print(pipe.named_steps['pca'])
    #PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,  svd_solver='auto', tol=0.0, whiten=False)

    在pipeline中estimator的参数通过使用<estimator>__<parameter>语法来获取

    #修改参数并打印输出
    print(pipe.set_params(clf__C=10))
    #Pipeline(steps=[('pca', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,svd_solver='auto', tol=0.0, whiten=False)), ('clf', LogisticRegression(C=10, class_weight=None, dual=False,fit_intercept=True,intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,penalty='l2', random_state=None, solver='liblinear', tol=0.0001,verbose=0, warm_start=False))])

    既然有参数的存在,就可以使用网格搜索方法来调节参数

    from sklearn.model_selection import GridSearchCV
    params=dict(pca__n_components=[2,5,10],clf__C=[0,1,10,100])
    grid_research=GridSearchCV(pipe,param_grid=params)

    单个阶段(step)可以用参数替换,而且非最后阶段还可以将其设置为None来忽略:

    from sklearn.linear_model import LogisticRegression
    params=dict(pca=[None,PCA(5),PCA(10)],clf=[SVC(),LogisticRegression()],
                clf_C=[0.1,10,100])
    grid_research=GridSearchCV(pipe,param_grid=params)

    函数make_pipeline是一个构造pipeline的简短工具,他接受可变数量的estimators并返回一个pipeline,每个estimator的名称自动填充。

    from sklearn.pipeline import make_pipeline
    from sklearn.naive_bayes import MultinomialNB
    from sklearn.preprocessing import Binarizer
    print(make_pipeline(Binarizer(),MultinomialNB()))
    
    #Pipeline(steps=[('binarizer', Binarizer(copy=True, threshold=0.0)), ('multinomialnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

    FeatureUnion:composite(组合)feature spaces

    FeatureUnion把若干个transformer objects组合成一个新的transformer,这个新的transformer组合了他们的输出,一个FeatureUnion对象接受一个transformer对象列表

    二、FeatureUnion 的用法

    from sklearn.pipeline import FeatureUnion
    from sklearn.decomposition import PCA
    from sklearn.decomposition import KernelPCA
    estimators=[('linear_pca',PCA()),('kernel_pca',KernelPCA())]
    combined=FeatureUnion(estimators)
    print(combined)
    
    #FeatureUnion(n_jobs=1, transformer_list=[('linear_pca', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,  svd_solver='auto', tol=0.0, whiten=False)), ('kernel_pca', KernelPCA(alpha=1.0, coef0=1, copy_X=True, degree=3, eigen_solver='auto',     fit_inverse_transform=False, gamma=None, kernel='linear',     kernel_params=None, max_iter=None, n_components=None, n_jobs=1,  random_state=None, remove_zero_eig=False, tol=0))],transformer_weights=None)

    与pipeline类似,feature union也有一种比较简单的构造方法:make_union,不需要显示的给每个estimator指定名称。

     Featu热Union设置参数

    #修改参数
    print(combined.set_params(kernel_pca=None))
    
    #FeatureUnion(n_jobs=1,transformer_list=[('linear_pca', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,svd_solver='auto', tol=0.0, whiten=False)), ('kernel_pca', None)],transformer_weights=None)

     另外一篇讲pipleline不错的文章:http://blog.csdn.net/lanchunhui/article/details/50521648

  • 相关阅读:
    poj 2417 Discrete Logging
    洛谷 P2886 [USACO07NOV]牛继电器Cow Relays
    bzoj 3232 圈地游戏——0/1分数规划(或网络流)
    bzoj 4753 [Jsoi2016]最佳团体——0/1分数规划
    bzoj 5281 [Usaco2018 Open]Talent Show——0/1分数规划
    CF 949D Curfew——贪心(思路!!!)
    bzoj 3872 [Poi2014]Ant colony——二分答案
    bzoj 1731 [Usaco2005 dec]Layout 排队布局——差分约束
    洛谷 1344 [USACO4.4]追查坏牛奶Pollutant Control——最大流
    洛谷 1262 间谍网络——缩点+拓扑
  • 原文地址:https://www.cnblogs.com/nolonely/p/6970419.html
Copyright © 2011-2022 走看看