zoukankan      html  css  js  c++  java
  • test

    监督学习-分类

    项目说明: 为CharityML寻找捐献者

    在这个项目中,你将使用1994年美国人口普查收集的数据,选用几个监督学习算法以准确地建模被调查者的收入。然后,你将根据初步结果从中选择出最佳的候选算法,并进一步优化该算法以最好地建模这些数据。你的目标是建立一个能够准确地预测被调查者年收入是否超过50000美元的模型。这种类型的任务会出现在那些依赖于捐款而存在的非营利性组织。了解人群的收入情况可以帮助一个非营利性的机构更好地了解他们要多大的捐赠,或是否他们应该接触这些人。虽然我们很难直接从公开的资源中推断出一个人的一般收入阶层,但是我们可以(也正是我们将要做的)从其他的一些公开的可获得的资源中获得一些特征从而推断出该值。

    这个项目的数据集来自UCI机器学习知识库。这个数据集是由Ron Kohavi和Barry Becker在发表文章_"Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid"_之后捐赠的,你可以在Ron Kohavi提供的在线版本中找到这个文章。我们在这里探索的数据集相比于原有的数据集有一些小小的改变,比如说移除了特征'fnlwgt' 以及一些遗失的或者是格式不正确的记录。

    # 为这个项目导入需要的库
    import numpy as np
    import pandas as pd
    from time import time
    
    # 导入人口普查数据
    data = pd.read_csv("census.csv")
    

    数据探索

    特征

    • age: 一个整数,表示被调查者的年龄。
    • workclass: 一个类别变量表示被调查者的通常劳动类型,允许的值有 {Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked}
    • education_level: 一个类别变量表示教育程度,允许的值有 {Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool}
    • education-num: 一个整数表示在学校学习了多少年
    • marital-status: 一个类别变量,允许的值有 {Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse}
    • occupation: 一个类别变量表示一般的职业领域,允许的值有 {Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces}
    • relationship: 一个类别变量表示家庭情况,允许的值有 {Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried}
    • race: 一个类别变量表示人种,允许的值有 {White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black}
    • sex: 一个类别变量表示性别,允许的值有 {Female, Male}
    • capital-gain: 连续值。
    • capital-loss: 连续值。
    • hours-per-week: 连续值。
    • native-country: 一个类别变量表示原始的国家,允许的值有 {United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands}

    目标变量

    • income: 一个类别变量,表示收入属于那个类别,允许的值有 {<=50K, >50K}
    data.head(3).T
    
    0 1 2
    age 39 50 38
    workclass State-gov Self-emp-not-inc Private
    education_level Bachelors Bachelors HS-grad
    education-num 13 13 9
    marital-status Never-married Married-civ-spouse Divorced
    occupation Adm-clerical Exec-managerial Handlers-cleaners
    relationship Not-in-family Husband Not-in-family
    race White White White
    sex Male Male Male
    capital-gain 2174 0 0
    capital-loss 0 0 0
    hours-per-week 40 13 40
    native-country United-States United-States United-States
    income <=50K <=50K <=50K
    # 数据不均衡
    data.income.value_counts()
    
    <=50K    34014
    >50K     11208
    Name: income, dtype: int64
    
    data.income.count()
    
    45222
    

    数据清洗/数据处理

    # 将数据切分成特征和对应的标签
    income_raw = data['income']
    features_raw = data.drop(['income'], axis = 1)
    

    预测目标(y)的转换,将标签的编码值范围限定在[0,n_classes-1]

    from sklearn.preprocessing import LabelEncoder
    le = LabelEncoder()
    le.fit(["<=50K",">50K"])
    income = le.transform(income_raw)
    

    特征工程

    特征缩放

    from sklearn.preprocessing import MinMaxScaler
    
    # 初始化一个 scaler,并将它施加到特征上
    scaler = MinMaxScaler()
    numerical = ['age', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
    features_raw[numerical] = scaler.fit_transform(data[numerical])
    
    features_raw.head(3).T
    
    0 1 2
    age 0.30137 0.452055 0.287671
    workclass State-gov Self-emp-not-inc Private
    education_level Bachelors Bachelors HS-grad
    education-num 0.8 0.8 0.533333
    marital-status Never-married Married-civ-spouse Divorced
    occupation Adm-clerical Exec-managerial Handlers-cleaners
    relationship Not-in-family Husband Not-in-family
    race White White White
    sex Male Male Male
    capital-gain 0.0217402 0 0
    capital-loss 0 0 0
    hours-per-week 0.397959 0.122449 0.397959
    native-country United-States United-States United-States

    独热编码

    # 需要独热编码的分类变量
    cols = ['workclass', 'education_level',
            'marital-status', 'occupation', 
            'relationship', 'race', 
            'sex','native-country']
    
    # 使用pandas.get_dummies()对'features_raw'数据进行独热编码
    features = pd.get_dummies(features_raw, columns=cols)
    
    # 打印经过独热编码之后的特征数量
    encoded = list(features.columns)
    print ("{} total features after one-hot encoding.".format(len(encoded)))
    
    103 total features after one-hot encoding.
    

    数据划分

    from sklearn.cross_validation import train_test_split
    
    X_train, X_test, y_train, y_test = train_test_split(features, 
                                                        income, 
                                                        test_size = 0.2, 
                                                        random_state = 0)
    
    # 显示切分的结果
    print ("Training set has {} samples.".format(X_train.shape[0]))
    print ("Testing set has {} samples.".format(X_test.shape[0]))
    
    Training set has 36177 samples.
    Testing set has 9045 samples.
    

    建模

    下面的监督学习模型是现在在 scikit-learn 中你能够选择的模型

    • 高斯朴素贝叶斯 (GaussianNB)
    • 决策树(CART)
    • K近邻 (KNeighbors)
    • 支撑向量机 (SVM)
    • Logistic回归
    • 集成方法 (Bagging, AdaBoost, Random Forest, Gradient Boosting)
      • 随机森林
      • AdaBoost
      • 梯度提升树GBDT

    创建一个初步训练和预测的流水线

    为了正确评估你选择的每一个模型的性能,创建一个能够帮助快速有效地使用不同大小的训练集并在测试集上做预测的训练和测试的流水线是十分重要的。

    • sklearn.metrics中导入f1_scoreaccuracy_score
    • 用样例训练集拟合学习器,并记录训练时间。
    • 用学习器来对训练集进行预测并记录预测时间。
    • 计算训练数据和测试数据的准确率。
    • 计算训练数据和测试数据的F-score。
    # 从sklearn中导入两个评价指标 - f1_score和accuracy_score
    from sklearn.metrics import f1_score, accuracy_score
    
    def train_predict(learner,X_train, y_train, X_test, y_test): 
    
        results = {}
        results["1_model"] = learner.__class__.__name__
        
        # 训练
        start = time() # 获得程序开始时间
        learner = learner.fit(X_train,y_train)
        end = time() # 获得程序结束时间    
        # 计算训练时间
        results['2_train_time'] = end-start
        
        # 预测
        start = time() # 获得程序开始时间
        predictions_test = learner.predict(X_test)
        end = time() # 获得程序结束时间
        # 计算预测用时
        results['3_pred_time'] = end-start
                
          
        # 计算在测试集上的准确率
        results['4_acc_test'] = accuracy_score(y_test,predictions_test)
        # 计算测试集上的F-score
        results['5_f_test'] = f1_score(y_test,predictions_test)
        
        return results
    

    初步选择

    clfs = {}
    
    # 高斯朴素贝叶斯 (GaussianNB)
    from sklearn.naive_bayes import GaussianNB
    clfs["nb"] = GaussianNB()
    
    # 决策树
    from sklearn.tree import DecisionTreeClassifier
    clfs["dtc"] = DecisionTreeClassifier()
    
    # K近邻 (KNeighbors)
    from sklearn.neighbors import KNeighborsClassifier
    clfs["knc"] = KNeighborsClassifier()
    
    # 支撑向量机回归(SVC)
    from sklearn.svm import SVC
    clfs["svc"] = SVC()
    
    # Logistic回归
    from sklearn.linear_model import LogisticRegression
    clfs['lr'] = LogisticRegression()
    
    # 随机森林
    from sklearn.ensemble import RandomForestClassifier
    clfs["rfc"] = RandomForestClassifier()
    
    # AdaBoost
    from sklearn.ensemble import AdaBoostClassifier
    clfs["adc"] = AdaBoostClassifier()
    
    # 梯度提升树GBDT
    from sklearn.ensemble import GradientBoostingClassifier
    clfs["gbdt"] = GradientBoostingClassifier()
    
    totaldata = pd.DataFrame(columns=['1_model', 
                                      '2_train_time', 
                                      '3_pred_time', 
                                      '4_acc_test', 
                                      '5_f1_test'])
    
    for clf in clfs:
        print(clf)
        # print(clfs[clf])
        temp = train_predict(clfs[clf],X_train, y_train, X_test, y_test)
        rdata = pd.DataFrame(pd.Series(temp)).T
        totaldata=pd.concat([totaldata,rdata])
    
    gbdt
    adc
    nb
    rfc
    svc
    knc
    dtc
    lr
    
    totaldata.sort_values(by="2_train_time",ascending=False)
    
    1_model 2_train_time 3_pred_time 4_acc_test 5_f_test
    0 SVC 91.771 14.1266 0.830072 0.593923
    0 GradientBoostingClassifier 8.75785 0.0195289 0.863018 0.683686
    0 AdaBoostClassifier 1.64838 0.0730915 0.857601 0.673924
    0 KNeighborsClassifier 1.607 26.832 0.820122 0.608988
    0 RandomForestClassifier 0.522327 0.0280254 0.841459 0.652785
    0 LogisticRegression 0.453595 0.010886 0.848314 0.659046
    0 DecisionTreeClassifier 0.377471 0.00451517 0.819016 0.62255
    0 GaussianNB 0.0841055 0.0220275 0.608292 0.536923

    初步选择结果

    • GBDT
    • Adaboost
    • Logistic回归

    注:SVM训练特别慢,预测也慢;K近邻预测特别慢!

    建模调参/调优

    调参过程,需要同时使用到网格搜索和交叉验证,而在sklearn中便有GridSearchCV结合了两者的类。
    下面开始调参,因为是学习,所以所有算法都尝试调参,看看效果。

    from sklearn.grid_search import GridSearchCV
    

    朴素贝叶斯

    参考:http://www.cnblogs.com/pinard/p/6074222.html

    在scikit-learn中,一共有3个朴素贝叶斯的分类算法类。分别是GaussianNB,MultinomialNB和BernoulliNB。

    • GaussianNB:先验为高斯分布的朴素贝叶斯;
    • MultinomialNB:先验为多项式分布的朴素贝叶斯;
    • BernoulliNB:先验为伯努利分布的朴素贝叶斯。

    如何选择

    • GaussianNB:样本特征的分布大部分是连续值;
    • MultinomialNB:样本特征的分大部分是多元离散值;
    • BernoulliNB:样本特征是二元离散值或者很稀疏的多元离散值。

    GaussianNB只有一个参数,没有调参的必要

    决策树

    参考:http://www.cnblogs.com/pinard/p/6056319.html

    在scikit-learn中,决策树算法类库内部实现是使用了调优过的CART树算法,既可以做分类,又可以做回归。

    调整 决策树最大深度:max_depth,默认为None,不限制深度

    from sklearn.tree import DecisionTreeClassifier
    CART = DecisionTreeClassifier()
    CART
    
    DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
                min_samples_split=2, min_weight_fraction_leaf=0.0,
                presort=False, random_state=None, splitter='best')
    
    %%time
    param_grids = {'max_depth':[8,9,10,11,12,13,14]}
    CART_gridsearchCV = GridSearchCV(CART, param_grids,scoring='f1',cv=3) 
    CART_gridsearchCV.fit(X_train,y_train)
    
    Wall time: 3.59 s
    

    CART_gridsearchCV.grid_scores_
    
    [mean: 0.66541, std: 0.00187, params: {'max_depth': 8},
     mean: 0.66285, std: 0.00206, params: {'max_depth': 9},
     mean: 0.66937, std: 0.00707, params: {'max_depth': 10},
     mean: 0.66663, std: 0.00176, params: {'max_depth': 11},
     mean: 0.66743, std: 0.00184, params: {'max_depth': 12},
     mean: 0.66651, std: 0.00235, params: {'max_depth': 13},
     mean: 0.66680, std: 0.00542, params: {'max_depth': 14}]
    
    print("最佳参数:%s"%CART_gridsearchCV.best_params_)
    print("最佳得分:%s"%CART_gridsearchCV.best_score_)
    print("测试集得分:%s"%f1_score(y_test, CART_gridsearchCV.predict(X_test)))
    print("未调参测试集得分:0.62255")
    
    最佳参数:{'max_depth': 10}
    最佳得分:0.669372654206
    测试集得分:0.669708491762
    未调参测试集得分:0.62255
    

    结论:初次建模,CART未调参模型在测试集的f1得分为0.62255,调参后效果明显提升

    Logistic回归

    参考:http://www.cnblogs.com/pinard/p/6035872.html

    在scikit-learn中,逻辑回归算法除了LogisticRegression外,还有一个LogisticRegressionCV使用了交叉验证来选择正则化系数C.而LogisticRegression需要自己每次指定一个正则化系数。

    所以逻辑回归算法调参不必使用GridSearchCV

    参数说明:

    参数 说明
    penalty 正则化项:‘l1’ or 'l2'
    solver 优化算法:{‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’}
    multi_class 分类方式:ovr和multinomial两个值可以选择,默认是 ovr
    Cs (C)(1e^{-4})(1e^4) 的取值数量,默认10个,配合cv,将训练出10cv个模型
    dual 默认为False,当样本量少于特征数是设置为True,一般不会发生
    refit cv过程后,模型将以最佳参数训练好
    from sklearn.linear_model import LogisticRegressionCV
    lgCV = LogisticRegressionCV(Cs=10,
                                cv=3,
                                scoring='f1',
                                verbose=1,
                                n_jobs=-1)
    lgCV
    
    LogisticRegressionCV(Cs=10, class_weight=None, cv=3, dual=False,
               fit_intercept=True, intercept_scaling=1.0, max_iter=100,
               multi_class='ovr', n_jobs=-1, penalty='l2', random_state=None,
               refit=True, scoring='f1', solver='lbfgs', tol=0.0001, verbose=1)
    
    %%time
    lgCV.fit(X_train,y_train)
    
    [Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:    7.4s finished
    

    Wall time: 8.37 s
    

    LogisticRegressionCV(Cs=10, class_weight=None, cv=3, dual=False,
               fit_intercept=True, intercept_scaling=1.0, max_iter=100,
               multi_class='ovr', n_jobs=-1, penalty='l2', random_state=None,
               refit=True, scoring='f1', solver='lbfgs', tol=0.0001, verbose=1)
    
    print("测试集得分:%s"%f1_score(y_test, lgCV.predict(X_test)))
    print("未调参测试集得分:0.659046")
    
    测试集得分:0.664203005666
    未调参测试集得分:0.659046
    

    结论:相比未调参有些许提升

    随机森林

    参考:http://www.cnblogs.com/pinard/p/6160412.html

    随机森林的调参,包括两部分,第一部分是Bagging框架的参数,第二部分是CART决策树的参数。

    from sklearn.ensemble import RandomForestClassifier
    RF = RandomForestClassifier()
    RF
    
    RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                max_depth=None, max_features='auto', max_leaf_nodes=None,
                min_samples_leaf=1, min_samples_split=2,
                min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
                oob_score=False, random_state=None, verbose=0,
                warm_start=False)
    
    %%time
    param_grids = {'n_estimators':[10,30,60,120]}
    RF_gridsearchCV = GridSearchCV(RF, param_grids,scoring='f1',cv=3) 
    RF_gridsearchCV.fit(X_train,y_train)
    
    Wall time: 29.6 s
    

    print("最佳参数:%s"%RF_gridsearchCV.best_params_)
    print("最佳得分:%s"%RF_gridsearchCV.best_score_)
    print("测试集得分:%s"%f1_score(y_test, RF_gridsearchCV.predict(X_test)))
    print("未调参测试集得分:0.652785")
    
    最佳参数:{'n_estimators': 120}
    最佳得分:0.664833898038
    测试集得分:0.661571530929
    未调参测试集得分:0.652785
    

    结论:只调整集成数量,相比未调参提升很微弱

    K近邻

    参考:http://www.cnblogs.com/pinard/p/6065607.html

    from sklearn.neighbors import KNeighborsClassifier
    knn = KNeighborsClassifier()
    knn
    
    KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
               metric_params=None, n_jobs=1, n_neighbors=5, p=2,
               weights='uniform')
    
    %%time
    param_grids = {'n_neighbors':[5,10]}
    knn_gridsearchCV = GridSearchCV(knn, param_grids,scoring='f1',cv=3) 
    knn_gridsearchCV.fit(X_train,y_train)
    
    Wall time: 3min 13s
    

    print("最佳参数:%s"%knn_gridsearchCV.best_params_)
    print("最佳得分:%s"%knn_gridsearchCV.best_score_)
    print("测试集得分:%s"%f1_score(y_test, knn_gridsearchCV.predict(X_test)))
    print("未调参测试集得分:0.608988")
    
    最佳参数:{'n_neighbors': 5}
    最佳得分:0.614849876343
    测试集得分:0.608988223985
    未调参测试集得分:0.608988
    

    K近邻其实不用训练模型,预测相当慢,还是算了

    Adaboost

    参考:http://www.cnblogs.com/pinard/p/6136914.html

    AdaBoostClassifier默认使用CART分类树DecisionTreeClassifier,而AdaBoostRegressor默认使用CART回归树DecisionTreeRegressor。

    from sklearn.ensemble import AdaBoostClassifier
    adc = AdaBoostClassifier()
    adc
    
    AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
              learning_rate=1.0, n_estimators=50, random_state=None)
    
    %%time
    param_grids = {'n_estimators':[100,300,500,750,900]}
    adc_gridsearchCV = GridSearchCV(adc, param_grids,scoring='f1',cv=2, n_jobs=-1)
    adc_gridsearchCV.fit(X_train,y_train)
    
    Wall time: 57.2 s
    

    adc_gridsearchCV.best_params_
    
    {'n_estimators': 750}
    
    print("测试集得分:%s"%f1_score(y_test, adc_gridsearchCV.predict(X_test)))
    print("未调参测试集得分:0.673924")
    
    测试集得分:0.70091834202
    未调参测试集得分:0.673924
    

    结论:只调整迭代次数,便提升了近3个点,效果不错,目前最高得分

    GBDT

    参考:http://www.cnblogs.com/pinard/p/6143927.html

    from sklearn.ensemble import GradientBoostingClassifier
    GBDT = GradientBoostingClassifier()
    GBDT
    
    GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',
                  max_depth=3, max_features=None, max_leaf_nodes=None,
                  min_samples_leaf=1, min_samples_split=2,
                  min_weight_fraction_leaf=0.0, n_estimators=100,
                  presort='auto', random_state=None, subsample=1.0, verbose=0,
                  warm_start=False)
    
    %%time
    param_grids = {'n_estimators':[500,750,900]}
    GBDT_gridsearchCV = GridSearchCV(GBDT, param_grids,scoring='f1',cv=2, n_jobs=-1)
    GBDT_gridsearchCV.fit(X_train,y_train)
    
    Wall time: 1min 40s
    

    print("测试集得分:%s"%f1_score(y_test, GBDT_gridsearchCV.predict(X_test)))
    print("未调参测试集得分:0.683686")
    
    测试集得分:0.714708785785
    未调参测试集得分:0.683686
    
    GBDT_gridsearchCV.best_params_
    
    {'n_estimators': 750}
    

    结论:同样只调整迭代次数,因为都是boosting框架,猜测与Adaboost会类似,结果真是如此,目前最高得分!

    总结

    • 学习类型:监督学习-分类
    • 数据情况:学习用数据,数据干净整齐,无需进行数据清洗
    • 项目步奏:数据探索、(数据清洗)、特征处理、建模(初步)、调参(简单)
    • 收获:
      • 该类型机器学习的基本流程
      • 分类目标变量的处理:LabelEncoder
      • 搭建训练和预测的流水线
      • notebook的时间计算:%%time
      • 调参的基本流程:GridSearchCV
      • 模型评估:sklearn.metrics模块 和 交叉验证的scoring参数
  • 相关阅读:
    jQuery轮播插件SuperSlide【2016-10-14】
    【实际项目需求】话题讨论分类支持删除,且删除后对应话题分为改为未分类
    java设计模式—工厂模式
    [php]修改站点的虚拟目录
    [php排错] Forbidden You don't have permission to access / on this server.
    JDBC-MySql
    DHTML中window的使用
    CSS与HTML结合
    DOM使用
    DHTML Object Model&DHTML&DOM
  • 原文地址:https://www.cnblogs.com/stream886/p/6277663.html
Copyright © 2011-2022 走看看