zoukankan      html  css  js  c++  java
  • kaggle—first play—Titanic(特征工程)

    之前使用逻辑回归算法得到的生还预测kaggle打分是0.75119分,emmm,可以说是比较差的一个分数了,下面进行调整。

    -----------------------------------------------------------------------------------------------

    1、判断拟合状态

    由于过拟合和欠拟合两种情况下对于数据集的处理不同,所以首先需要判断现有模型是过拟合还是欠拟合。

    百度百科欠拟合

    百度百科过拟合

    我们可以通过绘制学习曲线(learning curve)来进行判断(样本数为横坐标,准确率为纵坐标)

    learning curve 官方文档    learning curve 官方示例代码

    首先定义学习曲线的绘制函数:

    from sklearn.learning_curve import learning_curve
    from sklearn import cross_validation
    #定义学习曲线绘制
    def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None, n_jobs=1, 
                            train_sizes=np.linspace(.05, 1., 20), verbose=0, plot=True):
        plt.figure()
        plt.title(title)
        if ylim is not None:
            plt.ylim(*ylim)
        plt.xlabel("Training examples")
        plt.ylabel("Score")
        train_sizes, train_scores, test_scores = learning_curve(
            estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
        train_scores_mean = np.mean(train_scores, axis=1)
        train_scores_std = np.std(train_scores, axis=1)
        test_scores_mean = np.mean(test_scores, axis=1)
        test_scores_std = np.std(test_scores, axis=1)
        plt.grid()
        plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                         train_scores_mean + train_scores_std, alpha=0.1,
                         color="r")
        plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                         test_scores_mean + test_scores_std, alpha=0.1, color="g")
        plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
                 label="Training score")
        plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
                 label="Cross-validation score")
        plt.legend(loc="best")
        return plt

    根据已有模型,设置上面函数中的参数并调用:

    #具体参数
    estimator = lrModel
    title = 'Learning Curves (LogisticRegression)'
    X, y = data_train[inputcolumns], data_train[outpucolumns]
    cv = cross_validation.ShuffleSplit(891, n_iter=100, test_size=0.2, random_state=0)
    #调用
    plot_learning_curve(estimator, title, X, y, ylim=(0.7, 1.01), cv=cv, n_jobs=4)
    plt.show()

    得到如下的图形,属于欠拟合的状态,所以后续还需要做更多的特征工程:

    2、特征工程

     考虑到对数据集的处理,可以从以下的几个方向进行更深层的考虑:

    1)未使用到的姓名、船票编号列是否能够加以利用

    2)Parch和Sibsp两个变量分别代表同船的兄弟/姐妹和分母/小孩的个数,求和是否能够代表同船的家族大小(人数)

    3)缺失年龄的随机森林拟合欠妥,是否有更好解决方法

    经过更深入的观察,先对数据做以下处理(从易到难):

    1)Parch和Sibsp求和得到家族人数

    #将Parch 和 SibSp 变量求和得到家族大小
    data['family_size'] = data['Parch'] + data['SibSp']

    2)根据Ticket 分组,得到人均票价,再根据人均票价进行离散化

    #根据Ticket进行分组,得到人均票价, 再根据票价区间进行离散化
    data['Fare'] = data['Fare'] / data.groupby(by=['Ticket'])['Fare'].transform('count')
    
    data['Fare'].describe()
    
    def fare_level(s):
        if s <= 5 : #低价票
            m = 0
        elif s>5 and s<=20:  #普通票
            m = 1
        elif s>20 and s<=40:  #一等票
            m = 2
        else:
            m = 3  #特等票
        return m
    
    data['Fare_level'] = data['Fare'].apply(fare_level)

    3)将对结果影响最大的两个因素——sex 和 pclass 进行合并,生成一个新的变量

    data['Sex_Pclass'] = data.Sex + "_" + data.Pclass.map(str)
    dummies_Sex_Pclass = pd.get_dummies(data['Sex_Pclass'], prefix= 'Sex_Pclass')
    data = pd.concat([data, dummies_Sex_Pclass], axis=1)

    4)缺失年龄填补,这里使用线性回归和随机森林的均值

    age_data = data[['Age','Fare_level', 'family_size', 'Pclass','Sex_Pclass_female_1',
           'Sex_Pclass_female_2', 'Sex_Pclass_female_3', 'Sex_Pclass_male_1',
           'Sex_Pclass_male_2', 'Sex_Pclass_male_3', 'embarked_C','embarked_Q','embarked_S']]
    fcolumns = ['Fare_level', 'family_size', 'Pclass', 'Sex_Pclass_female_1',
           'Sex_Pclass_female_2', 'Sex_Pclass_female_3', 'Sex_Pclass_male_1',
           'Sex_Pclass_male_2', 'Sex_Pclass_male_3', 'embarked_C','embarked_Q','embarked_S']
    tcolumns = ['Age']
    
    age_data_known = age_data[age_data.Age.notnull()]
    age_data_unknown = age_data[age_data.Age.isnull()]
    
    x = age_data_known[fcolumns]#特征变量
    y = age_data_known[tcolumns]#目标变量
    
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.linear_model import LinearRegression
    from sklearn.model_selection import GridSearchCV
    
    #线性回归
    lr = LinearRegression()
    lr_grid_pattern = {'fit_intercept': [True], 'normalize': [True]}
    lr_grid = GridSearchCV(lr, lr_grid_pattern, cv=10, n_jobs=1, verbose=1, scoring='neg_mean_squared_error')
    lr_grid.fit(age_data_known[fcolumns], age_data_known[tcolumns])
    print('Age feature Best LR Params:' + str(lr_grid.best_params_))
    print('Age feature Best LR Score:' + str(lr_grid.best_score_))
    lr = lr_grid.predict(age_data_unknown[fcolumns]).tolist()
    lr = sum(lr, [])
    
    #随机森林回归
    rfr = RandomForestRegressor()
    rfr_grid_pattern = {'max_depth': [3], 'max_features': [3]}
    rfr_grid = GridSearchCV(rfr, rfr_grid_pattern, cv=10, n_jobs=1, verbose=1, scoring='neg_mean_squared_error')
    rfr_grid.fit(age_data_known[fcolumns], age_data_known[tcolumns])
    print('Age feature Best LR Params:' + str(rfr_grid.best_params_))
    print('Age feature Best LR Score:' + str(rfr_grid.best_score_))
    rfr = rfr_grid.predict(age_data_unknown[fcolumns]).tolist()
    
    #取二者均值   
    predictresult = pd.DataFrame()
    predictresult['lr'] = lr
    predictresult['rfr'] = rfr
    predictresult['result'] = (predictresult['lr'] + predictresult['rfr']) / 2
    data.loc[data['Age'].isnull(), 'Age'] = predictresult['result']

    5)根据年龄段,进行离散化

    def age_level(s):
        if s <= 14 : #儿童
            m = 0
        elif s>14 and s<=35:  #青年
            m = 1
        elif s>35 and s<=60:  #中年
            m = 2
        else:
            m = 3  #老年
        return m
    
    data['age_level'] = data['Age'].apply(age_level)

    3、单个模型拟合

    还是使用逻辑回归模型,对上述处理过的数据集进行拟合

    data_train = data.drop(['Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 
                            'Ticket',  'Fare', 'Cabin', 'Embarked', 'Sex_Pclass'], axis=1, inplace=False)
    #data_train.columns
    
    #再次进行单个模型拟合
    from sklearn import linear_model
    lrModel = linear_model.LogisticRegression(penalty='l1')
    
    inputcolumns = ['family_size', 'Fare_level', 'Sex_Pclass_female_1',
           'Sex_Pclass_female_2', 'Sex_Pclass_female_3', 'Sex_Pclass_male_1',
           'Sex_Pclass_male_2', 'Sex_Pclass_male_3', 'embarked_C', 'embarked_Q',
           'embarked_S', 'age_level']
    outpucolumns = ['Survived']
    
    lrModel.fit(data_train[inputcolumns], data_train[outpucolumns])
    lrModel.score(data_train[inputcolumns], data_train[outpucolumns])

    -----------------------------------------------------------------------

    对测试集做同样处理后,得到的预测结果上传kaggle,评分0.77,上升了0.02。  >_<

     后续将再进行交叉验证和模型融合的优化

  • 相关阅读:
    uva 11080(二分图染色)
    poj 3255(次短路)
    uva 707(记忆化搜索)
    uva 436(floyd变形)
    uva 11748(求可达矩阵)
    uva 11573(bfs)
    Codeforces Round #226 (Div. 2) 解题报告
    uva 11354(最小瓶颈路--多组询问 MST+LCA倍增)
    uva 534(最小瓶颈路)
    uva 538(简单图论 入度出度)
  • 原文地址:https://www.cnblogs.com/rix-yb/p/10137305.html
Copyright © 2011-2022 走看看