zoukankan      html  css  js  c++  java
  • 用sklearn(scikit-learn)的LogisticRegression预测titanic生还情况(kaggle)

    titanic, prediction using sklearn

    after EDA, we can now preprocess the training data and learn a model to predict using scikit-learn (sklearn) ml library

    做完上面的分析,可以选定几个特征进行使用,然后选择模型。

    我们使用scikit-learn,这个框架对于基本的ml的method都有实现,方便使用,不需要自己from scratch编写代码。而且支持交叉验证。除非某些问题使用多层的dl神经网络更好,那么我们可以用tf或者theano等,如果传统机器学习方法可以解决,那么选择scikit-learn就可以。

    import numpy as np
    import pandas as pd
    import seaborn as sns
    from sklearn.linear_model import LogisticRegressionCV
    from sklearn.preprocessing import StandardScaler
    ## 读取train和test数据,并进行预处理:填充空缺,str转int类型转换,以及尺度归一化
    path = './titanic/'
    trainset = pd.read_csv(path + 'train.csv')
    testset = pd.read_csv(path + 'test.csv')
    print '[*] trainset shape is : ' + str(trainset.shape)
    print '[*] testset shape is : ' + str(testset.shape)
    ## 填充空缺与数据类型转换
    ## 训练集上:
    trainset.loc[trainset.Sex == 'male','Sex'] = 0
    trainset.loc[trainset.Sex == 'female','Sex'] = 1
    trainset.loc[trainset.Embarked == 'S','Embarked'] = 1
    trainset.loc[trainset.Embarked == 'C','Embarked'] = 2
    trainset.loc[trainset.Embarked == 'Q','Embarked'] = 3
    trainset.Age = trainset.Age.fillna(trainset.Age.median())
    trainset.Sex = trainset.Sex.fillna(trainset.Sex.mode()[0])
    trainset.Fare = trainset.Fare.fillna(trainset.Fare.mean())
    trainset.Pclass = trainset.Pclass.fillna(trainset.Pclass.mode()[0])
    trainset.Embarked = trainset.Embarked.fillna(trainset.Embarked.mode()[0])
    ## 测试集上:(由于iid假设,fillna用了训练集的数据的中位数或众数,因为训练集比较大。也可训练集测试集合起来的众数中位数)
    testset.loc[testset.Sex == 'male','Sex'] = 0
    testset.loc[testset.Sex == 'female','Sex'] = 1
    testset.loc[testset.Embarked == 'S','Embarked'] = 1
    testset.loc[testset.Embarked == 'C','Embarked'] = 2
    testset.loc[testset.Embarked == 'Q','Embarked'] = 3
    testset.Age = testset.Age.fillna(trainset.Age.median())
    testset.Sex = testset.Sex.fillna(trainset.Sex.mode()[0])
    testset.Fare = testset.Fare.fillna(trainset.Fare.mean())
    testset.Pclass = testset.Pclass.fillna(trainset.Pclass.mode()[0])
    testset.Embarked = testset.Embarked.fillna(trainset.Embarked.mode()[0])
    ## 用StandardScaler进行训练集和测试集的尺度变换
    AgeScaler = StandardScaler().fit(trainset[['Age']])
    FareScaler = StandardScaler().fit(trainset[['Fare']])
    #print AgeScaler.mean_ , AgeScaler.scale_
    #print FareScaler.mean_, FareScaler.scale_
    trainset.Age = AgeScaler.transform(trainset[['Age']])
    trainset.Fare = FareScaler.transform(trainset[['Fare']])
    testset.Age = AgeScaler.transform(testset[['Age']])
    testset.Fare = FareScaler.transform(testset[['Fare']])
    ## 选择特征做逻辑斯蒂回归
    print('[*] Using Logistic Regression Model')
    features = ['Pclass','Sex','Age','Fare','Embarked']
    predlabel = ['Survived']
    train_X = trainset[features]
    train_Y = trainset[predlabel]
    test_X = testset[features]
    LogReg = LogisticRegressionCV(random_state=0)
    LogReg.fit(train_X,train_Y)
    test_Y_hat = LogReg.predict(test_X)
    print('[*] prediction completed')
    submission = pd.DataFrame(columns=['PassengerId','Survived'])
    submission['PassengerId'] = range(892,1310)
    submission['Survived'] = test_Y_hat
    #trainset.head(10)
    #pd.read_csv(path+'gender_submission.csv')
    ## 按照格式,存成不含index的csv文件。
    submission.to_csv('./titanic/logreg_submission.csv',index=False)
    print('[*] result saved')
    print('[*] done')
    [*] trainset shape is : (891, 12)
    [*] testset shape is : (418, 11)
    [*] Using Logistic Regression Model
    [*] prediction completed
    [*] result saved
    [*] done
    

    上面我们使用了LogisticRegressionCV, instead of 之前的LogisticRegression,相当于做了一次cross validation,实际上调参调整了C,也是就是正则项系数。这个改变提高了439个place的得分。

    这里写图片描述

    考虑加上SibSp和Parch这俩特征,看看有没有用:

    import numpy as np
    import pandas as pd
    import seaborn as sns
    from sklearn.linear_model import LogisticRegressionCV
    from sklearn.preprocessing import StandardScaler
    ## 读取train和test数据,并进行预处理:填充空缺,str转int类型转换,以及尺度归一化
    path = './titanic/'
    trainset = pd.read_csv(path + 'train.csv')
    testset = pd.read_csv(path + 'test.csv')
    print '[*] trainset shape is : ' + str(trainset.shape)
    print '[*] testset shape is : ' + str(testset.shape)
    ## 填充空缺与数据类型转换
    ## 训练集上:
    trainset.loc[trainset.Sex == 'male','Sex'] = 0
    trainset.loc[trainset.Sex == 'female','Sex'] = 1
    trainset.loc[trainset.Embarked == 'S','Embarked'] = 1
    trainset.loc[trainset.Embarked == 'C','Embarked'] = 2
    trainset.loc[trainset.Embarked == 'Q','Embarked'] = 3
    trainset.Age = trainset.Age.fillna(trainset.Age.median())
    trainset.Sex = trainset.Sex.fillna(trainset.Sex.mode()[0])
    trainset.Fare = trainset.Fare.fillna(trainset.Fare.mean())
    trainset.Pclass = trainset.Pclass.fillna(trainset.Pclass.mode()[0])
    trainset.Embarked = trainset.Embarked.fillna(trainset.Embarked.mode()[0])
    trainset.SibSp = trainset.SibSp.fillna(trainset.SibSp.mode()[0])
    trainset.Parch = trainset.Parch.fillna(trainset.Parch.mode()[0])
    ## 测试集上:(由于iid假设,fillna用了训练集的数据的中位数或众数,因为训练集比较大。也可训练集测试集合起来的众数中位数)
    testset.loc[testset.Sex == 'male','Sex'] = 0
    testset.loc[testset.Sex == 'female','Sex'] = 1
    testset.loc[testset.Embarked == 'S','Embarked'] = 1
    testset.loc[testset.Embarked == 'C','Embarked'] = 2
    testset.loc[testset.Embarked == 'Q','Embarked'] = 3
    testset.Age = testset.Age.fillna(trainset.Age.median())
    testset.Sex = testset.Sex.fillna(trainset.Sex.mode()[0])
    testset.Fare = testset.Fare.fillna(trainset.Fare.mean())
    testset.Pclass = testset.Pclass.fillna(trainset.Pclass.mode()[0])
    testset.Embarked = testset.Embarked.fillna(trainset.Embarked.mode()[0])
    testset.SibSp = testset.SibSp.fillna(trainset.SibSp.mode()[0])
    testset.Parch = testset.Parch.fillna(trainset.Parch.mode()[0])
    ## 用StandardScaler进行训练集和测试集的尺度变换
    AgeScaler = StandardScaler().fit(trainset[['Age']])
    FareScaler = StandardScaler().fit(trainset[['Fare']])
    #print AgeScaler.mean_ , AgeScaler.scale_
    #print FareScaler.mean_, FareScaler.scale_
    trainset.Age = AgeScaler.transform(trainset[['Age']])
    trainset.Fare = FareScaler.transform(trainset[['Fare']])
    testset.Age = AgeScaler.transform(testset[['Age']])
    testset.Fare = FareScaler.transform(testset[['Fare']])
    ## 选择特征做逻辑斯蒂回归
    print('[*] Using Logistic Regression Model')
    features = ['Pclass','Sex','Age','Fare','Embarked','SibSp','Parch']
    predlabel = ['Survived']
    train_X = trainset[features]
    train_Y = trainset[predlabel]
    test_X = testset[features]
    LogReg = LogisticRegressionCV(random_state=0)
    LogReg.fit(train_X,train_Y)
    test_Y_hat = LogReg.predict(test_X)
    print('[*] prediction completed')
    submission = pd.DataFrame(columns=['PassengerId','Survived'])
    submission['PassengerId'] = range(892,1310)
    submission['Survived'] = test_Y_hat
    #trainset.head(10)
    #pd.read_csv(path+'gender_submission.csv')
    ## 按照格式,存成不含index的csv文件。
    submission.to_csv('./titanic/logreg_submission.csv',index=False)
    print('[*] result saved')
    print('[*] done')
    [*] trainset shape is : (891, 12)
    [*] testset shape is : (418, 11)
    [*] Using Logistic Regression Model
    [*] prediction completed
    [*] result saved
    [*] done
    

    这里写图片描述

    果然可以提高一点。在之前分析的感觉没多少相关性的特征通过logistic Regression算法以后也可以提高分类准确率。另外,还可以通过考虑Name中的头衔,以及舱位编号(可以参考titanic的船体结构图)等等,来提高分类准确率。另外也可以换其他模型,并采用Ensemble集成。由于希望将这个problem仅仅作为toy problem用来熟悉环境和方法,所以不再进行进一步的探究,可以在实际问题中投入较多的时间进行不同模型选择以及cross validation和ensemble来提高模型效率。

    2018年02月23日18:01:36
    我们之所以冒险,正是因为上帝给了我们这副臭皮囊,而非不顾生命。 —— 斯蒂芬 金

    最后还是用了个随机森林试一试,发现效果很明显呀

    这里写图片描述

    看来还是要多试试几个模型,以及调参数。

  • 相关阅读:
    剑指63.数据流中的中位数
    剑指62.二叉搜索树的第k个结点
    JPA ---- EntityManager使用
    JPA ---- EntityManager介绍
    win10多桌面切换
    $emit子组件如何传递多个参数
    height高度自适应
    vue Avoided redundant navigation to current location
    Ant Design 使用小结
    Object.keys方法之详解
  • 原文地址:https://www.cnblogs.com/morikokyuro/p/13256799.html
Copyright © 2011-2022 走看看