zoukankan      html  css  js  c++  java
  • Kaggle比赛(一)Titanic: Machine Learning from Disaster

    泰坦尼克号幸存预测是本小白接触的第一个Kaggle入门比赛,主要参考了以下两篇教程:

    1. https://www.cnblogs.com/star-zhao/p/9801196.html
    2. https://zhuanlan.zhihu.com/p/30538352

    本模型在Leaderboard上的最高得分为0.79904,排名前13%。

    由于这个比赛做得比较早了,当时很多分析的细节都忘了,而且由于是第一次做,整体还是非常简陋的。今天心血来潮,就当做个简单的记录(流水账)。

    导入相关包:

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import re
    from sklearn.model_selection import GridSearchCV
    from sklearn.linear_model import LinearRegression
    from sklearn.ensemble import GradientBoostingRegressor
    from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
    

    读取训练、测试集,合并在一起处理:

    train_raw = pd.read_csv('datasets/train.csv')
    test_raw = pd.read_csv('datasets/test.csv')
    train_test = train_raw.append(test_raw, ignore_index=True, sort=False)
    

    姓名中的称谓可以在一定程度上体现出人的性别、年龄、身份、社会地位等,因而是一个不可忽略的重要特征。我们首先用正则表达式将Name字段中的称谓信息提取出来,然后做归类:

    • Mr、Don代表男性
    • Miss、Ms、Mlle代表未婚女子
    • Mrs、Mme、Lady、Dona代表已婚女士
    • Countess、Jonkheer均为贵族身份
    • Capt、Col、Dr、Major、Sir这些少数称谓归为其他一类
    train_test['Title'] = train_test['Name'].apply(lambda x: re.search('(w+).', x).group(1))
    train_test['Title'].replace(['Don'], 'Mr', inplace=True)
    train_test['Title'].replace(['Mlle','Ms'], 'Miss', inplace=True)
    train_test['Title'].replace(['Mme', 'Lady', 'Dona'], 'Mrs', inplace=True)
    train_test['Title'].replace(['Countess', 'Jonkheer'], 'Noble', inplace=True)
    train_test['Title'].replace(['Capt', 'Col', 'Dr', 'Major', 'Sir'], 'Other', inplace=True)
    

    对称谓类别进行独热编码(One-Hot encoding):

    title_onehot = pd.get_dummies(train_test['Title'], prefix='Title')
    train_test = pd.concat([train_test, title_onehot], axis=1)
    

    对性别进行独热处理:

    sex_onehot = pd.get_dummies(train_test['Sex'], prefix='Sex')
    train_test = pd.concat([train_test, sex_onehot], axis=1)
    

    将SibSp和Parch两个特征组合在一起,构造出表示家庭大小的特征,因为分析表明有亲人同行的乘客比独自一人具有更高的存活率。

    train_test['FamilySize'] = train_test['SibSp'] + train_test['Parch'] + 1
    

    用众数对Embarked填补缺失值:

    train_test['Embarked'].fillna(train_test['Embarked'].mode()[0], inplace=True)
    embarked_onehot = pd.get_dummies(train_test['Embarked'], prefix='Embarked')
    train_test = pd.concat([train_test, embarked_onehot], axis=1)
    

    由于Cabin缺失值太多,姑且将有无Cabin作为特征:

    train_test['Cabin'].fillna('NO', inplace=True)
    train_test['Cabin'] = np.where(train_test['Cabin'] == 'NO', 'NO', 'YES')
    cabin_onehot = pd.get_dummies(train_test['Cabin'], prefix='Cabin')
    train_test = pd.concat([train_test, cabin_onehot], axis=1)
    

    用同等船舱的票价均值填补Fare的缺失值:

    Ktrain_test['Fare'].fillna(train_test.groupby('Pclass')['Fare'].transform('mean'), inplace=True)
    

    由于有团体票,我们将票价均摊到每个人身上:

    shares = train_test.groupby('Ticket')['Fare'].transform('count')
    train_test['Fare'] = train_test['Fare'] / shares
    

    票价分级:

    train_test.loc[train_test['Fare'] < 5, 'Fare'] = 0
    train_test.loc[(train_test['Fare'] >= 5) & (train_test['Fare'] < 10), 'Fare'] = 1
    train_test.loc[(train_test['Fare'] >= 10) & (train_test['Fare'] < 15), 'Fare'] = 2
    train_test.loc[(train_test['Fare'] >= 15) & (train_test['Fare'] < 30), 'Fare'] = 3
    train_test.loc[(train_test['Fare'] >= 30) & (train_test['Fare'] < 60), 'Fare'] = 4
    train_test.loc[(train_test['Fare'] >= 60) & (train_test['Fare'] < 100), 'Fare'] = 5
    train_test.loc[train_test['Fare'] >= 100, 'Fare'] = 6
    

    利用shares构造一个新的特征,将买团体票的乘客分为一类,单独买票的分为一类:

    train_test['GroupTicket'] = np.where(shares == 1, 'NO', 'YES')
    group_ticket_onehot = pd.get_dummies(train_test['GroupTicket'], prefix='GroupTicket')
    train_test = pd.concat([train_test, group_ticket_onehot], axis=1)
    

    对于缺失较多的Age项,直接用平均数或者中位数来填充不太合适。这里我们用机器学习算法,利用其他特征来推测年龄。

    missing_age_df = pd.DataFrame(train_test[['Age', 'Parch', 'Sex', 'SibSp', 'FamilySize', 'Title', 'Fare', 'Pclass', 'Embarked']])
    missing_age_df = pd.get_dummies(missing_age_df,columns=['Title', 'FamilySize', 'Sex', 'Pclass' ,'Embarked'])
    missing_age_train = missing_age_df[missing_age_df['Age'].notnull()]
    missing_age_test = missing_age_df[missing_age_df['Age'].isnull()]
    
    def fill_missing_age(missing_age_train, missing_age_test):
            missing_age_X_train = missing_age_train.drop(['Age'], axis=1)
            missing_age_Y_train = missing_age_train['Age']
            missing_age_X_test = missing_age_test.drop(['Age'], axis=1)
            # 模型1
            gbm_reg = GradientBoostingRegressor(n_estimators=100, max_depth=3, learning_rate=0.01, max_features=3, random_state=42)
            gbm_reg.fit(missing_age_X_train, missing_age_Y_train)
            missing_age_test['Age_GB'] = gbm_reg.predict(missing_age_X_test)
            # 模型2
            lrf_reg = LinearRegression(fit_intercept=True, normalize=True)
            lrf_reg.fit(missing_age_X_train, missing_age_Y_train)
            missing_age_test['Age_LRF'] = lrf_reg.predict(missing_age_X_test)
            # 将两个模型预测后的均值作为最终预测结果
            missing_age_test['Age'] = np.mean([missing_age_test['Age_GB'], missing_age_test['Age_LRF']])
            return missing_age_test
        
    train_test.loc[(train_test.Age.isnull()), 'Age'] = fill_missing_age(missing_age_train, missing_age_test)
    

    划分年龄段:

    train_test.loc[train_test['Age'] < 9, 'Age'] = 0
    train_test.loc[(train_test['Age'] >= 9) & (train_test['Age'] < 18), 'Age'] = 1
    train_test.loc[(train_test['Age'] >= 18) & (train_test['Age'] < 27), 'Age'] = 2
    train_test.loc[(train_test['Age'] >= 27) & (train_test['Age'] < 36), 'Age'] = 3
    train_test.loc[(train_test['Age'] >= 36) & (train_test['Age'] < 45), 'Age'] = 4
    train_test.loc[(train_test['Age'] >= 45) & (train_test['Age'] < 54), 'Age'] = 5
    train_test.loc[(train_test['Age'] >= 54) & (train_test['Age'] < 63), 'Age'] = 6
    train_test.loc[(train_test['Age'] >= 63) & (train_test['Age'] < 72), 'Age'] = 7
    train_test.loc[(train_test['Age'] >= 72) & (train_test['Age'] < 81), 'Age'] = 8
    train_test.loc[train_test['Age'] >= 81, 'Age'] = 9
    

    保存PassengerId:

    passengerId_test = train_test['PassengerId'][891:]
    

    丢弃多余的特征:

    train_test.drop(['PassengerId', 'Name', 'SibSp', 'Parch', 'Title', 'Sex', 'Embarked', 'Cabin', 'Ticket', 'GroupTicket'], axis=1, inplace=True)
    

    划分训练集和测试集:

    train = train_test[:891]
    test = train_test[891:]
    X_train = train.drop(['Survived'], axis=1)
    y_train = train['Survived']
    X_test = test.drop(['Survived'], axis=1)
    

    分别用随机森林、极端随机树和梯度提升树进行训练,然后利用VotingClassifer建立最终预测模型。

    rf = RandomForestClassifier(n_estimators=500, max_depth=5, min_samples_split=13)
    et = ExtraTreesClassifier(n_estimators=500, max_depth=7, min_samples_split=8)
    gbm = GradientBoostingClassifier(n_estimators=500, learning_rate=0.0135)
    voting = VotingClassifier(estimators=[('rf', rf), ('et', et), ('gbm', gbm)], voting='soft')
    voting.fit(X_train, y_train)
    

    预测并生成提交文件:

    y_predict = voting.predict(X_test)
    submission = pd.DataFrame({'PassengerId': passengerId_test, 'Survived': y_predict.astype(np.int32)})
    submission.to_csv('submission.csv', index=False)
    
  • 相关阅读:
    导出查询结果到excle
    导出所选行为excle
    spring security LDAP获取用户信息
    spring security防御会话伪造session攻击
    Linux安装Loadrunner generator
    Centos7 安装gitlab
    kafka 安装部署
    zookeeper 搭建
    Oracle GoldenGate对接 Oracle 11g和Kafka
    suse 11 sp4 设置yast 安装源
  • 原文地址:https://www.cnblogs.com/timdyh/p/11379991.html
Copyright © 2011-2022 走看看