zoukankan      html  css  js  c++  java
  • Kaggle:Titanic: Machine Learning from Disaster

    一直想着抓取股票的变化,偶然的机会在看股票数据抓取的博客看到了kaggle,然后看了看里面的题,感觉挺新颖的,就试了试。

    题目如图:给了一个train.csv,现在预测test.csv里面的Passager是否幸存。train.csv里面包含的乘客信息有

    PassagerId 乘客id
    Survived 乘客是否幸存
    Pclass 仓位
    Name 乘客姓名
    Sex 乘客性别
    Age 乘客年龄
    SibSp 船上是否有兄弟姐妹
    Parch 穿上是否有父母子女
    Ticket 船票信息
    Fare 票价
    Cabin 客舱
    Embarked 上船地址

    然后表里面的Sibsp,Parch,Name,PassagerId,Ticket,Cabin都是些数据无关的信息。

     然后用到了随机森林算法。

    #-*- coding:utf-8 -*-
    import numpy as np # linear algebra
    import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
    from subprocess import check_outputimport csv
    import random as rnd
    import seaborn as sns
    import matplotlib.pyplot as plt
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.cross_validation import cross_val_score
    from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
    train_df = pd.read_csv('train.csv', header=0)
    test_df = pd.read_csv('test.csv', header=0)
    df = pd.concat([train_df, test_df])
    df = df.reset_index()
    df = df.drop('index',axis=1)
    #移除index列
    df = df.reindex_axis(train_df.columns,axis=1)
    #填补合并之后的表中 属性是Age,Fare,Embarked为空的值
    df['Age'][df['Age'].isnull()] = df['Age'].median()
    df['Fare'][df['Fare'].isnull()] = df['Fare'].median()
    df['Embarked'][df['Embarked'].isnull()] = df['Embarked'].mode().values
    #将表中的Sex属性做映射
    df['Sex'] = pd.factorize(df['Sex'])[0]
    df['Embarked'] = pd.factorize(df['Embarked'])[0]
    df['family_member'] = df['SibSp'] + df['Parch']
    #移除表中的'Cabin','Ticke t','Name','SibSp','Parch','PassengerId'属性
    d= df.drop(['Cabin','Ticke t','Name','SibSp','Parch','PassengerId'],axis=1)
    survived_member = df[df['Survived'].notnull()].values
    test_message = df[df['Survived'].isnull()].values
    Y = survived_member[:, 0].astype(int)
    #取servived属性不为空的属性的第一列
    X = survived_member[:, 1:].astype(int)
    #取servived属性不为空的出第一列之外的所有信息
    result = RandomForestClassifier(n_estimators=1000, random_state=312, min_samples_leaf=3).fit(X, Y)
    #随机森林算法
    pre = result.predict(test_message[:, 1:]).astype(int)
    Id = test_df['PassengerId']
    result_csv = open('result1.csv',"w")
    result_fd = csv.writer(result_csv)
    result_fd.writerow(['PassengerId','Survived'])
    result_fd.writerows(zip(Id,pre))
    result_csv.close()
  • 相关阅读:
    codeforces567E. President and Roads
    codeforces 573C Bear and Drawing
    bzoj4160: [Neerc2009]Exclusive Access 2
    bzoj1251: 序列终结者
    bzoj2534: Uva10829L-gap字符串
    Excel中导入到oracle使用merge into 差异性更新数据库
    文件解压缩公用类
    XML常用操作
    密码加密解密
    GridView中数据行的操作
  • 原文地址:https://www.cnblogs.com/chenyang920/p/7248138.html
Copyright © 2011-2022 走看看