一直想着抓取股票的变化,偶然的机会在看股票数据抓取的博客看到了kaggle,然后看了看里面的题,感觉挺新颖的,就试了试。
题目如图:给了一个train.csv,现在预测test.csv里面的Passager是否幸存。train.csv里面包含的乘客信息有
PassagerId | 乘客id |
Survived | 乘客是否幸存 |
Pclass | 仓位 |
Name | 乘客姓名 |
Sex | 乘客性别 |
Age | 乘客年龄 |
SibSp | 船上是否有兄弟姐妹 |
Parch | 穿上是否有父母子女 |
Ticket | 船票信息 |
Fare | 票价 |
Cabin | 客舱 |
Embarked | 上船地址 |
然后表里面的Sibsp,Parch,Name,PassagerId,Ticket,Cabin都是些数据无关的信息。
然后用到了随机森林算法。
#-*- coding:utf-8 -*- import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) from subprocess import check_outputimport csv import random as rnd import seaborn as sns import matplotlib.pyplot as plt from sklearn.ensemble import RandomForestClassifier from sklearn.cross_validation import cross_val_score from sklearn.grid_search import GridSearchCV, RandomizedSearchCV train_df = pd.read_csv('train.csv', header=0) test_df = pd.read_csv('test.csv', header=0) df = pd.concat([train_df, test_df]) df = df.reset_index() df = df.drop('index',axis=1) #移除index列 df = df.reindex_axis(train_df.columns,axis=1) #填补合并之后的表中 属性是Age,Fare,Embarked为空的值 df['Age'][df['Age'].isnull()] = df['Age'].median() df['Fare'][df['Fare'].isnull()] = df['Fare'].median() df['Embarked'][df['Embarked'].isnull()] = df['Embarked'].mode().values #将表中的Sex属性做映射 df['Sex'] = pd.factorize(df['Sex'])[0] df['Embarked'] = pd.factorize(df['Embarked'])[0] df['family_member'] = df['SibSp'] + df['Parch'] #移除表中的'Cabin','Ticke t','Name','SibSp','Parch','PassengerId'属性 d= df.drop(['Cabin','Ticke t','Name','SibSp','Parch','PassengerId'],axis=1) survived_member = df[df['Survived'].notnull()].values test_message = df[df['Survived'].isnull()].values Y = survived_member[:, 0].astype(int) #取servived属性不为空的属性的第一列 X = survived_member[:, 1:].astype(int) #取servived属性不为空的出第一列之外的所有信息 result = RandomForestClassifier(n_estimators=1000, random_state=312, min_samples_leaf=3).fit(X, Y) #随机森林算法 pre = result.predict(test_message[:, 1:]).astype(int) Id = test_df['PassengerId'] result_csv = open('result1.csv',"w") result_fd = csv.writer(result_csv) result_fd.writerow(['PassengerId','Survived']) result_fd.writerows(zip(Id,pre)) result_csv.close()