zoukankan      html  css  js  c++  java
  • [数据科学从零到壹]·泰坦尼克号生存预测(数据读取、处理与建模)​​​​​​​

    泰坦尼克号生存预测(数据读取、处理与建模)

    • 简介:

    本文是泰坦尼克号上的生存概率预测,这是基于Kaggle上的一个经典比赛项目。

    数据集:

    1.Kaggle泰坦尼克号项目页面下载数据:https://www.kaggle.com/c/titanic

    2.网盘地址:https://pan.baidu.com/s/1BfRZdCz6Z1XR6aDXxiHmHA      提取码:jzb3 

    • 代码内容

    数据读取:

    #%%
    import tensorflow as tf
    import keras
    import pandas as pd
    import numpy as np
    
    data = pd.read_csv("titanic/train.csv")
    print(data.head())
    print(data.describe())

    数据处理:

    #%%
    strs = "Survived Pclass Sex Age SibSp Parch Fare Embarked"
    clos = strs.split(" ")
    print(clos)
    #%%
    x_datas = data[clos]
    print(x_datas.head())
    #%%
    print(x_datas.isnull().sum())
    
    #%%
    x_datas["Age"] = x_datas["Age"].fillna(x_datas["Age"].mean())
    x_datas["Embarked"] = x_datas["Embarked"].fillna(x_datas["Embarked"].mode()[0])
    
    
    #x_datas["Sex"] = pd.get_dummies(x_datas["Sex"])
    x_datas = pd.get_dummies(x_datas,columns=["Pclass","Sex","Embarked"])
    x_datas["Age"]/=100
    x_datas["Fare"]/=100
    
    print(x_datas.isnull().sum())
    print(x_datas.head())
    
    #%%
    seq = int(0.75*(len(x_datas)))
    
    X ,Y = x_datas.iloc[:,1:],x_datas.iloc[:,0]
    X_train,Y_train,X_test,Y_test = X[:seq],Y[:seq],X[seq:],Y[seq:]

    模型搭建:

    #%%
    strs = "Survived Pclass Sex Age SibSp Parch Fare Embarked"
    clos = strs.split(" ")
    print(clos)
    #%%
    x_datas = data[clos]
    print(x_datas.head())
    #%%
    print(x_datas.isnull().sum())
    
    #%%
    x_datas["Age"] = x_datas["Age"].fillna(x_datas["Age"].mean())
    x_datas["Embarked"] = x_datas["Embarked"].fillna(x_datas["Embarked"].mode()[0])
    
    
    #x_datas["Sex"] = pd.get_dummies(x_datas["Sex"])
    x_datas = pd.get_dummies(x_datas,columns=["Pclass","Sex","Embarked"])
    x_datas["Age"]/=100
    x_datas["Fare"]/=100
    
    print(x_datas.isnull().sum())
    print(x_datas.head())
    
    #%%
    seq = int(0.75*(len(x_datas)))
    
    X ,Y = x_datas.iloc[:,1:],x_datas.iloc[:,0]
    X_train,Y_train,X_test,Y_test = X[:seq],Y[:seq],X[seq:],Y[seq:]

    模型训练与评估:

    #%%
    strs = "Survived Pclass Sex Age SibSp Parch Fare Embarked"
    clos = strs.split(" ")
    print(clos)
    #%%
    x_datas = data[clos]
    print(x_datas.head())
    #%%
    print(x_datas.isnull().sum())
    
    #%%
    x_datas["Age"] = x_datas["Age"].fillna(x_datas["Age"].mean())
    x_datas["Embarked"] = x_datas["Embarked"].fillna(x_datas["Embarked"].mode()[0])
    
    
    #x_datas["Sex"] = pd.get_dummies(x_datas["Sex"])
    x_datas = pd.get_dummies(x_datas,columns=["Pclass","Sex","Embarked"])
    x_datas["Age"]/=100
    x_datas["Fare"]/=100
    
    print(x_datas.isnull().sum())
    print(x_datas.head())
    
    #%%
    seq = int(0.75*(len(x_datas)))
    
    X ,Y = x_datas.iloc[:,1:],x_datas.iloc[:,0]
    X_train,Y_train,X_test,Y_test = X[:seq],Y[:seq],X[seq:],Y[seq:]
    • 输出结果:
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #
    =================================================================
    dense_1 (Dense)              (None, 64)                832
    _________________________________________________________________
    dropout_1 (Dropout)          (None, 64)                0
    _________________________________________________________________
    dense_2 (Dense)              (None, 16)                1040
    _________________________________________________________________
    dense_3 (Dense)              (None, 2)                 34
    =================================================================
    Total params: 1,906
    Trainable params: 1,906
    Non-trainable params: 0
    _________________________________________________________________
    ...
    Epoch 96/100
    534/534 [==============================] - 0s 80us/step - loss: 0.3870 - acc: 0.8277 - val_loss: 0.5083 - val_acc: 0.7612
    Epoch 97/100
    534/534 [==============================] - 0s 80us/step - loss: 0.3921 - acc: 0.8352 - val_loss: 0.5070 - val_acc: 0.7687
    Epoch 98/100
    534/534 [==============================] - 0s 82us/step - loss: 0.3940 - acc: 0.8371 - val_loss: 0.5102 - val_acc: 0.7687
    Epoch 99/100
    534/534 [==============================] - 0s 78us/step - loss: 0.3996 - acc: 0.8277 - val_loss: 0.5106 - val_acc: 0.7687
    Epoch 100/100
    534/534 [==============================] - 0s 80us/step - loss: 0.3892 - acc: 0.8352 - val_loss: 0.5082 - val_acc: 0.7612
    223/223 [==============================] - 0s 63us/step
    test loss is 0.389338, acc 0.829596
    • 完整代码:
    #%%
    strs = "Survived Pclass Sex Age SibSp Parch Fare Embarked"
    clos = strs.split(" ")
    print(clos)
    #%%
    x_datas = data[clos]
    print(x_datas.head())
    #%%
    print(x_datas.isnull().sum())
    
    #%%
    x_datas["Age"] = x_datas["Age"].fillna(x_datas["Age"].mean())
    x_datas["Embarked"] = x_datas["Embarked"].fillna(x_datas["Embarked"].mode()[0])
    
    
    #x_datas["Sex"] = pd.get_dummies(x_datas["Sex"])
    x_datas = pd.get_dummies(x_datas,columns=["Pclass","Sex","Embarked"])
    x_datas["Age"]/=100
    x_datas["Fare"]/=100
    
    print(x_datas.isnull().sum())
    print(x_datas.head())
    
    #%%
    seq = int(0.75*(len(x_datas)))
    
    X ,Y = x_datas.iloc[:,1:],x_datas.iloc[:,0]
    X_train,Y_train,X_test,Y_test = X[:seq],Y[:seq],X[seq:],Y[seq:]
  • 相关阅读:
    May Lunchtime 2021 Division 1
    June Cook-Off 2021 Division 1
    Codeforces Round #733 (Div. 1 + Div. 2)
    腾讯云TDSQL MySQL版
    腾讯云TDSQL PostgreSQL版-产品优势
    腾讯云TDSQL PostgreSQL版 -应用场景
    腾讯云TDSQL PostgreSQL版 -最佳实践 |优化 SQL 语句
    腾讯云TDSQL PostgreSQL版 -最佳实践 |优化 SQL 语句
    腾讯云TDSQL监控库密码忘记问题解决实战
    腾讯云分布式数据库TDSQL在银行传统核心系统中的应用实践
  • 原文地址:https://www.cnblogs.com/xiaosongshine/p/10418388.html
Copyright © 2011-2022 走看看