zoukankan      html  css  js  c++  java
  • kaggle总结

    kaggle总结

    一、特征分析(EDA,探索性数据分析)

    1.1 seaborn特征分析

    roc_cure
    lineplot("X", "y", data=df))

    一个特征不同值对生的影响,有限个数:
    barplot("X", "y", data=df)

    连续且个数比较多
    sns.distplot(train['SibSp'][train['Survived'] == 1], bins=50)
    sns.distplot(train['SibSp'][train['Survived'] == 0], bins=50)
    等价于
    sns.distplot(train.loc[ train['Survived'] == 1, 'SibSp'], bins=50)
    sns.distplot(train.loc[ train['Survived'] == 0, 'SibSp'], bins=50)

    一个值分别对生死的影响
    countplot("Embarked", hue='Survived', data=df)

    1.2 特征概述

    data.head(10)
    data.describe()
    data.describe().T
    data.info()
    train['Survived'].value_counts() #查看生存比重

    二、特征选择、处理

    2.1 连续值分隔处理

    1. 使用pd.cut自动分割
      train['Age'] = pd.cut(train['Age'], 5, labels=[0, 1, 2, 3, 4])

    2. 手动分割
      def ProcessLabel(val):
      if val < 3:
      return 0
      elif val < 7:
      return 1
      else:
      return 2
      train['FamliySize'] = train['Sisbp'] + train['Parch'] + 1
      train['FamLable'] = train[FamilySize].apply(ProcessLabel)

    2.2 字符串处理

    train['Embarked'] = train['Embarked'].map({'S': 0, 'P':1, 'S': 2})

    2.3 缺失值处理

    字符串填充:

    train['Embarked'] = train['Embarked'].fillna('S')
    

    使用均值填充:

    avg = train['Age'].mean()
    std = train['Age'].std()
    age_null_count  = train['Age'].isnull().sum()
    age_list = np.random.randint(avg-std, avg+std, size = age_null_count)
    train.loc[train['Age'].isnull(), 'Age'] = age_list
    
    当缺失较多时,使用回归模型预测值:
    from sklearn.ensemble import RandomForestRegressor
    import lightgbm as lgbm
    
    data = train[['Age', 'Pclass', 'Sex', 'Title']]
    data = pd.get_dummies(data)
    model = RandomForestRegressor(n_estimators=128, n_jobs=-1)
    # model = lgbm.LGBMRegressor(n_estimators=128, n_jobs=-1)
    tr= data[data['Age'].notnull()].values
    te = data[data['Age'].isnull()].values
    tr_X = tr[:, 1:]
    tr_y = tr[:, 0]
    te_X = te[:, 1:]
    model.fit(tr_X, tr_y)
    pred_age = model.predict(te_X)
    train.loc[data['Age'].isnull(), 'Age'] = pre_age
    

    2.4 one hot 编码

    一定要对all_data进行,否则容易训练集,测试集不匹配:

    all_data = pd.get_dummise(all_data)
    
    
    Emb = pd.get_dummies(all_data)
    all_data = pd.concat([all_data, Emb], axis = 1)
    

    2.5 数据合并分开

    all_data = pd.concat([train, test], ignore_index = True)
    

    分开:

    train=all_data.loc[all_data['Survived'].notnull()]
    test=all_data.loc[all_data['Survived'].isnull()]
    

    2.6 特征缩放,标准化

    from sklearn.preprocessing import StandardScaler
    sc =StandardScaler()
    data_new[['Amount', 'Hour']] =sc.fit_transform(data_new[['Amount', 'Hour']])
    data_new.head()
    

    三、模型调参

    lgbm:

    objective=(regression,binary/multiclass)
    

    3.1 GridSearchCV参数寻优

    import lightgbm as lgb
    from sklearn.model_selection import cross_val_score
    from sklearn.model_selection improt GridSearchCV
    
    params = {'num_leaves': [32, 64, 128, 256, 1024], 'max_depth': [10, 20, 30, 60], 'learning_rate': [0.01, 0.05, 0.1], 'n_estimators': [100, 200, 300]}
    model = lgb.LGBMClassifier()
    gridS = GridSearchCV(model, params, cv=5, n_jobs=-1)
    gridS.fit(X, y)
    gridS.best_estimator_
    

    四、结果

    4.1 画roc曲线

    需要最好是概率, 如果是0, 1值的话,只有一个点,所以要使用lgb.train(),而不是LGBMClassifier()的模型

    from sklearn.metrics import roc_curve
    from matplotlib import pyplot as plt
    import seaborn as sns
    
    sns.set()
    fpr, tpr, thresh = roc_curve(y, pred)
    plt.plot(fpr, tpr)
    plt.show()
    

    4.2 求交叉准确率

    from sklearn.model_selection import cross_val_score
    
    score =  cross_val_score(model, X, y, scoring='accuracy', cv=5)
    print(np.mean(score))
    

    4.3 保存csv

    res = pd.DataFrame({'PassageID': passage_id, 'Survived': pred.as_type(np.int32)})
    res.to_csv('pred.csv', index=False)
    

    Others

    模型训练时报Input contains NaN, infinity or a value too large for dtype('float64'):
    因为特征里包含nan

    相关函数:

    np.isnan
    train.info()
    train['Age'].isnull()
    train['Age'].notnull()
    
  • 相关阅读:
    前端备战21秋招之操作系统,线程/进程/死锁
    前端备战秋招之计算机网络,这一篇足矣
    VS Code项目中共享自定义的代码片段方案
    eslint插件开发教程
    2020前端春招经验分享,从面试小白到老油条的蜕变
    使用nodejs从控制台读入内容
    js实现展开多级数组
    js使用typeof与instanceof相结合编写一个判断常见变量类型的函数
    07-数据结构
    06-流程控制
  • 原文地址:https://www.cnblogs.com/gr-nick/p/11125777.html
Copyright © 2011-2022 走看看