zoukankan      html  css  js  c++  java
  • 基于python 信用卡评分系统 的数据分析

    基于python 信用卡评分系统 的数据分析

    import pandas as pd
    import matplotlib.pyplot as plt #导入图像库
    from sklearn.ensemble import RandomForestRegressor
    # 用随机森林对缺失值预测填充函数
    def set_missing(df):
        # 把已有的数值型特征取出来
        process_df = df.ix[:,[5,0,1,2,3,4,6,7,8,9]]
        # 分成已知该特征和未知该特征两部分
        known = process_df[process_df.MonthlyIncome.notnull()].as_matrix()
        unknown = process_df[process_df.MonthlyIncome.isnull()].as_matrix()
        # X为特征属性值
        X = known[:, 1:]
        # y为结果标签值
        y = known[:, 0]
        # fit到RandomForestRegressor之中
        rfr = RandomForestRegressor(random_state=0, n_estimators=200,max_depth=3,n_jobs=-1)
        rfr.fit(X,y)
        # 用得到的模型进行未知特征值预测
        predicted = rfr.predict(unknown[:, 1:]).round(0)
        print(predicted)
        # 用得到的预测结果填补原缺失数据
        df.loc[(df.MonthlyIncome.isnull()), 'MonthlyIncome'] = predicted
        return df
    data = pd.read_csv(r'E:PythonSourceCreditScorecs-training.csv')
    process_df = data.iloc[:,[5,0,1,2,3,4,6,7,8,9]]
    known = process_df[process_df.MonthlyIncome.notnull()].as_matrix()
    unknown = process_df[process_df.MonthlyIncome.isnull()].as_matrix()
    X = known[:, 1:]
    y = known[:, 0]
    # fit到RandomForestRegressor之中
    rfr = RandomForestRegressor(random_state=0, n_estimators=200,max_depth=3,n_jobs=-1)
    rfr.fit(X,y)
    # 用得到的模型进行未知特征值预测
    predicted = rfr.predict(unknown[:, 1:]).round(0)
    print(predicted)
    data.loc[(data.MonthlyIncome.isnull()), 'MonthlyIncome'] = predicted
     
     
     
    [8311. 1159. 8311. ... 1159. 2554. 2554.]
    data=data.dropna()#删除比较少的缺失值
    data = data.drop_duplicates()#删除重复项
    #异常值处理
    #x1 = data["age"]
    x2 = data["RevolvingUtilizationOfUnsecuredLines"]
    x3 = data["DebtRatio"]
    fig = plt.figure(1)
    ax = fig.add_subplot(111)
    ax.boxplot([x2,x3])
    ax.set_xticklabels(["RevolvingUtilizationOfUnsecuredLines","DebtRatio"])
     
    Out[48]:
    [Text(0,0,'RevolvingUtilizationOfUnsecuredLines'), Text(0,0,'DebtRatio')]
     
    #异常值处理
    data = data[data["age"] > 0]
    data = data[data['NumberOfTime30-59DaysPastDueNotWorse'] < 90]#剔除异常值
    # 好坏客户的整体分析
    data['SeriousDlqin2yrs']=1-data['SeriousDlqin2yrs']
    grouped = data["SeriousDlqin2yrs"].groupby(data["SeriousDlqin2yrs"]).count()
    print("坏客户占比:{:.2%}".format(grouped[0]/grouped[1]))
    print(grouped)
    grouped.plot(kind="bar")
    坏客户占比:7.16%
    SeriousDlqin2yrs
    0      9706
    1    135648
    Name: SeriousDlqin2yrs, dtype: int64
    
    Out[54]:
    <matplotlib.axes._subplots.AxesSubplot at 0x126eecc0>
     
     Y = data['SeriousDlqin2yrs']
     

    本文通过对kaggle上的Give Me Some Credit数据的挖掘分析,结合信用评分卡的建立原理,从数据的预处理、变量选择、建模分析到创建信用评分,创建了一个简单的信用评分系统。本项目还有许多不足之处,比如分箱应当使用最优分箱或卡方分箱,减少人为分箱的随机性,此外模型采用的是逻辑回归算法,还可以多多尝试其他模型。

     
     
     
     
     
     
  • 相关阅读:
    web页面中四种常见必测控件
    python03
    python基础2
    python基础
    如何定位测试用例的作用?
    需求测试的注意事项有哪些?
    性能测试的流程?
    简述bug的生命周期?
    Python字符串
    Python基础语法
  • 原文地址:https://www.cnblogs.com/gylhaut/p/9887471.html
Copyright © 2011-2022 走看看