问题定义
这是一个贷款的审批问题,假设你是一个银行的贷款审批员,现在有客户需要一定额度的贷款,他们填写了个人的信息(信息在datas.txt中给出),你需要根据他们的信息,建立一个分类模型,判断是否可以给他们贷款。
请根据所给的信息,建立分类模型,评价模型,同时将模型建立过程简单介绍一下,同时对各特征进行简单的解释说明。
Dataset
用户id,年龄,性别,申请金额,职业类型,教育程度,婚姻状态,房屋类型,户口类型,贷款用途,公司类型,薪水,贷款标记:0不放贷,1同意放贷
Data preprocessing
在对数据进行建模时,用户ID是没有用的。在描述用户信息的几个维度数据中,年龄,申请金额,薪水是连续值,剩下的是离散值。
通过观察发现有些数据存在数据缺失的情况,需要对这些数据进行处理,比如直接删除或者通过缺失值补全。
The Logit Function
The Logistic Regression
Model Data
1 #逻辑回归模型 2 #对银行客户是否放贷进行分类 3 4 import pandas 5 import numpy 6 import matplotlib.pyplot as plt 7 from sklearn.linear_model import LogisticRegression 8 from sklearn.metrics import roc_curve, roc_auc_score 9 10 data = pandas.read_csv("datas.csv") 11 data = data.dropna() 12 13 # Randomly shuffle our data for the training and test set 14 admissions = data.loc[numpy.random.permutation(data.index)] 15 16 # train with 700 and test with the following 300, split dataset 17 num_train = 14968 18 data_train = admissions[:num_train] 19 data_test = admissions[num_train:] 20 21 # Fit Logistic regression to admit with features using the training set 22 logistic_model = LogisticRegression() 23 logistic_model.fit(data_train[['Age','Gender','AppAmount','Occupation', 24 'Education','Marital','Property','Residence', 25 'LoanUse','Company','Salary']], data_train['Label']) 26 27 # Print the Models Coefficients 28 print(logistic_model.coef_) 29 30 # .predict() using a threshold of 0.50 by default 31 predicted = logistic_model.predict(data_train[['Age','Gender','AppAmount','Occupation', 32 'Education','Marital','Property','Residence', 33 'LoanUse','Company','Salary']]) 34 35 # The average of the binary array will give us the accuracy 36 accuracy_train = (predicted == data_train['Label']).mean() 37 38 # Print the accuracy 39 print("Accuracy in Training Set = {s}".format(s=accuracy_train)) 40 41 # Predicted to be admitted 42 predicted = logistic_model.predict(data_test[['Age','Gender','AppAmount','Occupation', 43 'Education','Marital','Property','Residence', 44 'LoanUse','Company','Salary']]) 45 46 # What proportion of our predictions were true 47 accuracy_test = (predicted == data_test['Label']).mean() 48 print("Accuracy in Test Set = {s}".format(s=accuracy_test)) 49 50 51 # Predict the chance of label from those in the training set 52 train_probs = logistic_model.predict_proba(data_train[['Age','Gender','AppAmount','Occupation', 53 'Education','Marital','Property','Residence', 54 'LoanUse','Company','Salary']])[:,1] 55 56 test_probs = logistic_model.predict_proba(data_test[['Age','Gender','AppAmount','Occupation', 57 'Education','Marital','Property','Residence', 58 'LoanUse','Company','Salary']])[:,1] 59 60 # Compute auc for training set 61 auc_train = roc_auc_score(data_train["Label"], train_probs) 62 63 # Compute auc for test set 64 auc_test = roc_auc_score(data_test["Label"], test_probs) 65 66 # Difference in auc values 67 auc_diff = auc_train - auc_test 68 69 # Compute ROC Curves 70 roc_train = roc_curve(data_train["Label"], train_probs) 71 roc_test = roc_curve(data_test["Label"], test_probs) 72 73 # Plot false positives by true positives 74 plt.plot(roc_train[0], roc_train[1]) 75 plt.plot(roc_test[0], roc_test[1])