Main:
- Template.py
- Template.py 为主要流程部分,依次实现:
- Train Test Split
- Missing Imputation
- Feature Selection
- Cap and Floor
- Data Scaling
- Model Selection
- Feature Reduction
- AUC & KS graphing, Model Ranking and PSI
- 模型预测的概率对应逾期率的排序情况
- 模型变量递减后效果验证
- 最终模型的Feature Importance图像(如有)
- 以PKL或PMML形式保存模型
Function Files:
- Data_Processing.py
- Data_Processing.py 为数据预处理部分,包含函数:
- Train Test Split
- Feature Selection
- Cap and Floor
- Multiple_Model_Selection.py
- Multiple_Model_Selection.py 为模型选择部分,实现:
- grids search with cross validation(CV) 和 randomized search with CV 两种方式的模型选择,并可输出训练多类模型后的最优模型。
- 得到最优模型后,减少模型变量(Feature Reduction),再验证其效果,希望达到与之前得到的最优模型的效果不发生较大下降的前提下,尽可能的减少入参变量,从而增强模型的Robustness。
- Model_Evaluation.py
- Model_Evaluation.py 为模型效果验证部分,实现:
- AUC和KS画图部分
- PSI计算
- Model Ranking
- 模型预测的概率对应逾期率的排序情况
- 最终模型的feature importance图像(如有)
代码整体是用functional programming 的思路,且是框架性的代码,所以在变量处理时没有过于细化,需要根据实际数据结构的情况进行添加
一、先说Data_Processing部分:
先导入模块
import pandas as pd import numpy as np from sklearn.model_selection import StratifiedShuffleSplit #主要用于label分布不均匀的样本中 from sklearn.feature_selection import VarianceThreshold, SelectFromModel #第一个是特征选择中的方差阈值法(设定一个阈值,小于这个阈值就丢弃),第二个是嵌入式特征选择的一种 #from sklearn.preprocessing import MinMaxScaler from sklearn.ensemble import ExtraTreesClassifier #极端随机树,是随机深林的一种 from matplotlib import style, pyplot as plt plt.rcParams['font.sans-serif'] = ['SimHei'] # Display Chinese Characters plt.rcParams['axes.unicode_minus'] = False # Display Minus Sign style.use('ggplot')
1.数据划分,根据StratifiedShuffleSplit(主要是数据分布不均匀时使用),返回的xy合并的训练集和测试集
train, test = trainTestSplitV2(dataset, 'flag')
def trainTestSplitV2(data, response, testsize = 0.3, trainsize = 0.7, rdm_state = None): ''' Train test split with tolerance of the mean difference between dataset and test set. Parameters: ----------- data : pandas DataFrame with all possible predictors and response. response: string, name of response column in data. 相当是y值label testsize: numeric, between 0.0 and 1.0, the size of the testing set. trainsize: numeric, between 0.0 and 1.0, the size of the training set. rdm_state: None or int, random state of feature selection model. ----------- ''' X = np.array(data.drop(response, axis = 1)) #为什么需要array数组化,那是由于很多模型不能直接输入df y = np.array(data[response]) sssplit = StratifiedShuffleSplit(n_splits = 1, test_size = testsize, train_size = trainsize, random_state = rdm_state) #37分,且只分一折,这就不需要for循环了 # Generate indices to split data into training and test set. split_index = sssplit.split(X, y) train_index, test_index = next(split_index) X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] # Get all the columns name back train_data = pd.DataFrame(data = np.c_[X_train, y_train], #np.c_是左右连接两个矩阵,保持行数不变 columns = np.array((data.drop(response, axis = 1).columns.tolist() + [response]))) #.tolist()是将数组或者矩阵转换成list test_data = pd.DataFrame(data = np.c_[X_test, y_test], columns = np.array((data.drop(response, axis = 1).columns.tolist() + [response]))) print('Mean of y_all is: {:.4f} Mean of y_train is: {:.4f} Mean of y_test is: {:.4f}'.format(data[response].mean(), train_data[response].mean(), test_data[response].mean())) return(train_data, test_data)
2.计算每个特征的var,根据VarianceThreshold挑选方差不为0的特征,返回的是方差为0的特征(后面需要将这些特征踢除)
elmi_col = varThreshold(dataset, train)
#-------------------------Variance Threshold Function------------------------- def varThreshold(data, trainset, thd = 0): ''' Feature selector that removes all low-variance features. Parameters: ----------- data : pandas DataFrame with all possible predictors and response. trainset: nparray, training features set. thd: numeric, threshold of the variance. ----------- ''' sel_def = VarianceThreshold(threshold = thd) new_train = sel_def.fit_transform(trainset) print('The Number of Features Selected When Removes All Zero-variance Features:', new_train.shape[1]) ##其中shape[0]是行数,shape[1]是列数 # Get names of low variance featuers bool_arr = (sel_def.variances_ == thd).tolist() #方差等于阈值 ,放回的是一组布尔值的list,True是等于阈值,其他的是不等于 seq = [i for i, value in enumerate(bool_arr) if value] #将true的index输出来 eliminate_feature = [data.columns.tolist()[ele] for ele in seq] return(eliminate_feature)
['身份证_姓名命中法院结案模糊名单', '身份证命中信贷逾期名单', '手机号命中信贷逾期名单', '第一联系人手机号命中信贷逾期名单', '身份证命中法院失信名单', '身份证_姓名命中法院失信模糊名单', '身份证命中公司欠税名单', 'X1month_第三方服务商', 'X1month_理财机构', 'X1month_银行小微贷款', 'X1month_汽车租赁', 'X1month_房地产金融', 'X1month_融资租赁', 'X3month_银行小微贷款', 'X3month_房地产金融', 'X7days_互联网金融门户', 'X7days_第三方服务商', 'X7days_理财机构', 'X7days_财产保险', 'X7days_银行小微贷款', 'X7days_汽车租赁', 'X7days_房地产金融', 'X7days_融资租赁']
3.异常值的处理(将大于或者小于某个阈值的数据用阈值代替)和数据缩放(使用最大最小值缩放),返回的是每个特征的阈值分位数和缩放后的数据
注意:是先运行4的然后再运行的3,也就是说先选出了50个特征,对这些特征的-1值再换成空值,在进行下面这个操作
quantile_limit, test = replaceOutlierNScale(train, test, 'flag', low = 0.02, high = 0.98, scale = True) quantile_limit, train = replaceOutlierNScale(train, train, 'flag', low = 0.02, high = 0.98, scale = True)
返回的是处理阈值后并且缩放后的特征
#-------------------------Apply Quantile(Low and High) of Data to Replace Outliers------------------------- def replaceOutlierNScale(data_a, data_b, response, low = 0.02, high = 0.98, interpolation = 'midpoint', scale = True): ''' Replace low and high quantile outliers of data_b with specified quantile value of data_a. Only deal with numeric columns, ignore the Nan value and string type columns. Parameter: ----------- data_a & data_b: pandas DataFrame, have same columns and only have numeric columns. response: string, name of response column in data. low: low limit. high: high limit. interpolation: This parameter specifies the interpolation method to use. ----------- ''' # Replace Outliers: Cap and floor new_df = data_b.drop(response, axis = 1) quantile_limit = data_a.drop(response, axis = 1).quantile([low, high], interpolation = interpolation) #得到以low, high为index的二维表 ''' 如: 0 1 2 3 0.02 0.0 1.0 0.5 3.0 0.98 0.0 1.5 2.5 3.0 ''' outliers_low = (new_df < quantile_limit.loc[low, :]) #返回的布尔型的df ''' 0 1 2 3 0 False False True False 1 False False False False 2 False False False False ''' new_df.mask(outliers_low, quantile_limit.loc[low, :], inplace = True, axis = 1) #将小于阈值的用阈值代替,mask是条件为True时才替换 outliers_high = (new_df > quantile_limit.loc[high, :]) new_df.mask(outliers_high, quantile_limit.loc[high, :], inplace = True, axis = 1) new_df[response] = data_b[response] # Min-max scale if scale: # # Replace Outliers: Cap and floor # X_a = data_a.drop(response, axis = 1) # outliers_low = (X_a < quantile_limit.loc[low, :]) # X_a.mask(outliers_low, quantile_limit.loc[low, :], inplace = True, axis = 1) # outliers_high = (X_a > quantile_limit.loc[high, :]) # X_a.mask(outliers_high, quantile_limit.loc[high, :], inplace = True, axis = 1) # X_a = np.array(X_a) # X_b = np.array(new_df.drop(response, axis = 1)) # # Fit on X_a and transform to X_b # min_max_scaler = MinMaxScaler().fit(X_a) # X_b = min_max_scaler.transform(X_b) # # Get all the columns name back # new_df = pd.DataFrame(data = np.c_[X_b, np.array(data_b[response])], # columns = np.array((data_b.drop(response, axis = 1).columns.tolist() + [response]))) new_df.drop(response, axis = 1, inplace = True) new_df = (new_df - quantile_limit.min()) / (quantile_limit.max() - quantile_limit.min()) #用最大最小值去缩放 new_df[response] = data_b[response] return(quantile_limit, new_df)
4.特征选择(featureSelectFromModel),返回feature_importances_
imp_col =featureSelectFromModel(train, 'flag', figname = 'Top 10 Important Features', n_tree = 500, n_core = -1, rdm_state = None, thd = 'median', word = 1, show = 10, figsize = (12, 10))
是先运行4的然后再运行的3
#-------------------------Feature Selected From Model------------------------- def featureSelectFromModel(trainset, response, figname, n_tree = 500, n_core = -1, rdm_state = None, word = 0, thd = 'median', show = 10, figsize = (10, 8)): ''' Feature selector that removes all unimportant features. Parameters: ----------- trainset : pandas DataFrame with all possible predictors and response in train set. response: string, name of response column in data. figname: string, name of feature importace graph. rdm_state: None or int, random state of feature selection model. thd: string, seted as 'median', 'mean' or '0.25*median', '0.25*mean', means features whose importance is greater or equal to thd are kept while the others are discarded. show: int, the number of features to be shown in the graph. figsize: tuple, the size of figure. ----------- ''' column_names = trainset.drop(response, axis = 1).columns.tolist() X = np.array(trainset.drop(response, axis = 1)) y = np.array(trainset[response]) # Create the SelectFromModel object and retrieve the optimal number of features which # the threshold value is set as thd of the feature importances. clf = ExtraTreesClassifier(n_estimators = n_tree, n_jobs = n_core, random_state = rdm_state, verbose = word) clf = clf.fit(X, y) model = SelectFromModel(clf, threshold = thd, prefit = True) X_new = model.transform(X) print('The Number of Features Selected:', X_new.shape[1]) # Get the feature importances importances = clf.feature_importances_ #返回的是一个数组,个数和特征个数一致 #***************Obtain names of importannt features*************** def ImptFeature(thd = thd, impt = importances): method = thd.split('*') if 'median' in method: name_val = pd.Series(impt, index = column_names) #注意这里index即是列名,所以后面返回的其实是列名 if len(method) == 1: #本来这个长度就等于1 imp_val = name_val[name_val >= np.median(impt)] #选择特征重要度大于均值的特征 return(imp_val.index.tolist()) elif len(method) == 2: coef = np.float(method[0]) imp_val = name_val[name_val >= (coef * np.median(impt))] return(imp_val.index.tolist()) else: print('"thd" maybe not follow the pattern, check it !!!') elif 'mean' in method: name_val = pd.Series(impt, index = column_names) if len(method) == 1: imp_val = name_val[name_val >= np.mean(impt)] return(imp_val.index.tolist()) elif len(method) == 2: coef = np.float(method[0]) imp_val = name_val[name_val >= (coef * np.mean(impt))] return(imp_val.index.tolist()) else: print('"thd" maybe not follow the pattern, check it !!!') else: print('"thd" maybe not follow the pattern, check it !!!') #********************************************************* # Standard deviation of feature importances std = np.std([tree.feature_importances_ for tree in clf.estimators_], axis=0) # Return the indices of the top several important features indices = np.argsort(importances)[::-1][:show] #设定了show=10,即是说前10名的特征 # Get the top important features name features = [column_names[ele] for ele in indices] #***************Graph of feature importance*************** # Show importance of each feature fig = plt.figure(figsize = figsize) axes = plt.subplot2grid((1,1), (0,0)) axes.bar(range(show), importances[indices], color = '#4682B4', yerr = std[indices], align = 'center') plt.xticks(range(show), features) plt.xlim((-1, show)) # Rotate the angle of the labels for label in axes.xaxis.get_ticklabels(): label.set_rotation(90) plt.title(('Top ' + str(show) + ' Important Features')) plt.xlabel('Name of Variable') plt.ylabel('Importance') plt.savefig((figname + '.jpg')) #********************************************************* # Output the top important features print('Top ' + str(show) + ' Features Ranking:') for f in range(show): print('{}. Importance of feature {} named "{}" is: {:.4f}'.format(f + 1, indices[f], column_names[indices[f]], importances[indices[f]])) return(ImptFeature())
返回的是50个入选的特征
['loan_amount', 'loan_term', 'final_score', 'X3个月内申请人在多个平台申请借款', 'X1个月内申请人在多个平台申请借款', 'X1month_P2P网贷', 'X1month_财产保险', 'X3month_P2P网贷', 'X3month大型消费金融公司', 'X3month_互联网金融门户', 'X3month_信用卡中心', 'label', 'cell_number_xiangguan', 'risk_count', 'annual_income_500000', 'X_y', '本品牌合作时间', '经营年限_注册时间_', '上级经销商法人手机', '总和_经营面积', '总和_经营年限', '总和_年销售额', '总和_年销售总量', 'order_sum', 'first_dd_sum', 'first_dd_avg', 'first_dd_sd', 'max_dd_sum', 'max_dd_avg', 'max_dd_sd', 'pass_term_avg', 'pass_term_sd', 'fpd_sum', 'loan_amt_sum', 'loan_amt_avg', 'loan_amt_sd', 'fpd_times', 'times', 'loan_times', 'xs_amount', 'xy_amount', 'history', 'layer', 'type', 'size', 'wang2_count_x', 'wang2_def30_count', 'wang2_c_order_sum', 'wang2_count_y', 'amount_sum']
Data_Processing的全部代码(已折叠)
# -*- coding: utf-8 -*- """ Created on Tue Nov 21 21:36:04 2017 @author: Hin """ import pandas as pd import numpy as np from sklearn.model_selection import StratifiedShuffleSplit #主要用于label分布不均匀的样本中 from sklearn.feature_selection import VarianceThreshold, SelectFromModel #第一个是特征选择中的方差阈值法(设定一个阈值,小于这个阈值就丢弃),第二个是嵌入式特征选择的一种 #from sklearn.preprocessing import MinMaxScaler from sklearn.ensemble import ExtraTreesClassifier #极端随机树,是随机深林的一种 from matplotlib import style, pyplot as plt plt.rcParams['font.sans-serif'] = ['SimHei'] # Display Chinese Characters plt.rcParams['axes.unicode_minus'] = False # Display Minus Sign style.use('ggplot') def trainTestSplitV2(data, response, testsize = 0.3, trainsize = 0.7, rdm_state = None): ''' Train test split with tolerance of the mean difference between dataset and test set. Parameters: ----------- data : pandas DataFrame with all possible predictors and response. response: string, name of response column in data. 相当是y值label testsize: numeric, between 0.0 and 1.0, the size of the testing set. trainsize: numeric, between 0.0 and 1.0, the size of the training set. rdm_state: None or int, random state of feature selection model. ----------- ''' X = np.array(data.drop(response, axis = 1)) #为什么需要array数组化,那是由于很多模型不能直接输入df y = np.array(data[response]) sssplit = StratifiedShuffleSplit(n_splits = 1, test_size = testsize, train_size = trainsize, random_state = rdm_state) #37分,且只分一折,这就不需要for循环了 # Generate indices to split data into training and test set. split_index = sssplit.split(X, y) train_index, test_index = next(split_index) X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] # Get all the columns name back train_data = pd.DataFrame(data = np.c_[X_train, y_train], #np.c_是左右连接两个矩阵,保持行数不变 columns = np.array((data.drop(response, axis = 1).columns.tolist() + [response]))) #.tolist()是将数组或者矩阵转换成list test_data = pd.DataFrame(data = np.c_[X_test, y_test], columns = np.array((data.drop(response, axis = 1).columns.tolist() + [response]))) print('Mean of y_all is: {:.4f} Mean of y_train is: {:.4f} Mean of y_test is: {:.4f}'.format(data[response].mean(), train_data[response].mean(), test_data[response].mean())) return(train_data, test_data) #-------------------------Variance Threshold Function------------------------- def varThreshold(data, trainset, thd = 0): ''' Feature selector that removes all low-variance features. Parameters: ----------- data : pandas DataFrame with all possible predictors and response. trainset: nparray, training features set. thd: numeric, threshold of the variance. ----------- ''' sel_def = VarianceThreshold(threshold = thd) new_train = sel_def.fit_transform(trainset) print('The Number of Features Selected When Removes All Zero-variance Features:', new_train.shape[1]) ##其中shape[0]是行数,shape[1]是列数 # Get names of low variance featuers bool_arr = (sel_def.variances_ == thd).tolist() #方差等于阈值 ,放回的是一组布尔值的list,True是等于阈值,其他的是不等于 seq = [i for i, value in enumerate(bool_arr) if value] #将true的index输出来 eliminate_feature = [data.columns.tolist()[ele] for ele in seq] return(eliminate_feature) #-------------------------Apply Quantile(Low and High) of Data to Replace Outliers------------------------- def replaceOutlierNScale(data_a, data_b, response, low = 0.02, high = 0.98, interpolation = 'midpoint', scale = True): ''' Replace low and high quantile outliers of data_b with specified quantile value of data_a. Only deal with numeric columns, ignore the Nan value and string type columns. Parameter: ----------- data_a & data_b: pandas DataFrame, have same columns and only have numeric columns. response: string, name of response column in data. low: low limit. high: high limit. interpolation: This parameter specifies the interpolation method to use. ----------- ''' # Replace Outliers: Cap and floor new_df = data_b.drop(response, axis = 1) quantile_limit = data_a.drop(response, axis = 1).quantile([low, high], interpolation = interpolation) #得到以low, high为index的二维表 ''' 如: 0 1 2 3 0.02 0.0 1.0 0.5 3.0 0.98 0.0 1.5 2.5 3.0 ''' outliers_low = (new_df < quantile_limit.loc[low, :]) #返回的布尔型的df ''' 0 1 2 3 0 False False True False 1 False False False False 2 False False False False ''' new_df.mask(outliers_low, quantile_limit.loc[low, :], inplace = True, axis = 1) #将小于阈值的用阈值代替,mask是条件为True时才替换 outliers_high = (new_df > quantile_limit.loc[high, :]) new_df.mask(outliers_high, quantile_limit.loc[high, :], inplace = True, axis = 1) new_df[response] = data_b[response] # Min-max scale if scale: # # Replace Outliers: Cap and floor # X_a = data_a.drop(response, axis = 1) # outliers_low = (X_a < quantile_limit.loc[low, :]) # X_a.mask(outliers_low, quantile_limit.loc[low, :], inplace = True, axis = 1) # outliers_high = (X_a > quantile_limit.loc[high, :]) # X_a.mask(outliers_high, quantile_limit.loc[high, :], inplace = True, axis = 1) # X_a = np.array(X_a) # X_b = np.array(new_df.drop(response, axis = 1)) # # Fit on X_a and transform to X_b # min_max_scaler = MinMaxScaler().fit(X_a) # X_b = min_max_scaler.transform(X_b) # # Get all the columns name back # new_df = pd.DataFrame(data = np.c_[X_b, np.array(data_b[response])], # columns = np.array((data_b.drop(response, axis = 1).columns.tolist() + [response]))) new_df.drop(response, axis = 1, inplace = True) new_df = (new_df - quantile_limit.min()) / (quantile_limit.max() - quantile_limit.min()) #用最大最小值去缩放 new_df[response] = data_b[response] return(quantile_limit, new_df) #-------------------------Feature Selected From Model------------------------- def featureSelectFromModel(trainset, response, figname, n_tree = 500, n_core = -1, rdm_state = None, word = 0, thd = 'median', show = 10, figsize = (10, 8)): ''' Feature selector that removes all unimportant features. Parameters: ----------- trainset : pandas DataFrame with all possible predictors and response in train set. response: string, name of response column in data. figname: string, name of feature importace graph. rdm_state: None or int, random state of feature selection model. thd: string, seted as 'median', 'mean' or '0.25*median', '0.25*mean', means features whose importance is greater or equal to thd are kept while the others are discarded. show: int, the number of features to be shown in the graph. figsize: tuple, the size of figure. ----------- ''' column_names = trainset.drop(response, axis = 1).columns.tolist() X = np.array(trainset.drop(response, axis = 1)) y = np.array(trainset[response]) # Create the SelectFromModel object and retrieve the optimal number of features which # the threshold value is set as thd of the feature importances. clf = ExtraTreesClassifier(n_estimators = n_tree, n_jobs = n_core, random_state = rdm_state, verbose = word) clf = clf.fit(X, y) model = SelectFromModel(clf, threshold = thd, prefit = True) X_new = model.transform(X) print('The Number of Features Selected:', X_new.shape[1]) # Get the feature importances importances = clf.feature_importances_ #返回的是一个数组,个数和特征个数一致 #***************Obtain names of importannt features*************** def ImptFeature(thd = thd, impt = importances): method = thd.split('*') if 'median' in method: name_val = pd.Series(impt, index = column_names) if len(method) == 1: #本来这个长度就等于1 imp_val = name_val[name_val >= np.median(impt)] #选择特征重要度大于均值的特征 return(imp_val.index.tolist()) elif len(method) == 2: coef = np.float(method[0]) imp_val = name_val[name_val >= (coef * np.median(impt))] return(imp_val.index.tolist()) else: print('"thd" maybe not follow the pattern, check it !!!') elif 'mean' in method: name_val = pd.Series(impt, index = column_names) if len(method) == 1: imp_val = name_val[name_val >= np.mean(impt)] return(imp_val.index.tolist()) elif len(method) == 2: coef = np.float(method[0]) imp_val = name_val[name_val >= (coef * np.mean(impt))] return(imp_val.index.tolist()) else: print('"thd" maybe not follow the pattern, check it !!!') else: print('"thd" maybe not follow the pattern, check it !!!') #********************************************************* # Standard deviation of feature importances std = np.std([tree.feature_importances_ for tree in clf.estimators_], axis=0) # Return the indices of the top several important features indices = np.argsort(importances)[::-1][:show] # Get the top important features name features = [column_names[ele] for ele in indices] #***************Graph of feature importance*************** # Show importance of each feature fig = plt.figure(figsize = figsize) axes = plt.subplot2grid((1,1), (0,0)) axes.bar(range(show), importances[indices], color = '#4682B4', yerr = std[indices], align = 'center') plt.xticks(range(show), features) plt.xlim((-1, show)) # Rotate the angle of the labels for label in axes.xaxis.get_ticklabels(): label.set_rotation(90) plt.title(('Top ' + str(show) + ' Important Features')) plt.xlabel('Name of Variable') plt.ylabel('Importance') plt.savefig((figname + '.jpg')) #********************************************************* # Output the top important features print('Top ' + str(show) + ' Features Ranking:') for f in range(show): print('{}. Importance of feature {} named "{}" is: {:.4f}'.format(f + 1, indices[f], column_names[indices[f]], importances[indices[f]])) return(ImptFeature())
二、模型选择部分,两种方式的模型选择,并可输出训练多类模型后的最优模型
model_set = trainModelSequence(train, 'flag', classifiers, method = randomSearchCVTraining, iternum = 10, evalmetric = 'roc_auc', n_core = 4, fold = 5, word = 1)
Best score of RandomizedSearchCV is: 0.7967
Best parameters of RandomizedSearchCV is:
{'kernel': 'linear', 'gamma': 'auto', 'C': 100.0}
Best score of RandomizedSearchCV is: 0.8623
Best parameters of RandomizedSearchCV is:
{'max_depth': None, 'max_features': 0.3, 'min_samples_leaf': 6, 'min_samples_split': 8, 'n_estimators': 600}
Best score of RandomizedSearchCV is: 0.8514
Best parameters of RandomizedSearchCV is:
{'colsample_bylevel': 0.7, 'colsample_bytree': 0.7, 'gamma': 0.15000000000000002, 'learning_rate': 0.05, 'max_delta_step': 6, 'max_depth': 1, 'min_child_weight': 2, 'n_estimators': 300, 'random_state': 457, 'reg_alpha': 2, 'reg_lambda': 10, 'subsample': 0.7}
# Get best CV estimator best_model = bestModel(model_set) #输出的就是最高分数的模型,即是随机深林
=====到剪枝了,由前面可知有50个特征进入模型里面,因此要剪枝=======
但是发生了一件很疑惑的事情,我们上面跑的结果明明是随机深林的分数最高,但是后面剪枝的却使用了xgboost的,所以我就将随机深林的去掉,才继续跑了下面的剪枝
# -*- coding: utf-8 -*- """ Created on Thu Nov 23 16:27:24 2017 @author: Hin """ import pandas as pd import numpy as np from sklearn.model_selection import GridSearchCV, RandomizedSearchCV #-------------------------GridSearchCV------------------------- def gridSearchCVTraining(data, response, params, mod = None, evalmetric = 'roc_auc', n_core = -1, fold = 5, word = 1, dispatch = '2*n_jobs'): ''' GridSearchCV: Exhaustive search over specified parameter values for an estimator. Parameters: ----------- data: pandas DataFrame with all possible predictors and response. response: string, name of response column in data. other parameters see: scikit-learn documentation(http://scikit-learn.org/stable/modules/ generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV.fit) ----------- ''' X = np.array(data.drop(response, axis = 1)) y = np.array(data[response]) gs = GridSearchCV(estimator = mod, param_grid = params, scoring = evalmetric, n_jobs= n_core, cv = fold, refit = True, verbose = word, pre_dispatch = dispatch, error_score = 'raise') gs.fit(X, y) print(' Best score of GridSearchCV is: {:.4f}'.format(round(gs.best_score_, 4))) print(' Best parameters of GridSearchCV is: {}'.format(gs.best_params_)) return(gs) #-------------------------RandomizedSearchCV------------------------- def randomSearchCVTraining(data, response, params, mod = None, evalmetric = 'roc_auc', n_core = -1, iter_num = 1000, fold = 5, word = 1, dispatch = '2*n_jobs'): ''' RandomizedSearchCV: Randomized search on hyper parameters. Parameters: ----------- data: pandas DataFrame with all possible predictors and response. response: string, name of response column in data. other parameters see: scikit-learn documentation(http://scikit-learn.org/stable/ modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV) ----------- ''' X = np.array(data.drop(response, axis = 1)) y = np.array(data[response]) rs = RandomizedSearchCV(estimator = mod, param_distributions = params, n_iter = iter_num, scoring = evalmetric, n_jobs= n_core, cv = fold, refit = True, verbose = word, pre_dispatch = dispatch, error_score = 'raise') rs.fit(X, y) print(' Best score of RandomizedSearchCV is: {:.4f}'.format(round(rs.best_score_, 4))) print(' Best parameters of RandomizedSearchCV is: {}'.format(rs.best_params_)) return(rs) #-------------------------Training Models in Sequence------------------------- def trainModelSequence(data, response, classifiers, method = randomSearchCVTraining, iternum = None, evalmetric = 'roc_auc', n_core = -1, fold = 5, word = 1, dispatch = '2*n_jobs'): '''
这里使用了函数作为参数,注意参数里面的函数
Training several models in sequence. Parameters: ----------- data: pandas DataFrame with all possible predictors and response. response: string, name of response column in data. classifiers: a set of algorithms. method: the method of parameter tuning, randomSearchCVTraining or gridSearchCVTraining. other parameters see: scikit-learn documentation(RandomizedSearchCV / GridSearchCV) ----------- ''' model_set = [] # RandomizedSearchCV Interface if(iternum): for clf in range(len(classifiers)): try: print(' ***************** {} Begin *****************'.format(classifiers[clf][0])) #输出每一个模型的名字,如SVM # Train model model = method(data, response, params = classifiers[clf][2], #也就是说每个模型的参数 mod = classifiers[clf][1], iter_num = iternum, #mod是每个基模型,如:SVM() evalmetric = evalmetric, n_core = n_core, fold = fold, word = word, dispatch = dispatch) #返回的是fit(x,y),并输出best_score_和best_parms_ bst_score = '{:.4f}'.format(round(model.best_score_, 4)) model_set.append({classifiers[clf][0]: model, 'Best Score': bst_score}) except Exception as e: print(' ', '#' * 15, 'Error Message', '#' * 15, ' ') print('Error is:', e) print(' ', '#' * 15, 'End','#' * 15, ' ') model_set.append({classifiers[clf][0]: 'Error', 'Best Score': 'Error'}) next else: pass finally: print('***************** {} End ***************** '.format(classifiers[clf][0])) # GridSearchCV Interface: set method = gridSearchCVTraining, iternum = None else: for clf in range(len(classifiers)): try: print(' ***************** {} Begin *****************'.format(classifiers[clf][0])) # Train model model = method(data, response, params = classifiers[clf][2], mod = classifiers[clf][1], evalmetric = evalmetric, n_core = n_core, fold = fold, word = word, dispatch = dispatch) bst_score = '{:.4f}'.format(round(model.best_score_, 4)) model_set.append({classifiers[clf][0]: model, 'Best Score': bst_score}) except Exception as e: print(' ', '#' * 15, 'Error Message', '#' * 15, ' ') print('Error is:', e) print(' ', '#' * 15, 'End','#' * 15, ' ') model_set.append({classifiers[clf][0]: 'Error', 'Best Score': 'Error'}) next else: pass finally: print('***************** {} End ***************** '.format(classifiers[clf][0])) return(model_set) #-------------------------Select the Best Model in Model Sequence------------------------- def bestModel(model_set, eval_logic = 'Max'): ''' Select the best model from trainModelSequence. Parameters: ----------- model_set: result from trainModelSequence. eval_logic: either 'Max' or 'Min', represent the logic of the evalmetric method. ----------- ''' if (eval_logic == 'Max'): bst_model, cur_model = -np.inf, -np.inf for model in model_set: if (model['Best Score'] != 'Error'): if (np.float64(model['Best Score']) > cur_model): bst_model = model cur_model = np.float64(model['Best Score']) else: next else: next else: bst_model, cur_model = np.inf, np.inf for model in model_set: if (model['Best Score'].isnumeric()): if (np.float64(model['Best Score']) < cur_model): bst_model = model cur_model = np.float64(model['Best Score']) else: next else: next return(bst_model) #-------------------------Reduce Features from the Best Model------------------------- def featureReduce(data, response, classifiers, model, method = randomSearchCVTraining, interpolation = 'higher', iternum = None, evalmetric = 'roc_auc', n_core = -1, fold = 5, word = 1, dispatch = '2*n_jobs'): ''' Reduce features to increase robustness of model Parameters: ----------- data: pandas DataFrame with all possible predictors and response. response: string, name of response column in data. classifiers: a set of algorithms. model: result from RandomizedSearchCV / GridSearchCV. method: the method of parameter tuning, randomSearchCVTraining or gridSearchCVTraining. other parameters see: scikit-learn documentation(RandomizedSearchCV / GridSearchCV) ''' model_set = [] column_names = data.drop(response, axis = 1).columns.tolist() # Get feature importances importances = model.best_estimator_.feature_importances_ # Return the indices of the important features from max to min indices = np.argsort(importances)[::-1] # Get the top important features name features = [column_names[ele] for ele in indices] features = pd.DataFrame(data = features, index = range(len(features)), columns = ['feature']) features['seq'] = list(range(len(features))) quantile_limit = features.drop('feature', axis = 1).quantile(q = ([i/len(classifiers) for i in range(len(classifiers))] + [0.5]), interpolation = interpolation) quantile_limit.reset_index(drop = True, inplace = True) quantile_limit.sort_values('seq', inplace = True) quantile_limit = quantile_limit.iloc[1:, :] feature_sets = [(features['feature'][:int(i)].tolist() + [response]) for i in quantile_limit.seq.values] # RandomizedSearchCV Interface if(iternum): for clf in range(len(classifiers)): try: print(' ***************** {} Begin *****************'.format(classifiers[clf][0])) # Train model model = method(data[feature_sets[clf]], response, params = classifiers[clf][2], mod = classifiers[clf][1], iter_num = iternum, evalmetric = evalmetric, n_core = n_core, fold = fold, word = word, dispatch = dispatch) bst_score = '{:.4f}'.format(round(model.best_score_, 4)) model_set.append({classifiers[clf][0]: model, 'Best Score': bst_score, 'feature' : feature_sets[clf]}) except Exception as e: print(' ', '#' * 15, 'Error Message', '#' * 15, ' ') print('Error is:', e) print(' ', '#' * 15, 'End','#' * 15, ' ') model_set.append({classifiers[clf][0]: 'Error', 'Best Score': 'Error'}) next else: pass finally: print('***************** {} End ***************** '.format(classifiers[clf][0])) # GridSearchCV Interface: set method = gridSearchCVTraining, iternum = None else: for clf in range(len(classifiers)): try: print(' ***************** {} Begin *****************'.format(classifiers[clf][0])) # Train model model = method(data[feature_sets[clf]], response, params = classifiers[clf][2], mod = classifiers[clf][1], evalmetric = evalmetric, n_core = n_core, fold = fold, word = word, dispatch = dispatch) bst_score = '{:.4f}'.format(round(model.best_score_, 4)) model_set.append({classifiers[clf][0]: model, 'Best Score': bst_score, 'feature' : feature_sets[clf]}) except Exception as e: print(' ', '#' * 15, 'Error Message', '#' * 15, ' ') print('Error is:', e) print(' ', '#' * 15, 'End','#' * 15, ' ') model_set.append({classifiers[clf][0]: 'Error', 'Best Score': 'Error'}) next else: pass finally: print('***************** {} End ***************** '.format(classifiers[clf][0])) return(model_set)
三、模型效果验证
# -*- coding: utf-8 -*- """ Created on Sun Sep 3 12:06:42 2017 @author: Hin """ import math import numpy as np import pandas as pd from sklearn.metrics import roc_curve, auc #说明是分类模型 from matplotlib import style, pyplot as plt style.use('ggplot') #-------------------------Feature Importance Graph------------------------- def featureImpGraph(data, response, model, figname = 'Feature Importance', show = 10, figsize = (10, 8)): ''' Plot AUC and KS graph. Parameters: ----------- data : pandas DataFrame with all possible predictors and response. response: string, name of response column in data. model: result from RandomizedSearchCV / GridSearchCV. figname: string, name of the graph. show: int, the number of features to be shown in the graph. figsize: tuple, the size of figure. ----------- ''' column_names = data.drop(response, axis = 1).columns.tolist() # Get feature importances importances = model.best_estimator_.feature_importances_ #最好的基模型的特征重要性 # Return the indices of the top several important features indices = np.argsort(importances)[::-1][:show] #[::-1]取从后向前(相当于和原来的顺序相反),argsort()是将X中的元素从小到大排序后,提取对应的索引index,然后输出到y # Get the top important features name features = [column_names[ele] for ele in indices] # Show importance of each feature fig = plt.figure(figsize = figsize) axes = plt.subplot2grid((1,1), (0,0)) axes.bar(range(show), importances[indices], color = '#4682B4', align = 'center') plt.xticks(range(show), features) plt.xlim((-1, show)) # Rotate the angle of the labels for label in axes.xaxis.get_ticklabels(): label.set_rotation(90) plt.title(('Top ' + str(show) + ' Important Features')) plt.xlabel('Name of Variable') plt.ylabel('Importance') plt.savefig((figname + '.jpg')) # Output the top important features print('Top ' + str(show) + ' Features:') for f in range(show): print('{}. Importance of feature {} named "{}" is: {:.4f}'.format(f + 1, indices[f], column_names[indices[f]], importances[indices[f]])) #-------------------------AUC KS Graph------------------------- def aucKSGraph(data, response, pred_value, pos_label, model_name = 'Model', figname = 'AUC KS Graph', figsize = (10, 8)): ''' Plot AUC and KS graph. Parameters: ----------- data : pandas DataFrame with all possible predictors and response. response: string, name of response column in data. pred_value: pandas Series with the predict value of model. pos_label: int, for binary classification this represents positive label. figname: string, name of the graph. model_name: string, label of the graph. figsize: tuple, the size of figure. ----------- ''' fpr, tpr, thresholds = roc_curve(data[response], pred_value, pos_label = pos_label) ks_value = max(tpr - fpr) ks_value = round(ks_value, 4) roc_auc_value = auc(fpr, tpr) roc_auc_value = round(roc_auc_value, 4) fig = plt.figure(figsize = figsize) plt.plot([0, 1],[0, 1], linestyle = '--', color = 'b', label = "random guessing") plt.xlabel('FPR') plt.ylabel('TPR') plt.title('{} {}'.format(model_name, figname)) plt.plot(fpr, tpr, label = '{} (auc = {:.4f}, ks = {:.4f})'.format(model_name, roc_auc_value, ks_value), color = 'r') plt.legend(loc = 'lower right') plt.savefig((' '.join([model_name, figname]) + '.jpg')) #-------------------------Model Ranking Ability------------------------- def modelRank(data, response, model, show = 10, pos_label = 1, neg_label = 0): ''' Model Ranking Ability. Parameters: ----------- data : pandas DataFrame with all possible predictors and response. response: string, name of response column in data. model: result from RandomizedSearchCV / GridSearchCV. show: int, the number of sections to be shown in the table. pos_label: int, for binary classification this represents positive label. neg_label: int, for binary classification this represents negative label. ----------- ''' prob = data[[response]] prob['prob'] = model.predict_proba(data.drop(response, axis = 1).values)[:, 1] prob.sort_values('prob', ascending = False, inplace = True) prob.reset_index(drop = True, inplace = True) prob['seq'] = list(range(len(prob))) # Get quantile of data quantile_limit = prob.seq.quantile([i/show for i in range(1, (show + 1))], interpolation = 'higher') quantile_limit.reset_index(drop = True, inplace = True) rank_df = pd.DataFrame() for i in quantile_limit.index.values: rank_df_tmp = pd.DataFrame() if i == 0: rank_df_tmp['Bad'] = [sum(prob.loc[: quantile_limit[i], response] == pos_label)] rank_df_tmp['Good'] = [sum(prob.loc[: quantile_limit[i], response] == neg_label)] rank_df_tmp['Total'] = [int(rank_df_tmp['Bad'] + rank_df_tmp['Good'])] rank_df_tmp['Bad_Rate'] = ['{:.4f}'.format(round(float(rank_df_tmp['Bad'] / rank_df_tmp['Total']), 4))] rank_df_tmp['Min_Prob'] = [prob.loc[: quantile_limit[i], 'prob'].min()] rank_df_tmp['Max_Prob'] = [prob.loc[: quantile_limit[i], 'prob'].max()] rank_df = rank_df.append(rank_df_tmp) else: rank_df_tmp['Bad'] = [sum(prob.loc[(quantile_limit[i - 1] + 1) : quantile_limit[i], response] == pos_label)] rank_df_tmp['Good'] = [sum(prob.loc[(quantile_limit[i - 1] + 1) : quantile_limit[i], response] == neg_label)] rank_df_tmp['Total'] = [int(rank_df_tmp['Bad'] + rank_df_tmp['Good'])] rank_df_tmp['Bad_Rate'] = ['{:.4f}'.format(round(float(rank_df_tmp['Bad'] / rank_df_tmp['Total']), 4))] rank_df_tmp['Min_Prob'] = [prob.loc[(quantile_limit[i - 1] + 1) : quantile_limit[i], 'prob'].min()] rank_df_tmp['Max_Prob'] = [prob.loc[(quantile_limit[i - 1] + 1) : quantile_limit[i], 'prob'].max()] rank_df = rank_df.append(rank_df_tmp) rank_df['Cum_Bad_Num'] = rank_df.Bad.cumsum() rank_df['Cum_Bad_Pct'] = rank_df['Cum_Bad_Num'] / rank_df['Bad'].sum() rank_df.drop('Cum_Bad_Num', axis = 1, inplace = True) rank_df['Cum_Bad_Pct'] = rank_df['Cum_Bad_Pct'].map(lambda x: '{:.4f}'.format(round(x, 4))) total = pd.DataFrame({'Bad' : [rank_df['Bad'].sum()], 'Good' : [rank_df['Good'].sum()], 'Total' : [rank_df['Total'].sum()], 'Bad_Rate' : ['{:.4f}'.format(round((rank_df['Bad'].sum() / rank_df['Total'].sum()), 4))], 'Min_Prob' : [rank_df['Min_Prob'].min()], 'Max_Prob' : [rank_df['Max_Prob'].max()], 'Cum_Bad_Pct' : ['1.0000']}) rank_df = rank_df.append(total) rank_df.reset_index(drop = True, inplace = True) rank_df['Min_Prob'] = rank_df['Min_Prob'].map(lambda x: '{:.4f}'.format(round(x, 4))) rank_df['Max_Prob'] = rank_df['Max_Prob'].map(lambda x: '{:.4f}'.format(round(x, 4))) # Output ordered columns return(rank_df[['Min_Prob', 'Max_Prob', 'Bad', 'Good', 'Total', 'Bad_Rate', 'Cum_Bad_Pct']]) #-------------------------Population Stability Index(PSI)------------------------- def PSI(data_a, data_b, response, model, show = 10): ''' Calculate Population Stability Index(PSI). Parameters: ----------- data_a & data_b: pandas DataFrame, have same columns and only have numeric columns. response: string, name of response column in data. model: result from RandomizedSearchCV / GridSearchCV. show: int, the number of sections to be split in the table. ----------- ''' # Get probability from base data and predict data prob_base = pd.Series(model.predict_proba(data_a.drop(response, axis = 1).values)[:, 1]) quantile_limit = prob_base.quantile([i/show for i in range(1, (show + 1))], interpolation = 'higher') quantile_limit.reset_index(drop = True, inplace = True) prob_pred = pd.Series(model.predict_proba(data_b.drop(response, axis = 1).values)[:, 1]) # base and predict list base_list = [] pred_list = [] # Orignal for i in quantile_limit.index.values: if i == 0: base_list.append((sum(prob_base <= quantile_limit[i]) / len(prob_base))) pred_list.append((sum(prob_pred <= quantile_limit[i]) / len(prob_pred))) else: base_list.append((sum((prob_base > quantile_limit[i - 1]) & (prob_base <= quantile_limit[i])) / len(prob_base))) pred_list.append((sum((prob_pred > quantile_limit[i - 1]) & (prob_pred <= quantile_limit[i])) / len(prob_pred))) # Deal with 0 in base_list & pred_list psi = '{:.4f}%'.format((round(sum([np.inf if (y == 0 or t == 0) else ((t - y) * math.log(t / y)) for t, y in zip(pred_list, base_list)]), 8) * 100)) print(' {} sections PSI is: {}'.format(show, psi)) if np.float(psi[:(len(psi) - 1)]) > 10: print(''' ************************************ Warning: Beware of the large PSI !!! ************************************ ''') print(' ') #************************************************** # # For cooperate Xu Min # for i in quantile_limit.index.values: # if i == 0: # base_list.append((sum(prob_base < quantile_limit[i]) / len(prob_base))) # pred_list.append((sum(prob_pred < quantile_limit[i]) / len(prob_pred))) # # elif i == 9: # base_list.append((sum((prob_base >= quantile_limit[i - 1]) & (prob_base <= quantile_limit[i])) / len(prob_base))) # pred_list.append((sum((prob_pred >= quantile_limit[i - 1]) & (prob_pred <= quantile_limit[i])) / len(prob_pred))) # # else: # base_list.append((sum((prob_base >= quantile_limit[i - 1]) & (prob_base < quantile_limit[i])) / len(prob_base))) # pred_list.append((sum((prob_pred >= quantile_limit[i - 1]) & (prob_pred < quantile_limit[i])) / len(prob_pred))) # # psi = '{:.4f}%'.format((round(sum([0 if y == 0 or t == 0 else ((t - y) * math.log(t / y)) for t, y in zip(pred_list, base_list)]), # 8) * 100)) #************************************************** return(psi)
四、实例(Template)
# -*- coding: utf-8 -*- """ Created on Tue Nov 21 20:57:44 2017 @author: Hin """ %load_ext autoreload %autoreload 2 import pandas as pd import numpy as np from scipy.stats import randint as sp_randint from Data_Processing import trainTestSplitV2, varThreshold, replaceOutlierNScale, featureSelectFromModel from Multiple_Model_Selection import randomSearchCVTraining, gridSearchCVTraining, trainModelSequence, bestModel, featureReduce from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier from xgboost.sklearn import XGBClassifier from Model_Evaluation import featureImpGraph, aucKSGraph, modelRank, PSI import joblib from sklearn_pandas import DataFrameMapper from sklearn2pmml import sklearn2pmml, PMMLPipeline #-------------------------Set Label------------------------- # Read csv with chinese characters and rename y as flag dataset = pd.read_csv('ZZZ_test_purpose.csv', header = 0, encoding = 'gb18030') dataset.rename(columns = {'def30_dup': 'flag'}, inplace = True) # Get number of rows and columns of data print("Number of Rows: ", dataset.shape[0]) print("Number of Columns: ", dataset.shape[1]) # Show missing values for each column dataset.isnull().sum() print('The Number of Missing Values: ', dataset.isnull().sum().sum()) # Split all the numric columns from data # For test purpose, only deal with numeric data for the rest numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] dataset = dataset.select_dtypes(include = numerics) #-------------------------Missing Imputaion------------------------- # Use certain value for missing imputaion dataset.replace([np.nan, np.inf, -np.inf], -1, inplace = True) #-------------------------Train Test Split------------------------- train, test = trainTestSplitV2(dataset, 'flag') #-------------------------Feature Engineering and Feature Selection------------------------- # Deal with catergorical variables(one-hot encoding): pd.get_dummies(df) # Removes all 0 variance features elmi_col = varThreshold(dataset, train) train.drop(elmi_col, axis = 1, inplace = True) test.drop(elmi_col, axis = 1, inplace = True) # Obtain all the important features based on certain threshold imp_col =featureSelectFromModel(train, 'flag', figname = 'Top 10 Important Features', n_tree = 500, n_core = -1, rdm_state = None, thd = 'median', word = 1, show = 10, figsize = (12, 10)) train = train[(imp_col + ['flag'])] test = test[(imp_col + ['flag'])] #-------------------------Replacing Outliers and Scaling------------------------- # Change test set first owing to the logic of the replaceOutlierNScale function # Ignore nan when scale the data train.replace(-1, np.nan, inplace = True) test.replace(-1, np.nan, inplace = True) quantile_limit, test = replaceOutlierNScale(train, test, 'flag', low = 0.02, high = 0.98, scale = True) quantile_limit, train = replaceOutlierNScale(train, train, 'flag', low = 0.02, high = 0.98, scale = True) # Fill nan data train.replace([np.nan, np.inf, -np.inf], -1, inplace = True) test.replace([np.nan, np.inf, -np.inf], -1, inplace = True) #-------------------------Model Selection------------------------- # Use same interface for several models # classifiers consists of name, algorithm, parameters set classifiers = [] ## Logistic regression #logreg_param = {'C': list(np.power(10.0, np.arange(-10, 10))),'penalty': ['l1','l2']} #classifiers.append(['Logistic Regression', LogisticRegression(), logreg_param]) # SVM svm_para = {'kernel':['linear','rbf'], 'C': list(np.power(10.0, np.arange(-10, 3))), 'gamma': list(np.logspace(-4,0,5)) + ['auto']} classifiers.append(['SVM', SVC(probability = True), svm_para]) # Random Forest rf_param = {'n_estimators': list(np.arange(600, 1100, 100)), 'max_depth': [3, 10, None], 'max_features': [i/10 for i in range(1, 10)], 'min_samples_split': sp_randint(2, 10), 'min_samples_leaf': sp_randint(1, 10)} classifiers.append(['RandomForest', RandomForestClassifier(), rf_param]) # XGBoost xgb_param = {'learning_rate': [0.001, 0.01, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4], 'n_estimators': list(range(100, 1100, 100)), # Number of trees 'max_depth': list(range(1, 10, 2)), # Max depth of trees 'gamma': list(np.arange(0, 0.5, 0.05)), # Minimum loss reduction required to make a further partition on a leaf node of the tree 'subsample': [i/10 for i in range(5, 11)], # subsample ratio of the training data, row-wise 'colsample_bytree': [i/10 for i in range(3, 11)], # subsample ratio of columns when constructing each tree 'colsample_bylevel': [i/10 for i in range(1, 11)], # subsample ratio of columns for each split 'reg_alpha': ([0, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5] + list(np.arange(1, 10, 1)) + list(np.arange(10, 110, 10))), # L1 regularization term on weights 'reg_lambda' : (list(np.arange(1, 10, 1)) + list(np.arange(10, 60, 10))), # L2 regularization term on weights 'min_child_weight': sp_randint(1, 6), # Defines the minimum sum of weights of all observations required in a child 'max_delta_step': sp_randint(0, 11), # In maximum delta step we allow each tree’s weight estimation to be 'random_state': sp_randint(0, 1000)} xgb = XGBClassifier(objective = 'binary:logistic', missing = None) classifiers.append(['XGB', xgb, xgb_param]) model_set = trainModelSequence(train, 'flag', classifiers, method = randomSearchCVTraining, iternum = 10, evalmetric = 'roc_auc', n_core = 4, fold = 5, word = 1) # Get best CV estimator best_model = bestModel(model_set) #-------------------------Feature Reduction------------------------- # Reduce features from the best model above # classifiers consists of name, algorithm, parameters set classifiers = [] # XGBoost1 xgb_param = {'learning_rate': [0.001, 0.01, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4], 'n_estimators': list(range(100, 1100, 100)), # Number of trees 'max_depth': list(range(1, 10, 2)), # Max depth of trees 'gamma': list(np.arange(0, 0.5, 0.05)), # Minimum loss reduction required to make a further partition on a leaf node of the tree 'subsample': [i/10 for i in range(5, 11)], # subsample ratio of the training data, row-wise 'colsample_bytree': [i/10 for i in range(3, 11)], # subsample ratio of columns when constructing each tree 'colsample_bylevel': [i/10 for i in range(1, 11)], # subsample ratio of columns for each split 'reg_alpha': ([0, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5] + list(np.arange(1, 10, 1)) + list(np.arange(10, 110, 10))), # L1 regularization term on weights 'reg_lambda' : (list(np.arange(1, 10, 1)) + list(np.arange(10, 60, 10))), # L2 regularization term on weights 'min_child_weight': sp_randint(1, 6), # Defines the minimum sum of weights of all observations required in a child 'max_delta_step': sp_randint(0, 11), # In maximum delta step we allow each tree’s weight estimation to be 'random_state': sp_randint(0, 1000)} xgb = XGBClassifier(objective = 'binary:logistic', missing = None) classifiers.append(['XGB1', xgb, xgb_param]) # XGBoost2 classifiers.append(['XGB2', xgb, xgb_param]) # XGBoost3 classifiers.append(['XGB3', xgb, xgb_param]) # Pick one model that has good performance and less features in xgb_models xgb_models = featureReduce(train, 'flag', classifiers, best_model['XGB'], method = randomSearchCVTraining, iternum = 10, evalmetric = 'roc_auc', n_core = 4, fold = 5, word = 1) #-------------------------Create Graph, Ranking and PSI------------------------- # Feature Importance featureImpGraph(train[xgb_models[1]['feature']], 'flag', xgb_models[1]['XGB2'], figname = 'XGB2 Feature Importance', show = 10, figsize = (12, 10)) # AUC and KS Graph # For training set aucKSGraph(train, 'flag', xgb_models[1]['XGB2'].predict_proba(np.array(train[xgb_models[1]['feature']].drop('flag', axis = 1)))[:, 1], pos_label = 1, model_name = 'Training', figsize = (12, 10)) # For testing set aucKSGraph(test, 'flag', xgb_models[1]['XGB2'].predict_proba(np.array(test[xgb_models[1]['feature']].drop('flag', axis = 1)))[:, 1], pos_label = 1, model_name = 'Testing', figsize = (12, 10)) # Ranking rank_train = modelRank(train[xgb_models[1]['feature']], 'flag', xgb_models[1]['XGB2'], show = 20) rank_test = modelRank(test[xgb_models[1]['feature']], 'flag', xgb_models[1]['XGB2'], show = 20) # PSI PSI(train[xgb_models[1]['feature']], test[xgb_models[1]['feature']], 'flag', xgb_models[1]['XGB2']) #-------------------------Save Model as PKL------------------------- joblib.dump(xgb_models[1]['XGB2'], 'XGB2oost_xgb_models.pkl', compress = 3) # XGB2_best = joblib.load("XGB2oost_XGB2_models.pkl") #-------------------------Save Model as PMML------------------------- # XGB to PMML # xgb_models[1]['XGB2'].best_params_: Get best parameters from model # xgb_models[1]['XGB2'].best_estimator_: Estimator that was chosen by the search xgb_pipeline = PMMLPipeline([ ("mapper", DataFrameMapper([(i, None) for i in xgb_models[1]['feature'][:(len(xgb_models[1]['feature']) - 1)]])), ("classifier", xgb_models[1]['XGB2'].best_estimator_)]) # xgb_pipeline is a model which can also be used to predict xgb_pipeline.fit(train[xgb_models[1]['feature']].drop('flag', axis = 1), train[xgb_models[1]['feature']].flag) # PMML Transfer sklearn2pmml(xgb_pipeline, "xgb.pmml", with_repr = True)