zoukankan      html  css  js  c++  java
  • 机器学习from(zhouxun-old leader)

    Main:

    • Template.py
    • Template.py 为主要流程部分,依次实现:
    1. Train Test Split
    2. Missing Imputation
    3. Feature Selection
    4. Cap and Floor
    5. Data Scaling
    6. Model Selection
    7. Feature Reduction
    8. AUC & KS graphing, Model Ranking and PSI
    9. 模型预测的概率对应逾期率的排序情况
    10. 模型变量递减后效果验证
    11. 最终模型的Feature Importance图像(如有)
    12. 以PKL或PMML形式保存模型

    Function Files:

    • Data_Processing.py
    • Data_Processing.py 为数据预处理部分,包含函数:
    1. Train Test Split
    2. Feature Selection
    3. Cap and Floor
    • Multiple_Model_Selection.py
    • Multiple_Model_Selection.py 为模型选择部分,实现:
    1. grids search with cross validation(CV) 和 randomized search with CV 两种方式的模型选择,并可输出训练多类模型后的最优模型。
    2. 得到最优模型后,减少模型变量(Feature Reduction),再验证其效果,希望达到与之前得到的最优模型的效果不发生较大下降的前提下,尽可能的减少入参变量,从而增强模型的Robustness。
    • Model_Evaluation.py
    • Model_Evaluation.py 为模型效果验证部分,实现:
    1. AUC和KS画图部分
    2. PSI计算
    3. Model Ranking
    4. 模型预测的概率对应逾期率的排序情况
    5. 最终模型的feature importance图像(如有)

    代码整体是用functional programming 的思路,且是框架性的代码,所以在变量处理时没有过于细化,需要根据实际数据结构的情况进行添加

    一、先说Data_Processing部分:

    先导入模块

    import pandas as pd
    import numpy as np
    from sklearn.model_selection import StratifiedShuffleSplit  #主要用于label分布不均匀的样本中
    from sklearn.feature_selection import VarianceThreshold, SelectFromModel  #第一个是特征选择中的方差阈值法(设定一个阈值,小于这个阈值就丢弃),第二个是嵌入式特征选择的一种
    #from sklearn.preprocessing import MinMaxScaler
    from sklearn.ensemble import ExtraTreesClassifier  #极端随机树,是随机深林的一种
    from matplotlib import style, pyplot as plt
    plt.rcParams['font.sans-serif'] = ['SimHei'] # Display Chinese Characters
    plt.rcParams['axes.unicode_minus'] = False # Display Minus Sign
    style.use('ggplot')

    1.数据划分,根据StratifiedShuffleSplit(主要是数据分布不均匀时使用),返回的xy合并的训练集和测试集

    train, test = trainTestSplitV2(dataset, 'flag')

    def trainTestSplitV2(data, response, testsize = 0.3, trainsize = 0.7, rdm_state = None):
        '''
        Train test split with tolerance of the mean difference between
        dataset and test set.
        
        Parameters:
        -----------
        data : pandas DataFrame with all possible predictors and response.
    
        response: string, name of response column in data.  相当是y值label
        
        testsize: numeric, between 0.0 and 1.0, the size of the testing set.
        
        trainsize: numeric, between 0.0 and 1.0, the size of the training set.
        
        rdm_state: None or int, random state of feature selection model.
        -----------
        '''
        
        X = np.array(data.drop(response, axis = 1))  #为什么需要array数组化,那是由于很多模型不能直接输入df 
        y = np.array(data[response])
        
        sssplit = StratifiedShuffleSplit(n_splits = 1, test_size = testsize, 
                                         train_size = trainsize, random_state = rdm_state)  #37分,且只分一折,这就不需要for循环了
        
        # Generate indices to split data into training and test set.
        split_index = sssplit.split(X, y)
        train_index, test_index = next(split_index)
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        
        # Get all the columns name back
        train_data = pd.DataFrame(data = np.c_[X_train, y_train],   #np.c_是左右连接两个矩阵,保持行数不变
                                  columns = np.array((data.drop(response, axis = 1).columns.tolist() + [response])))  #.tolist()是将数组或者矩阵转换成list
        test_data = pd.DataFrame(data = np.c_[X_test, y_test], 
                                 columns = np.array((data.drop(response, axis = 1).columns.tolist() + [response])))
     
        print('Mean of y_all is: {:.4f} 
    Mean of y_train is: {:.4f} 
    Mean of y_test is: {:.4f}'.format(data[response].mean(), 
                                                                                                         train_data[response].mean(),
                                                                                                         test_data[response].mean()))       
    
        return(train_data, test_data)

    2.计算每个特征的var,根据VarianceThreshold挑选方差不为0的特征,返回的是方差为0的特征(后面需要将这些特征踢除)

    elmi_col = varThreshold(dataset, train)

    #-------------------------Variance Threshold Function-------------------------
    def varThreshold(data, trainset, thd = 0):
        '''
        Feature selector that removes all low-variance features.
        
        Parameters:
        -----------
        data : pandas DataFrame with all possible predictors and response.
    
        trainset: nparray, training features set.
        
        thd: numeric, threshold of the variance.
        -----------
        '''
        
        sel_def = VarianceThreshold(threshold = thd)
        new_train = sel_def.fit_transform(trainset)
        print('The Number of Features Selected When Removes All Zero-variance Features:', new_train.shape[1])  ##其中shape[0]是行数,shape[1]是列数 
        
        # Get names of low variance featuers
        bool_arr = (sel_def.variances_ == thd).tolist() #方差等于阈值 ,放回的是一组布尔值的list,True是等于阈值,其他的是不等于
        seq = [i for i, value in enumerate(bool_arr) if value]  #将true的index输出来
        eliminate_feature = [data.columns.tolist()[ele] for ele in seq]
        
        return(eliminate_feature)
    ['身份证_姓名命中法院结案模糊名单',
     '身份证命中信贷逾期名单',
     '手机号命中信贷逾期名单',
     '第一联系人手机号命中信贷逾期名单',
     '身份证命中法院失信名单',
     '身份证_姓名命中法院失信模糊名单',
     '身份证命中公司欠税名单',
     'X1month_第三方服务商',
     'X1month_理财机构',
     'X1month_银行小微贷款',
     'X1month_汽车租赁',
     'X1month_房地产金融',
     'X1month_融资租赁',
     'X3month_银行小微贷款',
     'X3month_房地产金融',
     'X7days_互联网金融门户',
     'X7days_第三方服务商',
     'X7days_理财机构',
     'X7days_财产保险',
     'X7days_银行小微贷款',
     'X7days_汽车租赁',
     'X7days_房地产金融',
     'X7days_融资租赁']

    3.异常值的处理(将大于或者小于某个阈值的数据用阈值代替)和数据缩放(使用最大最小值缩放),返回的是每个特征的阈值分位数和缩放后的数据

    注意:是先运行4的然后再运行的3,也就是说先选出了50个特征,对这些特征的-1值再换成空值,在进行下面这个操作

    quantile_limit, test = replaceOutlierNScale(train, test, 'flag', low = 0.02, high = 0.98, scale = True)
    quantile_limit, train = replaceOutlierNScale(train, train, 'flag', low = 0.02, high = 0.98, scale = True)

    返回的是处理阈值后并且缩放后的特征

    #-------------------------Apply Quantile(Low and High) of Data to Replace Outliers-------------------------
    def replaceOutlierNScale(data_a, data_b, response, low = 0.02, high = 0.98, 
                             interpolation = 'midpoint', scale = True):
        '''
        Replace low and high quantile outliers of data_b with specified quantile value of data_a.
        Only deal with numeric columns, ignore the Nan value and string type columns.
        
        Parameter:
        -----------
        data_a & data_b: pandas DataFrame, have same columns and only have numeric columns.
        
        response: string, name of response column in data.
            
        low: low limit.
        
        high: high limit.
        
        interpolation: This parameter specifies the interpolation method to use.  
        -----------
        '''
        
        # Replace Outliers: Cap and floor 
        new_df = data_b.drop(response, axis = 1)
        
        quantile_limit = data_a.drop(response, axis = 1).quantile([low, high], interpolation = interpolation) #得到以low, high为index的二维表
        '''
        如:
    
           0    1    2    3
           0.02    0.0    1.0    0.5    3.0
           0.98    0.0    1.5    2.5    3.0
        '''
        
        outliers_low = (new_df < quantile_limit.loc[low, :])  #返回的布尔型的df 
        '''    0    1    2    3
        0    False    False    True    False
        1    False    False    False    False
        2    False    False    False    False
        '''
    
        new_df.mask(outliers_low, quantile_limit.loc[low, :], inplace = True, axis = 1)  #将小于阈值的用阈值代替,mask是条件为True时才替换
        
        outliers_high = (new_df > quantile_limit.loc[high, :])
        new_df.mask(outliers_high, quantile_limit.loc[high, :], inplace = True, axis = 1)
        
        new_df[response] = data_b[response]
        
        # Min-max scale
        if scale:
    #        # Replace Outliers: Cap and floor 
    #        X_a = data_a.drop(response, axis = 1)
    #        outliers_low = (X_a < quantile_limit.loc[low, :])
    #        X_a.mask(outliers_low, quantile_limit.loc[low, :], inplace = True, axis = 1)  
    #        outliers_high = (X_a > quantile_limit.loc[high, :])
    #        X_a.mask(outliers_high, quantile_limit.loc[high, :], inplace = True, axis = 1)
            
    #        X_a = np.array(X_a)
    #        X_b = np.array(new_df.drop(response, axis = 1))
            
    #        # Fit on X_a and transform to X_b 
    #        min_max_scaler = MinMaxScaler().fit(X_a)
    #        X_b = min_max_scaler.transform(X_b)
            
    #        # Get all the columns name back
    #        new_df = pd.DataFrame(data = np.c_[X_b, np.array(data_b[response])], 
    #                              columns = np.array((data_b.drop(response, axis = 1).columns.tolist() + [response])))
        
            new_df.drop(response, axis = 1, inplace = True)
            new_df = (new_df - quantile_limit.min()) / (quantile_limit.max() - quantile_limit.min())  #用最大最小值去缩放
            
            new_df[response] = data_b[response] 
        
        return(quantile_limit, new_df)   
        

    4.特征选择(featureSelectFromModel),返回feature_importances_

    imp_col =featureSelectFromModel(train, 'flag', figname = 'Top 10 Important Features', 
                                    n_tree = 500, n_core = -1, rdm_state = None, thd = 'median', 
                                    word = 1, show = 10, figsize = (12, 10))

     是先运行4的然后再运行的3

    #-------------------------Feature Selected From Model-------------------------
    def featureSelectFromModel(trainset, response, figname, 
                               n_tree = 500, n_core = -1, 
                               rdm_state = None, word = 0, thd = 'median', 
                               show = 10, figsize = (10, 8)):
        '''
        Feature selector that removes all unimportant features.
        
        Parameters:
        -----------
        trainset : pandas DataFrame with all possible predictors and response in train set.
        
        response: string, name of response column in data.
        
        figname: string, name of feature importace graph.
        
        rdm_state: None or int, random state of feature selection model.
        
        thd: string, seted as 'median', 'mean' or '0.25*median', '0.25*mean', 
        means features whose importance is greater or equal to thd are kept 
        while the others are discarded.
            
        show: int, the number of features to be shown in the graph.
        
        figsize: tuple, the size of figure.
        -----------
        '''
        
        column_names = trainset.drop(response, axis = 1).columns.tolist()
        X = np.array(trainset.drop(response, axis = 1))
        y = np.array(trainset[response])
        # Create the SelectFromModel object and retrieve the optimal number of features which
        # the threshold value is set as thd of the feature importances.
        clf = ExtraTreesClassifier(n_estimators = n_tree, n_jobs = n_core, 
                                   random_state = rdm_state, verbose = word)
        clf = clf.fit(X, y)  
        model = SelectFromModel(clf, threshold = thd, prefit = True)
        X_new = model.transform(X)
        print('The Number of Features Selected:', X_new.shape[1]) 
        
        # Get the feature importances
        importances = clf.feature_importances_  #返回的是一个数组,个数和特征个数一致
        
        #***************Obtain names of importannt features***************
        def ImptFeature(thd = thd, impt = importances):
            method = thd.split('*')
            if 'median' in method:
                name_val = pd.Series(impt, index = column_names)  #注意这里index即是列名,所以后面返回的其实是列名
                if len(method) == 1:  #本来这个长度就等于1
                    imp_val = name_val[name_val >= np.median(impt)]  #选择特征重要度大于均值的特征
                    return(imp_val.index.tolist())
                    
                elif len(method) == 2:
                    coef = np.float(method[0])
                    imp_val = name_val[name_val >= (coef * np.median(impt))]
                    return(imp_val.index.tolist())
                    
                else:
                    print('"thd" maybe not follow the pattern, check it !!!')
                    
            elif 'mean' in method:
                name_val = pd.Series(impt, index = column_names)
                if len(method) == 1:
                    imp_val = name_val[name_val >= np.mean(impt)]
                    return(imp_val.index.tolist())
                    
                elif len(method) == 2:
                    coef = np.float(method[0])
                    imp_val = name_val[name_val >= (coef * np.mean(impt))]
                    return(imp_val.index.tolist())
                    
                else:
                    print('"thd" maybe not follow the pattern, check it !!!')
                    
            else:
                print('"thd" maybe not follow the pattern, check it !!!')
        #*********************************************************
    
        # Standard deviation of feature importances
        std = np.std([tree.feature_importances_ for tree in clf.estimators_],
                     axis=0)
        
        # Return the indices of the top several important features 
        indices = np.argsort(importances)[::-1][:show]  #设定了show=10,即是说前10名的特征
        
        # Get the top important features name
        features = [column_names[ele] for ele in indices]    
        
        #***************Graph of feature importance***************
        # Show importance of each feature   
        fig = plt.figure(figsize = figsize)
        axes = plt.subplot2grid((1,1), (0,0))
        axes.bar(range(show), importances[indices],
               color = '#4682B4', yerr = std[indices], align = 'center')
        plt.xticks(range(show), features)
        plt.xlim((-1, show))
        
        # Rotate the angle of the labels
        for label in axes.xaxis.get_ticklabels():
            label.set_rotation(90)
            
        plt.title(('Top ' + str(show) + ' Important Features'))
        plt.xlabel('Name of Variable')
        plt.ylabel('Importance')   
        plt.savefig((figname + '.jpg'))
        #*********************************************************
        
        # Output the top important features
        print('Top ' + str(show) + ' Features Ranking:')
        for f in range(show): 
            print('{}. Importance of feature {} named "{}" is: {:.4f}'.format(f + 1,
                  indices[f], column_names[indices[f]], importances[indices[f]]))
            
        return(ImptFeature())

     返回的是50个入选的特征

    ['loan_amount',
     'loan_term',
     'final_score',
     'X3个月内申请人在多个平台申请借款',
     'X1个月内申请人在多个平台申请借款',
     'X1month_P2P网贷',
     'X1month_财产保险',
     'X3month_P2P网贷',
     'X3month大型消费金融公司',
     'X3month_互联网金融门户',
     'X3month_信用卡中心',
     'label',
     'cell_number_xiangguan',
     'risk_count',
     'annual_income_500000',
     'X_y',
     '本品牌合作时间',
     '经营年限_注册时间_',
     '上级经销商法人手机',
     '总和_经营面积',
     '总和_经营年限',
     '总和_年销售额',
     '总和_年销售总量',
     'order_sum',
     'first_dd_sum',
     'first_dd_avg',
     'first_dd_sd',
     'max_dd_sum',
     'max_dd_avg',
     'max_dd_sd',
     'pass_term_avg',
     'pass_term_sd',
     'fpd_sum',
     'loan_amt_sum',
     'loan_amt_avg',
     'loan_amt_sd',
     'fpd_times',
     'times',
     'loan_times',
     'xs_amount',
     'xy_amount',
     'history',
     'layer',
     'type',
     'size',
     'wang2_count_x',
     'wang2_def30_count',
     'wang2_c_order_sum',
     'wang2_count_y',
     'amount_sum']
    View Code

    Data_Processing的全部代码(已折叠)

    # -*- coding: utf-8 -*-
    """
    Created on Tue Nov 21 21:36:04 2017
    
    @author: Hin
    """
    
    
    import pandas as pd
    import numpy as np
    from sklearn.model_selection import StratifiedShuffleSplit  #主要用于label分布不均匀的样本中
    from sklearn.feature_selection import VarianceThreshold, SelectFromModel  #第一个是特征选择中的方差阈值法(设定一个阈值,小于这个阈值就丢弃),第二个是嵌入式特征选择的一种
    #from sklearn.preprocessing import MinMaxScaler
    from sklearn.ensemble import ExtraTreesClassifier  #极端随机树,是随机深林的一种
    from matplotlib import style, pyplot as plt
    plt.rcParams['font.sans-serif'] = ['SimHei'] # Display Chinese Characters
    plt.rcParams['axes.unicode_minus'] = False # Display Minus Sign
    style.use('ggplot')
    
    
    def trainTestSplitV2(data, response, testsize = 0.3, trainsize = 0.7, rdm_state = None):
        '''
        Train test split with tolerance of the mean difference between
        dataset and test set.
        
        Parameters:
        -----------
        data : pandas DataFrame with all possible predictors and response.
    
        response: string, name of response column in data.  相当是y值label
        
        testsize: numeric, between 0.0 and 1.0, the size of the testing set.
        
        trainsize: numeric, between 0.0 and 1.0, the size of the training set.
        
        rdm_state: None or int, random state of feature selection model.
        -----------
        '''
        
        X = np.array(data.drop(response, axis = 1))  #为什么需要array数组化,那是由于很多模型不能直接输入df 
        y = np.array(data[response])
        
        sssplit = StratifiedShuffleSplit(n_splits = 1, test_size = testsize, 
                                         train_size = trainsize, random_state = rdm_state)  #37分,且只分一折,这就不需要for循环了
        
        # Generate indices to split data into training and test set.
        split_index = sssplit.split(X, y)
        train_index, test_index = next(split_index)
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        
        # Get all the columns name back
        train_data = pd.DataFrame(data = np.c_[X_train, y_train],   #np.c_是左右连接两个矩阵,保持行数不变
                                  columns = np.array((data.drop(response, axis = 1).columns.tolist() + [response])))  #.tolist()是将数组或者矩阵转换成list
        test_data = pd.DataFrame(data = np.c_[X_test, y_test], 
                                 columns = np.array((data.drop(response, axis = 1).columns.tolist() + [response])))
     
        print('Mean of y_all is: {:.4f} 
    Mean of y_train is: {:.4f} 
    Mean of y_test is: {:.4f}'.format(data[response].mean(), 
                                                                                                         train_data[response].mean(),
                                                                                                         test_data[response].mean()))       
    
        return(train_data, test_data)
        
    
    #-------------------------Variance Threshold Function-------------------------
    def varThreshold(data, trainset, thd = 0):
        '''
        Feature selector that removes all low-variance features.
        
        Parameters:
        -----------
        data : pandas DataFrame with all possible predictors and response.
    
        trainset: nparray, training features set.
        
        thd: numeric, threshold of the variance.
        -----------
        '''
        
        sel_def = VarianceThreshold(threshold = thd)
        new_train = sel_def.fit_transform(trainset)
        print('The Number of Features Selected When Removes All Zero-variance Features:', new_train.shape[1])  ##其中shape[0]是行数,shape[1]是列数 
        
        # Get names of low variance featuers
        bool_arr = (sel_def.variances_ == thd).tolist() #方差等于阈值 ,放回的是一组布尔值的list,True是等于阈值,其他的是不等于
        seq = [i for i, value in enumerate(bool_arr) if value]  #将true的index输出来
        eliminate_feature = [data.columns.tolist()[ele] for ele in seq]
        
        return(eliminate_feature)
        
        
    #-------------------------Apply Quantile(Low and High) of Data to Replace Outliers-------------------------
    def replaceOutlierNScale(data_a, data_b, response, low = 0.02, high = 0.98, 
                             interpolation = 'midpoint', scale = True):
        '''
        Replace low and high quantile outliers of data_b with specified quantile value of data_a.
        Only deal with numeric columns, ignore the Nan value and string type columns.
        
        Parameter:
        -----------
        data_a & data_b: pandas DataFrame, have same columns and only have numeric columns.
        
        response: string, name of response column in data.
            
        low: low limit.
        
        high: high limit.
        
        interpolation: This parameter specifies the interpolation method to use.  
        -----------
        '''
        
        # Replace Outliers: Cap and floor 
        new_df = data_b.drop(response, axis = 1)
        
        quantile_limit = data_a.drop(response, axis = 1).quantile([low, high], interpolation = interpolation) #得到以low, high为index的二维表
        '''
        如:
    
           0    1    2    3
           0.02    0.0    1.0    0.5    3.0
           0.98    0.0    1.5    2.5    3.0
        '''
        
        outliers_low = (new_df < quantile_limit.loc[low, :])  #返回的布尔型的df 
        '''    0    1    2    3
        0    False    False    True    False
        1    False    False    False    False
        2    False    False    False    False
        '''
    
        new_df.mask(outliers_low, quantile_limit.loc[low, :], inplace = True, axis = 1)  #将小于阈值的用阈值代替,mask是条件为True时才替换
        
        outliers_high = (new_df > quantile_limit.loc[high, :])
        new_df.mask(outliers_high, quantile_limit.loc[high, :], inplace = True, axis = 1)
        
        new_df[response] = data_b[response]
        
        # Min-max scale
        if scale:
    #        # Replace Outliers: Cap and floor 
    #        X_a = data_a.drop(response, axis = 1)
    #        outliers_low = (X_a < quantile_limit.loc[low, :])
    #        X_a.mask(outliers_low, quantile_limit.loc[low, :], inplace = True, axis = 1)  
    #        outliers_high = (X_a > quantile_limit.loc[high, :])
    #        X_a.mask(outliers_high, quantile_limit.loc[high, :], inplace = True, axis = 1)
            
    #        X_a = np.array(X_a)
    #        X_b = np.array(new_df.drop(response, axis = 1))
            
    #        # Fit on X_a and transform to X_b 
    #        min_max_scaler = MinMaxScaler().fit(X_a)
    #        X_b = min_max_scaler.transform(X_b)
            
    #        # Get all the columns name back
    #        new_df = pd.DataFrame(data = np.c_[X_b, np.array(data_b[response])], 
    #                              columns = np.array((data_b.drop(response, axis = 1).columns.tolist() + [response])))
        
            new_df.drop(response, axis = 1, inplace = True)
            new_df = (new_df - quantile_limit.min()) / (quantile_limit.max() - quantile_limit.min())  #用最大最小值去缩放
            
            new_df[response] = data_b[response] 
        
        return(quantile_limit, new_df)   
        
        
    #-------------------------Feature Selected From Model-------------------------
    def featureSelectFromModel(trainset, response, figname, 
                               n_tree = 500, n_core = -1, 
                               rdm_state = None, word = 0, thd = 'median', 
                               show = 10, figsize = (10, 8)):
        '''
        Feature selector that removes all unimportant features.
        
        Parameters:
        -----------
        trainset : pandas DataFrame with all possible predictors and response in train set.
        
        response: string, name of response column in data.
        
        figname: string, name of feature importace graph.
        
        rdm_state: None or int, random state of feature selection model.
        
        thd: string, seted as 'median', 'mean' or '0.25*median', '0.25*mean', 
        means features whose importance is greater or equal to thd are kept 
        while the others are discarded.
            
        show: int, the number of features to be shown in the graph.
        
        figsize: tuple, the size of figure.
        -----------
        '''
        
        column_names = trainset.drop(response, axis = 1).columns.tolist()
        X = np.array(trainset.drop(response, axis = 1))
        y = np.array(trainset[response])
        # Create the SelectFromModel object and retrieve the optimal number of features which
        # the threshold value is set as thd of the feature importances.
        clf = ExtraTreesClassifier(n_estimators = n_tree, n_jobs = n_core, 
                                   random_state = rdm_state, verbose = word)
        clf = clf.fit(X, y)  
        model = SelectFromModel(clf, threshold = thd, prefit = True)
        X_new = model.transform(X)
        print('The Number of Features Selected:', X_new.shape[1]) 
        
        # Get the feature importances
        importances = clf.feature_importances_  #返回的是一个数组,个数和特征个数一致
        
        #***************Obtain names of importannt features***************
        def ImptFeature(thd = thd, impt = importances):
            method = thd.split('*')
            if 'median' in method:
                name_val = pd.Series(impt, index = column_names)
                if len(method) == 1:  #本来这个长度就等于1
                    imp_val = name_val[name_val >= np.median(impt)]  #选择特征重要度大于均值的特征
                    return(imp_val.index.tolist())
                    
                elif len(method) == 2:
                    coef = np.float(method[0])
                    imp_val = name_val[name_val >= (coef * np.median(impt))]
                    return(imp_val.index.tolist())
                    
                else:
                    print('"thd" maybe not follow the pattern, check it !!!')
                    
            elif 'mean' in method:
                name_val = pd.Series(impt, index = column_names)
                if len(method) == 1:
                    imp_val = name_val[name_val >= np.mean(impt)]
                    return(imp_val.index.tolist())
                    
                elif len(method) == 2:
                    coef = np.float(method[0])
                    imp_val = name_val[name_val >= (coef * np.mean(impt))]
                    return(imp_val.index.tolist())
                    
                else:
                    print('"thd" maybe not follow the pattern, check it !!!')
                    
            else:
                print('"thd" maybe not follow the pattern, check it !!!')
        #*********************************************************
    
        # Standard deviation of feature importances
        std = np.std([tree.feature_importances_ for tree in clf.estimators_],
                     axis=0)
        
        # Return the indices of the top several important features 
        indices = np.argsort(importances)[::-1][:show]
        
        # Get the top important features name
        features = [column_names[ele] for ele in indices]    
        
        #***************Graph of feature importance***************
        # Show importance of each feature   
        fig = plt.figure(figsize = figsize)
        axes = plt.subplot2grid((1,1), (0,0))
        axes.bar(range(show), importances[indices],
               color = '#4682B4', yerr = std[indices], align = 'center')
        plt.xticks(range(show), features)
        plt.xlim((-1, show))
        
        # Rotate the angle of the labels
        for label in axes.xaxis.get_ticklabels():
            label.set_rotation(90)
            
        plt.title(('Top ' + str(show) + ' Important Features'))
        plt.xlabel('Name of Variable')
        plt.ylabel('Importance')   
        plt.savefig((figname + '.jpg'))
        #*********************************************************
        
        # Output the top important features
        print('Top ' + str(show) + ' Features Ranking:')
        for f in range(show): 
            print('{}. Importance of feature {} named "{}" is: {:.4f}'.format(f + 1,
                  indices[f], column_names[indices[f]], importances[indices[f]]))
            
        return(ImptFeature())
        
        
        
    
    
            
            
            
            
    View Code

    二、模型选择部分,两种方式的模型选择,并可输出训练多类模型后的最优模型

    model_set = trainModelSequence(train, 'flag', classifiers, method = randomSearchCVTraining,
                                   iternum = 10, evalmetric = 'roc_auc', n_core = 4, fold = 5, word = 1)
    Best score of RandomizedSearchCV is: 0.7967
    Best parameters of RandomizedSearchCV is: 
     {'kernel': 'linear', 'gamma': 'auto', 'C': 100.0}
    Best score of RandomizedSearchCV is: 0.8623
    Best parameters of RandomizedSearchCV is: 
     {'max_depth': None, 'max_features': 0.3, 'min_samples_leaf': 6, 'min_samples_split': 8, 'n_estimators': 600}
    Best score of RandomizedSearchCV is: 0.8514
    Best parameters of RandomizedSearchCV is: 
     {'colsample_bylevel': 0.7, 'colsample_bytree': 0.7, 'gamma': 0.15000000000000002, 'learning_rate': 0.05, 'max_delta_step': 6, 'max_depth': 1, 'min_child_weight': 2, 'n_estimators': 300, 'random_state': 457, 'reg_alpha': 2, 'reg_lambda': 10, 'subsample': 0.7}
    # Get best CV estimator
    best_model = bestModel(model_set)
    #输出的就是最高分数的模型,即是随机深林

    =====到剪枝了,由前面可知有50个特征进入模型里面,因此要剪枝=======

    但是发生了一件很疑惑的事情,我们上面跑的结果明明是随机深林的分数最高,但是后面剪枝的却使用了xgboost的,所以我就将随机深林的去掉,才继续跑了下面的剪枝

    # -*- coding: utf-8 -*-
    """
    Created on Thu Nov 23 16:27:24 2017
    
    @author: Hin
    """
    
    
    import pandas as pd
    import numpy as np
    from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
    
    
    #-------------------------GridSearchCV-------------------------
    def gridSearchCVTraining(data, response, params, mod = None, evalmetric = 'roc_auc',
                             n_core = -1, fold = 5, word = 1, dispatch = '2*n_jobs'):
        '''
        GridSearchCV: Exhaustive search over specified parameter values for an estimator.
        
        Parameters:
        -----------
        data: pandas DataFrame with all possible predictors and response.
    
        response: string, name of response column in data.
        
        other parameters see: scikit-learn documentation(http://scikit-learn.org/stable/modules/
        generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV.fit)
        -----------
        '''
        
        X = np.array(data.drop(response, axis = 1))
        y = np.array(data[response])
        
        gs = GridSearchCV(estimator = mod, param_grid = params, scoring = evalmetric, 
                          n_jobs= n_core, cv = fold, refit = True, verbose = word, 
                          pre_dispatch = dispatch, error_score = 'raise')
        
        gs.fit(X, y)
    
        print('
    Best score of GridSearchCV is: {:.4f}'.format(round(gs.best_score_, 4)))
        print('
    Best parameters of GridSearchCV is: 
     {}'.format(gs.best_params_))
    
        return(gs)
    
    #-------------------------RandomizedSearchCV-------------------------
    def randomSearchCVTraining(data, response, params, mod = None, evalmetric = 'roc_auc', 
                               n_core = -1, iter_num = 1000, fold = 5, word = 1, 
                               dispatch = '2*n_jobs'):
        '''
        RandomizedSearchCV: Randomized search on hyper parameters.
        
        Parameters:
        -----------
        data: pandas DataFrame with all possible predictors and response.
    
        response: string, name of response column in data.
        
        other parameters see: scikit-learn documentation(http://scikit-learn.org/stable/
        modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV)
        -----------
        '''
        
        X = np.array(data.drop(response, axis = 1))
        y = np.array(data[response])
        
        rs = RandomizedSearchCV(estimator = mod, param_distributions  = params, n_iter = iter_num, 
                                scoring = evalmetric, n_jobs= n_core, cv = fold, refit = True, 
                                verbose = word, pre_dispatch = dispatch, error_score = 'raise')
        
        rs.fit(X, y)
    
        print('
    Best score of RandomizedSearchCV is: {:.4f}'.format(round(rs.best_score_, 4)))
        print('
    Best parameters of RandomizedSearchCV is: 
     {}'.format(rs.best_params_))
    
        return(rs)
    
    
    #-------------------------Training Models in Sequence-------------------------
    def trainModelSequence(data, response, classifiers, method = randomSearchCVTraining,
                           iternum = None, evalmetric = 'roc_auc', n_core = -1, fold = 5, 
                           word = 1, dispatch = '2*n_jobs'):
        '''
    这里使用了函数作为参数,注意参数里面的函数
    Training several models in sequence. Parameters: ----------- data: pandas DataFrame with all possible predictors and response. response: string, name of response column in data. classifiers: a set of algorithms. method: the method of parameter tuning, randomSearchCVTraining or gridSearchCVTraining. other parameters see: scikit-learn documentation(RandomizedSearchCV / GridSearchCV) ----------- ''' model_set = [] # RandomizedSearchCV Interface if(iternum): for clf in range(len(classifiers)): try: print(' ***************** {} Begin *****************'.format(classifiers[clf][0])) #输出每一个模型的名字,如SVM # Train model model = method(data, response, params = classifiers[clf][2], #也就是说每个模型的参数 mod = classifiers[clf][1], iter_num = iternum, #mod是每个基模型,如:SVM() evalmetric = evalmetric, n_core = n_core, fold = fold, word = word, dispatch = dispatch) #返回的是fit(x,y),并输出best_score_和best_parms_ bst_score = '{:.4f}'.format(round(model.best_score_, 4)) model_set.append({classifiers[clf][0]: model, 'Best Score': bst_score}) except Exception as e: print(' ', '#' * 15, 'Error Message', '#' * 15, ' ') print('Error is:', e) print(' ', '#' * 15, 'End','#' * 15, ' ') model_set.append({classifiers[clf][0]: 'Error', 'Best Score': 'Error'}) next else: pass finally: print('***************** {} End ***************** '.format(classifiers[clf][0])) # GridSearchCV Interface: set method = gridSearchCVTraining, iternum = None else: for clf in range(len(classifiers)): try: print(' ***************** {} Begin *****************'.format(classifiers[clf][0])) # Train model model = method(data, response, params = classifiers[clf][2], mod = classifiers[clf][1], evalmetric = evalmetric, n_core = n_core, fold = fold, word = word, dispatch = dispatch) bst_score = '{:.4f}'.format(round(model.best_score_, 4)) model_set.append({classifiers[clf][0]: model, 'Best Score': bst_score}) except Exception as e: print(' ', '#' * 15, 'Error Message', '#' * 15, ' ') print('Error is:', e) print(' ', '#' * 15, 'End','#' * 15, ' ') model_set.append({classifiers[clf][0]: 'Error', 'Best Score': 'Error'}) next else: pass finally: print('***************** {} End ***************** '.format(classifiers[clf][0])) return(model_set) #-------------------------Select the Best Model in Model Sequence------------------------- def bestModel(model_set, eval_logic = 'Max'): ''' Select the best model from trainModelSequence. Parameters: ----------- model_set: result from trainModelSequence. eval_logic: either 'Max' or 'Min', represent the logic of the evalmetric method. ----------- ''' if (eval_logic == 'Max'): bst_model, cur_model = -np.inf, -np.inf for model in model_set: if (model['Best Score'] != 'Error'): if (np.float64(model['Best Score']) > cur_model): bst_model = model cur_model = np.float64(model['Best Score']) else: next else: next else: bst_model, cur_model = np.inf, np.inf for model in model_set: if (model['Best Score'].isnumeric()): if (np.float64(model['Best Score']) < cur_model): bst_model = model cur_model = np.float64(model['Best Score']) else: next else: next return(bst_model) #-------------------------Reduce Features from the Best Model------------------------- def featureReduce(data, response, classifiers, model, method = randomSearchCVTraining, interpolation = 'higher', iternum = None, evalmetric = 'roc_auc', n_core = -1, fold = 5, word = 1, dispatch = '2*n_jobs'): ''' Reduce features to increase robustness of model Parameters: ----------- data: pandas DataFrame with all possible predictors and response. response: string, name of response column in data. classifiers: a set of algorithms. model: result from RandomizedSearchCV / GridSearchCV. method: the method of parameter tuning, randomSearchCVTraining or gridSearchCVTraining. other parameters see: scikit-learn documentation(RandomizedSearchCV / GridSearchCV) ''' model_set = [] column_names = data.drop(response, axis = 1).columns.tolist() # Get feature importances importances = model.best_estimator_.feature_importances_ # Return the indices of the important features from max to min indices = np.argsort(importances)[::-1] # Get the top important features name features = [column_names[ele] for ele in indices] features = pd.DataFrame(data = features, index = range(len(features)), columns = ['feature']) features['seq'] = list(range(len(features))) quantile_limit = features.drop('feature', axis = 1).quantile(q = ([i/len(classifiers) for i in range(len(classifiers))] + [0.5]), interpolation = interpolation) quantile_limit.reset_index(drop = True, inplace = True) quantile_limit.sort_values('seq', inplace = True) quantile_limit = quantile_limit.iloc[1:, :] feature_sets = [(features['feature'][:int(i)].tolist() + [response]) for i in quantile_limit.seq.values] # RandomizedSearchCV Interface if(iternum): for clf in range(len(classifiers)): try: print(' ***************** {} Begin *****************'.format(classifiers[clf][0])) # Train model model = method(data[feature_sets[clf]], response, params = classifiers[clf][2], mod = classifiers[clf][1], iter_num = iternum, evalmetric = evalmetric, n_core = n_core, fold = fold, word = word, dispatch = dispatch) bst_score = '{:.4f}'.format(round(model.best_score_, 4)) model_set.append({classifiers[clf][0]: model, 'Best Score': bst_score, 'feature' : feature_sets[clf]}) except Exception as e: print(' ', '#' * 15, 'Error Message', '#' * 15, ' ') print('Error is:', e) print(' ', '#' * 15, 'End','#' * 15, ' ') model_set.append({classifiers[clf][0]: 'Error', 'Best Score': 'Error'}) next else: pass finally: print('***************** {} End ***************** '.format(classifiers[clf][0])) # GridSearchCV Interface: set method = gridSearchCVTraining, iternum = None else: for clf in range(len(classifiers)): try: print(' ***************** {} Begin *****************'.format(classifiers[clf][0])) # Train model model = method(data[feature_sets[clf]], response, params = classifiers[clf][2], mod = classifiers[clf][1], evalmetric = evalmetric, n_core = n_core, fold = fold, word = word, dispatch = dispatch) bst_score = '{:.4f}'.format(round(model.best_score_, 4)) model_set.append({classifiers[clf][0]: model, 'Best Score': bst_score, 'feature' : feature_sets[clf]}) except Exception as e: print(' ', '#' * 15, 'Error Message', '#' * 15, ' ') print('Error is:', e) print(' ', '#' * 15, 'End','#' * 15, ' ') model_set.append({classifiers[clf][0]: 'Error', 'Best Score': 'Error'}) next else: pass finally: print('***************** {} End ***************** '.format(classifiers[clf][0])) return(model_set)

    三、模型效果验证

    # -*- coding: utf-8 -*-
    """
    Created on Sun Sep  3 12:06:42 2017
    
    @author: Hin
    """
    
    
    import math
    import numpy as np
    import pandas as pd
    from sklearn.metrics import roc_curve, auc  #说明是分类模型
    from matplotlib import style, pyplot as plt
    style.use('ggplot')
    
    
    #-------------------------Feature Importance Graph-------------------------
    def featureImpGraph(data, response, model, figname = 'Feature Importance', 
                        show = 10, figsize = (10, 8)):
        '''
        Plot AUC and KS graph.
        
        Parameters:
        -----------
        data : pandas DataFrame with all possible predictors and response.
    
        response: string, name of response column in data.
        
        model: result from RandomizedSearchCV / GridSearchCV.
        
        figname: string, name of the graph.
        
        show: int, the number of features to be shown in the graph.
        
        figsize: tuple, the size of figure.
        -----------
        '''    
        
        column_names = data.drop(response, axis = 1).columns.tolist()
        
        # Get feature importances
        importances = model.best_estimator_.feature_importances_  #最好的基模型的特征重要性
        
        # Return the indices of the top several important features 
        indices = np.argsort(importances)[::-1][:show]  #[::-1]取从后向前(相当于和原来的顺序相反),argsort()是将X中的元素从小到大排序后,提取对应的索引index,然后输出到y
        
        # Get the top important features name
        features = [column_names[ele] for ele in indices]
        
        # Show importance of each feature   
        fig = plt.figure(figsize = figsize)
        axes = plt.subplot2grid((1,1), (0,0))
        axes.bar(range(show), importances[indices],
               color = '#4682B4', align = 'center')
        plt.xticks(range(show), features)
        plt.xlim((-1, show))
        
        # Rotate the angle of the labels
        for label in axes.xaxis.get_ticklabels():
            label.set_rotation(90)
            
        plt.title(('Top ' + str(show) + ' Important Features'))
        plt.xlabel('Name of Variable')
        plt.ylabel('Importance')   
        plt.savefig((figname + '.jpg'))
        
        # Output the top important features
        print('Top ' + str(show) + ' Features:')
        for f in range(show): 
            print('{}. Importance of feature {} named "{}" is: {:.4f}'.format(f + 1,
                  indices[f], column_names[indices[f]], importances[indices[f]]))
      
    
    
    #-------------------------AUC KS Graph-------------------------
    def aucKSGraph(data, response, pred_value, pos_label, model_name = 'Model', 
                   figname = 'AUC KS Graph', figsize = (10, 8)):
        '''
        Plot AUC and KS graph.
        
        Parameters:
        -----------
        data : pandas DataFrame with all possible predictors and response.
    
        response: string, name of response column in data.
        
        pred_value: pandas Series with the predict value of model.
        
        pos_label: int, for binary classification this represents positive label.
        
        figname: string, name of the graph.
        
        model_name: string, label of the graph.
        
        figsize: tuple, the size of figure.
        -----------
        '''
        
        fpr, tpr, thresholds = roc_curve(data[response], pred_value, pos_label = pos_label)
        
        ks_value = max(tpr - fpr)
        ks_value = round(ks_value, 4)
        roc_auc_value = auc(fpr, tpr)
        roc_auc_value = round(roc_auc_value, 4)
        
        fig = plt.figure(figsize = figsize)
        plt.plot([0, 1],[0, 1], linestyle = '--', color = 'b', 
                 label = "random guessing")
       
        plt.xlabel('FPR')
        plt.ylabel('TPR')
        plt.title('{} {}'.format(model_name, figname))   
        plt.plot(fpr, tpr, label = '{} (auc = {:.4f}, ks = {:.4f})'.format(model_name, 
                 roc_auc_value, ks_value), color = 'r')
        plt.legend(loc = 'lower right')
        
        plt.savefig((' '.join([model_name, figname]) + '.jpg')) 
    
    
    #-------------------------Model Ranking Ability-------------------------
    def modelRank(data, response, model, show = 10, pos_label = 1, neg_label = 0):
        '''
        Model Ranking Ability.
        
        Parameters:
        -----------
        data : pandas DataFrame with all possible predictors and response.
    
        response: string, name of response column in data.
        
        model: result from RandomizedSearchCV / GridSearchCV.
        
        show: int, the number of sections to be shown in the table.
        
        pos_label: int, for binary classification this represents positive label.
        
        neg_label: int, for binary classification this represents negative label.
        -----------
        '''       
        
        prob = data[[response]]
        prob['prob'] = model.predict_proba(data.drop(response, axis = 1).values)[:, 1]
        prob.sort_values('prob', ascending = False, inplace = True)
        prob.reset_index(drop = True, inplace = True)
        prob['seq'] = list(range(len(prob)))
        
        # Get quantile of data
        quantile_limit = prob.seq.quantile([i/show for i in range(1, (show + 1))], 
                                                interpolation = 'higher')
        quantile_limit.reset_index(drop = True, inplace = True)
        
        rank_df = pd.DataFrame()
        
        for i in quantile_limit.index.values:
            rank_df_tmp = pd.DataFrame()
            if i == 0:
                rank_df_tmp['Bad'] = [sum(prob.loc[: quantile_limit[i], response] == pos_label)]
                rank_df_tmp['Good'] = [sum(prob.loc[: quantile_limit[i], response] == neg_label)]
                rank_df_tmp['Total'] = [int(rank_df_tmp['Bad'] + rank_df_tmp['Good'])]
                rank_df_tmp['Bad_Rate'] = ['{:.4f}'.format(round(float(rank_df_tmp['Bad'] / rank_df_tmp['Total']), 4))]
                rank_df_tmp['Min_Prob'] = [prob.loc[: quantile_limit[i], 'prob'].min()]
                rank_df_tmp['Max_Prob'] = [prob.loc[: quantile_limit[i], 'prob'].max()]
                
                rank_df = rank_df.append(rank_df_tmp)
                
            else:
                rank_df_tmp['Bad'] = [sum(prob.loc[(quantile_limit[i - 1] + 1) : quantile_limit[i], response] == pos_label)]
                rank_df_tmp['Good'] = [sum(prob.loc[(quantile_limit[i - 1] + 1) : quantile_limit[i], response] == neg_label)]
                rank_df_tmp['Total'] = [int(rank_df_tmp['Bad'] + rank_df_tmp['Good'])]
                rank_df_tmp['Bad_Rate'] = ['{:.4f}'.format(round(float(rank_df_tmp['Bad'] / rank_df_tmp['Total']), 4))]
                rank_df_tmp['Min_Prob'] = [prob.loc[(quantile_limit[i - 1] + 1) : quantile_limit[i], 'prob'].min()]
                rank_df_tmp['Max_Prob'] = [prob.loc[(quantile_limit[i - 1] + 1) : quantile_limit[i], 'prob'].max()]
                
                rank_df = rank_df.append(rank_df_tmp)
                
        rank_df['Cum_Bad_Num'] = rank_df.Bad.cumsum()
        rank_df['Cum_Bad_Pct'] = rank_df['Cum_Bad_Num'] / rank_df['Bad'].sum()
        rank_df.drop('Cum_Bad_Num', axis = 1, inplace = True)
        rank_df['Cum_Bad_Pct'] = rank_df['Cum_Bad_Pct'].map(lambda x: '{:.4f}'.format(round(x, 4)))
         
        total = pd.DataFrame({'Bad' : [rank_df['Bad'].sum()],
                              'Good' : [rank_df['Good'].sum()],
                              'Total' : [rank_df['Total'].sum()],
                              'Bad_Rate' : ['{:.4f}'.format(round((rank_df['Bad'].sum() / rank_df['Total'].sum()), 4))],
                              'Min_Prob' : [rank_df['Min_Prob'].min()],
                              'Max_Prob' : [rank_df['Max_Prob'].max()],
                              'Cum_Bad_Pct' : ['1.0000']}) 
        
        rank_df = rank_df.append(total)
        rank_df.reset_index(drop = True, inplace = True)
        
        rank_df['Min_Prob'] = rank_df['Min_Prob'].map(lambda x: '{:.4f}'.format(round(x, 4)))
        rank_df['Max_Prob'] = rank_df['Max_Prob'].map(lambda x: '{:.4f}'.format(round(x, 4)))
        
        # Output ordered columns
        return(rank_df[['Min_Prob', 'Max_Prob', 'Bad', 'Good',
                        'Total', 'Bad_Rate', 'Cum_Bad_Pct']])
        
                
    #-------------------------Population Stability Index(PSI)-------------------------
    def PSI(data_a, data_b, response, model, show = 10):
        '''
        Calculate Population Stability Index(PSI).
        
        Parameters:
        -----------
        data_a & data_b: pandas DataFrame, have same columns and only have numeric columns.
    
        response: string, name of response column in data.
        
        model: result from RandomizedSearchCV / GridSearchCV.
        
        show: int, the number of sections to be split in the table.
        -----------
        '''
        
        # Get probability from base data and predict data
        prob_base = pd.Series(model.predict_proba(data_a.drop(response, axis = 1).values)[:, 1])
        quantile_limit = prob_base.quantile([i/show for i in range(1, (show + 1))], 
                                             interpolation = 'higher')
        quantile_limit.reset_index(drop = True, inplace = True)
        
        prob_pred = pd.Series(model.predict_proba(data_b.drop(response, axis = 1).values)[:, 1])
        
        # base and predict list
        base_list = []
        pred_list = []
        
        # Orignal  
        for i in quantile_limit.index.values:
            if i == 0:
                base_list.append((sum(prob_base <= quantile_limit[i]) / len(prob_base)))
                pred_list.append((sum(prob_pred <= quantile_limit[i]) / len(prob_pred)))
                
            else:
                base_list.append((sum((prob_base > quantile_limit[i - 1]) & (prob_base <= quantile_limit[i])) / len(prob_base)))
                pred_list.append((sum((prob_pred > quantile_limit[i - 1]) & (prob_pred <= quantile_limit[i])) / len(prob_pred)))
                
        # Deal with 0 in base_list & pred_list    
        psi = '{:.4f}%'.format((round(sum([np.inf if (y == 0 or t == 0) else ((t - y) * math.log(t / y)) for t, y in zip(pred_list, base_list)]), 
                                8) * 100))
        
        print('
    {} sections PSI is: {}'.format(show, psi))
        
        if np.float(psi[:(len(psi) - 1)]) > 10:
            print('''
                  ************************************
                  Warning: Beware of the large PSI !!!
                  ************************************
                  ''')
            
        print('
    ')
            
    #**************************************************    
    #    # For cooperate Xu Min
    #    for i in quantile_limit.index.values:
    #        if i == 0:       
    #            base_list.append((sum(prob_base < quantile_limit[i]) / len(prob_base)))
    #            pred_list.append((sum(prob_pred < quantile_limit[i]) / len(prob_pred)))
    #            
    #        elif i == 9:       
    #            base_list.append((sum((prob_base >= quantile_limit[i - 1]) & (prob_base <= quantile_limit[i])) / len(prob_base)))
    #            pred_list.append((sum((prob_pred >= quantile_limit[i - 1]) & (prob_pred <= quantile_limit[i])) / len(prob_pred)))
    #            
    #        else:
    #            base_list.append((sum((prob_base >= quantile_limit[i - 1]) & (prob_base < quantile_limit[i])) / len(prob_base)))
    #            pred_list.append((sum((prob_pred >= quantile_limit[i - 1]) & (prob_pred < quantile_limit[i])) / len(prob_pred)))
    #        
    #    psi = '{:.4f}%'.format((round(sum([0 if y == 0 or t == 0 else ((t - y) * math.log(t / y)) for t, y in zip(pred_list, base_list)]), 
    #                            8) * 100))
    #**************************************************  
           
        return(psi)    

    四、实例(Template)

    # -*- coding: utf-8 -*-
    """
    Created on Tue Nov 21 20:57:44 2017
    
    @author: Hin
    """
    
    
    %load_ext autoreload
    %autoreload 2
    
    import pandas as pd
    import numpy as np
    from scipy.stats import randint as sp_randint
    from Data_Processing import trainTestSplitV2, varThreshold, replaceOutlierNScale, featureSelectFromModel
    from Multiple_Model_Selection import randomSearchCVTraining, gridSearchCVTraining, trainModelSequence, bestModel, featureReduce
    from sklearn.linear_model import LogisticRegression
    from sklearn.svm import SVC
    from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
    from xgboost.sklearn import XGBClassifier
    from Model_Evaluation import featureImpGraph, aucKSGraph, modelRank, PSI
    import joblib
    from sklearn_pandas import DataFrameMapper
    from sklearn2pmml import sklearn2pmml, PMMLPipeline
    
    #-------------------------Set Label-------------------------
    # Read csv with chinese characters and rename y as flag
    dataset = pd.read_csv('ZZZ_test_purpose.csv', header = 0, encoding = 'gb18030')
    dataset.rename(columns = {'def30_dup': 'flag'}, inplace = True)
    
    # Get number of rows and columns of data
    print("Number of Rows: ", dataset.shape[0])
    print("Number of Columns: ", dataset.shape[1])
    
    # Show missing values for each column
    dataset.isnull().sum()
    print('The Number of Missing Values: ', dataset.isnull().sum().sum())
    
    # Split all the numric columns from data
    # For test purpose, only deal with numeric data for the rest
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    dataset = dataset.select_dtypes(include = numerics)
    
    
    #-------------------------Missing Imputaion-------------------------
    # Use certain value for missing imputaion
    dataset.replace([np.nan, np.inf, -np.inf], -1, inplace = True)
    
    
    #-------------------------Train Test Split-------------------------
    train, test = trainTestSplitV2(dataset, 'flag')
    
    
    #-------------------------Feature Engineering and Feature Selection-------------------------
    # Deal with catergorical variables(one-hot encoding): pd.get_dummies(df)  
    
    # Removes all 0 variance features
    elmi_col = varThreshold(dataset, train)
    train.drop(elmi_col, axis = 1, inplace = True)
    test.drop(elmi_col, axis = 1, inplace = True)
    
    # Obtain all the important features based on certain threshold
    imp_col =featureSelectFromModel(train, 'flag', figname = 'Top 10 Important Features', 
                                    n_tree = 500, n_core = -1, rdm_state = None, thd = 'median', 
                                    word = 1, show = 10, figsize = (12, 10))
    train = train[(imp_col + ['flag'])]
    test = test[(imp_col + ['flag'])]
    
    
    #-------------------------Replacing Outliers and Scaling-------------------------
    # Change test set first owing to the logic of the replaceOutlierNScale function
    
    # Ignore nan when scale the data
    train.replace(-1, np.nan, inplace = True)
    test.replace(-1, np.nan, inplace = True)
    
    quantile_limit, test = replaceOutlierNScale(train, test, 'flag', low = 0.02, high = 0.98, scale = True)
    quantile_limit, train = replaceOutlierNScale(train, train, 'flag', low = 0.02, high = 0.98, scale = True)
    
    # Fill nan data
    train.replace([np.nan, np.inf, -np.inf], -1, inplace = True)
    test.replace([np.nan, np.inf, -np.inf], -1, inplace = True)
    
    
    
    #-------------------------Model Selection-------------------------
    # Use same interface for several models
    # classifiers consists of name, algorithm, parameters set
    classifiers = []
    
    ## Logistic regression
    #logreg_param = {'C': list(np.power(10.0, np.arange(-10, 10))),'penalty': ['l1','l2']}
    #classifiers.append(['Logistic Regression', LogisticRegression(), logreg_param])
    
    # SVM
    svm_para = {'kernel':['linear','rbf'],
                'C': list(np.power(10.0, np.arange(-10, 3))),
                'gamma': list(np.logspace(-4,0,5)) + ['auto']}
    classifiers.append(['SVM', SVC(probability = True), svm_para])
    
    # Random Forest
    rf_param = {'n_estimators': list(np.arange(600, 1100, 100)), 
                'max_depth': [3, 10, None],
                'max_features': [i/10 for i in range(1, 10)],
                'min_samples_split': sp_randint(2, 10),
                'min_samples_leaf': sp_randint(1, 10)}
    classifiers.append(['RandomForest', RandomForestClassifier(), rf_param])
    
    
    # XGBoost
    xgb_param = {'learning_rate': [0.001, 0.01, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4],
                 'n_estimators': list(range(100, 1100, 100)), # Number of trees
                 'max_depth': list(range(1, 10, 2)), # Max depth of trees
                 'gamma': list(np.arange(0, 0.5, 0.05)), # Minimum loss reduction required to make a further partition on a leaf node of the tree
                 'subsample': [i/10 for i in range(5, 11)], # subsample ratio of the training data, row-wise
                 'colsample_bytree': [i/10 for i in range(3, 11)], # subsample ratio of columns when constructing each tree
                 'colsample_bylevel': [i/10 for i in range(1, 11)], # subsample ratio of columns for each split
                 'reg_alpha': ([0, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5] + list(np.arange(1, 10, 1)) + list(np.arange(10, 110, 10))), # L1 regularization term on weights
                 'reg_lambda' : (list(np.arange(1, 10, 1)) + list(np.arange(10, 60, 10))), # L2 regularization term on weights
                 'min_child_weight': sp_randint(1, 6),  # Defines the minimum sum of weights of all observations required in a child
                 'max_delta_step': sp_randint(0, 11), # In maximum delta step we allow each tree’s weight estimation to be
                 'random_state': sp_randint(0, 1000)} 
    xgb = XGBClassifier(objective = 'binary:logistic', missing = None)
    classifiers.append(['XGB', xgb, xgb_param])
    
    model_set = trainModelSequence(train, 'flag', classifiers, method = randomSearchCVTraining,
                                   iternum = 10, evalmetric = 'roc_auc', n_core = 4, fold = 5, word = 1)
    
    # Get best CV estimator
    best_model = bestModel(model_set)
    
    
    #-------------------------Feature Reduction-------------------------
    # Reduce features from the best model above
    # classifiers consists of name, algorithm, parameters set
    classifiers = []
    
    
    # XGBoost1
    xgb_param = {'learning_rate': [0.001, 0.01, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4],
                 'n_estimators': list(range(100, 1100, 100)), # Number of trees
                 'max_depth': list(range(1, 10, 2)), # Max depth of trees
                 'gamma': list(np.arange(0, 0.5, 0.05)), # Minimum loss reduction required to make a further partition on a leaf node of the tree
                 'subsample': [i/10 for i in range(5, 11)], # subsample ratio of the training data, row-wise
                 'colsample_bytree': [i/10 for i in range(3, 11)], # subsample ratio of columns when constructing each tree
                 'colsample_bylevel': [i/10 for i in range(1, 11)], # subsample ratio of columns for each split
                 'reg_alpha': ([0, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5] + list(np.arange(1, 10, 1)) + list(np.arange(10, 110, 10))), # L1 regularization term on weights
                 'reg_lambda' : (list(np.arange(1, 10, 1)) + list(np.arange(10, 60, 10))), # L2 regularization term on weights
                 'min_child_weight': sp_randint(1, 6),  # Defines the minimum sum of weights of all observations required in a child
                 'max_delta_step': sp_randint(0, 11), # In maximum delta step we allow each tree’s weight estimation to be
                 'random_state': sp_randint(0, 1000)} 
    xgb = XGBClassifier(objective = 'binary:logistic', missing = None)
    classifiers.append(['XGB1', xgb, xgb_param])
    
    # XGBoost2
    classifiers.append(['XGB2', xgb, xgb_param])
    
    # XGBoost3
    classifiers.append(['XGB3', xgb, xgb_param])
    
    # Pick one model that has good performance and less features in xgb_models
    xgb_models = featureReduce(train, 'flag', classifiers, best_model['XGB'], method = randomSearchCVTraining,
                               iternum = 10, evalmetric = 'roc_auc', n_core = 4, fold = 5, word = 1)
    
    
    
    #-------------------------Create Graph, Ranking and PSI-------------------------
    # Feature Importance 
    featureImpGraph(train[xgb_models[1]['feature']], 'flag', xgb_models[1]['XGB2'], 
                    figname = 'XGB2 Feature Importance', 
                    show = 10, figsize = (12, 10))
    
    # AUC and KS Graph
    # For training set
    aucKSGraph(train, 'flag', xgb_models[1]['XGB2'].predict_proba(np.array(train[xgb_models[1]['feature']].drop('flag', axis = 1)))[:, 1],
               pos_label = 1, model_name = 'Training', figsize = (12, 10))
    
    # For testing set
    aucKSGraph(test, 'flag', xgb_models[1]['XGB2'].predict_proba(np.array(test[xgb_models[1]['feature']].drop('flag', axis = 1)))[:, 1],
               pos_label = 1, model_name = 'Testing', figsize = (12, 10))
    
    # Ranking
    rank_train = modelRank(train[xgb_models[1]['feature']], 'flag', xgb_models[1]['XGB2'], show = 20)
    rank_test = modelRank(test[xgb_models[1]['feature']], 'flag', xgb_models[1]['XGB2'], show = 20)
    
    # PSI 
    PSI(train[xgb_models[1]['feature']], test[xgb_models[1]['feature']], 'flag', xgb_models[1]['XGB2'])
    
    
    #-------------------------Save Model as PKL-------------------------
    joblib.dump(xgb_models[1]['XGB2'], 'XGB2oost_xgb_models.pkl', compress = 3)
    # XGB2_best = joblib.load("XGB2oost_XGB2_models.pkl") 
    
    #-------------------------Save Model as PMML-------------------------
    # XGB to PMML 
    # xgb_models[1]['XGB2'].best_params_: Get best parameters from model
    # xgb_models[1]['XGB2'].best_estimator_: Estimator that was chosen by the search
    xgb_pipeline = PMMLPipeline([  
        ("mapper", DataFrameMapper([(i, None) for i in xgb_models[1]['feature'][:(len(xgb_models[1]['feature']) - 1)]])),    
        ("classifier", xgb_models[1]['XGB2'].best_estimator_)])    
    
    # xgb_pipeline is a model which can also be used to predict    
    xgb_pipeline.fit(train[xgb_models[1]['feature']].drop('flag', axis = 1), train[xgb_models[1]['feature']].flag)  
    
    # PMML Transfer
    sklearn2pmml(xgb_pipeline, "xgb.pmml", with_repr = True) 
  • 相关阅读:
    xgqfrms™, xgqfrms® : xgqfrms's offical website of GitHub!
    xgqfrms™, xgqfrms® : xgqfrms's offical website of GitHub!
    xgqfrms™, xgqfrms® : xgqfrms's offical website of GitHub!
    xgqfrms™, xgqfrms® : xgqfrms's offical website of GitHub!
    xgqfrms™, xgqfrms® : xgqfrms's offical website of GitHub!
    xgqfrms™, xgqfrms® : xgqfrms's offical website of GitHub!
    scala 时间,时间格式转换
    GIS基础知识
    客户端,Scala:Spark查询Phoenix
    Phoenix的shell操作
  • 原文地址:https://www.cnblogs.com/cgmcoding/p/13590573.html
Copyright © 2011-2022 走看看