zoukankan      html  css  js  c++  java
  • Variance Inflation Factor (VIF) 方差膨胀因子解释_附python脚本

    python信用评分卡(附代码,博主录制)

    https://etav.github.io/python/vif_factor_python.html

    Colinearity is the state where two variables are highly correlated and contain similiar information about the variance within a given dataset. To detect colinearity among variables, simply create a correlation matrix and find variables with large absolute values. In R use the corr function and in python this can by accomplished by using numpy's corrcoeffunction.

    Multicolinearity on the other hand is more troublesome to detect because it emerges when three or more variables, which are highly correlated, are included within a model. To make matters worst multicolinearity can emerge even when isolated pairs of variables are not colinear.

    A common R function used for testing regression assumptions and specifically multicolinearity is "VIF()" and unlike many statistical concepts, its formula is straightforward:

    $$ V.I.F. = 1 / (1 - R^2). $$

    The Variance Inflation Factor (VIF) is a measure of colinearity among predictor variables within a multiple regression. It is calculated by taking the the ratio of the variance of all a given model's betas divide by the variane of a single beta if it were fit alone.

    Steps for Implementing VIF

    1. Run a multiple regression.
    2. Calculate the VIF factors.
    3. Inspect the factors for each predictor variable, if the VIF is between 5-10, multicolinearity is likely present and you should consider dropping the variable.
    #Imports
    import pandas as pd
    import numpy as np
    from patsy import dmatrices
    import statsmodels.api as sm
    from statsmodels.stats.outliers_influence import variance_inflation_factor
    
    df = pd.read_csv('loan.csv')
    df.dropna()
    df = df._get_numeric_data() #drop non-numeric cols
    
    df.head()
    
     idmember_idloan_amntfunded_amntfunded_amnt_invint_rateinstallmentannual_incdtidelinq_2yrs...total_bal_ilil_utilopen_rv_12mopen_rv_24mmax_bal_bcall_utiltotal_rev_hi_liminq_fitotal_cu_tlinq_last_12m
    0 1077501 1296599 5000.0 5000.0 4975.0 10.65 162.87 24000.0 27.65 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
    1 1077430 1314167 2500.0 2500.0 2500.0 15.27 59.83 30000.0 1.00 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
    2 1077175 1313524 2400.0 2400.0 2400.0 15.96 84.33 12252.0 8.72 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
    3 1076863 1277178 10000.0 10000.0 10000.0 13.49 339.31 49200.0 20.00 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
    4 1075358 1311748 3000.0 3000.0 3000.0 12.69 67.79 80000.0 17.94 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

    5 rows × 51 columns

    df = df[['annual_inc','loan_amnt', 'funded_amnt','annual_inc','dti']].dropna() #subset the dataframe
    

    Step 1: Run a multiple regression

    %%capture
    #gather features
    features = "+".join(df.columns - ["annual_inc"])
    
    # get y and X dataframes based on this regression:
    y, X = dmatrices('annual_inc ~' + features, df, return_type='dataframe')
    

    Step 2: Calculate VIF Factors

    # For each X, calculate VIF and save in dataframe
    vif = pd.DataFrame()
    vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif["features"] = X.columns
    

    Step 3: Inspect VIF Factors

    vif.round(1)
    
     VIF Factorfeatures
    0 5.1 Intercept
    1 1.0 dti
    2 678.4 funded_amnt
    3 678.4 loan_amnt

    As expected, the total funded amount for the loan and the amount of the loan have a high variance inflation factor because they "explain" the same variance within this dataset. We would need to discard one of these variables before moving on to model building or risk building a model with high multicolinearity.

    https://study.163.com/course/courseMain.htm?courseId=1005988013&share=2&shareId=400000000398149

  • 相关阅读:
    【原创】使用开源libimobiledevice盗取iphone信息
    【原创】Arduino制作Badusb实践
    【原创】Aduino小车玩法全记录
    【原创】Arduino入门基础知识总结
    【原创】Arduino、arm、树莓派与单片机
    【原创】PM3破解IC卡记录
    【转】反编译D-Link路由器固件程序并发现后门
    DDOS分布式拒绝服务
    XSS 初识
    针对企业级别渗透测试流程
  • 原文地址:https://www.cnblogs.com/webRobot/p/10889525.html
Copyright © 2011-2022 走看看