zoukankan      html  css  js  c++  java
  • kaggle house price

    kaggle 竞赛入门

    • 对于刚刚入门机器学习的的同学来说,kaggle竞赛通常是他们学习和跟其他的全世界范围内的参赛选手切磋的一个大的平台,这个平台上提供了一些入门的竞赛,可以供刚入门的同学一展拳脚

    • 本文针对房价预测的这个竞赛展开,从EDA,特征工程,到模型调参开始讲述一些竞赛中的小的trick,希望对大家有些帮助,本人基础一般,如果有贻笑大方的地方,可以随意拍砖

    from IPython.display import HTML
    from IPython.display import Image
    
    HTML('''<script>
    code_show=true; 
    function code_toggle() {
     if (code_show){
     $('div.input').hide();
     } else {
     $('div.input').show();
     }
     code_show = !code_show
    } 
    $( document ).ready(code_toggle);
    </script>
    <form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
    

    导入常用的数据分析以及模型的库

    import pandas as pd
    import numpy as np
    
    • 查看当前目录下的文件可以使用!ls
    !ls
    
    data_description.txt
    data_description.zip
    kaggle house price.ipynb
    sample_submission.csv
    stacking-house-prices-walkthrough-to-top-5.ipynb
    test.csv
    train.csv
    
    train = pd.read_csv('train.csv')
    test = pd.read_csv('test.csv')
    
    train.head()
    
    Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
    0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
    1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
    2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
    3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
    4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

    5 rows × 81 columns

    train.info()
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 1460 entries, 0 to 1459
    Data columns (total 81 columns):
    Id               1460 non-null int64
    MSSubClass       1460 non-null int64
    MSZoning         1460 non-null object
    LotFrontage      1201 non-null float64
    LotArea          1460 non-null int64
    Street           1460 non-null object
    Alley            91 non-null object
    LotShape         1460 non-null object
    LandContour      1460 non-null object
    Utilities        1460 non-null object
    LotConfig        1460 non-null object
    LandSlope        1460 non-null object
    Neighborhood     1460 non-null object
    Condition1       1460 non-null object
    Condition2       1460 non-null object
    BldgType         1460 non-null object
    HouseStyle       1460 non-null object
    OverallQual      1460 non-null int64
    OverallCond      1460 non-null int64
    YearBuilt        1460 non-null int64
    YearRemodAdd     1460 non-null int64
    RoofStyle        1460 non-null object
    RoofMatl         1460 non-null object
    Exterior1st      1460 non-null object
    Exterior2nd      1460 non-null object
    MasVnrType       1452 non-null object
    MasVnrArea       1452 non-null float64
    ExterQual        1460 non-null object
    ExterCond        1460 non-null object
    Foundation       1460 non-null object
    BsmtQual         1423 non-null object
    BsmtCond         1423 non-null object
    BsmtExposure     1422 non-null object
    BsmtFinType1     1423 non-null object
    BsmtFinSF1       1460 non-null int64
    BsmtFinType2     1422 non-null object
    BsmtFinSF2       1460 non-null int64
    BsmtUnfSF        1460 non-null int64
    TotalBsmtSF      1460 non-null int64
    Heating          1460 non-null object
    HeatingQC        1460 non-null object
    CentralAir       1460 non-null object
    Electrical       1459 non-null object
    1stFlrSF         1460 non-null int64
    2ndFlrSF         1460 non-null int64
    LowQualFinSF     1460 non-null int64
    GrLivArea        1460 non-null int64
    BsmtFullBath     1460 non-null int64
    BsmtHalfBath     1460 non-null int64
    FullBath         1460 non-null int64
    HalfBath         1460 non-null int64
    BedroomAbvGr     1460 non-null int64
    KitchenAbvGr     1460 non-null int64
    KitchenQual      1460 non-null object
    TotRmsAbvGrd     1460 non-null int64
    Functional       1460 non-null object
    Fireplaces       1460 non-null int64
    FireplaceQu      770 non-null object
    GarageType       1379 non-null object
    GarageYrBlt      1379 non-null float64
    GarageFinish     1379 non-null object
    GarageCars       1460 non-null int64
    GarageArea       1460 non-null int64
    GarageQual       1379 non-null object
    GarageCond       1379 non-null object
    PavedDrive       1460 non-null object
    WoodDeckSF       1460 non-null int64
    OpenPorchSF      1460 non-null int64
    EnclosedPorch    1460 non-null int64
    3SsnPorch        1460 non-null int64
    ScreenPorch      1460 non-null int64
    PoolArea         1460 non-null int64
    PoolQC           7 non-null object
    Fence            281 non-null object
    MiscFeature      54 non-null object
    MiscVal          1460 non-null int64
    MoSold           1460 non-null int64
    YrSold           1460 non-null int64
    SaleType         1460 non-null object
    SaleCondition    1460 non-null object
    SalePrice        1460 non-null int64
    dtypes: float64(3), int64(35), object(43)
    memory usage: 924.0+ KB
    
    print(train.shape)
    print(test.shape)
    
    (1460, 81)
    (1459, 80)
    
    • 数据结构类似于波士顿房屋的价格数据,其中该数据集中有79个特征,来描述房屋,可以通过数据描述来查看对应字段的意义
    • 同时本文也将缺失值处理的方法进行阐述
    • PoolQC 7 non-null object
    • Fence 281 non-null object
    • MiscFeature 54 non-null object 以上三个特征缺失较为明显,后文将有对应的对缺失值处理的方法

    数据处理

    处理异常值

    • 异常值通常是指在预期的值之外,至于如何处理异常值,怎么界定异常值,取决于个人和特定的问题
    • 对于异常值通常会在数据分布点之外,因此通常会让计算的结果和数据的分布
    • 以下图为例

    with open ('data_description.txt','r') as f:
        for i in f.readlines():
            print(i)
            break
        
    
    MSSubClass: Identifies the type of dwelling involved in the sale.	
    

    Data fields

    Here's a brief version of what you'll find in the data description file.

    • SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.

    • MSSubClass: The building class

    • MSZoning: The general zoning classification

    • LotFrontage: Linear feet of street connected to property

    • LotArea: Lot size in square feet

    • Street: Type of road access

    • Alley: Type of alley access

    • LotShape: General shape of property

    • LandContour: Flatness of the property

    • Utilities: Type of utilities available

    • LotConfig: Lot configuration

    • LandSlope: Slope of property

    • Neighborhood: Physical locations within Ames city limits

    • Condition1: Proximity to main road or railroad

    • Condition2: Proximity to main road or railroad (if a second is present)

    • BldgType: Type of dwelling

    • HouseStyle: Style of dwelling

    • OverallQual: Overall material and finish quality

    • OverallCond: Overall condition rating

    • YearBuilt: Original construction date

    • YearRemodAdd: Remodel date

    • RoofStyle: Type of roof

    • RoofMatl: Roof material

    • Exterior1st: Exterior covering on house

    • Exterior2nd: Exterior covering on house (if more than one material)

    • MasVnrType: Masonry veneer type

    • MasVnrArea: Masonry veneer area in square feet

    • ExterQual: Exterior material quality

    • ExterCond: Present condition of the material on the exterior

    • Foundation: Type of foundation

    • BsmtQual: Height of the basement

    • BsmtCond: General condition of the basement

    • BsmtExposure: Walkout or garden level basement walls

    • BsmtFinType1: Quality of basement finished area

    • BsmtFinSF1: Type 1 finished square feet

    • BsmtFinType2: Quality of second finished area (if present)

    • BsmtFinSF2: Type 2 finished square feet

    • BsmtUnfSF: Unfinished square feet of basement area

    • TotalBsmtSF: Total square feet of basement area

    • Heating: Type of heating

    • HeatingQC: Heating quality and condition

    • CentralAir: Central air conditioning

    • Electrical: Electrical system

    • 1stFlrSF: First Floor square feet

    • 2ndFlrSF: Second floor square feet

    • LowQualFinSF: Low quality finished square feet (all floors)

    • GrLivArea: Above grade (ground) living area square feet

    • BsmtFullBath: Basement full bathrooms

    • BsmtHalfBath: Basement half bathrooms

    • FullBath: Full bathrooms above grade

    • HalfBath: Half baths above grade

    • Bedroom: Number of bedrooms above basement level

    • Kitchen: Number of kitchens

    • KitchenQual: Kitchen quality

    • TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)

    • Functional: Home functionality rating

    • Fireplaces: Number of fireplaces

    • FireplaceQu: Fireplace quality

    • GarageType: Garage location

    • GarageYrBlt: Year garage was built

    • GarageFinish: Interior finish of the garage

    • GarageCars: Size of garage in car capacity

    • GarageArea: Size of garage in square feet

    • GarageQual: Garage quality

    • GarageCond: Garage condition

    • PavedDrive: Paved driveway

    • WoodDeckSF: Wood deck area in square feet

    • OpenPorchSF: Open porch area in square feet

    • EnclosedPorch: Enclosed porch area in square feet

    • 3SsnPorch: Three season porch area in square feet

    • ScreenPorch: Screen porch area in square feet

    • PoolArea: Pool area in square feet

    • PoolQC: Pool quality

    • Fence: Fence quality

    • MiscFeature: Miscellaneous feature not covered in other categories

    • MiscVal: $Value of miscellaneous feature

    • MoSold: Month Sold

    • YrSold: Year Sold

    • SaleType: Type of sale

    • SaleCondition: Condition of sale

    • 首先看这个特征 GrLivArea: Above grade (ground) living area square feet,是指居住面积平方英尺

    去除异常值
    import matplotlib.pyplot as plt
    import seaborn as sns
    %matplotlib inline
    sns.set(style='white', context='notebook', palette='deep')
    
    plt.subplots(figsize=(15,8))
    plt.subplot(1,2,1)
    g= sns.regplot(x=train['GrLivArea'],y= train['SalePrice'],fit_reg=False).set_title('Before')
    plt.subplot(1,2,2)
    train= train.drop(train[train['GrLivArea']>4000].index)
    g=sns.regplot(x=train['GrLivArea'],y=train['SalePrice'],fit_reg=False).set_title('After')
    

    png

    • 从以上图中可以发现,居住面积大于4000的样本总共有4个,且这个四个属于严重的偏离分布
    处理缺失值
    • 缺失值可能是由于人工输入错误,机器误差等问题导致的
    • 有些例子中的缺失值可以使用0进行填充,前提是需要知道该特征代表的意义,缺失即代表0
    • 实际情况中,填充0并不总是最好的办法,而且针对不同的算法,对于缺失值处理的能力不同,本文需要使用多种算法进行拟合房价,因此如何正确处理缺失值呢,一般有两种方法:
      • 直接删掉带有缺失值的列
      • 填充缺失值
    # 首先先把训练数据与测试数据的长度保持,以备后用
    ntrain = train.shape[0]
    ntest = test.shape[0]
    
    # 保持训练集的目标值数据即 SalePrice
    y_train = train.SalePrice.values
    all_data = pd.concat((train,test)).reset_index(drop=True)
    all_data.drop(['SalePrice'],axis=1,inplace=True)
    all_data.drop(['Id'],axis=1,inplace=True)
    print('all data shape:{}'.format(all_data.shape))
    
    all data shape:(2915, 79)
    
    
    /Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/ipykernel_launcher.py:7: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
    of pandas will change to not sort by default.
    
    To accept the future behavior, pass 'sort=False'.
    
    To retain the current behavior and silence the warning, pass 'sort=True'.
    
      import sys
    
    all_data_na = all_data.isnull().sum()
    
    all_data_na.sort_values(ascending=False)
    
    PoolQC           2907
    MiscFeature      2810
    Alley            2717
    Fence            2345
    FireplaceQu      1420
    LotFrontage       486
    GarageFinish      159
    GarageQual        159
    GarageYrBlt       159
    GarageCond        159
    GarageType        157
    BsmtCond           82
    BsmtExposure       82
    BsmtQual           81
    BsmtFinType2       80
    BsmtFinType1       79
    MasVnrType         24
    MasVnrArea         23
    MSZoning            4
    BsmtHalfBath        2
    Utilities           2
    Functional          2
    BsmtFullBath        2
    Electrical          1
    Exterior2nd         1
    KitchenQual         1
    GarageCars          1
    Exterior1st         1
    GarageArea          1
    TotalBsmtSF         1
                     ... 
    GrLivArea           0
    YearRemodAdd        0
    YearBuilt           0
    WoodDeckSF          0
    TotRmsAbvGrd        0
    Street              0
    ScreenPorch         0
    SaleCondition       0
    RoofStyle           0
    RoofMatl            0
    PoolArea            0
    PavedDrive          0
    OverallQual         0
    OverallCond         0
    OpenPorchSF         0
    Neighborhood        0
    MoSold              0
    MiscVal             0
    MSSubClass          0
    LowQualFinSF        0
    LotShape            0
    LotConfig           0
    LotArea             0
    LandSlope           0
    LandContour         0
    KitchenAbvGr        0
    HouseStyle          0
    HeatingQC           0
    Heating             0
    1stFlrSF            0
    Length: 79, dtype: int64
    
    all_data_na = all_data_na.drop(all_data_na[all_data_na==0].index).sort_values(ascending=False)
    
    plt.subplots(figsize=(12,6))
    all_data_na.plot(kind='Bar')
    
    <matplotlib.axes._subplots.AxesSubplot at 0x128568710>
    

    png

    !pip install xgboost
    
    Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
    Requirement already satisfied: xgboost in /Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages (0.90)
    Requirement already satisfied: numpy in /Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages (from xgboost) (1.16.2)
    Requirement already satisfied: scipy in /Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages (from xgboost) (1.2.1)
    
    train[all_data_na.index[:25]].info()
    
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 1456 entries, 0 to 1459
    Data columns (total 25 columns):
    PoolQC          5 non-null object
    MiscFeature     54 non-null object
    Alley           91 non-null object
    Fence           280 non-null object
    FireplaceQu     766 non-null object
    LotFrontage     1197 non-null float64
    GarageQual      1375 non-null object
    GarageCond      1375 non-null object
    GarageFinish    1375 non-null object
    GarageYrBlt     1375 non-null float64
    GarageType      1375 non-null object
    BsmtExposure    1418 non-null object
    BsmtCond        1419 non-null object
    BsmtQual        1419 non-null object
    BsmtFinType2    1418 non-null object
    BsmtFinType1    1419 non-null object
    MasVnrType      1448 non-null object
    MasVnrArea      1448 non-null float64
    MSZoning        1456 non-null object
    BsmtFullBath    1456 non-null int64
    BsmtHalfBath    1456 non-null int64
    Utilities       1456 non-null object
    Functional      1456 non-null object
    Electrical      1455 non-null object
    BsmtUnfSF       1456 non-null int64
    dtypes: float64(3), int64(3), object(19)
    memory usage: 295.8+ KB
    
    • for category feature we,fill these missing values with "None"
    • for float feature and the number of missing values seemingly much larger ,we fill these missing values with median of the feature
    • for float feature and the number of missing values smaller, we will fill these missing values with mode
    for col in ("PoolQC", 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu', 'GarageQual', 'GarageCond',
                'GarageFinish', 'GarageType','BsmtExposure','BsmtCond','BsmtQual','BsmtFinType2','BsmtFinType1',
               'MasVnrType'):
        all_data[col] = all_data[col].fillna('None')
    
        
    print('处理object类型缺失值,根据特征的描述,特征缺失值补充为"None",已完成')
        
    for col in ("GarageYrBlt", "GarageArea", "GarageCars", "BsmtFinSF1", 
               "BsmtFinSF2", "BsmtUnfSF", "TotalBsmtSF", "MasVnrArea",
               "BsmtFullBath", "BsmtHalfBath"):
        all_data[col] = all_data[col].fillna(0)
    
    print('处理数值类型的缺失值,根据特征的描述,选择特征缺失值补充为0,已完成')
    
    
    all_data['MSZoning'] = all_data['MSZoning'].fillna(all_data['MSZoning'].mode()[0])
    all_data['Electrical'] = all_data['Electrical'].fillna(all_data['Electrical'].mode()[0])
    all_data['KitchenQual'] = all_data['KitchenQual'].fillna(all_data['KitchenQual'].mode()[0])
    all_data['Exterior1st'] = all_data['Exterior1st'].fillna(all_data['Exterior1st'].mode()[0])
    all_data['Exterior2nd'] = all_data['Exterior2nd'].fillna(all_data['Exterior2nd'].mode()[0])
    all_data['SaleType'] = all_data['SaleType'].fillna(all_data['SaleType'].mode()[0])
    all_data["Functional"] = all_data["Functional"].fillna(all_data['Functional'].mode()[0])
    
    print('处理缺失值较少的缺失值,数据类型为数值,填充缺失值为该特征的众数,已完成')
    
    all_data_na = all_data.isnull().sum()
    print("Features with missing values: ", all_data_na.drop(all_data_na[all_data_na == 0].index))
    
    
    处理object类型缺失值,根据特征的描述,特征缺失值补充为"None",已完成
    处理数值类型的缺失值,根据特征的描述,选择特征缺失值补充为0,已完成
    处理缺失值较少的缺失值,数据类型为数值,填充缺失值为该特征的众数,已完成
    Features with missing values:  LotFrontage    486
    Utilities        2
    dtype: int64
    
    all_data.groupby(["Neighborhood"])['LotFrontage'].sum()
    
    Neighborhood
    Blmngtn      938.0
    Blueste      273.0
    BrDale       645.0
    BrkSide     5300.0
    ClearCr     1763.0
    CollgCr    15694.0
    Crawfor     5806.0
    Edwards    11467.0
    Gilbert     8237.0
    IDOTRR      5415.0
    MeadowV      845.0
    Mitchel     6763.0
    NAmes      28204.0
    NPkVill      591.0
    NWAmes      6929.0
    NoRidge     4684.0
    NridgHt    13722.0
    OldTown    14147.0
    SWISU       2599.0
    Sawyer      7306.0
    SawyerW     7491.0
    Somerst    10457.0
    StoneBr     2860.0
    Timber      4626.0
    Veenker     1152.0
    Name: LotFrontage, dtype: float64
    
    all_data['LotFrontage']=all_data.groupby("Neighborhood")["LotFrontage"].transform(
        lambda x: x.fillna(x.median()))
    
    分析 Utilities
    plt.subplots(figsize=(12,5))
    plt.subplot(1,2,1)
    g=sns.countplot(x='Utilities',data=train).set_title('Utilities_train')
    plt.subplot(1,2,2)
    g=sns.countplot(x='Utilities',data=test).set_title('Utilities_test')
    

    png

    train['Utilities'].value_counts()
    
    AllPub    1455
    NoSeWa       1
    Name: Utilities, dtype: int64
    
    test['Utilities'].value_counts()
    
    AllPub    1457
    Name: Utilities, dtype: int64
    
    all_data = all_data.drop(['Utilities'], axis=1)
    
    all_data_na = all_data.isnull().sum()
    print("Features with missing values: ", len(all_data_na.drop(all_data_na[all_data_na == 0].index)))
    
    Features with missing values:  0
    

    Exploratory Data Analysis

    Correlation matrix
    • 异常值与缺失值已经处理完毕,进一步需要特征之间与特征与目标值之间的关系,相关系数矩阵就是提供了反应特征与目标值之间关系的一个参考
    corr = train.corr()
    plt.subplots(figsize=(30,30))
    cmap = sns.diverging_palette(150, 250, as_cmap=True)
    sns.heatmap(corr, cmap="RdYlBu", vmax=1, vmin=-0.6, center=0.2, square=True, linewidths=0, cbar_kws={"shrink": .5}, annot = True)
    
    <matplotlib.axes._subplots.AxesSubplot at 0x12901bc18>
    

    png

    • for raw highly influencing factors on SalePrice, we could do feature engineering

    • 从相关系数矩阵中,我们挑选了一些跟最终售价相关性较高的做进一步的分析

    • 主要的影响因素有以下几个:

    1. OverallQual Overall material and finish quality 整体的物料以及完成质量
    2. GrLivArea Above grade (ground) living area square feet 地面以上的居住面积 平方英尺
    3. GarageCars Size of garage in car capacity 停车场的大小,可以放几辆车
    4. GarageArea Size of garage in square feet 停车场的面积大小
    5. TotalBsmtSF Total square feet of basement area 地下室的面积 平方英尺
    6. 1stFlrSF First Floor square feet 一楼的面积 平方英尺
    7. FullBath Full bathrooms above grade 地上卫生间
    8. TotRmsAbvGrd Total rooms above grade (does not include bathrooms) 地上去掉卫生间的房屋数
    9. Fireplaces 壁炉数量
    10. MasVnrArea Masonry veneer area in square feet 粗略可以理解为石灰结构的建筑面积
    11. BsmtFinSF1 Quality of basement finished area Type 1 finished square feet地下室的完成面积
    12. LotFrontage Linear feet of street connected to property 距离街道的距离
    13. WoodDeckSF Wood deck area in square feet 木质结构的建筑面积
    14. OpenPorchSF Open porch area in square feet 开放式门廊的面积
    15. 2ndFlrSF Second floor square feet 二楼的面积
    # Quadratic
    all_data["OverallQual-2"] = all_data["OverallQual"] ** 2
    all_data["GrLivArea-2"] = all_data["GrLivArea"] ** 2
    all_data["GarageCars-2"] = all_data["GarageCars"] ** 2
    all_data["GarageArea-2"] = all_data["GarageArea"] ** 2
    all_data["TotalBsmtSF-2"] = all_data["TotalBsmtSF"] ** 2
    all_data["1stFlrSF-2"] = all_data["1stFlrSF"] ** 2
    all_data["FullBath-2"] = all_data["FullBath"] ** 2
    all_data["TotRmsAbvGrd-2"] = all_data["TotRmsAbvGrd"] ** 2
    all_data["Fireplaces-2"] = all_data["Fireplaces"] ** 2
    all_data["MasVnrArea-2"] = all_data["MasVnrArea"] ** 2
    all_data["BsmtFinSF1-2"] = all_data["BsmtFinSF1"] ** 2
    all_data["LotFrontage-2"] = all_data["LotFrontage"] ** 2
    all_data["WoodDeckSF-2"] = all_data["WoodDeckSF"] ** 2
    all_data["OpenPorchSF-2"] = all_data["OpenPorchSF"] ** 2
    all_data["2ndFlrSF-2"] = all_data["2ndFlrSF"] ** 2
    print("Quadratics done!...")
    
    
    # Cubic
    all_data["OverallQual-23"] = all_data["OverallQual"] ** 3
    all_data["GrLivArea-3"] = all_data["GrLivArea"] ** 3
    all_data["GarageCars-3"] = all_data["GarageCars"] **3
    all_data["GarageArea-3"] = all_data["GarageArea"] ** 3
    all_data["TotalBsmtSF-3"] = all_data["TotalBsmtSF"] ** 3
    all_data["1stFlrSF-3"] = all_data["1stFlrSF"] ** 3
    all_data["FullBath-3"] = all_data["FullBath"] ** 3
    all_data["TotRmsAbvGrd-3"] = all_data["TotRmsAbvGrd"] ** 3
    all_data["Fireplaces-3"] = all_data["Fireplaces"] ** 3
    all_data["MasVnrArea-3"] = all_data["MasVnrArea"] ** 3
    all_data["BsmtFinSF1-3"] = all_data["BsmtFinSF1"] ** 3
    all_data["LotFrontage-3"] = all_data["LotFrontage"] ** 3
    all_data["WoodDeckSF-3"] = all_data["WoodDeckSF"] ** 3
    all_data["OpenPorchSF-3"]=all_data["OpenPorchSF"] ** 3
    all_data["2ndFlrSF-3"]= all_data["2ndFlrSF"] ** 3
    print("Quadratics done!...")
    
    
    
    # Square Root
    all_data["OverallQual-Sq"] = np.sqrt(all_data["OverallQual"])
    all_data["GrLivArea-Sq"] = np.sqrt(all_data["GrLivArea"])
    all_data["GarageCars-Sq"] = np.sqrt(all_data["GarageCars"])
    all_data["GarageArea-Sq"] = np.sqrt(all_data["GarageArea"])
    all_data["TotalBsmtSF-Sq"] = np.sqrt(all_data["TotalBsmtSF"])
    all_data["1stFlrSF-Sq"] = np.sqrt(all_data["1stFlrSF"])
    all_data["FullBath-Sq"] = np.sqrt(all_data["FullBath"])
    all_data["TotRmsAbvGrd-Sq"] = np.sqrt(all_data["TotRmsAbvGrd"])
    all_data["Fireplaces-Sq"] = np.sqrt(all_data["Fireplaces"])
    all_data["MasVnrArea-Sq"] = np.sqrt(all_data["MasVnrArea"])
    all_data["BsmtFinSF1-Sq"] = np.sqrt(all_data["BsmtFinSF1"])
    all_data["LotFrontage-Sq"] = np.sqrt(all_data["LotFrontage"])
    all_data["WoodDeckSF-Sq"] = np.sqrt(all_data["WoodDeckSF"])
    all_data["OpenPorchSF-Sq"] = np.sqrt(all_data["OpenPorchSF"])
    all_data["2ndFlrSF-Sq"] = np.sqrt(all_data["2ndFlrSF"])
    print("Roots done!...")
    
    
    
    
    
    
    
    Quadratics done!...
    Quadratics done!...
    Roots done!...
    
    BsmtQual
    train['BsmtQual'].value_counts()
    
    TA    649
    Gd    618
    Ex    117
    Fa     35
    Name: BsmtQual, dtype: int64
    
    train.groupby(['BsmtQual'])['SalePrice'].mean()
    """
    BsmtQual: Evaluates the height of the basement
    
           Ex	Excellent (100+ inches)	
           Gd	Good (90-99 inches)
           TA	Typical (80-89 inches)
           Fa	Fair (70-79 inches)
           Po	Poor (<70 inches
           NA	No Basement
    """
    
    '
    BsmtQual: Evaluates the height of the basement
    
           Ex	Excellent (100+ inches)	
           Gd	Good (90-99 inches)
           TA	Typical (80-89 inches)
           Fa	Fair (70-79 inches)
           Po	Poor (<70 inches
           NA	No Basement
    '
    
    plt.subplots(figsize=(20,6))
    plt.subplot(1,3,1)# 箱形图
    sns.boxplot(x='BsmtQual',y='SalePrice',data=train,order= ['Fa', 'TA', 'Gd', 'Ex'])
    
    
    plt.subplot(1,3,2) # x轴里的类别进行分类
    sns.stripplot(x='BsmtQual',y='SalePrice',data=train,size=5,jitter=True,order= ['Fa', 'TA', 'Gd', 'Ex'])
    
    
    plt.subplot(1,3,3) # 柱状图
    sns.barplot(x='BsmtQual',y='SalePrice',data=train,order= ['Fa', 'TA', 'Gd', 'Ex'],estimator=np.mean)
    
    
    <matplotlib.axes._subplots.AxesSubplot at 0x1263d5e10>
    

    png

    all_data['BsmtQual'] = all_data['BsmtQual'].map({"None":0, "Fa":1, "TA":2, "Gd":3, "Ex":4})
    all_data['BsmtQual'].unique()
    
    array([3, 2, 4, 0, 1])
    
    all_data['BsmtQual'].value_counts()
    
    2    1283
    3    1209
    4     254
    1      88
    0      81
    Name: BsmtQual, dtype: int64
    
    • 很明显,该特征能够显著的影响销售价格,而且越高的的地下室,对应的价格也越高
    • typical and good 两个分部数量较大,占比较高
    • 可以将该特征的变量是有高低好坏之分的,也就是category 特征的顺序性,可以转化为数字(个人觉得意义不大)
    BsmtCond
    """
    BsmtCond: Evaluates the general condition of the basement
    
           Ex	Excellent
           Gd	Good
           TA	Typical - slight dampness allowed
           Fa	Fair - dampness or some cracking or settling
           Po	Poor - Severe cracking, settling, or wetness
           NA	No Basement
    """
    
    
    
    
    '
    BsmtCond: Evaluates the general condition of the basement
    
           Ex	Excellent
           Gd	Good
           TA	Typical - slight dampness allowed
           Fa	Fair - dampness or some cracking or settling
           Po	Poor - Severe cracking, settling, or wetness
           NA	No Basement
    '
    
    plt.subplots(figsize=(20,5))
    plt.subplot(1,3,1)
    sns.boxplot(x='BsmtCond',y='SalePrice',data=train,order=['Po','Fa','TA','Gd'])
    plt.subplot(1,3,2)
    
    
    sns.stripplot(x='BsmtCond',y='SalePrice',data=train,size=5,jitter=True,order= ['Po','Fa','TA','Gd'])
    
    
    plt.subplot(1,3,3)
    
    
    sns.barplot(x='BsmtCond',y='SalePrice',data=train,order=['Po','Fa','TA','Gd'])
    
    
    
    <matplotlib.axes._subplots.AxesSubplot at 0x12ab8d6d8>
    

    png

    train['BsmtCond'].value_counts()
    
    TA    1307
    Gd      65
    Fa      45
    Po       2
    Name: BsmtCond, dtype: int64
    
    • 图二中的Typical样本数据占比较高,从barplot中可以看出该特征能够很明显的影响售出价格
    • 针对图一种的TA价格较为分散,价格分布离散
    all_data['BsmtCond'] = all_data['BsmtCond'].map({"None":0, "Po":1, "Fa":2, "TA":3,"Gd":4, "Ex":5})
    all_data['BsmtCond'].unique()
    
    array([3, 4, 0, 2, 1])
    
    BsmtExplosure
    """
    BsmtExposure: Refers to walkout or garden level walls
    
           Gd	Good Exposure
           Av	Average Exposure (split levels or foyers typically score average or above)	
           Mn	Mimimum Exposure
           No	No Exposure
           NA	No Basement
    
    """
    
    '
    BsmtExposure: Refers to walkout or garden level walls
    
           Gd	Good Exposure
           Av	Average Exposure (split levels or foyers typically score average or above)	
           Mn	Mimimum Exposure
           No	No Exposure
           NA	No Basement
    
    '
    
    plt.subplots(figsize=(20,5))
    plt.subplot(1,3,1)
    sns.boxplot(x='BsmtExposure',y='SalePrice',data=train,order=['No','Mn','Av','Gd'])
    plt.subplot(1,3,2)
    sns.stripplot(x='BsmtExposure',y='SalePrice',data=train,size=5,jitter=True,order= ['No','Mn','Av','Gd'])
    plt.subplot(1,3,3)
    sns.barplot(x='BsmtExposure',y='SalePrice',data=train,order=['No','Mn','Av','Gd'])
    
    <matplotlib.axes._subplots.AxesSubplot at 0x12b8e4470>
    

    png

    all_data['BsmtExposure'] = all_data['BsmtExposure'].map({"None":0, "No":1, "Mn":2, "Av":3,"Gd":4})
    all_data['BsmtExposure'].unique()
    
    array([1, 4, 2, 3, 0])
    
    BsmtFinType1
    """
    BsmtFinType1: Rating of basement finished area
    
           GLQ	Good Living Quarters
           ALQ	Average Living Quarters
           BLQ	Below Average Living Quarters	
           Rec	Average Rec Room
           LwQ	Low Quality
           Unf	Unfinshed
           NA	No Basement
    """
    
    '
    BsmtFinType1: Rating of basement finished area
    
           GLQ	Good Living Quarters
           ALQ	Average Living Quarters
           BLQ	Below Average Living Quarters	
           Rec	Average Rec Room
           LwQ	Low Quality
           Unf	Unfinshed
           NA	No Basement
    '
    
    plt.subplots(figsize =(20, 5))
    
    plt.subplot(1, 3, 1)
    sns.boxplot(x="BsmtFinType1", y="SalePrice", data=train, order=["Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"]);
    
    plt.subplot(1, 3, 2)
    sns.stripplot(x="BsmtFinType1", y="SalePrice", data=train, size = 5, jitter = True, order=["Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"]);
    
    plt.subplot(1, 3, 3)
    sns.barplot(x="BsmtFinType1", y="SalePrice", data=train, order=["Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"]);
    

    png

    • 可以从图一中看出,很多没有装修完的地下室房屋的价格很高
    • 从图三中可以看到,这些category 不是按照顺序的提高,房屋的销售价提高与category的顺序没有必然关系
    • 因此将这个特征进行one-hot转化,可以使用pandas 中的get_dummy函数进行转化
    all_data = pd.get_dummies(all_data, columns = ["BsmtFinType1"], prefix="BsmtFinType1")
    all_data.head(3)
    
    1stFlrSF 2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFinSF1 BsmtFinSF2 ... WoodDeckSF-Sq OpenPorchSF-Sq 2ndFlrSF-Sq BsmtFinType1_ALQ BsmtFinType1_BLQ BsmtFinType1_GLQ BsmtFinType1_LwQ BsmtFinType1_None BsmtFinType1_Rec BsmtFinType1_Unf
    0 856 854 0 None 3 1Fam 3 1 706.0 0.0 ... 0.000000 7.810250 29.223278 0 0 1 0 0 0 0
    1 1262 0 0 None 3 1Fam 3 4 978.0 0.0 ... 17.262677 0.000000 0.000000 1 0 0 0 0 0 0
    2 920 866 0 None 3 1Fam 3 2 486.0 0.0 ... 0.000000 6.480741 29.427878 0 0 1 0 0 0 0

    3 rows × 129 columns

    BsmtFinSF1
    • BsmtFinSF1: Type 1 finished square feet
    from scipy.stats.stats import pearsonr
    grid = plt.GridSpec(2,3,wspace=0.15,hspace=0.25) 
    # 创建画布指定子图将放置的网格的几何位置。 需要设置网格的行数和列数。 子图布局参数(例如,左,右等)可以选择性调整。
    plt.subplots(figsize=(30,15))
    plt.subplot(grid[0,0])
    
    
    g = sns.regplot(x=train['BsmtFinSF1'], y=train['SalePrice'], fit_reg=False, label = "corr: %2f"%(pearsonr(train['BsmtFinSF1'], train['SalePrice'])[0]))
    # g= sns.regplot(x=train['BsmtFinSF1'],y=train["SalePrice"],fit_reg==False,label= "Corr:%2f" %(pearsonr(train['BsmtFinType1'],train['SalePrice'])[0]))
    g.legend(loc='best')
    
    plt.subplot(grid[0,1:])
    
    sns.boxplot(x='Neighborhood',y='BsmtFinSF1',data=train)
    
    plt.subplot(grid[1,0])
    sns.barplot(x='BldgType',y= 'BsmtFinSF1',data=train)
    
    
    plt.subplot(grid[1,1])
    
    sns.barplot(x='HouseStyle',y ='BsmtFinSF1',data=train)
    
    plt.subplot(grid[1,2])
    
    
    sns.barplot(x='LotShape',y='BsmtFinSF1',data=train)
    
    
    <matplotlib.axes._subplots.AxesSubplot at 0x129034e10>
    

    png

    • 地下室完成面积对于销售价格来说影响很大,但是对于Neighborhood以及BldgType houseType LotShape 影响各异,这三个因素对于完成面积影响没有规律可循
    • 但是特征是连续的数值特质,因此考虑将其进行切割分组
    bins = [-5,1000,2000,3000,float('inf')]
    all_data['BsmtFinSF1_Band'] = pd.cut(all_data['BsmtFinSF1'], bins,labels=['1','2','3','4'])
    
    all_data['BsmtFinSF1_Band'].unique()
    all_data.drop('BsmtFinSF1',axis=1,inplace=True)
    
    all_data = pd.get_dummies(all_data, columns = ["BsmtFinSF1_Band"], prefix="BsmtFinSF1")
    all_data.head()
    
    1stFlrSF 2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFinSF2 BsmtFinType2 ... BsmtFinType1_BLQ BsmtFinType1_GLQ BsmtFinType1_LwQ BsmtFinType1_None BsmtFinType1_Rec BsmtFinType1_Unf BsmtFinSF1_1 BsmtFinSF1_2 BsmtFinSF1_3 BsmtFinSF1_4
    0 856 854 0 None 3 1Fam 3 1 0.0 Unf ... 0 1 0 0 0 0 1 0 0 0
    1 1262 0 0 None 3 1Fam 3 4 0.0 Unf ... 0 0 0 0 0 0 1 0 0 0
    2 920 866 0 None 3 1Fam 3 2 0.0 Unf ... 0 1 0 0 0 0 1 0 0 0
    3 961 756 0 None 3 1Fam 4 1 0.0 Unf ... 0 0 0 0 0 0 1 0 0 0
    4 1145 1053 0 None 4 1Fam 3 3 0.0 Unf ... 0 1 0 0 0 0 1 0 0 0

    5 rows × 132 columns

    BsmtFinType2
    """
    BsmtFinType2: Rating of basement finished area (if multiple types)
    
           GLQ	Good Living Quarters
           ALQ	Average Living Quarters
           BLQ	Below Average Living Quarters	
           Rec	Average Rec Room
           LwQ	Low Quality
           Unf	Unfinshed
           NA	No Basement
    
    """
    
    '
    BsmtFinType2: Rating of basement finished area (if multiple types)
    
           GLQ	Good Living Quarters
           ALQ	Average Living Quarters
           BLQ	Below Average Living Quarters	
           Rec	Average Rec Room
           LwQ	Low Quality
           Unf	Unfinshed
           NA	No Basement
    
    '
    
    plt.subplots(figsize =(20, 5))
    
    plt.subplot(1, 3, 1)
    sns.boxplot(x="BsmtFinType2", y="SalePrice", data=train, order=["Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"]);
    
    plt.subplot(1, 3, 2)
    sns.stripplot(x="BsmtFinType2", y="SalePrice", data=train, size = 5, jitter = True, order=["Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"]);
    
    plt.subplot(1, 3, 3)
    sns.barplot(x="BsmtFinType2", y="SalePrice", data=train, order=["Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"]);
    

    png

    • 很多房子的第二个地下室没有装修完工,且价格分化很大
    • 第二个装修的地下室的装修好坏对于价格影响没有像之前的那样的顺序关系(图三)
    • 因此,需要将该特征转化为one-hot哑变量
    all_data = pd.get_dummies(all_data, columns = ["BsmtFinType2"], prefix="BsmtFinType2")  # columns 参数要传入列表
    
    all_data.head(3)
    """
    columns : list-like, default None
    Column names in the DataFrame to be encoded. If columns is None then all the columns with object or category dtype will be converted.
    
    """
    
    '
    columns : list-like, default None
    Column names in the DataFrame to be encoded. If columns is None then all the columns with object or category dtype will be converted.
    
    '
    
    BsmtFinSF2
    """
    BsmtFinSF2: Type 2 finished square feet
    """
    grid = plt.GridSpec(2,3,wspace=0.15,hspace=0.25) 
    # 创建画布指定子图将放置的网格的几何位置。 需要设置网格的行数和列数。 子图布局参数(例如,左,右等)可以选择性调整。
    plt.subplots(figsize=(30,15))
    plt.subplot(grid[0,0])
    
    
    g = sns.regplot(x=train['BsmtFinSF2'], y=train['SalePrice'], fit_reg=False, label = "corr: %2f"%(pearsonr(train['BsmtFinSF2'], train['SalePrice'])[0]))
    # g= sns.regplot(x=train['BsmtFinSF1'],y=train["SalePrice"],fit_reg==False,label= "Corr:%2f" %(pearsonr(train['BsmtFinType1'],train['SalePrice'])[0]))
    g.legend(loc='best')
    
    plt.subplot(grid[0,1:])
    
    sns.boxplot(x='Neighborhood',y='BsmtFinSF2',data=train)
    
    plt.subplot(grid[1,0])
    sns.barplot(x='BldgType',y= 'BsmtFinSF2',data=train)
    
    
    plt.subplot(grid[1,1])
    
    sns.barplot(x='HouseStyle',y ='BsmtFinSF2',data=train)
    
    plt.subplot(grid[1,2])
    
    
    sns.barplot(x='LotShape',y='BsmtFinSF2',data=train)
    
    
    <matplotlib.axes._subplots.AxesSubplot at 0x12c7a68d0>
    

    png

    • 已装修完成的第二个地下室的面积与销售价格没有明显的关系
    • 而且大部分的数据都是未完成装修的,与上一个特征相关性较高
    • 可以采用是否完成装修来转化该特征(类似于缺失值的补充,变成是否缺失)
    all_data['BsmtFinType2_None'].value_counts()
    
    0    2835
    1      80
    Name: BsmtFinType2_None, dtype: int64
    
    all_data['BsmtFinSf2_Flag'] = all_data['BsmtFinSF2'].map(lambda x:0 if x==0 else 1)
    all_data.drop('BsmtFinSF2', axis=1, inplace=True)
    
    all_data['BsmtFinSf2_Flag'].value_counts()
    
    0    2568
    1     347
    Name: BsmtFinSf2_Flag, dtype: int64
    
    BsmtUnfSF
    """
    Unfinished square feet of basement area
    
    """
    grid = plt.GridSpec(2,3,wspace=0.15,hspace=0.25) 
    # 创建画布指定子图将放置的网格的几何位置。 需要设置网格的行数和列数。 子图布局参数(例如,左,右等)可以选择性调整。
    plt.subplots(figsize=(30,15))
    plt.subplot(grid[0,0])
    
    
    g = sns.regplot(x=train['BsmtUnfSF'], y=train['SalePrice'], fit_reg=False, label = "corr: %2f"%(pearsonr(train['BsmtUnfSF'], train['SalePrice'])[0]))
    # g= sns.regplot(x=train['BsmtFinSF1'],y=train["SalePrice"],fit_reg==False,label= "Corr:%2f" %(pearsonr(train['BsmtFinType1'],train['SalePrice'])[0]))
    g.legend(loc='best')
    
    plt.subplot(grid[0,1:])
    
    sns.boxplot(x='Neighborhood',y='BsmtUnfSF',data=train)
    
    plt.subplot(grid[1,0])
    sns.barplot(x='BldgType',y= 'BsmtUnfSF',data=train)
    
    
    plt.subplot(grid[1,1])
    
    sns.barplot(x='HouseStyle',y ='BsmtUnfSF',data=train)
    
    plt.subplot(grid[1,2])
    
    
    sns.barplot(x='LotShape',y='BsmtUnfSF',data=train)
    
    <matplotlib.axes._subplots.AxesSubplot at 0x118d8b940>
    

    png

    
    
    """
    This feature has a significant positive correlation with SalePrice, with a small proportion of data points having a value of 0.
    This tells me that most houses will have some amount of square feet unfinished within the basement, and this actually positively contributes towards SalePrice.
    The amount of unfinished square feet also varies widely based on location and style.
    Whereas the average unfinished square feet within the basement is fairly consistent across the different lot shapes.
    Since this is a continuous numeric feature with a significant correlation, I will bin this and create dummy variables.
    与售价正相关,
    Unfinished square feet of basement area 与lot shape 没啥关系
    连续值变量,需要进行封箱操作,然后将封箱之后的特征进行one-hot转化
    """
    all_data['BsmtUnfSF_Band'] = pd.cut(all_data['BsmtUnfSF'], 3,labels=['1','2','3'])
    all_data.drop('BsmtUnfSF',axis=1,inplace=True)
    all_data['BsmtUnfSF_Band'].unique()
    all_data['BsmtUnfSF_Band'] = all_data['BsmtUnfSF_Band'].astype(int)
    all_data = pd.get_dummies(all_data, columns = ["BsmtUnfSF_Band"], prefix="BsmtUnfSF")
    all_data.head()
    
    1stFlrSF 2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFullBath BsmtHalfBath ... BsmtFinType2_BLQ BsmtFinType2_GLQ BsmtFinType2_LwQ BsmtFinType2_None BsmtFinType2_Rec BsmtFinType2_Unf BsmtFinSf2_Flag BsmtUnfSF_1 BsmtUnfSF_2 BsmtUnfSF_3
    0 856 854 0 None 3 1Fam 3 1 1.0 0.0 ... 0 0 0 0 0 1 0 1 0 0
    1 1262 0 0 None 3 1Fam 3 4 0.0 1.0 ... 0 0 0 0 0 1 0 1 0 0
    2 920 866 0 None 3 1Fam 3 2 1.0 0.0 ... 0 0 0 0 0 1 0 1 0 0
    3 961 756 0 None 3 1Fam 4 1 1.0 0.0 ... 0 0 0 0 0 1 0 1 0 0
    4 1145 1053 0 None 4 1Fam 3 3 1.0 0.0 ... 0 0 0 0 0 1 0 1 0 0

    5 rows × 140 columns

    TotalBsmtSF
    """
    Total square feet of basement area.
    """
    grid = plt.GridSpec(2,3,wspace=0.15,hspace=0.25) 
    # 创建画布指定子图将放置的网格的几何位置。 需要设置网格的行数和列数。 子图布局参数(例如,左,右等)可以选择性调整。
    plt.subplots(figsize=(30,15))
    plt.subplot(grid[0,0])
    
    
    g = sns.regplot(x=train['TotalBsmtSF'], y=train['SalePrice'], fit_reg=False, label = "corr: %2f"%(pearsonr(train['TotalBsmtSF'], train['SalePrice'])[0]))
    # g= sns.regplot(x=train['BsmtFinSF1'],y=train["SalePrice"],fit_reg==False,label= "Corr:%2f" %(pearsonr(train['BsmtFinType1'],train['SalePrice'])[0]))
    g.legend(loc='best')
    
    plt.subplot(grid[0,1:])
    
    sns.boxplot(x='Neighborhood',y='TotalBsmtSF',data=train)
    
    plt.subplot(grid[1,0])
    sns.barplot(x='BldgType',y= 'TotalBsmtSF',data=train)
    
    
    plt.subplot(grid[1,1])
    
    sns.barplot(x='HouseStyle',y ='TotalBsmtSF',data=train)
    
    plt.subplot(grid[1,2])
    
    
    sns.barplot(x='LotShape',y='TotalBsmtSF',data=train)
    
    <matplotlib.axes._subplots.AxesSubplot at 0x12d9b3d30>
    

    png

    def get_feature_corr(feature_name):
        grid = plt.GridSpec(2,3,wspace=0.15,hspace=0.25) 
    # 创建画布指定子图将放置的网格的几何位置。 需要设置网格的行数和列数。 子图布局参数(例如,左,右等)可以选择性调整。
        plt.subplots(figsize=(30,15))
        plt.subplot(grid[0,0])
    
    
        g = sns.regplot(x=train[feature_name], y=train['SalePrice'], fit_reg=False, label = "corr: %2f"%(pearsonr(train[feature_name], train['SalePrice'])[0]))
        # g= sns.regplot(x=train['BsmtFinSF1'],y=train["SalePrice"],fit_reg==False,label= "Corr:%2f" %(pearsonr(train['BsmtFinType1'],train['SalePrice'])[0]))
        g.legend(loc='best')
    
        plt.subplot(grid[0,1:])
    
        sns.boxplot(x='Neighborhood',y=feature_name,data=train)
    
        plt.subplot(grid[1,0])
        sns.barplot(x='BldgType',y= feature_name,data=train)
    
    
        plt.subplot(grid[1,1])
    
        sns.barplot(x='HouseStyle',y =feature_name,data=train)
    
        plt.subplot(grid[1,2])
    
    
        sns.barplot(x='LotShape',y=feature_name,data=train)
        plt.show()
    
    1stFlrSF
    get_feature_corr('1stFlrSF')
    """
    First floor square feet.
    """
    

    png

    '
    First floor square feet.
    '
    
    • 第一层的面积与售价有着很强的相关性
    • 不同的街区对于第一层的面积分布范围变化很大
    • 对于不同的房型,第一层的面积变化不大
    • 该特征为连续值,需要进行封箱然后one-hot转化
    all_data['1stFlrSF_Band'] = pd.cut(all_data['1stFlrSF'], 6,labels=['1','2','3','4','5','6'])
    all_data['1stFlrSF_Band'].unique()
    all_data['1stFlrSF_Band'] = all_data['1stFlrSF_Band'].astype(int)
    
    all_data.drop('1stFlrSF', axis=1, inplace=True)
    all_data = pd.get_dummies(all_data, columns = ["1stFlrSF_Band"], prefix="1stFlrSF")
    all_data.head(3)
    
    2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFullBath BsmtHalfBath BsmtQual ... BsmtFinSf2_Flag BsmtUnfSF_1 BsmtUnfSF_2 BsmtUnfSF_3 1stFlrSF_1 1stFlrSF_2 1stFlrSF_3 1stFlrSF_4 1stFlrSF_5 1stFlrSF_6
    0 854 0 None 3 1Fam 3 1 1.0 0.0 3 ... 0 1 0 0 1 0 0 0 0 0
    1 0 0 None 3 1Fam 3 4 0.0 1.0 3 ... 0 1 0 0 0 1 0 0 0 0
    2 866 0 None 3 1Fam 3 2 1.0 0.0 3 ... 0 1 0 0 1 0 0 0 0 0

    3 rows × 145 columns

    2ndFlrSF
    get_feature_corr('2ndFlrSF')
    """
    Second floor square feet.
    """
    

    png

    '
    Second floor square feet.
    '
    
    • 很多房子没有第二层,所有很多房子的第二层面积为0
    • 第二层面积与街区的变化很大
    • 对于不同的房型,第二层的面积变化很大
    • 连续值变量,进行封箱,然后进行one-hot转化
    all_data['2ndFlrSF_Band'] = pd.cut(all_data['2ndFlrSF'], 6,labels=list('123456'))
    all_data['2ndFlrSF_Band'].unique()
    all_data=pd.get_dummies(all_data,columns=['2ndFlrSF_Band'],prefix="2ndFlrSF")
    all_data.drop('2ndFlrSF', axis=1, inplace=True)
    all_data.head()
    
    3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFullBath BsmtHalfBath BsmtQual CentralAir ... 1stFlrSF_3 1stFlrSF_4 1stFlrSF_5 1stFlrSF_6 2ndFlrSF_1 2ndFlrSF_2 2ndFlrSF_3 2ndFlrSF_4 2ndFlrSF_5 2ndFlrSF_6
    0 0 None 3 1Fam 3 1 1.0 0.0 3 Y ... 0 0 0 0 0 0 1 0 0 0
    1 0 None 3 1Fam 3 4 0.0 1.0 3 Y ... 0 0 0 0 1 0 0 0 0 0
    2 0 None 3 1Fam 3 2 1.0 0.0 3 Y ... 0 0 0 0 0 0 1 0 0 0
    3 0 None 3 1Fam 4 1 1.0 0.0 2 Y ... 0 0 0 0 0 0 1 0 0 0
    4 0 None 4 1Fam 3 3 1.0 0.0 3 Y ... 0 0 0 0 0 0 0 1 0 0

    5 rows × 150 columns

    LowQualFinSF
    get_feature_corr('LowQualFinSF')
    
    '''
    Low quality finished square feet (all floors)
    '''
    

    png

    '
    Low quality finished square feet (all floors)
    '
    
    • 针对该特征可以将特征转化为0-1
    all_data['LowQualFinSF_Flag'] = all_data['LowQualFinSF'].map(lambda x:0 if x==0 else 1)
    all_data.drop('LowQualFinSF', axis=1, inplace=True)
    
    BsmtHalfBath BsmtFullBath HalfBath FullBath
    all_data['TotalBathrooms'] = all_data['BsmtHalfBath'] + all_data['BsmtFullBath'] + all_data['HalfBath'] + all_data['FullBath']
    
    columns = ['BsmtHalfBath', 'BsmtFullBath', 'HalfBath', 'FullBath']
    all_data.drop(columns, axis=1, inplace=True)
    
    def get_feature_corr1(feature_name,order=None):
        plt.subplots(figsize =(20, 5))
    
        plt.subplot(1, 3, 1)
        sns.boxplot(x=feature_name, y="SalePrice", data=train,order=order)
    
        plt.subplot(1, 3, 2)
        sns.stripplot(x=feature_name, y="SalePrice", data=train, size = 5, jitter = True ,order=order);
    
        plt.subplot(1, 3, 3)
        sns.barplot(x=feature_name, y="SalePrice", data=train,order=order)
        plt.show()
    
    get_feature_corr1('BedroomAbvGr',order=None)
    """
    Bedrooms above grade (does not include basement bedrooms)
    """
    

    png

    '
    Bedrooms above grade (does not include basement bedrooms)
    '
    
    get_feature_corr1('KitchenAbvGr',order=None)
    

    png

    get_feature_corr1('KitchenQual',order=['Fa','TA','Gd','Ex'])
    print("""
    该特征需要转化category with order
    """)
    

    png


    该特征需要转化category with order

    all_data['KitchenQual'] = all_data['KitchenQual'].map({"Fa":1, "TA":2, "Gd":3, "Ex":4})
    all_data['KitchenQual'].unique()
    
    array([3, 2, 4, 1])
    
    TotRmsAbvGrd
    get_feature_corr1('TotRmsAbvGrd')
    

    png

    Fireplaces
    get_feature_corr1('Fireplaces')
    

    png

    FireplaceQu
    get_feature_corr1('FireplaceQu',order=['Po','Fa','TA','Gd','Ex'])
    

    png

    all_data['FireplaceQu'] = all_data['FireplaceQu'].map({"None":0, "Po":1, "Fa":2, "TA":3, "Gd":4, "Ex":5})
    all_data['FireplaceQu'].unique()
    
    array([0, 3, 4, 2, 5, 1])
    
    GrLivArea
    get_feature_corr('GrLivArea')
    

    png

    • 特征为连续值,且与售价相关性非常强
    • 封箱然后转化为one-hot特征
    all_data['GrLivArea_Band'] = pd.cut(all_data['GrLivArea'], 6,labels=list('123456'))
    all_data['GrLivArea_Band'].unique()
    all_data['GrLivArea_Band'] = all_data['GrLivArea_Band'].astype(int)
    all_data.drop('GrLivArea',axis=1,inplace=True)
    all_data = pd.get_dummies(all_data, columns = ["GrLivArea_Band"], prefix="GrLivArea")
    all_data.head(3)
    
    3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 ... 2ndFlrSF_5 2ndFlrSF_6 LowQualFinSF_Flag TotalBathrooms GrLivArea_1 GrLivArea_2 GrLivArea_3 GrLivArea_4 GrLivArea_5 GrLivArea_6
    0 0 None 3 1Fam 3 1 3 Y Norm Norm ... 0 0 0 4.0 0 1 0 0 0 0
    1 0 None 3 1Fam 3 4 3 Y Feedr Norm ... 0 0 0 3.0 0 1 0 0 0 0
    2 0 None 3 1Fam 3 2 3 Y Norm Norm ... 0 0 0 4.0 0 1 0 0 0 0

    3 rows × 152 columns

    MSSubClass
    get_feature_corr1('MSSubClass')
    

    png

    all_data['MSSubClass'] = all_data['MSSubClass'].astype(str)
    
    all_data = pd.get_dummies(all_data, columns = ["MSSubClass"], prefix="MSSubClass")
    all_data.head(3)
    
    3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 ... MSSubClass_30 MSSubClass_40 MSSubClass_45 MSSubClass_50 MSSubClass_60 MSSubClass_70 MSSubClass_75 MSSubClass_80 MSSubClass_85 MSSubClass_90
    0 0 None 3 1Fam 3 1 3 Y Norm Norm ... 0 0 0 0 1 0 0 0 0 0
    1 0 None 3 1Fam 3 4 3 Y Feedr Norm ... 0 0 0 0 0 0 0 0 0 0
    2 0 None 3 1Fam 3 2 3 Y Norm Norm ... 0 0 0 0 1 0 0 0 0 0

    3 rows × 167 columns

    BldgType
    get_feature_corr1('BldgType')
    

    png

    all_data['BldgType'] = all_data['BldgType'].astype(str)
    
    all_data = pd.get_dummies(all_data, columns = ["BldgType"], prefix="BldgType")
    all_data.head(3)
    
    3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... MSSubClass_70 MSSubClass_75 MSSubClass_80 MSSubClass_85 MSSubClass_90 BldgType_1Fam BldgType_2fmCon BldgType_Duplex BldgType_Twnhs BldgType_TwnhsE
    0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 0 0 0 0 0 1 0 0 0 0
    1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 0 0 0 0 1 0 0 0 0
    2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 0 0 0 0 0 1 0 0 0 0

    3 rows × 171 columns

    HouseStyle
    get_feature_corr1('HouseStyle')
    

    png

    all_data['HouseStyle'] = all_data['HouseStyle'].map({"2Story":"2Story", "1Story":"1Story", "1.5Fin":"1.5Story", "1.5Unf":"1.5Story", 
                                                         "SFoyer":"SFoyer", "SLvl":"SLvl", "2.5Unf":"2.5Story", "2.5Fin":"2.5Story"})
    
    all_data = pd.get_dummies(all_data, columns = ["HouseStyle"], prefix="HouseStyle")
    all_data.head(3)
    
    3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... BldgType_2fmCon BldgType_Duplex BldgType_Twnhs BldgType_TwnhsE HouseStyle_1.5Story HouseStyle_1Story HouseStyle_2.5Story HouseStyle_2Story HouseStyle_SFoyer HouseStyle_SLvl
    0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 0 0 0 0 0 0 0 1 0 0
    1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 0 0 0 0 1 0 0 0 0
    2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 0 0 0 0 0 0 0 1 0 0

    3 rows × 176 columns

    OverallQual
    get_feature_corr1('OverallQual')
    

    png

    OverallCond
    get_feature_corr1('OverallCond')
    

    png

    YearRemodAdd
    get_feature_corr1('YearRemodAdd')
    

    png

    train['Remod_Diff'] = train['YearRemodAdd'] - train['YearBuilt']
    
    plt.subplots(figsize =(40, 10))
    sns.barplot(x="Remod_Diff", y="SalePrice", data=train);
    

    png

    all_data['Remod_Diff'] = all_data['YearRemodAdd'] - all_data['YearBuilt']
    
    all_data.drop('YearRemodAdd', axis=1, inplace=True)
    
    YearBuilt
    get_feature_corr1('YearBuilt')
    

    png

    all_data['YearBuilt_Band'] = pd.cut(all_data['YearBuilt'], 7,labels=list('1234567'))
    all_data['YearBuilt_Band'].unique()
    all_data['YearBuilt_Band'] = all_data['YearBuilt_Band'].astype(int)
    all_data.drop('YearBuilt',axis=1,inplace=True)
    all_data = pd.get_dummies(all_data, columns = ["YearBuilt_Band"], prefix="YearBuilt")
    all_data.head(3)
    
    3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... HouseStyle_SFoyer HouseStyle_SLvl Remod_Diff YearBuilt_1 YearBuilt_2 YearBuilt_3 YearBuilt_4 YearBuilt_5 YearBuilt_6 YearBuilt_7
    0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 0 0 0 0 0 0 0 0 0 1
    1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 0 0 0 0 0 0 0 1 0
    2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 0 0 1 0 0 0 0 0 0 1

    3 rows × 182 columns

    Foundation
    get_feature_corr1('Foundation')
    

    png

    all_data = pd.get_dummies(all_data, columns = ["Foundation"], prefix="Foundation")
    all_data.head(3)
    
    3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... YearBuilt_4 YearBuilt_5 YearBuilt_6 YearBuilt_7 Foundation_BrkTil Foundation_CBlock Foundation_PConc Foundation_Slab Foundation_Stone Foundation_Wood
    0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 0 0 0 1 0 0 1 0 0 0
    1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 0 1 0 0 1 0 0 0 0
    2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 0 0 0 1 0 0 1 0 0 0

    3 rows × 187 columns

    Functional
    get_feature_corr1('Functional')
    

    png

    all_data['Functional'] = all_data['Functional'].map({"Sev":1, "Maj2":2, "Maj1":3, "Mod":4, "Min2":5, "Min1":6, "Typ":7})
    all_data['Functional'].unique()
    
    array([7, 6, 3, 5, 4, 2, 1])
    
    RoofStyle
    get_feature_corr1('RoofStyle')
    

    png

    all_data = pd.get_dummies(all_data, columns = ["RoofStyle"], prefix="RoofStyle")
    all_data.head(3)
    
    3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... Foundation_PConc Foundation_Slab Foundation_Stone Foundation_Wood RoofStyle_Flat RoofStyle_Gable RoofStyle_Gambrel RoofStyle_Hip RoofStyle_Mansard RoofStyle_Shed
    0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 1 0 0 0 0 1 0 0 0 0
    1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 0 0 0 0 1 0 0 0 0
    2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 1 0 0 0 0 1 0 0 0 0

    3 rows × 192 columns

    RoofMatl
    """
    Roof material.
    """
    
    get_feature_corr1('RoofMatl')
    

    png

    all_data = pd.get_dummies(all_data, columns = ["RoofMatl"], prefix="RoofMatl")
    all_data.head(3)
    
    3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... RoofStyle_Hip RoofStyle_Mansard RoofStyle_Shed RoofMatl_CompShg RoofMatl_Membran RoofMatl_Metal RoofMatl_Roll RoofMatl_Tar&Grv RoofMatl_WdShake RoofMatl_WdShngl
    0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 0 0 0 1 0 0 0 0 0 0
    1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 0 0 1 0 0 0 0 0 0
    2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 0 0 0 1 0 0 0 0 0 0

    3 rows × 198 columns

    Exterior1st & Exterior2nd
    get_feature_corr1('Exterior1st')
    

    png

    get_feature_corr1('Exterior2nd')
    

    png

    def Exter2(col):
        if col['Exterior2nd'] == col['Exterior1st']:
            return 1
        else:
            return 0
        
    all_data['ExteriorMatch_Flag'] = all_data.apply(Exter2, axis=1)
    all_data.drop('Exterior2nd', axis=1, inplace=True)
    
    all_data = pd.get_dummies(all_data, columns = ["Exterior1st"], prefix="Exterior1st")
    all_data.head(3)
    
    3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... Exterior1st_CemntBd Exterior1st_HdBoard Exterior1st_ImStucc Exterior1st_MetalSd Exterior1st_Plywood Exterior1st_Stone Exterior1st_Stucco Exterior1st_VinylSd Exterior1st_Wd Sdng Exterior1st_WdShing
    0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 0 0 0 0 0 0 0 1 0 0
    1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 0 0 1 0 0 0 0 0 0
    2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 0 0 0 0 0 0 0 1 0 0

    3 rows × 212 columns

    MasVnrType
    get_feature_corr1('MasVnrType')
    

    png

    all_data = pd.get_dummies(all_data, columns = ["MasVnrType"], prefix="MasVnrType")
    all_data.head(3)
    
    3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... Exterior1st_Plywood Exterior1st_Stone Exterior1st_Stucco Exterior1st_VinylSd Exterior1st_Wd Sdng Exterior1st_WdShing MasVnrType_BrkCmn MasVnrType_BrkFace MasVnrType_None MasVnrType_Stone
    0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 0 0 0 1 0 0 0 1 0 0
    1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 0 0 0 0 0 0 0 1 0
    2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 0 0 0 1 0 0 0 1 0 0

    3 rows × 215 columns

    MasVnrArea
    get_feature_corr('MasVnrArea')
    

    png

    • 这个特征没啥意义,各个维度与这个特征的相关性都不是很大,变化都很大,且没有规律
    all_data.drop('MasVnrArea', axis=1, inplace=True)
    
    ExterQual
    get_feature_corr1('ExterQual',order=['Fa','TA','Gd', 'Ex'])
    

    png

    all_data['ExterQual'] = all_data['ExterQual'].map({"Fa":1, "TA":2, "Gd":3, "Ex":4})
    all_data['ExterQual'].unique()
    
    array([3, 2, 4, 1])
    
    ExterCond
    """
    Evaluates the present condition of the material on the exterior.
    """
    
    '
    Evaluates the present condition of the material on the exterior.
    '
    
    get_feature_corr1('ExterCond',order=['Po','Fa',"TA",'Gd','Ex'])
    

    png

    all_data = pd.get_dummies(all_data, columns = ["ExterCond"], prefix="ExterCond")
    all_data.head(3)
    
    3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... Exterior1st_WdShing MasVnrType_BrkCmn MasVnrType_BrkFace MasVnrType_None MasVnrType_Stone ExterCond_Ex ExterCond_Fa ExterCond_Gd ExterCond_Po ExterCond_TA
    0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 0 0 1 0 0 0 0 0 0 1
    1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 0 0 1 0 0 0 0 0 1
    2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 0 0 1 0 0 0 0 0 0 1

    3 rows × 218 columns

    GarageType
    """
    location of the Garage
    """
    get_feature_corr1('GarageType')
    

    png

    • 如果观察了该特征 ,其实可以发现这些现象值是有优劣关系的,但是售价并没有跟特征的优劣值进行对应,因此可以简单将这些特征进行one-hot转化也可以实现,
    • builtin 的车库房屋售价平均值最高
    all_data = pd.get_dummies(all_data, columns = ["GarageType"], prefix="GarageType")
    all_data.head(3)
    
    3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... ExterCond_Gd ExterCond_Po ExterCond_TA GarageType_2Types GarageType_Attchd GarageType_Basment GarageType_BuiltIn GarageType_CarPort GarageType_Detchd GarageType_None
    0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 0 0 1 0 1 0 0 0 0 0
    1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 0 1 0 1 0 0 0 0 0
    2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 0 0 1 0 1 0 0 0 0 0

    3 rows × 224 columns

    GarageYrBlt
    """
    Year Garage was built
    """
    get_feature_corr1('GarageYrBlt')
    

    png

    • 年代越近,售价有逐步走高的趋势
    plt.subplots(figsize =(50, 10))
    
    sns.boxplot(x="GarageYrBlt", y="SalePrice", data=train);
    

    png

    plt.subplots(figsize =(50, 10))
    sns.violinplot(x = 'GarageYrBlt', y = 'SalePrice', data = train,
                   linewidth = 2, #线宽
                   width = 0.8,   #箱之间的间隔比例
                   palette = 'hls', #设置调色板
    #                order = {'Thur', 'Fri', 'Sat','Sun'}, #筛选类别
    #                scale = 'count',  #测度小提琴图的宽度: area-面积相同,count-按照样本数量决定宽度,width-宽度一样
                   gridsize = 50, #设置小提琴图的平滑度,越高越平滑
                   inner = 'box', #设置内部显示类型 --> 'box','quartile','point','stick',None
                   #bw = 0.8      #控制拟合程度,一般可以不设置
                   )
    ### 新学到的seaborn中的一些新图
    
    <matplotlib.axes._subplots.AxesSubplot at 0x12e2cec50>
    

    png

    train['GarageYrBlt'].value_counts()
    sns.distplot(train['GarageYrBlt'].dropna(), kde=True, bins=5, rug=True)
    
    <matplotlib.axes._subplots.AxesSubplot at 0x12945c940>
    

    png

    all_data['GarageYrBlt_Band']  = pd.qcut(all_data['GarageYrBlt'],3,labels=list('123'))
    # qcut是根据这些值的频率来选择箱子的均匀间隔,即每个箱子中含有的数的数量是相同的
    # cut将根据值本身来选择箱子均匀间隔,即每个箱子的间距都是相同的
    
    all_data['GarageYrBlt_Band'] = all_data['GarageYrBlt_Band'].astype(int)
    all_data.drop(['GarageYrBlt'],axis=1,inplace=True)
    
    all_data = pd.get_dummies(all_data, columns = ["GarageYrBlt_Band"], prefix="GarageYrBlt")  # 默认删除掉原来的特征,因此不必删除旧值
    all_data.head(3)
    
    3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... GarageType_2Types GarageType_Attchd GarageType_Basment GarageType_BuiltIn GarageType_CarPort GarageType_Detchd GarageType_None GarageYrBlt_1 GarageYrBlt_2 GarageYrBlt_3
    0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 0 1 0 0 0 0 0 0 0 1
    1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 1 0 0 0 0 0 0 1 0
    2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 0 1 0 0 0 0 0 0 0 1

    3 rows × 226 columns

    GarageFinish
    get_feature_corr1('GarageFinish')
    

    png

    all_data = pd.get_dummies(all_data, columns = ["GarageFinish"], prefix="GarageFinish")
    all_data.head(3)
    
    3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... GarageType_CarPort GarageType_Detchd GarageType_None GarageYrBlt_1 GarageYrBlt_2 GarageYrBlt_3 GarageFinish_Fin GarageFinish_None GarageFinish_RFn GarageFinish_Unf
    0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 0 0 0 0 0 1 0 0 1 0
    1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 0 0 0 1 0 0 0 1 0
    2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 0 0 0 0 0 1 0 0 1 0

    3 rows × 229 columns

    GarageCars
    """
    size of the Garage in car capacity 
    默认是的数字不用其他操作,3辆车容量的车库售价最高,四辆车的转手频率较低(5个样本)
    """
    get_feature_corr1('GarageCars')
    

    png

    GarageArea
    get_feature_corr('GarageArea')
    

    png

    all_data['GarageArea_Band']  = pd.cut(all_data['GarageArea'],3,labels=list('123'))
    all_data['GarageArea_Band'] =all_data['GarageArea_Band'].astype('int')
    all_data.drop(['GarageArea'],axis=1,inplace=True)
    
    all_data = pd.get_dummies(all_data, columns = ["GarageArea_Band"], prefix="GarageArea")
    all_data.head(3)
    
    3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... GarageYrBlt_1 GarageYrBlt_2 GarageYrBlt_3 GarageFinish_Fin GarageFinish_None GarageFinish_RFn GarageFinish_Unf GarageArea_1 GarageArea_2 GarageArea_3
    0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 0 0 1 0 0 1 0 0 1 0
    1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 1 0 0 0 1 0 1 0 0
    2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 0 0 1 0 0 1 0 0 1 0

    3 rows × 231 columns

    GarageQual
    """
    Garage  quality
    """
    
    get_feature_corr1('GarageQual',order=['Po','Fa','TA','Gd','Ex'])
    

    png

    • "TA"的出售的价格有较高的值以及数量较为集中,而两端的数据却很分散,因此可以两边的特征进行合并
    all_data['GarageQual'] = all_data['GarageQual'].map({"None":"None", "Po":"Low", "Fa":"Low", "TA":"TA", "Gd":"High", "Ex":"High"})
    all_data['GarageQual'].unique()
    
    array(['TA', 'Low', 'High', 'None'], dtype=object)
    
    all_data = pd.get_dummies(all_data, columns = ["GarageQual"], prefix="GarageQual")
    all_data.head(3)
    
    3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... GarageFinish_None GarageFinish_RFn GarageFinish_Unf GarageArea_1 GarageArea_2 GarageArea_3 GarageQual_High GarageQual_Low GarageQual_None GarageQual_TA
    0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 0 1 0 0 1 0 0 0 0 1
    1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 1 0 1 0 0 0 0 0 1
    2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 0 1 0 0 1 0 0 0 0 1

    3 rows × 234 columns

    GarageCond
    """
    Garage condition.
    """
    
    get_feature_corr1('GarageCond',order=['Po','Fa','TA','Gd','Ex'])
    

    png

    • 该特征与garage quality 特征处理方式类似
    all_data['GarageCond']= all_data['GarageCond'].map({"None":'None',"Po":'Low','Fa':'Low','TA':'TA','Gd':'High','Ex':'High'})
    all_data['GarageCond'].unique()
    
    array(['TA', 'Low', 'None', 'High'], dtype=object)
    
    all_data = pd.get_dummies(all_data, columns = ["GarageCond"], prefix="GarageCond")
    all_data.head(3)
    
    3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... GarageArea_2 GarageArea_3 GarageQual_High GarageQual_Low GarageQual_None GarageQual_TA GarageCond_High GarageCond_Low GarageCond_None GarageCond_TA
    0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 1 0 0 0 0 1 0 0 0 1
    1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 0 0 0 0 1 0 0 0 1
    2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 1 0 0 0 0 1 0 0 0 1

    3 rows × 237 columns

    WoodDeckSF
    """
    Wood deck area in SF.
    """
    
    get_feature_corr('WoodDeckSF')
    

    png

    • high correlation with salesPrice
    • 很多的0值,需要单独创建一个特征,来说明是否伟木质材料构建
    • 对于非0值,进行封箱操作,然后转化为one-hot特征
    def WoodDeckFlag(col):
        if col['WoodDeckSF'] == 0:
            return 1
        else:
            return 0
        
    all_data['NoWoodDeck_Flag'] = all_data.apply(WoodDeckFlag, axis=1)  # new feature
    
    all_data['WoodDeckSF_Band'] = pd.cut(all_data['WoodDeckSF'], 4,labels=list('1234'))  ## bin 
    
    all_data['WoodDeckSF_Band'] = all_data['WoodDeckSF_Band'].astype(int)
    
    all_data.drop('WoodDeckSF', axis=1, inplace=True)
    
    all_data = pd.get_dummies(all_data, columns = ["WoodDeckSF_Band"], prefix="WoodDeckSF")
    all_data.head(3)
    
    3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... GarageQual_TA GarageCond_High GarageCond_Low GarageCond_None GarageCond_TA NoWoodDeck_Flag WoodDeckSF_1 WoodDeckSF_2 WoodDeckSF_3 WoodDeckSF_4
    0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 1 0 0 0 1 1 1 0 0 0
    1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 1 0 0 0 1 0 1 0 0 0
    2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 1 0 0 0 1 1 1 0 0 0

    3 rows × 241 columns

    TotalPorchSF
    """
    OpenPorchSF, EnclosedPorch, 3SsnPorch & ScreenPorch
    
    I will sum these features together to create a total porch in square feet feature.
    """
    all_data['TotalPorchSF'] = all_data['OpenPorchSF'] + all_data['OpenPorchSF'] + all_data['EnclosedPorch'] + all_data['3SsnPorch'] + all_data['ScreenPorch'] 
    train['TotalPorchSF'] = train['OpenPorchSF'] + train['OpenPorchSF'] + train['EnclosedPorch'] + train['3SsnPorch'] + train['ScreenPorch']
    
    get_feature_corr('TotalPorchSF')
    

    png

    def PorchFlag(col):
        if col['TotalPorchSF'] == 0:
            return 1
        else:
            return 0
        
    all_data['NoPorch_Flag'] = all_data.apply(PorchFlag, axis=1)
    
    all_data['TotalPorchSF_Band'] = pd.cut(all_data['TotalPorchSF'], 4,labels=list('1234'))
    all_data['TotalPorchSF_Band'].unique()
    all_data['TotalPorchSF_Band'] = all_data['TotalPorchSF_Band'].astype(int)
    
    all_data.drop('TotalPorchSF', axis=1, inplace=True)
    
    all_data = pd.get_dummies(all_data, columns = ["TotalPorchSF_Band"], prefix="TotalPorchSF")
    all_data.head(3)
    
    3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... NoWoodDeck_Flag WoodDeckSF_1 WoodDeckSF_2 WoodDeckSF_3 WoodDeckSF_4 NoPorch_Flag TotalPorchSF_1 TotalPorchSF_2 TotalPorchSF_3 TotalPorchSF_4
    0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 1 1 0 0 0 0 1 0 0 0
    1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 1 0 0 0 1 1 0 0 0
    2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 1 1 0 0 0 0 1 0 0 0

    3 rows × 246 columns

    PoolArea
    """
    PoolArea Pool area in square feet.
    """
    get_feature_corr('PoolArea')
    

    png

    def PoolFlag(col):
        if col['PoolArea'] == 0:
            return 0
        else:
            return 1
        
    all_data['HasPool_Flag'] = all_data.apply(PoolFlag, axis=1)
    all_data.drop('PoolArea', axis=1, inplace=True)
    
    PoolQC
    """
    Pool quality.
    """
    get_feature_corr1('PoolQC',order=['Fa','Gd','Ex'])
    
    

    png

    all_data['PoolQC'].value_counts()  #  总共8个数据带pool,其他的都是不带的,所以拿到的这个quality数据意义不大
    
    None    2907
    Gd         3
    Ex         3
    Fa         2
    Name: PoolQC, dtype: int64
    
    all_data.drop('PoolQC', axis=1, inplace=True)
    
    Fence
    '''
    Fence: Fence quality
    		
           GdPrv	Good Privacy
           MnPrv	Minimum Privacy
           GdWo	Good Wood
           MnWw	Minimum Wood/Wire
           NA	No Fence
    '''
    
    get_feature_corr1('Fence',order=['MnWw','GdWo','MnPrv','GdPrv'])
    

    png

    all_data = pd.get_dummies(all_data, columns = ["Fence"], prefix="Fence")
    all_data.head(3)
    
    3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... TotalPorchSF_1 TotalPorchSF_2 TotalPorchSF_3 TotalPorchSF_4 HasPool_Flag Fence_GdPrv Fence_GdWo Fence_MnPrv Fence_MnWw Fence_None
    0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 1 0 0 0 0 0 0 0 0 1
    1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 1 0 0 0 0 0 0 0 0 1
    2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 1 0 0 0 0 0 0 0 0 1

    3 rows × 249 columns

    MSZoning
    """
    MSZoning: Identifies the general zoning classification of the sale.
    		
           A	Agriculture
           C	Commercial
           FV	Floating Village Residential
           I	Industrial
           RH	Residential High Density
           RL	Residential Low Density
           RP	Residential Low Density Park 
           RM	Residential Medium Density
    """
    get_feature_corr1('MSZoning')
    all_data['MSZoning'].value_counts()
    

    png

    RL         2265
    RM          460
    FV          139
    RH           26
    C (all)      25
    Name: MSZoning, dtype: int64
    
    all_data = pd.get_dummies(all_data, columns = ["MSZoning"], prefix="MSZoning")
    all_data.head(3)
    
    3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... Fence_GdPrv Fence_GdWo Fence_MnPrv Fence_MnWw Fence_None MSZoning_C (all) MSZoning_FV MSZoning_RH MSZoning_RL MSZoning_RM
    0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 0 0 0 0 1 0 0 0 1 0
    1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 0 0 0 1 0 0 0 1 0
    2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 0 0 0 0 1 0 0 0 1 0

    3 rows × 253 columns

    Neighborhood
    """
    this feature has lots of values,and SalePrice varies a lot in the values of the feature,
    we  just use one-hot to transform this feature
    
    """
    
    get_feature_corr1('Neighborhood')
    all_data = pd.get_dummies(all_data, columns = ["Neighborhood"], prefix="Neighborhood")
    all_data.head(3)
    

    png

    3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... Neighborhood_NoRidge Neighborhood_NridgHt Neighborhood_OldTown Neighborhood_SWISU Neighborhood_Sawyer Neighborhood_SawyerW Neighborhood_Somerst Neighborhood_StoneBr Neighborhood_Timber Neighborhood_Veenker
    0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 0 0 0 0 0 0 0 0 0 0
    1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 0 0 0 0 0 0 0 0 1
    2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 0 0 0 0 0 0 0 0 0 0

    3 rows × 277 columns

    Condition1 & Condition2
    print('condition1')
    get_feature_corr1('Condition1')
    print('condition2')
    get_feature_corr1('Condition2')
    
    condition1
    

    png

    condition2
    

    png

    '''
    Condition1: Proximity to various conditions
           Artery	Adjacent to arterial street
           Feedr	Adjacent to feeder street
           Norm	Normal
           RRNn	Within 200' of North-South Railroad
           RRAn	Adjacent to North-South Railroad
           PosN	Near positive off-site feature--park, greenbelt, etc.
           PosA	Adjacent to postive off-site feature
           RRNe	Within 200' of East-West Railroad
           RRAe	Adjacent to East-West Railroad
    
    '''
    all_data['Condition1'] = all_data['Condition1'].map({"Norm":"Norm", "Feedr":"Street", "PosN":"Pos", "Artery":"Street", "RRAe":"Train",
                                                        "RRNn":"Train", "RRAn":"Train", "PosA":"Pos", "RRNe":"Train"})
    all_data['Condition2'] = all_data['Condition2'].map({"Norm":"Norm", "Feedr":"Street", "PosN":"Pos", "Artery":"Street", "RRAe":"Train",
                                                        "RRNn":"Train", "RRAn":"Train", "PosA":"Pos", "RRNe":"Train"})
    
    def ConditionMatch(col):
        if col['Condition1'] == col['Condition2']:
            return 0
        else:
            return 1
        
    all_data['Diff2ndCondition_Flag'] = all_data.apply(ConditionMatch, axis=1)
    all_data.drop('Condition2', axis=1, inplace=True)
    
    all_data = pd.get_dummies(all_data, columns = ["Condition1"], prefix="Condition1")
    all_data.head(3)
    
    3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Electrical EnclosedPorch ExterQual ... Neighborhood_SawyerW Neighborhood_Somerst Neighborhood_StoneBr Neighborhood_Timber Neighborhood_Veenker Diff2ndCondition_Flag Condition1_Norm Condition1_Pos Condition1_Street Condition1_Train
    0 0 None 3 3 1 3 Y SBrkr 0 3 ... 0 0 0 0 0 0 1 0 0 0
    1 0 None 3 3 4 3 Y SBrkr 0 2 ... 0 0 0 0 1 1 0 0 1 0
    2 0 None 3 3 2 3 Y SBrkr 0 3 ... 0 0 0 0 0 0 1 0 0 0

    3 rows × 280 columns

    LotFrontage
    """
    Linear feet of street connected to property.
    """
    
    get_feature_corr('LotFrontage')
    

    png

    • 该特征与saleprice 没有明显的相关性,可以考虑去掉该特征
    LotArea
    '''
    Lot size in square feet.
    '''
    get_feature_corr('LotArea')
    

    png

    • 该特征与saleprice有着明显的相关性,且该特征与saleprice呈现一个正偏态(峰左移,右偏,正偏)
    all_data['LotArea_Band'] = pd.qcut(all_data['LotArea'], 8,labels=list('12345678'))  # 针对分布不均匀的特征使用qcut进行封箱
    all_data['LotArea_Band'].unique()
    all_data['LotArea_Band'] = all_data['LotArea_Band'].astype(int)
    
    all_data.drop('LotArea', axis=1, inplace=True)
    
    all_data = pd.get_dummies(all_data, columns = ["LotArea_Band"], prefix="LotArea")
    all_data.head(3)
    
    3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Electrical EnclosedPorch ExterQual ... Condition1_Street Condition1_Train LotArea_1 LotArea_2 LotArea_3 LotArea_4 LotArea_5 LotArea_6 LotArea_7 LotArea_8
    0 0 None 3 3 1 3 Y SBrkr 0 3 ... 0 0 0 0 1 0 0 0 0 0
    1 0 None 3 3 4 3 Y SBrkr 0 2 ... 1 0 0 0 0 0 1 0 0 0
    2 0 None 3 3 2 3 Y SBrkr 0 3 ... 0 0 0 0 0 0 0 1 0 0

    3 rows × 287 columns

    LotShape
    """
    LotShape: General shape of property
    
           Reg	Regular	
           IR1	Slightly irregular
           IR2	Moderately Irregular
           IR3	Irregula
    该特征能够明显的影响售价,在国外,不仅仅要有大的面积数,而且尺寸也要合理,否则也很能卖出高价 
    """
    get_feature_corr1('LotShape')
    

    png

    all_data = pd.get_dummies(all_data, columns = ["LotShape"], prefix="LotShape")
    all_data.head(3)
    print("地皮的形状主要集中在Reg,Reg1两个值里面,而且salerice在不同的属性里面变化很大")
    
    地皮的形状主要集中在Reg,Reg1两个值里面,而且salerice在不同的属性里面变化很大
    
    LandContour
    """
    LandContour: Flatness of the property
    
           Lvl	Near Flat/Level	
           Bnk	Banked - Quick and significant rise from street grade to building
           HLS	Hillside - Significant slope from side to side
           Low	Depression
    
    """
    get_feature_corr1('LandContour')
    all_data = pd.get_dummies(all_data, columns = ["LandContour"], prefix="LandContour")
    all_data.head(3)
    

    png

    3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Electrical EnclosedPorch ExterQual ... LotArea_7 LotArea_8 LotShape_IR1 LotShape_IR2 LotShape_IR3 LotShape_Reg LandContour_Bnk LandContour_HLS LandContour_Low LandContour_Lvl
    0 0 None 3 3 1 3 Y SBrkr 0 3 ... 0 0 0 0 0 1 0 0 0 1
    1 0 None 3 3 4 3 Y SBrkr 0 2 ... 0 0 0 0 0 1 0 0 0 1
    2 0 None 3 3 2 3 Y SBrkr 0 3 ... 0 0 1 0 0 0 0 0 0 1

    3 rows × 293 columns

    LotConfig
    """
    LotConfig: Lot configuration
    
           Inside	Inside lot 内部
           Corner	Corner lot 角落
           CulDSac	Cul-de-sac 死胡同
           FR2	Frontage on 2 sides of property 前排
           FR3	Frontage on 3 sides of property  前排
    房子周围的环境 
    """
    get_feature_corr1('LotConfig')
    all_data['LotConfig'] = all_data['LotConfig'].map({"Inside":"Inside", "FR2":"FR", "Corner":"Corner", "CulDSac":"CulDSac", "FR3":"FR"})
    
    all_data = pd.get_dummies(all_data, columns = ["LotConfig"], prefix="LotConfig")
    all_data.head(3)
    
    
    

    png

    3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Electrical EnclosedPorch ExterQual ... LotShape_IR3 LotShape_Reg LandContour_Bnk LandContour_HLS LandContour_Low LandContour_Lvl LotConfig_Corner LotConfig_CulDSac LotConfig_FR LotConfig_Inside
    0 0 None 3 3 1 3 Y SBrkr 0 3 ... 0 1 0 0 0 1 0 0 0 1
    1 0 None 3 3 4 3 Y SBrkr 0 2 ... 0 1 0 0 0 1 0 0 1 0
    2 0 None 3 3 2 3 Y SBrkr 0 3 ... 0 0 0 0 0 1 0 0 0 1

    3 rows × 296 columns

    LandSlope
    """
    LandSlope: Slope of property
           Gtl	Gentle slope
           Mod	Moderate Slope
           Sev	Severe Slope
    """
    get_feature_corr1('LandSlope')
    
    

    png

    all_data['LandSlope'] = all_data['LandSlope'].map({"Gtl":1, "Mod":0, "Sev":0})
    '''
    Mod and Sev saleprice 处于同一区间,可以将两者合并
    '''
    
    '
    Mod and Sev saleprice 处于同一区间,可以将两者合并
    '
    
    all_data['LandSlope'].value_counts()
    
    1    2774
    0     141
    Name: LandSlope, dtype: int64
    
    Street
    get_feature_corr1('Street')
    

    png

    • Pave中价格变化很大,且Grvl数量太少,所以该特征意义不大,直接去掉
    all_data.drop('Street', axis=1, inplace=True)
    
    Alley
    get_feature_corr1('Alley')
    

    png

    all_data['Alley'].value_counts()
    
    None    2717
    Grvl     120
    Pave      78
    Name: Alley, dtype: int64
    
    all_data = pd.get_dummies(all_data, columns = ["Alley"], prefix="Alley")
    all_data.head(3)
    
    3SsnPorch BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Electrical EnclosedPorch ExterQual FireplaceQu ... LandContour_HLS LandContour_Low LandContour_Lvl LotConfig_Corner LotConfig_CulDSac LotConfig_FR LotConfig_Inside Alley_Grvl Alley_None Alley_Pave
    0 0 3 3 1 3 Y SBrkr 0 3 0 ... 0 0 1 0 0 0 1 0 1 0
    1 0 3 3 4 3 Y SBrkr 0 2 3 ... 0 0 1 0 0 1 0 0 1 0
    2 0 3 3 2 3 Y SBrkr 0 3 3 ... 0 0 1 0 0 0 1 0 1 0

    3 rows × 297 columns

    PvaeDrive
    """
    PavedDrive: Paved driveway
    
           Y	Paved 价格差异较大,且没有明显的顺序关系,需要转化为one-hot特征
           P	Partial Pavement
           N	Dirt/Gravel
    """
    get_feature_corr1('PavedDrive')
    

    png

    all_data=pd.get_dummies(all_data,columns=['PavedDrive'],prefix='PavedDrive')
    all_data.head()
    
    3SsnPorch BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Electrical EnclosedPorch ExterQual FireplaceQu ... LotConfig_Corner LotConfig_CulDSac LotConfig_FR LotConfig_Inside Alley_Grvl Alley_None Alley_Pave PavedDrive_N PavedDrive_P PavedDrive_Y
    0 0 3 3 1 3 Y SBrkr 0 3 0 ... 0 0 0 1 0 1 0 0 0 1
    1 0 3 3 4 3 Y SBrkr 0 2 3 ... 0 0 1 0 0 1 0 0 0 1
    2 0 3 3 2 3 Y SBrkr 0 3 3 ... 0 0 0 1 0 1 0 0 0 1
    3 0 3 4 1 2 Y SBrkr 272 2 4 ... 1 0 0 0 0 1 0 0 0 1
    4 0 4 3 3 3 Y SBrkr 0 3 3 ... 0 0 1 0 0 1 0 0 0 1

    5 rows × 299 columns

    Heating
    get_feature_corr1('Heating')
    

    png

    """
    大量集中在GasA,其余的数据量非常小,可以转化为天然气供暖,和其他方式供暖
    """
    all_data['Heating']  = all_data['Heating'].map({'GasA':1,'GasW':0,'Grav':0,'Wall':0,'OthW':0,'Floor':0})
    
    all_data.drop('Heating', axis=1, inplace=True)
    all_data.head(3)
    
    3SsnPorch BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Electrical EnclosedPorch ExterQual FireplaceQu ... LotConfig_Corner LotConfig_CulDSac LotConfig_FR LotConfig_Inside Alley_Grvl Alley_None Alley_Pave PavedDrive_N PavedDrive_P PavedDrive_Y
    0 0 3 3 1 3 Y SBrkr 0 3 0 ... 0 0 0 1 0 1 0 0 0 1
    1 0 3 3 4 3 Y SBrkr 0 2 3 ... 0 0 1 0 0 1 0 0 0 1
    2 0 3 3 2 3 Y SBrkr 0 3 3 ... 0 0 0 1 0 1 0 0 0 1

    3 rows × 298 columns

    HeatingQC
    """
    Heating quality and condition.
    """
    get_feature_corr1('HeatingQC',order=['Po','Fa','TA','Gd','Ex'])
    

    png

    all_data['HeatingQC'] = all_data['HeatingQC'].map({"Po":1, "Fa":2, "TA":3, "Gd":4, "Ex":5})
    all_data['HeatingQC'].unique()
    
    array([5, 4, 3, 2, 1])
    
    CentralAir
    """
    Central air conditioning.
    
    """
    get_feature_corr1('CentralAir')
    
    
    

    png

    all_data['CentralAir'] = all_data['CentralAir'].map({"Y":1,"N":0})
    
    Electrical
    """
    Electrical system.
    
    """
    
    get_feature_corr1('Electrical')
    

    png

    all_data['Electrical'] = all_data['Electrical'].map({'SBrkr':'SBrkr','FuseF':'Fuse','FuseA':'Fuse','FuseP':'Fuse','Mix':'Mix'})
    all_data = pd.get_dummies(all_data, columns = ["Electrical"], prefix="Electrical")
    all_data.head(3)
    
    3SsnPorch BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir EnclosedPorch ExterQual FireplaceQu Fireplaces ... LotConfig_Inside Alley_Grvl Alley_None Alley_Pave PavedDrive_N PavedDrive_P PavedDrive_Y Electrical_Fuse Electrical_Mix Electrical_SBrkr
    0 0 3 3 1 3 1 0 3 0 0 ... 1 0 1 0 0 0 1 0 0 1
    1 0 3 3 4 3 1 0 2 3 1 ... 0 0 1 0 0 0 1 0 0 1
    2 0 3 3 2 3 1 0 3 3 1 ... 1 0 1 0 0 0 1 0 0 1

    3 rows × 300 columns

    all_data['MiscFeature'].value_counts()  #
    
    None    2810
    Shed      95
    Gar2       5
    Othr       4
    TenC       1
    Name: MiscFeature, dtype: int64
    
    get_feature_corr1('MiscFeature')
    '''
    有效数据太少,剔除该特征
    '''
    

    png

    '
    有效数据太少,剔除该特征
    '
    
    get_feature_corr1('MiscVal')
    

    png

    all_data['MiscVal'].value_counts()
    """
    有效数据过少,剔除该特征
    """
    
    '
    有效数据过少,剔除该特征
    '
    
    all_data.drop(['MiscVal','MiscFeature'],axis=1,inplace=True)
    
    MoSold and YrSold
    """
    month sold,Year Sold
    """
    get_feature_corr1('MoSold')
    

    png

    get_feature_corr1('YrSold')
    

    png

    all_data = pd.get_dummies(all_data, columns = ["MoSold"], prefix="MoSold")
    all_data = pd.get_dummies(all_data,columns=['YrSold'],prefix='YrSold')
    all_data.head(3)
    
    3SsnPorch BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir EnclosedPorch ExterQual FireplaceQu Fireplaces ... MoSold_8 MoSold_9 MoSold_10 MoSold_11 MoSold_12 YrSold_2006 YrSold_2007 YrSold_2008 YrSold_2009 YrSold_2010
    0 0 3 3 1 3 1 0 3 0 0 ... 0 0 0 0 0 0 0 1 0 0
    1 0 3 3 4 3 1 0 2 3 1 ... 0 0 0 0 0 0 1 0 0 0
    2 0 3 3 2 3 1 0 3 3 1 ... 0 1 0 0 0 0 0 1 0 0

    3 rows × 313 columns

    SaleType
    """
    SaleType: Type of sale
    		
           WD 	Warranty Deed - Conventional
           CWD	Warranty Deed - Cash
           VWD	Warranty Deed - VA Loan
           New	Home just constructed and sold
           COD	Court Officer Deed/Estate
           Con	Contract 15% Down payment regular terms
           ConLw	Contract Low Down payment and low interest
           ConLI	Contract Low Interest
           ConLD	Contract Low Down
           Oth	Other
    
    """
    get_feature_corr1('SaleType')
    

    png

    all_data['SaleType'] = all_data['SaleType'].map({'WD':"WD",'New':"New",'COD':"COD",'CWD':'Oth','ConLD':'Oth','ConLI':'Oth',
                                                    "ConLW":'Oth','Con':'Oth','Oth':'Oth'})
    all_data=  pd.get_dummies(all_data,columns=['SaleType'],prefix='SaleType')
    all_data.head()
    
    3SsnPorch BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir EnclosedPorch ExterQual FireplaceQu Fireplaces ... MoSold_12 YrSold_2006 YrSold_2007 YrSold_2008 YrSold_2009 YrSold_2010 SaleType_COD SaleType_New SaleType_Oth SaleType_WD
    0 0 3 3 1 3 1 0 3 0 0 ... 0 0 0 1 0 0 0 0 0 1
    1 0 3 3 4 3 1 0 2 3 1 ... 0 0 1 0 0 0 0 0 0 1
    2 0 3 3 2 3 1 0 3 3 1 ... 0 0 0 1 0 0 0 0 0 1
    3 0 3 4 1 2 1 272 2 4 1 ... 0 1 0 0 0 0 0 0 0 1
    4 0 4 3 3 3 1 0 3 3 1 ... 1 0 0 1 0 0 0 0 0 1

    5 rows × 316 columns

    SaleCondition
    """
    Condition of sale.
    
    """
    
    get_feature_corr1('SaleCondition')
    

    png

    all_data = pd.get_dummies(all_data, columns = ["SaleCondition"], prefix="SaleCondition")
    all_data.head(3)
    
    3SsnPorch BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir EnclosedPorch ExterQual FireplaceQu Fireplaces ... SaleType_COD SaleType_New SaleType_Oth SaleType_WD SaleCondition_Abnorml SaleCondition_AdjLand SaleCondition_Alloca SaleCondition_Family SaleCondition_Normal SaleCondition_Partial
    0 0 3 3 1 3 1 0 3 0 0 ... 0 0 0 1 0 0 0 0 1 0
    1 0 3 3 4 3 1 0 2 3 1 ... 0 0 0 1 0 0 0 0 1 0
    2 0 3 3 2 3 1 0 3 3 1 ... 0 0 0 1 0 0 0 0 1 0

    3 rows × 321 columns

    目标值转换

    • 与分类算法不同,回归是用算法拟合连续值
    • 通常需要对目标值进行分布进行分析,机器学习的算法对于正态分布的数据一般都有很高的拟合度,如果目标值为偏正态分布,需要将目标值转化为正态分布
    from scipy.stats import skew, norm
    plt.subplots(figsize=(15,12))
    g = sns.distplot(train['SalePrice'],fit=norm,label="Skewness:%.2f" % (train['SalePrice'].skew()))
    g.legend(loc='best')
    
    <matplotlib.legend.Legend at 0x12f5f5cc0>
    

    png

    • 目标变量为正偏态,可以是用numpy中的函数,将其转化
    train["SalePrice"] = np.log1p(train["SalePrice"])
    y_train = train["SalePrice"]
    
    #Check the new distribution 
    plt.subplots(figsize=(15,10))
    g = sns.distplot(train['SalePrice'], fit=norm, label = "Skewness : %.2f"%(train['SalePrice'].skew()));
    g = g.legend(loc="best")
    

    png

    处理数据中偏态的特征
    numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index
    
    # Check how skewed they are
    skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
    
    plt.subplots(figsize =(65, 20))
    skewed_feats.plot(kind='bar');
    

    png

    
    from scipy.special import boxcox1p
    
    skewness = skewed_feats[abs(skewed_feats) > 0.5]
    
    skewed_features = skewness.index
    lam = 0.15
    for feat in skewed_features:
        all_data[feat] = boxcox1p(all_data[feat], lam)
    
    print(skewness.shape[0],  "skewed numerical features have been Box-Cox transformed")
    
    294 skewed numerical features have been Box-Cox transformed
    

    准备模型训练的数据

    train = all_data[:ntrain]
    test = all_data[ntrain:]
    print(train.shape)
    print(test.shape)
    
    (1456, 321)
    (1459, 321)
    
    y_train.shape
    
    (1456,)
    
    feature importance
    import xgboost as xgb
    
    model = xgb.XGBRegressor()
    model.fit(train, y_train)
    
    
    # Sort feature importances from GBC model trained earlier
    indices = np.argsort(model.feature_importances_)[::-1]
    indices = indices[:75]
    
    # Visualise these with a barplot
    plt.subplots(figsize=(20, 15))
    g = sns.barplot(y=train.columns[indices], x = model.feature_importances_[indices], orient='h')
    g.set_xlabel("Relative importance",fontsize=12)
    g.set_ylabel("Features",fontsize=12)
    g.tick_params(labelsize=9)
    g.set_title("XGB feature importance");
    
    /Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/xgboost/core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
      if getattr(data, 'base', None) is not None and 
    /Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/xgboost/core.py:588: FutureWarning: Series.base is deprecated and will be removed in a future version
      data.base is not None and isinstance(data, np.ndarray) 
    
    
    [11:04:46] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    

    png

    xgb_train = train.copy()
    xgb_test = test.copy()
    from sklearn.feature_selection import SelectFromModel
    
    xgb_feat_red = SelectFromModel(model,prefit=True)
    # reduce estimation validation and test datasets
    xgb_train = xgb_feat_red.transform(xgb_train)
    xgb_test = xgb_feat_red.transform(xgb_test)
    print('X_train: ', xgb_train.shape, '
    X_test: ', xgb_test.shape)
    
    X_train:  (1456, 47) 
    X_test:  (1459, 47)
    
    
    from sklearn import model_selection
    
    X_train, X_test, Y_train, Y_test = model_selection.train_test_split(xgb_train, y_train, test_size=0.3, random_state=42)
    
    # X_train = predictor features for estimation dataset
    # X_test = predictor variables for validation dataset
    # Y_train = target variable for the estimation dataset
    # Y_test = target variable for the estimation dataset
    
    print('X_train: ', X_train.shape, '
    X_test: ', X_test.shape, '
    Y_train: ', Y_train.shape, '
    Y_test: ', Y_test.shape)
    
    
    
    X_train:  (1019, 47) 
    X_test:  (437, 47) 
    Y_train:  (1019,) 
    Y_test:  (437,)
    
    X_train
    
    array([[0.73046315, 3.        , 0.73046315, ..., 0.        , 0.        ,
            0.        ],
           [0.73046315, 3.        , 0.73046315, ..., 0.        , 0.        ,
            0.        ],
           [1.19431764, 2.        , 0.73046315, ..., 0.        , 0.        ,
            0.        ],
           ...,
           [1.8203341 , 3.        , 0.73046315, ..., 0.73046315, 0.        ,
            0.        ],
           [0.73046315, 3.        , 0.73046315, ..., 0.        , 0.        ,
            0.        ],
           [1.54096276, 3.        , 0.73046315, ..., 0.        , 0.        ,
            0.        ]])
    

    训练不同的模型

    # 从sklearn 导入不同的回归模型
    from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC
    from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor, ExtraTreesRegressor
    from sklearn.kernel_ridge import KernelRidge
    import xgboost as xgb
    print('Algorithm packages imported!')
    
    
    Algorithm packages imported!
    
    # Model selection packages used for sampling dataset and optimising parameters
    from sklearn import model_selection
    from sklearn.model_selection import KFold
    from sklearn.model_selection import cross_val_score, train_test_split
    from sklearn.model_selection import GridSearchCV
    from sklearn.model_selection import ShuffleSplit
    print('Model selection packages imported!')
    
    Model selection packages imported!
    
    models = [KernelRidge(),ElasticNet(),Lasso(),GradientBoostingRegressor(),BayesianRidge(),LassoLarsIC(),RandomForestRegressor(),xgb.XGBRegressor()]
    # 随机取样,其实可以使用正常的split,然后选择里面的shuffle = True
    # https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
    shuff =ShuffleSplit(n_splits=5,test_size=0.2,random_state=42)
    # 创建一个数据框,用于保存模型的指标
    columns = ['Name','Parameters','Train mean_squared_error','Test mean_squared_error']
    before_model_compare = pd.DataFrame(columns=columns)
    
    
    # 将模型的参数以及结果添加到DataFrame中
    row_index=0
    for alg in models:
        model_name = alg.__class__.__name__
        before_model_compare.loc[row_index,'Name'] = model_name
        before_model_compare.loc[row_index,'Parameters'] = str(alg.get_params())
        alg.fit(X_train,Y_train)
        # for cross_validation  but the results are negative,we need to convert it to postive,均方误差
        training_results = np.sqrt((-cross_val_score(alg,X_train,Y_train,cv=shuff,scoring='neg_mean_squared_error')).mean())
        test_results = np.sqrt(((Y_test-alg.predict(X_test))**2).mean())
        before_model_compare.loc[row_index,"Train mean_squared_error"] = training_results*100
        before_model_compare.loc[row_index,'Test mean_squared_error'] = test_results*100
        row_index+=1
        print(row_index,model_name,"trained>>>>")
    
        
    decimals = 3
    before_model_compare['Train mean_squared_error'] = before_model_compare['Train mean_squared_error'].apply(lambda x:round(x,decimals))
    before_model_compare['Test mean_squared_error'] = before_model_compare['Train mean_squared_error'].apply(lambda x:round(x,decimals))
    before_model_compare
        
    
    1 KernelRidge trained>>>>
    2 ElasticNet trained>>>>
    3 Lasso trained>>>>
    4 GradientBoostingRegressor trained>>>>
    5 BayesianRidge trained>>>>
    6 LassoLarsIC trained>>>>
    7 RandomForestRegressor trained>>>>
    [12:04:14] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    [12:04:14] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    [12:04:14] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    
    
    /Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/xgboost/core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
      if getattr(data, 'base', None) is not None and 
    /Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/xgboost/core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
      if getattr(data, 'base', None) is not None and 
    /Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/xgboost/core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
      if getattr(data, 'base', None) is not None and 
    /Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/xgboost/core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
      if getattr(data, 'base', None) is not None and 
    /Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/xgboost/core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
      if getattr(data, 'base', None) is not None and 
    
    
    [12:04:14] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    [12:04:14] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    [12:04:14] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    8 XGBRegressor trained>>>>
    
    Name Parameters Train mean_squared_error Test mean_squared_error
    0 KernelRidge {'alpha': 1, 'coef0': 1, 'degree': 3, 'gamma':... 31.424 31.424
    1 ElasticNet {'alpha': 1.0, 'copy_X': True, 'fit_intercept'... 23.245 23.245
    2 Lasso {'alpha': 1.0, 'copy_X': True, 'fit_intercept'... 28.008 28.008
    3 GradientBoostingRegressor {'alpha': 0.9, 'criterion': 'friedman_mse', 'i... 12.381 12.381
    4 BayesianRidge {'alpha_1': 1e-06, 'alpha_2': 1e-06, 'compute_... 11.118 11.118
    5 LassoLarsIC {'copy_X': True, 'criterion': 'aic', 'eps': 2.... 11.818 11.818
    6 RandomForestRegressor {'bootstrap': True, 'criterion': 'mse', 'max_d... 14.299 14.299
    7 XGBRegressor {'base_score': 0.5, 'booster': 'gbtree', 'cols... 12.466 12.466
    优化参数
    • 开始的时候,我们准备了不同模型简单的看了模型的评价以及训练结果
    • 实际上,这些模型都需要进一步的参数优化
    • 下一步需要是用GridSearch进行参数的调整
    models = [KernelRidge(),ElasticNet(),Lasso(),GradientBoostingRegressor(),BayesianRidge(),LassoLarsIC(),RandomForestRegressor(),
             xgb.XGBRegressor()]
    KR_param_grid = {'alpha': [0.1], 'coef0': [100], 'degree': [1], 'gamma': [None], 'kernel': ['polynomial']}
    EN_param_grid = {'alpha': [0.001], 'copy_X': [True], 'l1_ratio': [0.6], 'fit_intercept': [True], 'normalize': [False], 
                             'precompute': [False], 'max_iter': [300], 'tol': [0.001], 'selection': ['random'], 'random_state': [None]}
    LASS_param_grid = {'alpha': [0.0005], 'copy_X': [True], 'fit_intercept': [True], 'normalize': [False], 'precompute': [False], 
                        'max_iter': [300], 'tol': [0.01], 'selection': ['random'], 'random_state': [None]}
    GB_param_grid = {'loss': ['huber'], 'learning_rate': [0.1], 'n_estimators': [300], 'max_depth': [3], 
                                            'min_samples_split': [0.0025], 'min_samples_leaf': [5]}
    BR_param_grid = {'n_iter': [200], 'tol': [0.00001], 'alpha_1': [0.00000001], 'alpha_2': [0.000005], 'lambda_1': [0.000005], 
                     'lambda_2': [0.00000001], 'copy_X': [True]}
    LL_param_grid = {'criterion': ['aic'], 'normalize': [True], 'max_iter': [100], 'copy_X': [True], 'precompute': ['auto'], 'eps': [0.000001]}
    RFR_param_grid = {'n_estimators': [50], 'max_features': ['auto'], 'max_depth': [None], 'min_samples_split': [5], 'min_samples_leaf': [2]}
    XGB_param_grid = {'max_depth': [3], 'learning_rate': [0.1], 'n_estimators': [300], 'booster': ['gbtree'], 'gamma': [0], 'reg_alpha': [0.1],
                      'reg_lambda': [0.7], 'max_delta_step': [0], 'min_child_weight': [1], 'colsample_bytree': [0.5], 'colsample_bylevel': [0.2],
                      'scale_pos_weight': [1]}
    params_grid = [KR_param_grid, EN_param_grid, LASS_param_grid, GB_param_grid, BR_param_grid, LL_param_grid, RFR_param_grid, XGB_param_grid]
    
    after_model_compare = pd.DataFrame(columns=columns)
    row_index= 0
    
    for alg in models:
        gs_alg = GridSearchCV(alg,param_grid=params_grid[0],cv=shuff,scoring='neg_mean_squared_error',n_jobs=-1)
        params_grid.pop(0)
        
        
        model_name = alg.__class__.__name__
        after_model_compare.loc[row_index,'Name'] = model_name
        gs_alg.fit(X_train,Y_train)
        gs_best=gs_alg.best_estimator_
        after_model_compare.loc[row_index,"Parameters"] = str(gs_alg.best_params_)
        after_training_results = np.sqrt(-gs_alg.best_score_)
        after_test_results = np.sqrt((Y_test-gs_alg.predict(X_test)**2).mean())
        after_model_compare.loc[row_index,"Train mean_squared_error"] = after_training_results*100
        after_model_compare.loc[row_index,'Test mean_squared_error']= after_test_results*100
        row_index+=1
        print(row_index,model_name,"trained>>>>>")
    
    
        
    decimals = 3
    after_model_compare['Train mean_squared_error'] = after_model_compare['Train mean_squared_error'].apply(lambda x:round(x,decimals))
    after_model_compare['Test mean_squared_error'] = after_model_compare['Train mean_squared_error'].apply(lambda x:round(x,decimals))
    after_model_compare
    
    /Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/ipykernel_launcher.py:33: RuntimeWarning: invalid value encountered in sqrt
    /Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/ipykernel_launcher.py:33: RuntimeWarning: invalid value encountered in sqrt
    /Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/ipykernel_launcher.py:33: RuntimeWarning: invalid value encountered in sqrt
    
    
    1 KernelRidge trained>>>>>
    2 ElasticNet trained>>>>>
    3 Lasso trained>>>>>
    
    
    /Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/ipykernel_launcher.py:33: RuntimeWarning: invalid value encountered in sqrt
    /Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/ipykernel_launcher.py:33: RuntimeWarning: invalid value encountered in sqrt
    /Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/ipykernel_launcher.py:33: RuntimeWarning: invalid value encountered in sqrt
    
    
    4 GradientBoostingRegressor trained>>>>>
    5 BayesianRidge trained>>>>>
    6 LassoLarsIC trained>>>>>
    
    
    /Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/ipykernel_launcher.py:33: RuntimeWarning: invalid value encountered in sqrt
    /Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/xgboost/core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
      if getattr(data, 'base', None) is not None and 
    
    
    7 RandomForestRegressor trained>>>>>
    [19:23:22] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    8 XGBRegressor trained>>>>>
    
    Name Parameters Train mean_squared_error Test mean_squared_error
    0 KernelRidge {'alpha': 0.1, 'coef0': 100, 'degree': 1, 'gam... 11.140 11.140
    1 ElasticNet {'alpha': 0.001, 'copy_X': True, 'fit_intercep... 11.234 11.234
    2 Lasso {'alpha': 0.0005, 'copy_X': True, 'fit_interce... 11.203 11.203
    3 GradientBoostingRegressor {'learning_rate': 0.1, 'loss': 'huber', 'max_d... 11.966 11.966
    4 BayesianRidge {'alpha_1': 1e-08, 'alpha_2': 5e-06, 'copy_X':... 11.118 11.118
    5 LassoLarsIC {'copy_X': True, 'criterion': 'aic', 'eps': 1e... 11.818 11.818
    6 RandomForestRegressor {'max_depth': None, 'max_features': 'auto', 'm... 13.735 13.735
    7 XGBRegressor {'booster': 'gbtree', 'colsample_bylevel': 0.2... 11.964 11.964

    stacking method

    • 准备一系列的算法模型
    • 将train训练数据分割为训练数据和验证数据(X_trian,Y_train,X_test,Y_test)
    • 在X_train数据集中进行算法拟合,然后将训练出来的模型去拟合X_test(验证集),将模型拟合出的验证集的结果和实际的Y_test组成的新的训练数据(new_train datasets)
    • 将训练出来的模型去拟合test数据集,得到每个模型预测的结果,组成醒的test数据集,new_test dataset
    • 用一个相对简单或者使用不同的模型(meta-model),比如说lasso,将新的训练进行拟合,然后将拟合后的模型预测新的测试集new_test_dataset,得到新的模型
    • 将新的模型去拟合新的测试集(new_test_dataset),得到预测的结果
    models  = [KernelRidge(),ElasticNet(),Lasso(),GradientBoostingRegressor(),BayesianRidge(),LassoLarsIC(),RandomForestRegressor(),xgb.XGBRegressor()]
    names = ['KernelRidge','ElasticNet','Lasso','GradientBoostingRegressor','BayesianRidge','LassoLarsIC','RandomForest','XGBoost']
    params_grid = [KR_param_grid, EN_param_grid, LASS_param_grid, GB_param_grid, BR_param_grid, LL_param_grid, RFR_param_grid, XGB_param_grid]
    stacked_validation_train = pd.DataFrame()
    stacked_test_train = pd.DataFrame()
    
    row_index= 0
    
    for alg in models:
        gs_alg = GridSearchCV(alg,param_grid=params_grid[0],cv=shuff,scoring='neg_mean_squared_error',n_jobs=-1)
        params_grid.pop(0)
        gs_alg.fit(X_train,Y_train)
        gs_best = gs_alg.best_estimator_
        stacked_validation_train.insert(loc= row_index,column=names[0],value=gs_best.predict(X_test))
        """  dataFrme insert (loc 表示的是列的序号,column 列名,value 插入的内容)"""
        print(row_index+1,alg.__class__.__name__,"将验证集的预测的结果堆砌,组成新的训练集")
        stacked_test_train.insert(loc=row_index,column=names[0],value=gs_best.predict(xgb_test))
        print(row_index+1,alg.__class__.__name__,"将测试集的预测的结果堆砌,组成新的测试集")
        print("---"*50)
        names.pop(0)
        row_index+=1
        
    print("第一层数据处理完成,新的训练集与测试集完成")
        
        
        
        
        
        
        
        
        
        
    
    1 KernelRidge 将验证集的预测的结果堆砌,组成新的训练集
    1 KernelRidge 将测试集的预测的结果堆砌,组成新的测试集
    ------------------------------------------------------------------------------------------------------------------------------------------------------
    2 ElasticNet 将验证集的预测的结果堆砌,组成新的训练集
    2 ElasticNet 将测试集的预测的结果堆砌,组成新的测试集
    ------------------------------------------------------------------------------------------------------------------------------------------------------
    3 Lasso 将验证集的预测的结果堆砌,组成新的训练集
    3 Lasso 将测试集的预测的结果堆砌,组成新的测试集
    ------------------------------------------------------------------------------------------------------------------------------------------------------
    4 GradientBoostingRegressor 将验证集的预测的结果堆砌,组成新的训练集
    4 GradientBoostingRegressor 将测试集的预测的结果堆砌,组成新的测试集
    ------------------------------------------------------------------------------------------------------------------------------------------------------
    5 BayesianRidge 将验证集的预测的结果堆砌,组成新的训练集
    5 BayesianRidge 将测试集的预测的结果堆砌,组成新的测试集
    ------------------------------------------------------------------------------------------------------------------------------------------------------
    6 LassoLarsIC 将验证集的预测的结果堆砌,组成新的训练集
    6 LassoLarsIC 将测试集的预测的结果堆砌,组成新的测试集
    ------------------------------------------------------------------------------------------------------------------------------------------------------
    7 RandomForestRegressor 将验证集的预测的结果堆砌,组成新的训练集
    7 RandomForestRegressor 将测试集的预测的结果堆砌,组成新的测试集
    ------------------------------------------------------------------------------------------------------------------------------------------------------
    [15:23:01] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    8 XGBRegressor 将验证集的预测的结果堆砌,组成新的训练集
    8 XGBRegressor 将测试集的预测的结果堆砌,组成新的测试集
    ------------------------------------------------------------------------------------------------------------------------------------------------------
    第一层数据处理完成,新的训练集与测试集完成
    
    
    /Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/xgboost/core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
      if getattr(data, 'base', None) is not None and 
    
    print(stacked_validation_train.shape)
    stacked_validation_train.head()
    # Y_test的数据结果
    
    (437, 8)
    
    KernelRidge ElasticNet Lasso GradientBoostingRegressor BayesianRidge LassoLarsIC RandomForest XGBoost
    0 12.096814 12.095574 12.095347 12.103610 12.095675 12.104932 12.170897 12.084927
    1 11.952395 11.966939 11.964576 12.027570 11.957859 11.999328 12.066678 12.071651
    2 11.798390 11.800390 11.807569 11.842686 11.807968 11.787126 11.880778 11.789903
    3 11.834224 11.814334 11.820662 11.806835 11.840026 11.837654 11.755137 11.753889
    4 11.287412 11.267859 11.271162 11.150576 11.289689 11.290524 11.328786 11.278980
    print(stacked_test_train.shape)
    stacked_test_train.head()
    
    (1459, 8)
    
    KernelRidge ElasticNet Lasso GradientBoostingRegressor BayesianRidge LassoLarsIC RandomForest XGBoost
    0 11.655653 11.666206 11.661235 11.717153 11.664298 11.639410 11.735618 11.754628
    1 12.033653 12.042914 12.039875 11.950150 12.032724 12.007921 11.956780 11.985191
    2 12.121196 12.121925 12.124266 12.138572 12.125334 12.072644 12.097413 12.115376
    3 12.194246 12.200128 12.201113 12.166538 12.196015 12.143436 12.095009 12.139894
    4 12.171520 12.180859 12.179168 12.145913 12.167523 12.168576 12.178091 12.176064
    stacked_validation_train.drop('Lasso',axis=1,inplace=True)
    stacked_test_train.drop('Lasso',axis=1,inplace=True)
    from sklearn.pipeline import make_pipeline
    from sklearn.preprocessing import RobustScaler
    
    
    
    meta_model = make_pipeline(RobustScaler(),Lasso(alpha=0.00001,copy_X=True,fit_intercept=True,normalize=False,precompute=False,
                                                   max_iter=10000,tol=0.0001,selection='random',random_state=42))
    meta_model.fit(stacked_validation_train,Y_test)
    meta_model_pred= np.expm1(meta_model.predict(stacked_test_train))
    print("meta_model 完成训练,并预测测试集的数据")
    
    meta_model 完成训练,并预测测试集的数据
    
    
    /Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/sklearn/linear_model/coordinate_descent.py:475: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 1.7538551527086552, tolerance: 0.006483051719467419
      positive)
    
    models = [KernelRidge(), ElasticNet(), Lasso(), GradientBoostingRegressor(), BayesianRidge(), LassoLarsIC(), RandomForestRegressor(), xgb.XGBRegressor()]
    names = ['KernelRidge', 'ElasticNet', 'Lasso', 'Gradient Boosting', 'Bayesian Ridge', 'Lasso Lars IC', 'Random Forest', 'XGBoost']
    params_grid = [KR_param_grid, EN_param_grid, LASS_param_grid, GB_param_grid, BR_param_grid, LL_param_grid, RFR_param_grid, XGB_param_grid]
    final_predictions = pd.DataFrame()
    
    row_index=0
    
    for alg in models:
        
        gs_alg = GridSearchCV(alg, param_grid = params_grid[0], cv = shuff, scoring = 'neg_mean_squared_error', n_jobs=-1)
        params_grid.pop(0)
        
        gs_alg.fit(stacked_validation_train, Y_test)
        gs_best = gs_alg.best_estimator_
        final_predictions.insert(loc = row_index, column = names[0], value = np.expm1(gs_best.predict(stacked_test_train)))
        print(row_index+1, alg.__class__.__name__, 'final results predicted added to table...')
        names.pop(0)
        
        row_index+=1
    
    print("-"*50)
    print("已经完成")
    final_predictions.head()
    
    1 KernelRidge final results predicted added to table...
    2 ElasticNet final results predicted added to table...
    3 Lasso final results predicted added to table...
    4 GradientBoostingRegressor final results predicted added to table...
    5 BayesianRidge final results predicted added to table...
    6 LassoLarsIC final results predicted added to table...
    7 RandomForestRegressor final results predicted added to table...
    [18:03:42] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    8 XGBRegressor final results predicted added to table...
    --------------------------------------------------
    已经完成
    
    
    /Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/xgboost/core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
      if getattr(data, 'base', None) is not None and 
    
    KernelRidge ElasticNet Lasso Gradient Boosting Bayesian Ridge Lasso Lars IC Random Forest XGBoost
    0 120698.786728 121126.968875 120569.541877 119545.552352 121817.672344 121618.593011 120774.731602 117987.320312
    1 162778.261755 162293.616103 163198.661456 154034.245333 162888.953970 162663.194168 154944.085742 154422.265625
    2 184187.690046 183822.395933 184145.902661 181996.954345 185167.984485 184643.383928 181824.224304 174336.687500
    3 193128.541814 192388.040730 193035.580999 195110.109361 193760.580424 193069.794744 188563.541259 181933.593750
    4 192957.823204 192839.290437 193289.070140 192292.299199 192910.466862 192890.725826 190770.891456 192144.093750
    ensemble = meta_model_pred*(1/10) + final_predictions['XGBoost']*(1.5/10) + final_predictions['Gradient Boosting']*(2/10) + final_predictions['Bayesian Ridge']*(1/10) + final_predictions['Lasso']*(1/10) + final_predictions['KernelRidge']*(1/10) + final_predictions['Lasso Lars IC']*(1/10) + final_predictions['Random Forest']*(1.5/10)
    
    submission = pd.DataFrame()
    test1 = pd.read_csv('test.csv',index_col=False)
    test_ID = test1['Id']
    submission['Id'] = test_ID
    submission['SalePrice'] = ensemble
    submission.to_csv('final_submission.csv',index=False)
    print("Submission file, created!")
    
    Submission file, created!
  • 相关阅读:
    wcf通道Channel
    固定位置右下角
    小闹钟(无样式)
    CSS小注意(初级)
    java少包汇总
    maven的pom.xml配置
    myeclipse 手动安装 lombok
    Could not synchronize database state with session
    (转)myeclipse插件—SVN分支与合并详解【图】
    Nginx的启动、停止与重启
  • 原文地址:https://www.cnblogs.com/onemorepoint/p/11236051.html
Copyright © 2011-2022 走看看