zoukankan      html  css  js  c++  java
  • 房价预测《基础版,测试》

    #coding=utf8
    
    import numpy as np
    import pandas as pd
    from sklearn.linear_model import Ridge
    from sklearn.model_selection import cross_val_score
    import matplotlib.pyplot as plt
    from sklearn.ensemble import RandomForestRegressor
    
    #不要第一列id,只是作为索引
    train_df = pd.read_csv('./input/train.csv', index_col=0)
    test_df = pd.read_csv('./input/test.csv', index_col=0)
    #
    label本身并不平滑。为了我们分类器的学习更加准确,我们会首先把label给“平滑化”(正态化),如果miss掉,导致自己的结果总是达不到一定标准。这里我们使用最有逼格的log1p, 也就是 log(x+1),避免了复值的问题。如果我们这里把数据都给平滑化了,那么最后算结果的时候,要记得把预测到的平滑数据给变回去。按照“怎么来的怎么去”原则,log1p()就需要expm1(); 同理,log()就需要exp(), ... etc.

    prices = pd.DataFrame({'price':train_df['SalePrice'], 'log(price + 1)':np.log1p(train_df['SalePrice'])})
    #print train_df.columns
    #prices.hist()
    #print 'ok'
    y_train = np.log1p(train_df.pop('SalePrice'))
    #print y_train.shape
    #print train_df.index
    all_df = pd.concat((train_df,test_df), axis=0)
    #变量转换
    
    #print all_df['MSSubClass'].dtypes
    all_df['MSSubClass'] = all_df['MSSubClass'].astype(str)
    #print all_df.shape
    #print all_df['MSSubClass'].value_counts()
    #print all_df['MSSubClass'].dtypes
    #print pd.get_dummies(all_df['MSSubClass'], prefix='MSSubClass').head()
    #当我们用numerical来表达categorical的时候,要注意,数字本身有大小的含义,所以乱用数字会给之后的模型学习带来麻烦。于是我们可以用One-Hot的方法来表达category。
    #pandas自带的get_dummies方法,一键做到One-Hot。
    #把所有的category数据,都给One-Hot了
    all_dummy_df = pd.get_dummies(all_df)
    #print all_dummy_df.head()
    #print all_dummy_df.isnull().sum().sort_values(ascending=False).head(10)
    #处理缺失值
    mean_cols = all_dummy_df.mean()
    #print mean_cols
    all_dummy_df = all_dummy_df.fillna(mean_cols)
    #print all_dummy_df.isnull().sum().sum()
    #标准化numerical数据,这里,我们当然不需要把One-Hot的那些0/1数据给标准化。我们的目标应该是那些本来就是numerical的数据:
    #先来看看 哪些是numerical的
    numeric_cols = all_df.columns[all_df.dtypes != 'object']
    #print numeric_cols
    #print train_df.index
    numeric_col_means = all_dummy_df.loc[:, numeric_cols].mean()
    numeric_col_std = all_dummy_df.loc[:, numeric_cols].std()
    all_dummy_df.loc[:, numeric_cols] = (all_dummy_df.loc[:, numeric_cols] - numeric_col_means) / numeric_col_std
    
    dummy_train_df = all_dummy_df.loc[train_df.index]
    dummy_test_df = all_dummy_df.loc[test_df.index]
    #print train_df.index
    #print test_df.index
    #print dummy_train_df.shape
    #print dummy_test_df.shape
    #print type(dummy_train_df)
    
    X_train = dummy_train_df.values
    X_test = dummy_test_df.values
    #print type(X_train)
    
    print X_train.shape
    alphas = np.logspace(-3, 2, 50)
    test_scores = []
    for alpha in alphas:
        clf = Ridge(alpha)
        test_score = np.sqrt(-cross_val_score(clf, X_train, y_train, cv=10, scoring='neg_mean_squared_error'))
        test_scores.append(np.mean(test_score))
    
    plt.plot(alphas, test_scores)
    plt.title('Alpha vs CV Error')
    
    max_features = [.1, .3, .5, .7, .9, .99]
    test_scores = []
    for max_feat in max_features:
        clf = RandomForestRegressor(n_estimators=200, max_features=max_feat)
        test_score = np.sqrt(-cross_val_score(clf, X_train, y_train, cv=5, scoring='neg_mean_squared_error'))
        test_scores.append(np.mean(test_score))
    
    plt.plot(max_features, test_scores)
    plt.title("Max Features vs CV Error")
    
    #Ensemble
    ridge = Ridge(alpha=15)
    rf = RandomForestRegressor(n_estimators=500, max_features=.3)
    
    ridge.fit(X_train, y_train)
    rf.fit(X_train, y_train)
    
    y_ridge = np.expm1(ridge.predict(X_test))
    y_rf = np.expm1(rf.predict(X_test))
    y_final = (y_ridge + y_rf) / 2
  • 相关阅读:
    PHP基本的语法以及和Java的差别
    Linux 性能測试工具
    【Oracle 集群】Linux下Oracle RAC集群搭建之Oracle DataBase安装(八)
    【Oracle 集群】Oracle 11G RAC教程之集群安装(七)
    【Oracle 集群】11G RAC 知识图文详细教程之RAC在LINUX上使用NFS安装前准备(六)
    【Oracle 集群】ORACLE DATABASE 11G RAC 知识图文详细教程之RAC 特殊问题和实战经验(五)
    【Oracle 集群】ORACLE DATABASE 11G RAC 知识图文详细教程之缓存融合技术和主要后台进程(四)
    【Oracle 集群】ORACLE DATABASE 11G RAC 知识图文详细教程之RAC 工作原理和相关组件(三)
    Oracle 集群】ORACLE DATABASE 11G RAC 知识图文详细教程之ORACLE集群概念和原理(二)
    【Oracle 集群】ORACLE DATABASE 11G RAC 知识图文详细教程之集群概念介绍(一)
  • 原文地址:https://www.cnblogs.com/TMatrix52/p/7717906.html
Copyright © 2011-2022 走看看