zoukankan      html  css  js  c++  java
  • Linear Regression

    线性回归方法是机器学习里面最基础的一种方法,相关的理论方面的知识有很多,这里就不介绍了,博客主要从scikit-learn库的使用方面来探讨算法。

    首先,我们使用随机生成一组数据,然后加入一些随机噪声。

     1 import numpy as np
     2 from sklearn.cross_validation import train_test_split
     3 
     4 def f(x):
     5     return np.sin(2 * np.pi * x)
     6 
     7 x_plot = np.linspace(0, 1, 100)
     8 
     9 n_samples = 100
    10 X = np.random.uniform(0, 1, size=n_samples)[:, np.newaxis]
    11 y = f(X) + np.random.normal(scale=0.3, size=n_samples)[:, np.newaxis] ##add random noise to the dataset
    12 
    13 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8)
    View Code

    首先,不添加正则项

     1 fig, axes = plt.subplots(5, 2, figsize=(8, 5))
     2 train_error = np.empty(10)
     3 test_error = np.empty(10)
     4 #
     5 for ax, degree in zip(axes.ravel(), range(10)):
     6     est = make_pipeline(PolynomialFeatures(degree), LinearRegression())
     7     est.fit(X_train, y_train)
     8     train_error[degree] = mean_squared_error(y_train, est.predict(X_train))
     9     test_error[degree] = mean_squared_error(y_test, est.predict(X_test))
    10     plot_approximation(est, ax, label='degree=%d' %degree)
    11 plt.show(fig)
    12 
    13 plt.plot(np.arange(10), train_error, color='green', label='train')
    14 plt.plot(np.arange(10), test_error, color='red', label='test')
    15 plt.ylim(0.0, 1e0)
    16 plt.ylabel('log(mean squared error)')
    17 plt.xlabel('degree')
    18 plt.legend(loc="upper left")
    19 plt.show()
    View Code

    误差为:

    当多项式的最高次幂超过6之后,训练样本的误差小,测试样本的误差过大,出现了过拟合,下面加入L2正则项:

     1 alphas = [0.0, 1e-8, 1e-5, 1e-1]
     2 degree = 9
     3 fig, ax_rows = plt.subplots(3, 4, figsize=(8, 5))
     4 for degree, ax_row in zip(range(7, 10), ax_rows):
     5     for alpha, ax in zip(alphas, ax_row):
     6         est = make_pipeline(PolynomialFeatures(degree), Ridge(alpha=alpha))
     7         est.fit(X_train, y_train)
     8         plot_approximation(est, ax, xlabel="degree=%d alpha=%r" %(degree, alpha))
     9 #plt.tight_layout()
    10 plt.show(fig)
    View Code

    具体看看不同的alpha大小对多项式系数的影响:

     1 def plot_coefficients(est, ax, label=None, yscale='log'):
     2     coef = est.steps[-1][1].coef_.ravel()
     3     if yscale == 'log':
     4         ax.semilogy(np.abs(coef), marker='o', label=label)
     5         ax.set_ylim((1e-1, 1e8))
     6     else:
     7         ax.plot(np.abs(coef), marker='o', label=label)
     8     ax.set_ylabel('abs(coefficient)')
     9     ax.set_xlabel('coefficients')
    10     ax.set_xlim((1, 9))
    11 
    12 fig, ax_rows = plt.subplots(4, 2, figsize=(8, 5))
    13 alphas = [0.0, 1e-8, 1e-5, 1e-1]
    14 for alpha, ax_row in zip(alphas, ax_rows):
    15     ax_left, ax_right = ax_row
    16     est = make_pipeline(PolynomialFeatures(degree), Ridge(alpha=alpha))
    17     est.fit(X_train, y_train)
    18     plot_approximation(est, ax_left, label='alpha=%r'%alpha)
    19     plot_coefficients(est, ax_right, label='Ridge(alpha=%r) coefficients' % alpha )
    20 
    21 plt.show(fig)
    View Code

    alpha越大,因子越小,而曲线也越来越平滑。使用Ridge,可以加入L2正则项,还可以通过使用Lasso,加入L1正则项

     1 fig, ax_rows = plt.subplots(2, 2, figsize=(8, 5))
     2 
     3 degree = 9
     4 alphas = [1e-3, 1e-2]
     5 
     6 for alpha, ax_row in zip(alphas, ax_rows):
     7     ax_left, ax_right = ax_row
     8     est = make_pipeline(PolynomialFeatures(degree), Lasso(alpha=alpha))
     9     est.fit(X_train, y_train)
    10     plot_approximation(est, ax_left, label='alpha=%r' % alpha)
    11     plot_coefficients(est, ax_right, label='Lasso(alpha=%r) coefficients' % alpha, yscale=None)
    12 
    13 plt.tight_layout()
    14 plt.show(fig)
    View Code

    除了上述两种方式外,scikit-learn还支持同时加入L1和L2正则,需要使用ElasticNet进行训练

     1 fig, ax_rows = plt.subplots(8, 2, figsize=(8, 5))
     2 alphas = [1e-2, 1e-2, 1e-2, 1e-3, 1e-3, 1e-3, 1e-4, 1e-4]
     3 ratios = [0.05, 0.85, 0.50, 0.05, 0.85, 0.50, 0.03, 0.95]
     4 for alpha, ratio, ax_row in zip(alphas, ratios, ax_rows):
     5     ax_left, ax_right = ax_row
     6     est = make_pipeline(PolynomialFeatures(degree), ElasticNet(alpha=alpha, l1_ratio=ratio))
     7     est.fit(X_train, y_train)
     8     plot_approximation(est, ax_left, label='alpha=%r ratio=%r' % (alpha, ratio))
     9     plot_coefficients(est, ax_right, label="Lasso(alpah=%r ratio=%r) coefficients" % (alpha, ratio), yscale=None)
    10 
    11 plt.show()
    View Code

    当alpha一定时,曲线形状并未发生明显变化,alpha限定了参数范围,alpha越小,参数取值范围越大,这与只使用L2、L1正则时相似。ratio决定了参数的取值情况,当ratio比较大时,则参数相对稀疏(只有少数几个参数的值比较大,而其余的值比较小者趋近于0),

    而ratio比较小时,参数之间差异相对较小,分布较为均匀。

    数据集1:Test Scores for General Psychology

     每组数据是一个四元组,<x1, x2, x3, x4>其中x1, x2, x3表示前3次的成绩,x4表示最终成绩。现在需要有(x1, x2, x3)来预测x4.数据集总共有25条记录。其中第一行是标题。下面对比不使用正则项,使用L2正则项和使用L1正则项来做一个简单的线性回归模型。

      1 # -*-encoding:utf-8-*-
      2 '''
      3 Created on 
      4 author: dstarer
      5 copyright: dstarer
      6 '''
      7 
      8 import numpy as np
      9 from sklearn.cross_validation import train_test_split
     10 from sklearn.linear_model import LinearRegression
     11 from sklearn.metrics import mean_squared_error
     12 from sklearn.linear_model import Ridge
     13 from sklearn.linear_model import Lasso
     14 from plot import *
     15 
     16 def readData(filename, ignoreFirstLine=True, separtor='	'):
     17     dataSet = []
     18     fp = open(filename, "r")
     19     if ignoreFirstLine:
     20         fp.readline()
     21     for line in fp.readlines():
     22         elements = map(int, line.strip().split(separtor))
     23         dataSet.append(elements)
     24     fp.close()
     25     return dataSet
     26 
     27 
     28 def Print(message, train_error, test_error, coef):
     29     print "%s--------------" % message
     30     print "train error: %.3f" % train_error
     31     print "test error: %.3f" % test_error
     32     print coef
     33     print "sum of coef: ", np.sum(coef)
     34 
     35 def process(X, y, show=True):
     36     error = np.empty(3)
     37     X_train, X_test, y_train, y_test = train_test_split(X, y)
     38     est = LinearRegression()
     39     est.fit(X_train, y_train)
     40     train_error = mean_squared_error(y_train, est.predict(X_train))
     41     test_error = mean_squared_error(y_test, est.predict(X_test))
     42     error[0] = test_error
     43 
     44     if show:
     45         Print(message="train without regularization", train_error=train_error, test_error=test_error, coef=est.coef_)
     46 
     47     ridge = Ridge()
     48     ridge.fit(X_train, y_train)
     49     train_error = mean_squared_error(y_train, ridge.predict(X_train))
     50     test_error = mean_squared_error(y_test, ridge.predict(X_test))
     51     error[1] = test_error
     52 
     53     if show:
     54         Print(message="train using L2 regularization", train_error=train_error, test_error=test_error, coef=est.coef_)
     55 
     56     lasso = Lasso()
     57     lasso.fit(X_train, y_train)
     58     train_error = mean_squared_error(y_train, lasso.predict(X_train))
     59     test_error = mean_squared_error(y_test, lasso.predict(X_test))
     60     error[2] = test_error
     61 
     62     if show:
     63         Print(message="train using L1 regularization", train_error=train_error, test_error=test_error, coef=est.coef_)
     64 
     65     if show:
     66         print "Data ------------"
     67         print "[x1  x2  x3 ] [y] [	 est 	] [	 ridge 	] [	 lasso 	]"
     68         for X, y, est_v, ridge_v, lasso_v in zip(X_test, y_test, est.predict(X_test), ridge.predict(X_test), lasso.predict(X_test)):
     69             print X, y, est_v, ridge_v, lasso_v
     70 
     71     return error
     72 
     73 
     74 def error_estimate(X, y):
     75     error = np.empty(3)
     76     Iters = 20
     77 
     78     for i in range(Iters):
     79         tmp = process(X, y, show=False)
     80         error = error + tmp
     81     error /= Iters
     82     print "normal error: %.3f" % error[0]
     83     print "L2 error: %.3f" % error[1]
     84     print "L1 error: %.3f" % error[2]
     85 
     86 
     87 def extract_data(filename):
     88     dataset = np.mat(readData(filename))
     89 
     90     y = dataset[:, -1]
     91     X = dataset[:, :-1]
     92 
     93     process(X, y, show=True)
     94 
     95     # original data set
     96     print "original data set:"
     97     error_estimate(X, y)
     98 
     99     print "using the first two dimensions"
    100     X = dataset[:, :-2]
    101     error_estimate(X, y)
    102 
    103     print "use the first and third dimensions"
    104     X = dataset[:, ::2]
    105     error_estimate(X, y)
    106 
    107     print "only use the third dimension"
    108     X = dataset[:, 2]
    109     error_estimate(X, y)
    110 
    111     print "use the second and third dimensions"
    112     X = dataset[:, 1:-1]
    113     error_estimate(X, y)
    114 
    115     #plot the data
    116     ax = plt.gca()
    117     X1 = dataset[:, 0]
    118     X2 = dataset[:, 1]
    119     X3 = dataset[:, 2]
    120     plotScatter2D(ax=ax, X=X1, y=y, color="red")
    121     plotScatter2D(ax=ax, X=X2, y=y, color="blue")
    122     plotScatter2D(ax=ax, X=X3, y=y, color="green")
    123     plt.show()
    124 
    125 
    126 if '__main__' == __name__:
    127     extract_data("E:\dataset\mldata\test_score.csv")
    View Code
     1 train without regularization--------------
     2 train error: 2.457
     3 test error: 16.153
     4 [[ 0.39285405  0.5157764   1.16694498]]
     5 sum of coef:  2.07557543476
     6 train using L2 regularization--------------
     7 train error: 2.457
     8 test error: 16.159
     9 [[ 0.39285405  0.5157764   1.16694498]]
    10 sum of coef:  2.07557543476
    11 train using L1 regularization--------------
    12 train error: 2.466
    13 test error: 16.038
    14 [[ 0.39285405  0.5157764   1.16694498]]
    15 sum of coef:  2.07557543476
    16 Data ------------
    17 [x1  x2  x3 ] [y] [     est     ] [     ridge     ] [     lasso     ]
    18 [70 73 78] [148] [ 150.36924252] [ 150.36061654] 150.447200166
    19 [78 75 68] [147] [ 142.87417786] [ 142.89774266] 142.910175633
    20 [93 89 96] [192] [ 188.66231778] [ 188.65674942] 188.514072928
    21 [93 88 93] [185] [ 184.64570642] [ 184.64589888] 184.502840202
    22 [47 56 60] [115] [ 111.56039086] [ 111.54876013] 111.860472713
    23 [87 79 90] [175] [ 174.14575956] [ 174.14218324] 174.036875299
    24 [78 83 85] [175] [ 166.83845382] [ 166.82925063] 166.853488692
    25 original data set:
    26 normal error: 9.255
    27 L2 error: 11.200
    28 L1 error: 12.574
    29 using the first two dimensions
    30 normal error: 63.057
    31 L2 error: 64.947
    32 L1 error: 66.151
    33 use the first and third dimensions
    34 normal error: 23.051
    35 L2 error: 23.057
    36 L1 error: 23.230
    37 only use the third dimension
    38 normal error: 39.893
    39 L2 error: 39.890
    40 L1 error: 39.899
    41 use the second and third dimensions
    42 normal error: 12.268
    43 L2 error: 12.265
    44 L1 error: 12.260

    上面是一些测试的结果,为了具体的看一下线性回归的效果,测试20次,每次数据随机划分,将测试误差绘制出来,如下图:

    其中红色线是表示不加正则项的结果,不同划分下测试误差也有很大偏差。使用三种特征的组合,得到的效果总体上来说还是可以的,是不是还有其他方法会取得更好的效果呢?为了找到一种更好的预测方法,我分别从这三个特征中任选两个用于测试。测试结果已经显示在上面了,当然为了更严谨一些,我也分别测试了20次,每次也同样是随机划分数据,误差曲线如下:

    第一幅图是(x1, x2),第二幅图是(x1, x3),第三幅图是(x2, x3)。很容易发现,只用(x2, x3)与同时使用(X1, X2, X3)的效果很相似!!!前面我们将训练的系数输出来了,其实不难从系数上发现,a3>a2>a1, a3>a2 + a1, 也就是说x3这个特征是最重要的,所以在只考虑x1,x2时,测试误差很大,而考虑了x3之后,误差就减小了,而同时用x2, x3时,数据集的主要特征基本被表征出来,所以此时效果基本与(x1, x2, x3)的结果相同。为了测试一下x3特征的重要性,我只使用x3特征,效果如下:

    对比只使用x3,和同时使用(x1,x2),x3的效果要比较好。当然这里我的测试方式可能欠妥,但是从平均情况来看,还是可以反映出上面的结论,最后我们在分别看一下(xi, y)的分布情况:

    总体上,x3的分布相对集中一些,而x2,x1相对较为离散,波动幅度较大。未完待续,下一数据集。。。。。。

  • 相关阅读:
    Java实现 蓝桥杯 算法提高 扶老奶奶过街
    Java实现 蓝桥杯 算法提高 扶老奶奶过街
    RichEdit选中文字右键菜单的实现
    避免用户重复点击按钮(使用Enable:=False,消息繁忙时会有堵塞的问题,只能改用Sleep)
    枚举当前系统用户(使用NetUserEnum API枚举)
    通过Graphics对象获取它所属的Control
    Delphi中动态调用TXMLDocument的经历
    MongoDB集群
    瀑布流特效
    常用优化策略
  • 原文地址:https://www.cnblogs.com/bootstar/p/4212902.html
Copyright © 2011-2022 走看看