zoukankan      html  css  js  c++  java
  • Linear Regression

    线性回归方法是机器学习里面最基础的一种方法,相关的理论方面的知识有很多,这里就不介绍了,博客主要从scikit-learn库的使用方面来探讨算法。

    首先,我们使用随机生成一组数据,然后加入一些随机噪声。

     1 import numpy as np
     2 from sklearn.cross_validation import train_test_split
     3 
     4 def f(x):
     5     return np.sin(2 * np.pi * x)
     6 
     7 x_plot = np.linspace(0, 1, 100)
     8 
     9 n_samples = 100
    10 X = np.random.uniform(0, 1, size=n_samples)[:, np.newaxis]
    11 y = f(X) + np.random.normal(scale=0.3, size=n_samples)[:, np.newaxis] ##add random noise to the dataset
    12 
    13 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8)
    View Code

    首先,不添加正则项

     1 fig, axes = plt.subplots(5, 2, figsize=(8, 5))
     2 train_error = np.empty(10)
     3 test_error = np.empty(10)
     4 #
     5 for ax, degree in zip(axes.ravel(), range(10)):
     6     est = make_pipeline(PolynomialFeatures(degree), LinearRegression())
     7     est.fit(X_train, y_train)
     8     train_error[degree] = mean_squared_error(y_train, est.predict(X_train))
     9     test_error[degree] = mean_squared_error(y_test, est.predict(X_test))
    10     plot_approximation(est, ax, label='degree=%d' %degree)
    11 plt.show(fig)
    12 
    13 plt.plot(np.arange(10), train_error, color='green', label='train')
    14 plt.plot(np.arange(10), test_error, color='red', label='test')
    15 plt.ylim(0.0, 1e0)
    16 plt.ylabel('log(mean squared error)')
    17 plt.xlabel('degree')
    18 plt.legend(loc="upper left")
    19 plt.show()
    View Code

    误差为:

    当多项式的最高次幂超过6之后,训练样本的误差小,测试样本的误差过大,出现了过拟合,下面加入L2正则项:

     1 alphas = [0.0, 1e-8, 1e-5, 1e-1]
     2 degree = 9
     3 fig, ax_rows = plt.subplots(3, 4, figsize=(8, 5))
     4 for degree, ax_row in zip(range(7, 10), ax_rows):
     5     for alpha, ax in zip(alphas, ax_row):
     6         est = make_pipeline(PolynomialFeatures(degree), Ridge(alpha=alpha))
     7         est.fit(X_train, y_train)
     8         plot_approximation(est, ax, xlabel="degree=%d alpha=%r" %(degree, alpha))
     9 #plt.tight_layout()
    10 plt.show(fig)
    View Code

    具体看看不同的alpha大小对多项式系数的影响:

     1 def plot_coefficients(est, ax, label=None, yscale='log'):
     2     coef = est.steps[-1][1].coef_.ravel()
     3     if yscale == 'log':
     4         ax.semilogy(np.abs(coef), marker='o', label=label)
     5         ax.set_ylim((1e-1, 1e8))
     6     else:
     7         ax.plot(np.abs(coef), marker='o', label=label)
     8     ax.set_ylabel('abs(coefficient)')
     9     ax.set_xlabel('coefficients')
    10     ax.set_xlim((1, 9))
    11 
    12 fig, ax_rows = plt.subplots(4, 2, figsize=(8, 5))
    13 alphas = [0.0, 1e-8, 1e-5, 1e-1]
    14 for alpha, ax_row in zip(alphas, ax_rows):
    15     ax_left, ax_right = ax_row
    16     est = make_pipeline(PolynomialFeatures(degree), Ridge(alpha=alpha))
    17     est.fit(X_train, y_train)
    18     plot_approximation(est, ax_left, label='alpha=%r'%alpha)
    19     plot_coefficients(est, ax_right, label='Ridge(alpha=%r) coefficients' % alpha )
    20 
    21 plt.show(fig)
    View Code

    alpha越大,因子越小,而曲线也越来越平滑。使用Ridge,可以加入L2正则项,还可以通过使用Lasso,加入L1正则项

     1 fig, ax_rows = plt.subplots(2, 2, figsize=(8, 5))
     2 
     3 degree = 9
     4 alphas = [1e-3, 1e-2]
     5 
     6 for alpha, ax_row in zip(alphas, ax_rows):
     7     ax_left, ax_right = ax_row
     8     est = make_pipeline(PolynomialFeatures(degree), Lasso(alpha=alpha))
     9     est.fit(X_train, y_train)
    10     plot_approximation(est, ax_left, label='alpha=%r' % alpha)
    11     plot_coefficients(est, ax_right, label='Lasso(alpha=%r) coefficients' % alpha, yscale=None)
    12 
    13 plt.tight_layout()
    14 plt.show(fig)
    View Code

    除了上述两种方式外,scikit-learn还支持同时加入L1和L2正则,需要使用ElasticNet进行训练

     1 fig, ax_rows = plt.subplots(8, 2, figsize=(8, 5))
     2 alphas = [1e-2, 1e-2, 1e-2, 1e-3, 1e-3, 1e-3, 1e-4, 1e-4]
     3 ratios = [0.05, 0.85, 0.50, 0.05, 0.85, 0.50, 0.03, 0.95]
     4 for alpha, ratio, ax_row in zip(alphas, ratios, ax_rows):
     5     ax_left, ax_right = ax_row
     6     est = make_pipeline(PolynomialFeatures(degree), ElasticNet(alpha=alpha, l1_ratio=ratio))
     7     est.fit(X_train, y_train)
     8     plot_approximation(est, ax_left, label='alpha=%r ratio=%r' % (alpha, ratio))
     9     plot_coefficients(est, ax_right, label="Lasso(alpah=%r ratio=%r) coefficients" % (alpha, ratio), yscale=None)
    10 
    11 plt.show()
    View Code

    当alpha一定时,曲线形状并未发生明显变化,alpha限定了参数范围,alpha越小,参数取值范围越大,这与只使用L2、L1正则时相似。ratio决定了参数的取值情况,当ratio比较大时,则参数相对稀疏(只有少数几个参数的值比较大,而其余的值比较小者趋近于0),

    而ratio比较小时,参数之间差异相对较小,分布较为均匀。

    数据集1:Test Scores for General Psychology

     每组数据是一个四元组,<x1, x2, x3, x4>其中x1, x2, x3表示前3次的成绩,x4表示最终成绩。现在需要有(x1, x2, x3)来预测x4.数据集总共有25条记录。其中第一行是标题。下面对比不使用正则项,使用L2正则项和使用L1正则项来做一个简单的线性回归模型。

      1 # -*-encoding:utf-8-*-
      2 '''
      3 Created on 
      4 author: dstarer
      5 copyright: dstarer
      6 '''
      7 
      8 import numpy as np
      9 from sklearn.cross_validation import train_test_split
     10 from sklearn.linear_model import LinearRegression
     11 from sklearn.metrics import mean_squared_error
     12 from sklearn.linear_model import Ridge
     13 from sklearn.linear_model import Lasso
     14 from plot import *
     15 
     16 def readData(filename, ignoreFirstLine=True, separtor='	'):
     17     dataSet = []
     18     fp = open(filename, "r")
     19     if ignoreFirstLine:
     20         fp.readline()
     21     for line in fp.readlines():
     22         elements = map(int, line.strip().split(separtor))
     23         dataSet.append(elements)
     24     fp.close()
     25     return dataSet
     26 
     27 
     28 def Print(message, train_error, test_error, coef):
     29     print "%s--------------" % message
     30     print "train error: %.3f" % train_error
     31     print "test error: %.3f" % test_error
     32     print coef
     33     print "sum of coef: ", np.sum(coef)
     34 
     35 def process(X, y, show=True):
     36     error = np.empty(3)
     37     X_train, X_test, y_train, y_test = train_test_split(X, y)
     38     est = LinearRegression()
     39     est.fit(X_train, y_train)
     40     train_error = mean_squared_error(y_train, est.predict(X_train))
     41     test_error = mean_squared_error(y_test, est.predict(X_test))
     42     error[0] = test_error
     43 
     44     if show:
     45         Print(message="train without regularization", train_error=train_error, test_error=test_error, coef=est.coef_)
     46 
     47     ridge = Ridge()
     48     ridge.fit(X_train, y_train)
     49     train_error = mean_squared_error(y_train, ridge.predict(X_train))
     50     test_error = mean_squared_error(y_test, ridge.predict(X_test))
     51     error[1] = test_error
     52 
     53     if show:
     54         Print(message="train using L2 regularization", train_error=train_error, test_error=test_error, coef=est.coef_)
     55 
     56     lasso = Lasso()
     57     lasso.fit(X_train, y_train)
     58     train_error = mean_squared_error(y_train, lasso.predict(X_train))
     59     test_error = mean_squared_error(y_test, lasso.predict(X_test))
     60     error[2] = test_error
     61 
     62     if show:
     63         Print(message="train using L1 regularization", train_error=train_error, test_error=test_error, coef=est.coef_)
     64 
     65     if show:
     66         print "Data ------------"
     67         print "[x1  x2  x3 ] [y] [	 est 	] [	 ridge 	] [	 lasso 	]"
     68         for X, y, est_v, ridge_v, lasso_v in zip(X_test, y_test, est.predict(X_test), ridge.predict(X_test), lasso.predict(X_test)):
     69             print X, y, est_v, ridge_v, lasso_v
     70 
     71     return error
     72 
     73 
     74 def error_estimate(X, y):
     75     error = np.empty(3)
     76     Iters = 20
     77 
     78     for i in range(Iters):
     79         tmp = process(X, y, show=False)
     80         error = error + tmp
     81     error /= Iters
     82     print "normal error: %.3f" % error[0]
     83     print "L2 error: %.3f" % error[1]
     84     print "L1 error: %.3f" % error[2]
     85 
     86 
     87 def extract_data(filename):
     88     dataset = np.mat(readData(filename))
     89 
     90     y = dataset[:, -1]
     91     X = dataset[:, :-1]
     92 
     93     process(X, y, show=True)
     94 
     95     # original data set
     96     print "original data set:"
     97     error_estimate(X, y)
     98 
     99     print "using the first two dimensions"
    100     X = dataset[:, :-2]
    101     error_estimate(X, y)
    102 
    103     print "use the first and third dimensions"
    104     X = dataset[:, ::2]
    105     error_estimate(X, y)
    106 
    107     print "only use the third dimension"
    108     X = dataset[:, 2]
    109     error_estimate(X, y)
    110 
    111     print "use the second and third dimensions"
    112     X = dataset[:, 1:-1]
    113     error_estimate(X, y)
    114 
    115     #plot the data
    116     ax = plt.gca()
    117     X1 = dataset[:, 0]
    118     X2 = dataset[:, 1]
    119     X3 = dataset[:, 2]
    120     plotScatter2D(ax=ax, X=X1, y=y, color="red")
    121     plotScatter2D(ax=ax, X=X2, y=y, color="blue")
    122     plotScatter2D(ax=ax, X=X3, y=y, color="green")
    123     plt.show()
    124 
    125 
    126 if '__main__' == __name__:
    127     extract_data("E:\dataset\mldata\test_score.csv")
    View Code
     1 train without regularization--------------
     2 train error: 2.457
     3 test error: 16.153
     4 [[ 0.39285405  0.5157764   1.16694498]]
     5 sum of coef:  2.07557543476
     6 train using L2 regularization--------------
     7 train error: 2.457
     8 test error: 16.159
     9 [[ 0.39285405  0.5157764   1.16694498]]
    10 sum of coef:  2.07557543476
    11 train using L1 regularization--------------
    12 train error: 2.466
    13 test error: 16.038
    14 [[ 0.39285405  0.5157764   1.16694498]]
    15 sum of coef:  2.07557543476
    16 Data ------------
    17 [x1  x2  x3 ] [y] [     est     ] [     ridge     ] [     lasso     ]
    18 [70 73 78] [148] [ 150.36924252] [ 150.36061654] 150.447200166
    19 [78 75 68] [147] [ 142.87417786] [ 142.89774266] 142.910175633
    20 [93 89 96] [192] [ 188.66231778] [ 188.65674942] 188.514072928
    21 [93 88 93] [185] [ 184.64570642] [ 184.64589888] 184.502840202
    22 [47 56 60] [115] [ 111.56039086] [ 111.54876013] 111.860472713
    23 [87 79 90] [175] [ 174.14575956] [ 174.14218324] 174.036875299
    24 [78 83 85] [175] [ 166.83845382] [ 166.82925063] 166.853488692
    25 original data set:
    26 normal error: 9.255
    27 L2 error: 11.200
    28 L1 error: 12.574
    29 using the first two dimensions
    30 normal error: 63.057
    31 L2 error: 64.947
    32 L1 error: 66.151
    33 use the first and third dimensions
    34 normal error: 23.051
    35 L2 error: 23.057
    36 L1 error: 23.230
    37 only use the third dimension
    38 normal error: 39.893
    39 L2 error: 39.890
    40 L1 error: 39.899
    41 use the second and third dimensions
    42 normal error: 12.268
    43 L2 error: 12.265
    44 L1 error: 12.260

    上面是一些测试的结果,为了具体的看一下线性回归的效果,测试20次,每次数据随机划分,将测试误差绘制出来,如下图:

    其中红色线是表示不加正则项的结果,不同划分下测试误差也有很大偏差。使用三种特征的组合,得到的效果总体上来说还是可以的,是不是还有其他方法会取得更好的效果呢?为了找到一种更好的预测方法,我分别从这三个特征中任选两个用于测试。测试结果已经显示在上面了,当然为了更严谨一些,我也分别测试了20次,每次也同样是随机划分数据,误差曲线如下:

    第一幅图是(x1, x2),第二幅图是(x1, x3),第三幅图是(x2, x3)。很容易发现,只用(x2, x3)与同时使用(X1, X2, X3)的效果很相似!!!前面我们将训练的系数输出来了,其实不难从系数上发现,a3>a2>a1, a3>a2 + a1, 也就是说x3这个特征是最重要的,所以在只考虑x1,x2时,测试误差很大,而考虑了x3之后,误差就减小了,而同时用x2, x3时,数据集的主要特征基本被表征出来,所以此时效果基本与(x1, x2, x3)的结果相同。为了测试一下x3特征的重要性,我只使用x3特征,效果如下:

    对比只使用x3,和同时使用(x1,x2),x3的效果要比较好。当然这里我的测试方式可能欠妥,但是从平均情况来看,还是可以反映出上面的结论,最后我们在分别看一下(xi, y)的分布情况:

    总体上,x3的分布相对集中一些,而x2,x1相对较为离散,波动幅度较大。未完待续,下一数据集。。。。。。

  • 相关阅读:
    leetcode 347. Top K Frequent Elements
    581. Shortest Unsorted Continuous Subarray
    leetcode 3. Longest Substring Without Repeating Characters
    leetcode 217. Contains Duplicate、219. Contains Duplicate II、220. Contains Duplicate、287. Find the Duplicate Number 、442. Find All Duplicates in an Array 、448. Find All Numbers Disappeared in an Array
    leetcode 461. Hamming Distance
    leetcode 19. Remove Nth Node From End of List
    leetcode 100. Same Tree、101. Symmetric Tree
    leetcode 171. Excel Sheet Column Number
    leetcode 242. Valid Anagram
    leetcode 326. Power of Three
  • 原文地址:https://www.cnblogs.com/bootstar/p/4212902.html
Copyright © 2011-2022 走看看