zoukankan html css js c++ java

Linear Regression

线性回归方法是机器学习里面最基础的一种方法，相关的理论方面的知识有很多，这里就不介绍了，博客主要从scikit-learn库的使用方面来探讨算法。

首先，我们使用随机生成一组数据，然后加入一些随机噪声。

 1 import numpy as np
 2 from sklearn.cross_validation import train_test_split
 3 
 4 def f(x):
 5     return np.sin(2 * np.pi * x)
 6 
 7 x_plot = np.linspace(0, 1, 100)
 8 
 9 n_samples = 100
10 X = np.random.uniform(0, 1, size=n_samples)[:, np.newaxis]
11 y = f(X) + np.random.normal(scale=0.3, size=n_samples)[:, np.newaxis] ##add random noise to the dataset
12 
13 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8)

View Code

首先，不添加正则项

 1 fig, axes = plt.subplots(5, 2, figsize=(8, 5))
 2 train_error = np.empty(10)
 3 test_error = np.empty(10)
 4 #
 5 for ax, degree in zip(axes.ravel(), range(10)):
 6     est = make_pipeline(PolynomialFeatures(degree), LinearRegression())
 7     est.fit(X_train, y_train)
 8     train_error[degree] = mean_squared_error(y_train, est.predict(X_train))
 9     test_error[degree] = mean_squared_error(y_test, est.predict(X_test))
10     plot_approximation(est, ax, label='degree=%d' %degree)
11 plt.show(fig)
12 
13 plt.plot(np.arange(10), train_error, color='green', label='train')
14 plt.plot(np.arange(10), test_error, color='red', label='test')
15 plt.ylim(0.0, 1e0)
16 plt.ylabel('log(mean squared error)')
17 plt.xlabel('degree')
18 plt.legend(loc="upper left")
19 plt.show()

View Code

误差为：

当多项式的最高次幂超过6之后，训练样本的误差小，测试样本的误差过大，出现了过拟合，下面加入L2正则项：

 1 alphas = [0.0, 1e-8, 1e-5, 1e-1]
 2 degree = 9
 3 fig, ax_rows = plt.subplots(3, 4, figsize=(8, 5))
 4 for degree, ax_row in zip(range(7, 10), ax_rows):
 5     for alpha, ax in zip(alphas, ax_row):
 6         est = make_pipeline(PolynomialFeatures(degree), Ridge(alpha=alpha))
 7         est.fit(X_train, y_train)
 8         plot_approximation(est, ax, xlabel="degree=%d alpha=%r" %(degree, alpha))
 9 #plt.tight_layout()
10 plt.show(fig)

View Code

具体看看不同的alpha大小对多项式系数的影响：

 1 def plot_coefficients(est, ax, label=None, yscale='log'):
 2     coef = est.steps[-1][1].coef_.ravel()
 3     if yscale == 'log':
 4         ax.semilogy(np.abs(coef), marker='o', label=label)
 5         ax.set_ylim((1e-1, 1e8))
 6     else:
 7         ax.plot(np.abs(coef), marker='o', label=label)
 8     ax.set_ylabel('abs(coefficient)')
 9     ax.set_xlabel('coefficients')
10     ax.set_xlim((1, 9))
11 
12 fig, ax_rows = plt.subplots(4, 2, figsize=(8, 5))
13 alphas = [0.0, 1e-8, 1e-5, 1e-1]
14 for alpha, ax_row in zip(alphas, ax_rows):
15     ax_left, ax_right = ax_row
16     est = make_pipeline(PolynomialFeatures(degree), Ridge(alpha=alpha))
17     est.fit(X_train, y_train)
18     plot_approximation(est, ax_left, label='alpha=%r'%alpha)
19     plot_coefficients(est, ax_right, label='Ridge(alpha=%r) coefficients' % alpha )
20 
21 plt.show(fig)

View Code

alpha越大，因子越小，而曲线也越来越平滑。使用Ridge，可以加入L2正则项，还可以通过使用Lasso，加入L1正则项

 1 fig, ax_rows = plt.subplots(2, 2, figsize=(8, 5))
 2 
 3 degree = 9
 4 alphas = [1e-3, 1e-2]
 5 
 6 for alpha, ax_row in zip(alphas, ax_rows):
 7     ax_left, ax_right = ax_row
 8     est = make_pipeline(PolynomialFeatures(degree), Lasso(alpha=alpha))
 9     est.fit(X_train, y_train)
10     plot_approximation(est, ax_left, label='alpha=%r' % alpha)
11     plot_coefficients(est, ax_right, label='Lasso(alpha=%r) coefficients' % alpha, yscale=None)
12 
13 plt.tight_layout()
14 plt.show(fig)

View Code

除了上述两种方式外，scikit-learn还支持同时加入L1和L2正则，需要使用ElasticNet进行训练

 1 fig, ax_rows = plt.subplots(8, 2, figsize=(8, 5))
 2 alphas = [1e-2, 1e-2, 1e-2, 1e-3, 1e-3, 1e-3, 1e-4, 1e-4]
 3 ratios = [0.05, 0.85, 0.50, 0.05, 0.85, 0.50, 0.03, 0.95]
 4 for alpha, ratio, ax_row in zip(alphas, ratios, ax_rows):
 5     ax_left, ax_right = ax_row
 6     est = make_pipeline(PolynomialFeatures(degree), ElasticNet(alpha=alpha, l1_ratio=ratio))
 7     est.fit(X_train, y_train)
 8     plot_approximation(est, ax_left, label='alpha=%r ratio=%r' % (alpha, ratio))
 9     plot_coefficients(est, ax_right, label="Lasso(alpah=%r ratio=%r) coefficients" % (alpha, ratio), yscale=None)
10 
11 plt.show()

View Code

当alpha一定时，曲线形状并未发生明显变化，alpha限定了参数范围，alpha越小，参数取值范围越大，这与只使用L2、L1正则时相似。ratio决定了参数的取值情况，当ratio比较大时，则参数相对稀疏(只有少数几个参数的值比较大，而其余的值比较小者趋近于0)，

而ratio比较小时，参数之间差异相对较小，分布较为均匀。

数据集1：Test Scores for General Psychology

每组数据是一个四元组，<x1, x2, x3, x4>其中x1, x2, x3表示前3次的成绩，x4表示最终成绩。现在需要有(x1, x2, x3)来预测x4.数据集总共有25条记录。其中第一行是标题。下面对比不使用正则项，使用L2正则项和使用L1正则项来做一个简单的线性回归模型。

  1 # -*-encoding:utf-8-*-
  2 '''
  3 Created on 
  4 author: dstarer
  5 copyright: dstarer
  6 '''
  7 
  8 import numpy as np
  9 from sklearn.cross_validation import train_test_split
 10 from sklearn.linear_model import LinearRegression
 11 from sklearn.metrics import mean_squared_error
 12 from sklearn.linear_model import Ridge
 13 from sklearn.linear_model import Lasso
 14 from plot import *
 15 
 16 def readData(filename, ignoreFirstLine=True, separtor='	'):
 17     dataSet = []
 18     fp = open(filename, "r")
 19     if ignoreFirstLine:
 20         fp.readline()
 21     for line in fp.readlines():
 22         elements = map(int, line.strip().split(separtor))
 23         dataSet.append(elements)
 24     fp.close()
 25     return dataSet
 26 
 27 
 28 def Print(message, train_error, test_error, coef):
 29     print "%s--------------" % message
 30     print "train error: %.3f" % train_error
 31     print "test error: %.3f" % test_error
 32     print coef
 33     print "sum of coef: ", np.sum(coef)
 34 
 35 def process(X, y, show=True):
 36     error = np.empty(3)
 37     X_train, X_test, y_train, y_test = train_test_split(X, y)
 38     est = LinearRegression()
 39     est.fit(X_train, y_train)
 40     train_error = mean_squared_error(y_train, est.predict(X_train))
 41     test_error = mean_squared_error(y_test, est.predict(X_test))
 42     error[0] = test_error
 43 
 44     if show:
 45         Print(message="train without regularization", train_error=train_error, test_error=test_error, coef=est.coef_)
 46 
 47     ridge = Ridge()
 48     ridge.fit(X_train, y_train)
 49     train_error = mean_squared_error(y_train, ridge.predict(X_train))
 50     test_error = mean_squared_error(y_test, ridge.predict(X_test))
 51     error[1] = test_error
 52 
 53     if show:
 54         Print(message="train using L2 regularization", train_error=train_error, test_error=test_error, coef=est.coef_)
 55 
 56     lasso = Lasso()
 57     lasso.fit(X_train, y_train)
 58     train_error = mean_squared_error(y_train, lasso.predict(X_train))
 59     test_error = mean_squared_error(y_test, lasso.predict(X_test))
 60     error[2] = test_error
 61 
 62     if show:
 63         Print(message="train using L1 regularization", train_error=train_error, test_error=test_error, coef=est.coef_)
 64 
 65     if show:
 66         print "Data ------------"
 67         print "[x1  x2  x3 ] [y] [	 est 	] [	 ridge 	] [	 lasso 	]"
 68         for X, y, est_v, ridge_v, lasso_v in zip(X_test, y_test, est.predict(X_test), ridge.predict(X_test), lasso.predict(X_test)):
 69             print X, y, est_v, ridge_v, lasso_v
 70 
 71     return error
 72 
 73 
 74 def error_estimate(X, y):
 75     error = np.empty(3)
 76     Iters = 20
 77 
 78     for i in range(Iters):
 79         tmp = process(X, y, show=False)
 80         error = error + tmp
 81     error /= Iters
 82     print "normal error: %.3f" % error[0]
 83     print "L2 error: %.3f" % error[1]
 84     print "L1 error: %.3f" % error[2]
 85 
 86 
 87 def extract_data(filename):
 88     dataset = np.mat(readData(filename))
 89 
 90     y = dataset[:, -1]
 91     X = dataset[:, :-1]
 92 
 93     process(X, y, show=True)
 94 
 95     # original data set
 96     print "original data set:"
 97     error_estimate(X, y)
 98 
 99     print "using the first two dimensions"
100     X = dataset[:, :-2]
101     error_estimate(X, y)
102 
103     print "use the first and third dimensions"
104     X = dataset[:, ::2]
105     error_estimate(X, y)
106 
107     print "only use the third dimension"
108     X = dataset[:, 2]
109     error_estimate(X, y)
110 
111     print "use the second and third dimensions"
112     X = dataset[:, 1:-1]
113     error_estimate(X, y)
114 
115     #plot the data
116     ax = plt.gca()
117     X1 = dataset[:, 0]
118     X2 = dataset[:, 1]
119     X3 = dataset[:, 2]
120     plotScatter2D(ax=ax, X=X1, y=y, color="red")
121     plotScatter2D(ax=ax, X=X2, y=y, color="blue")
122     plotScatter2D(ax=ax, X=X3, y=y, color="green")
123     plt.show()
124 
125 
126 if '__main__' == __name__:
127     extract_data("E:\dataset\mldata\test_score.csv")

View Code

 1 train without regularization--------------
 2 train error: 2.457
 3 test error: 16.153
 4 [[ 0.39285405  0.5157764   1.16694498]]
 5 sum of coef:  2.07557543476
 6 train using L2 regularization--------------
 7 train error: 2.457
 8 test error: 16.159
 9 [[ 0.39285405  0.5157764   1.16694498]]
10 sum of coef:  2.07557543476
11 train using L1 regularization--------------
12 train error: 2.466
13 test error: 16.038
14 [[ 0.39285405  0.5157764   1.16694498]]
15 sum of coef:  2.07557543476
16 Data ------------
17 [x1  x2  x3 ] [y] [     est     ] [     ridge     ] [     lasso     ]
18 [70 73 78] [148] [ 150.36924252] [ 150.36061654] 150.447200166
19 [78 75 68] [147] [ 142.87417786] [ 142.89774266] 142.910175633
20 [93 89 96] [192] [ 188.66231778] [ 188.65674942] 188.514072928
21 [93 88 93] [185] [ 184.64570642] [ 184.64589888] 184.502840202
22 [47 56 60] [115] [ 111.56039086] [ 111.54876013] 111.860472713
23 [87 79 90] [175] [ 174.14575956] [ 174.14218324] 174.036875299
24 [78 83 85] [175] [ 166.83845382] [ 166.82925063] 166.853488692
25 original data set:
26 normal error: 9.255
27 L2 error: 11.200
28 L1 error: 12.574
29 using the first two dimensions
30 normal error: 63.057
31 L2 error: 64.947
32 L1 error: 66.151
33 use the first and third dimensions
34 normal error: 23.051
35 L2 error: 23.057
36 L1 error: 23.230
37 only use the third dimension
38 normal error: 39.893
39 L2 error: 39.890
40 L1 error: 39.899
41 use the second and third dimensions
42 normal error: 12.268
43 L2 error: 12.265
44 L1 error: 12.260

上面是一些测试的结果，为了具体的看一下线性回归的效果，测试20次，每次数据随机划分，将测试误差绘制出来，如下图：

其中红色线是表示不加正则项的结果，不同划分下测试误差也有很大偏差。使用三种特征的组合，得到的效果总体上来说还是可以的，是不是还有其他方法会取得更好的效果呢？为了找到一种更好的预测方法，我分别从这三个特征中任选两个用于测试。测试结果已经显示在上面了，当然为了更严谨一些，我也分别测试了20次，每次也同样是随机划分数据，误差曲线如下：

第一幅图是(x1, x2),第二幅图是(x1, x3),第三幅图是(x2, x3)。很容易发现，只用(x2, x3)与同时使用(X1, X2, X3)的效果很相似！！！前面我们将训练的系数输出来了，其实不难从系数上发现，a3>a2>a1， a3>a2 + a1, 也就是说x3这个特征是最重要的，所以在只考虑x1,x2时，测试误差很大，而考虑了x3之后，误差就减小了，而同时用x2, x3时，数据集的主要特征基本被表征出来，所以此时效果基本与(x1, x2, x3)的结果相同。为了测试一下x3特征的重要性，我只使用x3特征，效果如下：

对比只使用x3,和同时使用(x1,x2)，x3的效果要比较好。当然这里我的测试方式可能欠妥，但是从平均情况来看，还是可以反映出上面的结论，最后我们在分别看一下(xi, y)的分布情况：

总体上，x3的分布相对集中一些，而x2,x1相对较为离散，波动幅度较大。未完待续，下一数据集。。。。。。

查看全文

相关阅读:
leetcode 347. Top K Frequent Elements
581. Shortest Unsorted Continuous Subarray
leetcode 3. Longest Substring Without Repeating Characters
leetcode 217. Contains Duplicate、219. Contains Duplicate II、220. Contains Duplicate、287. Find the Duplicate Number 、442. Find All Duplicates in an Array 、448. Find All Numbers Disappeared in an Array
leetcode 461. Hamming Distance
leetcode 19. Remove Nth Node From End of List
leetcode 100. Same Tree、101. Symmetric Tree
leetcode 171. Excel Sheet Column Number
leetcode 242. Valid Anagram
leetcode 326. Power of Three

原文地址：https://www.cnblogs.com/bootstar/p/4212902.html