『cs231n』作业2选讲_通过代码理解优化器
1)、Adagrad
一种自适应学习率算法,实现代码如下:
1
2
|
cache + = dx * * 2 x + = - learning_rate * dx / (np.sqrt(cache) + eps) |
这种方法的好处是,对于高梯度的权重,它们的有效学习率被降低了;而小梯度的权重迭代过程中学习率提升了。要注意的是,这里开根号很重要。平滑参数eps是为了避免除以0的情况,eps一般取值1e-4 到1e-8。
2)、RMSprop
RMSProp方法对Adagrad算法做了一个简单的优化,以减缓它的迭代强度:
1
2
|
cache = decay_rate * cache + ( 1 - decay_rate) * dx * * 2 x + = - learning_rate * dx / (np.sqrt(cache) + eps) |
其中,decay_rate是一个超参数,其值可以在 [0.9, 0.99, 0.999]中选择。
3)、Adam
Adam有点像RMSProp+momentum,效果比RMSProp稍好,其简化版的代码如下:
1
2
3
|
m = beta1 * m + ( 1 - beta1) * dx v = beta2 * v + ( 1 - beta2) * (dx * * 2 ) x + = - learning_rate * m / (np.sqrt(v) + eps) |
论文中推荐eps = 1e-8,beta1 = 0.9,beta2 = 0.999。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
|
import numpy as np """ 输入: - w: - dw: - config: 包含各种超参数 返回: - next_w: - config: """ def sgd(w, dw, config = None ): if config is None : config = {} config.setdefault( 'learning_rate' , 1e - 2 ) w - = config[ 'learning_rate' ] * dw return w, config def sgd_momentum(w, dw, config = None ): """ 结合动量的SGD(最常用) - learning_rate: - momentum: 动量值 - velocity: A numpy array of the same shape as w and dw used to store a moving average of the gradients. """ if config is None : config = {} config.setdefault( 'learning_rate' , 1e - 2 ) config.setdefault( 'momentum' , 0.9 ) v = config.get( 'velocity' , np.zeros_like(w)) next_w = None next_w = w v = config[ 'momentum' ] * v - config[ 'learning_rate' ] * dw next_w + = v config[ 'velocity' ] = v return next_w, config def rmsprop(x, dx, config = None ): """ - learning_rate: - decay_rate: - epsilon: 小数值 避免分母为零 - cache: """ if config is None : config = {} config.setdefault( 'learning_rate' , 1e - 2 ) config.setdefault( 'decay_rate' , 0.99 ) config.setdefault( 'epsilon' , 1e - 8 ) config.setdefault( 'cache' , np.zeros_like(x)) next_x = None next_x = x config[ 'cache' ] = config[ 'decay_rate' ] * config[ 'cache' ] + ( 1 - config[ 'decay_rate' ]) * (dx * dx) x + = - config[ 'learning_rate' ] * dx / (np.sqrt(config[ 'cache' ]) + config[ 'epsilon' ]) return next_x, config def adam(x, dx, config = None ): """ - learning_rate - beta1: m的衰减率 - beta2: v的衰减率 - epsilon - m: Moving average of gradient. - v: Moving average of squared gradient. - t: Iteration number. """ if config is None : config = {} config.setdefault( 'learning_rate' , 1e - 3 ) config.setdefault( 'beta1' , 0.9 ) config.setdefault( 'beta2' , 0.999 ) config.setdefault( 'epsilon' , 1e - 8 ) config.setdefault( 'm' , np.zeros_like(x)) config.setdefault( 'v' , np.zeros_like(x)) config.setdefault( 't' , 0 ) next_x = None config[ 't' ] + = 1 config[ 'm' ] = config[ 'beta1' ] * config[ 'm' ] + ( 1 - config[ 'beta1' ]) * dx config[ 'v' ] = config[ 'beta2' ] * config[ 'v' ] + ( 1 - config[ 'beta2' ]) * (dx * * 2 ) mb = config[ 'm' ] / ( 1 - config[ 'beta1' ] * * config[ 't' ]) vb = config[ 'v' ] / ( 1 - config[ 'beta2' ] * * config[ 't' ]) next_x = x - config[ 'learning_rate' ] * mb / (np.sqrt(vb) + config[ 'epsilon' ]) return next_x, config |