Theory for f : (mathbb{R}^{n} mapsto mathbb{R})
先定义一个标识:scalar-product (langle a | b
angle=sum_{i=1}^{n} a_{i} b_{i})
我们可以定义导数的公式如下:
[f(x+h)=f(x)+mathrm{d}_{x} f(h)+o_{h
ightarrow 0}(h)
]
(O_{h
ightarrow 0}(h)) 满足 (lim _{h
ightarrow 0} epsilon(h)=0)。
(mathbb{R}^{n} mapsto mathbb{R})是一个线性变换。形如:
[fleft(left( egin{array}{l}{x_{1}} \ {x_{2}}end{array}
ight)
ight)=3 x_{1}+x_{2}^{2}
]
当 (left( egin{array}{l}{a} \ {b}end{array}
ight) in mathbb{R}^{2}) and (h=left( egin{array}{l}{h_{1}} \ {h_{2}}end{array}
ight) in mathbb{R}^{2})时,我们有
[egin{aligned} fleft(left( egin{array}{c}{a+h_{1}} \ {b+h_{2}}end{array}
ight)
ight) &=3left(a+h_{1}
ight)+left(b+h_{2}
ight)^{2} \ &=3 a+3 h_{1}+b^{2}+2 b h_{2}+h_{2}^{2} \ &=3 a+b^{2}+3 h_{1}+2 b h_{2}+h_{2}^{2} \ &=f(a, b)+3 h_{1}+2 b h_{2}+o(h) end{aligned}
]
也就是说:(mathrm{d}_{(egin{array}{l}{a} \ {b}end{array})} fleft(left( egin{array}{l}{h_{1}} \ {h_{2}}end{array}
ight)
ight)=3 h_{1}+2 b h_{2}).
神经网络中的梯度下降
Vectorized Gradients
我们可以将一个 (mathbb{R}^{n}
ightarrow mathbb{R}^{m}) 矩阵(线性变换)看做一个函数 (oldsymbol{f}(oldsymbol{x})=left[f_{1}left(x_{1}, ldots, x_{n}
ight), f_{2}left(x_{1}, ldots, x_{n}
ight), ldots, f_{m}left(x_{1}, ldots, x_{n}
ight)
ight]) 。向量 (oldsymbol{x}) = (x_{1}, dots, x_{n})。 矩阵对向量求导就是:
[frac{partial oldsymbol{f}}{partial oldsymbol{x}}=left[ egin{array}{ccc}{frac{partial f_{1}}{partial x_{1}}} & {cdots} & {frac{partial f_{1}}{partial x_{n}}} \ {vdots} & {ddots} & {vdots} \ {frac{partial f_{m}}{partial x_{1}}} & {cdots} & {frac{partial f_{m}}{partial x_{n}}}end{array}
ight]
]
上述的求导矩阵又叫做雅可比矩阵。其中的元素可以写成:
[left(frac{partial oldsymbol{f}}{partial oldsymbol{x}}
ight)_{i j}=frac{partial f_{i}}{partial x_{j}}
]
这个矩阵很有用。对于神经网络我们可以理解为 向量的函数,及 (oldsymbol{f}(oldsymbol{x})), 而这个函数的本质是对向量的线性变换,也就是一个矩阵。在神经网络的反向传播的过程就需要对参数求偏导,那么多层神经网络就会是链式求导,雅可比矩阵就是用于矩阵的链式求导。
考虑下面的例子: (f(x)=left[f_{1}(x), f_{2}(x)
ight])。 (f) 是一个 1 * 2 的矩阵。(g(y)=left[g_{1}left(y_{1}, y_{2}
ight), g_{2}left(y_{1}, y_{2}
ight)
ight]) 是一个 2 * 2的矩阵。那么 (f) 和 (g) 的复合矩阵就是 (g * f) . 由矩阵的连乘得到,(g(x)=left[g_{1}left(f_{1}(x), f_{2}(x)
ight), g_{2}left(f_{1}(x), f_{2}(x)
ight)
ight])。那么我们对复合矩阵求导就是:
[frac{partial oldsymbol{g}}{partial x}=left[ egin{array}{c}{frac{partial}{partial x} g_{1}left(f_{1}(x), f_{2}(x)
ight)} \ {frac{partial}{partial x} g_{2}left(f_{1}(x), f_{2}(x)
ight)}end{array}
ight]=left[ egin{array}{c}{frac{partial g_{1}}{partial f_{1}} frac{partial f_{1}}{partial x}+frac{partial g_{1}}{partial f_{2}} frac{partial f_{2}}{partial x}} \ {frac{partial g_{2}}{partial f_{1}} frac{partial f_{1}}{partial x}+frac{partial g_{2}}{partial f_{2}} frac{partial f_{2}}{partial x}}end{array}
ight]
]
本质上与一般连续函数的求导是一样的。这个矩阵等价于两次的求导矩阵的矩阵乘积。
[frac{partial g}{partial x}=frac{partial g}{partial f} frac{partial f}{partial x}=left[ egin{array}{ll}{frac{partial g_{1}}{partial f_{1}}} & {frac{partial g_{1}}{partial f_{2}}} \ {frac{partial g_{2}}{partial f_{1}}} & {frac{partial g_{2}}{partial f_{2}}}end{array}
ight] left[ egin{array}{c}{frac{partial f_{1}}{partial x}} \ {frac{f_{2}}{partial x}}end{array}
ight]
]
Useful Identities
对于一般的矩阵
我们将一般的矩阵理解为:(oldsymbol{W} in mathbb{R}^{n imes m}),将一个 (m) 维的向量变换成一个 (n) 维的向量。也可以写成 (z=W x)。 其中:
[z_{i}=sum_{k=1}^{m} W_{i k} x_{k}
]
所以对每一项的求导也很好计算:
[left(frac{partial oldsymbol{z}}{partial oldsymbol{x}}
ight)_{i j}=frac{partial z_{i}}{partial x_{j}}=frac{partial}{partial x_{j}} sum_{k=1}^{m} W_{i k} x_{k}=sum_{k=1}^{m} W_{i k} frac{partial}{partial x_{j}} x_{k}=W_{i j}
]
因此有:
[frac{partial z}{partial x}=W
]
一般的矩阵的另一种写法可以写成:(z=x W), 其中 (x) 是一个行向量。维度为 m,这就可以看成是 (W) 列向量的线性组合。这时:
[frac{partial z}{partial x}=oldsymbol{W}^{T}
]
从另一个角度看一般的矩阵
我们可以将线性变换的矩阵写成:(z=W x)。我们之前都是将 (x) 看成参数求导,如果这次我们将 (W) 看成参数求导会是什么样子呢?
我们假设有:
[z=oldsymbol{W} oldsymbol{x}, oldsymbol{delta}=frac{partial J}{partial oldsymbol{z}}
]
[frac{partial J}{partial oldsymbol{W}}=frac{partial J}{partial oldsymbol{z}} frac{partial oldsymbol{z}}{partial oldsymbol{W}}=delta frac{partial oldsymbol{z}}{partial oldsymbol{W}}
]
这个怎么求呢?对于 (frac{partial z}{partial W}) 我们可以写成
[egin{aligned} z_{k} &=sum_{l=1}^{m} W_{k l} x_{l} \ frac{partial z_{k}}{partial W_{i j}} &=sum_{l=1}^{m} x_{l} frac{partial}{partial W_{i j}} W_{k l} end{aligned}
]
也就是:
[frac{partial z}{partial W_{i j}}=left[ egin{array}{c}{0} \ {vdots} \ {0} \ {x_{j}} \ {0} \ {vdots} \ {0}end{array}
ight]
]
因此有:
[frac{partial J}{partial W_{i j}}=frac{partial J}{partial z} frac{partial z}{partial W_{i j}}=delta frac{partial z}{partial W_{i j}}=sum_{k=1}^{m} delta_{k} frac{partial z_{k}}{partial W_{i j}}=delta_{i} x_{j}
]
所以 (frac{partial J}{partial oldsymbol{W}}=oldsymbol{delta}^{T} oldsymbol{x}^{T})
同理,如果我们将矩阵写成 (z=x W),那么 (frac{partial J}{partial W}=x^{T} delta)。
一层神经网络的例子
简单描述一个神经网络如下,我们采用交叉熵损失函数作为损失函数,来优化参数,神经网络描述如下:
[egin{array}{l}{x= ext { input }} \ {z=W x+b_{1}} \ {h=operatorname{ReLU}(z)} \ { heta=U h+b_{2}} \ {hat{y}=operatorname{softmax}( heta)} \ {J=C E(y, hat{y})}end{array}
]
对于其中数据的维度可以表示成:
[oldsymbol{x} in mathbb{R}^{D_{x} imes 1} quad oldsymbol{b}_{1} in mathbb{R}^{D_{h} imes 1} quad oldsymbol{W} in mathbb{R}^{D_{h} imes D_{x}} quad oldsymbol{b}_{2} in mathbb{R}^{N_{c} imes 1} quad oldsymbol{U} in mathbb{R}^{N_{c} imes D_{h}}
]
其中 (D_{x}) 是输入的维度,(D_{h}) 是隐含层的维度,(N_{c})是分类的种数。
我们需要求得梯度是:
[frac{partial J}{partial U} quad frac{partial J}{partial b_{2}} quad frac{partial J}{partial W} quad frac{partial J}{partial b_{1}} quad frac{partial J}{partial x}
]
这些都比较好计算,
[delta_{1}=frac{partial J}{partial heta} quad delta_{2}=frac{partial J}{partial z}
]
[egin{aligned} delta_{1} &=frac{partial J}{partial heta}=(hat{y}-y)^{T} \ delta_{2} &=frac{partial J}{partial z}=frac{partial J}{partial heta} frac{partial heta}{partial h} frac{partial h}{partial z} \ &=delta_{1} frac{partial heta}{partial h} frac{partial h}{partial z} \ &=delta_{1} U frac{partial h}{partial z} \ &=delta_{1} U circ operatorname{ReLU}^{prime}(z) \ &=delta_{1} U circ operatorname{sgn}(h) end{aligned}
]
关于神经网络的反向传播,这里是使用的交叉熵损失函数的计算,如果是最小二乘法,推荐下面的是视频:
BP