zoukankan      html  css  js  c++  java
  • 有关一些求导的概念与神经网络梯度下降

    Theory for f : (mathbb{R}^{n} mapsto mathbb{R})

    先定义一个标识:scalar-product (langle a | b angle=sum_{i=1}^{n} a_{i} b_{i})

    我们可以定义导数的公式如下:

    [f(x+h)=f(x)+mathrm{d}_{x} f(h)+o_{h ightarrow 0}(h) ]

    (O_{h ightarrow 0}(h)) 满足 (lim _{h ightarrow 0} epsilon(h)=0)

    (mathbb{R}^{n} mapsto mathbb{R})是一个线性变换。形如:

    [fleft(left( egin{array}{l}{x_{1}} \ {x_{2}}end{array} ight) ight)=3 x_{1}+x_{2}^{2} ]

    (left( egin{array}{l}{a} \ {b}end{array} ight) in mathbb{R}^{2}) and (h=left( egin{array}{l}{h_{1}} \ {h_{2}}end{array} ight) in mathbb{R}^{2})时,我们有

    [egin{aligned} fleft(left( egin{array}{c}{a+h_{1}} \ {b+h_{2}}end{array} ight) ight) &=3left(a+h_{1} ight)+left(b+h_{2} ight)^{2} \ &=3 a+3 h_{1}+b^{2}+2 b h_{2}+h_{2}^{2} \ &=3 a+b^{2}+3 h_{1}+2 b h_{2}+h_{2}^{2} \ &=f(a, b)+3 h_{1}+2 b h_{2}+o(h) end{aligned} ]

    也就是说:(mathrm{d}_{(egin{array}{l}{a} \ {b}end{array})} fleft(left( egin{array}{l}{h_{1}} \ {h_{2}}end{array} ight) ight)=3 h_{1}+2 b h_{2}).

    神经网络中的梯度下降

    Vectorized Gradients

    我们可以将一个 (mathbb{R}^{n} ightarrow mathbb{R}^{m}) 矩阵(线性变换)看做一个函数 (oldsymbol{f}(oldsymbol{x})=left[f_{1}left(x_{1}, ldots, x_{n} ight), f_{2}left(x_{1}, ldots, x_{n} ight), ldots, f_{m}left(x_{1}, ldots, x_{n} ight) ight]) 。向量 (oldsymbol{x}) = (x_{1}, dots, x_{n})。 矩阵对向量求导就是:

    [frac{partial oldsymbol{f}}{partial oldsymbol{x}}=left[ egin{array}{ccc}{frac{partial f_{1}}{partial x_{1}}} & {cdots} & {frac{partial f_{1}}{partial x_{n}}} \ {vdots} & {ddots} & {vdots} \ {frac{partial f_{m}}{partial x_{1}}} & {cdots} & {frac{partial f_{m}}{partial x_{n}}}end{array} ight] ]

    上述的求导矩阵又叫做雅可比矩阵。其中的元素可以写成:

    [left(frac{partial oldsymbol{f}}{partial oldsymbol{x}} ight)_{i j}=frac{partial f_{i}}{partial x_{j}} ]

    这个矩阵很有用。对于神经网络我们可以理解为 向量的函数,及 (oldsymbol{f}(oldsymbol{x})), 而这个函数的本质是对向量的线性变换,也就是一个矩阵。在神经网络的反向传播的过程就需要对参数求偏导,那么多层神经网络就会是链式求导,雅可比矩阵就是用于矩阵的链式求导。

    考虑下面的例子: (f(x)=left[f_{1}(x), f_{2}(x) ight])(f) 是一个 1 * 2 的矩阵。(g(y)=left[g_{1}left(y_{1}, y_{2} ight), g_{2}left(y_{1}, y_{2} ight) ight]) 是一个 2 * 2的矩阵。那么 (f)(g) 的复合矩阵就是 (g * f) . 由矩阵的连乘得到,(g(x)=left[g_{1}left(f_{1}(x), f_{2}(x) ight), g_{2}left(f_{1}(x), f_{2}(x) ight) ight])。那么我们对复合矩阵求导就是:

    [frac{partial oldsymbol{g}}{partial x}=left[ egin{array}{c}{frac{partial}{partial x} g_{1}left(f_{1}(x), f_{2}(x) ight)} \ {frac{partial}{partial x} g_{2}left(f_{1}(x), f_{2}(x) ight)}end{array} ight]=left[ egin{array}{c}{frac{partial g_{1}}{partial f_{1}} frac{partial f_{1}}{partial x}+frac{partial g_{1}}{partial f_{2}} frac{partial f_{2}}{partial x}} \ {frac{partial g_{2}}{partial f_{1}} frac{partial f_{1}}{partial x}+frac{partial g_{2}}{partial f_{2}} frac{partial f_{2}}{partial x}}end{array} ight] ]

    本质上与一般连续函数的求导是一样的。这个矩阵等价于两次的求导矩阵的矩阵乘积。

    [frac{partial g}{partial x}=frac{partial g}{partial f} frac{partial f}{partial x}=left[ egin{array}{ll}{frac{partial g_{1}}{partial f_{1}}} & {frac{partial g_{1}}{partial f_{2}}} \ {frac{partial g_{2}}{partial f_{1}}} & {frac{partial g_{2}}{partial f_{2}}}end{array} ight] left[ egin{array}{c}{frac{partial f_{1}}{partial x}} \ {frac{f_{2}}{partial x}}end{array} ight] ]

    Useful Identities

    对于一般的矩阵

    我们将一般的矩阵理解为:(oldsymbol{W} in mathbb{R}^{n imes m}),将一个 (m) 维的向量变换成一个 (n) 维的向量。也可以写成 (z=W x)。 其中:

    [z_{i}=sum_{k=1}^{m} W_{i k} x_{k} ]

    所以对每一项的求导也很好计算:

    [left(frac{partial oldsymbol{z}}{partial oldsymbol{x}} ight)_{i j}=frac{partial z_{i}}{partial x_{j}}=frac{partial}{partial x_{j}} sum_{k=1}^{m} W_{i k} x_{k}=sum_{k=1}^{m} W_{i k} frac{partial}{partial x_{j}} x_{k}=W_{i j} ]

    因此有:

    [frac{partial z}{partial x}=W ]

    一般的矩阵的另一种写法可以写成:(z=x W), 其中 (x) 是一个行向量。维度为 m,这就可以看成是 (W) 列向量的线性组合。这时:

    [frac{partial z}{partial x}=oldsymbol{W}^{T} ]

    从另一个角度看一般的矩阵

    我们可以将线性变换的矩阵写成:(z=W x)。我们之前都是将 (x) 看成参数求导,如果这次我们将 (W) 看成参数求导会是什么样子呢?

    我们假设有:

    [z=oldsymbol{W} oldsymbol{x}, oldsymbol{delta}=frac{partial J}{partial oldsymbol{z}} ]

    [frac{partial J}{partial oldsymbol{W}}=frac{partial J}{partial oldsymbol{z}} frac{partial oldsymbol{z}}{partial oldsymbol{W}}=delta frac{partial oldsymbol{z}}{partial oldsymbol{W}} ]

    这个怎么求呢?对于 (frac{partial z}{partial W}) 我们可以写成

    [egin{aligned} z_{k} &=sum_{l=1}^{m} W_{k l} x_{l} \ frac{partial z_{k}}{partial W_{i j}} &=sum_{l=1}^{m} x_{l} frac{partial}{partial W_{i j}} W_{k l} end{aligned} ]

    也就是:

    [frac{partial z}{partial W_{i j}}=left[ egin{array}{c}{0} \ {vdots} \ {0} \ {x_{j}} \ {0} \ {vdots} \ {0}end{array} ight] ]

    因此有:

    [frac{partial J}{partial W_{i j}}=frac{partial J}{partial z} frac{partial z}{partial W_{i j}}=delta frac{partial z}{partial W_{i j}}=sum_{k=1}^{m} delta_{k} frac{partial z_{k}}{partial W_{i j}}=delta_{i} x_{j} ]

    所以 (frac{partial J}{partial oldsymbol{W}}=oldsymbol{delta}^{T} oldsymbol{x}^{T})

    同理,如果我们将矩阵写成 (z=x W),那么 (frac{partial J}{partial W}=x^{T} delta)

    一层神经网络的例子

    简单描述一个神经网络如下,我们采用交叉熵损失函数作为损失函数,来优化参数,神经网络描述如下:

    [egin{array}{l}{x= ext { input }} \ {z=W x+b_{1}} \ {h=operatorname{ReLU}(z)} \ { heta=U h+b_{2}} \ {hat{y}=operatorname{softmax}( heta)} \ {J=C E(y, hat{y})}end{array} ]

    对于其中数据的维度可以表示成:

    [oldsymbol{x} in mathbb{R}^{D_{x} imes 1} quad oldsymbol{b}_{1} in mathbb{R}^{D_{h} imes 1} quad oldsymbol{W} in mathbb{R}^{D_{h} imes D_{x}} quad oldsymbol{b}_{2} in mathbb{R}^{N_{c} imes 1} quad oldsymbol{U} in mathbb{R}^{N_{c} imes D_{h}} ]

    其中 (D_{x}) 是输入的维度,(D_{h}) 是隐含层的维度,(N_{c})是分类的种数。

    我们需要求得梯度是:

    [frac{partial J}{partial U} quad frac{partial J}{partial b_{2}} quad frac{partial J}{partial W} quad frac{partial J}{partial b_{1}} quad frac{partial J}{partial x} ]

    这些都比较好计算,

    [delta_{1}=frac{partial J}{partial heta} quad delta_{2}=frac{partial J}{partial z} ]

    [egin{aligned} delta_{1} &=frac{partial J}{partial heta}=(hat{y}-y)^{T} \ delta_{2} &=frac{partial J}{partial z}=frac{partial J}{partial heta} frac{partial heta}{partial h} frac{partial h}{partial z} \ &=delta_{1} frac{partial heta}{partial h} frac{partial h}{partial z} \ &=delta_{1} U frac{partial h}{partial z} \ &=delta_{1} U circ operatorname{ReLU}^{prime}(z) \ &=delta_{1} U circ operatorname{sgn}(h) end{aligned} ]

    关于神经网络的反向传播,这里是使用的交叉熵损失函数的计算,如果是最小二乘法,推荐下面的是视频:

    BP

  • 相关阅读:
    [转]IDEA 新建 JSP 项目时
    [转] AForge.NET 图像处理类
    [转] 端口被占用的处理
    [极客大挑战 2019]PHP
    今天鸽了
    [ZJCTF 2019]NiZhuanSiWei
    [极客大挑战 2019]Secret File
    [SUCTF 2019]Pythonginx
    [CISCN2019 华北赛区 Day1 Web2]ikun
    [极客大挑战 2019]EasySQL
  • 原文地址:https://www.cnblogs.com/wevolf/p/10806307.html
Copyright © 2011-2022 走看看