BPTT随时间反向传播[梯度爆炸和梯度消失][frac{partial L}{partial W_o} = Sigma_{t=1}^Tfrac{partial L_t}{partial hat y_t}frac{partial hat y_t}{partial W_o}
]
[frac{partial L}{partial W_h}= Sigma_{t=1}^Tfrac{partial L_t}{partial hat y_t} frac{partial hat y_t}{partial h_t} frac{partial h_t}{partial W_h}
]
由于(h_t)涉及(h_{t-1}),而(h_{t-1})涉及到(W_h),所以随时间步回溯逐项求导,得[frac{partial h_t}{partial W_h}=Sigma_{i=1}^{t}frac{partial h_t}{partial h_i}frac{partial h_i}{partial W_h}
]
[frac{partial h_t}{partial h_i} = Pi_{j=i}^{t-1} frac{partial h_{j+1}}{partial h_j}
]
所以,[frac{partial L}{partial W_h} = Sigma_{t=1}^T frac{partial L_t}{partial hat y_t}frac{partial hat y_t}{partial h_t}Sigma_{i=1}^t(Pi_{j=i+1}^{t} frac{partial h_{j}}{partial h_{j-1}})frac{partial h_i}{partial W_h}
]
考虑到(frac{partial h_t}{partial h_{t-1}} = f'*W_h),当f为tanh函数时,其梯度范围为[0,1],所以当(j)与(t)的相差过大(相距太远),如果(W_h>1),则会产生梯度爆炸的问题;如果(W_h<1),则会产生梯度消失的情况。(特别注意:从公式我们可以发现,总损失对于参数的梯度值是存在的,但梯度值被近距离时间步所主导,无法学习得到长期依赖信息。所以,RNN中的梯度消失实际上指的就是无法学习得到长期依赖信息)