



^ is the square root of epsilon


a simplified version of hard version
a more smooth way to find correct solution

the first term is the REINFORCE term, and the seconde term is our grad log probability of our loss

b is a stochastic node




more formula derivations are ignored.