强化学习基础：蒙特卡罗和时序差分

zoukankan html css js c++ java

强化学习基础：蒙特卡罗和时序差分
- $v_{pi}$ corresponding to a policy $pi$
  
  First-visit MC estimates $v_{pi}(s)$
  
  Every-visit MC estimates $v_{pi}(s)$
- 问题二（右图）：estimate the action-value function $q_{pi}$ $v_{π}$
  
  First-visit MC estimates $q_{pi}(s,a)$
  
  Every-visit MC estimates $q_{pi}(s,a)$
- 问题三（左图）：get the optimal policy $pi_*$
  
  relationship between the mean and individual return: $ar{Q}_k=frac{sum_{i=1}^kG_i}{k}=ar{Q}_{k-1}+frac{1}{k}(G_k-ar{Q}_{k-1})$
  
  $epsilon$-greedy: Exploration vs Exploitation
  
  with probability $1-epsilon$, select the greedy action ${pi}(s)=arg max _{a in mathcal{A}(s)} Q(s, a)$ (Exploitation)
  
  with probability $epsilon$, select an action (uniformly) at random ${pi}(a|s)=frac{1}{|mathcal{A}(s)|}$ (Exploration)　　
- 问题四（右图）：modify the algorithm to put more weights to the most recent returns
求解方法：Temporal Difference

Monte Carlo (MC) prediction methods must wait until the end of an episode to update the value function estimate, temporal-difference (TD) methods update the value function after every time step.
- 问题一（左图）：estimate the state-value function $v_{pi}$ (the estimation of $q_{pi}$ is similar)
- 问题二（右图）：get the optimal action value function $q_*$
  
  On policy: the agent interact with the environment by following the same policy $pi$ that it seeks to evaluate (or improve)
  
  Sarsa(0) is an on-policy method
- 问题三：modified algorithm to get the optimal action value function $q_*$
  
  Off poliy: the agent interact with the environment by following a policy $b$
- $q_*$
  
  Expected Sarsa is an on-policy method
  
  $pi(a|S_{t+1})$ is derived from $Q$ (e.g., $epsilon$-greedy)
$v_{π}$
查看全文

相关阅读:
使用std::accumulate计算和、积和平均值
 Boost文件读写，断言、日期
 mem_fun的用法，以及使用wcout
singleton的内存泄漏及线程安全性问题
 delphi关键字
 Windows Api的一些方法封装以及常用参数
 linux字符设备驱动自动创建设备节点的的方法
 Linux混杂设备注册方法
 linux2.6字符设备的标准注册方法
 另一种linux下的powerpc中断注册的方法

原文地址：https://www.cnblogs.com/sunwq06/p/11084512.html