zoukankan      html  css  js  c++  java
  • Temporal-Difference Control: SARSA and Q-Learning

    SARSA

    SARSA algorithm also estimate Action-Value functions rather than State-Value function. The difference between SARSA and Monte Carlo is: SARSA does not need to wait the actual return untill the end of the episode, instead it learns from each time step using estimations of the return.

    In every step, the agent takes an action A from state S, then it receives a reward R and gets to a new state S'. Based on the policy π, we know the algorithm will greedily pick the action A'. So now we have:S,A,R,S',A', and the task is to estimate Q function of S,A pair.

    We borrow the idea of estimating State-Value functions and use it onto Action-Value function estimation, then we get:

    Here is the Sudo code for SARSA:

    On-Policy vs Off-Policy

    If we look into the learning process, there are actually two steps, firstly taking an action A from state S based on policy π, geting the reward R, and the next state S' coming; the second step is using the Q-function of action A' followd the same policy π. Both of the two steps use the same policy π, but actually they can be different. On the first step, the policy is called Target Policy, which is the policy that we will update. The second policy is Behavior Policy, this is how we pick the oprimal action from S'. Q-Learning uses different Policies on the two steps.

    Q-Learning

    From state S', Q-Learning algorithm picks the action maximizing the Q-function. It stands at state S', looking into all possible actions, and then chooses the best one.

  • 相关阅读:
    第九周进度条
    梦断代码阅读笔记01
    NABCD分析
    软件工程个人作业05
    HDU 3949 XOR(线性基)
    luogu 2115 破坏(01分数规划)
    luogu 1360 阵容均衡(前缀和+差分+hash)
    luogu 1967 货车运输(最大生成树+LCA)
    luogu 1344 追查坏牛奶(最小割)
    BZOJ 2007 海拔(平面图最小割转对偶图最短路)
  • 原文地址:https://www.cnblogs.com/rhyswang/p/11273474.html
Copyright © 2011-2022 走看看