zoukankan      html  css  js  c++  java
  • Dynamic Programming and Policy Evaluation

    Dynamic Programming divides the original problem into subproblems, and then complete the whole task by recursively conquering these subproblems. The key idea of DP, and of reinforcement learning generally, is the use of value functions to organize and structure the search for good policies. It assumes the full knowledge of the environment: someone tells us the state space, action space, transition struction, the reward structure, discounted factor...

    We start with policy evaluation: given the MDP and an arbitary Policy π, we use Bellman Equation to recursively calculate the State-Value function:

     And the policy evaluation algorithm is given by following:

    The stop criteria is only very small change for the value state function.

    The example is a  GridWorld puzzle, the task is to reach grey cell with most reward. The policy for the possible actions (up,down,left,right) are equivalent, all 25%.

    Like a random walk, after calculation, we got :

  • 相关阅读:
    DOM面试题【三】
    JS面试题【二】
    移动端面试题【一】
    【python】mysql查询错误告警的处理
    硬币排成线
    书籍复印
    分割回文串
    分割回文串 II
    完全平方数
    俄罗斯套娃信封问题
  • 原文地址:https://www.cnblogs.com/rhyswang/p/11161983.html
Copyright © 2011-2022 走看看