zoukankan      html  css  js  c++  java
  • Monte Carlo Control

    Problem of State-Value Function

    Similar as Policy Iteration in Model-Based Learning, Generalized Policy Iteration will be used in Monte Carlo Control. In Policy Iteration, we keep doing Policy Evaluation and Policy Improvement untill our policy converging to Optimal Policy.

    Every time when we improve the policy, the action that gives the best return(reward+value function of the next state) will be picked.

    The problem of this algorithm if we directly transfering to Monte Carlo is: it is based on the Transition Matrix.

    Monte Carlo Control based on Q function

    The idea of Policy Iteration can be used to Estimite Action-Value Function, and it is very useful for Model-Free problem. The process of choosing actions does not depend on State-Value function, because the return from a specific action is given by Monte Carlo estimation.

    Q function can be updated by:

    When we improve the policy, we just pick the action that produce the maximum Q value.

    Exploration-exploitation Dilemma and ε-Greedy Exploration:

    In Model-Based Policy Iteration algorithm, we update all State-Value function within a single policy evaluation process, so that we can choose the best actions from the whole action space  whiled improving policies. Nevertheless, Monte Carlo Learning only updates the Action-Value functions whose actions were taken on the previous episode. So there are probabily some actions having better returns than the actions we have tried. Sometimes we need to give them a trial. We call that problem the Exploration-Exploitation Delemma.

    It is necessary to try some new opened restaurant, rather than going to the usual place every day. 

    ε-Greedy Exploration is the algorithm that gives the agent probability=ε to choose randomly actions and 1-ε to stay on the optimal action.

  • 相关阅读:
    lua源码分析 伪索引
    visual studio 插件
    修改Linux内核参数 减少TIME-WAIT
    linux下编译libmysqlclient, 安装mysql-server mysql-client
    编译静态库tinyxml2
    linux下编译lua库
    在Xshell中文件内容显示乱码
    Java中的自增自减
    Integer的缓存机制
    八大基本排序
  • 原文地址:https://www.cnblogs.com/rhyswang/p/11258273.html
Copyright © 2011-2022 走看看