Monte Carlo Control

zoukankan html css js c++ java

Monte Carlo Control

Problem of State-Value Function

Similar as Policy Iteration in Model-Based Learning, Generalized Policy Iteration will be used in Monte Carlo Control. In Policy Iteration, we keep doing Policy Evaluation and Policy Improvement untill our policy converging to Optimal Policy.

Every time when we improve the policy, the action that gives the best return(reward+value function of the next state) will be picked.

The problem of this algorithm if we directly transfering to Monte Carlo is: it is based on the Transition Matrix.

Monte Carlo Control based on Q function

The idea of Policy Iteration can be used to Estimite Action-Value Function, and it is very useful for Model-Free problem. The process of choosing actions does not depend on State-Value function, because the return from a specific action is given by Monte Carlo estimation.

Q function can be updated by:

When we improve the policy, we just pick the action that produce the maximum Q value.

Exploration-exploitation Dilemma and ε-Greedy Exploration:

In Model-Based Policy Iteration algorithm, we update all State-Value function within a single policy evaluation process, so that we can choose the best actions from the whole action space whiled improving policies. Nevertheless, Monte Carlo Learning only updates the Action-Value functions whose actions were taken on the previous episode. So there are probabily some actions having better returns than the actions we have tried. Sometimes we need to give them a trial. We call that problem the Exploration-Exploitation Delemma.

It is necessary to try some new opened restaurant, rather than going to the usual place every day.

ε-Greedy Exploration is the algorithm that gives the agent probability=ε to choose randomly actions and 1-ε to stay on the optimal action.

查看全文

相关阅读:
Spring 框架学习：BeanFactory 和 FactoryBean
Spring框架学习二：通过简单的HelloWorld程序理解IoC
Spring框架学习一：环境准备
 Java多线程：Condition条件
 Java多线程：公平锁和非公平锁
 Java多线程：AQS源码分析
 Java多线程：可重入锁RentrantLock
Java多线程：Unsafe 类以及 CAS 函数
 Java多线程：AtomicIntegerFieldUpdater 原子更新字段类
 Java多线程：AtomicReference AtomicStampedReference AtomicMarkableReference 原子更新引用类型

原文地址：https://www.cnblogs.com/rhyswang/p/11258273.html