zoukankan      html  css  js  c++  java
  • 学习笔记 | Morvan

    Q-learning

    Auxiliary Material

    1. A Painless Q-Learning Tutorial

    2. Simple Reinforcement Learning with Tensorflow Part 0: Q-Learning

    3. 6.5 Q-Learning: Off-Policy TD Control (Sutton and Barto's Reinforcement Learning ebook)

    Note

    1. tabular

      扁平的,表格式的

    2. Q-learning by Morvan

      Q-learning 是一种记录行为值 (Q value) 的方法, 每种在一定状态的行为都会有一个值 Q(s, a), 就是说 行为 a 在 s 状态的值是 Q(s, a).

      s 在上面的探索者游戏中, 就是 o 所在的地点了. 而每一个地点探索者都能做出两个行为 left/right, 这就是探索者的所有可行的 a 啦.

    3. Q-learning by Wikipedia

      Q-learning is a model-free reinforcement learning technique. Specifically, Q-learning can be used to find an optimal action-selection policy for any given (finite) Markov decision process (MDP). It works by learning an action-value function that ultimately gives the expected utility of taking a given action in a given state and following the optimal policy thereafter.

      When such an action-value function is learned, the optimal policy can be constructed by simply selecting the action with the highest value in each state. One of the strengths of Q-learning is that it is able to compare the expected utility of the available actions without requiring a model of the environment. Additionally, Q-learning can handle problems with stochastic transitions and rewards, without requiring any adaptations.

    4. Psudocode

    5. The transition rule of Q learning is a very simple formula (source):

      Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]

    6. epsilon greedy

      EPSILON 就是用来控制贪婪程度的值。EPSILON 可以随着探索时间不断提升(越来越贪婪)。

    7. Why is Q-learning considered an off-policy control method? (Exercise 6.9 of Sutton and Barto book's)

      If the algorithm estimates the value function of the policy generating the data, the method is called on-policy. Otherwise it is called off-policy.

      if the samples used in the TD update is not generated according to your behavior policy (policy that the agent is following) then it is called off-policy learning--you can also say learning from off-policy data. (source)

      Q-learning 是一个 off-policy 的算法, 因为里面的 max action 让 Q table 的更新可以不基于正在经历的经验(可以是现在学习着很久以前的经验,甚至是学习他人的经验).

  • 相关阅读:
    iOS-UIScrollView的使用
    iOS-UILabel的使用
    iOS-UITextField的使用
    iOS-UIScreen,UIFont,UIColor,UIView,UIButton
    jQuery和ajax【“Asynchronous Javascript And XML】
    iOS-NSBundle、NSArray、NSDictionay
    iOS-UINavigationController多控制器管理
    iOS-NSNotification本地推送、远程推送
    iOS-MJRefresh框架
    苹果电脑:快捷键使用
  • 原文地址:https://www.cnblogs.com/casperwin/p/6305351.html
Copyright © 2011-2022 走看看