zoukankan      html  css  js  c++  java
  • Reinforcement Learning Cheatsheet

    1. MDPs

    What is MDPs?

    MDPs are a classical formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequent situations, or states, and through those future rewards.

    The dynamics of MDP is define as:

    [p(s',r|s,a) = Pr{S_t=s',R_t=r|S_{t-1}=s,A_{t-1}=a} ]

    What is Markov property?

    The state must include information about all aspects of the past agent–environment interaction that make a difference for the future. If it does, then the state is said to have the Markov property

    What is finite MDP?

    In a finite MDP, the sets of states, actions, and rewards (S, A, and R) all have a finite number of elements. In this case, the random variables (R_t) and (S_t) have well defined discrete probability distributions dependent only on the preceding state and action.

    What can one do if the state or action is continuous?

    To quantize it.

    What makes MDPs different from k-bandit algo?

    Whereas in bandit problems we estimated the value (q_*(a)) of each action (a), in MDPs we estimate the value (q_*(s,a)) of each action (a) in each state (s), or we estimate the value (v_*(s)) of each state given optimal action selections.

    Policy function

    A policy (e.g. (pi(a|s))) is a mapping from states to probabilities of selecting each possible action.

    Value function

    State-value function

    State-value function, denoted (v_{pi}), namely the value function of a state (s) under a policy (pi), means the expected return an agent can get following policy (pi) starting from state (s).

    [v_{pi}(s) = mathrm{E_pi} sum_{k=0} ^{infty} {gamma^{k} R_{t+k+1} | S_t = s}, for \, forall s in S ]

    Action-value function

    $ q_{pi}(s, a) $, called action-value function for policy (pi).
    Similar to the definition of value function, action-value function means the expected return an agent can get following policy (pi) starting from state-action pair ((s, a)) at time t.

    [q_{pi}(s, a) = mathrm{E_pi} sum_{k=0} ^{infty} {gamma^{k} R_{t+k+1} | S_t = s, A_t = a}, for \, forall s in S and \, forall a in A ]

    2. Dynamic Programing

    DP refers to a collection of algos that can be used to compute the value functions, thus to find optimal policies, given a perfect model of the model of the environment as an MDP.

  • 相关阅读:
    中断高深吗?不!和我一起了解它!(三)
    IIS7下uploadify上传大文件出现404错误
    初来博客园
    cxf3.x +spring 3.x(4.x)+ maven 发布webservice 服务
    angularjs + fis +modJS 对于支持amd规范的组建处理(PhotoSwipe 支持,百度webUpload支持)
    elasticsearch suggest 的几种使用completion 的基本 使用
    使用github+sublime+markdwon 写文章,写博客并发布到博客园
    小互联网公司
    linux pts
    linux添加用户例如oracle
  • 原文地址:https://www.cnblogs.com/DianeSoHungry/p/11444968.html
Copyright © 2011-2022 走看看