zoukankan      html  css  js  c++  java
  • MDPs

    Markov Decision Processes  (MDPs)

    Named after Andrey Markov, known at least as early as 1950s.(cf. Bellman 1957)

    Discrete time stochastic control process.

    State:

    Action:

    Reward:

    Markov property: given s and a , it conditionally independen of all previous state and actions.

    Extent of Markov chains: 

    Diff.: addition of actions (allowing choice) and rewards (giving motivation).

    Formally,

    Defination

    a  4-tuple (S,A,P.(.,.),R.(.,.), where

    • S is a finite set of states,
    • A is a finite set of actions (alternatively, A_s is the finite set of actions available from state s),
    • P_a(s, s’) = Pr(s_{t+1} = s’ | s_t = s, a_t = a) is the probability that action a in state s at time t will lead to state s’ at time t+1
    • R_a(s,s’) is the immediate reward (or expected immediate reward) received after transition to state s’ from state s with transition probability P_a(s,s’)

    Problem

    Find a policy for decision maker: a function \pai that specifies that action \pai 9s) that the decision maker will choose when in state s.

    The goal is to choose a policy \pai that will maximize some cumulative function of random rewards:

    \sum_{t=0]{\gama^t R_{a_t}(s_t,s_{t+1})  (where a_t = \pai (s_t})

    where \gama is the discount rate and \in (0,1]. It is typically close to 1.

    Algorithm

     \pi(s) := \arg \max_a \left\{ \sum_{s'} P_a(s,s') \left( R_a(s,s') + \gamma V(s') \right) \right\}

     V(s) := \sum_{s'} P_{\pi(s)} (s,s') \left( R_{\pi(s)} (s,s') + \gamma V(s') \right)

    P : transition function

    R: Reward function

    V: contains real values, V(s) will contain the discounted sum of the rewards to be earned (on average ) bye following that solution from state s.

    Notable variants

    Value iteration

     V(s) := \max_a \left\{ \sum_{s'} P_a(s,s') \left( R_a(s,s') + \gamma V(s') \right) \right\}.

    Polycy iteration \ Modified policy iteration\ Prioritized sweeping

    Extension and generalizations

    Partial observability (POMDP)

    RL:

    Probilities or rewards are unknown.

    Q(s,a) = \sum_{s’}{Pa(s,s])(R_a{s,s]) + \gammaV(s’))}

    I was in state s and I tried doing a and s’ happened.

  • 相关阅读:
    学习
    2018年看书计划(40本)
    java快排(两种方法)
    max-points-on-a-line
    Angular不同版本对应的Bootstrap组件
    AngularCLI介绍及配置文件主要参数含义解析
    D3——Updates, Transitions, and Motion
    SVG中的元素属性
    D3——Axes
    Angular2.0知识架构图
  • 原文地址:https://www.cnblogs.com/justin_s/p/2038228.html
Copyright © 2011-2022 走看看