zoukankan      html  css  js  c++  java
  • DRL Hands-on book

    代码:https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On

    Chapter 1 What is Reinforcement Learning

    Learning - supervised, unsupervised, and reinforcement

    RL is not completely blind as in an unsupervised learning setup--we have a reward system.

    (1) life is suffering, which could be totally wrong. In machine learning terms, it can be rephrased as having non-i.i.d data.

    (2) exploration/exploitation dilemma is one of the open fundamental question in RL.

    (3) the third complication factor lays in the fact that reward can be seriously delayed from actions.

    RL fromalisms and realtions

    RL entities and their communications

    • Agent和Environment是图的两个node
    • Actions作为edge由Agent指向Environment
    • Rewards和Observations作为edge由Environment指向Agent

    Reward

    We don't define how frequently the agent receives this reward. In the case of once-in-a-lifetime reward systems, all rewards except the last one will be zero.

    The agent

    The environment

    Action

    two types of actions: discrete or continuous.

    Observations

    Markov decision process

    It is the theoretical foundation of RL, which makes it possible to start moving toward the methods used to solve the RL problem.

    we start from the simplest case of a Markov Process(also known as a Markov chain), then extend it with rewards, which will turn it into a Markov reward processes. Then we'll put this idea into one other extra envelop by adding actions, which will lead us to Markov Decision Processes.

    Markov process

    you can always make your model more complex by extending your state space, which will allow you to capture more dependencies in the model at the cost of a large state space.

    you can capture transition probabilities with a transition matrix, which is a square matrix of the size NxN, where N is the number of states in your model.

    可以根据观测的episodes来估计transition matrix

    Markov reward process

    first thing is to add reward to Markov process model.

    representation: reward transition matrix or a more compact representation, which is applicable only if the reward value depends only on the target state, which is not always the case.

    second thing is to add discount factor gamma(from 0 to 1).

    Markov decision process

    add a dimension 'action' to transition matrix.

    Chapter 2 OpenAI Gym

    Chapter 3 Deep Learning with PyTorch

    Chapter 4 The Cross-Entropy Method

    Taxonomy of RL methods

    • Model-free or model-based
    • Value-based or policy-based
    • On-policy or off-policy

    Practional cross-entropy

  • 相关阅读:
    巨蟒python全栈开发-第17天 核能来袭-成员
    常用Git命令大全思维导图
    Android传递Bitmap的两种简单方式及其缺陷
    功能强大的图片截取修剪神器:Android SimpleCropView及其实例代码重用简析
    Package pdftex.def Error: PDF mode expected, but DVI mode detected!
    Android第三方开源图片裁剪截取:cropper
    Android第三方文件选择器:aFileChooser
    Android实现ViewPager无限循环滚动回绕
    AndroidTreeView:Android树形节点View
    Android DynamicGrid:拖曳交换位置
  • 原文地址:https://www.cnblogs.com/ZeroTensor/p/10926679.html
Copyright © 2011-2022 走看看