DRL学习笔记

zoukankan html css js c++ java

DRL学习笔记
学习Deep Reinforcement Learning笔记
- 也不能算是初识了，大二SRP的时候看过一点，还在博客记录了一下，但是现在对DRL的了解和完全不知道其实没差多少。
- 所以现在系统一点看一些文章，对DRL构建一个大致的了解框架。
Background

Machine Learning
- RL是ML的一种。
  
  ML可以分类为三种：supervised, unsupervised, and reinforcement learning.
  
  RL：有evaluative feedbacks, but no supervised signals.
  
  supervised, unsupervised learning are usually one-shot, myopic, considering instant reward.
  
  RL is sequential, far-sighted, considering long-term accumulative reward.
- Gradient descent: 解决optimization problems常用的方法。
- Occam's Razor: with the same expressiveness, simple models are preferred.
- No free lunch theorem: there is no universally best model, or best regularizor.
什么是Deep Learning
- 在Input, output layers之间有一层及以上的hidden layers。
- MLP
- CNN: convolutional layers, pooling layers, fully connected layers.
- ResNets: ease the training of very deep neural networks by adding shortcut connections to learn residual functions with reference to the layer inputs.
- RNN: 很多层神经网络之间会share the same weights。
- LSTM: 可以储存历史长时间的信息。
- Deep RL和 "shallow" RL最大的区别是使用的function approximator。
  
  "Shallow" RL使用linear function, decision trees, tile coding等。
  
  Deep RL经常使用SGD来update weight parameters。
什么是Reinforcement Learning
- 另一篇文章
- 其实有点相当于Classification问题，{s1, a1}是输入，A相当于标签。所以Loss可以由Cross Entropy \(e_n\) 与 \(A_n\)来衡量。
  
  Update的时候可以减去baseline，因为good and bad is relative。
  
  可以以average Value 作为baseline。
Value Function
- DQN
Policy Gradient
- 减去baseline：因为好不好是相对的。使G有正值也有负值。
- On-policy Policy Gradient: The actor to train and the actor for interacting is the same.
  
  每收集一个episode的数据（玩一次）就只能更新一次参数。
  
  耗时很长。
- Off-policy Policy Gradient: The actor to train and the actor for interacting are different.
  
  不用每次update参数都得收集一次数据。
  
  经典算法：Proximal Policy Optimization(PPO)。
  
  The actor to train has to know its difference from the actor to interact.
Policy
- 也是TD的一类，TD分Value-based和Policy-based。
  
  Value-based就是On-policy SARSA, Off-policy Q-Learning那些。
  
  Policy-based就是Policy Gradient, Actor-Critic, A3C这些。
Actor-Critic Methods
- Actor：来源于Policy Gradients. 传统的Policy Gradients是回合更新制。
  
  Critic：给你actor θ，critic通过observing s(and taking action a)评价好不好。
  
  来源与Value-Based的方法，可以实行单步更新制。
  
  Monte-Carlo(MC), TD.
- 包括两个部分：an actor and a critic。
  
  Actor: the policy.
  
  Critic: the estimate of a value function.
  
  在RL，两个部分都能 be represented by non-linear neural network function approximators.
- Actor是个决定policy的network, Critic是个评价actor的network，它们之间有共用的部分，例如对输入的处理。
- Critic给出\(r_t^n + V^\pi(s_{t+1}^n-V^\pi(s_t^n))\)部分。
  
  Actor根据Critic给出的值来进行policy gradient。
Critic
- 近似最近的policy的value function。参数为θ。
- \(Q(s, a; \theta) \approx Q^\pi (s, a)\)
- 减去v值：以\(v^\theta(s_i)\)作为baseline值。如果\(A_1>0\)：证明这个action要好过在\(s_1\)下的average。
  
  这个是拿一个sample的reward减去平均。
- Advantage Actor-Critic(A2C)
  
  用average减去average。
Reference
- Deep Reinforcement Learning: An Overview
- An Introduction to Deep Reinforcement Learning
- 西瓜书《机器学习》周志华
- https://www.cnblogs.com/xuwanwei/p/13641755.html
- 李宏毅 DRL相关视频
- 莫烦Python b站视频# 学习Deep Reinforcement Learning笔记
- 也不能算是初识了，大二SRP的时候看过一点，还在博客记录了一下，但是现在对DRL的了解和完全不知道其实没差多少。
- 所以现在系统一点看一些文章，对DRL构建一个大致的了解框架。
Background

Machine Learning
- RL是ML的一种。
  
  ML可以分类为三种：supervised, unsupervised, and reinforcement learning.
  
  RL：有evaluative feedbacks, but no supervised signals.
  
  supervised, unsupervised learning are usually one-shot, myopic, considering instant reward.
  
  RL is sequential, far-sighted, considering long-term accumulative reward.
- Gradient descent: 解决optimization problems常用的方法。
- Occam's Razor: with the same expressiveness, simple models are preferred.
- No free lunch theorem: there is no universally best model, or best regularizor.
什么是Deep Learning
- 在Input, output layers之间有一层及以上的hidden layers。
- MLP
- CNN: convolutional layers, pooling layers, fully connected layers.
- ResNets: ease the training of very deep neural networks by adding shortcut connections to learn residual functions with reference to the layer inputs.
- RNN: 很多层神经网络之间会share the same weights。
- LSTM: 可以储存历史长时间的信息。
- Deep RL和 "shallow" RL最大的区别是使用的function approximator。
  
  "Shallow" RL使用linear function, decision trees, tile coding等。
  
  Deep RL经常使用SGD来update weight parameters。
什么是Reinforcement Learning
- 另一篇文章
- 其实有点相当于Classification问题，{s1, a1}是输入，A相当于标签。所以Loss可以由Cross Entropy \(e_n\) 与 \(A_n\)来衡量。
  
  Update的时候可以减去baseline，因为good and bad is relative。
  
  可以以average Value 作为baseline。
Value Function
- DQN
Policy Gradient
- 减去baseline：因为好不好是相对的。使G有正值也有负值。
- On-policy Policy Gradient: The actor to train and the actor for interacting is the same.
  
  每收集一个episode的数据（玩一次）就只能更新一次参数。
  
  耗时很长。
- Off-policy Policy Gradient: The actor to train and the actor for interacting are different.
  
  不用每次update参数都得收集一次数据。
  
  经典算法：Proximal Policy Optimization(PPO)。
  
  The actor to train has to know its difference from the actor to interact.
Policy
- 也是TD的一类，TD分Value-based和Policy-based。
  
  Value-based就是On-policy SARSA, Off-policy Q-Learning那些。
  
  Policy-based就是Policy Gradient, Actor-Critic, A3C这些。
Actor-Critic Methods
- Actor：来源于Policy Gradients. 传统的Policy Gradients是回合更新制。
  
  Critic：给你actor θ，critic通过observing s(and taking action a)评价好不好。
  
  来源与Value-Based的方法，可以实行单步更新制。
  
  Monte-Carlo(MC), TD.
- 包括两个部分：an actor and a critic。
  
  Actor: the policy.
  
  Critic: the estimate of a value function.
  
  在RL，两个部分都能 be represented by non-linear neural network function approximators.
- Actor是个决定policy的network, Critic是个评价actor的network，它们之间有共用的部分，例如对输入的处理。
- Critic给出\(r_t^n + V^\pi(s_{t+1}^n-V^\pi(s_t^n))\)部分。
  
  Actor根据Critic给出的值来进行policy gradient。
Critic
- 近似最近的policy的value function。参数为θ。
- \(Q(s, a; \theta) \approx Q^\pi (s, a)\)
- 减去v值：以\(v^\theta(s_i)\)作为baseline值。如果\(A_1>0\)：证明这个action要好过在\(s_1\)下的average。
  
  这个是拿一个sample的reward减去平均。
- Advantage Actor-Critic(A2C)
  
  用average减去average。
Reference
- Deep Reinforcement Learning: An Overview
- An Introduction to Deep Reinforcement Learning
- 西瓜书《机器学习》周志华
- https://www.cnblogs.com/xuwanwei/p/13641755.html
- 李宏毅 DRL相关视频
- 莫烦Python b站视频
查看全文

相关阅读:
EJB 笔记
 设计模式
 go 笔记
 破解ssl pinning 抓APP的https数据包
 python读取、写入csv文件
 Python中用正则匹配手机号码
 苹果手机安装charles证书
 mysql在表中插入一个字段
 Python递归调用自己的函数
 mysql查询语句

原文地址：https://www.cnblogs.com/xuwanwei/p/15715893.html

学习Deep Reinforcement Learning笔记

Background

Machine Learning

什么是Deep Learning

什么是Reinforcement Learning

Value Function

Policy Gradient

Policy

Actor-Critic Methods

Critic

Reference

Background

Machine Learning

什么是Deep Learning

什么是Reinforcement Learning

Value Function

Policy Gradient

Policy

Actor-Critic Methods

Critic

Reference