Deep RL Bootcamp Lecture 4A: Policy Gradients - 走看看

zoukankan html css js c++ java

Deep RL Bootcamp Lecture 4A: Policy Gradients

in policy gradient, "a" is replaced by "u" usually.

use this new form to estimate how good the update is.

If all three path show positive reward, should the policy increase the posibility of all the sampling?

monte carlo estimate

TD estimate

2 weeks to train as respect to real world time scale.

but could be faster in emulator (MOJOCO).

we don't know whether a set of hyperparameter is going to work until enough interations have past. So it's kind of tricky, and using emulator could alleviate this problem.

question: how to transform learnt knowledge of robot to real life if we are not sure about the match between simulator and real world?

Randomly initilize many simulator and see the robustness of the algorithm

this video shows that even a robot with two years of endeavor of a group of experts still isn't good at locomotion

hindsight experience replay

Marcin Richard from OpenAI

the program is set to find the best way to get pizza, but when the agent find a ice cream, the agent realizes that ice cream, corresponding to a higher reward, is the exact thing it wants to get.

https://zhuanlan.zhihu.com/p/29486661

https://zhuanlan.zhihu.com/p/31527085

查看全文

相关阅读:
第32章数据库的备份和恢复
 Perl 打印关键字上下行
 mysql select * into OUTFILE 不会锁表
 独享表空间 ibdata1
sql 使用单引号
 Oracle 维护常用SQL
Mysql 独享表空间
 Mysql Perl unload表数据
 PLSQL 拼接SQL
begin和declare

原文地址：https://www.cnblogs.com/ecoflex/p/8974602.html

Copyright © 2011-2022 走看看