Windows下OpenAI gym环境的使用
作者:凯鲁嘎吉 - 博客园 http://www.cnblogs.com/kailugaji/
1. gym环境搭建用到的关键语句
1.1 准备工作
首先创建一个虚拟环境conda create -n RL python=3.8,激活activate RL。我用到的包及版本conda list:
ale-py 0.7.3 <pip>
atari-py 1.2.2 <pip>
Box2D 2.3.10 <pip>
box2d-py 2.3.8 <pip>
ca-certificates 2021.10.26 haa95532_2 http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
certifi 2020.6.20 py37_0 anaconda
cffi 1.15.0 <pip>
cloudpickle 2.0.0 <pip>
cycler 0.11.0 <pip>
Cython 0.29.26 <pip>
fasteners 0.16.3 <pip>
ffmpeg 1.4 <pip>
fonttools 4.28.5 <pip>
glfw 2.5.0 <pip>
gym 0.21.0 <pip>
imageio 2.13.5 <pip>
importlib-metadata 2.0.0 py_1 anaconda
importlib-resources 5.4.0 <pip>
kiwisolver 1.3.2 <pip>
lockfile 0.12.2 <pip>
matplotlib 3.5.1 <pip>
mujoco-py 1.50.1.68 <pip>
numpy 1.21.5 <pip>
openssl 1.0.2t vc14h62dcd97_0 [vc14] anaconda
packaging 21.3 <pip>
Pillow 9.0.0 <pip>
pip 20.2.4 py37_0 anaconda
pycparser 2.21 <pip>
pyglet 1.5.21 <pip>
pyparsing 3.0.6 <pip>
python 3.7.1 h33f27b4_4 anaconda
python-dateutil 2.8.2 <pip>
setuptools 50.3.0 py37h9490d1a_1 anaconda
six 1.16.0 <pip>
sqlite 3.20.1 vc14h7ce8c62_1 [vc14] anaconda
swig 3.0.12 h047fa9f_3 anaconda
vc 14.2 h21ff451_1 http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
vs2015_runtime 14.27.29016 h5e58377_2 http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
wheel 0.37.0 pyhd3eb1b0_1 http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
wincertstore 0.2 py37_0 anaconda
wrappers 0.1.9 <pip>
zipp 3.3.1 py_0 anaconda
zipp 3.7.0 <pip>
然后安装numpy: pip install numpy
1.2 安装gym, atari, Box2D, mujoco
1.2.1 安装简单版的gym
pip install gym, pip install pyglet
查看所有模拟环境:
from gym import envs
names = [env.id for env in envs.registry.all()]
print('\n'.join(names))
1.2.2 安装atari
pip install gym[atari]
pip uninstall atari_py
pip install --no-index -f https://github.com/Kojoley/atari-py/releases atari_py
1.2.3 安装Gym Box2D
conda install -c anaconda swig
pip install box2d-py
1.2.4 安装mujoco
1). Visual Studio安装的时候要选择windows 10 SDK
2). 在C:\Users\24410\下创建文件夹:.mujoco,在https://www.roboti.us/index.html下载mjpro150 win64,下载mjkey.txt,将https://www.apache.org/licenses/LICENSE-2.0.txt中的文字保存为LICENSE.txt。
3). 将这三个文件放在C:\Users\24410\.mujoco下。其中mjpro150 win64解压,文件名为mjpro150。并将mjkey.txt与LICENSE.txt复制到C:\Users\24410\.mujoco\mjpro150\bin一份。
4). 添加系统环境变量
变量名:MUJOCO_PY_MJKEY_PATH
变量值:C:\Users\24410\.mujoco\mjpro150\bin\mjkey.txt
变量名:MUJOCO_PY_MUJOCO_PATH
变量值:C:\Users\24410\.mujoco\mjpro150\bin
并添加path路径:C:\Users\24410\.mujoco\mjpro150\bin
5). 在终端输入
cd C:\Users\24410\.mujoco\mjpro150\bin
simulate.exe ../model/humanoid.xml
6). 在RL环境下输入pip install mujoco-py==1.50.1.68。完成。
2. 第一个Python小程序(gym环境)
2.1 gym环境的简单使用(随机采样选择动作)
这里给的是玩5局游戏,每局最多走1000步。
# -*- coding: UTF-8 -*-
# https://www.cnblogs.com/kailugaji/ - 凯鲁嘎吉 - 博客园
import gym
import time
def run_gym(index):
env = gym.make(index)
for i_episode in range(5): # 玩几局游戏
observation = env.reset() #用于重置环境
for t in range(1000): # # 每一局游戏最多1000步
env.render() # 用于渲染出当前的智能体以及环境的状态
time.sleep(0.01) # 为了让显示变慢,否则画面会非常快
action = env.action_space.sample() # 随机采样选择动作,这一步后续可以通过RL策略获得好的动作,而不是随机
observation, reward, done, info = env.step(action) # take a random action
"""
env.step()返回四个值:
observation(object)一个特定的环境对象,代表了你从环境中得到的观测值
例如从摄像头获得的像素数据,机器人的关节角度和关节速度,或者棋盘游戏的棋盘
reward(float)由于之前采取的动作所获得的大量奖励,与环境交互的过程中,奖励值的规模会发生变化,但是总体的目标一直都是使得总奖励最大
done(boolean)决定是否将环境初始化,大多数,但不是所有的任务都被定义好了什么情况该结束这个回合
例如倒立摆的小车离开地太远了就结束了这个回合
info(dict)调试过程中将会产生的有用信息,有时它会对我们的强化学习学习过程很有用
例如,有时它会包含最后一个状态改变后的原始概率
然而在评估你的智能体的时候你是不会用到这些信息去驱动你的智能体学习的
"""
if done:
print("Episode finished after {} timesteps".format(t+1))
break
env.close()
return observation, reward, env.action_space, env.observation_space
if __name__ == "__main__":
index = 'CartPole-v0' # Classic control
# index = 'MountainCar-v0' # Classic control
# index = 'AirRaid-ram-v0' # Atari
# index = 'Taxi-v3' # Toy text
'''
出租车调度
这里有 4 个地点,分别用 4 个字母表示,任务是要从一个地点接上乘客,送到另外 3 个中的一个放下乘客,越快越好。
颜色:蓝色:乘客,红色:乘客的目的地,黄色:空出租车,绿色:出租车满座,其中 “:” 栅栏可以穿越,"|" 栅栏不能穿越
Reward: 成功运送一个客人获得 20 分奖励
每走一步损失 1 分(希望尽快送到目的地)
没有把客人放到指定的位置,损失 10 分
Action: 0:向南移动,1:向北移动,2:向东移动,3:向西移动,4:乘客上车,5:乘客下车
State: 500维,(出租车行、出租车列、乘客位置、目的地)
'''
# index = 'Ant-v2'
# index = 'BipedalWalker-v3' # Box2D
'''
训练两足机器人行走
Goal:Agent需要学会克服各种障碍向前移动
State: 24维向量,包括各部件角速度,水平速度,垂直速度,关节位置,腿与地面的接触以及10个激光雷达测距仪的测量值
Action: 4维连续动作空间,取值范围为[-1,1],分别对应机器人胯下两个关节的转矩以及膝关节的转矩
Reward: 向前移动会获得到正奖励信号,摔倒会得到-100的奖励信号,同时,驱动各关节转动会得到一定的负奖励信号
Done: 摔倒或抵达地图终点会结束当前回合
'''
# index = 'LunarLander-v2' # Box2D
'''
将着陆器导航到其着陆台
着陆点始终位于坐标 (0,0)。坐标是状态向量中的前两个数字。
燃料是无限的,所以代理可以学习飞行,然后在第一次尝试时着陆。
Reward: 从屏幕顶部移动到着陆垫和零速度的奖励约为 100..140 点。
如果着陆器远离着陆台,它会失去奖励。如果着陆器坠毁或静止,情节结束,获得额外的 -100 或 +100 分。
每条腿接地是+10。点火主机每帧-0.3分。解决是200分。可以在着陆场外着陆。
Action: 什么都不做,向左方向引擎开火,向主引擎开火,向右方向引擎开火。
State: 水平坐标x,垂直坐标y,水平速度,垂直速度,角度,角速度,腿1触地,腿2触地。
'''
observation, reward, action_space, observation_space = run_gym(index)
print('Action Space: \n', action_space)
print('Observation Space: \n', observation_space)
print('Observation: \n', observation)
print('Reward: \n', reward)
CartPole-v0
MountainCar-v0
AirRaid-ram-v0
Taxi-v3
Ant-v2
BipedalWalker-v3
LunarLander-v2
2.2 将gym游戏界面保存为gif动图
这里给的是玩1局游戏,每局最多走1000步。
# -*- coding: UTF-8 -*-
# https://www.cnblogs.com/kailugaji/ - 凯鲁嘎吉 - 博客园
import gym
import time
from matplotlib import animation
import matplotlib.pyplot as plt
# 将gym界面保存为gif动图
def save_frames_as_gif(frames, path, index):
filename = 'gym_'+ index + '.gif'
#Mess with this to change frame size
plt.figure(figsize=(frames[0].shape[1] / 72.0, frames[0].shape[0] / 72.0), dpi=72)
patch = plt.imshow(frames[0])
plt.axis('off')
def animate(i):
patch.set_data(frames[i])
anim = animation.FuncAnimation(plt.gcf(), animate, frames = len(frames), interval=50)
anim.save(path + filename, writer='pillow', fps=60)
def run_gym(index):
env = gym.make(index)
frames = []
for i_episode in range(1): # 玩几局游戏
observation = env.reset() #用于重置环境
for t in range(1000): # # 每一局游戏最多1000步
env.render() # 用于渲染出当前的智能体以及环境的状态
frames.append(env.render(mode='rgb_array'))
time.sleep(0.01) # 为了让显示变慢,否则画面会非常快
action = env.action_space.sample() # 随机采样选择动作,这一步后续可以通过RL策略获得好的动作,而不是随机
observation, reward, done, info = env.step(action) # take a random action
if done:
print("Episode finished after {} timesteps".format(t+1))
break
env.close()
save_frames_as_gif(frames, path = './', index = index)
return observation, reward, env.action_space, env.observation_space
if __name__ == "__main__":
index = 'CartPole-v0' # Classic control
# index = 'MountainCar-v0' # Classic control
# index = 'AirRaid-ram-v0' # Atari
# index = 'Taxi-v3' # Toy text
# index = 'Ant-v2'
# index = 'BipedalWalker-v3' # Box2D
# index = 'LunarLander-v2' # Box2D
observation, reward, action_space, observation_space = run_gym(index)
print('Action Space: \n', action_space)
print('Observation Space: \n', observation_space)
print('Observation: \n', observation)
print('Reward: \n', reward)
CartPole-v0
Episode finished after 14 timesteps
Action Space:
Discrete(2)
Observation Space:
Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)
Observation:
[-0.21281277 -0.82338583 0.21441801 1.4051291 ]
Reward:
1.0
MountainCar-v0
Episode finished after 200 timesteps
Action Space:
Discrete(3)
Observation Space:
Box([-1.2 -0.07], [0.6 0.07], (2,), float32)
Observation:
[-0.44918552 0.00768781]
Reward:
-1.0
AirRaid-ram-v0
Episode finished after 520 timesteps
Action Space:
Discrete(6)
Observation Space:
Box([0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0], [255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255
255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255
255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255
255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255
255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255
255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255
255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255
255 255], (128,), uint8)
Observation:
[188 184 180 184 188 62 188 0 0 0 128 16 56 68 146 56 68 0
37 64 44 124 0 191 252 175 252 2 1 240 240 240 62 62 62 0
0 0 224 255 0 8 117 14 14 4 155 246 150 246 145 246 140 246
135 246 130 246 234 0 31 31 10 1 2 6 30 79 123 0 0 0
0 2 2 0 16 16 181 236 236 0 0 248 248 35 89 1 0 3
0 0 22 60 80 80 64 0 2 15 0 0 0 0 0 0 0 0
0 0 0 0 62 247 236 247 62 247 181 247 240 240 0 0 202 245
144 245]
Reward:
0.0
Ant-v2
Episode finished after 84 timesteps
Action Space:
Box([-1. -1. -1. -1. -1. -1. -1. -1.], [1. 1. 1. 1. 1. 1. 1. 1.], (8,), float32)
Observation Space:
Box([-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf
-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf
-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf
-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf
-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf
-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf
-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf
-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf], [inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf
inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf
inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf
inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf
inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf
inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf
inf inf inf], (111,), float64)
Observation:
[ 1.01382928 0.93552107 -0.3281438 -0.11665695 0.05927168 0.45597194
0.97790293 0.54600325 -1.25270942 -0.52486369 -1.22324054 -0.37710358
1.20411015 -0.63179069 0.77311549 1.25462637 -3.07764597 -0.53052459
-0.56717237 5.52724079 -7.48125022 -0.67642035 1.00847733 0.0252006
-6.03428845 5.96745216 5.86833084 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. ]
Reward:
-1.3955676167834756
BipedalWalker-v3
Episode finished after 72 timesteps
Action Space:
Box([-1. -1. -1. -1.], [1. 1. 1. 1.], (4,), float32)
Observation Space:
Box([-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf
-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf], [inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf
inf inf inf inf inf inf], (24,), float32)
Observation:
[ 2.2528927 0.10642751 -0.30818883 -0.07858526 -0.64701635 0.07910275
-0.09122896 -0.10672931 1. -0.7459374 -1.264112 -0.5282707
0.41768146 1. 0.18380441 0.18589178 0.19239755 0.20412572
0.22270261 0.25120568 0.2956909 0.36940333 0.50724566 0.83926386]
Reward:
-100
LunarLander-v2
Episode finished after 88 timesteps
Action Space:
Discrete(4)
Observation Space:
Box([-inf -inf -inf -inf -inf -inf -inf -inf], [inf inf inf inf inf inf inf inf], (8,), float32)
Observation:
[ 0.620516 -0.04573616 0.10586149 0.08483806 -1.4809935 -0.49739528
0. 0. ]
Reward:
-100
由于这里是随机选动作,所以基本没有学习到任何东西,全靠蒙。后续可以通过强化学习的相关方法来确定agent的下一步动作。
atari, mujoco在第一个程序里渲染的颜色没有问题,但第二个程序里渲染的颜色出问题了,但不影响算法的学习能力(虽然这里并未学习),可能是某些包版本的问题,有待进一步改善。