【CS285 深度强化学习】作业一之详解 [Deep Reinforcement Learning]

前情提要与引用参考：顺序阅读代码：BC_Trainer**BCAgent**MLP_policy.pyReplayBufferRL_Trainercollect_training_trajectoriesdo_relabel_with_expert代码完成后的结果分析BC Behaviour cloningQ1.3分析训练成果对比问题合集numpy.core._exceptions.MemoryError: Unable to allocate 1.40 GiB for an array with shape (1000, 500, 1000, 3) and data type uint8image一直没有显示pickle.loadptu.to_numpy() 会把grad_func给去掉**参考引用**安装Mujoco Ubuntu介绍前提依赖等许可证申请设置路径

前情提要与引用参考：

b站看课地址：/video/BV1dJ411W78A官方课程地址：http://rail.eecs.berkeley.edu/deeprlcourse/本人代码地址：/kin_zhang/drl-hwprogramm/tree/kin/hw1/hw1

请先看原文件里的readme.md和installation.md等，课程是fall，但是作业我直接做的最新的fall的一些参考引用见最后的部分，主要是参考代码等

好像上次的flag 强化学习的书一直没更新完，这次也是pjc同学推荐的课程，觉得很有意思所以就开始听课做作业顺便当预习这些流程了，后续Carla还是会继续搞，搭个环境做一下DRL/RL之类的。课程笔记看看后面能不能整理的好一点… 看起来只能我自己看得懂 hhhh

Notion原笔记地址

顺序阅读代码：

scripts/run_hw.py (you should read this file, but you don’t need to edit it)infrastructure/rl_trainer.pyagents/bc_agent.py (another read-only file)policies/MLP_policy.pyinfrastructure/replay_buffer.pyinfrastructure/utils.pyinfrastructure/pytorch_utils.py

首先按顺序看run_hw1.py 可以从main()看，

添加参数do_dagger 是否使用专家数据使用logging的目录等重点建立BC_Trainer重点运行训练

BC_Trainer

导入参数构建BCAgent构建RL的训练对象加载expert policy

其中，首先按顺序进入BCAgent的构建过程

BCAgent

初始化环境和参数，设置actor也就跳入了下面：

MLP_policy.py

self.actor.update(ob_no, ac_na) # HW1: you will modify thisTODO

首先这一点是构建actor/policy得知，也就是MLPPolicySL跳入进去后，发现需要更新policy和loss

首先loss的更新是由：self.loss = nn.MSELoss()里给出，具体用法建议先看一下pytorch 60mins教程：/tutorials/beginner/blitz/neural_networks_tutorial.html

output = net(input)target = torch.randn(10) # a dummy target, for exampletarget = target.view(1, -1) # make it the same shape as outputcriterion = nn.MSELoss()loss = criterion(output, target)print(loss)

所以在这里我们应该这样填入：第一个参数：根据现在的观测值得到action，然后去补充get_action；第二个参数是将numpy转成tensor的

loss = self.loss(self.forward(ptu.from_numpy(observations)),ptu.from_numpy(actions))# the reason why we cannot use get_action since to_numpy will remove grad_func

然后再进入get_action补充：由现在的观测值放入网络得到action，注意格式需要变为numpy

return ptu.to_numpy(self.forward(ptu.from_numpy(observation)))

进而补充forward，从前面的init我们得知：

在self.discrete是logits_na网络，其他则是mean_net网络

所以forward是：

def forward(self, observation: torch.FloatTensor) -> Any:# raise NotImplementedErrorif self.discrete:return self.logits_na(observation)else:return self.mean_net(observation)

最后是关于Backprop的，pytorch官网的教程，整个使用的过程是：

import torch.optim as optim# create your optimizeroptimizer = optim.SGD(net.parameters(), lr=0.01)# in your training loop:optimizer.zero_grad() # zero the gradient buffersoutput = net(input)loss = criterion(output, target)loss.backward()optimizer.step() # Does the update

所以模仿一下其中的过程，这一部分就可以全部完成了

class MLPPolicySL(MLPPolicy):def __init__(self, ac_dim, ob_dim, n_layers, size, **kwargs):super().__init__(ac_dim, ob_dim, n_layers, size, **kwargs)self.loss = nn.MSELoss()def update(self, observations, actions,adv_n=None, acs_labels_na=None, qvals=None):# DONE TODO: update the policy and return the lossself.optimizer.zero_grad() # zeroes the gradient buffers of all parametersloss = self.loss(self.get_action(observations),ptu.from_numpy(actions))loss.backward() # backpropself.optimizer.step() # # Does the updatereturn {# You can add extra logging information here, but keep this line'Training Loss': ptu.to_numpy(loss),}

ReplayBuffer

其中需要完成的是随机数据，首先根据提示选取按random entries from each of the 5 component arrays 来随机的个数

也就是np.random.permutation(len(self))然后因为需要batch_size个这样的索引，所以整体就是

def sample_random_data(self, batch_size):assert (self.obs.shape[0]== self.acs.shape[0]== self.rews.shape[0]== self.next_obs.shape[0]== self.terminals.shape[0])## TODO return batch_size number of random entries from each of the 5 component arrays above## HINT 1: use np.random.permutation to sample random indices## HINT 2: return corresponding data points from each array (i.e., not different indices from each array)## HINT 3: look at the sample_recent_data function belowindices = np.random.permutation(len(self))[:batch_size]return self.obs[indices], self.acs[indices], self.rews[indices], self.next_obs[indices], self.terminals[indices]

自此我们的这个部分的TODO就做完了

RL_Trainer

回到run_hw1.py继续当构建完BCAgent后，我们就需要构建RL_Trainer里的了，

初始化：获取参数，建立logger，创建TF 部分环境的设置agent的创建

然后可以看到run_hw1.py里直接调用的self.rl_trainer.run_training_loop需要完成的部分是收集训练的轨迹

collect_training_trajectories

提示里已经说明了当时第一次迭代的时候，应该需要加载路径

# DONE TODO decide whether to load training data or use the current policy to collect more data# HINT: depending on if it's the first iteration or not, decide whether to either# (1) load the data. In this case you can directly return as follows# ```return loaded_paths, 0, None ```# (2) collect `self.params['batch_size']` transitionsif itr==0:with open(load_initial_expertdata, 'rb') as f:loaded_paths = pickle.loads(f)return loaded_paths, 0, None# DONE TODO collect `batch_size` samples to be used for training# HINT1: use sample_trajectories from utils# HINT2: you want each of these collected rollouts to be of length self.params['ep_len']print("\nCollecting data to be used for training...")paths, envsteps_this_batch = utils.sample_trajectory(self.env, collect_policy, self.params['ep_len'], render=False, render_mode=('rgb_array')) # sample_trajectories(env, policy, min_timesteps_per_batch, max_path_length, render=False, render_mode=('rgb_array'))

第二个TODO其实提示的很明显从utils调用那个函数，然后跳到那边可以看到需要的参数是什么，对应写已知的即可，跳入的过程我们发现sample_trajectories正好也有TODO需要完善

根据提示，使用sample_trajectory获得每条路径；使用get_pathlength去计算timesteps

def sample_trajectories(env, policy, min_timesteps_per_batch, max_path_length, render=False, render_mode=('rgb_array')):"""Collect rollouts until we have collected min_timesteps_per_batch steps.TODO implement this functionHint1: use sample_trajectory to get each path (i.e. rollout) that goes into pathsHint2: use get_pathlength to count the timesteps collected in each path"""timesteps_this_batch = 0paths = []while timesteps_this_batch < min_timesteps_per_batch:path = sample_trajectory(env,policy,max_path_length)paths.append(path)timesteps_this_batch += get_pathlength(path)return paths, timesteps_this_batch

然后再去看看sample_trajectory，首先是知道gym的基本用法：/ 见其官网

import gymenv = gym.make("CartPole-v1")observation = env.reset()for _ in range(1000):env.render()action = env.action_space.sample() # your agent here (this takes random actions)observation, reward, done, info = env.step(action)if done:observation = env.reset()env.close()

完善的过程中就是重置环境ob = env.reset()等等见gitee链接吧：

回到主的收集轨迹的我们可以看到还有一个TODO实现sample_n_trajectories是根据输入的ntraj来给出paths的长度

def sample_n_trajectories(env, policy, ntraj, max_path_length, render=False, render_mode=('rgb_array')):"""Collect ntraj rollouts.TODO implement this functionHint1: use sample_trajectory to get each path (i.e. rollout) that goes into paths"""paths = []for i in range(ntraj):path = sample_trajectory(env, policy, max_path_length)paths.append(path)return paths

do_relabel_with_expert

收集完后继续回到RL_Trainer的函数，往下是relabel 收集到的观测，emmm 后面懒得写的那么仔细了… 大家直接看程序把吧吧 hhhh

写好的程序地址：/kin_zhang/drl-hwprogramm/tree/kin/hw1/hw1

代码完成后的结果分析

BC Behaviour cloning

首先正如我在solution.md里写到的一个点，就是ep_len默认一开始是1000，但是在我循环运行的时候，这样子的，我的ep_len是从100开始的

###################### RUN TRAINING#################### 重复运行，请消除此处注释，注释下面两行add_item = 100for step in range(25):params['ep_len'] = add_item*(step+1)trainer = BC_Trainer(params)trainer.run_training_loop()# trainer = BC_Trainer(params)# trainer.run_training_loop()

然后经过导出csv画一下：

emm 好像也没错哦，到1000的时候也正是那个范围，不过这个提醒了我一个事，在课上曾说过，BC如果在一开始就犯错，后面很难救回来，这就是为什么导入专家数据，但是只导入一次的原因吗？在开始的return保持最高（en_len默认1000的时候哈）

第二题有对迭代次数进行分析，但是我觉得n_itr应该不算迭代次数，而是多少长度的学习，迭代次数不应该表示同一个状态观测下的不断学习嘛？噢我好像想起来了，是ep_len=1000 eval_batch_size=5000 专业那个就是收集 5 trajectories；~~是这个代表的一个状态观测下学习多少次吧，默认是双方都是1000，所以就是一次~~（我到底是怎么写完这个都没能理解参数的意义的… 主要是输入什么的时候HINT都有提示基本不用想）

~~这里的n_itr越大，这些机器人走的路也就越长… emmm~~前面说错了：运行了n_itr=300发现还是那几步所以肯定不是这个是次数

emmm 可能是写solution.md和我… 写程序隔了一天所以都不记得了？？重新回去看了一下是path的长度

def sample_trajectories(env, policy, min_timesteps_per_batch, max_path_length, render=False, render_mode=('rgb_array')):"""Collect rollouts until we have collected min_timesteps_per_batch steps.DONE TODO implement this functionHint1: use sample_trajectory to get each path (i.e. rollout) that goes into pathsHint2: use get_pathlength to count the timesteps collected in each path"""timesteps_this_batch = 0paths = []while timesteps_this_batch < min_timesteps_per_batch:path = sample_trajectory(env, policy, max_path_length, render)paths.append(path)timesteps_this_batch += get_pathlength(path)return paths, timesteps_this_batch

也就是这里，这里传进来的时候是：paths, envsteps_this_batch = utils.sample_trajectories(self.env, collect_policy, batch_size, self.params['ep_len'], render=False, render_mode=('rgb_array'))也就是max_path_length是['ep_len']

在logging时给的是eval_paths, eval_envsteps_this_batch = utils.sample_trajectories(self.env, eval_policy, self.params['eval_batch_size'], self.params['ep_len'])

但是在记录视频的时候其实是转到了MAX_VIDEO_LEN这里，这里原文档应该写错了一个地方也就是init里的Overwrite没加global会失败的

train_video_paths = utils.sample_n_trajectories(self.env, collect_policy, MAX_NVIDEO, MAX_VIDEO_LEN, True)

May 18, 已完成修复 global的问题，加大了长度

总结上面的问题：

n_iter迭代次数，具体是指整体的训练循环是几次

比如第一次就可以训练出走ep_len的，但是train_loss可能很大，效果不好

ep_lenmax length of episodes，在rl_trainer.py里可以看到它决定了MAX_VIDEO_LEN同时和eval_batch_size配合使用决定几个轨迹序列

self.params['ep_len'] = self.params['ep_len'] or self.env.spec.max_episode_steps

eval_batch_sizeeval data collected (in the env) for logging metrics，也就是在整个环境中我们取多少数据去做评估，类似于我的长度由ep_len决定，但是评估的数据量是由这个决定的，比如我的长度默认是1000，评估数据也是1000，

# rl_train.py里的对于训练的print("\nCollecting data to be used for training...")paths, envsteps_this_batch = utils.sample_trajectories(self.env, collect_policy, batch_size, self.params['ep_len'])# rl_train.py里的对于收集eval的轨迹的# collect eval trajectories, for loggingprint("\nCollecting data for eval...")eval_paths, eval_envsteps_this_batch = utils.sample_trajectories(self.env, eval_policy, self.params['eval_batch_size'], self.params['ep_len'])# 函数定义def sample_trajectories(env, policy, min_timesteps_per_batch, max_path_length, render=False, render_mode=('rgb_array')):"""Collect rollouts until we have collected min_timesteps_per_batch steps.DONE TODO implement this functionHint1: use sample_trajectory to get each path (i.e. rollout) that goes into pathsHint2: use get_pathlength to count the timesteps collected in each path"""timesteps_this_batch = 0paths = []while timesteps_this_batch < min_timesteps_per_batch:path = sample_trajectory(env, policy, max_path_length, render)paths.append(path)timesteps_this_batch += get_pathlength(path)return paths, timesteps_this_batch# 在sample_trajectory函数里while True:# ...........rollout_done = 1 if steps>=max_path_length else 0 # HINT: this is either 0 or 1if rollout_done:break# ....

然后在这里可以看到每次都是min_timesteps_per_batch是指：每个batch的最小时长是，默认是1000，max_path_length就是每条路径的长度，比如你想那个agent走多远，默认也是1000【当然gif图里显示的为了减少文件夹大小，所以默认只显示40长度】

如果按照hw1.pdf里讲的如果ep_len=1000,eval_batch_size=5000那么在这里的函数，while大概会执行五次，也就是有五条轨迹，这里的轨迹是类似于五条轨迹序列的意思，然后再取五条轨迹序列reward相加，平均和方差来判断这次的policy表现

所以这里破案了，Q1.2我选取ep_len从100到2500，就是因为轨迹序列从多到少，多的时候可能会有很差的出现，少的时候完全按照专家策略给的relabel嘛？

但是我看了一下是所有的path都会relabel的应该不会出现差距这么多呀

重新理了一下思路，解决了：

首先是前面分析从100到2500的ep_len没错，轨迹序列从多到少也没错，但是求reward是ep_len的平均，也就是说如果我走100步，一共运行10次，我求的是这10次走100步的平均；接着当我走1000步只运行一次的时候，我求的就是1000步的reward，所以显而易见走的越多当然reward也越多【在reward没有负的情况下】

Q1.3分析

首先是就算run_hw1.py里有很多参数可调，但是！这里是BC，itr=0所以不会经过训练层那么关于train_batch_size,learning_rate,size n_layers,batch_size这些关于网络的都是在此问题下没用的，在问题Q2里就有用了，然后ep_len前面有分析过了，改了就把总量改了除非保持总量不变，变轨迹但是没有什么分析的意义

具体就是collect_training_trajectories这里，当时第一次迭代是直接返回的初始专家数据的，而且BC情况下只能一次迭代

if itr==0:with open(load_initial_expertdata, 'rb') as f:loaded_paths = pickle.loads(f.read())return loaded_paths, 0, None

但是吧，他的hw1.pdf里又说可调参数有train data set 和你从专家数据集抽多少数据，这就让我很疑惑了… 后者我能理解，前者根本无效呀，因为是不会去进行下一步collecting data to be used for training呀… 但是后者我用多少数据也没有可调的参数只能修改源代码搞定吧吧吧

训练成果对比

这些都是在ep_len等于200的时候进行的哈，我设过1000 然后就出界了，虽然物理还是在的就是地面消失了但是有悬空地面还能继续跑hhh

首先是Ant-v2的环境对比：

在一开始迭代的时候比如n=5的时候，右边的那位就翻车停止了，然后走到迭代95的时候都走完全程没有翻车了

其实上面Ant-v2的环境看的不明显，主要是比较简单，那么直接上个更明显的humanoid-v2环境下：【这个当时写的时候没有设为done 所以就算失败了也会继续翻车挣扎 hhh】

n=5 ep_len=100

n=95 ep_len=100

好了大致的分析就到此了

问题合集

numpy.core._exceptions.MemoryError: Unable to allocate 1.40 GiB for an array with shape (1000, 500, 1000, 3) and data type uint8

当ep_len太大保存Video的时候会出现这个错误

引用参考处：/questions/57507832/unable-to-allocate-array-with-shape-and-data-type

# 查看是否允许overcommit_memorycat /proc/sys/vm/overcommit_memory# 进入rootsudo su# 修改成允许echo 1 > /proc/sys/vm/overcommit_memory

image一直没有显示

在pdf里写了如果想保存image和video删除--video_log_freq -1

其实… 应该还要在代码里把def sample_trajectory(env, policy, max_path_length, render=False, render_mode=('rgb_array'))这里的render改成True不然的话！

是会报错的，然后print出来可以发现p['image_obs']是空的然后顺藤摸瓜就知道问题在这里了

pickle.load

if itr==0:with open(load_initial_expertdata, 'rb') as f:loaded_paths = pickle.loads(f.read())return loaded_paths, 0, None

这里对于第一次迭代是打开专家数据然后读取进来，一开始写成了pickle.loads(f)

ptu.to_numpy() 会把grad_func给去掉

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

这个问题的发现源于我一开始一直写的是loss = self.loss(self.get_action(observations),ptu.from_numpy(actions))

class MLPPolicySL(MLPPolicy):def __init__(self, ac_dim, ob_dim, n_layers, size, **kwargs):super().__init__(ac_dim, ob_dim, n_layers, size, **kwargs)self.loss = nn.MSELoss()def update(self, observations, actions,adv_n=None, acs_labels_na=None, qvals=None):# DONE TODO: update the policy and return the lossself.optimizer.zero_grad() # zeroes the gradient buffers of all parametersloss = self.loss(self.forward(ptu.from_numpy(observations)),ptu.from_numpy(actions))# the reason why we cannot use get_action since to_numpy will remove grad_funcloss.backward() # backpropself.optimizer.step() # # Does the updatereturn {# You can add extra logging information here, but keep this line'Training Loss': ptu.to_numpy(loss),}

然后get_action那块是：

def get_action(self, obs: np.ndarray) -> np.ndarray:if len(obs.shape) > 1:observation = obselse:observation = obs[None]# DONE TODO return the action that the policy prescribes return ptu.to_numpy(self.forward(ptu.from_numpy(observation)))

因为最后需要转成np.ndarray输出，所以我一直没感觉，直到发生错误：

然后把我就以为是loss没有数据然后还print(loss.shape)，接着错误就更深了，因为只有一个数据的时候输出就是torch.Size([])

然后我就误以为是没有数据，还疑惑

最后呢是健聪过来看我写的，emmm 输出发现 grad_func不见了，然后就知道了

参考引用

写代码过程中的几个Github参考：

/cww97/cs285_fall_cww/tree/main/hw1/vincentkslim/cs285_homework_fall/tree/master/mdeib/berkeley-deep-RL-pytorch-solutions

安装Mujoco Ubuntu

介绍

Mujoco: Mujoco is owned by Roboti LLC, initially used by Movement and Control Laboratory at the University of Washington. MuJoCo stands for Multi-Joint Dynamics with Control and is physics engine which provides simulation environments for research in several areas related to robotics, biomechanics, and graphics. Minimal installation of OpenAI Gym doesn’t include Mujoco because Mujoco needs to be properly licensed which can cost you up to $2000 unless you are a student which again has some very strict clauses regarding publication. Either way, you can still install it and use it for personal projects as much as you want with a student license or 30-day trial license.

前提依赖等

sudo apt-get install libosmesa6-devconda install anaconda patchelf

许可证申请

申请网址：https://www.roboti.us/license.html 可按照学校邮箱申请edu

设置路径

> set MUJOCO_PY_MJKEY_PATH=C:\path\to\.mujoco\mjkey.txt> set MUJOCO_PY_MUJOCO_PATH=C:\Users\zhangqingwen\Downloads\AProgramm\drl-hwprogramm\mujoco200_win64\bin> set PATH=C:\Users\zhangqingwen\Downloads\AProgramm\drl-hwprogramm\mujoco200_win64\bin

如果是试用版本的话，下载，后对于ubuntu系统需要chmod +x getid_linux然后再：./getid_linux