深度强化学习和强化学习_深度强化学习：从哪里开始

深度强化学习和强化学习

by Jannes Klaas

简尼斯·克拉斯(Jannes Klaas)

深度强化学习：从哪里开始 (Deep reinforcement learning: where to start)

Last year, DeepMind’s AlphaGo beat Go world champion Lee Sedol 4–1. More than 200 million people watched as reinforcement learning (RL) took to the world stage. A few years earlier, DeepMind had made waves with a bot that could play Atari games. The company was soon acquired by Google.

去年， DeepMind的AlphaGo以4比1击败了围棋世界冠军Lee Sedol 。超过2亿人观看了强化学习(RL)进入世界舞台。几年前，DeepMind用一种可以玩Atari游戏的机器人引起了轰动。该公司很快被Google收购。

Many researchers believe that RL is our best shot at creating artificial general intelligence. It is an exciting field, with many unsolved challenges and huge potential.

许多研究人员认为，RL是我们创造人工智能的最佳选择。这是一个令人兴奋的领域，具有许多未解决的挑战和巨大的潜力。

Although it can appear challenging at first, getting started in RL is actually not so difficult. In this article, we will create a simple bot with Keras that can play a game of Catch.

尽管乍一看似乎很困难，但是入门RL实际上并不那么困难。在本文中，我们将使用Keras创建一个简单的机器人，该机器人可以玩Catch游戏。

游戏 (The game)

Catch is a very simple arcade game, which you might have played as a child. Fruits fall from the top of the screen, and the player has to catch them with a basket. For every fruit caught, the player scores a point. For every fruit lost, the player loses a point.

Catch是一款非常简单的街机游戏，您可能小时候就玩过。水果从屏幕顶部掉落，玩家必须用篮子抓住它们。玩家每抓到一个水果就得分。失去的每一果，玩家都会失去一分。

The goal here is to let the computer play Catch by itself. But we will not use the pretty game above. Instead, we will use a simplified version to make the task easier:

这里的目标是让计算机自己播放Catch。但是我们不会使用上面的漂亮游戏。相反，我们将使用简化版本来简化任务：

While playing Catch, the player decides between three possible actions. They can move the basket to the left, to the right, or stay put.

在玩接球时，玩家在三个可能的动作之间做出选择。他们可以将篮子向左，向右移动或保持原状。

The basis for this decision is the current state of the game. In other words: the positions of the falling fruit and of the basket.

该决定的基础是游戏的当前状态。换句话说：落下的水果和篮子的位置。

Our goal is to create a model, which, given the content of the game screen, chooses the action which leads to the highest score possible.

我们的目标是创建一个模型，该模型在给定游戏屏幕内容的情况下，选择可能导致最高分的动作。

This task can be seen as a simple classification problem. We could ask expert human players to play the game many times and record their actions. Then, we could train a model to choose the ‘correct’ action that mirrors the expert players.

该任务可以看作是一个简单的分类问题。我们可以请人类专家多次玩游戏并记录他们的行为。然后，我们可以训练一个模型以选择反映专家玩家的“正确”动作。

But this is not how humans learn. Humans can learn a game like Catch by themselves, without guidance. This is very useful. Imagine if you had to hire a bunch of experts to perform a task thousands of times every time you wanted to learn something as simple as Catch! It would be expensive and slow.

但这不是人类学习的方式。人们可以在没有指导的情况下自己学习“追赶”这样的游戏。这非常有用。想象一下，如果您每次想学习像Catch这样简单的东西时，都必须聘请一群专家来执行数千次任务，那么！这将是昂贵且缓慢的。

In reinforcement learning, the model trains from experience, rather than labeled data.

在强化学习中，该模型从经验中训练，而不是从标记的数据中训练。

深度强化学习 (Deep reinforcement learning)

Reinforcement learning is inspired by behavioral psychology.

强化学习是受行为心理学启发的。

Instead of providing the model with ‘correct’ actions, we provide it with rewards and punishments. The model receives information about the current state of the environment (e.g. the computer game screen). It then outputs an action, like a joystick movement. The environment reacts to this action and provides the next state, alongside with any rewards.

我们没有为模型提供“正确”的操作，而是为它提供了奖励和惩罚。该模型接收有关环境当前状态的信息(例如，计算机游戏屏幕)。然后，它输出一个动作，例如操纵杆运动。环境对此动作做出React，并提供下一个状态以及所有奖励。

The model then learns to find actions that lead to maximum rewards.

然后，模型学习寻找导致最大回报的行动。

There are many ways this can work in practice. Here, we are going to look at Q-Learning. Q-Learning made a splash when it was used to train a computer to play Atari games. Today, it is still a relevant concept. Most modern RL algorithms are some adaptation of Q-Learning.

在实践中有许多方法可以起作用。在这里，我们将研究Q学习。 Q-Learning在用于训练计算机以玩Atari游戏时引起了轰动。今天，它仍然是一个相关的概念。大多数现代RL算法都是Q学习的一种改编。

Q学习直觉 (Q-learning intuition)

A good way to understand Q-learning is to compare playing Catch with playing chess.

理解Q学习的一种好方法是将“接球”与“下棋”进行比较。

In both games you are given a state,S. With chess, this is the positions of the figures on the board. In Catch, this is the location of the fruit and the basket.

在两个游戏中，您都被赋予状态S对于国际象棋，这是数字在板上的位置。在“捕获”中，这是水果和篮子的位置。

The player then has to take an action,A. In chess, this is moving a figure. In Catch, this is to move the basket left or right, or remain in the current position.

然后，玩家必须采取行动A在国际象棋中，这是一个数字。在“捕获”中，这是将篮子向左或向右移动，或保持在当前位置。

As a result, there will be some rewardR, and a new stateS’.

结果，将有一些奖励R和一个新状态S'。

The problem with both Catch and chess is that the rewards do not appear immediately after the action.

接球和下象棋的问题在于，奖励不会在行动后立即出现。

In Catch, you only earn rewards when the fruits hit the basket or fall on the floor, and in chess you only earn a reward when you win or lose the game. This means that rewards aresparsely distributed.Most of the time,Rwill be zero.

在“接球”中，只有当水果撞到篮子或掉在地板上时，您才能获得奖励；而在国际象棋中，您只有在游戏输赢时才获得奖励。这意味着奖励稀疏分配。大多数情况下，R将为零。

When there is a reward, it is not always a result of the action taken immediately before. Some action taken long before might have caused the victory. Figuring out which action is responsible for the reward is often referred to as thecredit assignment problem.

当获得奖励时，并不一定是紧接之前采取的行动的结果。早就采取的某些行动可能会导致胜利。弄清楚哪个动作负责奖励，通常被称为信用分配问题。

Because rewards are delayed, good chess players do not choose their plays only by the immediate reward. Instead, they choose by theexpected future reward.

由于奖励被延迟，因此优秀的棋手不会仅凭立即奖励来选择自己的游戏。相反，他们根据预期的未来奖励进行选择。

For example, they do not only think about whether they can eliminate an opponent’s figure in the next move. They also consider how taking a certain action now will help them in the long run.

例如，他们不仅考虑下一步是否可以消除对手的身材。他们还考虑从现在开始采取某种行动从长远来看将如何帮助他们。

In Q-learning, we choose our action based on the highest expected future reward. We use a “Q-function” to calculate this. This is a math function that takes two arguments: the current state of the game, and a given action.

在Q学习中，我们根据预期的最高未来奖励来选择行动。我们使用“ Q函数”进行计算。这是一个带有两个参数的数学函数：游戏的当前状态和给定的动作。

We can write this as:Q(state, action)

我们可以这样写：Q(state, action)

While in stateS, we estimate the future reward for each possible actionA. We assume that after we have taken actionAand moved to the next stateS’, everything works out perfectly.

在状态S，我们估计每个可能动作A的未来奖励。我们假设在采取行动A并移至下一个状态S'，一切都将顺利进行。

The expected future rewardQ(S,A)for a given a stateSand actionAis calculated as the immediate rewardR, plus the expected future reward thereafterQ(S',A'). We assume the next actionA'is optimal.

给定状态S和动作A的预期未来奖励Q(S,A)计算为立即奖励R，加上其后的预期未来奖励Q(S',A')。我们假设下一个动作A'是最佳的。

Because there is uncertainty about the future, we discountQ(S’,A’)by the factor gamma γ.

由于对未来存在不确定性，因此我们将Q(S',A')减去因子γ。

Q(S,A) = R + γ * max Q(S’,A’)

Q(S,A) = R + γ * max Q(S',A')

Good chess players are very good at estimating future rewards in their head. In other words, their Q-functionQ(S,A)is very precise.

优秀的国际象棋棋手非常擅长估算未来的回报。换句话说，它们的Q函数Q(S,A)非常精确。

Most chess practice revolves around developing a better Q-function. Players peruse many old games to learn how specific moves played out in the past, and how likely a given action is to lead to victory.

大多数国际象棋实践都围绕着开发更好的Q功能。玩家细读许多旧游戏，以了解过去的特定动作是如何进行的，以及给定的动作导致胜利的可能性。

But how can a machine estimate a good Q-function? This is where neural networks come into play.

但是一台机器如何估计一个好的Q函数呢？这就是神经网络起作用的地方。

毕竟回归 (Regression after all)

When playing a game, we generate lots of “experiences”. These experiences consist of:

在玩游戏时，我们会产生很多“经验”。这些经验包括：

the initial state,S

初始状态S

the action taken,A

采取的行动，A

the reward earned,R

获得的奖励，R

and the state that followed,S’

以及随后的状态S'

These experiences are our training data. We can frame the problem of estimatingQ(S,A)as a regression problem. To solve this, we can use a neural network.

这些经验是我们的培训数据。我们可以将估计Q(S,A)的问题构造为回归问题。为了解决这个问题，我们可以使用神经网络。

Given an input vector consisting ofSandA, the neural net is supposed to predict the value ofQ(S,A)equal to the target:R + γ * max Q(S’,A’).

给定组成的输入矢量S和A，神经网络被认为预测值Q(S,A)等于目标：R + γ * max Q(S',A')

If we are good at predictingQ(S,A)for different statesSand actionsA, we have a good approximation of the Q-function. Note that we estimateQ(S’,A’)through the same neural net asQ(S,A).

如果我们擅长预测不同状态S和动作AQ(S,A)，则可以很好地逼近Q函数。请注意，我们通过与Q(S,A)相同的神经网络估算Q(S',A')Q(S,A)。

培训过程 (The training process)

Given a batch of experiences< S, A, R, S’ >, the training process then looks as follows:

给定一批经验< S, A, R, S'>，然后训练过程如下：

For each possible actionA’(left, right, stay), predict the expected future rewardQ(S’,A’)using the neural net

对于每个可能的动作A'(左，右，停留)，使用神经网络预测预期的未来奖励Q(S',A')

Choose the highest value of the three predictions asmax Q(S’,A’)

选择三个预测中的max Q(S',A')作为max Q(S',A')

Calculater + γ * max Q(S’,A’). This is the target value for the neural net

计算r + γ * max Q(S',A')。这是神经网络的目标值

Train the neural net using a loss function. This is a function that calculates how near or far the predicted value is from the target value. Here, we will use0.5 * (predicted_Q(S,A) — target)²as the loss function.

使用损失函数训练神经网络。该函数可计算预测值与目标值之间的距离。在这里，我们将使用0.5 * (predicted_Q(S,A) — target)²作为损失函数。

During gameplay, all the experiences are stored in areplay memory. This acts like a simple buffer in which we store< S, A, R, S’ > pairs. The experience replay class also handles preparing the data for training. Check out the code below:

在游戏过程中，所有体验都存储在重播内存中。这就像一个简单的缓冲区，我们在其中存储< S, A, R, S'>对。体验重播类还处理为训练准备数据的过程。查看以下代码：

定义模型 (Defining the model)

Now it is time to define the model that will learn a Q-function for Catch.

现在是时候定义将学习Catch Q函数的模型了。

We are using Keras as a front end to Tensorflow. Our baseline model is a simple three-layer dense network.

我们使用Keras作为前端Tensorflow 。我们的基准模型是一个简单的三层密集网络。

Already, this model performs quite well on this simple version of Catch. Head over to GitHub for the full implementation. You can experiment with more complex models to see if you can get better performance.

该模型已经在此简单版本的Catch上运行良好。前往GitHub进行完整的实现。您可以尝试使用更复杂的模型，以查看是否可以获得更好的性能。

勘探 (Exploration)

A final ingredient to Q-Learning is exploration.

Q学习的最后一个要素是探索。

Everyday life shows that sometimes you have to do something weird and/or random to find out whether there is something better than your daily trot.

日常生活表明，有时您必须做一些奇怪和/或随意的事情才能发现是否有比您日常小跑更好的东西。

The same goes for Q-Learning. Always choosing the best option means you might miss out on some unexplored paths. To avoid this, the learner will sometimes choose a random option, and not necessarily the best.

Q学习也是如此。始终选择最佳选项意味着您可能会错过一些未开发的路径。为了避免这种情况，学习者有时会选择一个随机选项，不一定是最好的。

Now we can define the training method:

现在我们可以定义训练方法：

I let the game train for 5,000 epochs, and it does quite well now!

我让游戏训练了5,000个纪元，并且现在表现不错！

As you can see in the animation, the computer catches the apples falling from the sky.

正如您在动画中看到的那样，计算机捕获了从天上掉下来的苹果。

To visualize how the model learned, I plotted the moving average of victories over the epochs:

为了可视化模型的学习方式，我绘制了各个时期胜利的移动平均值：

从这往哪儿走 (Where to go from here)

You now have gained a first overview and an intuition of RL. I recommend taking a look at the full code for this tutorial. You can experiment with it.

现在，您已经获得了RL的初步概述和直觉。我建议看一下本教程的完整代码。您可以尝试一下。

You might also want to check out Arthur Juliani’s series. If you’d like a more formal introduction, have a look at Stanford’s CS 234, Berkeley’s CS 294 or David Silver’s lectures from UCL.

您可能还想看看Arthur Juliani的系列。如果您想要更正式的介绍，请参阅斯坦福大学的CS 234 ，伯克利的CS 294或UCL的David Silver的演讲。

A great way to practice your RL skills is OpenAI’s Gym, which offers a set of training environments with a standardized API.

练习RL技能的绝佳方法是OpenAI的Gym ，它提供了一组带有标准化API的训练环境。

致谢 (Acknowledgements)

This article builds upon Eder Santana’s simple RL example, from . I refactored his code and added explanations in a notebook I wrote earlier in . For readability on Medium, I only show the most relevant code here. Head over to the notebook or Eder’s original post for more.

本文以Eder Santana从开始的简单RL示例为基础。我重构了他的代码，并在早些时候写的笔记本中添加了解释。为便于阅读，我在这里仅显示最相关的代码。前往笔记本或Eder的原始文章了解更多。

关于简尼斯·克拉斯 (About Jannes Klaas)

This text is part of the Machine Learning in Financial Context course material, which helps economics and business students understand machine learning.

此文本是金融上下文机器学习课程材料的一部分，该课程材料帮助经济学和商科学生理解机器学习。

I spent a decade building software and am now on the journey bring ML to the financial world. I study at the Rotterdam School of Management and have done research with the Institute for Housing and Urban development studies.

我花了十年的时间开发软件，现在正在将ML带到金融界的旅程。我在鹿特丹管理学院学习，并在住房和城市发展研究所进行了研究。

You can follow me on Twitter. If you have any questions or suggestions please leave a comment or ping me on Medium.

您可以在Twitter上关注我。如果您有任何疑问或建议，请在Medium上发表评论或ping我。

翻译自: /news/deep-reinforcement-learning-where-to-start-291fb0058c01/

深度强化学习和强化学习