How to Make Sense of the Reinforcement Learning Agents?


By Piotr Januszewski, Research Software Engineer and PhD Student

Based on merely watching how an agent acts in the atmosphere it’s arduous to inform something about why it behaves this manner and the way it works internally. That’s why it’s essential to set up metrics that inform WHY the agent performs in a sure method.

This is difficult particularly when the agent doesn’t behave the method we want it to behave, … which is like all the time. Every AI practitioner is aware of that no matter we work on, most of the time it gained’t merely work out of the field (they wouldn’t pay us a lot for it in any other case).

In this weblog put up, you’ll study what to preserve observe of to examine/debug your agent studying trajectory. I’ll assume you’re already accustomed to the Reinforcement Learning (RL) agent-environment setting (see Figure 1) and also you’ve heard about no less than some of the commonest RL algorithms and environments.

Nevertheless, don’t fear in case you are simply starting your journey with RL. I’ve tried to not rely an excessive amount of on readers’ prior information and the place I couldn’t omit some particulars, I’ve put references to helpful supplies.


Figure 1: The Reinforcement Learning framework (Sutton & Barto, 2018).


I’ll begin by discussing helpful metrics that give us a glimpse into the coaching and choice processes of the agent.

Then we are going to give attention to the aggregation statistics of these metrics, like common, that may assist us analyze them for a lot of episodes performed by the agent all through the coaching. These will assist root trigger any points with the agent.

At every step, I’ll base my options alone expertise in RL analysis. Let’s soar proper into it!


Metrics I take advantage of to examine RL agent coaching

There are a number of varieties of metrics to comply with and every of them provides you totally different details about the mannequin’s efficiency. So the researcher can get the details about…

…how is the agent doing

Here, we are going to take a more in-depth take a look at three metrics that diagnose the general efficiency of the agent.



Episode return

This is what we care about the most. The entire agent coaching is all about getting to the highest anticipated return potential (see Figure 2). If this metric goes up all through the coaching, it’s an excellent signal.



Figure 2: The RL Problem. Find a coverage π that maximizes the goal J. The goal J is an anticipated return E[R] beneath the atmosphere dynamics P. τ is the trajectory performed by the agent (or its coverage π).


However, it’s far more helpful to us once we know what return to count on, or what is an effective rating.

That’s why you must all the time search for baselines, others lead to an atmosphere you’re employed on, to examine your outcomes with them.

Random agent baseline is commonly an excellent begin and permits you to recalibrate, really feel what’s true “zero” rating in the atmosphere – the minimal return you will get merely from bunging into the controller (see Figure 3).


Figure 3. Table Three from the SimPLe paper with their outcomes on Atari environments in contrast to many baselines alongside the random agent and human scores.


Episode size

This is a helpful metric to analyze together with the episode return. It tells us if our agent is in a position to reside for a while earlier than termination. In MuJoCo environments, the place numerous creatures study to stroll (see Figure 4), it tells you e.g. in case your agent does some strikes earlier than flipping and resetting to the starting of the episode.



Solve price

Yet one other metric to analyze with episode return. If your atmosphere has a notion of being solved, then it’s helpful to verify what number of episodes it will probably resolve. For occasion, in Sokoban (see Figure 5) there are partial rewards for pushing a field onto a goal. That being stated, the room is barely solved when all packing containers are on targets.


Figure 5. Sokoban is a transportation puzzle, the place the participant has to push all packing containers in the room on the storage targets.


So, it’s potential for the agent to have a optimistic episode return, however nonetheless don’t end the process it’s required to resolve.

One extra instance could be Google Research Football (see Figure 6) with its academies. There are some partial rewards for shifting in direction of the opponents’ objective, however the academy episode (e.g. exercising counterattack state of affairs in smaller teams) is barely thought of “solved” when the agent’s staff scores a objective.



…progress of coaching

There are a number of methods of representing the notion of “time” and in opposition to what to measure progress in RL. Here are the high Four picks.

Total atmosphere steps

This easy metric tells you how a lot expertise, in phrases of atmosphere steps or timesteps, the agent already gathered. This is commonly extra informative on coaching development (steps) than wall-time, which extremely will depend on how briskly your machine can simulate the atmosphere and do calculations on a neural community (see Figure 6).


Figure 6. DDPG coaching on the MuJoCo Ant atmosphere. Both runs took 24h, however on totally different machines. One did ~5M steps and the different ~9.5M. For the latter, it was sufficient time to converge. For the former not and it scored worse.


Moreover, we report the last agent rating along with how a lot atmosphere steps (typically referred to as samples) it took to prepare it. The greater the rating with the fewer samples, the extra pattern environment friendly is the agent.

Training steps

We prepare neural networks with the Stochastic Gradient Descent (SGD) algorithm (see Deep Learning Book).

The coaching steps metric tells us what number of batch updates we did to the community. When coaching from the off-policy replay buffer, we are able to match it with whole atmosphere steps so as to higher perceive what number of occasions, on common, every pattern from the atmosphere is proven to the community to study from it:

batch dimension * trainings steps / whole atmosphere steps = batch dimension / rollout size

the place rollout size is the quantity of new timesteps we collect, on common, throughout the knowledge assortment part in between coaching steps (when knowledge assortment and coaching are run sequentially).

The above ratio, generally referred to as coaching depth, shouldn’t be under 1 as it could imply that some samples aren’t proven even as soon as to the community! In truth, it must be a lot greater than 1, e.g. 256 (as set in e.g. RLlib implementation of DDPG, search for “training intensity”).

Wall time

This merely tells us how a lot time an experiment is operating.

It could be helpful when planning in the future how a lot time do we’d like for every experiment to merely end:

  • 2-Three hours?
  • full night time??
  • or a pair of days???
  • entire week?!?!?!

Yes, some experiments would possibly take even the entire week in your PC to totally converge or prepare to the most episode return the methodology you employ can obtain.

Thankfully, in the improvement part, shorter experiments (just a few hours, up to 24h) are most of the time adequate to merely inform if the agent is working or not or to take a look at some enchancment concepts.

Note, that you just all the time need to plan your work in such a method, that some experiments are operating in the background when you work on one thing else e.g. code, learn, write, assume, and so forth.

This is why some devoted workstations for less than operating experiments may be helpful.

Steps per second

How many atmosphere steps an agent does in every second. The common of this worth permits you to calculate how a lot time you want to run  some quantity of atmosphere steps.


…what’s the agent pondering/doing

Finally, let’s have a look inside the agent’s mind. In my analysis – relying on the challenge – I take advantage of worth operate and coverage entropy to get a touch of what’s going on.

State/Action worth operate

Q-learning and actor-critic strategies make use of worth capabilities (VFs).

It’s helpful to take a look at the values they predict to detect some anomalies and see how the agent evaluates its odds in the atmosphere.

In the easiest case, I log the community state worth estimate at every episode’s timestep after which common them throughout the entire episode (extra on this in the subsequent part). With extra coaching, this metric ought to begin to match the logged episode return (see Figure 7) or, extra typically, discounted episode return as it’s used to prepare VF. If it doesn’t, then it’s a dangerous signal.


Figure 7. An experiment on the Google Research Football atmosphere. With time, as the agent trains, the agent’s worth operate matches the episode return imply.


Moreover, on the VF values chart, we are able to see if some further knowledge processing is required.

For occasion, in the Cart Pole atmosphere, an agent will get a reward of 1 for each timestep till it falls and dies. Episode return rapidly will get to orders of tens and lots of. A VF community that’s initialized in such a method that at the starting of coaching it outputs small values round zero has a tough time catching this vary of values (see Figure 8).

That’s why some further return normalization earlier than coaching with it’s required. The best method is solely dividing by the max return potential, however someway we would not know what’s the most return or there isn’t any such (see e.g. Q-value normalization in the MuZero paper, Appendix B – Backup).


Figure 8. An experiment on the Cart Pole atmosphere. The worth operate goal isn’t normalized and it has a tough time catching up with it.


I’ll talk about an instance in the subsequent part when this specific metric joint with excessive aggregation helped me detect a bug in my code.

Policy entropy

Because some RL strategies make use of stochastic insurance policies, we are able to calculate their entropy: how random they’re. Even with the deterministic insurance policies we frequently use epsilon-greedy exploratory coverage of which we are able to nonetheless calculate the entropy.

The equation for the coverage entropy H, the place a is an motion and p(a) in an motion likelihood.

The most entropy worth equals ln(N), the place N is the quantity of actions, and it implies that the coverage chooses actions uniformly at random. The minimal entropy worth equals Zero and it implies that all the time just one motion is feasible (has 100% likelihood).

If you observe that the entropy of the agent coverage drops quickly, it’s a foul signal. It implies that your agent stops exploring in a short time. If you employ stochastic insurance policies, you must assume of some entropy regularization strategies (e.g. Soft Actor-Critic). If you’re utilizing deterministic insurance policies with epsilon-greedy exploratory coverage, in all probability you employ too aggressive schedule for epsilon decay.


…how the coaching goes

Last, however not least, we’ve some, extra commonplace, Deep Learning metrics.

KL divergence

On-policy strategies like Vanilla Policy Gradient (VPG) prepare on batches of expertise sampled from the present coverage (they don’t use any replay buffer with expertise to prepare on).

It implies that what we do has a excessive influence on what we study. If you set a studying price too excessive, then the approximate gradient replace would possibly take too massive steps in some seemingly promising path which can push the agent proper into the worse area of the state area.

Therefore the agent will do worse than earlier than the replace (see Figure 9)! This is why we’d like to monitor KL divergence between the previous and the new coverage. It will help us e.g. set a studying price.



Figure 9. VPG coaching on the Cart Pole atmosphere. On the y-axis, we’ve an episode size (it equals an episode return on this atmosphere). The orange line is the sliding window common of the rating. On the left diagram, the studying price is just too massive and the coaching is unstable. On the proper diagram, the studying price was correctly fine-tuned (I discovered it by hand).


KL divergence is a measure of the distance between two distributions. In our case, these are motion distributions (insurance policies). We don’t need our coverage to differ an excessive amount of earlier than and after the replace. There are strategies like PPO that put a constraint on the KL divergence and gained’t enable too massive updates in any respect!

Network weights/gradients/activations histograms

Logging the activations, gradients, and weights histograms of every layer will help you monitor the synthetic neural community coaching dynamics. You ought to search for indicators of:

  • Dying ReLUs:
    If a ReLU neuron will get clamped to zero in the ahead move, then it gained’t get a gradient sign in the backward move. It may even occur, that some neurons gained’t get excited (return a non-zero output) for any enter as a result of of unlucky initialization or too massive replace throughout coaching.
    “Sometimes you can forward the entire training set <i.e. the replay buffer in RL> through a trained network and find that a large fraction (e.g. 40%) of your neurons were zero the entire time.” ~ Yes you must perceive backprop by Andrej Karpathy
  • Vanishing or Exploding gradients:
    Very massive values of gradient updates can point out exploding gradients. Gradient clipping could assist.
    On the different hand, very low values of gradient updates can point out vanishing gradients. Using ReLU activations and Glorot uniform initializer (a.okay.a. Xavier uniform initializer) ought to assist with it.
  • Vanishing or Exploding activations:
    commonplace deviation for the activations is on the order of 0.5 to 2.0. Significantly outdoors of this vary could point out vanishing or exploding activations, which in flip could trigger issues with gradients. Try Layer/Batch normalization to preserve your activations distribution beneath management.

In common, distributions of layer weights (and activations), which might be shut to regular distribution (values round zero with out a lot outliers) are an indication of wholesome coaching.

The above suggestions ought to make it easier to preserve your community wholesome by way of coaching.

Policy/Value/Quality/… heads losses

Even although we do optimize some loss operate to prepare an agent, you must know that this isn’t a loss operate in the typical sense of the phrase. Specifically, it’s totally different from the loss capabilities utilized in supervised studying.

We optimize the goal from Figure 2. To accomplish that, in Policy Gradient strategies you derive the gradient of this goal (referred to as, Policy Gradient). However, as a result of TensorFlow and different DL frameworks are constructed round auto-grad, you outline a surrogate loss operate that, after the auto-grad is run on it, yields gradient equal to the Policy Gradient.

Note that the knowledge distribution will depend on the coverage and modifications with coaching. This implies that the loss capabilities don’t have to lower monotonically for coaching to proceed. It can generally enhance when the agent discovers some new space of state area (see Figure 10).


Figure 10. SAC coaching on the MuJoCo Humanoid atmosphere. When the episode return begins to go up (our agent learns efficiently), the Q-function loss goes up too! It begins to go down once more after a while.


Moreover, it doesn’t measure the efficiency of the agent! The true efficiency of the agent is an episode return. It’s helpful to log losses as a sanity verify. However, don’t base your judgments on coaching progress on it.


Aggregated statistics

Of course, for some metrics (like state/action-values) it’s infeasible to log them for each atmosphere timestep for every experiment. Typically, you’ll calculate statistics each episode or couple of episodes.

For different metrics, we take care of randomness (e.g. the episode return when the atmosphere and/or the coverage are stochastic). Therefore, we’ve to use sampling to estimate the anticipated metric worth (pattern = one agent episode in the episode return case).

In both case, the combination statistics are the answer!


Average and commonplace deviation

When you take care of a stochastic atmosphere (e.g. ghosts in the PacMan act randomly) and/or your coverage attracts actions at random (e.g. stochastic coverage in VPG) you must:

  • play a number of episodes (10-20 must be tremendous),
  • common metrics throughout them,
  • log this common and commonplace deviation.

The common will higher estimate the true anticipated return than merely one episode and commonplace deviation provides you a touch of how a lot the metric modifications when taking part in a number of episodes.

Too excessive variance and you must take extra samples into the common (play extra episodes) or make use of one of the smoothing strategies like Exponential Moving Average.


Minimum/Maximum worth

It’s actually helpful to examine extremes when searching for a bug. I’ll talk about it with the instance.

In experiments on Google Research Football with my agent that used random rollouts from the present timestep to calculate motion qualities, I observed some unusual minimal values of these motion qualities.

The common statistic made sense, however one thing with minimal values was not good. They had been under cheap minimal worth (under minus one, see Figure 11).



Figure 11. The imply qualities are all above zero. The minimal qualities are fairly often under minus one, which is decrease than must be potential.


After some digging, it turned out that I take advantage of np.empty to create an array for motion qualities.

np.empty is a elaborate method of doing np.zeros that allocates reminiscence however doesn’t initialize the NumPy array simply but.

Because of that, from time to time some actions had up to date scores (which overrode the preliminary values in the array) that got here from the allotted reminiscence places that had not been erased!

I modified np.empty to np.zeros and it fastened the downside.



The identical concept that we used with averaging stochastic episodes, could be utilized to the entire coaching!

As we all know, the algorithm used for deep studying is known as Stochastic Gradient Descent. It’s stochastic as a result of we draw coaching samples at random and pack them into batches. This implies that operating one coaching a number of occasions will yield totally different outcomes.

You ought to all the time run your coaching a number of occasions with totally different seeds (pseudo-random numbers generator initialization) and report the median of these runs to make certain that the rating shouldn’t be that top or that low just by likelihood.


Figure 12. SAC coaching on the MuJoCo Ant atmosphere. All runs have the identical hyper-parameters, solely totally different seeds. Three runs, three outcomes.


Deep Reinforcement Learning Doesn’t Work Yet and so your agent would possibly fail to prepare something, even when your implementation is right. It can merely fail by likelihood e.g. as a result of of unfortunate initialization (see Figure 12).



Now you realize what and why you must log to get the full image of an agent coaching course of. Moreover, you realize what to search for in these logs and even how to take care of the widespread issues.

Before we end, please check out Figure 12 as soon as once more. We see that the coaching curves, although totally different, comply with related paths and even two out of three converge to the same end result. Any concepts what that might imply?

Stay tuned for future posts!

Bio: Piotr Januszewski is a Research Software Engineer at University of Warsaw and PhD scholar at Gdansk University of Technology.

Original. Reposted with permission.



Source hyperlink

Write a comment