N-step Bootstrapping. This is part 7 of the RL tutorial… | by Sagi Shaier | Nov, 2020
Chapter 7 — n-step Bootstrapping
This chapter focuses on unifying the one step temporal difference (TD) methods and Monte Carlo (MC) methods.
The idea is that neither one step TD nor MC are always the best fit. Rather, if you think about a spectrum, where one step TD is on one end, and MC on the other, we will find something in between.
n-step TD Prediction
The simplest way to think about what methods are between MC and TD is the following image:
Consider estimating the value function Vπ from some sample episodes.
1 step TD methods will perform an update for each state using the next reward and the estimated value of the next state as an approximation for the following rewards.
MC methods on the other hand, will perform an update using the entire sequence of rewards that it observed from the starting state until the end of the episode.
So it’s quite easy to see what’s in between: for example, 2 step TD will perform an update based on the next 2 rewards, and the estimated value of the corresponding state (2 steps ahead).
Note that these n step methods are still TD methods because they alter the previous estimate value based on other its difference from the next estimated value.
More formally, the return of MC method:
Where T is the last time step.
The return in 1 step TD method:
The return in 2 step TD method:
Lastly, the return for n step TD method:
In the previous chapter we discussed Sara(0), or 0 step Sarsa. And for a quick recall, in Sarsa we cared about the values of state-action pairs, rather than just the values states.
So very similar to the technique above with the n step TD method:
Note that n step expected Sarsa has a slightly different last step that I’ll talk about later.
So now our returns are just in terms of estimated action values:
And our action-value pair:
Consider taking a path as seen in the left grid.
1 step Sarsa would update the action values based solely on the last action of the sequence of actions.
10 step Sarsa would update the action values based on the last 10 actions of the sequence of actions.
As seen from the n step Sarsa’s backup diagram, the expected Sarsa has a different last step. In it, we weight the probabilities of each action to happen under the policy. So all the steps up to the last one are similar to n step Sarsa.
n-step off policy Learning
Recall that in on policy we sample and update a single policy, where in off policy we sample actions from a behavior policy b, and aim to improve our target policy π.
We will do the same as the previous chapter where we used the importance sampling ratio to weight the returns, except that now we are just interested in n actions.
So our update for n steps TD:
And the update for Sarsa:
Off policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm
This section present an algorithm that works with n steps without importance sampling — the tree-backup algorithm
The idea is that instead of just updating our estimated value of the top node using the discounted rewards it received along the path as well as with the bottom nodes’ values, we are also going to add the estimated values of the action nodes on each of the states we pass along the way.
More formally, the return for 1 step tree backup algorithm:
For 2 step tree backup algorithm:
And so on.
Where Q is
A Unifying Algorithm: n-step Q(σ)
So far we have considered the left 3 diagrams:
The idea behind the fourth algorithm — n-step Q(σ) is quite simple: simply alternate between the other algorithms, where σ = [0,1], defines how much sampling to do on each time step.
If σ = 1 we do full sampling.
If σ = 0 we do expectation without sampling.
Until next time,