Reinforcement Learning Explained Visually (Part 4): Q Learning, step-by-step | by Ketan Doshi | Nov, 2020
The difference, which is the key hallmark of the Q Learning algorithm, is how it updates its estimates. The equation used to make the update in the fourth step is based on the Bellman equation, but if you examine it carefully it uses a slight variation of the formula we had studied earlier.
Let’s zoom in on the flow and examine this in more detail.
Let’s look at an example to understand this.
In step #2 of the algorithm, the agent uses the ε-greedy policy to pick the current action (a1) from the current state (S1). This is the action that it passes to the environment to execute, and gets feedback in the form of a reward (R1) and the next state (S2).
Now, for step #4, the algorithm has to use a Q-value from the next state in order to update its estimated Q-value (Q1) for the current state and selected action.
And here is where the Q-Learning algorithm uses its clever trick. The next state has several actions, so which Q-value does it use? It uses the action (a4) from the next state which has the highest Q-value (Q4). What is critical to note is that it treats this action as a target action to be used only for the update to Q1. It is not necessarily the action that it will actually end up executing from the next state when it reaches the next time step.
Now that it has identified the target Q-value, it uses the update formula to compute a new value for the current Q-value, using the reward and the target Q-value…
…and updates the current Q-value.
In other words, there are two actions involved:
- Current action — the action from the current state that is actually executed in the environment, and whose Q-value is updated.
- Target action — has the highest Q-value from the next state, and used to update the current action’s Q value.
This duality of actions is what makes Q-Learning unique.
- We can explore and discover new paths for actions that we execute.
- However, when we update Q-value estimates to improve them, we always use the best Q-value, even though that action may not get executed.
This might sound confusing, so let’s move forward to the next time-step to see what happens. Now the next state has become the new current state.
The agent again uses the ε-greedy policy to pick an action. If it ends up exploring rather than exploiting, the action that it executes (a2) will be different from the target action (a4) used for the Q-value update in the previous time-step.
This is known as ‘off-policy’ learning because the actions that are executed are different from the target actions that are used for learning.
At the start of the game, the agent doesn’t know which action is better than any other action. So we start by giving all Q-values arbitrary estimates and set all entries in the Q-table to 0.
Let’s see an example of what happens in the first time-step so we can visualize how the Q-table gets populated with actual values.
The algorithm then picks an ε-greedy action, gets feedback from the environment, and uses the formula to update the Q-value, as below. This new Q-value reflects the reward that we observed.
In this way, one cell of the Q-table has gone from zero values to being populated with some real data from the environment.
Our goal is for the Q-values to converge towards their Optimal Values. We are seeing those Q-values getting populated with something, but, are they being updated with random values, or are they progressively becoming more accurate?
If you think about it, it seems utterly incredible that an algorithm such as Q Learning converges to the Optimal Value at all.
You start with arbitrary estimates, and then at each time-step, you update those estimates with other estimates.
So why does this eventually give you better estimates?
The reason is that at every time-step, the estimates become slightly more accurate because they get updated with real observations.
The update formula combines three terms in some weighted proportion:
- The reward for the current action
- Best Estimated Q-value of the next state-action
- Estimated Q-value of the current state-action
Two of the three terms in the update formula are estimates which are not very accurate at first. We’ll address those two terms a little later.
However, the third term ie. the reward received is concrete data. That allows the agent to learn and improve its estimates based on actual experience with the environment.
To visualize this more clearly, let’s take an example where we focus on just one cell in the Q-table (ie. one state-action pair), and follow the progression of the updates to that one cell.
Let’s see what happens over time to the Q-value for state S3 and action a1 (corresponding to the orange cell). The very first time we visit it, this cell has a Q-value of 0. In fact, most of the Q-table is filled with zeros. Using the update formula, we update this cell with a value that is largely based on the reward (R1) that we observed.
Now let’s see what happens when we visit that state-action pair again. This could be within the same episode, or in a future episode.
This time we see that some of the other Q-values in the table have also been filled with values. As the agent follows various paths and starts to visit state-action pairs, those cells which were previously zeros have been populated.
Also, notice that the reward each time (for the same action from the same state) need not be the same.
Let’s visit that cell a third time. By the way, notice that the target action (in purple) need not be the same in each of our three visits.
Let’s layout all our visits to that same cell in a single picture to visualize the progression over time. As we visit that same state-action pair more and more times over many episodes, we collect rewards each time. An individual reward observation might fluctuate, but over time, the rewards will converge towards their expected values. This allows the Q-value to also converge over time.