## Introduction to Reinforcement Learning (RL) — Part 3 — “Finite Markov Decision Processes” | by Sagi Shaier | Nov, 2020

Chapter 3 — Finite Markov Decision Processes
The key concepts of this chapter:
– How RL problems fit into the Markov decision process (MDP) framework
– Understanding what is a Markov property
– What are transition probabilities
– Discounting future rewards
– Solving for optimal policy and value functions with the bellman optimality equations

A quick recap:

• The learner and decision-maker is called the agent.
• The thing it interacts with, comprising everything outside the agent, is called the environment.
• The agent interacts with the environment over a sequence of discrete time steps, t = 0, 1, 2, 3, . . ..
• At each time step t, the environment sends the agent a state, St ∈ S, where S is the set of possible states.
• Based on that, the agent selects an action, At ∈ A(St), where A(St) is the set of actions available in state St.
• One time step later the agent receives a reward, Rt+1 ∈ R, and finds itself in a new state, St+1.
• At each time step, the agent follows a mapping from its current state to probabilities of selecting each possible action in that state.
• This mapping is called the agent’s policy and is denoted πt(A|S), where πt(A|S) is the probability of taking action “A” in state “S” in time step “t”
• The agent’s goal is to maximize the total reward it receives in the long run

Returns
If we denote the sequence of rewards the agent receive after time step t as: R(t+1), R(t+2), . . .
The agent seek to maximize the expected return, E(Gt).
Where the return Gt is defined as:
Gt = R(t+1) +R(t+2) +R(t+3) + · · · +R(T), and T is a final time step.

This makes sense when we actually have a final time step. That is, when the agent’s and the environment’s interaction breaks naturally into subsequences (like playing chess, or TicTacToe), which we call episodes.

Each episode ends in a special state called the terminal state. After the agent reach the terminal state, the game is reset to some starting state.

On the other hand, there are many situations where the interaction between the agent and the environment goes on continually without a limit (e.g a robot with a long life span).

The return (Gt) is problematic for continuing tasks though.
For example, imagine the agent receives a reward of +1 at each time step.
Since the final time step is T = ∞ (i.e. there is no final step), the return (which is what we are trying to maximize), will be infinite.

Hence we need another term that we will call discounting.
The agent’s goal now, is to choose actions so that the sum of the discounted rewards it receives is maximized.
In general, it chooses action A at time t (At) to maximize the expected discounted return

where γ is a parameter, 0 ≤ γ ≤ 1, called the discount rate

γ basically says, that a reward received k time steps in the future is only worth γ^k−1 times what it would be worth if it were received immediately.

That solve our problem, since if γ < 1, the infinite sum has a finite value (as long as the sequence of rewards is bounded).

One key idea that comes out from introducing discounting, is that we can factor out the γ from a sequence of rewards in different time steps:

to have our return (Gt) equal to the reward at the next time step plus the discount factor times the expected return at that next time step.
This is a key idea in the Bellman equation that we’ll see later.

The Markov Property
Optimally, we would like to have a state signal that summarize the past in such a way that all relevant information is retained.
A state signal that is capable of having all of that relevant information from the past, is said to be Markov, or to have the Markov property.
For example, a TicTacToe position —any state (i.e. configuration of all the pieces on the board) — would serve as a Markov state because it summarizes everything important.
In other words, all the information that we need to predict the future is contained in the state representation that we have.
For the TicTacToe game, we shouldn’t care about the past (i.e. the sequence of moves that the opponent chose before). Everything that really matters for the future of the game is retained. We should only care about where we currently are (where are the pieces on the board- its state).

Now, if an environment has the Markov property, we can use the following equation

The equation above is called the dynamic function p.
It allows us to predict the next state (S’) and expected next reward (r) given the current state (S) and action (a).
Which basically tells us how the environment is changing as we move from state to state, and taking different actions.

The Markov property is important in RL because decisions and values are assumed to be a function of only the current state.

Markov Decision Processes
A RL problem that satisfies the Markov property is called a Markov decision process, or MDP.
Moreover, if there are only a finite number of states and actions, then it’s called a finite Markov decision process (finite MDP).

From the dynamic function we can also derive several other functions that might be useful:

Which is the next state (S’) probability given the current state and action

Which is the reward for a certain state and action

Which is the reward for a certain state, action, and next state

Transition diagram
A useful way to see the dynamics of a finite MDP is transition diagram

A transition diagram is simply a graph that tells you, the agent, what are the possible actions at each state. It can sometimes have the probability of taking each action, and what are the rewards for taking each action (as in the image above).
This graph can also be viewed as a table:

First row — We can see how in this environment we can go from the state “high” to the state “high” by taking the action “search” . We can also see that we have a probability of “a” of moving the state “high” after being in state “high” and taking action “a”. Lastly, we can see that the reward that we will get doing such, will be “r_search”

Value Functions
Almost all RL algorithms involve estimating value functions. To recap, value functions estimate how good it is for an agent to be in a given state.
Recall that πt(A|S) is the probability of taking action “A” in state “S” in time step “t” — also called the agent’s policy.
The value of a state S under a policy π, denoted Vπ(s), is the expected return when starting in S and following π afterwards.

We call the function Vπ the state-value function for policy π.
Where Gt is the expected return I mentioned earlier.

Similarly, we can define the value of taking action “A” in state “S” under a policy π as qπ(S, A).

We call this the action-value function for policy π. It is the expected return starting from state S, taking the action A, and following policy π afterwards.

These 2 value functions (Vπ and qπ) can be estimated from experience. Meaning, as the agent continuously interact with the environment, and keeps an average for each state it encountered, of the actual returns that have followed that state. As the interaction approaches infinity, the average will converge to the state’s value vπ(s).

If the agent also maintains separate averages for each action taken in a state, then these averages will also converge to the action values, qπ(s, a).

Bellman equation
So now that we have the idea of discounted returns and transition probabilities, we can solve for the value functions and policy using the idea of the Bellman equations

Where π(a|s) is the probability of taking action “a” in state “s” under policy π.
The above equation is the Bellman equation for Vπ. It expresses a relationship between the value of a state S, and the values of the next state S’, following policy π.
Think about it as “the value of a given a state (where you currently are) depends on what following states are possible”.

This can be seen in a backup diagram:

We’re basically computing the expected value of the state S (top node — our current state) using all the next states (bottom nodes) it can transition into.
We call those diagram backup diagrams because they show how the update or backup operations work. We transfer information back to a state from its possible following states.

Note that similarly to Vπ, there is also a bellman equation and a backup diagram for qπ(S, A), following the same principles.

Example:
Let’s see how the Bellman equation can be used to find the value of every state given a certain policy.

In our policy, we select actions uniformly at random: left, up, right, down, each with equal probability (25%).
The environment tells us that we receive 10 reward when we transition from A to A’, 5 reward when we transition from B to B’, negative reward when we fall off the map, and 0 reward otherwise.

So let’s see how we derived the value 2.3 for the circled state:
From the circled state, we can go left, up, right, or down to the next state.
Each one of these states has a certain value — Gt+1 (4.4, 3.0, 0.7, 1.9).
We have 0.25 chance of going into each one.
And our discount factor γ is 0.9 in this example.

To get the expected value of the circle state we simply sum the reward that we’ll get in each and the probability of going to each of the possible states times the discount factor:

0 + 0.9* [(0.25 * 4.4) + (0.25*1.9) + (0.25*0.7) + (0.25*3.0)] = 2.25 — > 2.3

0 is the reward
0.9 is the discount factor
0.25 is the probability of going to each state (left, up…)
the value that 0.25 is multiplied by is the value of that state (e.g. left=3.0)

Optimal Value Functions
We’ve seen how we can use the Bellman equations for estimating the value of states as a function of their successor states.

The Bellman optimality equations are a way of solving for the optimal value of states.

The optimal value of a state is denoted by V*(s)

Where the equations 3.16, and 3.17 are two forms of the Bellman optimality equation for V∗.

Quite similar to before, but now we’re taking “max(a)”, meaning, the action with the best reward for each state in order to maximize the reward.
Once we have V∗, it’s easy to determine the optimal policy, which in this example looks as follows:

The optimal policy π* tells us what’s the best course of actions. From every state, we simply follow the directions with the highest values.

The problem is however, that with nearly all problems it is computationally intractable to explicitly solve for the optimal value at every state. For example, in a more complex environment like the game Backgammon, we have about 10²⁰ states. In the next chapters we’ll see how we can go about that issue.

Until next time,

Cheers.