## Tackling the UNO Card Game with Reinforcement Learning | by Bernhard Pfann | Jan, 2021

With a fully functioning game engine, we are now able to run as many simulations of the game, as we like. This is helpful to get insights into the statistical features of the game, but it also facilitates the training of an RL-agent. I was mainly interested in the following questions:

• How many turns do games last?
• How big is the advantage of the player making the first turn?
• What are the most likely situations in the course of a game?

## 2.1 Game Length

Within the generated sample of 100,000 simulations, the average game length is 41 turns. The significantly lower mode of 13 turns results from the right skewness (the portion of games with more than 100 turns).

The longest game in the sample lasted 327 turns, while the shortest only took 3 turns, which is only possible through the combination of multiple special cards (Skip, Reverse, +2, +4).

## 2.2 Advantage of the First Mover

A player who starts the game has a certain edge over his opponent. To quantify this advantage, the cumulative long-run winning-rate of two randomly playing agents is observed in the following two scenarios:

1. Player 1 always gets to make the first move (red simulations)
2. Player 1 gets to make the first move every 2nd game (violet simulations)

As expected, the winning rate of two randomly playing agents converges to 50% in a fair set-up with alternating first movers. This is an indication that the game engine works as expected.

More interesting is the fact that the player who always makes the first move consistently wins 52% of all games. This corresponds to a ca. 4% (52% vs. 48%) higher likelihood to win against the opponent.

## 2.3 The Course of Play

Like in sports analytics, a heatmap can be used to display where the course of play had its focus. By keeping track of every state and action a player has experienced, we can identify the most likely situations in the game.

The axes of the heatmap denote a players’ number of hand cards, as well as the action taken at the respective point in time. It stands out, that most of the time players are lingering in the “mid-game” by holding 4–5 cards. It makes sense that normal cards (0–9) from all colors are played the most often since they are the most common cards in a deck.

Searching for the optimal strategy in the UNO card game is a classical use case for Reinforcement Learning. Thereby the game itself can be framed as a finite Markov Decision Process (MDP), which implies the following characteristics:

• States: Each step of the game can be described by a state
• Actions: The decision-maker interacts with the game by taking actions based on the state he is in
• Reward: Taking certain actions can lead to a desirable terminal state (e.g winning the game), which is rewarded

The stochastic element, which is inherent through the randomly drawn cards as well as the opponents’ moves, require numerous simulations, to identify a long-run optimum. The basic techniques, which I applied are Monte Carlo and Q-Learning with a discrete state-action matrix.

## 3.1 Defining State-Space

In a supervised machine learning set-up, a function is fitted to map the available features to the output variable. RL on the other hand evaluates each action sequentially and individually at each step. Therefore states, actions, and rewards need to be defined.

A state can represent any information available to the decision-maker, that is useful to describe the current situation of the game. This could be the type of cards he holds, the number of cards the opponent holds, or information regarding cards that have already been played.

At first sight, UNO might appear to be a very simplistic card game, due to its limited set of rules. However, when figuring out the possible combinations of cards a player can hold, it quickly gets out of hand. To be precise there are about 10¹⁶³ combinations. Since the RL-agent has to learn an optimal action for each state, it makes sense to limit the number of states.

Thus, I formed a 16-digit state identification, that captures which cards the agent holds and which he can play, differentiating only between colors, and special cards (skip, reverse, etc.)

By capping each property of the state-identification with a maximum value of 2, state-space has been successfully limited to 270,000 possibilities.

## 3.2 Possible Actions

The cards, that the agent decides to play, represent the actions he is taking. Applying the same aggregation logic as above results in 9 different actions: Red, Green, Blue, Yellow, Skip, Reverse, +2, +4, Wild Color.

## 3.3 Monte Carlo Agent

Given the discrete state-action matrix, the agent navigates through the fields by simulating multiple games. While the matrix is initialized with all values at zero, Monte Carlo (MC) updates all visited state-action values after every completed game.

The q-value at state s taking action a is updated dependent on the achieved reward in this episode R, and the step size parameter alpha. In order to decide which action to take in a respective state, the epsilon-greedy form of the algorithm chooses:

• With epsilon probability: Random action
• With 1-epsilon probability: Action with maximum q-values

## 3.4 Q-Learning

In its basic form, Q-learning works in a similar way. However, while MC waits for the completion of each episode before updating q-values, Q-learning updates them with a lag of one step, at each step.

The q-value is thereby dependent on the step-size parameter, the reward of the next step r, and the q-value of the next step at state s-hat and action-hat.

Both algorithms consequently take the same 2 parameters which have the following effects:

• Alpha: A higher step size parameter increases the change in q-values at each update while prohibiting values to converge closer to their true optimum
• Epsilon: A higher epsilon grants more exploration of actions, which do not appear profitable at first sight. At the same time, this dilutes the optimal game strategy when it has been picked up by the agent.

Now that we have specified the models, let us see if they are useful. Therefore games are simulated with one random player and the other one leveraging the RL-algorithm. As an evaluation criterion, I use the cumulative long-run winning rate of the RL-player.

After running 100,000 simulations for each of the two RL-models, a consistent outperformance over the random agent can be testified. Given the enormous number of run episodes, statistical significance can be ensured, even though the margin of outperformance seems to be limited.

A win-rate of 51.5% means, that the RL-algorithm is 3% more likely to win a game against a random player.

## 4.1 Q-Table

To see how well the algorithms have learned the game, a closer inspection of the Q-table can be helpful. Since the full table is of size 270,000 x 8, states have been aggregated by the number of hand cards.

The steadily declining violet curve depicts the fact that states with fewer hand cards are closer to win a game. Therefore these states carry higher Q-values, which helps the RL-agent to choose actions that reduce their number of hand cards as fast as possible, in order to achieve a reward.

## 4.2 State-Action Coverage

To judge if the models have been trained on enough simulations, the state-action coverage can be assessed. Therefore the different state-action pairs that have already occurred during 100,000 games can be counted.

In total, both models have seen ca. 60,000 different combinations. At first sight, this is lower than expected, since we now know that 100,000 games have had on average 41 turns (41 x 100,000 = 4.1 Mio. turns). However, we also showcased how concentrated the course of play is towards certain card combinations, that appear multiple times within the sample.

The quicker initial coverage increase of the Monte-Carlo model does not come by surprise, due to the nature of its update function.