## A/B Testing Intuition with Python | by Nam Vu | Dec, 2020

[ad_1]

This step mostly rely on the business need, so be sure to talk to your stakeholders beforehand. Also, let’s familiarize ourselves to some of the concepts:

- Statistical power: The ability of the experiment to correctly identify a positive change, given that there is indeed one.
- Significant level: The probability of wrongly identifying a positive change, when there is actually none.
- Minimum detectable effect (MDE): The minimum change that the business would like to detect in this test.

I know statistical power and significant level at this moment sounds quite abstract, so let’s develop some intuition there. Before we dive deep, I would like to point out one thing that I have seen many questions about, and is particularly important: there is no hard math on MDE, it is all based on the need of the business — **and it is the product owners who should be telling you, I want to detect at least a 1% change in conversion in order for it to be worth the effort.**

Now let’s get to some action.

## 1. Creating some data

Let’s first assume that you have exactly equal number of pageviews for control and experiment group, which is likely not true — but can help understand the math.

bcr = 0.1 # Baseline conversion rate

lift = 0.02 # Difference between the 2 groups

size = 4000 # Total size of data, each group will have 2000df, summary = generate_data_equal_size(size, bcr, lift)

This tells us that the click through rate (CTR) of group B (experiment group) is higher than that of group A. However, how do we translate this into statistical confidence and interpret this result.

**2. Binomial distribution**

First, we know that the probability of a user clicking through is a Bernoulli distribution (either he clicks or not). When multiple users do this (assuming independence), the number of click throughs (success) will follow a Binomial distribution. In the real world, the parameters of the distributions can be derived from the experiment itself.

Which can be represented graphically:

fig, ax = plt.subplots(figsize=(8,5))

data_a = binom(n_A, p_A).pmf(xx)

data_b = binom(n_B, p_B).pmf(xx)ax.bar(xx, data_a, alpha=0.5)

ax.bar(xx, data_b, alpha=0.5)

plt.xlabel('converted')

plt.ylabel('probability')

However, this doesn’t really tell us much — here we can understand that group B (orange) has a higher number of clicks, but we want to standardize this. In fact, we want to draw the distribution of the click through rates, instead of number of clicks.

**Illustrating Central Limit Theorem**

From above, we know that when multiple users have a probability of click/not click, then the total number of clicks follows a Binomial distribution. But there’s another way to think about this.

Let’s call the users X₁, X₂, X₃, …, Xₙ. Assume that for each user Xᵢ who has a click through probability of *p*, then: 𝑋𝑖∼𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝, 𝑝∗(1−𝑝))** **(let’s not get deep into why it’s as such for Bernoulli distribution).

Then, by Central Limit Theorem, the sample mean, which is the same as **p (Equation 1)**:

Note that there is a distinction between **p (bolded), the sample mean** which follows the normal distribution distribution and *p (italic), the click through probability of a user*.

Let’s do a small visualization to understand better the above. Let’s simulate 100000 users, with a 0.1% CTR for each user:

outcomes = []

for i in range(100000):

p = 0.1

outcome = np.random.binomial(1, p)

outcomes.append(outcome)sns.countplot(outcomes)

Let’s say we have 2000 users in each sample. Then we can calculate the sample mean for each sample X̄ = **p**. If we do many times like above, we can get the distribution of those sample means:

sample_means = []# Simulating 1000 samples

for i in range(1000):

# For each sample, we simulate 2000 users

sample = random.choices(outcomes, k=2000)

mean = np.mean(sample)

sample_means.append(mean)sns.distplot(sample_means)

We can see that the distribution of sample mean does indeed follow a normal distribution.

## 3. Forming the hypothesis

Since the sample mean will follow a normal distribution, we can graph the “assumed” distribution of the sample mean of group A and group B using the equation (1). Se can assume the observed *p *of each group is the unbiased estimator for the population mean.

fig, ax = plt.subplots()

xx = np.linspace(0.07, 0.17, 1000)data_a_norm = norm.pdf(xx, p_A, np.sqrt(p_A*(1-p_A) / n_A))

sns.lineplot(xx, data_a_norm, color='blue', ax=ax)

data_b_norm = norm.pdf(xx, p_B, np.sqrt(p_B*(1-p_B) / n_B))

sns.lineplot(xx, data_b_norm, color='red', ax=ax)

ax.axvline(p_A, color='cyan', linestyle='--')

ax.axvline(p_B, color='orange', linestyle='--')plt.xlabel('converted')

plt.ylabel('probability')

What we are interested in is the difference between the two groups above. In other words, if I pick a mean 𝜇ₐ from distribution A, and a mean 𝜇ᵦ from distribution B, how likely will 𝜇ᵦ be greater than 𝜇ₐ.

Thus, our hypothesis is as follow:

**Distribution of difference in mean**

We can assume the users in the two groups are independent, as such:

We can derive the formula for the standard error of difference like so, which gives us the Satterthwaite Approximation. Note that this assumes that the two variance are different (what we are assuming), which is different from the other equation that Udacity uses for Pooled Standard Error.

Since both distribution are normal, we know that **d** also follows a normal distribution, as such we can graph the distribution for the two hypothesis.

## 4. Calculating the sample size

Now that we have understood how the hypothesis are formed, let’s continue to derive the sample size. Evan Miller wrote an awesome blog on why you MUST decide a sample size before running the experiment, and I will try to build an intuition around it.

Recall that our baseline conversion rate is 0.1, and our lift is 0.02 for our population. However, before running the experiment you would not have the lift number, and there is no way to figure it out. As such, we will use a minimum desired effect (MDE), which is a lift that we would accept as “significant”, to calculate the sample size. The tool I personally use, and is also recommended by Udacity’s course is Evan Miller sample size calculator.

But how does it work?

**Power of a test and statistical significant**

Let’s see how power and significant level work together. Recall the definition:

- Statistical power: The ability of the experiment to correctly identify a positive change, given that there is indeed one.
- Significant level: The probability of wrongly identify a positive change, when there is actually none.

So essentially, assuming there is in fact no difference between A and B (H₀ is true), 5% of the time, a difference of at least 𝑍𝛼/2 can be **observed**. We can mathematically calculate 𝑍𝛼/2 with the function `norm.ppf(1-alpha/2, d0, SE)`

where d0=0 and SE equal to the pooled SE.

The reason that we use 𝑍𝛼/2 instead of 𝑍𝛼 is that we are interested in a two-tail test, as shown in the graph below, where the total shaded area is 5%. The area is our probability of wrongly rejecting H₀. We need to take 1-alpha/2 because we are interested in the right tail, and the graph below shows the calculated min difference needed.

`plot_null(p_A=0.1, p_B=0.12, n_A=2000, n_B=2000)`

So under our test, whenever we observe a difference of at least 0.01939, we will reject the null hypothesis and say that **our test group does indeed perform better than control group**.

In addition, in A/B test we would need to reach a particular power — normally 80%. This means we need to reject H₀ 80% of the time given that the alternative hypothesis H₁ is true. Recall that we reject the null hypothesis whenever the difference is greater than 0.01939. As such, we need to somehow find the alternative distribution such that the probability of getting a mean of greater than 0.01939 is at least 80% under this distribution.

Let’s take a look at a few of such alternative distributions:

# Since we draw the graph for 3, let's just assume pooled SE is calculated with bcr=0.1 and lift=0.02

SE_0 = calculate_SE(0.1, 0.12, 2000, 2000)

SE_1 = calculate_SE(0.1, 0.12, 2000, 2000)

SE_2 = calculate_SE(0.1, 0.12, 2000, 2000)plot_multiple_alt(0, 0.01, 0.02, SE_0, SE_1, SE_2)

We can see here that the power can be calculated as the area under the alternative hypothesis to the left of the calculated 𝑍𝛼/2=0.01939. This area denotes the probability of getting an observed mean greater than 𝑍𝛼/2 if the alternative hypothesis is true.

For example, under H₂, which is when the MDE is 0.02 and the sample size is 2000, the power is only 52%. It is greater than the power of when MDE 0.01 as it makes sense that if the actual effect is higher, our test would have an easier time to detect the difference. However, it is still not quite what we wanted.

In order to increase the power in this case, what we would thus need to do is to somehow make the bell curves above thinner (for lack of better words), which means reducing the standard error. If we recall back to the formula of SE (**hint**: there are two of them), both are inversely proportional to the sample size. Which means, to reduce the SE, we would need to increase the sample size, so let’s do exactly that.

# Since we draw the graph for 3, let's just assume pooled SE is calculated with bcr=0.1 and lift=0.02

SE_0 = calculate_SE(0.1, 0.12, 4000, 4000)

SE_1 = calculate_SE(0.1, 0.12, 4000, 4000)

SE_2 = calculate_SE(0.1, 0.12, 4000, 4000)plot_multiple_alt(0, 0.01, 0.02, SE_0, SE_1, SE_2)

Above we can see that the graphs do appear to be thinner. We also observe that the 𝑍𝛼/2 score under H₀ has **shifted left **due to the thinner bell curves. If we do the same and calculate the area under the H₂ curve, we will see that the power is 82%, which is good.

A simple way to put all these together is:

- Calculate 𝑍𝛼/2 based on H₀, which is a threshold that if the observed difference cross, we will reject H₀.
- If H₁ is true, we should see 𝑍𝛼/2 at least 80% of the time.

**Calculating the sample size**

So now we know that the sample size is based on MDE, power and significant level. The easiest way is to scroll the sample size like the gif below (actually just kidding, but it’s cool to visualize how power change as a function of sample size). I have built a simple slider in notebooks which can be found at my Github.

The official way to do it is with a formula, which can be found online as linked above. There are a few ways to actually calculate the required sample size (due to different assumptions used). The derivation of both formula is beyond the scope of this article.

Read More …

[ad_2]