Data Distribution vs. Sampling Distribution: What You Need to Know | by Esmaeil Alizadeh | Jan, 2021
Much of the statistics deals with inferring from samples drawn from a larger population. Hence, we need to distinguish between the analysis done the original data as opposed to analyzing its samples. First, let’s go over the definition of the data distribution:
Data distribution: The frequency distribution of individual data points in the original dataset.
Let’s first generate random skewed data that will result in a non-normal (non-Gaussian) data distribution. The reason behind generating non-normal data is to better illustrate the relation between data distribution and the sampling distribution.
So, let’s import the Python plotting packages and generate right-skewed data.
In the sampling distribution, you draw samples from the dataset and compute a statistic like the mean. It’s very important to differentiate between the data distribution and the sampling distribution as most confusion comes from the operation done on either the original dataset or its (re)samples.
Sampling distribution: The frequency distribution of a sample statistic (aka metric) over many samples drawn from the dataset. Or to put it simply, the distribution of sample statistics is called the sampling distribution.
- Draw a sample from the dataset.
- Compute a statistic/metric of the drawn sample in Step 1 and save it.
- Repeat Steps 1 and 2 many times.
- Plot the distribution (histogram) of the computed statistic.
>>> Mean: 0.23269
Above sampling distribution is basically the histogram of the mean of each drawn sample (in above, we draw samples of 50 elements over 2000 iterations). The mean of the above sampling distribution is around 0.23, as can be noted from computing the mean of all samples means.
⚠️ Do not confuse the sampling distribution with the sample distribution. The sampling distribution considers the distribution of sample statistics (e.g. mean), whereas the sample distribution is basically the distribution of the sample taken from the population.
💡 Central Limit Theorem: As the sample size gets larger, the sampling distribution tends to be more like a normal distribution (bell-curve shape).
In CLT, we analyze the sampling distribution and not a data distribution, an important distinction to be made. CLT is popular in hypothesis testing and confidence interval analysis, and it’s important to be aware of this concept, even though with the use of bootstrap in data science, this theorem is less talked about or considered in the practice of data science. More on bootstrapping is provided later in the post.
The standard error is a metric to describe the variability of a statistic in the sampling distribution. We can compute the standard error as follows:
where s denotes the standard deviation of the sample values and n denotes the sample size. It can be seen from the formula that as the sample size increases, the SE decreases.
We can estimate the standard error using the following approach:
- Draw a new sample from a dataset.
- Compute a statistic/metric (e.g., mean) of the drawn sample in Step 1 and save it.
- Repeat Steps 1 and 2 several times.
- An estimate of the standard error is obtained by computing the standard deviation of the previous steps’ statistics.
While the above approach can be used to estimate the standard error, we can use bootstrapping instead, which is preferable. I will go over that in the next section.
⚠️ Do not confuse the standard error with the standard deviation. The standard deviation captures the variability of the individual data points (how spread the data is), unlike the standard error that captures a sample statistic’s variability.
Bootstrapping is an easy way of estimating the sampling distribution by randomly drawing samples from the population (with replacement) and computing each resample’s statistic. Bootstrapping does not depend on the CLT or other assumptions on the distribution, and it is the standard way of estimating SE.
Luckily, we can use
bootstrap() functionality from the MLxtend library (You can read my post on MLxtend library covering other interesting functionalities). This function also provides the flexibility to pass a custom sample statistic.
>>> Mean: 0.23293
>>> Standard Error: +/- 0.00144
>>> CI95: [0.23023, 0.23601]
The main takeaway is to differentiate between whatever computation you do on the original dataset or the sample of the dataset. Plotting a histogram of the data will result in data distribution, whereas plotting a sample statistic computed over samples of data will result in a sampling distribution. On a similar note, the standard deviation tells us how the data is spread, whereas the standard error tells us how a sample statistic is spread out.
Another takeaway is that even if the original data distribution is non-normal, the sampling distribution is normal (central limit theorem).
You can find the Jupyter notebook for this blog post on GitHub.
Thanks for reading!
Originally published at https://www.ealizadeh.com.
Read More …