A Guide to Metrics in Exploratory Data Analysis | by Esmaeil Alizadeh | Dec, 2020
Estimates of location are measures of the central tendency of the data (where most of the data is located). In statistics, this is usually referred to as the first moment of a distribution.
The arithmetic mean, or simply mean or average is probably the most popular estimate of location. There different variants of mean, such as weighted mean or trimmed/truncated mean. You can see how they can be computed below.
where n denotes the total number of observations (rows).
Weighted mean (equation 1.2) is a variant of mean that can be used in situations where the sample data does not represent different groups in a dataset. By assigning a larger weight to groups that are under-represented, the computed weighted mean will more accurately represent all groups in our dataset.
Extreme values can easily influence both the mean and weighted mean since neither one is a robust metric!
Another variant of mean is the trimmed mean (eq. 1.3) that is a robust estimate.
Robust estimate: A metric that is not sensitive to extreme values (outliers).
The trimmed mean is used in calculating the final score in many sports where a panel of judges will each give a score. Then the lowest and the highest scores are dropped and the mean of the remaining scores are computed as a part of the final score. One such example is in the international diving score system.
In statistics, xˉ refers to a sample mean, whereas μ refers to the population mean.
A Use Case for the Weighted Mean
If you want to buy a smartphone or a smartwatch or any gadget where there are many options, you can use the following method to choose among various options available for a gadget.
Let’s assume you want to buy a smartphone, and the following features are important to you: 1) battery life, 2) camera quality, 3) price and 4) the phone design. Then, you give the following weights to each one:
Let’s say you have two options an iPhone and Google’s Pixel. You can give each feature a score of some value between 1 and 10 (1 being the worst and 10 being the best). After going over some reviews, you may give the following scores to the features of each phone.
So, which phone is better for you?
Google Pixel score=0.15×5+0.3×9.5+0.25×8+0.3×5=7.1
And based on your feature preferences, the Google Pixel might be the better option for you!
Median is the middle of a sorted list, and it’s a robust estimate. For an ordered sequence x_1, x_2, …, x_n, the median is computed as follows:
Analogous to the weighted mean, we can also have the weighted median that can be computed as follows for an ordered sequence x_1, x_2, …, x_n with weights w_1, w_2, …, w_n where w_i > 0.
The mode is the value that appears most often in the data and is typically used for categorical data and less for numeric data.
Let’s first import all necessary Python libraries and generate our dataset.
You can use NumPy’s
average() function to calculate the mean and weighted mean (equations 1.1 & 1.2). For computing truncated mean, you can use
trim_mean() from the SciPy stats module. A common choice for truncating the top and bottom of the data is 10%.
You can use NumPy’s
median() function to calculate the median. For computing the weighted median, you can use
weighted_median() from the robustats Python library (you can install it using
pip install robustats). Robustats is a high-performance Python library to compute robust statistical estimators implemented in C.
For computing the mode, you can either use the
mode() function either from the robustats library that is particularly useful on large datasets or from
>>> Mean: 4.375
>>> Weighted Mean: 3.5
>>> Truncated Mean: 4.375
>>> Median: 2.0
>>> Weighted Median: 2.0
>>> Mode: ModeResult(mode=array(), count=array())
Now, let’s see if we just remove 20 from our data, how that will impact our mean.
mean = np.average(data[:-1]) # Remove the last data point (20) print("Mean: ", mean.round(3)) >>> Mean: 2.143
You can see how the last data point (20) impacted the mean (4.375 vs 2.143). There can be many situations that we may end up with some outliers that should be cleaned from our datasets like faulty measurements that are in orders of magnitude away from other data points.
The second dimension (or moment) addresses how the data is spread out (variability or dispersion of the data). For this, we have to measure the difference (aka residual) between an estimate of location and an observed value.
Mean Absolute Deviation
One way to get this estimate is to calculate the difference between the largest and the lowest value to get the range. However, the range is, by definition, very sensitive to the two extreme values. Another option is the mean absolute deviation that is the average of the sum of all absolute deviation from the mean, as can be seen in the below formula:
One reason why the mean absolute deviation receives less attention is since mathematically it’s preferable not to work with absolute values if there are other desirable options such as squared values available (for instance, x² is differentiable everywhere while the derivative of |x| is not defined at x=0)).
The variance and standard deviation are much more popular statistics than the mean absolute deviation to estimate the data dispersion.
The variance is actually the average of the squared deviations from the mean.
In statistics, s is used to refer to a sample standard deviation, whereas σ refers to the population standard deviation.
As can be noted from the formula, the standard deviation is on the same scale as the original data making it an easier metric to interpret than the variance. Analogous to the trimmed mean, we can also compute the trimmed/truncated standard deviation that is less sensitive to outliers.
A good way of remembering some of the above estimates of variability is to link them to other metrics or distances that share a similar formulation. For instance,
Variance ≡ Mean Squared Error (MSE) (aka Mean Squared Deviation MSD)
Standard deviation ≡ L2-norm, Euclidean norm
Mean absolute deviation ≡ L1-norm, Manhattan norm, Taxicab norm
Like the arithmetic mean, none of the estimates of variability (variance, standard deviation, mean absolute deviation) is robust to outliers. Instead, we can use the median absolute deviation from the median to check how our data is spread out in the presence of outliers. The median absolute deviation is a robust estimator, just like the median.
Percentiles (or quantiles) is another measure of the data dispersion that is based on order statistics (statistics based on sorted data). P-th percentile is the least percentage of the values that are lower than or equal to P percent.
The median is the 50th percentile (0.5 quantile).
The percentile is technically a weighted average.
25th (Q2) and 75th (Q3) percentiles are particularly interesting since their difference (Q3 — Q2) shows the middle 50% of the data. The difference is known as the interquartile range (IQR) (IQR=Q3-Q2). Percentiles are used to visualize data distribution using boxplots. A nice article about boxplots is available on Towards Data Science blog.
You can use NumPy’s
std()function to calculate the variance and standard deviation, respectively. On the other hand, to calculate the mean absolute deviation, you can use Pandas’
mad()function. For computing the trimmed standard deviation, you can use SciPy’s
tstd()from the stats module. You can use Pandas’
boxplot()to quickly visualize a boxplot of the data.
>>> Variance: 35.234
>>> Standard Deviation: 5.936
>>> Mean Absolute Deviation: 3.906
>>> Trimmed Standard Deviation: 6.346
>>> Median Absolute Deviation: 0.741
>>> Interquantile Range (IQR): 1.0
Read More …