Whatever the question was, correlation is not the answer – Probably Overthinking It


Pearson’s coefficient of correlation, ρ, is likely one of the most widely-reported statistics. However for my part, it’s ineffective; there isn’t a good purpose to report it, ever.

More often than not, what you actually care about is both impact dimension or predictive worth:

  • To quantify impact dimension, report the slope of a regression line.

If there’s no purpose to desire one measure over one other, report discount in RMSE, as a result of you’ll be able to compute it straight from R².

If you happen to don’t care about impact dimension or predictive worth, and also you simply wish to present that there’s a (linear) relationship between two variables, use R², which is extra interpretable than ρ, and exaggerates the energy of the connection much less.

In abstract, there isn’t a case the place ρ is the most effective statistic to report. More often than not, it solutions the improper query and makes the connection sound extra vital than it’s.

To elucidate that second level, let me present an instance.

Peak and weight

I’ll use knowledge from the BRFSS to quantify the connection between weight and peak. Right here’s a scatter plot of the info and a regression line:

The slope of the regression line is 0.9 kg / cm, which signifies that if somebody is 1 cm taller, we count on them to be 0.9 kg heavier. If we care about impact dimension, that’s what we should always report.

If we care about predictive worth, we should always evaluate predictive error with and with out the explanatory variable.

  • With out the mannequin, the estimate that minimizes imply absolute error (MAE) is the median; in that case, the MAE is about 15.9 kg.
  • With the mannequin, MAE is 13.eight kg.

So the mannequin reduces MAE by about 13%.

If you happen to don’t care about impact dimension or predictive worth, you’re most likely as much as no good. However even in that case, it’s best to report R² = 0.22 slightly than ρ = 0.47, as a result of

  • R² may be interpreted because the fraction of variance defined by the mannequin; I don’t love this interpretation as a result of I feel using “defined” is deceptive, nevertheless it’s higher than ρ, which has no pure interpretation.
  • R² is usually smaller than ρ, which suggests it exaggerates the energy of the connection much less.


This dataset isn’t uncommon.  and ρ typically overstate the predictive worth of the mannequin.

The next determine reveals the connection between ρ, , and the discount in RMSE.

Values of ρ that sound spectacular correspond to values of R² which are extra modest and to reductions in RMSE that are considerably much less spectacular.

This inflation is especially hazardous when ρ is small. For instance, if you happen to see ρ = 0.25, you would possibly suppose you’ve discovered an vital relationship. However that solely “explains” 6% of the variance, and when it comes to predictive worth, solely decreases RMSE by 3%.

In some contexts, that predictive worth could be helpful, however it’s considerably extra modest than ρ=0.25 would possibly lead you to imagine.

The details of this example are in this Jupyter notebook.

And the analysis I used to generate the last figure is in this notebook.


Source link

Write a comment