dynamite plots must die · Simply Statistics
Statisticians have been stating the issue with dynamite plots, also called bar and line graphs, for years. Karl Broman lists them as one of many top ten worst graphs. The issue has even been documented within the peer reviewed literature. For instance, this British Journal of Pharmacology paper titled Present the information, don’t conceal them was printed in 2011.
Nevertheless, regardless of all these efforts, dynamite plots proceed to be ubiquitous within the scientific literature. Simply open the most recent concern of Nature, Science or Cell and you’ll seemingly see a number of. Actually, on this PLOS Biology paper, Tracey Weissgerber and co-authors carry out a systmetic evaluate of “high physiology journals” and discover that “85.6% of papers included a minimum of one bar graph”. They go on to suggest “coaching investigators in knowledge presentation, encouraging a extra full presentation of information, and altering journal editorial insurance policies”. For my part, the coaching can be accelerated if editors implement a coverage that requires authors to point out the information or, if the dataset is just too giant, present the distribution of the information with boxplots, histograms or clean density estimates.
What’s fallacious with dynamite plots
Dynamite plots are used to check measurements from two or extra teams: instances and controls, for instance. In a two group comparability, the plots are graphical representations of a grand complete of four numbers, whatever the pattern dimension. The 4 numbers are the common and the usual error (or the usual deviation, it’s not at all times clear) for every group. Here’s a simulated instance evaluating diastolic blood strain for sufferers on a drug and placebo:
Stars are sometimes added to level out that the variations are statistically important.
So what’s the drawback with these plots? First, when you’ve got a print version of your journal you’re losing ink. No have to waste all that toner simply to point out these 4 summaries:
## x common se ## 1 Controls 60 2.3 ## 2 Instances 81 9.7
From these numbers you compute the p-value, which on this case is just under 0.05.
Second, the dynamite plot makes it seem as if there’s a clear distinction between the 2 teams. Displaying the information reveals extra info. In our instance, exhibiting the information reveals that the bottom blood strain is definitely within the therapy group. It additionally reveals the presence of 1 considerably excessive worth of 150. This would possibly characterize an information entry mistake. Maybe systolic strain was recorded accidentally? Word that with out that knowledge level, the distinction is now not important on the 0.05 stage.
Word additionally that, as identified by Weissgerber, knowledge that look fairly completely different can lead to precisely the identical barplot. For example, the 2 datasets beneath would produce the identical barplot because the one proven above.
What ought to we do as an alternative?
First, let’s generate the information that we’ll use within the instance R code proven beneath.
library(tidyverse) set.seed(0) n <- 10 instances <- rnorm(n, log2(64), 0.25) controls <- rnorm(n, log2(64), 0.25) instances <- 2^(instances) controls <- 2^(controls) instances[1:2] <- c(110, 150) #introduce outliers dat <- knowledge.body(x = issue(rep(c("Controls", "Instances"), every = n), ranges = c("Controls", "Instances")), Final result = c(controls, instances))
One choice is solely to point out the information factors, which you are able to do like this:
dat %>% ggplot(aes(x, Final result)) + geom_jitter(width = 0.05)
On this case we see that the information is true skewed so we’d need to remake the plot within the log scale
dat %>% ggplot(aes(x, Final result)) + geom_jitter(width = 0.05) + scale_y_log10()
If we need to present abstract statistics for the information, we will superimpose a boxplot:
dat %>% ggplot(aes(x, Final result)) + geom_boxplot() + geom_jitter(width = 0.05) + scale_y_log10()
Though not the case right here, if there are too many factors, we will merely present the boxplot.
dat %>% ggplot(aes(x, Final result)) + geom_boxplot() + scale_y_log10()
And if we’re fearful that 5 abstract statistics could be hiding necessary traits of the information, we will use ridge plots.
library(ggridges) dat %>% ggplot(aes(Final result, x)) + scale_x_log10() + geom_density_ridges(scale = 0.9)
If of manageable dimension, you must present the information factors as properly:
library(ggridges) dat %>% ggplot(aes(Final result, x)) + scale_x_log10() + geom_density_ridges(scale = 0.9, jittered_points = TRUE, place = position_points_jitter(width = 0.05, top = 0), point_shape = '|', point_size = 3, point_alpha = 1, alpha = 0.7)