Evidence for the null hypothesis? A case of equivalence testing | by Rafael Valdece Sousa Bastos | Jan, 2021


A statistical tool to use when the results are not significant (with example in R)

Rafael Valdece Sousa Bastos
Photo by on

Let’s say you run a study to test differences between kids’ and adolescents’ moral disgust to unfair treatment. After you collected data and run the analysis, you see that there were no significant statistical differences between the two. When this happens, it is common to interpret the results as evidence for the null-hypothesis, that is, that there are no real differences between kids’ and adolescents’ regarding moral disgust. However, this is a misinterpretation of the non-significant result, since it is impossible to show the total absence of an effect in a population.

Quertemont (2011) stated that non-significant results can occur for three different reasons:

1. Mistakes have been made during the collection or encoding of the data, which mask otherwise significant results. This also includes measurement error (imprecision).

2. The study did not have enough statistical power to prove the existence of an otherwise real effect at the population level. The result is a “false equivalence”, due to a sampling error.

3. There is actually no real effect (or a negligible effect) at the population level. The result is a “true equivalence”.

Although it’s impossible to show the absence of an effect in a population, we can use statistics to show the likelihood that the size of an effect in the population is lower than some value considered to be too low to be useful (Quertemont, 2011). That is the case for equivalence testing.

Equivalence testing arose in bioequivalence research, where drugs were considered to be bioequivalent if their absorption rate and concentration levels in the blood after a certain amount of time were the same.

Equivalence tests, as said before, examine whether the hypothesis that there are effects extreme enough to be considered meaningful can be rejected (Lakens et al., 2018).

Difference between Classical Null-Hypothesis Significant Test and Equivalence Test. Image by author.

As you can see by the image above, to do equivalence testing the researcher must define the smallest effect size of interest. Let’s look at Lakens et al. (2018) example to make it more clear:

After an extensive discussion with experts, the researcher decides that as long as the gender difference does not deviate from the population difference by more than .06, it is too small to care about. Given an expected true difference in the population of .015, the researcher will test if the observed difference falls outside the boundary values (or equivalence bounds) of −.055 and .075. If differences at least as extreme as these boundary values can be rejected in two one-sided tests […], the researcher will conclude that the application rates are statistically equivalent; the gender difference will be considered trivially small, and no money will be spent on addressing a gender difference in participation.

To know how to justify the smallest effect size of interest, look at Lakens et al. (2018) paper.

Reading data

We will read the dataset using the R function read.delim

EqT <- read.delim("")

The name we choose for our data frame was EqT. Using View(EqT) we will see our dataset:

Where sex is divided into 1 = men, 2 = women; and SPPD = Self-Perception of Prejudice and Discrimination (Bastos et al. 2021).

Loading TOSTER package

To start manipulating data, we will need to download the TOSTER package. Just run the following code.


Ok. Now the package is on your computer. Now, we need to make it start working with library(TOSTER) .

Summary Statistics

Equivalence testing can be performed with summary statistics using the TOSTER package. Then, we will calculate the mean, standard deviations, and sample size for each group.

aggregate(x = EqT$SPPD, by = list(EqT$sex), FUN = "mean")
aggregate(x = EqT$SPPD, by = list(EqT$sex), FUN = "sd")
aggregate(x = EqT$SPPD, by = list(EqT$sex), FUN = "length")

Where it gives us the following output:

Man had Mean = 2.99, SD = 1.82 and 197 participants, while woman had Mean = 3.03, SD = 1.90 and 374 participants.

Equivalence Testing in R

Now, we will put those results in one line of code that will give us an informative output and a figure. Attention, the lower and upper equivalent bound here was set as an example, it should not be considered a real result.

TOSTtwo(m1 = 2.991117, m2 = 3.036096, sd1 = 1.823769, sd2 = 1.904216, n1 = 197, n2 = 374,
low_eqbound_d = -0.06, high_eqbound_d = 0.06, alpha = 0.05,
var.equal = FALSE)

The parameters of the test are defined inside the parentheses in the last line of code. To perform the test on your own data, simply copy these lines of code, replace the values with the corresponding values from your own study, and run the code. Results and a plot will be printed automatically. Running the code “help(“TOSTtwo”)” provides a help file with more detailed information.

With the following output:

And figure:

As we can see, based on the equivalence testing and null-hypothesis test, we can conclude that the observed effect is statistically not different from zero and statistically not equivalent to zero.

In this post, I showed the importance of equivalence testing and the importance of these statistics for research. I also showed an example using two one-sided tests (TOST), although equivalence tests can be made using other statistics as well. It is important to notice a limitation of equivalence testing: power. To do those kinds of statistics, you may need a sample size beyond n = 100 or even n = 500 to have sufficient power (Goertzen and Cribbie, 2010). That means researchers are required to invest more money to do these statistics.

Feel free to contact me by

Gmail: rafavsbastos@gmail.com
Website for consulting and partnerships:



E. Quertemont, How to statistically show the absence of an effect, 2011, Psychologica Belgica, 51(2), 109–127.

D. Lakens, A. M. Scheel, and P. M. Isager, Equivalence testing for psychological research: A tutorial, 2018, Advances in Methods and Practices in Psychological Science, 1(2), 259–269.

R. V. S. Bastos, F. C. Novaes, J. C. Natividade, Self-Perception of Prejudice and Discrimination Scale: Evidence of Validity and Other Psychometric Properties, 2021, Manuscript Submitted to peer-review.

J. R. Goertzen, and R. A. Cribbie, Detecting a lack of association: An equivalence testing approach, 2010, British Journal of Mathematical and Statistical Psychology, 63(3), 527–537.

Read More …


Write a comment