The validity of psychological and educational tests | by Rafael Valdece Sousa Bastos | Jan, 2021
Evidence based on content
Collect data regarding the representation of items of a test, investigating if they are samples of the domain that they want to measure. The set of items is judged as to its scope, in view of the assessment of the proposed construct. In general, it is done based on the evaluation of specialists, where they evaluate the importance of items, in view of their relationship with the aspects to be evaluated. Some statistical tests can be used, such as the percentage of agreement and the Kappa coefficient.
Example: In a submitted paper, Bastos et al. (2020) created a measure of self-perceived prejudice and discrimination for different social groups. The authors used the following procedure to seek validity based on content:
- Literature review on the existing measures of prejudice and discrimination.
- Defined self-perceived prejudice as a perception that a person is a victim of negative attitudes towards themselves based on their social group; and self-perceived discrimination as a perception that a person is a victim of negative and unjustified behavior towards themselves based on their social group.
- Based on those definitions and on previous measures, the authors developed new items for other social groups.
- After creating items, they sent them to experts (i.e. psychologists and psychometricians) so they can evaluate the items.
- Based on the proportion of agreement, the authors selected nine items for the following analysis.
Evidence based on response processes
Collect data regarding the mental processes involved in doing given tasks. Normally is about an individual process of response, and researchers ask the person evaluated about the cognitive path used to reach a given result. As an example, we can see that Noble et al. (2014) sought this kind of validity with their study. They found that English language learners (ELL) had lower test scores on a high-stakes test compared to non-English language learners. Based on interview, they discovered that
ELL students’ interactions with specific linguistic features of test items often led to alternative interpretations of the items that resulted in incorrect answers.
Evidence based on internal structure
Collect data about the structure of the correlation of items, evaluating the same construct. Statistical tests that are often used are Exploratory Factor Analysis (EFA), Confirmatory Factor Analysis (CFA), Exploratory Structural Equation Modeling.
As an example, we can use Selau et al. (2020) paper. The authors wanted to measure Intellectual Disability of 7 to 15-year-old children. They investigated the internal structure of the scale through EFA and CFA the following structure:
Where the items are divided into social, conceptual, and practical factors that are explained by a higher-order factor called adaptative function.
Evidence based on its relations with external variables
Collect data regarding the pattern of correlation between test scores and other variables measuring the same construct or different constructs. Usually, to obtain this kind of validity researchers use correlation of test scores with other variables. This type of validity can be:
- Evidence of the capacity of an instrument to predict the evaluated construct.
- When we have tests that measure the same construct, we expect that they are closely related.
- When we have tests that measure related constructs, we expect that they are moderately related.
- When we have tests that measure different constructs, we expect that they are not related.
Beymer et al. (2021) developed a scale of College Students’ Perceptions of Cost. They correlated items from the scale with students’ perceptions and values. They expected (and found) that “costs” were negatively correlated to “expectancies” and “value” (you can see the definition of each variable in their paper).
Evidence based on the consequences of testing
Examine the intentional or not intentional social consequences of the use of a test, to verify if its use is giving desired effects, according to the reason it was built. Tests have this type of validity if they are being used for the same reason they were built for. Although you can’t predict what people will do with an instrument you developed, the responsibilities of the authors of the instrument needs to be discussed.
As an example, we can think about IQ measures. Its purpose is to measure people’s intelligence. However, we can see that sometimes in history IQ was being used to justify racism.
Read More …