The Content Validity of Cognitively Oriented Tests: Commentary on Schmidt (†)
- Author note: I thank Michael C. Campion for his help with the reference section.
Schmidt (International Journal of Selection and Assessment, 20, 1–13 (2012)) argues that it is possible for scores based on measures of general cognitive ability (GCA) to have content validity evidence. This commentary examines this argument further. I first decompose the various lines of validity evidence that may exist for GCA scores. Next, I consider whether GCA scores can have content validity evidence and whether they cannot. I conclude with several observations about the meaning of content validity within GCA research and practice. The bottom line is that although I agree with Schmidt that GCA scores can have content validity evidence, I am not sure such evidence tells us much about the overall validity of GCA.
Schmidt's (2012) central argument is that it is possible to acquire evidence and support for the content validity of general cognitive ability (GCA). He further contends that, ‘… in the domains of cognitive skills, aptitudes, and abilities, test development procedures that yield content validity also yield criterion-related validity’ (p. 3). The paper makes a variety of other points that are not as central to the core focus but are nevertheless worth some reflection.
This may be the most boring commentary you have read in some time because I find myself in agreement with Schmidt's basic point – GCA can have content validity evidence. As I will try to explain, to argue otherwise simply does not make sense professionally, technically, theoretically, or practically. At the same time, the role of content validity, as one piece of evidence in the broader accumulation of evidence for construct validity, needs closer examination. In today's world, with the mass of data and knowledge we have about jobs, work, workers, individual differences, and validity, do the classic distinctions (and relationships) among content, criterion-related, and construct validity make sense? Is there an instance where GCA could not have content validity?
This commentary will provide some thoughts on these issues. The following subsections will tackle key points and arguments raised by Schmidt (2012) and present some additional thinking and discussion around them. Let us begin by clarifying terms.
2. Scores and validity
Most of the phenomena and individual differences of interest to applied psychologists are latent in nature. We cannot directly see or observe GCA, but neither can we directly observe conscientiousness, collectivism, or work values. In this sense, I agree with Schmidt (2012) that ‘mental processes’ are vital for virtually every type of work behavior, even those that may not be conscious. Mental processes are even relevant for those individual differences that we often term ‘noncognitive,’ such as personality or attitudes. For example, Mischel and Shoda's (1995) cognitive–affective personality system is based heavily on the idea that traits are systematic forms of cognitive processing (see McCrae & Costa, 1996; Matthews, 1997 for very similar points). The question is not about mental processing but rather the nature of that processing and whether it is focused on adding digits, interpreting social situations, or expressing affect toward others and objects. Mental processing underlies all constructs of interest to industrial/organizational (I/O) psychologists, and hence, the measures and indicators used to represent those constructs. The Uniform Guidelines' (1978) treatment of mental processes is simply out of date.
However, I wish Schmidt (2012) would have made a clearer distinction between indicators (scores) of GCA and the latent nature of GCA. He mentioned a focus on the construct level, but then also emphasized GCA scores from paper-and-pencil tests. This distinction is important for the discussion at hand. We must make inferences about latent constructs based on fallible, manifest indicators. Such indicators may include responses to paper-and-pencil tests, observations from knowledgeable observers, or behavior in simulations. Performance on paper-and-pencil tests can be as informative as observations of behavior, as Anastasi and Urbina (1997) suggested tests are just samples of behavior. Recognizing the distinction between latent constructs (e.g., GCA) and fallible measures or indicators of those constructs (e.g., paper-and-pencil assessments) leads to an appreciation that we do not care much about the indicators themselves, as much as the scores that come from those measures.
Scores, and not tests, may have validity, and there are multiple types of validity evidence that can be accumulated to understand the overall sense of whether the scores fulfill their intended purpose. Validity is a stretched concept, which means it has been used in so many different ways for so long that the term is almost meaningless without definition (Osigweh, 1989). Therefore, I think it is critical to always (a) define one's view of validity and (b) be precise about the specific types of validity that support this overall view. I favor the definition by Messick (1995, p. 74): ‘Validity is an overall evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of interpretations and actions on the basis of test scores or other modes of assessment.’ Notice the emphasis on overall evaluation of scores; it is language that is consistent with all current professional standards (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999; Society for Industrial and Organizational Psychology, 2003) and flows naturally from the thinking that validity is a unitary concept based on the accumulation of multiple lines of evidence (see Guion, 2011 for a nice, concise historical review).
If validity is an overall evaluation of scores, then there are multiple lines of evidence that can be used to support (or refute) the appropriateness of those scores for a specific purpose. The American Educational Research Association, American Psychological Association, and National Council on Measurement in Education (1999) and Messick (1995) developed several of these lines of evidence:
- Content: there are two types. One involves matching the manifest indicators (e.g., test items) to the latent construct. The second involves matching the content of the assessment to the intended purpose of the assessment (e.g., overlap between test items and job requirements).
- Response process: ensuring psychological processes used to perform in the assessment are consistent with the intended construct and purpose for the assessment. For example, if the test is designed to assess problem solving, but the respondents already know the answer, then it is not a problem solving test but a knowledge test.
- Internal structure: the interrelationships among items or indicators should be consistent with the latent construct. If the construct is supposed to be unidimensional, then the assessment should produce a unidimensional factor structure.
- Relations with other variables: this includes the classic lines of validity evidence including convergent, discriminant, and criterion-related.
- Consequential: this is a more controversial form of validity, but is not simply whether the scores have desired or undesired consequences (e.g., racioethnic subgroup differences). Rather, it is whether the scores are consistent with their intended purpose, and if not, why?
Notice that the trinity of content, criterion-related, and construct validity are summarized here. In personnel selection, content validity historically has been defined in terms of whether there is an overlap between the content of the assessment and the content of the job (Goldstein, Zedeck, & Schneider, 1993), which is similar to the content and response process forms of validity noted earlier. Criterion-related validity is subsumed within relations with other variables. Construct validity, when referring to factor structure, is similar to internal structure and relations with other variables. Thus, the reason I like Messick's (1995) thinking is because he reserves the term ‘validity’ for an overall evaluative judgment of scores and uses more precise terms for specific lines of validity evidence (content validity, response process validity, internal structure validity, relational validity, consequential validity; and even more specific types such as discriminant validity, etc.).
If I have one major source of disagreement with Schmidt (2012), it is with his treatment of theory, the term construct, and construct validity. It would help to be more precise so that it is easier to understand the arguments. For example, I find his statement on page 27 confusing: ‘The abilities measured in cognitive employment tests are not constructs.’ Does this mean that the scores based on the measures are not good indicators of the GCA construct? The gist of this paragraph is summarized by the last sentence, ‘Since the cognitive abilities and skills measured in employment tests are not constructs, they are suitable for use in a content validity methodology.’ I do not think we want to equate ‘suitability for content validity’ with measures or indicators. I do not think we want constructs to be lost in content validation. Rather, if we keep the focus on scores, then we need to keep constructs in the picture. The scores from the measures are obviously not constructs; they are indicators of the constructs. We want to evaluate the content validity of the scores, but to do so effectively, we need to understand (a) whether the scores are consistent with the nature of the underlying GCA construct and (b) whether the scores (and items) overlap with the demands of the job. Constructs and theory need to play a role in the evaluation of content validity. Perhaps I simply misunderstand Schmidt's (2012) point on this issue. It needs clarification.
To put all this more simply, validity is just an idea that is subjected to theoretical, empirical, and logical examination – that is, theory building and hypothesis testing, as Guion (1965) agued decades ago. And if validity is just an idea to be tested, then each form of validity evidence is simply testing a different hypothesis about the scores. By using a more precise and complete view of validity, and distinguishing between scores, measures, and constructs, we can better evaluate whether GCA scores can have content validity.
3. Can GCA have content validity?
Schmidt (2012) suggests that some people believe it is not possible for GCA scores to have content validity (which he equates as the link between the GCA measure content and the content of the job). To respond to such a criticism gives it legitimacy. But I wonder, does this criticism even make sense given the broad view of validity presented earlier (and endorsed by all major professional scientific organizations)?
The answer depends on one's theory of GCA. Schmidt (2012) is a bit vague in his definition of GCA, referring mostly to psychometric descriptions not unlike many in the personnel selection literature (e.g., the most general factor sitting at the apex of specific abilities and aptitudes). But if GCA is the determinant of these more specific abilities and aptitudes, then GCA is more than an ability to learn. It also represents an ability to perceive, interpret, manipulate, store, retrieve, and respond to data and information (see Jensen, 1998).
These ‘cognitive demands’ are usually found in the content of most jobs. Jobs differ in the types of cognitive activity required, and jobs differ in the levels of proficiency needed even for the same types of cognitive ability. For example, engineering occupations require more advanced forms of math than construction occupations, which in turn require more advanced forms of math than customer service occupations (e.g., retail associate). But in all of these jobs, the ability to learn; and to perceive, interpret, manipulate, store, retrieve, and respond to data and information, is present. These cognitive demands may not be the most critical elements of the job (e.g., retail associate), and hence, other individual difference constructs may be more critical, but everyone will need to learn new job tasks sooner or later. Thus, on theoretical grounds, there is no reason why GCA scores should not have content validity evidence when the job has sufficient cognitive demands.
Likewise, consider the methods used to establish content validity. Job analysis is the most obvious one. Job analysis starts by defining the nature of the job or occupation to be studied. It then identifies the critical tasks, uses these tasks to identify the critical knowledge, skill, ability, and other characteristics (KSAOs) needed to perform the tasks competently, and concludes by specifying the levels of competence required on these KSAOs. Thus, job analysis is a process of accumulating content validity evidence, and if the job analysis is performed according to professional standards, then the KSAOs identified should – by definition – be content valid. There are other approaches. Lawshe's (1975) content validity ratio is one familiar to many I/O psychologists, but cognitive psychologists and some training scholars also use methods such as cognitive task analysis, process tracing, and verbal protocols. The point is that it is possible to identify, and even quantify, the overlap between GCA content, GCA scores, and the demands of the job using existing approaches.
Thus, I agree with Schmidt (2012) – I see no theoretical, scientific, or practical reason why it is not possible for GCA scores to have content validity. Apparently, neither does the US Federal Government, as the Occupational Information Network system contains GCA within its framework, and nearly every job (perhaps all!) contains at least some of the elements of GCA, and theoretically, by extension, GCA itself. This is all to say that it should be possible for GCA scores to have content validity evidence. Job analysis, content validity ratios, and related methodologies can be used to quantify and test the content validity hypothesis. I see little doubt that GCA scores can have content validity. I have more doubt whether this ‘GCA content validity hypothesis’ is falsifiable.
4. Can GCAnot have content validity?
Let us consider the question differently – is it possible for GCA scores to have criterion-related (relational) validity but not content validity? Murphy, Dzieweczynski, and Zhang (2009) suggested that when predictor scores exhibit positive manifold, and are all positively related to the criterion, then even scores that do not have content validity may still have criterion-related validity. Since theory suggests that GCA is the determinant of more specific abilities and aptitudes (i.e., they are saturated with the general factor), even specific abilities and aptitudes that seem unrelated to the job (lack of content validity) may exhibit criterion-related validity. Based on the theory underlying GCA, it is not inconsistent to find that GCA scores have criterion-related validity but not content validity – the reason is because they are all saturated by a common GCA factor. A more serious concern would be if a GCA score has content validity but not criterion-related validity (assuming the design and methodological requirements for conducting a criterion-related study were adequately met).
This leads to a bigger question – does content validity evidence provide a useful means for supporting the overall validity of GCA? This question is important because it is often not possible or feasible to conduct an appropriate criterion-related validity study. If content validity evidence does not relate much to overall validity, is there any point in establishing it? I would say yes (assuming appropriate resources to do it right), for the following reasons:
- Content validity can, at the very least, help determine the content of the GCA items that will be used operationally. The test must include GCA content of some type, and content validity can help ensure the item content is more relevant to the job.
- Content validity (based on a job analysis or related methodology) can help establish the level of competence required on GCA (useful for setting cut scores or minimum qualifications).
- Content validity can be used to show the relevance of GCA meta-analytic data, by demonstrating that the job is not so unique that prior research is irrelevant.
- Content validation procedures (e.g., job analysis) provide a means to involve organizational decision makers and incumbents in the selection development process. This involvement helps build support for the selection system.
- Content validity provides a means of communicating the job relevance of the predictor scores to the general public, lawyers, and applicants.
It is one thing to tell the general public that GCA is being used for selection because it has been found in prior research to be one of the strongest predictors of most jobs. It is quite another to say that using the organization's own employees and managers, a process was followed to link the GCA content to that of the job. It is obvious that the general public will be more willing to accept the content validity argument.
The bottom line is this: GCA scores can have content validity evidence, but that evidence is not a particularly ‘strong’ form of evidence contributing to overall validity evaluations because of positive manifold. Yet content validity evidence offers value in many other ways, such as making the linkage between GCA scores and performance more transparent for the lay public. To summarize:
- Nearly all predictors of interest in personnel selection are latent, and hence, the key focus is on the validity of the scores obtained from those predictors. The Uniform Guidelines' (1978) treatment of this issue is inconsistent with current professional and scientific thinking.
- Validity is an overall evaluative judgment based upon different, specific lines of evidence informing the meaning of the scores. Theory determines how much convergence should exist across these different lines of specific evidence.
- GCA scores can have content validity evidence.
- Content validity evidence does not provide strong evidence for the overall validity of GCA scores due to the positive manifold present within subscores.
- Content validity helps establish the job relatedness of a predictor.
- Content validity helps provide an explanation why predictor scores are related to performance scores.
- Content validity is built into the process of validation, whether informally (a hunch or idea that predictor scores are related to job performance) or formally (through job analysis and related methodologies).
- If GCA is used based on a competently performed job analysis, then GCA scores will have content validity evidence.
Thus, I agree with Schmidt (2012) that it is possible to establish content validity evidence for GCA scores, but the benefits of doing so are probably more important for legal and public relations reasons than for establishing overall validity.