SEARCH

SEARCH BY CITATION

Abstract

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Content-oriented validity evidence in the Guidelines, Standards, and Principles
  5. 3. Do tests in the cognitive domain measure constructs?
  6. 4. Content validity hinges on domain sampling
  7. References

This commentary discusses a number of issues that build on Schmidt's (International Journal of Selection and Assessment, 20, 1-13 (2012)) perspective on content validity and cognitive tests. First, it elaborates on the relationship between the treatment of content validity in various professional standards and government guidelines. Second, it offers a differing perspective on the definition of ‘construct’ than that taken by Schmidt. Third, it elaborates on the settings in which content validity can and cannot be used to support the use of a given test in the cognitive ability domain.

1. Introduction

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Content-oriented validity evidence in the Guidelines, Standards, and Principles
  5. 3. Do tests in the cognitive domain measure constructs?
  6. 4. Content validity hinges on domain sampling
  7. References

This commentary offers some reactions to the argument Schmidt 2012 puts forth in the hopes of furthering the discussion of the role of content-oriented validity evidence in personnel selection. I agree with the vast majority of Schmidt's paper, including the fundamental assertion that content validity (or content-oriented validity evidence, in the language of the Standards for Educational and Psychological Testing, American Educational Research Association [AERA], American Psychological Association [APA], & National Council on Measurement in Education [NCME], 1999]) can be used to support the use of some tests in the cognitive domain for selection purposes.

2. Content-oriented validity evidence in the Guidelines, Standards, and Principles

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Content-oriented validity evidence in the Guidelines, Standards, and Principles
  5. 3. Do tests in the cognitive domain measure constructs?
  6. 4. Content validity hinges on domain sampling
  7. References

A central goal of Schmidt's paper is to counter the position taken by Section 14C of the Uniform Guidelines on Employee Selection Procedures (UGESP) (Equal Employment Opportunity Commission, EEOC, 1978) that ‘a selection procedure based upon inferences about mental processes cannot be supported solely or primarily on the basis of content validity,’ and thus that content validity ‘is not appropriate for demonstrating the validity of selection procedures which purport to measure traits or constructs.’ Intelligence and aptitude are offered in 14C as examples of such traits or constructs, thus appearing to exclude content validity as a viable strategy for measures in the cognitive domain.

However, Section 14C does permit using content validity for measures of knowledge, skill, and ability that are ‘operationally defined in terms of observable aspects of work behavior of the job.’ It also cautions that ‘as the content of the selection procedure less resembles work behavior … the less likely the selection procedure is to be content valid.’ It is useful to also consider the questions and answers (Q&A) that accompany the UGESP. Q&A 73 notes the importance of minimizing the inferential leap between test and job but acknowledges that content validity is appropriate for measures of knowledge, skill, and ability that are observable behaviors even if they are not actual ‘on the job’ samples of work behavior. Q&A 75 notes that some section procedures may carry trait or construct labels, they may actually be samples of observable behaviors, and thus amenable to content validity. Q&As 73 and 75 have been the basis for making a content validity argument for measures in the cognitive domain.

I believe that it is important to separate the UGESP from the field's two key professional standards documents, namely, the AERA/APA/NCME Standards for Educational and Psychological Testing (1999) and Society for Industrial and Organizational Psychology's (2003) Principles for the Validation and Use of Employee Selection Procedures. Schmidt's summary of the argument against the use of content validity with measures in the cognitive domain includes ‘the use of a construct cannot be justified by content validity under professional standards or under the EEOC Uniform Guidelines – only criterion-related validity or construct validity can justify its use (manuscript p. 21).’ However, the parsing of tests into those that measure constructs and those that do not, and the restriction of the applicability of content validity in the case of tests deemed to measure constructs is restricted to the UGESP. The Standards and the Principles do not take this stance on constructs (see Jeanneret, 2005 or McDaniel, Kepes, and Banks, 2011 for a comparison of the three documents). Thus, Schmidt is taking issue with the UGESP, which, in contrast to the Standards and Principles, is a political/policy document, and not a professional document (Sackett, 2011; Sharf, 2011).

3. Do tests in the cognitive domain measure constructs?

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Content-oriented validity evidence in the Guidelines, Standards, and Principles
  5. 3. Do tests in the cognitive domain measure constructs?
  6. 4. Content validity hinges on domain sampling
  7. References

Schmidt offers a number of arguments in response to the position taken by the UGESP. One is that the UGESP is inconsistent, as it permits content validity for knowledge measures, and knowledge certainly reflects mental processes that are no more ‘directly observable’ than skill and ability. In each of these cases, we observe performance on items designed to reflect the domain in question rather than observing the knowledge, skill, or ability directly. I find this a persuasive analysis.

A second argument Schmidt puts forward is that the specific abilities reflected in measures developed using proper content validation methods are not constructs, and as such, are not covered by the UGESP restrictions on the use of content validity to justify the use of tests that measure constructs. It is here that I part ways with Schmidt. While I agree with the end conclusion that content-oriented validity evidence can support the predictive inference for certain measures in the cognitive domain, I do view such measures as reflecting constructs. Schmidt and I have differing views as to the meaning of ‘construct.’

Schmidt defines a construct as ‘a variable which is defined in theoretical terms’ and ‘a variable which is not defined directly in terms of empirical measurement operations but in terms of some particular theory.’ Based on this definition, he argues that the abilities measured by employment tests are not constructs because ‘the definitions are not given in theoretical terms, but in terms of empirical measurement operations (i.e., operationalizations).’

In contrast, consider the perspective taken by the Standards for Educational and Psychological Testing (1999), which defines ‘construct’ as ‘the concept or characteristic that a test is designed to measure’ (p. 5). The Standards then note that rarely, if ever, is there a single possible meaning that can be attached to a pattern of responses to test items, and by attaching a meaning, one makes a claim as to the construct that is the key determinant of variance in test performance (e.g., ‘from the pattern of responses to this set of items I infer the examinee's standing on the ability to comprehend reading material written at the 8th grade level of difficulty’). Thus, this perspective rejects the notion that some characteristics measured by tests are constructs while other characteristics are not.

Interestingly, both the Schmidt perspective and the Standards perspective (which I share, and to which I acknowledge contributing to as cochair of the committee producing the 1999 Standards) draw on Cronbach and Meehl (1955) as their foundation. I see the Standards perspective as consistent with Cronbach and Meehl, who offer as their summary definition: ‘A construct is some postulated attribute of people, assumed to be reflected in test performance. In test validation the attribute about which we make statements in interpreting a test is a construct’ (p. 283). This is a very inclusive perspective on constructs.

There are additional statements in Cronbach and Meehl that may contribute to Schmidt's interpretation. For example, Cronbach and Meehl wrote: ‘Construct validation is involved whenever a test is to be interpreted as a measure of some attribute of quality which is not “operationally defined” ’ (p. 282). This language resonates with Schmidt's view of a construct as ‘a variable which is not defined directly in terms of empirical measurement operations.’ I believe, though, that Cronbach and Meehl intended a very specific meaning of ‘operationally defined,’ namely, a test to which no attribute is postulated as the basis for variance in test performance. The example they offer is of some unspecified ‘Test X,’ which is predictive of a criterion of interest. If one's only interest is in prediction, this lack of any interpretation of the meaning of scores on Test X is not an issue. However, when scores on Test X are interpreted as reflecting examinee standing on some attribute, a construct is invoked, and questions of ‘why do you believe that variance on this test is due to this attribute’ become salient. As Cronbach and Meehl noted: ‘Construct validation takes place when an investigator believes that his instrument reflects a particular construct, to which are attached certain meanings. The proposed interpretation generates specific testable hypotheses, which are a means of confirming or disconfirming the claim’ (p. 290). The notion that constructs are restricted to attributes above some level on a continuum from operational to theoretical is also at odds with Cronbach and Meehl, who wrote: ‘Constructs may vary in nature from those very close to “pure description” (involving little more than extrapolation of relations among observation-variables) to highly theoretical constructs involving hypothesized entities and processes’ (p. 300).

In short, my perspective is that a construct is invoked any time we attach an attribute label to test scores and attempt to answer the question ‘why do individuals vary in test performance?’ Attaching an attribute label to test performance is a claim that the specified attribute is the predominant source of variance. The task of the validator is to offer and evaluate evidence in support of this claim. As the Standards note, it is rarely the case that there is only a single possible meaning that can be attached to test scores. Consider a test intended to tap the ability to perform the set of mathematical operations used on a given job, developed using content sampling procedures. We administer this test and observe variance in scores. Our intent is to attribute that variation in performance on that set of items to variation in the ability to perform these mathematical operations. But alternatives are possible. Someone might hypothesize that all examinees, in fact, are equally able to perform these mathematical operations, and that variance in test performance reflects variance in effort, not ability. We may attempt to refute that claim on logical grounds: in research settings, it is plausible that there is wide variation in level of effort, but we presume that individuals who present themselves as candidates for a job that they desire are exhibiting maximum performance in pursuit of this valued goal. Another alternative that may be put forward is that variance in test performance reflects variance in the degree to which performance is affected by stereotype threat. We may attempt to refute that claim by comparing variance in test performance across groups: if the hypothesis is that women's performance on math tests is influenced by stereotype threat, while men's is not, a finding of comparable variance is inconsistent with the notion that variance is due to differences is experienced threat, rather than to differences in ability. Thus logical, theoretical, and/or empirical evidence can be brought to bear to justify a claim that a measure reflects particular construct.

Thus, Schmidt and I both reach the conclusion that the UGESP is incorrect in rejecting the applicability of content validity to measures in the cognitive domain, but for different reasons. Schmidt argues that these measures do not reflect constructs and thus are not covered by the UGESP restriction on the applicability of content validity to constructs. I would argue that it is misguided to argue that some tests measure constructs and others do not. As constructs are invoked when attribute labels are attached to test scores, virtually all test measure constructs, rendering meaningless the UGESP parsing of tests into those that do and those that do not (I write ‘virtually all’ as there may be limited situations analogous to Cronbach and Meehl's ‘Test X,’ where one makes no attempt to attach any meaning or any labels to patterns of test responses, but relies solely on the correlation between Test X and a criterion of interest. In such cases, all hinges on criterion-related validity evidence).

One additional note on Schmidt's perspective on constructs and construct validity. Schmidt asserts that ‘construct validation is required if the theory has not been independently verified; i.e., construct validation is required if the measurement operations must be validated along with the theory itself’ (manuscript p. 26). There is a classic distinction (Loevinger, 1957) between two types of construct validity questions, namely, questions about the existence of a construct (e.g., can one define a construct labeled ‘integrity’ and differentiate it from other constructs?) and questions about the adequacy of a given measure of a construct (e.g., can test X be viewed as a measure of integrity?). Thus, from that perspective, the second construct validity question (i.e., about the adequacy of a given measure of a construct) remains relevant, even if the underlying theory is well-developed and supported.

4. Content validity hinges on domain sampling

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Content-oriented validity evidence in the Guidelines, Standards, and Principles
  5. 3. Do tests in the cognitive domain measure constructs?
  6. 4. Content validity hinges on domain sampling
  7. References

The following material elaborates on ideas developed by Schmidt. It is not intended as contrary to Schmidt's position. Reliance on evidence based on test content is based on the idea of sampling: that is, that the set of test items constitute an appropriate sample of knowledge, skills, abilities, or other characteristics that contribute to effectiveness in the criterion domain of interest. As has long been acknowledged, there is a continuum regarding the inferential leap to be made in rendering a judgment regarding the relationship between predictor content and the criterion construct domain of interest (e.g., Tenopyr, 1977).

Consider the domain of reading comprehension, an example used by Schmidt. One common strategy, requiring a minimal inferential leap, is to select reading passages from material directly used on the job (e.g., training materials, office communiqués) and ask candidates to respond to questions to demonstrate their understanding of the material. In other settings, concerns may arise about factors such as differential opportunity for advance access to the materials from which passages are to be sampled, leading to a strategy of selecting passages comparable in reading difficulty to those used on the job, but sampled from a wide variety of courses with the goal of minimizing the likelihood that the operational test contains passages that a candidate has previously encountered. Here, the inferential leap is modestly larger, but I believe that the sampling logic at the heart of content-related validity evidence still applies. Now consider a third scenario: rather than constructing a test based on reading passages tailored to the complexity of the job in question, an off-the-shelf reading comprehension test is selected, which contains passages varying in complexity, and including passages more complex than the reading matter used on the job. Here, in my judgment, sampling logic is violated, as the test contains material that cannot be construed as reflecting the type of reading material used on the job. Schmidt would appear to agree with this assessment as he lists the requirement that the level of difficulty of the measures of the required cognitive skills should be comparable with the level at which these skills are used on the job as part of the process of establishing content validity.

All three of the tests outlined earlier (i.e., reading comprehension based on passages directly sampled from the job, passages matched to the job in terms of complexity, and passages chosen without regard to job complexity) are cognitive tests. Yet only the first two can lay claim to the relevance of content-related evidence of validity in support of establishing the predictive inference for the job in question. In the case of the third, possible support for the predictive inference would need to rest on other sources of evidence.

The arguments put forth by Schmidt (and with which I agree) regarding the reason why cognitive tests exhibit criterion-related validity are relevant here. Those arguments would posit that all three of our exemplar tests earlier would load on a broad verbal factor, with that verbal factor subsequently loading on a higher level general cognitive ability factor. Meta-analytic evidence indicates consistent criterion-related validity against job performance criteria for tests in this verbal domain due in substantial part to the notion that general cognitive ability is an indicator of learning ability. Thus, I would expect that if a large-scale high-quality criterion-related validity study were conducted on any of the three tests, positive relationships with the criterion of interest would be found. In the case of the first two tests, these would constitute examples of the predictive inference being supported by both criterion-related and content-related validity evidence. In the case of the third test, this would illustrate a setting in which content-related validity evidence could not be brought to bear, though criterion-related evidence would support the predictive inference.

This is jarring at first glance. By the logic of content-related validity evidence, the third test would be rejected due to its inclusion of material more complex than required by job. But at the same time, criterion-related validity evidence may support the test. Even though the test includes material more complex than required by the job (and thus is rejected by sampling logic), candidate performance on items reflecting the more complex material can still signal learning ability that is relevant for the prediction of job performance. Thus, the claims made for the test become crucial: if claim is made that ‘the test samples reading passages of the level of complexity required on the job, and thus individuals able to better comprehend those passages on the test can be expected to better comprehend reading material on the job,’ then that claim is not supported for the third test. If the claim is ‘the ability to comprehend reading passages of varying complexity reflects general learning ability that is predictive of job performance,’ then that claim may be supported for the third test. Both claims may be supported for the first two tests.

Note, though, that support for the predictive inference is but one factor affecting the decision as to whether to use a measure is a selection system. For example, questions may arise as to whether the use of more complex material results in a test with comparable criterion-related validity and larger mean differences between subgroups of interest than a test limited to the level of complexity required on the job.

In sum, what the aforementioned discussion should make clear is that there are tests in the cognitive domain for which content-oriented validity evidence can support establishing the predictive inference. These would be tests of specific skills and abilities, developed from a job-analytic foundation, using items that sample a specified domain with psychological fidelity to the processes used on the job (e.g., reading passages comparable in complexity with those used on the job). At the same time, there are other tests in the cognitive domain for which this is not the case. These would include tests of abilities not linked to job domain of interest via job analysis, tests not based on domain sampling logic, and tests at level of complexity not comparable to the job setting.

References

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Content-oriented validity evidence in the Guidelines, Standards, and Principles
  5. 3. Do tests in the cognitive domain measure constructs?
  6. 4. Content validity hinges on domain sampling
  7. References
  • American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
  • Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281300.
  • Equal Employment Opportunity Commission, Civil Service Commission, Department of Labor, & Department of Justice. (1978). Uniform Guidelines on Employee Selection Procedures. Federal Register, 43, 3829438309.
  • Jeanneret, P. R. (2005). Professional and technical authorities and guidelines. In F. L. Landy (Ed.), Employment discrimination litigation: Behavioral, quantitative, and legal perspectives (pp. 47100). San Francisco, CA: Jossey-Bass.
  • Loevinger, J. (1957). Objective tests as instruments of psychological theory [monograph no. 9]. Psychological Reports, 3, 635694.
  • McDaniel, M. A., Kepes, S., & Banks, G. (2011). The Uniform Guidelines are a detriment to the field of personnel selection. Industrial and Organizational Psychology: Implications for Science and Practice, 4, 493514.
  • Sackett, P. R. (2011). The Uniform Guidelines is not a scientific document: Implications for expert testimony. Industrial and Organizational Psychology: Perspectives on Science and Practice, 4, 545546.
  • Schmidt, F. L. (2012). Cognitive tests used in selection can have content validity as well as criterion validity: A broader research review and implications for practice. International Journal of Selection and Assessment, 20, 113.
  • Sharf, J. C. (2011). Equal employment versus equal opportunity: A naked political agenda covered by a scientific fig leaf. Industrial and Organizational Psychology: Perspectives on Science and Practice, 4, 527539.
  • Society for Industrial and Organizational Psychology. (2003). Principles for the validation and use of personnel selection procedures (4th ed.) Bowling Green, OH: SIOP.
  • Tenopyr, M. (1977). Content–construct confusion. Personnel Psychology, 30, 4754.