Cognitive Tests Used in Selection Can Have Content Validity as Well as Criterion Validity: A broader research review and implications for practice

Authors


  • Author Note. The author would like to acknowledge the contributions of the late Jack Hunter to the ideas expressed in this article.

Abstract

Many industrial/organizational (I/O) psychologists, both academics and practitioners, believe that the content validity model is not appropriate for cognitive ability measures used in personnel selection. They believe that cognitive tests can have criterion validity and construct validity but not content validity. Based on a review of the broader differential psychology research literature on cognitive skills, aptitudes, and abilities, this article demonstrates that with the proper content validity procedures, cognitive ability measures, including, ultimately, some de facto measures of general cognitive ability, can have content validity in addition to criterion and construct validity. Finally, the article considers, critiques, and refutes the specific arguments contending that content validity is inappropriate for use with cognitive skills and abilities. These research facts have implications for I/O practice, professional standards, and legal defensibility of selection programs.

1. Introduction

Many personnel psychologists, both academics and practitioners, believe that cognitive ability measures can have criterion validity and construct validity but that the content validity model is not appropriate for cognitive skills and abilities. As described later, this is also the position taken by agencies of the US government (e.g., the Equal Employment Opportunity Commission [EEOC]) in their efforts to enforce laws and regulations related to employment discrimination. This article challenges that position. It describes why and how content validity can be, should be, and has been used with measures of cognitive skills shown by job analysis to be required for job performance and how the aggregation of several specific content valid cognitive skills into a content valid selection test results in a de facto measure of general cognitive ability (GCA). This issue is important for professional practice standards and this is especially true for practitioners who develop selection programs for business, industry, nonprofit organizations, and government at all levels because it has important implications for the legal defensibility of those programs.

To provide the necessary foundation for later developments, this article first describes the history of research on the organization and meaning of human abilities and explicates the relations among cognitive skills, cognitive aptitudes, and GCA. Many personnel psychologist are not familiar with this research. These research findings show that the use of a sum or composite of three or more cognitive skill or aptitude measures has criterion-related validity for all jobs for which it might be used because a combination of three or more cognitive skills or aptitudes is a de facto measure of GCA. Empirical evidence is cited that demonstrates the universal validity of GCA for job and training performance. The article then summarizes the research showing the reasons why GCA universally predicts job performance. This theoretical explanation and the evidence demonstrating it are important because it is not sufficient to simply empirically demonstrate the generalizable criterion validity of GCA. There needs to also be an explanation for why the universal validity of GCA exists. This information is needed to set the stage for the main purpose of this article.

The article next describes how content validity can be used with measures of cognitive skills shown by job analysis to be required on the job. The conclusion is that in the domains of cognitive skills, aptitudes, and abilities, test development procedures that yield content validity also yield criterion-related validity. Demonstration of these facts is the main purpose of this article. In the final section, the article considers and refutes the arguments advanced in support of the position that cognitive skills are constructs and therefore cannot be validated via a content validity strategy.

It is important at the outset to distinguish between the approach taken in this article and that taken by Murphy and his colleagues (Murphy, 2009; Murphy, Dzieweczynski, & Zhang, 2009). The present article focuses on the fact that a test measuring specific cognitive skills that is properly constructed to be content valid will, in fact, have content validity – and will also be a measure of GCA and therefore have criterion validity. That is, if the test is composed of the specific cognitive skills shown by job analysis to be used in job performance, and meets the other content validity requirements, such a test will be both content-valid and a measure of GCA (and will therefore have two lines of validity evidence supporting its use). Murphy and his colleagues do not disagree with this conclusion. Rather, they emphasize the fact that a test made up of specific cognitive skills not shown to be directly used in job performance (and hence, a cognitive test without content validity) will usually have essentially the same substantial level of criterion validity as the content valid test – because it will likewise be a measure of GCA and hence will show the criterion validity that all reliable measures of GCA show. The focus of Murphy and his colleagues is on the fact that the degree of content validity of a cognitive skills test does not reveal much about the level of criterion validity. The present article does not dispute this fact. But this conclusion by Murphy and associates does not deny that a cognitive skills test can have content validity, which is the contention of the present article.

2. Cognitive skills, cognitive aptitudes, and GCA

Cognitive skills tests have been developed at different levels of generality. Tests of GCA have been developed to measure the general ability to learn. There are also tests for specific cognitive skills such as problem solving, following complex instructions, understanding words and numbers, etc. As demonstrated later, the sum of several such cognitive skills is a measure of GCA. GCA, in turn, is predictive of performance on virtually all jobs.

2.1. Criterion validity studies versus explanatory studies

There are two kinds of studies showing that cognitive skills have predictive validity for job and training performance: direct validation studies (criterion-related validity studies) and indirect validation studies (content validity studies). A direct validation study gathers data on job performance to estimate how well it is predicted by various potential methods of selecting applicants. While such studies can demonstrate criterion validity, they do not usually provide any understanding of why the predictor is valid. An indirect study or content validity study establishes validity by studying the psychological tasks and processes underlying work performance, which are often specific cognitive processes and skills (such as, e.g., the skills of simple mental arithmetic or mental rotation of objects). The study starts with a specification of the particular domain of the job content to be sampled, is followed with job analysis that reveals what skills and abilities are actually used in that performance domain and are required for high job performance, and creates a process for sampling these critical skills and abilities (Society for Industrial and Organizational Psychology [SIOP] Principles, pp. 21–24). A component of a job may require mental arithmetic, or it may require determination of how many boxes can be fitted into a certain storage space on a truck. Content validity studies are more powerful from an explanatory point of view because they reveal why a skill is relevant rather than just presenting raw numerical and statistical evidence of a linkage to a performance measure. However, a content validity study provides no quantitative estimate of the size of predictive validity.

2.2. Definition and nature of GCA

Cognitive skills determine how well individuals carry out mental operations: learning, thinking, remembering, perceiving, reasoning, etc. Many different cognitive skills have been defined and measured, and there is a large literature on the relations between those skills. This literature ultimately led to the concept and definition of GCA, as described later. A useful way to understand the definition and nature of GCA is by starting with a focus on the more numerous specific cognitive skills.

For our purposes here, we can simplify this discussion without loss of generality by considering only three levels of skill: specific skills (specific aptitudes), general aptitudes, and GCA. General aptitudes are higher order skills that are used in the development of more specific skills. GCA is the highest order skill and is used in the development of all the aptitudes and thus indirectly in the development of all specific skills as well. There are other levels of skill that are more specific than the specific skills that are considered in this article – for example, skills that are specific to single tasks. For instance, solving anagrams with five characters is slightly different from solving anagrams with seven characters. However, skills at that extreme level of specificity are usually studied only in the laboratory. Even the simplest subtask included in job performance can usually be broken into many component tasks at the narrow level studied in the laboratory. Hence, cognitive skills at this extreme level of specificity are less useful for personnel selection than somewhat less narrow cognitive skills.

The human performance literature extends back over nearly 100 years and has resulted in several taxonomies of cognitive skills. Perhaps the most widely accepted taxonomy is the one that stemmed from the groundbreaking review of over 100 factor analytic studies by John French (1951) (updated and extended in French, 1954; French, Ekstrom, & Price, 1963; Ekstrom, 1973; Ekstrom, French, Harman, & Dermon, 1976; followed up by Pearlman, 1979; Hirsh, Northrup, & Schmidt, 1986; Trattner, 1988). The other main taxonomist is Ed Fleishman (cf. Fleishman & Quaintance, 1984). Those who have devoted their research careers to cognitive taxonomy – the differentiation of cognitive skills from one another – have had to face the fact that the numerous specific skills can often be replaced functionally in applied psychology (i.e., in terms of predictive validity) by one overall GCA. This is a key finding in the human performance and abilities literature. The main reason for this result is the overarching importance of learning in performance of all kinds and hence of learning ability (Bloom, 1976), which is assessed by GCA measures. Learning ability tends to set the pace for the acquisition of all other cognitive skills – including those learned on the job (Schmidt & Hunter, 2004).

The fact that the cognitive skills are all positively correlated with each other was discovered first by Spearman (1904, 1927). The positive correlations between specific skills can be explained by relating them to higher order skills called ‘general aptitudes.’ That is, skills can be grouped in clusters in which the skills within a cluster are very highly correlated. The data show that the learning of each specific skill at the primary level is determined in large part by the level of skill at a higher level, the level of the general aptitude that dominates the skills in that bundle. The main aptitudes studied in industrial/organizational psychology have been verbal aptitude, quantitative aptitude, and technical aptitude, although others have also been examined.

The general aptitudes are themselves highly correlated. The high correlation between aptitudes can be explained by relating them to a still higher order ability: GCA. The first key fact is that specific skills are determined by higher order skills called general aptitudes. The specific skills within a given cluster are highly correlated because the skills within that cluster are all determined in large part by the same aptitude. The second key fact is that all of the aptitudes are determined by a still higher order skill called GCA. These facts are the core of the hierarchical theory of cognitive skills and abilities that was originally developed by Spearman (1904, 1927) and Holzinger (1935, 1944) and extended by Gustafsson (1984, 2002). This theory has been confirmed by hundreds of factor analytic studies. Substantive reviews can be found in Vernon (1960) and Carroll (1993).

The most complete quantitative research to date on this structure of cognitive ability is the reanalysis of over 300 factor analytic studies conducted by John Carroll (1986, 1993). In this project, he collected and computerized over 6,000 references on individual differences in cognitive skill and ability. All 309 of the most important datasets were reanalyzed using a uniform factor analytic methodology that included second-order factor analysis. Carroll's reanalysis confirmed the general applicability of the hierarchical model.

Figure 1 illustrates this hierarchical organization of cognitive skills and abilities. Figure 1 is based on US Army data on over 16,000 soldiers (Hunter, Crossen, & Friedman, 1985). At the highest level, we see GCA, which causes the development of the general aptitudes (quantitative aptitude, verbal aptitude, and technical aptitude). These aptitudes, in turn, cause the development of the specific aptitudes (cognitive skills), such as arithmetic reasoning, word knowledge, and mechanical comprehension, as shown in the figure (note that the meaning of the general aptitudes can be inferred from the specific aptitudes that they cause; e.g., technical aptitude causes mechanical comprehension and electronics information). Figure 1 is illustrative but is not exhaustive because the number of general aptitudes is larger than the three depicted in Figure 1. For example, spatial aptitude is a general aptitude. Likewise, the number of specific aptitudes (cognitive skills) is much larger than the six depicted here.

Figure 1.

The hierarchy of mental abilities as found in the Armed Services Vocational Aptitude Battery. From Hunter et al. (1985, p. C61); MK, math knowledge; AR, arithmetic reasoning; WK, work knowledge; GS, general science knowledge; EI, electronics information; and MC, mechanical comprehension.

Figure 1 also illustrates the finding that it is GCA that drives job performance, not the aptitudes or the specific aptitudes. That is, after one controls for GCA, the aptitudes make no incremental contribution to the prediction of job or training performance over and above the contribution of GCA (Jensen, 1984). Note that there are no paths in Figure 1 from general or specific aptitudes to performance. What this means is that the factors specific to any particular aptitude (e.g., verbal) do not predict performance. This finding has been replicated using other large-scale databases. For prediction of job performance, these include Olea and Ree (1994), Ree, Earles, and Teachout (1994), Thorndike (1985, 1986), and Schmidt, Hunter, and Caplan (1981). For large sample studies demonstrating the same conclusion for performance in training, see Brown, Le, and Schmidt (2006) and Ree and Earles (1991). This important finding is reviewed and discussed in Ree and Earles (1992) and Schmidt, Ones, and Hunter (1992).

(An important methodological note here is that a valid test of the hypothesis of incremental validity requires controlling for measurement error in the measures of GCA and the aptitudes. Absent such control, results may falsely indicated incremental validity [Schmidt et al., 1981]. Note that the question here is not whether specific measures of aptitudes increment validity over a measure of GCA but whether the actual aptitudes themselves increment validity. So the question must be addressed at the true score level. See Schmidt et al. [1981] and Brown et al. [2006] for more details.)

Figure 1 also illustrates why any measure that combines measures of three or more specific aptitudes is, in effect, a measure of GCA. This is because the measurement of GCA is operationalized by measuring the causal effects of GCA. For example, GCA causes arithmetic reasoning ability (through quantitative aptitude), word knowledge (through verbal aptitude), and mechanical comprehension (through technical aptitude). Therefore, a composite of reliable tests that includes measures of arithmetic reasoning, word knowledge, and mechanical comprehension will be a measure of GCA because such a measure will be correlated nearly perfectly with GCA. The key point is that GCA is not (and cannot be) measured directly but rather is measured through its effects or products (at a lower level, the same thing is true of the general aptitudes). For this reason, these products or effects are often called ‘indicators’ of GCA (Gustafsson, 1984, 2002).

The data presented in Figure 1 are typical of such findings. We can use the principles of path analysis (the tracing rules) in Figure 1 to determine the correlations of the specific aptitudes with general mental ability. The correlation between GCA and arithmetic reasoning is (.93)(.87) = .81. The correlation between GCA and word knowledge is (.93)(.86) = .80. The correlation between GCA and mechanical comprehension is (.93)(.84) = .78. The average correlation among the tests measuring these specific aptitudes is.55 (Hunter et al., 1985). Based on this information, we can use the formulas for the correlation of composites (Nunnally & Bernstein, 1994) to compute the correlation between the sum of these three tests and GCA. This correlation is.95. This correlation is so high that the sum of these three specific aptitudes functions as a measure of GCA. This process applies to any composite of three or more specific aptitude tests used to predict performance for any given job. In fact, this is how all major tests of GCA are created; that is, by combining measures of specific cognitive skills.

An important result from this research for the purposes of personnel selection is that all of the different cognitive skills are highly correlated with each other. To a considerable extent, the person who is high on one tends to be high on all, and the person who is low on one tends to be low on all. This fact led to the current professional consensus regarding GCA (Carroll, 1993).

The same cognitive skill can be used in many different tasks, even tasks that may not have much superficial similarity to each other. A given specific cognitive skill is often measured with some ‘marker’ test that has been developed to measure that specific skill (Ekstrom, 1973; Ekstrom et al., 1976). Most of these marker tests are paper-and-pencil tests. There is a large factor analytic literature showing that the paper-and-pencil skills tests measure the same cognitive skill that is used in corresponding practical tasks (Vernon, 1960; Breland, 1977). People unfamiliar with this literature sometimes experience doubts that paper-and-pencil tests can really measure the skills named. This, in turn, can lead to doubt that paper-and-pencil tests can predict real-world behavior such as job performance. The claim that tests cannot predict real-world performance is one form of a concern voiced early in the 20th century in the psychological literature on intelligence tests: ‘How can a test with just words, numbers, and pictures predict factory performance with real, three-dimensional objects? Can paper-and-pencil tests really predict job performance?’ Thousands of studies have now been conducted, which bear on this question, including over 500 studies done by the US Department of Labor's Employment Service (Hunter, 1980a, 1980b) and well over 500 studies done by the US military (Hunter, 1983a, 1983b, 1983c, 1984a, 1984b; Hunter et al., 1985; McHenry, Hough, Toquam, Hanson, & Ashworth, 1990; see also Campbell, 1990). These are direct validation studies that gather data measuring job performance and compute the extent of prediction of job performance for the workers in that study. As discussed later, these studies clearly show that paper-and-pencil tests do indeed predict real-world performance.

The discussion earlier points out that the direction of causality is from GCA to the general aptitudes and from the general aptitudes to the specific aptitudes. This point is important because descriptions of the hierarchical model are sometimes seen that erroneously assume the opposite direction of causality. For example, it is stated that groupings of specific aptitudes combine to produce general aptitudes and that the general aptitudes then combine to produce GCA. This is the opposite of the actual direction of causality (Carroll, 1986, 1993). The causal direction in the hierarchical model is from the top down. It is only the data analysis process that moves from bottom to top.

3. Local validation studies

Local validation studies are usually either content validity studies or criterion-related validity studies. In this section, we discuss criterion-related studies. Content validity studies are discussed in a later section. Most criterion-related local validation studies are attempts to determine whether particular tests or other predictors will predict performance in a particular setting. Thirty years or so ago, there was a widespread belief that the best evidence of the validity of a test in a given organization was a ‘local’ validation study conducted in that organization for that job. But the number of employees available in any particular organization on any particular job is almost always so small that the results of the local study are unstable statistically (Schmidt, Hunter, & Urry, 1976). If the number of workers available for the study is less than about 2,000, the estimate of predictive validity will suffer from large statistical sampling error, and as a result, the confidence interval around the validity estimate will be wide (Schmidt & Hunter, 1978) (even with N = 2,000, the 95% confidence interval width is.09). In addition, statistical power will be limited. The depth of the problem of sampling error can be seen in comparing the needed sample size of around 2,000 to the actual sample size. Local validation studies have traditionally had an average of only 68 workers (Schmidt et al., 1976) – less than 2% of the number needed for stable statistical estimation. In more recent years, sample sizes have increased only marginally. For example, in the Russell et al. (1994) meta-analysis of all the criterion-related validity studies published in Journal of Applied Psychology and Personnel Psychology between 1964 and 1992, the median sample size was 103. Salgado (1998) examined the sample sizes of all criterion-related validity studies published between 1983 and 1994 in Journal of Applied Psychology, Personnel Psychology, and Journal of Occupational and Organizational Psychology and also found a median sample size of 103. A sample size of 103 contains large amounts of sampling error and does not produce statistically stable validity estimates (with N = 103, the width of the 95% confidence interval is approximately.43). It is likely that median sample sizes for unpublished studies are even smaller because sample size is one of the methodological factors taken into account by reviewers and editors in accepting or rejection studies submitted for publication. These facts explain why the American Psychological Association (APA) (1999) Standards for Educational and Psychological Testing (henceforth, Standards or APA Standards) include the following statement:

Sample size affects the degree to which different lines of evidence can be drawn on in examining validity for the intended inference to be drawn from the test. For example, relying on the local setting for empirical linkages between test and criterion scores is not technically feasible with small sample sizes. (p. 153)

Local validation studies have other problems in addition to sampling error due to small sample sizes: error of measurement, range restriction, dichotomization of continuous variables, etc. Most of these problems can be corrected at the level of the local validation study, but while these corrections eliminate bias in estimates, they also increase sampling error, and hence, make the problem of sampling error even more severe (i.e., confidence intervals become even wider). So even a correctly conducted local validation study usually provides only a very imprecise estimate of predictive validity.

4. Validity generalization and meta-analysis

The solution to the problems with local validation studies is meta-analysis of multiple studies as initially described by Schmidt and Hunter (1977). That article noted that if many local validation studies are considered together, it is possible to solve the problem of sampling error. For example, if 30 local validation studies are considered together, the total number of workers may be about 30 × (68) = 2,040. That is, cumulating the results of many local validation studies can provide the statistically accurate estimates needed for precise inference as to predictive validity. Schmidt and Hunter (1977) proposed a statistical method of meta-analysis for cumulating the results of local validation studies that they called validity generalization.

There is now a consensus among personnel psychology researchers and practitioners that validity generalization provides more accurate estimates of predictive validity than does a typical local validation study (Sackett, 2003). The need for correction for the problems of local validation studies has been noted in the 2003 Principles for the Validation and Use of Personnel Selection Procedures (Society for Industrial and Organizational Psychology [SIOP], 2003; henceforth Principles or SIOP Principles). Both the problems of the local validation study and the potential solution using validity generalization have been noted in commissions set up by the federal government. For example, in 1985, the Department of Labor commissioned the National Academy of Sciences to appoint an expert committee (National Research Council committee) to examine the General Aptitude Test Battery, a battery of cognitive skills tests developed by the US Employment Service. In its report (Hartigan & Wigdor, 1989), the committee recognized the consensus on validity generalization. The APA (1999) Standards also recognize and accept the use of meta-analysis in validity calibration and generalization as do the SIOP Principles (the APA Standards are issued jointly by the APA, the National Committee on Measurement in Education, and the American Educational Research Association). In summary, the best data on criterion validity come from the cumulative results (i.e., the cumulative knowledge) from the thousands of local validation studies that have been conducted. In personnel selection research, this cumulative knowledge based on the accumulation of results across studies is called validity generalization; in other scientific areas, it is called meta-analysis (Hunter, Schmidt, & Jackson, 1992; Hunter & Schmidt, 1990; Hunter & Schmidt, 2004).

5. The predictive validity of GCA: direct evidence

The fact that GCA predicts job performance and training success on all jobs need not be theoretically or logically demonstrated. It can be and has been demonstrated by the brute force of empirical studies showing positive correlations for large representative samples of jobs. These validity generalization-based research findings have been summarized in a number of publications, including Hunter and Hunter (1984), Hunter and Schmidt (1996), Hunter, Schmidt, and Le (2006), Schmidt and Hunter (1998), Schmidt (2002), and Schmidt, Shaffer, and Oh (2008). A summary of these validity generalization studies can be found in Schmidt (F. L. Schmidt, Construct, content, and criterion validity of ability and aptitude measure: Implications for practice, Unpublished manuscript), which includes the results of 11 large meta-analyses on the prediction of job performance and 12 on the prediction of training performance. These studies come from both North America and Europe. Each of these meta-analytic studies contains multiple separate meta-analyses for individual job classes and individual GCA measures for a total of 62 separate meta-analyses. Because these findings on the criterion validity of GCA measures are now familiar to, and accepted by, most or all personnel psychologists, we do not present them in detail. For present purposes, one recent example should suffice. The best estimate of the validity of GCA measures for medium complexity jobs (63% of the US workforce) is.66. Validities range from.74 for the most complex jobs down to.39 for the least complex jobs (Hunter et al., 2006, Table 3). These validity values are corrected for criterion unreliability and indirect range restriction but not for predictor unreliability. The application since 2006 of a more accurate correction for range restriction has resulted in somewhat higher validity values for GCA (cf. Hunter et al., 2006; Le & Schmidt, 2006; Schmidt, Oh, & Le, 2006).

6. Why does GCA predict job performance?

Some – especially laymen (including some judges) – find raw validity correlations too abstract to be convincing. The theoretical (i.e., explanatory) basis for validity is not contained in the criterion validation studies. These studies do not tell people why it is that scores on paper-and-pencil tests should predict such important real world performances – and to many people, perhaps most, it seems a dubious proposition that they would or could. The theoretical explanation for validity is shown in the data that relate ability, knowledge, and performance. This research is based on the use of path analysis to test causal theories of job performance. It is reviewed in Schmidt and Hunter (1992) and is briefly summarized here.

The major direct causal impact of GCA has been found to be on the acquisition of job knowledge, which, in turn, has a large causal impact on job performance. That is, the major reason why employees with higher GCA levels have higher job performance is that they acquire job knowledge more rapidly and acquire more of it. This knowledge of how to perform the job then causes their job performance to be higher (McDaniel, 1985; Hunter, 1986). Even jobs often considered to be ‘simple’ (low-complexity jobs) require employees to master substantial amounts of job knowledge (Schmidt & Hunter, 1992). As a result, employees who do not know how to do the job cannot do the job well. Thus, the major effect of GCA on job performance is indirect and is mediated by job knowledge acquisition. There are many kinds of knowledge that are important on most jobs. The largest block of knowledge is usually procedural: what to do?; how to do it?; when to do it?; when not to do it?; who to coordinate with?; whose help is needed?; who needs your help?, and so on. Individual differences in learning procedural knowledge are very large, and these differences are primarily determined by GCA (Hunter & Schmidt, 1996). The fact that procedural knowledge is universal in work means that differences in GCA will be universally relevant to the prediction of job performance.

There is also a direct effect of GCA on job performance, but it is smaller. This direct effect probably results from the use of GCA to solve problems encountered on the job for which standard job knowledge does not provide an answer (rare, unique, or unprecedented problems). For nonsupervisory jobs, this direct effect is about 20% as large as the indirect effect; for supervisory jobs, it is about 50% as large (Schmidt, Hunter, & Outerbridge, 1986; Borman, White, Pulakos, & Oppler, 1991). Hence, the research literature does provide an explanation for why measures of GCA predict the important real-world variables of job and training performance. This explanation helps to make the research findings on the predictive validity of GCA more plausible and easier to accept for many laymen and others, including judges. This fact is important in any defense of GCA measures used in selection.

The remainder of this article is devoted to demonstrating that cognitive tests (GCA tests) can have content validity in addition to criterion-related validity. A natural question on the part of the reader at this point is: ‘If we already have criterion-related validity for GCA tests, why do we need content validity? Isn't this overkill?’ One answer is that both the 2003 SIOP Principles and the 1999 APA Standards no longer focus on discrete types of validity. Instead, they define validity as a unitary concept that is ideally supported by multiple lines of evidence (Principles, p. 4; Standards, p. 17). The more different lines of validity evidence supporting test use and interpretation, the stronger is the foundation for an inference of validity. Hence, for both professional purposes and legal defensibility purposes, it is an advantage to have both criterion and content validity as the foundation for test use.

There is another potential value of content validity evidence. The content validity evidence is based on a local content validity study, while the criterion-related validity evidence is derived from cumulative validity generalizations studies. For legal defensibility purposes, it might often be useful to be able to demonstrate the linkage to the local job context and specific local job content as opposed to sole reliance on published validity generalization research.

Another consideration is that the Guidelines on Employee Selection (Equal Employment Opportunity Commission, Civil Service Commission, Department of Labor, & Department of Justice, 1978; henceforth Uniform Guidelines) state the following:

A selection procedure based upon inferences about mental processes cannot be supported solely or primarily on the basis of content validity. Thus a content strategy is not appropriate for demonstrating the validity of selection procedures which purport to measure traits or constructs, such as intelligence, aptitude, personality, common sense, judgment, leadership, or spatial ability. [Section 1607.14C(1)]

The position of EEOC has been that this language precludes the use of content validity validation with cognitive measures, and this position has been upheld by some (but not all) courts, most recently US v. City of New York (2009) (F. Supp. 2d 419). The UGLs were issued jointly by EEOC, the Office of Federal Contract Compliance (OFCC), and the Department of Justice (DOJ; Civil Rights Division), and all three agencies have adopted this same interpretation in their enforcement efforts, including in court actions. Finally, it is possible that this position could be adopted internationally by other countries. Hence, it is very important to address the question of the scientific validity of this interpretation.

7. Why cognitive tests can have content validity

This section presents evidence showing that measures of GCA and other cognitive tests can have content validity. Such content-valid cognitive measures are not deliberately constructed to be GCA measures, but instead, are composed of assemblies of specific cognitive skills shown by job analysis in a content-oriented validation study to be used in job performance.

A key requirement for content validity is a showing that the skills or abilities measured by the test are also used in performance of the job. The content validity study (a) starts with a specification of the particular domain of the job content to be sampled (Standards, Standard 14.8, p. 160), (b) is followed with a job analysis that reveals what cognitive skills are used in that performance domain and are required for high job performance, (c) creates a process for sampling these critical skills and abilities; and (d) then creates measures of these skills and abilities (SIOP Principles, pp. 21–24; Ployhart, Schneider, & Schmitt, 2006, pp. 313–319). The domain need not include all performances involved in the job but can cover a portion or subarea of job performance (Principles, p. 22, 24). Another consideration is that the level of difficulty of the measures of the required cognitive skills should be comparable with the level at which these skills are used on the job (Standards, Standard 14.11, p. 160). In short, all the usual procedural requirements for content validity apply in the case of cognitive skills.

For some personnel psychologists, there is still some hesitation in using the phrases ‘content validity’ for cognitive skills. Some would use only the phrase ‘construct validity’ in the case of cognitive skills. The last section of this article addresses this objection and finds that it does not hold up logically (see the section headed The Argument against the Use of Content Validity with Cognitive Skills).

8. Job analysis, content validity, and GCA

Virtually every validation study begins with a job analysis to determine what skills are used on that job. Such a job analysis is part of standard practice for industrial/organizational psychologists. Thus, virtually every validation study begins with an analysis of the content of the job and the skills relevant to performance on that job.

8.1. Different approaches to job analysis

There are a variety of approaches to job analysis (McCormick, 1979; Pearlman, 1980; Cornelius, Schmidt, & Carron, 1984; Brannick, Levine, & Morgeson, 2007; Pearlman & Sanchez, 2010), and to some extent, these fall along a continuum running from a micro-focus on specific (often narrow) job tasks to a focus on broad knowledge, skills, abilities, and other characteristics required for high job performance (Pearlman, 1980; Pearlman & Sanchez, 2010). The most common specific examples of these two approaches are task analysis with its immediate performance focus and a broad analysis of the job where the emphasis is on the evaluation of job complexity or job level (which may be determined by the time required to learn the job or some other index of job complexity [Pearlman, 1980; Cornelius et al., 1984]).

Those who start with a task analysis usually produce a psychological job analysis focused on specific cognitive skills needed for performance of specific job tasks. They end up with a content-valid test that measures GCA because they find that when they combine across the different specific cognitive skills used in different tasks, the resulting final test contains many different specific cognitive skills. These cognitive skills might be quite specific (e.g., the ability to read product order printouts or the ability to identify misspelled words in reports) or somewhat broader (e.g., the ability to comprehend long passages written at the college level of difficulty; i.e., reading comprehension). But in any case, all such cognitive skills are directly tied to tasks contained in the job as shown by job analysis. This is a content validity-based approach. One area in which this approach has been used with some frequency is in the construction of selection tests for entry level firefighters and police personnel (cf. Murdy & Norton, 1972; Bownas & Heckman, 1976; City of Milwaukee, 1978; Payne & Van Rijn, 1978; Northrup, 1979; Merit Employment Assessment Services, 1980; International Personnel Management Association, 1989; Morris, 1990).

Those who start with a broader job analysis approach wind up focusing on the need to use learning ability as the key to successful selection and successful job performance. They recommend a measure of GCA for that reason. The test that they build is a test that measures several cognitive skills or aptitudes because this is the current best technology for building a test of GCA. This is not a content validity approach but rather relies on the cumulative evidence for the criterion-related validity of GCA.

The fact that the two resulting tests in the examples given here can be viewed as similar – and are similar – does not mean that the underlying theories or job analyses are similar. The problem is not that one or the other job analysis approach is wrong, but rather that each of the two basic approaches is conceptually incomplete. The fact is that performance on all known jobs both requires considerable learning and makes use of many of the known specific cognitive skills (although different jobs use those specific skills to different degrees).

9. The argument against the use of content validity with cognitive skills

An argument is sometimes advanced against the use of content validity when the predictors are cognitive skills. This argument holds that cognitive skills and abilities are constructs and as such cannot be validated using content validity. As indicated earlier, this is the position argued by the EEOC, the OFCC, and the DOJ Civil Rights Division. This section shows that this is a false argument. The argument can be summarized as follows: (a) abilities are mental processes; (b) mental processes are not directly observable; (c) a variable that is not directly observable is a construct; and (d) the use of a construct cannot be justified by content validity under professional standards or under the EEOC UGLs – only criterion-related validity or construct validity can justify its use.

The conclusion is that the use in selection of specific cognitive skills and abilities or aptitudes such as verbal comprehension, numerical ability, or inductive reasoning cannot be justified based on a content validity methodology. It is important to evaluate this argument in detail. This argument has a surface plausibility when paired with some of the UGL language taken in isolation – in particular, Section 1607.14C(1), quoted earlier. Yet it is contradictory to the UGL taken as a whole. The UGL clearly states that content validity can be used to justify a job knowledge test. No one has ever suggested that knowledge is not a mental process. Thus, the UGL clearly provides for the use of content validity to justify a selection device based on mental processes. The UGL also provides for the use of content validity to justify tests of skill, yet there is no known skill that does not depend on mental processes. If the writers of the UGL had not intended that content validity be used for knowledge and skills, then they would presumably not have provided explicit instructions for the use of content validity with such measures. This interpretation of the UGL is consistent with the SIOP Principles and the APA Standards.

The key points in critiquing the previous argument are as follows: (a) observing in science in not seeing or visual detection, but measuring; (b) all important work components are mental processes, in particular, all work behaviors are mental processes; (c) most narrowly defined mental processes are directly observable even though they cannot be seen by the eye (i.e., are not visible); and (d) the word ‘construct’ refers to variables that are defined within a theory in terms of other theoretical variables that are part of that theory; hence, construct validation is required only when the measurement process for the construct must be validated by validating the larger theory itself – a condition that does not occur in the case of cognitive skills. We now consider these points one by one.

9.1. Observation is not seeing

Those who endorse this argument refer to the word ‘observable’ as if it means ‘can be seen.’ Yet by that criterion, almost nothing in science is observable. Consider three examples from physics: weight, time, and electricity. You cannot see the weight of an object (since different objects have different densities). But if you attach it to a spring scale, you can easily observe and measure its weight. You cannot see time, yet you can observe its passage. With a clock, you cannot only observe time but measure it. You cannot see electricity, but with an amp meter, you can both observe and measure its amount. Note that numbers on the amp meter are not the electricity itself; they are the coding of its amount. This is characteristic of science: observation does not mean seeing, it means measurement.

Consider an ability such as ‘adding two digit numbers.’ You cannot see a person's ability by looking at the person. But the ability is very easily observable in most adults under most conditions. If the person can hear, and can speak English, and is not drunk or asleep or otherwise disabled, and if the person is motivated to answer questions, then you need only ask the person ‘What is the sum of 25 and 36?’ If the person can indeed add two digit numbers, then with high probability, the person will answer ‘61.’ Thus, the ability becomes observable, even though it cannot be seen. Of course, observation and measurement are not perfect. The person might answer ‘61’ by guessing. The person might answer ‘51’ because of nervously forgetting to carry. These error processes lead to the technical requirements of measurement, in particular the use of multiple items in a test to average across the different error processes to reduce measurement error. This is the reason why, other things equal, longer tests are more reliable (and therefore more valid) than shorter tests.

The test score is not the ability itself; it is the coding of the results of the measurement process. This coding constitutes the observation of that ability.

9.2. The definition of ‘mental process’

Those who argue that content validity cannot be used with cognitive skills because cognitive skills involve nonobservable ‘mental processes’ do not give an empirical definition of the phrase ‘mental process.’ The phrase is defined later and related to the terms ‘work behavior’ and ‘ability.’

The 19th century was characterized by a vigorous search for the causal locus of perception, action, and behavior. This led to the discovery of the central nervous system and the tracing of the neural pathways to and from the brain. One of the conclusions established by the year 1900 was that the phrase ‘mental process’ could be exactly translated into ‘activity in the brain.’ Every significant human event or behavior is a mental process: seeing, hearing, recognition, memory, thought, and action. All can be identified with activity patterns in the brain. Today, this is done directly using functional magnetic resonance imaging (fMRI). Words such as ‘mind’ or ‘mental’ simply represent the linguistic expression of brain activity developed before the relevant biological facts were known.

Consider the skill of typing. Once it was believed that dexterity was in the fingers, that courage was in the heart, and so on. Many primitive tribes believed that if you ate your enemy's hands or heart, you could obtain his or her dexterity or his or her courage, respectively. It is now known that the skill of typing is in the brain. Even though the hand, arm, and shoulder may be physically and physiologically perfect, a lesion in the brain may render a previously excellent typist completely incapable of typing. The lesion might be in the parietal lobe and destroy the typist's manual dexterity. Or the lesion might be in the temporal lobe and destroy the linguistic abilities that are prerequisite to the task. For example, in certain aphasias, a typist might be able to type what he or she hears but not what he or she reads (or vice versa).

Every work behavior is a mental process. Consider a police officer writing a ticket. The brain not only dictates the words to be written, it also dictates the writing itself. Slow motion photography reveals what every elementary school teacher knows: that writing is a laborious process composed of hundreds of discrete actions that are each guided by the brain on the basis of visual and kinesthetic (i.e., nerves that start in the muscles, bones, etc.) feedback. These motions are adjusted by the brain depending on the angle of the paper, the thickness of the writing implement, the emotional mood of the officer, etc.

The only known movements that are not governed by mental processes are the spinal reflexes such as the knee-jerk reflex (Dewey, 1896). And even these are subject to cognitive control in situations where this is required. For example, the knee-jerk reflex can be repressed by a mental process if a police officer bangs a knee on a car bumper so that the reflex does not interfere with postural adjustment and balance.

Those who endorse the argument in question draw a distinction between ‘observable work behaviors’ and ‘mental processes.’ The distinction is false; work behaviors are a special case of mental processes.

9.3. Observable abilities

Consider police officers as an example. If an officer cannot understand the words in city codes, then he or she cannot issue correct citations, cannot write correct reports, and cannot give accurate testimony in court. It is true that we cannot see the neural events that constitute the mental process of recognizing the meaning of ‘felon.’ Indeed, at the present time, we do not even know what specific events in the brain accompany such recognition (although research using fMRI is close to answering this question). But we do have over 80 years of empirical research showing that whether or not the person can answer vocabulary test items measures the ability to understand the meaning of words in the sense that it correlates perfectly with any other method of measuring such understanding (when corrected for biasing artifacts such as measurement error). Virtually every ability, aptitude, and cognitive skill used in employment testing has a research history of at least 40 years and has been used and validated in many studies by many different investigators for many different purposes. For example, there are literally thousands of studies showing that vocabulary measures verbal comprehension, not only of words, but sentences, paragraphs, and test segments. Hundreds of studies have shown that vocabulary correlates with level of training success in programs such as the police academy. We cannot see the neural events that make up the ability, but we can observe the presence of the ability and measure its amount with tools proven by nearly a century of empirical work.

9.4. The definition of ‘construct’

The UGL give no definition of ‘construct,’ and neither do advocates of the argument being critiqued here. This is not surprising since most scientists in the area of personnel selection have never engaged in a theoretical study. Indeed, this is why researchers and practitioners in the area of personnel selection have never used construct validation in its original scientific sense: they have never had to in their research and practice. This lack of experience has led some to misinterpret the meaning of construct validation as laid out by Cronbach and Meehl (1955) in their original work for the APA Standards for test validation. This article will now give a definition of ‘construct’ as used in construct validation and illustrate that definition in relation to abilities relevant to police work.

Definition: a construct is a variable that is defined in theoretical terms. That is, a construct is a variable that is not defined directly in terms of empirical measurement operations but in terms of some particular theory. Construct validation is required if the theory has not been independently verified; that is, construct validation is required if the measurement operations must be validated along with the theory itself. For example, a well-known construct in psychology is the construct ‘anxiety.’ This term has no concrete definition in the English language and hence can only be used if such a definition is created by the scientist. Every existing definition has been phrased within some specific theory. For example, in psychoanalytic theory, anxiety is defined as the conscious manifestation of unchanneled libidinous energy. In stimulus response theory, anxiety is defined as the amount of arousal in the autonomic nervous system. Both definitions are phrased in terms of other theoretical terms that are only meaningful if the theory itself is substantially correct. For example, the psychoanalytic theory assumes the existence of libido; the stimulus response theory assumes the existence of a unitary drive mechanism that underlies activity in the autonomic nervous system.

Consider the term ‘leadership.’ It was once thought that leadership was a narrowly defined ability such as verbal comprehension or numerical ability. At that time, it was believed that all we need do to measure leadership was to go out into a variety of group contexts and see what differentiates good leaders from bad ones. Alas, the empirical research showed this belief to be false. The early belief that good leaders are strong persons who rule their subordinates with an iron fist was discredited by multiple studies showing that the successful leader is often one who is considerate of this subordinates and who shares decision making with them. On the other hand, the belief that the good leader can be defined in terms of consideration and power sharing was discredited by studies showing that authoritarian leaders such as Steve Jobs are sometimes more successful. There are now a number of theories that try to specify the conditions that determine what kind of leader will be successful in different situations. The construct of ‘leadership’ is defined differently in each such theory. In the case of each such theory, the validation of its construct of ‘leadership’ will be inextricably linked to the validation of the theory itself. If the theory is supported when tested, then the construct is supported. If the theory is disconfirmed by empirical evidence, then the construct is discredited.

The abilities measured in cognitive employment tests are not constructs. The definitions are not given in theoretical terms, but in terms of empirical measurement operations, that is, operationalizations. Verbal comprehension is defined in terms of the extent to which a person can correctly answer questions about the meaning of words. The operational definition makes only the assumption that verbal comprehension when writing out a ticket is the same as verbal comprehension while reading city code, is the same as verbal comprehension while reading training manuals, is the same as verbal comprehension while taking a verbal test. This assumption of empirical generality has been verified in hundreds of prior studies showing large correlations between different assessments of verbal comprehension. Since the cognitive abilities and skills measured in employment tests are not constructs, they are suitable for use in a content validity methodology.

In summary, the arguments that have been advanced against the use of the content validity model with cognitive skills, aptitudes, and abilities do not stand up to close scrutiny and must be rejected.

10. Overall summary

In summary, this article challenges the belief held by many personnel psychologists and by US government enforcement agencies that the content validity model is not appropriate for cognitive measures used in personnel selection. Based on a review of the broader differential psychology research literature on cognitive skills, aptitudes, and abilities, this article demonstrates that with the proper job analytic and content validity procedures, cognitive ability measures, including tests that are de facto measures of GCA, can have content validity in addition to criterion and construct validity. The article considers, critiques, and refutes the specific arguments contending that content validity is inappropriate for use with cognitive skills and abilities. The implication of the facts presented in this article is that practitioners should be free under the SIOP Principles, APA Standards, and EEOC UGL to apply the content validity model to measures of the specific cognitive skills shown by appropriate job analysis to be required in job performance and to be properly sampled from defined job performance domains. Such applications by practitioners should be viewed as meeting professional standards and as therefore legally defensible under professional standards. This means that it is possible for a de facto GCA test to be supported by both criterion-related validity (via the massive validity generalization evidence from the research literature) and by content validity (via a content validity study of the sort described in this article). In such cases, the existence of dual lines of validity evidence strengthens the professional and legal defensibility of the test.

Ancillary