Addiction sciences and its psychometrics: the measurement of alcohol-related problems
Lorraine T. Midanik, University of California at Berkeley, School of Social Welfare, 120 Haviland Hall MC #7400, Berkeley, CA, USA, 94720–7400. E-mail: firstname.lastname@example.org
Aims The focus of this paper is on psychometric issues related to the measurement of alcohol problems.
Methods Taking a broad perspective, this paper first examines several issues around the use of instruments to provide diagnostic categories in surveys, including dimensionality, severity and alcohol consumption. Secondly, a discussion of some of the political issues surrounding measurement of alcohol problems is presented, including some of the conflicts that arise when the psychometric properties of commonly used instruments are questioned. Finally, newer statistical techniques that can be applied to scale development in the alcohol field are examined, including non-linear multivariate analyses and confirmatory/hypothesis-based methods.
Results and conclusions Continued scholarly discussion needs to be encouraged around these psychometric issues so that instrument development and maintenance in the addiction sciences becomes an ongoing academic pursuit as we strive to measure alcohol problems in the best way possible.
How one measures the use of alcohol and its consequences has been a concern of researchers, clinicians, politicians and policymakers for well over half a century [1,2]. It has long been recognized that measurement could be used as a political tool to maximize or minimize social problems to the gain or loss of vested interest groups  (pp. 41–65). It has been argued ‘. . . that political judgments are implicit in the choice of what to measure, how to measure it, how often to measure it, and how to present and interpret the results’(p. 3). Thus, while there is a need for accuracy, precision and objectivity, we must also recognize that a given culture's social construction of problems may well influence any type of self-report measurement approach [5–7]. This has been recognized not only from a sociological viewpoint, but also from a psychiatric perspective. For example, when making the recommendation that a workgroup create the categorical definitions of substance use disorders for the forthcoming DSM-V revisions, it has been noted that such a process:
is not a strictly empirical one. Ultimately it is based primarily upon the judgment of the experts selected as members of the diagnostic workgroups. Even [with] abundant clinical data and secondary analyses . . . judgments differ and can be significantly influenced by nonempirical considerations such as personal bias and political considerations. Results may or may not closely reflect empirical reality  (p. 19).
Consideration of measurement in the addiction sciences is a daunting task under any circumstances, and even more challenging when the discussion must be confined to a single article. Thus, the focus of this paper will not be a traditional review of scales that are used in the alcohol field. Readers who would like this information are referred to a lengthy and thorough compendium of alcohol assessment instruments published in 2003 entitled Assessing Alcohol Problems. A Guide for Clinicians and Researchers. This monograph, funded by the US National Institute of Alcohol Abuse and Alcoholism (NIAAA) and available on their website (http://www.niaaa.nih.gov), includes eight papers that discuss a wide range of assessment tools designed for screening, diagnosing, measuring drinking behavior, assessing adolescent alcohol use, treatment planning, treatment and process assessment and outcome evaluation. A fairly large appendix includes fact-sheets about each instrument, as well as copies of each of the scales discussed.
In contrast to NIAAA's 2003 publication, this paper will take a broader approach to the overall issue of measurement. As such, it is organized into three related thematic areas to provide a provocative overview of the role of psychometrics in the substance abuse field. First, the article focuses on issues surrounding the use of instruments or scales in general populations to measure alcohol dependence prevalence rates. This will include a discussion of the ‘two worlds’[10,11] of alcohol problems (the clinical and the general population) as a background to presenting issues concerning dimensionality of dependence, severity and the role of alcohol consumption in dependence diagnoses.
The arguments surrounding how alcohol dependence is measured is part of a larger discourse on political concerns that is not usually included in discussions on instrumentation in the alcohol field. Thus, the second section of this paper will explore some of these political issues. Case examples of standardized instruments now viewed as ‘gold standards’ and the debates that ensue will be used to illustrate how commonly used scales take on a ‘life of their own’ and may continue to be preferred over objectively better-performing scales. Moreover, their continued use, despite flaws that limit their ability to provide valid and reliable diagnostic information, may impede the emergence of newer and better ways to achieve these goals.
Finally, after assessing political factors that can impede progress in psychometric research, we will discuss newer strategies and techniques that need to be considered for scale development and, we would argue, for scale maintenance and improvement as well. This section will address specifically the following two areas: non-linear multivariate analysis and confirmatory/hypothesis-based methods.
ALCOHOL DEPENDENCE DIAGNOSIS: THE ‘TWO WORLDS’ OF ALCOHOL PROBLEMS
Dimensionality of dependence
Since Edwards & Gross'  seminal work over 30 years ago defining alcohol dependence as a psycho–physiological syndrome with specific features distinct from social consequences of drinking , there has been continued discussion of its underlying empirical basis and whether alcohol dependence is a unitary or multi-dimensional construct. In the early 1960s, alcohol survey researchers also raised the issue of dimensionality and whether there is evidence for a relatively stable disease entity distinct from a broader, and not necessarily unitary, collection of alcohol problems that might vary over time .
It was during this early period that the ‘two worlds of alcohol problems’ was recognized [10,11]. This conceptualization implied that the correlations and clustering of items found in clinical samples differed substantially from those seen in the general population. Room  speculated that ‘Perhaps what was being measured as alcoholism or drinking problems in the general population was qualitatively as well as quantitatively different from alcoholism as it appeared in the clinic’ (p. 211). Although more studies have been conducted in recent times, to this day very few studies fully compare dependence- or consequence-related problem measurements derived from general population and clinically enrolled samples with some exceptions .
While large-scale alcohol surveys generally include items on alcohol treatment, these data have been underexploited in distinguishing the possible psychometric differences in alcohol-related symptom and consequence dimensionality in subjects drawn from each of the two worlds—clinical and general populations. Although recent work has used responses to survey questions on receiving treatment to understand more clearly the use of and outcomes from alcohol services [16–18], if the ‘two worlds’ conjecture is accurate, such analyses would be limited by selectively representing the clinical population actually found in treatment programs. Parenthetically it should be noted that, at least in the United States, national censuses of the treatment population also have biases, for example, by over-representing public sector programs  so that large national population surveys remain very important for studying service use. Compared to the 1980s when in-patient services were more common, clients in specialty and non-specialty substance abuse treatment services may have a broader spectrum of problem severity, including those who may not meet the criteria for alcohol dependence . Thus, the two worlds distinction may not be as extreme as it was previously. However, because of sampling differences in clinical and general populations, treated subjects in a general population survey are still probably not representative of those enrolling in clinical studies . Yet this issue, discussed somewhat comprehensively in the 1970s , has still not really been addressed fully today.
Analyses of the general population using the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC) continue to be used to inform the development of DSM-V. For example, in the recent Addiction supplement marshalling evidence useful for development of a new DSM-V classification, Muthén  conducted a psychometric analysis of NESARC that is presumably intended to inform both research and clinical definitions. Another included paper  drew on an earlier analysis by Muthén and colleagues  which had used the 1988 National Health Interview Survey (NHIS). Hasin and colleagues , when addressing psychometric validation, cited those factor-analytical and latent class studies; the NHIS data set was noted to be sufficiently robust to overcome the many problems of analyzing binary variables in small data sets and also to overcome problems that occur ‘when the sample is defined by characteristics included in the analysis’ (p. 68). By inference, today's experts take the position that psychometric analyses that investigate dimensionality of dependence symptoms, deemed relevant for clinical assessment, may need to rely on well-developed large nationally representative general population data sets. Conversely, many clinical data sets are seen as inadequate, based on these methodological issues and often a limited sample size. Should one then presume that the field has moved beyond the ‘two worlds’ concerns of earlier years and become comfortable with inferences for clinical scale development via general population surveys? From proceedings of expert panels such as those collected in Addiction 2006; 101 (Supplement 1), this would appear to be the case.
Another issue to consider is the spectrum of symptom severity and how best to assess or characterize it. Much of this issue hinges on a unidimensional view of dependence. A straightforward and pragmatic approach is recommended by Helzer and colleagues , who argue that individual DSM-defined symptom items could be ranked on a three-point scale that could provide an easier way for respondents or patients to report their symptoms providing ‘an intermediate between an absolute “yes and “no”’ (p. 19). It is not clear exactly how this item-level innovation would be designed to assure good inter-rater and test–retest reliability, or how it would overcome the deficiency the authors note for simple symptom counts ‘given that cross-symptom equivalence . . . is not necessarily justified’ (p. 19).
Muthén  provides an analysis that addresses severity by applying latent class analysis (LCA) and other hybrid modeling approaches using a sample of current drinkers in NESARC defined as having ‘at least one’ to ‘five or more’ drinks per occasion in the last year. More discussion of this methodology and its application can be found in the section to follow. Seven indicators based on DSM-IV domains and four based on abuse criteria were implemented from 32 symptom items that operationalized DSM-IV alcohol abuse or dependence. The LCA classification results in ‘parallel profiles’ across 11 indicators, involving both abuse and dependence items. A latent class factor analysis (LCFA) provided both dimensional information and a classification of individuals into four classes. Because the LFCA specifies a unidimensional model a sum of the 11 indicators is possible, but results in numerous misclassifications. This illustrates some of the problems with simplifications based on complex analytical strategies: ‘The classification based on the LCFA model uses more information than merely the sum of criteria and also has a statistical modeling rationale’ (p. 14). Once again, one may wonder whether the classes so established in a broad population group including numerous quite modestly heavy drinkers (n = 13 067) are most relevant to treated alcoholics. To be sure, the large Class 4 in the LCFA (Table 2, p. 13) has no (n = 9184) or only one (n = 1196) problem. Class 1, on the other hand, gains steam at eight to 11 affirmed indicators (combined n= 123), although a further eight members have seven indicators (at which score an additional 91 are, rather, in Class 2, ranging from three to eight indicators (n = 581). Class 1 represents 1% of the heavy drinkers, while Class 2 accounts for 4.4%. Referenced to the full NESARC sample of 43 093 (unweighted), the combined Classes 1 and 2 (n = 713) total 1.7% of the population. Would these be those likely to be seen in the clinic and would only two classes be found there if one had a sufficiently diverse and large sample? A two-worlds adherent might very well be skeptical.
The relationship between alcohol consumption and the diagnosis of alcohol dependence has recently been receiving attention. One aspect of the ‘two worlds’ issue that was discussed from early on , is that those ‘diagnosed’ as being alcohol-dependent in general population surveys tend to be younger than clinical samples. Noting this, and adopting a continuum perspective, Caetano  tested whether application of a drinking criterion of averaging two drinks/day or less, and drinking five or more drinks less than six times per year, might remove false-positives on alcohol dependence, resulting in a ‘corrected prevalence’. The high prevalence of impairment of control among men aged 18–29 years, however, reduced by this filtering (from 46.7% to 41.5%), remained high. To Caetano, the younger profile of survey-estimated dependence prevalence ‘serves as a good indicator of the misidentification of normal behavior as abnormal’ (p. 262). Overall, the reduction of percentages among all men in the NHIS-88 was ‘relatively small’. Excluding impairment of control, affirmatives on other domains were reduced in the range of 0.3–1.6%. With regard to meeting criteria for dependence, considering theless-hazardous drinking pattern to invalidate the reports, the younger group aged 18–29 years was unfortunately less affected by the filter than their elders. This is, no doubt, because younger people drink substantially more heavily than those older in the general population. Thus, filtering by higher alcohol consumption for younger age groups does not seem to be able to ‘correct’ the overpopulation of apparently dependent individuals among the young adults.
It may be worth revisiting a very early report by Room  on amounts drunk by institutionalized alcoholics based on reports from diverse clinical samples. Having on occasion drunk a pint of spirits (approximately 12 standard drinks) at some time in one's life (or even 100 times) did not discriminate alcoholics effectively from non-alcoholic controls.
Having drunk a fifth [of spirits in a day (approximately 20 standard drinks)] 10 or more times identified more than three-quarters of the labeled alcoholics while providing the greatest discrimination, in terms of percentage differences, between the two samples [alcoholics/not]' (p. 12).
Perhaps we are setting our criteria for alcoholic drinking patterns far too low. When considering general population data, it may be worth noting that the Alcohol Use Disorders and Associated Disabilities Interview (AUDADIS) includes as a tolerance item whether the respondent ever/in the prior 12 months drank the equivalent of a fifth of liquor in one day . Just as cut-points on other standard psychiatric epidemiological measures, such as the Beck Depression Inventory , have to be adjusted for population characteristics, the same may be needed for alcohol dependence and younger drinkers.
THE POLITICS OF PROBLEM MEASUREMENT
The previous section delineated multiple concerns with the measurement of alcohol dependence and the relevance of developing dependence diagnoses from general population samples that do not necessarily mirror clinical populations. Beyond specific ways in which dependence is measured, there are larger political concerns that need to be acknowledged and handled when considering the utility and appropriateness of scales and screening instruments for either general or clinical populations. These concerns include definitions and thresholds of reliability and validity estimates; the inherent friction between keeping existing instruments the same for comparability reasons or abandoning the instrument by creating a new one, enhancing the current one or finding another scale that measures a specific area better; and finally, how ‘ownership’ of instruments by researchers and/or institutions can be a barrier by impeding progress in the ongoing development of research instruments. While the focus of this section will not be a review of scales or screeners per se, specific instruments will be used as case examples to illustrate these concerns.
Reliability and validity
The traditional way to assess the merits of any instrument is to determine its reliability and validity. Potential users of any instrument appreciate knowing that psychometric studies have already been conducted and that the general consensus is that the instrument ‘does its job’ in terms of consistent and accurate measurement. Moreover, there is an expectation in research grants as well as published articles that use of established instruments is predicated on the fact that they perform well from a psychometric perspective. Thus, a sentence or two describing the moderate to excellent validity and reliability (expressed as correlations or Cronbach's alphas) of an instrument with preferably several references is the norm. Yet the actual threshold used to demonstrate that an instrument is valid or reliable may seem somewhat arbitrary. It has been argued that reliability parameters must be focused on both the purposes and the specific sample to which it is applied . While this may be the case, others have gone further by differentiating the purpose of a measure by whether or not a specific score will or will not have direct application to a specific individual, e.g. have direct clinical relevance . In the early developmental stages of an instrument, it may be appropriate to have reliabilities in the range of 0.50–0.60; however, when measures have direct application, and important decisions will be made on exact scores, a reliability of 0.80 may not be sufficient . Nunnally  concludes that when a scale score is being used to make critical decisions about an individual, ‘. . . a reliability of 0.90 is the minimum that should be tolerated, and a reliability of 0.95 should be considered the desirable standard’ (p. 226).
Finally, researchers tend typically to treat statements about the reliability and validity of an instrument by other researchers and perhaps, most importantly, by those who actually developed the scale, as sacrosanct and not question the methodology, the appropriateness of the sample or even who conducted the psychometric research. While it is expected that authors of new scales provide data on validity and reliability, it can be argued that those who develop a scale may not be the best ones to establish its psychometric properties . Whether conscious or not of bias, those who develop an instrument do have a vested interest in ensuring that it be presented in a positive way with an emphasis on its strengths without a thorough focus on its weaknesses . Mäkelä illustrates this in his critique of the Addiction Severity Index (ASI). Most researchers who used the ASI discussed its reliability and validity ‘. . . in very positive and categorical terms’ (p. 398), despite evidence upon further investigation that this was not the case. While it is important to cite previous studies that have found an instrument to have good psychometric qualities, this does not preclude researchers who choose to use an instrument in a new study from conducting their own assessments of reliability and validity as part of their research. Thus, reliability and validity of instruments based on other researchers' work should not be the only data considered when assessing the merits of an instrument. Rather, an instrument's psychometric properties need to be assessed and reassessed, particularly as the samples with which the instrument will be used vary from the original one despite additional financial or time costs.
Comparability versus creativity
Once an instrument has been well established and disseminated in the research literature, it de facto becomes the standard. Along with this status comes strong pressure for the instrument to remain ‘intact’ thereby avoiding the problem of not being able to compare results across studies and over time in longitudinal studies. Some instruments that have become standard, such as the DSM and the ICD, have been revised multiple times since their original development. Despite serious criticisms of how and why the DSM was originally created  and maintained, efforts to improve these diagnostic criteria have continued. For example, currently, a substance use disorders workgroup is meeting to update the DSM-IV to DSM-V .
Regular assessment and revision of other instruments that have become standard is not necessarily the norm. This phenomenon has been referred to as ‘intellectual slumber’ in which researchers and clinicians accept the standard as is, and no longer question its psychometric properties or its current relevance . Moreover, uses of these standard instruments may vary from the original intent of those researchers who developed them. One example of this is the Alcohol Use Disorders Identification Test (AUDIT), that was developed originally by the World Health Organization as a screening instrument to identify patients in primary health-care settings who may be appropriate candidates for a brief clinical intervention [38,39]. Importantly, the AUDIT was designed to be able to screen individuals before the development of serious alcohol problems; thus ensuring that a brief intervention aimed at early hazardous or harmful drinking would be effective. The AUDIT is a 10-item scale that includes three alcohol consumption and seven problem items with each of the 10 items scored on a Likert scale. Use of this instrument over time has changed from its original intent. While the AUDIT continues to be the instrument of choice for the identification of at-risk drinkers who can be referred for brief interventions within primary care settings, it is also used to screen for alcohol dependence .
‘Ownership’ of instruments and change potential
Scales that allow clinicians and researchers to determine who is at risk or can be diagnosed with alcohol abuse or alcohol dependence are useful in so far as they have a high level of sensitivity and specificity. As more is known, both in terms of how a condition is defined or of how to measure a condition more effectively, it is necessary to either revise current instruments or create new ones. Issues arise with both strategies, but there is considerable pressure to maintain the instrument as is, in order to preserve comparability despite its deficits. Another issue rarely discussed is the personal investment in instruments that have become ‘standards’ by those individuals who developed them; thus, there are personal motivations to maintain and promote some scales beyond their ‘natural lifespan’. Ownership in this instance is not limited to individuals or groups of researchers but sometimes belongs to organizations. Regardless of who or what ‘owns’ the instrument, the result is continuation of the status quo with less emphasis on innovation.
Related to this issue of ownership is a lack of incentive for researchers to create newer and better instruments. Typically in the alcohol and drug fields, methodological research is considered ‘less than’ so-called substantive research . Thus, it is clearly more difficult to obtain funding for the development of new or improved instruments and, in addition, it is difficult to get methodological articles published in well-respected journals in the substance abuse field. Hence, there is less than adequate motivation to continue to improve existing scales. Even when new scales are created or existing scales revised, implementation is also at issue. As discussed above, new and revised scales present a challenge to the researcher who wishes to compare results with earlier studies. It has been suggested that it might even be preferable to move away from revisions in favor of new and presumably better existing instruments that better fit the theoretical principles that should underlie any diagnostic instrument .
FUTURE DIRECTIONS FOR PSYCHOMETRICS IN THE SUBSTANCE ABUSE FIELD
In addition to political considerations, researchers in the alcohol field who want to assess the psychometric properties of their instruments may also be hindered by a fairly traditional set of psychometric tools that have been developed and standardized over many decades. The vast majority of these older methods assume that the items considered for analysis are continuous variables. In addition, the ways in which items are combined to form scales and to assess their coherence have almost exclusively been linear (often simply summative). Scale construction and evaluation have also traditionally been considered exploratory in nature, and thus the majority of methods widely in use do not incorporate formal statistical testing procedures.
Several factors influence the degree and rate of adoption of newly developed psychometric methods in applied fields. One is the complexity of the methodologies developed. A second is the extent to which these methods have been documented, such as through publications in the substance literature, in a way that allows social scientists to apply the methods to their own research questions and data. These methods are often computationally intensive and so the availability of their software implementation is also a key factor in their use. Although a great deal of effort has been exerted to expedite this process with the advantage of continually improving electronic tools, there continues to be a significant lag between development and wide application of psychometric methodologies.
There have been a number of significant advances in both the development of psychometric methodological tools as well as their software implementations in the recent years. Two specific areas that will be discussed here include the generalization of linear factor analysis to a more general non-linear multivariate analysis framework, and confirmatory methods that incorporate formal statistical testing into a framework where more complex multi-dependent latent variable relationships can be studied. These areas by no means constitute an exhaustive list of the work that has taken place in the generalization of traditional psychometric methods, but represent an important set of areas that address some of the fundamental limitations of the some of the first methods developed in the field.
Non-linear multivariate analysis
One of the most obvious generalizations of the standard set of psychometric tools has been that of the relaxation of the assumptions of continuous outcome variables and of the sole use of linear methods for creating scales. It is often the case that variables used in factor analyses are not what would be considered typically as continuous variables and could often more rationally be considered as ordinal. Even variables that might be considered numerical [e.g. number of Alcoholics Anonymous (AA) meetings attended, number of substances consumed, etc.] might not best reproduce the underlying factor structure when combined only in a linear fashion.
One quite common practice is to use factor-analytical methods on binary variables [43–45], and several different methods have been developed in an attempt to adapt the use of such methods designed for continuous data for use with categorical data. Some have suggested the use of polychoric correlations [46,47] and estimation of the factor structure via generalized least-squares . Care should always be taken in the assumption of underlying normal variables for the estimation of such correlations, as results from models may produce misleading results . Others have used a range of techniques, including estimation methods and rotation techniques, in order to study the stability of the model under alternative specifications .
Another common practice, either when binary variables are considered or when continuous or categorized continuous (ordinal) variables are analyzed, is to form summative scales with unit weights [43,50]. Obviously, the simplification produced from the use of a scale may be argued, in some cases, to outweigh the precision produced from using the weighting produced from analyses. For analyses using binary data only, few other options are available. However, when continuous or ordinal variables are used, choices for further dichotomization of variables can be somewhat arbitrary.
Non-linear multivariate analysis  (generally referred to as the ‘Gifi System’ of non-linear multivariate analysis) includes a range of methods that generalize some of the standard psychometric tools such as factor analysis. This allows for different levels of measurement of the variables analyzed, including continuous, ordinal and nominal, while allowing simultaneously for non-linear combinations of these variables to form scales. The linear scaling used traditionally in factor analysis methods, although simple, is somewhat arbitrary, and the representation of underlying constructs may be improved drastically by not imposing such a restrictive scaling. For example, in the construction of a substance abuse treatment service satisfaction scale containing innately ordinal items such as four-point Likert scale items, there is no a priori reason to believe that the differential contribution to the overall scale of this item of moving from categories 1–2 should be the same as from moving from categories 2–3. Even variables that might be considered as linear may be better represented as a non-linear variable (e.g. a smoothed function of the variable using bases of splines). In addition, missing data issues are handled much more readily in these methods than in traditional factor analyses, where solutions such as listwise deletion and pairwise estimation of correlations can produce biased results or cause technical problems in model estimation (e.g. non-positive definiteness of the correlation matrix). However, solutions for missing data in these scenarios (e.g. multiple imputation ) have been proposed.
HOMALS (HOMogeneity analysis by Alternating Least Squares), an optimal scaling method that is part of the family of the ‘Gifi system’ of non-linear multivariate analysis techniques, can be described in pure graphical language, with the basic premise being that complicated multivariate data can be made more accessible by displaying their main regularities and patterns in plots. What the technique accomplishes is the scaling of the respondents in such a way that those respondents with similar response profiles are close together, while respondents with different profiles are relatively far apart. Similar to factor analysis, eigenvalues and a factor score are produced for each dimension of the solution as well as diagnostics, called discrimination measures, which indicate the level of contribution of each variable to each dimension produced in the solution. Similarities of HOMALS to other techniques, such as correspondence analysis, multi-dimensional scaling, cluster analysis, discriminant analysis and analysis of variance (ANOVA) can be found in Michailidis & de Leeuw .
Rehm  used HOMALS in order to derive a latent construct assessing the level of detrimental drinking across a range of countries. This measure has been found to be a very effective predictor of a range of characteristics associated with variability of risk of injury due to drinking , risk of alcohol-related injury  and causal attribution of injury to alcohol  across a range of studies spread around the world. Variations of the HOMALS family of optimal scaling techniques are also widely available, including implementations in the SPSS categories add-on package , as well as other implementations in freely available software packages such as R  and Xlisp-Stat .
Confirmatory/hypothesis testing-based methods
Traditional factor analytical methods, along with the generalization discussed above (the Gifi system), generally fall into the category of exploratory data analysis (EDA) methods. These methods rely less on probability models to describe relationships in the data and more on geometric representation of the respondents and variables . Often, however, the interest of the analyst is to propose models to fit multivariate data that can be tested in a formal probabilistic sense. The latent variable literature attempts to provide such probabilistic structure by modeling parametrically the covariance structure of the data. These methods provide a much broader ability to not only explore the factor structure of a set of variables but to also model simultaneous relationships between multiple sets of scales, including predictive relationships between both observed and latent variables themselves.
Structural equation modeling (SEM) , a general type of covariance structure modeling, generalizes factor analysis methods and assumes that the underlying latent variables are continuous. Often, such modeling procedures have found very useful application in studying the behavior of scales in longitudinal studies. For example, Aneshensel and Huba  studied the relationship between a measure of depression (the CES-D) and alcohol use and smoking over four waves of data collection. Harford  explored the factor structure of the DSM-IV scale in a sample of youth while allowing simultaneously for other background demographics in the prediction of the factor structure. Note that each of these sets of analyses could have been carried out as a sequence of separate analyses. However, the influence of estimating one set of relationships or factors on another, as is performed in the assessment of measurement invariance between groups , can be explored only in such simultaneous models.
Recently there has been a great deal of interest not only in the modeling of continuous latent variables, but also in the modeling of categorical latent variables. The integration of the modeling of continuous and categorical latent variables is described in Muthén & Shedden . An excellent overview of the role of both continuous and categorical latent variables as applied towards the substance abuse field can be found in Muthén & Muthén . The ability to group respondents empirically while estimating simultaneously parameters associated with the grouping (e.g. growth of alcohol dependence across time within each group) has seen a great deal of application in the substance abuse literature [68–72], and this need will probably continue to grow.
One serious issue facing the field of substance abuse is the need for the formation of formal abuse/dependence diagnoses that classify people into categories (e.g. alcohol dependent or not dependent, alcohol abuse or not abuse). These categories are derived by choosing cut-points from a scale formed from responses to a series of items chosen carefully for their content. Although these cut-points are chosen by those with substantive expertise, they still retain the property of categorical variables. That is, they lack the underlying dimension of severity given the diagnosis that can be important, for example, for clinicians choosing the assignment of appropriate treatment. In addition, scales such as the DSM-IV  use a simple summative scale and assume that all criteria are equivalent so that it does not matter which three domains are met, a property which has not been found to be the case in a nationally representative dataset .
With the recent developments in the combination of categorical and continuous latent variable methods, two analytical models that combine the features of both categorical (latent class or mixture) and continuous (latent trait or factor) latent variables have been developed : Latent Class Factor Analysis (LCFA) and factor mixture analysis (FMA). LCFA can be seen as an attempt to add a continuous dimensional aspect to the assignment of respondents to classes based on their responses while relaxing the distributional assumptions of the underlying factor (usually assumed to be normally distributed) by representing it non-parametrically by discrete points (the classes). Because the factor structure (the regression parameters of the factor on the items) is assumed to be the same for each class, this method is appropriate when the profiles of the items (the plot of the average value for each item within a class) are parallel (i.e. do not cross) across classes. FMA can be seen as a generalization of LCFA, in that it allows the factor analysis parameters (i.e. means, variances) to vary across the latent classes.
Although these relatively new methods allow for the ability to add a continuous severity component to an empirical grouping, choices between which specific implementation cannot be based on statistical criteria, as substantive considerations and predictive considerations should always drive the analysis. However, there are statistical tools available to test for model improvements within a class of models, as maximum likelihood techniques are used most often for their estimation. In addition, measures are also available to aid in the choice of the number of classes to use for a given model.
Following the development of these psychometric methodologies, software tools have recently been developed to estimate models that allow for the estimation of both categorical and continuous latent variables, the Mplus package . Due to the generality of the program its use may be somewhat intimidating to untrained analysts, although the developers have gone to great lengths to simplify the process of implementation. Other programs are also available for estimation of structural equation models, each with its own limitations on the types of models that are estimable, such as Proc Traj in SAS , EQS (EQuationS) , AMOS  and LISREL . One particular challenge that will continue to face those who develop scales is the degree of usability of software applications that serve as a barrier towards more widespread adoption of these newer methods. However, with input from the user base community at large, the future appears bright for the continued development of more flexible and appropriate psychometric methodologies.
The purpose of this paper was to examine some of the psychometric issues that arise when assessing scales or instruments that measure alcohol-related problems. To that end, this paper has provided a forum for the discussion of the unitary versus multi-dimensional conceptualizations of alcohol abuse/dependence, the ‘life of scales’ once they have become accepted as the standard practice, and some future directions for the development of scales that have not yet been well incorporated into the set of analytical strategies used in the alcohol field. To the extent possible, the goal of this paper is to stimulate more discussion in these areas in the hope that it will enhance further development of newer ways to measure alcohol-related problems.
The research for this paper was supported by grant P30 AA05595 from the US National Institute on Alcohol Abuse and Alcoholism to the Alcohol Research Group, Public Health Institute, Emeryville, California.