The percentages do not sum to 100%, because many articles included multiple methods. ANOVA = analysis of variance; ICC = intraclass correlation coefficient; ANCOVA = analysis of covariance; ROC = receiver operating characteristic.

Special Article

# Statistical methods in *Arthritis & Rheumatism*: Current trends

Article first published online: 28 NOV 2006

DOI: 10.1002/art.22251

Copyright © 2006 by the American College of Rheumatology

Additional Information

#### How to Cite

Kim, M. (2006), Statistical methods in *Arthritis & Rheumatism*: Current trends. Arthritis & Rheumatism, 54: 3741–3749. doi: 10.1002/art.22251

#### Publication History

- Issue published online: 28 NOV 2006
- Article first published online: 28 NOV 2006
- Manuscript Accepted: 24 AUG 2006
- Manuscript Received: 1 AUG 2006

- Abstract
- Article
- References
- Cited By

### Introduction

- Top of page
- Introduction
- Analysis of articles published in
*A&R*in 2005 - Advanced statistical methods
- Conclusion
- Acknowledgements
- REFERENCES

The prevalence and complexity of statistical methods used in biomedical research have increased in recent years. The majority of studies reported in the *New England Journal of Medicine* in 2004–2005 used sophisticated statistical methods, a marked increase compared with prior years (1). Similarly, more than 20 years ago, Felson et al (2), comparing articles in *Arthritis & Rheumatism* (*A&R*) from 1982 with those from 1967 to 1968, concluded that the number and sophistication of analytical techniques had increased substantially, but so had the number of statistical errors and the misuse of methods.

The *A&R* survey by Felson and colleagues was conducted when the most common analyses were simple, such as the *t*-test, chi-square test, and linear regression. New statistical innovations and software programs have proliferated since then, influencing how rheumatology studies are now designed and analyzed. This article includes an evaluation of statistical procedures used in *A&R* articles published in 2005 and an overview of some advanced statistical approaches, with examples from recent literature.

### Analysis of articles published in *A&R* in 2005

- Top of page
- Introduction
- Analysis of articles published in
*A&R*in 2005 - Advanced statistical methods
- Conclusion
- Acknowledgements
- REFERENCES

All 419 reports classified as “research articles” published in *A&R* in 2005 were included in this survey. Following the approach described by Horton and Switzer (1), the Methods section of each article was examined by the author, and all statistical techniques were recorded. For articles that did not include a separate description of the analytical methods, the Results section was also scanned. The proportion of articles using different analytical approaches was tabulated.

Table 1 lists the statistical methods used, ordered according to frequency of use. In 94% of the reported studies, some type of statistical procedure was used; this represents a substantial increase from prior years: only 50% of studies published in *A&R* from 1967 to 1968 and 62% from 7 randomly selected months in 1982 used statistics (2). The remaining 6% of articles published in *A&R* in 2005 were predominantly case reports that did not require the use of statistics.

Statistical method | No. (%) of articles |
---|---|

- *
- †
Other advanced methods included simulations (n = 3), Poisson models (n = 3), multivariate analysis of variance (n = 3), survey methods (n = 2), meta-analysis (n = 1), multidimensional scaling (n = 1), log-linear models (n = 1), and Bayesian hierarchical models (n = 1).
| |

No statistics | 26 (6) |

t-test | 153 (37) |

Wilcoxon rank sum test | 147 (35) |

Chi-square test | 83 (20) |

Multiple comparisons | 68 (16) |

ANOVA | 67 (16) |

Correlation | 63 (15) |

Fisher's exact test | 59 (14) |

Logistic regression | 40 (10) |

Kruskal-Wallis test | 31 (7) |

Multiple linear regression | 27 (6) |

Statistical genetics | 25 (6) |

Survival analysis | 23 (5) |

Measures of reliability or agreement (kappa/ICC/Bland-Altman method) | 20 (5) |

Generalized estimating equations | 17 (4) |

Missing data methods | 17 (4) |

Mixed effects models | 15 (4) |

ANCOVA | 11 (3) |

Simple linear regression | 11 (3) |

Repeated-measures ANOVA | 10 (2) |

Microarray/proteomics analysis | 10 (2) |

Mantel-Haenszel test | 6 (1) |

ROC curve | 6 (1) |

Propensity scores | 6 (1) |

Other methods† | 12 (3) |

The most frequently used procedures in 2005 were the *t*-test (37%) and its nonparametric analog, the Wilcoxon rank sum test (35%), followed by the Pearson chi-square test (20%). Of the studies published in 1967–1968, 17% used the *t*-test, and 19% used the chi-square test; in 1982, the corresponding percentages were 50% and 30%. The percentage of studies in which these 2 tests were used has declined since 1982, presumably because of increased awareness that the Wilcoxon rank sum test rather than the *t*-test should be applied for non-normally distributed continuous data, and Fisher's exact test or other methods should be used for the analysis of contingency tables with small sample sizes.

In more than half (57%) of the articles published in *A&R* in 2005, either no statistics were reported, or only elementary statistical methods, defined as the *t*-test, Wilcoxon test, chi-square test, Fisher's exact test, analysis of variance (ANOVA), correlation, or simple linear regression, were reported. In contrast, only 21% of the articles published in the *New England Journal of Medicine* from 2004 to 2005 were in this category, indicating greater use of more advanced methods (1). The discrepancy between the 2 journals is most likely attributable to a larger proportion of basic science papers requiring simpler analytical approaches appearing in *A&R*.

Felson et al (2) observed that the most common statistical error in *A&R* articles from 1982 was failure to appropriately adjust for multiple testing, which can inflate the rate of false-positive findings. In 2005, however, multiple testing procedures were one of the most frequently applied statistical methods, used in 16% of *A&R* articles.

Forty-three percent of the articles published in 2005 used more sophisticated analytical techniques. Particularly notable is the frequent use of statistical genetics (6%), reflecting in general the increased focus on genetics in biomedical research. Use of statistical genetics methods was nearly as prevalent as or more prevalent than use of multiple linear regression and survival analysis. Approaches for analyzing longitudinal and other types of correlated data from clinical trials and prospective observational studies, including generalized estimating equations (GEEs) and mixed effects models, were also among the more commonly used sophisticated methods.

### Advanced statistical methods

- Top of page
- Introduction
- Analysis of articles published in
*A&R*in 2005 - Advanced statistical methods
- Conclusion
- Acknowledgements
- REFERENCES

Described below are some advanced statistical methods and examples of applications from articles published in *A&R* in 2005. Many of these methods have been incorporated in statistical software packages such as SAS, SPSS, and Stata. A summary of the different approaches is provided in Table 2.

Method | Application |
---|---|

Multiple linear regression | Evaluating the relationship between a normally distributed outcome and categorical or continuous predictor variables |

Logistic regression | Evaluating the relationship between a binary outcome and categorical or continuous predictor variables; can be extended to ordinal and categorical outcomes |

Survival analysis | Analyzing time-to-event data with censored observations |

Kaplan-Meier plot | Estimating survival distribution using a nonparametric approach |

Log rank test | Comparing survival distributions across ≥2 groups |

Cox proportional hazards model | Evaluating the effects of continuous or categorical variables on survival time |

Longitudinal data analysis | Analyzing outcomes that are measured at multiple time points and other types of correlated data |

Repeated-measures analysis of variance | Analyzing a repeatedly measured continuous outcome variable; excludes observations with missing data |

Linear mixed effects models | Analyzing a repeatedly measured continuous outcome variable; includes both random and fixed effects and allows for data that are missing at random |

Generalized estimating equations | Analyzing a repeatedly measured outcome variable that can be continuous or categorical |

Missing data methods | Adjusting for the occurrence of missing data |

Last observation carried forward | Filling in missing data using the last observed value for subjects lost to followup in a prospective study; not recommended since it may yield biased results |

Multiple imputation | Using multiple values randomly drawn from an assumed distribution to fill in each missing value; appropriately accounts for uncertainty and variability of imputed value |

Multiple testing procedures | Adjusting for increase in false-positivity rate due to multiple testing |

Bonferroni, Tukey's test | Controlling for the family-wise error rate, i.e., probability of any false-positive result in a family of tests |

False discovery rate | Controlling for the false discovery rate, i.e., proportion of false-positive results among all significant findings; more powerful than family-wise error rate procedures |

Measures of agreement | Assessing degree of agreement between 2 raters or methods |

Kappa statistic | Quantifying magnitude of agreement between ≥2 raters for categorical outcomes |

Intraclass correlation coefficient | Quantifying magnitude of agreement between ≥2 raters for continuous outcomes; evaluating reproducibility of measurements |

Bland-Altman method | Evaluating whether one method can be substituted for another when measurements are continuous |

Propensity scores | Adjusting for differences in patient characteristics when evaluating the effect of a treatment or exposure in an observational study |

#### Multivariable linear regression.

In simple linear regression, one is interested in evaluating the linear association of a single predictor variable with a continuous outcome or dependent variable that is normally distributed. The parameter of interest is the slope, which measures the magnitude of change in the outcome variable per unit change in the predictor variable. For example, Farge et al (3) fit a simple linear regression model to study immunologic reconstitution rates after autologous bone marrow transplantation in systemic sclerosis. The outcome variable in this case was percentage of lymphocytes remaining from levels measured at baseline, and the predictor variable was time in months since transplantation. The estimated slope in this study indicates how much the lymphocyte levels changed over a 1-month period.

Multivariable (or multiple) linear regression allows for analysis of the effects of multiple predictors, continuous or categorical, on a continuous outcome. This approach adjusts for the effects of potential confounders in evaluating the association between a main predictor variable and the outcome. Bultink et al (4) used multiple regression analysis to model low bone mineral density in patients with systemic lupus erythematosus as a function of age, sex, race, menopause status, body mass index, disease duration, and disease activity. Independent risk factors that were found included low body mass index, postmenopause status, and vitamin D deficiency.

Two critical issues in multiple linear regression models are multicollinearity and variable selection strategies. When the correlation between predictors is high (multicollinearity), the resulting regression coefficient estimates can be very unstable, and the standard errors can be large. Intuitively, this makes sense, because there is little information in subsequent correlated predictor variables beyond what is contained in the first variable. Approaches for addressing multicollinearity include dropping 1 or more correlated variables, normalizing the variables, or using an analytical approach such as ridge regression, a modification of multiple regression, to deal with collinear predictors (5, 6).

Another important issue in multiple linear regression analysis is how to determine which variables to include as predictors in the model (variable selection), especially when the number of potential predictors is large. Strategies include forward selection, backward elimination, and all-subsets regression, each of which has its own strengths and limitations, as discussed by Draper and Smith (5). Clinical or biologic considerations also need to be taken into account in model building.

#### Logistic regression.

In many studies, the outcome variable is binary (i.e., belongs to only 1 of 2 categories). Examples include disease progression/no progression, presence/absence of a biologic or clinical characteristic, and response/nonresponse to an experimental treatment. The association between a binary outcome and another categorical variable can be assessed with a simple chi-square test or Fisher's exact test. Often, however, the aim is to evaluate the independent effects of multiple continuous or categorical predictor variables on the binary outcome or adjust for the effects of confounders in the analysis. Logistic regression models, which may be viewed as a variation of linear regression for binary outcomes, can accomplish these objectives. With this approach, the logarithm of the odds of the outcome, i.e., log[*P*/(1 − *P*)], where *P* is the probability of the event of interest, is modeled as a linear function of the predictor variables.

Harris and colleagues (7) assessed whether patients with fibromyalgia who experienced large fluctuations in pain were likely to respond to milnacipran, in a randomized placebo-controlled clinical trial. Those investigators fit a logistic regression model with the binary response defined as a ≥50% improvement in pain. Treatment assignment and pain variability index were the main predictor variables; age, duration of fibromyalgia, and race were additional covariates to control for these factors.

There are extensions to the logistic regression model for matched data (conditional logistic regression), for the outcome variable having >2 categories (multinomial regression), and for the outcome variable having ordered categories (ordinal regression). For example, Hunter et al (8) investigated whether knee height (predictor variable) was associated with the severity of knee symptoms (dependent variable) in elderly Beijing residents, while adjusting for age, physical activity, quadriceps strength, radiographic severity, and body mass index (confounders). Those authors chose ordinal logistic regression analysis because severity of knee pain had 4 ordinal categories (0 = no pain, 1 = usually bearable pain, 2 = sometimes unbearable pain, and 3 = mostly or always unbearable pain). Some investigators might consider combining adjacent categories of an ordinal variable to form a binary outcome (e.g., 0/1 versus 2/3) and applying conventional logistic regression, but this approach can result in a significant loss of information. Further discussion of how to analyze ordinal data can be found in the report by LaValley and Felson (9).

#### Survival analysis.

Many studies measure time-to-event outcomes, such as time to clinical response, disease progression, or death, commonly referred to as survival data. The usual goal in these studies is to estimate the cumulative incidence of the event of interest and identify factors that predict when the event will occur. A complicating feature, however, is that subjects may not experience the event by the end of the followup period; the survival time in this case is known only to be longer than the last time the subject was under observation. These survival times are considered to be right-censored.

The Kaplan-Meier method estimates distribution of survival times with right-censored data. Survival distributions between 2 or more groups can be compared using a log rank test, and the effect of 1 or more predictor variables on the survival time is analyzed with the Cox proportional hazards model. The Cox model allows for the inclusion of continuous, categorical, or time-dependent predictor variables. Time-dependent covariates are variables whose values can change over time, such as blood pressure, cholesterol levels, and other biologic measurements, as opposed to fixed covariates such as sex and ethnicity.

Goodson et al (10) fit Cox proportional hazards models to assess the association between C-reactive protein levels and risk of death from cardiovascular disease in patients with inflammatory polyarthritis. A fixed covariate was used to model the effect of C-reactive protein, since only the baseline measurement was analyzed. In contrast, Maradit-Kremers and colleagues (11) used time-dependent covariates to determine whether systemic inflammation increases the risk of death from cardiovascular disease among patients with rheumatoid arthritis, with adjustment for traditional cardiovascular risk factors. Because the status of dyslipidemia, diabetes mellitus, congestive heart failure, and hypertension can change during a subject's followup period, time-dependent covariates were included in the Cox model. Extensions to the Cox model to allow for more complicated types of censoring, truncated survival times, and correlated time-to-event outcomes on the same subject are also available (12, 13).

#### Longitudinal and correlated data analysis.

Studies involving repeated collection of information on the same individuals or experimental units over time yield longitudinal data. Examples occur in clinical trials, epidemiologic studies, and laboratory experiments in which animals or biologic specimens are observed or assessed for specific outcomes over multiple time points. The analytical challenge here is to relate one or more predictor variables to a repeatedly measured outcome variable while properly taking into account time trends and correlations among multiple measurements from the same subject. Measurements of the same subject tend to be more similar to each other than to measurements of different subjects. Ignoring this correlation can lead to invalid standard errors and inferences.

Nearly 10% of *A&R* articles published in 2005 used longitudinal methods. The most common approaches were GEEs, repeated-measures ANOVA, and linear mixed effects models. The GEE method is an extension for correlated data of generalized linear models, a unified regression framework that includes linear regression and logistic regression as special cases (14). The advantage of the GEE method is that it can accommodate different types of repeated outcome variables, such as categorical, count data, and continuous data, although it is most commonly used for noncontinuous outcomes.

Repeated-measures ANOVA and linear mixed effects models can be applied to only continuous outcomes. Repeated-measures ANOVA is a special case of linear mixed effects models and requires that all subjects be evaluated under the same experimental conditions. Subjects with missing values are deleted from the analysis, which can be a major source of potential bias. In contrast, linear mixed effects models use all available data and are able to handle unequally spaced time points, subjects observed at different time points, and missing data, provided they are missing at random, i.e., the missing data pattern is not dependent on unobserved data but may depend on observed data.

Gerloni et al (15) evaluated the efficacy of infliximab combined with methotrexate in persistently active, refractory juvenile idiopathic arthritis. Disease activity indices were assessed at baseline, week 2, week 6, and every 8 weeks thereafter, and repeated-measures ANOVA models were fit to analyze all of the multiple measurements obtained from the same subject during the first year of treatment. Mixed effects and GEE models also could have been applied to these data and would have yielded estimates of treatment effects in addition to tests of significance (*P* values). Messier et al (16) applied mixed effects models to predict gait kinetic outcome variables, with body mass as the primary predictor variable, in an 18-month clinical trial of diet and exercise in overweight and obese older adults with knee osteoarthritis. Mixed models were used so that all of the available followup data from 6 months and 18 months could be included simultaneously in the analysis.

The methods described above are also applicable to data that arise in contexts other than longitudinal studies but are likely to be correlated, such as family studies or measurements obtained from different body parts of one individual. Reijman et al (17) investigated the effect of different types of nonsteroidal antiinflammatory drugs on progression of osteoarthritis of the hip and knee. Because data from both of a subject's hips or knees were included in the analyses and the outcome was binary (presence or absence of progression), the GEE method was used to take into account the correlation between joints from the same subject.

#### Methods for handling missing data.

Missing data is a common problem, especially in longitudinal studies, leading to bias and inefficiency in statistical analyses. Reasons for missing data include early withdrawal from a prospective study, missed visits, and technical or laboratory problems.

Most strategies for handling missing data involve imputation, or filling in the missing values. Missing data due to losses to followup in prospective studies reported in *A&R* were most commonly handled using the last observation carried forward (LOCF) technique. This method is an example of a single imputation approach, in which a single value is used to fill in each missing value. In the case of the LOCF method, the last observed value is used for all subsequent missing values to yield a complete data set.

Although the LOCF approach is easy to implement and understand, it often leads to biased estimates of covariate effects and always leads to biased estimates of variances, since the assumption is that the variable remains constant at the last observed value when in fact it often changes, and no distinction is made between the imputed values and the values actually observed. Other types of single imputation include imputing the worst-case and best-case values, mean imputation, conditional mean imputation, and the hot deck method (18).

In contrast to imputing a single value to fill in each missing observation, multiple imputation involves using multiple values randomly drawn from an assumed distribution for each missing value. The advantage of this approach is that the analyses take into consideration the inherent uncertainty and variability surrounding which value to impute (19).

When <10% of the values are missing and the missing data mechanism is random, unconditional and conditional mean imputation, multiple hot deck imputation, expectation maximization imputation, and multiple imputation tend to yield similar results (20). If the proportion of missing data is larger, the performance of the multiple imputation approach is superior. The LOCF method, as a rule, should not be used. Sensitivity analyses to evaluate the stability of the results to different approaches for handling missing data are also recommended.

#### Multiple testing procedures.

Multiple testing is used when several groups are evaluated in the same study and pairwise comparisons are performed between the groups or when statistical tests are conducted for several different outcomes. The problem is that the chance of a false-positive finding increases with the number of hypothesis tests performed. To illustrate, if 10 independent tests are performed, each at the α = 0.05 significance level, then the chance of obtaining 1 or more false-positive results is equal to 1 − (1 − 0.05)^{10}, or 40%. Most multiple comparison procedures ensure that the probability of 1 or more false-positive findings in a family of tests does not exceed the alpha error. This probability is called the family-wise error rate.

In the current survey, the most common approaches for controlling the family-wise error rate in pairwise comparisons among multiple groups were the Bonferroni adjustment and Tukey's test. Tukey's test is generally recommended over the Bonferroni adjustment, because the latter tends to be more conservative and less powerful. Other family-wise error rate–based approaches that were used in *A&R* were the Newman-Keuls test, which produces significance more frequently than Tukey's test but also yields more false-positive results, and Dunnett's multiple comparison test for comparing multiple groups with a single control.

The problems associated with multiple testing are especially acute in genomic studies, because these routinely involve the analysis of expression levels of thousands of genes. Control of the family-wise error rate is viewed as too stringent in these studies, because it limits the probability of observing *any* false-positive results. A few errors are usually acceptable, provided the proportion of errors among the identified set of differentially expressed genes is sufficiently small. The expected proportion of false-positive results is known as the false discovery rate (FDR) (21). A 10% FDR implies that, on average, 10% of the significant findings are false-positive.

Tan et al (22) used an FDR-based approach to minimize false-positive results in the comparison of expression levels of 8,324 genes in patients with diffuse systemic sclerosis versus levels in a matched group of normal controls. With this approach, 832 genes were found to be differentially expressed, but with 320 potential false discoveries (38%). If a traditional Bonferroni correction had been applied, a *P* value of <0.00006 would have to be observed in order for a gene to be declared differentially expressed at a family-wise error rate of α = 0.05. The FDR approach, which yields greater power than the Bonferroni approach, should be used in situations in which control of the family-wise error rate is considered to be too stringent. Such situations include large-scale screening, as opposed to confirmatory, studies of genes, proteins, or other variables, in which the goal is to make as many “discoveries” as possible while limiting the fraction of false-positive findings.

It should be noted that the issues of when and how to adjust for multiple testing have been debated extensively and remain controversial. Most would agree, however, that multiple testing procedures are not needed if the hypotheses being tested were planned a priori, whereas they should be applied to studies that are more exploratory in nature, such as gene-finding experiments.

#### Measures of agreement.

Five percent of studies published in *A&R* in 2005 used statistical methods for assessing reliability, agreement, and validity of new research instruments, diagnostic tests, assays, or biomarkers. The most common methods included the kappa statistic, intraclass correlation coefficient (ICC), measures of sensitivity and specificity, and Bland-Altman plots.

The kappa statistic was originally developed as a measure of interrater agreement for categorical outcomes. This statistic, which reflects the magnitude of agreement between 2 raters beyond that attributable to chance, can technically range from −1 to +1, although most values tend to fall between 0 (agreement no better than chance) and 1 (perfect agreement) (23). The kappa statistic is also used as a measure of similarity or reproducibility for categorical responses and assumes no inherent ordering in terms of importance or validity in the different sets of measurements. If, however, the aim is to assess the validity of a new method by comparing it with a more established method, or to compare the responses of an inexperienced rater with those of an expert rater, then other approaches, including measuring sensitivity and specificity, may be more appropriate measures of agreement (24, 25).

The ICC is a measure of interrater agreement or reproducibility for continuous variables. Unlike Pearson's correlation and Spearman's rank correlation, the ICC will reflect not only the degree of correlation between 2 sets of measurements but also systematic differences between the measurements. For example, if one set of measurements is consistently higher by the same amount than another set, a simple correlation between them would still yield ρ = 1 (i.e., perfect correlation), whereas the ICC would be <1 because of the systematic differences.

Reijman and colleagues (17) estimated both the kappa statistic and the ICC in a study of interrater reliability of radiographic assessments of the hip and knee. Agreement in Kellgren/Lawrence grades, a system for categorizing the severity of osteoarthritis, was measured using the kappa statistic, which was estimated to be 0.68. For minimum joint space width, a continuous outcome, the ICC was estimated to be 0.85, indicating excellent agreement.

A Bland-Altman plot is a simple graphic approach for evaluating whether one measurement technique can be substituted for another (26); outcomes must be continuous variables. In this approach, the difference in the 2 measurements is first plotted against the mean of the 2 measurements to assess whether the magnitude of the difference depends on the mean. The 95% limits of agreement (i.e., the mean ± 2 SD difference) are computed and plotted on the graph. If the differences are normally distributed, 95% of the differences should lie within the limits. If all of the differences within these limits are biologically unimportant, there is sufficient agreement between the measurements. LaValley et al (27) used the Bland-Altman method to graph the degree of test–retest reliability of joint space width on lateral-view radiographs.

#### Propensity scores.

In a clinical trial, the random assignment of subjects to treatment groups ensures that the groups are balanced with respect to patient factors. In an observational study, however, some subjects may be more likely to receive one treatment than another. The propensity score is an increasingly used method of adjusting for differences in patient characteristics when the objective is to evaluate the effect of a treatment or an exposure on an outcome (28).

Fessler et al (29) investigated whether hydroxychloroquine (HCQ) was associated with a reduced risk of damage accrual, in a longitudinal observational cohort of patients with systemic lupus erythematosus. Since the study was not a clinical trial in which the treatments were assigned in a randomized manner, the treatment groups were unlikely to be balanced with respect to patient characteristics. In fact, because HCQ is traditionally used to treat milder disease, a simple comparison of patients receiving HCQ with patients not receiving HCQ would yield biased results of treatment effect; it would appear that patients receiving HCQ accrued less damage than patients who were not receiving HCQ, even in the absence of a treatment effect. The authors used propensity scores to adjust for differences in patient characteristics between treatment groups.

A propensity score is the probability of exposure to a treatment, given a subject's observed covariates, and is estimated by using a logistic regression model with treatment exposure as the outcome variable. The scores are estimated for each subject from the logistic regression model, and then the study population can be stratified on the propensity score. The result is that all subjects in a given stratum will have the same propensity or probability of receiving treatment, even though there will be differences in the actual treatment received and covariate profiles. The covariate distributions will be similar across treatment groups within a stratum, so the groups can be validly compared. In addition to stratification, one can use the propensity score to create matched pairs or adjust for the propensity score as a covariate in the main analysis. Austin and Mamdani (30) observed that greater balance in subject characteristics is achieved by matching than by stratification, but at the potential expense of a reduced sample size if appropriate matches for some individuals cannot be found. Covariate adjustment for the propensity score is limited in that the regression model must be correctly specified in order to obtain unbiased estimates of treatment or exposure.

With sufficiently large sample sizes, a randomization procedure balances observed as well as unobserved factors, while propensity scores balance only observed covariates. In addition, the propensity score does not substitute for direct adjustment for covariates in a model-based method such as regression; rather, stratifying or matching on propensity scores is usually supplemented by further adjustment for other variables in the main analytical model (31).

#### Other advanced techniques.

Other advanced techniques used in studies published in *A&R* are methods for the analysis of data from genomic and proteomic technologies. These experiments yield large quantities of molecular data. The high dimensionality of the data, coupled with the typically limited number of biologic and technical replicates available, presents significant analytical challenges. In addition to FDR-based multiple testing, the studies in *A&R* used methods such as cluster analysis, classification methods, principal components, and class prediction techniques to analyze gene and protein expression data and correlate them with clinical outcomes.

Specialized methods were also used for the analysis of human genetics studies. In 2005 25 studies applied principles of statistical genetics, including linkage analysis, haplotype analysis, transmission disequilibrium tests, sibpair methods, and assessment of Hardy-Weinberg equilibrium. A detailed discussion of these methods is beyond the scope of this report; Thomas (32) provides an overview of statistical genetics and further information on specific approaches.

### Conclusion

- Top of page
- Introduction
- Analysis of articles published in
*A&R*in 2005 - Advanced statistical methods
- Conclusion
- Acknowledgements
- REFERENCES

Nearly all *A&R* articles published in 2005 included a statistical procedure. Many of the studies incorporated multiple methods and advanced approaches, confirming the trend of increasing usage and sophistication of statistical methods in biomedical research. The analytical challenges faced by researchers are likely to grow even more in the future as the technologies for conducting basic science and clinical research continue to evolve. Recent advances in high-throughput experimental techniques and a larger number of studies using genetics have especially necessitated the development and application of more specialized statistical methods.

These documented trends should have an impact on how current and future biomedical investigators are trained and the resources they will need to successfully conduct their research. The increasing complexity of methods used in medical research as opposed to the level of statistical education obtained by the typical researcher will make it more difficult for investigators not only to effectively conduct their own research, but also to comprehend and critically evaluate the work of others. In addition, readers, including those not actively involved in research, need assistance to stay abreast of new methodologic developments. Recommendations to address these issues have included more opportunities for training in advanced statistics for scientific investigators, such as short courses and workshops introducing new methodologic approaches, and more didactic reports in medical journals on statistical techniques (33); this article and one by LaValley and Felson (9) are examples of recent efforts in *A&R* to address this need.

The growing availability of sophisticated analytical tools and user-friendly software introduces the potential for misuse of these methods by statistically naive researchers and misunderstandings by readers. Twenty years ago, the most common statistical error in *A&R* was failure to appropriately adjust for multiple comparisons. Current investigators appear to be more cognizant of this issue; however, other types of errors are becoming apparent, such as the common use of the LOCF approach to handle missing data without adequate consideration of the underlying assumptions needed for the approach to be valid.

Collaborations between investigators and experienced statisticians on research projects will be more important than ever to ensure that methods, especially those that have been recently developed, are applied appropriately. In addition, journals will need to engage multiple statisticians with different areas of expertise in the manuscript review process, because a single statistician usually does not possess the depth and breadth of knowledge required to critically evaluate all types of articles. Studies in new and rapidly changing fields such as genomics and proteomics, which lack well-established analytical guidelines, need to be especially well scrutinized. Writers of articles will need to explain clearly the strengths and weaknesses of their choices of statistical methods, and finally, readers must assume the obligation of informing themselves about statistical methodology.

### Acknowledgements

- Top of page
- Introduction
- Analysis of articles published in
*A&R*in 2005 - Advanced statistical methods
- Conclusion
- Acknowledgements
- REFERENCES

I am grateful to Dr. Michael D. Lockshin (editor of *A&R*) and to my colleagues, Dr. Xiaonan Xue and Dr. Charles Hall, for their very helpful comments on this article.

### REFERENCES

- Top of page
- Introduction
- Analysis of articles published in
*A&R*in 2005 - Advanced statistical methods
- Conclusion
- Acknowledgements
- REFERENCES

- 1Statistical methods in the journal. N Engl J Med 2005; 353: 1977–9., .
- 2Misuse of statistical methods in Arthritis and Rheumatism 1982 versus 1967–1968. Arthritis Rheum 1984; 27: 1018–22., , .
- 3Intensification et Autogreffe dans les Maladies Auto Immunes Resistantes (ISAMAIR) Study Group. Analysis of immune reconstitution after autologous bone marrow transplantation in systemic sclerosis. Arthritis Rheum 2005; 52: 1555–63., , , , , , et al, for the
- 4Prevalence of and risk factors for low bone mineral density and vertebral fractures in patients with systemic lupus erythematosus. Arthritis Rheum 2005; 52: 2044–50., , , , .
- 5Applied regression analysis. New York: John Wiley & Sons; 1981., .
- 6Regression diagnostics: influential data and sources of collinearity. New York: John Wiley & Sons; 1980., , .
- 7Characterization and consequences of pain variability in individuals with fibromyalgia. Arthritis Rheum 2005; 52: 3670–4., , , , , , et al.
- 8Knee height, knee pain, and knee osteoarthritis: the Beijing Osteoarthritis Study. Arthritis Rheum 2005; 52: 1418–23., , , , , , et al.
- 9Statistical presentation and analysis of ordered categorical outcome data in rheumatology journals. Arthritis Rheum 2002; 47: 255–9., .
- 10Baseline levels of C-reactive protein and prediction of death from cardiovascular disease in patients with inflammatory polyarthritis: a ten-year followup study of a primary care–based inception cohort. Arthritis Rheum 2005; 52: 2293–9., , , , , .
- 11Cardiovascular death in rheumatoid arthritis: a population-based study. Arthritis Rheum 2005; 52: 722–32., , , , .
- 12Cox regression analysis of multivariate failure time data: the marginal approach. Stat Med 1994; 13: 2233–47..
- 13Tutorial in biostatistics: methods for interval-censored data. Stat Med 1998; 17: 219–38., .
- 14Models for longitudinal data: a generalized estimating equation approach. Biometrics 1988; 44: 1049–60., , .
- 15Efficacy of repeated intravenous infusions of an anti–tumor necrosis factor α monoclonal antibody, infliximab, in persistently active, refractory juvenile idiopathic arthritis: results of an open-label prospective study. Arthritis Rheum 2005; 52: 548–53., , , , , , et al.
- 16Weight loss reduces knee-joint loads in overweight and obese older adults with knee osteoarthritis. Arthritis Rheum 2005; 52: 2026–32., , , .
- 17Is there an association between the use of different types of nonsteroidal antiinflammatory drugs and radiologic progression of osteoarthritis? The Rotterdam Study. Arthritis Rheum 2005; 52: 3137–42., , , , , .
- 18Statistical analysis with missing data. New York: John Wiley & Sons; 1987., .
- 19Handling missing data in clinical trials: an overview. Drug Inf J 2000; 34: 525–33..
- 20Imputations of missing values in practice: results from imputations of serum cholesterol in 28 cohort studies. Am J Epidemiol 2004; 160: 34–45., .
- 21Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B 1995; 57: 289–300., .
- 22Classification analysis of the transcriptosome of nonlesional cultured dermal fibroblasts from systemic sclerosis patients with early disease. Arthritis Rheum 2005; 52: 865–76., , , , , , et al.
- 23Statistical techniques for comparing measures and methods of measurement: a critical review. Clin Exp Pharmacol Physiol 2002; 29: 527–36..
- 24 , , .
- 25Misinterpretation and misuse of the kappa statistic. Am J Epidemiol 1987; 126: 161–8., .
- 26Statistical method for assessing agreement between two methods of clinical measurement. Lancet 1986; 1: 307–10., .
- 27The lateral view radiograph for assessment of the tibiofemoral joint space in knee osteoarthritis: its reliability, sensitivity to change, and longitudinal validity. Arthritis Rheum 2005; 52: 3542–7., , , , , .
- 28Analytic strategies to adjust confounding using exposure propensity scores and disease risk scores: nonsteroidal anti-inflammatory drugs and short-term mortality in the elderly. Am J Epidemiol 2005; 161: 891–8., , , , , .
- 29LUMINA Study Group. Systemic lupus erythematosus in three ethnic groups. XVI. Association of hydroxychloroquine use with reduced risk of damage accrual. Arthritis Rheum 2005; 52: 1473–80., , , , , , et al, for the
- 30A comparison of propensity score methods: a case-study estimating the effectiveness of post-AMI statin use. Stat Med 2006; 25: 2084–106., .
- 31Invited commentary: propensity scores. Am J Epidemiol 1999; 150: 327–33., .
- 32Statistical methods in genetic epidemiology. Oxford: Oxford University Press; 2004..
- 33Transfer of technology from statistical journals to the biomedical literature. JAMA 1994; 272: 129–32., .