### ABSTRACT

- Top of page
- ABSTRACT
- Introduction
- Methods
- Results
- Discussion
- Conclusions
- References

**Objectives: ** The objective of this study was twofold: 1) to confirm the hypothetical eight scales and two-component summaries of the questionnaire Short Form 36 Health Survey (SF-36), and 2) to evaluate the performance of two alternative measures to the original physical component summary (PCS) and mental component summary (MCS).

**Methods: ** We performed principal component analysis (PCA) based on 35 items, after optimal scaling via multiple correspondence analysis (MCA), and subsequently on eight scales, after standard summative scoring. Item-based summary measures were planned. Data from the European Community Respiratory Health Survey II follow-up of 8854 subjects from 25 centers were analyzed to cross-validate the original and the novel PCS and MCS.

**Results: ** Overall, the scale- and item-based comparison indicated that the SF-36 scales and summaries meet the supposed dimensionality. However, vitality, social functioning, and general health items did not fit data optimally. The novel measures, derived a posteriori by unit-rule from an oblique (correlated) MCA/PCA solution, are simple item sums or weighted scale sums where the weights are the raw scale ranges. These item-based scores yielded consistent scale-summary results for outliers profiles, with an expected known-group differences validity.

**Conclusions: ** We were able to confirm the hypothesized dimensionality of eight scales and two summaries of the SF-36. The alternative scoring reaches at least the same required standards of the original scoring. In addition, it can reduce the item-scale inconsistencies without loss of predictive validity.

### Introduction

- Top of page
- ABSTRACT
- Introduction
- Methods
- Results
- Discussion
- Conclusions
- References

The health-related quality of life (HRQoL) measurement is often used in clinical trials, quality control programs, and health-care system. Several methods to correctly judge HRQoL data are available in clinical practice, and these should be validated and standardized to compare results coming from different studies via a rigorous development and users feedback [1,2]. Among the different questionnaires for assessing HRQoL, the Short Form 36 Health Survey (SF-36), developed by the Medical Outcome Study (MOS), is the most used worldwide [3–5]. For over 15 years, the SF-36 has been proven useful in comparing general and specific populations, in measuring the health deficit and the treatment efficacy, and for screening individual patients. Moreover, it has been found to correlate with the frequency and severity of specific symptoms and disease, as reported in more than 5000 papers on MEDLINE.

The SF-36 questionnaire has been evaluated and proposed in different (SF-36v1, SF-36v2) and shorter (SF-12; SF-8) versions with the aim to better measure HRQoL and to be easily applied in clinical trials. The SF-36 questionnaire describes eight scales with scale score ranges from 0 to 100 (percent of maximum sum score); it covers four physical health perceptions (physical functioning—PF, role limitations because of physical health problems—RP, bodily pain—BP, general health—GH), and four mental health concepts (vitality—VT, social functioning—SF, role limitations because of personal or emotional problems—RE, and mental health perceptions—MH). Successively, two global measures, depending on the height scales, have been derived and referred to as physical component summary (PCS) and as mental component summary (MCS).

The strategy of summary development was set up on the data-driven analysis of the 8 × 8 Pearson's correlations matrix of the scale scores by means of principal component analysis (PCA). The underlying dimensions have been counted by eigenvalues rules, and both Varimax (orthogonal = uncorrelated components) or Promax (oblique = correlated components) rotations were performed to confirm the hypothesis of the two high order underlying dimensions. Finally, a weighted sum of the eight scales based on the rotated “component score coefficients” have been proposed [6]. The recommended standard MOS system is based on the aim of providing the maximally independent measurement of physical and mental health domains, thus, the scoring method forces the PCS and MCS to be uncorrelated by orthogonal weights (MOS_{UC}). As the physical and mental health are often empirically related, and disease may influence both of them at different extents, an optional MOS scoring method with correlated oblique weights (MOS_{C}) has been further proposed by the SF36 developers [6], although it is not currently recommended.

A number of studies have confirmed the validity of the dimensional structure SF-36 applying MOS strategy [7–9]. Other studies, via confirmatory factor analysis and structural equation modeling, have introduced additional factors (components), or residual pairwise item correlations, with contrasting results [10–16].

Also, there is an ongoing debate about the summary scoring to be applied. Specifically, the uncorrelated summaries would seem not in agreement with the empirical evidences that mental and physical health might strongly interact one to each other [17–18]. Some authors [19,20] highlighted discrepancies between scores on individual scales and components summaries. Taft et al. [21,22] suggested that these discrepancies are attributed to the effects of negatively weighted scales used in the PCS and MCS scoring algorithm. Three “mental” scales are negatively weighted for PCS, while for MCS four “physical” scales are negatively weighted. Thus, the higher the mental health scale scores the lower the PCS, and the higher the physical health scores the lower the MCS (and vice versa). In its extreme, PCS is primarily measuring-impaired mental health, and MCS-impaired physical health! The negative loadings were also assigned by the correlated solution.

With these caveats in mind, by means of the European Community Respiratory Health Survey (ECRHS) data [23,24], the purpose of this study was: 1) to confirm the hypothetical eight scales and two summaries of the SF-36 questionnaire based on a data-driven (exploratory) analysis of the 35 items recoded by “optimal” weights; and 2) to propose two new summary measures to avoid the negative weightings of the MOS (uncorrelated and correlated) component scoring.

### Discussion

- Top of page
- ABSTRACT
- Introduction
- Methods
- Results
- Discussion
- Conclusions
- References

The SF-36 is one of the widely used HRQoL measures, and the ECRHS II dataset, composed by 8854 valid questionnaires coming from 25 international centers is one of the most widespread dataset including SF-36 administration. The huge number of data gave us the opportunity to produce reliable results and to evaluate the measurement features of SF-36 questionnaire.

The first question we aimed to answer at was to confirm the eight first-order dimensions and the two second-order dimensions of the SF-36. To do this, we performed a PCA based on the 35 items, after MCA optimal quantifications. Scale-based analysis was used in several studies on SF-36, and item-based analysis is performed in confirmatory analysis by using structural equations modeling [12–16], only in small number of exploratory analysis studies an item-based level has been considered [34–37], but both exploratory or confirmatory studies have used the Likert/binary item response as continuous variables.

As the SF-36 is widely used in clinical practice, it was mandatory to investigate the dimensionality by different approaches. There is no reason to assume that the answers to the questionnaire queries such as: 1 = “definitely true,” 2 = “mostly true,” 3 = “don't know,” 4 = “mostly false,” and 5 = “definitely false” of items a–d of GH scale should have equal intervals as supposed by Likert recoding. As highlighted by our study on Likert/binary formats [38], linearity assumption among ordinal response points is often not respected in SF-36 items, and it is necessary to calibrate the Likert recoding. Thus, we use the methodology of MCA “optimal scaling” recoding before dimensionally testing via PCA.

Optimal scaling comes from psychometrics that assigns numeric values to categorical variables in an optimal way, and then the item responses are judged as continuous. As well detailed by De Leeuw [39], the single item quantifications derived by MCA linearize all the bivariate regressions in the Pearson's correlation matrix. In this way, MCA allows the management of the nonlinear information contained in the original data, and the performance of a suitable linear PCA on the 35 items of the SF-36 questionnaire, i.e., MCA/PCA produces a non linear multivariate analysis (see e.g., Gifi [25]).

After optimal item quantifications via MCA, the conventional PCA output presented here showed a positive response of the supposed dimensionality of the eight scales and the two summaries, and generally, support that the items of a scale, and the scales of a summary, loaded with high component loadings (>0.40) on the supposed underlining constructs. However, some discrepancies were noted.

Considering the eight dimensions (scales), items of the VT scale split up in items measuring positive (VTa and VTe, with MHh item) and negative (VTg and VTi) mental health status, and indicated that the VT scale does not measure one single underlying construct in the ECRHS subjects. Also, the two items of the SF scale split up, but on other supposed constructs: SFa on the RE scale, and SFb on the MH scale. It is of note that positively and negatively worded items loaded on two components, suggesting that the subjects had difficulties or misleading in changing between these reversed answering formats.

The summary components defined by scale- and item-based analysis with orthogonal (uncorrelated) and oblique (correlated) rotations confirmed the underlying two-component structure of the SF-36 questionnaire. Nevertheless, item-based analysis suggested that the GH scale correlated (in average) with the mental rather than the physical component of health. Specifically, GHx and GHd items load on both the physical and mental components, the GHa, and GHb items on the mental components, while GHc on anyone.

These findings, also reported in other item-based studies [35–37], highlight the need to consider the VT, SF, and GH items more closely and possibly to modify the conceptual framework to improve the underlying dimensionality of the questionnaire.

The second aim of the present study was to compare the two global summary components, based on the eight scales (MOS approach), with an alternative one based on the 35 items (item-based approach). The two employed approaches have similar course about how the summaries should be handled. The MOS approach studies the scale dimensionality and structure via PCA using the Pearson's correlation matrix of eight scales. These scales were computed after Likert coding (from 1 to 6 points as maximum), recalibration, and sum of the item responses. Successively, the PCS and MCS summaries were obtained as a weighted sum of the eight scales. The item-based alternative approach starts considering the items as categorical (nominal) variables, and evaluates the item dimensionality and structure via PCA using Pearson's correlation matrix of 35 items after MCA optimal data coding of the item responses. Successively, the PCS and MCS summaries were obtained as on/off (0/1) sum of the 35 items. Thus, two steps should be processed to calculate the summaries for both approaches, but these steps are quite different in conceptual framework and operational procedure of scoring development.

According to the Likert model of the MOS first step, a construct is regarded as being latent continuous, and is operationalized to be measured by highly correlated and equally important items in order to increase the reliability and improve precision. It uses arbitrary numbers, which indicate the ordered structure of the alternative responses, and also assumes that the precision increases with the number of digits in the scale [28]. In contrast, according to the MCA model of the item-based first step, the ordinal/binary data of the SF-36 items were processed as nominal ones, and were transformed in continuous form by optimal quantifications. The MCA solution allows to define the optimal weights for the item options and their ranking, independently by an a priori recoding, enabling an optimal grading for each category response of the questionnaire.

Scores for the two summary measures in the MOS second step are generated in three stages. First, the 0–100 scale scores are standardized (*z* score transformation) by subtracting the US population mean for that scale and dividing the difference by the US population standard deviation for the scale. Next, *z* scores are multiplied by the respective principal component coefficients, derived from US population data, and summed. These are weighed scales sums of an equal weighting of items within each scale. Finally, these summary scores are linearly reexpressed to have a mean of 50 and a standard deviation of 10 (*T*-score transformation), in the general US population.

The item-based alternative scoring in the second step is very basic, just a simple sum (without weighting) of the Likert/binary responses of the items loading in the physical or mental components. To facilitate comparisons across scales and summaries, summaries are reexpressed in the 0–100 scale scores range. Otherwise, a weighted scales sums, where the weights are the raw ranges of the scales, except for GH scale, can be computed; thus, the summaries can be reexpressed in standard deviation units as *T*-scores, using US or other population norms.

Ware et al. [40,41] have provided extensive justification for two uncorrelated PCS and MCS solutions. The advantages include: easier modeling with respect to the additional factors (components), or residual pairwise item correlations. Independent components are more responsive to the distinction between psychical and mental health outcomes. A direct relationship between component loading and explained variance gives an easier interpretation; also, oblique components require negative scoring weights. Nevertheless, recent comparison studies of Farivar et al. [42], Hann and Reeves [43], and Anognostopoulos et al. [44] have recommended that users of the SF-36 adopt the oblique solution for calculating PCS and MCS, but the proposal SF-36 summary scores was similar to MOS_{C} or was structural equations-based.

Our item-based scoring is derived from a data-driven (exploratory) PCA correlated solution and use the 1/3-unit rule; thus, it fits several PCA advantages of the Ware's scoring system, and objectively overlaps the possible negative weightings of oblique solution with a posteriori rule. Consequently, our alternative scoring is equivalent to the RAND-36 method [45] based on the item response theory for item scoring, and on a correlated confirmatory factor analysis solution that conceptually force a priori the weighting of the four scales of mental health to zero in PCS, and vice versa the four scales of physical health to zero in MCS.

Similarly to the MOS_{C} strategy, the new scoring allows the physical and mental health summary scores to be somewhat correlated, and assess the extent of this correlation in each study population. As hypothesized in the ECHRS populations, the correlation between PCS and MCS, using scale- and item-based systems, get to 0.53 and 0.47, respectively, showing a moderate value in line with the previous studies using MOS_{C}. Moreover, the MOS_{C} and item-based physical and mental scores correlations were both equal to 0.97, suggesting that the scores were empirically the same. By contrast, the MOS_{UC} and item-based physical and mental scores correlations were less noteworthy (PCS: 0.92 and MCS: 0.89).

Our item-based scores can be expressed as 0–100 scores, matching to the original scale scores, or as norm-based scores, matching to the original summary scores. Comparing with the MOS_{UC} summaries, the item-based alternative ones are in line with the expected scores derived on hypothetical outlier scale profiles. Thus, the SF-36 scales scores and physical and mental health summary scores are in agreement, reducing the inconsistent results reported in some SF-36 studies. Additionally, clinical (criterion-based) validity of the proposed scores by means of know-groups comparison produces results supporting the hypotheses suggested, and are compatible with those of the MOS scores.

We would encourage other authors to investigate this alternative item-based scoring in addition to MOS uncorrelated and correlated scorings, to determine if our findings can be replicated using other populations and condition-specific samples. MCA, as PCA, is a procedure presents in all the general statistical packages (SPSS, SAS, Stata, R); in SPSS and R, the data matrix with optimal quantifications of the first MCA dimension is automatically saved in a file as option. Lastly, future research should be dedicated to deriving the norms for various conditions, ages, genders, countries of the new scoring, but the norm-based rescaling in the US population means and standard deviations, or other specific populations can be employed until that new norms are available.