Use of quantile regression to investigate the longitudinal association between physical activity and body mass index


  • Funding agencies: This work was supported by the U.S. Department of Defense (W81XWH-08-1-0082) and the National Institutes of Health grants AG006945, HL062508, and R21 DK088195 (to X.S. from the National Institute of Diabetes and Digestive and Kidney Diseases).

  • Disclosures: The authors have no competing interests.



To examine associations among age, physical activity (PA), and birth cohort on body mass index (BMI) percentiles in men.


Longitudinal analyses using quantile regression were conducted among men with ≥ two examinations between 1970 and 2006 from the Aerobics Center Longitudinal Study (n = 17,759). Height and weight were measured; men reported their PA and were categorized as inactive, moderately, or highly active at each visit. Analyses allowed for longitudinal changes in PA.


BMI was greater in older than younger men and in those born in 1960 than those born in 1940. Inactive men gained weight significantly more rapidly than active men. At the 10th percentile, increases in BMI among inactive, moderately active, and highly active men were 0.092, 0.078, and 0.069 kg/m2 per year of age, respectively. The 10th percentile increased by 0.081 kg/m2 per birth year and by 0.180 kg/m2 at the 90th percentile, controlling for age.


Although BMI increased with age, PA reduced the magnitude of the gradient among active compared to inactive men. Regular PA had an important, protective effect against weight gain. This study provides evidence of the utility of quantile regression to examine the specific causes of the obesity epidemic.


The world-wide obesity epidemic [1-4] can be attributed to a widespread imbalance between energy intake and energy expenditure. The prevalence of obesity, defined as body mass index (BMI) ≥ 30.0 kg/m2, has increased dramatically among men over the past 50 years from 10.4% in 1960–1962 [5] to 35.5% in 2009–2010 [6]. The specific factors that cause the energy imbalance are still poorly understood. Some have suggested that physical activity has been essentially unchanged during the obesity epidemic, and conclude that the cause of the epidemic must be an increase in energy intake [7, 8]. However, a major factor to consider is the rapid change in occupational energy expenditure over the past 50 years, with a large decline in manufacturing, mining, and farming; and a consistent increase in service jobs with substantially lower energy requirements [9]. Importantly, mean daily energy expenditure from occupational physical activity declined by more than 100 calories over the past five decades, and that decrease accounted for a significant portion of the mean weight gain during that time period [9].

Multiple individual and environmental factors may affect an individual's ability to achieve energy balance and maintain a stable weight over time, and a number of observational and interventional studies have examined the potential effects of these factors [10-13]. Assessments of the potential influences on obesity tend to focus on the upper percentiles of the frequency distribution of BMI in categorical logistic regression analyses or on the mean as in linear regression analyses. Both approaches are limited because they sacrifice what can be learned about the entire distribution. For instance, the influence of age, physical activity, and birth cohort on BMI may affect subgroups of the population differently; thus, the effect on mean BMI may not adequately convey the potential varying impact on the entire distribution. Quantile regression is an analytical method that is compatible with assessing associations throughout the distribution of BMI [14-19]. To date, no study has used quantile regression to examine the influences of age, physical activity, and birth cohort prospectively on obesity among adult men.

Therefore, the primary purpose of this paper was to determine the associations among age, physical activity, and birth cohort on the BMI percentiles of the distribution in a large sample of men. We hypothesized that BMI values would be centered on higher values in 60-year-old men than in 20-year-old men, and that BMI would be higher in 40-year-old men born in 1960 than in 40-year-old men born in 1940. We also expected that the BMI distribution would be shifted towards larger values with age to a greater degree in inactive men than in active men, but in a way that would not be uniform across the BMI distribution. The secondary purpose of this paper was to describe the application of an underutilized statistical method, quantile regression [14, 15], to study factors influencing BMI, an application for which the method seems particularly well suited.


Sample selection

The Aerobics Center Longitudinal Study (ACLS) is a prospective observational study [20]. Participants came to the Cooper Clinic in Dallas, TX, for periodic preventive health examinations and counseling regarding diet, exercise, and other lifestyle factors associated with increased risk of chronic disease. Between 1970 and 2006, participants received at least one comprehensive medical examination and maximal graded treadmill exercise test at the clinic, and were enrolled in the ACLS. Most study participants were non-Hispanic whites from middle-to-upper socioeconomic strata, and were either referred by their employers or physicians or were self-referred. The study was reviewed and approved annually by the Cooper Institute Institutional Review Board, and all participants gave written informed consent. From the initial sample of 120,649 observations from 50,787 men, we included men without any history of heart attack, stroke or cancer (observations = 103,379, participants = 46,132), 25–75 years old (observations = 102,229, participants = 45,515), and men with at least two visits (observations = 74,473, participants = 17,759). In the final sample, 7334 men had two visits, 3566 men had three visits, 1989 men had four visits, and 4870 men had five or more visits.


The comprehensive health evaluation is described in detail elsewhere [20, 21]. The outcome of interest in this study was BMI (kg/m2). Height and weight were measured on a physician's scale and stadiometer. The exposures of interest were self-reported physical activity, diet, and smoking behavior. Physical activity was categorized based on participants' responses to questions about their regular physical activity habits over the past three months (1 = no activity, 2 = some sports or activity or walk/jog/run up to 10 miles per week, 3 = walk/jog/run more than 10 miles per week) [21-23]. Categories of physical activity were defined at each visit as “inactive” if physical activity = 1, “moderate” if physical activity = 2, and “high” if physical activity = 3. The analysis allowed for changes in physical activity level over time. Smoking habits were obtained from a standardized questionnaire. Participants were classified as a nonsmoker or current smoker at the time of each examination. Eating habits were self-reported as eating: [1] much less, [2] somewhat less, [3] just what, [4] somewhat more, or [5] much more than I want. Birth cohort was defined as each participant's year of birth.

Statistical analyses

We employed quantile regression to assess associations of predictor variables at the 10th, 25th, 50th, 75th, and 90th percentiles of BMI. Quantile regression has the advantages of allowing examination at multiple points in the distribution of BMI rather than only at the mean. Quantile regression does not require any assumption about the distribution of the regression residuals and, unlike ordinary linear regression, is not influenced by outliers or skewness in the distribution of the dependent variable, providing greater statistical efficiency when outliers are present. In addition, inference on quantiles can accommodate transformation of the dependent variable without the problems encountered in ordinary linear regression [24].

Quantile regression parameters are interpreted similarly to normal linear regression parameters except that the parameter indicates the change in the value at the modeled percentile, not the mean, of the dependent variable for each unit change in the independent variable. For example, a parameter estimate of 0.133 for age in the 75th percentile model would indicate that the 75th percentile of BMI increased by 0.133 kg/m2 for each one year increase in age.

The densities shown were smoothed by applying the Epanechnikov kernel function, K(x) = 0.75 (1 − x2) I(|x| < 1), with bandwidth 3 to a dense set of estimated quantiles (2nd, 4th, …, 98th percentile). A kernel function gives the weights of the nearby data points in making an estimate while ensuring that the result is a probability density function and that the average of the corresponding distribution is equal to that of the sample used. The repeated observations of BMI taken on the same men may be dependent. The quantile regression estimator is consistent when the data are dependent [25]. Because we were interested in population-level, not individual-level, estimates, we estimated the standard errors and confidence intervals with 1000 cluster bootstrap samples to account for the dependence [26-29].

For completeness, we complemented our inference on quantiles with that from two other, more traditional approaches: linear regression and multinomial logistic regression. The former permits inference about the mean of BMI, whereas the latter allows estimation of the conditional probability of being in any given BMI class (18–25, 25–30, and ≥ 30 kg/m2). To take the potential intra-individual dependence into account, the standard errors were estimated by applying generalized estimating equations (GEE) with an exchangeable working covariance matrix for the linear regression on the mean [30] and the robust cluster sandwich estimator for multinomial logistic regression [31].


The shape of the BMI distribution differed across the levels of physical activity with respect to location, spread, and skewness (Figure 1). Across the three levels of physical activity, there were no statistically significant differences in the mean or quartiles of age or height of the men (Table 1). There were, however, gradients in weight, waist circumference, BMI, and body fat mass in the expected direction, with inactive men having the greatest relative weight and fat, and highly active men having the least.

Figure 1.

Box plots of BMI by levels of physical activity of men, Dallas, TX, 1970–2006. Values with BMI > 70 kg/m2 were excluded.

Table 1. Descriptive statistics of men by physical activity level (inactive, moderate PA, high PA) across all visits, Dallas, TX, 1970–2006 (men = 17,759, observations = 74,473)a
 Inactive (obs. = 18,552)Moderate PA (obs. = 37,058)High PA (obs. = 18,863)
 25th50th75thMean or %25th50th75thMean or %25th50th75thMean or %
  1. a

    Differences across physical activity levels in the percentiles (tested with quantile regression with cluster bootstrapped standard errors), the means (GEE), and the proportions (multinomial logistic regression with robust cluster sandwich estimator) are significant (P < 0.05) for all variables.

  2. PA, physical activity; obs., observations; BMI, body mass index; GEE, generalized estimating equations.

Age, years40475447.241485448.042485548.6
Height, cm174.6179.1182.9178.9175.3179.1183.5179.2174.6179.1182.9178.9
Weight, kg76.483.492.085.275.582.090.383.773.379.386.380.4
Waist circumference, cm88.094.0101.094.386.
BMI, kg/m224.226.028.326.523.825.527.626.123.224.726.525.1
Body fat, %
Current smoker   14%   12%   8%
Alcohol consumption (0/wk)   27%   26%   25%
Alcohol consumption (1−7/wk)   51%   50%   49%
Alcohol consumption (≥ 8/wk)   22%   23%   26%
Eat much less than I want   7%   5%   6%
Eat somewhat less than I want   35%   43%   43%
Eat just what I want   44%   41%   41%
Eat somewhat more than I want   12%   10%   9%
Each much more than I want   2%   1%   1%

BMI was higher at older ages for all three physical activity levels, but with a smaller gradient in the physically active as compared to the inactive men (Table 2). The difference in the magnitude of increase was significant at the 10th and 25th percentiles, as indicated by the statistically significant cross-product interaction terms for age and physical activity level (high vs. inactive) in those models. At the 10th percentile, below which was the leanest 10% of the population, the gradients in BMI with age in the inactive, moderately active, and highly active were 0.092, 0.078, and 0.069 kg/m2 per year of age, respectively. The 10th percentile of BMI increased with year of birth by 0.081 kg/m2 per birth cohort year and by 0.180 kg/m2 at the 90th percentile, adjusting for age, and the magnitude of increase associated with year of birth was larger at each successive percentile. Eating habits were significantly associated with BMI at all percentiles, and the category “eat just what I want” showed the largest reduction of BMI values at all percentiles. Smoking and drinking habits were not significant predictors or confounders, and were omitted from all models.

Table 2. Effects of predictors at five percentiles (10th, 25th, 50th, 75th, and 90th) of the distribution of body mass index (kg/m2) estimated by quantile regression in men, Dallas, TX, 1970–2006 (men = 8885; observations = 17,304)
 10th percentile25th percentile50th percentile75th percentile90th percentile
  1. a

    The coefficient represents the change in the value at the nth percentile of BMI for each unit change in the independent variable. For interactions, the coefficient is the difference in the change in the value of BMI at the nth percentile compared to the main relative to the change when the interacting variable is at its reference level, so, for example, at the 10th percentile, BMI increases by 0.096 for each year of age for those who are inactive, but by 0.023 less than that per year of age for those with high physical activity.

  2. b

    Confidence intervals (CI) are based on 1000 cluster bootstrap samples. Test for interaction terms: 10th (P = 0.082), 25th (P = 0.048), 50th (P = 0.255), 75th (P = 0.695), 90th (P = 0.722).

  3. c

    1=Eat much less than I want; 2=Eat somewhat less than I want; 3=Eat just what I want; 4=Eat somewhat more than I want; 5=Eat much more than I want.

  4. d

    Birth year centered at 1940.

  5. e

    The intercept is the value of the nth percentile of BMI when all other variables are zero.

  6. BMI, body mass index; PA, physical activity.

PA (moderate vs. inactive)
95% CIb−0.553, −0.164−0.732, −0.420−1.026, −0.674−1.226, −0.774−1.756, −0.979
PA (High vs. Inactive)
95% CIb−1.039, −0.629−1.330, −0.978−1.744, −1.383−2.156, −1.641−2.939, −2.085
Age (years, centered at 50)
95% CIb0.069, 0.1150.078, 0.1220.098, 0.1420.104, 0.1620.100, 0.186
Interaction (Age × Moderate PA)
95% CIb−0.035, 0.007−0.030, 0.008−0.037, 0.005−0.030, 0.024−0.022, 0.053
Interaction (Age × High PA)
95% CIb−0.043, −0.003−0.043, −0.004−0.038, 0.005−0.022, 0.033−0.033, 0.052
Eating Habit (2 vs. 1)c
95% CIb−1.844, −0.815−2.235, −1.551−3.093, −2.229−3.960, −3.038−4.625, −3.235
Eating Habit (3 vs. 1)c
95% CIb−2.880, −1.827−3.198, −2.523−4.139, −3.232−5.089, −4.095−5.520, −4.099
Eating Habit (4 vs. 1)c
95% CIb−1.312, −0.071−1.354, −0.580−2.055, −1.095−2.888, −1.837−3.222, −1.438
Eating Habit (5 vs. 1)c
95% CIb−0.984, 0.781−0.498, 1.282−0.121, 1.448−0.572, 1.532−0.172, 2.582
95% CIb0.067, 0.0960.077, 0.1020.095, 0.1250.129, 0.1590.153, 0.207
95% CIb23.991, 25.04026.173, 26.88628.704, 29.58331.507, 32.50333.985, 35.484

Figure 2 illustrates the results of the quantile regression analysis and compares the estimates for the distribution of BMI in the inactive and highly active populations at age 30, 50, and 70 (chart rows), in the cohorts born in 1940 and 1960 (chart columns) in those who reported that they eat “just what I want.” The distributions for moderate physical activity were similar to those for high physical activity and are not shown. The distribution in the active population in the 1940 cohort at age 30 is included as a shaded area in all graphs for reference. Moving down each column, the distribution of BMI shifted toward larger values with older age. Skewness abated with age in the inactive men because of the comparatively larger increase in the lower percentiles than in the higher ones. Conversely, skewness increased in the active population. Comparing the two columns, there was a conspicuous, significant cohort effect on both location and spread of the BMI distribution. The later generation shifted toward higher values and showed an accentuated elongation.

Figure 2.

Quantile regression estimates of the BMI distributions from the model shown in Table 2 in inactive (solid curve) and highly active (dashed curve) men at ages 30, 50, and 70 in the 1940 cohort and 1960 cohort for men who “eat just what I want,” Dallas, TX, 1970–2006 (men = 8,885; observations = 17,304). Shaded area in all panels is for reference and represents physically active 30-year-old men in the 1940 cohort.

Table 3 reports the estimated coefficients and associated confidence intervals from GEE linear regression models for mean BMI, which increased with age at all levels of physical activity. All main effects were statistically significant (P < 0.05). The difference in the slopes of the increase in BMI over age across levels of physical activity is borderline significant (P = 0.060). These estimates, however, were obtained after removing outliers (BMI < 14 or BMI > 50). Inference was dependent on which values were identified as outliers and removed. If all data were utilized, the estimated coefficients were substantially different, the standard errors inflated, the confidence intervals wider, and the difference in slopes far from statistically significant (data not shown).

Table 3. Effects of predictors at the mean of the distribution of body mass index estimated by generalized estimating equations after removing outliersa in men, Dallas, TX, 1970–2006 (men = 8882; observations = 17,295)
 CoefficientP value95% CI
  1. a

    Defined as BMI < 14 kg/m2 or BMI > 50 kg/m2.

  2. b

    Test for interaction terms: P = 0.060.

  3. c

    1=Eat much less than I want; 2=Eat somewhat less than I want; 3=Eat just what I want; 4=Eat somewhat more than I want; 5=Eat much more than I want.

  4. BMI, body mass index; CI, confidence interval; PA, physical activity.

PA (Moderate vs. Inactive)−0.335<0.001−0.413, −0.257
PA (High vs. Inactive)−0.787<0.001−0.881, −0.693
Age (years, centered at 50)0.106<0.0010.096, 0.115
Interaction (Age × Moderate PA)b−0.0030.433−0.011, 0.005
Interaction (Age × High PA)b0.0060.226−0.004, 0.016
Eating habit (2 vs. 1)c−0.868<0.001−1.000, −0.736
Eating habit (3 vs. 1)c−1.010<0.001−1.149, −0.870
Eating habit (4 vs. 1)c−0.363<0.001−0.520, −0.206
Eating habit (5 vs. 1)c0.895<0.0010.594, 1.196
Cohort (birth year, centered at 1940)0.114<0.0010.105, 0.123
Intercept27.184<0.00127.029, 27.339

Table 4 shows the estimated probability of being in one of the three BMI categories at age 30, 50, and 70, for the 1940 cohort that reported “eat what I want.” All main effects were statistically significant (P < 0.05). The probability of having normal BMI was lower with older age. At age 70 the probability of having normal BMI was more than twice as great for the highly active men than for the inactive men. Further, the probability for highly active obese men was less than half that of the inactive men.

Table 4. Predicted probabilities for being in a defined BMI category (normal weight, overweight, obese) based on multinomial regression in the 1940 cohort of men who “Eat Just What I Want,” Dallas, TX, 1970–2006 (men = 8885; observations = 17,304)
PA Level Normal weightOverweightObese
AgePred. Prob.95% CIaPred. Prob.95% CIaPred. Prob.95% CIa
  1. a

    The 95% confidence intervals are based on robust, cluster, sandwich estimator for the standard error.

  2. BMI, body mass index; PA, physical activity; Pred. Prob., predictive probability; CI, confidence interval.

Inactive300.770.81, 0.710.220.18,, 0.02
 500.450.48, 0.420.460.43, 0.480.090.08, 0.10
 700.150.20, 0.110.540.53, 0.540.310.27, 0.35
Moderate300.820.85, 0.770.170.14,, 0.01
 500.560.59, 0.540.390.37, 0.400.050.05, 0.06
 700.250.30, 0.210.550.53, 0.570.190.17, 0.22
High300.870.89, 0.830.130.10,, 0.01
 500.660.68, 0.640.310.29, 0.330.030.02, 0.03
 700.350.41, 0.290.520.48, 0.540.140.11, 0.17


Quantile regression permitted us to describe and quantify that, as expected, BMI was centered on larger values in older compared to younger men and in those born in 1960 compared to those born in 1940. With the use of quantile regression, we also found that the distribution of BMI was shifted towards larger values in older ages to a greater degree in inactive men than in those who were physically active. It also allowed us to show that these relationships were not uniform across the distribution of BMI. For example, the association of physical activity and BMI was greater at the larger percentiles of BMI than at the smaller percentiles, as shown by the greater magnitude of the regression coefficients at the larger percentiles. Therefore, men in the normal BMI range who led inactive lives tended to have higher weight gradients with age than men who maintained an active lifestyle. For example, at age 70, nearly half of the active men had a BMI that was below the 10th percentile of the inactive men. Quantile regression also permitted estimation of the entire distribution of BMI by age and year of birth adjusted for eating habits. Furthermore, quantile regression was statistically efficient and insensitive to extreme values of BMI.

Quantile regression has several advantages that apply directly to the analysis of our data set. First, research interest lies not in the mean of BMI but in its quantiles. Our study interest was in the complete distribution of BMI: the underweight, normal, overweight, and obese men. Inference on mean BMI alone would not be as informative as inference on multiple quantiles throughout the distribution. Quantile regression permits inference on multiple percentiles of BMI given a set of covariate values. Second, quantile regression has robustness to outliers and statistical efficiency. Large, outlying values have a major impact on the mean and therefore on linear regression estimates. Conversely, quantile regression is robust to them. Robustness to outliers makes the quantile estimator more efficient than the mean estimator when the population being sampled contains outliers. Third, quantile regression does not require transformations. When the relationship between dependent and independent variables is nonlinear or the distribution of the dependent variable is skew, transforming the outcome may simplify modeling when using linear regression. Transformations such as the logarithm and the square root are frequently applied in linear regression, but are often challenging to implement in practice because of inconsistent back transformation and challenges in interpretation [32]. In contrast, quantile regression accommodates skewed distributions seamlessly. This property of quantile regression has been exploited to great advantage in other settings, for example, power-transformations [33] and censored data [34], and is applicable to the analysis of BMI. Other measures of obesity may be bounded from above or below, such as percentage of body fat mass bounded between 0% and 100%, and quantile regression is also applicable for these measures [24].

These complex relationships could not be described by linear or multinomial logistic regression analyses. Inference about mean BMI from linear regression was unsatisfactory because it did not permit understanding the differential effect of physical activity across the distribution of BMI, and it was highly affected by the extreme values of BMI. Unless outliers were identified and removed through an ad hoc and somewhat arbitrary process, the amount of change in mean BMI with age did not appear to differ significantly across levels of physical activity, whereas in the quantile regression results, the differences in slope with age for differing activity levels were apparent in some quantiles. Multinomial logistic regression performed better than linear regression because it allowed inference about the upper tail of the distribution of BMI (i.e., overweight and obese), but it was inferior to quantile regression because it categorized the outcome BMI into a small number of groups that did not permit examination of the entire BMI distribution.

Quantile regression could be extended to the analysis of the effects of other risk factors on BMI or the analysis of other obesity measures. This analytic approach has the potential to contribute greatly to forming a fuller picture of the extent and causes of the obesity epidemic, and understanding the impact of large-scale obesity interventions, given that interventions may primarily affect one portion of the distribution of the outcome [35]. Quantile regression is readily implemented by available statistical software (e.g., Stata, SAS, S-plus). It should be used instead of linear regression in any study for which the effects of explanatory variables may differ across the range of the outcome variable and affect the shape of the distribution.

Therefore, analyses should not focus merely on the mean BMI, for these would provide diluted, tangential measures of the effects of interest. For instance, influences on BMI may differ across the population: stronger associations at the upper end of the distribution (> 70th percentile), moderate associations in the middle of the distribution (30th−70th percentile), and low or no association at the lower end of the distribution (< 30th percentile). Thus, the effect on mean BMI may not adequately convey the impact on the entire distribution. Further, if, as Rose [36] argued there are potentially greater population benefits from approaches that encompass the large portion of the population at moderate risk, then we must employ methods that can provide information about effects throughout the distribution.

This study has strengths and limitations that deserve mention. A major strength is the large sample with multiple measurements over the period of 36 years. Another strength is the use of quantile regression which allowed for a comprehensive examination of the relationship between physical activity and BMI across the entire distribution of BMI. A limitation is that the ACLS cohort is predominately white, well-educated, and of middle-to-upper socio-economic status, and is not representative of the general population [20]. However, we contend that one of the advantages of the method proposed here is that it allows for more appropriate comparisons of findings from similar studies in other populations where distributions may be shifted or associations may vary across the distributions in different ways.

In summary, quantile regression was an effective statistical method that allowed us to examine how physical activity affected BMI across the entire distribution of BMI. Our findings demonstrated that the distribution of BMI ranged over higher values in older men than in younger men. However, the shift in the BMI distributions between younger and older men was smaller among regularly active men than among inactive men. The beneficial effect of regular physical activity in attenuating the BMI increase with ageing was most evident among the lower percentiles, which in younger men were within the range of normal BMI values. For example, the 25th percentile of BMI increased with age at a rate that was 24% smaller in active men (76 g/m2 per year) than inactive men (100 g/m2 per year). This study provides compelling evidence of the utility of quantile regression to examine the specific causes of the obesity epidemic.


The authors thank the Cooper Clinic physicians and technicians for collecting the data and the staff at the Cooper Institute for data entry and data management. In addition, this work was undertaken by the collaborative effort of the TRIM Research Group, and the TRIM authors express their appreciation to other members: Clemson University: Susan Barefoot, Margaret Condrasky, Ellen Granberg; Cooper Institute: Susan Campbell; Medical University of South Carolina: Patrick M. O'Neil; Pennington Biomedical Research Center: David W. Harsha; South Carolina Research Authority: Kate Beaver, Robert Davis, Stephen L. Jones; South Carolina State University: Bonita Manson; University of Iowa: Kathleen F. Janz; University of South Carolina: Robert R. Moran; Winthrop University: Patricia G. Wolman.