SEARCH

SEARCH BY CITATION

Keywords:

  • public health surveillance;
  • nutrition;
  • obesity;
  • statistical methods

Abstract

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Conflicts of interest
  8. References
  9. Supporting Information

Added sugar, particularly in carbonated soft drinks (CSDs), represents a considerable proportion of caloric intake in North America. Interventions to decrease the intake of added sugar have been proposed, but monitoring their effectiveness can be difficult due to the costs and limitations of dietary surveys. We developed, assessed the accuracy of, and took an initial step toward validating an indicator of neighborhood-level purchases of CSDs using automatically captured store scanner data in Montreal, Canada, between 2008 and 2010 and census data describing neighborhood socioeconomic characteristics. Our indicator predicted total monthly neighborhood sales based on historical sales and promotions and characteristics of the stores and neighborhoods. The prediction error for monthly sales in sampled stores was low (2.2%), and we demonstrated a negative association between predicted total sales and median personal income. For each $10,000 decrease in median personal income, we observed a fivefold increase in predicted monthly sales of CSDs. This indicator can be used by public health agencies to implement automated systems for neighborhood-level monitoring of an important upstream determinant of health. Future refinement of this indicator is possible to account for factors such as store catchment areas and to incorporate nutritional information about products.


Introduction

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Conflicts of interest
  8. References
  9. Supporting Information

The global obesity epidemic is attributable to increases in caloric intake and sedentary lifestyle,[1] but there are few effective interventions to address these fundamental causes.[2, 3] A significant and growing proportion of total caloric intake is attributable to added sugar,[4] which is associated with body mass index (BMI) in both men and women.[5] In North America, sugar-sweetened drinks, such as carbonated soft drinks (CSDs), are the primary source of added sugar.[6] Interventions that decrease the intake of CSDs and other sources of added sugar can reduce caloric intake and BMI,[7] and may play a role in controlling obesity. In this context, population monitoring of added sugar intake is important so that public health agencies can assess this health determinant and evaluate the effect of interventions to decrease added sugar intake.

Added sugar intake can be measured using individual diet surveys, but these methods are resource intensive and are subject to considerable measurement error[8] and reporting bias.[9, 10] Technological innovations in diet measurement offer promise for the future,[11] but individual diet measurement remains difficult to accomplish accurately in a representative and ongoing manner within constrained public health budgets. Owing to these limitations, it is not practical for public health agencies to routinely use surveys to monitor trends in diet over time at a high geographical resolution. Tracking food purchasing, however, offers a novel alternative to diet surveys for monitoring population intake of added sugar.[12, 13]

Many grocery and convenience stores now use digital scanners to identify items at checkout and to generate an electronic record of sales. Companies such as Nielsen obtain these electronic sales data routinely from randomly sampled stores around the world and make aggregated sales data and information products available to companies for marketing and other purposes.[14] These aggregate sales data should also allow routine surveillance over time and geography of purchases of sugar-sweetened products. The information derived from such surveillance may be a useful complement to the limited data available through dietary surveys and other sources.

In previous work, we developed a system of indicators from these aggregate food sales data to describe product sales by category and to account for factors such as in-store promotions.[15] In this paper, we build on our earlier work to develop a surveillance indicator that will allow tracking of neighborhood-level purchases for a food category over time. In particular, we use CSDs as an example food category due to their importance. We then validate our surveillance indicator by demonstrating its inverse association with median personal income, a relationship others have shown previously using data from dietary surveys.[16]

Methods

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Conflicts of interest
  8. References
  9. Supporting Information

Data

The Island of Montreal had a population of approximately 1.6 million people in 2006. For our analyses, we partitioned Montreal into 98 forward sortation areas (FSAs). The FSA is a unit of postal geography defined by the first three digits of the six-digit postal code used in Canada. In urban areas, the size and population of an FSA is roughly equivalent to that of a ZIP code in the United States.

Data on sales of foods in 16 categories between 2008 and 2010 on the Island of Montreal were obtained from Nielsen. To select stores for sampling, Nielsen arranges all stores in each FSA into substrata based on store characteristics (e.g., total sales, square footage, and number of checkouts), and stores are then sampled randomly from each substratum. If a selected store is not scanning, a replacement store is selected from the same substratum. The selected sample is compared against all stores in the FSA, and if the sample is not representative, a partial reselection is performed. A field audit team from Nielsen visits each sampled store on a weekly basis to quantify all product display and promotion activity, and the digital sales data are obtained automatically from scanners in the same stores.

Scanner data were available as a single row for each product sold with variables to indicate the stock-keeping unit (SKU), the FSA of the store, the type of store (grocery or convenience store), the week of the sale, whether the product was on promotion through placement in the store (binary variable), the purchase price, the regular price, and whether the product was being advertised in the region (binary variable). We created the CSD category by grouping together all SKUs for flavored soft drinks containing sugar. Diet soft drinks were not included in the CSD category. Using methods we developed previously, we determined the discount frequency and advertising intensity for the CSD category.[15] Individual items were aggregated to arrive at total CSD sales by FSA and month, and corresponding summary measures for discount frequency, in-store promotion, and advertising intensity.

Stores were sampled in 76 of the 98 FSAs and data were consistent (i.e., without obvious errors such as zero stores sampled with positive sales) and regularly available for only 68 FSAs. We computed values of our indicator for these 68 FSAs with sampled stores and regularly available data. Approximately 10% of the records were missing data on sales, food category pricing, and/or marketing indicators. In order to avoid dropping incomplete observations, we imputed missing values. Data on the total number of stores by type in each FSA were obtained from the Institut national de santé publique du Québec.[17] Total population, average number of children per household, proportion speaking French or English, and median personal income were obtained for each FSA from the Canada 2006 Census. We considered the proportion of the population speaking French or English as a marker for recent immigration.

CSD indicator

An indicator of CSD purchasing should provide estimates at the level of a neighborhood so that local public health departments can target and evaluate interventions. For the sales data, however, some observations are missing over time owing to rotation of sampled stores and missing at different proportions across food categories. Consequently, the proportion of total stores and mix of stores sampled may differ by neighborhood and over time. To develop a robust surveillance indicator that is comparable over time and across geographical regions, we partitioned the data into a training set (all monthly data for each sampled neighborhood in 2008 and 2009) to construct a monthly sales prediction model and a test set (all monthly data in 2010 for each neighborhood) to assess the accuracy of predicted monthly sales. We first imputed missing values in the training set; then we built and assessed the accuracy of a model to predict monthly CSD sales for sampled stores in each neighborhood; then we predicted total monthly CSD sales from all stores in each neighborhood for 2010. Finally, we assessed the association at the neighborhood level between predicted total monthly CSD sales for 2010 and median personal income as measured.

Statistical analyses

The sequential regression imputation method in IVEware[18] and SAS software was used to perform multiple imputation on missing sales, pricing, and marketing data in the training sample (2008–2009) to produce five imputed data sets. In subsequent analysis, regression models were fit to each of the imputed data sets and the parameter estimates from the five models were combined using the mianalyze procedure in SAS version 8.2.[19] We compared the prediction error of four regression models, which included FSA-level variables describing stores, sales and promotions, sociodemographic characteristics, FSA indicator variables, and temporal trends (Table 1). The outcome was log-transformed monthly sales in the FSA. Overall error in predictions of monthly FSA sales for sampled stores in 2010 was measured using the mean absolute predictive error (MAPE), which is calculated as

  • display math(1)

where n is the number of observations, Yi is the observed data, and Ŷi is the forecast.

Table 1. Variables in candidate forecasting models and prediction accuracy of the models for 2010 data when trained using data from 2008 and 2009
  Model
 Source1234
Note
  1. The sociodemographic variable “proportion speaking French or English” is included as a marker of recent immigration.

Stores
 Number of sampled outletsNielsen    
 Number of sampled grocery storesNielsen    
 Total number of grocery stores in FSAINSPQ    
 Total number of convenience stores in FSAINSPQ    
Sales and promotion
 Number of SKUsNielsen    
 Regular price of productNielsen    
 Discount frequencyNielsen    
 In-store promotionNielsen    
 Advertising intensityNielsen    
Sociodemographic
 PopulationCensus    
 Average number of children in householdCensus    
 Proportion speaking French or EnglishCensus    
Temporal and spatial
 Season indicator     
 Month indicator     
 FSA random effect     
Mean absolute prediction error 6.03%5.75%5.77%2.17%

The model with the lowest MAPE was used to predict total sales of CSDs in sampled stores for each month and FSA in 2010. To do this, we used the total number of outlets in the FSA and the total number of grocery stores in the FSA as the inputs for the number of sampled outlets and the number of sampled grocery stores in the predictive model. Details of this model are given in the Supporting Information.

The predicted total FSA monthly log-transformed sales for 2010 were used as the outcome in assessing the relationship between median personal income and sales of CSDs. We used a Bayesian spatiotemporal model that accounts for spatially structured and unstructured variation in sales across FSAs and the serial correlation in sales through time.[20] To account for spatial autocorrelation in FSA sales, we used a conditional autoregressive prior distribution. We used additional FSA random effects for unstructured variation in sales. For the temporal correlation, we used a first-order random walk model. The FSA random effects accounted for unmeasured FSA-level confounders. Median personal income was the independent variable of interest. The spatiotemporal models were implemented using WinBUGS 1.4. We used three Markov chain Monte Carlo (MCMC) chains from different initial values to assess convergence. A detailed description of this model is available in the Supporting Information. For mapping sales data and demographic variables, observations were grouped into classes to give equal-sized classes, with each class representing a quartile.

Results

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Conflicts of interest
  8. References
  9. Supporting Information

Predictive model

Using only the average of the logarithm of monthly sales of CSDs in each FSA from 2008 and 2009 to predict monthly FSA sales in 2010 gave a prediction error of 13.2%. We considered four predictive regression models in an attempt to improve on this simple model (Table 1). The best model with the lowest prediction error for sales from sampled stores (2.2%) was used to predict log-transformed total sales for each FSA and month in 2010.

Descriptive analysis

Table 2 presents descriptive statistics for the FSA with and without sampled stores. The FSA without sampled stores tends to have a lower population, but is comparable with respect to the other census variables examined. Plots of the predicted and observed sales by region over time indicated that CSD sales tend to increase in the summer within each region but that sales are relatively constant within a region over time, and considerable differences in total sales are seen across regions (Fig. 1). The spatial distribution of total predicted monthly sales indicates some spatial clustering of regions with high sales, and similar clustering is seen in median personal income (Fig. 2).

Table 2. Descriptive statistics for the forward sortation areas (FSA) with and without sampled stores
 Percentiles95% Confidence
VariableMin25Median75MaxMeanLowerUpper
Note
  1. Sales and marketing data are from Nielsen, and population and household characteristics are from the Canada 2006 Census.

FSA with sampled stores (n = 68)
 Monthly CSD sales ($)4541934455420,091117,42815,94610,01621,876
 Number of SKUs246676243349139115162
 Number of outlets1.001.002.003.006.002.251.952.55
 Number of grocery stores0.000.001.001.254.000.960.711.20
 Discount frequency0.000.010.100.450.970.230.170.29
 In-store promotion0.000.000.000.220.970.130.090.18
 Advertising intensity0.160.160.160.160.160.160.160.16
 Population594415,97121,06027,50250,91621,81719,79423,840
 Median household income14,49220,00223,37526,83144,70524,16222,79125,534
 Proportion speaking English or French0.870.970.980.991.000.970.970.98
 Average number of children in household0.640.921.061.251.461.071.031.12
FSA without sampled stores (n = 30)
 Population1595510311,47718,05524,90112,315974214,889
 Median household income16,05120,99523,95229,49148,18225,63323,00028,266
 Proportion speaking English or French0.880.980.980.991.000.980.970.99
 Average number of children in household0.480.761.021.221.440.980.881.08
image

Figure 1. Predicted and observed monthly sales of carbonated soft drinks for five (H1A, H1B, H1E, H1G, and H1H) of the 68 forward sortation areas in Montreal with sampled stores. The seasonal variation is statistically significant, as indicated by the parameter estimate for the season variable (summer versus winter) of 0.212 (95% CI 0.080–0.345) in our space–time prediction model.

Download figure to PowerPoint

image

Figure 2. Geographical distributions of total predicted carbonated soft drink sales and median personal income by forward sortation areas in Montreal.

Download figure to PowerPoint

Spatial regression

Convergence of the Bayesian model was achieved following 20,000 iterations. An additional 20,000 iterations were used to estimate the random effects and regression coefficients. The regression coefficient for FSA median personal income was −0.0001641 (credible interval −0.00023, −0.000059). The logarithm of total sales ranged from 7.9 to 16.2, whereas FSA median personal income ranged from $14,273 to $37,903. The estimated coefficient for median income implies that a $10,000 decrease in FSA median income is associated with an average increase of 1.641 units in log (total sales) or an increase in total sales by a factor of exp (1.641) = 5.16.

Discussion

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Conflicts of interest
  8. References
  9. Supporting Information

Using automatically captured data on monthly sales of food products, we developed an indicator of CSD sales by neighborhood. We then took an initial step toward validating this indicator of added sugar intake by demonstrating its negative correlation with median household income, a relationship that others have observed using data from a dietary survey.[16] More specifically, we found at the neighborhood level that a decrease of $10,000 in median personal income was associated with a fivefold increase in sales of CSDs. Given the importance of CSDs as a source of added sugar intake[6] and the need to identify effective public health interventions for reducing the intake of added sugar,[7, 21] the ability to routinely monitor this upstream determinant of obesity can provide guidance for targeting and evaluating public health interventions. The indicator that we developed is based on data captured automatically for inventory and marketing purposes. These data are available throughout the world,[14] so it should be possible to implement this indicator in other settings with few modifications. Moreover, the automated nature of the data capture should enable the development of an automated surveillance system based on this and other similar indicators of different food categories of public health importance.[22]

The development of our surveillance indicator builds on our previous work to develop indicators from automatically captured store sales data.[12, 15, 22] In this work, we have demonstrated how these sales indictors can be used to develop a public health surveillance indicator, which provides interpretable information over time and at a high geographical resolution. We do not know of any previous efforts to develop a similar sort of indicator using food sales data, but in public health settings, surveillance systems increasingly rely on automated feeds of data captured for other purposes, such as electronic medical records,[23] telehealth calls,[24] and pharmaceutical sales.[25] With respect to our substantive finding, the small amount of evidence regarding the negative association between CSD sales and median personal income is based on survey data and is difficult to compare to our results, as the measures of added sugar intake and income were different in those studies.[16, 26] For example, one study reported a 15% increase in added sugar intake when comparing the third of respondents with the highest family income to the third of respondents with the lowest family income.[16]

Although we have identified a novel approach to monitoring food purchasing at the neighborhood level, there are limitations to this approach. For one, the indicator measures food purchasing, not dietary intake. However, we did demonstrate that our indicator of CSD sales has an inverse association with income, as reported by others. Nonetheless, because our indicator follows purchasing at the neighborhood level, it is not possible to attribute purchases to specific population subgroups, although differences in demographic profiles between neighborhoods could be used to assess ecological associations between population subgroups and purchasing patterns. Another limitation is that our indicator measures a single food category and not a variable of more direct nutritional interest, such as added sugar. In the case of added sugar, CSDs represent a large proportion of the total intake,[6] so the connection is clear. In general, however, we could address this limitation in the future by linking data on nutritional values of products to data on sales[13] and developing new indicators of total added sugar or other measures, which would be the product of the sales data and the nutritional value of each item sold. Monitoring multiple categories of food purchasing simultaneously may also be important to identify if interventions are resulting in substitution of one food category for another.[12] The trend toward dining out creates another limitation to our approach. The indicator measures only food purchased by consumers for eating at home and does not capture sales of food at restaurants or purchasing from restaurants. In the United States, the frequency of dining out increases with family income,[27] but the energy density of restaurant meals is inversely associated with family income.[28] There is also some spatial imprecision inherent in the estimates, as we directly attribute the sales information from stores to the regions that they fall within, making no allowance for store catchment areas. In future work, it should be possible to address this limitation by defining store catchment areas and then using those areas to assign sales proportionally to regions. Finally, we relied on multiple imputation to estimate missing values in the sales data. Only 10% of the values were missing in our data, but if a large proportion of values were missing, then caution would be warranted in using this approach.

The method that we have developed for monitoring food purchasing opens many avenues for future research. One avenue of research is to refine and extend the method. Particularly useful extensions would be to incorporate catchment areas for stores and to link food products to nutritional databases, allowing population-level monitoring of purchasing at the nutrient level. Another avenue of research is to extend the method to multiple food categories, allowing near real-time monitoring of the full breadth of food purchased within neighborhoods. This comprehensive view would allow a richer understanding of how the complete food “basket” varies over time and across neighborhoods. Perhaps the most promising avenue of research is using this method, ideally extended to measure nutrition or the full basket, to discover the effect of interventions and neighborhood-level characteristics on spatial and temporal variations in food purchasing.

In conclusion, we have developed, assessed the accuracy of, and begun to validate an indicator to allow neighborhood-level surveillance of CSD purchasing. This indicator should be straightforward to implement in other settings, and there are many ways that this indicator can be refined and extended in the future to support automated surveillance of food purchasing within neighborhoods.

References

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Conflicts of interest
  8. References
  9. Supporting Information

Supporting Information

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Conflicts of interest
  8. References
  9. Supporting Information

Disclaimer: Supplementary materials have been peer-reviewed but not copyedited.

FilenameFormatSizeDescription
nyas12332-sup-0001-Suppmat.docx444KDetailed description of regression models

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.