SEARCH

SEARCH BY CITATION

Keywords:

  • C1;
  • C8;
  • C53

Abstract

  1. Top of page
  2. Abstract
  3. I Introduction
  4. II Literature Review
  5. III Google Trends
  6. IV Examples
  7. V Consumer Confidence
  8. VI Conclusion
  9. References

In this paper we show how to use search engine data to forecast near-term values of economic indicators. Examples include automobile sales, unemployment claims, travel destination planning and consumer confidence.


I Introduction

  1. Top of page
  2. Abstract
  3. I Introduction
  4. II Literature Review
  5. III Google Trends
  6. IV Examples
  7. V Consumer Confidence
  8. VI Conclusion
  9. References

Government agencies periodically release indicators of the level of economic activity in various sectors. However, these releases are typically only available with a reporting lag of several weeks and are often revised a few months later. It would clearly be helpful to have more timely forecasts of these economic indicators.

Nowadays there are several sources of data on real-time economic activity available from private sector companies such as Google, MasterCard, Federal Express, UPS, Intuit and many others. In this paper we examine Google Trends, which is a a real-time daily and weekly index of the volume of queries that users enter into Google. We have found that these query indices are often correlated with various economic indicators and may be helpful for short-term economic prediction.

We are not claiming that Google Trends data can help in predicting the future. Rather we are claiming that Google Trends may help in predicting the present. For example, the volume of queries on automobile sales during the second week in June may be helpful in predicting the June auto sales report which is released several weeks later in July.

It may also be true that June queries help to predict July sales, but we leave that question for future research, as this depends very much on the particular time series in question. We have found that queries can be useful leading indicators for subsequent consumer purchases in situations where consumers start planning purchases significantly in advance of their actual purchase decision.

Predicting the present, in the sense described above, is a form of ‘contemporaneous forecasting’ or ‘nowcasting,’ a topic which is of particular interest to central banks and other government agencies. As Castle et al. (2009) point out, contemporaneous forecasting is valuable in itself, but it also raises a number of interesting econometric research questions involving topics such as variable selection, mixed frequency estimation, and incorporation of data revisions, to name just a few.

Our goals in this paper are to familiarise readers with Google Trends data, illustrate some simple forecasting methods that use this data and encourage readers to undertake their own analyses.

We do not claim any methodological advances here; certainly it is possible to build more sophisticated forecasting models than those we use. However, we believe that the models we describe can serve as baselines to help analysts get started with their own modelling efforts and that can subsequently be refined for specific applications.1

Our examples use R, a freely available open-source statistics package from http://CRAN.R-project.org. We provide the R source code and data in the online appendix available at http://www.ischool.berkeley.edu/~hal.

II Literature Review

  1. Top of page
  2. Abstract
  3. I Introduction
  4. II Literature Review
  5. III Google Trends
  6. IV Examples
  7. V Consumer Confidence
  8. VI Conclusion
  9. References

So far as we know, the first published paper that suggested that web search data was useful in forecasting economic statistics was Ettredge et al. (2005), which examined the US unemployment rate. At about the same time Cooper et al. (2005) described using Internet search volume for cancer-related topics. Since then there have been several papers that have examined web search data in various fields.

For example, in the field of epidemiology, Polgreen et al. (2008) and Ginsberg et al. (2009) showed that search data could help predict the incidence of influenza-like diseases. This work was widely publicised and stimulated several further findings in epidemiology, including Brownstein et al. (2009), Corley et al. (2009), Hulth et al. (2009), Pelat et al. (2009), Valdivia and Monge-Corella (2010) and Wilson (2009).

In economics, Choi and Varian (2009a,b) described how to use Google Search Insights data to predict several economic metrics including initial claims for unemployment, automobile demand, and vacation destinations; this report is an updated and streamlined version of those working papers. Askitas and Zimmermann (2010), D'Amuri and Marcucci (2010) and Suhoy (2009) examined unemployment in the US, Germany and Israel. Guzman (2011) has examined Google data as a predictor of inflation.

Recently, Baker and Fradkin (2011) have used Google search data to examine how job search responded to extensions of unemployment payments.

Radinsky et al. (2009), Huang and Penna (2009) and Preis et al. (2010) examine the use of search data for measuring consumer sentiment while Schmidt and Vosen (2009) and Lindberg (2011) examine retail sales and consumption metrics. Wu and Brynjolfsson (2010) examine housing data using longitudinal data extracted from Google Search Insights.

Shimshoni et al. (2009) describe the predictability of Google Trends data itself, pointing out that a substantial amount of search terms are highly predictable using simple seasonal decomposition methods.

Goel et al. (2010) provide a useful survey of work in this area and describe some of the limitations of web search data. As they point out, search data is easy to acquire and is often helpful in making forecasts, but may not provide dramatic increases in predictability. Although we generally agree with this view, we typically find economically significant, if not dramatic, improvements in forecast performance using search engine data, as illustrated in this paper.

Finally, McLaren and Shanbhoge (2011) summarise how web search data can be used for economic nowcasting by central banks.

III Google Trends

  1. Top of page
  2. Abstract
  3. I Introduction
  4. II Literature Review
  5. III Google Trends
  6. IV Examples
  7. V Consumer Confidence
  8. VI Conclusion
  9. References

Google Trends provides a time series index of the volume of queries users enter into Google in a given geographic area.

The query index is based on query share: the total query volume for the search term in question within a particular geographic region divided by the total number of queries in that region during the time period being examined. The maximum query share in the time period specified is normalised to be 100, and the query share at the initial date being examined is normalised to be zero.

The queries are ‘broad matched’ in the sense that queries such as [used automobiles] are counted in the calculation of the query index for [automobile]. The data go back to January 1, 2004.

Note that Google Trends data is computed using a sampling method, and the results therefore vary a few per cent from day to day. Furthermore, due to privacy considerations, only queries with a meaningful volume are tracked. There is a substantial amount of online help available via links on the site which describe details of how of how the data is collected.

This query index data is available at country, state and metro level for the United States and several other countries. There are two user interfaces for the data, Google Trends and Google Insights for Search (I4S). The latter is the more useful for our purposes since it allows a logged-in user to download the query index data as a CSV file.

Figure 1 depicts example output from I4S for the query [free shipping] in Australia. The search share for this query has exhibited significant increase since 2008 and tends to peak during the holiday shopping season.

image

Figure 1. Search Index for [Free Shipping] in Australia

Download figure to PowerPoint

Google classifies search queries into approximately 30 categories at the top level and approximately 250 categories at the second level using a natural language classification engine. For example, the query [car tire] would be assigned to category Vehicle Tires which is a subcategory of Auto Parts which is a subcategory of Automotive. The assignment is probabilistic in the sense that a query such as [apple] could be partially assigned to Computers & Electronics, Food & Drink and Entertainment.

IV Examples

  1. Top of page
  2. Abstract
  3. I Introduction
  4. II Literature Review
  5. III Google Trends
  6. IV Examples
  7. V Consumer Confidence
  8. VI Conclusion
  9. References

(i) Motor Vehicles and Parts

As an initial example we use the ‘Motor Vehicles and Parts Dealers’ series from the US Census Bureau ‘Advance Monthly Sales for Retail and Food Services’ report.2

This index summarises results from a survey sent to motor vehicle and parts dealers that asks about current sales. The preliminary index is released two weeks after the end of each month. The data is available in both seasonally adjusted and unadjusted form; here we use the unadjusted data.

Let yt be the log of the observation at time t. We first estimate a simple baseline seasonal AR-1 model inline image for the period 2004-01-01 to 2011-07-01.

 EstimateStandard Errort valuePr(>|t|)
  1. Multiple R-squared: 0.7185, Adjusted R-squared: 0.7111.

(Intercept)0.672660.763550.8810.381117
lag(y, −1)0.643450.073328.7763.59e−13***
lag(y, −12)0.295650.072824.0600.000118***

Google Trends contains several automotive-related categories. A little experimentation shows that two of these categories, Trucks & SUVs and Automotive Insurance significantly improve in-sample fit when added to this regression.

 EstimateStandard Errort valuePr(>|t|)
  1. Multiple R-squared: 0.8179, Adjusted R-squared: 0.808.

(Intercept)−0.457980.78438−0.5840.561081
lag(y, −1)0.619470.063189.8055.09e−15***
lag(y, −12)0.428650.065356.5596.45e−09***
suvs1.057210.166866.3361.66e−08***
insurance−0.529660.15206−3.4830.000835***

However, the perils of in-sample forecasting are well-known. The question of interest is whether the Trends variables improve out-of-sample forecasting.

To check this, we use a rolling window forecast where we estimate the model using the data for periods k through t − 1 and then forecast yt using yt−1, yt−12, and the contemporaneous values of the Trends variables as predictors. Since the series is actually released two weeks after the end of each month, this gives us a meaningful forecasting lead. The value of k is chosen so that there are a reasonable number of observations for the first regression in the sequence. In this case we chose k = 17, which implied the forecasts start on 2005-06-01.

The results are shown in Figure 2. The mean absolute error of log(yt) using the baseline seasonal AR-1 model is 6.34 per cent while the MAE using the Trends data is 5.66 per cent, an improvement of 10.5 per cent. If we look at the MAE during the recession (December 2007 through June 2009) we find that the MAE without Trends data is 8.86 per cent and with Trends data is 6.96 per cent, an improvement of 21.5 per cent.

image

Figure 2. Motor Vehicles and Parts

Download figure to PowerPoint

(ii) Initial Claims for Unemployment Benefits

Each Thursday morning the US Department of Labor releases a report describing the number of people who filed for unemployment benefits in the previous week.3

Initial claims have a good record as a leading indicator. Macroeconomist Robert Gordon indicates that there is a ‘surprisingly tight historical relationship in past US recessions between the cyclical peak in new claims for unemployment insurance (measured as a four-week moving average) and the subsequent National Bureau of Economic Research (NBER) trough.’4 Furthermore, a cursory inspection of the relationship between initial claims and the unemployment rate indicates that initial claims tend to peak a few months before the unemployment rate peaks.

When someone becomes unemployed it is natural to expect that they will issue searches such as [file for unemployment], [unemployment office], [unemployment benefits], [unemployment claim], [jobs], [resume] and so on. Google Trends classifies search queries like these into two categories, Local/Jobs and Society/Social Services/Welfare & Unemployment.

In this example we work with the seasonally adjusted initial claims data, since that is the number used by most economic forecasters. Since our dependent variable is seasonally adjusted, it makes sense to seasonally adjust the independent variables as well, so we used the stl command in R to remove the seasonal component of the Trends data.

In this case, our baseline regression is a simple AR-1 model on the log of initial claims.

Start = 2004-01-17, End = 2011-07-02
  1. Multiple R-squared: 0.9607, Adjusted R-squared: 0.9606.

 EstimateStandard Errort valuePr(>|t|)
(Intercept)0.254880.129511.9680.0498*
L(y, 1)0.980220.0100797.368<2e−16***

Note that the coefficient on the lagged term is almost one, suggesting that the process for initial claims is very close to a random walk (with drift).

As Nelson and Plosser (1982) and many subsequent authors have pointed out, it is very common for macroeconomic data to be represented as a random walk. For a random walk, the best univariate forecast for yt is simply yt−1. However, perhaps we can improve on this baseline forecast by using additional predictors from Google Trends.

Using the Google Trends categories Jobs and Welfare...Unemployment we find that these are marginally significant but have little impact on in-sample fit.

 EstimateStandard Errort valuePr(>|t|)
  1. Multiple R-squared: 0.962, Adjusted R-squared: 0.9618.

(Intercept)1.05634400.26863603.9329.98e−05***
L(y, 1)0.91835600.020877843.987<2e−16***
Jobs0.00070690.00038471.8380.0669
Welfare...Unemployment0.00037520.00018382.0420.0418*

When we look at one-step-ahead out-of-sample forecasts we find that the MAE goes from 3.37 per cent using the baseline forecast to 3.68 per cent using the Trends data, which is a 5.95 per cent reduction in fit. However, when we look at the series a bit more closely a rather different picture emerges.

It is well-known that it is difficult to identify ‘turning points’ in economic series. A smoothly increasing or decreasing trend is easy to fit with a simple linear AR model. Turning points in time series are much harder to forecast.

If we look just at the recession period (December 2007 through June 2009) we find that using Trends data reduces the MAE from 3.98 per cent to 3.44 per cent, an improvement of 13.6 per cent. Looking more closely at the series, we see that there are four notable turning points indicated by the shaded areas in Figure 3. The MAE for the period surrounding these turning points are reported in Table 1. Note that there is a reduction in MAE at all turning points, with particularly pronounced reductions in the first two. In this case, the Google Trends data seems to help in identifying at least two of the turning points in the series.

image

Figure 3. Seasonally Adjusted Initial Claims for Unemployment; Turning Points in Gray

Download figure to PowerPoint

Table 1. Behavior of MAE around Turning Points
StartEndMAE baseMAE trends1-ratio
2009-03-012009-05-010.03060.0239821.85%
2009-12-012010-02-010.03560.0312712.36%
2010-07-152010-07-150.05130.051010.65%
2011-01-012011-05-010.02520.024463.22%

Figure 4 plots the difference in MAE for the Base and Trends model. A positive value indicates that the Trends forecast had a smaller error. Here it is clear that the Trends model fits better during the recession (December 2007 through June 2009), while the Base fits better immediately after.

image

Figure 4. Base Absolute Error – Trends Absolute Error

Download figure to PowerPoint

Askitas and Zimmermann (2010), Suhoy (2009) and D'Amuri and Marcucci (2010) have confirmed the value of search data in forecasting unemployment in the U.S., Germany and Israel.

(iii) Travel

The Internet is commonly used for travel planning which suggests that Google Trends data about destinations may be useful in predicting visits to that destination. We illustrate this using data from the Hong Kong Tourism Board.5

The Hong Kong Tourism Board publishes monthly visitor arrival statistics, including ‘Monthly visitor arrival summary’ by country/territory of residence. For this study we use visitor data from US, Canada, Great Britain, Germany, France, Italy, Australia, Japan and India.

‘Hong Kong’ is also one of the subcategories under Vacation Destinations in Google Trends. We can examine the query index for this category by country of origin.

The Hong Kong visitor arrival data is not seasonally adjusted, nor is the Google Trends data. We used the average query index in the first two weekly observations of the month to predict the total monthly visitors. Since the data is released with a one-month lag, this gives us roughly a six-week lead in terms of forecasting

We let yt be the visitors from a given country in month t, and xt be the average Google Trends index for Vacation Destinations/Hong Kong for the first two weeks of that month. We can specify a basic seasonal AR-1 model of the form inline image.

We estimate this model for each country and compare the actual to the fitted results in Figure 5. Unlike the previous examples, we have here used in-sample fits. As can be seen, the in-sample fits are pretty good, with the exception of Japan. Excluding Japan, the average R2 is 73.3 per cent. In Choi and Varian (2009a) we use a more elaborate random effects model with some additional predictors and find a somewhat better in-sample fit.

image

Figure 5. Visitors to Hong Kong

Download figure to PowerPoint

V Consumer Confidence

  1. Top of page
  2. Abstract
  3. I Introduction
  4. II Literature Review
  5. III Google Trends
  6. IV Examples
  7. V Consumer Confidence
  8. VI Conclusion
  9. References

In our final example, we examine the Roy Morgan Consumer Confidence Index for Australia.6 Unlike our earlier examples, it is not obvious which categories would be most helpful in predicting this series. There are a variety of methods one can use for variable selection; see Castle et al. (2010) for a recent discussion of this topic with emphasis on nowcasting applications.

We used a Bayesian method known as ‘spike and slab’ regression described by George and McCulloch (1997). This technique produces a posterior probability that a variable enters a regression (i.e., has a non-zero coefficient) along with an estimate of that coefficient's posterior distribution.

We used Google Trends category data for Australia, taking the average value of the category data for the first two weeks of the month and seasonally adjusting it using the R command stl. The spike and slab technique assigned high posterior probabilities to the categories Crime & Justice, Trucks & SUVs and Hybrid & Alternative Vehicles. The last two are not surprising as they are highly correlated with the price of gasoline, which is known to impact consumer confidence in the United States. We have no explanation for the first predictor. Plotting the Crime & Justice time series shows a definite correlation with consumer confidence for the period we examine, but of course there is no way to know if this correlation will persist in the future.

Our predictor for Australian log(consumer confidence) is summarised in this table.

 EstimateStandard Errort valuePr(>|t|)
  1. Multiple R-squared: 0.8583, Adjusted R-squared: 0.8514.

(Intercept)1.51726170.27953565.4285.67e−07***
lag(y, −1)0.68394360.058415811.708<2e−16***
Crime … Justice−0.00096640.0002404−4.0200.000129***
Trucks … SUVs0.00106000.00053461.9830.050735.
Hybrid … Alternative.Vehicles−0.00078690.0001482−5.3089.26e−07***

The Trends predictors reduce MAE of the simple AR-1 model by about 12.7 per cent for in-sample forecasts. One-step-ahead MAE goes from 3.63 per cent to 3.29 per cent, an improvement of 9.3 per cent; see Figure 6. The big drop in Spring 2008 is due to a significant increase in queries on Hybrid & Alternative Vehicles, which is likely due to the increased price of oil that occurred during that period.

image

Figure 6. Australia Consumer Confidence

Download figure to PowerPoint

VI Conclusion

  1. Top of page
  2. Abstract
  3. I Introduction
  4. II Literature Review
  5. III Google Trends
  6. IV Examples
  7. V Consumer Confidence
  8. VI Conclusion
  9. References

We have found that simple seasonal AR models that include relevant Google Trends variables tend to outperform models that exclude these predictors by 5 per cent to 20 per cent. We hope that these examples will encourage other researchers to experiment with this data source in their own research.

Google Trends data is available at a state and metro level for several countries. We have also had success with forecasting various business metrics using state-level data. In some cases longitudinal data helps make up for the rather short time series available from Google Trends.

References

  1. Top of page
  2. Abstract
  3. I Introduction
  4. II Literature Review
  5. III Google Trends
  6. IV Examples
  7. V Consumer Confidence
  8. VI Conclusion
  9. References