When is a correlation between non-independent variables “spurious”?

Authors


Abstract

Correlations which are artifacts of various types of data transformations can be said to be spurious. This study considers four common types of analyses where the X and Y variables are not independent; these include regressions of the form X/Z vs Y/Z, X×Z vs Y×Z, X vs Y/X, and X+Y vs Y. These analyses were carried out using a series of Monte Carlo simulations while varying sample size and sample variability. The impact of disparities in variability between the shared and non-shared terms and measurement error for the shared term on the magnitude of the spurious correlations was also considered. The accuracy of equations previously derived to predict the magnitude of spurious correlations was also assessed. These results show the risk of producing spurious correlations when analyzing non-independent variables is very large. Spurious correlations occurred in all cases assessed, the mean spurious coefficient of determination (r2) frequently exceeded 0.50, and in some cases the 90% confidence interval for these simulations included all large r2 values. The magnitude of spurious correlations was sensitive to differences in the variability of the shared and non-shared terms, with large spurious correlations obtained when the variability for the shared term was larger. Sample size had only a modest impact on the magnitude of spurious correlations. When measurement error for the shared variable was smaller than one half the coefficient of variation for that variable, which is generally the case, the measurement error did not generate large spurious correlations. The equations available to predict expected spurious correlations provided accurate predictions for the case of X×Z vs Y×Z, variable predictions for the case of X vs Y/X, and poor predictions for most cases of X/Z vs Y/Z, and X+Y vs Y.

Ancillary