Interannual variability in biosphere-atmosphere exchange of CO2 is driven by a diverse range of biotic and abiotic factors. Replicating this variability thus represents the ‘acid test’ for terrestrial biosphere models. Although such models are commonly used to project responses to both normal and anomalous variability in climate, they are rarely tested explicitly against inter-annual variability in observations. Herein, using standardized data from the North American Carbon Program, we assess the performance of 16 terrestrial biosphere models and 3 remote sensing products against long-term measurements of biosphere-atmosphere CO2 exchange made with eddy-covariance flux towers at 11 forested sites in North America. Instead of focusing on model-data agreement we take a systematic, variability-oriented approach and show that although the models tend to reproduce the mean magnitude of the observed annual flux variability, they fail to reproduce the timing. Large biases in modeled annual means are evident for all models. Observed interannual variability is found to commonly be on the order of magnitude of the mean fluxes. None of the models consistently reproduce observed interannual variability within measurement uncertainty. Underrepresentation of variability in spring phenology, soil thaw and snowpack melting, and difficulties in reproducing the lagged response to extreme climatic events are identified as systematic errors, common to all models included in this study.