Terrestrial biosphere models can help identify physical processes that control carbon dynamics, including land–atmosphere CO2 fluxes, and have great potential to predict the terrestrial ecosystem response to changing climate. The skill of models that provide continental-scale carbon flux estimates, however, remains largely untested. This paper evaluates the performance of continental-scale flux estimates from 17 models against observations from 36 North American flux towers. Fluxes extracted from regional model simulations were compared with co-located flux tower observations at monthly and annual time increments. Site-level model simulations were used to help interpret sources of the mismatch between the regional simulations and site-based observations. On average, the regional model runs overestimated the annual gross primary productivity (5%) and total respiration (15%), and they significantly underestimated the annual net carbon uptake (64%) during the time period 2000–2005. Comparison with site-level simulations implicated choices specific to regional model simulations as contributors to the gross flux biases, but not the net carbon uptake bias. The models performed the best at simulating carbon exchange at deciduous broadleaf sites, likely because a number of models used prescribed phenology to simulate seasonal fluxes. The models did not perform as well for crop, grass, and evergreen sites. The regional models matched the observations most closely in terms of seasonal correlation and seasonal magnitude of variation, but they have very little skill at interannual correlation and minimal skill at interannual magnitude of variability. The comparison of site vs. regional-level model runs demonstrated that (1) the interannual correlation is higher for site-level model runs, but the skill remains low; and (2) the underestimation of year-to-year variability for all fluxes is an inherent weakness of the models. The best-performing regional models that did not use flux tower calibration were CLM-CN, CASA-GFEDv2, and SIB3.1. Two flux tower calibrated, empirical models, EC-MOD and MOD17+, performed as well as the best process-based models. This suggests that (1) empirical, calibrated models can perform as well as complex, process-based models and (2) combining process-based model structure with relevant constraining data could significantly improve model performance.