In this study six hydrological models that only differ with respect to their conceptual geological models are established for a 465 km2 area. The performances of the six models are evaluated in differential split-sample tests against a unique data set with well documented groundwater head and discharge data for different periods with different groundwater abstractions. The calibration results of the six models are comparable, with no model being superior to the others. Though, the six models make very different predictions of changes in groundwater head and discharges as a response to changes in groundwater abstraction. This confirms the utmost importance of the conceptual geological model for making predictions of variables and conditions beyond the calibration situation. In most cases the observed changes in hydraulic head and discharge are within the range of the changes predicted by the six models implying that a multiple modeling approach can be useful in obtaining more robust assessments of likely prediction errors. We conclude that the use of multiple models appear to be a good alternative to traditional differential split-sample schemes. A model averaging analysis shows that model weights estimated from model performance in the calibration or validation situation in many cases are not optimal for making other predictions. Hence, the critical assumption that is always made in model averaging, namely that the model weights derived from the calibration situation are also optimal for model predictions, cannot be assumed to be generally valid.