To evaluate the utility of automated deformable image registration (DIR) algorithms, it is necessary to evaluate both the registration accuracy of the DIR algorithm itself, as well as the registration accuracy of the human readers from whom the “gold standard” is obtained. We propose a Bayesian hierarchical model to evaluate the spatial accuracy of human readers and automatic DIR methods based on multiple image registration data generated by human readers and automatic DIR methods. To fully account for the locations of landmarks in all images, we treat the true locations of landmarks as latent variables and impose a hierarchical structure on the magnitude of registration errors observed across image pairs. DIR registration errors are modeled using Gaussian processes with reference prior densities on prior parameters that determine the associated covariance matrices. We develop a Gibbs sampling algorithm to efficiently fit our models to high-dimensional data, and apply the proposed method to analyze an image dataset obtained from a 4D thoracic CT study.