There is currently great interest in comparing protein–ligand docking programs. A review of recent comparisons shows that it is difficult to draw conclusions of general applicability. Statistical hypothesis testing is required to ensure that differences in pose-prediction success rates and enrichment rates are significant. Numerical measures such as root-mean-square deviation need careful interpretation and may profitably be supplemented by interaction-based measures and visual inspection of dockings. Test sets must be of appropriate diversity and of good experimental reliability. The effects of crystal-packing interactions may be important. The method used for generating starting ligand geometries and positions may have an appreciable effect on docking results. For fair comparison, programs must be given search problems of equal complexity (e.g. binding-site regions of the same size) and approximately equal time in which to solve them. Comparisons based on rescoring require local optimization of the ligand in the space of the new objective function. Re-implementations of published scoring functions may give significantly different results from the originals. Ostensibly minor details in methodology may have a profound influence on headline success rates. Proteins 2005. © 2005 Wiley-Liss, Inc.