• Algorithmic complexity;
  • Compound selection;
  • Dissimilarity selection;
  • Random screening;
  • Similarity coefficient


Current algorithms for the selection of a set of n dissimilar molecules from a dataset of N molecules have an expected time complexity of O(n2N). This paper describes an improved algorithm that has an expected time complexity of O(nN) and that will identify exactly the same set of molecules as the normal algorithm if the cosine coefficient is used for the calculation of the inter-molecular (dis)similarities. The algorithm is applicable to any type of representation that characterises a molecule by a set of attribute values and to any procedure that involves calculating a sum of inter-molecular similarities. It is also both more effective and more efficient than our implementation of a genetic algorithm for the selection of maximally-dissimilar sets of molecules.