Structural Key Bit Occurrence Frequencies and Dependencies in PubChem and Their Effect on Similarity Searches



Little published literature exists on the 881 bit structural keys used by PubChem for categorizing and comparing the compounds present in its database. We characterized these structural keys by examining their frequencies of occurrence within the PubChem compound database. In addition, bit dependencies, defined as the universal presence of a bit given the presence of another, were determined. We show that the vast majority of bits are rarely set and that substantial numbers of dependencies exist. A comparison of similarity searches with five United States Food and Drug Administration approved drugs as reference compounds using the full structural keys versus a variant in which all dependent bits were removed was performed using the Tanimoto coefficient. These bit dependencies not only affect similarity scores, but also alter the compounds returned in similarity searching. Judicious selection of bits is needed to maintain sufficient ability to differentiate related compounds.