Organic Chemistry as a Language and the Implications of Chemical Linguistics for Structural and Retrosynthetic Analyses


  • This work was supported by the Non-equilibrium Energy Research Center, which is an Energy Frontier Research Center funded by the U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences under Award Number DE-SC0000989. We thank Prof. Maciej Eder and Prof. Rafal Gorski from the Institute of the Polish Language, Polish Academy of Sciences, for helpful comments.


Methods of computational linguistics are used to demonstrate that a natural language such as English and organic chemistry have the same structure in terms of the frequency of, respectively, text fragments and molecular fragments. This quantitative correspondence suggests that it is possible to extend the methods of computational corpus linguistics to the analysis of organic molecules. It is shown that within organic molecules bonds that have highest information content are the ones that 1) define repeat/symmetry subunits and 2) in asymmetric molecules, define the loci of potential retrosynthetic disconnections. Linguistics-based analysis appears well-suited to the analysis of complex structural and reactivity patterns within organic molecules.