I’m listening to Jean-Louis Reymond of Bern talking about the GDB data set, the massive enumerated set of possible molecules. That’s the set of chemically feasible molecules at or below a certain heavy atom count – the first iteration was GDB11 (blogged about here), and it’s since been extended to GDB13, which has nearly one billion compounds with up to 13 C, N, O, S and Cl atoms. (Note, as always, that huge vast heaps of poly-small-ring compounds, especially concatenations of 3-membered rings, are pre-filtered out of these sets, because otherwise they would overwhelm them completely). They’re working now on GDB17, which is a truly huge mound of data.
I was particularly taken with the image shown (from this paper), an artificial set of compounds (up to heavy atoms counts of 500) from several main classes of real molecules. It’s a 3-D principle components analysis plot, which tunes things up to emphasize the differences, of course, and there’s what chemical space looks like from this angle. There go the proteins and nucleic acids, off into their own zones, and similarly the linear alkanes and diamond-like lattices, beaming off in separate directions. In the middle are drug-like compounds – and don’t imagine for a minute that any substantial number of those have actually been prepared, either. This is where we live, all of us organic chemists.