Thanks to a comment on this post, I’ve had a chance to read this interesting article from Stephen Johnson of Bristol-Myers Squibb, entitled “The Trouble with QSAR (Or How I Learned to Stop Worrying And Embrace Fallacy)”. (As a side note, it’s interesting to see that people still make references to the titling of *Dr. Strangelove*. I’ve never met Johnson, but I’d gather from that that he can’t be much younger than I am).

The most arresting part of the article is the graph found in its abstract. No mention is made of it in the text, but none has to be. It’s a plot of the US highway fatality rate versus the tonnage of fresh lemons imported from Mexico, and I have to say, it’s a pretty darn straight line. I’ve seen a *lot* shakier plots used to justify some sweeping conclusions, and if those were justified, well, then I’m forced to conclude that Mexican lemons have improved highway safety a great deal. The vitamin C, maybe? The fragrance? Bioflavanoids?

None of the above, of course. Correlation, tiresomely, once again refuses to imply causation, even when you ask it nicely. And that’s the whole point of the article. QSAR, for those outside the business, stands for Quantitative Structure-Activity Relationship(s), an attempt to rationalize the behavior of a series of drug candidate compounds through computational means. The problem is, there are plenty of possible variables (size, surface area, molecular weight, polarity, solubility, charge, hydrogen bond donors and acceptors, and as many structural representation parameters as you can stand). As Johnson notes dryly:

” With such an infinite array of descriptions possible, each of which can be coupled with any of a myriad of statistical methods, the number of equivalent solutions is typically fairly substantial.”

That it is. And (as he rightly mentions) one of the other problems is that all these variables are discontinuous. Some region of the molecule can get larger, but only up to a point. When it’s too large to fit into the binding site any more, activity drops off steeply. Similarly, the difference between forming a crucial hydrogen bond and not forming one is a big difference, and it can be realized by a very small change in structure and properties. (Thus the “magic methyl” effect).

But that’s not the whole problem. Johnson takes many of his fellow computational chemists to task for what he sees as sloppy work. Too many models are advanced just because they’ve shown some (limited) correlations, and they’re not tested hard enough afterwards. Finding a model with a good “fitness score” becomes an end in itself:

”We can generate so many hypotheses, relating convoluted molecular factors to activity in such complicated ways, that the process of careful hypothesis testing so critical to scientific understanding has been circumvented in favor of blind validation tests with low resulting information content. QSAR disappoints so often, not only because the response surface is not smooth but because we have embraced the fallacy that correlation begets causation.”

A similar declining straightline plot can be made between ambient lead levels and time. A nearly identical plot can be made for the decline in college board scores over time (before they were normalized upward to improve educator’s and student’s self-esteem). Clearly then, lead makes us smarter.

And here I thought I was unusual for finding this cartoon so funny.

http://xkcd.com/552/

Zz

I have seen that one. There are indeed many problems with QSAR, including overfitting and mistaking correlation for causation. Here are two similar but a little more detailed and engaging articles from Arthur Doweyko, also from BMS:

1. QSAR: Dead or Alive?

J Comput Aided Mol Des (2008) 22:81–89

DOI 10.1007/s10822-007-9162-7

2. Is QSAR relevant to Drug Discovery?

IDrugs 2008 11(12):894-899

In one of these Doweyko cites the correlation between number of breeding storks and number of new births in Germany. Another more subtle but still obvious graph is a correlation between number of executions in the US vs decline in US population rate. You need to know the

physicalbasis of the correlation in order to distinguish correlation from causation.A reasonable and sane computational chemist will usually know the problems with QSAR well and will judiciously interpret models.

Reminds me of

http://en.wikipedia.org/wiki/Flying_Spaghetti_Monster#Pirates_and_global_warming

Hit the graph at the right.

@ Wavefunction

The example with storks and babies in Germany is a really poor example: since everybody knows that in Germany babies are delivered by storks, there is not only a correlation but also a causation in this case.

Fewer storks fewer babies, simple!

Mexican lemons —> cheaper car air fresheners —> happier drivers.

See? It works! I are a genius.

Mexican lemon truck drivers in the US drive slowly to comprehend highway signs because English is their second language. Everyone slows down as a result, leading to fewer fatalities.

But what about pirates? Any correlation to the number of pirates?

As long as it’s testable, I’m happy to use any metric that I can vary predictably by changing the structure to try to understand the activity of my compounds. It’s when you start getting into numbers based on some black-box multicomponent analysis that isn’t readily derivable in the real world, then you’ve lost me.

What is this “slowing down to read signs because English is your second language” bit? One, no one where I live actually bothers to read signs – if their exit is nearby, they simply cut over how ever many lanes are needed to get to their exit and ignore what might be in their way (without any of this pesky foresight stuff), while speed limit, nearby lane closing, and no U-turn signs are roundly ignored. (And turn signals are only for when police are around or for when you cut someone off.) Two, truck drivers where I live don’t slow down – they seem to figure it’s everyone else’s job to stay out of their way, and if they need to go somewhere they (usually) just signal and move, with what is in their way being irrelevant. (I would also assume Mexican trucking companies are less likely than American ones to have satellite/GPS truck monitors, so speeding to achieve their (probably optimistic) schedules is likely.)

For these sort of exercises, I think it’s important to realize what the p-value or R^2 truly means. Correct me if I’m wrong, but for the lemon chart, R^2=0.97 means there is a 3% chance the trend is due to pure random chance (or ~1 in 30). Sounds pretty good? Not if you have a modeler who spends his entire day looking for trends. This would mean if he looked at 30 possible trends, he would’ve found atleast one with an R^2 of 0.97 by pure chance. If he spends all week, he could probably find one with an R^2 of 0.995.

This sloppy way of thinking is (sadly) not confined to QSAR. It seems that the entire filed of modern epidemiology is dedicated to finding correlations, coffee consumption vs. cancer, meat intake vs. impotence, etc. No biological plausibility enters into these papers; the correlations are presented along with the rubber-stamp sentence “More research into the possible reasons for this relationship is needed” (read:Give me more grant money). These types of stories contibute to the poor public perception of science.

tyrosine:

Incorrect. You’re thinking of a P-value. R^2 is simply the square of the correlation coefficient, R, which can be positive or negative. It’s a measurement of a model’s predictive power.

tyrosine, in particular:

There’s a data-mining element to this (we aren’t given a p-value here, but I bet it’s not terrible), but the bigger thing that jumps out from this particular case is cointegration, where two variables that are independently following trends over a period of time will therefore be correlated. Note the year labels on the points; data-mining becomes a better concern if, reading from upper left to lower right, they went 1999, 1996, 1998, 2000, 1997 or something. The other examples given in these comments — lead versus test scores, pirates versus anything — are also cointegration issues rather than people simply having dug through a bunch of data to find out that ambient lead correlates with higher test scores.

“3% chance the trend is due to pure random chance”

That’s a p-value. Calculating that for a particular model may be non-trivial.

Aren’t there a crapload of pirates around? Try sailing a luxury yacht by Somalia or in some waters around Malaysia.

You would be surprised at how many scientists believe in QSAR models, and yet how few examples there are of QSAR models that have actually been of real use in a PROspective fashion.

Clustering in the training set is a real problem in QSAR because it can trick you into thinking that you are interpolating when in fact you are extrapolating.

What is it about QSAR modelers and their humorous paper titles? I was at a cheminformatics conference in the Netherlands last year and one of the QSAR talks was entitled ‘QSAR modeler seeks meaningful relationship’ – one of the best titles I’ve seen.

They may not convince us they’re right, but at least they can be funny about it

How about the opposite situation?

Take a look at a currently approved manuscript in BMCL from AZ (Defining optimum lipophilicity and molecular weight ranges for drug candidates—Molecular weight dependent lower log D limits based on permeability).

There is data all over the Papp vs LogD graph, with a R2=0.12… that does not stop the author from drawing conclusions…

I am thinking… what is the real physical meaning of this?

Anyone can help me understand this?

Lost in NJ,

TX Raven

Its always good to see the statistics shown up, butthe big question is what woudl the power calculation look like for that, just how many lemon trucks / accidents would be needed to judge the relevance.

FWIW, The modelers here are pretty careful at checking out QSAR methods reported in the literature — they’ve set up a streamlined way to do this — and it is nearly always the case that the correlations don’t hold up to more rigorous testing. Them molecules is sneaky.

I think that people are misinterpreting Stephen and Arthur’s messages. The data they present in their papers is exactly why creating linear models, especially in the presence of a limited number of cases, is a fool’s errand.

The probability of finding a chance correlation, especially in situations which are not linear by nature and where there are few cases (i.e. short wide data tables), is so great that it’s not even worth trying. It’s also why non-linear methods like Forest of Trees, SVM, Bayes, and kNN methods, paired with descriptor-selecting methods, have become state of the art.

The lack of interpretability is the weakness of these methods, and perhaps part of the motivation for building traditional QSAR models, but I’d rather have a correct model that’s a black box then an incorrect one that I think I understand. Unfortunately, convincing the medchemists to trust the models is harder when there’s no visual or intuitive component.

Perhaps the term QSAR should be laid to rest, and replaced with MLSAR (Machine Learning SAR). This would change the mindset.

@19: r^2 is the strength of a relationship, p is how well you’ve established it. Even a very weak relationship can be established with high probability if you have enough data points, and that’s what Fig 2 shows us. The weakness of the correlation means that predictability is poor. As a direct consequence of this poor predictive power, the author suggests not bothering with most of Lipinski rule type criteria, and just looking at MW and lipophilicity (Fig 4). (At least, that’s what I get from a quick skim.)

Saw that photo on facebook. I was wondering where it came from. totally awesome

FWIW, setting R^2 = 0.965 to be as generous as possible with the rounding, the p-value for a one-sided test against zero correlation is 0.0004 (Fisher rho-to-z transform and Gaussian approximation). If you generated 1,644 bivariate Gaussian data sets with N = 5 and rho = 0 you’d have a 50% chance of getting one with a stronger correlation.

In addition to the cointegration explanation, another obvious cheat is the fact that only 5 data points are plotted — many more years of data on both variables are likely available.

(P-values are lousy as measures of statistical evidence — see the work of Richard Royall for more.)

The notion of correlation and causation can be used to highlight this graph a bit. If someone says that reliable predictions state that Mexican lemon imports will increase 12% you may argue that automotive insurance stocks are a good buy because their claims will go down. Tell people the gov’t should subsidize Mexican lemon imports to improve road safety and you’ll…well…I guess you’ll still find people who will agree. They’re probably the only people still owning insurance stocks.

Steve and I went to school together at PSU, he knows what he is talking about. But I’m not ready to throw the baby out with the bathwater just yet. Countless times at my old workplace, we had fast synthesis cycle times. This allowed us to really “validate” our QSAR models. For the most part, we performed quite well on “new” compounds not seen by the models, and our med chemists were not reluctant to use our models. We just didn’t get around to publishing the models in scientific journals.

I definitely wouldn’t throw the baby out with the bathwater either. IMHO if you are not looking for trends between chemical descriptors and biological activity you are not doing your job as a medicinal chemist. If a QSAR model fails to be predictive, it is not due to misinterpreting causation, it is because you are not using the correct descriptor or combination of descriptors (PLS regression).

I have noticed that chemists will often apply QSAR (with out physically plotting the relationship) without realizing they are using it.

I find it amusing when chemists notice a trend between the property of a functional group (lipophilicity etc) and make the appropriate compound and mock qsar at the same time.

Like protein crystal structures, a QSAR relationship is a model. And much like structure based drug design, QSAR models are not always predictive. If I had a dollar for every compound a modeler “designed” that was a dud, well…

Coming from far outside the field as I do, I made a few guesses about QSAR’s acronymic expansion before the term was defined.

I’d come up with “quodlibet search at random”, and the more I read, the more I think it fits!

Formulating trends between biological activity and structure (defined by physicochemical descriptors) and designing new compounds based on the trend is random?

“Myself”, keep reading.

OK, I’m a little late to this game, but I’ve got the answer. The original graph is inverted. The x-axis should be fatalities, the y-axis imported lemons. The causation is then obvious: the lower the fatality rate, the more lemons will be imported. More people, more margaritas consumed, more lemons imported. QED.

Do you have any video of that? I’d care to find out more details.

Wow! This could be one particular of the most helpful blogs We have ever arrive across on this subject. Actually Wonderful. I am also an expert in this topic so I can understand your effort.