Skip to Content

Clinical Trials

Levels of Data

Here’s a brief article in Science that a lot of us should keep a copy of. Plenty of journalists and investors should do the same. It’s a summary of what sort of questions get asked of data sets, and the differences between them. There are six broad data analysis categories:
1. Descriptive. This is the simplest case, where you’re just summarizing a data set and describing the totals in it.
2. Exploratory. The next step – you search through the descriptive analysis looking for trends or relationships, with which to develop new hypotheses. No guarantees, of course – you’ll have to confirm these with more work.
3. Inferential. This one looks at an exploratory treatment and tried to determine whether those trends are likely to hold up. As the authors say, this is probably the most common statistical workup in the literature – better than randome chance, or not? But it can’t tell you why something is happening, of course.
4. Predictive. An inferential study is necessarily done on a large sample (well, it had better be, at any rate, if you’re going to infer with much confidence). A predictive analysis uses some subset of the data to predict how individual cases will go. The example from drug development would be the use of biomarkers to predict whether a given patient in a trial will respond to some new investigational drug.
5. Causal. At this level, you’re trying to see what the magnitude of changes are across the system when you start changing things – what often gets called the “tone” of the system. What are the most important variables, and what has little effect on the outcome?
6. Mechanistic. With the information at the causal level available, now you can really get down to the nuts and bolts. Change A causes effect B, through this detailed mechanism. We don’t see this as much with anything involving biology – there always seem to be exceptions. This is more the realm of engineering and physics, although a lot of time and money is going into trying to change that.
It’s only at the causal and mechanistic levels that you can start doing detailed modeling with confidence. That’s where everyone would like to be with computational binding predictions, but we don’t understand them well enough yet. And think how far we have to go to get predictive toxicology to those levels! We can do that sort of thing on a small scale – for example, saying that a compound that (say) inhibits angiotensin-converting enzyme, to this degree, and with that average half-life in vivo, will be expected to lower X% of a random population’s members blood pressure by at least Y%. That’s after decades of experience and data-gathering, keep in mind.
But that’s not aeronautical engineering. Those folks don’t tell you that wing design A will provide at least so much lift on a certain percentage of the airframes it gets bolted on to. Nope, those folks get to build their airframes to the same exact specifications, not just take whatever shows up at the factory needing wings, and those airframe/wing combinations had better perform within some very tight tolerances or something has gone seriously wrong. This is just another way of stating the “built by humans” difference I was talking about the other day.
So some of that data analysis hierarchy above is, well, aspirational for those of us doing drug research. The authors of the Science article are well aware of this themselves, saying that “Outside of engineering, mechanistic data analysis is extremely challenging and rarely achievable.”. But that level is where many people expect science to be, most of the time, which leads to a lot of frustration: “Look, is this pill going to help me or not?” We should remember where we are on the scale and try to work our way up.

12 comments on “Levels of Data”

  1. Pete says:

    When exploring and modelling data it is vital to make an honest distinction between what one knows and what one believes. Our ability to make predictions based on trends in data is determined by the strength (and not the statistical significance) of those trends. Be particularly wary of any data analysis in which variation is hidden or disguised (i.e. by using standard error rather than standard deviation as measure of spread in data). When I see medicinal chemists presenting average values without error bars, I always wonder what their reactions would be if an in vivo biologist had temerity to present mean values without error bars.

  2. Stu West says:

    One of the authors has published a pay-what-you-want ebook, The Elements of Data Analytic Style, based on the format of the classic Strunk & White English usage text, which goes into the subject in more detail. Recommended.

  3. Anonymous says:

    The most important concept for the media, non-scientists, and unfortunately some percentage of scientists to understand is this: Correlation does not equal causation. If those groups could grasp this concept, scientific reporting would be exponentially more accurate and well understood.

  4. Pete says:

    It is indeed an important point that correlation is not causation. However, in the drug-likenesss area, correlation is sometimes not even correlation in that correlations are presented in ways that make it impossible to see how strong (actually how weak) the relevant trends are. For example, correlations between predictor variables like molecular weight and aromatic ring count are probably stronger that the correlations between descriptor variables and ADME properties like solubility. That doesn’t stop the Anti-Fat Anti-Flat Movement from asserting that one descriptor variable is better than another. We see opinions ‘supported’ by statements that the trend shown in one pie chart array is clearly stronger that that shown in another pie chart array. These statements are often accompanied by much hand wringing to the effect that marketed drugs are getting less drug-like with time and we wonder why drug discovery appears to be getting more difficult.

  5. Anonymous says:

    @4 Pete – @2 here. I agree. I am somewhat old school in thought. Although all the data, parameters, and modeling we can come up with is interesting, to me it is too much dogma and number crunching vs. actual compound making. I don’t care what somebody’s model says or what their software told them. I am going to go ahead and make compounds to answer questions and discover drugs. Companies spend a lot of time figuring out (ironically) how to save time in drug discovery. Perhaps we should spend more of that time at the bench instead of at a computer.

  6. diver dude says:

    What @2/5 said. In homo, veritas (unless you are a vet).

  7. Tuck says:

    Whenever I go to your site from Internet Explorer, I get something akin to the following:
    “Fatal error: Call to undefined function: str_split() in /home/corante/public_html/pipeline/connect.php(1) : regexp code(1) : eval()’d code(1) : regexp code on line 1”
    Tried it on both of my computers. Chrome works fine, but can’t use that from the office.

  8. Level -1 says:

    Or, what many practioners of Molecular Dynamics do.
    Level -1: Bogus. You just fit your computer simulations to the available data until you get an R^2 that you think deserves publishing.

  9. Philip says:

    @Tuck, Here at tech support, we suggest that you upgrade to the latest version of Internet Explorer that your company will let you use.
    Somebody at your company needs to be fired. Any good tech person knows that the only use for Internet Explore is to download a standards complaint web browser. Sounds like your tech guys love KoolAid.

  10. M says:

    @3: The problem with harping on “correlation does not equal causation” is that, while true, it applies to everything. So it does nothing to help distinguish between useful and meaningless observations. The best you can hope for is that it inspires healthy skepticism, but I’ve seen it trotted out just as often as a mindless mantra when someone doesn’t like a particular conclusion.

  11. Anonymous says:

    7. Non-reproducible
    8. Fraudulent / fabricated
    9. Random noise
    10. Wishful thinking
    These data seem to make up 99% of what we see in papers these days.

  12. Sally says:

    Now, apply the above, objectively, to climate modeling.

Comments are closed.