OK, we have kind of a complicated situation today. Biogen and Eisai have press-released what appear to be positive results for an anti-amyloid antibody therapy for Alzheimer’s. Since every single other attempt in this area has failed, often at great expense, what are we to make of this?

Well, the first thing to say is that antibodies are all different. That’s why they exist. So you can’t draw a direct line from the failure of previous agents to a possible success here. But that’s about the most positive thing I have to say, because those earlier failures are not meaningless either, not by a long shot. Previous agents have demonstrated binding to amyloid, brain penetration, and even reduction of brain amyloid *in vivo* in human subjects – all without having any useful clinical effects. So if this one (BAN2401, developed earlier by Eisai in collaboration with a Swedish company, BioArctic) is showing some, then there is explaining to be done. As I understand it, it’s targeted at soluble amyloid protofibers, but I am unaware of any direct comparisons of its binding profile with previous antibodies in the field. Biogen and Eisai must have had reason to believe that this one might be different (positive), but lots of people have thought that in Alzheimer’s research and landed hard anyway (negative).

One complication is that the companies have not announced all that much. They say that the highest dose of the antibody demonstrated statistically significant slowing of the disease as determined by ADCOMS scores and that PET imaging also showed positive results. But we have no numbers whatsoever, making it completely impossible to evaluate these claims. There’s no way to have an informed opinion on the matter without seeing the data, which will surely be shown at some future conference. Until then, all one can say is that Biogen and Eisai believe that what they have is sufficient to make a public announcement about it and will be backing up their claims at a later date.

And when the data do come out, it’s going to be complicated to evaluate. That’s because this is a Bayesian trial, as opposed to one run under the more typical “frequentist” statistical framework. Adaptive/Bayesian trial designs have been talked about for years in the business, because they have possible advantages versus the traditional trials, but they’ve been talked about a *lot* more than they’ve actually been run. Oversimplifying, a Bayesian trial has the possibility of changing the patient population being evaluated while the trial is being still being run, thus the “adaptive” part. If you do that in a traditional trial without having stated up front everything you’re going to do and exactly when and why (as in a crossover design), you will have blown your chance to say anything statistically meaningful about its outcome, but not so with Bayesian statistics, if set up properly.

Now come the headaches. That’s because (as Matthew Herper notes here), the Bayesian trial design, when run for 12 months, said that the drug *failed to show* *meaningful effects*. But the companies said earlier that it was always their plan to continue out to at least 18 months and do a more conventional statistical analysis at that point, and that’s the basis of this new announcement about the high-dose group. I am absolutely not competent to evaluate how you’d do that sort of statistical-regime crossover, and I very much look forward to people who know their stuff getting a chance to look over the trial design, the data, and the results. Did the Bayesian design allow patients to move into the high-dose group if it showed more effect? In that case, why didn’t it show an overall positive result at 12 months? How many patients, then, are in the cohort that showed significance in the later conventional evaluation? Who knows? We don’t have any of the key details yet, and there are definitely more details to go over this time than usual. Until then. . .?

I’d watch to see senior Biogen & Eisai folks are selling stock after this tease. But I’m a cynic.

Exactly my thought.

You *might* get some significance. However, effect size is not going to be that good. Given the change in analysis plans, I’m deeply skeptical.

If you try 20 antibodies, one of them will probably show significance at p=0.05. Obviously you know this already – I’m just wondering if the solution to the paradox could really be that simple.

This is NOT how p value is defined.

Kevin,

About the p Value, I got the following from CELG board on ‘investor village’ on 06-29-18, when their MDS drug gave great results in a phase-III trial, and CELG stock priice rose sharply.

___ __ _

My interpretation of “highly statistically significant” means more than one zero to the right of the decimal place in the p-value lines up well with the following source:

https://www.statsdirect.com/help/basics/p_values.htm

P Values

The P value, or calculated probability, is the probability of finding the observed, or more extreme, results when the null hypothesis (H0) of a study question is true – the definition of ‘extreme’ depends on how the hypothesis is being tested. P is also described in terms of rejecting H0 when it is actually true, however, it is not a direct probability of this state.

The null hypothesis is usually an hypothesis of “no difference” e.g. no difference between blood pressures in group A and group B. Define a null hypothesis for each study question clearly before the start of your study.

The only situation in which you should use a one sided P value is when a large change in an unexpected direction would have absolutely no relevance to your study. This situation is unusual; if you are in any doubt then use a two sided P value.

The term significance level (alpha) is used to refer to a pre-chosen probability and the term “P value” is used to indicate a probability that you calculate after a given study.

The alternative hypothesis (H1) is the opposite of the null hypothesis; in plain language terms this is usually the hypothesis you set out to investigate. For example, question is “is there a significant (not due to chance) difference in blood pressures between groups A and B if we give group A the test drug and group B a sugar pill?” and alternative hypothesis is ” there is a difference in blood pressures between groups A and B if we give group A the test drug and group B a sugar pill”.

If your P value is less than the chosen significance level then you reject the null hypothesis i.e. accept that your sample gives reasonable evidence to support the alternative hypothesis. It does NOT imply a “meaningful” or “important” difference; that is for you to decide when considering the real-world relevance of your result.

The choice of significance level at which you reject H0 is arbitrary. Conventionally the 5% (less than 1 in 20 chance of being wrong), 1% and 0.1% (P < 0.05, 0.01 and 0.001) levels have been used. These numbers can give a false sense of security.

In the ideal world, we would be able to define a "perfectly" random sample, the most appropriate test and one definitive conclusion. We simply cannot. What we can do is try to optimise all stages of our research to minimise sources of uncertainty. When presenting P values some groups find it helpful to use the asterisk rating system as well as quoting the P value:

P < 0.05 *

P < 0.01 **

P < 0.001

Most authors refer to statistically significant as P < 0.05 and statistically highly significant as P < 0.001 (less than one in a thousand chance of being wrong).

The asterisk system avoids the woolly term "significant". Please note, however, that many statisticians do not like the asterisk rating system when it is used without showing P values. As a rule of thumb, if you can quote an exact P value then do. You might also want to refer to a quoted exact P value as an asterisk in text narrative or tables of contrasts elsewhere in a report.

At this point, a word about error. Type I error is the false rejection of the null hypothesis and type II error is the false acceptance of the null hypothesis. As an aid memoir: think that our cynical society rejects before it accepts.

The significance level (alpha) is the probability of type I error. The power of a test is one minus the probability of type II error (beta). Power should be maximised when selecting statistical methods. If you want to estimate sample sizes then you must understand all of the terms mentioned here.

The following table shows the relationship between power and error in hypothesis testing:

DECISION

TRUTH Accept H0: Reject H0:

H0 is true: correct decision P type I error P

1-alpha alpha (significance)

H0 is false: type II error P correct decision P

beta 1-beta (power)

H0 = null hypothesis

P = probability

If you are interested in further details of probability and sampling theory at this point then please refer to one of the general texts listed in the reference section.

You must understand confidence intervals if you intend to quote P values in reports and papers. Statistical referees of scientific journals expect authors to quote confidence intervals with greater prominence than P values.

Notes about Type I error:

is the incorrect rejection of the null hypothesis

maximum probability is set in advance as alpha

is not affected by sample size as it is set in advance

increases with the number of tests or end points (i.e. do 20 rejections of H0 and 1 is likely to be wrongly significant for alpha = 0.05)

Notes about Type II error:

is the incorrect acceptance of the null hypothesis

probability is beta

beta depends upon sample size and alpha

can't be estimated except as a function of the true population effect

beta gets smaller as the sample size gets larger

beta gets smaller as the number of tests or end points increases

While it’s not exactly how p value is defined, the expected number of false positives from 1/pthreshold trials is 1. In frequentist tests p is the probability of a getting as result equal to or more extreme than a given value under the null hypothesis, in this case that the drug does nothing. By applying a p threshold you turn results from the null hypothesis into a binomial with p=pthreshold for a false positive / type I error, expected number over n independent trials is np. Hence the need for multiple comparisons correction when doing multiple hypothesis tests and people like john Ioannidis saying there’s a crisis in research reproducibility and calling for stricter p-thresholds in research (which I think doesn’t really fix that problem and introduces its own).

Presumably Biogen will be hoping to elaborate a bit at AAIC in about two weeks.

p-Value is defined as the probability to get at least the same outcome, assuming that the Null hypothesis is true.

Thus a p-Value of 0.05 means that there is a 5% probability to get at least the same outome, assuming that the Null hypothesis is true…

… which means that you’d need to run the same experiment 1/0.05 = 20 times on average to get at least the same outcome assuming that the Null hypothesis is true.

In other words, as long as we keep testing a crappy (untrue) hypothesis, we’re bound to get a “statistically significant” positive result sooner or later.

That’s why we must set the bar higher (with a lower p-value) each time we run a new experiment, to account for multiple hypothesis testing.

For the amyloid hypothesis, I would not be convinced unless the p-Value is less than 0.05%(0.0005), just to take into account the number of times we’ve tested it in clinical trials now!

So true! That’s how you use a p value…

This is a reason for multiple independent trials, passing p<=0.05 twice is p<=0.0025, or 0.25%. If you've found something you think works then sufficiently powered replication studies (doing your replication with 10% power proves nothing either way) are a strong confirmation.

(And of course, we're not exactly directly testing the amyloid hypothesis, but your point does need to be borne in mind when looking at promising early amyloid antibody trials these days.)

The big question is whether the Bayesian trial design translates to a treatment paradigm that is acceptable to FDA down the road. Intuitively, it might make sense to come up with trial designs that present a better model of a chronic treatment of much longer duration than the trial itself over the common “frequentist” approaches. Curious to find out that any therapeutics tested this way in the clinic ever gained FDA approval.

Regardless, the common failures of the anti-amyloid therapies make it most likely that these therapies do too little too late for an irreversible pathology (to be precise: not to say there are no disease modifying therapies to discover, but anti-amyloid/BACE inhibition is insufficient albeit may be necessary)

Everyone likes to point to Bayesian analysis being an advantage, more efficient, more robust interpretation etc…, but if one were to take a pessimistic prior (it will not work), then more, not less information will be needed to push the posterior belief into the positive treatment realm. Whilst I’m not familiar with the details of the trial, a Bayesian updating for 12 months of data across doses, followed by dose selection of the “most likely” best dose for expansion using probability criteria, and a final frequentist analysis at the end is pretty standard.

A Bayesian analysis of the data at 18 months based on a pessimistic prior would be a more robust approach, given the past history of mAbs in AD. That won’t happen.

From my 101-level understanding of Bayesian statistics, the posterior probability alone would be poor metric, likelihood ratio (or rather something more advanced) should be better indicator if treatment works. And it should be much less dependent on the choice of prior.

The thing I don’t understand at all is, if they did change the way studies are conducted, can they apply frequentist statistics post-hoc and still get FDA approval?

According to clintrials.gov, the study used an adaptive allocation to treatment algorithm – presumably more subjects were allocated to the high dose than placebo as the trial progressed. This is not an issue for Bayesian analysis but is an issue for conventional frequentist analysis. It will be interesting to see the details of the data, what the Bayesian analysis says at 18 months, what changes occurred between 12 and 18 months and whether there is high drop out after 12 months. This is a Phase 2 trial so presumably not necessarily needed to be positive – however the FDA will be deeply skeptical about the change in analysis.

“Incidence of ARIA-E (edema) was not more than 10% in any of the treatment arms” anyone remember the incidence of ARIA in the other MAb trials? I always thought this risk, especially in treating pre-symptomatic or early stage patients, could be a deal breaker.

The ARIA rate for apoE4 carriers in the aducanumab trial was quite high as I recall, well over the 15% they report in the BAN2401 trial. For non-apoE4 carriers, the rates are fairly similar. ApoE4 carriers make up roughly 60% of Alzheimer’s patients, so this could be a major liability for aducanumab.

Mmm. Does anyone simulate what would be a meaningful result in these long duration, large population trials with a switch from Bayesian treatment to ‘traditional’ statistics? And I mean someone independent, not from the companies involved.

My gut needs convincing this isn’t an artefact akin to a dodgy sub-group analysis after failing the primary endpoint…

My gut thought the same thing. Would love for it to be real, though!

Is a 12- or 18-month trial really going to show an effect, even if the treatment were completely effective? We are talking about a condition with a time course that evolves over decades. The clinical trials that need to be done will involve enrolling asymptomatic people and will conclude when it is too late for the information to be of any use for anyone reading this who is over 40.

Yes, “all antibodies are different”. But PD is PD. As I understand it, others have demonstrated clearance of amyloid plaques with the antibodies, but found no effect on AD. A (very) large number of molecularly different antibodies might achieve that clearance of plaques. But it’s the PD that counts.(even though differences between such antibodies might still explain that some are e.g. immunogenic)

Here’s a link to a humorous interpretation of p-values:

https://xkcd.com/1478/

I’m tempted to suggest that this is Biogen’s version of “I put on my robe and wizard hat” …

Or better still, for the amyloid hypothesis: https://xkcd.com/882/

Well…The first thing that needs to be said is that the result implies that this antibody got into the brain. It didn’t. Antibodies don’t get into the brain. Secondly, Biogen added $12B to their market cap with this breezy little press release. Good for them. They should invest some of that windfall into a brain-permeable small molecule drug program for AD. Furthermore, all of those risk-averse pharma companies who are opting out of Alzheimer’s disease drug development…They should all be docked $12B in market cap because that’s what it’s worth. RB

No, some antibodies do cross. It’s not wildly efficient, and they’re subject to efflux, etc., but some antibodies definitely cross the BBB.

Could always open up the BBB and let more antibodies in.

What is the word about the open label extension? Have they actually the patients in the phase 2 to drift away? From the Monday morning quarterback position that seems to have been unwise. Generating more data from those patients would be extremely valuable.

If secretase inhibitors, which successfully lower amyloid (but fail to provide therapeutic benefit) and function UPSTREAM of the appearance of fibrils (of any sort) and hence block their production how can a down-stream block directed at existing fibrils help? Happy to be corrected on this.