OK, we’re going to get a bit esoteric this morning. There are all kinds of things going on in the world, but I’m going to seek refuge for a little bit in abstraction, and if that’s your sort of thing, then let’s lift off. This is broadly on the hot topic of machine learning, which we will define non-rigorously as “having software dig through a large data set in search of rules, trends, and correlations”. We all have such large data piles in this business, and we all feel that there must be a lot of those correlations that we have missed. Certainly there are humungous gaps in our knowledge, so perhaps some more knowledge is hiding in there and needs the diligence of a machine to make it clear? If you announce yourself as a biopharma company interested in machine learning and with money to spend, then that long lonely period of no one wanting to set up meetings with you are at an end. There are more folks than I can count who are ready to sell you software, hardware, consultancies, conferences, and for all I know matching T-shirts and coffee mugs.

Now, for a machine learning algorithm to be able to extract something useful from a set of data, there has to be something useful in there in the first place. You do not want your snazzy new ML program announcing that it has found rules and patterns in piles of random noise (and in fact, deliberately dumping a pile of random noise into its hopper is a useful reality check; you want to be sure that it comes up blank). That’s for starters. You also have to wonder if these correlations you’re looking for are, in fact, extractable from the data you have. That’s a bit more of a stare-out-the-window idea: of course, if there are trends, the program will find them. . .right? This is the topic of “learnability”, and it’s intimately related to what sort of machine-learning model your system is using.

That already takes you a bit further under the hood than most people like to get. But there are all sorts of algorithmic choices for ML. For some datasets, it almost doesn’t seem to matter much which of the well-known ones you use; the program will extract the rule(s) for you. Other problems are pickier (and not least pickier in terms of computational time, because some of the methods are just more efficient than others for different sorts of problems). But can there ever be a complete mismatch? That is, could you have a dataset that really does have a rule in it that could theoretically be discovered, but you are luckless enough to have picked an ML algorithm that is *incapable* of finding such a rule? How general is learnability?

A recent paper came up with a surprising answer to that question. (*Update: here’s Ash/Wavefunction with a post on this as well*). The authors have a particular learning model (“estimating the maximum”, or EMX) and investigate its behavior with data produced by several different underlying “rule” functions. And they show – as in no escape, mathematically prove – that there are functions for which it is absolutely impossible to know whether they are learnable by the EMX algorithm or not. Not just difficult to know, not computationally hard until someone comes up with a better way to evaluate the problem, none of that: the question of learnability in those cases is *formally undecidable*.

The mathematically inclined will be wondering if I mean that term the way it sounds like, and the answer is, yep, Kurt Gödel rides again. At this point, it might be advisable to go do something else if you’ve been wondering whether to stick with this post. But if you’re game and want to know more, Gödel is famous for his 1931 proof (yeah, proof) that any mathematical system that is complex enough to at least encompass the rules of arithmetic will have things in it that are true (or false) but *cannot be proven to be so* using the rules of that system. His proof of that is rock-solid and is a work of genius, but even those without advanced mathematical training can be taken through versions of it (this is a classic text that does so, which I’ve gone through myself). When Gödel first presented his results in public, the only person in the audience who appears to have grasped his point and asked him questions was John von Neumann, which figures – he was probably one of the few people on Earth who could have understood that presentation in real time from a standing start. As another character witness, in his later years Einstein said that the main reason he still came in to his office at the Institute for Advanced Study was to have the chance to walk back home with Gödel.

How did undecidability crop up in machine learning? As explained here, new paper first exploits a connection between ML and data compression. Basically, if you can use algorithmic (lossless) compression on a given data set, that tells you that there are rules and patterns inside it that can be learned; there’s some aspect of it that has low complexity which allows that data compression to be possible. The authors use a particular sort of compression algorithm that they show really is related to the learnability via their EMX machine-learning system. So far, so good.

And then they show that the ability to do that compression is related to the size of infinite sets – particularly the “unit interval”, which is all the real numbers between 0 and 1. Shout-out here to Georg Cantor, who introduced this whole crazy idea that there are all kinds of infinite sets and that some of them are larger than others. If that sounds nuts, consider all the rational numbers between 0 and 1. There’s all your fractions, 1/3, 1/2, 2/7 and what have you. It’s an infinite set, you can keep making fractions by throwing one integer over another integer forever. But the set of real numbers between 0 and 1 is even larger, because it includes the *irrational* numbers (never-ending, never-repeating decimal expansions). It’s also infinite, but it’s a *bigger kind* of infinite, as you find out when you try to pair them off and the real-number set is larger at every point. Cantor’s “continuum hypothesis” was that there are no infinite sets with a “size” in between those two. (The size, more accurately the “cardinality” of the rational number set is equivalent to the integer set, so another way of stating the continuum hypothesis is that the set of real numbers has the lowest possible cardinality that’s greater than that of the integer set).

Proving that was one of the greatest unsolved mathematical issues of the 20th century, and Gödel strongly suspected that it was in fact undecidable. That was nailed down by Paul Cohen versus standard mathematical set theory in the early 1960s, work (which is even larger than I’m going into here) that rightly won him every major award that the mathematical world could bestow. The connection here is that the ability to do the data-compression in this ML set, which is tied to learnability, is *also tied to the continuum hypothesis. *Which is undecidable and unprovable. And therefore the question of whether this ML algorithm is capable of learning a (real) rule from such a data set is also undecidable.

So this is fascinating – well, OK, you may or may not agree – but does it have any relevance to the real world? That, we don’t know. You could argue that this system is set up to be particularly Gödel-vulnerable and that the same undecidability will not rear its head under ordinary conditions. That’s *probably* true, although it’s too early to say. And it’s important to remember that the fact of not being able to *prove* that an ML system like this one can learn a rule doesn’t mean that such a system *won’t* learn such a rule. It’s just that you literally can’t be sure if it can or not; you can’t nail that larger question down. The larger field of mathematics (and mathematical logic) has been able to carry on post-Gödel, in some ways perhaps by just not letting themselves get worried about it, and no doubt the machine learning field will as well. But it’s very interesting, and not a little unnerving, to know that the cold finger of undecidability can reach into this area as well. I was going to finish up by saying “You never know”, but it’s not just that. In some cases, you never *can* know.

*Update: as per the comments, for those who really want to dive in, here’s an extended commentary.*

So… Machine learning would never be able to discover all the correlations in a set of data?

It might do a better job than a human but deus ex machina it is not…

Correct?

No that’s not what the paper is saying. It’s saying that there exists a set of data, for which we cannot determine if it’s learn-able. (This would apply to humans as well) An unlearn-able set may or may not exist, and they’ve proven that we can never prove which one. As a real example: Say you’ve found a problem that your ML algorithm is having a hard time encapsulating and you want to know if it’s worth pursuing. The proof says: “It’s possible that for your problem: there’s no way to know for sure.”

Two excellent books on the topic:

Deep Medicine by Eric Topol, MD; and

The Man Who Solved the Market by Gregory Zuckerman

Neither is about undecidability, but both are good explanations of how machine learning succeeds (when it does).

But I think you have the wrong definition of learning. Learning is the process of moving from one state of knowledge to another (improved?) state. Something that chunks through a pile of collected data and reports the results afterwards is establishing correlations and rules, but it only becomes learning when you compare the before and after states.

That said, this doesn’t invalidate your argument about patterns being able to be recognized. For any recognizer there will inevitably be patterns that it won’t recognize. And for any data set there will be false patterns, i.e. patterns that appear only because of random chance, which would disappear if the data set were larger or selected differently. (I.e. any finite set of random data will have patterns that can be learned.)

That said, probability of a rule or pattern being random noise can usually be estimated. And it’s often(?) useful to add (some) random noise to your data set to make patterns more evident.

P.S.: This is based on general knowledge of the field, I don’t know anything about the particular recognizer you’re describing. Also, I have grave doubts about the applicability of anything invoking the continuum hypothesis to finite discrete data sets. And all measurements are discrete. But the conclusion looks correct anyway.

Did you mean to say “the cardinality of the RATIONAL-number set is equivalent to [that of] the integer set”?

Good Lord, yes. Fixed, thanks. It’s at least proof that someone read that far!

People often understate the practical importance of undecidability. What is often proved is that there exists at least one case where this, that or the other is undecidable.

How many such cases are likely to exist in practice? This question is rarely asked, but I know plenty of people (including me) who assume that it is small enough to ignore (for the problem we are working on).

A commentary on the paper here: https://arxiv.org/abs/1909.08410

I find it extraordinary that a theorem as abstruse and abstract as Cantor’s Continuum Hypothesis has real-world applications.

G. H. Hardy (of “A Mathematician’s Apology”) would have been horrified to find out how many applications pure, abstract math has turned out to have in the modern world (!)

And maybe worse, the field he picked out for himself as having no practical use so far as he knew turns out to be the basis for secure network transactions, ciphers, and key management.

Decidability is semi-connected to computability, and the simplest-to-state uncomputable function that I know of is “the busy beaver number”. For any N, it is the highest number of consecutive 1 generated by any Turing Machine with N states.

This can be brute-forced, but there’s provably no formula to compute the busy-beaver number.

There are a fair number of practical implications. A somewhat boring but common one has to do with why your computer hangs sometimes (the halting problem).

I’ll make the modest point that ‘random’ is not synonymous with ‘unstructured’ or ‘unpatterned’. In fact, a random sequence in time will show -all possible- structures and patterns if you wait long enough.

On the contrary; it is synonymous with both. By definition, any patterns you perceive in a truly random data set are illusions.

You can have an infinite number of apples without having any oranges (although it must be acknowledged that that’s an awful lot of apples).

But I think you would be hard pressed to find that infinite data set on which to base your inquiry, right? Does the proof apply to finite data sets? I would find that …. odd

From what appears to be in the literature (see, i.a., https://arxiv.org/pdf/1711.05195.pdf), the EMX learning method is continuous by assumption. I haven’t digested the paper well yet, but I think arguments based on the finitude of the input data set incorrect: the infinities arise from elsewhere.

Whether this is attractive or not, I don’t know, but I can imagine that there is a pretty wide distance in many cases between finite and tractable. It makes sense to me that there is a lot of utility to be had in finding real-valued functions for ML in those cases.

Is machine learning (as it’s being discussed here) mostly concerned with detecting trends in chemical activity, response of organisms to various chemical entities, other things, or “all of the above”? Just curious.

The latter mostly

There are more folks than I can count who are ready to sell you . . .for all I know matching T-shirts and coffee mugs.The first rule of tech procurement is, if their salesmen aren’t giving you the matching T-shirts and coffee mugs for free, they don’t want to make a sale.

The second rule of tech procurement is, use a competitor’s coffee mug when you’re talking to the next salesman.

haha that was great

Most of my fellow-programmers will know that the compressibility of a given data stream or file is unknowable; few of us could give a concise explanation demonstrating an understanding of it, or even a lengthy one that states the principles without fostering much in the way of ‘non-machine learning’ or improving human understanding.

Mostly, we run the compression algorithm anyway, and look surprised when a file doesn’t zip up any smaller.

Those of us who test our data encryption with a self-correlation check, or entropy metrics, are defeated by our colleagues’ insistence on blaming the testing algorithm for failing to detect a non-random data segment in a ‘random-looking’ encrypted stream that definitely contains structured data; and most times they’re right – it’s an implementation error – but they are unaware that the answer is, in an unknown number of cases, fundamentally unknowable.

This has applications in computational biology: you know the known patterns in DNA, and some of the unknown ones flag up anomalous entropy values that indicate a potentially-functional sequence. But you can never prove, formally, that unusual sequence is truly ‘junk’, even if it’s apparently nonsense – not yet, anyway, because you can’t say “there’s no such mechanism that would ever use that”, only “no known one” – and there’s only so much that you can infer from observing whether it’s conserved across related species.

The entropy or self-correlation check won’t help you here: it will always give you an unknown number of wrong answers from the unknowable data space.

So this is very much a topic in applied mathematics, not an abstract exercise in philosophy.

The fact that any given DNA sequence exists in our genome means at some point in deep time, Nature selected for it. The circumstances under which that happened may now be obscure or even absent, but the concept of “junk DNA” is a bet that our present ignorance is definitive.

I don’t think it’s existence means that natural selection selected for it, just that it didn’t select against it. If a sequence doesn’t disadvantage an organism, and no sequence that conveys an advantage occupies that same spot, there’s no reason for the sequence to be selected against and it’s just as likely to stick around as to disappear. In fact it may be an advantage to have a wide variety of those “junk” sequences in the genome because they’re available if the environment changes so that they become useful and resilience vs. change is definitely an advantage for an organism.

The ‘deep time’ past can indeed preserve unused sequences that rapidly degrade into unusability; and, eventually, into random noise; natural selection is neither efficient nor timely at removing them.

I wonder if I’m right in speculating that they might decay into *completely* random noise over time: unfortunately, we cannot definitively prove that this is true or false for any given ‘junk’ sequence.

An aside, for amusement:

There is, perhaps, a deep future in which every organism’s genome contains short sequences of ‘junk’ containing encrypted copyright signatures and license keys.

Fortunately, that is not relevant to biology today.

Goedel bites…

Read “uncle petros and goldbach’s conjecture” for an amusing take on mathematical insanity.

It was very interesting to see Tegmark putting the knowable (~manageable) meanings of the various infinities into perspective starting from a more cosmological point of view. Transcribing those considerations to the scale of the molecular and smaller was a novel brain candy flavor.

https://www.goodreads.com/book/show/19395553-our-mathematical-universe

There’s a neat little graphic novel on Bertrand Russel and Alfred Whitehead who try to prove the consistency of mathematics only to hear the death rattle of their decades long endeavor as Godel steps in out of nowhere with his proof. An interesting narrative of the events in the early 1900s in comic form with other characters such as Frege and Cantor. I highly recommend taking a gander at it or checking it out from some library. https://en.wikipedia.org/wiki/Logicomix <- for anyone interested in following down that rabbit hole.

Overall, it is refreshing to see Godel's results still making waves today, even in this field. I am surprised yet not surprised. Very cool application.