Treatise on Universal Induction

Samuel Rathmanner and Marcus Hutter recently wrote a treatise on universal induction.  While there are no proofs, it does get into some fairly deep aspects of the problem of inductive inference.  Download it here:

http://dx.doi.org/10.3390/e13061076

 

 

This entry was posted in Uncategorized. Bookmark the permalink.

25 Responses to Treatise on Universal Induction

  1. Razorcam says:

    btw, Dr. Dan Burfoot is writing a book about a similar but simplified theorem, which could be easily taught in schools. Here is a quote from his website http://www.danburfoot.net/research.html
    “At the core of the book is refined version of the scientific method, consisting of the following steps:
    Obtain a large database of measurements relating to a phenomenon of interest.
    Through study of the phenomenon, attempt to develop a theory describing it, or revise a previous theory.
    Instantiate the new theory as a compression program. Invoke the compressor on the database.
    Prefer the new theory to a previous one if it achieves a lower net encoded file size, taking into account the length of the compressor itself.”

    He is recruiting volunteers to read drafts of his book.

    I don’t like that he completely throws away the “bigger” theories, but I admit that it is easier to teach this way.

  2. David Rimshnick says:

    Question about the strange “non-black non-raven” providing evidence for “all ravens are black” deal on page 1108-1109: it seems that the argument that a red apple provides support for the statement “all ravens are black” is that, since this is not a counterexample, it is less likely that there is a counterexample. But a black apple would also not be a counterexample to “all ravens are black”, yet it would provide no more support to “all ravens are black”. Am I missing something?

    Thanks,
    Dave

    • Shane Legg says:

      I think your understanding is correct (though I don’t profess much understanding of this point myself). I think that the key thing to remember is that the amount of evidence in these situations is (always?) extremely small.

      It would be interesting to know if this could be “stretched” into a situation where either Solomonoff induction does something that seems seriously wrong (rather than a tiny but weird looking update), or whether the opposite is possible, i.e. a situation where not doing this weird update that Solomonoff induction suggests would imply something seriously strange.

      • David Rimshnick says:

        Right – I guess that’s my worry. I mean 99.999% of things in the world aren’t ravens, so pretty much anything non-black would provide evidence for “all ravens are black” if this is the case. It seems like you could prove anything this way….

        • Shane Legg says:

          Right, but if the weight of this evidence is extremely small and tending towards zero… then the fact that it’s non-zero might be just a technical issue rather than something of actual consequence.

          • David Rimshnick says:

            Right.. I guess the how quickly it diminishes would have to be formalized. Otherwise, this could lead to a sort of selection-bias.

            I feel like it should be straightforward just to ignore contrapositives though, theory be damned (or I guess just modified)…

    • sp says:

      To David and Shane.

      Depending on the prior probability of the world, black apple may provide support for “all ravens are black”, or against it, or have no effect. It is not difficult to construct a prior for any outcome. For example, we could consider a prior which fixes number of objects of each kind (ravens,apples), and number of objects possessing a particular property (red,black). Then a black apple decreases probability of a black raven. Simple probability theory, no magic here.

      Now consider the Solomonoff prior. I think you are right that affect of an apple is to simply remove possibility of a counterexample. And I do not agree that this probability is small, because very simple worlds (with small number of object types and their attributes) dominate the universal prior. I’m not sure, but it seems to me that main increase in the probability of “all ravens are black” comes from decrease in the probability of having ravens in the world. But this will also trivially increase the probability of correctness of “all ravens are red”. Not very interesting.

      More interesting to consider conditional statement: “if there are any ravens, then all ravens are black”. It seems that observing a red apple actually gives support against the statement, because a world with just one color is simpler than a world with two colors. Correct me please if I am wrong.

      Maybe we could condition the statement so that we are actually consider something similar to our world, apart from the colors (maybe make all color permutations equally probable)? But then I fail to see the stated paradox, as well the smallness of probabilities.

  3. David says:

    Hello Shane,

    So, how does solomonoff induction escape the performance limits that the NFL theorems establish? From section 2.5 I assume that SI assumes structure and applies Occam as the bias to exploit this structure. But then must I assume that optimality results for SI are only applicable to said structured environments? Also, am I correct in thinking that the assumption is formalized as the universal prior that grants greater probability to shorter programs? Would removing this assumption then result in a prior that grants equal probability to all programs irrespective of length?

    In other words, if we remove the condition that there is structure, and employ SI against environments where there is no structure, does SI boil down to average performance as NFL states?

    It seems that although Occam works in practice, the bias is not really justified (as argued by Hume), Assumptions of structure based on past performance are circular arguments.

    Regards and thanks for the post,

    David

    • Shane Legg says:

      The Solomonoff convergence result has a term for the complexity of the true distribution. If this is high, it will take longer to converge. The results always hold, but they don’t contradict NFL either.

      In NFL the usual thing to do is to take a finite hypothesis space, perhaps of programs, and then put a uniform prior over these. However, for a program space most programs will have near maximal complexity. This means that NFL is assuming that with high probability the true distribution is very complex. That completely breaks learning because you end up considering any structure you see to probably just be an anomaly. 1, 2, 3, 4, 5, 6, 7, perhaps 9939485, or -5 or 82 or … comes next rather than 8? Under a uniform prior all the programs are just as likely. This shows us that if you screw up your prior badly enough, as NFL does, you can break learning.

      Is Occam’s razor justified? If you mean empirically justified, then the answer is obvious: the universe we live in is extremely structured.

      • David says:

        If I understand you correctly, then most of what I was saying before was on roughly the right track. Ie removing the assumption of structure yields a uniform prior over programs. Ofcourse this breaks learning as learning can only occur when structure is present, in other words, if there is something to learn.

        “Is Occam’s razor justified? If you mean empirically justified, then the answer is obvious: the universe we live in is extremely structured.”

        Im just being philosophical here for the sake of understanding the fundamentals, but:

        Is it actually obvious as you say? Does the fact that the universe _has_ been structured in the past mean it will be structured in the future? This is exactly the place where circularity can occur:

        Using your example before, our observations say 1,2,3,4,5,6,7. But what does this regularity say about the future? If we assume the future is like the past, then yes, 8 is more likely. But assuming that the future is like the past is exactly what we are trying to determine. In other words, it seems we need Occam in order interpret the data in a way that it supports Occam.

        Thanks for your reply

        • Shane Legg says:

          Yes, I’d say that you’re on the right track.

          In my opinion, the reason NFL has staying power is due to three things. 1) for small spaces a uniform prior is good and has intuitive appeal, thus using it over a large space seems correct at first sight. 2) the NFL result is then surprising and yet simple. 3) it can be used as an excuse! People working on GAs are especially guilty of this: “Hey, you can’t criticise my new kind of GA because it mostly seems to work on a hand picked problem, NFL says that no algorithm is better or worse in general!”

          So my first point is to push back on number 1. Yes, uniform priors are sensible on small spaces, but no they aren’t on large spaces.

          Actually, given that you’re interested in theoretical justifications for Occam’s razor, let’s take the above a bit further. If we want an extremely general learning system, and we’re fairly pure Bayesian, then we need to start with an extremely large hypothesis space and put a prior over it. Indeed, as we don’t want to rule out any hypothesis a priori, we’ll want to take an infinitely large hypothesis space such as the space of programs.

          Over a denumerable (infinite but countable) space you mathematically can’t put a non-zero uniform prior because it will always sum to more than one. Thus, some elements have to have higher prior probability than others. Now if you order these elements according to some complexity measure (any total ordering will do, so the choice of complexity function makes no difference), then the boundedness of the probability sum means that the supremium of the sequence converges to zero. Thus, for very general learning systems, and for any complexity measure you like, you can only violate Occam’s razor locally. Eventually, as things get more complex, their probabilities must decrease.

          In summary: if you want to do pure Bayesian learning over very large spaces you must set a prior that globally respects a kind of Occam’s razor. It’s not a matter of belief, it’s a mathematical necessity.

          If you accept a kind of Occam’s razor, you will also expect that the future will generally tend to be like the past because these hypotheses will generally have lower complexity.

          • Razorcam says:

            Many thanks for this very pedagogic answer. I will reuse it.

            But, of course, Solomonoff’s assumption that one should use the infinite space of all computer programs can be challenged.

            The fact that most, if not all, successful scientific models are computable is a main reason for choosing this assumption.

            But one can always argue that scientists always work with finite computational ressources, so that the space of computer programs they use is finite, unlike Solomonoff’s infinite space.

          • David says:

            “In summary: if you want to do pure Bayesian learning over very large spaces you must set a prior that globally respects a kind of Occam’s razor. It’s not a matter of belief, it’s a mathematical necessity.”

            Ah! This is a very interesting point I was missing.

            So, even if you had a non-Occam _belief_ (whatever that would be), it would be mathematically impossible to formalize it within bayesian learning.

            “If you accept a kind of Occam’s razor, you will also expect that the future will generally tend to be like the past because these hypotheses will generally have lower complexity.”

            Agreed, thats what I was trying to say in my previous comment.

            Thanks again

    • Razorcam says:

      NFL is the proof that you need a prior in order to predict that an outcome is more likely than another. Because if the environement is a completely random function, then all outcomes have perfectly equal probabilities, so you can’t predict anything.
      NFL is the proof that when there is no prior there can be no scientific method (a rigourous method that predict what will happen, based on previous observations).

      So what prior should a scientific method use ?
      Modern scientists job is often to write a program that will simulate an environment compatible with previous observation. Simply because in science there is a long history that shows that building a model of an environement is a good way to predict what will happen. That’s the main reason why people do science: it usually works.

      A scientist has to write a program which size is not larger than the maximum size (let’s call this size Smax) her cluster of computers can handle.
      You can easily prove that for each program P of size S<Smax there are many larger programs that are perfectly equivalent to P. The smaller P is, the more there are larger programs equivalent to it. This number of equivalent programs grows exponentially as S decrease.
      Wait a second, this really looks like Solomonoff's prior !

  4. BcM7mL says:

    Hi I have a question about inductive inference / Solomonoff induction:

    My understanding is that, for a given finite sequence, the algorithms that compute that sequence are given probability values according to their length, complexity or something similar.

    It seems to me that this method of algorithm selection leaves no room for “margins of errors” in the input values, meaning that if there is the slightest variation in the input string then it can have adverse effects on the algorithms being considered in terms of simplicity. Is there a variation of the theory that accomodates small errors in the input strings? It seems to me that such a feature is desirable if an approximation of the model is to be used in real-life prediction tasks, because measurement values are often accompanied by “margins of erros”.

    I think this kind of error-accomodating variation of inductive inference can be seen as a generalization of regression analyses used in statistics, where data points are matched used simple models (i.e. linear).

    • Razorcam says:

      BcM7mL said: “if there is the slightest variation in the input string then it can have adverse effects on the algorithms being considered in terms of simplicity”.

      Not really. If you have a very good compressor program for a large input string, except for some rare errors, then adding some program lines to this compressor in order to correct those rare errors won’t add much to the size of the compressor, relative to the size of the large input string.

      • BcM7mL says:

        In science for example, what is usually being sought is the simplest theory that fits the given set of data with slight errors allowed rather than the slightly more complicated theory that fits all the data points exactly.

        So it seems to me that in some situations a model that trades errors for simplicity can be useful even if the simplicity cost is relatively small when the input string is large.

        The use case I’m thinking of is closer to “letting a machine discover scientific laws automatically” than to “letting a machine make accurate predictions”.

        • Razorcam says:

          Here is an example of an input string S with “small errors”:
          11111111111111111111111111111165
          22222222222222222222222222222221
          33333333333333333333333333333346
          44444444444444444444444444444487
          ……………………….
          99999999999999999999999999999956

          The neverending program P that generates an infinite string that starts with S, but without the errrors found in S, is small compared to the string. Adding some program lines to get the same errors will be small compared to the string, too. But it won’t affect the predictions made by P.

          P can be seen as a compressor, as a scientific “law” and as a predictor.

          Thus “letting a machine discover scientific laws automatically” and “letting a machine make accurate predictions” are compatible.

  5. Razorcam says:

    Yes, if you already have a satisfying model of a phenomenon, “algorithmic statistics” seems what you need. http://homepages.cwi.nl/~paulv/papers/algorithmicstatistics.pdf

    Here, when we talk about induction, we are talking about making a new model.

    How can an induction agent know that those numbers in S are errors and are not an interesting phenomenon that should also be predicted ?
    If you are already sure that those are errors, that’s because you have previous information that is not available in S.
    If you can, a simple solution is to use your previous information to smooth those errors before giving S to the agent.

    Alternatively, you can give all the previous information to the agent.
    If the information comes a physical apparatus, you will never be sure that those “errors” can’t be predicted. And that’s what the agent will try to do for you. Even if it fails, it will come up with several theories, each having its own probability, and you can use it to predict the most likely range of values that the errors will be in.

    • Razorcam says:

      Oops, the previous message was an answer to the last message from BcM7mL.

    • BcM7mL says:

      “How can an induction agent know that those numbers in S are errors and are not an interesting phenomenon that should also be predicted ?”

      Good point. The errors can be separated if there is a known model, but not without.

      But suppose that the “errors” remain relatively constant with respect to all models of a certain class. Then it would make sense to call these parts the “errors” and the rest the “interesting phenomena” or “meaningful data”. This would be similar to how Kolmogorov complexity is determined not with respect to a single Turing machine but with respect to Turing machines in general.

      This seems to be the kind of theory developed in the “Correspondence Meaningful Data” paper by Paul M. Vitanyi as far as I can tell. Assuming my interpretation is correct, this seems like a promising approach because it would separate the “errors” and “meaningful data” in an objective / model-independent way.

      • Razorcam says:

        It is very difficult to distinguish numbers coming from a true random number generator (let’s call them “pure errors” or “pure noise”, completely unpredictable) and numbers coming from a pseudo random generator (a predictable phenomenon). In fact a large number of succesful cryptographic algorithms are based on this difficulty (stream ciphers). Until all of these algorithms get cracked, I choose to give up trying to solve this difficult problem.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>